Programming Basics with C++ explained in Assembly

Like I said earlier, this post is a sort of introduction to programming, however, it is not a typical, “Hello World” introduction. We’ll be using C++ and inline Assembly to investigate what exactly is going on in a program and C++. The following will be explained:

  • what a C++ program is and looks like to a computer
  • what variables are and how the stack and memory works
  • what functions are and how they work
  • how pointers work
  • how math and logic works, such as when you evaluate an equation

What a C++ program is and what it looks like to the computer

C++ and any programming language is just a bunch of words and syntax used to describe a process for the computer to perform.

You write a bunch of stuff, and the computer does exactly what you commanded it to. That said, while a program is a series of words and syntax to us, to the computer, it ends up being quite different. Computers only understand numbers, so somewhere between when a program is written and when it is executed by the computer, it gets interpreted from the programming language into numbers – Machine Code.

For example, the following program:

r = x + x * y / z - z * x

Looks like this in Machine Code:

Programming languages exist for a several reasons:

  • to provide an alternative to writing in Machine Code
  • to make writing maintainable and manageable code bases possible
  • to allow programmers to express themselves in different ways
  • to accomplish a task with efficiency given a language customized to that task

Finally, the basic C++ program that writes, “Hello World!”, to the console screen:

int main(){
 	std::cout << "Hello World!";
	return 0;

In Machine Code, it looks like this:

C++ is quite the improvement.

The language that translates closely to Machine Code is Assembly Language. While both Machine Code and Assembly Language are unpractical to work with, Assembly Language is actually something people do use, and we are going to use it to link what’s going on in the computer to what’s going on in C++.

Before we can move into Assembly, however, we need to review some basics of how programs work.


As we know, computers work with numbers, and a program is ultimately a bunch of numbers. The numbers are stored in the the computer in a place called memory, and lots of them, more numbers than you could count in your lifetime. It can be thought of as an array, grid or matrix where each cell contains a number and is a specific location in memory.

A visualization of memory.

These cells exist in sequence and that sequence describes its location. For instance, we could start counting memory at cell 1, and go to cell 7, and each of those locations could be referred to by their sequence number: 1 to 7.

The numbers in memory can represent one of two things:

  • a Machine Code or operand used to execute a command
  • some sort of Variable – an arbitrary value to be used by, or that has been saved by, a program

In programming languages such as C++, variables are referred to using names rather than their sequence number, and Machine Code is abstracted away with verbose language.

A snippet of C++ that declares three variables and prints them out.

The same snippet in Assembly; for glancing only.

The Stack: a type of Memory for Variables

Whenever a Variable is declared, it gets allocated in an area of Memory called the Stack. The Stack is a special type of Memory which is essential to how a C++ program executes. When a Variable enters scope, it gets allocated onto the Stack, and then later, when it exits scope, it’s removed from the Stack. The Stack grows in size as Variables enter scope and shrinks as Variables exit scope.

In C++, curly braces “{ “ and “ } “ are used to define a variable’s scope, also known as a block or epoch.

A C++ program with several blocks:

void main(){	//begin block a:

	//x is allocated, stack size goes to 1
	int x = 0;

	{	// begin block b: 
		// k, l and m are allocated, stack size goes to 4
		int  k = 2,  l = 3, m = 4;
		// an example operation
		x = k + l + m;

	}	// end of block b, stack size shrinks back to 1

	// - back to block a - y is allocated, stack size goes to 2
	int y = 0;

	{	// begin block c:
		// l and o are allocated, stack size goes to 4
		int l = 12, o = 2;

		// an example operation
		y = l * o;

	}	// end of block c, stack size shrinks back 2

	// - back to block a - z is allocated, stack size goes to 3
	int z = 0;

	// an example operation
	z = x - y;

	// write result, z, to the console
	std::cout << z << std::endl;	

}	// end of block a, stack shrinks to size 0

x86 Assembly Brief on the Stack

Assembly is really simple. Simple but lengthy and hard to work with because of its lengthiness. Lets jump right in.

A line of Assembly consists of a command and its operands. Operands are numbers of course, but these numbers can refer to a few different things: Constants ( plain numbers ), Memory locations (such as a reference to a Stack Variable), and Registers, which are a special type of Memory on the processor.

The Stack’s current location is in the esp register. The esp register starts at a high memory location, the top of the stack, such as 1000, and as it grows, the register is decremented, towards say, 900 and 500, and eventually to point where the stack can grow no further. General purpose registers include eax and are used to store whatever or are sometimes defaulted to with certain operations.

There are two commands that manipulate the stack implicitly:

  • push [operand]

    put the operand value onto the Stack, and decrement esp

  • pop [operand]

    remove a value from the Stack, store it in the operand register, and increment esp


  • esp

    the Stack’s top, moves up or down depending on pushes/pops, or plain adds/subtracts.

    • eax

      a general purpose register

    Given these two instructions we can create and destroy stack variables. For instance, similar to the program above but without the example equations:

    	push 0		//create x, with a value of 0
    	push 2		//create k
    	push 3		//create l
    	push 4		//create m
    	//do an equation
    	pop eax		//destroy m
    	pop eax		//destroy l
    	pop eax		//destroy k
    	push 0		//create y
    	push 12		//create l
    	push 2		//create o
    	//do an equation
    	pop eax		//destroy o
    	pop eax		//destroy l
    	push 0		//create z
    	//do an equation
    	pop eax		//destroy z
    	pop eax		//destroy y
    	pop eax		//destroy x

    Another way to manipulate the Stack/esp involves adding or subtracting from esp. There is an offset in bytes which is a multiple of the Stack alignment, on x86, this is 4 bytes.

    add esp, 12 	// reduce the stack by 12 bytes – same as 3 * pop
    sub esp, 4 	// increase the stack by 4 bytes – same as 1 * push

    The difference in usage is that add and sub do not copy any values around. With add, we trash the contents, and with sub, we get uninitialized space.

    Once Stack space has been allocated, it can be accessed through dereferencing esp. Dereferencing is done with brackets.

    [esp]  		// the top of the stack, the last push
    [esp + 4] 	// top + 4 bytes, the second to last push
    [esp + 8] 	// top + 8 bytes, the third to last push
    //[esp – x ] 	makes no sense

    Adding in the example equations requires some additional commands:

    • add [add to, store result], [ operand to add ]

      adds the operand to register

    • sub [subract from, store result], [ operand to subtract ]

      subtracts the operand from the register

    • mul [operand to multiply by eax, store in eax ]

      multiplies the operand by the eax register

    • mov [ copy to ], [ copy from ]

      copies a value from a Register to Memory or vice versa.

    	push 0			//create x, with a value of 0
    	push 2			//create k
    	push 3			//create l
    	push 4			//create m
    	//k + l + m;
    	mov eax, [esp + 8]	// copy k into eax
    	add eax, [esp + 4] 	// add l to eax
    	add eax, [esp]		// add m to eax
    	add esp, 12		//destroy k through m
    	//store the value in x
    	mov [esp], eax
    	push 0			//create y
    	push 12			//create l
    	push 2			//create o
    	// l * o
    	mov eax, [esp + 4]	 //copy l into eax
    	mul [esp]		 //multiply o by eax
    	add esp, 8		//destroy l and o
    	//store the value in y
    	mov [esp], eax 
    	push 0			//create z
    	//x - y;
    	mov eax, [esp + 8]
    	sub eax, [esp + 4]
    	mov [esp], eax		//write to z; z = -15
    	add esp, 12		//destroy z through x


    Functions are a way to associate several things:

    • A Stack block
    • Code
    • A useful name

    As we know, rather than writing out the same code over and over, to reuse it, we can use a function.

    But that is not the end of what functions are, they are also Named Code Segments, a segment of code with has a useful name, a name that both describes what the function does, and that be used to call that function in the language.

    In C++ functions have four properties:

    • return type
    • function name
    • parameters
    • function body

    And the following prototype:

    [return type] [name]([parameters]){[body]}

    In C++ a sum function would look like so:

    int sum( int x, int y ){
    	return x + y;

    In assembly to sum these two values we’d do the following:

    push 5 			// y = 5
    push 4 			// x = 4
    mov eax, [esp] 		//eax = x
    add eax, [esp+4]	//eax = 9
    add esp, 8		//clear stack

    You can see there are three stages here:

    • Setting up the function’s parameters using push
    • the actual function code
    • resetting the stack

    We are also missing three things, the function name, the return type, and actually calling the function. In assembly, a function name is the same as a C++ label:


    Now comes an interesting part – assembly – since we can do anything in it, is going to implement something called a: Calling Convention – how the function’s parameters are passed, how the stack is cleaned and where, and how the return value is handled.

    In C++ the default calling convention is called cdecl. In this calling convention, arguments are pushing onto the stack by the caller, and in right to left order. Integer return values are saved in eax, and the caller cleans up the stack.

    First off, to actually call a function we could use a command, jmp. Jmp is similar to the C++ goto statement. It sends the next instruction to be processed to a label. Here’s how we could implement a function using jmp:

    	jmp main
    sum:		//sum function body
    	mov eax, [esp+4]	//we have a default offset due to the return address
    	add eax, [esp+8]
    	jmp [esp]		//return to: sumreturn
    main:		//main program body
    	push 5			//we plan to call sum, so push arguments
    	push 4
    	push sumreturn		//push the return address
    	jmp sum			//call the function 
    	add esp, 12		//clean the stack
    	//result is still in eax

    To make this much easier, there are additional instructions:

    • call [operand]

      automatically pushes the return address onto the stack and then jumps to the function

    • ret [ optional operand ]

      automatically pops the return address and jumps to it

    These instructions wrap up some of the requirements of function calling – the push/pop of the return address, the jmp instructions, and the return label. With these additional instructions we’d have the following:

    	jmp main
    	mov eax, [esp+4]	//we have a default offset due the to return address
    	add eax, [esp+8]
    	push 5			//we plan to call sum, so push arguments
    	push 4
    	call sum		
    	add esp, 8		//clean the stack
    	//result is still in eax

    In other calling conventions, specifically x64 ones, most parameters are passed in registers instead of on the stack by default.

    See: https://en.wikipedia.org/wiki/X86_calling_conventions

    See: https://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_c/main_cls/bldaps_cls/common/bldaps_calling_conv.htm

    With cdecl, every time we call sum, we have an equivalent assembly expansion:

    push 5			//pass an argument
    push 4			//pass an argument
    call sum		//enter function 
    add esp, 8		//clear arguments

    Recursive functions run out of stack space if they go too deep because each call involves allocations on the stack and it has a finite size.


    Pointers confuse a lot of folks. What I think may be the source of the confusion is that a pointer is a single variable with two different parts to it, and indeed, a pointer is in fact its own data type. All pointers are of the same data type, pointer. They all have the same size in bytes, and all are treated the same by the computer.

    The two parts of a pointer:

    • The pointer variable itself.
    • The variable pointed to by the pointer.

    In C++, non-function pointers have the following prototype:

    [data type]* [pointer variable name];

    So we actually have a description of two different parts here.

    • The Data Type

      this is the type of the variable that is pointed to by the pointer.

    • The Pointer Variable

      this is a variable which points to another variable.

    What a lot of folk may not realize, is that all Dynamic Memory ultimately routes back to some pointer on the Stack, and to some pointer period – if it doesn’t, that Dynamic Memory has been leaked and is lost forever.

    Consider the following C++

    int* pi = new int(5); 

    With this statement, we put data into two different parts of memory; the first part is the Stack, it now has pi on it, the second, int(5), is on the heap. The value contained by the pi variable is the address of int(5). To work with a value that is pointed-to by a pointer, we need to tell the computer to go and fetch the value at the pointed-to address, and then we can manipulate it. This is called Dereferencing.

    int i = 10;		//declare i on the stack
    int* pi = &i;		//declare pi on the stack and set it's value to the 								 
     			//address of i
    *pi = *pi + 7;		//dereference pi, add 7, assign to i							 
     // i = 17

    So in assembly, there is an instruction used to Dereference:

    • lea [operand a ] [operand b]

      Load Effective Address, copy b into a

    lea a, value performs the following process:

    • stores value in a
    • such that

      lea eax, [ eax + 1] will increment eax by 1

    • and

      lea eax, [ esp + 4 ] takes the value of esp, an address, adds 4, and stores it in eax

    _asm {
    	push 10			//a value, i
    	push 0			//a pointer, pi	
    	lea eax, [esp + 4]	//store the address of i in eax		
    	mov [esp], eax		//copy the address of i into pi
    	mov ebx, [esp] 		//dereference esp, a value on the stack, and get the value 					 
     				//of pi - the address of i 
    	add [ebx], 7		//dereference pi and add 7, i == 17
    	//[esp+4] now == 17
    	add esp, 8		//clean the stack

    So these examples pointed to memory on the stack, but the same process goes for memory on the heap as well, that memory is just allocated with a function rather than push.

    Logic and Math

    Probably the key aspect of programming is the use of if-statements. An if-statement can be a long line of code and obviously that means it turns into lots of assembly statements. What you may not realize though, is that a single C++ if-statement actually often turns into multiple assembly if-statements. So lets examine what an if-statement is in assembly.

    Consider the following C++:

    int x = 15;
    if( x == 10 ) ++x; 

    Can result in the following assembly:

    _asm {
    	push 15 // x
    	cmp [esp], 10		//compare x with 10 and store a comparison flag.
    	jne untrue		//jump to a label based on the comparison flag.
    				//in this case, ne, jump if not-equal.
    				//since 15 != 10, jump to untrue and skip the 						 
      				//addition. If x was 10, it would be executed.
    	add [esp], 1		//increment x by 1	
    	//proceed whether it was evaluated true or not
    	add esp, 4		//clean stack		

    Ifstatements generate multiple jumps when there are logical operators in the statement. This results in an if-statement property called short-circuiting. Short Circuiting is a process in which only some of the if-statement may be executed. In the case where the first failure of a logical operation evaluates the if-statement as false, the subsequent logical operations are not evaluated at all.

    An example of and logical operator:

    int x = 10, y = 15;
    if( x == 10 && y == 12) ++x; 
    _asm {
    	push 10 // x
    	push 15 // y
    	//the first statement
    	cmp [esp+4], 10			//if the current logic operation evaluates 					 
     	jne untrue			//the statement to false, then all other 							 
     					//operations are skipped.		
    					//in our case it was true so we continue to 						 		 
     	//the second statement
    	cmp [esp], 12			//this comparison is false, so we jump 
    	jne untrue				
     	//only evaluated if both and operands were true
    	add [esp+4], 1			//increment x by 1	
    	//proceed whether it was evaluated true or not
    	add esp, 8			//clear stack		

    In the above example, if x had been != 10, then y == 12 would not have been evaluated.

    An example with the or logical operator.

    int x = 10, y = 15;
    if( x == 10 || y == 12) ++x; 
    _asm {
    	push 10 // x
    	push 15 // y
    	cmp [esp+4], 10				 	
    	je true				//if the first statement is true, no need to								 
     					//evaluate the second	
    	cmp [esp], 12			//if the first statement evaluated false, 
    	jne untrue			//check the next statement
     	//evaluated if either or operand was true
    	add [esp+4], 1			//increment x by 1	
    	//proceed whether it was evaluated true or not
    	add esp, 8			//clear stack		

    Logical order of operations is going to be left to right order, which ensures that sort circuiting works as expected.

    Back to the top, we had the following equation:

    r = x + x * y / z - z * x

    This entire equation needs to be evaluated, and as you realize, this is a bunch of assembly steps. The order in which the steps execute is almost certainly not left to right, but is some sort of logical order-of-operations which is also optimized to be speedy and effecient.

    The equation has four steps:

    a = + x*y/z
    b = + x
    c = + z*x
    result = a + b  - c

    Again, the order is going to respect mathematical order of operations, and try to run efficiently on the computer. Here’s how it may turn out:

    _asm {
    	//all of these variables are integers and the operations are integer
    	push 10	//x
    	push 15	//y
    	push 2		//z
    	//cache x its used 3 times: ebx = x
    	mov ebx, [esp+8]
    	//eax = x*y/ z
    	mov eax, ebx
    	mul [esp+4]
    	div [esp]
    	//ecx = eax + x
    	add eax, ebx
    	mov ecx, eax
    	//eax = ez*ex
    	mov eax, [esp]
    	mul ebx
    	//finally, subtract: perform ( ex*ey/ez + ex ) - (ez*ex) = ecx
    	sub ecx, eax
    	//result is in ecx		
    	add esp, 12		//clear stack		

    On a final note, if you want to get output from these examples into c++, here’s an easy way for Visual Studio:

    int main(int argc, char**argv){
    	int result =0;
    		mov eax, 5
    		mov dword ptr[result], eax		
    	std::cout << result;
    	return 0;

    That’s it for this post! In the future I’ll mostly post about C++ programming itself. I would love to hear your comments below! Good Luck!

    How std::span changes C++ fundamentally

    In C++, we have all sorts of std containers, such as vector. Normally, when we work with vectors, we also pass them around by reference, and vectors of course own their data in a contiguous way.

    When a vector is passed to a function, we pass it by reference:

    std::vector<int> ints;
    auto foo =[&](std::vector<int>& v){
     std::iota( v.begin(), v.end(), 0);

    In C++20, there is a new object called: std::span. span is a non-owning object that interfaces like a typical vector. Before standardisation, span was considered a “view”, and that’s what it is.

    To start, you can use a span when passing a vector to a function, and perhaps normally should.

    std::vector<int> ints(100,0);
    auto foo =[&](std::span<int> v){
     std::iota( v.begin(), v.end(), 0);

    The span allows us to do really cool stuff with memory management and access, for instance:

    std::vector<int> ints(100,0);
    std::span<int> a( ints.begin(), 50 )
     , b( ints.begin()+50, ints.end() );
    auto foo =[&](std::span<int> v){
     std::iota( v.begin(), v.end(), 0);

    In this example, a refers to the first half of ints, and b refers to the second half.

    We now have two vector like objects that infact refer to a single contiguous segment of memory, cache performance goes up. Spans can be used exactly like vectors, except, they are non-owning and just “views” into owning vectors.

    Software development and sneaky algorithms

    So I have been engineering my own neural network technology for some time, and I’ve been running tests and experiments sometimes. What I have discovered about backrpopagation, is that, when there are errors in backpropagation algorithm, they may not break backpropagation completely. It can work in partial completeness or with some amount of errors, to some extent.

    The first attempt I made to write backpropagation some years ago, I used a math library called Armadillo in hopes that it would be fast. This code ended up sort of working, but clearly, was not actually working, and indeed there were so many errors since I wrote it during a 24 hour challenges.

    Then, I branched and started from scratch without armadillo and just c++. This code worked too, but, it turns out there were bugs which in hindsight, introduced limits.

    In the next branch, I revisted Armadillo using my understanding of my c++ code and more Armadillo research, to produce yet another version of backpropagation which seemed to work at least as good as my c++ version.

    Finally, I created a branch to compare these two working methods to eachother and see what errors I can find. I found a lot of similarities between the two branches, and also, clear errors in both branches. I corrected them as far as I can tell, and found out my c++ code is some 10x faster than the Armadillo code, and I wonder if I am using Armadillo properly, IDK.

    In my current project, in a pool of 2000 networks, if the backpropagation math is correct, then I generate maybe 30 xor networks with fewer than 100 trainings of specific xor function.

    If I train circle function beforehand, then I end up training more like 200 xor networks with fewer than 100 trainings of specific xor function.

    Some of the technology I developed related to genetic algorithms all functioned properly even though backpropagation was broke. I produced mnist network which could recognize 0,1,2 digits – with broken backpropagation process on CPU, probably mainly functioning through genetic features. It seemed to approach limits which I hoped were errors that I could fix, and perhaps I have. Errors or no errors, there is still the project.

    The smallest xor network and lottery ticket

    I am creating networks that are: two input nodes, x*y hidden nodes, and one output node. The ability to create these networks has to do with some odds which are based on the initial state of the starting network pool. I have a pool of 2000 networks which are randomly generated, and I try to train them to learn xor in real time. I train few or zero xor networks in real time. After this, I train on some other function for some period of real time. After returning to xor, xor trains in real time. At first, xor is learned by some small number, say one to five percent of networks out of 2000. After this, I return to some other problem, a math problem, a geometry problem, some arbitrary function, and then return to xor, and then, there are more trained xor networks trained in real time. This process continues and more and more xor networks are trained. This number of xor networks is far greater then the number of networks found only by training xor alone, which could be zero.

    Real time, in this case, I create 3×3 xor networks in a pool of 2000, in fewer than 10k training of specific xor function, and they continue to be created at a rate greater than just xor training.

    This leads to a process I call network formatting, which is the concept of starting out with an existant network data to train some arbirary other type of network. I want to create small xor networks which have had their solution space expanded dramatically beyond xor. Then, in some way, use these existant networks to do other things instead, that is the idea.

    So one thing I have learned is that there there is a whole dimension of output in just one class of output. All numbers we can represent in that space on computers, could be entirely different distinct outputs. One thing I do with these networks are math functions, with two input nodes, x*y hidden, and one output, I can create the following: f(x, y) = z. Using this format I can perform math, memorization, constant passing and manipulating, geometry and line functions, all using these small networks intended for xor or something else eventually.

    Another thing I’ve learned is that a remarkable amount of networks are convergent on many problems starting from their completely random state. As the difficulty of the initial problem increases, the number found goes down. That is, I can process a signal such that, on many problems, completely randomly generated networks, have convergence indicated through this signal.

    Another thing I’ve found is that I had been using smaller radius to measure a networks “convergence”. It turns out, a network with a larger radius can beat a smaller radius network, potentially, and I’m calling this measurement the super radius.

    C++ comma operator and parallelization

    I have never had a reason to use the comma operator, however, writing some modern code, it seems required.

    Say you have some series of variables and you want to perform a common operation on the group.

    std::vector<std::size_t> a, b, c;
    auto generateSortingIndexes = [&](auto&...args) {
         auto gen = [&](auto& arg){
              std::iota(arg.begin(), arg.end(), 0);
         (gen(args), ...);

    New C++ is always a fun thing, I know.

    We could change this to be a generic function that could be a good addition to any C++ library:

    std::vector<std::size_t> a, b, c;
    auto generate = [&](auto gen, auto&...args) {
       (gen(args), ...);
    generate([&](auto& arg) {
       std::iota(arg.begin(), arg.end(), 0);
       }, a, b, c);

    In this case, maybe the best code would actually be as follows:

    std::vector<std::size_t> a, b, c;
    std::iota(a.begin(), a.end(), 0);
    b = c = a;

    Though this is only because they’re all identical.

    The real problem or benefit of this code is that it is somewhat a compile time expansion and also not multi-threaded at runtime. In reality, I would probably want all of that initialization to occur in parallel rather than in sequence.

    I could go parallelizing this code in two easy ways:

    with the standard library, I keep with the trend of passing naked variables, so the first step is to wrap them and group them. After that is the parallel part:

    std::vector<std::reference_wrapper<std::vector<std::size_t>>> vectors = {  
          ,c };
    std::for_each(std::execution::par, vectors.begin(), vectors.end(), [&](auto& ref) {
          auto& vector = ref.get();
          std::iota(vector.begin(), vector.end(), 0);

    Another way is with openmp, which looks like as follows:

    #pragma omp parallel
     for (int i = omp_get_thread_num(); i < vectors.size(); i += omp_get_num_threads()) {
         auto& vector = vectors[i].get();
         std::iota(vector.begin(), vector.end(), 0);

    What a fascinating loop paradigm that openmp produces, the loop is executed in parallel stripes.

    A simpler way would be

    std::vector<std::reference_wrapper<std::vector<std::size_t>>> vectors = {  
          , c };
    #pragma omp parallel
     for (int i = omp_get_thread_num(); i < vectors.size(); i += omp_get_num_threads()) {
         auto& vector = vectors[i].get();
         vector = a;

    Errors and Progress in mnist recognition

    So, in my earlier post, I said I had a “convergent mnist network”. At the time I was excited and I wrote that in haste. What that network had been doing, it had been trained on null and digits, but only one image for each of these digit was actually ever trained into the network. This network could recognize more than ten images, around 60, out of a total test dataset of 1000.

    Since then, I have been trying to train on as much and types of data as possible, and I would not say that I have a convergent mnist network, yet.

    What I have produced and learned is about tension inside the network. I can see why there is a ml library called “Tensorflow”, though I have not studied it, yet.

    For each class of data the network recognizes or thinks about, there is a tension which spreads across the network. So, in my network of 11 classes, there are eleven of these tensors which I can measure. As the network is trained in some way, these tensors change in relation to each other. Some of the tensors have a sharp pull and are highly convergent, accurate to 100% of test data. Other tensors, may not be convergent at all, and are a source of confusion and change in the network’s space. Many tensors can be highly convergent while others are not convergent at all.

    The way I am sort of viewing the network tensors right now, is that each highly convergent tensor is a sort of “black hole” in the network’s space. The relationships in the space between all of these objects affects the network’s capacity to learn and I think could even prevent learning.

    One thing I am experimenting with is layer width and layer depth. Mainly I am engineering algorithm to make convergence more likely for any given network. I am working on the theory that great engineering skill can compensate for lesser science and mathematical skills. I hope this turns out true in the future. And I do think there is cutting edge math and science being done that I can’t really expect myself to follow very effectively or find access to.

    I am still not on GPU because I feel there is much more progress to be made in cpu/c++ as far as technology is concerned. However, my test times on CPU make me think training anything larger than mnist will require bigger compute ability.

    I have imaged a network which is highly convergent though not on all digits, and the image seems to be dramatically orders of magnitude more complex than the images seen before. I have no video yet though there could be one soon.

    Non-Convolutional Image Recognition

    I have had some difficulty determining numbers for training times for mnist, so I am going to post some of mine, and also discuss what my network is doing.

    So far in my work on mnist, I have generated a convergent network. Using cpu only, ryzen 7 1700, I train a convergent network in under five minutes.

    I have made a video of a small convergent network which was worked on for few hours, and here it is:

    You may need to go to youtube to view it full screen, which I recommend.

    This network consists of 10 layers, two of which, are not shown, these are the input and output layers.

    The input layer is the size of the mnist image, 784, and would have a pixel dimension of 7840, this layer is excluded.

    The output layer is the number of classifications, it has 11, one for each digit and also null. This layer is also excluded.

    The remaining hidden layers are all ten nodes in length and therefore 100 pixels in dimension.

    The total number of nodes in the network is 875, with the vast majority of these nodes on the input layer alone.

    The network does no convolution, all layers are fully connected layers. I have been studying convolution for some time and am unsure how necessary it is for image recognition. I read things on the internet which suggest that there is no relationship between nodes, and therefore relationships must be created through convolution. This is not true, there are indeed spacial relationships, and this is evident in my videos. I have not implemented convolution in C++ yet, nor python, for comparisons of non-convolution vs convolution image recognition. Furthermore, I do not have much experience with python and am only starting to study things like tensorflow.

    From what I think initially, is that convolution revists many pixels, and I wonder, does this result in more processing speed or recognition power than non-convolution, I do not know, yet.

    Code Simplicity of binary files and C++ wonder

    There is a lot of code online about reading MNIST dataset, and I have produced my own version, which uses my Binary/Reader classes talked about in a previous post. So, you could be thinking, Binary Image file stuffs with C++, oh, no! I have seen some of the parsers easily obtainable online, written in C++, and I can imagine what poor code looks like. I found none of the examples to be particularly expressive C++, and have since generated my own code which uses C++ up to 2023 .

    In this post I’m going to describe loading the MNIST binary file, parsing it and specifically highlight my custom binary file class. It should seem magical how simple working with a binary file in C++ actually is, and indeed, it is somewhat magical.

    The MNIST dataset includes two files, an images file, and a labels file. Both files have a binary heading and then binary data segment.

    The format of the binary heading is as follows and all fields are uint32_t:

    image file Magic number = 2051;
    number of images = 10000
    image height = 28
    image width = 28

    The data segment of the image file contains grey pixels of 28×28 images, the pixel type is uint8_t:

    image pixels dimension = [28*28];

    The number of images is 10,000.

    The labels file is a list of 10,000 labels, uint8_t, and has a heading consisting of a Magic number and the number of labels.

    label file Magic number = 2049
    number of labels = 10000
    uint8_t label[10000];

    I read these files simultaneously.

    The values in headings need to have their byte order reversed, but for actual data, this is not the case.

    Here is my code to read MNIST followed by more code.

    void parseMNIST() {
    	std::string imagesFileName = "mnist//t10k-images.idx3-ubyte"
    		, labelsFileName = "mnist//t10k-labels.idx1-ubyte";
    	auto readHeading = [](auto& file, auto& h) {
    		file >> h; h = std::byteswap(h);
    	auto checkMagic = [&](auto& file, uint32_t magic) {
    		uint32_t input{ 0 }; readHeading(file, input);
    		if (input != magic) {
    			throw std::logic_error("Incorrect file magic");
    	BinaryReader imagesIn(imagesFileName)
    		, labelsIn(labelsFileName);
    	uint32_t numImages{ 0 }
    		, numLabels{ 0 }
    		, rows{ 0 }
    		, cols{ 0 };
    	checkMagic(imagesIn, 2051);
    	checkMagic(labelsIn, 2049);
    	readHeading(imagesIn, numImages);
    	readHeading(labelsIn, numLabels);
    	if (numImages != numLabels) {
    		throw std::logic_error("image num should equal label num");
    	readHeading(imagesIn, rows);
    	readHeading(imagesIn, cols);
    	std::size_t imageSize = rows * cols;
    	uint8_t label{ 0 };
    	std::vector<uint8_t> image(imageSize);
    	for (std::size_t i = 0; i < numImages; ++i) {
    		imagesIn >> image;
    		labelsIn >> label;

    To draw an image in the console I do the following:

    void drawDigit(uint8_t digit, std::size_t index=0) {
        auto& image = mData[digit][index];
        using namespace std;
        cout << to_string(digit) << endl;
        for (size_t y = 0; y < 28; ++y) {
            for (size_t x = 0; x < 28; ++x) {
                cout << to_string(image[y * 28 + x]) << ",";
            cout << endl;
        cout << endl;

    I separated the MNIST data into separate digit files, here’s what creating and reading them ended up looking like:

    void loadDigit(uint8_t digit) {
    	BinaryReader imagesIn("mnist_" + std::to_string(digit));
    	std::vector<uint8_t> image(28*28);
    	while (imagesIn >> image) {
    void makeDigitFiles() {
    	for (auto& [label, images] : mData) {
    		std::string fileName = "mnist_" + std::to_string(label);
    		BinaryWriter imagesOut(fileName);
    		for (auto& image : images) {
    			imagesOut << image;

    Movement to OpenCl++

    In my project, I have been thinking about something, data-oriented-design, quite a lot during the creation of new code. I’m going to present some of these concepts in current implementation and then describe moving to a specifically DOD language, opencl++. Moving into OpenCl++ seems like a natural extention of DOD C++ I use.

    So, in C++, a simple Data-Oriented technology is as follows:

    std::for_each( std::execution::par, container.begin(), container.end(), [&,front = &container.front()](auto& item) {
    			std::size_t idx = &item- front;
                //item = container[idx];
                item = x; // some algorithm
                someOtherContainer[idx] = // some algorithm

    Basically, we have vector of items, which are going to be iterated over and can be easily synchronized with mutexes or other mechanisms, and in some hardware parallel fashion. For each item in the container, often, that item is also associated with other containers, such that you can use idx to access a variety of information in various containers.

    When containers are vectors then they could even be further optimized, potentially, by SIMD code. This could also be ran instead as std::seq to perform better in some circumstance or during debuging.

    Heading in the direction of SIMD, I have done before fairly successfully, I think. However, since I am seeking parallel-power to test, I want processors other than CPU, potentially, like GPU or some super-computer. I’m pretty sure the std::par is not attempting to use gpu or other hardware, at least, or the moment it doesn’t. Therefore, I think learning and implementing OpenCL++ is the next step toward a good process.

    OpenCL is a standard language for operating in a parallel fashion, on a variety of types of processing platforms, such as GPU or CPU. It is developed and maintained by Khronos Group so I expect it to be an excellent product. It is based on a set of C and there is also a C++ version which I intend to explore.

    Studying OpenCL++, so far, I have had to compile three libraries and also install ruby. In the source sdk there are examples which I proceed into. The first example compiles but then crashes at runtime with a specific error code. This is apparently due to an open issue which is on their forums, where I also found a solution, and it has been an open ticket since 2021.

    Implementing the solution, it seems to work just fine, though – it is particularly concerning if that is the correct solution or not – because it appears to not be. But it does work in some circimstance.

    The C++ written is modern and pretty interesting, and I am just still learning about it. I have encoutered several interesting code features so far…

    Some lines of C++ code:

    setArgs<index + 1, T1s...>(std::forward<T1s>(t1s)...);
    std::vector<int, cl::SVMAllocator<int, cl::SVMTraitCoarse<>>> output2(numElements / 2, 0xdeadbeef);

    Some lines of OpenCl++ code:

    "output[get_global_id(0)] = inputA[get_global_id(0)] + inputB[get_global_id(0)] + val + *(aNum->bar);"
           "  enqueue_kernel(childQueue, CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange, "
            "    ^{"
            "      output[get_global_size(0)*3 + get_global_id(0)] = inputA[get_global_size(0)*3 + get_global_id(0)] + inputB[get_global_size(0)*3 + get_global_id(0)] + globalA + 2;"
            "    });"

    Until next time!

    Backpropagation is a class of random search algorithms

    This is evident when watching networks learn using my visualisation program. The backpropagation algorithm creates spontaneous types of patterns which are put into or scuplted into the network briefely or for some period, and then there is a random shift to a new concept with a new pattern for its own breifity. This process repeats until there is a clear learned functon, at which point the random search abrubtly ends. These series of processes generate animations that are explsions of movement in particular directions and angles, composed of various lines and shapes. They are common among the entire landscape of the network and working in apparent unison, therefore making their appearence and trendy behaviour obvious. I’m going to be posting more videos.

    Visualizing network training

    Here is a video of where my project is at,

    Basically, it provides a graphical visualization of the network weights. I am going to experiment with generating and examining this sort of image through training. In this video, the network is training on six problems, and it slowly learns for about 45 seconds. After that, its learning skyrockets for brief amount of seconds. What if I had decided to stop training before 45 seconds? Seems problematic.

    You can see in this image, the network is 11 pixels high, input and output rows and then 9 hidden layers. There are obvious horizontal and vertical lines and some occuring types of shapes, and lots of movement. The width of the hidden layers in the network are 100 pixels. The total image of the network is 11×100 pixels, and the network itself is 3x9x10x3 in shape.