Neural Network in C++ From Scratch and Backprop-Free Optimizers

Introduction

In this article I’ll present a beginner-oriented framework implementing neural networks in C++. The main goal of this code is to understand the root of neural networks for beginners, it also allows multiple modifications which are usually hard to implement in classical high-level framework (Pytorch, Tensorflow, Keras etc..).

  1. Neural networks from scratch
  2. Framework design
  3. Exotic backpropagation-free optimizer: Shaking Tree optimizer
  4. What I learned doing this
  5. Github repository

1. Features of the framework

  • Graph implementation of neural network (no matrices)
  • Neural network, neurons and edges as explicit objects
  • Only multilayer perceptrons with Linear/ReLU/Sigmoid activations
  • No parallel computations, no optimizations, no convolutional layers
  • Exotic backpropagation-free optimizer

2. Neural networks from scratch

Neural networks can be interpreted in two ways. The “graph” way and the “matrix” way. When you begin with neural networks, people usually teach you the “graph” way:

A graph representing a neural network

3. Framework Design

1. Folders (latest version)

  • Dataset: Contains one simple class for reading a dataset
  • Optimizer: Contains a generic optimizer class, a classical backpropagation optimizer and an exotic optimizer
  • Misc: Contains activation functions and their derivatives
  • Layer.h/.cpp: A Layer is composed of multiple neurons. It has a type (INPUT/STANDARD/OUTPUT) because it doesn’t work the same way depending on its type and it has an activation function
  • Neuron.h/.cpp: A Neuron has a parent layer. A neuron accumulates the output of the edges connected to it (_accumulated), it outputs that input to its edges after processing it with its _activation_function (which you can change on a per-neuron basis). A neuron knows its _next and _previous edges.
  • Edge.h/.cpp: An Edge knows its next neuron (_n), the neuron its coming from (_nb), it has a weight (_w), it can memorize how much its weight was shift (_last_shift) last time it was, and it has a backpropagation memory so that it can retain a part of the chain rule
  • Optimizer.h/.cpp: Abstract class that links the network and the dataset together. Child classes must implement a “minimize” method that is called during the optimization process to minimize the loss.
  • Backpropagation.h/.cpp: Implementation of the backpropagation optimizer (though the backpropagation computations are in the neuron class for simplification). This class is mostly there to call these methods in the neuron class and to apply the new computed weights. It uses a batch size, the learning rate is a global variable because it’s common to all implemented optimizers and it’s easier to access it from anywhere.
  • ShakingTree.h/.cpp: Exotic optimizer that’ll be detailed later.

Computation example

The algorithm I provide was also made to exactly reproduce the “famous” article from Matt Mazur on his website: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Second iteration reproduces the computation on mattmazur’s website

Building a Neural Network

It’s pretty much straightforward:

Predicting a value

That’s also an easy one:

Backpropagation

The chain rule is implemented in the getBackpropagationShifts() methods in neuron.cpp.

4. Exotic backpropagation-free optimizer: ShakingTree optimizer

Here I’ll introduce a new kind of exotic optimizer which I named “Shaking Tree optimizer”.

Theory

The goal is of course to do better than a backpropagation algorithm. How does backprop work? (1) do the forward pass (2) compute the loss (3) do complex computations to evaluate the gradients (4) loop.

  • Save current weights
  • Apply new weights randomly (amplitude, distribution, number of weights etc.. many hyperparameters)
  • Re-evaluate the score on (another?) part of the dataset
  • If the new score is worst, re-apply old weights
  • Generate a delta to be applied to all weights (this delta is meant to act like the gradients)
  • Apply the delta weight, evaluate it, save the delta score and the delta weight
  • Now technically you can evaluate a kind of f(delta_weight) = delta_score (that’s what the gradients aim to be!)
  • Each N iterations, average the delta_weight of the weights that reduced the score (delta_score < 0), and apply that delta

In practice

The dataset contains two 2d-gaussian distribution, so input is size 2 and output is size 1 (<0.5 is first gaussian, ≥ 0.5 is the second one), so that’s far from any real-world use case for deep learning algorithms. I don’t tune hyerparameters much. Meaning I’m pretty sure any of these curves could be higher or lower than the others with tuning but as we’re not in a real-world use case, it doesn’t really matter. The neural network contains five 10-neurons hidden layer with sigmoid activations (again, probably far from real-world use case).

Loss on test set = f(seconds) on a dedicated machine
Best loss on test set = f(seconds) on a dedicated machine

Limitations

Just to be clear, be sure that these algorithms do bad outside of their confort zone, which, again, to be honest, would be the confort zone of many other algorithms that don’t scale. For example, the minimizeBasic algorithm changes one weight per iteration, which is ok in a neural network with 500 weights but you sure can’t use that version for a neural network with 5M parameters. And if you randomly change 500.000 weights, you’re less prone to find a great configuration by chance, meaning you have to use smarter metrics.

5. What I learned doing this

  • Tensor computations on GPUs and optimizations are probably the only reasons why Deep Learning became that big. I can easily train a model with 15M parameters on my server with GPUs but I can barely train a 20k parameters model on my framework with one thread on one CPU.
  • Backpropagation is very efficient because it’s two operations in one. When doing backprop, you’re evaluating a way to change your parameters AND evaluating the score of your parameters at the same time. And this delta you apply to your parameters is qualitative, it’s based on a rigorous metric.
  • Doing better than backprop in some configurations will always be possible. But to create completely new and better optimization algorithms that scale well, we need to focus on this qualitative way to change the parameters. There’s tons of information that can be used, variance of parameters on a layer, statistics on one particular layer, on its outputs etc.. so I doubt that backpropagation or its derivatives give the best possible “delta” you could apply on the weights. New kind of optimization algorithms definitely should get more attention.
  • Neural networks are not blackboxes. That’s just a way to say that it’s a complex model but if you made hyperparameter optimization to find the best features to extract from images in the past, that could also be a blackbox. The way they’re optimized can be explained fairly well and the gradient descent algorithm is easy to understand in its basic form. ShakingTree also shows that finding a great value for one parameter and doing so multiple times helps in finding the overall great weights for a neural network. So backpropagation doesn’t have to always be right, it only needs to be right most of the time, that’s even more true when you use a momentum.
  • Finally, building the root of a deep learning framework is insanely hard and people developing Pytorch, Tensorflow or CuDNN truly deserve some recognition and it’s not a surprise that these guys are supported by some of the biggest companies in the world.

6. Github repository / code

We work on statistics, machine learning, AI and we want to bring Artificial General Intelligence into the real world — Twitter