# Convolutional Neural Networks from the ground up

### A NumPy implementation of the famed Convolutional Neural Network: one of the most influential neural network architectures to date.

When Yann LeCun published his work on the development of a new kind of neural network architecture [1], the Convolutional Neural Network (CNN), his work went largely unnoticed. It took 14 years and a team of researchers from The University of Toronto to bring CNN’s into the public’s view during the 2012 ImageNet Computer Vision competition. Their entry, which they named AlexNet after chief architect Alex Krizhevsky, achieved an error of only 15.8% when tasked with classifying millions of images from thousands of categories [2]. Fast forward to 2018 and the current state-of-the-art Convolutional Neural Networks achieve accuracies that surpass human-level performance [3].

Motivated by these promising results, I set out to understand how CNN’s function, and how it is that they perform so well. As Richard Feynman pointed out, “What I cannot build, I do not understand”, and so to gain a well-rounded understanding of this advancement in AI, I built a convolutional neural network from scratch in NumPy. After finishing this project I feel that there’s a disconnect between how complex convolutional neural networks appear to be, and how complex they really are. Hopefully, you too will share this feeling after building your own network from scratch.

Code for this project can be found here.

### The Challenge

CNN’s are best known for their ability to recognize patterns present in images, and so the task chosen for the network described in this post was that of image classification. One of the most common benchmarks for gauging how well a computer vision algorithm performs is to train it on the MNIST handwritten digit database: a collection of 70,000 handwritten digits and their corresponding labels. The goal is to train a CNN to be as accurate as possible when labeling handwritten digits (ranging from 0–9). After about five hours of training and two loops over the training set, the network presented here was able to achieve an accuracy of 98% on the test data, meaning it could correctly guess almost every handwritten digit shown to it.

Let’s go over the individual components that form the network and how they link together to form predictions from the input data. After explaining each component, we will code its functionality. In the last section of this post, we’ll program every piece of the network and train it using NumPy (Code here). It is important to note that this section assumes at least a working knowledge of linear algebra and calculus, as well as familiarity with the Python programming language. If you are unfamiliar with these domains or are in need of a tune-up, check out this publication to learn about linear algebra in the scope of machine learning and this resource to start programming with Python. Without further ado, let’s get into it.

### How Convolutional Neural Networks learn

#### Convolutions

CNN’s make use of filters (also known as kernels), to detect what features, such as edges, are present throughout an image. A filter is just a matrix of values, called weights, that are trained to detect specific features. The filter moves over each part of the image to check if the feature it is meant to detect is present. To provide a value representing how confident it is that a specific feature is present, the filter carries out a convolution operation, which is an element-wise product and sum between two matrices.

When the feature is present in part of an image, the convolution operation between the filter and that part of the image results in a real number with a high value. If the feature is not present, the resulting value is low.

In the following example, a filter that is in charge of checking for right-hand curves is passed over a part of the image. Since that part of the image contains the same curve that the filter is looking for, the result of the convolution operation is a large number (6600).

But when that same filter is passed over a part of the image with a considerably different set of edges, the convolution’s output is small, meaning that there was no strong presence of a right hand curve.

The result of passing this filter over the entire image is an output matrix that stores the convolutions of this filter over various parts of the image. The filter must have the same number of channels as the input image so that the element-wise multiplication can take place. For instance, if the input image contains three channels (RGB, for example), then the filter must contain three channels as well.

The convolution of a filter over a 2D image:

Additionally, a filter can be slid over the input image at varying intervals, using a stride value. The stride value dictates by how much the filter should move at each step. The output dimensions of a strided convolution can be calculated using the following equation:

Where n_in denotes the dimension of the input image, f denotes the window size, and s denotes the stride.

So that the Convolutional Neural Network can learn the values for a filter that detect features present in the input data, the filter must be passed through a non-linear mapping. The output of the convolution operation between the filter and the input image is summed with a bias term and passed through a non-linear activation function. The purpose of the activation function is to introduce non-linearity into our network. Since our input data is non-linear (it is infeasible to model the pixels that form a handwritten signature linearly), our model needs to account for that. To do so, we use the Rectified Linear Unit (ReLU) activation function:

As you can see, the ReLU function is quite simple; values that are less than or equal to zero become zero and all positive values remain the same.

Usually, a network utilizes more than one filter per layer. When that is the case, the outputs of each filter’s convolution over the input image are concatenated along the last axis, forming a final 3D output.

The Code

Using NumPy, we can program the convolution operation quite easily. The convolution function makes use of a for-loop to convolve all the filters over the image. Within each iteration of the for-loop, two while-loops are used to pass the filter over the image. At each step, the filter is multipled element-wise(*) with a section of the input image. The result of this element-wise multiplication is then summed to obtain a single value using NumPy’s sum method, and then added with a bias term.

The filt input is initialized using a standard normal distribution and bias is initialized to be a vector of zeros.

After one or two convolutional layers, it is common to reduce the size of the representation produced by the convolutional layer. This reduction in the representation’s size is known as downsampling.

#### Downsampling

To speed up the training process and reduce the amount of memory consumed by the network, we try to reduce the redundancy present in the input feature. There are a couple of ways we can downsample an image, but for this post, we will look at the most common one: max pooling.

In max pooling, a window passes over an image according to a set stride (how many units to move on each pass). At each step, the maximum value within the window is pooled into an output matrix, hence the name max pooling.

In the following visual, a window of size f=2 passes over an image with a stride of 2. f denotes the dimensions of the max pooling window (red box) and s denotes the number of units the window moves in the x and y-direction. At each step, the maximum value within the window is chosen:

Max pooling significantly reduces the representation size, in turn reducing the amount of memory required and the number of operations performed later in the network. The output size of the max pooling operation can be calculated using the following equation:

Where n_in denotes the dimension of the input image, f denotes the window size, and s denotes the stride.

An added benefit of max pooling is that it forces the network to focus on a few neurons instead of all of them which has a regularizing effect on the network, making it less likely to overfit the training data and hopefully generalize well.

The Code

The max pooling operation boils down to a for loop and a couple of while loops. The for-loop is used pass through each layer of the input image, and the while-loops slide the window over every part of the image. At each step, we use NumPy’s max method to obtain the maximum value:

After multiple convolutional layers and downsampling operations, the 3D image representation is converted into a feature vector that is passed into a Multi-Layer Perceptron, which merely is a neural network with at least three layers. This is referred to as a Fully-Connected Layer.

#### Fully-Connected Layer

In the fully-connected operation of a neural network, the input representation is flattened into a feature vector and passed through a network of neurons to predict the output probabilities. The following image describes the flattening operation:

The rows are concatenated to form a long feature vector. If multiple input layers are present, its rows are also concatenated to form an even longer feature vector.

The feature vector is then passed through multiple dense layers. At each dense layer, the feature vector is multiplied by the layer’s weights, summed with its biases, and passed through a non-linearity.

The following image visualizes the fully connected operation and dense layers:

It is worth noting that, according to this Facebook post by Yann LeCun, “there is no such thing as a fully connected layer,” and he’s right. When thinking back to the convolutional layer, one realizes that a fully connected layer is a convolutional operation with a 1×1 output kernel. That is, If we pass 128 n-by-n filters over an image of dimensions n-by-n, what we would end up with is a vector of length 128.

The Code

NumPy makes it quite simple to program the fully connected layer of a CNN. As a matter of fact, you can do it in a single line of code using NumPy’s reshape method:

In this code snippet, we gather the dimensions of the previous layer (number of channels and height/width) then use them to flatten the previous layer into a fully connected layer. This fully connected layer is proceeded by multiple dense layers of neurons that eventually produce raw predictions:

#### Output Layer

The output layer of a CNN is in charge of producing the probability of each class (each digit) given the input image. To obtain these probabilities, we initialize our final Dense layer to contain the same number of neurons as there are classes. The output of this dense layer then passes through the Softmax activation function, which maps all the final dense layer outputs to a vector whose elements sum up to one:

Where x denotes each element in the final layer’s outputs.

The Code

Once again, the softmax function can be written in a few lines of simple code:

#### Calculating the Loss

To measure how accurate our network was in predicting the handwritten digit from the input image, we make use of a loss function. The loss function assigns a real-valued number to define the model’s accuracy when predicting the output digit. A common loss function to use when predicting multiple output classes is the Categorical Cross-Entropy Loss function, defined as follows:

Here, ŷ is the CNN’s prediction, and y is the desired output label. When making predictions over multiple examples, we take the average of the loss over all examples.

The Code

The Categorical Cross-Entropy loss function can be easily programmed using two simple lines of code, which are a mirror of the equation shown above:

This about wraps up the operations that compose a convolutional neural network. Let us join these operations to construct the CNN.

### The Network

Given the relatively low amount of classes (10 in total) and the small size of each training image (28x28px.), a simple network architecture was chosen to solve the task of digit recognition. The network uses two consecutive convolutional layers followed by a max pooling operation to extract features from the input image. After the max pooling operation, the representation is flattened and passed through a Multi-Layer Perceptron (MLP) to carry out the task of classification.

### Programming the CNN

Now that we have gone over the elementary operations that form a Convolutional Neural Network, let’s create it.

Feel free to use this repo when following along.

#### Step 1: Getting the Data

The MNIST handwritten digit training and test data can be obtained here. The files store image and label data as tensors, so the files must be read through their bytestream. We define two helper methods to perform the extraction:

#### Step 2: Initialize parameters

We first define methods to initialize both the filters for the convolutional layers and the weights for the dense layers. To make for a smoother training process, we initialize each filter with a mean of 0 and a standard deviation of 1.

#### Step 3: Define the backpropagation operations

To compute the gradients that will force the network to update its weights and optimize its objective, we need to define methods that backpropagate gradients through the convolutional and max pooling layers. To keep this post (relatively) short, I won’t go into the derivation of these gradients but, If you would like me to make a post that describes backpropagation through a convolutional neural network, leave a comment below.

#### Step 4: Building the network

In the spirit abstraction, we now define a method that combines the forward and backward operations of a convolutional neural network. It takes the network’s parameters and hyperparameters as inputs and spits out the gradients:

#### Step 5: Training the network

To efficiently force the network’s parameters to learn meaningful representations, we use the Adam optimization algorithm. I won’t go into much detail regarding this algorithm, but it can be thought of this way: if stochastic gradient descent is a drunk college student stumbling down a hill, then Adam is a bowling ball beaming down that same hill. A better explanation of Adam found here.

That about wraps up the development of the network. To train it locally, download this repo and run the following command in the terminal:

$python3 train_cnn.py '<file_name>.pkl' Replace <file_name> with whatever file name you would like. The terminal should display the following progress bar to training progress, as well as the cost for the current training batch. After the CNN has finished training, a .pkl file containing the network’s parameters is saved to the directory where the script was run. The network takes about 5 hours to train on my macbook pro. I included the trained params in the GitHub repo under the name params.pkl. To use them, replace <file_name> with params.pkl. To measure the network’s accuracy, run the following command in the terminal: $ python3 measure_performance.py '<file_name>.pkl'

This command will use the trained parameters to run predictions on all 10,000 digits in the test dataset. After all predictions are made, a value displaying the network’s accuracy will appear in the command prompt:

If you run into any issues regarding dependencies, the following command can be used to install the required packages:

\$ pip install -r requirements.txt

### Results

After two epochs over the training set, the network’s accuracy on the test set averaged 98%, which I would say is quite decent. After extending the training time by 2–3 epochs, I found that the test set performance decreased. I speculate that on the third to fourth training loop, the network begins overfitting the training set and is no longer generalizing.

Because we were passing in batches of data, the network had to account for the variability in every new batch, which is why the cost fluctuated so heavily during training time:

Additionally, we measure the network’s recall to understand how well it is able to predict each digit. Recall is a measure of accuracy, and it can be understood with the following example: of all the digits labeled ‘7’(or any other digit) in our test set, how many did our network correctly predict?

The following bar graph displays the recall for each digit:

This indicates that our network learned meaningful representations for all the digits. Overall, the CNN generalized well.

### Conclusion

Hopefully this post provided you with a richer understanding of convolutional neural networks, and perhaps even removed their perceived complexity. If you have any questions or would like to know a bit more, drop a comment below 🙂

#### References

[1]: Lecun, Y., et al. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278–2324., doi:10.1109/5.726791.

[2]: Krizhevsky, Alex, et al. “ImageNet Classification with Deep Convolutional Neural Networks.” Communications of the ACM, vol. 60, no. 6, 2017, pp. 84–90., doi:10.1145/3065386.

[3]: He, Kaiming, et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” 2015 IEEE International Conference on Computer Vision (ICCV), 2015, doi:10.1109/iccv.2015.123.