Neural Network From Scratch with NumPy and MNIST
This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.
Creating complex neural networks with different architectures in Python should be a standard practice for any Machine Learning Engineer and Data Scientist. But a genuine understanding of how a neural network works is equally as valuable. This is what we aim to expand on in this article, the very fundamentals on how we can build neural networks, without the help of the frameworks that make it easy for us.
Please open the notebook from GitHub and run the code alongside reading the explanations in this article.
Prerequisite Knowledge
In this specific article, we explore how to make a basic deep neural network, by implementing the forward and backward pass (backpropagation). This requires some specific knowledge on the functionality of neural networks – which I went over in this complete introduction to neural networks.
It’s also important to know the fundamentals of linear algebra, to be able to understand why we do certain operations in this article. I have a series of articles here, where you can learn some of the fundamentals. Though, my best recommendation would be watching 3Blue1Brown’s brilliant series Essence of linear algebra.
NumPy
We are building a basic deep neural network with 4 layers in total: 1 input layer, 2 hidden layers and 1 output layer. All layers will be fully connected.
We are making this neural network, because we are trying to classify digits from 0 to 9, using a dataset called MNIST, that consists of 70000 images that are 28 by 28 pixels. The dataset contains one label for each image, specifying the digit we are seeing in each image. We say that there are 10 classes, since we have 10 labels.
For training the neural network, we will use stochastic gradient descent; which means we put one image through the neural network at a time.
Let’s try to define the layers in an exact way. To be able to classify digits, we must end up with the probabilities of an image belonging to a certain class, after running the neural network, because then we can quantify how well our neural network performed.
 Input layer: In this layer, we input our dataset, consisting of 28×28 images. We flatten these images into one array with $28 times 28 = 784$ elements. This means our input layer will have 784 nodes.
 Hidden layer 1: In this layer, we have decided to reduce the number of nodes from 784 in the input layer to 128 nodes. This brings a challenge when we are going forward in the neural network (explained later).
 Hidden layer 2: In this layer, we have decided to go with 64 nodes, from the 128 nodes in the first hidden layer. This is no new challenge, since we already reduced the number in the first layer.
 Output layer: In this layer, we are reducing the 64 nodes to a total of 10 nodes, so that we can evaluate the nodes against the label. This label is received in the form of an array with 10 elements, where one of the elements is 1, while the rest is 0.
You might realize that the number of nodes in each layer decreases from 784 nodes, to 128 nodes, to 64 nodes and then to 10 nodes. This is based on empirical observations that this yields better results, since we are not overfitting nor underfitting, but trying to get just the right number of nodes. Though, the specific number of nodes chosen for this article were just chosen at random, although decreasing to avoid overfitting. In most reallife scenarios, you would want to optimize these parameters by brute force or good guesses – usually by Grid Search or Random Search, but this is outside the scope of this article.
Imports And Dataset
For the whole NumPy part, I specifically wanted to share the imports used. Note that we use other libraries than NumPy to more easily load the dataset, but they are not used for any of the actual neural network.
from sklearn.datasets import fetch_openml
from keras.utils.np_utils import to_categorical
import numpy as np
from sklearn.model_selection import train_test_split
import time
Now we have to load the dataset and preprocess it, so that we can use it in NumPy. We do normalization by dividing all images by 255, and make it such that all images have values between 0 and 1, since this removes some of the numerical stability issues with activation functions later on. We choose to go with onehot encoded labels, since we can more easily subtract these labels from the output of the neural network. We also choose to load our inputs as flattened arrays of 28 * 28 = 784 elements, since that is what the input layer requires.
x, y = fetch_openml('mnist_784', version=1, return_X_y=True)
x = (x/255).astype('float32')
y = to_categorical(y)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.15, random_state=42)
Initialization
The initialization of weights in the neural network is kind of hard to think about. To really understand how and why the following approach works, you need a grasp of linear algebra, specifically dimensionality when using the dot product operation.
The specific problem that arises, when trying to implement the feedforward neural network, is that we are trying to transform from 784 nodes all the way down to 10 nodes. When instantiating the DeepNeuralNetwork class, we pass in an array of sizes that defines the number of activations for each layer.
dnn = DeepNeuralNetwork(sizes=[784, 128, 64, 10])
This initializes the DeepNeuralNetwork class by the init function.
def __init__(self, sizes, epochs=10, l_rate=0.001):
self.sizes = sizes
self.epochs = epochs
self.l_rate = l_rate
# we save all parameters in the neural network in this dictionary
self.params = self.initialization()
Let’s look at how the sizes affect the parameters of the neural network, when calling the initialization()
function. We are preparing m x n matrices that are “dotable”, so that we can do a forward pass, while shrinking the number of activations as the layers increase. We can only use the dot product operation for two matrices M1 and M2, where m in M1 is equal to n in M2, or where n in M1 is equal to m in M2.
With this explanation, you can see that we initialize the first set of weights W1 with $m=128$ and $n=784$, while the next weights W2 are $m=64$ and $n=128$. The number of activations in the input layer A0 is equal to 784, as explained earlier, and when we dot W1 by the activations A0, the operation is successful.
def initialization(self):
# number of nodes in each layer
input_layer=self.sizes[0]
hidden_1=self.sizes[1]
hidden_2=self.sizes[2]
output_layer=self.sizes[3]
params = {
'W1':np.random.randn(hidden_1, input_layer) * np.sqrt(1. / hidden_1),
'W2':np.random.randn(hidden_2, hidden_1) * np.sqrt(1. / hidden_2),
'W3':np.random.randn(output_layer, hidden_2) * np.sqrt(1. / output_layer)
}
return params
Feedforward
The forward pass consists of the dot operation in NumPy, which turns out to be just matrix multiplication. As described in the introduction to neural networks article, we have to multiply the weights by the activations of the previous layer. Then we have to apply the activation function to the outcome.
To get through each layer, we sequentially apply the dot operation, followed by the sigmoid activation function. In the last layer we use the softmax activation function, since we wish to have probabilities of each class, so that we can measure how well our current forward pass performs.
Note: A numerical stable version of the softmax function was chosen, you can read more from the course at Stanford called CS231n.
def forward_pass(self, x_train):
params = self.params
# input layer activations becomes sample
params['A0'] = x_train
# input layer to hidden layer 1
params['Z1'] = np.dot(params["W1"], params['A0'])
params['A1'] = self.sigmoid(params['Z1'])
# hidden layer 1 to hidden layer 2
params['Z2'] = np.dot(params["W2"], params['A1'])
params['A2'] = self.sigmoid(params['Z2'])
# hidden layer 2 to output layer
params['Z3'] = np.dot(params["W3"], params['A2'])
params['A3'] = self.softmax(params['Z3'])
return params['A3']
The following are the activation functions used for this article. As can be observed, we provide a derivative version of the sigmoid, since we will need that later on when backpropagating through the neural network.
def sigmoid(self, x, derivative=False):
if derivative:
return (np.exp(x))/((np.exp(x)+1)**2)
return 1/(1 + np.exp(x))
def softmax(self, x):
# Numerically stable with large exponentials
exps = np.exp(x  x.max())
return exps / np.sum(exps, axis=0)
Backpropagation
The backward pass is hard to get right, because there are so many sizes and operations that have to align, for all the operations to be successful. Here is the full function for the backward pass; we will go through each weight update below.
def backward_pass(self, y_train, output):
'''
This is the backpropagation algorithm, for calculating the updates
of the neural network's parameters.
Note: There is a stability issue that causes warnings. This is
caused by the dot and multiply operations on the huge arrays.
RuntimeWarning: invalid value encountered in true_divide
RuntimeWarning: overflow encountered in exp
RuntimeWarning: overflow encountered in square
'''
params = self.params
change_w = {}
# Calculate W3 update
error = 2 * (output  y_train) / output.shape[0] * self.softmax(params['Z3'], derivative=True)
change_w['W3'] = np.outer(error, params['A2'])
# Calculate W2 update
error = np.dot(params['W3'].T, error) * self.sigmoid(params['Z2'], derivative=True)
change_w['W2'] = np.outer(error, params['A1'])
# Calculate W1 update
error = np.dot(params['W2'].T, error) * self.sigmoid(params['Z1'], derivative=True)
change_w['W1'] = np.outer(error, params['A0'])
return change_w
W3 Update
The update for W3
can be calculated by subtracting the ground truth array with labels called y_train
from the output of the forward pass called output
. This operation is successful, because len(y_train)
is 10 and len(output)
is also 10.
An example of y_train
might be the following, where the 1 is corresponding to the label of the output
:
y_train = np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
While an example of output
might be the following, where the numbers are probabilities corresponding to the classes of y_train
:
output = np.array([0.2, 0.2, 0.5, 0.3, 0.6, 0.4, 0.2, 0.1, 0.3, 0.7])
If we subtract them, we get the following:
>>> output  y_train
array([ 0.2, 0.2, 0.5, 0.3, 0.6, 0.4, 0.2, 0.1, 0.3, 0.7])
We use that operation when calculating the initial error, along with the length of our output vector, and the softmax derivative.
error = 2 * (output  y_train) / output.shape[0] * self.softmax(params['Z3'], derivative=True)
change_w['W3'] = np.outer(error, params['A2'])
W2 Update
The next is updating the weights W2
. More operations are involved for success. Firstly, there is a slight mismatch in shapes, because W3
has the shape (10, 64)
, and error
has (10, 64)
, i.e. the exact same dimensions. Thus, we can use a transpose operation on the W3
parameter by the .T
, such that the array has its dimensions permuted and the shapes now align up for the dot operation.
W3
now has shape (64, 10)
and error
has shape (10, 64)
, which are compatible with the dot operation. The result is multiplied elementwise (also called Hadamard product) with the outcome of the derivative of the sigmoid function of Z2
. At last, we use the outer product of two vectors to multiply the error with the activations A1
.
error = np.dot(params['W3'].T, error) * self.sigmoid(params['Z2'], derivative=True)
change_w['W2'] = np.outer(error, params['A1'])
W1 Update
Likewise, the code for updating W1
is using the parameters of the neural network one step earlier. Except for other parameters, the code is equivalent to the W2 update.
error = np.dot(params['W2'].T, error) * self.sigmoid(params['Z1'], derivative=True)
change_w['W1'] = np.outer(error, params['A0'])
Training (Stochastic Gradient Descent)
We have defined a forward and backward pass, but how can we start using them? We have to make a training loop and choose to use Stochastic Gradient Descent (SGD) as the optimizer to update the parameters of the neural network.
There are two main loops in the training function. One loop for the number of epochs, which is the number of times we run through the whole dataset, and a second loop for running through each observation one by one.
For each observation, we do a forward pass with x
, which is one image in an array with the length 784, as explained earlier. The output
of the forward pass is used along with y
, which are the onehot encoded labels (the ground truth), in the backward pass. This gives us a dictionary of updates to the weights in the neural network.
def train(self, x_train, y_train, x_val, y_val):
start_time = time.time()
for iteration in range(self.epochs):
for x,y in zip(x_train, y_train):
output = self.forward_pass(x)
changes_to_w = self.backward_pass(y, output)
self.update_network_parameters(changes_to_w)
accuracy = self.compute_accuracy(x_val, y_val)
print('Epoch: {0}, Time Spent: {1:.2f}s, Accuracy: {2}'.format(
iteration+1, time.time()  start_time, accuracy
))
The update_network_parameters()
function has the code for the SGD update rule, which just needs the gradients for the weights as input. And to be clear, SGD involves calculating the gradient using backpropagation from the backward pass, not just updating the parameters. They seem separate and they should be thought of separately, since the two algorithms are different.
def update_network_parameters(self, changes_to_w):
'''
Update network parameters according to update rule from
Stochastic Gradient Descent.
θ = θ  η * ∇J(x, y),
theta θ: a network parameter (e.g. a weight w)
eta η: the learning rate
gradient ∇J(x, y): the gradient of the objective function,
i.e. the change for a specific theta θ
'''
for key, value in changes_to_w.items():
for w_arr in self.params[key]:
w_arr = self.l_rate * value
After having updated the parameters of the neural network, we can measure the accuracy on a validation set that we conveniently prepared earlier, to validate how well our network performs after each iteration over the whole dataset.
This code uses some of the same pieces as the training function; to begin with, it does a forward pass, then it finds the prediction of the network and checks for equality with the label. We return the average of the accuracy.
def compute_accuracy(self, x_val, y_val):
'''
This function does a forward pass of x, then checks if the indices
of the maximum value in the output equals the indices in the label
y. Then it sums over each prediction and calculates the accuracy.
'''
predictions = []
for x, y in zip(x_val, y_val):
output = self.forward_pass(x)
pred = np.argmax(output)
predictions.append(pred == np.argmax(y))
return np.mean(predictions)
Finally, we can call the training function, after knowing what will happen. We use the training and validation data as input to the training function, and then we wait.
dnn.train(x_train, y_train, x_val, y_val)
Note that the results may vary a lot, depending on how the weights are initialized.
Here is the full code, for an easy copypaste and overview of what’s happening.
from sklearn.datasets import fetch_openml
from keras.utils.np_utils import to_categorical
import numpy as np
from sklearn.model_selection import train_test_split
import time
x, y = fetch_openml('mnist_784', version=1, return_X_y=True)
x = (x/255).astype('float32')
y = to_categorical(y)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.15, random_state=42)
class DeepNeuralNetwork():
def __init__(self, sizes, epochs=10, l_rate=0.001):
self.sizes = sizes
self.epochs = epochs
self.l_rate = l_rate
# we save all parameters in the neural network in this dictionary
self.params = self.initialization()
def sigmoid(self, x, derivative=False):
if derivative:
return (np.exp(x))/((np.exp(x)+1)**2)
return 1/(1 + np.exp(x))
def softmax(self, x, derivative=False):
# Numerically stable with large exponentials
exps = np.exp(x  x.max())
if derivative:
return exps / np.sum(exps, axis=0) * (1  exps / np.sum(exps, axis=0))
return exps / np.sum(exps, axis=0)
def initialization(self):
# number of nodes in each layer
input_layer=self.sizes[0]
hidden_1=self.sizes[1]
hidden_2=self.sizes[2]
output_layer=self.sizes[3]
params = {
'W1':np.random.randn(hidden_1, input_layer) * np.sqrt(1. / hidden_1),
'W2':np.random.randn(hidden_2, hidden_1) * np.sqrt(1. / hidden_2),
'W3':np.random.randn(output_layer, hidden_2) * np.sqrt(1. / output_layer)
}
return params
def forward_pass(self, x_train):
params = self.params
# input layer activations becomes sample
params['A0'] = x_train
# input layer to hidden layer 1
params['Z1'] = np.dot(params["W1"], params['A0'])
params['A1'] = self.sigmoid(params['Z1'])
# hidden layer 1 to hidden layer 2
params['Z2'] = np.dot(params["W2"], params['A1'])
params['A2'] = self.sigmoid(params['Z2'])
# hidden layer 2 to output layer
params['Z3'] = np.dot(params["W3"], params['A2'])
params['A3'] = self.softmax(params['Z3'])
return params['A3']
def backward_pass(self, y_train, output):
'''
This is the backpropagation algorithm, for calculating the updates
of the neural network's parameters.
Note: There is a stability issue that causes warnings. This is
caused by the dot and multiply operations on the huge arrays.
RuntimeWarning: invalid value encountered in true_divide
RuntimeWarning: overflow encountered in exp
RuntimeWarning: overflow encountered in square
'''
params = self.params
change_w = {}
# Calculate W3 update
error = 2 * (output  y_train) / output.shape[0] * self.softmax(params['Z3'], derivative=True)
change_w['W3'] = np.outer(error, params['A2'])
# Calculate W2 update
error = np.dot(params['W3'].T, error) * self.sigmoid(params['Z2'], derivative=True)
change_w['W2'] = np.outer(error, params['A1'])
# Calculate W1 update
error = np.dot(params['W2'].T, error) * self.sigmoid(params['Z1'], derivative=True)
change_w['W1'] = np.outer(error, params['A0'])
return change_w
def update_network_parameters(self, changes_to_w):
'''
Update network parameters according to update rule from
Stochastic Gradient Descent.
θ = θ  η * ∇J(x, y),
theta θ: a network parameter (e.g. a weight w)
eta η: the learning rate
gradient ∇J(x, y): the gradient of the objective function,
i.e. the change for a specific theta θ
'''
for key, value in changes_to_w.items():
self.params[key] = self.l_rate * value
def compute_accuracy(self, x_val, y_val):
'''
This function does a forward pass of x, then checks if the indices
of the maximum value in the output equals the indices in the label
y. Then it sums over each prediction and calculates the accuracy.
'''
predictions = []
for x, y in zip(x_val, y_val):
output = self.forward_pass(x)
pred = np.argmax(output)
predictions.append(pred == np.argmax(y))
return np.mean(predictions)
def train(self, x_train, y_train, x_val, y_val):
start_time = time.time()
for iteration in range(self.epochs):
for x,y in zip(x_train, y_train):
output = self.forward_pass(x)
changes_to_w = self.backward_pass(y, output)
self.update_network_parameters(changes_to_w)
accuracy = self.compute_accuracy(x_val, y_val)
print('Epoch: {0}, Time Spent: {1:.2f}s, Accuracy: {2:.2f}%'.format(
iteration+1, time.time()  start_time, accuracy * 100
))
dnn = DeepNeuralNetwork(sizes=[784, 128, 64, 10])
dnn.train(x_train, y_train, x_val, y_val)
Good Exercises in NumPy
You might have noticed that the code is very readable, but takes up a lot of space and could be optimized to run in loops. Here is a chance to optimize and improve the code. For newcomers, the difficulty of the following exercises are easyhard, where the last exercise is the hardest.
 Easy: Implement the ReLU activation function, or any other activation function from this overview of activation functions. Check how the sigmoid functions are implemented for reference, and remember to implement the derivative as well. Use the ReLU activation function in place of the sigmoid function.
 Easy: Initialize biases and add them to Z before the activation function in the forward pass, and update them in the backward pass. Be careful of the dimensions of the arrays when you try to add biases.

Medium: Optimize the forward and backward pass, such that they run in a for loop in each function. This makes the code easier to modify and possibly easier to maintain.
a. Optimize the initialization function that makes weights for the neural network, such that you can modify thesizes=[]
argument without the neural network failing.  Medium: Implement minibatch gradient descent, replacing stochastic gradient descent. Instead of making an update to a parameter for each sample, make an update based on the average value of the sum of the gradients accumulated from each sample in the minibatch. The size of the minibatch is usually below 64.

Hard: Implement the Adam optimizer, described in this overview of optimizers. This should be implemented in the training function.
a. Implement Momentum by adding the extra term.
b. Implement an adaptive learning rate, based on the AdaGrad optimizer.
c. Combine step a and b to finally implement Adam.
My belief is that if you complete these exercises, you will have learnt a lot. The next step would be implementing convolutions, filters and more, but that is left for a future article. As a disclaimer, there are no solutions to these exercises, but feel free to share GitHub/Colab links to your solution in the comment section.
PyTorch
Now that we have shown how to implement these calculations for the feedforward neural network with backpropagation, let’s show just how easy and how much time PyTorch saves us, in comparison to NumPy.
Loading MNIST Dataset
One of the things that seems more complicated, or harder to understand than it should be, is loading datasets with PyTorch.
You start by defining the transformation of the data, specifying that it should be a tensor and that it should be normalized. Then you use the DataLoader in combination with the datasets import to load a dataset. This is all we need, and we will see how to unpack the values from these loaders later.
import torch
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=True, download=True, transform=transform))
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=False, transform=transform))
Training
I have defined a class called Net, that is similar to the DeepNeuralNetwork class written in NumPy earlier. This class has some of the same methods, but you can clearly see that we don’t need to think about initializing the network parameters nor the backward pass in PyTorch, since those functions are gone along with the function for computing accuracy.
import time
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self, epochs=10):
super(Net, self).__init__()
self.linear1 = nn.Linear(784, 128)
self.linear2 = nn.Linear(128, 64)
self.linear3 = nn.Linear(64, 10)
self.epochs = epochs
def forward_pass(self, x):
x = self.linear1(x)
x = torch.sigmoid(x)
x = self.linear2(x)
x = torch.sigmoid(x)
x = self.linear3(x)
x = torch.softmax(x, dim=0)
return x
def one_hot_encode(self, y):
encoded = torch.zeros([10], dtype=torch.float64)
encoded[y[0]] = 1.
return encoded
def train(self, train_loader, optimizer, criterion):
start_time = time.time()
loss = None
for iteration in range(self.epochs):
for x,y in train_loader:
y = self.one_hot_encode(y)
optimizer.zero_grad()
output = self.forward_pass(torch.flatten(x))
loss = criterion(output, y)
loss.backward()
optimizer.step()
print('Epoch: {0}, Time Spent: {1:.2f}s, Loss: {2}'.format(
iteration+1, time.time()  start_time, loss
))
When reading this class, we observe that PyTorch has implemented all the relevant activation functions for us, along with different types of layers. We don’t even have to think about it, we can just define some layers like nn.Linear()
for a fully connected layer.
We have imported optimizers earlier, and here we specify which optimizer we want to use, along with the criterion for the loss. We pass both the optimizer and criterion into the training function, and PyTorch starts running through our examples, just like in NumPy. We could even include a metric for measuring accuracy, but that is left out in favor of measuring the loss instead.
model = Net()
optimizer = optim.SGD(model.parameters(), lr=0.001)
criterion = nn.BCEWithLogitsLoss()
model.train(train_loader, optimizer, criterion)
TensorFlow 2.0 With Keras
For the TensorFlow/Keras version of our neural network, I chose to use a simple approach, minimizing the number of lines of code. That means we are not defining any class, but instead using the high level API of Keras to make a neural network with just a few lines of code. If you are just getting into learning neural networks, you will find that the bar to entry is the lowest when using Keras, therefore I recommend it.
We start off by importing all the functions we need for later.
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.losses import BinaryCrossentropy
We can load the dataset and preprocess it with just these few lines of code. Note that we only preprocess the training data, because we are not planning on using the validation data for this approach. I will explain how we can use the validation data later on.
(x_train, y_train), (x_val, y_val) = mnist.load_data()
x_train = x_train.astype('float32') / 255
y_train = to_categorical(y_train)
The next step is defining our model. In Keras, this is extremely simple once you know which layers you want to apply to your data. In this case, we are going for the fully connected layers, as in our NumPy example; in Keras, this is done by the Dense()
function.
Once we have defined the layers of our model, we compile the model and define the optimizer, loss function and metric. At last, we can tell Keras to fit to our training data for 10 epochs, just like in our other examples.
model = tf.keras.Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='sigmoid'),
Dense(64, activation='sigmoid'),
Dense(10)
])
model.compile(optimizer='SGD',
loss=BinaryCrossentropy(),
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
If you want to use the validation data, you could pass it in using the validation_data
parameter of the fit function:
model.fit(x_train, y_train, epochs=10, validation_data=(x_val, y_val))
References
 Optimizing Gradient Descent by Sebastian Ruder. A very good read on the majority of optimizers.
 Neural Networks by 3Blue1Brown. Probably one of the best introductions to neural networks in general. Extremely visual and clear intuitive explanations are provided.
 Neural Networks and Deep Learning by Michael Nielsen. A formal book that intuitively explains the mathematical perspective behind neural network.