# Neural Networks: Feedforward and Backpropagation Explained Optimization  Towards really understanding neural networks — One of the most recognized concepts in Deep Learning (subfield of Machine Learning) is neural networks.

Something fairly important is that all types of neural networks are different combinations of the same basic principals. When you know the basics of how neural networks work, new architectures are just small additions to everything you already know about neural networks.

Moving forward, the above will be the primary motivation for every other deep learning post on this website.

## Overview

The big picture in neural networks is how we go from having some data, throwing it into some algorithm and hoping for the best. But what happens inside that algorithm? This question is important to answer, for many reasons; one being that you otherwise might just regard the inner workings of a neural networks as a black box.

Neural networks consists of neurons, connections between these neurons called weights and some biases connected to each neuron. We distinguish between input, hidden and output layers, where we hope each layer helps us towards solving our problem.

To move forward through the network, called a forward pass, we iteratively use a formula to calculate each neuron in the next layer. Keep a total disregard for the notation here, but we call neurons for activations $a$, weights $w$ and biases $b$ — which is cumulated in vectors.

$a^{(l)}= sigmaleft( boldsymbol{W}boldsymbol{a}^{l-1}+boldsymbol{b} right)$

This takes us forward, until we get an output. We measure how good this output $hat{y}$ is by a cost function $C$ and the result we wanted in the output layer $y$, and we do this for every example. This one is commonly called mean squared error (MSE):

$$C = frac{1}{n} sum_{i=1}^n (y_i-hat{y}_i)^2$$

Given the first result, we go back and adjust the weights and biases, so that we optimize the cost function — called a backwards pass. We essentially try to adjust the whole neural network, so that the output value is optimized. In a sense, this is how we tell the algorithm that it performed poorly or good. We keep trying to optimize the cost function by running through new observations from our dataset.

To update the network, we calculate so called gradients, which is small nudges (updates) to individual weights in each layer.

$$frac{partial C}{partial w^{(L)}} = frac{partial C}{partial a^{(L)}} frac{partial a^{(L)}}{partial z^{(L)}} frac{partial z^{(L)}}{partial w^{(L)}}$$

We simply go through each weight, e.g. in the output layer, and subtract the value of the learning rate, times the cost of a particular weight, from the original value that particular weight had.

$$w^{(L)} = w^{(L)} – text{learning rate} times frac{partial C}{partial w^{(L)}}$$

Add something called mini-batches, where we average the gradient of some number of defined observation per mini.batch, and then you have the basic neural network setup.

I’m going to explain the each part in great detail if you continue reading further. Refer to the table of contents, if you want to read something specific.

We start off with feedforward neural networks, then into the notation for a bit, then a deep explanation of backpropagation and at last an overview of how optimizers helps us use the backpropagation algorithm, specifically stochastic gradient descent.

## What is a neural network?

There is so much terminology to cover. Let me just take it step by step, and then you will need to sit tight.

Neural networks is an algorithm inspired by the neurons in our brain. It is designed to recognize patterns in complex data, and often performs the best when recognizing patterns in audio, images or video.

### Neurons — Connected

A neural network simply consists of neurons (also called nodes). These nodes are connected in some way. Then each neuron holds a number, and each connection holds a weight.

These neurons are split between the input, hidden and output layer. In practice, there are many layers and there are no general best number of layers. White circles corresponding to neurons, and yellow arrows are the connections (with a weight) from one neuron to another neuron. The white boxes above indicates which layer is which.

The idea is that we input data into the input layer, which sends the numbers from our data ping-ponging forward, through the different connections, from one neuron to another in the network. Once we reach the output layer, we hopefully have the number we wished for.

The input data is just your dataset, where each observation is run through sequentially from $x=1,…,x=i$. Each neuron has some activation — a value between 0 and 1, where 1 is the maximum activation and 0 is the minimum activation a neuron can have. That is, if we use the activation function called sigmoid, explained below. Thus, it is recommended to scale your data to values between 0 and 1 (e.g. by using MinMaxScaler from Scikit-Learn).

#### From input layer to hidden layer

We are kind of given the input layer to us by the dataset that we input, but what about the layers afterwards? What happens is just a lot of ping-ponging of numbers, it is nothing more than basic math operations. We look at all the neurons in the input layer, which are connected to a new neuron in the next layer (which is a hidden layer).

Remember this: each neuron has an activation a and each neuron that is connected to a new neuron has a weight w. Activations are typically a number within the range of 0 to 1, and the weight is a double, e.g. 2.2, -1.2, 0.4 etc. This is an example supposing we have the value for each activation and weight to a new neuron.

(see Stochastic Gradient Descent for weight explanation)
Then.. one could multiply activations by weights and get a single neuron in the next layer, from the first weights and activations $w_1a_1$ all the way to $w_na_n$:

$$w_1a_1+w_2a_2+…+w_na_n = text{new neuron}$$

That is, multiply n number of weights and activations, to get the value of a new neuron.

$$1.1 times 0.3+2.6 times 1.0 = 2.93$$

The procedure is the same moving forward in the network of neurons, hence the name feedforward neural network.

#### Activation Functions

But.. things are not that simple. We also have an activation function, most commonly a sigmoid function, which just scales the output to be between 0 and 1 again — so it is a logistic function. In future posts, a comparison or walkthrough of many activation functions will be posted.

$$text{sigmoid} = sigma = frac{1}{1+e^{-x}}= text{number between 0 and 1}$$

We wrap the equation for new neurons with the activation, i.e. multiply summarization of the result of multiplying the weights and activations

$$sigma(w_1a_1+w_2a_2+…+w_na_n) = text{new neuron}$$ Image of the sigmoid function. Taken an input x, we get a number between 0 and 1.

Now we just need to explain adding a bias to the equation, and then you have the basic setup of calculating a new neuron’s value.

Bias is trying to approximate where the value of the new neuron starts to be meaningful. So you would try to add or subtract a bias from the multiplication of activations and weights.

$$sigma(w_1a_1+w_2a_2+…+w_na_n + b) = text{new neuron}$$

There are many types of activation functions, here is an overview:

This is all there is to a very basic neural network, the feedforward neural network. But we need to introduce other algorithms into the mix, to introduce you to how such a network actually learns.

Before moving into the heart of what makes neural networks learn, we have to talk about the notation. At least for me, I got confused about the notation at first, because not many people take the time to explain it.

## Math for neural networks

Before moving into the more advanced algorithms, I would like to provide some of the notation and general math knowledge for neural networks — or at least resources for it, if you don’t know linear algebra or calculus.

### Notation: Linear Algebra

When learning neural network theory, one will often find that most of the neurons and layers are formatted in linear algebra. Note that I did a short series of articles, where you can learn linear algebra from the bottom up. I would recommend reading most of them and try to understand them. Leave a comment if you don’t and I will do my best to answer in time.

The notation is quite neat, but can also be cumbersome. Let me start from the bottom of the final equation and then explain my way down to the previous equation:

$$sigma(w_1a_1+w_2a_2+…+w_na_npm b) = text{new neuron}$$

So what we start off with is organising activations and weights into a corresponding matrix.

We denote each activation by $a_{neuron}^{(layer)}$, e.g. where $a_{2}^{(1)}$ would correspond to the number three neuron in the second layer (we count from 0). So the number below (subscript) corresponds to which neuron we are talking about, and the number above (superscript) corresponds to which layer we are talking about, counting from zero.

We denote each weight by $w_{to,from}$ where to is denoted as $j$ and from denoted as $k$, e.g. $w_{2,3}^{2}$ means to third neuron in the third layer, from neuron four in the previous layer (second layer), since we count from zero. It also makes sense when checking up on the matrix for $w$, but I won’t go into the details here.

To calculate each activation in the next layer, we need all the activations from the previous layer:

begin{bmatrix}
a_0^{0}\
a_1^{0}\
vdots \
a_n^{0}\
end{bmatrix}

And all the weights connected to each neuron in the next layer:

begin{bmatrix}
w_{0,0} & w_{0,1} & cdots & w_{0,k}\
w_{1,0} & w_{1,1} & cdots & w_{1,k}\
vdots & vdots & ddots & vdots \
w_{j,0} & w_{j,1} & cdots & w_{j,k}\
end{bmatrix}

Combining these two, we can do matrix multiplication (read my post on it), adding a bias matrix and wrapping the whole equation in the sigmoid function, we get:

$sigma left( begin{bmatrix} w_{0,0} & w_{0,1} & cdots & w_{0,k}\ w_{1,0} & w_{1,1} & cdots & w_{1,k}\ vdots & vdots & ddots & vdots \ w_{j,0} & w_{j,1} & cdots & w_{j,k}\ end{bmatrix} , begin{bmatrix} a_0^{0}\ a_1^{0}\ vdots \ a_n^{0}\ end{bmatrix} + begin{bmatrix} b_0\ b_1\ vdots \ b_n\ end{bmatrix} right)$

THIS is the final expression, the one that is neat and perhaps cumbersome, if you did not follow through.:

$a^{(1)}= sigmaleft( boldsymbol{W}boldsymbol{a}^{0}+boldsymbol{b} right)$

Sometimes we might even reduce the notation even more and replace the weights, activations and biases within the sigmoid function to a mere $z$:

$a^{(1)}= sigmaleft( boldsymbol{z} right)$

We take all the activations from the first layer $boldsymbol{a^{0}}$, do matrix multiplication with all the weights connecting each neuron from the first to the second layer $boldsymbol{W}$, add a bias matrix, and at last use the sigmoid function $sigma$ on the result. From this, we get a matrix of all the activations in the second layer.

### Calculus Knowledge

You need to know how to find the slope of a tangent line — finding the derivate of a function. In practice, you don’t actually need to know how to do every derivate, but you should at least have a feel for what a derivative means.

There are different rules for differentiation, one of the most important and used rules are the chain rule, but here is a list of multiple rules for differentiation, that is good to know if you want to calculate the gradients in the upcoming algorithms. The partial derivative, where we find the derivate of one variable and let the rest be constant, is also valuable to have some knowledge about.

My own opinion is that you don’t need to be able to do the math, you just have to be able to understand the process behind these algorithms. I will pick apart each algorithm, to a more down to earth understanding of the math behind these prominent algorithms.

To summarize, you should understand what these terms mean, or be able to do the calculations for:

• Matrices; matrix multiplication and addition, the notation of matrices.
• Derivates; measuring the steepness at a particular point of a slope on a graph.
• Partial Derivative; the derivative of one variable, while the rest is constant.
• The chain rule; finding the composite of two or more functions.

Now that you understand the notation, we should move into the heart of what makes neural networks work. This algorithm is part of every neural network. When I break it down, there is some math, but don’t be freightened. What the math does is actually fairly simple, if you get the big picture of backpropagation.

## Backpropagation

Backpropagation is the heart of every neural network. Firstly, we need to make a distinction between backpropagation and optimizers (which is covered later).

Backpropagation is for calculating the gradients efficiently, while optimizers is for training the neural network, using the gradients computed with backpropagation. In short, all backpropagation does for us is compute the gradients. Nothing more.

SO.. Err, how do we go backwards?

We always start from the output layer and propagate backwards, updating weights and biases for each layer.

The idea is simple: adjust the weights and biases throughout the network, so that we get the desired output in the output layer. Say we wanted the output neuron to be 1.0, then we would need to nudge the weights and biases so that we get an output closer to 1.0.

We can only change the weights and biases, but activations are direct calculations of those weights and biases, which means we indirectly can adjust every part of the neural network, to get the desired output — except for the input layer, since that is the dataset that you input.

Now, before the equations, let’s define what each variable means. We have already defined some of them, but it’s good to summarize. Some of this should be familiar to you, if you read the post.

PLEASE! Pay attention to the notation used between L, L-1 and l. I intentionally mix it up, so that you can get an understanding of how both of them work.

Firstly, let’s start by defining the relevant equations. Note that any indexing explained earlier is left out here, and we abstract to each layer instead of each weight, bias or activation:

$$z^{(L)}=w^{(L)} times a +b$$
$$a^{(L)}= sigmaleft( boldsymbol{z}^{(L)} right)$$
$$C=(a^{(L)}-y)^2$$

More on the cost function later in the cost function section.

The way we might discover how to calculate gradients in the backpropagation algorithm is by thinking of this question:

How might we measure the change in the cost function in relation to a specific weight, bias or activation?

Mathematically, this is why we need to understand partial derivatives, since they allow us to compute the relationship between components of the neural network and the cost function. And as should be obvious, we want to minimize the cost function. When we know what affects it, we can effectively change the relevant weights and biases to minimize the cost function.

$$frac{partial C}{partial w^{(L)}} = frac{partial C}{partial a^{(L)}} frac{partial a^{(L)}}{partial z^{(L)}} frac{partial z^{(L)}}{partial w^{(L)}} = 2 left(a^{(L)} – y right) sigma’ left(z^{(L)}right) a^{(L-1)}$$

If you are not a math student or have not studied calculus, this is not at all clear. So let me try to make it more clear.

The squished ‘d’ is the partial derivative sign. $partial C/partial w^{L}$ means that we look into the cost function $C$ and within it, we only take the derivative of $w^{L}$, i.e. the rest of the variables are left as is. I’m not showing how to differentiate in this article, as there are many great resources for that.

Although $w^{L}$ is not directly found in the cost function, we start by considering the change of w in the z equation, since that z equation holds a w. Next we consider the change of $z^{L}$ in $a^{L}$, and then the change $a^{L}$ in the function $C$. Effectively, this measures the change of a particular weight in relation to a cost function.

We measure a ratio between the weights (and biases) and the cost function. The ones with the largest ratio will have the greatest impact on the cost function and will give us ‘the most bang for our buck’.

### Three equations for calculating the gradient

We need to move backwards in the network and update the weights and biases. Let’s introduce how to do that with math. One equation for weights, one for biases and one for activations:

$$frac{partial C}{partial w^{(L)}} = frac{partial C}{partial a^{(L)}} frac{partial a^{(L)}}{partial z^{(L)}} frac{partial z^{(L)}}{partial w^{(L)}}$$
$$frac{partial C}{partial b^{(L)}} = frac{partial C}{partial a^{(L)}} frac{partial a^{(L)}}{partial z^{(L)}} frac{partial z^{(L)}}{partial b^{(L)}}$$
$$frac{partial C}{partial a^{(L-1)}} = frac{partial C}{partial a^{(L)}} frac{partial a^{(L)}}{partial z^{(L)}} frac{partial z^{(L)}}{partial a^{(L-1)}}$$

Remember that these equations just measure the ratio of how a particular weight affects the cost function, which we want to optimize. We optimize by stepping in the direction of the output of these equations. It really is (almost) that simple.

Each partial derivative from the weights and biases is saved in a gradient vector, that has as many dimensions as you have weights and biases. The gradient is the triangle symbol $nabla$, and n being number of weights and biases:

$$-nabla C(w_1, b_1,…, w_n, b_n) = begin{bmatrix} frac{partial C}{partial w_1} \ frac{partial C}{partial b_1} \ vdots \ frac{partial C}{partial w_n} \ frac{partial C}{partial b_n} end{bmatrix}$$

Activations are also a good idea to keep track of, to see how the network reacts to changes, but we don’t save them in the gradient vector. Importantly, they also help us measure which weights matters the most, since weights are multiplied by activations. From an efficiency standpoint, this is important to us.

You compute the gradient according to a mini-batch (often 16 or 32 is best) of your data, i.e. you subsample your observations into batches. For each observation in your mini-batch, you average the output for each weight and bias. Then the average of those weights and biases becomes the output of the gradient, which creates a step in the average best direction over the mini-batch size.

Then you would update the weights and biases after each mini-batch. Each weight and bias is ‘nudged’ a certain amount for each layer l:

$$w^{(l)} = w^{(l)} – text{learning rate} times frac{partial C}{partial w^{(l)}}$$
$$b^{(l)} = b^{(l)} – text{learning rate} times frac{partial C}{partial b^{(l)}}$$

The learning rate is usually written as an alpha $alpha$ or eta $eta$.

But this is not all there is to it. The three equations I showed are just for the output layer, if we were to move one layer back through the network, there would be more partial derivatives to compute for each weight, bias and activation. We have to move all the way back through the network and adjust each weight and bias.

#### Example: Going Deeper

Taking the rest of the layers into consideration, we have to chain more partial derivatives to find the weight in the first layer, but we do not have to compute anything else.

If we look at the hidden layer in the previous example, we would have to use the previous partial derivates as well as two newly calculated partial derivates. To help you see why, you should look at the dependency graph below, since it helps explain each layer’s dependencies on the previous weights and biases.

Updating the weights and biases in layer 2 (or $L$) depends only on the cost function, and the weights and biases connected to layer 2. Similarly, for updating layer 1 (or $L-1$), the dependenies are on the calculations in layer 2 and the weights and biases in layer 1. This would add up, if we had more layers, there would be more dependencies. As you might find, this is why we call it ‘back propagation’.

As the graph above shows, to calculate the weights connected to the hidden layer, we will have to reuse the previous calculations for the output layer (L or layer 2). Let me just remind of them:

$$frac{partial C}{partial w^{(2)}} = frac{partial C}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial w^{(2)}}$$
$$frac{partial C}{partial b^{(2)}} = frac{partial C}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial b^{(2)}}$$

If we wanted to calculate the updates for the weights and biases connected to the hidden layer (L-1 or layer 1), we would have to reuse some of the previous calculations.

$$frac{partial C}{partial w^{(1)}} = underbrace{ frac{partial C}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} }_text{Reused from frac{partial C}{partial w^{(2)}}} , frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(1)}} frac{partial z^{(1)}}{partial w^{(1)}}$$
$$frac{partial C}{partial b^{(1)}} = underbrace{ frac{partial C}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} }_text{Reused from frac{partial C}{partial b^{(2)}}} , frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(1)}} frac{partial z^{(1)}}{partial b^{(1)}}$$

We use all previous calculations, except the partial derivatives with respect to either the weights or biases of a layer, e.g. we don’t reuse $partial z^{(1)}/ partial w^{(1)}$ (we obviously use some of $partial C/ partial w^{(1)}$).

If you look at the dependency graph above, you can connect these last two equations to the big curly bracket that says “Layer 1 Dependencies” on the left. Try to make sense of the notation used by linking up which layer L-1 is in the graph. This should make things more clear, and if you are in doubt, just leave a comment.

A small detail left out here, is that if you calculate weights first, then you can reuse the 4 first partial derivatives, since they are the same when calculating the updates for the bias. And of course the reverse.

Suppose we had another hidden layer, that is, if we have input-hidden-hidden-output — a total of four layers. Then we would just reuse the previous calculations for updating the previous layer. We essentially do this for every weight and bias for each layer, reusing calculations.

So.. if we suppose we had an extra hidden layer, the equation would look like this:

$$frac{partial C}{partial w^{(1)}} = underbrace{ frac{partial C}{partial a^{(3)}} frac{partial a^{(3)}}{partial z^{(3)}} }_text{From w^{(3)}} , underbrace{ frac{partial z^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} }_text{From w^{(2)}} , frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(1)}} frac{partial z^{(1)}}{partial w^{(1)}}$$

If you are looking for a concrete example with explicit numbers, I can recommend watching Lex Fridman from 7:55 to 20:33 or Andrej Karpathy’s lecture on Backpropgation.

#### Summarization

• Do a forward pass with the help of this equation

$a^{(l)}= sigmaleft( boldsymbol{W}boldsymbol{a}^{l-1}+boldsymbol{b} right)$

• For each layer weights and biases connecting to a new layer, back propagate using the backpropagation algorithm by these equations (replace $w$ by $b$ when calculating biases)

$$frac{partial C}{partial w^{(3)}} = frac{partial C}{partial a^{(3)}} frac{partial a^{(3)}}{partial z^{(3)}} frac{partial z^{(3)}}{partial w^{(3)}}$$
$$frac{partial C}{partial w^{(2)}} = underbrace{ frac{partial C}{partial a^{(3)}} frac{partial a^{(3)}}{partial z^{(3)}} }_text{From w^{(3)}} , frac{partial z^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} frac{partial z^{(2)}}{partial w^{(2)}}$$
$$frac{partial C}{partial w^{(1)}} = underbrace{ frac{partial C}{partial a^{(3)}} frac{partial a^{(3)}}{partial z^{(3)}} }_text{From w^{(3)}} , underbrace{ frac{partial z^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} }_text{From w^{(2)}} , frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(1)}} frac{partial z^{(1)}}{partial w^{(1)}}$$

Continue on adding more partial derivatives for each extra layer in the same manner as done here.

• Repeat for each observation/sample (or mini-batches with size less than 32)

## Optimizers

Optimizers is how the neural networks learn, using backpropagation to calculate the gradients.

Many factors contribute to how well a model performs. The way we measure performance, as may be obvious to some, is by a cost function.

### Cost Function

The cost function gives us a value, which we want to optimize. There are too many cost functions to mention them all, but one of the more simple and often used cost functions is the sum of the squared differences.

$$C = frac{1}{n} sum_{i=1}^n (y_i-hat{y}_i)^2$$

Where $y$ is what we want the output to be and $hat{y}$ being the actual predicted output from a neural network. Basically, for every sample $n$, we start summing from the first example $i=1$ and over all the squares of the differences between the output we want $y$ and the predicted output $hat{y}$ for each observation.

There are obviously many factors contributing to how well a particular neural network performs. Complexity of model, hyperparameters (learning rate, activation functions etc.), size of dataset and more.

In Stochastic Gradient Descent, we take a mini-batch of random sample and perform an update to weights and biases based on the average gradient from the mini-batch. The weights for each mini-batch is randomly initialized to a small value, such as 0.1. The biases are initialized in many different ways; the easiest one being initialized to 0.

1. Define a cost function, with a vector as input (weight or bias vector)
2. Start at a random point along the x-axis and step in any direction.
Ask, which way should we step to decrease the cost function most quickly?
3. Calculate the gradient using backpropagation, as explained earlier
4. Step in the opposite direction of the gradient — we calculate gradient ascent, therefore we just put a minus in front of the equation or move in the opposite direction, to make it gradient descent.

Getting a good grasp of what stochastic gradient descent looks like is pretty easy from the GIF below. Each step you see on the graph is a gradient descent step, meaning we calculated the gradient with backpropagation for some number of samples, to move in a direction.

We say that we want to reach a global minima, the lowest point on the function. Though, this is not always possible. We are very likely to hit a local minima, which is a point between the slope moving upwards on both the left and right side. If we find a minima, we say that our neural network has converged. If we don’t, or we see a weird drop in performance, we say that the neural network has diverged.

If we calculate a positive derivative, we move to the left on the slope, and if negative, we move to the right, until we are at a local minima.

## Putting Neural Networks Into Steps

Here, I will briefly break down what neural networks are doing into smaller steps.

Repeat for each mini-batch:

1. Initialize weights to a small random number and let all biases be 0
2. Start forward pass for next sample in mini-batch and do a forward pass with the equation for calculating activations
$a^{(l)}=sigmaleft(boldsymbol{W}boldsymbol{a}^{l-1}+boldsymbol{b}right)$
3. Calculate gradients and update gradient vector (average of updates from mini-batch) by iteratively propagating backwards through the neural network. An example calculation of partial derivative of $w^1$ in an input-hidden-hidden-output neural network (4 layers)
$frac{partial C}{partial w^{(1)}} = underbrace{ frac{partial C}{partial a^{(3)}} frac{partial a^{(3)}}{partial z^{(3)}} }_text{From$w^{(3)}$} , underbrace{ frac{partial z^{(3)}}{partial a^{(2)}} frac{partial a^{(2)}}{partial z^{(2)}} }_text{From$w^{(2)}$} , frac{partial z^{(2)}}{partial a^{(1)}} frac{partial a^{(1)}}{partial z^{(1)}} frac{partial z^{(1)}}{partial w^{(1)}}$
4. Put a minus in front of the gradient vector, and update weights and biases based on the gradient vector calculated from averaging over the nudges of the mini-batch. 