Picking the right optimizer with the right parameters, can help you squeeze the last bit of accuracy out of your neural network model. In this article, optimizers are explained from the classical to the newer approaches.

This post could be seen as a part three of how neural networks learn; in the previous posts, we have proposed the update rule as the one in gradient descent. Now we are exploring better and newer optimizers. If you want to know how we do a forward and backwards pass in a neural network, you would have to read the first part – especially how we calculate the gradient is covered in great detail.

If you are new to neural networks, you probably won't understand this post, if you don't read the first part.

I want to add, before explaining the different optimizers, that you really should read Sebastian Ruder's paper An overview of gradient descent optimization algorithms. It's a great resource that briefly describes many of the optimizers available today.

This is the basic algorithm responsible for having neural networks converge, i.e. we shift towards the optimum of the cost function. Multiple gradient descent algorithms exists, and I have mixed them together in previous posts. Here, I am not talking about batch (vanilla) gradient descent or mini-batch gradient descent.

The basic difference between batch gradient descent (BGD) and stochastic gradient descent (SGD), is that we only calculate the cost of one example for each step in SGD, but in BGD, we have to calculate the cost for all training examples in the dataset. Trivially, this speeds up neural networks greatly. Exactly this is the motivation behind SGD.

The equation for SGD is used to update parameters in a neural network – we use the equation to update parameters in a backwards pass, using backpropagation to calculate the gradient $nabla$:

$$theta = theta - eta cdot overbrace{nabla_theta J(theta; , x, , y)}^{text{Backpropagation}}$$

This is how the e

[...]