During the calculations of the values for activations in each layer, we use an activation function right before deciding what exactly the activation value should be. From the previous activations, weights and biases in each layer, we calculate a value for every activation in the next layer. But before sending that value to the activations of the next layer, we use an activation function to scale the output. Here, we will explore different activation functions.
The prerequisite for this post is my last post about feedfordward and backpropagation in neural networks, you would have seen that I briefly talked about activation functions, but never actually expanded on what they do for us. Much of what I talk about here will only be relevant if you have the prior knowledge, or have read my previous post.
Code > Theory? → Jump straight to the code.
Table of Contents (Click To Scroll)
Activation functions can be a make-or-break-it part of a neural network. In this extensive article (>6k words), I'm going to go over 6 different activation functions, each with pros and cons. I will give you the equation, differentiated equation and plots for both of them. The goal is to explain the equation and graphs in simple input-output terms.
I show you the vanishing and exploding gradient problem; for the latter, I follow Nielsens great example of why gradients might explode.
At last, I provide some code that you can run for yourself, in a Jupyter Notebook.
From the small code experiment on the MNIST dataset, we obtain a loss and accuracy graph for each activation function
The sigmoid function is a logistic function, which means that, whatever you input, you get an output ranging between 0 and 1. That is, every neuron, node or activation that you input, will be scaled to a value between 0 and 1.<
Source - Continue Reading: https://mlfromscratch.com/activation-functions-explained/