Quora user Nathan, make a really nice and useful comment on Quora, (https://www.quora.com/Why-is-ReLU-non-linear/answer/Nathan-Yan-2?comment_id=39524791&comment_type=2), about how ReLU really does introduce non-linearity to a neural network. Respect.
He made a short little program that simulates a random ReLU neural network, with one hidden layer. The network takes a scalar input, and produces a scalar output. To generate this graph, he runs the numbers from -100 to 100, incrementing it by 0.01 every time. Then he plots the output of the network when evaluated on that number on the blue graph. Looking at the blue graph, you can clearly see non-linearity—the graph isn’t a straight line.
On the other hand, look at a linear neural network:
The blue graph is now straight
Now, let’s get into the nitty gritty details, and show you why ReLU introduces non-linearity. If we have a linear neural network, with weight matrices A and B, with layer inputs i1 and i2 , and layer outputs o1, o2
To produce the output of layer one, we take the dot product of the layer input, i1 with the weight matrix A:
clearly, o1o1 is just a set of linear combinations of i1. Since o1 is a linear combination of i1, it follows that o2 is a linear combination of o1 (or i2, they’re the same thing), and that o2 is thus a linear combination of i1.
Using differentiation, it’s pretty simple to see that
is constant, and therefore the slope will always stay the same, and explains the linear nature of the network.
With a ReLU network, on the other hand. We have to factor in the fact that the activation function—the ReLU, sometimes returns 0. This can potentially change the gradient.
In the image shown before. The green graph represents when the slope of the function changes. And the orange graph represents how many negative activations were in the network. Notice how every time the negative activations go either up or down, the green graph jumps. This is because every time we get a new negative activation, we change the network dynamics, and the slope changes. This is how we get non-linearity. However, an interesting point is that, as you pointed out, since both components of the ReLU are linear, the blue graph is more like a composite of linear sections, so it produces “non-linearity”, but it’s individual sections are linear.
(if the owner of this comment would like to delete this article, just contact with cyberlatentspace by email)