Non-linear activations are important in deep neural networks. It is important in the sense that without non-linear activation functions, even if you have many linear layers, the end results is like you have only one linear layer, and the approximation ability of the network is very limited¹. Some of most commonly-used nonlinear activation functions are Sigmoid, ReLU and Tanh.

Nonlinear activations and their derivatives

Sigmoid

Sigmoid function, also known as logistic function, has the following form:

\[f(x) = \frac{1}{1+e^{-x}}\]

The derivative of sigmoid is:

\[\begin{aligned}\frac{df}{dx} &= \frac{e^{-x}}{(1+e^{-x})^2}\\ &= \frac{1}{1+e^{-x}}(1- \frac{1}{1+e^{-x}})\\ &= f(x)(1-f(x)) \end{aligned}\]

Tanh

Tanh function

\[f(x) = \frac{e^{2x}-1}{e^{2x}+1}\]

The derivative of Tanh is:

\[\frac{df}{dx} = \frac{4e^{2x}}{(e^{2x} + 1)^2} = 1 - {f(x)}^2\]

ReLU

ReLU, called rectified linear unit, has the following form:

\[f(x) = \max(0, x)\]

We can also write ReLU as:

\[f(x) = \begin{cases} x & x \geq 0 \\ 0 & x < 0 \end{cases}\]

The derivate of ReLU is quite simple, it is 1 for \(x > 0\) and 0 otherwise.

There are also variants of ReLU, such as Leaky ReLU, PReLU (parametric ReLU), and RReLU (randomized ReLU). In Empirical Evaluation of Rectified Activations in Convolutional Network, the author claimed that PReLU and RReLU works better than ReLU in small scale datasets such as CIFAR10, CIFAR100 and Kaggle NDSB.

Vanishing gradient

I show the plot of different activation functions and their derivatives in the title image.

Click to show the code for visualization.

import matplotlib.pyplot as plt
import numpy as np


def main():
    x = np.linspace(-5, 5, 100)

    r = [relu(v) for v in x]

    sig = [sigmoid(v) for v in x]
    d_sig = [sigmoid(v)*(1 - sigmoid(v)) for v in x]

    t = [tanh(v) for v in x]
    d_tanh = [1 - tanh(v)**2 for v in x]

    fig = plt.figure(figsize=[6, 3])

    ax = fig.add_subplot(111)

    ax.plot(x, r, '#66c2a5', label='ReLU')

    ax.plot(x, sig, '#fc8d62', label='sigmoid')
    ax.plot(x, d_sig, '#8da0cb', label='sigmoid derivative')

    ax.plot(x, t, '#e78ac3', label='tanh')
    ax.plot(x, d_tanh, '#a6d854', label='tanh derivative')

    ax.legend()

    plt.savefig('activation-curve.png', dpi=96, bbox_inches='tight')


def relu(x):
    if x >=0:
        return x

    return 0


def sigmoid(x):
    return 1/(1 + np.exp(-x))


def tanh(x):
    return (np.exp(x)**2 - 1)/(np.exp(x)**2 + 1)


if __name__ == "__main__":
    main()

The derivative of sigmoid is relatively small, and its largest value is only 0.25 (when \(x = 0\)). When \(x\) is large, the derivative is near zero. Tanh has a similar issue: it has a low gradient, and maximum gradient is only 1 (\(x=0\)).

This will cause the vanishing gradient problem, because in order to calculate the derivative of loss w.r.t the weight of earlier layers in the network, we need to multiply the gradient in the later layers. When you multiply several values below 0.25, the result goes down to zero quickly, so the network weight in earlier layers get updated slowly. In other words, the learning process will converge much slower than using ReLU, and we might need much more epochs to get a satisfactory result.

Another advantage of ReLU is that it is computationally cheap compared to sigmoid, both in terms of forward and backward operation.

Try it yourself interactively

To gain more insight into this, we can use minist on convenet.js and change the activation function to see how the train goes. We can see that training process under tanh and sigmoid activation is much slower than ReLU. Sigmoid is slowest among the three.

We can also play with different activations functions real quick with TensorFlow playground.