Activation functions are an integral component in neural networks. There are a number of common activation functions. Due to which it often gets confusing as to which one is best suited for a particular task.
In this blog post I will talk about,
 Why do we need activation functions in neural networks?
 Output layer activation functions. When not to use the softmax activation?
 The popular types of hidden layer activation functions and their pros and cons.
 The best practices to follow (for hidden layer activations).
 Some of the recent developments that we should be aware about.
Why do we need activation functions?
Why do we use nonlinear activation functions? Couldn’t we just multiply the input with the weight values, add a bias and propagate them forward? The reason we don’t do this is because no matter how many layers we add then, our final output would still be a linear function. It is because of these nonlinear activation functions neural networks are considered universal function approximators. Adding nonlinearity in the network allows it to approximate any possible function (linear or nonlinear).
To get a better understanding of this property, you should definitely check out this awesome post by Michael Nielsen – A Visual Proof that neural nets can compute any function.
Output Layer Activation Functions
It is very important to understand that the output layer activation functions are different from the hidden layer activation functions. The output layer has a very specific objective – to try to replicate the true labels as much as possible.
We need to carefully select the final layer activation depending on the task in hand (regression, singlelabel classification, multilabel classification etc.)
Softmax Activation
The softmax function takes as input a Kdimensional vector having real values, z and squashes it to a Kdimensional vector f(z) of real values in the range (0, 1] that add up to one. The function is given by,
The softmax is a popular choice for the output layer activation.
 It mimics the one hot encoded labels better than the absolute values.
 If we use the absolute (modulus) values we would lose information, while the exponential intrinsically takes care of this.
When not to use softmax activations?
The softmax function should not be used for multilabel classification. Unlike the onehot encoded values, there can be more that one label that is true in a multilabel classification (for example, a dog and a bone). The softmax function simply can’t produce more than one label with values close to 1. Therefore, the sigmoid function (discussed later) is preferred for multilabel classification.
The Softmax function should not be used for a regression task as well. Simple linear units, f(x) = x should be used.
Now, lets discuss some of the popular hidden layer activation functions and then decide which one should be preferred.
Sigmoid Activation
The sigmoid function has the mathematical formula –
σ(x) = 1 ∕ (1 + e^{x})
This function takes a real number as input and squashes it in the range (0, 1]. Earlier it was widely used as it has a nice interpretation of the firing rate of a neuron. It converts large negative numbers to 0 (not firing) and large positive numbers to 1 (completely firing).
The sigmoid function is not used any more. It has two major drawbacks –
 The derivative of a sigmoid function is σ(x)(1 – σ(x)). During backpropagation, when the output of a neuron becomes 0 or 1, the gradient becomes 0. As a result the weights of the neuron do not get updated. These neurons are called saturated neurons. Not only this, the weight of the neurons connected to this saturated neuron also get slowly updated. Hence, a network with sigmoid activation may not backpropagate if there are many saturated neurons present.
 The exp( ) function is computationally expensive.
Tanh Activation
The tanh function has a mathematical formula –
tanh(x) = 2σ(2x) – 1, where σ(x) is the sigmoid function.
It takes a real value as input and squashes it in the range (1, 1).
 In practice, the tanh activation is preferred over the sigmoid activation.
 It is also common to use the tanh function in state to state transition models (recurrent neural networks).
 The tanh function also suffers from the gradient saturation problem and kills gradients when saturated.
ReLU Activation
The ReLU is the most popular and commonly used activation function. It can be represented as –
f(x) = max(0, x)
It takes a real value as input. The output is x, when x > 0 and is 0, when x < 0. It is mostly preferred over sigmoid and tanh.
Advantages
 Accelerates the convergence of SGD compared to sigmoid and tanh (around 6 times).
 No expensive operations are required.
Disadvantages
 It is fragile and might die during training. If the learning rate is too high the weights may change to a value that causes the neuron to not get updated at any data point again.
Leaky ReLU Activation
f(x) = 1(x<0)(αx) + 1(x>=0)(x), where α is a small constant
Leaky ReLUs attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU gives a small negative slope (of 0.01, or so). Some people report success with this form of activation function, but the results are not always consistent.
Parametric ReLU Activation
The PReLU function is given by,
f(x) = max(αx, 0), where α is a hyperparameter.
This gives the neurons the ability to choose what slope is best in the negative region. They can become a ReLU or a leaky ReLU with certain values of α.
Maxout Activation
The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is represented as –
 Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).
 However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.
Best Practices (For Hidden Layer Activations)
 It is rare to mix and match different types of activation functions, even though there is no fundamental problem in doing so. You should have a look at the paper, Learning Combinations of Activation Functions [1] if you are more interested.
 Use ReLU, but be careful with the learning rate and monitor the fraction of dead units.
 If ReLU is giving problems. Try Leaky ReLU, PReLU, Maxout or some of the other recently developed functions (more on this below).
 Do not use sigmoid.
 Tanh function is useful in some state to state transition models. But generally speaking, you can expect it to be worse than ReLU or Maxout.
Recent Developments
Neural Networks is a rapidly evolving field. You should not be surprised if something that you learn today gets replaced by a totally new technique in a few months.
Therefore, I have also included some of the recent developments in activation functions that claim to outperform the current favorite, ReLU. I have not used them before. But, if I am getting unsatisfactory results with the existing techniques, I would definitely try them out.
Swish Activation
The swish activation function is represented as,
f(x) = x * σ(β * x), where
σ(x) = 1 ∕ (1 + e^{x}), is the sigmoid function and β is either a constant or a trainable parameter.
According to the paper, Searching for Activation Functions [2] the swish function outperforms ReLU. It was published by the Google Brain team.
ESwish Activation
The ESwish function has been introduced in a fairly recent paper that came out in January 2018 – Eswish: Adjusting Activations to Different Network Depths [3]
The mathematical function proposed is,
f(x) = β * x * sigmoid(x), where β is a constant
The recommended value of β is 1 ≤ β ≤ 2. The paper reports that this function outperforms ReLU and also the swish activation function.
More Interesting Papers

Empirical analysis of nonlinear activation functions for Deep Neural Networks in classification tasks [4]. It conducts an empirical analysis on several nonlinear activation functions on the MNIST classification task.
 Another very recent paper (February 2018) – Training Neural Networks by Using Power Linear Units (PoLUs) [5] introduces power linear unit activation function.
Again, these are very new findings and some may not live up to promises. But, it is important for us to keep track of the latest developments.
References
 [1] Franco Manessi and Alessandro Rozza .Learning Combinations of Activation Functions. arXiv:1801.09403
 [2] Prajit Ramachandran, Barret Zoph and Quoc V. Le. Searching for Activation Functions. arXiv:1710.05941
 [3] Eric Alcaide. Eswish: Adjusting Activations to Different Network Depths. arXiv:1801.07145
 [4] Giovanni Alcantara. Empirical analysis of nonlinear activation functions for Deep Neural Networks in classification tasks. arXiv:1710.11272
 [5] Yikang Li, Pak Lun Kevin Ding and Baoxin L.Training Neural Networks by Using Power Linear Units (PoLUs). arXiv:1802.00212

A visual proof that neural nets can compute any function by Michael Nielsen

Course notes, Module 1 Neural Networks Part 1, CS231n Convolutional Neural Networks for Visual Recognition
 fast.ai, Deep Learning Part 1, Version 2
Thank You. 🙂