The astounding success of artificial neural networks can be attributed in part to the fact that they are able to estimate complex, non-linear functions that are often present in real-world data. This is achieved through activation functions, which introduce non-linearity into the neural networks and facilitate them to find a better fit for the input data. Thus, it can be said that activation functions are crucial for the effectiveness of neural networks.
This article will try to provide a relatively comprehensive, but not overly technical overview of activation functions in neural networks. By the end, you'll have a firm grasp of the following:
- What activation functions are and why to use them
- How activation functions help neural networks learn
- Why activation functions need to be differentiable
- What the most widely used activation functions are, their pros and cons
- How to choose an activation function when training a neural network
What is activation function?
Activation function in neural networks is a mathematical function that determines the output of a neuron based on its input. As the name suggests, it is some kind of function that should "activate" the neuron. Whether it will be convolutional neural networks or recurrent neural networks, the activation function decides how to proceed. Just as neurons in the brain receive signals from the body and make decisions on how to process them, neurons in artificial neural networks work in a similar manner. They act as transfer functions, receiving input values and producing corresponding output values.
How do activation functions work?
Before discussing modern and widely used activation functions, it's a good idea to get a solid understanding of how they work in a neural network. Regardless of network architecture, an activation function will take the values generated by a given network layer (in a fully connected network, this would be the sum of weights and biases) and apply a certain transformation to these values to map them to a specific range.
Here's a useful illustration of the role an activation function plays in a neural network.
After taking a weighted sum of the inputs plus the bias (W₁X₁ + W₂*X₂ + … + W𝚗*X𝚗+ b), we pass this value to some activation function ⨍, which then gives us the output of the given neuron. In this case, each of the Xᵢ values is the output of a neuron from the previous layer, while Wᵢ is our neuron's weights assigned to each input Xᵢ.
Why use an activation function
While not all activation functions are non-linear, the overwhelming majority are, and for a good reason. Nonlinear activation functions help introduce additional complexity into neural networks and facilitate them to “learn” to approximate a much larger swathe of functions. If not for nonlinear activation functions, neural networks would only be able to learn linear and affine functions since the layers would be linearly dependent on each other and would just comprise a glorified affine function.
Another important aspect of activation functions is that they allow us to map an input stream of unknown distribution and scale it to a known one (e.g., the sigmoid function maps any input to a value between 0 and 1). This helps stabilize the training of neural networks and also helps map the values to our desired output in the output layer (for non-regression tasks).
Why should an activation function be differentiable
The most important feature that the activation function should have is to be differentiable. Artificial neural networks learn using an algorithm called backpropagation. This algorithm essentially uses the model's incorrect predictions to adjust the network in a way that will make it less incorrect, thereby improving the network's predictive capabilities. This is done through differentiation.
Therefore, in order for a network to be trainable, all its elements need to be differentiable, including the activation function. However, it doesn't mean that the neural network will do its best during the training. There are more barriers that need to be passed to reach the goal. The differentiability creates other problems, especially in deep learning, such are "vanishing" and "exploding" gradients.
In the "vanishing" gradient case from one hidden layer to another, the values of the gradient can be smaller and smaller that eventually become zero. The "exploding" gradient is the other side of the problem where from one hidden layer to another the values become bigger and bigger and reach infinity.
Simple activation functions
With this in mind, what does a real-world activation function look like? Perhaps the simplest activation function one could think of is the identity activation function in which case the input and output values will be the same.
Using this linear activation function doesn't add any complexity to the neural networks and neural networks become similar to the linear regression model.
Of course, this wouldn't be of much use as it literally doesn't do anything, and so we would still face the aforementioned problem of an unpredictable distribution of values, destabilizing the training of our deep neural networks.
Step function
A somewhat more effective activation function than linear activation, but still a super simple way to tackle this problem is the binary step function:
As one can see, all the step activation function does is take the input, and assign it to either 0 or 1, depending on whether the input is larger or smaller than 0. While this fixes the issue of having a more predictable distribution of values, it's almost never used, because you lose a lot of information by squishing all nuance out of the neural network.
Non-linear activation functions
Now that we have a solid grasp of what activation functions do, let's discuss some non-linear activation functions that are actually used in practice. There has been a hefty amount of research regarding non-linear activation functions in recent years, which has introduced new and improved activation functions and, thus, affected the popularity of old ones. However, tried-and-true activation functions are still used often and have their place.
Sigmoid / Logistic activation function
The sigmoid activation function or logistic activation function is a very popular non-linear activation function that maps input data to the output range (0, 1). Unlike the step function, the sigmoid function doesn't just output 0 or 1, but instead numbers in that range (not including 0 and 1 themselves). Here's an illustration of the sigmoid activation function and its first derivative:
In comparison to the linear function or binary step function, the derivative of the sigmoid is not a constant value. The derivative is a well-defined function and will be possible to pass value, for any given value. However, while the sigmoid activation function is better than the ones discussed before and does have its place (especially in tasks like binary classification), it has somewhat major drawbacks. The activation function is close to zero when the input values are too big or too small, that is the problem that we described above as a "vanishing" gradient problem.
All these saturated neurons “kill” the gradients. Another drawback is that since the range is (0, 1), the output of the sigmoid activation function is not 0-centered, which also causes problems during backpropagation (a detailed discussion of these phenomena is out of this article's scope, though). Finally, exponential functions are a bit expensive computationally, which can slow down the neural network training process.
Softmax function
The softmax activation function is similar to the sigmoid function. It is common to use on output layer to represent output values as probabilities. The mathematical expression is presented below.
The expression is mathematically defined for all x in the range (-inf to inf), but computationally it has some limitations. Exponents can be very large numbers and cause some problems during calculation. Due to that, in the first step of calculation the maximum input value will be subtracted from all xi and only after that use softmax function as it is presented above will be used to calculate output.
The softmax function is mainly used on the output layer as some kind of transfer function to represent output layer values as probabilities, usually for classification tasks.
Hyperbolic tangent function
The tanh activation function is somewhat similar to the sigmoid in the sense that it also maps the input values to an s-shaped curve, but in this case, the output range is (-1, 1) and is centered at 0, which solves one of the problems with the sigmoid function. Tanh stands for the hyperbolic tangent, which is just the hyperbolic sine divided by the hyperbolic cosine, similar to the regular tangent. Here's an illustration along with the formula:
While the tanh activation function can be more effective than the sigmoid activation function, it still encounters the same problems as the sigmoid function by causing issues during backpropagation. In the case of very large or very small values, the derivative of the tanh function gets closer and closer to zero and makes the neural network harder to train. As an exponential function, it is computationally costly. However, the tanh function is a handy activation function to use in the hidden layers to pass better input values to the next hidden layer.
Inverse tangent function
The inverse tan function is another non-linear activation function. Similar to the sigmoid function and tanh function it has an 'S'-shape and with the derivatives similar shapes as well. Inverse tangent function outputs in the range of (-π/2, π/2).
Again has the same "killing"/ "vanishing" gradient problem, but as there is no exponent in the calculation of the gradient, it is relatively faster than tanh or sigmoid functions.
Rectified linear unit (ReLU) function
ReLU activation function is a more modern and widely used activation function. It stands for Rectified Linear Unit and looks like this:
The beauty of the ReLU activation function lies partly in its simplicity. As one can see, all it does is replace negative values with 0 and keep positive ones as is. This avoids the problem of “killing” the gradients of large and small values, while also being much faster computationally as it involves simpler mathematical operations. Also, in practice neural networks using ReLU tend to converge about six times faster than those with sigmoid and tanh.
However, ReLU still faces some problems. First off, it's not 0-centered, which can cause problems during training. Most importantly though, it does not deal with negative inputs in a particularly meaningful way. During the backpropagation process neural network updates weights with the gradients. Some neurons, that have negative input values, will have zero gradient and will not be updated. There is the possibility that some neurons will not be updated during the whole process of neural network training at all. Those neurons are called "dead" neurons.
Modern activation functions tend to take ReLU and try to fix these problems. A lot of variations of the ReLU function were developed to avoid this problem in neural network model training.
Parametric ReLU function
The Parametric ReLU activation function builds on top of ReLU by trying to handle negative values in a more meaningful way. More specifically, instead of replacing negative values with 0, it multiplies them by some user-defined number between 0 and 1, which essentially lets some of the information contained in the negative inputs be used in the neural network model training. The disadvantage of this activation function is that the parameter is not learnable and the user should define it very carefully in the neural network architecture as results can vary depending on that parameter.
Leaky ReLU function
The leaky ReLU activation function is the specific case for the parametric ReLU activation function, where a=0.01. The leaky ReLU functions in neural networks can cause problems such as updating weights slower as the parameter is very small. It is preferable to use the parametric ReLU function in the neural network model.
Exponential linear units (ELU) function
Exponential linear units(ELU) is yet another non-linear activation function as an alternative to the ReLU function. Positive inputs have the same output for both of them, but the negative values are smoother with the ELU due to the exponent, but it also has its disadvantage of making it computationally costly for this type of activation function. Fig. 9 shows the function and first derivative shape.
All of the above-mentioned activation functions (ReLU, Leaky ReLU, Parametric ReLU, and ELU) have one common problem. All of them are the same activation function for positive input data (output as a linear function) and the gradient is constant and equals "1" for all positive values. If the weights of the hidden layers are big values they can start to multiply together and get bigger and bigger, which will cause exploding gradient problem.
Gaussian error linear unit (GELU) function
GELU is one of the newest activation functions. It uses a standard Gaussian cumulative distribution function to weigh the input data. The GELU has major differences compared to ReLU, ELU and etc. All the activation functions' output values depend on the input value sign. The GELU activation function uses the value of the input rather which makes it more efficient. The neuron input multiplies by m ∼ Bernoulli(Φ(x)), where Φ(x) = P (X ≤ x), X ∼ N (0, 1) is the cumulative distribution function of the standard normal distribution:
The computational cost for this activation function is high so some approximations are done to make it easier to calculate.
Swish function
The swish activation function is the multiplication of input data and parametrized sigmoid function output for the input data. The parameter "a" in the swish function in the vast majority of neural network model cases is 1. The function in that case is called sigmoid linear unit (SiLU).
The swish function shows the advantages of deep learning. The activation function is mostly used when the number of hidden layers is big.
How to choose an activation function
Here's the million-dollar question in machine learning: how to actually choose the right activation function when training a neural network from scratch? Different activation functions have different advantages and disadvantages and depending on the type of the artificial neural network the outcome may be different.
The starting point can be to choose one of the ReLU-based activation functions (including ReLU itself) since they have empirically proven to be very effective for almost any task. After it tries to choose other activation functions for hidden layers may be different activation functions for multiple layers to see how the performance changes.
Neural network architecture, machine learning tasks, and many others have an impact on activation function selection. For example, if the task is binary classification then the sigmoid activation function is a good choice, but for the multi-class classification softmax function is better as it will output probability representation for each class.
In convolutional neural networks, activation functions can be ReLU-based to increase the convergence speed. However, some architectures require specific activation functions. For example, recurrent neural network architectures and Long Short Term Memory architectures utilize the sigmoid function and tanh function, and their logic gate-like architecture wouldn't work with ReLU.
Summing up
To recap, activation functions are crucial for modern neural networks because they facilitate the learning of a much wider set of functions and can, thus, make the model learn to perform all sorts of complex tasks. The impressive advances in computer vision, natural language processing, time series, and many other fields would be nearly impossible without the opportunities created by non-linear activation functions. While exponent-based activation functions like the sigmoid and tanh functions have been used for decades and can yield good results, more modern ones like ReLU work better in most applications. As a rule of thumb, when training a neural network from scratch, one can simply use ReLU, leaky ReLU, or GELU and expect decent results.