We often talk about activation functions when using artificial neural networks. Let’s all consider the definition and varieties of the activation function together.
Activation functions obstruct neural networks to be a linear transformation. Without activation functions, a neural network acts as a linear connecting with limited learning power. When we give complex world information such as image, sound, video to learned by neural network, the network is forced to learn. So, we need nonlinear functions which has multiple degrees. Activation functions can regulate the outputs of nodes and add a level of complexity that neural networks without activation functions cannot achieve. Thus, although the complexity of the network, the network becomes stronger and learns better.
The sigmoid function compresses the values it receives from 0 to 1. Here is the mathematical expression for sigmoid function.
When a high value comes, it gets closer to one and produces a stronger signal and when a negative value arrives, it approaches zero and produces a weaker signal.
The sigmoid function is not linear so the network becomes more complex and we can use it for more difficult tasks but if we look carefully at the graph, we can see that y values react very little to changes in X. In these regions the derivative values become very small and approach to 0. This is called vanishing gradient, and the learning event takes place at a minimum level. When a slow learning event occurs, the optimization algorithm that minimizes the error can be attached to local minimums and we cannot get the maximum performance that can be obtained from the artificial neural network model.
The tanh function compresses the values it receives from -1 to 1. Here is the mathematical expression for tanh function.
f(x) = 2/1+e^(-2x) -1
The derivative of tanh function is steeper than the sigmoid function’s derivative, so it can take more value. It means that it will be more efficient because it has a wider range for the classification process. However, the problem of vanishing gradient at the ends of the function continues.
ReLU is commonly used in deep learning neural networks for speech recognition and computer vision. This function first separates the incoming values according to whether they are positive or negative. The output is 0 if the input is negative and return the input unchanged if the input is positive so computer can calculate faster. The problem with the ReLU function is that the derivative of this zero-value region, which gives us processing speed, is also zero, because therefore learning cannot occur there.
Leaky ReLU function developed against the dead neuron problem in ReLU function.
f(x)=max (0.01x, x)
As shown in the figure, this problem was solved by a 0.01 magnitude leak towards the bottom of the X axis. This value is close to 0, but not 0 because of the vanishing gradients in ReLU survived, so learning is also provided for the values in the negative region.
5. Swish Function
Swish function gets value in negative region like leaky ReLU function, but swish function’s values are not linear.
f(x)= x × 1/1+e^(-x)
Thus, we have seen that activation functions play a key role in artificial neural networks. Hope to see you in our next article…