Activation
Functions
last update 28-05-
2022 •Dr. Ghulam Gilanie Janjua
•PhD (Artificial Intelligence)
Activation functions are one
of the building blocks of ML.
Activation
Functions
Input Information
Activation Function
“useful” “less-useful” “not-so-useful”
Neural
Network
The output from the activation function moves to the next hidden layer and the same process is repeated. This
forward movement of information is known as the "forward propagation”.
If the output generated is far away from the actual value. Using the output from the forward propagation, the error
is calculated. Based on this error value, the weights and biases of the neurons are updated. This process is known
as "back-propagation”.
-activation function introduces an
additional step at each layer during the
forward propagation
Can we do -increases the complexity
it without -without the activation functions, every
an neuron will only be performing a linear
transformation on the inputs using the
activation weights and biases
function? -although linear transformations make the
neural network simpler, this network would
be less powerful and will not be able to
learn the complex patterns from the data
-a neural network without an activation function is
essentially just a linear regression model
-thus, we use a nonlinear transformation to
the inputs of the neuron, and this non-
linearity in the network is introduced by an
WHAT IS A GRADIENT?
In machine learning, a gradient is a
Activation derivative of a function that has
more than one input variable.
functions 1.Binary Step Known as the slope of a function in
and when 2.Linear mathematical terms, the gradient
simply measures the change in all
to use 3.Sigmoid weights about the change in error.
them? 4.Tanh
5.ReLU
6.Leaky ReLU
7.Parameterised ReLU
8.Exponential Linear Unit
9.Swish
10.Softmax
f(x) = 1,
Binary Step or Step
function or x>=0
threshold
• f(x)to=the0,activation function is
If the input
greater than a threshold, then the
neuronx<0is activated, else it is
deactivated, i.e., its output is not
considered for the next hidden layer
• Simplest activation function, which can
be implemented with a single if-else
def
condition binary_step(
binary_step(x):
5)
if x<0: Output?
binary_step(-
return 0
1)
else:
Output?
Binary Step or Step
function or
threshold • Suitable for a binary classifier
• Not useful for multiple classes in the
target variable
• Gradients are calculated to update the
weights and biases during the
backpropagation process.
• The gradient of the step function is
zero, so, no backpropagation is
possible
• If you calculate the derivative of f(x)
with respect to x, it comes out to be 0.
• Add another component of x in the
Linear Function binary step function “a linear function”.
f(x)=a
def x
linear_function(4),
linear_function(x): linear_function(-2)
return Output?
4*x
• Gradient is not zero now, but it is a
constant which does not depend upon
f'(x) x=at all.
the input value
a
• Weights and biases will be updated during the backpropagation
process, but the updating factor would be the same.
• Network will not really improve the error since the gradient is
the same for every iteration.
• Linear function might be ideal for simple tasks where
interpretability is highly desired.
• One of the most widely used “non-
linear” activation functions.
• Transforms the values between the
range 0 f(x)
and = 1/(1+e^-
1
Sigmoid import numpy as np
x)
def sigmoid_function(7),sigmoid_functi
on(-22)
sigmoid_function(x)
Output?
: ( 0.9990889488055994,
2.7894680920908113e-10)
z = (1/(1 + np.exp(-
x)))
• return
A smooth
z S-shaped function and is
continuously differentiable.
f'(x) = sigmoid(x)*(1-
sigmoid(x))
• The gradient values are significant for ranges -3 and 3.
• For values greater than 3 or less than -3, will have very small
gradients
• Not symmetric around zero, the output of all the neurons will
• Similar to the sigmoid function, but,
symmetric around the origin.
• Transforms the values between the
sigmoid
range -1 and 1. function(7),sigmoid_function(-22)
tanh(x)=2sigmoid( Output?
Tanh 2x)-1 (0.4621171572600098, -
tanh(x) = 2/(1+e^(- 0.7615941559557646)
2x)) -1
• Inputs to the next layers will not always
be of the same sign.
• All other properties of tanh function are the same as
that of the sigmoid function, it is continuous and
differentiable at all points.
• The gradient of the tanh function is steeper as
compared to the sigmoid function.
• Being zero centered, tanh is preferred over the
sigmoid function and the gradients are not
restricted to move in a certain direction.
• Non-linear activation function, gained
popularity in the machine learning
domain.
• The main advantage of using the ReLU
function over other activation functions
ReLU is that it does notrelu_function(7),
activate all the
relu_function(-7)
f(x)=max(0
neurons at the same time.
Output?
,x) (7, 0)
• For the negative input values, the result is
zero, which means the neuron does not get
activated.
• ReLU function is far more computationally
• On the negativewhen
efficient side, the gradient to
compared value
the issigmoid
zero. during
and
the backpropagation
tanh function process, the weights and biases for
some neurons are not updated. This can create dead
neurons which never get activated. This is taken care of
• An improved version of the ReLU
function.
• For the ReLU function, the gradient is 0
for x<0, which would deactivate the
neurons in that region.
Leaky ReLU • Instead of defining the Relu function as
0 for0.01x,
f(x)= negative values of x, we define it
leaky_relu_function(7),
as an extremely
x<0 small linear
leaky_relu_function(-7)
component of x Output?
= x, (7, -
f'(x)
x>=0 = 1, 0.07)
x>=0
=0.01,
• Apart fromx<0
Leaky ReLU, there are a few other
variants of ReLU, the two most popular are –
Parameterized ReLU function and Exponential ReLU.
• Another variant of ReLU to solve the
problem of gradients becoming zero for
the left half of the axis.
Parameteri • As the name suggests, introduces a
new parameter as a slope of the
zed ReLU negative part off(x) = function.
the x, f'(x) = 1,
x>=0 x>=0
= ax, x<0 = a, x<0
• When the value of a is fixed to 0.01, the
function acts as a Leaky ReLU function. in case
of a parameterized ReLU function, ‘a‘ is also a
trainable parameter. The network also learns the
value of ‘a‘ for faster and more optimum
convergence.
• Lesser known activation function,
discovered by researchers at Google.
• Computationally efficient as ReLU and
shows better performance than ReLU
Swish on deeper models.
• The values for swish range from
negative
f(x) = infinity to infinity
swish_function(-67),
x*sigmoid(x) swish_function(4)
f(x) = x/(1-e^- Output?
(5.349885844610276e-28,
x) 4.074629441455096)
• The curve of the function is smooth, and the function
is differentiable at all points. This is helpful during the
model optimization process and is one of the reasons
that swish outperforms than ReLU.
• Swish function is not monotonic, the value of the
function may decrease even when the input values
• Described as a combination of multiple
sigmoids.
• Sigmoid returns values between 0 and
Softmax 1, which can be treated as probabilities
of a data point belonging to a
particular class.
• Sigmoid is widely used for binary
• Softmax function can
classification be used for multiclass
problems.
classification problems.
• Returns the probability for a data point belonging
to each individual class.
• for a multiclass problem, the output layer would
have as many neurons as the number of classes
in the target, if we have three classes, there
would be three neurons in the output layer, as
[1.2will
• The softmax function , 0.9
get ,the0.75].
following result – [0.42, 0.31, 0.27], representing the
probability for the data point belonging to each class, the sum of all the values is 1.
• Good or bad – there is no rule of
thumb.
• Depending upon the properties of the
Choosing problem, we might be able to make a
the right better choice for easy and quicker
Activatio convergence
•Sigmoid of their
functions and the network.
combinations generally
work better in the case of classifiers.
n •Sigmoids and tanh functions are sometimes avoided
due to the vanishing gradient problem.
Function •ReLU function is a general activation function and is
used in most cases these days.
•If we encounter a case of dead neurons in our
networks, the leaky ReLU function is the best choice.
•Always keep in mind that ReLU function should only be
used in the hidden layers.
•You can begin with using ReLU function and then move
over to other activation functions in case ReLU doesn’t
provide optimum results.