0% found this document useful (0 votes)
10 views

Different Activation Functions With The Equations

Uploaded by

Narayana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Different Activation Functions With The Equations

Uploaded by

Narayana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Different activation functions with the Equations

1) Step Activation Function:-


The Step Function is an activation function used in binary classification
tasks, where a neuron activates (outputs 1) if the input is greater than or
equal to 0 and deactivates (outputs 0) otherwise. It is simple and can be
implemented using an if-else condition, but it cannot be used for multi
class classification and has a gradient of zero, which makes it unsuitable
for back propagation in deep learning as it hinders learning through
gradient descent. The function is mathematically represented as:
· f(x)=1 if x≥0
· f(x)=0 if x<0

2) Sigmoid Activation Function:-


It is the most widely used activation function as it is a non-linear function.
Sigmoid function transforms the values in the range 0 to 1. It can be
defined as: f(x) = 1/e-x Sigmoid function is continuously differentiable
and a smooth S-shaped function. The derivative of the function is: f’(x) =
1-sigmoid(x)
Also, sigmoid function is not symmetric about zero which means that the
signs of all output values of
neurons will be same. This issue can be improved by scaling the sigmoid
function.
3) TanH Activation Function:-
It is Hyperbolic Tangent function. Tanh function is similar to the sigmoid
function but it is symmetric to around the origin. This results in different
signs of outputs from previous layers which will be fed as input to the
next layer. It can be defined as: f(x) = 2sigmoid(2x)-1 Tanh function is
continuous and differentiable, the values lies in the range -1 to 1. As
compared to the sigmoid function the gradient of tanh function is more
steep. Tanh is preferred over sigmoid function as it has gradients which
are not restricted to vary in a certain direction and also, it is zero centered.

4) ReLU Activation Function:-


ReLU stands for rectified liner unit and is a non-linear activation function
which is widely used in neural network. The upper hand of using ReLU
function is that all the neurons are not activated at the same time. This
implies that a neuron will be deactivated only when the output of linear
transformation is zero. It can be defuned mathematically as: f(x) =
max(0,x) ReLU is more efficient than other functions because as all the
neurons are not activated at the same time, rather a
certain number of neurons are activated at a time.
In some cases, the value of gradient is zero, due to which the weights and
biases are not updated during back propagation step in neural network
training.

5) LEAKY RELU ACTIVATION FUNCTION:-

Leaky ReLU is an improvised version of ReLU function where for


negative values of x, instead of defining the ReLU functions’ value as
zero, it is defined as extremely small linear component of x. It can be
expressed mathematically as: f(x) = 0.01x, x < 0
f(x) = x, x >= 0

Batch Normalization:- Batch normalization (BN) is a powerful


technique to address issues like vanishing and exploding gradients and
internal covariate shift in training deep neural networks. The core idea is
to normalize the inputs to a layer such that the mean and variance are
controlled, thereby stabilizing and speeding up training.

Key Concepts of Batch Normalization:

1. Internal Co-variate Shift:

1. During training, as parameters are updated, the hidden layer


inputs change, leading to instability and slower convergence.
This shift is called internal co-variate shift.
2. Batch normalization helps mitigate this by normalizing the
input at each layer, ensuring that the distribution of inputs
remains more stable.
2. Normalization Layer:

1. The BN layer is introduced between hidden layers in the


network. It normalizes inputs such that they have a mean of
zero and a standard deviation of one over each mini-batch of
training data.
2. The normalization is followed by a scaling and shifting
operation using learnable parameters βi(shift) and γi(scale).

3. Choices for Normalization:

1. Post-activation normalization: After applying the


activation function, the values are normalized.
2. Pre-activation normalization: The normalization is applied
right after the linear transformation (before applying the
activation function).
3. Research suggests that normalizing pre-activation values is
more effective, leading to faster and more stable convergence
during training.

4. Mathematical Operations:

5. Back propagation Through Batch Normalization:

1. BN involves additional parameters βi and γi that must be


updated during back propagation.
2. To update these parameters, the gradients of the loss with
respect to βi and γi are computed as:

3. Back propagation through the BN layer also requires


computing the gradients with respect to the mean μi and
variance σi2​ , which are batch-dependent. This non-linearity
adds complexity but still allows the network to back propagate
through these layers.

Key Benefits of Batch Normalization:

 Faster Training: By reducing internal co-variate shift, BN allows


higher learning rates, leading to faster training.
 Improved Stability: The controlled input distributions reduce the
risk of vanishing or exploding gradients.
 Regularization Effect: BN introduces a slight regularization effect,
sometimes reducing the need for dropout.
Ensemble Methods

 Ensemble methods help improve classifier performance by


addressing the bias-variance trade-off.
 Bagging reduces variance, while Boosting reduces bias.
 Neural networks, which tend to have low bias but high variance,
benefit from ensemble methods to enhance generalization.

4.5.1 Bagging and Subsampling

 Bagging: Creates multiple models by sampling the training data


with replacement and averaging the predictions to reduce variance.

o Customarily, sample size s=n (size of original data), but


smaller s can work better.

 Subsampling: Similar to bagging but without replacement.


Preferable when sufficient data is available.

4.5.2 Parametric Model Selection and Averaging

 Model selection involves finding the best configuration from a set


of hyperparameters.
 Averages predictions from the top kkk configurations for more
robust predictions.

4.5.3 Randomized Connection Dropping

 Randomly drop connections between layers in a neural network.


 Diverse models are generated, and averaging their predictions
improves accuracy.

4.5.4 Dropout

· Node Sampling: Dropout randomly drops nodes (input and hidden)


along with their connections to create different neural networks during
training.
· Weight Sharing: Different sampled networks share the same weights,
updated via backpropagation.
· Sampling Process: Each node is sampled with a probability, typically
between 20-50%, and all edges connected to dropped nodes are also
removed.
· Training with Dropout: A new neural network is sampled for each
mini-batch, making the number of sampled networks large.
· Weight Scaling Inference Rule: At inference, the base network (no
dropping) is used with re-scaled weights to approximate the ensemble's
output.

· Regularization: Dropout acts as a regularizer by introducing noise


(setting random nodes to 0), preventing overfitting and feature co-
adaptation.
· Feature Co-Adaptation: Dropout prevents complex dependencies
between features by encouraging the network to rely on subsets of
features, increasing generalization.
· Larger Models Needed: Due to the regularization effect, larger models
and more units are often required.
· Performance Improvement: Commonly improves model performance
by around 2% in large datasets like ImageNet.
· DropConnect: A variation that applies dropout to weights instead of
nodes.

4.5.5. Data Perturbation Ensembles:

 Noise on Input Data: Add small amounts of noise to input data,


train multiple models, and average their predictions for better
generalization.
 Noise in Hidden Layers: Inject noise into hidden layers (e.g.,
Dropout), but careful calibration is needed to avoid degrading
performance.
 Dropout: Randomly drops nodes in neural networks, indirectly
adding noise to the hidden layers, enhancing robustness.
 Data Augmentation: Apply transformations like rotations,
translations to increase the dataset size, improving model
generalization (used in CNNs).
 Denoising Autoencoders: Common in unsupervised learning,
reconstruct input data from noisy versions to enhance feature
learning.

You might also like