The Step Function is an activation function used in binary classification tasks, where a neuron activates (outputs 1) if the input is greater than or equal to 0 and deactivates (outputs 0) otherwise. It is simple and can be implemented using an if-else condition, but it cannot be used for multi class classification and has a gradient of zero, which makes it unsuitable for back propagation in deep learning as it hinders learning through gradient descent. The function is mathematically represented as: · f(x)=1 if x≥0 · f(x)=0 if x<0
2) Sigmoid Activation Function:-
It is the most widely used activation function as it is a non-linear function. Sigmoid function transforms the values in the range 0 to 1. It can be defined as: f(x) = 1/e-x Sigmoid function is continuously differentiable and a smooth S-shaped function. The derivative of the function is: f’(x) = 1-sigmoid(x) Also, sigmoid function is not symmetric about zero which means that the signs of all output values of neurons will be same. This issue can be improved by scaling the sigmoid function. 3) TanH Activation Function:- It is Hyperbolic Tangent function. Tanh function is similar to the sigmoid function but it is symmetric to around the origin. This results in different signs of outputs from previous layers which will be fed as input to the next layer. It can be defined as: f(x) = 2sigmoid(2x)-1 Tanh function is continuous and differentiable, the values lies in the range -1 to 1. As compared to the sigmoid function the gradient of tanh function is more steep. Tanh is preferred over sigmoid function as it has gradients which are not restricted to vary in a certain direction and also, it is zero centered.
4) ReLU Activation Function:-
ReLU stands for rectified liner unit and is a non-linear activation function which is widely used in neural network. The upper hand of using ReLU function is that all the neurons are not activated at the same time. This implies that a neuron will be deactivated only when the output of linear transformation is zero. It can be defuned mathematically as: f(x) = max(0,x) ReLU is more efficient than other functions because as all the neurons are not activated at the same time, rather a certain number of neurons are activated at a time. In some cases, the value of gradient is zero, due to which the weights and biases are not updated during back propagation step in neural network training.
5) LEAKY RELU ACTIVATION FUNCTION:-
Leaky ReLU is an improvised version of ReLU function where for
negative values of x, instead of defining the ReLU functions’ value as zero, it is defined as extremely small linear component of x. It can be expressed mathematically as: f(x) = 0.01x, x < 0 f(x) = x, x >= 0
Batch Normalization:- Batch normalization (BN) is a powerful
technique to address issues like vanishing and exploding gradients and internal covariate shift in training deep neural networks. The core idea is to normalize the inputs to a layer such that the mean and variance are controlled, thereby stabilizing and speeding up training.
Key Concepts of Batch Normalization:
1. Internal Co-variate Shift:
1. During training, as parameters are updated, the hidden layer
inputs change, leading to instability and slower convergence. This shift is called internal co-variate shift. 2. Batch normalization helps mitigate this by normalizing the input at each layer, ensuring that the distribution of inputs remains more stable. 2. Normalization Layer:
1. The BN layer is introduced between hidden layers in the
network. It normalizes inputs such that they have a mean of zero and a standard deviation of one over each mini-batch of training data. 2. The normalization is followed by a scaling and shifting operation using learnable parameters βi(shift) and γi(scale).
3. Choices for Normalization:
1. Post-activation normalization: After applying the
activation function, the values are normalized. 2. Pre-activation normalization: The normalization is applied right after the linear transformation (before applying the activation function). 3. Research suggests that normalizing pre-activation values is more effective, leading to faster and more stable convergence during training.
4. Mathematical Operations:
5. Back propagation Through Batch Normalization:
1. BN involves additional parameters βi and γi that must be
updated during back propagation. 2. To update these parameters, the gradients of the loss with respect to βi and γi are computed as:
3. Back propagation through the BN layer also requires
computing the gradients with respect to the mean μi and variance σi2 , which are batch-dependent. This non-linearity adds complexity but still allows the network to back propagate through these layers.
Key Benefits of Batch Normalization:
Faster Training: By reducing internal co-variate shift, BN allows
higher learning rates, leading to faster training. Improved Stability: The controlled input distributions reduce the risk of vanishing or exploding gradients. Regularization Effect: BN introduces a slight regularization effect, sometimes reducing the need for dropout. Ensemble Methods
Ensemble methods help improve classifier performance by
addressing the bias-variance trade-off. Bagging reduces variance, while Boosting reduces bias. Neural networks, which tend to have low bias but high variance, benefit from ensemble methods to enhance generalization.
4.5.1 Bagging and Subsampling
Bagging: Creates multiple models by sampling the training data
with replacement and averaging the predictions to reduce variance.
o Customarily, sample size s=n (size of original data), but
smaller s can work better.
Subsampling: Similar to bagging but without replacement.
Preferable when sufficient data is available.
4.5.2 Parametric Model Selection and Averaging
Model selection involves finding the best configuration from a set
of hyperparameters. Averages predictions from the top kkk configurations for more robust predictions.
4.5.3 Randomized Connection Dropping
Randomly drop connections between layers in a neural network.
Diverse models are generated, and averaging their predictions improves accuracy.
4.5.4 Dropout
· Node Sampling: Dropout randomly drops nodes (input and hidden)
along with their connections to create different neural networks during training. · Weight Sharing: Different sampled networks share the same weights, updated via backpropagation. · Sampling Process: Each node is sampled with a probability, typically between 20-50%, and all edges connected to dropped nodes are also removed. · Training with Dropout: A new neural network is sampled for each mini-batch, making the number of sampled networks large. · Weight Scaling Inference Rule: At inference, the base network (no dropping) is used with re-scaled weights to approximate the ensemble's output.
· Regularization: Dropout acts as a regularizer by introducing noise
(setting random nodes to 0), preventing overfitting and feature co- adaptation. · Feature Co-Adaptation: Dropout prevents complex dependencies between features by encouraging the network to rely on subsets of features, increasing generalization. · Larger Models Needed: Due to the regularization effect, larger models and more units are often required. · Performance Improvement: Commonly improves model performance by around 2% in large datasets like ImageNet. · DropConnect: A variation that applies dropout to weights instead of nodes.
4.5.5. Data Perturbation Ensembles:
Noise on Input Data: Add small amounts of noise to input data,
train multiple models, and average their predictions for better generalization. Noise in Hidden Layers: Inject noise into hidden layers (e.g., Dropout), but careful calibration is needed to avoid degrading performance. Dropout: Randomly drops nodes in neural networks, indirectly adding noise to the hidden layers, enhancing robustness. Data Augmentation: Apply transformations like rotations, translations to increase the dataset size, improving model generalization (used in CNNs). Denoising Autoencoders: Common in unsupervised learning, reconstruct input data from noisy versions to enhance feature learning.