DL Mod2
DL Mod2
MODULE 2
TRAINING DEEP MODELS:
INTRODUCTION
• Deep learning is an advanced form of machine learning that tries to
emulate the way the human brain learns.
• In brain, we have nerve cells called neurons, which are connected to
one another by nerve extensions that pass electrochemical signals
through the network.
• When the first neuron in the network is stimulated, the input signal is
processed, and if it exceeds a particular threshold, the neuron is
activated and passes the signal on to the neurons to which it is
connected.
• These neurons in turn may be activated and pass the signal on
through the rest of the network.
• Over time, the connections between the neurons are strengthened by
frequent use as we learn how to respond effectively
• Machine learning is concerned with predicting a label based on
some features of a particular observation.
• In simple terms, a machine learning model is a function that
calculates y (the label) from x (the features): f(x)=y
A deep neural network model
• Because of the layered architecture of the network, this kind of
model is sometimes referred to as a multilayer perceptron.
• Additionally, notice that all neurons in the input and hidden layers
are connected to all neurons in the subsequent layers - this is an
example of a fully connected network.
• While creating a model like this, we must define an input layer that
supports the number of features our model will process, and an
output layer that reflects the number of outputs we expect it to
produce.
• We can decide how many hidden layers we want to include and
how many neurons are in each of them;
• But we have no control over the input and output values for these
layers - these are determined by the model training process.
Training a deep neural network
• The training process for a deep neural network consists of multiple iterations, called
epochs.
• For the first epoch, we start by assigning random initialization values for the weight
(w) and bias b values.
• Then the process is as follows:
• Features for data observations with known label values are submitted to the input layer.
Generally, these observations are grouped into batches (often referred to as mini-batches).
• The neurons then apply their function, and if activated, pass the result onto the next layer until
the output layer produces a prediction.
• The prediction is compared to the actual known value, and the amount of variance between
the predicted and true values (which we call the loss) is calculated.
• Based on the results, revised values for the weights and bias values are calculated to reduce the
loss, and these adjustments are backpropagated to the neurons in the network layers.
• The next epoch repeats the batch training forward pass with the revised weight and bias
values, hopefully improving the accuracy of the model (by reducing the loss).
LOSS FUNCTION
A loss function is a function that compares the target and predicted output values;
measures how well the neural network models the training data.
It is calculated for each sample output.
When training, we aim to minimize this loss between the predicted and target outputs
COST FUNCTION
• A cost function is an important parameter that determines how well a
machine learning model performs for a given dataset.
• It is the average of the loss function values for an entire training set
• It calculates the difference between the expected value and predicted
value and represents it as a single real number.
• GRADIENT DESCENT
• Gradient Descent is an optimization algorithm which is used for optimizing
the cost function or error in the model.
KAIMING AND XAVIER WEIGHT
INITIALIZATIONS
• The aim of weight initialization is to prevent layer
activation outputs from exploding or vanishing
during the course of a forward pass through a deep
neural network.
• If either occurs, loss gradients will either be too large or
too small to flow backwards beneficially, and the
network will take longer to converge, if it is even able to
do so at all.
Different Weight Initialization Techniques
• If we initialized all the weights with 0, then what happens is that the
derivative wrt loss function is the same for every weight in W[l], thus
all weights have the same value in subsequent iterations.
• This makes hidden layers symmetric and this process continues for all
the n iterations. Thus initialized weights with zero make your network
no better than a linear model.
Random Initialization (Initialized weights randomly)
This technique tries to address the problems of zero initialization since it
prevents neurons from learning the same features of their inputs since
our goal is to make each neuron learn different functions of its input and
this technique gives much better accuracy than zero initialization.
In general, it is used to break the symmetry. It is better to assign
random values except 0 to weights.
Remember, neural networks are very sensitive and prone to overfitting
as it quickly memorizes the training data.
Best Practices for Weight Initialization
Fan_out :
This represents the number of output connections from a neuron or the
number of output units (neurons) in a layer.
For a given layer, the fan_out is the number of neurons in that layer itself.
2)He Init[Kaiming Initialization] :
This weight initialization also has two variations. It works pretty well for ReLU
and LeakyReLU activation function prevalent in deep learning models like
CNNs and transformers.
i)He Normal :
Normal Distribution with Mean=0
Wij ~ N(mean,std) , mean=0 , std=sqrt(2/fan_in)
Where N is a Normal Distribution
ii)He Uniform :
Wij ~ D[-sqrt(6/fan_in),sqrt(6/fan_in)]
Where D is a Uniform Distribution
• Benefits of using these heuristics:
• All these heuristics serve as good starting points for
weight initialization and they reduce the chances of
exploding or vanishing gradients.
• All these heuristics do not vanish or explode too
quickly, as the weights are neither too much bigger
than 1 nor too much less than 1.
• They help to avoid slow convergence .
Setup and Initialization Issues
• First, the hyperparameters of the neural network (such as the
learning rates and regularization parameters) need to be selected.
• Tuning Hyperparameters
• Feature Processing
• Initialization
• Tuning Hyper Parameters:
•
Whitening : It is done to transform a feature set to a new axis ( through PCA
principle component analysis).
Principal component analysis can be viewed as the application of singular
value decomposition after mean-centering a data matrix
Let D be an n × d data matrix that has already been mean-centered.
Let C be the d × d co-variance matrix of D in which the (i, j)th entry is the co-
variance between the dimensions i and j. Because the matrix D is mean-
centered, we have the following:
Initialization
• Initializations are surprisingly important.
• Poor initializations can lead to premature end of training process and bad
convergence behavior.
• Instability across different layers (vanishing and exploding gradients).
• Initial parameters need to break symmetry between different units.
• If two hidden units connected to the same inputs have identical initial weights, they
will have identical influence on the cost which will lead to identical gradients. Thus
neurons will evolve symmetrically, preventing different neuron from learning
different things. For this reason we don’t initialize weights with zero or even
constant values.
• Initializing weights to random values breaks symmetry.
More sophisticated initializations such as pretraining
Even some simple rules in initialization can help in conditioning which are
The mean of activations is zero
The variance of activation stays the same across every layer
Average magnitude of the random variables is important for stability.
• An important consideration is that symmetry breaking is important. if all weights are initialized
to the same value (such as 0), all updates will move in lock-step in a layer. As a result, identical
features will be created by the neurons in a layer. Hence no learning will happen. Hence it is
important to have a source of asymmetry among the neurons to begin with.ie., symmetry in
weights should be broken.
• If the weights are too small then activation value is in small range near zero, and hence the non
linearity is lost. Also the variance of the input to succeeding layers starts diminishing as we pass
through each layer in the network. The input eventually drops to a very small value and ‘adjust
ability’ of the neurons in the backpropagation becomes very poor.(Vanishing gradient issue)
• If the weights are too large and take activation value out of small range to saturated region, then
the variance of the input to succeeding layers tends to rapidly increase with each passing layer.
(exploding gradient issue)
• To avoid these vanishing and exploding gradient issues- we have to have good initialization for
weights
OPTIMIZATION TECHNIQUES
• Optimization algorithms are responsible for reducing losses and
provide most accurate results possible.
• The weight is initialized using some initialization strategies and is
updated with each epoch according to the equation.
• The best results are achieved using some optimization strategies or
algorithms called Optimizer.
• When we get to realize that our model is performing poor at the
current instance so we need to minimize the loss and maximize the
accuracy. That process is known as optimization.
• Optimizers are methods or algorithms used to change the attributes
of neural network such as weights and learning rate to reduce the
loss.
• After calculation of loss we need to optimize our weights and bias in
the same iteration.
• Some of the techniques are
1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini-Batch Stochastic Gradient Descent (MB — SGD)
4. SGD with Momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7.AdaDelta
8. RMSProp
9. Adam
10. Nadam
Gradient Descent Optimization Algorithm
Gradient descent is an optimization algorithm that's used to train machine
learning models and neural networks to minimize errors between predicted
and actual results.
It's an iterative algorithm that works by finding the direction in which a
function decreases the most and following that direction to minimize the
function
The gradient descent algorithm is based on a convex function.
Gradient Descent
• Gradient Descent is one of the popular techniques to perform
optimization.
• It's based on a convex function and tweaks its parameters iteratively
to minimize a given function to its local minimum.
How algorithm works
The starting point is just an arbitrary point for us to evaluate the performance.
From that starting point, we will find the derivative (or slope), and from there,
we can use a tangent line to observe the steepness of the slope.
The slope will inform the updates to the parameters—i.e. the weights and bias.
The slope at the starting point will be steeper, but as new parameters are
generated, the steepness should gradually reduce until it reaches the lowest
point on the curve, known as the point of convergence.
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.
• We start by defining initial parameter's values and from there
gradient descent uses calculus to iteratively adjust the values so they
minimize the given cost-function.
• The above equation computes the gradient of the cost function J(θ)
w.r.t. to the parameters/weights θ for the entire training dataset.
• "A gradient measures how much the output of a function changes if
you change the inputs a little bit."
Importance of Learning Rate
• Learning rate determines how big the steps are gradient descent takes
into the direction of local minimum.
• That will tells us about how fast or slow we will move towards the optimal
weights.
• When we initialize learning rate we set an apporpriate value which is
neither too low nor too high.
• A constant learning rate is not desirable.
• A lower learning rate will cause the algorithm to take too long to come
even close to an optimal solution.
• On the other hand, a large initial learning rate will allow the algorithm to
come reasonably close to a good solution at first; however, the algorithm
will then oscillate around the point for a very long time.
Learning Rate Decay
Allowing the learning rate to decay over time can naturally achieve the desired learning-rate
adjustment to avoid these challenges
Exponential decay
Inverse decay
The learning rate can be expressed in terms of the initial decay rate and epoch t as above.The
parameter k controls the rate of the decay.
Step decay-the learning rate is reduced by a particular factor every few epochs.
For example, the learning rate might be multiplied by 0.5 every 5 epochs
• Advantages of Gradient Descent
• Easy Computation
• Easy to implement
• Easy to understand
• Disadvantages of Gradient Descent
• May trap at local minima
• Weights are changed after calculation the gradient on whole dataset, so if
dataset is too large then it may take years to converge to the minima
• Requires large memory to calculate gradient for whole dataset
• 3 Types of Gradient Descent
• There are a few problems that can occur when using gradient descent:
• Local Minima:
• Gradient descent can get stuck in local minima, points that are not the
global minimum of the cost function but are still lower than the
surrounding points.
• This can occur when the cost function has multiple valleys, and the
algorithm gets stuck in one instead of reaching the global minimum.
• Saddle Points:
• A plateau is a region in the cost function where the gradients are very
small or close to zero. This can cause gradient descent to take a long
time or not converge.
• Oscillations:
• Oscillations occur when the learning rate is too high, causing the
algorithm to overshoot the minimum and oscillate back and forth.
• Slow convergence:
• Gradient descent can converge very slowly when the cost function is
complex or has many local minima. This means the algorithm may
take a long time to find the global minimum.
• Stochasticity:
• In stochastic gradient descent, the cost function is evaluated at
random samples from the data set. This introduces randomness into
the algorithm, making converging to a global minimum more difficult.
• Vanishing or Exploding Gradients:
• Deep neural networks with many layers can suffer from vanishing or
exploding gradients. This occurs when the gradients become very
small or large, respectively, as they are backpropagated through the
layers. This can make it difficult for the algorithm to update the
weights and biases.
• Momentum helps to,
The normal updates for gradient-descent with respect to loss function L (defined over a mini-batch of instances)
are as follows:
Here, α is the learning rate.
In momentum-based descent, the vector V is modified with exponential smoothing, where β ∈ (0, 1) is a
smoothing parameter.
A momentum(energy) from the previous gradients are added to the function to gain energy to move forward in
flat regions
Setting β = 0 specializes to straightforward mini-batch gradient-descent. The parameter β is also referred to as the
momentum parameter or the friction parameter
• Momentum at time ‘t’ is computed using all previous updates giving
more weightage to recent updates compared to the previous update.
This leads to speed up the convergence.
• Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster
on the way .
• The same thing happens to our parameter updates: The momentum
term increases for dimensions whose gradients point in the same
directions and reduces updates for dimensions whose gradients change
directions.
• As a result, we gain faster convergence and reduced oscillation.
It is evident that momentum-based updates can reach the optimal solution in fewer
updates.
The basic idea is to give greater preference to consistent directions over multiple
steps, which have greater importance in the descent. This allows the use of larger
steps in the correct direction without causing overflows or “explosions” in the
sideways direction.
The use of momentum will often cause the solution to slightly overshoot in the
direction where velocity is picked up and this over shooting is desirable to the
extent that it helps avoid local optima.
While increased values of β help in avoiding local optima, it might also increase
oscillation at the end.
• Advantages
• Converges faster than SGD
• All advantages of SGD
• Reduces the oscillations and high variance of the parameters
• Disadvantage
• One more extra variable is introduced that we need to compute for
each update
https://miro.medium.com/v2/resize:fit:640/format:webp/1*zVi4ayX9u0MQQwa90CnxVg.gi
f
NAG - Nesterov Accelerated Gradient/Nesterov
Momentum
• Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up.
• The approach followed here was that the parameters update would be
made with the history element first and then only the derivative is
calculated which can move it in the forward or backward direction.
• This is called the look-ahead approach, and it makes more sense because
if the curve reaches near to the minima, then the derivative can make it
move slowly so that there are fewer oscillations and therefore saving
more time.
The idea is that this corrected gradient uses a better understanding of how the
gradients will change because of the momentum portion of the update, and
incorporates this information into the gradient portion of the update.
Therefore, one is using a certain amount of lookahead in computing the
updates.
The Nesterov momentum is a modification of the traditional momentum
method in which the gradients are computed at a point that would be reached
after executing a β- discounted version of the previous step again.
Parameter-Specific Learning Rates
Use learning rates specific to each parameter. Some parameter’s gradient follow
steepest path and gradient of some other parameters may follow path with flat regions.
The basic idea in the momentum methods of the previous section is to leverage the
consistency in the gradient direction of certain parameters in order to speed up the
updates. This goal can also be achieved more explicitly by having different learning
rates for different parameters.
Parameters with large partial derivatives are often oscillating and zigzagging,
whereas parameters with small partial derivatives tend to be more consistent but
move in the same direction
Adagrad, RMSProp, Adam
AdaGrad - Adaptive Gradient
Descent
• AdaGrad is little bit different from other gradient descent algorithms.
In all the previously discussed algorithms learning rate was constant.
So here the key idea is to have an adaptive learning for each of the
weights.
• It uses different learning rate for each iteration. The more the
parameters get change, the more minor the learning rate changes.
Here, we keeps track of the aggregated squared magnitude of the partial derivative with respect to each
parameter.
Squared partial derivatives are aggregated and added as a factor for learning rate updating.
is desired to avoid division by zero(ill-conditioning)
When the previous aggregate is large, learning rate factor will be reduced by this aggregate and hence weight
values are updated slowly only.
When previoius aggregate is small, learning rate will be upgraded, which may cause large weight updation.
Aggregate over all past components will tend to slow down over time, which is the main problem with Adagrad
approach
https://miro.medium.com/v2/resize:fit:828/format:webp/1*WRtvrr9Z0QcokiKlgU7xEw.gi
f
Advantages
Learning rate changes adaptively, no human intervention is required
One of the best algorithm to train on sparse data
. Disadvantages
Learning rate is always decreasing which leads to slow convergence
Due to small learning rate model eventually becomes unable to train properly
and couldn't acquire the required knowledge and hence accuracy of the
model is compromised.
RMSProp - Root Mean Square
Propagation
RMSProp is one of the version of AdaGrad. It is actually the improvement of
AdaGrad Optimizer.
Here the learning rate is an exponential average of the gradients instead of
the cumulative sum of squared gradients.
Uses a decay parameter to decay the extreme past values to decay over
time.
Instead of simply adding the squared gradients to estimate Ai, it uses
exponential averaging.
RMSProp - Root Mean Square
Propagation
The basic idea is to use a decay factor ρ ∈ (0, 1) to decay the squared partial
derivatives occurring t updates ago.
RMSProp with Nesterov Momentum
Note that the partial derivative of the loss function is computed at a shifted point,
as is common in the Nesterov method.
The weight W is shifted with βV while computing the partial derivative with respect
to the loss function.
AdaDel
ta
The AdaDelta algorithm uses a similar update as RMSProp, except that it eliminates
the need for a global learning parameter by computing it as a function of
incremental updates in previous iterations.
Consider the update of RMSProp, which is repeated below:
In each update, the value of Δwi is the increment in the value of wi.
As with the exponentially smoothed gradients Ai, we keep an exponentially smoothed value δi of the
values of Δwi in previous iterations with the same decay parameter ρ:
For a given iteration, the value of δi can be computed using only the iterations before it because the
value of Δwi is not yet available.
On the other hand, Ai can be computed using the partial derivative in the current iteration as well.
• It uses the squared gradients to scale the learning rate like RMSprop and it
takes advantage of momentum by using moving average of the gradient .
• Its name is derived from adaptive moment estimation, and the reason it’s called
that is because Adam uses estimations of first and second moments of gradient to
adapt the learning rate for each weight of the neural network.
ADAM(Adaptive Moment
We have two parameters Estimation)
𝜷 and 𝜷 as decay parameters ,they are usually kept
1 2
around 0.9 and 0.99 but we can change them according to our use case. Default value
for the learning rate η is 0.001.
Along with decay parameters, First moment(mt) and second moment(vt) are also
used.
The first moment is mean, and the second moment is uncentered variance
(meaning we don’t subtract the mean during variance calculation).
Gradient value is used in weight updation indirectly through the first and second
moments.
• What Are the Advantages of Adam Optimization?
• Adam optimization offers several advantages over other optimization algorithms:
1. Adaptive Learning Rates: Unlike fixed learning rate methods like SGD, Adam
optimization provides adaptive learning rates for each parameter based on the
history of gradients. This allows the optimizer to converge faster and more
accurately, especially in high-dimensional parameter spaces.
2. Momentum: Adam optimization uses momentum to smooth out fluctuations in the
optimization process, which can help the optimizer avoid local minima and saddle
points.
3. Bias Correction: Adam optimization applies bias correction to the first and second
moment estimates to ensure that they are unbiased estimates of the true values.
4. Robustness: Adam optimization is relatively robust to hyperparameter choices and
works well across a wide range of deep learning architectures
• Best Practices for Using Adam Optimization
• Use Default Hyperparameters: In most cases, the default hyperparameters for
Adam optimization (𝜷1=0.9, 𝜷2=0.999, =10-8) work well and do not need to be
tuned.
• Monitor Learning Rate: It can be helpful to monitor the learning rate during
training to ensure that it is not too high or too low. A good rule of thumb is to set
the initial learning rate to a small value and then gradually increase it until
convergence.
• Regularization: Adam optimization can benefit from regularization techniques
like weight decay or dropout to prevent overfitting.
• Batch Size: The batch size can have an impact on the performance of Adam
optimization. In general, larger batch sizes tend to work better with Adam
optimization compared to other optimization algorithms.
Animation of 5 gradient descent methods on a surface: gradient descent (cyan), momentum (magenta),
AdaGrad (white), RMSProp (green), Adam (blue). Left well is the global minimum; right well is a local
minimum.
https://miro.medium.com/v2/resize:fit:828/format:webp/1*47skUygd3tWf3yB9A10QHg.gi
f
REGULARIZATION TECHNIQUES
CONCEPT OF REGULARIZATION
• Regularization is a set of techniques that can prevent overfitting in
neural networks and thus improve the accuracy of a Deep
Learning model when facing completely new data from the problem
domain.
• Overfitting refers to the phenomenon where a neural network models
the training data very well but fails when it sees new data from the
same problem domain.
• Overfitting is caused by noise in the training data that the neural
network picks up during training and learns it as an underlying
concept of the data.
• Weight regularization is a technique which aims to stabilize an overfitted
network by penalizing the large value of weights in the network.
• An overfitted network usually presents with problems with a large value of
weights, as a small change in the input can lead to large changes in the output.
• For instance, when the network is given new or test data, it results in incorrect
predictions.
• Weight regularization penalizes the network’s large weights & forcing the
optimization algorithm to reduce the larger weight values to smaller weights,
and this leads to stability of the network & presents good performance.
• In weight regularization, the network configuration remains unchanged only
modifying the value of weights.
• Weight Regularization reduces overfitting by penalizing or adding a
constraint to the loss function.
• In Deep Learning there are two well-known regularization
techniques:
• L1 and L2 regularization
• Both add a penalty to the cost based on the model complexity, so
instead of calculating the cost by simply using a loss function, there
will be an additional element (called “regularization term”) that will
be added in order to penalize complex models.
Regularization to prevent over-fit
L1 regularization (LASSO regression) (Least Absolute Shrinkage and Selection Operator) produces sparse
matrices.
Sparse matrices are zero-matrices in which some elements are ones (the sparsity refers to the ones), but in this
context a sparse matrix could be several close-to-zero values and other larger values.
If we find a model with neurons whose weights are close to zero it means we don’t need those neurons
because the model deactivates them with zeros and we might not need a specific feature/input leading to a
simpler model. For instance, if we have 50 coefficients but only 10 are non-zero, the other 40 are irrelevant to
make our predictions. This is not only interesting from the efficiency point of view but also from the
economic point of view: gathering data and extracting its features might be a very expensive task (in terms
of time and money). Reducing this will benefit us.
.
Regularization to prevent overfit
• Flipping
• This is a very simple technique in which an image is flipped
horizontally to produce a mirror image or flipped vertically to produce
an image that is upside down.
• Rotation
• With this technique, we rotate the entire image by a certain degree.
• Translation
• The entire image is shifted left/right and/or up/down by a certain
amount. This will result in objects of interest appearing in different
locations of the image frame after translation is applied.
• Cropping
• Given an image, we select part of the image (normally a square or
rectangular section), take a crop of this selection, and then resize the
crop to the original size of the image.
• II.Colour Transformation
• With colour transformation techniques, the spatial aspect of the
image is normally preserved while the values of the pixels are edited.
• Brightness
• The pixel values of the image are either increased to result in a
lighter, brighter image or reduced to result in a darker, dimmer
image.
• Contrast
• Contrast is the difference between the bright and dark parts of an
image. Increasing the contrast generally involves making the bright
parts of the image brighter and the dark parts darker.
• III.Advanced Data Augmentation Techniques
• GridMask
• Unlike the above techniques of spatial and colour transformations,
GridMask falls under a third set of transformation techniques which
we can refer to as 'information deletion’.
• With these techniques, parts of the image are removed by setting
the pixels to 0 or placing some patch over that part of the image,
thereby deleting information.
• IV.Niche Data Augmentation Techniques
• Temporal Reordering
• Most of the techniques we've talked about above work well on single
images.
• However given that we work with multiple images from a camera, we
can try and incorporate other augmentation techniques specific to
video data.
• With temporal reordering, given a pair of images, we can reverse the
order of the images and present this to the model as a different
training example.
TF – transformation functions)
One must be careful not to apply data augmentation blindly without regard
to the data set and application at hand.
For example, applying rotations and reflections on the MNIST data set of
handwritten digits is a bad idea because the digits in the data set are all
presented in a similar orientation.
Furthermore, the mirror image of an asymmetric digit is not a valid digit,
and a rotation of a ‘6’ is a ‘9.’
The key point in deciding what types of data augmentation are reasonable
is to account for the natural distribution of images in the full data set.
BENEFITS OF DATA
AUGMENTATION
• Improving model prediction accuracy
• Adding more training data into the models
• Preventing data scarcity for better models
• Reducing data overfitting ( i.e. an error in statistics, it means a
function corresponds too closely to a limited set of data points) and
creating variability in data
• Increasing generalization ability of the models
• Reducing costs of collecting and labeling data
• Enables rare event prediction
ENSEMBLE METHODS
• Ensemble learning is a machine learning paradigm where multiple
models (often called “weak learners”) are trained to solve the same
problem and combined to get better results.
• The main hypothesis is that when weak models are correctly combined
we can obtain more accurate and/or robust models.
• Weak Learners: A ‘weak learner’ is any ML algorithm (for
regression/classification) that provides an accuracy slightly better than
random guessing.
• In ensemble learning theory, we call weak learners (or base models) models
that can be used as building blocks for designing more complex models by
combining several of them.
• Most of the time, these basics models perform not so well by themselves
either because they have a high bias or because they have too much
variance to be robust.
• Then, the idea of ensemble methods is to try reducing bias and/or variance
of such weak learners by combining several of them together to create
a strong learner (or ensemble model) that achieves better performances.
ENSEMBLE METHODS
• BAGGING aims to decrease variance
• BOOSTING aims to decrease bias
• STACKING aims to improve prediction accuracy.
•
BAGGING
• Bagging stands for Bootstrap Aggregation.
• Bootstrapping is a technique of sampling different sets of data from a
given training set by using replacement.
• After bootstrapping the training dataset, we train the model on all the
different sets and aggregate the result. This technique is known as
Bootstrap Aggregation or Bagging.
• The idea behind bagging is combining the results of multiple models
(for instance, all decision trees) to get a generalized result. Now,
bootstrapping comes into picture.
Bagging works as follows:-
Multiple subsets are created from the original dataset, selecting observations
with replacement.
A base model (weak model) is created on each of these subsets.
The models run in parallel and are independent of each other.
The final predictions are determined by combining the predictions from all
the models.
• For aggregating the outputs of base learners, bagging uses
majority voting (most frequent prediction among all predictions) for
classification and averaging (mean of all the predictions) for
regression.
Advantages of a Bagging Model:
1. Bagging significantly decreases the variance without increasing bias.
2. Bagging methods work so well because of diversity in the training
data since the sampling is done by bootstrapping.
3. Also, if the training set is very huge, it can save computational time
by training the model on a relatively smaller data set and still can
increase the accuracy of the model.
4. Works well with small datasets as well.
BOOSTING
• The term ‘Boosting’ refers to a family of algorithms which converts
weak learner to strong learners.
• Boosting is an ensemble method for improving the model
predictions of any given learning algorithm.
• The idea of boosting is to train weak learners sequentially, each
trying to correct its predecessor.
• The weak learners are sequentially corrected by their predecessors
and, in the process, they are converted into strong learners.
• Boosting is a sequential process, where each subsequent model
attempts to correct the errors of the previous model. The
succeeding models are dependent on the previous model.
PRO’S
Computational scalability
Robust to outliers