0% found this document useful (0 votes)

10 views152 pages

DL Mod2

Uploaded by

Aiswarya Reghunath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views152 pages

DL Mod2

Uploaded by

Aiswarya Reghunath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

DEEP LEARNING

MODULE 2
TRAINING DEEP MODELS:
INTRODUCTION
• Deep learning is an advanced form of machine learning that tries to
emulate the way the human brain learns.
• In brain, we have nerve cells called neurons, which are connected to
one another by nerve extensions that pass electrochemical signals
through the network.
• When the first neuron in the network is stimulated, the input signal is
processed, and if it exceeds a particular threshold, the neuron is
activated and passes the signal on to the neurons to which it is
connected.
• These neurons in turn may be activated and pass the signal on
through the rest of the network.
• Over time, the connections between the neurons are strengthened by
frequent use as we learn how to respond effectively
• Machine learning is concerned with predicting a label based on
some features of a particular observation.
• In simple terms, a machine learning model is a function that
calculates y (the label) from x (the features): f(x)=y
A deep neural network model
• Because of the layered architecture of the network, this kind of
model is sometimes referred to as a multilayer perceptron.
• Additionally, notice that all neurons in the input and hidden layers
are connected to all neurons in the subsequent layers - this is an
example of a fully connected network.
• While creating a model like this, we must define an input layer that
supports the number of features our model will process, and an
output layer that reflects the number of outputs we expect it to
produce.
• We can decide how many hidden layers we want to include and
how many neurons are in each of them;
• But we have no control over the input and output values for these
layers - these are determined by the model training process.
Training a deep neural network
• The training process for a deep neural network consists of multiple iterations, called
epochs.
• For the first epoch, we start by assigning random initialization values for the weight
(w) and bias b values.
• Then the process is as follows:

• Features for data observations with known label values are submitted to the input layer.
Generally, these observations are grouped into batches (often referred to as mini-batches).
• The neurons then apply their function, and if activated, pass the result onto the next layer until
the output layer produces a prediction.
• The prediction is compared to the actual known value, and the amount of variance between
the predicted and true values (which we call the loss) is calculated.
• Based on the results, revised values for the weights and bias values are calculated to reduce the
loss, and these adjustments are backpropagated to the neurons in the network layers.
• The next epoch repeats the batch training forward pass with the revised weight and bias
values, hopefully improving the accuracy of the model (by reducing the loss).
LOSS FUNCTION
A loss function is a function that compares the target and predicted output values;
measures how well the neural network models the training data.
It is calculated for each sample output.
When training, we aim to minimize this loss between the predicted and target outputs

COST FUNCTION
• A cost function is an important parameter that determines how well a
machine learning model performs for a given dataset.
• It is the average of the loss function values for an entire training set
• It calculates the difference between the expected value and predicted
value and represents it as a single real number.

• GRADIENT DESCENT
• Gradient Descent is an optimization algorithm which is used for optimizing
the cost function or error in the model.
KAIMING AND XAVIER WEIGHT
INITIALIZATIONS
• The aim of weight initialization is to prevent layer
activation outputs from exploding or vanishing
during the course of a forward pass through a deep
neural network.
• If either occurs, loss gradients will either be too large or
too small to flow backwards beneficially, and the
network will take longer to converge, if it is even able to
do so at all.
Different Weight Initialization Techniques

• Zero Initialization (Initialized all weights to 0)

• If we initialized all the weights with 0, then what happens is that the
derivative wrt loss function is the same for every weight in W[l], thus
all weights have the same value in subsequent iterations.
• This makes hidden layers symmetric and this process continues for all
the n iterations. Thus initialized weights with zero make your network
no better than a linear model.
Random Initialization (Initialized weights randomly)

This technique tries to address the problems of zero initialization since it
prevents neurons from learning the same features of their inputs since
our goal is to make each neuron learn different functions of its input and
this technique gives much better accuracy than zero initialization.
In general, it is used to break the symmetry. It is better to assign
random values except 0 to weights.
Remember, neural networks are very sensitive and prone to overfitting
as it quickly memorizes the training data.
Best Practices for Weight Initialization

👉 Use RELU or leaky RELU as the activation function, as they

both are relatively robust to the vanishing or exploding gradient
problems (especially for networks that are not too deep). In the case of
leaky RELU, they never have zero gradients. Thus they never die and
training continues.
👉 Use Heuristics for weight initialization: For deep neural
networks, we can use any of the following heuristics to initialize the
weights depending on the chosen non-linear activation function.
While these heuristics do not completely solve the exploding or
vanishing gradients problems, they help to reduce it to a great extent.
The most common heuristics are as follows:
• 1) Xavier/Glorot Initialization :
• Xavier Initialization is a Gaussian initialization heuristic that keeps the
variance of the input to a layer the same as that of the output of the layer.
This ensures that the variance remains the same throughout the network.
• Use Xavier Initialization for layers with sigmoid or tanh activations,
common in older or smaller networks.
• i)Xavier Normal :
• Normal Distribution with Mean=0
Wij~N(0,std) where std=sqrt(2/(fan_in + fan_out))
• Here N is a Normal Distribution.
• ii)Xavier Uniform :
• Wij ~ D [-sqrt(6)/sqrt(fan_in+fan_out),sqrt(6)/sqrt(fan_in + fan_out)]
• Where D is a Uniform Distribution

Fan_in :

This represents the number of input connections to a neuron or the
number of input units to a layer.

In a feedforward neural network, for a given layer, the fan_in is the number
of neurons in the preceding layer.


Fan_out :

This represents the number of output connections from a neuron or the
number of output units (neurons) in a layer.

For a given layer, the fan_out is the number of neurons in that layer itself.
2)He Init[Kaiming Initialization] :
This weight initialization also has two variations. It works pretty well for ReLU
and LeakyReLU activation function prevalent in deep learning models like
CNNs and transformers.
i)He Normal :
Normal Distribution with Mean=0
Wij ~ N(mean,std) , mean=0 , std=sqrt(2/fan_in)
Where N is a Normal Distribution
ii)He Uniform :
Wij ~ D[-sqrt(6/fan_in),sqrt(6/fan_in)]
Where D is a Uniform Distribution
• Benefits of using these heuristics:
• All these heuristics serve as good starting points for
weight initialization and they reduce the chances of
exploding or vanishing gradients.
• All these heuristics do not vanish or explode too
quickly, as the weights are neither too much bigger
than 1 nor too much less than 1.
• They help to avoid slow convergence .
Setup and Initialization Issues
• First, the hyperparameters of the neural network (such as the
learning rates and regularization parameters) need to be selected.
• Tuning Hyperparameters
• Feature Processing
• Initialization
• Tuning Hyper Parameters:

• The term “hyperparameter” is used to specifically refer to the

parameters regulating the design of the model (like learning rate and
regularization).
• The hyperparameters should not be tuned using the same data used for
gradient descent.
• Rather, a portion of the data is held out as validation data, and the
performance of the model is tested on the validation set with various
choices of hyperparameters.
• This type of approach ensures that the tuning process does not overfit
to the training data set (while providing poor test data performance).

The most well-known technique is grid search, in which a set of values is
selected for each hyperparameter. One issue with this procedure is that the
number of hyperparameters might be large.

Other method of selecting hyper parameters is : sampling. As in the case of grid
ranges, we can perform multi-resolution sampling, where one first samples in
the full grid range. One then creates a new set of grid ranges that are
geometrically smaller than the previous grid ranges.

Another approach for tuning hyper parameters is that instead of choosing a
hyperparameter through direct sampling, logarithms of hyperparameters are
sampled uniformly. Learning rate and regularization rates are selected in this
way.
• Feature Preprocessing
• How to preprocess features in a dataset for building proper model.
• It can be done through mean-centering, feature normalization and whitening
• Mean-centering: It is done to remove certain types of bias effects.
• A vector of column-wise means is subtracted from each data point. Mean-
centering is often paired with standardization.

• Feature normalization: A common type of normalization is to divide each

feature value by its standard deviation. The basic idea is that each feature is
presumed to have been drawn from a standard normal distribution with zero mean
and unit variance.
• Let minj and maxj be the minimum and maximum values of the jth attribute.
Then, each feature value xij for the jth dimension of the ith point is scaled by
min-max normalization as follows:

•

Whitening : It is done to transform a feature set to a new axis ( through PCA
principle component analysis).

Principal component analysis can be viewed as the application of singular
value decomposition after mean-centering a data matrix

Let D be an n × d data matrix that has already been mean-centered.

Let C be the d × d co-variance matrix of D in which the (i, j)th entry is the co-
variance between the dimensions i and j. Because the matrix D is mean-
centered, we have the following:
Initialization
• Initializations are surprisingly important.
• Poor initializations can lead to premature end of training process and bad
convergence behavior.
• Instability across different layers (vanishing and exploding gradients).
• Initial parameters need to break symmetry between different units.
• If two hidden units connected to the same inputs have identical initial weights, they
will have identical influence on the cost which will lead to identical gradients. Thus
neurons will evolve symmetrically, preventing different neuron from learning
different things. For this reason we don’t initialize weights with zero or even
constant values.
• Initializing weights to random values breaks symmetry.

More sophisticated initializations such as pretraining

Even some simple rules in initialization can help in conditioning which are

The mean of activations is zero

The variance of activation stays the same across every layer

Average magnitude of the random variables is important for stability.
• An important consideration is that symmetry breaking is important. if all weights are initialized
to the same value (such as 0), all updates will move in lock-step in a layer. As a result, identical
features will be created by the neurons in a layer. Hence no learning will happen. Hence it is
important to have a source of asymmetry among the neurons to begin with.ie., symmetry in
weights should be broken.
• If the weights are too small then activation value is in small range near zero, and hence the non
linearity is lost. Also the variance of the input to succeeding layers starts diminishing as we pass
through each layer in the network. The input eventually drops to a very small value and ‘adjust
ability’ of the neurons in the backpropagation becomes very poor.(Vanishing gradient issue)
• If the weights are too large and take activation value out of small range to saturated region, then
the variance of the input to succeeding layers tends to rapidly increase with each passing layer.
(exploding gradient issue)
• To avoid these vanishing and exploding gradient issues- we have to have good initialization for
weights
OPTIMIZATION TECHNIQUES
• Optimization algorithms are responsible for reducing losses and
provide most accurate results possible.
• The weight is initialized using some initialization strategies and is
updated with each epoch according to the equation.
• The best results are achieved using some optimization strategies or
algorithms called Optimizer.
• When we get to realize that our model is performing poor at the
current instance so we need to minimize the loss and maximize the
accuracy. That process is known as optimization.
• Optimizers are methods or algorithms used to change the attributes
of neural network such as weights and learning rate to reduce the
loss.
• After calculation of loss we need to optimize our weights and bias in
the same iteration.
• Some of the techniques are

1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini-Batch Stochastic Gradient Descent (MB — SGD)
4. SGD with Momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7.AdaDelta
8. RMSProp
9. Adam
10. Nadam
Gradient Descent Optimization Algorithm

Gradient descent is an optimization algorithm that's used to train machine
learning models and neural networks to minimize errors between predicted
and actual results.

It's an iterative algorithm that works by finding the direction in which a
function decreases the most and following that direction to minimize the
function

The gradient descent algorithm is based on a convex function.

Gradient Descent
• Gradient Descent is one of the popular techniques to perform
optimization.
• It's based on a convex function and tweaks its parameters iteratively
to minimize a given function to its local minimum.
How algorithm works

The starting point is just an arbitrary point for us to evaluate the performance.

From that starting point, we will find the derivative (or slope), and from there,
we can use a tangent line to observe the steepness of the slope.

The slope will inform the updates to the parameters—i.e. the weights and bias.


The slope at the starting point will be steeper, but as new parameters are
generated, the steepness should gradually reduce until it reaches the lowest
point on the curve, known as the point of convergence.
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function.
• We start by defining initial parameter's values and from there
gradient descent uses calculus to iteratively adjust the values so they
minimize the given cost-function.

• The above equation computes the gradient of the cost function J(θ)
w.r.t. to the parameters/weights θ for the entire training dataset.
• "A gradient measures how much the output of a function changes if
you change the inputs a little bit."
Importance of Learning Rate
• Learning rate determines how big the steps are gradient descent takes
into the direction of local minimum.
• That will tells us about how fast or slow we will move towards the optimal
weights.
• When we initialize learning rate we set an apporpriate value which is
neither too low nor too high.
• A constant learning rate is not desirable.
• A lower learning rate will cause the algorithm to take too long to come
even close to an optimal solution.
• On the other hand, a large initial learning rate will allow the algorithm to
come reasonably close to a good solution at first; however, the algorithm
will then oscillate around the point for a very long time.
Learning Rate Decay

Allowing the learning rate to decay over time can naturally achieve the desired learning-rate
adjustment to avoid these challenges

Exponential decay

Inverse decay


The learning rate can be expressed in terms of the initial decay rate and epoch t as above.The
parameter k controls the rate of the decay.


Step decay-the learning rate is reduced by a particular factor every few epochs.

For example, the learning rate might be multiplied by 0.5 every 5 epochs
• Advantages of Gradient Descent
• Easy Computation
• Easy to implement
• Easy to understand
• Disadvantages of Gradient Descent
• May trap at local minima
• Weights are changed after calculation the gradient on whole dataset, so if
dataset is too large then it may take years to converge to the minima
• Requires large memory to calculate gradient for whole dataset
• 3 Types of Gradient Descent

• Batch Gradient Descent

• Stochastic Gradient Descent
• Mini Batch Gradient Descent
• Batch Gradient Descent
• In batch gradient descent we uses the entire dataset to calculate
gradient of the cost function for each epoch.
• That's why the convergence is slow in batch gradient descent.

• SGD - Stochastic Gradient Descent

• SGD algorithm is an extension of the Gradient Descent and it
overcomes disadvantages of gradient descent algorithm.
• SGD derivative is computed taking one observation at a time. So if
dataset contains 100 observations then it updates model weights and
bias 100 times in 1 epoch.
SGD performs parameter updation for each training
example x(i) and label y(i):
• Advantages of SGD
• Memory requirement is less compared to Gradient Descent
algorithm.
• Disadvantages of SGD
• May stuck at local minima
• Time taken by 1 epoch is large compared to Gradient Descent
• MBGD - Mini Batch Gradient Descent
• MBGD is a combination of both batch and stochastic gradient
descent.
• It divides the training data into small batch size and performs
updates on each of the batch.
• So here only subset of dataset is used for calculating the loss
function.
• Mini Batch is widely used and converges faster because it requires
less cycles in one iteration.
• Advantages of Mini Batch Gradient Descent
• Less time taken to converge the model
• Requires medium amount of memory
• Frequently updates the model parameters and also has less variance.
• Disadvantages of Mini Batch Gradient Descent
• If the learning rate is too small then convergence rate will be slow.
• It doesn't guarantee good convergence
Problems with Gradient Descent

• There are a few problems that can occur when using gradient descent:

• Local Minima:

• Gradient descent can get stuck in local minima, points that are not the
global minimum of the cost function but are still lower than the
surrounding points.
• This can occur when the cost function has multiple valleys, and the
algorithm gets stuck in one instead of reaching the global minimum.
• Saddle Points:

• A saddle point is a point in the cost function where one dimension

has a higher value than the surrounding points, and the other has a
lower value.
• Gradient descent can get stuck at these points because the gradients
in one direction point towards a lower value, while those in the other
direction point towards a higher value.
• Plateaus:

• A plateau is a region in the cost function where the gradients are very
small or close to zero. This can cause gradient descent to take a long
time or not converge.
• Oscillations:

• Oscillations occur when the learning rate is too high, causing the
algorithm to overshoot the minimum and oscillate back and forth.
• Slow convergence:

• Gradient descent can converge very slowly when the cost function is
complex or has many local minima. This means the algorithm may
take a long time to find the global minimum.
• Stochasticity:
• In stochastic gradient descent, the cost function is evaluated at
random samples from the data set. This introduces randomness into
the algorithm, making converging to a global minimum more difficult.
• Vanishing or Exploding Gradients:

• Deep neural networks with many layers can suffer from vanishing or
exploding gradients. This occurs when the gradients become very
small or large, respectively, as they are backpropagated through the
layers. This can make it difficult for the algorithm to update the
weights and biases.
• Momentum helps to,

• Escape local minima and saddle points

• Aids in faster convergence by reducing oscillations
• Smooths out weight updates for stability
• Reduces model complexity and prevents overfitting
• Can be used in combination with other optimization algorithms for
improved performance.
Momentum-Based Learning


The normal updates for gradient-descent with respect to loss function L (defined over a mini-batch of instances)
are as follows:


Here, α is the learning rate.

In momentum-based descent, the vector V is modified with exponential smoothing, where β ∈ (0, 1) is a
smoothing parameter.

A momentum(energy) from the previous gradients are added to the function to gain energy to move forward in
flat regions


Setting β = 0 specializes to straightforward mini-batch gradient-descent. The parameter β is also referred to as the
momentum parameter or the friction parameter
• Momentum at time ‘t’ is computed using all previous updates giving
more weightage to recent updates compared to the previous update.
This leads to speed up the convergence.
• Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster
on the way .
• The same thing happens to our parameter updates: The momentum
term increases for dimensions whose gradients point in the same
directions and reduces updates for dimensions whose gradients change
directions.
• As a result, we gain faster convergence and reduced oscillation.

It is evident that momentum-based updates can reach the optimal solution in fewer
updates.

The basic idea is to give greater preference to consistent directions over multiple
steps, which have greater importance in the descent. This allows the use of larger
steps in the correct direction without causing overflows or “explosions” in the
sideways direction.

The use of momentum will often cause the solution to slightly overshoot in the
direction where velocity is picked up and this over shooting is desirable to the
extent that it helps avoid local optima.

While increased values of β help in avoiding local optima, it might also increase
oscillation at the end.
• Advantages
• Converges faster than SGD
• All advantages of SGD
• Reduces the oscillations and high variance of the parameters
• Disadvantage
• One more extra variable is introduced that we need to compute for
each update

https://miro.medium.com/v2/resize:fit:640/format:webp/1*zVi4ayX9u0MQQwa90CnxVg.gi
f
NAG - Nesterov Accelerated Gradient/Nesterov
Momentum

• Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up.
• The approach followed here was that the parameters update would be
made with the history element first and then only the derivative is
calculated which can move it in the forward or backward direction.
• This is called the look-ahead approach, and it makes more sense because
if the curve reaches near to the minima, then the derivative can make it
move slowly so that there are fewer oscillations and therefore saving
more time.

The idea is that this corrected gradient uses a better understanding of how the
gradients will change because of the momentum portion of the update, and
incorporates this information into the gradient portion of the update.

Therefore, one is using a certain amount of lookahead in computing the
updates.

The Nesterov momentum is a modification of the traditional momentum
method in which the gradients are computed at a point that would be reached
after executing a β- discounted version of the previous step again.
Parameter-Specific Learning Rates

Use learning rates specific to each parameter. Some parameter’s gradient follow
steepest path and gradient of some other parameters may follow path with flat regions.

The basic idea in the momentum methods of the previous section is to leverage the
consistency in the gradient direction of certain parameters in order to speed up the
updates. This goal can also be achieved more explicitly by having different learning
rates for different parameters.

Parameters with large partial derivatives are often oscillating and zigzagging,
whereas parameters with small partial derivatives tend to be more consistent but
move in the same direction

Adagrad, RMSProp, Adam
AdaGrad - Adaptive Gradient
Descent
• AdaGrad is little bit different from other gradient descent algorithms.
In all the previously discussed algorithms learning rate was constant.
So here the key idea is to have an adaptive learning for each of the
weights.
• It uses different learning rate for each iteration. The more the
parameters get change, the more minor the learning rate changes.

Here, we keeps track of the aggregated squared magnitude of the partial derivative with respect to each
parameter.


Squared partial derivatives are aggregated and added as a factor for learning rate updating.


is desired to avoid division by zero(ill-conditioning)


When the previous aggregate is large, learning rate factor will be reduced by this aggregate and hence weight
values are updated slowly only.

When previoius aggregate is small, learning rate will be upgraded, which may cause large weight updation.

Aggregate over all past components will tend to slow down over time, which is the main problem with Adagrad
approach
https://miro.medium.com/v2/resize:fit:828/format:webp/1*WRtvrr9Z0QcokiKlgU7xEw.gi
f

Advantages
 Learning rate changes adaptively, no human intervention is required
 One of the best algorithm to train on sparse data
. Disadvantages
 Learning rate is always decreasing which leads to slow convergence
 Due to small learning rate model eventually becomes unable to train properly
and couldn't acquire the required knowledge and hence accuracy of the
model is compromised.
RMSProp - Root Mean Square
Propagation

RMSProp is one of the version of AdaGrad. It is actually the improvement of
AdaGrad Optimizer.

Here the learning rate is an exponential average of the gradients instead of
the cumulative sum of squared gradients.

Uses a decay parameter to decay the extreme past values to decay over
time.

Instead of simply adding the squared gradients to estimate Ai, it uses
exponential averaging.
RMSProp - Root Mean Square
Propagation

The basic idea is to use a decay factor ρ ∈ (0, 1) to decay the squared partial
derivatives occurring t updates ago.
RMSProp with Nesterov Momentum


Note that the partial derivative of the loss function is computed at a shifted point,
as is common in the Nesterov method.

The weight W is shifted with βV while computing the partial derivative with respect
to the loss function.
AdaDel

ta
The AdaDelta algorithm uses a similar update as RMSProp, except that it eliminates
the need for a global learning parameter by computing it as a function of
incremental updates in previous iterations.

Consider the update of RMSProp, which is repeated below:

 In each update, the value of Δwi is the increment in the value of wi.
 As with the exponentially smoothed gradients Ai, we keep an exponentially smoothed value δi of the
values of Δwi in previous iterations with the same decay parameter ρ:

 For a given iteration, the value of δi can be computed using only the iterations before it because the
value of Δwi is not yet available.
 On the other hand, Ai can be computed using the partial derivative in the current iteration as well.

 This is a subtle difference between how Ai and δi are computed.


This results in the following AdaDelta update:
• Advantages
• Learning rate doesn't decay
• Disadvantages
• Computationally Expensive
ADAM(Adaptive Moment
Estimation)
• Adam can be looked at as a combination of RMSprop and Stochastic Gradient
Descent with momentum.

• It uses the squared gradients to scale the learning rate like RMSprop and it
takes advantage of momentum by using moving average of the gradient .

• Adam is an adaptive learning rate method, which means, it computes individual

learning rates for different parameters.

• Its name is derived from adaptive moment estimation, and the reason it’s called
that is because Adam uses estimations of first and second moments of gradient to
adapt the learning rate for each weight of the neural network.
ADAM(Adaptive Moment
 We have two parameters Estimation)
𝜷 and 𝜷 as decay parameters ,they are usually kept
1 2
around 0.9 and 0.99 but we can change them according to our use case. Default value
for the learning rate η is 0.001.
 Along with decay parameters, First moment(mt) and second moment(vt) are also
used.

The first moment is mean, and the second moment is uncentered variance
(meaning we don’t subtract the mean during variance calculation).

Gradient value is used in weight updation indirectly through the first and second
moments.
• What Are the Advantages of Adam Optimization?
• Adam optimization offers several advantages over other optimization algorithms:
1. Adaptive Learning Rates: Unlike fixed learning rate methods like SGD, Adam
optimization provides adaptive learning rates for each parameter based on the
history of gradients. This allows the optimizer to converge faster and more
accurately, especially in high-dimensional parameter spaces.
2. Momentum: Adam optimization uses momentum to smooth out fluctuations in the
optimization process, which can help the optimizer avoid local minima and saddle
points.
3. Bias Correction: Adam optimization applies bias correction to the first and second
moment estimates to ensure that they are unbiased estimates of the true values.
4. Robustness: Adam optimization is relatively robust to hyperparameter choices and
works well across a wide range of deep learning architectures
• Best Practices for Using Adam Optimization
• Use Default Hyperparameters: In most cases, the default hyperparameters for
Adam optimization (𝜷1=0.9, 𝜷2=0.999, =10-8) work well and do not need to be
tuned.
• Monitor Learning Rate: It can be helpful to monitor the learning rate during
training to ensure that it is not too high or too low. A good rule of thumb is to set
the initial learning rate to a small value and then gradually increase it until
convergence.
• Regularization: Adam optimization can benefit from regularization techniques
like weight decay or dropout to prevent overfitting.
• Batch Size: The batch size can have an impact on the performance of Adam
optimization. In general, larger batch sizes tend to work better with Adam
optimization compared to other optimization algorithms.
Animation of 5 gradient descent methods on a surface: gradient descent (cyan), momentum (magenta),
AdaGrad (white), RMSProp (green), Adam (blue). Left well is the global minimum; right well is a local
minimum.

https://miro.medium.com/v2/resize:fit:828/format:webp/1*47skUygd3tWf3yB9A10QHg.gi
f
REGULARIZATION TECHNIQUES
CONCEPT OF REGULARIZATION
• Regularization is a set of techniques that can prevent overfitting in
neural networks and thus improve the accuracy of a Deep
Learning model when facing completely new data from the problem
domain.
• Overfitting refers to the phenomenon where a neural network models
the training data very well but fails when it sees new data from the
same problem domain.
• Overfitting is caused by noise in the training data that the neural
network picks up during training and learns it as an underlying
concept of the data.
• Weight regularization is a technique which aims to stabilize an overfitted
network by penalizing the large value of weights in the network.
• An overfitted network usually presents with problems with a large value of
weights, as a small change in the input can lead to large changes in the output.
• For instance, when the network is given new or test data, it results in incorrect
predictions.
• Weight regularization penalizes the network’s large weights & forcing the
optimization algorithm to reduce the larger weight values to smaller weights,
and this leads to stability of the network & presents good performance.
• In weight regularization, the network configuration remains unchanged only
modifying the value of weights.
• Weight Regularization reduces overfitting by penalizing or adding a
constraint to the loss function.
• In Deep Learning there are two well-known regularization
techniques:
• L1 and L2 regularization
• Both add a penalty to the cost based on the model complexity, so
instead of calculating the cost by simply using a loss function, there
will be an additional element (called “regularization term”) that will
be added in order to penalize complex models.
Regularization to prevent over-fit

Weight decay: To prevent overfitting, every time we update a

weight w with the gradient ∇J in respect to w, we also subtract from
it λ∙w. This gives the weights a tendency to decay towards zero, hence
the name.
Regularization to prevent overfit

L1 regularization (LASSO regression) (Least Absolute Shrinkage and Selection Operator) produces sparse
matrices.
Sparse matrices are zero-matrices in which some elements are ones (the sparsity refers to the ones), but in this
context a sparse matrix could be several close-to-zero values and other larger values.
 If we find a model with neurons whose weights are close to zero it means we don’t need those neurons
because the model deactivates them with zeros and we might not need a specific feature/input leading to a
simpler model. For instance, if we have 50 coefficients but only 10 are non-zero, the other 40 are irrelevant to
make our predictions. This is not only interesting from the efficiency point of view but also from the
economic point of view: gathering data and extracting its features might be a very expensive task (in terms
of time and money). Reducing this will benefit us.
.
Regularization to prevent overfit

L2 regularization (Ridge regression) on the other hand leads to a

balanced minimization of the weights.
Since L2 uses squares, it emphasizes the errors, and it can be a problem
when there are outliers in the data.
Unlike L1, L2 has an analytical solution which makes it
computationally efficient.
.
EARLY STOPPING
• Early stopping is an optimization technique used to reduce overfitting
without compromising on model accuracy.
• The main idea behind early stopping is to stop training before a
model starts to overfit.
Early stopping approaches
• 1. Training model on a preset number of epochs
• This method is simple.
• By running a set number of epochs, we run the risk of not reaching a
satisfactory training point.
• With a higher learning rate, the model might possibly converge with
fewer epochs, but this method requires a lot of trial and error.
• Due to the advancements in machine learning, this method is pretty
obsolete.
• 2. Stop when the loss function update becomes small
• This approach is more sophisticated than the first as it is
built on the fact that the weight updates in
gradient descent become significantly smaller as the
model approaches minima.
• Usually, the training is stopped when the update
becomes as small as 0.001, as stopping at this
point minimizes loss and saves computing power
by preventing any unnecessary epochs.
• However, overfitting might still occur.
• 3. Validation set strategy
• This clever technique is the most popular early stopping approach.
• To understand how it works, it’s important to look at how training and
validation errors change with the number of epochs .
• The training error decreases exponentially until increasing epochs no
longer have such a large effect on the error.
• The validation error, however, initially decreases with increasing
epochs, but after a certain point, it starts increasing.
• This is the point where a model should be early stopped as beyond this
the model will start to overfit.
• Benefits of Early Stopping:
• Helps in reducing overfitting
• It improves generalisation
• It requires less amount of training data
• Takes less time compared to other regularisation
models
• It is simple to implement
• Limitations of Early Stopping:
• If the model stops too early, there might be risk of underfitting
• It may not be beneficial for all types of models
• If validation set is not chosen properly, it may not lead to the most
optimal stopping
• To summarize, early stopping can be best used to prevent overfitting
of the model, and saving resources. It would give best results if taken
care of few things like – parameter tuning, preventing the model
from overfitting, and ensuring that the model learns enough from
the data.
DATASET AUGMENTATION
• It is a set of techniques to artificially increase the dataset by
modifying the copies of existing data or synthetically generating new
copies of the dataset by using the existing dataset.
• Data augmentation is a process of artificially increasing the amount
of data by generating new data points from existing data.
• This includes adding minor alterations to data or using
machine learning models to generate new data points to amplify the
dataset.
• Synthetic data: When data is generated artificially without using real-world
images. Synthetic data are often produced by Generative Adversarial
Networks
• Augmented data: Derived from original images with some sort of minor
geometric transformations (such as flipping, translation, rotation, or the
addition of noise) in order to increase the diversity of the training set.
• Data augmentation is the process of transforming images to create new
ones, for training machine learning models.
• This is an important step when building datasets because modern machine
learning models are very powerful; if they're given datasets that are too
small, these models can start to ‘overfit’,
Common Data Augmentation
Techniques
• I.Spatial Transformation
• With spatial transformation techniques, pixels are moved around the
image in set ways to create the augmented image.

• Flipping
• This is a very simple technique in which an image is flipped
horizontally to produce a mirror image or flipped vertically to produce
an image that is upside down.
• Rotation
• With this technique, we rotate the entire image by a certain degree.
• Translation
• The entire image is shifted left/right and/or up/down by a certain
amount. This will result in objects of interest appearing in different
locations of the image frame after translation is applied.
• Cropping
• Given an image, we select part of the image (normally a square or
rectangular section), take a crop of this selection, and then resize the
crop to the original size of the image.
• II.Colour Transformation
• With colour transformation techniques, the spatial aspect of the
image is normally preserved while the values of the pixels are edited.

• Brightness
• The pixel values of the image are either increased to result in a
lighter, brighter image or reduced to result in a darker, dimmer
image.
• Contrast
• Contrast is the difference between the bright and dark parts of an
image. Increasing the contrast generally involves making the bright
parts of the image brighter and the dark parts darker.
• III.Advanced Data Augmentation Techniques
• GridMask
• Unlike the above techniques of spatial and colour transformations,
GridMask falls under a third set of transformation techniques which
we can refer to as 'information deletion’.
• With these techniques, parts of the image are removed by setting
the pixels to 0 or placing some patch over that part of the image,
thereby deleting information.
• IV.Niche Data Augmentation Techniques
• Temporal Reordering
• Most of the techniques we've talked about above work well on single
images.
• However given that we work with multiple images from a camera, we
can try and incorporate other augmentation techniques specific to
video data.
• With temporal reordering, given a pair of images, we can reverse the
order of the images and present this to the model as a different
training example.
TF – transformation functions)

One must be careful not to apply data augmentation blindly without regard
to the data set and application at hand.

For example, applying rotations and reflections on the MNIST data set of
handwritten digits is a bad idea because the digits in the data set are all
presented in a similar orientation.

Furthermore, the mirror image of an asymmetric digit is not a valid digit,
and a rotation of a ‘6’ is a ‘9.’

The key point in deciding what types of data augmentation are reasonable
is to account for the natural distribution of images in the full data set.
BENEFITS OF DATA
AUGMENTATION
• Improving model prediction accuracy
• Adding more training data into the models
• Preventing data scarcity for better models
• Reducing data overfitting ( i.e. an error in statistics, it means a
function corresponds too closely to a limited set of data points) and
creating variability in data
• Increasing generalization ability of the models
• Reducing costs of collecting and labeling data
• Enables rare event prediction
ENSEMBLE METHODS
• Ensemble learning is a machine learning paradigm where multiple
models (often called “weak learners”) are trained to solve the same
problem and combined to get better results.
• The main hypothesis is that when weak models are correctly combined
we can obtain more accurate and/or robust models.
• Weak Learners: A ‘weak learner’ is any ML algorithm (for
regression/classification) that provides an accuracy slightly better than
random guessing.
• In ensemble learning theory, we call weak learners (or base models) models
that can be used as building blocks for designing more complex models by
combining several of them.
• Most of the time, these basics models perform not so well by themselves
either because they have a high bias or because they have too much
variance to be robust.
• Then, the idea of ensemble methods is to try reducing bias and/or variance
of such weak learners by combining several of them together to create
a strong learner (or ensemble model) that achieves better performances.
ENSEMBLE METHODS
• BAGGING aims to decrease variance
• BOOSTING aims to decrease bias
• STACKING aims to improve prediction accuracy.
•
BAGGING
• Bagging stands for Bootstrap Aggregation.
• Bootstrapping is a technique of sampling different sets of data from a
given training set by using replacement.
• After bootstrapping the training dataset, we train the model on all the
different sets and aggregate the result. This technique is known as
Bootstrap Aggregation or Bagging.
• The idea behind bagging is combining the results of multiple models
(for instance, all decision trees) to get a generalized result. Now,
bootstrapping comes into picture.
Bagging works as follows:-


Multiple subsets are created from the original dataset, selecting observations
with replacement.


A base model (weak model) is created on each of these subsets.


The models run in parallel and are independent of each other.


The final predictions are determined by combining the predictions from all
the models.
• For aggregating the outputs of base learners, bagging uses
majority voting (most frequent prediction among all predictions) for
classification and averaging (mean of all the predictions) for
regression.
Advantages of a Bagging Model:
1. Bagging significantly decreases the variance without increasing bias.
2. Bagging methods work so well because of diversity in the training
data since the sampling is done by bootstrapping.
3. Also, if the training set is very huge, it can save computational time
by training the model on a relatively smaller data set and still can
increase the accuracy of the model.
4. Works well with small datasets as well.
BOOSTING
• The term ‘Boosting’ refers to a family of algorithms which converts
weak learner to strong learners.
• Boosting is an ensemble method for improving the model
predictions of any given learning algorithm.
• The idea of boosting is to train weak learners sequentially, each
trying to correct its predecessor.
• The weak learners are sequentially corrected by their predecessors
and, in the process, they are converted into strong learners.
• Boosting is a sequential process, where each subsequent model
attempts to correct the errors of the previous model. The
succeeding models are dependent on the previous model.
PRO’S
Computational scalability

Handles missing values

Robust to outliers

Does not require feature scaling

Can deal with irrelevant inputs

Interpretable (if small)

• Con’s
• Inability to extract a linear combination of features
• High variance leading to a small computational power
STACKING
• Stacking is an ensemble learning method that combines multiple
machine learning algorithms via meta-learning
• In which base level algorithms are trained based on a complete
training data-set, the meta-model is trained on the final outcomes of
the all base-level model as a feature.
Advantages of a Stacked Generalization Model:

• Stacking improves the model prediction accuracy.

• Disadvantage of a Stacked Generalization Model:

• As we are taking the whole dataset for training for every

individual classifier, in the case of huge datasets the
computational time will be more as each classifier is
working independently on the huge dataset.
DROPOUT
• The term “dropout” refers to dropping out the nodes (input and
hidden layer) in a neural network .
• All the forward and backwards connections with a dropped node are
temporarily removed, thus creating a new network architecture out
of the parent network.
• The nodes are dropped by a dropout probability of p.
• Let’s try to understand with a given input x: {1, 2, 3,
4, 5} to the fully connected layer. We have a dropout
layer with probability p = 0.2 (or keep probability =
0.8). During the forward propagation (training) from the
input x, 20% of the nodes would be dropped, i.e. the x
could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and
so on. Similarly, it applied to the hidden layers.
• For instance, if the hidden layers have 1000 neurons
(nodes) and a dropout is applied with drop
probability = 0.5, then 500 neurons would be
randomly dropped in every iteration (batch).
• By using dropout, in every iteration, we will work on a smaller neural
network than the previous one and therefore, it approaches
regularization.
DROP CONNECT
• DropConnect works similarly, except that we disable individual
weights (i.e., set them to zero), instead of nodes, so a node can
remain partially active.
• Dropconnect works by randomly setting some of these weights to zero
during training. This has the effect of “dropping out” some of the
connections between neurons.
PARAMETER TYING AND
SHARING
• Parameter sharing and parameter tying is another well-known
approach for controlling the complexity of Deep Neural Networks by
forcing certain weights to share the same value.
• We may know from domain and model architecture that there should
be some dependencies between model parameters
• Model Parameters:These are the parameters in the model that must
be determined using the training data set. These are the fitted
parameters.
• Hyperparameters: These are adjustable parameters that must be
tuned in order to obtain a model with optimal performance.
Parameter Tying

• We want to express that certain parameters should be close to one

another
Parameter Sharing
• Parameter sharing is where we force sets of parameters to be equal
• Because we interpret various models or model components as sharing a
unique set of parameters
• Only a subset of the parameters needs to be stored in memory
• In a CNN significant reduction in the memory of the model
Use of parameter sharing in
CNNs
• Most extensive use of parameter sharing is in convolutional neural
networks (CNNs)
• Natural images have many statistical properties that are invariant to
translation –
• Ex: photo of a cat remains a photo of a cat if it is translated one pixel to the
right – CNNs take this property into account by sharing parameters across
multiple image locations – Thus we can find a cat with the same cat detector
whether the cat appears at column i or column i+1 in the image
Batch Normalization
• Batch normalization is a recent method to address the vanishing and exploding gradient
problems, which cause activation gradients in successive layers to either reduce or increase in
magnitude.
• Internal Covariance Shift problem: An important problem in training deep networks is that of
internal covariance shift.
• Input data normalization is usually the first step of data preprocessing. Weights are updated
in a neural network through backpropagation algorithm.
• During the training phase, the distribution of neurons(probability distribution) in the early
layers begins to shift. Due to this, the later layers should somehow modify themselves to
accommodate these shifts in their inputs data distributions. This is known as internal
covariance shift problem which slows down training. And the magnitude of the problem
increases as the number of hidden layers(depth) of the network increases.
• The problem is that the parameters change during training, and therefore the hidden variable
activations change as well. In other words, the hidden inputs from early layers to later layers
keep changing. Changing inputs from early layers to later layers causes slower convergence
during training because the training data for later layers is not stable.
BATCH NORMALIZATION
• One of the most common problems of data science
professionals is to avoid over-fitting.
• The solution to such a problem is regularization.
• The regularization techniques help to improve a
model and allows it to converge faster. We have
several regularization tools at our end, some of them
are early stopping, dropout, weight initialization
techniques, and batch normalization. The regularization
helps in preventing the over-fitting of the model and the
learning process becomes more efficient.
• Normalization is a data preprocessing technique used to adjust the
values of features in a dataset to a common scale. This is done to
facilitate data analysis and modeling, and to reduce the impact of
different scales on the accuracy of machine learning models.

• Normalization is a scaling technique in which values are shifted and

rescaled so that they end up ranging between 0 and 1. It is also
known as Min-Max scaling.
• Batch normalization, it is a process to make neural networks faster
and more stable through adding extra layers in a deep neural
network. The new layer performs the standardizing and normalizing
operations on the input of a layer coming from a previous layer.
• A typical neural network is trained using a collected set
of input data called batch. Similarly, the normalizing
process in batch normalization takes place in batches,
not as a single input.
• BATCH NORMALIZATION
• It is a two-step process.
• First, the input is normalized, and later rescaling and
offsetting is performed.
• Normalization of the Input
• Normalization is the process of transforming the data to
have a mean zero and standard deviation one. In
this step we have our batch input from layer h, first,
we need to calculate the mean of this hidden
activation.
• Here, m is the number of neurons at layer h.
• Once we have meant at our end, the next step is to
calculate the standard deviation of the hidden
activations
• As we have the mean and the standard deviation ready.
We will normalize the hidden activations using
these values. For this, we will subtract the mean
from each input and divide the whole value with
the sum of standard deviation and the smoothing
term (ε).
• The smoothing term(ε) assures numerical
stability within the operation by stopping a division by
a zero value.
• Rescaling and Offsetting
• In the final operation, the re-scaling and offsetting of
the input take place. Here two components of the BN
algorithm come into the picture, γ(gamma) and β
(beta). These parameters are used for re-scaling (γ)
and shifting(β) of the vector containing values from
the previous operations
• These two are learnable parameters, during the training neural
network ensures the optimal values of γ and β are used. That will
enable the accurate normalization of each batch.
• ADVANTAGES
• Speed up training
• Here, we mention that there are two choices for where the
normalization layer can be connected:
• 1.Post-activation normalization The normalization can be performed
just after applying the activation function to the linearly transformed
inputs. This solution is shown in Figure 3.18(a). Therefore, the
normalization is performed on post-activation values.
• 2. Pre-activation normalization The normalization can be performed
after the linear transformation of the inputs, but before applying the
activation function. This situation is shown in Figure 3.18(b).
Therefore, the normalization is performed on pre-activation values
•THANK YOU
ADAM

The Adam algorithm uses a similar “signal-to-noise” normalization as AdaGrad
and RMSProp

However, it also exponentially smooths the first-order gradient in order to
incorporate momentum into the update.

It also directly addresses the bias inherent in exponential smoothing when the running
estimate of a smoothed value is unrealistically initialized to 0.
 As in the case of RMSProp, let Ai be the exponentially averaged value of the ith
parameter wi. This value is updated in the same way as RMSProp with the decay
parameter

At the same time, an exponentially smoothed value of the gradient is maintained for which
the ith component is denoted by Fi.
 This smoothing is performed with a different decay parameter ρf :

 This type of exponentially smoothing of the gradient with ρf is a variation of the

momentum method (which is parameterized by a friction parameter β instead of ρf ). Then,
the following update is used at learning rate αt in the tth iteration:


There are two key differences from the RMSProp algorithm.

First, the gradient is replaced with its exponentially smoothed value in order to
incorporate momentum.
 Second, the learning rate αt now depends on the iteration index t, and is defined as
follows:

DL Mod2
No ratings yet
DL Mod2
45 pages
Introduction To Deep Learning - Deep Feed Forward Network
No ratings yet
Introduction To Deep Learning - Deep Feed Forward Network
24 pages
Intro DL 04
No ratings yet
Intro DL 04
35 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Weights Initialization in Neural Networks
No ratings yet
Weights Initialization in Neural Networks
31 pages
Unit3 DL JNTUK
No ratings yet
Unit3 DL JNTUK
15 pages
Module 2
No ratings yet
Module 2
13 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Chapter 8-Deep Learning Book (Final Part) - Rev1
No ratings yet
Chapter 8-Deep Learning Book (Final Part) - Rev1
19 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
FDL Module2
No ratings yet
FDL Module2
37 pages
Effective Neural Network Initialization
No ratings yet
Effective Neural Network Initialization
15 pages
DL Unit 5 Notes 2
No ratings yet
DL Unit 5 Notes 2
23 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Unit 3
No ratings yet
Unit 3
110 pages
Introduction to Deep Learning Techniques
No ratings yet
Introduction to Deep Learning Techniques
299 pages
Module 2
No ratings yet
Module 2
66 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
Introduction Deep Eng
No ratings yet
Introduction Deep Eng
50 pages
Overview of Deep Learning Concepts
100% (2)
Overview of Deep Learning Concepts
49 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Minimizing Gradient Issues in DNNs
No ratings yet
Minimizing Gradient Issues in DNNs
105 pages
Fundamentals of Deep Learning Overview
No ratings yet
Fundamentals of Deep Learning Overview
51 pages
NITW - Improving Deep Neural Networks
No ratings yet
NITW - Improving Deep Neural Networks
50 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Dynamic Neural Diversification Path To Computation
No ratings yet
Dynamic Neural Diversification Path To Computation
9 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Neural Network
No ratings yet
Neural Network
7 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
DL - QB Solution
No ratings yet
DL - QB Solution
19 pages
Deep Neural Network Optimization Techniques
No ratings yet
Deep Neural Network Optimization Techniques
23 pages
EPS-DL-Handout3-Build ANN From Scratch Basics
No ratings yet
EPS-DL-Handout3-Build ANN From Scratch Basics
25 pages
Geometric Modeling of Occam's Razor in DL
No ratings yet
Geometric Modeling of Occam's Razor in DL
93 pages
Practical Aspects of Deep Learning
No ratings yet
Practical Aspects of Deep Learning
46 pages
Artificial Neural Network Concepts/Terminology
No ratings yet
Artificial Neural Network Concepts/Terminology
22 pages
Module 2 Initialization and Optimization Technique
No ratings yet
Module 2 Initialization and Optimization Technique
6 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
Neuron 7 AI: Linear Threshold Units
No ratings yet
Neuron 7 AI: Linear Threshold Units
18 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Unit 4
No ratings yet
Unit 4
13 pages
PDF Hyperparameter Tuning Batch Normalization
No ratings yet
PDF Hyperparameter Tuning Batch Normalization
11 pages
A Comprehensive Guide To Training Artificial Neural Networks - From Fundamentals To Advanced Techniques
No ratings yet
A Comprehensive Guide To Training Artificial Neural Networks - From Fundamentals To Advanced Techniques
26 pages
Training Neural
No ratings yet
Training Neural
16 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Week 3: Neural Networks Overview
No ratings yet
Week 3: Neural Networks Overview
21 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
20 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Information Sciences: Le Zhang, P.N. Suganthan
No ratings yet
Information Sciences: Le Zhang, P.N. Suganthan
3 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
Unit 5.1-Economic Load Dispatch
No ratings yet
Unit 5.1-Economic Load Dispatch
5 pages
Part A: Texas A&M University MEEN 683 Multidisciplinary System Design Optimization (MSADO) Spring 2021 Assignment 3
No ratings yet
Part A: Texas A&M University MEEN 683 Multidisciplinary System Design Optimization (MSADO) Spring 2021 Assignment 3
3 pages
Engineering Achievements & Projects
No ratings yet
Engineering Achievements & Projects
2 pages
Cost Optimization of Electricity in Energy Storage System by Dynamic Programming
No ratings yet
Cost Optimization of Electricity in Energy Storage System by Dynamic Programming
8 pages
Elastic Tracker
No ratings yet
Elastic Tracker
7 pages
Book Sustainable Finance Chapter Kock en Lundberg H
No ratings yet
Book Sustainable Finance Chapter Kock en Lundberg H
23 pages
EOQ Model For Time-Deteriorating Items Using Penalty Cost: Meenakshi Srivastava and Ranjana Gupta
No ratings yet
EOQ Model For Time-Deteriorating Items Using Penalty Cost: Meenakshi Srivastava and Ranjana Gupta
10 pages
Decision Support System Using Dhouib-Matrix-TP1 Heuristic For Pentagonal Fuzzy Transportation Problem
No ratings yet
Decision Support System Using Dhouib-Matrix-TP1 Heuristic For Pentagonal Fuzzy Transportation Problem
13 pages
Simplex Method for Linear Programming
No ratings yet
Simplex Method for Linear Programming
5 pages
Mme Exam PPR
No ratings yet
Mme Exam PPR
58 pages
IOT BTech M Tech Dissertation Template 2025 1 1
No ratings yet
IOT BTech M Tech Dissertation Template 2025 1 1
116 pages
Bat Algorithm
No ratings yet
Bat Algorithm
38 pages
Datta Sen 2016 Estimating A Starting Model For Full Waveform Inversion Using A Global Optimization Method
No ratings yet
Datta Sen 2016 Estimating A Starting Model For Full Waveform Inversion Using A Global Optimization Method
13 pages
Hybrid Energy System Design Guide
0% (1)
Hybrid Energy System Design Guide
28 pages
Lesson 11: Linear Programming Models II: IE 335: Operations Research - Optimization
No ratings yet
Lesson 11: Linear Programming Models II: IE 335: Operations Research - Optimization
20 pages
Set4 - Revised Simplex Method
No ratings yet
Set4 - Revised Simplex Method
14 pages
MSC Financial Economics
No ratings yet
MSC Financial Economics
57 pages
RSM Optimization with JMP Software
No ratings yet
RSM Optimization with JMP Software
8 pages
Engineering Optimization Course
No ratings yet
Engineering Optimization Course
289 pages
Linear Programming Sensitivity Analysis
No ratings yet
Linear Programming Sensitivity Analysis
27 pages
Customer Journey Optimization 111906
No ratings yet
Customer Journey Optimization 111906
2 pages
Reseach Methodology Lecture Notes, Ebook - MBA First Year Sem 2 - Free PDF Download PDF
No ratings yet
Reseach Methodology Lecture Notes, Ebook - MBA First Year Sem 2 - Free PDF Download PDF
355 pages
PyTorch - Advanced Deep Learning
100% (1)
PyTorch - Advanced Deep Learning
237 pages
20609-VAIBHAV RAM CHVAVAN PPT-SYNOPSIS (3) (1) New-1
No ratings yet
20609-VAIBHAV RAM CHVAVAN PPT-SYNOPSIS (3) (1) New-1
14 pages
Mathematical Programming
100% (1)
Mathematical Programming
215 pages
EcoLand Land Tax Management Plan
No ratings yet
EcoLand Land Tax Management Plan
7 pages
Mba 205
No ratings yet
Mba 205
2 pages
IE426 - Optimization Models and Application: 1 Goal Programming
No ratings yet
IE426 - Optimization Models and Application: 1 Goal Programming
10 pages
M.E. Syllabus 25-26 Final 110825
No ratings yet
M.E. Syllabus 25-26 Final 110825
85 pages
Axen
No ratings yet
Axen
8 pages

DL Mod2

Uploaded by

DL Mod2

Uploaded by

DEEP LEARNING

• Zero Initialization (Initialized all weights to 0)

👉 Use RELU or leaky RELU as the activation function, as they

• The term “hyperparameter” is used to specifically refer to the

• Feature normalization: A common type of normalization is to divide each

• Batch Gradient Descent

• SGD - Stochastic Gradient Descent

• A saddle point is a point in the cost function where one dimension

• Escape local minima and saddle points

 This is a subtle difference between how Ai and δi are computed.

• Adam is an adaptive learning rate method, which means, it computes individual

Weight decay: To prevent overfitting, every time we update a

L2 regularization (Ridge regression) on the other hand leads to a

Handles missing values

Does not require feature scaling

Can deal with irrelevant inputs

Interpretable (if small)

• Stacking improves the model prediction accuracy.

• Disadvantage of a Stacked Generalization Model:

• As we are taking the whole dataset for training for every

• We want to express that certain parameters should be close to one

• Normalization is a scaling technique in which values are shifted and

 This type of exponentially smoothing of the gradient with ρf is a variation of the

You might also like