0% found this document useful (0 votes)
14 views32 pages

Deep Learning Optimization Techniques

The document outlines the content and structure of Deep Learning Modules 2 and 3, focusing on neural networks, optimization algorithms, and techniques like Gradient Descent, Stochastic Gradient Descent, and Adam. It discusses the challenges of optimizing deep networks, the importance of learning rates, and various adaptive learning methods. Additionally, it highlights the benefits and drawbacks of different optimization strategies and references relevant literature for further reading.

Uploaded by

2023dc04001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views32 pages

Deep Learning Optimization Techniques

The document outlines the content and structure of Deep Learning Modules 2 and 3, focusing on neural networks, optimization algorithms, and techniques like Gradient Descent, Stochastic Gradient Descent, and Adam. It discusses the challenges of optimizing deep networks, the importance of learning rates, and various adaptive learning methods. Additionally, it highlights the benefits and drawbacks of different optimization strategies and references relevant literature for further reading.

Uploaded by

2023dc04001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning

Module 2 & 3
Course Owner : Lead Instructor: Section Faculty:
Seetha Parameswaran Bharatesh Chakravarthi Raja vadhana Prabhakar
The designers/authors of this course deck is gratefully
acknowledging the orginal authors who made their
course materials freely available online.

2
Course Content

● Fundamentals of Neural Network


● Multilayer Perceptron
● Deep Feedforward Neural Network
● Improve the DNN performance by Optimization and Regularization
● Convolutional Neural Networks
● Sequence Models
● Attention Mechanism
● Representational Learning
● Generative Adversarial Networks

3
Module 3
Optimization of Deep models

4
Agenda

• Overview of optimization algorithms


• Gradient Descent
• Stochastic Gradient Descent
• Mini batch Gradient Descent
• Momentum based algorithms
• Gradient Descent with momentum
• Algorithms with Adaptive Learning Rates:
• ADAgrad
• RMSProp
• Adam

BITS Pilani, Pilani Campus


Optimization

 Nature of the cost function

 Gradient based Iterative solution

 Initialization point in search

 Local Minima

 Multivariate

 Saddle Points

 Hyper parameter tuning

 Learning rate

 Indicator of trajectory
GD vs SGD vs Mini batch GD

Gradient descent Stochastic gradient descent Mini batch gradient descent


Takes consistent steps Noisy, oscillates around the Noisy, oscillates around the
toward minimum. minimum minimum but may not converge to
but may not converge to it. it.
SGD

Stochastic gradient descent


Noisy, oscillates around the
minimum
but may not converge to it.
Dynamic learning rate

Stochastic gradient descent Exponential decay


Noisy, oscillates around
the minimum
but may not converge to it.

• Replace  with time-dependent


learning rate (t) Polynomial decay
• adds to the complexity of
controlling convergence of
an optimization algorithm.
SGD

Stochastic gradient descent


Noisy, oscillates around the
minimum
but may not converge to it.

LR = 0.9 t=1 t=10 t=100 t=500

Exp
0.7369 0.1218 0.0165 0.0022
DR = 0.2
Poly
A = 0.5 0.6708 0.3 0.2183 0.18
B = 0.8
Drawback of gradient based methods

• The most critical challenge to optimizing deep networks is finding the correct
trajectory to move in.
• Gradient isn’t usually a very good indicator of the good trajectory.
• when the contours are perfectly circular , gradient always point in the direction of the
local minimum.
• However, if the contours are extremely elliptical (as is usually the case for the error
surfaces of deep networks), the gradient can be as inaccurate as 90 degrees away from
the correct direction!
Gradient Descent with Momentum

Gradient descent with momentum uses the momentum of the


gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient +
Momentum
Negative of Gradient
Momentum
Real Movement

𝜃
Gradient =
Slide credit: Hung-yi Lee – Deep Learning Tutorial
Gradient Descent with Momentum
Gradient Descent with Momentum
Momentum

15
SGD

Stochastic gradient descent


Noisy, oscillates around the
minimum
but may not converge to it.

● Replace the gradient computation by a “leaky average “ for better


variance reduction. β ∈ (0, 1).
● v is called momentum. It accumulates past gradients.
● Large β amounts to a long-range average and small β amounts to only a
slight correction relative to a gradient method.
Solved problem SGD and SGD + momentum

Consider the following loss function, with initial value of W{0} = -2.8, and
learning rate  = 0.05 and =0.7. Use SGD and SGD +momentum to find the
updated value of W{1} after the first iteration.

Answer 𝐿(𝑤) = 0.3 ∗ 𝑤 4 − 0.1 ∗ 𝑤 3 − 2 ∗ 𝑤 2 − 0.8 ∗ 𝑤


SGD SGD +momentum
Iteration 𝑔𝑖 w𝑖 Iteration 𝑔𝑖 v𝑖 w𝑖
1 -18.2943 -1.88527 1 -18.294 0 -1.88527
2 -2.36614 -1.76697 2 -2.3661 -15.172 -1.126
Comparison of Momentum in Learning

● Consider a moderately distorted ellipsoid objective


● f has its minimum at (0, 0). This function is very flat in
x1 direction.

● For LR = 0.6.
Without momentum
● Convergence in the
x1 direction
● For LR = 0.4. Without momentum
improves but the
● The gradient in the x2 direction
overall solution
oscillates than in the horizontal x1
quality is diverging.
direction.
Comparison of Momentum in Learning

● Consider a moderately distorted ellipsoid objective


● f has its minimum at (0, 0). This function is very flat in
x1 direction.

● For MR = 0.25.
Without momentum
● Reduced
convergence. More
● For MR = 0.5. Without momentum
oscillations. Larger
● Converges well. Lesser oscillations.
magnitude of
Larger steps in x1 direction.
oscillations.
Adagrad - motivation
• In GD same LR for all parameters and iteration.
• What happens to features that occur infrequently (sparse features) e.g w is
sparse, b is not (as x 0 is always 1) e.g:

• Parameters associated with infrequent features only receive meaningful


updates whenever these features occur

W
Adagrad

• Decay the learning rate for parameters in proportion to their update history
• An individual learning rate per parameter(feature)
• Accumulating past squared gradients in st

Accumulate the
history

• ϵ is a smoothing term that avoids division by zero


• Initialize s0 = 0
• Adagrad's main benefits is that it eliminates the need to manually tune the
learning rate.
• Most implementations use a default value of 0.01 and leave it at that
• Application: In natural language processing and image recognition
applications
Adagrad

• It adapts the learning rate to


the parameters
• smaller updates (i.e. low
learning rates) for
parameters associated
with frequently
occurring features
• larger updates (i.e. high
learning rates) for
parameters associated
with infrequent features
Drawback of AdaGrad

• AdaGrad decays the learning rate very aggressively (as the denominator grows).
• As a result, after a while, the frequent parameters will start receiving very small
updates because of the decayed learning rate.

• As a result, the algorithm may experience sluggish convergence or even premature


stalling
• To avoid this why not decay the denominator and prevent its rapid growth
RMSProp
• The issue :
• Adagrad accumulates the squares of the gradient gt into a state vector
𝑠𝑡 = 𝑠𝑡−1 + 𝑔𝑡2
• As a result st keeps on growing without bound due to the lack of
normalization

• Use a leaky average in the same way we used in the momentum method

Parameter  > 0
The constant 𝜖 > 0 is typically set to 10-6
• Faster convergence compared to AdaGrad
• works well on big and redundant datasets
Review of techniques learned so far

1. Stochastic gradient descent


○ more compute effective than Gradient Descent when solving
optimization problems
○ Mini batch Stochastic gradient descent affords significant
additional efficiency arising from vectorization, using larger sets of
observations in one mini batch.
○ This is the key to efficient multi-machine,multi-GPU and overall
parallel processing.
2. Momentum
○ added a mechanism for aggregating a history of past gradients
to accelerate convergence.
3. Adagrad
○ used per-coordinate scaling to allow for a computationally efficient
preconditioner.
4. RMSProp
○ leaky average + dynamic learning rate
Adam-Adaptive Moment Estimation

• Adam combines all these techniques into one efficient learning algorithm.
• computes individual adaptive learning rates for different parameters from
estimates of the first and second moments of the gradients.
• utilize the momentum concept from “SGD with momentum” and adaptive learning
rate from “RMSProp”
• Adam can diverge due to poor variance control. (disadvantage)

 V for Momentum
 S for RMSProp
Adam Algorithm

 first moment (the mean)of gradient

second moment (uncentered


variance) of gradients

• 1 and 2 are nonnegative weighting parameters.


• Common choices forthem are 1 = 0.9 and 2 = 0.999.
• Initialize v0 = s0 = 0
• Normalize the state variables (bias correction)
Bias-correction helps Adam slightly outperform RMSprop

• Rescale the gradient 

2
• Compute updates  7
Optimization algorithm comparison

Gradient Descent 𝑤𝑡 ← 𝑤𝑡−1 −  𝑔𝑡

GD with Momentum: replaces gradients 𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1


with a leaky average over past gradients 𝑤𝑡 ← 𝑤𝑡−1 −  𝑣𝑡

AdaGrad (Adaptive learning rate): An


individual learning rate per parameter

RMSProp : exponentially decaying


average of past squared gradients

Adam : exponentially decaying average of past


gradients + replaces gradients
with a leaky average over past gradients
Hybrid Approaches to Optimization

Adaptive methods (such as Adam) do not generalize as well as SGD


with momentum when tested on a diverse set of deep learning tasks

In the research paper by Nitish Shirish Keskar, Richard Socher :


Improving Generalization Performance by Switching from Adam to
SGD :
Earlier stages of training Adam still outperforms SGD but later the
learning saturates. They proposed simple strategy in which they
start training deep neural network with Adam but then switch to
SGD when certain criteria hits
Optimization algorithm comparison
References

• Ref: Chapter 12 of T1 : Dive into Deep Learning. [Link]

• [Link]

• [Link]

• [Link]

• A Neural Network Playground


• [Link] (Check here to visualize the convergence in
cost space)
Thank you

32

You might also like