0% found this document useful (0 votes)
0 views

Otimization 2024_ver3

The document discusses various optimization techniques in machine learning, including regularization, gradient descent, and advanced methods like Adam and RMSProp. It highlights the challenges of stochastic gradient descent (SGD) and introduces concepts such as momentum and second-order optimization. The conclusion emphasizes that while Adam is a good default choice, SGD with momentum can outperform it with proper tuning.

Uploaded by

tientao05042004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Otimization 2024_ver3

The document discusses various optimization techniques in machine learning, including regularization, gradient descent, and advanced methods like Adam and RMSProp. It highlights the challenges of stochastic gradient descent (SGD) and introduces concepts such as momentum and second-order optimization. The conclusion emphasizes that while Adam is a good default choice, SGD with momentum can outperform it with proper tuning.

Uploaded by

tientao05042004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Optimization

Nguyen Van Vinh


UET - 2025
Content

• Regularization
(*)
• Gradient Decent
• Momentum, RMSProp, Adam
• Second Order Optimization

2
3
4
Quiz

● Q: Which check implementation analytic gradient with numerical gradient?

(1)

(2)

● Q: Why Second-Order Optimization is this bad for deep learning?


● Q: Why is Adam a good default choice in most cases for optimization?

5
Strategy #1: A first very bad idea solution: Random search

6
Strategy #2: Follow the slope

7
Strategy #2: Follow the slope

• In 1-dimension, the derivative of a function:

• In multiple dimensions, the gradient is the vector of (partial derivatives) along each
dimension
• The slope in any direction is the dot productof the direction with the gradient
• The direction of steepest descent is the negative gradient

8
In summary:

• Numerical gradient: approximate, slow, easy to write


• Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation with


numerical gradient. This is called a gradient check.

9
Gradient Descent

Gradient Descent: The Secret Weapon of Machine Learning

10
Stochastic Gradient Descent (SGD)

• Full sum expensive when


N is large!
• Approximate sum using a
minibatch of examples
32 / 64 / 128 common

11
Optimization: Problem #1 with SGD

• What if loss changes quickly in one direction and slowly in another?


• What does gradient descent do?

12
Optimization: Problem #1 with SGD

• What if loss changes quickly in one direction and slowly in another?


• What does gradient descent do?

Very slow progress along shallow dimension, jitter along steep direction

Aside: Loss function has high condition number: ratio of largest to smallest
singular value of the Hessian matrix is large 13
Optimization: Problem #2 with SGD

• What if the loss function


has a local minima or
saddle point?
• Zero gradient, gradient
descent gets stuck

• Saddle points much more


common in high
dimension
Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014
14
Optimization: Problem #2 with SGD

• saddle point in two dimension

15
Optimization: Problem #3 with SGD

• Our gradients come from minibatches so they can be noisy!

16
SGD + Momentum

Gradient Noise

17
SGD: the simple two line update code

• SGD

18
SGD + Momentum: continue moving in the general
direction as the previous iterations

• SGD+Momentum
• SGD

 Build up “velocity” as a running mean of gradients


 Rho gives “friction”; typically rho=0.9 or 0.99

Source: Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
19
SGD + Momentum: alternative equivalent
formulation

20
More Complex Optimizers: AdaGrad

Added element-wise scaling of the


gradient based on the historical sum of
SGD +
Momentum squares in each dimension

AdaGrad

Source: Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
21
AdaGrad
Q2: What happens to the step size over long time?

Q1: What happens with AdaGrad?


22
RMSProp

AdaGrad

“Per-parameter learning rates”


or “adaptive learning rates”
RMSProp

Adds element-wise scaling of the gradient based on the historical sum of squares
in each dimension (with decay)
Source: Tieleman and Hinton, 2012 23
RMSProp

RMSProp

Q: What happens with RMSProp?

24
RMSProp

25
Optimizers: Adam (almost)

Source: Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
26
Optimizers: Adam (almost)

Momentum

RMSProp

Source: Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
27
Adam (full form)

Bias correction

Bias correction for the fact that first and Adam with beta1 = 0.9,
second moment estimates start at zero beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
is a great starting point for many models!

Source: Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
28
Adam

29
AdamW: Adam Variant with Weight Decay

Q: How does regularization interact with the optimizer? (e.g., L2)

A: It depends!
30
AdamW: Adam Variant with Weight Decay

Standard Adam computes L2 here

AdamW (Weight Decay) adds term:

λ:Weight Decay
31
AdamW: Adam Variant with Weight Decay

Source: https://www.fast.ai/posts/2018-07-02-adam-weight-decay.html
32
Learning rate schedules

• Learning Rate Schedules are techniques used in deep learning to adjust the
learning rate dynamically during training, instead of keeping it at a fixed value.

33
SGD, SGD+Momentum, RMSProp, Adam, AdamW all
have learning rate as a hyperparameter

Q: Which one of these learning rates is


best to use?

34
Learning rate decays over time

Step: Reduce learning rate at a


few fixed points. E.g. for
ResNets, multiply LR by 0.1 after
epochs 30, 60, and 90.

35
First Order Optimization

36
Second Order Optimization

37
Second Order Optimization

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

No hyperparameters!
No learning rate!

Hessian has O(N^2) elements


Inverting takes O(N^3)
N = (Tens or Hundreds of) Millions

38
Second-Order Optimization

• Quasi-Newton methods (BGFS most popular): instead of inverting the Hessian


(O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2)
each).
• L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian

39
In practice:

• Adam(W) is a good default choice in many cases; it often works ok even with
constant learning rate
• SGD+Momentum can outperform Adam but may require more tuning of Learning
Rate and Learning Rate schedule
• If you can afford to do full batch updates then look beyond 1st order optimization
(2nd order and beyond)

40
Conclusion

• GradientDecent
• Momentum, AdaGrad, RMSProp, Adam

• Second Order Optimization

41
Question?

42

You might also like