Otimization 2024_ver3
Otimization 2024_ver3
• Regularization
(*)
• Gradient Decent
• Momentum, RMSProp, Adam
• Second Order Optimization
2
3
4
Quiz
(1)
(2)
5
Strategy #1: A first very bad idea solution: Random search
6
Strategy #2: Follow the slope
7
Strategy #2: Follow the slope
• In multiple dimensions, the gradient is the vector of (partial derivatives) along each
dimension
• The slope in any direction is the dot productof the direction with the gradient
• The direction of steepest descent is the negative gradient
8
In summary:
9
Gradient Descent
10
Stochastic Gradient Descent (SGD)
11
Optimization: Problem #1 with SGD
12
Optimization: Problem #1 with SGD
Very slow progress along shallow dimension, jitter along steep direction
Aside: Loss function has high condition number: ratio of largest to smallest
singular value of the Hessian matrix is large 13
Optimization: Problem #2 with SGD
15
Optimization: Problem #3 with SGD
16
SGD + Momentum
Gradient Noise
17
SGD: the simple two line update code
• SGD
18
SGD + Momentum: continue moving in the general
direction as the previous iterations
• SGD+Momentum
• SGD
Source: Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013
19
SGD + Momentum: alternative equivalent
formulation
20
More Complex Optimizers: AdaGrad
AdaGrad
Source: Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
21
AdaGrad
Q2: What happens to the step size over long time?
AdaGrad
Adds element-wise scaling of the gradient based on the historical sum of squares
in each dimension (with decay)
Source: Tieleman and Hinton, 2012 23
RMSProp
RMSProp
24
RMSProp
25
Optimizers: Adam (almost)
Source: Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
26
Optimizers: Adam (almost)
Momentum
RMSProp
Source: Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
27
Adam (full form)
Bias correction
Bias correction for the fact that first and Adam with beta1 = 0.9,
second moment estimates start at zero beta2 = 0.999, and learning_rate = 1e-3 or 5e-4
is a great starting point for many models!
Source: Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
28
Adam
29
AdamW: Adam Variant with Weight Decay
A: It depends!
30
AdamW: Adam Variant with Weight Decay
λ:Weight Decay
31
AdamW: Adam Variant with Weight Decay
Source: https://www.fast.ai/posts/2018-07-02-adam-weight-decay.html
32
Learning rate schedules
• Learning Rate Schedules are techniques used in deep learning to adjust the
learning rate dynamically during training, instead of keeping it at a fixed value.
33
SGD, SGD+Momentum, RMSProp, Adam, AdamW all
have learning rate as a hyperparameter
34
Learning rate decays over time
35
First Order Optimization
36
Second Order Optimization
37
Second Order Optimization
Solving for the critical point we obtain the Newton parameter update:
No hyperparameters!
No learning rate!
38
Second-Order Optimization
39
In practice:
• Adam(W) is a good default choice in many cases; it often works ok even with
constant learning rate
• SGD+Momentum can outperform Adam but may require more tuning of Learning
Rate and Learning Rate schedule
• If you can afford to do full batch updates then look beyond 1st order optimization
(2nd order and beyond)
40
Conclusion
• GradientDecent
• Momentum, AdaGrad, RMSProp, Adam
41
Question?
42