0% found this document useful (0 votes)

4 views

optim

This document is a guide on various optimization algorithms used in machine learning, including Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, and advanced methods like Adam and RMSprop. It explains concepts such as fitting a line to data, gradients, and the importance of tuning parameters to minimize loss functions. The author, Ari, also shares insights on implementing these algorithms in programming frameworks like PyTorch.

Uploaded by

Aravind Ariharasudhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

optim

Uploaded by

Aravind Ariharasudhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

OPTIM

THIS BOOK
This book is written (typed) by
Ari, who hails from the South
and has keen interests in
Computer Science, Biology,
and Tamil Literature.
Occasionally, he updates his
website, where you can reach
out to him.
https://arihara-sudhan.github.io
PREREQUISITES
To get the most out of this
book, you should be familiar
with some basics, which you
can learn from the following
book.
FITTING A LINE TO DATA
Assume we do have a data
distribution as shown below.

Initially, the weights and biases

makes a line as in the following
figure.
Now, We want to check whether
this is a perfect fit or not. A perfect
fit makes the loss low. So, it is the
place for experiments. We can
tweak and tune the parameters
and check with different lines.
Let’s consider that horizontal line
(no slope). What could be the
distance to the data points from
this line? Since the Y magnitude
represents bias (intercept), let’s
measure the difference with
accordance to that.
The answer for this is 25.
Let us tilt the line and try once
again.

Now, the sum is a way lesser

than the previous one. It is 19.
I would like to make this sum
even more lesser.
So, let’s tweak the parameters
again.

Now, the sum is 14. Yes! It

could be a better fit than the
all previous ones.
This method of finding the
appropriate parameter values
for the smallest sum of
squares is called, “Least
Squares”. However, The
expression in general
mathematical term would be
like the following.

We find values for a (weight)

and b (intercept). Now, how do
we find the optimal line? How
do we fit the line to data?
GRADIENT
The gradient refers to the vector
of partial derivatives of a function
with respect to its parameters. It’s
like the slope of the function at a
certain point, and it tells us the
direction and rate at which the
function’s value increases or
decreases. In one dimension,
Imagine a hill with a path leading
up to the peak. If Ari stands at a
certain point on the path, the
slope tells him whether he is
going up or down. A positive
slope means he is going up, a
negative slope means he is going
down, and a slope of zero means
he is at a flat point (possibly a
peak or a valley). In calculus, this
slope is the derivative of the
function.
If he wants to find the minimum
point of a function (like reaching
the valley), he would look at how
the slope changes and move
opposite to it. In machine learning,
our functions are often defined
over many dimensions (like many
features in a dataset). The
gradient in this multi-dimensional
space tells us the direction of
steepest ascent. It is represented
as a vector of partial derivatives,
where each element represents
the rate of change along one
dimension. If we want to minimize
a function (say, reduce error in a
model), we take a step opposite
to the gradient direction, because
this leads us downhill toward
lower values.
In the figure above, for the first
chosen parameters, it is so
steepest! For the second also, it is
somewhat steepest.. But for the
middle(3rd) one, it seems to be
less steep.
As we have discussed there in
“MLP” book, it is all about
derivatives. Taking derivative at a
point tells us how steepy that
point is.
BATCH GRADIENT DESCENT
Batch Gradient Descent or
Gradient Descent is an
optimization algorithm used to
minimize the loss function of a
model. It works by iteratively
adjusting the model's parameters
(weights and biases) in the
direction opposite to the gradient
of the loss function. This helps
reduce the loss and improves the
model's performance. It calculates
the gradient of the loss function
with respect to each parameter
(weight or bias). Following is the
parameter updating rule.

θ=θ−α⋅∇θJ(θ)
∇θJ(θ) is the gradient of the loss
function with respect to the
parameters.
STOCHASTIC GD
Gradient Descent (GD) has several
drawbacks, including the risk of
getting stuck in local minima or
saddle points, especially in non-
convex functions, which can lead
to suboptimal solutions. It is
sensitive to the learning rate,
requiring careful tuning to avoid
slow convergence or instability.
Stochastic Gradient Descent is a
variant of Gradient Descent where
the parameters are updated using
only a single training example at a
time, rather than the entire
dataset. SGD is more frequent
and faster, but the updates are
noisier and less stable. Over time,
this noise can help the algorithm
escape local minima and explore
more of the solution space.
θ=θ−α⋅∇θJ(θ;x ,y ) (i) (i)
MINI BATCH GD
Mini-Batch Gradient Descent
is a compromise between
Batch Gradient Descent and
Stochastic Gradient Descent.
In this approach, instead of
using the entire dataset or just
a single example, we compute
the gradient using a small
batch of training examples.
We split the training data into
small batches (e.g., 32 or 64
samples per batch), and then
we calculate the gradient for
each batch and update the
parameters accordingly. This
reduces the computational
cost compared to Batch
Gradient Descent, while still
providing more stable update.
The update rule is as the
following:
θ=θ−α⋅∇θJ(θ;Xbatch,Ybatch)
Where Xbatch and Ybatch are the
features and labels of the
mini-batch.
LET’S LIGHT ON IT
Let’s define the network first.

The way we pass the data matters

here.
SGD in PyTorch is more flexible
and can work for both Mini-Batch
GD and full Batch GD by adjusting
the batch_size. The underlying
optimization algorithm remains
the same (stochastic in nature).

Full Code is HERE

GD WITH MOMENTUM
In basic gradient descent, the
weights are updated by moving in
the direction of the negative
gradient (the direction that
decreases the cost function),
scaled by the learning rate. The
idea of momentum is inspired by
physics, where an object in
motion tends to continue in its
direction unless acted upon by a
force. Similarly, momentum helps
to "carry" the update in the same
direction over time, even if the
gradient is small. Instead of just
using the gradient at the current
step, we use a combination of the
current gradient and the previous
update.
vt = βvt−1+(1−β)∇θJ(θ)
θ = θ−ηvt
The velocity, v is the exponentially
weighted moving average of the past
gradients. It accumulates the gradient's
influence over time. β is the
momentum coefficient which
determines how much of the previous
velocity is retained. A typical value for
β is between 0.9 and 0.99. By
incorporating the past gradients,
momentum smooths out the noisy
fluctuations, leading to more stable
convergence. Momentum helps the
algorithm accelerate in the direction of
consistent gradients and dampen
oscillations, especially in areas with
steep gradients.
Gradient descent with momentum
helps overcome the challenge of
getting stuck in local minima by adding
a momentum term to the standard
gradient descent update rule. There’s
no absolute guarantee that
momentum-based gradient descent
will find the global minimum, especially
in complex, non-convex landscapes.
NESTEROV ACCELARATED
Nesterov Accelerated Gradient is
a modification of traditional
momentum-based gradient
descent where, instead of just
adding momentum, it anticipates
the direction of the next step by
looking slightly ahead in the
parameter space. In standard
gradient descent with momentum,
the update step simply
accumulates the previous
gradients. But in NAG, the
gradient is computed at a look-
ahead position, meaning we
compute the gradient not at the
current position θ but at θ+βv,
where v is the momentum term.
vt=βvt−1+α∇J(θ+βvt−1)
θ=θ−vt
While this can help with escaping
shallow local minima or speeding
up convergence, it does not
inherently guarantee that you’ll
avoid deep or well-formed local
minima. The challenge with local
minima is inherent in the non-
convex nature of the optimization
landscape.
ADAGRAD
Adaptive Gradient Algorithm
adapts the learning rate for each
parameter individually based on
how frequently it is updated.
Each parameter has its own
learning rate that changes over
time. Parameters updated
frequently receive lower learning
rates, while those updated
infrequently get higher learning
rates. This adjustment helps
Adagrad handle sparse features
or parameters well, making it
useful in cases where some
features (or weights) rarely occur
in the data. Adagrad keeps track
of the sum of squares of the
gradients for each parameter over
time. This is called the
accumulated squared gradient.
In Adagrad, there happens
learning rate decay. Adagrad
calculates the learning rate for
each parameter based on the
cumulative sum of squared
gradients for that parameter. As
training progresses, vt becomes
very large because it continuously
accumulates squared gradient
values, causing Square root of vt
to increase significantly over time.
When it grows, learning rate
becomes smaller and smaller.
RMS PROP
Root Mean Square Propagation is
an adaptive learning rate
optimization algorithm designed
to address the vanishing learning
rate problem that arises in
Adagrad. RMSprop adjusts the
learning rate by calculating an
exponentially decaying average of
squared gradients rather than
accumulating the squared
gradients over time like Adagrad.

The result is an adaptive learning

rate for each parameter based on
the recent history of gradients,
which prevents the lr from
becoming too small or too large.
By using an exponentially
decaying average of squared
gradients, RMSprop keeps
learning steady over time, making
it suitable for deep learning tasks.
Unlike Adagrad, RMSprop avoids
the problem of a rapidly shrinking
learning rate, which allows the
algorithm to continue updating
parameters effectively over time.

Wait for Adam!

ADAM
Adam (Adaptive Moment
Estimation) is an optimization
algorithm that combines the
advantages of two popular
optimizers which are Momentum
and RMSprop. It’s widely used in
deep learning because it adapts
the learning rate for each
parameter and incorporates an
adaptive momentum, making the
learning process both faster and
more stable. Adam adjusts the
learning rate for each parameter
individually, based on both the
first moment (mean) and the
second moment (uncentered
variance) of the gradient.
The main steps in Adam include
computing these moments, bias-
correcting them, and then
updating the parameters.

Adam leverages Momentum to

smooth the update direction,
making learning faster and more
stable. The adaptive learning rate
from RMSprop enables Adam to
handle noisy, sparse, or non-
stationary data efficiently.
LET’S LIGHT ON IT
To create an SGD with Momentum
optimizer, we have to specify the
momentum attribute over there.

If it is Nesterov GD, we have set

the nesterov parameter to be
True.
For AdaGrad,

For RMSProp,
And this is for Adam...

MERCI

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Bradken Fixed Plant Brochure
No ratings yet
Bradken Fixed Plant Brochure
29 pages
41z S4hana2021 BPD en Us
No ratings yet
41z S4hana2021 BPD en Us
22 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
UNIT3
No ratings yet
UNIT3
37 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
UNIT2
No ratings yet
UNIT2
25 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
GD Types
No ratings yet
GD Types
98 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Optimizers and Activation functions in Deep Learning
No ratings yet
Optimizers and Activation functions in Deep Learning
15 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
ML Notes
No ratings yet
ML Notes
14 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
SGD
No ratings yet
SGD
3 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
9.b Handout-3-GD variants
No ratings yet
9.b Handout-3-GD variants
3 pages
LInear
No ratings yet
LInear
14 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
physics-class-xii-chapter-10-wave-optics-practice-paper-10-2024
No ratings yet
physics-class-xii-chapter-10-wave-optics-practice-paper-10-2024
5 pages
Analysis Procedure K2co3 PDF
No ratings yet
Analysis Procedure K2co3 PDF
3 pages
Swqos
No ratings yet
Swqos
78 pages
Colm Corcoran Booklet IXD PDF
No ratings yet
Colm Corcoran Booklet IXD PDF
24 pages
5303 - 5403 Ind.
No ratings yet
5303 - 5403 Ind.
308 pages
Intersubjectivity Group 1
No ratings yet
Intersubjectivity Group 1
16 pages
Force Calculation Worksheet-C
No ratings yet
Force Calculation Worksheet-C
2 pages
Wassce History Textbook
No ratings yet
Wassce History Textbook
196 pages
Psychometric Tests
No ratings yet
Psychometric Tests
2 pages
The Eiffel Tower
No ratings yet
The Eiffel Tower
4 pages
Alvarez v. PICOP
No ratings yet
Alvarez v. PICOP
38 pages
Ppt of Egyptian Art
No ratings yet
Ppt of Egyptian Art
10 pages
Word Problems Leading To Linear Equations in One Variable
No ratings yet
Word Problems Leading To Linear Equations in One Variable
2 pages
Speaking Skill PDF
0% (1)
Speaking Skill PDF
115 pages
Chapter 3: Organizational Objectives, Growth and Scale
No ratings yet
Chapter 3: Organizational Objectives, Growth and Scale
25 pages
Circular Flow of Economic Activities 1
No ratings yet
Circular Flow of Economic Activities 1
8 pages
' Respiratory Disorders
No ratings yet
' Respiratory Disorders
16 pages
Footing On Piles
No ratings yet
Footing On Piles
6 pages
E-Learning Platform: Indira Gandhi National Open University
100% (1)
E-Learning Platform: Indira Gandhi National Open University
26 pages
EEA Life Settlement Fund Indicative Prices
No ratings yet
EEA Life Settlement Fund Indicative Prices
1 page
Ont.201711107 1117
No ratings yet
Ont.201711107 1117
12 pages
Respiratory Care J 4
No ratings yet
Respiratory Care J 4
20 pages
Nobel Prizes That Changed Medicine 1st Edition Gilbert Thompson (Ed.) - Get the ebook instantly with just one click
100% (2)
Nobel Prizes That Changed Medicine 1st Edition Gilbert Thompson (Ed.) - Get the ebook instantly with just one click
57 pages
Normal Distribution
100% (1)
Normal Distribution
7 pages
ALternador Delco Remy
100% (1)
ALternador Delco Remy
21 pages
Influence Line Diagram Calculation
No ratings yet
Influence Line Diagram Calculation
8 pages
Certification of Slag Powder January 2022
No ratings yet
Certification of Slag Powder January 2022
1 page
Edita Foods International Expansion To Jordan and South Africa Case
No ratings yet
Edita Foods International Expansion To Jordan and South Africa Case
14 pages

optim

Uploaded by

optim

Uploaded by

OPTIM

Initially, the weights and biases

Now, the sum is a way lesser

Now, the sum is 14. Yes! It

We find values for a (weight)

The way we pass the data matters

Full Code is HERE

The result is an adaptive learning

Wait for Adam!

Adam leverages Momentum to

If it is Nesterov GD, we have set

You might also like