0% found this document useful (0 votes)

34 views26 pages

ConvexSpring25 Week9

The document discusses various algorithms for optimization, focusing on gradient descent and its convergence properties under different conditions such as bounded gradient, smoothness, and strong convexity. It outlines the iterative process of these algorithms, error metrics for optimality, and the differences between gradient descent and stochastic gradient descent. Additionally, it covers accelerated gradient descent and the challenges in finite sum settings commonly encountered in machine learning.

Uploaded by

adarSh jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views26 pages

ConvexSpring25 Week9

Uploaded by

adarSh jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Module C: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

min f (x)
x2Rn
s.t. gi (x)  0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information

collected on the path.

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

Distributed Algorithms

Stochastic/Randomized Algorithms

1
Measure of progress

Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.

errt := ||xt x? ||

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

f (x̄)  f (x? ) + ✏.

We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.

In presence of constaints, we define

errt := max(f (xt ) f (x? ), g1 (xt ), g2 (xt ), . . . , gm (xt ), |h1 (xt )|, . . . , |hp (xt )|).

2
First order methods: Gradient descent

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

initial guess x0 2 Rn .

The stationarity condition satisfies x⇤ = x⇤ ⌘t rf (x⇤ ) =) rf (x⇤ ) = 0.

Convergence rate depends on choice of step size ⌘t and characteristic of the

function.

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

2
We can obtain a quadratic upper bound on the function from local informa-
tion.

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any

x, y, we have
↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic lower bound on the function from local informa-
tion.
If f is twice di↵erentiable, then
– f is -smooth if and only if r2 f (x) I or max (r
2
f (x))  for all
x 2 Rn .
– f is ↵-strongly convex if and only if r2 f (x) ⌫ ↵I or min (r
2
f (x)) ↵
for all x 2 Rn .
Determine and ↵ for f (x) = ||Ax b||22 .

3
Gradient Descent with Bounded Gradient Assumption

Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent

Let the function f satisfy the ||rf (x)||  G for all x 2 Rn . Let ||x0 x? || 
D. Then, for the choice of step size ⌘t = GD p , we have
T

DG
f (b
xT ) f (x? )  p .
T

DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation: Need to know G and D.

Proof: Define the following (potential) function:

1
t := ||xt x? ||2 .
2⌘
We show that t is decreasing in t. We compute t+1 t as:

4
Proof

5
Gradient Descent with Smoothness Assumption

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

2
Theorem 2
Let the function f be -smooth. Let ||x0 x? ||  D. Then, for the choice
of step size ⌘t = 1 , we have

? ||x0 x? ||2
f (xT ) f (x )  .
2T

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .

2
We show that t is decreasing in t. We compute t+1 t as:

7
Proof

8
Gradient Descent with Smoothness and Strong Convexity

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have

↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 3
Let the function f be -smooth and ↵-strongly convex with ↵  . Define
condition number  := ↵ . Then, for the choice of step size ⌘t = 1 , we have
T
f (xT ) f (x? )  e  (f (x0 ) f (x? )).

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:

1 ↵
t := (1 + )t [f (xt ) f (x? )], where = = .
 1 ↵
We need to show that t+1  t.

9
Proof

10
Summary of gradient descent convergence rates

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

initial guess x0 2 Rn .

Theorem 4: GD Convergence rates

Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

T
e  (f (x0 ) f (x? )) where  := ↵ is the condition number.

12
Gradient descent: Constrained Case

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn

is a convex feasibility set.

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0

starting from an initial guess x0 2 R where ⇧X (y) is the projection of y
n

on the set X.

Theorem 5
Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

T
e  (f (0) f (x? )) where  := ↵ is the condition number.

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

13
When is Projection easy to find?

Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.

X = {x 2 Rn |||x||2  r}.

X = {x 2 Rn |xl  x  xu }.

X = {x 2 Rn |Ax = b}.

Pn
X = {x 2 Rn |x 0, i=1 xi  1}.

14
Accelerated Gradient Descent

Start with x0 = y0 = z0 2 Rn . At every time-step t,

1
yt+1 = xt rf (xt )

zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1

Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have

? 2 ||x0 x⇤ ||2
f (yT ) f (x )  .
T (T + 1)

Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1  t .

15
Accelerated Gradient Descent 2

Start with x0 = y0 . At every state t,

1
yt+1 = xt rf (xt )
p p
 1  1
xt+1 = (1 + p )yt+1 p yt
+1 +1
Theorem 7

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,

we have ⇣ ⌘
? T ↵+ ⇤ 2
f (yT ) f (x )  (1 + ) ||x0 x || .
2
1
Improvement upon the previous rate where we had =  1.

16
Further details

AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers

The convergence rates in the previous two theorems are the best possible
ones.

Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9

https://francisbach.com/continuized-acceleration/

https://www.nowpublishers.com/article/Details/OPT-036

17
Finite Sum Setting

A large number of problems that arise in (supervised) ML can be written as

N N
1 X 1 X
min f (x) := fi (x) = l(x, ⇠i ).
x2Rn N i=1 N i=1

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation

of a stochastic optimization problem

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.

Challenge: N (number of samples) or n (dimension of decision variable)
both may be large. Samples may be located in di↵erent servers.

18
Gradient Descent vs. Stochastic Gradient Descent

Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,

Pick an index (sample) it uniformly at random from the set
{1, 2, . . . , N }.
Set xt+1 = xt ⌘t rfit (xt ).

Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .

19
Key result for SGD convergence

Under the following assumptions

Convexity: each fi is convex,
Bounded variance: E[||rfit (x)||2 ]  2
for some for all x,
Unbiased gradient estimate: E[rfit (x)] = rf (x) for all x,
the solutions generated by SGD algorithm satisfies
T 1
X T 1
2 X
? 1 ? 2
⌘t [E[f (xt )] f (x )]  ||x0 x || + ⌘t2
t=0
2 2 t=0
PT 1 2
||x0 x? ||2
?
2
⌘t
=) E[f (b
xT )] f (x )  PT 1 + Pt=0
T 1
,
2 t=0 ⌘t 2 t=0 ⌘ t

1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0

20
Proof Continues

21
Choice of stepsize

Constant step-size will not give us convergence. For convergence, we need to

choose step sizes that are diminishing and square-summable, i.e.,
T 1
X T 1
X
lim ⌘t = 1, lim ⌘t2 < 1.
T !1 T !1
t=0 t=0

⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x )  O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).

23
Analysis for Smooth and Strongly Convex Functions

When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .

If ⌘t := ⌘, then error bound

2
? 2 T ? 2 ⌘
E[||xT x || ]  (1 ⌘↵) ||x0 x || + .
2↵
With constant step-size ⌘ < ↵1 , convergence is quick to a neighborhood of the
optimal solution.

24
Extension: Mini-Batch

25
Extension: Stochastic Averaging

26
Further Reading

SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:

A fast incremental gradient method with support for non-strongly convex
composite objectives.” Advances in neural information processing systems
27 (2014).

Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic

Gradient Methods.” Journal of Machine Learning Research 18 (2018): 1-51.

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-

convex least squares.” In Conference on Learning Theory, pp. 2062-2126.
PMLR, 2022.

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-

ance reduction via gradient sketching.” Advances in Neural Information Pro-
cessing Systems 31 (2018).

27
Extension: Adaptive Step-sizes

AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).

Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).

Convex Module B
No ratings yet
Convex Module B
29 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Optim ML
No ratings yet
Optim ML
41 pages
Optimization For Machine Learning
No ratings yet
Optimization For Machine Learning
45 pages
Lec 13
No ratings yet
Lec 13
6 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Efficient Methods in Optimization
No ratings yet
Efficient Methods in Optimization
159 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Midterm 1 Notes
No ratings yet
Midterm 1 Notes
46 pages
Fast Algorithms for Convex Optimization
No ratings yet
Fast Algorithms for Convex Optimization
114 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Online Gradient Descent
No ratings yet
Online Gradient Descent
7 pages
Fast Convex Optimization With Quantum Gradient Methods
No ratings yet
Fast Convex Optimization With Quantum Gradient Methods
42 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions
No ratings yet
Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions
31 pages
Optimization Techniques in Engineering
No ratings yet
Optimization Techniques in Engineering
44 pages
Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
22 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Exam 2021 Solutions
No ratings yet
Exam 2021 Solutions
16 pages
Convex Optimization in Machine Learning
No ratings yet
Convex Optimization in Machine Learning
110 pages
Numerical Optimization Course Notes
No ratings yet
Numerical Optimization Course Notes
96 pages
CH 4
No ratings yet
CH 4
28 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
17 Convexoptim5
No ratings yet
17 Convexoptim5
63 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Smooth Convex Optimization Lecture
No ratings yet
Smooth Convex Optimization Lecture
28 pages
Gradient Descent & Optimization Techniques
No ratings yet
Gradient Descent & Optimization Techniques
51 pages
Classification of Optimization Methods
No ratings yet
Classification of Optimization Methods
68 pages
Convergence Theorems For (Stochastic) Gradient Methods
No ratings yet
Convergence Theorems For (Stochastic) Gradient Methods
84 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
ML Module 5 Full Notes
No ratings yet
ML Module 5 Full Notes
23 pages
Lecture 11 AGD Restart Lower Bounds
No ratings yet
Lecture 11 AGD Restart Lower Bounds
5 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Adaptive Convex Optimization Methods
No ratings yet
Adaptive Convex Optimization Methods
23 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
Exam 2019 Solutions
No ratings yet
Exam 2019 Solutions
15 pages
Bregman
No ratings yet
Bregman
9 pages
Convex Optimization in Data Science
No ratings yet
Convex Optimization in Data Science
31 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Notes Ipad
No ratings yet
Notes Ipad
263 pages
2 Ref Risk-Sensitive Sequential Action Control With Multi-Modal Human Trajectory Forecasting For Safe Crowd-Robot Interaction
No ratings yet
2 Ref Risk-Sensitive Sequential Action Control With Multi-Modal Human Trajectory Forecasting For Safe Crowd-Robot Interaction
8 pages
2 - Ref - Human Trajectory Forecasting in Crowds - A Deep Learning Perspective
No ratings yet
2 - Ref - Human Trajectory Forecasting in Crowds - A Deep Learning Perspective
15 pages
2 - Ref - Cross-Entropy Motion Planning
No ratings yet
2 - Ref - Cross-Entropy Motion Planning
17 pages
ConvexSpring25 Week6
No ratings yet
ConvexSpring25 Week6
8 pages
2 Ref Human Trajectory Forecasting in Crowds A Deep Learning Perspective
No ratings yet
2 Ref Human Trajectory Forecasting in Crowds A Deep Learning Perspective
15 pages
ConvexSpring25 Week11
No ratings yet
ConvexSpring25 Week11
5 pages
TechE Handouts4
No ratings yet
TechE Handouts4
42 pages
TechE Handouts2
No ratings yet
TechE Handouts2
27 pages
TechE Handouts5
No ratings yet
TechE Handouts5
38 pages
The Finite Element Method Using MATLAB - Kwon and Bang
60% (5)
The Finite Element Method Using MATLAB - Kwon and Bang
527 pages
Bisection Method Solution Example
No ratings yet
Bisection Method Solution Example
7 pages
Project Distribution Data
No ratings yet
Project Distribution Data
6 pages
M. Sc. I Sem. II Numerical Analysis All
No ratings yet
M. Sc. I Sem. II Numerical Analysis All
199 pages
Advanced Approximation Methods
No ratings yet
Advanced Approximation Methods
24 pages
Accelerated Numerical Method For Singularly Perturbed Differential Difference Equations
No ratings yet
Accelerated Numerical Method For Singularly Perturbed Differential Difference Equations
6 pages
Numerical Solution of Non-Linear Equations (Root Finding Method)
100% (1)
Numerical Solution of Non-Linear Equations (Root Finding Method)
4 pages
Excel Solver Optimization Report
No ratings yet
Excel Solver Optimization Report
14 pages
Assignment 2 1
No ratings yet
Assignment 2 1
2 pages
CSP Techniques in Algorithms
No ratings yet
CSP Techniques in Algorithms
21 pages
Unit 6 Introduction To Sequential Logic
No ratings yet
Unit 6 Introduction To Sequential Logic
34 pages
Regula Falsi & Numerical Methods Guide
No ratings yet
Regula Falsi & Numerical Methods Guide
19 pages
Question Bank-Unit 1 Numerical Methods: MATH 2300 B.Tech, III Sem
No ratings yet
Question Bank-Unit 1 Numerical Methods: MATH 2300 B.Tech, III Sem
4 pages
Algebraic Expressions - DHA 01 - Junoon 2025
No ratings yet
Algebraic Expressions - DHA 01 - Junoon 2025
4 pages
Numerical Methods 1
No ratings yet
Numerical Methods 1
3 pages
u= x−x h v (v+1) ∇ y …+v (v+1) … v +n−1) n! ∇ y v= x−x h: Δ y y y i=0,1,2, …, n−1 ∇ y y y i=n, n−1, …, 1
No ratings yet
u= x−x h v (v+1) ∇ y …+v (v+1) … v +n−1) n! ∇ y v= x−x h: Δ y y y i=0,1,2, …, n−1 ∇ y y y i=n, n−1, …, 1
2 pages
Optimize Football Production Profit
No ratings yet
Optimize Football Production Profit
7 pages
C - Program
No ratings yet
C - Program
8 pages
Dynamic Programming and Applications
No ratings yet
Dynamic Programming and Applications
17 pages
Applied Numerical Methods With MATLAB For Engineers and Scientists 4th Edition Steven C. Chapra Dr. All Chapter Instant Download
100% (2)
Applied Numerical Methods With MATLAB For Engineers and Scientists 4th Edition Steven C. Chapra Dr. All Chapter Instant Download
47 pages
MATLAB Basics for Engineers
No ratings yet
MATLAB Basics for Engineers
92 pages
NM Laboratory 3 Roots of Non Linear Functions Open Methods
No ratings yet
NM Laboratory 3 Roots of Non Linear Functions Open Methods
13 pages
Solving L. P. Problem by Simplex Method - Maximization Case
No ratings yet
Solving L. P. Problem by Simplex Method - Maximization Case
5 pages
Stiff Ode
No ratings yet
Stiff Ode
25 pages
Unit 6 (Interpolation)
No ratings yet
Unit 6 (Interpolation)
36 pages
Nla Teach Dutta
No ratings yet
Nla Teach Dutta
19 pages
Lagrange Multipliers in Boundary Conditions
100% (1)
Lagrange Multipliers in Boundary Conditions
3 pages
ES 204 L2 Linear Equations
No ratings yet
ES 204 L2 Linear Equations
75 pages
Pre Solving in Linear Programming
No ratings yet
Pre Solving in Linear Programming
16 pages
Shu Chi Wang (2009) - High Order WENO Scheme For Convection Dominated Problems - SIAM Review
No ratings yet
Shu Chi Wang (2009) - High Order WENO Scheme For Convection Dominated Problems - SIAM Review
45 pages

ConvexSpring25 Week9

Uploaded by

ConvexSpring25 Week9

Uploaded by

Module C: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

In presence of constaints, we define

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

The stationarity condition satisfies x⇤ = x⇤ ⌘t rf (x⇤ ) =) rf (x⇤ ) = 0.

Convergence rate depends on choice of step size ⌘t and characteristic of the

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any

Proof: Define the following (potential) function:

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an

Theorem 4: GD Convergence rates

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

Start with x0 = y0 = z0 2 Rn . At every time-step t,

Start with x0 = y0 . At every state t,

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,

A large number of problems that arise in (supervised) ML can be written as

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,

Under the following assumptions

Constant step-size will not give us convergence. For convergence, we need to

If ⌘t := ⌘, then error bound

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-

You might also like