Module C: Algorithms for Optimization
Recall that an optimization problem in standard form is given by
min f (x)
x2Rn
s.t. gi (x) 0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].
Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information
collected on the path.
Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.
First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.
Second Order: Hessian information is used. Eg: Newton’s Method, etc.
Distributed Algorithms
Stochastic/Randomized Algorithms
1
Measure of progress
Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.
errt := ||xt x? ||
errt := f (xt ) f (x? )
A solution x̄ is ✏-optimal when
f (x̄) f (x? ) + ✏.
We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.
In presence of constaints, we define
errt := max(f (xt ) f (x? ), g1 (xt ), g2 (xt ), . . . , gm (xt ), |h1 (xt )|, . . . , |hp (xt )|).
2
First order methods: Gradient descent
Consider the unconstrained optimization problem: minx2Rn f (x)
Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an
initial guess x0 2 Rn .
The stationarity condition satisfies x⇤ = x⇤ ⌘t rf (x⇤ ) =) rf (x⇤ ) = 0.
Convergence rate depends on choice of step size ⌘t and characteristic of the
function.
Bounded Gradient: ||rf (x)|| G for all x 2 Rn .
Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic upper bound on the function from local informa-
tion.
Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any
x, y, we have
↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic lower bound on the function from local informa-
tion.
If f is twice di↵erentiable, then
– f is -smooth if and only if r2 f (x) I or max (r
2
f (x)) for all
x 2 Rn .
– f is ↵-strongly convex if and only if r2 f (x) ⌫ ↵I or min (r
2
f (x)) ↵
for all x 2 Rn .
Determine and ↵ for f (x) = ||Ax b||22 .
3
Gradient Descent with Bounded Gradient Assumption
Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent
Let the function f satisfy the ||rf (x)|| G for all x 2 Rn . Let ||x0 x? ||
D. Then, for the choice of step size ⌘t = GD p , we have
T
DG
f (b
xT ) f (x? ) p .
T
DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation: Need to know G and D.
Proof: Define the following (potential) function:
1
t := ||xt x? ||2 .
2⌘
We show that t is decreasing in t. We compute t+1 t as:
4
Proof
5
Gradient Descent with Smoothness Assumption
Recall that a di↵erentiable convex f is -smooth if for any x, y, we have
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 2
Let the function f be -smooth. Let ||x0 x? || D. Then, for the choice
of step size ⌘t = 1 , we have
? ||x0 x? ||2
f (xT ) f (x ) .
2T
Proof: Define the following (potential) function:
t := t[f (xt ) f (x? )] + ||xt x? ||2 .
2
We show that t is decreasing in t. We compute t+1 t as:
7
Proof
8
Gradient Descent with Smoothness and Strong Convexity
Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have
↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 3
Let the function f be -smooth and ↵-strongly convex with ↵ . Define
condition number := ↵ . Then, for the choice of step size ⌘t = 1 , we have
T
f (xT ) f (x? ) e (f (x0 ) f (x? )).
Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .
Proof: Define the following (potential) function:
1 ↵
t := (1 + )t [f (xt ) f (x? )], where = = .
1 ↵
We need to show that t+1 t.
9
Proof
10
Summary of gradient descent convergence rates
Consider the unconstrained optimization problem: minx2Rn f (x)
Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an
initial guess x0 2 Rn .
Theorem 4: GD Convergence rates
Let ||x0 x? || D.
If ||rf (x)|| G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? )
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? ) 2T .
If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? )
T
e (f (x0 ) f (x? )) where := ↵ is the condition number.
12
Gradient descent: Constrained Case
Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn
is a convex feasibility set.
Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0
starting from an initial guess x0 2 R where ⇧X (y) is the projection of y
n
on the set X.
Theorem 5
Let ||x0 x? || D.
If ||rf (x)|| G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? )
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? ) 2T .
If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? )
T
e (f (0) f (x? )) where := ↵ is the condition number.
Note: Convergence rates remain unchanged.
Note: Projection itself is another optimization problem!
Non-expansive Property which preserves the convergence rates:
||⇧X (y1 ) ⇧X (y2 )|| ||y1 y2 ||.
13
When is Projection easy to find?
Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.
X = {x 2 Rn |||x||2 r}.
X = {x 2 Rn |xl x xu }.
X = {x 2 Rn |Ax = b}.
Pn
X = {x 2 Rn |x 0, i=1 xi 1}.
14
Accelerated Gradient Descent
Start with x0 = y0 = z0 2 Rn . At every time-step t,
1
yt+1 = xt rf (xt )
zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1
Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have
? 2 ||x0 x⇤ ||2
f (yT ) f (x ) .
T (T + 1)
Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1 t .
15
Accelerated Gradient Descent 2
Start with x0 = y0 . At every state t,
1
yt+1 = xt rf (xt )
p p
1 1
xt+1 = (1 + p )yt+1 p yt
+1 +1
Theorem 7
Let f be -smooth, ↵-strongly convex with = ↵ and let = p1 1 . Then,
we have ⇣ ⌘
? T ↵+ ⇤ 2
f (yT ) f (x ) (1 + ) ||x0 x || .
2
1
Improvement upon the previous rate where we had = 1.
16
Further details
AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers
The convergence rates in the previous two theorems are the best possible
ones.
Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9
https://francisbach.com/continuized-acceleration/
https://www.nowpublishers.com/article/Details/OPT-036
17
Finite Sum Setting
A large number of problems that arise in (supervised) ML can be written as
N N
1 X 1 X
min f (x) := fi (x) = l(x, ⇠i ).
x2Rn N i=1 N i=1
Example: Regression/Least Squares, SVM, NN Training
The above problem can also be viewed as sample average approximation
of a stochastic optimization problem
f (x) = E[l(x, ⇠)]
involving uncertain parameter or random variable ⇠.
Challenge: N (number of samples) or n (dimension of decision variable)
both may be large. Samples may be located in di↵erent servers.
18
Gradient Descent vs. Stochastic Gradient Descent
Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1
Each step requires N gradient computations.
Stochastic Gradient Descent (SGD) At every time step t,
Pick an index (sample) it uniformly at random from the set
{1, 2, . . . , N }.
Set xt+1 = xt ⌘t rfit (xt ).
Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .
19
Key result for SGD convergence
Under the following assumptions
Convexity: each fi is convex,
Bounded variance: E[||rfit (x)||2 ] 2
for some for all x,
Unbiased gradient estimate: E[rfit (x)] = rf (x) for all x,
the solutions generated by SGD algorithm satisfies
T 1
X T 1
2 X
? 1 ? 2
⌘t [E[f (xt )] f (x )] ||x0 x || + ⌘t2
t=0
2 2 t=0
PT 1 2
||x0 x? ||2
?
2
⌘t
=) E[f (b
xT )] f (x ) PT 1 + Pt=0
T 1
,
2 t=0 ⌘t 2 t=0 ⌘ t
1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0
20
Proof Continues
21
Choice of stepsize
Constant step-size will not give us convergence. For convergence, we need to
choose step sizes that are diminishing and square-summable, i.e.,
T 1
X T 1
X
lim ⌘t = 1, lim ⌘t2 < 1.
T !1 T !1
t=0 t=0
⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x ) O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).
23
Analysis for Smooth and Strongly Convex Functions
When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .
If ⌘t := ⌘, then error bound
2
? 2 T ? 2 ⌘
E[||xT x || ] (1 ⌘↵) ||x0 x || + .
2↵
With constant step-size ⌘ < ↵1 , convergence is quick to a neighborhood of the
optimal solution.
24
Extension: Mini-Batch
25
Extension: Stochastic Averaging
26
Further Reading
SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.
SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:
A fast incremental gradient method with support for non-strongly convex
composite objectives.” Advances in neural information processing systems
27 (2014).
Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.
Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic
Gradient Methods.” Journal of Machine Learning Research 18 (2018): 1-51.
Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-
convex least squares.” In Conference on Learning Theory, pp. 2062-2126.
PMLR, 2022.
Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-
ance reduction via gradient sketching.” Advances in Neural Information Pro-
cessing Systems 31 (2018).
27
Extension: Adaptive Step-sizes
AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).
Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).
28