0% found this document useful (0 votes)

81 views30 pages

Fast Gradient Methods: - Fast Proximal Gradient Method (FISTA) - Nesterov's Second Method

This document discusses fast gradient methods for minimizing composite convex objectives. It covers the fast proximal gradient method (FISTA) and Nesterov's second method. FISTA uses extrapolated points to accelerate convergence to a rate of O(1/k^2). Nesterov's second method maintains feasibility at each iteration. Both methods can achieve the optimal convergence rate of O(1/k^2) using a fixed step size or backtracking line search. Examples show the methods effectively minimize logistic regression and nuclear norm regularized objectives.

Uploaded by

THuy Dv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views30 pages

Fast Gradient Methods: - Fast Proximal Gradient Method (FISTA) - Nesterov's Second Method

Uploaded by

THuy Dv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

EE236C (Spring 2010-11)

8. Fast gradient methods

fast proximal gradient method (FISTA)

Nesterovs second method

8-1

Fast (proximal) gradient methods

Nesterov (1983, 1988, 2005): three gradient projection methods with

1/k 2 convergence rate
Beck & Teboulle (2008): FISTA, a proximal gradient version of
Nesterovs 1983 method
Nesterov (2004 book), Tseng (2008): overview and unified analysis of
fast gradient methods
several recent variations and extensions
this lecture:
FISTA and Nesterovs 2nd method (1988) as presented by Tseng

Fast gradient methods

8-2

Outline

fast proximal gradient method (FISTA)

Nesterovs second method

Fast proximal gradient method

convex problem with composite objective
minimize f (x) = g(x) + h(x)
g differentiable with dom g = Rn; h has inexpensive proxth operator
algorithm: choose x(0) = y (0) dom h; for k 1
x(k) = proxtk h y (k1) tk g(y (k1))
y

(k)

= x

(k)

k 1 (k)
+
(x x(k1))
k+2

known as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)

Fast gradient methods

8-3

Interpretation
first iteration (k = 1) is a proximal gradient step at x(0)
next iterations are proximal gradient steps at extrapolated points y (k1)
x(k) = proxtk h y (k1) tk g(y (k1))

x(k2)

x(k1)

y (k1)

sequence x(k) remains feasible (in dom h); sequence y (k) not necessarily

Fast gradient methods

8-4

Example
m

minimize log
i=1

exp(aTi x + bi)

randomly generated data with m = 2000, n = 1000, same fixed step size
100

gradient
FISTA

f (x(k)) f
|f |

10-1
10-2
10-3
10-4
10-5
10-6 0

Fast gradient methods

100
k

150

200

8-5

another instance

100

gradient
FISTA

f (x(k)) f
|f |

10-1
10-2
10-3
10-4
10-5
10-6 0

100
k

150

200

FISTA is not a descent method

Fast gradient methods

8-6

Convergence of FISTA
assumptions
optimal value f is finite and attained at x (not necessarily unique)
dom g = Rn and g is Lipschitz continuous with constant L > 0:
g(x) g(y)

L xy

x, y

h is closed and convex (hence proxth(u) exists and is unique for all u)
result: f (x(k)) f decreases at least as fast as 1/k 2
if fixed step size tk = 1/L is used
if backtracking line search is used

Fast gradient methods

8-7

Reformulation of FISTA
define k = 2/(k + 1) and introduce an intermediate variable v (k)
algorithm: choose x(0) = y (0) = v (0) dom h; for k 1
x(k) = proxtk h y (k1) tk g(y (k1))
v

(k)

= x

(k1)

1 (k)
+ (x x(k1))
k

y (k) = (1 k+1)x(k) + k+1v (k)

substituting expression for v (k) in step 3 gives algorithm on page 8-3
k = 2/(k + 1) satisfies
1 k
1
2 ,
k2
k1

Fast gradient methods

8-8

Key inequalities
upper bound from Lipschitz property
g(u) g(z) + g(z)T (u z) +

L
uz
2

2
2

u, z

property of proximal operators: if u = proxth(w),

1
h(u) h(z) + (w u)T (u z)
t

this follows from subgradient characterization of prox-operator (page 4-15)

u = proxth(w)

Fast gradient methods

w u th(u)

8-9

Progress in one iteration

x = x(i1), x+ = x(i), y = y (i1), v = v (i1), v + = v (i), t = ti, = i
from Lipschitz property if t 1/L
g(x+) g(y) + g(y)T (x+ y) +

1 +
x y
2t

2
2

(1)

from property of prox-operator

1
h(x+) h(z) + g(y)T (z x+) + (x+ y)T (z x+)
t

add the upper bounds and use convexity of g

1 +
1 +
T
+
f (x ) f (z) + (x y) (z x ) +
x y
t
2t
+

Fast gradient methods

2
2

z
8-10

make convex combination of upper bounds for z = x and z = x

f (x+) f (1 )(f (x) f )
= f (x+) f (1 )f (x)
1 +
1 +
T

+
(x y) (x + (1 )x x ) +
x y 22

t
2t
1
2
y (1 )x x 2 x+ (1 )x x
=
2t
2
2
2
=
v x 2 v + x 2
2t

2
2

conclusion: if the inequality (1) holds (for example, if 0 < t 1/L), then
1 +
1
+

(f
(x
)

f
)
+
v

x
2
2t

Fast gradient methods

2
2

1
1

v x
2 (f (x) f ) +

2
2

8-11

Analysis for fixed step size

2
:
apply inequality with t = ti = 1/L recursively, using (1 i)/i2 1/i1

1
1 (k)

(k)

x
(f
(x
)

f
)
+
k2
2t

2
2

1 (0)
1 1

(0)

x
(f
(x
)

f
)
+
12
2t
1 (0)
x x 22
2t

2
2

therefore,
2

f (x(k)) f k x(0) x
2t

2
2

2L
(0)

x
=
(k + 1)2

conclusion: reaches f (x(k)) f after O(

Fast gradient methods

2
2

L/) iterations
8-12

Line search
purpose: determine step size t = tk in
x+ = proxth (y tg(y))

(with x+ = x(k), y = y (k1))

backtracking line search: start at t := tk1; repeat t := t until

g(x+) g(y) + g(y)T (x+ y) +

1 +
x y
2t

2
2

for t0, can choose any positive value t0 = t

from Lipschitz property, tk tmin = min{t, /L}
guarantees that inequality (1) on page 8-10 holds
initialization implies tk tk1, i.e., step sizes are nonincreasing
Fast gradient methods

8-13

Analysis for backtracking line search

apply inequality on page 8-11 recursively to get
tk
1 (k)
(k)

(f
(x
)

f
)
+
v

x
k2
2

tmin
(k)

(f
(x
)

f
)
2
k

t1(1 1)
1 (0)

(0)

x
(f
(x
)

f
)
+
12
2
1 (0)
x x 22
2

therefore
f (x

(k)

2
(0)

x
)f
(k + 1)2tmin

conclusion: reaches f (x(k)) f after O(

Fast gradient methods

2
2
2
2

2
2

L/) iterations
8-14

Example: quadratic program with box constraints

minimize (1/2)xT Ax + bT x
subject to 0 x 1
101

gradient
FISTA

(f (x(k)) f )/|f |

100
10-1
10-2
10-3
10-4
10-5
10-6
10-7 0

n = 3000; fixed step size t = 1/max(A)

Fast gradient methods

8-15

1-norm regularized least-squares

minimize

1
Ax b
2

2
2

+ x

100

gradient
FISTA

10-1
(f (x(k)) f )/f

10-2
10-3
10-4
10-5
10-6
10-7 0

100

randomly generated A R20001000; step tk = 1/L with L = max(AT A)

Fast gradient methods

8-16

Example: nuclear norm regularization

minimize g(X) + X

g is smooth and convex; variable X Rmn (with m n)

nuclear norm
X

i(X)
i

1(X) 2(X) are the singular values of X

the dual norm of the matrix norm

(maximum singular value)

for diagonal X, reduces to the 1-norm of diag(X)

popular as penalty function that promotes low rank
Fast gradient methods

8-17

prox operator of proxth(X) for h(X) = X

proxth(X) = argmin

1
U X
2t

2
F

take singular value decomposition X = P diag(1, . . . , n)QT

apply soft thresholding to singular values:
proxth(Y ) = P diag(
1 , . . . ,
n)QT
where

k = k t

Fast gradient methods

(k t),

k = 0 (k t)

8-18

Approximate low-rank completion

(Xij Aij )2 + X

minimize

(i,j)N

entries (i, j) N are approximately specified (Xij Aij ); rest is free

nuclear norm regularization added to obtain low rank X

example
m = n = 500
5000 specified entries

Fast gradient methods

8-19

convergence (fixed step size t = 1/L)

(f (x(k)) f )/f

100
10-1
10-2
10-3
10-4
10-5 0

Fast gradient methods

gradient
FISTA

100
k

150

200

8-20

result
1

-1

normalized singular value

-3

-5

-7

-9

-11

-13

-15

-17

index

optimal X has rank 38; relative error in specified entries is 9%

Fast gradient methods

8-21

Descent version of FISTA

choose x(0) dom h and y (0) = x(0); for k 1
u(k) = proxtk h y (k1) tk g(y (k1))
x

(k)

u(k)
f (u(k)) f (x(k1))
x(k1) otherwise

v (k) = x(k1) +

1 (k)
(u x(k1))
k

y (k) = (1 k+1)x(k) + k+1v (k)

step 2 implies f (x(k)) f (x(k1))

same iteration complexity as original FISTA
changes on p. 8-10: replace x+ with u+ = u(i) and use f (x+) f (u+)
Fast gradient methods

8-22

Example
(from page 8-6)
100

gradient
FISTA
FISTA-d

(f (x(k)) f )/|f |

10-1
10-2
10-3
10-4
10-5
10-6 0

Fast gradient methods

100
k

150

200

8-23

Outline

fast proximal gradient method (FISTA)

Nesterovs second method

algorithm: choose x(0) = y (0) = v (0) dom h; for k 1
v

(k)

= prox(tk /k )h v

(k1)

tk
g(y (k1))
k

x(k) = (1 k )x(k1) + k v (k)

y (k) = (1 k+1)x(k) + k+1v (k)

k = 2/(k + 1)
can be shown to be identical to FISTA if h(x) = 0
unlike in FISTA, y (k) remains feasible (i.e., in dom h)

Fast gradient methods

8-24

Convergence of Nesterovs second method

assumptions
optimal value f is finite and attained at x (not necessarily unique)
g is Lipschitz continuous on dom h dom g with constant L > 0:
g(x) g(y)

L xy

x, y dom h

h is closed and convex

result: f (x(k)) f decreases at least as fast as 1/k 2
if fixed step size tk = 1/L is used
if backtracking line search is used

Fast gradient methods

8-25

Analysis of one iteration

x = x(i1), x+ = x(i), y = y (i1), v = v (i1), v + = v (i), t = ti, = i
from Lipschitz property if t 1/L
g(x+) g(y) + g(y)T (x+ y) +

1 +
x y
2t

2
2

(2)

plug in x+ = (1 )x + v + and x+ y = (v + v)
+

g(x ) g(y) + g(y)

2 +
(1 )x + v y +
v v
2t
+

2
2

from convexity of g, h
1 +
v v
g(x ) (1 )g(x) + g(y) + g(y) (v y) +
2t
h(x+) (1 )h(x) + h(v +)
+

Fast gradient methods

2
2

8-26

from property of prox-operator on page 8-9

+
h(v ) h(z) + g(y) (z v ) (v v)T (v + z)
t
+

combine the upper bounds on g(x+), h(x+), h(v +), with z = x

1 +
2 +
v v
f (x ) (1 )f (x) + f (v v)T (v + x) +
t
2t
2

= (1 )f (x) + f +
v x 22 v + x 22
2t
+

2
2

the same final inequality as in the analysis of FISTA on page 8-11

conclusion: same 1/k 2 complexity as FISTA
for fixed step size ti = 1/L
backtracking line search that ensures (2) and ti ti1
Fast gradient methods

8-27

References
surveys of fast gradient methods

Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)

P. Tseng, On accelerated proximal gradient methods for convex-concave optimization
(2008)
FISTA

A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear

inverse problems, SIAM J. on Imaging Sciences (2009)
A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal
recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal
Processing and Communications (2009)
Nesterovs third method (not covered in this lecture)

Yu. Nesterov, Smooth minimization of non-smooth functions, Mathematical

Programming (2005)
S. Becker, J. Bobin, E.J. Cand`es, NESTA: a fast and accurate first-order method for
sparse recovery, SIAM J. Imaging Sciences (2011)
Fast gradient methods

8-28

Fista
No ratings yet
Fista
32 pages
Fast Gradient Method
No ratings yet
Fast Gradient Method
25 pages
A Geometric Structure of Acceleration and Its Role in Making Gradients Small Fast
No ratings yet
A Geometric Structure of Acceleration and Its Role in Making Gradients Small Fast
40 pages
Subgradient Methods for Convex Optimization
No ratings yet
Subgradient Methods for Convex Optimization
33 pages
A Simplified View of First Order Methods For Optimization
No ratings yet
A Simplified View of First Order Methods For Optimization
30 pages
Proximal Algorithms for Convex Optimization
No ratings yet
Proximal Algorithms for Convex Optimization
5 pages
Proximal Operators in Optimization
No ratings yet
Proximal Operators in Optimization
4 pages
Accelerated and Inexact Forward-Backward Algorithms
No ratings yet
Accelerated and Inexact Forward-Backward Algorithms
27 pages
Subgradient Method for Optimization
No ratings yet
Subgradient Method for Optimization
33 pages
Optimization for Convex Functions
No ratings yet
Optimization for Convex Functions
31 pages
Homework 2
No ratings yet
Homework 2
5 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Acceleration Scribed
No ratings yet
Acceleration Scribed
8 pages
Composite
No ratings yet
Composite
26 pages
10.3934 Math.2023930
No ratings yet
10.3934 Math.2023930
19 pages
Convergence Rates of Inexact Proximal-Gradient Methods For Convex Optimization
No ratings yet
Convergence Rates of Inexact Proximal-Gradient Methods For Convex Optimization
31 pages
Opt 202 LN
No ratings yet
Opt 202 LN
86 pages
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
Convergence of Descent Methods For Semi-Algebraic and
No ratings yet
Convergence of Descent Methods For Semi-Algebraic and
37 pages
EE364b Optimization Exercises Guide
No ratings yet
EE364b Optimization Exercises Guide
48 pages
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
No ratings yet
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
9 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
FISTA
No ratings yet
FISTA
20 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
A New Method For Solving Variational Inequalities and Fixed Points Problems of Demi-Contractive Mappings in Hilbert Spaces
No ratings yet
A New Method For Solving Variational Inequalities and Fixed Points Problems of Demi-Contractive Mappings in Hilbert Spaces
21 pages
Controle 16
No ratings yet
Controle 16
4 pages
Lec 17
No ratings yet
Lec 17
9 pages
A Projected Semismooth Newton Method For A Class of Nonconvex Composite Programs With Strong Prox-Regularity
No ratings yet
A Projected Semismooth Newton Method For A Class of Nonconvex Composite Programs With Strong Prox-Regularity
32 pages
New Inertial Proximal Gradient Methods For Unconstrained Convex Optimization Problems
No ratings yet
New Inertial Proximal Gradient Methods For Unconstrained Convex Optimization Problems
18 pages
s10107 011 0484 9
No ratings yet
s10107 011 0484 9
39 pages
Gradient Method in Convex Optimization
No ratings yet
Gradient Method in Convex Optimization
31 pages
Wavelets 3
No ratings yet
Wavelets 3
29 pages
Primal - Dual Decomposition Methods
No ratings yet
Primal - Dual Decomposition Methods
40 pages
Iterative Methods For Solving Linear Systems
No ratings yet
Iterative Methods For Solving Linear Systems
237 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Notes Ipad
No ratings yet
Notes Ipad
263 pages
Advanced Optimization Techniques
No ratings yet
Advanced Optimization Techniques
8 pages
Interior Gradient and Proximal Methods For Convex and Conic Optimization
No ratings yet
Interior Gradient and Proximal Methods For Convex and Conic Optimization
29 pages
SESO2018 Wednesday Sagastizabal
No ratings yet
SESO2018 Wednesday Sagastizabal
181 pages
Finite Difference Methods For Ordinary and Partial 21krgbzlct
No ratings yet
Finite Difference Methods For Ordinary and Partial 21krgbzlct
7 pages
Equilibrium Programming and New Iterative Methods in Hilbert Spaces
No ratings yet
Equilibrium Programming and New Iterative Methods in Hilbert Spaces
29 pages
Unconstrained and Constrained Optimization Techniques
No ratings yet
Unconstrained and Constrained Optimization Techniques
25 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
L10 - Subgrad - PGD (Partially Annotated)
No ratings yet
L10 - Subgrad - PGD (Partially Annotated)
39 pages
EE227C Course Notes: Convex Optimization
No ratings yet
EE227C Course Notes: Convex Optimization
122 pages
Two Simple Projection-Type Methods For Solving Variational Inequalities
No ratings yet
Two Simple Projection-Type Methods For Solving Variational Inequalities
23 pages
MIT ML Optimization Lecture
No ratings yet
MIT ML Optimization Lecture
89 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Lecture 14 From Sensitivities To Optimisation
No ratings yet
Lecture 14 From Sensitivities To Optimisation
20 pages
Free Fista
No ratings yet
Free Fista
29 pages
Dimitri Bertsekas - Nonlinear Programming (Google Books Preview) (2016, Athena Scientific) - Libgen - Li
No ratings yet
Dimitri Bertsekas - Nonlinear Programming (Google Books Preview) (2016, Athena Scientific) - Libgen - Li
64 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Rajmic 2016
No ratings yet
Rajmic 2016
5 pages
Apl 232
No ratings yet
Apl 232
48 pages
Ident Ref PDF
100% (1)
Ident Ref PDF
1,592 pages
Monitoring Kraft Recovery Boiler Fouling Using Principal Component Analysis
No ratings yet
Monitoring Kraft Recovery Boiler Fouling Using Principal Component Analysis
9 pages
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
No ratings yet
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
11 pages
Electric Vehicle Plugs Type 2 With Shutter
No ratings yet
Electric Vehicle Plugs Type 2 With Shutter
10 pages
ADMM for Distributed Optimization
No ratings yet
ADMM for Distributed Optimization
23 pages
Lab 2: Digital Output - 8 LED's: Registers To Be Used in LAB 2: Initialise DSP
No ratings yet
Lab 2: Digital Output - 8 LED's: Registers To Be Used in LAB 2: Initialise DSP
8 pages
01 Lab1
No ratings yet
01 Lab1
22 pages
Thesis On Speech Recognition
No ratings yet
Thesis On Speech Recognition
102 pages
Computer Oriented Numerical Methods
No ratings yet
Computer Oriented Numerical Methods
2 pages
Inverse Laplace Transform Review
No ratings yet
Inverse Laplace Transform Review
19 pages
Halley's Method Final
No ratings yet
Halley's Method Final
7 pages
Assignment Problem 1: SUBMITTED BY:-Manish Kumar Div:-IB (A) Roll NO:-20MBAIB020
No ratings yet
Assignment Problem 1: SUBMITTED BY:-Manish Kumar Div:-IB (A) Roll NO:-20MBAIB020
14 pages
CFD Term Project Proposal Guide
No ratings yet
CFD Term Project Proposal Guide
4 pages
Economic Applications Constrained Optimisation
No ratings yet
Economic Applications Constrained Optimisation
25 pages
Galois Theory For Beginners A Historical Perspective Second Ed 2nd Edition JRG Bewersdorff PDF Download
No ratings yet
Galois Theory For Beginners A Historical Perspective Second Ed 2nd Edition JRG Bewersdorff PDF Download
91 pages
MT Cse330 Fall2023 1
No ratings yet
MT Cse330 Fall2023 1
2 pages
10th QE Project - Tlm4all
No ratings yet
10th QE Project - Tlm4all
6 pages
Excel Assignment Acg 5395 CURRENT VERSION-1
No ratings yet
Excel Assignment Acg 5395 CURRENT VERSION-1
1 page
HD 13 Numerical Integration of MDOF 2008
No ratings yet
HD 13 Numerical Integration of MDOF 2008
18 pages
Lab Report 1&2 "Incremental and Newton Methods"
No ratings yet
Lab Report 1&2 "Incremental and Newton Methods"
10 pages
Powels &-Simplex-Method
No ratings yet
Powels &-Simplex-Method
2 pages
Summative Test#1 Factoring Polynomials
No ratings yet
Summative Test#1 Factoring Polynomials
1 page
Test
No ratings yet
Test
214 pages
Transportation Model
No ratings yet
Transportation Model
14 pages
Linear Programming with Excel Solver
No ratings yet
Linear Programming with Excel Solver
27 pages
Research Report
No ratings yet
Research Report
27 pages
Math10 q2 w1 Studentsversion v1
No ratings yet
Math10 q2 w1 Studentsversion v1
10 pages
Chapter 2 Polynomials MCQ
No ratings yet
Chapter 2 Polynomials MCQ
3 pages
Name Here: Face Recognition System With Face Detection
No ratings yet
Name Here: Face Recognition System With Face Detection
70 pages
Numerical Methods for Mech Eng Students
No ratings yet
Numerical Methods for Mech Eng Students
12 pages
10th Grade Algebra 2 - Test #1 WALKS / WEBS 2021 - 2022 Teacher Luis Alicea
No ratings yet
10th Grade Algebra 2 - Test #1 WALKS / WEBS 2021 - 2022 Teacher Luis Alicea
6 pages
Real Zeros of Polynomial Functions
100% (1)
Real Zeros of Polynomial Functions
36 pages
ITE 2702 - CM Moratuwa University
No ratings yet
ITE 2702 - CM Moratuwa University
5 pages
Lecture 17
No ratings yet
Lecture 17
2 pages
Beam Deflection and Slope Calculation
No ratings yet
Beam Deflection and Slope Calculation
5 pages
Matrices 1
No ratings yet
Matrices 1
15 pages
Wk5 Lagrangian Polynomials
No ratings yet
Wk5 Lagrangian Polynomials
18 pages
Linear Programming Guide
No ratings yet
Linear Programming Guide
45 pages