0% found this document useful (0 votes)
81 views30 pages

Fast Gradient Methods: - Fast Proximal Gradient Method (FISTA) - Nesterov's Second Method

This document discusses fast gradient methods for minimizing composite convex objectives. It covers the fast proximal gradient method (FISTA) and Nesterov's second method. FISTA uses extrapolated points to accelerate convergence to a rate of O(1/k^2). Nesterov's second method maintains feasibility at each iteration. Both methods can achieve the optimal convergence rate of O(1/k^2) using a fixed step size or backtracking line search. Examples show the methods effectively minimize logistic regression and nuclear norm regularized objectives.

Uploaded by

THuy Dv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views30 pages

Fast Gradient Methods: - Fast Proximal Gradient Method (FISTA) - Nesterov's Second Method

This document discusses fast gradient methods for minimizing composite convex objectives. It covers the fast proximal gradient method (FISTA) and Nesterov's second method. FISTA uses extrapolated points to accelerate convergence to a rate of O(1/k^2). Nesterov's second method maintains feasibility at each iteration. Both methods can achieve the optimal convergence rate of O(1/k^2) using a fixed step size or backtracking line search. Examples show the methods effectively minimize logistic regression and nuclear norm regularized objectives.

Uploaded by

THuy Dv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

EE236C (Spring 2010-11)

8. Fast gradient methods

fast proximal gradient method (FISTA)


Nesterovs second method

8-1

Fast (proximal) gradient methods

Nesterov (1983, 1988, 2005): three gradient projection methods with


1/k 2 convergence rate
Beck & Teboulle (2008): FISTA, a proximal gradient version of
Nesterovs 1983 method
Nesterov (2004 book), Tseng (2008): overview and unified analysis of
fast gradient methods
several recent variations and extensions
this lecture:
FISTA and Nesterovs 2nd method (1988) as presented by Tseng

Fast gradient methods

8-2

Outline

fast proximal gradient method (FISTA)


Nesterovs second method

Fast proximal gradient method


convex problem with composite objective
minimize f (x) = g(x) + h(x)
g differentiable with dom g = Rn; h has inexpensive proxth operator
algorithm: choose x(0) = y (0) dom h; for k 1
x(k) = proxtk h y (k1) tk g(y (k1))
y

(k)

= x

(k)

k 1 (k)
+
(x x(k1))
k+2

known as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)


Fast gradient methods

8-3

Interpretation
first iteration (k = 1) is a proximal gradient step at x(0)
next iterations are proximal gradient steps at extrapolated points y (k1)
x(k) = proxtk h y (k1) tk g(y (k1))

x(k2)

x(k1)

y (k1)

sequence x(k) remains feasible (in dom h); sequence y (k) not necessarily

Fast gradient methods

8-4

Example
m

minimize log
i=1

exp(aTi x + bi)

randomly generated data with m = 2000, n = 1000, same fixed step size
100

gradient
FISTA

f (x(k)) f
|f |

10-1
10-2
10-3
10-4
10-5
10-6 0

Fast gradient methods

50

100
k

150

200

8-5

another instance

100

gradient
FISTA

f (x(k)) f
|f |

10-1
10-2
10-3
10-4
10-5
10-6 0

50

100
k

150

200

FISTA is not a descent method


Fast gradient methods

8-6

Convergence of FISTA
assumptions
optimal value f is finite and attained at x (not necessarily unique)
dom g = Rn and g is Lipschitz continuous with constant L > 0:
g(x) g(y)

L xy

x, y

h is closed and convex (hence proxth(u) exists and is unique for all u)
result: f (x(k)) f decreases at least as fast as 1/k 2
if fixed step size tk = 1/L is used
if backtracking line search is used

Fast gradient methods

8-7

Reformulation of FISTA
define k = 2/(k + 1) and introduce an intermediate variable v (k)
algorithm: choose x(0) = y (0) = v (0) dom h; for k 1
x(k) = proxtk h y (k1) tk g(y (k1))
v

(k)

= x

(k1)

1 (k)
+ (x x(k1))
k

y (k) = (1 k+1)x(k) + k+1v (k)


substituting expression for v (k) in step 3 gives algorithm on page 8-3
k = 2/(k + 1) satisfies
1 k
1
2 ,
k2
k1

Fast gradient methods

k2

8-8

Key inequalities
upper bound from Lipschitz property
g(u) g(z) + g(z)T (u z) +

L
uz
2

2
2

u, z

property of proximal operators: if u = proxth(w),


1
h(u) h(z) + (w u)T (u z)
t

this follows from subgradient characterization of prox-operator (page 4-15)


u = proxth(w)

Fast gradient methods

w u th(u)

8-9

Progress in one iteration


x = x(i1), x+ = x(i), y = y (i1), v = v (i1), v + = v (i), t = ti, = i
from Lipschitz property if t 1/L
g(x+) g(y) + g(y)T (x+ y) +

1 +
x y
2t

2
2

(1)

from property of prox-operator


1
h(x+) h(z) + g(y)T (z x+) + (x+ y)T (z x+)
t

add the upper bounds and use convexity of g


1 +
1 +
T
+
f (x ) f (z) + (x y) (z x ) +
x y
t
2t
+

Fast gradient methods

2
2

z
8-10

make convex combination of upper bounds for z = x and z = x


f (x+) f (1 )(f (x) f )
= f (x+) f (1 )f (x)
1 +
1 +
T

+
(x y) (x + (1 )x x ) +
x y 22

t
2t
1
2
y (1 )x x 2 x+ (1 )x x
=
2t
2
2
2
=
v x 2 v + x 2
2t

2
2

conclusion: if the inequality (1) holds (for example, if 0 < t 1/L), then
1 +
1
+

(f
(x
)

f
)
+
v

x
2
2t

Fast gradient methods

2
2

1
1

v x
2 (f (x) f ) +

2t

2
2

8-11

Analysis for fixed step size


2
:
apply inequality with t = ti = 1/L recursively, using (1 i)/i2 1/i1

1
1 (k)

(k)

x
(f
(x
)

f
)
+
k2
2t

2
2

1 (0)
1 1

(0)

x
(f
(x
)

f
)
+
12
2t
1 (0)
x x 22
2t

2
2

therefore,
2

f (x(k)) f k x(0) x
2t

2
2

2L
(0)

x
=
(k + 1)2

conclusion: reaches f (x(k)) f after O(


Fast gradient methods

2
2

L/) iterations
8-12

Line search
purpose: determine step size t = tk in
x+ = proxth (y tg(y))

(with x+ = x(k), y = y (k1))

backtracking line search: start at t := tk1; repeat t := t until


g(x+) g(y) + g(y)T (x+ y) +

1 +
x y
2t

2
2

for t0, can choose any positive value t0 = t


from Lipschitz property, tk tmin = min{t, /L}
guarantees that inequality (1) on page 8-10 holds
initialization implies tk tk1, i.e., step sizes are nonincreasing
Fast gradient methods

8-13

Analysis for backtracking line search


apply inequality on page 8-11 recursively to get
tk
1 (k)
(k)

(f
(x
)

f
)
+
v

x
k2
2

tmin
(k)

(f
(x
)

f
)
2
k

t1(1 1)
1 (0)

(0)

x
(f
(x
)

f
)
+
12
2
1 (0)
x x 22
2

therefore
f (x

(k)

2
(0)

x
)f
(k + 1)2tmin

conclusion: reaches f (x(k)) f after O(


Fast gradient methods

2
2
2
2

2
2

L/) iterations
8-14

Example: quadratic program with box constraints


minimize (1/2)xT Ax + bT x
subject to 0 x 1
101

gradient
FISTA

(f (x(k)) f )/|f |

100
10-1
10-2
10-3
10-4
10-5
10-6
10-7 0

10

20

30

40

50

n = 3000; fixed step size t = 1/max(A)


Fast gradient methods

8-15

1-norm regularized least-squares


minimize

1
Ax b
2

2
2

+ x

100

gradient
FISTA

10-1
(f (x(k)) f )/f

10-2
10-3
10-4
10-5
10-6
10-7 0

20

40

60

80

100

randomly generated A R20001000; step tk = 1/L with L = max(AT A)


Fast gradient methods

8-16

Example: nuclear norm regularization


minimize g(X) + X

g is smooth and convex; variable X Rmn (with m n)

nuclear norm
X

i(X)
i

1(X) 2(X) are the singular values of X


the dual norm of the matrix norm

(maximum singular value)

for diagonal X, reduces to the 1-norm of diag(X)


popular as penalty function that promotes low rank
Fast gradient methods

8-17

prox operator of proxth(X) for h(X) = X


proxth(X) = argmin

1
U X
2t

2
F

take singular value decomposition X = P diag(1, . . . , n)QT


apply soft thresholding to singular values:
proxth(Y ) = P diag(
1 , . . . ,
n)QT
where

k = k t

Fast gradient methods

(k t),

k = 0 (k t)

8-18

Approximate low-rank completion


(Xij Aij )2 + X

minimize

(i,j)N

entries (i, j) N are approximately specified (Xij Aij ); rest is free


nuclear norm regularization added to obtain low rank X

example
m = n = 500
5000 specified entries

Fast gradient methods

8-19

convergence (fixed step size t = 1/L)

(f (x(k)) f )/f

100
10-1
10-2
10-3
10-4
10-5 0

Fast gradient methods

gradient
FISTA

50

100
k

150

200

8-20

result
1

10

-1

normalized singular value

10

-3

10

-5

10

-7

10

-9

10

-11

10

-13

10

-15

10

-17

10

10

20

index

30

40

50

optimal X has rank 38; relative error in specified entries is 9%

Fast gradient methods

8-21

Descent version of FISTA


choose x(0) dom h and y (0) = x(0); for k 1
u(k) = proxtk h y (k1) tk g(y (k1))
x

(k)

u(k)
f (u(k)) f (x(k1))
x(k1) otherwise

v (k) = x(k1) +

1 (k)
(u x(k1))
k

y (k) = (1 k+1)x(k) + k+1v (k)

step 2 implies f (x(k)) f (x(k1))


same iteration complexity as original FISTA
changes on p. 8-10: replace x+ with u+ = u(i) and use f (x+) f (u+)
Fast gradient methods

8-22

Example
(from page 8-6)
100

gradient
FISTA
FISTA-d

(f (x(k)) f )/|f |

10-1
10-2
10-3
10-4
10-5
10-6 0

Fast gradient methods

50

100
k

150

200

8-23

Outline

fast proximal gradient method (FISTA)


Nesterovs second method

Nesterovs second method


algorithm: choose x(0) = y (0) = v (0) dom h; for k 1
v

(k)

= prox(tk /k )h v

(k1)

tk
g(y (k1))
k

x(k) = (1 k )x(k1) + k v (k)


y (k) = (1 k+1)x(k) + k+1v (k)

k = 2/(k + 1)
can be shown to be identical to FISTA if h(x) = 0
unlike in FISTA, y (k) remains feasible (i.e., in dom h)

Fast gradient methods

8-24

Convergence of Nesterovs second method


assumptions
optimal value f is finite and attained at x (not necessarily unique)
g is Lipschitz continuous on dom h dom g with constant L > 0:
g(x) g(y)

L xy

x, y dom h

h is closed and convex


result: f (x(k)) f decreases at least as fast as 1/k 2
if fixed step size tk = 1/L is used
if backtracking line search is used

Fast gradient methods

8-25

Analysis of one iteration


x = x(i1), x+ = x(i), y = y (i1), v = v (i1), v + = v (i), t = ti, = i
from Lipschitz property if t 1/L
g(x+) g(y) + g(y)T (x+ y) +

1 +
x y
2t

2
2

(2)

plug in x+ = (1 )x + v + and x+ y = (v + v)
+

g(x ) g(y) + g(y)

2 +
(1 )x + v y +
v v
2t
+

2
2

from convexity of g, h
1 +
v v
g(x ) (1 )g(x) + g(y) + g(y) (v y) +
2t
h(x+) (1 )h(x) + h(v +)
+

Fast gradient methods

2
2

8-26

from property of prox-operator on page 8-9


+
h(v ) h(z) + g(y) (z v ) (v v)T (v + z)
t
+

combine the upper bounds on g(x+), h(x+), h(v +), with z = x


1 +
2 +
v v
f (x ) (1 )f (x) + f (v v)T (v + x) +
t
2t
2

= (1 )f (x) + f +
v x 22 v + x 22
2t
+

2
2

the same final inequality as in the analysis of FISTA on page 8-11


conclusion: same 1/k 2 complexity as FISTA
for fixed step size ti = 1/L
backtracking line search that ensures (2) and ti ti1
Fast gradient methods

8-27

References
surveys of fast gradient methods

Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)


P. Tseng, On accelerated proximal gradient methods for convex-concave optimization
(2008)
FISTA

A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear


inverse problems, SIAM J. on Imaging Sciences (2009)
A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal
recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal
Processing and Communications (2009)
Nesterovs third method (not covered in this lecture)

Yu. Nesterov, Smooth minimization of non-smooth functions, Mathematical


Programming (2005)
S. Becker, J. Bobin, E.J. Cand`es, NESTA: a fast and accurate first-order method for
sparse recovery, SIAM J. Imaging Sciences (2011)
Fast gradient methods

8-28

You might also like