EE236C (Spring 2010-11)
8. Fast gradient methods
fast proximal gradient method (FISTA)
Nesterovs second method
8-1
Fast (proximal) gradient methods
Nesterov (1983, 1988, 2005): three gradient projection methods with
1/k 2 convergence rate
Beck & Teboulle (2008): FISTA, a proximal gradient version of
Nesterovs 1983 method
Nesterov (2004 book), Tseng (2008): overview and unified analysis of
fast gradient methods
several recent variations and extensions
this lecture:
FISTA and Nesterovs 2nd method (1988) as presented by Tseng
Fast gradient methods
8-2
Outline
fast proximal gradient method (FISTA)
Nesterovs second method
Fast proximal gradient method
convex problem with composite objective
minimize f (x) = g(x) + h(x)
g differentiable with dom g = Rn; h has inexpensive proxth operator
algorithm: choose x(0) = y (0) dom h; for k 1
x(k) = proxtk h y (k1) tk g(y (k1))
y
(k)
= x
(k)
k 1 (k)
+
(x x(k1))
k+2
known as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)
Fast gradient methods
8-3
Interpretation
first iteration (k = 1) is a proximal gradient step at x(0)
next iterations are proximal gradient steps at extrapolated points y (k1)
x(k) = proxtk h y (k1) tk g(y (k1))
x(k2)
x(k1)
y (k1)
sequence x(k) remains feasible (in dom h); sequence y (k) not necessarily
Fast gradient methods
8-4
Example
m
minimize log
i=1
exp(aTi x + bi)
randomly generated data with m = 2000, n = 1000, same fixed step size
100
gradient
FISTA
f (x(k)) f
|f |
10-1
10-2
10-3
10-4
10-5
10-6 0
Fast gradient methods
50
100
k
150
200
8-5
another instance
100
gradient
FISTA
f (x(k)) f
|f |
10-1
10-2
10-3
10-4
10-5
10-6 0
50
100
k
150
200
FISTA is not a descent method
Fast gradient methods
8-6
Convergence of FISTA
assumptions
optimal value f is finite and attained at x (not necessarily unique)
dom g = Rn and g is Lipschitz continuous with constant L > 0:
g(x) g(y)
L xy
x, y
h is closed and convex (hence proxth(u) exists and is unique for all u)
result: f (x(k)) f decreases at least as fast as 1/k 2
if fixed step size tk = 1/L is used
if backtracking line search is used
Fast gradient methods
8-7
Reformulation of FISTA
define k = 2/(k + 1) and introduce an intermediate variable v (k)
algorithm: choose x(0) = y (0) = v (0) dom h; for k 1
x(k) = proxtk h y (k1) tk g(y (k1))
v
(k)
= x
(k1)
1 (k)
+ (x x(k1))
k
y (k) = (1 k+1)x(k) + k+1v (k)
substituting expression for v (k) in step 3 gives algorithm on page 8-3
k = 2/(k + 1) satisfies
1 k
1
2 ,
k2
k1
Fast gradient methods
k2
8-8
Key inequalities
upper bound from Lipschitz property
g(u) g(z) + g(z)T (u z) +
L
uz
2
2
2
u, z
property of proximal operators: if u = proxth(w),
1
h(u) h(z) + (w u)T (u z)
t
this follows from subgradient characterization of prox-operator (page 4-15)
u = proxth(w)
Fast gradient methods
w u th(u)
8-9
Progress in one iteration
x = x(i1), x+ = x(i), y = y (i1), v = v (i1), v + = v (i), t = ti, = i
from Lipschitz property if t 1/L
g(x+) g(y) + g(y)T (x+ y) +
1 +
x y
2t
2
2
(1)
from property of prox-operator
1
h(x+) h(z) + g(y)T (z x+) + (x+ y)T (z x+)
t
add the upper bounds and use convexity of g
1 +
1 +
T
+
f (x ) f (z) + (x y) (z x ) +
x y
t
2t
+
Fast gradient methods
2
2
z
8-10
make convex combination of upper bounds for z = x and z = x
f (x+) f (1 )(f (x) f )
= f (x+) f (1 )f (x)
1 +
1 +
T
+
(x y) (x + (1 )x x ) +
x y 22
t
2t
1
2
y (1 )x x 2 x+ (1 )x x
=
2t
2
2
2
=
v x 2 v + x 2
2t
2
2
conclusion: if the inequality (1) holds (for example, if 0 < t 1/L), then
1 +
1
+
(f
(x
)
f
)
+
v
x
2
2t
Fast gradient methods
2
2
1
1
v x
2 (f (x) f ) +
2t
2
2
8-11
Analysis for fixed step size
2
:
apply inequality with t = ti = 1/L recursively, using (1 i)/i2 1/i1
1
1 (k)
(k)
x
(f
(x
)
f
)
+
k2
2t
2
2
1 (0)
1 1
(0)
x
(f
(x
)
f
)
+
12
2t
1 (0)
x x 22
2t
2
2
therefore,
2
f (x(k)) f k x(0) x
2t
2
2
2L
(0)
x
=
(k + 1)2
conclusion: reaches f (x(k)) f after O(
Fast gradient methods
2
2
L/) iterations
8-12
Line search
purpose: determine step size t = tk in
x+ = proxth (y tg(y))
(with x+ = x(k), y = y (k1))
backtracking line search: start at t := tk1; repeat t := t until
g(x+) g(y) + g(y)T (x+ y) +
1 +
x y
2t
2
2
for t0, can choose any positive value t0 = t
from Lipschitz property, tk tmin = min{t, /L}
guarantees that inequality (1) on page 8-10 holds
initialization implies tk tk1, i.e., step sizes are nonincreasing
Fast gradient methods
8-13
Analysis for backtracking line search
apply inequality on page 8-11 recursively to get
tk
1 (k)
(k)
(f
(x
)
f
)
+
v
x
k2
2
tmin
(k)
(f
(x
)
f
)
2
k
t1(1 1)
1 (0)
(0)
x
(f
(x
)
f
)
+
12
2
1 (0)
x x 22
2
therefore
f (x
(k)
2
(0)
x
)f
(k + 1)2tmin
conclusion: reaches f (x(k)) f after O(
Fast gradient methods
2
2
2
2
2
2
L/) iterations
8-14
Example: quadratic program with box constraints
minimize (1/2)xT Ax + bT x
subject to 0 x 1
101
gradient
FISTA
(f (x(k)) f )/|f |
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7 0
10
20
30
40
50
n = 3000; fixed step size t = 1/max(A)
Fast gradient methods
8-15
1-norm regularized least-squares
minimize
1
Ax b
2
2
2
+ x
100
gradient
FISTA
10-1
(f (x(k)) f )/f
10-2
10-3
10-4
10-5
10-6
10-7 0
20
40
60
80
100
randomly generated A R20001000; step tk = 1/L with L = max(AT A)
Fast gradient methods
8-16
Example: nuclear norm regularization
minimize g(X) + X
g is smooth and convex; variable X Rmn (with m n)
nuclear norm
X
i(X)
i
1(X) 2(X) are the singular values of X
the dual norm of the matrix norm
(maximum singular value)
for diagonal X, reduces to the 1-norm of diag(X)
popular as penalty function that promotes low rank
Fast gradient methods
8-17
prox operator of proxth(X) for h(X) = X
proxth(X) = argmin
1
U X
2t
2
F
take singular value decomposition X = P diag(1, . . . , n)QT
apply soft thresholding to singular values:
proxth(Y ) = P diag(
1 , . . . ,
n)QT
where
k = k t
Fast gradient methods
(k t),
k = 0 (k t)
8-18
Approximate low-rank completion
(Xij Aij )2 + X
minimize
(i,j)N
entries (i, j) N are approximately specified (Xij Aij ); rest is free
nuclear norm regularization added to obtain low rank X
example
m = n = 500
5000 specified entries
Fast gradient methods
8-19
convergence (fixed step size t = 1/L)
(f (x(k)) f )/f
100
10-1
10-2
10-3
10-4
10-5 0
Fast gradient methods
gradient
FISTA
50
100
k
150
200
8-20
result
1
10
-1
normalized singular value
10
-3
10
-5
10
-7
10
-9
10
-11
10
-13
10
-15
10
-17
10
10
20
index
30
40
50
optimal X has rank 38; relative error in specified entries is 9%
Fast gradient methods
8-21
Descent version of FISTA
choose x(0) dom h and y (0) = x(0); for k 1
u(k) = proxtk h y (k1) tk g(y (k1))
x
(k)
u(k)
f (u(k)) f (x(k1))
x(k1) otherwise
v (k) = x(k1) +
1 (k)
(u x(k1))
k
y (k) = (1 k+1)x(k) + k+1v (k)
step 2 implies f (x(k)) f (x(k1))
same iteration complexity as original FISTA
changes on p. 8-10: replace x+ with u+ = u(i) and use f (x+) f (u+)
Fast gradient methods
8-22
Example
(from page 8-6)
100
gradient
FISTA
FISTA-d
(f (x(k)) f )/|f |
10-1
10-2
10-3
10-4
10-5
10-6 0
Fast gradient methods
50
100
k
150
200
8-23
Outline
fast proximal gradient method (FISTA)
Nesterovs second method
Nesterovs second method
algorithm: choose x(0) = y (0) = v (0) dom h; for k 1
v
(k)
= prox(tk /k )h v
(k1)
tk
g(y (k1))
k
x(k) = (1 k )x(k1) + k v (k)
y (k) = (1 k+1)x(k) + k+1v (k)
k = 2/(k + 1)
can be shown to be identical to FISTA if h(x) = 0
unlike in FISTA, y (k) remains feasible (i.e., in dom h)
Fast gradient methods
8-24
Convergence of Nesterovs second method
assumptions
optimal value f is finite and attained at x (not necessarily unique)
g is Lipschitz continuous on dom h dom g with constant L > 0:
g(x) g(y)
L xy
x, y dom h
h is closed and convex
result: f (x(k)) f decreases at least as fast as 1/k 2
if fixed step size tk = 1/L is used
if backtracking line search is used
Fast gradient methods
8-25
Analysis of one iteration
x = x(i1), x+ = x(i), y = y (i1), v = v (i1), v + = v (i), t = ti, = i
from Lipschitz property if t 1/L
g(x+) g(y) + g(y)T (x+ y) +
1 +
x y
2t
2
2
(2)
plug in x+ = (1 )x + v + and x+ y = (v + v)
+
g(x ) g(y) + g(y)
2 +
(1 )x + v y +
v v
2t
+
2
2
from convexity of g, h
1 +
v v
g(x ) (1 )g(x) + g(y) + g(y) (v y) +
2t
h(x+) (1 )h(x) + h(v +)
+
Fast gradient methods
2
2
8-26
from property of prox-operator on page 8-9
+
h(v ) h(z) + g(y) (z v ) (v v)T (v + z)
t
+
combine the upper bounds on g(x+), h(x+), h(v +), with z = x
1 +
2 +
v v
f (x ) (1 )f (x) + f (v v)T (v + x) +
t
2t
2
= (1 )f (x) + f +
v x 22 v + x 22
2t
+
2
2
the same final inequality as in the analysis of FISTA on page 8-11
conclusion: same 1/k 2 complexity as FISTA
for fixed step size ti = 1/L
backtracking line search that ensures (2) and ti ti1
Fast gradient methods
8-27
References
surveys of fast gradient methods
Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)
P. Tseng, On accelerated proximal gradient methods for convex-concave optimization
(2008)
FISTA
A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear
inverse problems, SIAM J. on Imaging Sciences (2009)
A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal
recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal
Processing and Communications (2009)
Nesterovs third method (not covered in this lecture)
Yu. Nesterov, Smooth minimization of non-smooth functions, Mathematical
Programming (2005)
S. Becker, J. Bobin, E.J. Cand`es, NESTA: a fast and accurate first-order method for
sparse recovery, SIAM J. Imaging Sciences (2011)
Fast gradient methods
8-28