Mirror Descent and Variable Metric
Methods
Stephen Boyd & John Duchi & Mert Pilanci
EE364b, Stanford University
April 24, 2019
1
Mirror descent
• due to Nemirovski and Yudin (1983)
• recall the projected subgradient method:
(1) get subgradient g (k) ∈ ∂f (x(k) )
(2) update
1 2
x(k+1) = argmin g (k)T x + x − x(k)
x∈C 2αk 2
2
replace k·k2 with an alternate distance2 -like function
2
Convergence rate of projected subgradient method
Consider minx∈C f (x)
• Bounded subgradients: ||g||2 ≤ G for all g ∈ ∂f
• Initialization radius: ||x(1) − x∗ ||2 ≤ R
Projected sub-gradient method iterates will satisfy
Pk
(k) ∗ R2 + G2 i=1 αi2
fbest −f ≤ Pk
2 i=1 αi
√
setting αi = (R/G)/ k gives
(k) RG
fbest − f ∗ ≤ √
k
G = maxx∈C ||∂f (x)||2 and R = maxx,y∈C ||x − y||2 The analysis
and the convergence results depend on Euclidean (`2 ) norm
3
Bregman Divergence
h convex differentiable over an open convex set C.
• The Bregman divergence associated to h is defined by
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
can be interpreted as
distance between x and y as measured by the function h
Example: h(x) = ||x||22
4
Strong convexity
h(x) is λ-strongly convex with respect to the norm || · || if
λ
h(x) ≥ h(y) + ∇(y)T (x − y) + ||x − y||2
2
5
Properties of Bregman divergence
For a λ-strongly convex function h, Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
satisfies
λ
Dh (x, y) ≥ ||x − y||2 ≥ 0
2
6
Pythagorean theorem
Bregman projection
PCh (y) = arg min Dh (x, y)
x∈C
Dh (x, y) ≥ Dh (x, PCh (y)) + Dh (PCh (y), y)
7
Projected Gradient Descent
1
x(k+1) = PC arg min f (xk ) + g (k)T (x − x(k) ) + ||x − x(k) ||22 )
x 2αk
where PC is the Euclidean projection onto C.
8
Mirror Descent
1
x(k+1) = PCh argmin f (xk ) + g (k)T (x − x(k) ) + Dh (x, x(k) )
x αk
where Dh (x, y) is the Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
and h(x) is strongly convex with respect to || · ||
PCh is the Bregman projection:
PCh (y) = arg min Dh (x, y)
x∈C
9
Mirror Descent update rule
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x ) +(k)
Dh (x, x(k) )
x α k
h (k)T 1 (k)
= PC argmin g x+ Dh (x, x )
x αk
h (k)T 1 1 (k) T
= PC argmin g x+ h(x) − ∇h(x ) x
x αk αk
1 1
optimality condition for y = argmin:g k + αk ∇h(y) − αk ∇h(x
(k)
) =0
Dh (x, y) is the Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
PCh is the Bregman projection:
T
PCh (y) = arg min Dh (x, y) = arg min h(x) − ∇h(y) x
x∈C x∈C
T
= arg min h(x) − (∇h(x(k) ) − αk g (k) ) x
x∈C
T 10
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
Mirror Descent update rule (simplified)
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
where Dh (x, y) is the Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
11
Convergence guarantees
Let ||g||∗ ≤ G||·|| for all g ∈ ∂f , or equivalently
f (x) − f (y) ≤ G||·|| ||x − y||
1/2
Let xh = arg minx∈C h(x) and R||·||
h
= 2 maxy Dh (xh , y)/λ , then
||x − xh || ≤ R||·||
h
General guarantee:
k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ D(x? , x(1) ) + α kg k∗
i=1
2 i=1 i
h
λR||·||
Choose step size αk = √ .
G||·|| k
Mirror Descent iterates satisfy
h
(k)
R||·|| G||·||
fbest − f ∗ ≤ √
k
12
Standard setups for Mirror Descent
1
x (k+1)
= PCh k
argmin f (x ) + g (k)T
(x − x (k)
)+ Dh (x, x(k) )
x αk
T
= arg min Dh (x, x(k) ) + αk g (k) x
x∈C
where Dh (x, y) is the Bregman divergence
T
Dh (x, y) = h(x) − h(y) + ∇h(y) (x − y)
• simplest version is when h(x) = 12 ||x||22 , which is strongly convex
w.r.t. || · ||2 . Mirror Descent= Projected Subgradient Descent
Dh (x, y) = ||x − y||22
• negative entropy h(x) = ni=1 xi log xi , which is 1-strongly
P
Pn ||x||1 (Pinsker’s inequality).
convex wrt w.r.t.
Dh (x, y) = i=1 xi log xyii − (xi − yi ) is the generalized
Kullback-Leibler divergence
13
Negative Entropy
• negative entropy h(x) = ni=1 xi log xi
P
Pn
Dh (x, y) = i=1 xi log xyii − (xi − yi ) is the generalized
Kullback-Leibler divergence
• unit simplex C = ∆n = {x ∈ Rn+ :
P
i xi = 1}.
• Bregman projection onto the simplex is a simple renormalization
y
P h (Y ) =
||y||1
• Mirror Descent:
(k+1) h k (k)T (k) 1 (k)
x = PC argmin f (x ) + g (x − x ) + Dh (x, x )
x αk
• y ∈ arg min =⇒ ∇h(y) = log(y) + 1 = ∇h(x(k) ) − αk g k
(k)
=⇒ yi = xi exp (−αk gik )
• Mirror Descent update:
(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )
14
Mirror descent examples
• Usual (projected) subgradient descent: h(x) = 12 kxk22
• With constraints of simplex, C = {x ∈ Rn+ | 1T x = 1}, use
negative entropy
Xn
h(x) = xi log xi
i=1
(1) Strongly convex with respect to `1 -norm
(2) With x(1) = 1/n, have Dh (x? , x(1) ) ≤ log n for x? ∈ C
(3) If kgk∞ ≤ G∞ for g ∈ ∂f (x) for x ∈ C,
(k) log n α 2
fbest − f ? ≤ + G∞
αk 2
(4) Can be much better than regular subgradient decent...
15
Example
Robust regression problem (an LP):
m
X
minimize f (x) = kAx − bk1 = |aTi x − bi |
i=1
subject to x ∈ C = {x ∈ Rn+ | 1T x = 1}
Pm
subgradient of objective is g = i=1 sign(aTi x − bi )ai
• Projected subgradient update (h(x) = (1/2) kxk22 ): homework
• Mirror descent update (h(x) = ni=1 xi log xi ):
P
(k) (k)
(k+1) x exp(−αgi )
xi = Pn i (k) (k)
j=1 xj exp(−αgj )
16
Example
Robust regression problem with ai ∼ N (0, In×n ) and
bi = (ai,1 + ai,2 )/2 + εi where εi ∼ N (0, 10−2 ), m = 20, n = 3000
101
Entropy
Gradient
100
10-1
10-2
fbest−f ⋆
( k)
10-3
10-4
10-5
10-6 0 10 20 30 40 50 60
k
stepsizes chosen according to best bounds (but still sensitive to
stepsize choice)
17
Example: Spectrahedron
minimizing a function on the spectrahedron Sn defined as
Sn = {X ∈ Sn+ : tr(X) = 1}
18
Example: Spectrahedron
Sn = {X ∈ Sn+ : tr(X) = 1}
• von Neumann entropy:
n
X
h(X) = λi (X) log λi (X)
i=1
where λ1 (X), ..., λn (X) are the eigenvalues of X.
• 1
2 strongly convex with respect to the norm
n
X
||X||tr = λi (X)
i=1
• Mirror Descent update:
Yt+1 = exp(log Xt − αt ∇f (Xt ))
Xt+1 = PCh (Yt + 1) = Yt+1 /||Yt+1 ||tr
19
Mirror Descent Analysis
distance generating function h, 1-strongly-convex w.r.t. k·k:
1 2
h(y) ≥ h(x) + ∇h(x)T (y − x) + kx − yk
2
Fenchel conjugate
h∗ (θ) = sup θT x − h(x) , ∇h∗ (θ) = argmax θT x − h(x)
x∈C x∈C
∇h, ∇h∗ take us “through the mirror” and back
∇h
x−
←−
−−−−
−−
∗
→
−−θ
∇h
miror descent iterations for C = Rn
n o
x(k+1) = argmin αk g (k)T x + Dh (x, x(k) ) = ∇h∗ ∇h(x(k) ) − αk g (k)
x∈C
1 2
h(x) = 2 kxk2 recovers standard case
20
Convergence analysis
g (k) ∈ ∂f (x(k) ) , θ(k+1) = θ(k) − αk g (k) , x(k+1) = ∇h∗ (θ(k+1) )
Bregman divergence
Dh∗ (θ0 , θ) = h∗ (θ0 ) − h∗ (θ) − ∇h∗ (θ)T (θ0 − θ)
Let θ? = ∇h(x? ),
Dh∗ (θ(k+1) , θ? ) = Dh∗ (θ(k) , θ? )
+ (θ(k+1) − θ(k) )T (∇h∗ (θ(k) ) − ∇h∗ (θ? ))
+ Dh∗ (θ(k+1) , θ(k) )
and
T
(θ(k+1) − θ(k) )T (∇h∗ (θ(k) ) − ∇h∗ (θ? )) = −αk g (k) (x(k) − x? )
21
Convergence analysis continued
From convexity and g (k) ∈ ∂f (x(k) ),
T
f (x(k) ) − f (x? ) ≤ g (k) (x(k) − x? )
Therefore
αk [f (x(k) ) − f (x? )] ≤ Dh∗ (θ(k) , θ? ) − Dh∗ (θ(k+1) , θ? )
+ Dh∗ (θ(k+1) , θ(k) )
2
Fact: h is 1-strongly-convex w.r.t. k·k ⇔ Dh (x0 , x) ≥ 12 kx0 − xk ⇔
2
h∗ is 1-smooth w.r.t. k·k∗ ⇔ Dh∗ (θ0 , θ) ≤ 21 kθ0 − θk∗
Bounding the Dh∗ (θ(k+1) , θ(k) ) terms and telescoping gives
k k
X 1 X 2 (i) 2
αi [f (x(i) ) − f (x? )] ≤ Dh∗ (θ(1) , θ? ) + α kg k∗
i=1
2 i=1 i
22
Convergence guarantees
Note: Dh∗ (θ(1) , θ? ) = Dh (x? , x(1) )
Most general guarantee,
k k
X
(i) ? ? (1) 1 X 2 (i) 2
αi [f (x ) − f (x )] ≤ Dh (x , x )+ α kg k∗
i=1
2 i=1 i
Fixed step size αk = α
k
1X 1 α
f (x(i) ) − f (x? ) ≤ Dh (x? , x(1) ) + max kg (i) k2∗
k i=1 αk 2 i
in general, converges if
• Dh (x? , x(1) ) < ∞
•
P
k αk = ∞ and αk → 0
• for all g ∈ ∂f (x) and x ∈ C, kgk∗ ≤ G for some G < ∞
Stochastic gradients are fine!
23
Variable metric subgradient methods
subgradient method with variable metric Hk 0:
(1) get subgradient g (k) ∈ ∂f (x(k) )
(2) update (diagonal) metric Hk
(3) update x(k+1) = x(k) − Hk−1 g (k)
• matrix Hk generalizes step-length αk
there are many such methods (Ellipsoid method, AdaGrad, . . . )
24
Variable metric projected subgradient method
same, with projection carried out in the Hk metric:
(1) get subgradient g (k) ∈ ∂f (x(k) )
(2) update (diagonal) metric Hk
(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)
where
ΠH 2
X (y) = argmin kx − ykH
x∈X
√
and kxkH = xT Hx.
25
Convergence analysis
since ΠH
X is non-expansive in the k · kHk norm, we get
k
2
kx(k+1) − x? k2Hk = PXHk x(k) − Hk−1 g (k) − PXHk (x? )
Hk
≤ kx (k)
− Hk−1 g (k) − x? k2Hk
= kx (k)
− x? k2Hk − 2(g (k) )T (x(k) − x? ) + kg (k) k2H −1
k
≤ kx(k) − x? k2Hk − 2(f (x(k) ) − f ? ) + kg (k) k2H −1 .
k
using f ? = f (x? ) ≥ f (x(k) ) + g (k)T (x? − x(k) )
26
apply recursively, use
k
(k)
X
f (x(i) ) − f ? ≥ k fbest − f ?
i=1
and rearrange to get
Pk
kx(1) − x? k2H1 + i=1 kg (i) k2H −1
(k) ?
fbest −f ≤ i
2k
Pk (i) ? 2 (i) ? 2
i=2 kx − x kHi − kx − x kHi−1
+
2k
27
numerator of additional term can be bounded to get estimates
• for general Hk = diag(hk )
k
X k
X
2
R∞ kH1 k1 + kg (i) k2H −1 2
R∞ kHi − Hi−1 k1
i
k ? i=1 i=2
fbest −f ≤ +
2k 2k
• for Hk = diag(hk ) with hi ≥ hi−1 for all i
k
X
kg (i) k2H −1
2
k i=1
i
R∞ khk k1
fbest − f? ≤ +
2k 2k
where max kx(i) − x? k∞ ≤ R∞
1≤i≤k
28
converges if
• R∞ < ∞ (e.g. if X is compact)
Pk
• (i) 2
i=1 kg kH −1 grows slower than k
i
Pk
• i=2 kHi − Hi−1 k1 grows slower than k or
hi ≥ hi−1 for all i and khk k1 grows slower than k
29
AdaGrad
AdaGrad — adaptive subgradient method
(1) get subgradient g (k) ∈ ∂f (x(k) )
(2) choose metric Hk :
Pk
• set Sk = i=1 diag(g (i) )2
1
• set Hk = 1
α Sk
2
(3) update x(k+1) = PXHk x(k) − Hk−1 g (k)
where α > 0 is step-size
30
AdaGrad – motivation
• for fixed Hk = H we have estimate:
k
(k) ?1 (1) ? T (1) ? 1 X (i) 2
fbest −f ≤ (x − x ) H(x − x ) + kg kH −1
2k 2k i=1
• idea: Choose diagonal Hk 0 that minimizes this estimate in
hindsight:
k
X
Hk = argmin max (x − y)T diag(h)(x − y) + kg (i) k2diag(h)−1
h x,y∈C
i=1
q q
Pk (i) 2 Pk (i) 2
• optimal Hk = 1
R∞ diag i=1 (g1 ) , . . . , i=1 (gn )
• intuition: adapt step-length based on historical step lengths
31
AdaGrad – convergence
1
by construction, Hi = α diag(hi ) and hi ≥ hi−1 , so
k
(k) 1 X (i) 2 1 2
fbest − f ? ≤ kg kH −1 + R khk k1
2k i=1 i 2kα ∞
α 1 2
≤ khk k1 + R khk k1
k 2kα ∞
(second line is a theorem)
2
also have (with α = R∞ ) and for compact sets C
( k
)
(k) ? 2 T
X
(i) 2
fbest −f ≤ inf sup (x − y) diag(h)(x − y) + kg kdiag(h)−1
k h≥0 x,y∈C i=1
32
Example
Classification problem:
• Data: {ai , bi }, i = 1, . . . , 50000
• ai ∈ R1000
• b ∈ {−1, 1}
• Data created with 5% mis-classifications w.r.t. w = 1, v = 0
• Objective: find classifiers w ∈ R1000 and v ∈ R such that
• aTi w + v > 1 if b = 1
• aTi w + v < 1 if b = −1
• Optimization method:
• Minimize hinge-loss: max(0, 1 − bi (aTi w + v))
P
i
• Choose example uniformly at random, take sub-gradient step
w.r.t. that example
33
Best subgradient method vs best AdaGrad
1
10
subgradientmethodmethod
adagrad
0
10
relativeconvergence
−1
10
−2
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k
Often best AdaGrad performs better than best subgradient method
34
AdaGrad with different step-sizes α:
alphaalphaalpha1
alphaalphaalpha2
alphaalphaalpha3
alphaalphaalpha4
relativeconvergence
0
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
k
Sensitive to step-size selection (like standard subgradient method)
35