0% found this document useful (0 votes)
24 views

Chương 9

Uploaded by

trangtran.091103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Chương 9

Uploaded by

trangtran.091103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

9.

Unconstrained minimization

Outline

Gradient descent method

Descent methods
▶ descent methods generate iterates as

x (k+1) = x (k) + t (k)Δx (k)

with f (x (k+1) ) < f (x (k) ) (hence the name)


▶ other notations: x + = x + tΔx, x := x + tΔx
▶ Δx(k) is the step, or search direction
▶ t (k) > 0 is the step size, or step length
▶ from convexity, f (x + ) < f (x) implies ∇f (x) TΔx < 0
▶ this means Δx is a descent direction

Generic descent method

General descent method.


given a starting point x ∈ dom f .
repeat
1. Determine a descent direction Δx.
2. Line search. Choose a step size t > 0.
3. Update. x := x + tΔx.
until stopping criterion is satisfied.

Line search types

▶ exact line search: t = argmint>0 f (x + tΔx)


▶ backtracking line search (with parameters α ∈ (0,1/2), β ∈(0,1))
– starting at t = 1, repeat t := βt until f (x + tΔx) < f (x) + αt∇f (x) TΔx
▶ graphical interpretation: reduce t (i.e., backtrack) until t ≤ t0
Gradient descent method

▶ general descent method with Δx = −∇f (x)


given a starting point x ∈ dom f .
repeat
1. Δx := −∇f (x).
2. Line search. Choose step size t via exact or backtracking line search.
3. Update. x := x + tΔx.
until stopping criterion is satisfied.
▶ stopping criterion usually of the form ∥∇f (x) ∥2 ≤ ε
▶ convergence result: for strongly convex f ,

f (x (k) ) − p ★ ≤ c k (f (x (0) ) − p ★ )

c ∈ (0, 1) depends on m, x (0) , line search type


▶ very simple, but can be very slow

Example: Quadratic function on R 2

▶ take f (x) = (1/2) (x 12 + γx 22 ), with γ > 0


▶ with exact line search, starting at x (0) = (γ, 1):

– very slow if γ ≫ 1 or γ ≪ 1
– example for γ = 10 at right
– called zig-zagging

Example: Nonquadratic function on R 2


▶ f (x1, x2) = e x1+3x2−0.1 + e x1−3x2−0.1 + e −x1−0.1

backtracking line search exact line search


Example: A problem in R 100
▶ f (x) = c T x − Σ500 i=1 log(bi − a T i x)
▶ linear convergence, i.e., a straight line on a semilog plot

Steepest descent method


▶ normalized steepest descent direction (at x, for norm ∥ · ∥):
Δxnsd = argmin{∇f (x) T v | ∥v∥ = 1}
▶ interpretation: for small v, f (x + v) ≈ f (x) + ∇f (x) T v;
▶ direction Δxnsd is unit-norm step with most negative directional derivative
▶ (unnormalized) steepest descent direction: Δxsd = ∥∇f (x) ∥∗Δxnsd
▶ satisfies ∇f (x) TΔxsd = −∥∇f (x) ∥2 ∗
▶ steepest descent method
– general descent method with Δx = Δxsd
– convergence properties similar to gradient descent

Examples

▶ Euclidean norm: Δxsd = −∇f (x)


▶ quadratic norm ∥x∥P = (x TPx) 1/2 (P ∈ S n++): Δxsd = −P −1∇f (x)
▶ ℓ1-norm: Δxsd = −(∂f (x)/∂xi)ei , where |∂f (x)/∂xi | = ∥∇f (x) ∥∞
▶ unit balls, normalized steepest descent directions for quadratic norm and ℓ1-
norm:

Choice of norm for steepest descent


▶ steepest descent with backtracking line search for two quadratic norms
▶ ellipses show {x | ∥x − x (k) ∥P = 1}
▶ interpretation of steepest descent with quadratic norm ∥ · ∥P: gradient descent
after change of variables ¯x = P 1/2 x
▶ shows choice of P has strong effect on speed of convergence

Newton’s method

▶ Newton step is Δxnt = −∇2 f (x) −1∇f (x)


▶ interpretation: x + Δxnt minimizes second order approximation

Another intrepretation

▶ x + Δxnt solves linearized optimality condition

And one more interpretation


▶ Δxnt is steepest descent direction at x in local Hessian norm
∥u∥∇2 f (x) = (uT∇2f(x)u )1/2

▶ dashed lines are contour lines of f ; ellipse is {x + v | v T∇2f (x)v = 1}


▶ arrow shows −∇f (x)

Newton decrement

▶ Newton decrement is λ(x) = (∇f (x)T∇ 2f (x) −1∇f (x))1/2


▶ a measure of the proximity of x to x★
▶ gives an estimate of f (x) − p★, using quadratic approximationb f^:

▶ equal to the norm of the Newton step in the quadratic Hessian norm

▶ directional derivative in the Newton direction: ∇f (x) TΔxnt = −λ(x) 2


▶ affine invariant (unlike ∥∇f (x) ∥2)

Newton’s method

given a starting point x ∈ dom f , tolerance ε > 0.


repeat

1. Compute the Newton step and decrement.


Δxnt := −∇2 f (x) −1∇f (x); λ2 := ∇f (x) T∇2 f (x) −1∇f (x).
2. Stopping criterion. quit if λ2 /2 ≤ ε.
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + tΔxnt.

▶ affine invariant, i.e., independent of linear changes of coordinates


▶ Newton iterates for f ˜(y) = f (Ty) with starting point y (0) = T −1 x (0) are y (k) = T −1 x (k)

Classical convergence analysis


assumptions
▶ f strongly convex on S with constant m
▶ ∇ 2 f is Lipschitz continuous on S, with constant L > 0:
∥∇2 f (x) − ∇2 f (y)∥2 ≤ L∥x − y∥2
(L measures how well f can be approximated by a quadratic function)

outline: there exist constants η ∈ (0, m2 /L), γ > 0 such that

Classical convergence analysis

damped Newton phase (∥∇f (x) ∥2 ≥ η)


▶ most iterations require backtracking steps
▶ function value decreases by at least γ
▶ if p ★ > −∞, this phase ends after at most (f (x (0) ) − p ★)/γ iterations

quadratically convergent phase (∥∇f (x) ∥2 < η)


▶ all iterations use step size t = 1
▶ ∥∇f (x) ∥2 converges to zero quadratically: if ∥∇f (x (k) ) ∥2 < η, then

Classical convergence analysis

conclusion: number of iterations until f (x) − p ★ ≤ ε is bounded above by

▶γ, ε0 are constants that depend on m, L, x (0)


▶second term is small (of the order of 6) and almost constant for practical purposes
▶ in practice, constants m, L (hence γ, ε0) are usually unknown
▶ provides qualitative insight in convergence properties (i.e., explains two algorithm
phases)

Example: R2

(same problem as slide 9.13)


▶ backtracking parameters α = 0.1, β = 0.7
▶ converges in only 5 steps
▶ quadratic local convergence

Example in R100
(same problem as slide 9.14)

▶ backtracking parameters α = 0.01, β= 0.5


▶ backtracking line search almost as fast as exact l.s. (and much simpler)
▶ clearly shows two phases in algorithm

Example in R 10000
(with sparse ai)

▶ backtracking parameters α = 0.01, β = 0.5.


▶ performance similar as for small examples
Self-concordant functions

Self-concordance
shortcomings of classical convergence analysis
▶ depends on unknown constants (m, L, . . . )
▶ bound is not affinely invariant, although Newton’s method is
convergence analysis via self-concordance (Nesterov and Nemirovski)
▶ does not depend on any unknown constants
▶ gives affine-invariant bound
▶ applies to special class of convex self-concordant functions
▶ developed to analyze polynomial-time interior-point methods for convex
optimization

Convergence analysis for self-concordant functions


definition
▶ convex f : R → R is self-concordant if |f ′′′(x)| ≤ 2f ′′(x) 3/2 for all x ∈ dom f
▶ f : R n → R is self-concordant if g(t) = f (x + tv) is self-concordant for all x ∈ dom f ,
v ∈ Rn

examples on R
▶ linear and quadratic functions
▶ negative logarithm f (x) = − log x
▶ negative entropy plus negative logarithm: f (x) = x log x − log x

affine invariance: if f : R → R is s.c., then f ˜(y) = f (ay + b) is s.c.:

f ˜ ′′′(y) = a3 f ′′′(ay + b), f ˜ ′′(y) = a2 f ′′(ay + b)

Self-concordant calculus
properties
▶ preserved under positive scaling α ≥ 1, and sum
▶ preserved under composition with affine function
▶ if g is convex with dom g = R++ and |g ′′′(x)| ≤ 3g′′(x)/x then

f (x) = log(−g(x)) − log x


is self-concordant

examples: properties can be used to show that the following are s.c

Convergence analysis for self-concordant functions

summary: there exist constants η ∈ (0, 1/4], γ > 0 such that


▶ if λ(x) > η, then f (x (k+1) ) − f (x (k) ) ≤ −γ
▶ if λ(x) ≤ η, then 2 λ(x (k+1) ) ≤ 2λ(x (k) ))2
(η and γ only depend on backtracking parameters α, β)

complexity bound: number of Newton iterations bounded by

for α = 0.1, β = 0.8, ε = 10−10, bound evaluates to 375(f (x (0) ) − p★) + 6

Numerical example
▶ 150 randomly generated instances of f (x) = −Σmi=1 log(bi − aTi x), x ∈ Rn
▶ ο: m = 100, n = 50; ☐: m = 1000, n = 500; ◇ : m = 1000, n = 50

▶ number of iterations much smaller than 375(f (x (0) ) − p★) + 6


▶ bound of the form c(f (x (0) ) − p★) + 6 with smaller c (empirically) valid

Implementation

main effort in each iteration: evaluate derivatives and solve Newton system

HΔx = −g

where H = ∇2 f (x), g = ∇f (x)

via Cholesky factorization

H = LLT , Δxnt = −L −T L −1 g, λ(x) = ∥L −1 g∥2

▶ cost (1/3)n3 flops for unstructured system


▶ cost ≪ (1/3)n3 if H is sparse, banded, or has other structure

Example

▶ f (x) = Σni=1 ψi(xi) + ψ0 (Ax + b), with A ∈ R p×n dense, p ≪ n


▶ Hessian has low rank plus diagonal structure H = D + ATH0A
▶ D diagonal with diagonal elements ψi ′′(xi); H0 = ∇2ψ0 (Ax + b)

method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n 3 )


method 2 (block elimination): factor H0 = L0L0T ; write Newton system as

DΔx + ATL0w = −g, L0T AΔx − w = 0

eliminate Δx from first equation; compute w and Δx from

(I + L0TAD−1ATL0)w = −L0T AD−1 g, DΔx = −g − ATL0w

cost: 2p2n (dominated by computation of L0TAD−1ATL0)

Terminology and assumptions(bổ sung vào trang đầu)


Unconstrained minimization
▶ unconstrained minimization problem
minimize f (x)
▶ we assume
– f convex, twice continuously differentiable (hence dom f open)
– optimal value p ★ = infx f (x) is attained at x ★ (not necessarily unique)
▶ optimality condition is ∇f (x) = 0
▶ minimizing f is the same as solving ∇f (x) = 0
▶ a set of n equations with n unknowns

Quadratic functions
▶ convex quadratic: f (x) = (1/2)x TPx + qT x + r, P ⪰ 0
▶ we can solve exactly via linear equations

∇f (x) = Px + q = 0

▶ much more on this special case later

Iterative methods
▶ for most non-quadratic functions, we use iterative methods
▶ these produce a sequence of points x (k) ∈ dom f , k = 0, 1, . . .
▶ x (0) is the initial point or starting point
▶ x (k) is the kth iterate
▶ we hope that the method converges, i.e.,

f (x (k) ) → p ★ , ∇f (x (k) ) → 0

Initial point and sublevel set

▶ algorithms in this chapter require a starting point x (0) such that


– x (0) ∈ dom f
– sublevel set
S = {x | f (x) ≤ f (x (0) )} is closed

▶ 2nd condition is hard to verify, except when all sublevel sets are closed
– equivalent to condition that epi f is closed
– true if dom f = R n
– true if f (x) → ∞ as x → bd dom f

▶ examples of differentiable functions with closed sublevel sets:

Strong convexity and implications

▶ f is strongly convex on S if there exists an m > 0 such that

∇ 2 f (x) ⪰ mI for all x ∈ S

▶ same as f (x) − (m/2) ∥x∥ 22 is convex


▶ if f is strongly convex, for x, y ∈ S,

f (y) ≥ f (x) + ∇f (x) T (y − x) + m/2 ∥x − y∥22

▶ hence, S is bounded
▶ we conclude p ★ > −∞, and for x ∈ S,
f (x) − p★ ≤ 1/2m ||∇f (x)||22

▶ useful as stopping criterion (if you know m, which usually you do not)

You might also like