0% found this document useful (0 votes)
13 views

ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

This document summarizes key concepts about optimization and gradient descent algorithms for machine learning models. It discusses: 1) The goal of optimization is to find the minimizer of an objective function. Gradient descent is commonly used by taking steps proportional to the negative gradient to minimize loss functions. 2) Stochastic gradient descent approximates the full gradient using a small batch of samples to speed up computation for large datasets, updating the parameters after each batch. 3) Logistic regression can be optimized with stochastic gradient descent by sampling mini-batches to estimate the gradient of the log-likelihood objective and taking a step in the opposite direction.

Uploaded by

svwnerlgwr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)

This document summarizes key concepts about optimization and gradient descent algorithms for machine learning models. It discusses: 1) The goal of optimization is to find the minimizer of an objective function. Gradient descent is commonly used by taking steps proportional to the negative gradient to minimize loss functions. 2) Stochastic gradient descent approximates the full gradient using a small batch of samples to speed up computation for large datasets, updating the parameters after each batch. 3) Logistic regression can be optimized with stochastic gradient descent by sampling mini-batches to estimate the gradient of the log-likelihood objective and taking a step in the opposite direction.

Uploaded by

svwnerlgwr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

ECS171: Machine Learning

Lecture 4: Optimization (LFD 3.3, SGD)

Cho-Jui Hsieh
UC Davis

Jan 22, 2018


Gradient descent
Optimization

Goal: find the minimizer of a function

min f (w )
w

For now we assume f is twice differentiable


Machine learning algorithm: find the hypothesis that minimizes Ein
Convex vs Nonconvex

Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (x) = 0 ⇔ Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Convex vs Nonconvex

Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Gradient Descent

Gradient descent: repeatedly do


w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Gradient Descent

Gradient descent: repeatedly do


w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Gradient Descent

Gradient descent: repeatedly do


w t+1 ← w t − α∇f (w t )
α > 0 is the step size
Generate the sequence w 1 , w 2 , · · ·
converge to minimum solution (∇f (w ) = 0)
Step size too large ⇒ diverge; too small ⇒ slow convergence
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective


function locally
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective


function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):

1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2

Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
Why gradient descent?

Reason I: Gradient is the “steepest direction” to decrease the objective


function locally
Reason II: successive approximation view
At each iteration, form an approximation function of f (·):

1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2

Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )

d ∗ will decrease f (·) if α (step size) is sufficiently small


Illustration of gradient descent
Illustration of gradient descent

Form a quadratic approximation


1
f (w t + d ) ≈ g (d ) = f (w t ) + ∇f (w t )T d + kd k2

Illustration of gradient descent

Minimize g (d ):
1 ∗
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + d = 0 ⇒ d ∗ = −α∇f (w t )
α
Illustration of gradient descent

Update
w t+1 = w t + d ∗ = w t −α∇f (w t )
Illustration of gradient descent

Form another quadratic approximation


1
f (w t+1 + d ) ≈ g (d ) = f (w t+1 ) + ∇f (w t+1 )T d + kd k2

d ∗ = −α∇f (w t+1 )
Illustration of gradient descent

Update
w t+2 = w t+1 + d ∗ = w t+1 −α∇f (w t+1 )
When will it diverge?

Can diverge (f (w t )<f (w t+1 )) if g is not an upperbound of f


When will it converge?

Always converge (f (w t )>f (w t+1 )) when g is an upperbound of f


Convergence

Let L be the Lipchitz constant


(∇2 f (x)  LI for all x)
1
Theorem: gradient descent converges if α < L
Convergence

Let L be the Lipchitz constant


(∇2 f (x)  LI for all x)
1
Theorem: gradient descent converges if α < L
In practice, we do not know L · · ·
need to tune step size when running gradient descent
Applying to Logistic regression

gradient descent for logistic regression


Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn

Update the weights: w ← w − η∇Ein


Return the final weights w
Applying to Logistic regression

gradient descent for logistic regression


Initialize the weights w0
For t = 1, 2, · · ·
Compute the gradient
N
1 X yn xn
∇Ein = −
N n=1 1 + e yn w T xn

Update the weights: w ← w − η∇Ein


Return the final weights w
When to stop?
Fixed number of iterations, or
Stop when k∇Ein k < 
Stochastic Gradient descent
Large-scale Problems

Machine learning: usually minimizing the in-sample loss (training loss)


N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1

`: loss function (e.g., `(a, b) = (a − b)2 )


Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation
Large-scale Problems

Machine learning: usually minimizing the in-sample loss (training loss)


N
1 X
min{ `(w T xn , yn )} := Ein (w ) (linear model)
w N
n=1
N
1 X
min{ `(hw (xn ), yn )} := Ein (w ) (general hypothesis)
w N
n=1

`: loss function (e.g., `(a, b) = (a − b)2 )


Gradient descent:
w ←w −η ∇Ein (w )
| {z }
Main computation

1 PN
In general, Ein (w ) = N n=1 fn (w ),
each fn (w ) only depends on (xn , yn )
Stochastic gradient

Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1

Each gradient computation needs to go through all training samples


slow when millions of samples
Faster way to compute “approximate gradient”?
Stochastic gradient

Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1

Each gradient computation needs to go through all training samples


slow when millions of samples
Faster way to compute “approximate gradient”?
Use stochastic sampling:
Sample a small subset B ⊆ {1, · · · , N}
Estimate gradient
1 X
∇Ein (w ) ≈ ∇fn (w )
|B|
n∈B

|B|: batch size


Stochastic gradient descent

Stochastic Gradient Descent (SGD)


Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B
Stochastic gradient descent

Stochastic Gradient Descent (SGD)


Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a small batch B ⊆ {1, · · · , N}
Update parameter
1 X
w ← w − ηt ∇fn (w )
|B|
n∈B

Extreme case: |B| = 1 ⇒ Sample one training data at a time


Logistic Regression by SGD

Logistic regression:
N
1 X T
min log(1 + e −yn w xn )
w N | {z }
n=1
fn (w )

SGD for Logistic Regression


Input: training data {xn , yn }N
n=1
Initialize w (zero or random)
For t = 1, 2, · · ·
Sample a batch B ⊆ {1, · · · , N}
Update parameter
1 X −yn xn
w ← w − ηt
|B| 1 + e yn w T xn
i∈B | {z }
∇fn (w )
Why SGD works?

Stochastic gradient is an unbiased estimator of full gradient:


N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )
Why SGD works?

Stochastic gradient is an unbiased estimator of full gradient:


N
1 X 1 X
E[ ∇fn (w )] = ∇fn (w )
|B| N
n∈B n=1
= ∇Ein (w )

Each iteration updated by

gradient + zero-mean noise


Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
Stochastic gradient descent

In gradient descent, η (step size) is a fixed constant


Can we use fixed step size for SGD?
SGD with fixed step size cannot converge to global/local minimizers
If w ∗ is the minimizer, ∇f (w ∗ ) = N1 N ∗
P
n=1 ∇fn (w )=0,

1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B

(Even if we got minimizer, SGD will move away from it)


Stochastic gradient descent, step size

To make SGD converge:


Step size should decrease to 0
ηt → 0
Usually with polynomial rate: η t ≈ t −a with constant a
Stochastic gradient descent vs Gradient descent

Stochastic gradient descent:


pros:
cheaper computation per iteration
faster convergence in the beginning
cons:
less stable, slower final convergence
hard to tune step size

(Figure from https://medium.com/@ImadPhd/


gradient-descent-algorithm-and-its-variants-10f652806a3)
Revisit perceptron Learning Algorithm

Given a classification data {xn , yn }N


n=1
Learning a linear model:
N
1 X
min `(w T xn , yn )
w N
n=1
Consider the loss:
`(w T xn , yn ) = max(0, −yn w T xn )

What’s the gradient?


Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
Revisit perceptron Learning Algorithm

`(w T xn , yn ) = max(0, −yn w T xn )


Consider two cases:
Case I: yn w T xn > 0 (prediction correct)
`(w T xn , yn ) = 0
∂ T
∂w `(w xn , yn ) = 0
Case II: yn w T xn < 0 (prediction wrong)
`(w T xn , yn ) = −yn w T xn
∂ T
∂w `(w xn , yn ) = −yn xn
SGD update rule: Sample an index n
(
t+1 wt if yn w T xn ≥0 (predict correct)
w ←
w t + η t yn xn if yn w T xn <0 (predict wrong)

Equivalent to Perceptron Learning Algorithm when η t = 1


Conclusions

Gradient descent
Stochastic gradient descent
Next class: LFD 2

Questions?

You might also like