ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
Cho-Jui Hsieh
UC Davis
min f (w )
w
Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (x) = 0 ⇔ Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Convex vs Nonconvex
Convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is global minimum
A function is convex if ∇2 f (w ) is positive definite
Example: linear regression, logistic regression, · · ·
Non-convex function:
∇f (w ∗ ) = 0 ⇔ w ∗ is Global min, local min, or saddle point
most algorithms only converge to gradient= 0
Example: neural network, · · ·
Gradient Descent
1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
Why gradient descent?
1
f (w t + d ) ≈ g (d ) := f (w t ) + ∇f (w t )T d + kd k2
2α
Update solution by w t+1 ← w t + d ∗
d ∗ = arg mind g (d )
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + α1 d ∗ = 0 ⇒ d ∗ = −α∇f (w t )
Minimize g (d ):
1 ∗
∇g (d ∗ ) = 0 ⇒ ∇f (w t ) + d = 0 ⇒ d ∗ = −α∇f (w t )
α
Illustration of gradient descent
Update
w t+1 = w t + d ∗ = w t −α∇f (w t )
Illustration of gradient descent
Update
w t+2 = w t+1 + d ∗ = w t+1 −α∇f (w t+1 )
When will it diverge?
1 PN
In general, Ein (w ) = N n=1 fn (w ),
each fn (w ) only depends on (xn , yn )
Stochastic gradient
Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1
Gradient:
N
1X
∇Ein (w ) = ∇fn (w )
N
n=1
Logistic regression:
N
1 X T
min log(1 + e −yn w xn )
w N | {z }
n=1
fn (w )
1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
Stochastic gradient descent
1 X
but ∇fn (w ∗ )6=0 if B is a subset
|B|
n∈B
Gradient descent
Stochastic gradient descent
Next class: LFD 2
Questions?