0% found this document useful (0 votes)
7 views46 pages

Lecture 11

Lecture 11 focuses on challenges in non-linear optimization, including curvature issues and constrained optimization. It discusses methods like gradient descent with momentum, AdaGrad, RMSProp, and the Adam algorithm, highlighting their advantages and limitations. The lecture also covers Lagrange multipliers and the principles of convex optimization, emphasizing the importance of convex functions and their properties.

Uploaded by

sohindos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

Lecture 11

Lecture 11 focuses on challenges in non-linear optimization, including curvature issues and constrained optimization. It discusses methods like gradient descent with momentum, AdaGrad, RMSProp, and the Adam algorithm, highlighting their advantages and limitations. The lecture also covers Lagrange multipliers and the principles of convex optimization, emphasizing the importance of convex functions and their properties.

Uploaded by

sohindos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Lecture 11

Math Foundations
Team
Introduction

► We will take a deeper look at some challenges in non-linear


optimization, continuing on from the previous lecture.
► First we will look at the problem of different levels of
curvature in different directions.
► We will need to figure out strategies to deal with the above
situations to design a good optimization algorithm.
► Second we will look into how to solve constrained
optimization problem.
Examples of high curvature surfaces

► Two examples of high curvature surfaces are cliffs and valleys.

(a) (b)
Difficult topologies

► In Figure (a), the partial derivative with respect to x changes


drastically as we go along the axis of x .
► A modest learning rate will cause minimal reduction in the
value of the objective function in the gently sloping regions.
► The same modest learning rate in the high-sloping regions will
cause us to overshoot the optimal value in those regions.
► In Figure (b), there is gentle slope along the y -direction and a U-
shape in the x -direction.
► The gradient descent method will bounce violently along the
steep sides of the valley while not making much progress along
the x -axis.
Gradient descent with momentum

► Gradient descent may be slow if the curvature of the function is


such that the gradient descent steps hop between the walls of the
valley of contours and approaches the optimum slowly.
► If we endow the optimization procedure with memory, we can
improve convergence.
► We use an additional term in the step-update to remember
what happened in the previous iteration, so that we can
dampen oscillations and speed up convergence.
► This is a momentum term - the name momentum comes from a
comparison to a rolling ball whose direction becomes more and
more difficult to change as its velocity increases.
Gradient descent with momentum

The gradient descent with momentum has the following iteration

x i +1 = x i − α i ((∇f )(x i ))T + vi

► vi = β (x i − x i −1 ) and v0 = 0
► Β ∈ [0, 1] is referred to as the momentum parameter or the
friction parameter.
► Momentum-based methods attack the issues of flat-regions, cliffs
and valleys by emphasizing medium-term to long-term directions
of consistent movement.
Momentum-based learning

► An aggregated measure of feedback is used to reinforce


movement along certain directions and speed up gradient
descent.
► Momentum-based learning accelerates gradient descent since the
algorithm moves quicker in the direction of the optimal solution.
► The useless sideways oscillations as they get cancelled out
during the averaging process.
► The momentum term is useful where only approximate gradient
is known as the momentum term averages out the noisy
estimates of the gradient.
Momentum-based learning

► The concept of momentum can be illustrated by a marble rolling


down a hill that has a number of ”local” distortions like
potholes, ditches etc. The momentum of the marble causes it to
navigate local distortions and emerge out of them.
Momentum-based learning

Momentum increases the relative component of the gradient in the


correct direction
AdaGrad

► AdaGrad algorithm keep track of the aggregated squared


magnitude of the partial derivative with respect to each
parameter.
► At each iteration of Adagrad

► Ai measures only the historical magnitude of the gradient rather


than the sign. To avoid ill-conditioning, ϵ = 10−8 can be added to
Ai .
AdaGrad

► If the gradient takes values +100 and -100 alternatively, Ai will


be large and the update step along the parameter in question
will be small.
► If the gradient takes a value 0.1 consistently, Ai will not be as large
as before and the update step will be comparatively larger.
► With the passage of time absolute movements along all
components will slow down because Ai is monotonically
increasing with time.
► AdaGrad suffers from the problem of not making much progress
after a while. The fact that Ai is aggregated over the entire history
of partial derivatives may make the method stale.
RMSProp

► At each iteration of RMSProp

► It uses exponential averaging. scaling factor Ai does not


constantly increase. Note that ρ ∈ (0, 1).
► In RMSProp the importance of ancient gradients decays
exponentially with time as the gradient from t steps before is
weighted by ρ t .
Adam Algorithm
Adam
Constrained optimization and Lagrange multipliers
Constrained optimization and Lagrange multipliers
Primal and dual problems
Minimax inequality
Minimax inequality

► The difference between J( x ) and the Lagrangian L( x , λ) is that


the indicator function is relaxed to a linear function.
► When λ ≥ 0, the Lagrangian L (x , λ ) is a lower bound on
J( x).
► The maximum of L (x , λ ) with respect to λ is J(x ) - if the
point x satisfies all the constraints gi ( x ) ≤ 0, then the
maximum of the Lagrangian is obtained at λ = 0 and it is equal
to J( x).
► If one or more constraints is violated such that gi ( x ) > 0, then
the associated Lagrangian coefficient λ i can be taken to be ∞ so
as to equal J( x).
Minimax inequality
Lagrangian formulation
Modeling equality constraints

► Suppose the problem is min f (x ) subject to gi (x ) ≤ 0 for all 1 ≤


x

i ≤ m and hj (x ) = 0 for 1 ≤ j ≤ n.
► We model the equality constraint hj ( x ) = 0 with two
inequality constraints hj ( x ) ≥ 0 and hj ( x ) ≤ 0.
► The resulting Lagrange multipliers are then
unconstrained.
► The Lagrange multipliers for the original inequality constraints are
non-negative while those corresponding to the equality constraints
are unconstrained.
Convex optimization

► We are interested in a class of optimization problems where we


can guarantee global optimality.
► When f (), the objective function, is a convex function and g () and
h() are convex functions, we have a convex optimization problem.
► In this setting we have strong duality - the optimal solution of the
primal problem is equal to the optimal solution of the dual
problem.
► What is a convex function?
Convex function

► First we need to know what is a convex set. A set C is a


convex set if for any x, y ∈ C , θx + (1 − θ)y ∈ C .
► For any two points lying in the convex set, a line joining them lies
entirely in the convex set.
► Let a function f : Rd → R be a function whose domain is a
convex set C .
► The function is a convex function if for any x , y ∈ C ,
f (θx + (1 − θ)y ) ≤ θf (x ) + (1 − θ)f (y )
► Another way of looking at a convex function is to use the
gradient: for any two points x and y , we have
f (y ) ≥ f (x ) + ∇ f (x )(y − x ).
x
Example
Linear programming
Linear programming
Linear programming

► We can solve the original primal linear program or the dual


one - the optimum in each case is the same.
► The primal linear program is in d variables but the dual is in m
variables, where m is the number of constraints in the original
primal program.
► We choose to solve the primal or dual based on which of m or
d is smaller.
Quadratic programming
Quadratic programming
Example gradient descent with momentum
Solution:
AdaGrad
AdaGrad
AdaGrad
RMSProp
RMSProp
RMSProp
RMSProp
ADAM
ADAM; Same function as before
Solution Steps : ADAM
ADAM
ADAM

You might also like