Gradient Descent
Deep Learning
By
T.K. Damodharan
Vice President, RBS
Reg.No: PC2013003013008
Under the guidance of
Dr V.Rajasekar,
Associate Professor,
Department of Computer Science & Engineering,
SRM Institute of Science and Technology-Vadapalani Campus.
Gradient Descent
Gradient descent is by far the most
popular optimization strategy used in
machine learning and deep learning at the
moment.
It is used when training data models, can be
combined with every algorithm and is easy to
understand and implement.
Everyone working with machine learning
should understand its concept.
Gradient Descent
Gradient Descent is an optimization algorithm
for finding a local minimum of a differentiable
function.
Gradient descent is simply used to find the
values of a function's parameters (coefficients)
that minimize a cost function as far as possible.
It's based on a convex function and tweaks its
parameters iteratively to minimize a given
function to its local minimum.
What is a Gradient
"A gradient measures how much the output of a
function changes if you change the inputs a little
bit." — Lex Fridman (MIT)
A gradient simply measures the change in all
weights with regard to the change in error.
You can also think of a gradient as the slope of a
function. The higher the gradient, the steeper the
slope and the faster a model can learn.
But if the slope is zero, the model stops learning.
In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
What is a Gradient
What is a Gradient
Imagine a blindfolded man who wants to climb to
the top of a hill with the fewest steps along the way
as possible.
He might start climbing the hill by taking really
big steps in the steepest direction, which he can do
as long as he is not close to the top.
As he comes closer to the top, however, his steps
will get smaller and smaller to avoid overshooting
it.
This process can be described mathematically
using the gradient.
What is a Gradient
Imagine the image below illustrates our hill from a
top-down view and the red arrows are the steps of
our climber.
Think of a gradient in this context as a vector that
contains the direction of the steepest step the
blindfolded man can take and also how long
that step should be.
What is a Gradient
What is a Gradient
Note that the gradient ranging from X0 to X1 is
much longer than the one reaching from X3 to X4.
This is because the steepness/slope of the hill,
which determines the length of the vector, is less.
This perfectly represents the example of the hill
because the hill is getting less steep the higher it's
climbed.
Therefore a reduced gradient goes along with a
reduced slope and a reduced step size for the hill
climber.
How Gradient Descent Works
Instead of climbing up a hill, think of gradient
descent as hiking down to the bottom of a valley.
This is a better analogy because it is a
minimization algorithm that minimizes a given
function.
Equation :b is the next position of our climber,
while a represents his current position.
The minus sign refers to the min part of GD.
The gamma in the middle is a waiting factor and
Gradient Descent
More details and Types of Gradient Descent
https://builtin.com/data-science/gradient-descent
Step by step Video guide:
https://youtu.be/sDv4f4s2SB8
Linear Models
A strong high-bias assumption is linear separability:
in 2 dimensions, can separate classes by a line
in higher dimensions, need hyperplanes
A linear model is a model that assumes the data is linearly
separable
Linear models
A strong high-bias assumption is linear separability:
in 2 dimensions, can separate classes by a line
in higher dimensions, need hyperplanes
A linear model is a model that assumes the data is linearly
separable
Linear Regression
DATASET
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
w x3 = 2 y3 = 2
1 x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14
Linear models
A linear model in n-dimensional space (i.e. n
features) is define by n+1 weights:
In two dimensions, a line:
0 =w1 f1 + w2 f2 + b (where b = -a)
In three dimensions, a plane:
0 =w1 f1 + w2 f2 + w3 f3 + b
In m-dimensions, a hyperplane
m
0 =b + å wj fj
j=1
Which line will it find?
Which line will it find?
Only guaranteed to find some
line that separates the data
Linear models
Perceptron algorithm is one example of a linear
classifier
Many, many other algorithms that learn a line (i.e. a
setting of a linear combination of weights)
Goals:
Explore a number of linear training algorithms
Understand why these algorithms work
Linear models in general
1. pick a model
0 =b + å
m
wj fj
j=1
These are the parameters we want to learn
2. pick a criteria to optimize (aka objective function)
Some notation: indicator function
ìï 1 if x =True üï
1[ x ] =í ý
î 0 if x =False
ï ï
þ
Convenient notation for turning T/F answers into numbers/counts:
drinks _ to _ bring _ for _ class = å 1[ x >=21]
xÎ class
Some notation: dot-product
Sometimes it is convenient to use vector notation
We represent an example f1, f2, …, fm as a single vector, x
Similarly, we can represent the weight vector w1, w2, …, wm as a single
vector, w
The dot-product between two vectors a and b is defined as:
m
a ×b =å a j b j
j=1
Linear models
1. pick a model
0 =b + å
n
wj fj
j=1
These are the parameters we want to learn
2. pick a criteria to optimize (aka objective function)
n
å 1[ y (w ×x + b) £0]
i i
i=1
What does this equation say?
Convex functions
Convex functions look something like:
One definition: The line segment between any
two points on the function is above the function
Finding the minimum
You’re blindfolded, but you can see out of the bottom of the
blindfold to the ground right by your feet. I drop you off
somewhere and tell you that you’re in a convex shaped valley
and escape is at the bottom/minimum. How do you get out?
Finding the minimum
loss
How do we do this for a function?
One approach: gradient descent
Partial derivatives give us the
slope (i.e. direction to move)
in that dimension
loss
w
One approach: gradient descent
Partial derivatives give us the
slope (i.e. direction to move) in
that dimension
loss
Approach:
pick a starting point (w)
repeat: w
pick a dimension
move a small amount in that
dimension towards decreasing loss
(using the derivative)
One approach: gradient descent
Partial derivatives give us the
slope (i.e. direction to move) in
that dimension
Approach:
pick a starting point (w)
repeat:
pick a dimension
move a small amount in that
dimension towards decreasing loss
(using the derivative)
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
w j =w j - h loss(w)
dw j
What does this do?
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
w j =w j - h loss(w)
dw j
learning rate (how much we want to move in the error
direction, often this will change over time)
Some maths
n
d d
dw j
loss = å
dw j i=1
exp(- yi (w ×xi + b))
n
d
=å exp(- yi (w ×xi + b)) - yi (w ×xi + b)
i=1 dw j
n
=å - yi xij exp(- yi (w ×xi + b))
i=1
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
n
w j =w j + h å yi xij exp(- yi (w ×xi + b))
i=1
What is this doing?
Exponential update rule
n
w j =w j + hå yi xij exp(- yi (w ×xi + b))
i=1
for each example xi:
w j =w j + h yi xij exp(- yi (w ×xi + b))
Summary
Gradient descent minimization algorithm
require that our loss function is convex
make small updates towards lower losses
Gradient descent
pick a starting point (w)
repeat until loss doesn’t decrease in all dimensions:
pick a dimension
move a small amount in that dimension towards decreasing loss
(using the derivative)
d
wi =wi - h (loss(w) + regularizer(w, b))
dwi
n
w j =w j + h å yi xij exp(- yi (w ×xi + b)) - hl w j
i=1
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hl w j
learning rate direction to regularization
update
constant: how far from wrong
What effect does the regularizer have?
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlw j
learning rate direction to regularization
update
constant: how far from wrong
If wj is positive, reduces wj moves wj towards 0
If wj is negative, increases wj