0% found this document useful (0 votes)
26 views84 pages

Lec15 Regression Gradient Descent

Uploaded by

abcdwfghijk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views84 pages

Lec15 Regression Gradient Descent

Uploaded by

abcdwfghijk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Lecture 15: LMS Regression and Gradient

Descent

COMP 343, Spring 2022


Victoria Manfredi

Acknowledgements: These slides are based primarily on material from the book
Machine Learning by Tom Mitchell (and associated slides), the book Machine
Learning, An Applied Mathematics Introduction by Paul Wilmott, slides created by
Vivek Srikumar (Utah), Dan Roth (Penn), Julia Hockenmaier (Illinois Urbana-
Champaign), Jessica Wu (Harvey Mudd) and C. David Page (U of Wisconsin-Madison)
Today’s Topics
Midterm
– Wednesday, March 30

Linear Regression
– Gradient descent
– Example
– Convergence
– Stochastic gradient descent
– Regularization

[email protected]
Recap
Evaluation metrics (for classification)
To evaluate model, compare predicted labels to actual

Accuracy: proportion of examples where


we predicted correct label
Prediction
Label

# of correct predic ons


accuracy =
# of examples

Error: proportion of examples where we


predicted incorrect label
error = 1− accuracy
# of incorrect predic ons
accuracy =
# of examples
4
ti
ti
Precision-Recall analysis
What fraction of class “label” examples did the classifier discover?

Correct predic ons( label )


Recall( label ) =
Correct predic ons( label ) + Missed examples( label )

What fraction of classifier’s predictions of class “label” were correct

Correct predic ons( label )


Precision( label ) =
Correct predic ons( label ) + Incorrect predic ons( label )

By default, precision and recall computed for the positive label, as that is
usually the case of interest and the one usually with fewer example (e.g.,
diagnosing diseases in patients, identifying spam emails)
ti
ti
ti
5
ti
ti
Combining into one number

Sometimes easier to work with a single number as


performance measure

F1 score balances precision and recall: harmonic mean of


precision and recall

2pr
f1 =
p+r

Training to minimize F1 is difficult, but can choose hyper


parameters for which F1 is maximized
6
Linear regression
Inputs are feature vectors: x ∈ ℜd For simplicity, we will assume
Outputs are real numbers: y ∈ ℜ that the first feature is always 1,
to make notation easier

We have a training data set: 1


D = {(x1, y1), (x2, y2), ⋯, (xd, yd )} x1
xi = x2
We want to approximate y as ⋮
xd
y = w1 + w2 x2 + ⋯ + wd xd
y = wT x
w is the learned weight vector in ℜd

Making assumption that output y is


a linear function of the features x

Goal: use the training data to find the best possible value of w
If our hypothesis space is linear functions …
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi | How far apart is true from
y
predicted in absolute sense?
True Predicted If very different then weight
output output vector is probably not very good

| yi − w1T xi | = 60000 | yi+1 − w1T xi+1 | = 0.1


| yi − w2T xi | = 0.1 | yi+1 − w2T xi+1
x2| = 0.3

But could also be that weight vector is just bad for that example
8
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
This tells us how good
| yi − wT xi | for one example

Define the cost (or loss) for a particular weight vector w to be


1 m
(yi − wT xi)2
2∑
J(w) =
i=1
Squared error is a popular loss
function: sum of squared costs
This tells us how over the training set. Dividing
One strategy forgood
learning: Find the w by
for m examples 2 rather than m will make
with least cost on this data
our math work out nicely later

9
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi |

Define the cost (or loss) for a particular weight vector w to be


1 m
(yi − wT xi)2
2∑
J(w) =
i=1

Function of functions
J is a function that evaluates how good other m
w w
T
x. Every 1
One strategy
functionsfor learning:are,
or regressors Find
e.g.,the with least
J(f)cost
= on this
(yi −data
f(xi))2
choice of w gives a different regressor. So J 2∑i=1
evaluates how good a regressor is.
10
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi |

Define the cost (or loss) for a particular weight vector w to be


1 m
(yi − wT xi)2
2∑
J(w) =
i=1

One strategy for learning: Find the w with least cost on this data

11
This is called Least Mean Squares (LMS) Regression
1 m
(yi − wT xi)2
w 2∑
min J(w) = min
w
i=1

Goal of learning: minimize mean squared error

‣ This isstrategies
Different
just the training objective: you can use different learning
exist for learning by optimization
algorithms to minimize this objective
• Gradient descent: is a popular algorithm
‣•Properties of J(w):for
Matrix inversion: differentiable andminimization
this particular convex. Lower values
objective,
mean
therebetter
is also weight w, i.e., regressor.
vectorsolution;
an analytical no need for gradient
descent: b = (X T X)−1X TY
‣ Mathematical optimization: focuses on solving problems of the
form min J(w). So many algorithms exist to solve problem
w
12
This is called Least Mean Squares (LMS) Regression
1 m
(yi − wT xi)2
w 2∑
min J(w) = min
w
i=1

Goal of learning: minimize mean squared error

Different strategies exist for learning by optimization


Different strategies exist for learning by optimization
• Gradient descent: is a popular algorithm
‣ Gradient descent: is a popular algorithm
• Matrix inversion: for this particular minimization objective,
‣ Matrix
there inversion:
is also for this
an analytical particular
solution; minimization
no need for gradient
objective, there is also an analytical solution; no need for
descent: b = (X T X)−1X TY T −1 T
gradient descent: b = (X X) X Y

13
Linear Regression
GRADIENT DESCENT
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for


w, say w0

2. Iterate until convergence: w


– Compute the gradient of J at wt
– Update wt to get wt+1 by taking
Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
Whatofisthe
gradient
gradientof a function? the minimum, go in the opposite direction
In 2-dimensions: slope of a line
In higher dimensions: direction of steepest ascent, that is,
direction in which function grows the fastest
15
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w) Pick point w1
Gradient points in
direction function
1. Start with an initial guess for grows
w, say w0

2. Iterate until convergence: w


– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking


Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

16
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for


w, say w0

2. Iterate until convergence: w


Gradient descent:
– Compute initialize
the gradient of Jyour
at wt w4 w3 w2 w1
starting point for search for
– Update wt to get wt+1 by taking
minimuma stepanywhere
in the opposite direction Intuition: The gradient is the direction of
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

17
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for


w, say w0

2. Iterate until convergence: w


– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking


Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

Then at every point, compute the gradient (the arrow), and take
a step in direction away from gradient (i.e., move to a point
where value of function is lower) 18
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for


w, say w0

2. Iterate until convergence: w


– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking


Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

Keep repeating …

19
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for


w, say w0

2. Iterate until convergence: w


– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking


Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

And eventually you will get to minimum

20
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0 Initialize to zeroes or random
(convex function, so doesn’t matter where initialized)

2. For t = 0,1,2,…
– Compute gradient of J(w) at wt. Call it ∇J(wt )
Grad J or Nabla J
– Update w as follows:
wt+1 = wt − r ∇J(wt)
Use “-“ since step is in opposite direction
where r is the learning rate (a small constant)

21
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… What is the gradient of J?


– Compute gradient of J(w) at wt. Call it ∇J(wt )

– Update w as follows:
wt+1 = wt − r ∇J(wt)

where r is the learning rate (a small constant)

22
Gradient of the cost J at point w
Remember that w is a vector with d elements
w = [w1, w2, w3, …, wj, …, wd] J is a function that maps w to
real number (the total cost)
To find the best direction in the weight space w we compute the
gradient of J with respect to each of the components of

[ ∂w1 ∂w2 ∂wd ]


t ∂J ∂J ∂J
∇J(w ) = , , ⋯,

This vector specifies the direction that produces the steepest


increase in J. We want to modify w in the direction of − ∇J(w),
where (with a fixed step size r):
wt+1 = wt − r ∇J(wt)
23
Gradient of the cost J at point w
Remember that w is a vector with d elements
w = [w1, w2, w3, …, wj, …, wd]

To find the best direction in the weight space w we compute the


gradient of J with respect to each of the components of

[ ∂w1 ∂w2 ∂wd ] d elements since w is a


∂J ∂J ∂J Gradient will be vector with
∇J(wt) = , , ⋯,
vector with d elements
This vector specifies the direction that produces the steepest
increaseEach element
in J. We is modify
want to a w in the direction of − ∇J(w),
partial
where (with derivative
a fixed step size r):
wt+1 = wt − r ∇J(wt)
Need to compute every element of ∇J(wt) to define gradient
24
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
∂wj
=
∂wj 2 ∑
(yi − wT xi)2 Let’s compute gradient for jth weight
i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)

= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
(yi − wT xi)xij

=−
i=1

25
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
=
2∑ ∂wj
(yi − wT xi)2 Gradient of sum is just the sum of gradients
i=1
1 m ∂ so move partial derivative inside
2(yi − wT xi)

= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
(yi − wT xi)xij

=−
i=1

26
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m
2(yi − wT xi)
∂ Apply chain rule for derivative
2∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
i=1
∂wj Expanded dot product
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
(yi − wT xi)xij

=−
i=1

27
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m
2(yi − wT xi)
∂ Apply chain rule for derivatives
2∑
= (yi − w1xi1 − ⋯wj xij − ⋯)
i=1
∂wj Expanded dot product
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
T
Only one element

=− (yi − w xi)xij
i=1 depends on j

28
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)

= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
T
Only one element

=− (yi − w xi)xij
i=1 depends on j

29
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
1 m ∂
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)

= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
(yi − wT xi)xij

=− Move 2 and minus outside, 2s cancel
i=1

30
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
One element of
1 m ∂ the gradient vector
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)

= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
(yi − wT xi)xij

=−
i=1

Sum of Error x Input

31
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1

[ ∂w1 ∂w2 ∂wd ]


∂J ∂J ∂J
The gradient is of the form ∇J(wt ) = , , ⋯,

∂J ∂ 1 m
(yi − wT xi)2
∂wj 2 ∑
=
∂wj i=1
One element of
1 m ∂ the gradient vector
(yi − wT xi)2
2∑
=
i=1
∂wj
1 m ∂
2(yi − wT xi)

= (yi − w1xi1 − ⋯wj xij − ⋯)
2 i=1 ∂wj
1 m
2(yi − wT xi)(−xij )

=
2 i=1
m
(yi − wT xi)xij Negative of this gradient is how much to

=−
i=1 change jth weight

Sum of Error x Input

Larger features (xij) with larger errors will cause larger change
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… until error is below a threshold


– Compute gradient of J(w) at wt. Call it ∇J(wt )
Evaluate the function for each training example to compute the error and construct
the gradient vector
m
∂J One element
(yi − wT xi)xij

=− of ∇J(wt )
∂wj i=1

[ 1 ∂wd ]
∂J ∂J ∂J
∇J(wt ) = , , ⋯,
– Update w as follows: ∂w ∂w2
wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)

33
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… until error is below a threshold


– Compute gradient of J(w) at wt. Call it ∇J(wt )
Evaluate the function for each training example to compute the error and construct
the gradient vector
m
∂J One element
(yi − wT xi)xij

=− of ∇J(wt )
∂wj i=1
Take step in opposite direction
– Update w as follows: of gradient, so minus

wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)

34
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… until error is below a threshold


– Compute gradient of J(w) at wt. Call it ∇J(wt )
Evaluate the function for each training example to compute the error and construct
the gradient vector
m
∂J One element
(yi − wT xi)xij

=− of ∇J(wt )
∂wj i=1
Take step in opposite direction
– Update w as follows: of gradient, so minus

wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)

35
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… until error is below a threshold


– Compute gradient of J(w) at wt. Call it ∇J(wt )
Evaluate the function for each training example to compute the error and construct
the gradient vector
m
∂J One element
(yi − wT xi)xij

=− of ∇J(wt )
∂wj i=1
After computing error for all training examples, get vector that
you use to update weights all at once: basically a batch
– Update w as follows:
wt+1 = wt − r ∇J(wt)
where r is the learning rate (for now a small constant)

36
Gradient Descent
EXAMPLE
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage
(x 100 lb) (years) per gallon
31.5 6 21
36.2 2 25
43.1 0 18
27.6 2 30

38
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18
x4 x40 = 1 x41 27.6 x42 2 30

Example Feature
index index

39
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18
x4 x40 = 1 x41 27.6 x42 2 30

What does weight vector look like?

40
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

41
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

How do we update weights?

42
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

[ ∂w0 ∂w1 ∂w2 ]


∂J ∂J ∂J
∇J(wt ) = , ,

m
∂J One element
(yi − wT xi)xij

=−
∂wj of ∇J(wt )
i=1

wt+1 = wt − r ∇J(wt )
43
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi0

=−
[ ∂w0 ∂w1 ∂w2 ]
∂J ∂J ∂J
∇J(wt ) = , , ∂w0 i=1

m
∂J
(yi − wT xi)xij

=−
∂wj i=1

wt+1 = wt − r ∇J(wt )
44
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi0

=−
[ ∂w0 ∂w1 ∂w2 ]
∂J ∂J ∂J
∇J(wt ) = , , ∂w0 i=1
m
∂J
m (yi − wT xi)xi1

∂J =−
(yi − wT xi)xij ∂w1

=− i=1
∂wj i=1
m
∂J
(yi − wT xi)xi2

wt+1 = wt − r ∇J(wt ) =−
∂w2 i=1 45
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
∂J m w2
(yi − wT xi)xi0

=−
∂w0 i=1
= − (y1 − wT x1)x10 − (y2 − wT x2)x20 − (y3 − wT x3)x30 − (y4 − wT x4)x40

46
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
∂J m w2
(yi − wT xi)xi0

=−
∂w0 i=1
= − (y1 − wT x1)x10 − (y2 − wT x2)x20 − (y3 − wT x3)x30 − (y4 − wT x4)x40
m
∂J
(yi − wT xi)xi1

=−
∂w1 i=1
= − (y1 − wT x1)x11 − (y2 − wT x2)x21 − (y3 − wT x3)x31 − (y4 − wT x4)x41
m
∂J
(yi − wT xi)xi2

=−
∂w2 i=1
= − (y1 − wT x1)x12 − (y2 − wT x2)x22 − (y3 − wT x3)x32 − (y4 − wT x4)x42 47
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

m
∂J
(yi − wT xi)xi0

=−
∂w0 i=1
= − (y1 − wT x1)x10 − (y2 − wT x2)x20 − (y3 − wT x3)x30 − (y4 − wT x4)x40
= − (21 − wT x1)1 − (25 − wT x2)1 − (18 − wT x3)1 − (30 − wT x4)1
= − (21 − 0)1 − (25 − 0)1 − (18 − 0)1 − (30 − 0)1
= − 94
48
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2
m
∂J
(yi − wT xi)xi1

=−
∂w1 i=1
= − (y1 − wT x1)x11 − (y2 − wT x2)x21 − (y3 − wT x3)x31 − (y4 − wT x4)x41
= − (21 − wT x1)31.5 − (25 − wT x2)36.2 − (18 − wT x3)43.1 − (30 − wT x4)27.6
= − (21 − 0)31.5 − (25 − 0)36.2 − (18 − 0)43.1 − (30 − 0)27.6
= − 661.5 − 905 − 775 − 828
= − 3169.5 49
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

m
∂J
(yi − wT xi)xi2

=−
∂w2 i=1
= − (y1 − wT x1)x12 − (y2 − wT x2)x22 − (y3 − wT x3)x32 − (y4 − wT x4)x42
= − (21 − wT x1)6 − (25 − wT x2)2 − (18 − wT x3)0 − (30 − wT x4)2
= − (21 − 0)6 − (25 − 0)2 − (18 − 0)0 − (30 − 0)2
= − 126 − 50 − 0 − 60
= − 236 50
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

[ 0 1 ∂w2 ]
∂J ∂J ∂J
∇J(wt ) = , ,
∂w ∂w m
∂J
(yi − wT xi)xi0 = − 94

wt+1 = wt − r ∇J(wt ) =−
∂w0 i=1
m
∂J
(yi − wT xi)xi1 = − 3169.5

=−
∂w1 i=1
m
∂J
(yi − wT xi)xi2 = − 236

=−
∂w2 i=1 51
What’s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
Weight Age Mileage x10 1
(x 100 lb) (years) per gallon x1 = x11 = 31.5
x1 x10 = 1 x11 31.5 x12 6 21 x12
x22 2 6
x2 x20 = 1 x21 36.2 25
x3 x30 = 1 x31 43.1 x32 0 18 w0 0
[0]
x4 x40 = 1 x41 27.6 x42 2 30 w0 = w1 = 0
w2

∇J(wt ) = [−94,3169.5, − 236]


m
∂J
(yi − wT xi)xi0 = − 94

wt+1 = wt − r ∇J(wt ) =−
∂w0 i=1
0 −94
[0]
m
wt+1 = 0 − r −3169.5 ∂J
(yi − wT xi)xi1 = − 3169.5

=−
−236 ∂w1 i=1
94 ∂J m
(yi − wT xi)xi2 = − 236

= 3169.5 =−
∂w2 i=1
236 52
Gradient Descent
CONVERGENCE
How to improve likelihood of convergence
‣ Normalize values of features and labels: Important to normalize
features when using gradient descent (otherwise takes longer to
converge). All features should have a similar scale

‣ Decrease learning rate over time

‣ Check for weights converging

‣ Cross-validation to determine how best to set hyper-parameters


like number of epochs or learning rate

54
Learning Rates and Convergence
▪ In the general (“non-separable”) case the learning rate
r must decrease to zero to guarantee convergence.

▪ The learning rate is called the step size. There are more
sophisticated algorithms that choose the step size
automatically and converge faster.

▪ Choosing a better starting point also has impact.

55
Impact of learning rate
Cost Cost

w w
Random initial value Random initial value

Learning rate Learning rate


too large too small
Gradient descent
Algorithm is guaranteed to converge to the minimum of
J if learning rate r is small enough (small enough steps)
or the learning rate is decreased appropriately

Why? The objective J is a convex function here (LMS for


linear regression): the surface contains only a single
global minimum. The surface may have local minimum if
the loss function is different.

57
Decreasing learning rate over time
In order to guarantee that the algorithm will converge, the
learning rate should decrease over time. Here is a general
formula

‣ At iteration t
c1
rt = a where
t + c2

0.5 < a < 2


c1 > 0
c2 ≥ 0

58
When should algorithm stop?
1. Stop after fixed number of iterations

2. Stop once prediction error is less than threshold

3. Stop when validation loss stops changing

59
Stopping criteria
For most functions, you probably won’t get the gradient
to be exactly equal to 0 in a reasonable amount of time

Once the gradient is sufficiently close to 0, stop trying to


minimize further

How do we measure how close a gradient is to 0?


Gradient is just a vector, so can compute distance.

60
Distance
(x2, y2)
How far apart are two points? d
y2 − y1
(x1, y1)
x2 − x1

Euclidean distance between 2 points in 2 dimensions:


(x2 − x1)2 + (y2 − y1)2

In 3 dimensions (x, y, z):


(x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2
Distance
General formula for Euclidean distance between 2 points
with k dimensions:

k
(pi − qi)2

d(p, q) =
i=1

where p and q are 2 points (each represents a k


-dimensional vector)
Distance
A special case is the distance between a point and zero
(the origin)
k
pi2 also written | | p | |

d(p, 0) =
i=1

This is called the Euclidean norm of p


‣ A norm is a measure of a vector’s length
‣ The Euclidean norm is also called the L2 norm

63
Stopping criteria
Stop when the norm of the gradient is below some
threshold, θ:

| | ∇L(w) | | < θ

Common values of θ are around .01, but if it is taking too


long, you can make the threshold larger

64
Gradient descent
1. Initialize the parameters w to some guess (usually all zeroes,
or random values)

2. Update the parameters:


w = w − r ∇L(w)
c1
rt = a
t + c2

3. Repeat step 2 until | | ∇L(w) | | < θ or until the maximum


number of iterations is reached

65
Linear Regression
INCREMENTAL/STOCHASTIC
GRADIENT DESCENT
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… until error is below a threshold


– Compute gradient of J(w) at wt. Call it ∇J(wt )
Evaluate the function for each training example to compute the error and construct
the gradient vector
m
∂J The weight vector is not updated
(yi − wT xi)xij

=−
∂wj until all errors are calculated.
i=1

Why not make early updates to the


– Update w as follows: weight vector as soon as we
wt+1 = wt − r ∇J(wt) encounter errors instead of waiting
for a full pass over the data?

67
Incremental/stochastic gradient descent
Repeat for each example (xi, yi)
‣ Pretend that entire training set is represented by
this single example
‣ Use this example to calculate the gradient and
update the model

Contrast with batch gradient descent which makes one


update to the weight vector for every pass over the data

68
Incremental/stochastic gradient descent
1. Initialize w

2. For t = 0,1,2,… until error is below a threshold


– For each training example (xi, yi), update w. For each element of the weight
vector (wj)
Contrast with the previous method,
wjt+1 = wjt T
− r(yi − w xi)xij where the weights are updated only
after all examples are processed once

May get close to optimum much faster than the batch version. In general - does not
converge to global minimum. Decreasing r with time guarantees convergence. But,
online/incremental algorithms are often preferred when the training set is very large

69
Incremental/stochastic gradient descent
1. Initialize w

2. For t = 0,1,2,… until error is below a threshold


– For each training example (xi, yi), update w. For each element of the weight
vector (wj)
wjt+1 = wjt − r(yi − wT xi)xij

This update rule is also called the Widrow-Hoff


rule in the neural networks literature

May get close to optimum much faster than the batch version. In general - does not
converge to global minimum. Decreasing r with time guarantees convergence. But,
online/incremental algorithms are often preferred when the training set is very large

70
Linear regression: summary
▪ What we want: predict a real-valued output using
feature representation of the input

▪ Assumption: output is a linear function of the inputs

▪ Learning by minimizing total cost


– gradient descent and stochastic gradient descent to find the
best weight vector
– This particular optimization can be computed directly by
framing the problem as a matrix problem

[email protected] 71
Linear Regression
REGULARIZATION
Generalization
▪ Prediction functions that work on the training data
might not work on other data

▪ Minimizing the training error is a reasonable thing to


do, but it’s possible to minimize it “too well”

▪ Overfitting: your function matches the training data


well but is not learning general rules that will work for
new data

[email protected] 73
Regularization
▪ Modify learning algorithm to favor “simpler” prediction
rules to avoid overfitting

▪ Most commonly, regularization refers to modifying the


loss function to penalize certain values of the weights
you are learning. Specifically, penalize weights that are
large.

[email protected] 74
Regularization
▪ How do we define whether weights are large?
k
(wi)2 = | | w | | Note that bias

d(w, 0) =
i=1 term w0 is not
regularized

▪ This is called the L2 norm of w


– A norm is a measure of a vector’s length
– Also called the Euclidean norm

[email protected] 75
Regularization
▪ New goal for minimization Square to eliminate square root:
2
L(w) + λ | | w | | easier to work with mathematically

This is whatever By minimizing this we prefer


loss function we solutions where w is closer to 0
are using

λ is a hyperparameter that adjusts


trade-off between low training loss
and having low weights

[email protected] 76
Regularization
▪ Regularization helps the computational problem
because gradient descent won’t try to make some
feature weights grow larger and larger

▪ At some point, the penalty of having too large | | w | |2


will outweigh whatever gain you would make in your
loss function.

[email protected] 77
Regularization
▪ This also helps with generalization because it won’t
give large weight to features unless there is sufficient
evidence that they are useful

▪ The usefulness of a feature toward improving the loss


has to outweigh the cost of having large feature
weights

[email protected] 78
Regularization
▪ More generally
L(w) + λR(w)
This is called the regularization
term or regularizer or penalty.
The squared L2 norm is one kind
λ is called the regularization strength. of penalty, but there are others
Other common names for λ: alpha in
sklearn, C in many algorithms. Usually C
actually refers to the inverse regularization
strength, 1/λ. Figure out which one your
implementation is using (whether this will
increase or decrease regularization)

[email protected] 79
L2 Regularization
▪ When the regularizer is the squared L2 norm | | w | |2 , this is
called L2 regularization.

▪ This is the most common type of regularization

▪ When used with linear regression, this is called Ridge regression

▪ Logistic regression implementations usually use L2 regularization


by default

▪ L2 regularization can be added to other algorithms like


perceptron (or any gradient descent algorithm)

[email protected] 80
L2 Regularization
▪ The function R(w) = | | w | |2 is convex, so if it is added
to a convex loss function, the combined function will
still be convex.

[email protected] 81
L1 Regularization
▪ Another common regularizer is the L1 norm:
k


| | w | |1 = | wj |
j=1

▪ When used with linear regression, this is called Lasso

▪ Often results in many weights being exactly 0 (while L2


just makes them small but nonzero)

[email protected] 82
L1+L1 Regularization
▪ L2 and L1 regularization can be combined
2
R(w) = λ2 | | w | | + λ1 | | w | |1

▪ Also called ElasticNet


▪ Can work better than either type alone
▪ Can adjust hyperparameters to control which of the
two penalties is more important
▪ Once training is done, remove regularization term to
measure model performance

[email protected] 83
Feature normalization
▪ The scale of the feature values matters when using
regularization

▪ If one feature has values between [0, 1] and another


between [0, 10000], the learned weights might be on
very different scales – but whatever weights are
“naturally” larger are going to get penalized more by
the regularizer.

▪ Feature normalization or standardization refers to


converting the values to a standard range.

[email protected] 84

You might also like