0% found this document useful (0 votes)

26 views84 pages

Lec15 Regression Gradient Descent

Uploaded by

abcdwfghijk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views84 pages

Lec15 Regression Gradient Descent

Uploaded by

abcdwfghijk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Lecture 15: LMS Regression and Gradient

Descent

COMP 343, Spring 2022

Victoria Manfredi

Acknowledgements: These slides are based primarily on material from the book
Machine Learning by Tom Mitchell (and associated slides), the book Machine
Learning, An Applied Mathematics Introduction by Paul Wilmott, slides created by
Vivek Srikumar (Utah), Dan Roth (Penn), Julia Hockenmaier (Illinois Urbana-
Champaign), Jessica Wu (Harvey Mudd) and C. David Page (U of Wisconsin-Madison)
Today’s Topics
Midterm
– Wednesday, March 30

Linear Regression
– Gradient descent
– Example
– Convergence
– Stochastic gradient descent
– Regularization

[email protected]
Recap
Evaluation metrics (for classification)
To evaluate model, compare predicted labels to actual

Accuracy: proportion of examples where

we predicted correct label
Prediction
Label

# of correct predic ons

accuracy =
# of examples

Error: proportion of examples where we

predicted incorrect label
error = 1− accuracy
# of incorrect predic ons
accuracy =
# of examples
4
ti
ti
Precision-Recall analysis
What fraction of class “label” examples did the classifier discover?

Correct predic ons( label )

Recall( label ) =
Correct predic ons( label ) + Missed examples( label )

What fraction of classifier’s predictions of class “label” were correct

Correct predic ons( label )

Precision( label ) =
Correct predic ons( label ) + Incorrect predic ons( label )

By default, precision and recall computed for the positive label, as that is
usually the case of interest and the one usually with fewer example (e.g.,
diagnosing diseases in patients, identifying spam emails)
ti
ti
ti
5
ti
ti
Combining into one number

Sometimes easier to work with a single number as

performance measure

F1 score balances precision and recall: harmonic mean of

precision and recall

2pr
f1 =
p+r

Training to minimize F1 is difficult, but can choose hyper

parameters for which F1 is maximized
6
Linear regression
Inputs are feature vectors: x ∈ ℜd For simplicity, we will assume
Outputs are real numbers: y ∈ ℜ that the first feature is always 1,
to make notation easier

We have a training data set: 1

D = {(x1, y1), (x2, y2), ⋯, (xd, yd )} x1
xi = x2
We want to approximate y as ⋮
xd
y = w1 + w2 x2 + ⋯ + wd xd
y = wT x
w is the learned weight vector in ℜd

Making assumption that output y is

a linear function of the features x

Goal: use the training data to find the best possible value of w
If our hypothesis space is linear functions …
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi | How far apart is true from
y
predicted in absolute sense?
True Predicted If very different then weight
output output vector is probably not very good

| yi − w1T xi | = 60000 | yi+1 − w1T xi+1 | = 0.1

| yi − w2T xi | = 0.1 | yi+1 − w2T xi+1
x2| = 0.3

But could also be that weight vector is just bad for that example
8
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
This tells us how good
| yi − wT xi | for one example

Define the cost (or loss) for a particular weight vector w to be

1 m
(yi − wT xi)2
2∑
J(w) =
i=1
Squared error is a popular loss
function: sum of squared costs
This tells us how over the training set. Dividing
One strategy forgood
learning: Find the w by
for m examples 2 rather than m will make
with least cost on this data
our math work out nicely later

9
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi |

Define the cost (or loss) for a particular weight vector w to be

1 m
(yi − wT xi)2
2∑
J(w) =
i=1

Function of functions
J is a function that evaluates how good other m
w w
T
x. Every 1
One strategy
functionsfor learning:are,
or regressors Find
e.g.,the with least
J(f)cost
= on this
(yi −data
f(xi))2
choice of w gives a different regressor. So J 2∑i=1
evaluates how good a regressor is.
10
How do we decide whether weight vector is good?
How do we know which weight vector is best one for a training set?

For an input (xi, yi) in the training set, the cost of a mistake is
| yi − wT xi |

Define the cost (or loss) for a particular weight vector w to be

1 m
(yi − wT xi)2
2∑
J(w) =
i=1

One strategy for learning: Find the w with least cost on this data

11
This is called Least Mean Squares (LMS) Regression
1 m
(yi − wT xi)2
w 2∑
min J(w) = min
w
i=1

Goal of learning: minimize mean squared error

‣ This isstrategies
Different
just the training objective: you can use different learning
exist for learning by optimization
algorithms to minimize this objective
• Gradient descent: is a popular algorithm
‣•Properties of J(w):for
Matrix inversion: differentiable andminimization
this particular convex. Lower values
objective,
mean
therebetter
is also weight w, i.e., regressor.
vectorsolution;
an analytical no need for gradient
descent: b = (X T X)−1X TY
‣ Mathematical optimization: focuses on solving problems of the
form min J(w). So many algorithms exist to solve problem
w
12
This is called Least Mean Squares (LMS) Regression
1 m
(yi − wT xi)2
w 2∑
min J(w) = min
w
i=1

Goal of learning: minimize mean squared error

Different strategies exist for learning by optimization

Different strategies exist for learning by optimization
• Gradient descent: is a popular algorithm
‣ Gradient descent: is a popular algorithm
• Matrix inversion: for this particular minimization objective,
‣ Matrix
there inversion:
is also for this
an analytical particular
solution; minimization
no need for gradient
objective, there is also an analytical solution; no need for
descent: b = (X T X)−1X TY T −1 T
gradient descent: b = (X X) X Y

13
Linear Regression
GRADIENT DESCENT
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for

w, say w0

2. Iterate until convergence: w

– Compute the gradient of J at wt
– Update wt to get wt+1 by taking
Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
Whatofisthe
gradient
gradientof a function? the minimum, go in the opposite direction
In 2-dimensions: slope of a line
In higher dimensions: direction of steepest ascent, that is,
direction in which function grows the fastest
15
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w) Pick point w1
Gradient points in
direction function
1. Start with an initial guess for grows
w, say w0

2. Iterate until convergence: w

– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking

Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

16
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for

w, say w0

2. Iterate until convergence: w

Gradient descent:
– Compute initialize
the gradient of Jyour
at wt w4 w3 w2 w1
starting point for search for
– Update wt to get wt+1 by taking
minimuma stepanywhere
in the opposite direction Intuition: The gradient is the direction of
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

17
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for

w, say w0

2. Iterate until convergence: w

– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking

Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

Then at every point, compute the gradient (the arrow), and take
a step in direction away from gradient (i.e., move to a point
where value of function is lower) 18
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for

w, say w0

2. Iterate until convergence: w

– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking

Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

Keep repeating …

19
We are trying to minimize
Gradient descent 1 m
(yi − wT xi)2
2∑
J(w) =
J(w) i=1
General strategy for minimizing a
function J(w)

1. Start with an initial guess for

w, say w0

2. Iterate until convergence: w

– Compute the gradient of J at wt w4 w3 w2 w1

– Update wt to get wt+1 by taking

Intuition: The gradient is the direction of
a step in the opposite direction
steepest increase in the function. To get to
of the gradient the minimum, go in the opposite direction

And eventually you will get to minimum

20
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0 Initialize to zeroes or random
(convex function, so doesn’t matter where initialized)

2. For t = 0,1,2,…
– Compute gradient of J(w) at wt. Call it ∇J(wt )
Grad J or Nabla J
– Update w as follows:
wt+1 = wt − r ∇J(wt)
Use “-“ since step is in opposite direction
where r is the learning rate (a small constant)

21
We are trying to minimize
Gradient descent for LMS 1 m
(yi − wT xi)2
2∑
J(w) =
i=1
1. Initialize w0

2. For t = 0,1,2,… What is the gradient of J?

– Compute gradient of J(w) at wt. Call it ∇J(wt )

– Update w as follows:
wt+1 = wt − r ∇J(wt)

where r is the learning rate (a small constant)

22
Gradient of the cost J at point w
Remember that w is a vector with d elements
w = [w1, w2, w3, …, wj, …, wd] J is a function that maps w to
real number (the total cost)
To find the best direction in the weight space w we compute the
gradient of J with respect to each of the components of

[ ∂w1 ∂w2 ∂wd ]

t ∂J ∂J ∂J
∇J(w ) = , , ⋯,

This vector specifies the direction that produces the steepest

increase in J. We want to modify w in the direction of − ∇J(w),
where (with a fixed step size r):
wt+1 = wt − r ∇J(wt)
23
Gradient of the cost J at point w
Remember that w is a vector with d elements
w = [w1, w2, w3, …, wj, …, wd]

To find the best direction in the weight space w we compute the

gradient of J with respect to each of the components of

[ ∂w1 ∂w2 ∂wd ] d elements since w is a

∂J ∂J ∂J Gradient will be vector with
∇J(wt) = , , ⋯,
vector with d elements
This vector specifies the direction that produces the steepest
increaseEach element
in J. We is modify
want to a w in the direction of − ∇J(w),
partial
where (with derivative
a fixed step size r):
wt+1 = wt − r ∇J(wt)
Need to compute every element of ∇J(wt) to define gradient
24
We are trying to minimize
Gradient of the cost J at point w J(w) = 1 m
(yi − wT xi)2
2∑
i=1