100% found this document useful (1 vote)
77 views

CS229 Lecture 2 PDF

This lecture covered supervised learning and linear regression. It introduced linear regression as a way to model the relationship between input features (x) and continuous output variables (y). The goal is to find parameters (θ) for the linear function hθ(x) = Σθjxj such that it best predicts the outputs y in the training data. This is done by minimizing the least squares cost function J(θ) using gradient descent. The lecture also discussed representing the data and predictions using vector notation for clarity.

Uploaded by

Amr Abbas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
77 views

CS229 Lecture 2 PDF

This lecture covered supervised learning and linear regression. It introduced linear regression as a way to model the relationship between input features (x) and continuous output variables (y). The goal is to find parameters (θ) for the linear function hθ(x) = Σθjxj such that it best predicts the outputs y in the training data. This is done by minimizing the least squares cost function J(θ) using gradient descent. The lecture also discussed representing the data and predictions using vector notation for clarity.

Uploaded by

Amr Abbas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CS 229 Lecture Two

Supervised Learning: Regression

Chris Ré

April 2, 2023
Disclaimers

I I’m trying a new format with slides (vs. whiteboard).


I The course notes maintained by Tengyu are your best source,
the lecture is to give you the overall sense and highlight issues.
I The slides are new (copied from old hand-written notes), so
apologies for any bugs. Please flag them!
I I’m worried that the lecture pacing will be too fast. Please
slow me down with questions.
I I talk fast, please watch on slower speed.
Supervised Learning and Linear Regression

I Definitions
I Linear Regression
I Batch and Stochastic Gradient
I Normal Equations
Supervised Learning

I A hypothesis or a prediction function is function h : X → Y


Supervised Learning

I A hypothesis or a prediction function is function h : X → Y


I X is an image, and Y contains “cat” or “not.”
I X is a text snippet, and Y contains “hate speech” or “not.”
I X is house data, and Y could be the price.
Supervised Learning

I A hypothesis or a prediction function is function h : X → Y


I X is an image, and Y contains “cat” or “not.”
I X is a text snippet, and Y contains “hate speech” or “not.”
I X is house data, and Y could be the price.
I A training set is a set of pairs (x (1) , y (1) ), . . . , (x (n) , y (n) )


s.t. x (i) ∈ X and y (i) ∈ Y for i = 1, . . . , n.


Supervised Learning

I A hypothesis or a prediction function is function h : X → Y


I X is an image, and Y contains “cat” or “not.”
I X is a text snippet, and Y contains “hate speech” or “not.”
I X is house data, and Y could be the price.
I A training set is a set of pairs (x (1) , y (1) ), . . . , (x (n) , y (n) )


s.t. x (i) ∈ X and y (i) ∈ Y for i = 1, . . . , n.


I Given a training set our goal is to produce a good prediction
function h
I Defining “good” will take us a bit. It’s a modeling question!
I We will want to use h on new data not in the training set.
Supervised Learning

I A hypothesis or a prediction function is function h : X → Y


I X is an image, and Y contains “cat” or “not.”
I X is a text snippet, and Y contains “hate speech” or “not.”
I X is house data, and Y could be the price.
I A training set is a set of pairs (x (1) , y (1) ), . . . , (x (n) , y (n) )


s.t. x (i) ∈ X and y (i) ∈ Y for i = 1, . . . , n.


I Given a training set our goal is to produce a good prediction
function h
I Defining “good” will take us a bit. It’s a modeling question!
I We will want to use h on new data not in the training set.

I If Y is continuous, then called a regression problem.


I If Y is discrete, then called a classification problem.
Our first example: Regression using Housing Data.
Example Data (Housing Prices from Ames Dataset from
Kaggle)
How do we represent h? (One popular choice)

h(x) = θ0 + θ1 x1 is an affine function


How do we represent h? (One popular choice)

h(x) = θ0 + θ1 x1 is an affine function

size Price
x (1) 2104 y (1) 400
x (2) 2500 y (2) 900
How do we represent h? (One popular choice)

h(x) = θ0 + θ1 x1 is an affine function

size Price
x (1) 2104 y (1) 400
x (2) 2500 y (2) 900
An example prediction?
How do we represent h? (One popular choice)

h(x) = θ0 + θ1 x1 is an affine function

size Price
x (1) 2104 y (1) 400
x (2) 2500 y (2) 900
An example prediction?

Notice the prediction is defined by the parameters θ0 and θ1 . This


is a huge reduction in the space of functions!
Simple Line Fit
Slightly More Interesting Data
We add features (bedrooms and lot size) to incorporate more
information about houses.
size bedrooms lot size Price
x (1) 2104 4 45k y (1) 400
x (2) 2500 3 30k y (2) 900
Slightly More Interesting Data
We add features (bedrooms and lot size) to incorporate more
information about houses.
size bedrooms lot size Price
x (1) 2104 4 45k y (1) 400
x (2) 2500 3 30k y (2) 900
What’s a prediction here?

h(x) = θ0 + θ1 x1 + θ2 x2 + θ3 x3 .
Slightly More Interesting Data
We add features (bedrooms and lot size) to incorporate more
information about houses.
size bedrooms lot size Price
x (1) 2104 4 45k y (1) 400
x (2) 2500 3 30k y (2) 900
What’s a prediction here?

h(x) = θ0 + θ1 x1 + θ2 x2 + θ3 x3 .

With the convention that x0 = 1 we can write:


3
X
h(x) = θj x j
j=0
Vector Notation for Prediction
size bedrooms lot size Price
x (1) 2104 4 45k y (1) 400
x (2) 2500 3 30k y (2) 900
We write the vectors as (important notation)
 (1)  
x0
  
θ0 1
θ1   (1)  
 and x (1) =  x  2104 and y (1) = 400
θ= θ2   1(1)  = 
x2   4 
θ3 x3
(1) 45
Vector Notation for Prediction
size bedrooms lot size Price
x (1) 2104 4 45k y (1) 400
x (2) 2500 3 30k y (2) 900
We write the vectors as (important notation)
 (1)  
x0
  
θ0 1
θ1   (1)  
 and x (1) =  x  2104 and y (1) = 400
θ= θ2   1(1)  = 
x2   4 
θ3 x3
(1) 45

We call θ parameters, x (i) is the input or the features, and the


output or target is y (i) . To be clear,

(x, y ) is a training example and (x (i) , y (i) ) is the i th example.


Vector Notation for Prediction
size bedrooms lot size Price
x (1) 2104 4 45k y (1) 400
x (2) 2500 3 30k y (2) 900
We write the vectors as (important notation)
 (1)  
x0
  
θ0 1
θ1   (1)  
 and x (1) =  x  2104 and y (1) = 400
θ= θ2   1(1)  = 
x2   4 
θ3 x3
(1) 45

We call θ parameters, x (i) is the input or the features, and the


output or target is y (i) . To be clear,

(x, y ) is a training example and (x (i) , y (i) ) is the i th example.

We have n examples (i.e., i = 1, . . . , n). There are d features so


x (i) and θ are d + 1 dimensional (since x0 = 1)
Visual version of linear regression

Pd
Let hθ (x) = j=0 θj xj want to choose θ so that hθ (x) ≈ y .
Visual version of linear regression

Let hθ (x) = dj=0 θj xj want to choose θ so that hθ (x) ≈ y . One


P
popular idea called least squares
n
1 X 2
J(θ) = hθ (x (i) ) − y (i) .
2
i=1

Choose
θ = argmin J(θ).
θ
Linear Regression Summary

I We saw our first hypothesis class affine or linear functions.


I We refreshed ourselves on notation and introduced
terminology like parameters, features, etc.
I We saw this paradigm that a “good” hypothesis is some how
one that is close to the data (objective function J).
Linear Regression Summary

I We saw our first hypothesis class affine or linear functions.


I We refreshed ourselves on notation and introduced
terminology like parameters, features, etc.
I We saw this paradigm that a “good” hypothesis is some how
one that is close to the data (objective function J).
I Next, we’ll see how to solve these equations.
Solving the least squares optimization problem.
Gradient Descent

θ(0) =0
(t+1) (t) ∂
θj =θj −α J(θ(t) ) for j = 0, . . . , d.
∂θj
Gradient Descent Computation
(t+1) (t) ∂
θj = θj −αJ(θ(t) ) for j = 0, . . . , d.
∂θj
Note that α is called the learning rate or step size.

Let’s compute the derivatives. . .


n
∂ X 1 ∂  2
J(θ(t) ) = hθ (x (i) ) − y (i)
∂θj 2 ∂θj
i=1
n 
X  ∂
(i) (i)
= hθ (x ) − y hθ (x (i) )
∂θj
i=1
Gradient Descent Computation
(t+1) (t) ∂
θj = θj −αJ(θ(t) ) for j = 0, . . . , d.
∂θj
Note that α is called the learning rate or step size.

Let’s compute the derivatives. . .


n
∂ X 1 ∂  2
J(θ(t) ) = hθ (x (i) ) − y (i)
∂θj 2 ∂θj
i=1
n 
X  ∂
(i) (i)
= hθ (x ) − y hθ (x (i) )
∂θj
i=1

For our particular hθ we have:



hθ (x) = θ0 x0 + θ1 x1 + · · · + θd xd so hθ (x) = xj
∂θj
Gradient Descent Computation
(t+1) (t) ∂
θj = θj −αJ(θ(t) ) for j = 0, . . . , d.
∂θj
Note that α is called the learning rate or step size.

Let’s compute the derivatives. . .


n
∂ X 1 ∂  2
J(θ(t) ) = hθ (x (i) ) − y (i)
∂θj 2 ∂θj
i=1
n 
X  ∂
(i) (i)
= hθ (x ) − y hθ (x (i) )
∂θj
i=1

For our particular hθ we have:



hθ (x) = θ0 x0 + θ1 x1 + · · · + θd xd so hθ (x) = xj
∂θj
Gradient Descent Computation

Thus, our update rule for component j can be written:


n  
(t+1) (t) (i)
X
θj = θj −α hθ (x (i) ) − y (i) xj .
i=1
Gradient Descent Computation

Thus, our update rule for component j can be written:


n  
(t+1) (t) (i)
X
θj = θj −α hθ (x (i) ) − y (i) xj .
i=1

We write this in vector notation for j = 0, . . . , d as:


n 
X 
(t+1) (t)
θ =θ −α hθ (x (i) ) − y (i) x (i) .
i=1

Saves us a lot of writing! And easier to understand . . . eventually.


Batch Versus Stochastic Minibatch: Motivation

Consider our update rule:


n 
X 
θ(t+1) = θ(t) − α hθ (x (i) ) − y (i) x (i) .
i=1

I A single update, our rule examines all n data points.


Batch Versus Stochastic Minibatch: Motivation

Consider our update rule:


n 
X 
θ(t+1) = θ(t) − α hθ (x (i) ) − y (i) x (i) .
i=1

I A single update, our rule examines all n data points.


I In some modern applications (more later) n may be in the
billions or trillions!
I E.g., we try to “predict” every word on the web.
Batch Versus Stochastic Minibatch: Motivation

Consider our update rule:


n 
X 
θ(t+1) = θ(t) − α hθ (x (i) ) − y (i) x (i) .
i=1

I A single update, our rule examines all n data points.


I In some modern applications (more later) n may be in the
billions or trillions!
I E.g., we try to “predict” every word on the web.
I Idea Sample a few points (maybe even just one!) to
approximate the gradient called Stochastic Gradient (SGD).
I SGD is the workhorse of modern ML, e.g., pytorch and
tensorflow.
Stochastic Minibatch

I We randomly select a batch of B ⊆ {1, . . . , n} where |B| < n.


I We approximate the gradient using just those B points as
follows (vs. gradient descent)
n
1 X  1 X 
hθ (x (j) ) − y (j) x (j) v.s. hθ (x (j) ) − y (j) x (j) .
|B| n
j∈B j=1
Stochastic Minibatch

I We randomly select a batch of B ⊆ {1, . . . , n} where |B| < n.


I We approximate the gradient using just those B points as
follows (vs. gradient descent)
n
1 X  1 X 
hθ (x (j) ) − y (j) x (j) v.s. hθ (x (j) ) − y (j) x (j) .
|B| n
j∈B j=1

I So our update rule for SGD is:


X 
θ(t+1) = θ(t) − αB hθ (x (j) ) − y (j) x (j) .
j∈B

I NB: scaling of |B| versus n is “hidden” inside choice of αB .


Stochastic Minibatch vs. Gradient Descent

I Recall our rule B points as follows:


X 
θ(t+1) = θ(t) − αB hθ (x (j) ) − y (j) x (j) .
j∈B

I If |B| = {1, . . . , n} (the whole set), then they coincide.


I Smaller B implies a lower quality approximation of the
gradient (higher variance).
I Nevertheless, it may actually converge faster! (Case where the
dataset has many copies of the same point–extreme, but lots
of redundancy)
Stochastic Minibatch vs. Gradient Descent

I Recall our rule B points as follows:


X 
θ(t+1) = θ(t) − αB hθ (x (j) ) − y (j) x (j) .
j∈B

I If |B| = {1, . . . , n} (the whole set), then they coincide.


I Smaller B implies a lower quality approximation of the
gradient (higher variance).
I Nevertheless, it may actually converge faster! (Case where the
dataset has many copies of the same point–extreme, but lots
of redundancy)
I In practice, choose B proportional to what works well on
modern parallel hardware (GPUs).
Summary of this Subsection of Optimization

I Our goal was to optimize a loss function to find a good


predictor.
I We learned about gradient descent and the workhorse
algorithm for ML, Stochastic Gradient Descent (SGD).
I We touched on the tradeoffs of choosing the right batch size.
Normal Equations

I Least squares with linear hypothesis is really special, we can


solve it exactly (algebraically)!
I We’ll derive the Normal Equations for least squares.
Notation for Least Squares with Linear hθ
n
1X
J(θ) = (hθ (x (i) ) − y (i) )2 .
2
i=1
Let’s get some convenient notation
 (1)   (1) 
x y
x (2)  y (2) 
X =  .  ∈ Rn×(d+1) and y =  .  ∈ Rn
   
 ..   .. 
x (n) y (n)
We may call X the Design Matrix.
Notation for Least Squares with Linear hθ
n
1X
J(θ) = (hθ (x (i) ) − y (i) )2 .
2
i=1
Let’s get some convenient notation
 (1)   (1) 
x y
x (2)  y (2) 
X =  .  ∈ Rn×(d+1) and y =  .  ∈ Rn
   
 ..   .. 
x (n) y (n)
We may call X the Design Matrix.
With this notation for linear hθ (x), matrix multiplication is
evaluation of hθ , that is
 (1) 
h(x )
h(x (2) ) 1
X θ =  .  and so J(θ) = (X θ − y )T (X θ − y )
 
 ..  2
h(x )(n)
Vector Derivatives

Recall that for a real-valued matrix function f : Rn×d → R, we


mean
 ∂ ∂ ∂ 
∂a11 f (A) ∂a12 f (A) . . . ∂a1d f (A)
 ∂ f (A) ∂ f (A) . . . ∂
∂a2d f (A)

 ∂a21 ∂a22
∇A f (A) = 
.. .. ..


 . . . 
∂ ∂ ∂
∂a f (A) ∂a f (A) . . .
n1 n2 ∂a f (A)
nd

Here A ∈ Rn×d .
Vector Derivatives

Recall that for a real-valued matrix function f : Rn×d → R, we


mean
 ∂ ∂ ∂ 
∂a11 f (A) ∂a12 f (A) . . . ∂a1d f (A)
 ∂ f (A) ∂ f (A) . . . ∂
∂a2d f (A)

 ∂a21 ∂a22
∇A f (A) = 
.. .. ..


 . . . 
∂ ∂ ∂
∂a f (A) ∂a f (A) . . .
n1 n2 ∂a f (A)
nd

Here A ∈ Rn×d .
I With this notation, to find the minimum of J(θ) we compute
∇θ J(θ) = 0.
I Note that ∇θ J(θ) ∈ Rd+1 since θ ∈ Rd+1 .
The normal equation

From our previous derivation,


1
∇θ J(θ) = ∇θ (X θ − y )T (X θ − y ).
2
multiplying out we have:

∇θ J(θ) = X T X θ − X T y .

Setting ∇θ J(θ) = 0, solving for θ assuming (X T X )−1 exists, we


obtain:  −1
θ = XTX XT y.

We have the optimal solution for θ!


Some slight cheating. . .

 −1
θ = XTX XT y.

I We’ve assumed (X T X )−1 exists. What happens if not? Is θ


uniquely defined? Up to what?
I Why was ∇θ J(θ) = 0 a minimum? Notice that
 
∇2θ J(θ) = ∇θ X X θ − X y = X T X  0
T T

that is, it’s second derivative is positive (semi)definite.


I We did some quick vector calculus, if this isn’t familiar
practice on Friday!
Summary from Today

I We saw a lot of notation


I The TAs can help you practice on Friday!
I We learned about linear regression: the model, how to solve,
and more.
I We learned the workhorse algorithm for ML called SGD.
I Next time: Classification!

You might also like