0% found this document useful (0 votes)
27 views50 pages

Regression Review

Uploaded by

plgameplay1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views50 pages

Regression Review

Uploaded by

plgameplay1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Regression Review

SLR, Constant Model, and OLS

Data 100/Data 200, Spring 2023 @ UC Berkeley


Narges Norouzi and Lisa Yan
Content credit: Acknowledgments 1
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model

Today's Roadmap
Regression Review

2
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Ordinary Least
Squares
Regression Review

3
Multiple Linear Regression

Define the multiple linear regression model:

Predicted
value of This is a linear model because it is
a linear combination of parameters
.

single
single observation prediction
4
(p features)
Vector Notation
NBA Data

To combine the two terms into one matrix operation, we


can assume that there is an additional term in
and hence:

1 Rows correspond to
0.4 individual players.
0.8
1.5

Note that:
1 0.4 0.8 1.5

5
Matrix Notation

To make predictions on all datapoints in our sample:

where Datapoint 1
same

for all
where Datapoint 2 preds


where Datapoint n

6
Matrix Notation

To make predictions on all datapoints in our sample:

same

for all
preds


n row vectors, each Expand out each data
with dimension (p+1) point’s (transposed) input
7
Matrix Notation

To make predictions on all datapoints in our sample:

same

for all
preds

n row vectors, each Vectorize predictions and parameters


with dimension (p+1) to encapsulate all n equations into a
single matrix equation. 8
Matrix Notation

To make predictions on all datapoints in our sample:

same

for all
preds

Design matrix with


dimensions n x (p + 1)
9
The Multiple Linear Regression Model using Matrix Notation

We can express our linear model on our entire dataset as follows:

Note that our


true output is
also a vector:

Prediction vector Design matrix Parameter vector


10
Mean Squared Error with L2 Norms

We can rewrite mean squared error as a squared L2 norm:

With our linear model :

11
[Linear Algebra] Span

The set of all possible linear combinations of the columns of


is called the span of the columns of (denoted ),
also called the column space.
● Intuitively, this is all of the vectors
you can "reach" using the columns of .
● If each column of has length n,
is a subspace of .

12
A Linear Combination of Columns

The set of all possible linear combinations of the columns of X


is called the span of the columns of X (denoted ), also
called the column space.
● Intuitively, this is all of the vectors
you can “reach” using the columns of X.
● If each column of X has length n,
is a subspace of .

Our prediction is a linear combination


of the columns of . Therefore .
Interpret: Our linear prediction will be in ,
even if the true values might not be.
Goal: Find the vector in that is closest to .

13
This is the
residual vector,
.

Goal:
Minimize the L2 norm of
the residual vector.
i.e., get the predictions
to be “as close” to our
true values as
possible.

14
How do we minimize
this distance – the norm
of the residual vector
(squared)?

15
How do we minimize
this distance – the norm
of the residual vector
(squared)?

The vector in
that is closest to is
the orthogonal
projection of onto
.

We will not prove this property


of orthogonal projection: see
Khan Academy.
16
How do we minimize
this distance – the norm
of the residual vector
(squared)?

The vector in
that is closest to is
the orthogonal
projection of onto
.

Thus, we should choose


the θ that makes the
We will not prove this property
of orthogonal projection: see residual vector
Khan Academy. orthogonal to . 17
[Linear Algebra] Orthogonality

1. Vector a and Vector b are orthogonal if and only if their dot product is 0:
This is a generalization of the notion of two vectors in 2D being perpendicular.

v
2. A vector v is orthogonal to , the span of the columns of a matrix M,
if and only if v is orthogonal to each column in M.

Let’s express 2 in matrix notation. Let , where :

zero vector
v is orthogonal to each
column of M, (d-length vector 18
full of 0s).
Ordinary Least Squares Proof

The least squares estimate is the parameter that minimizes the objective function :

Design Residual
matrix vector
Equivalently, this is the such that the residual vector is orthogonal to .

Definition of orthogonality
of to
(0 is the vector)

Rearrange terms

The normal equation

If is invertible
19
[Metrics] Multiple R^2

Simple linear regression Multiple linear regression


Error Error
RMSE RMSE

Linearity Linearity
Correlation coefficient, r Multiple R2, also called the
coefficient of determination

Compare

20
[Metrics] Multiple R^2

We define the multiple R² value as the proportion of variance


or our fitted values (predictions) to our true values .

Also called the correlation of determination.


R2 ranges from 0 to 1 and is effectively
“the proportion of variance that the model explains.”

Compare For OLS with an intercept term (e.g. ),


is equal to the square of correlation between , .
● For SLR, , the correlation between x, .
● The proof of these last two properties is beyond this course.21
[Metrics] Multiple R^2

Simple linear regression Multiple linear regression


Error Error
RMSE RMSE
R² = 0.457
Linearity Linearity
Correlation coefficient, r Multiple R2, also called the
coefficient of determination

R² = 0.609
As we add more features, our fitted values tend to become closer and closer
to our actual values. Thus, R² increases.
Compare
● The SLR model (AST only) explains 45.7% of the variance in the true .
● The AST & 3PA model explains 60.9%.
Adding more features doesn’t always mean our model is better, though!
We are a few weeks away from understanding why.
22
Residual Properties

When using the optimal parameter vector, our residuals are


orthogonal to .

Proof: First line of our OLS estimate proof (slide).

For all linear models:


Since our predicted response is in by definition, , and hence it is orthogonal
to the residuals.

For all linear models with an intercept term, , the sum of


residuals is zero.

(Proof hint) 23
Properties When Our Model Has an Intercept Term

For all linear models with an intercept term, the sum of residuals is zero.
(previous slide)
● This is the real reason why we don’t
directly use residuals as loss.
● This is also why positive and negative residuals will cancel out in any residual plot where
the (linear) model contains an intercept term, even if the model is terrible.

It follows from the property above that for linear models with intercepts,
the average predicted value is equal to the average true value.

These properties are true when there is an intercept term, and not necessarily when there
isn’t.

24
Does a Unique Solution Always Exist?

Model Estimate Unique?

Constant Model + Yes. Any set of values


MSE has a unique mean.

Yes, if odd.
Constant Model + No, if even. Return
MAE average of middle 2
values.

Simple Linear Yes. Any set of


non-constant* values has a
Regression +
unique mean, SD, and
MSE correlation coefficient.

Ordinary Least
Squares
(Linear Model + ???
MSE) 25
Understanding The Solution Matrices

In most settings,
# observations # features
n >> p

26
Understanding The Solution Matrices

In practice, instead of directly inverting matrices, we can use more efficient


numerical solvers to directly solve a system of linear equations.
The Normal Equation:

Note that at least one solution always exists:


Intuitively, we can always draw a line of best fit for a given set of data, but there may be
multiple lines that are “equally good”. (Formal proof is beyond this course.)

27
Uniqueness of a Solution: Proof

Claim
The Least Squares estimate is unique if and only if is full column rank.

Proof
● The solution to the normal equation is the least square estimate .

● has a unique solution if and only if the square matrix is invertible, which happens
if and only if is full rank.
○ The rank of a square matrix is the max # of linearly independent columns it contains.
○ has shape (p +1) x (p + 1), and therefore has max rank p + 1.

● and have the same rank (proof out of scope).

● Therefore has rank p + 1 if and only if has rank p + 1 (full column rank).
28
Uniqueness of a Solution: Interpretation

Claim:
The Least Squares estimate is unique if and only if is full column rank.

When would we not have unique estimates?


1. If our design matrix is “wide”:
○ (property of rank) If n < p, rank of = min(n, p + 1) < p + 1. p + 1 features
○ In other words, if we have way more features n data
than observations, then is not unique. points

○ Typically we have n >> p so this is less of an issue.

2. If we our design matrix has features that are linear combinations of other features.
○ By definition, rank of is number of linearly independent columns in .
○ Example: If “Width”, “Height”, and “Perimeter” are all columns,
■ Perimeter = 2 * Width + 2 * Height → is not full rank.
○ Important with one-hot encoding (to discuss in later).
29
Does a Unique Solution Always Exist?

Model Estimate Unique?

Constant Model + Yes. Any set of values


MSE has a unique mean.

Yes, if odd.
Constant Model + No, if even. Return
MAE average of middle 2
values.

Simple Linear Yes. Any set of


non-constant* values has a
Regression +
unique mean, SD, and
MSE correlation coefficient.

Ordinary Least Yes, if is full col rank


Squares (all cols lin independent,
(Linear Model + #datapoints>>
MSE) #features) 30
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Simple Linear Only got to OLS - the rest
was not covered.
Regression
Regression Review

31
Parametric Model Notation

True outputs For data:

Predicted outputs The i-th datapoint is an observation:


● is the i-th output (aka dependent variable)
● is the i-th feature (aka independent variable)
● is the i-th prediction (aka estimation).

Any linear model with


Model parameter(s) parameters

Estimated parameter(s), The "best" fitting linear model


“best” fit to data in some sense with parameters

32
L2 and L1 Loss

L2 Loss or Squared Loss L1 Loss or Absolute Loss

● Widely used. ● Sounds worse than it is.


● Also called "L2 loss". ● Also called "L1 loss".
● Reasonable: ● Reasonable:
○ → good prediction ○ → good prediction
→ good fit → no loss → good fit → no loss
○ far from → bad prediction ○ far from → bad prediction
→ bad fit → lots of loss → bad fit → some loss

33
Empirical Risk is Average Loss over Data

We care about how bad our model’s predictions are for our entire data set, not just for one
point. A natural measure, then, is of the average loss (aka empirical risk) across all points.
Given data :

Function of the parameter (holding the data fixed) because determines .

The average loss on the sample tells us how well it fits the data (not the population).

But hopefully these are close.

34
Minimizing MSE for the SLR Model

Recall: we wanted to pick the regression line

To minimize the (sample) Mean Squared Error:

To find the best values, we set derivatives equal to zero to obtain the optimality conditions:

35
Minimizing MSE for the SLR Model

Recall: we wanted to pick the regression line

To minimize the (sample) Mean Squared Error:

To find the best values, we set derivatives equal to zero to obtain the optimality conditions:

“Equivalent”

To find the best , we need to solve the estimating equations on the right.
36
From Estimating Equations to Estimators

Goal: Choose to solve two estimating equations:

1 and 2

{
{ 37
From Estimating Equations to Estimators

Goal: Choose to solve two estimating equations:

1 and 2

Now, let's try: 2 - 1 *

38
From Estimating Equations to Estimators

Plug in definitions of correlation and SD:

Solve for :

39
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model

Constant Model
Regression Review

40
Fit the Model: Rewrite MSE for the Constant Model

Recall that Mean Squared Error (MSE) is average squared loss (L2 loss)
over the data :

L2 loss on a
single datapoint

Given the constant model :

We fit the model by finding the optimal that minimizes the MSE.


41
Fit the Model: Calculus for the General Case

1. Differentiate with respect to : 3. Solve for .

Derivative of sum is
sum of derivatives

Chain rule

Simplify constants

2. Set equal to 0.

42
Fit the Model: Calculus for the General Case

1. Differentiate with respect to : 3. Solve for .

Derivative of sum is
sum of derivatives
Separate sums

Chain rule

c + c + … + c = nxc

Simplify constants

2. Set equal to 0.

43
Fit the Model: Rewrite MAE for the Constant Model

Recall that Mean Absolute Error (MAE) is average absolute loss (L1 loss)
over the data :

L1 loss on a
single datapoint

Given the constant model :

We fit the model by finding the optimal that minimizes the MAE.


44
Fit the Model: Calculus

1. Differentiate with respect to .

⚠ Absolute value!

45
Fit the Model: Calculus

1. Differentiate with respect to . Note: The derivative of the absolute value when the
argument is 0 (i.e. when ) is technically
undefined. We ignore this case in our derivation, since
thankfully, it doesn’t change our result (proof left to you).

46
Fit the Model: Calculus

1. Differentiate with respect to .

Sum up for :
–1 if observation > our prediction ;
+1 if observation < our prediction .
47
Fit the Model: Calculus

1. Differentiate with respect to .


2. Set equal to 0.

3. Solve for .

Where do we go
from here?
48
Median Minimizes MAE for the Constant Model
The constant model parameter that minimizes MAE must satisfy:

# observations # observations
greater than less than

In other words, theta needs to be such that there are an equal # of points to the left and right.

This is the definition of the median!

49
50

You might also like