0% found this document useful (0 votes)

27 views50 pages

Regression Review

Uploaded by

plgameplay1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views50 pages

Regression Review

Uploaded by

plgameplay1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Regression Review

SLR, Constant Model, and OLS

Data 100/Data 200, Spring 2023 @ UC Berkeley

Narges Norouzi and Lisa Yan
Content credit: Acknowledgments 1
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model

Today's Roadmap
Regression Review

2
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Ordinary Least
Squares
Regression Review

3
Multiple Linear Regression

Deﬁne the multiple linear regression model:

Predicted
value of This is a linear model because it is
a linear combination of parameters
.

single
single observation prediction
4
(p features)
Vector Notation
NBA Data

To combine the two terms into one matrix operation, we

can assume that there is an additional term in
and hence:

1 Rows correspond to
0.4 individual players.
0.8
1.5

Note that:
1 0.4 0.8 1.5

5
Matrix Notation

To make predictions on all datapoints in our sample:

where Datapoint 1
same

for all
where Datapoint 2 preds
…
…

…
where Datapoint n

6
Matrix Notation

To make predictions on all datapoints in our sample:

same

for all
preds
…

…
n row vectors, each Expand out each data
with dimension (p+1) point’s (transposed) input
7
Matrix Notation

To make predictions on all datapoints in our sample:

same

for all
preds
…

n row vectors, each Vectorize predictions and parameters

with dimension (p+1) to encapsulate all n equations into a
single matrix equation. 8
Matrix Notation

To make predictions on all datapoints in our sample:

same

for all
preds
…

Design matrix with

dimensions n x (p + 1)
9
The Multiple Linear Regression Model using Matrix Notation

We can express our linear model on our entire dataset as follows:

Note that our

true output is
also a vector:

Prediction vector Design matrix Parameter vector

10
Mean Squared Error with L2 Norms

We can rewrite mean squared error as a squared L2 norm:

With our linear model :

11
[Linear Algebra] Span

The set of all possible linear combinations of the columns of

is called the span of the columns of (denoted ),
also called the column space.
● Intuitively, this is all of the vectors
you can "reach" using the columns of .
● If each column of has length n,
is a subspace of .

12
A Linear Combination of Columns

The set of all possible linear combinations of the columns of X

is called the span of the columns of X (denoted ), also
called the column space.
● Intuitively, this is all of the vectors
you can “reach” using the columns of X.
● If each column of X has length n,
is a subspace of .

Our prediction is a linear combination

of the columns of . Therefore .
Interpret: Our linear prediction will be in ,
even if the true values might not be.
Goal: Find the vector in that is closest to .

13
This is the
residual vector,
.

Goal:
Minimize the L2 norm of
the residual vector.
i.e., get the predictions
to be “as close” to our
true values as
possible.

14
How do we minimize
this distance – the norm
of the residual vector
(squared)?

15
How do we minimize
this distance – the norm
of the residual vector
(squared)?

The vector in
that is closest to is
the orthogonal
projection of onto
.

We will not prove this property

of orthogonal projection: see
Khan Academy.
16
How do we minimize
this distance – the norm
of the residual vector
(squared)?

The vector in
that is closest to is
the orthogonal
projection of onto
.

Thus, we should choose

the θ that makes the
We will not prove this property
of orthogonal projection: see residual vector
Khan Academy. orthogonal to . 17
[Linear Algebra] Orthogonality

1. Vector a and Vector b are orthogonal if and only if their dot product is 0:
This is a generalization of the notion of two vectors in 2D being perpendicular.

v
2. A vector v is orthogonal to , the span of the columns of a matrix M,
if and only if v is orthogonal to each column in M.

Let’s express 2 in matrix notation. Let , where :

zero vector
v is orthogonal to each
column of M, (d-length vector 18
full of 0s).
Ordinary Least Squares Proof

The least squares estimate is the parameter that minimizes the objective function :

Design Residual
matrix vector
Equivalently, this is the such that the residual vector is orthogonal to .

Deﬁnition of orthogonality
of to
(0 is the vector)

Rearrange terms

The normal equation

If is invertible
19
[Metrics] Multiple R^2

Simple linear regression Multiple linear regression

Error Error
RMSE RMSE

Linearity Linearity
Correlation coeﬃcient, r Multiple R2, also called the
coeﬃcient of determination

Compare

20
[Metrics] Multiple R^2

We deﬁne the multiple R² value as the proportion of variance

or our ﬁtted values (predictions) to our true values .

Also called the correlation of determination.

R2 ranges from 0 to 1 and is effectively
“the proportion of variance that the model explains.”

Compare For OLS with an intercept term (e.g. ),

is equal to the square of correlation between , .
● For SLR, , the correlation between x, .
● The proof of these last two properties is beyond this course.21
[Metrics] Multiple R^2

Simple linear regression Multiple linear regression

Error Error
RMSE RMSE
R² = 0.457
Linearity Linearity
Correlation coeﬃcient, r Multiple R2, also called the
coeﬃcient of determination

R² = 0.609
As we add more features, our ﬁtted values tend to become closer and closer
to our actual values. Thus, R² increases.
Compare
● The SLR model (AST only) explains 45.7% of the variance in the true .
● The AST & 3PA model explains 60.9%.
Adding more features doesn’t always mean our model is better, though!
We are a few weeks away from understanding why.
22
Residual Properties

When using the optimal parameter vector, our residuals are

orthogonal to .

Proof: First line of our OLS estimate proof (slide).

For all linear models:

Since our predicted response is in by deﬁnition, , and hence it is orthogonal
to the residuals.

For all linear models with an intercept term, , the sum of

residuals is zero.

(Proof hint) 23
Properties When Our Model Has an Intercept Term

For all linear models with an intercept term, the sum of residuals is zero.
(previous slide)
● This is the real reason why we don’t
directly use residuals as loss.
● This is also why positive and negative residuals will cancel out in any residual plot where
the (linear) model contains an intercept term, even if the model is terrible.

It follows from the property above that for linear models with intercepts,
the average predicted value is equal to the average true value.

These properties are true when there is an intercept term, and not necessarily when there
isn’t.

24
Does a Unique Solution Always Exist?

Model Estimate Unique?

Constant Model + Yes. Any set of values

MSE has a unique mean.

Yes, if odd.
Constant Model + No, if even. Return
MAE average of middle 2
values.

Simple Linear Yes. Any set of

non-constant* values has a
Regression +
unique mean, SD, and
MSE correlation coeﬃcient.

Ordinary Least
Squares
(Linear Model + ???
MSE) 25
Understanding The Solution Matrices

In most settings,
# observations # features
n >> p

26
Understanding The Solution Matrices

In practice, instead of directly inverting matrices, we can use more eﬃcient

numerical solvers to directly solve a system of linear equations.
The Normal Equation:

Note that at least one solution always exists:

Intuitively, we can always draw a line of best ﬁt for a given set of data, but there may be
multiple lines that are “equally good”. (Formal proof is beyond this course.)

27
Uniqueness of a Solution: Proof

Claim
The Least Squares estimate is unique if and only if is full column rank.

Proof
● The solution to the normal equation is the least square estimate .

● has a unique solution if and only if the square matrix is invertible, which happens
if and only if is full rank.
○ The rank of a square matrix is the max # of linearly independent columns it contains.
○ has shape (p +1) x (p + 1), and therefore has max rank p + 1.

● and have the same rank (proof out of scope).

● Therefore has rank p + 1 if and only if has rank p + 1 (full column rank).
28
Uniqueness of a Solution: Interpretation

Claim:
The Least Squares estimate is unique if and only if is full column rank.

When would we not have unique estimates?

1. If our design matrix is “wide”:
○ (property of rank) If n < p, rank of = min(n, p + 1) < p + 1. p + 1 features
○ In other words, if we have way more features n data
than observations, then is not unique. points

○ Typically we have n >> p so this is less of an issue.

2. If we our design matrix has features that are linear combinations of other features.
○ By deﬁnition, rank of is number of linearly independent columns in .
○ Example: If “Width”, “Height”, and “Perimeter” are all columns,
■ Perimeter = 2 * Width + 2 * Height → is not full rank.
○ Important with one-hot encoding (to discuss in later).
29
Does a Unique Solution Always Exist?

Model Estimate Unique?

Constant Model + Yes. Any set of values

MSE has a unique mean.

Yes, if odd.
Constant Model + No, if even. Return
MAE average of middle 2
values.

Simple Linear Yes. Any set of

non-constant* values has a
Regression +
unique mean, SD, and
MSE correlation coeﬃcient.

Ordinary Least Yes, if is full col rank

Squares (all cols lin independent,
(Linear Model + #datapoints>>
MSE) #features) 30
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Simple Linear Only got to OLS - the rest
was not covered.
Regression
Regression Review

31
Parametric Model Notation

True outputs For data:

Predicted outputs The i-th datapoint is an observation:

● is the i-th output (aka dependent variable)
● is the i-th feature (aka independent variable)
● is the i-th prediction (aka estimation).

Any linear model with

Model parameter(s) parameters

Estimated parameter(s), The "best" ﬁtting linear model

“best” ﬁt to data in some sense with parameters

32
L2 and L1 Loss

L2 Loss or Squared Loss L1 Loss or Absolute Loss

● Widely used. ● Sounds worse than it is.

● Also called "L2 loss". ● Also called "L1 loss".
● Reasonable: ● Reasonable:
○ → good prediction ○ → good prediction
→ good fit → no loss → good fit → no loss
○ far from → bad prediction ○ far from → bad prediction
→ bad fit → lots of loss → bad fit → some loss

33
Empirical Risk is Average Loss over Data

We care about how bad our model’s predictions are for our entire data set, not just for one
point. A natural measure, then, is of the average loss (aka empirical risk) across all points.
Given data :

Function of the parameter (holding the data ﬁxed) because determines .

The average loss on the sample tells us how well it ﬁts the data (not the population).

But hopefully these are close.

34
Minimizing MSE for the SLR Model

Recall: we wanted to pick the regression line

To minimize the (sample) Mean Squared Error:

To ﬁnd the best values, we set derivatives equal to zero to obtain the optimality conditions:

35
Minimizing MSE for the SLR Model

Recall: we wanted to pick the regression line

To minimize the (sample) Mean Squared Error:

To ﬁnd the best values, we set derivatives equal to zero to obtain the optimality conditions:

“Equivalent”

To ﬁnd the best , we need to solve the estimating equations on the right.
36
From Estimating Equations to Estimators

Goal: Choose to solve two estimating equations:

1 and 2

{
{ 37
From Estimating Equations to Estimators

Goal: Choose to solve two estimating equations:

1 and 2

Now, let's try: 2 - 1 *

38
From Estimating Equations to Estimators

Plug in deﬁnitions of correlation and SD:

Solve for :

39
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model

Constant Model
Regression Review

40
Fit the Model: Rewrite MSE for the Constant Model

Recall that Mean Squared Error (MSE) is average squared loss (L2 loss)
over the data :

L2 loss on a
single datapoint

Given the constant model :

We ﬁt the model by ﬁnding the optimal that minimizes the MSE.

41
Fit the Model: Calculus for the General Case

1. Differentiate with respect to : 3. Solve for .

Derivative of sum is
sum of derivatives

Chain rule

Simplify constants

2. Set equal to 0.

42
Fit the Model: Calculus for the General Case

1. Differentiate with respect to : 3. Solve for .

Derivative of sum is
sum of derivatives
Separate sums

Chain rule

c + c + … + c = nxc

Simplify constants

2. Set equal to 0.

43
Fit the Model: Rewrite MAE for the Constant Model

Recall that Mean Absolute Error (MAE) is average absolute loss (L1 loss)
over the data :

L1 loss on a
single datapoint

Given the constant model :

We ﬁt the model by ﬁnding the optimal that minimizes the MAE.

44
Fit the Model: Calculus

1. Differentiate with respect to .

⚠ Absolute value!

45
Fit the Model: Calculus

1. Differentiate with respect to . Note: The derivative of the absolute value when the
argument is 0 (i.e. when ) is technically
undeﬁned. We ignore this case in our derivation, since
thankfully, it doesn’t change our result (proof left to you).

46
Fit the Model: Calculus

1. Differentiate with respect to .

Sum up for :
–1 if observation > our prediction ;
+1 if observation < our prediction .
47
Fit the Model: Calculus

1. Differentiate with respect to .

2. Set equal to 0.

3. Solve for .

Where do we go
from here?
48
Median Minimizes MAE for the Constant Model
The constant model parameter that minimizes MAE must satisfy:

# observations # observations
greater than less than

In other words, theta needs to be such that there are an equal # of points to the left and right.

This is the deﬁnition of the median!

49
50

Lecture-04 - Least Squares and Geometry
No ratings yet
Lecture-04 - Least Squares and Geometry
35 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Lecture 5 Regression
No ratings yet
Lecture 5 Regression
77 pages
Linear Least Squared
No ratings yet
Linear Least Squared
23 pages
Linear Least Squares
No ratings yet
Linear Least Squares
21 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
26 pages
Regression III: Advanced Methods: William G. Jacoby Department of Political Science
No ratings yet
Regression III: Advanced Methods: William G. Jacoby Department of Political Science
21 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
18 pages
Least-Square Method
No ratings yet
Least-Square Method
32 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
Paper On Polynomial Regression
No ratings yet
Paper On Polynomial Regression
7 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Linear Regression Analysis Guide
No ratings yet
Linear Regression Analysis Guide
28 pages
Linear Regression Case Study
No ratings yet
Linear Regression Case Study
6 pages
2-L2 Model 2014
No ratings yet
2-L2 Model 2014
20 pages
Linear Algebra Spring Project 2024099270 Chominhyeok
No ratings yet
Linear Algebra Spring Project 2024099270 Chominhyeok
4 pages
Linear Regression Essentials
No ratings yet
Linear Regression Essentials
18 pages
Machine Learning Matrix Methods
No ratings yet
Machine Learning Matrix Methods
25 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Stat Modelling Notes
No ratings yet
Stat Modelling Notes
49 pages
Lecture 9-10 - Updated Vesion S25 - Regression
No ratings yet
Lecture 9-10 - Updated Vesion S25 - Regression
43 pages
Curve Fitting & Interpolation Guide
No ratings yet
Curve Fitting & Interpolation Guide
48 pages
Lecture Notes
No ratings yet
Lecture Notes
38 pages
MLF Week 4 Notes by Manisha Pal
No ratings yet
MLF Week 4 Notes by Manisha Pal
13 pages
5.2 Regression
No ratings yet
5.2 Regression
19 pages
Least Squares Error Computation
No ratings yet
Least Squares Error Computation
28 pages
Appendix E - The Linear Regression Model in Matrix Form
No ratings yet
Appendix E - The Linear Regression Model in Matrix Form
14 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
ML Module 2,3,4
No ratings yet
ML Module 2,3,4
13 pages
Regression Notes - Part-1
No ratings yet
Regression Notes - Part-1
17 pages
WST 311 Notes Part 2 2024
No ratings yet
WST 311 Notes Part 2 2024
21 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
Matrix Model
No ratings yet
Matrix Model
6 pages
Geometry of Least Squares Explained
No ratings yet
Geometry of Least Squares Explained
39 pages
Lec 13
No ratings yet
Lec 13
10 pages
Chapter 12 Lecture Notes
No ratings yet
Chapter 12 Lecture Notes
4 pages
Regression for Beginners
No ratings yet
Regression for Beginners
20 pages
Lecture 2 Multivariate Linear Regression Models
No ratings yet
Lecture 2 Multivariate Linear Regression Models
15 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Experiment No 7
No ratings yet
Experiment No 7
7 pages
斯坦福大学机器学习数学基础 9-16
No ratings yet
斯坦福大学机器学习数学基础 9-16
8 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
53 pages
Data Analysis: Statistics & Linear Algebra
No ratings yet
Data Analysis: Statistics & Linear Algebra
40 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Econometrics I Lecture 3 Wooldridge
No ratings yet
Econometrics I Lecture 3 Wooldridge
50 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Estimation in A Multivariate Errors in Variables Regression Model (Large Sample Results)
No ratings yet
Estimation in A Multivariate Errors in Variables Regression Model (Large Sample Results)
22 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Simple Linear Regression Overview
No ratings yet
Simple Linear Regression Overview
27 pages
3 Da
No ratings yet
3 Da
16 pages
Linear Regression Assumptions Impact
No ratings yet
Linear Regression Assumptions Impact
21 pages
CH 7 M7 Portfolio Theory
No ratings yet
CH 7 M7 Portfolio Theory
28 pages
Mean & Variance of Sampling Distribution
No ratings yet
Mean & Variance of Sampling Distribution
2 pages
Book 2 Chapter 8-Regression With Multiple Explanatory Variables
No ratings yet
Book 2 Chapter 8-Regression With Multiple Explanatory Variables
6 pages
Friedman & Kruskal Wallis - Test
No ratings yet
Friedman & Kruskal Wallis - Test
32 pages
Comparison of Retzius-Sparing and Conventionalrobot-Assisted Laparoscopic Radicalprostatectomy Regarding Continence and Sexualfunction-2022-02-23
No ratings yet
Comparison of Retzius-Sparing and Conventionalrobot-Assisted Laparoscopic Radicalprostatectomy Regarding Continence and Sexualfunction-2022-02-23
5 pages
Data Presentation and Interpretation
No ratings yet
Data Presentation and Interpretation
24 pages
10 51982-Bagimli 1359134-3402840
No ratings yet
10 51982-Bagimli 1359134-3402840
12 pages
Lean Six Sigma Black Belt Mock Exam
100% (2)
Lean Six Sigma Black Belt Mock Exam
26 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Standardized Regression Coefficients
No ratings yet
Standardized Regression Coefficients
4 pages
Descriptive Statistics - Practical1
No ratings yet
Descriptive Statistics - Practical1
12 pages
A Level Edexcel Maths June 23 31 QP
No ratings yet
A Level Edexcel Maths June 23 31 QP
20 pages
Correlation and Regression
No ratings yet
Correlation and Regression
39 pages
Covariance and Corelation
No ratings yet
Covariance and Corelation
19 pages
Presentation On IPL Match Winner Prediction With ML
No ratings yet
Presentation On IPL Match Winner Prediction With ML
27 pages
Credit Risk Modeling Training with SAS
0% (2)
Credit Risk Modeling Training with SAS
7 pages
5 Alternatives To Experimentation Correlational and Quasi Experimental Design PDF
No ratings yet
5 Alternatives To Experimentation Correlational and Quasi Experimental Design PDF
47 pages
1 - Mathew Donahue 2012 - Idfai
No ratings yet
1 - Mathew Donahue 2012 - Idfai
6 pages
MAS.M-1414. Cost Concepts, Classification and Segregation - MC
No ratings yet
MAS.M-1414. Cost Concepts, Classification and Segregation - MC
10 pages
Business Statistics & Analytics Course
No ratings yet
Business Statistics & Analytics Course
2 pages
Moving Average
No ratings yet
Moving Average
3 pages
Unit Ii
No ratings yet
Unit Ii
28 pages
Humidity Testing Correlation in Peck Comprehensive Model
No ratings yet
Humidity Testing Correlation in Peck Comprehensive Model
7 pages
تشخيص صعوبات الحساب وعلاجها
No ratings yet
تشخيص صعوبات الحساب وعلاجها
14 pages
T-Test and Hypothesis Testing Guide
No ratings yet
T-Test and Hypothesis Testing Guide
27 pages
Statistics & Probability FIDP Plan
No ratings yet
Statistics & Probability FIDP Plan
10 pages
Regression Slides
No ratings yet
Regression Slides
31 pages
Business Statistics For Decision Making
No ratings yet
Business Statistics For Decision Making
6 pages
Machine Learning and Web Scraping Lecture 03
No ratings yet
Machine Learning and Web Scraping Lecture 03
22 pages
Numerical Question Bank 7th Sem Deep Learning
No ratings yet
Numerical Question Bank 7th Sem Deep Learning
6 pages

Regression Review

Uploaded by

Regression Review

Uploaded by

Regression Review

SLR, Constant Model, and OLS

Data 100/Data 200, Spring 2023 @ UC Berkeley

Deﬁne the multiple linear regression model:

To combine the two terms into one matrix operation, we

To make predictions on all datapoints in our sample:

To make predictions on all datapoints in our sample:

To make predictions on all datapoints in our sample:

n row vectors, each Vectorize predictions and parameters

To make predictions on all datapoints in our sample:

Design matrix with

We can express our linear model on our entire dataset as follows:

Note that our

Prediction vector Design matrix Parameter vector

We can rewrite mean squared error as a squared L2 norm:

With our linear model :

The set of all possible linear combinations of the columns of

The set of all possible linear combinations of the columns of X

Our prediction is a linear combination

We will not prove this property

Thus, we should choose

Let’s express 2 in matrix notation. Let , where :

The normal equation

Simple linear regression Multiple linear regression

We deﬁne the multiple R² value as the proportion of variance

Also called the correlation of determination.

Compare For OLS with an intercept term (e.g. ),

Simple linear regression Multiple linear regression

When using the optimal parameter vector, our residuals are

Proof: First line of our OLS estimate proof (slide).

For all linear models:

For all linear models with an intercept term, , the sum of

Model Estimate Unique?

Constant Model + Yes. Any set of values

Simple Linear Yes. Any set of

In practice, instead of directly inverting matrices, we can use more eﬃcient

Note that at least one solution always exists:

● and have the same rank (proof out of scope).

When would we not have unique estimates?

○ Typically we have n >> p so this is less of an issue.

Model Estimate Unique?

Constant Model + Yes. Any set of values

Simple Linear Yes. Any set of

Ordinary Least Yes, if is full col rank

True outputs For data:

Predicted outputs The i-th datapoint is an observation:

Any linear model with

Estimated parameter(s), The "best" ﬁtting linear model

L2 Loss or Squared Loss L1 Loss or Absolute Loss

● Widely used. ● Sounds worse than it is.

Function of the parameter (holding the data ﬁxed) because determines .

But hopefully these are close.

Recall: we wanted to pick the regression line

To minimize the (sample) Mean Squared Error:

Recall: we wanted to pick the regression line

To minimize the (sample) Mean Squared Error:

Goal: Choose to solve two estimating equations:

Goal: Choose to solve two estimating equations:

Now, let's try: 2 - 1 *

Plug in deﬁnitions of correlation and SD:

Given the constant model :

We ﬁt the model by ﬁnding the optimal that minimizes the MSE.

1. Differentiate with respect to : 3. Solve for .

1. Differentiate with respect to : 3. Solve for .

Given the constant model :

We ﬁt the model by ﬁnding the optimal that minimizes the MAE.

1. Differentiate with respect to .

1. Differentiate with respect to .

1. Differentiate with respect to .

This is the deﬁnition of the median!

You might also like