Regression Review
SLR, Constant Model, and OLS
Data 100/Data 200, Spring 2023 @ UC Berkeley
Narges Norouzi and Lisa Yan
Content credit: Acknowledgments 1
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Today's Roadmap
Regression Review
2
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Ordinary Least
Squares
Regression Review
3
Multiple Linear Regression
Define the multiple linear regression model:
Predicted
value of This is a linear model because it is
a linear combination of parameters
.
single
single observation prediction
4
(p features)
Vector Notation
NBA Data
To combine the two terms into one matrix operation, we
can assume that there is an additional term in
and hence:
1 Rows correspond to
0.4 individual players.
0.8
1.5
Note that:
1 0.4 0.8 1.5
5
Matrix Notation
To make predictions on all datapoints in our sample:
where Datapoint 1
same
for all
where Datapoint 2 preds
…
…
…
where Datapoint n
6
Matrix Notation
To make predictions on all datapoints in our sample:
same
for all
preds
…
…
n row vectors, each Expand out each data
with dimension (p+1) point’s (transposed) input
7
Matrix Notation
To make predictions on all datapoints in our sample:
same
for all
preds
…
n row vectors, each Vectorize predictions and parameters
with dimension (p+1) to encapsulate all n equations into a
single matrix equation. 8
Matrix Notation
To make predictions on all datapoints in our sample:
same
for all
preds
…
Design matrix with
dimensions n x (p + 1)
9
The Multiple Linear Regression Model using Matrix Notation
We can express our linear model on our entire dataset as follows:
Note that our
true output is
also a vector:
Prediction vector Design matrix Parameter vector
10
Mean Squared Error with L2 Norms
We can rewrite mean squared error as a squared L2 norm:
With our linear model :
11
[Linear Algebra] Span
The set of all possible linear combinations of the columns of
is called the span of the columns of (denoted ),
also called the column space.
● Intuitively, this is all of the vectors
you can "reach" using the columns of .
● If each column of has length n,
is a subspace of .
12
A Linear Combination of Columns
The set of all possible linear combinations of the columns of X
is called the span of the columns of X (denoted ), also
called the column space.
● Intuitively, this is all of the vectors
you can “reach” using the columns of X.
● If each column of X has length n,
is a subspace of .
Our prediction is a linear combination
of the columns of . Therefore .
Interpret: Our linear prediction will be in ,
even if the true values might not be.
Goal: Find the vector in that is closest to .
13
This is the
residual vector,
.
Goal:
Minimize the L2 norm of
the residual vector.
i.e., get the predictions
to be “as close” to our
true values as
possible.
14
How do we minimize
this distance – the norm
of the residual vector
(squared)?
15
How do we minimize
this distance – the norm
of the residual vector
(squared)?
The vector in
that is closest to is
the orthogonal
projection of onto
.
We will not prove this property
of orthogonal projection: see
Khan Academy.
16
How do we minimize
this distance – the norm
of the residual vector
(squared)?
The vector in
that is closest to is
the orthogonal
projection of onto
.
Thus, we should choose
the θ that makes the
We will not prove this property
of orthogonal projection: see residual vector
Khan Academy. orthogonal to . 17
[Linear Algebra] Orthogonality
1. Vector a and Vector b are orthogonal if and only if their dot product is 0:
This is a generalization of the notion of two vectors in 2D being perpendicular.
v
2. A vector v is orthogonal to , the span of the columns of a matrix M,
if and only if v is orthogonal to each column in M.
Let’s express 2 in matrix notation. Let , where :
zero vector
v is orthogonal to each
column of M, (d-length vector 18
full of 0s).
Ordinary Least Squares Proof
The least squares estimate is the parameter that minimizes the objective function :
Design Residual
matrix vector
Equivalently, this is the such that the residual vector is orthogonal to .
Definition of orthogonality
of to
(0 is the vector)
Rearrange terms
The normal equation
If is invertible
19
[Metrics] Multiple R^2
Simple linear regression Multiple linear regression
Error Error
RMSE RMSE
Linearity Linearity
Correlation coefficient, r Multiple R2, also called the
coefficient of determination
Compare
20
[Metrics] Multiple R^2
We define the multiple R² value as the proportion of variance
or our fitted values (predictions) to our true values .
Also called the correlation of determination.
R2 ranges from 0 to 1 and is effectively
“the proportion of variance that the model explains.”
Compare For OLS with an intercept term (e.g. ),
is equal to the square of correlation between , .
● For SLR, , the correlation between x, .
● The proof of these last two properties is beyond this course.21
[Metrics] Multiple R^2
Simple linear regression Multiple linear regression
Error Error
RMSE RMSE
R² = 0.457
Linearity Linearity
Correlation coefficient, r Multiple R2, also called the
coefficient of determination
R² = 0.609
As we add more features, our fitted values tend to become closer and closer
to our actual values. Thus, R² increases.
Compare
● The SLR model (AST only) explains 45.7% of the variance in the true .
● The AST & 3PA model explains 60.9%.
Adding more features doesn’t always mean our model is better, though!
We are a few weeks away from understanding why.
22
Residual Properties
When using the optimal parameter vector, our residuals are
orthogonal to .
Proof: First line of our OLS estimate proof (slide).
For all linear models:
Since our predicted response is in by definition, , and hence it is orthogonal
to the residuals.
For all linear models with an intercept term, , the sum of
residuals is zero.
(Proof hint) 23
Properties When Our Model Has an Intercept Term
For all linear models with an intercept term, the sum of residuals is zero.
(previous slide)
● This is the real reason why we don’t
directly use residuals as loss.
● This is also why positive and negative residuals will cancel out in any residual plot where
the (linear) model contains an intercept term, even if the model is terrible.
It follows from the property above that for linear models with intercepts,
the average predicted value is equal to the average true value.
These properties are true when there is an intercept term, and not necessarily when there
isn’t.
24
Does a Unique Solution Always Exist?
Model Estimate Unique?
Constant Model + Yes. Any set of values
MSE has a unique mean.
Yes, if odd.
Constant Model + No, if even. Return
MAE average of middle 2
values.
Simple Linear Yes. Any set of
non-constant* values has a
Regression +
unique mean, SD, and
MSE correlation coefficient.
Ordinary Least
Squares
(Linear Model + ???
MSE) 25
Understanding The Solution Matrices
In most settings,
# observations # features
n >> p
26
Understanding The Solution Matrices
In practice, instead of directly inverting matrices, we can use more efficient
numerical solvers to directly solve a system of linear equations.
The Normal Equation:
Note that at least one solution always exists:
Intuitively, we can always draw a line of best fit for a given set of data, but there may be
multiple lines that are “equally good”. (Formal proof is beyond this course.)
27
Uniqueness of a Solution: Proof
Claim
The Least Squares estimate is unique if and only if is full column rank.
Proof
● The solution to the normal equation is the least square estimate .
● has a unique solution if and only if the square matrix is invertible, which happens
if and only if is full rank.
○ The rank of a square matrix is the max # of linearly independent columns it contains.
○ has shape (p +1) x (p + 1), and therefore has max rank p + 1.
● and have the same rank (proof out of scope).
● Therefore has rank p + 1 if and only if has rank p + 1 (full column rank).
28
Uniqueness of a Solution: Interpretation
Claim:
The Least Squares estimate is unique if and only if is full column rank.
When would we not have unique estimates?
1. If our design matrix is “wide”:
○ (property of rank) If n < p, rank of = min(n, p + 1) < p + 1. p + 1 features
○ In other words, if we have way more features n data
than observations, then is not unique. points
○ Typically we have n >> p so this is less of an issue.
2. If we our design matrix has features that are linear combinations of other features.
○ By definition, rank of is number of linearly independent columns in .
○ Example: If “Width”, “Height”, and “Perimeter” are all columns,
■ Perimeter = 2 * Width + 2 * Height → is not full rank.
○ Important with one-hot encoding (to discuss in later).
29
Does a Unique Solution Always Exist?
Model Estimate Unique?
Constant Model + Yes. Any set of values
MSE has a unique mean.
Yes, if odd.
Constant Model + No, if even. Return
MAE average of middle 2
values.
Simple Linear Yes. Any set of
non-constant* values has a
Regression +
unique mean, SD, and
MSE correlation coefficient.
Ordinary Least Yes, if is full col rank
Squares (all cols lin independent,
(Linear Model + #datapoints>>
MSE) #features) 30
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Simple Linear Only got to OLS - the rest
was not covered.
Regression
Regression Review
31
Parametric Model Notation
True outputs For data:
Predicted outputs The i-th datapoint is an observation:
● is the i-th output (aka dependent variable)
● is the i-th feature (aka independent variable)
● is the i-th prediction (aka estimation).
Any linear model with
Model parameter(s) parameters
Estimated parameter(s), The "best" fitting linear model
“best” fit to data in some sense with parameters
32
L2 and L1 Loss
L2 Loss or Squared Loss L1 Loss or Absolute Loss
● Widely used. ● Sounds worse than it is.
● Also called "L2 loss". ● Also called "L1 loss".
● Reasonable: ● Reasonable:
○ → good prediction ○ → good prediction
→ good fit → no loss → good fit → no loss
○ far from → bad prediction ○ far from → bad prediction
→ bad fit → lots of loss → bad fit → some loss
33
Empirical Risk is Average Loss over Data
We care about how bad our model’s predictions are for our entire data set, not just for one
point. A natural measure, then, is of the average loss (aka empirical risk) across all points.
Given data :
Function of the parameter (holding the data fixed) because determines .
The average loss on the sample tells us how well it fits the data (not the population).
But hopefully these are close.
34
Minimizing MSE for the SLR Model
Recall: we wanted to pick the regression line
To minimize the (sample) Mean Squared Error:
To find the best values, we set derivatives equal to zero to obtain the optimality conditions:
35
Minimizing MSE for the SLR Model
Recall: we wanted to pick the regression line
To minimize the (sample) Mean Squared Error:
To find the best values, we set derivatives equal to zero to obtain the optimality conditions:
“Equivalent”
To find the best , we need to solve the estimating equations on the right.
36
From Estimating Equations to Estimators
Goal: Choose to solve two estimating equations:
1 and 2
{
{ 37
From Estimating Equations to Estimators
Goal: Choose to solve two estimating equations:
1 and 2
Now, let's try: 2 - 1 *
38
From Estimating Equations to Estimators
Plug in definitions of correlation and SD:
Solve for :
39
● Ordinary Least Squares (OLS)
● Simple Linear Regression (SLR)
● Constant Model
Constant Model
Regression Review
40
Fit the Model: Rewrite MSE for the Constant Model
Recall that Mean Squared Error (MSE) is average squared loss (L2 loss)
over the data :
L2 loss on a
single datapoint
Given the constant model :
We fit the model by finding the optimal that minimizes the MSE.
41
Fit the Model: Calculus for the General Case
1. Differentiate with respect to : 3. Solve for .
Derivative of sum is
sum of derivatives
Chain rule
Simplify constants
2. Set equal to 0.
42
Fit the Model: Calculus for the General Case
1. Differentiate with respect to : 3. Solve for .
Derivative of sum is
sum of derivatives
Separate sums
Chain rule
c + c + … + c = nxc
Simplify constants
2. Set equal to 0.
43
Fit the Model: Rewrite MAE for the Constant Model
Recall that Mean Absolute Error (MAE) is average absolute loss (L1 loss)
over the data :
L1 loss on a
single datapoint
Given the constant model :
We fit the model by finding the optimal that minimizes the MAE.
44
Fit the Model: Calculus
1. Differentiate with respect to .
⚠ Absolute value!
45
Fit the Model: Calculus
1. Differentiate with respect to . Note: The derivative of the absolute value when the
argument is 0 (i.e. when ) is technically
undefined. We ignore this case in our derivation, since
thankfully, it doesn’t change our result (proof left to you).
46
Fit the Model: Calculus
1. Differentiate with respect to .
Sum up for :
–1 if observation > our prediction ;
+1 if observation < our prediction .
47
Fit the Model: Calculus
1. Differentiate with respect to .
2. Set equal to 0.
3. Solve for .
Where do we go
from here?
48
Median Minimizes MAE for the Constant Model
The constant model parameter that minimizes MAE must satisfy:
# observations # observations
greater than less than
In other words, theta needs to be such that there are an equal # of points to the left and right.
This is the definition of the median!
49
50