0% found this document useful (0 votes)

20 views8 pages

Supervised Learning: Linear Regression Basics

The document provides an introduction to supervised learning, focusing on the relationship between features and responses using linear functions. It discusses the concepts of regression and classification, loss functions, and the estimation of coefficients in linear models. Additionally, it covers training sets, training loss, and the least-squares estimator for linear regression.

Uploaded by

moqingmoqingmoqing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views8 pages

Supervised Learning: Linear Regression Basics

Uploaded by

moqingmoqingmoqing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1 Introduction to Supervised Learning

We have the task of predicting a response/output y using features/input x. 1 We model the

relationship between y and x via a function g(x) (so we hope that g(x) ≈ y).
In this course we will assume that a linear function g is a good model for the relationship
between the features and the response:

⊤ β0
g(x) = x β = [1, x] = β0 + β1 x.
β1 |{z} |{z}
| {z } intercept slope
coefficient vector

Here the feature vector x consists of two features. One is the constant feature 1 and the other is x.
Here β is the vector of coefficients of the linear relationship. Obviously, β0 + xβ1 is a straight line
in one dimension.
We can have a more general linear relationship (hyperplane in higher dimensions):
 
β0
 β1  X
g(x) = x⊤ β = [1, x1 . . . , xp ]  .  = β0 + x i βi
 
 ..  i
βp

Sometimes we write instead:

 
β1
 β2 X
g(x) = x⊤ β = [1, x1 . . . , xp ]   = β1 + vi βi
 
..
 .
 i
βp+1

You should pay attention to context to figure out which notation is used.
We call g(x) a regression function, when y is a continuous variable.2 For example, y is the
weight of a baby and x contains features such as the age of the mother, number of pregnancies,
health status of mother, etc. This is the main setting of this course (90%).
When y is discrete, say y ∈ {0, 1},3 then we say g(x) is a classification function. This is small
part of the course (time permitting).

Note about notation: In this course scalar variables are denoted as

x, y, z, α, β, γ,

vectors are denotes

x, y, z, α, β, γ.
Matrices are denotes as
X, Y, Z.
1 feature is the modern machine learning term, statisticians called is an explanatory variable.
2 This is the simplest possible neural network.
3 This should not be confused with the interval [0, 1] or (0, 1)

1
We use the special blue matrix symbol for a matrix of features (more on that later):

Generally, but not always, we use upper case notation for random variables:

X, Y, Z

For random vectors, we use the notation:

X, Y , Z.

When a coefficient, say β, is estimated via statistical procedures, we denote it as

β.
b

This βb can denote both a random vector (in which case it is called an estimator), or it can dernote
an real vector (in which case it is called an estimate). This is done in, for example, MATH2901.
There we talk about the maximum likelihood estimator or the maximum likelihood estimate.

1.1 Loss Function

Typically, the only thing we do not know about the linear regression function g is the coefficient
vector β. We estimate/determine β from empirical data and denote the estimate by β̂.
When we make a prediction of y, we denote it as ŷ. For example, when we use a linear function:

ŷ = x⊤ β̂,

where β̂ is estimated using some data (e.g., MLE).

We measure the quality of the prediction using a loss function:

Loss(y, ŷ).

When y is continuous, in this course we use the squared-error loss:

Loss(y, ŷ) = (y − ŷ)2 .

For binary response, the loss function is simply:

(
1, y≠ ŷ
Loss(y, ŷ) = 1{y =
̸ ŷ} =
0, y = ŷ.

Remark 1.1 (Using linear function to predict binary output.) One should not use a linear
function to predict a binary output, especially if the inputs are continuous variables. A contrived
examples is:
g(x) = β0 + β1 x
and x, β1 , β0 ∈ {0, 1}. The proper way to predict binary output is to use logistic (not linear)
regression. We may/may not talk about this, depending on time.

2
We frequently think of the pair (x, y) as the outcomes of random variables (X, Y ) with a certain
joint pdf f (x, y) (in practice unknown). In this case we consider the expected loss:

ℓ(g(X | β)) = ELoss(Y, g(X)),

which is called the risk for g. In our linear regression case with squared-error loss, the risk of
using a linear function g is:
E(Y − g(X))2 = E(Y − X ⊤ β)2 .
It can be shown that the best possible g when using squared-error loss is the conditional expectation
of Y given X:
g ∗ (x) = E[Y | X = x].
This means that for any other prediction function g(x) we have that

E(Y − g ∗ (X))2 ≤ E(Y − g(X))2 .

Note that to determine g ∗ we need to be able to compute a conditional expectation using the
pdf f (y | x), which is not possible in practice.
Given g ∗ , we can in fact write the random response given X = x as:

Y = g ∗ (x) + ϵ(x),

where E[ϵ(X) | X = x] = 0 is error term.

Problem 1.1 (Question) Why is ϵ(x) forced to be zero mean? From the equation

Y = g ∗ (x) + ϵ(x),

we know that
E[Y | X = x] = g ∗ (x) + E[ϵ(X) | X = x]
and since g ∗ (x) = E[Y | X = x], then

E[ϵ(X) | X = x] = E[ϵ(x)] = 0.

That the variance of ϵ(x) is independent of x is an assumption.

In this course, we assume that the variance of the noise term does not depend on x,
so that Var(ϵ(x)) = σ 2 and g ∗ is a linear function.
In other words, the response y can be viewed as the outcome of:

Y = x⊤ β + ϵ,

where E[ϵ] = 0 and Var(ϵ) = σ 2 and β = β ∗ is the true, but unknown, coefficient.
For example, we can have n = 100 is these pairs of (x, Y ) so that

Yi = x⊤
i β + ϵi , i = 1, . . . , n.

Assuming that these n observations are statistically independent is the same as assuming that the
noise components ϵ1 , . . . , ϵn are independent of each other. In fact, we may assume that ϵ1 , ϵ2 , . . .
are iid (independent and identically distributed).

3
We can write this in matrix form for the simple linear regression (where x⊤ = [1, x]) as:
     
Y1 1 x1 ϵ1
 ..   .. . β1
Y =  .  = . ..  +  ...  = Xβ + ϵ ,
  
β2 |{z}
Yn 1 xn ϵn random noise

where X ∈ Rn×2 is the matrix of features and Y ∈ Rn contains all the responses. At this stage, we
assume that n > p. Here the number of features is the number of columns of X and the number of
independent observations is the number of rows of X. Note that here Y and ϵ are random vectors.
We can have more than two features and still have a linear model:
 
  1 x1,1 x1,p    
Y1 1 x2,1 x2,p  β1 ϵ1
 ..     ..   .. 
Y =  .  = . .. ..   .  +  .  = |{z} X β + ϵ,
 .. . . 
Yn βp+1 ϵn ∈R n×(p+1)
1 xn,1 xn,n

Problem 1.2 (Polynomial Regression) We can have more than two features and still have a
linear model:
1 x1 x21 β1 
     
Y1 ϵ1
 ..   .. .
.. .
Y =  .  = .  β2  +  ..  = Xβ + ϵ,
 

Yn 1 xn x2 β3 ϵn n

so that the model here is

 
β1
Y = β1 + xβ2 + x2 β3 + ϵ = [1, x, x2 ] β2  + ϵ.
β3

The ultimate test for whether the model is linear or not, is if we can write the regression function
as an inner/dot product of a feature vector x with a coefficient vector β.

1.2 Training Set

In practice we do not know g ∗ (and if assumed linear, we do not know the corresponding coefficient
β ∗ : g ∗ (x) = x⊤ β ∗ ), but only have some (random) data

T = {(X1 , Y1 ), . . . , (Xn , Yn )},

where we assume that (Xi , Yi ) are iid for i = 1, . . . , n, and its realization:

τ = {(x1 , y1 ), . . . , (xn , yn )}

We are trying to find a way to use the training data to construct a learner (or prediction function
in older literature):
gτ (x).
In our case the learner is a linear function:

gτ (x) = x⊤ β̂,

4
where β̂ is an estimate of the "true" β based on τ . The learner can also be a random object (with
its own statistical properties) when we write:

gT (x) = x⊤ β̂,

where β̂ is an estimator (a function of T ).

2 Training loss
Recall that we have the risk of the learner g:
E(Y − g(X))2 =
|{z} E(Y − X ⊤ β)2
assuming linear model

We do not know the distribution of (Y, X), so we replace the expectation with its empirical estimate:
n
1X
ℓT (g) := (Yi − g(Xi ))2 .
n i=1
| {z }
≈E

This is called the training loss. The training loss is our approximation to the risk. Note that the
training loss can be either a function of T or τ . We can think of τ as the realization of the random
T:
n
1X
ℓτ (g) := (yi − g(xi ))2 .
n i=1
| {z }
≈E

When the learner is linear (90% of this course!), then

n
1X
ℓT (g(· | β)) := (Yi − x⊤ 2
i β) ,
n i=1

which can be written very neatly using matrix/vector notation. Recall x⊤ x = ∥x∥2 = x2i ,
P
i
then
∥Y − Xβ∥2
ℓT (g(· | β)) = ,
n
where    ⊤  ⊤ 
Y1 x1 x1 β
 ..   ..   .. 
Y =  .  , X =  .  , Xβ =  . 
Yn x⊤
n x⊤
nβ
Our learner is now
gτ (x) = x⊤ β̂,
where β̂ is the minimizer of the training loss:
∥Y − Xβ∥2
β̂ ∈ argmin .
β n

5
Here β̂ is called the least-squares estimator. How do we compute this estimate?
We need to consider how to find the minimizer of the following function:
X X
b⊤ Ab − 2b⊤ z = bi bj ai,j − 2 bj z j ,
i,j j

where matrix A is a suitable symmetric invertible matrix.

To find out, let us start with the simplest 1-dimensional case.

b × a × b − 2b × z = ab2 − 2bz

Setting the derivative (with respect to b) to zero 2ab − 2z and solving the resulting equation gives:

b = a−1 × z.

So we can then guess that the critical/stationary point of b⊤ Ab − 2b⊤ z is:

b = A−1 z.

This guess is correct.

Quick formal proof: Let us differentiate b⊤ Ab − 2b⊤ z with respect to bk :
∂ ⊤ ∂ ⊤ ∂ ⊤
[b Ab − 2b⊤ z] = [b Ab] − 2 b z
∂bk ∂bk ∂bk
∂ ⊤
= −2zk + [b Ab]
∂bk
Since: b⊤ Ab = (b/k + ek bk )⊤ A(b/k + ek bk )
= b⊤ 2 ⊤ ⊤
/k Ab/k + bk ek Aek + 2bk ek Ab/k
∂ ⊤
Therefore: b Ab = 0 + 2bk e⊤ ⊤
k Aek + 2ek Ab/k
∂bk
= 2e⊤ ⊤
k A(ek bk + b/k ) = 2ek Ab

Recall that
(b + c)a(b + c) = bab + cac + 2abc
So setting
∂ ⊤
[b Ab − 2b⊤ z] = 2e⊤ ⊤
k Ab − 2ek z = 0
∂bk
for each k and organizing the result in matrix form gives (the gradient of the multivariate function):
∂ ⊤
[b Ab − 2b⊤ z] = 2Ab − 2z,
∂b
which is a shorthand notation for
∂ ⊤
[b Ab − 2b⊤ z] = 2e⊤ ⊤
k Ab − 2ek z, k = 1, . . . , n.
∂bk
Solving the gradient=0 equations gives
b = A−1 z.

6
In the least-squares minimization case, we have :4

∥Y − Xβ∥2 = (Y − Xβ)⊤ (Y − Xβ)

((a − b)2 = a2 − 2ab + b2 ) = Y ⊤ Y − 2Y ⊤ Xβ + (Xβ)⊤ (Xβ)
(a⊤ b = b⊤ a) = ∥Y ∥2 + β ⊤ (X⊤ X) β − 2β ⊤ [X⊤ Y ]
| {z } | {z }
=A =z

Therefore, using our formula:

b = A−1 z = (X⊤ X)−1 X⊤ Y .

In summary, the least-squares estimate is given by:

β̂ = (X⊤ X)−1 X⊤ y,

this formula works when n > p (more examples than features), and X is full rank, meaning that
the columns of X are linearly independent.
It minimizes the training loss.
If we denote
X+ := (X⊤ X)−1 X⊤ ,
then β̂ = X+ y. The X+ is called a pseudoinverse. It generalizes the usual inverse of a matrix
for non-square matrices. When A is a square invertible matrix, then

A+ = A−1 .

Problem 2.1 (Is X+ X = I?) Suppose that X+ := (X⊤ X)−1 X⊤ . Then,

X+ X = [(X⊤ X)−1 X⊤ ]X = Ip .

However, it is not true that XX+ = In , because

XX+ = X(X⊤ X)−1 X⊤ ̸= In ,

unless X is square and invertible.

Here X has n rows and p columns/features and we assume that n > p. So X−1 does not make
sense.
If n = p and the square X is invertible, then

X+ = (X⊤ X)−1 X⊤ = X−1 X −⊤ ⊤

| {zX } = X
−1

To discuss the pseudoinverse further, we will talk about the singular value decomposition of an
arbitrary matrix.
Here is the code used during the lecture.
4 Note that ⟨a, b⟩ = a⊤ b = a · b =
P
i ai bi

7
import numpy as np # numerical Python for linear algebra etc
import [Link] as plt # nice plots like Matlab!

# --- Parameters ---

n = 20 # sample size
sigma = 0.5 # sigma^2 is the variance of the noise
beta = [Link]([2.0, 3.5]) # beta vector: [intercept, slope]
# --- Simulate predictors ---
x = [Link](low=0.0, high=1.0, size=n)

# --- Simulate Gaussian errors ---

epsilon = [Link](loc=0.0, scale=sigma, size=n)

# --- Build design matrix: each row is [1, x_i] ---

X = np.column_stack(([Link](n), x))

# --- Generate response variable ---

Y = X@beta + epsilon

# --- Plot scatterplot ---

[Link](figsize=(8, 4))
[Link](x, Y, alpha=0.7, edgecolor=’r’)
[Link](’x’)
[Link](’y’)
[Link](’Simulated Simple Linear Regression Data’)
[Link](True)

EY=X@beta
[Link](x, EY,’r’) # the true model/relationship

bhat=[Link](X.T@X, X.T@Y)

[Link](x, X@bhat,’y’)

U, s, Vt = [Link](X, full_matrices=False)
[Link](X-U@[Link](s)@Vt) # check
D=[Link](1/s)
pinvX=Vt.T@D@U.T
print(pinvX@Y)

Linear Models in Machine Learning
No ratings yet
Linear Models in Machine Learning
44 pages
Understanding Linear Models and Regression
No ratings yet
Understanding Linear Models and Regression
62 pages
Predictive Models Workshop Overview
No ratings yet
Predictive Models Workshop Overview
78 pages
Understanding Linear Models in Regression
No ratings yet
Understanding Linear Models in Regression
10 pages
Machine Learning Fundamentals Notes
No ratings yet
Machine Learning Fundamentals Notes
19 pages
Linear and Logistic Regression Overview
No ratings yet
Linear and Logistic Regression Overview
7 pages
Linear Regression Techniques Explained
No ratings yet
Linear Regression Techniques Explained
54 pages
Marketing Impact via Linear Regression
No ratings yet
Marketing Impact via Linear Regression
49 pages
Linear vs Nonlinear Classification Methods
No ratings yet
Linear vs Nonlinear Classification Methods
14 pages
EECS16B: Linear Regression Concepts
No ratings yet
EECS16B: Linear Regression Concepts
54 pages
Introduction to Supervised Learning
No ratings yet
Introduction to Supervised Learning
47 pages
Understanding Linear Regression Techniques
No ratings yet
Understanding Linear Regression Techniques
108 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
43 pages
Lagrange Multipliers & Linear Regression Techniques
No ratings yet
Lagrange Multipliers & Linear Regression Techniques
61 pages
Linear Models Training Overview
No ratings yet
Linear Models Training Overview
23 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
10 pages
Limitations of Linear Regression Models
No ratings yet
Limitations of Linear Regression Models
41 pages
Foundations of Linear Regression in ML
No ratings yet
Foundations of Linear Regression in ML
9 pages
Polynomial Regression Techniques Explained
No ratings yet
Polynomial Regression Techniques Explained
43 pages
Linear Regression for Color Mixing Analysis
No ratings yet
Linear Regression for Color Mixing Analysis
10 pages
Linear and Logistic Regression Overview
No ratings yet
Linear and Logistic Regression Overview
65 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Understanding the Representer Theorem
No ratings yet
Understanding the Representer Theorem
12 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
155 pages
Introduction to Linear Regression
No ratings yet
Introduction to Linear Regression
9 pages
Understanding Regression Variables
No ratings yet
Understanding Regression Variables
5 pages
Linear and Logistic Regression Overview
No ratings yet
Linear and Logistic Regression Overview
40 pages
Introduction to Linear Least Squares
No ratings yet
Introduction to Linear Least Squares
19 pages
Understanding Regression Models in ML
No ratings yet
Understanding Regression Models in ML
57 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
7 pages
Understanding Regression in Machine Learning
No ratings yet
Understanding Regression in Machine Learning
16 pages
Regression Techniques in Predictive Modeling
No ratings yet
Regression Techniques in Predictive Modeling
14 pages
Machine Learning Basics: Supervised vs Unsupervised
No ratings yet
Machine Learning Basics: Supervised vs Unsupervised
8 pages
Linear Regression Fundamentals Explained
No ratings yet
Linear Regression Fundamentals Explained
9 pages
Understanding Regression in Machine Learning
No ratings yet
Understanding Regression in Machine Learning
10 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
8 pages
Optimization and Machine Learning Concepts
No ratings yet
Optimization and Machine Learning Concepts
4 pages
Supervised Learning Cheat Sheet
100% (1)
Supervised Learning Cheat Sheet
4 pages
Scilab GUI for Linear Regression Analysis
No ratings yet
Scilab GUI for Linear Regression Analysis
11 pages
Cost Function in Linear Regression
No ratings yet
Cost Function in Linear Regression
17 pages
Linear Models in Deep Learning
No ratings yet
Linear Models in Deep Learning
28 pages
Introduction to Linear Regression Models
No ratings yet
Introduction to Linear Regression Models
44 pages
STAT 353: Expectation, Variance & Regression Guide
No ratings yet
STAT 353: Expectation, Variance & Regression Guide
44 pages
Linear Regression in Neural Networks
No ratings yet
Linear Regression in Neural Networks
66 pages
Linear Regression: OLS, Ridge, Lasso Insights
No ratings yet
Linear Regression: OLS, Ridge, Lasso Insights
41 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
29 pages
Neural Network Programming Basics
No ratings yet
Neural Network Programming Basics
12 pages
Regression Analysis Techniques Overview
No ratings yet
Regression Analysis Techniques Overview
28 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
55 pages
Overview of Key ML Algorithms
No ratings yet
Overview of Key ML Algorithms
16 pages
Supervised Learning in IoT Data Analytics
No ratings yet
Supervised Learning in IoT Data Analytics
27 pages
Deriving Linear Regression Loss Function
No ratings yet
Deriving Linear Regression Loss Function
29 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
5 pages
Regression Analysis and Random Variables
No ratings yet
Regression Analysis and Random Variables
267 pages
McDiarmid Inequality in ML Theory
No ratings yet
McDiarmid Inequality in ML Theory
100 pages
Linear Regression and Optimization in ML
No ratings yet
Linear Regression and Optimization in ML
42 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Supervised Learning: Linear Regression Guide
No ratings yet
Supervised Learning: Linear Regression Guide
47 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
38 pages
Univariate Regression Analysis Guide
No ratings yet
Univariate Regression Analysis Guide
81 pages
in c1 20 or Cte
No ratings yet
in c1 20 or Cte
8 pages
Library Media Center Marketing Plan
No ratings yet
Library Media Center Marketing Plan
5 pages
Manifesto for Reproducible Science
No ratings yet
Manifesto for Reproducible Science
10 pages
Sikkim Manipal University Admission Guide
No ratings yet
Sikkim Manipal University Admission Guide
131 pages
Moonton's Impact on Gaming and Society
No ratings yet
Moonton's Impact on Gaming and Society
7 pages
LNS-SQ Guidelines for Nutrition Support
No ratings yet
LNS-SQ Guidelines for Nutrition Support
8 pages
Psychoeducational Evaluation for Lucy
No ratings yet
Psychoeducational Evaluation for Lucy
22 pages
Grade 5 Online Math Quiz Bee
No ratings yet
Grade 5 Online Math Quiz Bee
4 pages
S.E. Discrete Mathematics Results 2023
No ratings yet
S.E. Discrete Mathematics Results 2023
1 page
Socratic Method in Multicultural Democracy
No ratings yet
Socratic Method in Multicultural Democracy
4 pages
Business Statistics Course Overview 2023
No ratings yet
Business Statistics Course Overview 2023
4 pages
Writing Structure and Organization Guide
0% (1)
Writing Structure and Organization Guide
2 pages
Grade 7 Statistics Learning Assessment
67% (3)
Grade 7 Statistics Learning Assessment
2 pages
Asking and Giving Information Lesson
No ratings yet
Asking and Giving Information Lesson
29 pages
Mastering Interview and Presentation Skills
No ratings yet
Mastering Interview and Presentation Skills
14 pages
Vietnamese Legal Education Reform
No ratings yet
Vietnamese Legal Education Reform
30 pages
Lesson Plan: Filipino Identity in English 7
No ratings yet
Lesson Plan: Filipino Identity in English 7
5 pages
Gracious Garments Website Project Guide
No ratings yet
Gracious Garments Website Project Guide
5 pages
UGC Act Section 3 Course Work Application
No ratings yet
UGC Act Section 3 Course Work Application
1 page
GATE Aerospace Engineering Math Notes
No ratings yet
GATE Aerospace Engineering Math Notes
13 pages
John Denver Trending: Social Media's Impact
No ratings yet
John Denver Trending: Social Media's Impact
5 pages
Rajiv Gandhi Government Schemes Overview
No ratings yet
Rajiv Gandhi Government Schemes Overview
21 pages
Fundamentals of ICT Course Outline
No ratings yet
Fundamentals of ICT Course Outline
3 pages
Smart Wireless Irrigation Memoir
No ratings yet
Smart Wireless Irrigation Memoir
23 pages
Primary 6 English Exam Questions & Answers
No ratings yet
Primary 6 English Exam Questions & Answers
8 pages
IIE Lesson Plan Template 2025
No ratings yet
IIE Lesson Plan Template 2025
6 pages
Designing a Community Calculator
No ratings yet
Designing a Community Calculator
6 pages
Extra Grammar Exercises (Unit 2, Page 16) : Top Notch 2, Third Edition
No ratings yet
Extra Grammar Exercises (Unit 2, Page 16) : Top Notch 2, Third Edition
4 pages
Non-Invasive Brain-Computer Interfaces Overview
No ratings yet
Non-Invasive Brain-Computer Interfaces Overview
25 pages
Job Descriptions as Management Tools
No ratings yet
Job Descriptions as Management Tools
47 pages

Supervised Learning: Linear Regression Basics

Uploaded by

Supervised Learning: Linear Regression Basics

Uploaded by

1 Introduction to Supervised Learning

We have the task of predicting a response/output y using features/input x. 1 We model the

Sometimes we write instead:

Note about notation: In this course scalar variables are denoted as

vectors are denotes

For random vectors, we use the notation:

When a coefficient, say β, is estimated via statistical procedures, we denote it as

1.1 Loss Function

where β̂ is estimated using some data (e.g., MLE).

When y is continuous, in this course we use the squared-error loss:

Loss(y, ŷ) = (y − ŷ)2 .

For binary response, the loss function is simply:

ℓ(g(X | β)) = ELoss(Y, g(X)),

E(Y − g ∗ (X))2 ≤ E(Y − g(X))2 .

where E[ϵ(X) | X = x] = 0 is error term.

That the variance of ϵ(x) is independent of x is an assumption.

so that the model here is

1.2 Training Set

T = {(X1 , Y1 ), . . . , (Xn , Yn )},

where β̂ is an estimator (a function of T ).

When the learner is linear (90% of this course!), then

where matrix A is a suitable symmetric invertible matrix.

So we can then guess that the critical/stationary point of b⊤ Ab − 2b⊤ z is:

This guess is correct.

∥Y − Xβ∥2 = (Y − Xβ)⊤ (Y − Xβ)

Therefore, using our formula:

b = A−1 z = (X⊤ X)−1 X⊤ Y .

In summary, the least-squares estimate is given by:

Problem 2.1 (Is X+ X = I?) Suppose that X+ := (X⊤ X)−1 X⊤ . Then,

However, it is not true that XX+ = In , because

XX+ = X(X⊤ X)−1 X⊤ ̸= In ,

unless X is square and invertible.

X+ = (X⊤ X)−1 X⊤ = X−1 X −⊤ ⊤

# --- Parameters ---

# --- Simulate Gaussian errors ---

# --- Build design matrix: each row is [1, x_i] ---

# --- Generate response variable ---

# --- Plot scatterplot ---

You might also like