1 Introduction to Supervised Learning
We have the task of predicting a response/output y using features/input x. 1 We model the
relationship between y and x via a function g(x) (so we hope that g(x) ≈ y).
In this course we will assume that a linear function g is a good model for the relationship
between the features and the response:
⊤ β0
g(x) = x β = [1, x] = β0 + β1 x.
β1 |{z} |{z}
| {z } intercept slope
coefficient vector
Here the feature vector x consists of two features. One is the constant feature 1 and the other is x.
Here β is the vector of coefficients of the linear relationship. Obviously, β0 + xβ1 is a straight line
in one dimension.
We can have a more general linear relationship (hyperplane in higher dimensions):
β0
β1 X
g(x) = x⊤ β = [1, x1 . . . , xp ] . = β0 + x i βi
.. i
βp
Sometimes we write instead:
β1
β2 X
g(x) = x⊤ β = [1, x1 . . . , xp ] = β1 + vi βi
..
.
i
βp+1
You should pay attention to context to figure out which notation is used.
We call g(x) a regression function, when y is a continuous variable.2 For example, y is the
weight of a baby and x contains features such as the age of the mother, number of pregnancies,
health status of mother, etc. This is the main setting of this course (90%).
When y is discrete, say y ∈ {0, 1},3 then we say g(x) is a classification function. This is small
part of the course (time permitting).
Note about notation: In this course scalar variables are denoted as
x, y, z, α, β, γ,
vectors are denotes
x, y, z, α, β, γ.
Matrices are denotes as
X, Y, Z.
1 feature is the modern machine learning term, statisticians called is an explanatory variable.
2 This is the simplest possible neural network.
3 This should not be confused with the interval [0, 1] or (0, 1)
1
We use the special blue matrix symbol for a matrix of features (more on that later):
X.
Generally, but not always, we use upper case notation for random variables:
X, Y, Z
For random vectors, we use the notation:
X, Y , Z.
When a coefficient, say β, is estimated via statistical procedures, we denote it as
β.
b
This βb can denote both a random vector (in which case it is called an estimator), or it can dernote
an real vector (in which case it is called an estimate). This is done in, for example, MATH2901.
There we talk about the maximum likelihood estimator or the maximum likelihood estimate.
1.1 Loss Function
Typically, the only thing we do not know about the linear regression function g is the coefficient
vector β. We estimate/determine β from empirical data and denote the estimate by β̂.
When we make a prediction of y, we denote it as ŷ. For example, when we use a linear function:
ŷ = x⊤ β̂,
where β̂ is estimated using some data (e.g., MLE).
We measure the quality of the prediction using a loss function:
Loss(y, ŷ).
When y is continuous, in this course we use the squared-error loss:
Loss(y, ŷ) = (y − ŷ)2 .
For binary response, the loss function is simply:
(
1, y≠ ŷ
Loss(y, ŷ) = 1{y =
̸ ŷ} =
0, y = ŷ.
Remark 1.1 (Using linear function to predict binary output.) One should not use a linear
function to predict a binary output, especially if the inputs are continuous variables. A contrived
examples is:
g(x) = β0 + β1 x
and x, β1 , β0 ∈ {0, 1}. The proper way to predict binary output is to use logistic (not linear)
regression. We may/may not talk about this, depending on time.
2
We frequently think of the pair (x, y) as the outcomes of random variables (X, Y ) with a certain
joint pdf f (x, y) (in practice unknown). In this case we consider the expected loss:
ℓ(g(X | β)) = ELoss(Y, g(X)),
which is called the risk for g. In our linear regression case with squared-error loss, the risk of
using a linear function g is:
E(Y − g(X))2 = E(Y − X ⊤ β)2 .
It can be shown that the best possible g when using squared-error loss is the conditional expectation
of Y given X:
g ∗ (x) = E[Y | X = x].
This means that for any other prediction function g(x) we have that
E(Y − g ∗ (X))2 ≤ E(Y − g(X))2 .
Note that to determine g ∗ we need to be able to compute a conditional expectation using the
pdf f (y | x), which is not possible in practice.
Given g ∗ , we can in fact write the random response given X = x as:
Y = g ∗ (x) + ϵ(x),
where E[ϵ(X) | X = x] = 0 is error term.
Problem 1.1 (Question) Why is ϵ(x) forced to be zero mean? From the equation
Y = g ∗ (x) + ϵ(x),
we know that
E[Y | X = x] = g ∗ (x) + E[ϵ(X) | X = x]
and since g ∗ (x) = E[Y | X = x], then
E[ϵ(X) | X = x] = E[ϵ(x)] = 0.
That the variance of ϵ(x) is independent of x is an assumption.
In this course, we assume that the variance of the noise term does not depend on x,
so that Var(ϵ(x)) = σ 2 and g ∗ is a linear function.
In other words, the response y can be viewed as the outcome of:
Y = x⊤ β + ϵ,
where E[ϵ] = 0 and Var(ϵ) = σ 2 and β = β ∗ is the true, but unknown, coefficient.
For example, we can have n = 100 is these pairs of (x, Y ) so that
Yi = x⊤
i β + ϵi , i = 1, . . . , n.
Assuming that these n observations are statistically independent is the same as assuming that the
noise components ϵ1 , . . . , ϵn are independent of each other. In fact, we may assume that ϵ1 , ϵ2 , . . .
are iid (independent and identically distributed).
3
We can write this in matrix form for the simple linear regression (where x⊤ = [1, x]) as:
Y1 1 x1 ϵ1
.. .. . β1
Y = . = . .. + ... = Xβ + ϵ ,
β2 |{z}
Yn 1 xn ϵn random noise
where X ∈ Rn×2 is the matrix of features and Y ∈ Rn contains all the responses. At this stage, we
assume that n > p. Here the number of features is the number of columns of X and the number of
independent observations is the number of rows of X. Note that here Y and ϵ are random vectors.
We can have more than two features and still have a linear model:
1 x1,1 x1,p
Y1 1 x2,1 x2,p β1 ϵ1
.. .. ..
Y = . = . .. .. . + . = |{z} X β + ϵ,
.. . .
Yn βp+1 ϵn ∈R n×(p+1)
1 xn,1 xn,n
Problem 1.2 (Polynomial Regression) We can have more than two features and still have a
linear model:
1 x1 x21 β1
Y1 ϵ1
.. .. .
.. .
Y = . = . β2 + .. = Xβ + ϵ,
Yn 1 xn x2 β3 ϵn n
so that the model here is
β1
Y = β1 + xβ2 + x2 β3 + ϵ = [1, x, x2 ] β2 + ϵ.
β3
The ultimate test for whether the model is linear or not, is if we can write the regression function
as an inner/dot product of a feature vector x with a coefficient vector β.
1.2 Training Set
In practice we do not know g ∗ (and if assumed linear, we do not know the corresponding coefficient
β ∗ : g ∗ (x) = x⊤ β ∗ ), but only have some (random) data
T = {(X1 , Y1 ), . . . , (Xn , Yn )},
where we assume that (Xi , Yi ) are iid for i = 1, . . . , n, and its realization:
τ = {(x1 , y1 ), . . . , (xn , yn )}
We are trying to find a way to use the training data to construct a learner (or prediction function
in older literature):
gτ (x).
In our case the learner is a linear function:
gτ (x) = x⊤ β̂,
4
where β̂ is an estimate of the "true" β based on τ . The learner can also be a random object (with
its own statistical properties) when we write:
gT (x) = x⊤ β̂,
where β̂ is an estimator (a function of T ).
2 Training loss
Recall that we have the risk of the learner g:
E(Y − g(X))2 =
|{z} E(Y − X ⊤ β)2
assuming linear model
We do not know the distribution of (Y, X), so we replace the expectation with its empirical estimate:
n
1X
ℓT (g) := (Yi − g(Xi ))2 .
n i=1
| {z }
≈E
This is called the training loss. The training loss is our approximation to the risk. Note that the
training loss can be either a function of T or τ . We can think of τ as the realization of the random
T:
n
1X
ℓτ (g) := (yi − g(xi ))2 .
n i=1
| {z }
≈E
When the learner is linear (90% of this course!), then
n
1X
ℓT (g(· | β)) := (Yi − x⊤ 2
i β) ,
n i=1
which can be written very neatly using matrix/vector notation. Recall x⊤ x = ∥x∥2 = x2i ,
P
i
then
∥Y − Xβ∥2
ℓT (g(· | β)) = ,
n
where ⊤ ⊤
Y1 x1 x1 β
.. .. ..
Y = . , X = . , Xβ = .
Yn x⊤
n x⊤
nβ
Our learner is now
gτ (x) = x⊤ β̂,
where β̂ is the minimizer of the training loss:
∥Y − Xβ∥2
β̂ ∈ argmin .
β n
5
Here β̂ is called the least-squares estimator. How do we compute this estimate?
We need to consider how to find the minimizer of the following function:
X X
b⊤ Ab − 2b⊤ z = bi bj ai,j − 2 bj z j ,
i,j j
where matrix A is a suitable symmetric invertible matrix.
To find out, let us start with the simplest 1-dimensional case.
b × a × b − 2b × z = ab2 − 2bz
Setting the derivative (with respect to b) to zero 2ab − 2z and solving the resulting equation gives:
b = a−1 × z.
So we can then guess that the critical/stationary point of b⊤ Ab − 2b⊤ z is:
b = A−1 z.
This guess is correct.
Quick formal proof: Let us differentiate b⊤ Ab − 2b⊤ z with respect to bk :
∂ ⊤ ∂ ⊤ ∂ ⊤
[b Ab − 2b⊤ z] = [b Ab] − 2 b z
∂bk ∂bk ∂bk
∂ ⊤
= −2zk + [b Ab]
∂bk
Since: b⊤ Ab = (b/k + ek bk )⊤ A(b/k + ek bk )
= b⊤ 2 ⊤ ⊤
/k Ab/k + bk ek Aek + 2bk ek Ab/k
∂ ⊤
Therefore: b Ab = 0 + 2bk e⊤ ⊤
k Aek + 2ek Ab/k
∂bk
= 2e⊤ ⊤
k A(ek bk + b/k ) = 2ek Ab
Recall that
(b + c)a(b + c) = bab + cac + 2abc
So setting
∂ ⊤
[b Ab − 2b⊤ z] = 2e⊤ ⊤
k Ab − 2ek z = 0
∂bk
for each k and organizing the result in matrix form gives (the gradient of the multivariate function):
∂ ⊤
[b Ab − 2b⊤ z] = 2Ab − 2z,
∂b
which is a shorthand notation for
∂ ⊤
[b Ab − 2b⊤ z] = 2e⊤ ⊤
k Ab − 2ek z, k = 1, . . . , n.
∂bk
Solving the gradient=0 equations gives
b = A−1 z.
6
In the least-squares minimization case, we have :4
∥Y − Xβ∥2 = (Y − Xβ)⊤ (Y − Xβ)
((a − b)2 = a2 − 2ab + b2 ) = Y ⊤ Y − 2Y ⊤ Xβ + (Xβ)⊤ (Xβ)
(a⊤ b = b⊤ a) = ∥Y ∥2 + β ⊤ (X⊤ X) β − 2β ⊤ [X⊤ Y ]
| {z } | {z }
=A =z
Therefore, using our formula:
b = A−1 z = (X⊤ X)−1 X⊤ Y .
In summary, the least-squares estimate is given by:
β̂ = (X⊤ X)−1 X⊤ y,
this formula works when n > p (more examples than features), and X is full rank, meaning that
the columns of X are linearly independent.
It minimizes the training loss.
If we denote
X+ := (X⊤ X)−1 X⊤ ,
then β̂ = X+ y. The X+ is called a pseudoinverse. It generalizes the usual inverse of a matrix
for non-square matrices. When A is a square invertible matrix, then
A+ = A−1 .
Problem 2.1 (Is X+ X = I?) Suppose that X+ := (X⊤ X)−1 X⊤ . Then,
X+ X = [(X⊤ X)−1 X⊤ ]X = Ip .
However, it is not true that XX+ = In , because
XX+ = X(X⊤ X)−1 X⊤ ̸= In ,
unless X is square and invertible.
Here X has n rows and p columns/features and we assume that n > p. So X−1 does not make
sense.
If n = p and the square X is invertible, then
X+ = (X⊤ X)−1 X⊤ = X−1 X −⊤ ⊤
| {zX } = X
−1
To discuss the pseudoinverse further, we will talk about the singular value decomposition of an
arbitrary matrix.
Here is the code used during the lecture.
4 Note that ⟨a, b⟩ = a⊤ b = a · b =
P
i ai bi
7
import numpy as np # numerical Python for linear algebra etc
import [Link] as plt # nice plots like Matlab!
# --- Parameters ---
n = 20 # sample size
sigma = 0.5 # sigma^2 is the variance of the noise
beta = [Link]([2.0, 3.5]) # beta vector: [intercept, slope]
# --- Simulate predictors ---
x = [Link](low=0.0, high=1.0, size=n)
# --- Simulate Gaussian errors ---
epsilon = [Link](loc=0.0, scale=sigma, size=n)
# --- Build design matrix: each row is [1, x_i] ---
X = np.column_stack(([Link](n), x))
# --- Generate response variable ---
Y = X@beta + epsilon
# --- Plot scatterplot ---
[Link](figsize=(8, 4))
[Link](x, Y, alpha=0.7, edgecolor=’r’)
[Link](’x’)
[Link](’y’)
[Link](’Simulated Simple Linear Regression Data’)
[Link](True)
EY=X@beta
[Link](x, EY,’r’) # the true model/relationship
bhat=[Link](X.T@X, X.T@Y)
[Link](x, X@bhat,’y’)
U, s, Vt = [Link](X, full_matrices=False)
[Link](X-U@[Link](s)@Vt) # check
D=[Link](1/s)
pinvX=Vt.T@D@U.T
print(pinvX@Y)