Wk05 machine learning
Wk05 machine learning
Sherry Thomas
21f3001449
Contents
Introduction to Supervised Learning 1
Linear Regression 2
Gradient Descent 3
Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 4
Kernel Regression 4
Acknowledgments 6
Abstract
The week commences with an exploration of Supervised Learning,
specifically focusing on the topic of Regression. The aim is to provide
a comprehensive understanding of the underlying mechanism of this pop-
ular machine learning technique, and its various applications. Addition-
ally, this study delves into the variants of Regression, including kernel
regression, and examines the probabilistic aspects of the technique.
1
Linear Regression
Linear regression is a supervised learning algorithm employed to predict a con-
tinuous output variable based on one or more input features, assuming a linear
relationship between the input and output variables. The primary objective of
linear regression is to determine the line of best fit that minimizes the sum of
squared errors between the predicted and actual output values.
Given a dataset x1 , … , x𝑛 where each x𝑖 belongs to ℝ𝑑 , and the corresponding
labels y1 , … , y𝑛 belong to ℝ, the goal of linear regression is to find a mapping
between the input and output variables, represented as follows:
ℎ ∶ ℝ𝑑 → ℝ
𝑛
error(ℎ) = ∑(ℎ(x𝑖 ) − y𝑖 )2
𝑖=1
Ideally, this error should be minimized, which occurs when ℎ(x𝑖 ) = y𝑖 for all 𝑖.
However, achieving this may only result in memorizing the data and its outputs,
which is not a desired outcome.
To mitigate the memorization problem, introducing a structure to the mapping
becomes necessary. The simplest and commonly used structure is linear, which
we will adopt as the underlying structure for our data.
Let ℋlinear denote the solution space for the mapping in the linear domain:
𝑛
min ∑(ℎ(x𝑖 ) − y𝑖 )2
ℎ∈ℋlinear
𝑖=1
Equivalently,
𝑛
min ∑(w𝑇 x𝑖 − y𝑖 )2
w∈ℝ𝑑
𝑖=1
Optimizing the above objective is the main aim of the linear regression algo-
rithm.
2
Optimizing the Error Function
The minimization equation can be expressed in vectorized form as:
(XX𝑇 )w∗ = Xy
∴w∗ = (XX𝑇 )+ Xy
Gradient Descent
The normal equation for linear regression, as shown above, involves calculating
(XX𝑇 )+ , which can be computationally expensive with a complexity of 𝑂(𝑑3 ).
Since w∗ represents the solution of an unconstrained optimization problem, it
can be solved using gradient descent. The iterative formula for gradient descent
is:
w𝑡+1 = w𝑡 − 𝜂𝑡 ∇𝑓(w𝑡 )
∴w𝑡+1 = w𝑡 − 𝜂𝑡 [2(XX𝑇 )w𝑡 − 2(Xy)]
Here, 𝜂 is a scalar that controls the step-size of the descent, and 𝑡 represents
the current iteration.
Even in the above equation, the calculation of XX𝑇 is required, which remains
computationally expensive. Is there a way to further enhance this process?
3
Stochastic Gradient Descent
Stochastic gradient descent (SGD) is an optimization algorithm widely employed
in machine learning to minimize the loss function of a model by determining
the optimal parameters. Unlike traditional gradient descent, which updates the
model parameters based on the entire dataset, SGD updates the parameters
using a randomly selected subset of the data, known as a batch. This approach
leads to faster training times and makes SGD particularly suitable for handling
large datasets.
Instead of updating w using the entire dataset at each step 𝑡, SGD leverages a
small randomly selected subset of 𝑘 data points to update w. Consequently, the
new gradient becomes 2(X̃ X̃ 𝑇 w𝑡 − X̃ y),
̃ where X̃ and ỹ represent small samples
randomly chosen from the dataset. This strategy is feasible since X̃ ∈ ℝ𝑑×𝑘 ,
which is considerably smaller compared to X.
After 𝑇 rounds of training, the final estimate is obtained as follows:
1 𝑇
w𝑇SGD = ∑ w𝑖
𝑇 𝑖=1
Kernel Regression
What if the data points reside in a non-linear subspace? Similar to dealing with
non-linear data clustering, kernel functions are employed in this scenario as well.
Let w∗ = X𝛼∗ , where 𝛼∗ ∈ ℝ𝑛 .
X𝛼∗ = w∗
∴X𝛼∗ = (XX𝑇 )+ Xy
(XX𝑇 )X𝛼∗ = (XX𝑇 )(XX𝑇 )+ Xy
(XX𝑇 )X𝛼∗ = Xy
X𝑇 (XX𝑇 )X𝛼∗ = X𝑇 Xy
(X𝑇 X)2 𝛼∗ = X𝑇 Xy
K2 𝛼∗ = Ky
∴𝛼∗ = K−1 y
Here, K ∈ ℝ𝑛×𝑛 , and it can be obtained using a kernel function such as the
Polynomial Kernel or RBF Kernel.
To predict using 𝛼 and the kernel function, let Xtest ∈ ℝ𝑑×𝑚 represent the test
dataset. The prediction is made as follows:
𝑛
w∗ 𝜙(Xtest ) = ∑ 𝛼∗𝑖 𝑘(x𝑖 , xtest𝑖 )
𝑖=1
4
Here, 𝛼∗𝑖 denotes the importance of the 𝑖-th data point in relation to w∗ , and
𝑘(x𝑖 , xtest𝑖 ) signifies the similarity between xtest𝑖 and x𝑖 .
y𝑖 = w𝑇 x𝑖 + 𝜖𝑖
where w ∈ ℝ𝑑 represents the weight vector that captures the relationship be-
tween the inputs and the target variable.
To estimate the weight vector w that best fits the data, we can apply the
principle of Maximum Likelihood (ML). The ML estimation seeks to find the
parameter values that maximize the likelihood of observing the given data.
Assuming that the noise term 𝜖𝑖 follows a zero-mean Gaussian distribution with
variance 𝜎2 , we can express the likelihood function as:
ℒ(w; X, y) = 𝑃 (y|X; w)
𝑛
= ∏ 𝑃 (y𝑖 |x𝑖 ; w)
𝑖=1
𝑛
1 (w𝑇 x𝑖 − y𝑖 )2
=∏√ exp (− )
𝑖=1 2𝜋𝜎 2𝜎2
𝑛
1 (w𝑇 x𝑖 − y𝑖 )2
log ℒ(w; X, y) = ∑ log ( √ )−
𝑖=1 2𝜋𝜎 2𝜎2
𝑛 1 𝑛
=− log(2𝜋𝜎2 ) − 2 ∑(w𝑇 x𝑖 − y𝑖 )2
2 2𝜎 𝑖=1
1 𝑛
− log ℒ(w; X, y) = ∑(w𝑇 x𝑖 − y𝑖 )2
2𝜎2 𝑖=1
5
This expression is equivalent to the mean squared error (MSE) objective function
used in linear regression. Therefore, finding the maximum likelihood estimate
wML is equivalent to solving the linear regression problem using the squared
error loss.
To obtain the closed-form solution for wML , we differentiate the negative log-
likelihood with respect to w and set the derivative to zero:
1 𝑛
∇w (− log ℒ(w; X, y)) = ∑(w𝑇 x𝑖 − y𝑖 )x𝑇𝑖 = 0
𝜎2 𝑖=1
1
(XX𝑇 w − Xy) = 0
𝜎2
where X is the matrix whose rows are the input vectors x𝑖 and y is the column
vector of labels. Rearranging the equation, we have:
XX𝑇 w = Xy
To obtain the closed-form solution for wML , we multiply both sides by the
inverse of XX𝑇 , denoted as (XX𝑇 )−1 :
Thus, the closed-form solution for the maximum likelihood estimate wML is
given by the product of (XX𝑇 )−1 and Xy.
The closed-form solution for wML in linear regression demonstrates that it can
be obtained by directly applying a matrix inverse operation to the product of
the input matrix X and the target variable vector y. This closed-form solution
provides an efficient and direct way to estimate the weight vector w based on
the given data.
Acknowledgments
Professor Arun Rajkumar: The content, including the concepts and nota-
tions presented in this document, has been sourced from his slides and lectures.
His expertise and educational materials have greatly contributed to the devel-
opment of this document.