0% found this document useful (0 votes)
14 views

Wk05 machine learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Wk05 machine learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Week-5: Linear Regression - Introduction

Sherry Thomas
21f3001449

Contents
Introduction to Supervised Learning 1

Linear Regression 2

Optimizing the Error Function 3

Gradient Descent 3
Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 4

Kernel Regression 4

Probabilistic View of Linear Regression 5

Acknowledgments 6
Abstract
The week commences with an exploration of Supervised Learning,
specifically focusing on the topic of Regression. The aim is to provide
a comprehensive understanding of the underlying mechanism of this pop-
ular machine learning technique, and its various applications. Addition-
ally, this study delves into the variants of Regression, including kernel
regression, and examines the probabilistic aspects of the technique.

Introduction to Supervised Learning


Supervised learning, a fundamental machine learning technique, involves train-
ing an algorithm on labeled data where the target variable or outcome is known.
The primary objective of supervised learning is to establish a mapping between
input data and corresponding output variables.
Given a dataset x1 , … , x𝑛 , where each x𝑖 belongs to ℝ𝑑 , the corresponding labels
y1 , … , y𝑛 can fall into the following categories:
• Regression: y𝑖 ∈ ℝ (e.g., Rainfall Prediction)
• Binary Classification: y𝑖 ∈ {0, 1} (e.g., Distinguishing between cats and
dogs)
• Multi-class Classification: y𝑖 ∈ {0, 1, … , 𝐾} (e.g., Digit classification)

1
Linear Regression
Linear regression is a supervised learning algorithm employed to predict a con-
tinuous output variable based on one or more input features, assuming a linear
relationship between the input and output variables. The primary objective of
linear regression is to determine the line of best fit that minimizes the sum of
squared errors between the predicted and actual output values.
Given a dataset x1 , … , x𝑛 where each x𝑖 belongs to ℝ𝑑 , and the corresponding
labels y1 , … , y𝑛 belong to ℝ, the goal of linear regression is to find a mapping
between the input and output variables, represented as follows:

ℎ ∶ ℝ𝑑 → ℝ

The error for this mapping function can be quantified as:

𝑛
error(ℎ) = ∑(ℎ(x𝑖 ) − y𝑖 )2
𝑖=1

Ideally, this error should be minimized, which occurs when ℎ(x𝑖 ) = y𝑖 for all 𝑖.
However, achieving this may only result in memorizing the data and its outputs,
which is not a desired outcome.
To mitigate the memorization problem, introducing a structure to the mapping
becomes necessary. The simplest and commonly used structure is linear, which
we will adopt as the underlying structure for our data.
Let ℋlinear denote the solution space for the mapping in the linear domain:

ℋlinear = {ℎ𝑤 ∶ ℝ𝑑 → ℝ s.t. ℎ𝑤 (x) = w𝑇 x ∀w ∈ ℝ𝑑 }

Thus, our objective is to minimize:

𝑛
min ∑(ℎ(x𝑖 ) − y𝑖 )2
ℎ∈ℋlinear
𝑖=1
Equivalently,
𝑛
min ∑(w𝑇 x𝑖 − y𝑖 )2
w∈ℝ𝑑
𝑖=1

Optimizing the above objective is the main aim of the linear regression algo-
rithm.

2
Optimizing the Error Function
The minimization equation can be expressed in vectorized form as:

min ‖X𝑇 w − y‖22


w∈ℝ𝑑

Defining a function 𝑓(w) that captures this minimization problem, we have:

𝑓(w) = min ‖X𝑇 w − y‖22


w∈ℝ𝑑
𝑓(w) = (X w − y)𝑇 (X𝑇 w − y)
𝑇

∴∇𝑓(w) = 2(XX𝑇 )w − 2(Xy)

Setting the gradient equation to zero, we obtain:

(XX𝑇 )w∗ = Xy
∴w∗ = (XX𝑇 )+ Xy

Here, (XX𝑇 )+ represents the pseudo-inverse of XX𝑇 .


Further analysis reveals that X𝑇 w corresponds to the projection of the labels
onto the subspace spanned by the features.

Gradient Descent
The normal equation for linear regression, as shown above, involves calculating
(XX𝑇 )+ , which can be computationally expensive with a complexity of 𝑂(𝑑3 ).
Since w∗ represents the solution of an unconstrained optimization problem, it
can be solved using gradient descent. The iterative formula for gradient descent
is:

w𝑡+1 = w𝑡 − 𝜂𝑡 ∇𝑓(w𝑡 )
∴w𝑡+1 = w𝑡 − 𝜂𝑡 [2(XX𝑇 )w𝑡 − 2(Xy)]

Here, 𝜂 is a scalar that controls the step-size of the descent, and 𝑡 represents
the current iteration.
Even in the above equation, the calculation of XX𝑇 is required, which remains
computationally expensive. Is there a way to further enhance this process?

3
Stochastic Gradient Descent
Stochastic gradient descent (SGD) is an optimization algorithm widely employed
in machine learning to minimize the loss function of a model by determining
the optimal parameters. Unlike traditional gradient descent, which updates the
model parameters based on the entire dataset, SGD updates the parameters
using a randomly selected subset of the data, known as a batch. This approach
leads to faster training times and makes SGD particularly suitable for handling
large datasets.
Instead of updating w using the entire dataset at each step 𝑡, SGD leverages a
small randomly selected subset of 𝑘 data points to update w. Consequently, the
new gradient becomes 2(X̃ X̃ 𝑇 w𝑡 − X̃ y),
̃ where X̃ and ỹ represent small samples
randomly chosen from the dataset. This strategy is feasible since X̃ ∈ ℝ𝑑×𝑘 ,
which is considerably smaller compared to X.
After 𝑇 rounds of training, the final estimate is obtained as follows:

1 𝑇
w𝑇SGD = ∑ w𝑖
𝑇 𝑖=1

The stochastic nature of SGD contributes to optimal convergence to a certain


extent.

Kernel Regression
What if the data points reside in a non-linear subspace? Similar to dealing with
non-linear data clustering, kernel functions are employed in this scenario as well.
Let w∗ = X𝛼∗ , where 𝛼∗ ∈ ℝ𝑛 .

X𝛼∗ = w∗
∴X𝛼∗ = (XX𝑇 )+ Xy
(XX𝑇 )X𝛼∗ = (XX𝑇 )(XX𝑇 )+ Xy
(XX𝑇 )X𝛼∗ = Xy
X𝑇 (XX𝑇 )X𝛼∗ = X𝑇 Xy
(X𝑇 X)2 𝛼∗ = X𝑇 Xy
K2 𝛼∗ = Ky
∴𝛼∗ = K−1 y

Here, K ∈ ℝ𝑛×𝑛 , and it can be obtained using a kernel function such as the
Polynomial Kernel or RBF Kernel.
To predict using 𝛼 and the kernel function, let Xtest ∈ ℝ𝑑×𝑚 represent the test
dataset. The prediction is made as follows:

𝑛
w∗ 𝜙(Xtest ) = ∑ 𝛼∗𝑖 𝑘(x𝑖 , xtest𝑖 )
𝑖=1

4
Here, 𝛼∗𝑖 denotes the importance of the 𝑖-th data point in relation to w∗ , and
𝑘(x𝑖 , xtest𝑖 ) signifies the similarity between xtest𝑖 and x𝑖 .

Probabilistic View of Linear Regression


Consider a dataset x1 , … , x𝑛 with x𝑖 ∈ ℝ𝑑 , and the corresponding labels
y1 , … , y𝑛 with y𝑖 ∈ ℝ. The probabilistic view of linear regression assumes that
the target variable y𝑖 can be modeled as a linear combination of the input
features x𝑖 , with an additional noise term 𝜖 following a zero-mean Gaussian
distribution with variance 𝜎2 . Mathematically, this can be expressed as:

y𝑖 = w𝑇 x𝑖 + 𝜖𝑖

where w ∈ ℝ𝑑 represents the weight vector that captures the relationship be-
tween the inputs and the target variable.
To estimate the weight vector w that best fits the data, we can apply the
principle of Maximum Likelihood (ML). The ML estimation seeks to find the
parameter values that maximize the likelihood of observing the given data.
Assuming that the noise term 𝜖𝑖 follows a zero-mean Gaussian distribution with
variance 𝜎2 , we can express the likelihood function as:

ℒ(w; X, y) = 𝑃 (y|X; w)
𝑛
= ∏ 𝑃 (y𝑖 |x𝑖 ; w)
𝑖=1
𝑛
1 (w𝑇 x𝑖 − y𝑖 )2
=∏√ exp (− )
𝑖=1 2𝜋𝜎 2𝜎2

Taking the logarithm of the likelihood function, we have:

𝑛
1 (w𝑇 x𝑖 − y𝑖 )2
log ℒ(w; X, y) = ∑ log ( √ )−
𝑖=1 2𝜋𝜎 2𝜎2
𝑛 1 𝑛
=− log(2𝜋𝜎2 ) − 2 ∑(w𝑇 x𝑖 − y𝑖 )2
2 2𝜎 𝑖=1

To find the maximum likelihood estimate wML , we want to maximize


log ℒ(w; X, y). Maximizing the likelihood is equivalent to minimizing the
negative log-likelihood. Thus, we seek to minimize:

1 𝑛
− log ℒ(w; X, y) = ∑(w𝑇 x𝑖 − y𝑖 )2
2𝜎2 𝑖=1

5
This expression is equivalent to the mean squared error (MSE) objective function
used in linear regression. Therefore, finding the maximum likelihood estimate
wML is equivalent to solving the linear regression problem using the squared
error loss.
To obtain the closed-form solution for wML , we differentiate the negative log-
likelihood with respect to w and set the derivative to zero:

1 𝑛
∇w (− log ℒ(w; X, y)) = ∑(w𝑇 x𝑖 − y𝑖 )x𝑇𝑖 = 0
𝜎2 𝑖=1

This can be rewritten as:

1
(XX𝑇 w − Xy) = 0
𝜎2

where X is the matrix whose rows are the input vectors x𝑖 and y is the column
vector of labels. Rearranging the equation, we have:

XX𝑇 w = Xy

To obtain the closed-form solution for wML , we multiply both sides by the
inverse of XX𝑇 , denoted as (XX𝑇 )−1 :

wML = (XX𝑇 )−1 Xy

Thus, the closed-form solution for the maximum likelihood estimate wML is
given by the product of (XX𝑇 )−1 and Xy.
The closed-form solution for wML in linear regression demonstrates that it can
be obtained by directly applying a matrix inverse operation to the product of
the input matrix X and the target variable vector y. This closed-form solution
provides an efficient and direct way to estimate the weight vector w based on
the given data.

Acknowledgments
Professor Arun Rajkumar: The content, including the concepts and nota-
tions presented in this document, has been sourced from his slides and lectures.
His expertise and educational materials have greatly contributed to the devel-
opment of this document.

You might also like