0% found this document useful (0 votes)

15 views6 pages

Linear Regression

The document discusses linear regression, defining it as a method for predicting continuous target variables using a linear combination of input data. It explains the formulation of the linear model, the design matrix, and the least square cost function, leading to the normal equation for determining model parameters. Additionally, it covers iterative algorithms like gradient descent and stochastic gradient descent for optimizing the model parameters, as well as the case of multiple solutions in regression problems.

Uploaded by

saikishanmoray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

Linear Regression

Uploaded by

saikishanmoray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Linear Regression

S. Sumitra

Notations: xi : ith data point; xT : transpose of x; xij : ith data point’s jth
attribute.
Let {(x1 , y1 ), (x2 , y2 )...(xN , yN )} be the given data, xi ∈ D and yi ∈ Y. Here D
denote the space of input values and Y denote the space of output values.
If the target variable is continuous, the learning problem is known as regression.
If the target variable y takes only discrete values, it is a classification problem.
In this paper, D =Rn and Y= R.
The simplest linear model is the representation of f as a linear combination of x.
That is,
f (xi ) = w0 + w1 xi1 + w2 xi2 + . . . wn xin (1)
where, xi = (xi1 , xi2 , . . . xin )T and w = (w0 , w1 , . . . , wn )T ∈ Rn+1 . Here, wi0 s are the
parameters, parameterizing the space of linear functions mapping from X to Y.
(1) is the equation of a hyperplane.
By taking xi0 = 1, (1) can be written as
n
X
f (xi ) = wj xij = wT xi (2)
j=0

Given a training set, how to choose the values of wi0 s?

As there are N data points, we could write N equations such that,
n
X
f (xi ) = wj xij = wT xi , i = 1, 2, . . . N (3)
j=0
Define the design matrix to be
(x1 )T
 

 (x2 )T 

 . 
X= .

 . 

 . 
(xN )T

1
Here, x0i s are the training data (each xi is a n dimensional vector). X is a N × n + 1
matrix, if we include the intercept term, that is if we set xi = (xi0 , xi1 , ....xin )T and
set xi0 = 1. Let y be the N dimensional vector that contains all the target values,
that is,
 
y1
 y2 
 
 . 
y=  . .

 
 . 
yN
Now the matrix representation for (3) is

Xw = y (4)
The range space of X is spanned by the columns of X. If X is a square matrix
and inverse exists, w = X −1 y. This might not be the case always. (4) may have no
solution or more than one solution. The former case happens when X is not onto
and later case happens X is not one to one.
Next we consider the case of rectangular matrices.

1 y is not in the range of X (n + 1 < N )

We will first consider the case when y is not in the range of X. In this case we find
the preimage of the projection of y onto the range space of X. That is, the optimal
solution is
w∗ = arg min n+1
J(w)
w∈R

where
1 1
J(w) = (d(Xw, y))2 = ||Xw − y||2 (5)
2 2
J(w) is called the least square cost function.
Now J(w) = 12 (d(Xw, y))2 = 12 ||Xw − y||2 = 12 hXw − y, Xw − yi
At the minimum value of w, ∇J = 0. That is ,

∇J = X T Xw − X T y = 0

2
Hence,

X T Xw = X T y (6)
(6) is called the normal equation. Hence,
−1
w = (X T X) X T y (7)
−1 −1
provided (X T X) X T exists. (X T X) X T is called the pseudo-inverse. For
determining w using derivative method, the inverse of X T X is to be found, which is
not computationally effective for large data sets. Hence we resort to iterative search
algorithms for finding w.

1.1 Least Mean Squares Algorithm

For finding w that minimizes J, we apply an iterative search algorithm. An iter-
ative search algorithm that minimizes J(w), starts with an initial guess of w and
then repeatedly change w to make J(w) smaller, until it converge to the values that
minimizes J(w). We consider gradient descent for finding w.
[Gradient descent: If a real valued function F (x) is defined and differentiable in
a neighbourhood of point a, then F (x) decreases fastest if one goes from a in the
direction of the negative gradient of F at a, ∇F (a). The gradient of the function
is always perpendicular to the contour lines.(A contour line of a function of two
variables is a curve along which the function has a constant value.)]
(5) can also be written as
N
1X
J(w) = (f (xi ) − yi )2 (8)
2 i=1

For applying gradient descent, consider the following steps. Choose an initial
w = (w0 , w1 , ...wn )T ∈ Rn+1 . Then repeatedly performs the update

w := w − α∇J (9)

Here, α > 0 is called the learning rate.

Since f is a function of w0 , w1 , . . . , wn , J is a function of w0 , w1 , . . . , wn .
Therefore,
T
∂J ∂J ∂J
∇J = , ,..., (10)
∂w0 ∂w1 ∂wn

3
Sub: (10) in (9),
T
T T ∂J ∂J ∂J
(w0 , w1 , .....wn ) := (w0 , w1 , .....wn ) − α , ,..., (11)
∂w1 ∂w2 ∂wn
Hence,
∂J
wj := wj − α , j = 0, 1, . . . n (12)
∂wj
Now
N
∂J ∂ 1X
= (f (xi ) − yi )2
∂wj ∂wj 2 i=1
N n
!
X ∂ X
= (f (xi ) − yi ) wj xij
i=1
∂wj j=0
N
X
= (f (xi ) − yi )xij
i=1

Therefore,
N
X
wj := wj + α (yi − f (xi ))xij , j = 0, 1, 2, . . . n (13)
i=1

(13) is called LMS (least mean squares) update or Widrow -Hoff learning rule. Pseu-
docode of the algorithm can be written as:
Iterate until convergence {
N
X
wj := wj + α (yi − f (xi ))xij , j = 0, 1, 2, . . . n
i=1

}
The magnitude of the update of parameter is proportional to the error term
(yi − f (xi )). Larger change to the parameter is made when the error is large and
vice versa.
For updating the parameter, the algorithm looks at every data point in the train-
ing set at every step and hence it called batch gradient descent. In general, gradient
descent does not guarantee a global minimum. However as J is a convex quadratic
function, the algorithm converges to the global minimum (assuming α is not too
large).

4
There is an alternative to batch gradient descent called stochastic gradient descent
which can be stated as follows:
Iterate until convergence {
for i=1 to N{

wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n

}}
In contrast to batch gradient, stochastic gradient process only one training point
at each step. Hence when N becomes large, that is, for large data sets, stochastic
gradient descent is more computationally efficient than batch gradient descent.

2 y has more than one pre-image (N < (n + 1))

The second case is when y is in the range of X but has more than one pre-image.
As there would be more than one w that satisfies the given equation, the following
constrained optimization problem has to be considered:

minimize
n+1
||w||2
w∈R (14)
subject to Xw = y
By applying lagrangian theory,

L(w, λ) = ||w||2 + λT (Xw − y)

where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the lagrangian parameters. By

∂L
equating =0
∂w
2w + X T λ = 0

Hence
XT λ
w=− (15)
2
∂L
By equating = 0 we get,
∂λ
Xw − y = 0 (16)

5
Using (15), the above equation becomes

−XX T λ
=y
2
Therefore
λ = −(XX T )−1 2y (17)
Sub: (17) to (15),

w = X T (XX T )−1 y

References
(1)Andrew Ng’s lecture note (CS 229).
(2) Bishop’s Book: Pattern Recognition and Machine Learning.

Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Week 6
No ratings yet
Week 6
72 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Regression PDF
No ratings yet
Regression PDF
37 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Linear Regression
No ratings yet
Linear Regression
55 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Regression
No ratings yet
Regression
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229
No ratings yet
CS229
69 pages
Linear Regression and Gradient Descent
No ratings yet
Linear Regression and Gradient Descent
30 pages
Notes 1
No ratings yet
Notes 1
30 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
7 pages
04 LinearRegression PDF
No ratings yet
04 LinearRegression PDF
61 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Gradient Descent
No ratings yet
Gradient Descent
108 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Lec06 Matt
No ratings yet
Lec06 Matt
60 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
Regression
No ratings yet
Regression
39 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
Influencer Marketing with LP
No ratings yet
Influencer Marketing with LP
58 pages
Local Boutique Problem Solution
No ratings yet
Local Boutique Problem Solution
10 pages
Forecasting Values
No ratings yet
Forecasting Values
21 pages
NLPQLG: A Fortran Implementation of A Sequential Quadratic Programming Algorithm For Heuristic Global Optimization - User's Guide
No ratings yet
NLPQLG: A Fortran Implementation of A Sequential Quadratic Programming Algorithm For Heuristic Global Optimization - User's Guide
24 pages
Assignment On Simplex Method: Course Name: Operations Research (F-409)
No ratings yet
Assignment On Simplex Method: Course Name: Operations Research (F-409)
7 pages
Cheat Sheet - Inheritance Mappings
No ratings yet
Cheat Sheet - Inheritance Mappings
1 page
Practice BFGS Algorithm
No ratings yet
Practice BFGS Algorithm
9 pages
Interpolation and Least Square
No ratings yet
Interpolation and Least Square
18 pages
Numerical Linear Algebra Guide
No ratings yet
Numerical Linear Algebra Guide
196 pages
Learner'S Module: Mathematics in The Modern World
No ratings yet
Learner'S Module: Mathematics in The Modern World
5 pages
Numerical Methods for Linear Equations
100% (1)
Numerical Methods for Linear Equations
42 pages
Homework 4: IEOR 160: Operations Research I (Fall 2014)
No ratings yet
Homework 4: IEOR 160: Operations Research I (Fall 2014)
2 pages
Maths (DAV CMC) Ch-8
No ratings yet
Maths (DAV CMC) Ch-8
3 pages
4 Secant Method
No ratings yet
4 Secant Method
3 pages
Factoring Techniques Guide
No ratings yet
Factoring Techniques Guide
31 pages
Simplex Method
100% (1)
Simplex Method
34 pages
Method For Solving Fully Fuzzy Assignment Problems Using Triangular Fuzzy Numbers
No ratings yet
Method For Solving Fully Fuzzy Assignment Problems Using Triangular Fuzzy Numbers
10 pages
New Row Maxima Method To Solve Multi-Objective Transportation Problem Under Fuzzy Conditions A. J. Khan and D. K. Das Volume - 1, Number - 1 Publication Year: 2012, Page(s) : 42 - 46
100% (1)
New Row Maxima Method To Solve Multi-Objective Transportation Problem Under Fuzzy Conditions A. J. Khan and D. K. Das Volume - 1, Number - 1 Publication Year: 2012, Page(s) : 42 - 46
5 pages
4b. Simplex Method (M Technique) - Max Objective Case
100% (1)
4b. Simplex Method (M Technique) - Max Objective Case
23 pages
Lecture 7
No ratings yet
Lecture 7
5 pages
Math-130 Laboratory 3
No ratings yet
Math-130 Laboratory 3
10 pages
Chapter 7 Supplement 1
No ratings yet
Chapter 7 Supplement 1
20 pages
Daftar Lampiran Selvy
No ratings yet
Daftar Lampiran Selvy
13 pages
Week 12.3E Newtons Method
No ratings yet
Week 12.3E Newtons Method
4 pages
Syllabus
No ratings yet
Syllabus
1 page
A.R. Class 9
No ratings yet
A.R. Class 9
8 pages
B.Tech Numerical Methods Course Guide
No ratings yet
B.Tech Numerical Methods Course Guide
3 pages
Factoring Different Polynomials
No ratings yet
Factoring Different Polynomials
18 pages
Linear Programming PDF Slides
No ratings yet
Linear Programming PDF Slides
28 pages
Lesson 1. Introduction To Metaheuristics and General Concepts
No ratings yet
Lesson 1. Introduction To Metaheuristics and General Concepts
37 pages

Linear Regression

Uploaded by

Linear Regression

Uploaded by

Linear Regression

Given a training set, how to choose the values of wi0 s?

1 y is not in the range of X (n + 1 < N )

1.1 Least Mean Squares Algorithm

Here, α > 0 is called the learning rate.

wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n

2 y has more than one pre-image (N < (n + 1))

L(w, λ) = ||w||2 + λT (Xw − y)

where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the lagrangian parameters. By

You might also like