0% found this document useful (0 votes)
15 views6 pages

Linear Regression

The document discusses linear regression, defining it as a method for predicting continuous target variables using a linear combination of input data. It explains the formulation of the linear model, the design matrix, and the least square cost function, leading to the normal equation for determining model parameters. Additionally, it covers iterative algorithms like gradient descent and stochastic gradient descent for optimizing the model parameters, as well as the case of multiple solutions in regression problems.

Uploaded by

saikishanmoray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Linear Regression

The document discusses linear regression, defining it as a method for predicting continuous target variables using a linear combination of input data. It explains the formulation of the linear model, the design matrix, and the least square cost function, leading to the normal equation for determining model parameters. Additionally, it covers iterative algorithms like gradient descent and stochastic gradient descent for optimizing the model parameters, as well as the case of multiple solutions in regression problems.

Uploaded by

saikishanmoray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Linear Regression

S. Sumitra

Notations: xi : ith data point; xT : transpose of x; xij : ith data point’s jth
attribute.
Let {(x1 , y1 ), (x2 , y2 )...(xN , yN )} be the given data, xi ∈ D and yi ∈ Y. Here D
denote the space of input values and Y denote the space of output values.
If the target variable is continuous, the learning problem is known as regression.
If the target variable y takes only discrete values, it is a classification problem.
In this paper, D =Rn and Y= R.
The simplest linear model is the representation of f as a linear combination of x.
That is,
f (xi ) = w0 + w1 xi1 + w2 xi2 + . . . wn xin (1)
where, xi = (xi1 , xi2 , . . . xin )T and w = (w0 , w1 , . . . , wn )T ∈ Rn+1 . Here, wi0 s are the
parameters, parameterizing the space of linear functions mapping from X to Y.
(1) is the equation of a hyperplane.
By taking xi0 = 1, (1) can be written as
n
X
f (xi ) = wj xij = wT xi (2)
j=0

Given a training set, how to choose the values of wi0 s?


As there are N data points, we could write N equations such that,
n
X
f (xi ) = wj xij = wT xi , i = 1, 2, . . . N (3)
j=0
Define the design matrix to be
(x1 )T
 

 (x2 )T 

 . 
X= .

 . 

 . 
(xN )T

1
Here, x0i s are the training data (each xi is a n dimensional vector). X is a N × n + 1
matrix, if we include the intercept term, that is if we set xi = (xi0 , xi1 , ....xin )T and
set xi0 = 1. Let y be the N dimensional vector that contains all the target values,
that is,
 
y1
 y2 
 
 . 
y=  . .

 
 . 
yN
Now the matrix representation for (3) is

Xw = y (4)
The range space of X is spanned by the columns of X. If X is a square matrix
and inverse exists, w = X −1 y. This might not be the case always. (4) may have no
solution or more than one solution. The former case happens when X is not onto
and later case happens X is not one to one.
Next we consider the case of rectangular matrices.

1 y is not in the range of X (n + 1 < N )


We will first consider the case when y is not in the range of X. In this case we find
the preimage of the projection of y onto the range space of X. That is, the optimal
solution is
w∗ = arg min n+1
J(w)
w∈R

where
1 1
J(w) = (d(Xw, y))2 = ||Xw − y||2 (5)
2 2
J(w) is called the least square cost function.
Now J(w) = 12 (d(Xw, y))2 = 12 ||Xw − y||2 = 12 hXw − y, Xw − yi
At the minimum value of w, ∇J = 0. That is ,

∇J = X T Xw − X T y = 0

2
Hence,

X T Xw = X T y (6)
(6) is called the normal equation. Hence,
−1
w = (X T X) X T y (7)
−1 −1
provided (X T X) X T exists. (X T X) X T is called the pseudo-inverse. For
determining w using derivative method, the inverse of X T X is to be found, which is
not computationally effective for large data sets. Hence we resort to iterative search
algorithms for finding w.

1.1 Least Mean Squares Algorithm


For finding w that minimizes J, we apply an iterative search algorithm. An iter-
ative search algorithm that minimizes J(w), starts with an initial guess of w and
then repeatedly change w to make J(w) smaller, until it converge to the values that
minimizes J(w). We consider gradient descent for finding w.
[Gradient descent: If a real valued function F (x) is defined and differentiable in
a neighbourhood of point a, then F (x) decreases fastest if one goes from a in the
direction of the negative gradient of F at a, ∇F (a). The gradient of the function
is always perpendicular to the contour lines.(A contour line of a function of two
variables is a curve along which the function has a constant value.)]
(5) can also be written as
N
1X
J(w) = (f (xi ) − yi )2 (8)
2 i=1

For applying gradient descent, consider the following steps. Choose an initial
w = (w0 , w1 , ...wn )T ∈ Rn+1 . Then repeatedly performs the update

w := w − α∇J (9)

Here, α > 0 is called the learning rate.


Since f is a function of w0 , w1 , . . . , wn , J is a function of w0 , w1 , . . . , wn .
Therefore,
 T
∂J ∂J ∂J
∇J = , ,..., (10)
∂w0 ∂w1 ∂wn

3
Sub: (10) in (9),
 T
T T ∂J ∂J ∂J
(w0 , w1 , .....wn ) := (w0 , w1 , .....wn ) − α , ,..., (11)
∂w1 ∂w2 ∂wn
Hence,
∂J
wj := wj − α , j = 0, 1, . . . n (12)
∂wj
Now
N
∂J ∂ 1X
= (f (xi ) − yi )2
∂wj ∂wj 2 i=1
N n
!
X ∂ X
= (f (xi ) − yi ) wj xij
i=1
∂wj j=0
N
X
= (f (xi ) − yi )xij
i=1

Therefore,
N
X
wj := wj + α (yi − f (xi ))xij , j = 0, 1, 2, . . . n (13)
i=1

(13) is called LMS (least mean squares) update or Widrow -Hoff learning rule. Pseu-
docode of the algorithm can be written as:
Iterate until convergence {
N
X
wj := wj + α (yi − f (xi ))xij , j = 0, 1, 2, . . . n
i=1

}
The magnitude of the update of parameter is proportional to the error term
(yi − f (xi )). Larger change to the parameter is made when the error is large and
vice versa.
For updating the parameter, the algorithm looks at every data point in the train-
ing set at every step and hence it called batch gradient descent. In general, gradient
descent does not guarantee a global minimum. However as J is a convex quadratic
function, the algorithm converges to the global minimum (assuming α is not too
large).

4
There is an alternative to batch gradient descent called stochastic gradient descent
which can be stated as follows:
Iterate until convergence {
for i=1 to N{

wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n


}}
In contrast to batch gradient, stochastic gradient process only one training point
at each step. Hence when N becomes large, that is, for large data sets, stochastic
gradient descent is more computationally efficient than batch gradient descent.

2 y has more than one pre-image (N < (n + 1))


The second case is when y is in the range of X but has more than one pre-image.
As there would be more than one w that satisfies the given equation, the following
constrained optimization problem has to be considered:

minimize
n+1
||w||2
w∈R (14)
subject to Xw = y
By applying lagrangian theory,

L(w, λ) = ||w||2 + λT (Xw − y)

where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the lagrangian parameters. By


∂L
equating =0
∂w
2w + X T λ = 0

Hence
XT λ
w=− (15)
2
∂L
By equating = 0 we get,
∂λ
Xw − y = 0 (16)

5
Using (15), the above equation becomes

−XX T λ
=y
2
Therefore
λ = −(XX T )−1 2y (17)
Sub: (17) to (15),

w = X T (XX T )−1 y

References
(1)Andrew Ng’s lecture note (CS 229).
(2) Bishop’s Book: Pattern Recognition and Machine Learning.

You might also like