Linear Regression
S. Sumitra
Notations: xi : ith data point; xT : transpose of x; xij : ith data point’s jth
attribute.
Let {(x1 , y1 ), (x2 , y2 )...(xN , yN )} be the given data, xi ∈ D and yi ∈ Y. Here D
denote the space of input values and Y denote the space of output values.
If the target variable is continuous, the learning problem is known as regression.
If the target variable y takes only discrete values, it is a classification problem.
In this paper, D =Rn and Y= R.
The simplest linear model is the representation of f as a linear combination of x.
That is,
f (xi ) = w0 + w1 xi1 + w2 xi2 + . . . wn xin (1)
where, xi = (xi1 , xi2 , . . . xin )T and w = (w0 , w1 , . . . , wn )T ∈ Rn+1 . Here, wi0 s are the
parameters, parameterizing the space of linear functions mapping from X to Y.
(1) is the equation of a hyperplane.
By taking xi0 = 1, (1) can be written as
n
X
f (xi ) = wj xij = wT xi (2)
j=0
Given a training set, how to choose the values of wi0 s?
As there are N data points, we could write N equations such that,
n
X
f (xi ) = wj xij = wT xi , i = 1, 2, . . . N (3)
j=0
Define the design matrix to be
(x1 )T
(x2 )T
.
X= .
.
.
(xN )T
1
Here, x0i s are the training data (each xi is a n dimensional vector). X is a N × n + 1
matrix, if we include the intercept term, that is if we set xi = (xi0 , xi1 , ....xin )T and
set xi0 = 1. Let y be the N dimensional vector that contains all the target values,
that is,
y1
y2
.
y= . .
.
yN
Now the matrix representation for (3) is
Xw = y (4)
The range space of X is spanned by the columns of X. If X is a square matrix
and inverse exists, w = X −1 y. This might not be the case always. (4) may have no
solution or more than one solution. The former case happens when X is not onto
and later case happens X is not one to one.
Next we consider the case of rectangular matrices.
1 y is not in the range of X (n + 1 < N )
We will first consider the case when y is not in the range of X. In this case we find
the preimage of the projection of y onto the range space of X. That is, the optimal
solution is
w∗ = arg min n+1
J(w)
w∈R
where
1 1
J(w) = (d(Xw, y))2 = ||Xw − y||2 (5)
2 2
J(w) is called the least square cost function.
Now J(w) = 12 (d(Xw, y))2 = 12 ||Xw − y||2 = 12 hXw − y, Xw − yi
At the minimum value of w, ∇J = 0. That is ,
∇J = X T Xw − X T y = 0
2
Hence,
X T Xw = X T y (6)
(6) is called the normal equation. Hence,
−1
w = (X T X) X T y (7)
−1 −1
provided (X T X) X T exists. (X T X) X T is called the pseudo-inverse. For
determining w using derivative method, the inverse of X T X is to be found, which is
not computationally effective for large data sets. Hence we resort to iterative search
algorithms for finding w.
1.1 Least Mean Squares Algorithm
For finding w that minimizes J, we apply an iterative search algorithm. An iter-
ative search algorithm that minimizes J(w), starts with an initial guess of w and
then repeatedly change w to make J(w) smaller, until it converge to the values that
minimizes J(w). We consider gradient descent for finding w.
[Gradient descent: If a real valued function F (x) is defined and differentiable in
a neighbourhood of point a, then F (x) decreases fastest if one goes from a in the
direction of the negative gradient of F at a, ∇F (a). The gradient of the function
is always perpendicular to the contour lines.(A contour line of a function of two
variables is a curve along which the function has a constant value.)]
(5) can also be written as
N
1X
J(w) = (f (xi ) − yi )2 (8)
2 i=1
For applying gradient descent, consider the following steps. Choose an initial
w = (w0 , w1 , ...wn )T ∈ Rn+1 . Then repeatedly performs the update
w := w − α∇J (9)
Here, α > 0 is called the learning rate.
Since f is a function of w0 , w1 , . . . , wn , J is a function of w0 , w1 , . . . , wn .
Therefore,
T
∂J ∂J ∂J
∇J = , ,..., (10)
∂w0 ∂w1 ∂wn
3
Sub: (10) in (9),
T
T T ∂J ∂J ∂J
(w0 , w1 , .....wn ) := (w0 , w1 , .....wn ) − α , ,..., (11)
∂w1 ∂w2 ∂wn
Hence,
∂J
wj := wj − α , j = 0, 1, . . . n (12)
∂wj
Now
N
∂J ∂ 1X
= (f (xi ) − yi )2
∂wj ∂wj 2 i=1
N n
!
X ∂ X
= (f (xi ) − yi ) wj xij
i=1
∂wj j=0
N
X
= (f (xi ) − yi )xij
i=1
Therefore,
N
X
wj := wj + α (yi − f (xi ))xij , j = 0, 1, 2, . . . n (13)
i=1
(13) is called LMS (least mean squares) update or Widrow -Hoff learning rule. Pseu-
docode of the algorithm can be written as:
Iterate until convergence {
N
X
wj := wj + α (yi − f (xi ))xij , j = 0, 1, 2, . . . n
i=1
}
The magnitude of the update of parameter is proportional to the error term
(yi − f (xi )). Larger change to the parameter is made when the error is large and
vice versa.
For updating the parameter, the algorithm looks at every data point in the train-
ing set at every step and hence it called batch gradient descent. In general, gradient
descent does not guarantee a global minimum. However as J is a convex quadratic
function, the algorithm converges to the global minimum (assuming α is not too
large).
4
There is an alternative to batch gradient descent called stochastic gradient descent
which can be stated as follows:
Iterate until convergence {
for i=1 to N{
wj := wj + α(yi − f (xi ))xij , j = 0, 1, 2, . . . n
}}
In contrast to batch gradient, stochastic gradient process only one training point
at each step. Hence when N becomes large, that is, for large data sets, stochastic
gradient descent is more computationally efficient than batch gradient descent.
2 y has more than one pre-image (N < (n + 1))
The second case is when y is in the range of X but has more than one pre-image.
As there would be more than one w that satisfies the given equation, the following
constrained optimization problem has to be considered:
minimize
n+1
||w||2
w∈R (14)
subject to Xw = y
By applying lagrangian theory,
L(w, λ) = ||w||2 + λT (Xw − y)
where λT = (λ1 , λ2 , . . . λN ). λi , i = 1, 2, . . . N are the lagrangian parameters. By
∂L
equating =0
∂w
2w + X T λ = 0
Hence
XT λ
w=− (15)
2
∂L
By equating = 0 we get,
∂λ
Xw − y = 0 (16)
5
Using (15), the above equation becomes
−XX T λ
=y
2
Therefore
λ = −(XX T )−1 2y (17)
Sub: (17) to (15),
w = X T (XX T )−1 y
References
(1)Andrew Ng’s lecture note (CS 229).
(2) Bishop’s Book: Pattern Recognition and Machine Learning.