Linear Regression
Linear Regression
• A supervised machine learning algorithm.
• Predicted output is continuous. Eg: Estimate sales, cost etc.
• Let x denote the set of input variables and y denotes the
output variable.
• As y is obtained from x, x is also called a set of Attributes or
Features to determine y .
• Collect n data points, (xi , yi ), i = 1, 2, · · · n – Training data.
• Choose a linear function, y = f (x) and estimate coefficients
of f using n training points – Learning.
• That means, f , which gives the relationship between x and y ,
is learnt from data.
1
Linear Regression (contd.)
• After learning f , new points can be predicted by, y = f (x).
• Types of linear regression:
Simple Regression – Based on a single variable.
Multivariate Regression – Based on multiple variables.
• Simple Regression: Choose y = f (x) = w0 + w1 x, and learn
coefficients or weights, w0 and w1 using n training data
points, (xi , yi ), i = 1, 2, · · · n.
2
Linear Regression (contd.)
• Multivariate Regression:
Choose y = f (x1 , x2 , · · · xk ) = w0 + w1 x1 + w2 x2 + · · · + wk xk .
(a) Learn weights, w0 , w1 , · · · wk , using n training data
points, (xi1 , xi2 , · · · xik , yi ), i = 1, 2, · · · n.
(b) Input data is a matrix of size n × (k + 1), where each row
i, denotes a k-dimensional input, (xi1 , xi2 , · · · xik ) and its
output, yi .
(c) In compact form, data can be represented by: (xi , yi ),
i = 1, 2, · · · n, where each xi is a k-dimensional vector.
3
An example
• Consider the following dataset for a problem related to
computer hardware.
Estimate the CPU relative performance (output variable
denoted by CRP) based on the input attributes: Vendor
name, Model name, Machine cycle time, Minimum main
memory in KB, Maximum main memory in KB, Cache
memory in KB, Minimum and maximum channels.
• Standard datasets for different ML problems are available and
maintained in the UCI repository:
[Link]
4
An example (contd.)
• Some instances (rows) from the UCI dataset for computer
hardware problem are shown in the below table.
Vendor Model MCT MINMain MaxMain CacheMem MinCh MaxCh CRP
honeywell dps:8/52 140 2000 32000 32 1 54 141
honeywell dps:8/62 140 2000 32000 32 1 54 189
ibm 3033:s 57 4000 16000 1 6 12 132
ibm 3033:u 57 4000 24000 64 12 16 237
hp 3000/88 75 3000 8000 8 3 48 64
hp 3000/iii 175 256 2000 0 3 24 22
5
Learning algorithm
• Learn coefficients or weights by minimizing an error function.
• Error function: Mean Squared Error (MSE) between actual
and predicted values over n training points.
MSE = J(w0 , w1 , · · · , wk ) = J(W ) =
n n
X
2 1X
1
n (yi −f (xi )) = (yi −w0 −w1 xi1 −w2 xi2 −· · ·−wk xik )2
n
i=1 i=1
Note that each xi is a k-dimensional vector and W is the
weight vector, (w0 , w1 , · · · , wk ).
• Here, yi and f (xi ) are the actual and predicted output values
for xi , respectively.
• Compute the weights, w0 , w1 , · · · , wk , in such a manner that
MSE is minimized.
6
Learning algorithm (contd.)
• Analytical solution (based on differentiation) is
computationally expensive, especially in higher dimensions.
• As such, Gradient descent algorithm is used to find the
solution iteratively.
7
Gradient descent method
• Gradient descent is an optimization algorithm used to
minimize some function by iteratively moving in the direction
of steepest descent.
• Steepest descent is defined by the negative of the gradient.
• For error function,
denoted by ∇J(W ).
J(W ), gradient is
∂J(W ) ∂J(W ) ∂J(W )
∇J(W ) = ∂w0 , ∂w1 , · · · , ∂wk .
• For example,
n
∂J(W )
X
1
∂wj = n (−xij ).2(yi − w0 − w1 xi1 − w2 xi2 − · · · − wk xik )
i=1
• Each component in the above gradient vector gives rate of
change of J(W ) with respect to each weight, wi .
8
Gradient descent method (contd.)
• Each weight is updated by taking a step (η) in the opposite
(negative) direction of the error gradient.
For each j,
δwj = −η ∂J(W )
∂wj .
wj = wj + δwj .
Here, η is a learning parameter controls the distance to move
in the direction of negative error gradient.
Vector representation: W = W + δW , where
δW = (δw0 , δw1 , · · · , δwk ).
That means, move in the direction of negative gradient
towards the minimum of the error function.
9
Gradient descent method (contd.)
• The above weight updation process can be repeated for
several iterations until the minimum point for the error
function is reached. Each such iteration is called an Epoch.
• Initially, random values are assigned to the weights.
• This updation of weights for a number of epochs to obtain
optimal weights (corresponding to minimum of the error
function) is called Training.
10
Gradient descent method (contd.)
• A one-dimensional example:
11
Training algorithm
• Initialize each wj , j = 1, 2, · · · k, to some random values.
• For one or more epochs or until some minimum error
threshold (say, ϵ < 0.001) is reached, do the following:
For each j = 1, 2, · · · k,
(i) δwj = −η ∂J(W
∂wj
)
(ii) wj = wj + δwj
• Training process can be monitored by plotting training
iterations/epochs vs MSE .
12
Training algorithm (contd.)
13
Training algorithm (contd.)
• After weights are learnt from the data using training process,
the function, y = f (x1 , x2 , · · · , xk ) can be used to predict
output, y , for any new input, (x1 , x2 , · · · , xk ) – Testing.
14
How to choose Learning rate (η)
• Learning rate parameter, η, controls the rate or speed at
which the weights (model) are learnt in the training process.
• A high learning rate can cover more distance at each step, but
there is a risk of overshooting the minimum point.
• A low learning rate is more precise but time consuming, due
to more number of gradient calculations.
15
How to choose Learning rate (η) (contd.)
• In the below figure, weight (θ) update steps on the MSE
curve (J(θ)), are illustrated for different values of η.
16
Thank You
17