0% found this document useful (0 votes)
26 views18 pages

3.linear Regression

Linear regression is a supervised machine learning algorithm used for predicting continuous outputs based on input variables. It can be categorized into simple regression, which uses a single variable, and multivariate regression, which involves multiple variables. The learning process involves minimizing the Mean Squared Error (MSE) through methods like gradient descent to optimize the coefficients that define the relationship between inputs and outputs.

Uploaded by

Sairam Manne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views18 pages

3.linear Regression

Linear regression is a supervised machine learning algorithm used for predicting continuous outputs based on input variables. It can be categorized into simple regression, which uses a single variable, and multivariate regression, which involves multiple variables. The learning process involves minimizing the Mean Squared Error (MSE) through methods like gradient descent to optimize the coefficients that define the relationship between inputs and outputs.

Uploaded by

Sairam Manne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Linear Regression

Linear Regression

• A supervised machine learning algorithm.


• Predicted output is continuous. Eg: Estimate sales, cost etc.
• Let x denote the set of input variables and y denotes the
output variable.
• As y is obtained from x, x is also called a set of Attributes or
Features to determine y .
• Collect n data points, (xi , yi ), i = 1, 2, · · · n – Training data.
• Choose a linear function, y = f (x) and estimate coefficients
of f using n training points – Learning.
• That means, f , which gives the relationship between x and y ,
is learnt from data.
1
Linear Regression (contd.)

• After learning f , new points can be predicted by, y = f (x).


• Types of linear regression:
Simple Regression – Based on a single variable.
Multivariate Regression – Based on multiple variables.
• Simple Regression: Choose y = f (x) = w0 + w1 x, and learn
coefficients or weights, w0 and w1 using n training data
points, (xi , yi ), i = 1, 2, · · · n.

2
Linear Regression (contd.)

• Multivariate Regression:
Choose y = f (x1 , x2 , · · · xk ) = w0 + w1 x1 + w2 x2 + · · · + wk xk .
(a) Learn weights, w0 , w1 , · · · wk , using n training data
points, (xi1 , xi2 , · · · xik , yi ), i = 1, 2, · · · n.
(b) Input data is a matrix of size n × (k + 1), where each row
i, denotes a k-dimensional input, (xi1 , xi2 , · · · xik ) and its
output, yi .
(c) In compact form, data can be represented by: (xi , yi ),
i = 1, 2, · · · n, where each xi is a k-dimensional vector.

3
An example
• Consider the following dataset for a problem related to
computer hardware.
Estimate the CPU relative performance (output variable
denoted by CRP) based on the input attributes: Vendor
name, Model name, Machine cycle time, Minimum main
memory in KB, Maximum main memory in KB, Cache
memory in KB, Minimum and maximum channels.
• Standard datasets for different ML problems are available and
maintained in the UCI repository:
[Link]

4
An example (contd.)

• Some instances (rows) from the UCI dataset for computer


hardware problem are shown in the below table.
Vendor Model MCT MINMain MaxMain CacheMem MinCh MaxCh CRP
honeywell dps:8/52 140 2000 32000 32 1 54 141
honeywell dps:8/62 140 2000 32000 32 1 54 189
ibm 3033:s 57 4000 16000 1 6 12 132
ibm 3033:u 57 4000 24000 64 12 16 237
hp 3000/88 75 3000 8000 8 3 48 64
hp 3000/iii 175 256 2000 0 3 24 22

5
Learning algorithm
• Learn coefficients or weights by minimizing an error function.
• Error function: Mean Squared Error (MSE) between actual
and predicted values over n training points.
MSE = J(w0 , w1 , · · · , wk ) = J(W ) =
n n
X
2 1X
1
n (yi −f (xi )) = (yi −w0 −w1 xi1 −w2 xi2 −· · ·−wk xik )2
n
i=1 i=1
Note that each xi is a k-dimensional vector and W is the
weight vector, (w0 , w1 , · · · , wk ).
• Here, yi and f (xi ) are the actual and predicted output values
for xi , respectively.
• Compute the weights, w0 , w1 , · · · , wk , in such a manner that
MSE is minimized.
6
Learning algorithm (contd.)

• Analytical solution (based on differentiation) is


computationally expensive, especially in higher dimensions.
• As such, Gradient descent algorithm is used to find the
solution iteratively.

7
Gradient descent method

• Gradient descent is an optimization algorithm used to


minimize some function by iteratively moving in the direction
of steepest descent.
• Steepest descent is defined by the negative of the gradient.
• For error function,
 denoted by ∇J(W ).
J(W ), gradient is 
∂J(W ) ∂J(W ) ∂J(W )
∇J(W ) = ∂w0 , ∂w1 , · · · , ∂wk .
• For example,
n
∂J(W )
X
1
∂wj = n (−xij ).2(yi − w0 − w1 xi1 − w2 xi2 − · · · − wk xik )
i=1
• Each component in the above gradient vector gives rate of
change of J(W ) with respect to each weight, wi .

8
Gradient descent method (contd.)

• Each weight is updated by taking a step (η) in the opposite


(negative) direction of the error gradient.
For each j,
δwj = −η ∂J(W )
∂wj .
wj = wj + δwj .
Here, η is a learning parameter controls the distance to move
in the direction of negative error gradient.
Vector representation: W = W + δW , where
δW = (δw0 , δw1 , · · · , δwk ).
That means, move in the direction of negative gradient
towards the minimum of the error function.

9
Gradient descent method (contd.)
• The above weight updation process can be repeated for
several iterations until the minimum point for the error
function is reached. Each such iteration is called an Epoch.
• Initially, random values are assigned to the weights.
• This updation of weights for a number of epochs to obtain
optimal weights (corresponding to minimum of the error
function) is called Training.

10
Gradient descent method (contd.)

• A one-dimensional example:

11
Training algorithm

• Initialize each wj , j = 1, 2, · · · k, to some random values.


• For one or more epochs or until some minimum error
threshold (say, ϵ < 0.001) is reached, do the following:
For each j = 1, 2, · · · k,
(i) δwj = −η ∂J(W
∂wj
)

(ii) wj = wj + δwj
• Training process can be monitored by plotting training
iterations/epochs vs MSE .

12
Training algorithm (contd.)

13
Training algorithm (contd.)

• After weights are learnt from the data using training process,
the function, y = f (x1 , x2 , · · · , xk ) can be used to predict
output, y , for any new input, (x1 , x2 , · · · , xk ) – Testing.

14
How to choose Learning rate (η)
• Learning rate parameter, η, controls the rate or speed at
which the weights (model) are learnt in the training process.
• A high learning rate can cover more distance at each step, but
there is a risk of overshooting the minimum point.
• A low learning rate is more precise but time consuming, due
to more number of gradient calculations.

15
How to choose Learning rate (η) (contd.)

• In the below figure, weight (θ) update steps on the MSE


curve (J(θ)), are illustrated for different values of η.

16
Thank You

17

You might also like