Linear Regression For Intermediate
Linear Regression For Intermediate
Introduction
Least Square “Linear Regression” is a statistical method to regress the data with dependent
variable having continuous values whereas independent variables can have either continuous
or categorical values. In other words, “Linear Regression” is a method to predict dependent
variable (Y) based on values of independent variables (X).
Prerequisites
To start with Linear Regression, you must be aware of a few basic concepts of statistics. i.e.,
Correlation (r) – Explains the relationship between two variables, possible values -1
to +1
Variance (σ2)– Measure of spread in your data
Standard Deviation (σ) – Measure of spread in your data (Square root of Variance)
Normal distribution
Residual (error term) – {Actual value – Predicted value}
Assumptions
To check Homoscedasticity:
1. Plot Residuals Vs. Predicted values, and there should be no pattern.
2. Perform Non Constant Variance Test.
While doing linear regression our objective is to fit a line through the distribution which is
nearest to most of the points. Hence reducing the distance (error term) of data points from the
fitted line.
For example, in above figure (left) dots represent various data points and line (right)
represents an approximate line which can explain the relationship between ‘x’ & ‘y’
axes. Through, linear regression we try to find out such a line. For example, if we have one
dependent variable ‘Y’ and one independent variable ‘X’ – relationship between ‘X’ & ‘Y’
can be represented in a form of following equation:
Y= β0 + β1X
Where,
Y = Dependent Variable
X = Independent Variable
β0 = Constant term a.k.a Intercept
β1 = Coefficient of relationship between ‘X’ & ‘Y’
Regression line always passes through mean of independent variable (x) as well as
mean of dependent variable (y)
Regression line minimizes the sum of “Square of Residuals”. That’s why the method
of Linear Regression is known as “Ordinary Least Square (OLS)”.
β1 explains the change in Y with a change in X by one unit. In other words, if we
increase the value of ‘X’ by one unit then what will be the change in value of Y.
Using a statistical tool e.g., Excel, R, SAS etc. you will directly find constants (β0 and β1) as
a result of linear regression function. But conceptually as discussed it works on OLS concept
and tries to reduce the square of errors, using the very concept software packages calculate
these constants.
For example, let say we want to predict ‘y’ from ‘x’ given in following table and let’s assume
that our regression equation will look like “y=B0+B1*x”
X y Predicted 'y'
1 2 Β0+B1*1
2 1 Β0+B1*2
3 3 Β0+B1*3
4 6 Β0+B1*4
5 9 Β0+B1*5
6 11 Β0+B1*6
7 13 Β0+B1*7
8 15 Β0+B1*8
9 17 Β0+B1*9
10 20 Β0+B1*10
Where,
Table 1:
Mean of x 5.5
Mean of y 9.7
If we differentiate the Residual Sum of Square (RSS) WRT. B0 & B1 and equate the results
to zero, we get the following equations as a result:
B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)
B0 = Mean(Y) – B1 * Mean(X)
Putting values from table 1 into the above equations,
B1 = 2.64
B0 = -2.2
Hence, the least regression equation will become –
Y = -2.2 + 2.64*x
Model Performance
Once you build the model, the next logical question comes in mind is to know whether your
model is good enough to predict in future or the relationship which you built between
dependent and independent variables is good enough or not.
For this purpose, there are various metrics which we look into-
i. R – Square (R2)
Formula for calculating R2 is given by: