Unit 5
Unit 5
BUSINESS ANALYTICS
P R O F. AD I T YA S U R E S H K AS AR
Unit 5
Linear Regression
Y 1 1 X1 2 X1 X 2 3 X 22
Regression Model Development
Simple Linear Regression Model Building
A simple linear regression model is developed to understand how the value of a KPI is associated with
changes in the values of an independent variable.
Some examples are as follows:
1. A hospital may be interested in finding how the total treatment cost of a patient varies with the body
weight of the patient.
2. E-commerce companies such as Amazon, Bigbasket and Flipkart would like to understand the
number of customer visits to their portal and the revenue.
3. Retailers such as Walmart, Target, Reliance Retail, Hyper City, etc. would be interested in
understanding the impact of price cut promotions on the revenue of their private labels (store brands
or house brands).
4. Original equipment manufacturers (OEMs) would like to know the impact of duration of warranty on
the profit.
Framework for SLR model development
Estimation of Parameters using Ordinary Least Squares
Given a set of dependent variable values (Yi) and the corresponding independent variable
values (Xi), each subject to a random error (i), one has to find the best equation to represent
the relationship between the dependent and independent variables.
Assumptions
The method of least squares gives the best equation under the assumptions stated below
(Harter 1974, 1975):
o In case of time series data, residuals are uncorrelated, that is, Cov(i, j) = 0 for all i j.
o The variance of the residuals, Var(i|Xi), is constant for all values of Xi. When the variance
of the residuals is constant for different values of Xi, it is called homoscedasticity. A non-
constant variance of residuals is called heteroscedasticity
Example
Yi 61555.3553 3076.1774X i
• The interpretation will depend on the functional form of the relationship between
the response and the explanatory variables.
• Interpretation of 0 and 1 in Y = 0 + 1 X
o Analysis of Variance (ANOVA) for overall model validity (relevant more for multiple linear
regression).
o Outlier analysis.
The above measures and tests are essential, but not exhaustive.
Validation of the SLR
Coefficient of Determination (R-Square or R2)
o The co-efficient of determination (or R-square or R2) measures the percentage of variation in Y
explained by the model (0 + 1 X).
o The simple linear regression model can be broken into explained variation and unexplained variation as
shown in
Yi
0 1 X i i
Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model
It can be proved mathematically that sum of squares of total variation is equal to sum of
squares of explained variation plus sum of squares of unexplained variation
n 2 n 2 n 2
Yi Y
Yi Y
Yi Yi
i 1 i 1 i 1
SST SSR SSE
where SST is the sum of squares of total variation, SSR is the sum of squares of variation
explained by the regression model and SSE is the sum of squares of errors or unexplained
variation.
Validation of the SLR
Coefficient of Determination (R-Square or R2)
2
Yi Y
Explained variation SSR
Coefficien t of determinat ion R 2
Total variation SST 2
Yi Y
Spurious Regression
Number of Facebook users and the number of people who died of helium poisoning in UK
SS MS F Significance F
Regression 1 2803.94 2803.94 978.4229 8.82E-09
Residual 7 20.06042 2.865775
Total 8 2824
Coefficients Standard Error t-stat P-value Lower 95% Upper 95%
Intercept 1.9967 0.76169 2.62143 0.034338 0.195607 3.79783
o The regression co-efficient ( 1) captures the existence of a linear relationship between the response
variable and the explanatory variable.
o If 1 = 0, we can conclude that there is no statistically significant linear relationship between the two
variables.
The null and alternative hypotheses for the SLR model can be stated as follows:
1 = 0 would imply that there is no linear relationship between the response variable Y and the
explanatory variable X. Thus, the null and alternative hypotheses can be restated as follows:
H0: 1 = 0
HA: 1 0
Validation of the SLR
Test for Overall Model: Analysis of Variance (F-test)
H0: There is no statistically significant relationship between Y and any of the explanatory
variables (i.e., all regression coefficients are zero).
• Alternatively:
MSR MSR / 1
• The F-statistic is given by F
MSE MSE / n 2
Validation of the SLR
Residual Analysis
Residual (error) analysis is important to check whether the assumptions of regression models
have been satisfied. It is performed to check the following:
• The residuals (Yi Yi )are normally distributed.
• The easiest technique to check whether the residuals follow normal distribution is to use the P-P plot
(Probability-Probability plot).
• The P-P plot compares the cumulative distribution function of two probability distributions against each
other
Validation of the SLR
Residual Analysis
Test of Homoscedasticity
An important assumption of regression model is that the residuals have constant variance
(homoscedasticity) across different values of the explanatory variable (X).
That is, the variance of residuals is assumed to be independent of variable X. Failure to meet this
assumption will result in unreliability of the hypothesis tests.
Any pattern in the residual plot would indicate incorrect specification (misspecification) of the model.
Validation of the SLR
Outlier Analysis
o Outliers are observations whose values show a large deviation from mean value, that is
( Yi Y ) large
o Presence of an outlier can have significant influence on values of regression coefficients.
Thus, it is important to identify the existence of outliers in the data
Example
Use the data on body weight of patients and their treatment cost provided in the data file “DAD” and answer the following
questions:
1. Is there a statistical evidence to support that the cost of treatment and body weight are related? Support your answer
with all necessary tests.
2. Comment on the value of R-square. Does a low R-square value indicate that the model is not useful?
3. Interpret the value of the coefficient of weight in the model developed in question 1. What will be average difference
in cost of treatment for patient aged 50 and patient aged 51?
Example
1. Is there a statistical evidence to support that the cost of treatment and the body weight are related?
Support your answer with all necessary tests.
Solution:
Let Y = cost of treatment and X = weight of the patient. The corresponding simple linear regression model is
given by
Y = b0 + b1 Body weight
The regression output for the model using the software SPSS is shown in Tables,
That is, the relationship between the cost of treatment and the body weight is given by
Y = 127498.079 + 1678.933 × Body Weight -------Eq-1
The p-value for the coefficient “Body Weight” is 0.030 which is less than 0.05; thus, the independent variable
body weight is significant at a = 0.05 or at 95% confidence level.
From the model we can interpret that the cost of treatment increases at the rate of INR 1678.933 per 1 kg
increase in the body weight.
Example
However, before we accept the model, we have to check the important assumptions of normality and homoscedasticity.
Figure 5.1 below is the P-P plot that shows the observed cumulative probability of standardized residuals and expected
cumulative probability of a normal distribution (diagonal line). Figure 5.2 is a plot between the standardized residual
and the standardized response variable (Y). The plot between residual and independent variable values can also be
used for finding existence of heteroscedasticity.
It is evident from Figures 5.1 and 5.2 that both the normality and homoscedasticity assumptions are not satisfied by
the model defined in Eq-1, which puts doubt over the model.
FIGURE 5.1 P-P plot for the model FIGURE 5.2 Plot of standardized predicted versus standardized residual
for model
Example
Whenever the assumptions of regression model are not met, we have to use a remedial measure and one of the
popular remedial measures is Transformation of Variables (transformation of variables will be discussed in Chapter
10). In this case, we try the following model in which instead of Y, we build the model between ln(Y) and X, where ln(Y)
is natural logarithm of Y:
ln(Y) = a0 + a1 × Body Weight
That is, the relationship between the cost of treatment and the weight is given by
ln(Y) = 11.804 + 0.0074 × Body Weight ----Eq-2
Example
The p-value for the coefficient ‘body weight’ is less than 0.05, thus the variable body weight is significant at 95%
confidence level.
Figures 5.3 and 5.4 provide the P-P plot and the residual plot between the standardized residual and the standardized
response variable ln(Y). Figure 5.3 (for normality) and Figure 5.4 (for homoscedasticity) are looking better than Figures
5.1 and 5.2.
Thus, the model in Eq. 2 may be used for predicting the cost of treatment since it satisfies important assumptions of
SLR model.
FIGURE 5.3 P-P plot for the model FIGURE 5.4 Plot of standardized predicted versus standardized residual
for model
Example
2. Comment on the value of the R-square. Does a low R-square value indicate that the model is not useful?
Answer: The R-square value for the model ln(Y) = a0 + a1 × Body Weight is only 0.046. That is, the model is explaining
only 4.6% of the variation in the value of ln(Y).
Low R-square values do not imply that the model is not useful. The primary objective of regression is to find whether
there is a relationship between the response variable (cost of treatment) and the independent variable (body weight of
the patient).
The regression model establishes this relationship since the p-value of the weight coefficient is less than 0.05 and both
normality and homoscedasticity assumptions are satisfied reasonably.
Low R-square may create problem when we use the model for prediction since the error is likely to be higher.
Example
3. Interpret the value of the coefficient of weight in the model developed in question 1. What will be
average difference in cost of treatment for someone aged 50 and 51?
a) Develop a simple linear regression model between winning margin (Y) and maximum flight delay (X)
and calculate the regression coefficients.
b) What is the value of R2?
c) Is the model statistically significant, what can you infer from the regression model?
Example
a) The model outputs for the regression equation are provided below
c) The estimated values of ᵝ0 and ᵝ1 and from the SPSS output are given by ᵝ0 = -136368.738 and ᵝ1 = 851.227
The t-stat value for ᵝ0 is -10.42 and the corresponding p-value is less than 0.001 which is less than 0.05 and hence
statistically significant.
Similarly, the t-stat value for ᵝ1 is 14.49 and the corresponding p-value is less than 0.001 which is less than 0.05 and hence
statistically significant. So, we can say that the model is statistically significant.
And the R-square value for the model is 0.921.
That means the model is explaining 92.10% of the variation in the value of Y (winning margin).