0% found this document useful (0 votes)
29 views

Unit 5

Uploaded by

jobowo3828
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Unit 5

Uploaded by

jobowo3828
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Simple Linear Regression

BUSINESS ANALYTICS

P R O F. AD I T YA S U R E S H K AS AR

Unit 5
Linear Regression

• Linear regression stands for a function that is linear in regression coefficients.

• The following equation will be treated as linear as far as regression is


concerned.

Y  1  1 X1   2 X1 X 2  3 X 22
Regression Model Development
Simple Linear Regression Model Building
A simple linear regression model is developed to understand how the value of a KPI is associated with
changes in the values of an independent variable.
Some examples are as follows:
1. A hospital may be interested in finding how the total treatment cost of a patient varies with the body
weight of the patient.
2. E-commerce companies such as Amazon, Bigbasket and Flipkart would like to understand the
number of customer visits to their portal and the revenue.
3. Retailers such as Walmart, Target, Reliance Retail, Hyper City, etc. would be interested in
understanding the impact of price cut promotions on the revenue of their private labels (store brands
or house brands).
4. Original equipment manufacturers (OEMs) would like to know the impact of duration of warranty on
the profit.
Framework for SLR model development
Estimation of Parameters using Ordinary Least Squares

Given a set of dependent variable values (Yi) and the corresponding independent variable
values (Xi), each subject to a random error (i), one has to find the best equation to represent
the relationship between the dependent and independent variables.
Assumptions
The method of least squares gives the best equation under the assumptions stated below
(Harter 1974, 1975):

o The regression model is linear in regression parameters.

o The explanatory variable, X, is assumed to be non-stochastic (i.e., X is deterministic).

o The conditional expected value of the residuals, E(i|Xi), is zero.

o In case of time series data, residuals are uncorrelated, that is, Cov(i, j) = 0 for all i  j.

o The residuals, i, follow a normal distribution.

o The variance of the residuals, Var(i|Xi), is constant for all values of Xi. When the variance
of the residuals is constant for different values of Xi, it is called homoscedasticity. A non-
constant variance of residuals is called heteroscedasticity
Example

Salary of Graduating MBA Students versus Their Percentage Marks in


Grade 10
Table in next slide provides the salary of 50 graduating MBA students of a
Business School in 2016 and their corresponding percentage marks in grade
10 . Develop a linear regression model by estimating the model parameters.
Salary of MBA students versus their grade 10 marks
Percentage in Grade Percentage in
S. No. Salary S. No. Salary
10 Grade 10
1 62 270000 26 64.6 250000
2 76.33 200000 27 50 180000
3 72 240000 28 74 218000
4 60 250000 29 58 360000
5 61 180000 30 67 150000
6 55 300000 31 75 250000
7 70 260000 32 60 200000
8 68 235000 33 55 300000
9 82.8 425000 34 78 330000
10 59 240000 35 50.08 265000
11 58 250000 36 56 340000
12 60 180000 37 68 177600
13 66 428000 38 52 236000
14 83 450000 39 54 265000
15 68 300000 40 52 200000
16 37.33 240000 41 76 393000
17 79 252000 42 64.8 360000
18 68.4 280000 43 74.4 300000
19 70 231000 44 74.5 250000
20 59 224000 45 73.5 360000
21 63 120000 46 57.58 180000
22 50 260000 47 68 180000
23 69 300000 48 69 270000
24 52 120000 49 66 240000
25 49 120000 50 60.8 300000
Solution

From SPSS output, the values of coefficients are:


 
 0  61555.3553 and  1  3076.1774
The corresponding regression equation is given by


Yi  61555.3553  3076.1774X i

Where is the predicted value of Y for a given value of Xi.


Yi

The equation can be interpreted as follows: for every one


percentage increase in grade 10 marks, the salary of the MBA
students will increase at the rate of 3076.177 on an average.
 
The notations  0 and  1 are used to denote that these are
estimated values of the regression coefficients from the
sample of 50 students.
Interpretation of Simple Linear Regression Coefficients
• Interpretation of regression coefficients is important for understanding the
relationship between the response variable and the explanatory variable and the
impact of change in the values of explanatory variables on the response variable.

• The interpretation will depend on the functional form of the relationship between
the response and the explanatory variables.

• Interpretation of 0 and 1 in Y = 0 + 1 X

When the functional form is Y = 0 + 1 X, the value of 0 = E(Y|X=0).


Y
1 = X that is 1 is the change in the value of Y for the unit change in the value of
X. Where Y is the partial derivative of Y with respect to X.
X
Validation of the Simple Linear Regression Model
It is important to validate the regression model to ensure its validity and goodness of fit before
it can be used for practical applications. The following measures are used to validate the
simple linear regression models:

o Co-efficient of determination (R-square).

o Hypothesis test for the regression coefficient

o Analysis of Variance (ANOVA) for overall model validity (relevant more for multiple linear
regression).

o Residual analysis to validate the regression model assumptions.

o Outlier analysis.
The above measures and tests are essential, but not exhaustive.
Validation of the SLR
Coefficient of Determination (R-Square or R2)
o The co-efficient of determination (or R-square or R2) measures the percentage of variation in Y
explained by the model (0 + 1 X).
o The simple linear regression model can be broken into explained variation and unexplained variation as
shown in

Yi

  0  1 X i  i

Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model

In absence of the predictive model for Yi, the users will


use the mean value of Yi. Thus, the total variation is
measured as the difference between Yi and mean value
of Yi (i.e. Yi - Y ).
Validation of the SLR
Coefficient of Determination (R-Square or R2)

Description of total variation, explained variation and unexplained variation


Variation Type Measure Description

Total Variation (SST) ( Yi  Y


 ) Total variation is the difference between the actual value
and the mean value.

Variation explained by the model (  


Yi  Y
) Variation explained by the model is the difference between
the estimated value of Yi and the mean value of Y

Variation not explained by model ( 


Yi  Yi
) Variation not explained by the model is the difference
between the actual value and the predicted value of Yi
(error in prediction)
Validation of the SLR
Coefficient of Determination (R-Square or R2)
The relationship between the total variation, explained variation and the unexplained variation
is given as follows:
   
Yi  Y  Yi  Y  Yi  Yi
    
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model

It can be proved mathematically that sum of squares of total variation is equal to sum of
squares of explained variation plus sum of squares of unexplained variation

n   2 n    2 n   2
 
 Yi  Y    
 Yi  Y    
 Yi  Yi 
i 1  i 1  i 1 
  
SST SSR SSE

where SST is the sum of squares of total variation, SSR is the sum of squares of variation
explained by the regression model and SSE is the sum of squares of errors or unexplained
variation.
Validation of the SLR
Coefficient of Determination (R-Square or R2)

Coefficient of Determination or R-Square


The coefficient of determination (R2) is given by

2
  
 Yi  Y 
Explained variation SSR  

Coefficien t of determinat ion  R 2   
Total variation SST  2

 Yi  Y 
 
 

Since SSR = SST – SSE, the above Eq. can be written as


2
  
 Yi  Yi 
 
 1  
SSE
R2  1 
SST  2
 
 Yi  Y 
 
 
Validation of the SLR
Coefficient of Determination (R-Square or R2)

Coefficient of Determination or R-Square


Thus, R2 is the proportion of variation in response variable Y explained by the regression model.
Coefficient of determination (R2) has the following properties:

o The value of R2 lies between 0 and 1.


o Higher value of R2 implies better fit, but one should be aware of spurious regression.
o Mathematically, the square of correlation coefficient is equal to coefficient of determination (i.e., r2 =
R2).
o We do not put any minimum threshold for R2; higher value of R2 implies better fit. However, a minimum
value of R2 for a given significance value  can be derived using the relationship between the F-
statistic and R2
Validation of the SLR
Coefficient of Determination (R-Square or R2)

Spurious Regression
Number of Facebook users and the number of people who died of helium poisoning in UK

Year Number of Facebook users in Number of people who died of


millions (X) helium poisoning in UK (Y)
2004 1 2
2005 6 2
2006 12 2
2007 58 2
2008 145 11
2009 360 21
2010 608 31
2011 845 40
2012 1056 51
Facebook users versus helium poisoning in UK
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.996442
R Square 0.992896
Standard Error 1.69286
Observations 9
ANOVA

SS MS F Significance F
Regression 1 2803.94 2803.94 978.4229 8.82E-09
Residual 7 20.06042 2.865775
Total 8 2824
Coefficients Standard Error t-stat P-value Lower 95% Upper 95%
Intercept 1.9967 0.76169 2.62143 0.034338 0.195607 3.79783

The R-squareFBvalue for regression


0.0465 0.00149
model 31.27975
between 8.82E-09 of deaths
the number 0.043074 due 0.050119
to helium
poisoning in UK and the number of Facebook users is 0.9928. That is, 99.28% variation in the
number of deaths due to helium poisoning in UK is explained by the number of Facebook
users.
The regression model is given as Y = 1.9967 + 0.0465 X
Validation of the SLR
Hypothesis Test for Regression Co-efficient (t-Test)

o The regression co-efficient ( 1) captures the existence of a linear relationship between the response
variable and the explanatory variable.

o If 1 = 0, we can conclude that there is no statistically significant linear relationship between the two
variables.

The null and alternative hypotheses for the SLR model can be stated as follows:

H0: There is no relationship between X and Y

HA: There is a relationship between X and Y

1 = 0 would imply that there is no linear relationship between the response variable Y and the
explanatory variable X. Thus, the null and alternative hypotheses can be restated as follows:

H0: 1 = 0

HA: 1  0
Validation of the SLR
Test for Overall Model: Analysis of Variance (F-test)

The null and alternative hypothesis for F-test is given by

H0: There is no statistically significant relationship between Y and any of the explanatory
variables (i.e., all regression coefficients are zero).

HA: Not all regression coefficients are zero

• Alternatively:

H0: All regression coefficients are equal to zero

HA: Not all regression coefficients are equal to zero

MSR MSR / 1
• The F-statistic is given by F 
MSE MSE / n  2
Validation of the SLR
Residual Analysis

Residual (error) analysis is important to check whether the assumptions of regression models
have been satisfied. It is performed to check the following:

• The residuals (Yi  Yi )are normally distributed.

• The variance of residual is constant (homoscedasticity).

• The functional form of regression is correctly specified.

• If there are any outliers


Validation of the SLR
Residual Analysis

Checking for Normal Distribution of Residuals (Yi  Yi )

• The easiest technique to check whether the residuals follow normal distribution is to use the P-P plot
(Probability-Probability plot).
• The P-P plot compares the cumulative distribution function of two probability distributions against each
other
Validation of the SLR
Residual Analysis
Test of Homoscedasticity
An important assumption of regression model is that the residuals have constant variance
(homoscedasticity) across different values of the explanatory variable (X).
That is, the variance of residuals is assumed to be independent of variable X. Failure to meet this
assumption will result in unreliability of the hypothesis tests.

Testing the Functional Form of Regression Model

Any pattern in the residual plot would indicate incorrect specification (misspecification) of the model.
Validation of the SLR
Outlier Analysis
o Outliers are observations whose values show a large deviation from mean value, that is

( Yi  Y ) large
o Presence of an outlier can have significant influence on values of regression coefficients.
Thus, it is important to identify the existence of outliers in the data
Example
Use the data on body weight of patients and their treatment cost provided in the data file “DAD” and answer the following
questions:
1. Is there a statistical evidence to support that the cost of treatment and body weight are related? Support your answer
with all necessary tests.
2. Comment on the value of R-square. Does a low R-square value indicate that the model is not useful?
3. Interpret the value of the coefficient of weight in the model developed in question 1. What will be average difference
in cost of treatment for patient aged 50 and patient aged 51?
Example
1. Is there a statistical evidence to support that the cost of treatment and the body weight are related?
Support your answer with all necessary tests.
Solution:
Let Y = cost of treatment and X = weight of the patient. The corresponding simple linear regression model is
given by
Y = b0 + b1 Body weight
The regression output for the model using the software SPSS is shown in Tables,

That is, the relationship between the cost of treatment and the body weight is given by
Y = 127498.079 + 1678.933 × Body Weight -------Eq-1
The p-value for the coefficient “Body Weight” is 0.030 which is less than 0.05; thus, the independent variable
body weight is significant at a = 0.05 or at 95% confidence level.
From the model we can interpret that the cost of treatment increases at the rate of INR 1678.933 per 1 kg
increase in the body weight.
Example
However, before we accept the model, we have to check the important assumptions of normality and homoscedasticity.
Figure 5.1 below is the P-P plot that shows the observed cumulative probability of standardized residuals and expected
cumulative probability of a normal distribution (diagonal line). Figure 5.2 is a plot between the standardized residual
and the standardized response variable (Y). The plot between residual and independent variable values can also be
used for finding existence of heteroscedasticity.

It is evident from Figures 5.1 and 5.2 that both the normality and homoscedasticity assumptions are not satisfied by
the model defined in Eq-1, which puts doubt over the model.

FIGURE 5.1 P-P plot for the model FIGURE 5.2 Plot of standardized predicted versus standardized residual
for model
Example
Whenever the assumptions of regression model are not met, we have to use a remedial measure and one of the
popular remedial measures is Transformation of Variables (transformation of variables will be discussed in Chapter
10). In this case, we try the following model in which instead of Y, we build the model between ln(Y) and X, where ln(Y)
is natural logarithm of Y:
ln(Y) = a0 + a1 × Body Weight

The model outputs for the regression are provided.

That is, the relationship between the cost of treatment and the weight is given by
ln(Y) = 11.804 + 0.0074 × Body Weight ----Eq-2
Example
The p-value for the coefficient ‘body weight’ is less than 0.05, thus the variable body weight is significant at 95%
confidence level.
Figures 5.3 and 5.4 provide the P-P plot and the residual plot between the standardized residual and the standardized
response variable ln(Y). Figure 5.3 (for normality) and Figure 5.4 (for homoscedasticity) are looking better than Figures
5.1 and 5.2.
Thus, the model in Eq. 2 may be used for predicting the cost of treatment since it satisfies important assumptions of
SLR model.

FIGURE 5.3 P-P plot for the model FIGURE 5.4 Plot of standardized predicted versus standardized residual
for model
Example
2. Comment on the value of the R-square. Does a low R-square value indicate that the model is not useful?

Answer: The R-square value for the model ln(Y) = a0 + a1 × Body Weight is only 0.046. That is, the model is explaining
only 4.6% of the variation in the value of ln(Y).
Low R-square values do not imply that the model is not useful. The primary objective of regression is to find whether
there is a relationship between the response variable (cost of treatment) and the independent variable (body weight of
the patient).
The regression model establishes this relationship since the p-value of the weight coefficient is less than 0.05 and both
normality and homoscedasticity assumptions are satisfied reasonably.
Low R-square may create problem when we use the model for prediction since the error is likely to be higher.
Example
3. Interpret the value of the coefficient of weight in the model developed in question 1. What will be
average difference in cost of treatment for someone aged 50 and 51?

Answer:The regression model is given by


ln(Y) = 11.804 + 0.0074 × Body Weight
⇒ Y = exp(11.804 + 0.0074 × Body Weight)
The coefficient for weight is 0.0074, that is, for every 1 kg increase in weight, the cost of treatment increases by a
factor of e11.804+0.0074X (e0.0074 −1) .The average cost of treatment for persons aged 50 and 51 are given by:
X = 50; Y = exp(11.804 + 0.0074 × 50) = exp(12.174) = 1,93,687.2
X = 51; Y = exp(11.804 + 0.0074 × 51) = exp(12.1814) = 1,95,125.80
The difference in the average cost of treatment for patients aged 50 and 51 is INR 1438.602 (note that the
regression coefficient values are truncated after 3 decimals, inclusion of more decimals will give slightly different
answer).
Alternatively, e11.804+0.0074X (e0.0074 −1) = 1438.602
Example
Table 5.3 provides the winning margin of all 20 Lok Sabha constituencies of Kerala in 2014 parliament
elections of India and maximum delay of top 20 flights (origin-destination) of Air India between 15 July
2014 and 15 September 2014.

a) Develop a simple linear regression model between winning margin (Y) and maximum flight delay (X)
and calculate the regression coefficients.
b) What is the value of R2?
c) Is the model statistically significant, what can you infer from the regression model?
Example
a) The model outputs for the regression equation are provided below

Y = -136368.7379 + 851.2274 × maximum flight delay


b) Value of R2 = 0.921

c) The estimated values of ᵝ0 and ᵝ1 and from the SPSS output are given by ᵝ0 = -136368.738 and ᵝ1 = 851.227

The t-stat value for ᵝ0 is -10.42 and the corresponding p-value is less than 0.001 which is less than 0.05 and hence
statistically significant.
Similarly, the t-stat value for ᵝ1 is 14.49 and the corresponding p-value is less than 0.001 which is less than 0.05 and hence
statistically significant. So, we can say that the model is statistically significant.
And the R-square value for the model is 0.921.
That means the model is explaining 92.10% of the variation in the value of Y (winning margin).

You might also like