MULTIPLE REGRESSION
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
The Multiple Regression Model
Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)
Multiple Regression Model with k Independent Variables:
Y-intercept
Population slopes
Random Error
Yi 0 1X1i 2 X 2i k X ki i
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
The error term is normally distributed. For each fixed value of X, the distribution of Y is normal. The mean of the error term is 0 and SD should be one . The variance of the error term is constant. This variance does not depend on the values assumed by X. The error terms are uncorrelated. In other words, the observations have been drawn independently.
Assumptions
The regressors themselves.
are
independent
amongst
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Assumptions
Independent variables should be uncorrelated with residual.. Model should be properly specified. No. of observation should be more than no. of parameters Model is linear in parameters Independent variables are fixed in repeated samples.
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Statistics Associated with Multiple Regression
Coefficient of multiple determination.
The strength of association in multiple regression is measured by the square of the multiple correlation coefficient, R2, which is also called the coefficient of multiple determination.
Adjusted R2
R2, coefficient of multiple determination, is adjusted for the number of independent variables and the sample size to account for the diminishing returns. After the first few variables, the additional independent variables do not make much contribution.
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Statistics Associated with Multiple Regression
F test Used to test the null hypothesis that the coefficient of multiple determination in the population, R2pop, is zero.
The test statistic has an F distribution with k and (n - k - 1) degrees of freedom.
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Statistics Associated with Multiple Regression
Partial regression coefficient. The partial regression coefficient, b1, denotes the change in the predicted value,Y , per unit change i in X1 when the other independent variables, X2 to X k, are held constant.
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Multiple Regression Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error 0.72213 0.52148 0.44172 47.46341
Observations
ce) 74.131(Adv ertising) 15Sales 306.526 - 24.975(Pri
SS 2 29460.027 MS 14730.01 3 F 6.53861 Significance F 0.01201
ANOVA Regression
df
Residual
Total
12
14 Coefficien ts
27033.306
56493.333 Standard Error 114.25389 25.96732
2252.776
t Stat 2.68285 2.85478
P-value 0.01993 0.01449
Lower 95% 57.58835
Upper 95% 555.46404
Intercept
306.52619
Price
Advertising
-24.97509
74.13096
Mr. Pranav Ranjan & Ms. Razia Sehdev 10.83213 -2.30565 0.03979 ICTC, LPU
-48.57626
17.55303
-1.37392
130.70888
The Multiple Regression Equation
Sales 306.526- 24.975(Pri ce) 74.131(Adv ertising)
where Sales is in number of pies per week Price is in $ Advertising is in $100s.
b1 = -24.975: sales will decrease, on average, by 24.975 pies per week for each $1 increase in selling price, net of the effects of changes due to advertising
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
b2 = 74.131: sales will increase, on average, by 74.131 pies per week for each $100 increase in advertising, net of the effects of changes due to price
Using The Equation to Make Predictions
Predict sales for a week in which the selling price is $5.50 and advertising is $350:
Sales 306.526 - 24.975(Pri ce) 74.131(Advertising) 306.526 - 24.975 (5.50) 74.131(3.5) 428.62
Note that Advertising is in $100s, so $350 means that X2 = 3.5
Predicted sales is 428.62 pies
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Multiple Coefficient of Determination
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.72213 0.52148
(continued)
SSR 29460.0 r .52148 SST 56493.3
2
0.44172
47.46341 15
52.1% of the variation in pie sales is explained by the variation in price and advertising
SS MS F Significance F 14730.01 3 6.53861 2252.776
ANOVA
df
Regression
Residual Total
2
12 14 Coefficien ts
29460.027
27033.306 56493.333 Standard Error 114.25389 25.96732
0.01201
t Stat 2.68285 2.85478
P-value 0.01993 0.01449
Lower 95% 57.58835
Upper 95% 555.46404
Intercept
306.52619
Price
Advertising
-24.97509
74.13096
Mr. Pranav Ranjan & Ms. Razia Sehdev 10.83213 -2.30565 0.03979 ICTC, LPU
-48.57626
17.55303
-1.37392
130.70888
Adjusted
Regression Statistics
2 r
(continued)
Multiple R
R Square Adjusted R Square Standard Error Observations
0.72213
0.52148
2 adj
.44172
0.44172
47.46341 15
44.2% of the variation in pie sales is explained by the variation in price and advertising, taking into account the sample size and number of independent variables
SS MS 14730.01 3 2252.776 F 6.53861 Significance F 0.01201
ANOVA Regression Residual
df 2 12
29460.027 27033.306
Total
14
Coefficien ts
56493.333
Standard Error 114.25389 Upper 95% 555.46404 -1.37392
t Stat 2.68285
P-value 0.01993
Lower 95% 57.58835 -48.57626
Intercept Price
306.52619 -24.97509
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU 10.83213 -2.30565 0.03979
F Test for Overall Significance
Regression Statistics Multiple R R Square Adjusted R Square Standard Error 0.72213 0.52148 0.44172 47.46341
(continued)
Observations
15
MSR 14730.0 F 6.5386 MSE 2252.8
With 2 and 12 degrees of freedom
SS MS 14730.01 3 2252.776 F 6.53861 Significance P-value for the F Test F 0.01201
ANOVA Regression Residual Total
df 2 12 14 Coefficien ts
29460.027 27033.306 56493.333 Standard Error 114.25389 10.83213
t Stat 2.68285 -2.30565
P-value 0.01993 0.03979
Lower 95% 57.58835 -48.57626 17.55303
Upper 95% 555.46404 -1.37392 130.70888
Intercept Price Advertising
306.52619 -24.97509
74.13096 Mr. Pranav 25.96732 0.01449 Ranjan & Ms. 2.85478 Razia Sehdev
ICTC, LPU
Are Individual Variables Significant?
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.72213 0.52148 0.44172 47.46341 15
(continued) t-value for Price is t = -2.306, with p-value .0398 t-value for Advertising is t = 2.855, with p-value .0145
SS MS 14730.01 3 2252.776 F Significance F
ANOVA
df
Regression
Residual Total
2
12 14 Coefficien ts
29460.027
27033.306 56493.333 Standard Error
6.53861
0.01201
t Stat
P-value
Lower 95%
Upper 95%
Intercept
Price Advertising
306.52619
-24.97509 74.13096
114.25389
10.83213
2.68285
-2.30565
0.01993
0.03979
57.58835
-48.57626 17.55303
555.46404
-1.37392 130.70888
Mr. Pranav Ranjan & Ms. Razia Sehdev 25.96732 2.85478 0.01449 ICTC, LPU
Multicollinearity
Multicollinearity arises when intercorrelations among the predictors are very high. Result in several problems, including: The partial regression coefficients may not be estimated precisely. The standard errors are likely to be high. The magnitudes as well as the signs of the partial regression coefficients may change from sample to sample. It becomes difficult to assess the relative importance of the independent variables in explaining the variation in the dependent variable. Predictor variables may be incorrectly included or removed in stepwise regression. Mr. Pranav Ranjan & Ms. Razia Sehdev
ICTC, LPU
Multicollinearity
A simple procedure for adjusting for multicollinearity consists of using only one of the variables in a highly correlated set of variables.
Alternatively, the set of independent variables can be transformed into a new set of predictors that are mutually independent by using techniques such as principal components analysis. More specialized techniques, such as ridge regression and latent root regression, can also be used.
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU
Multicollinearity Diagnostics:
Variance Inflation Factor (VIF) measures how much the variance of the regression coefficients is inflated by multicollinearity problems. If VIF equals 0, there is no correlation between the independent measures. A VIF measure of 1 is an indication of some association between predictor variables, but generally not enough to cause problems. A maximum acceptable VIF value would be 10; anything higher would indicate a problem with multicollinearity.
Tolerance the amount of variance in an independent variable that is not explained by the other independent variables. If the other variables explain a lot of the variance of a particular independent variable we have a problem with multicollinearity. Thus, small values for tolerance indicate problems of multicollinearity. The minimum cutoff value for tolerance is typically .10. That is, the tolerance value must be smaller than .10 to indicate a problem of multicollinearity.
Mr. Pranav Ranjan & Ms. Razia Sehdev ICTC, LPU