Welcome To:: Multiple Regression and Model Building
Welcome To:: Multiple Regression and Model Building
Welcome to:
Multiple regression and model building
9.1
Unit objectives IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Used while predicting the value of a variable based on the value of two or more other
variables.
• Allows to estimate the relation between a dependent variable and a set of explanatory
variables.
• The dependent variable is an interval variable and in principle, take any real value between
−∞ and +∞.
• The multiple linear regression model assumes a linear relationship between a dependent
variable yi and a set of explanatory variables x’i =(xi0, xi1, ..., xiK). xik is also called an
independent variable, a covariate or a regressor. The first regressor xi0 = 1 is a constant
unless otherwise specified.
• Let y denotes the dependent variable that is linearly related to k independent (or explanatory)
variables X1, X2 ,..., X k through the parameters 1, 2 ,..., k and we write,
y = X11+ X22+……+Xkk + … (1)
• We note that the jth regression coefficient j represents the expected change in y per unit
change in jth independent variable X j.
( )
• Assuming E( )= 0,
• A model is said to be linear when it is linear in parameters. In such a case should not
depend on any ’s.
Partial correlation and regression model
building IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Partial correlation is a measure of the strength and direction of a linear relationship between
two continuous variables whilst controlling for the effect of one or more other continuous
variables (also known as covariates or control variables).
• Suppose we want to find the correlation between Y and X controlling W. This is called the
partial correlation.
• We need to insure that no variance predictable from W enters the relationship between Y and
X. In z-score form we can predict both X and Y from W, then subtract those predictions
leaving only information in X and Y that is independent of W, as follows.
z =r z (1) (2)
where zXP and zYP are the predicted z-scores for X and Y respectively.
• With variance (1-rXW2), and with variance (1-rYW2), where zX(res) and zY(res) are the
residual information in X and Y controlling W.
Multiple linear regression model IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A multiple linear regression model with k predictor variables X1, X2, ..., Xk and a response Y,
can be written as
– y = β0 + β1x1 + β2x2 + · · · βkxk +
• More complex models may include higher powers of one or more predictor variables, e.g.,
– y = β0 + β1x + β2x2 + …. (1)
• Example 1 and 2:
Multiple linear regression coefficients
-partial regression coefficients IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A linear regression model with two predictor variables can be expressed with the following
equation: Y = B0 + B1*X1 + B2*X2 + e.
• Unit normal scaling: Subtract the sample mean and divide by the sample
standard deviation of both the predictor variables and the response:
̅
• ∗
• Missing data causes problems because multiple regression procedures require that every
case have a score on every variable that is used in the analysis.
• If data are missing randomly, then it may be appropriate to estimate each bivariate
correlation on the basis of all cases that have data on the two variables.
– pairwise deletion of missing data.
• A second procedure is to delete an entire case if information is missing on any one of the
variables that is used in the analysis
– list-wise deletion.
• A third procedure is simply to delete a variable that has substantial missing data.
– Deletion of variables.
Validation of multiple regression model IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The validation process can involve analyzing the goodness of fit of the regression, analyzing
whether the regression residuals are random and checking whether the models predictive
performance deteriorates substantially when applied to data that were not used in model
estimation.
• Numerical methods also play an important role in model validation. For example, the lack of
fit test for assessing the correctness of the functional part of the model can aid in interpreting
a borderline residual plot.
• Cross-validation is the process of assessing how the results of a statistical analysis will
generalize to an independent data set.
• A development in medical statistics is the use of out of sample cross validation techniques in
meta analysis. It forms the basis of the validation statistic, Vn, which is used to test the
statistical validity of meta analysis summary estimates. Essentially it measures a type of
normalized prediction error and its distribution is a linear combination of χ2 variables of
degree 1.
Coefficient of multiple determination
(R-Squared) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Indicates the percentage of the variance in the dependent variable that the independent
variables explain collectively.
• R-squared measures the strength of the relationship between the model and the dependent
variable on a convenient 0-100% scale.
• Residuals are the distance between the observed value and the fitted value.
Adjusted R-squared IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The adjusted R-squared compares the explanatory power of regression models that contain
different numbers of predictors.
• The adjusted R-squared is a modified version of R-squared that has been adjusted for the
number of predictors in the model.
• Multiple R squared is the proportion of Y variance that can be explained by the linear model
using X variables in the sample data, but it over-estimates that proportion in the population.
• Consider, for example, sample R2 = 0.60 based on k=7 predictor variables in a sample of
N=15 cases. An estimate of the proportion of Y variance that can be accounted for by the X
variables in the population is called shrunken R squared or adjusted R squared. It can be
calculated with the following formula:
~ 2 2 2 N -1 14
Shrunken R = R = 1 - (1 - R ) = 1 - (1 - .6) = .20.
N - k -1 7
Statistical significance : t-Test IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A t-test looks at the t-statistic, the t-distribution values and the degrees of freedom to deter.
• Mathematically, the t-test takes a sample from each of the two sets and establishes the
problem statement by assuming a null hypothesis that the two means are equal. Mine the
probability of difference between two sets of data.
• For a large sample size, statisticians use a z-test. Other testing options include the chi-
square test and the f-test.
Checkpoint (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
1. The main purpose(s) of (LR) is/are (choose all that apply) is to explain ____.
2. In MLR, the square of the multiple correlation coefficient or R2 is called the _____.
3. _______ is a modified version of R-squared.
4. OLS stands for _______
True or False:
1. The major conceptual limitation of all regression techniques is that one can only ascertain
relationships, but never be sure about underlying causal mechanism. True/False
2. In MLR, a residual is the difference between the predicted Y and actual Y values.
True/False
3. Multiple regression is not an extension of simple linear regression. True/False
Checkpoint solutions (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
1. The main purpose(s) of (LR) is/are (choose all that apply) is to explain one variable in
terms of another.
2. In MLR, the square of the multiple correlation coefficient or R2 is called the co-efficient of
determination.
3. Adjusted R-squared is a modified version of R-squared.
4. OLS stands for Ordinary Least Square.
True or False:
1. The major conceptual limitation of all regression techniques is that one can only ascertain
relationships, but never be sure about underlying causal mechanism. True
2. In MLR, a residual is the difference between the predicted Y and actual Y values. True
3. Multiple regression is not an extension of simple linear regression. False
Question bank IBM ICE (Innovation Centre for Education)
IBM Power Systems