0% found this document useful (0 votes)
88 views

Welcome To:: Multiple Regression and Model Building

The document discusses multiple linear regression analysis. It explains that multiple regression allows predicting the value of a dependent variable based on the values of two or more independent variables. The document covers topics such as ordinary least squares estimation, building regression models, interpreting regression coefficients, validating models, and measures of fit such as R-squared. It provides examples and formulas to illustrate key concepts in multiple linear regression.

Uploaded by

Aasmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Welcome To:: Multiple Regression and Model Building

The document discusses multiple linear regression analysis. It explains that multiple regression allows predicting the value of a dependent variable based on the values of two or more independent variables. The document covers topics such as ordinary least squares estimation, building regression models, interpreting regression coefficients, validating models, and measures of fit such as R-squared. It provides examples and formulas to illustrate key concepts in multiple linear regression.

Uploaded by

Aasmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

IBM ICE (Innovation Centre for Education)

Welcome to:
Multiple regression and model building

9.1
Unit objectives IBM ICE (Innovation Centre for Education)
IBM Power Systems

After completing this unit you should be able to:

• Understand the concept of multiple regression


• Learn the significance of ordinary least square
• Have an insight into the regression model building
• Have a conceptual clarity on interpreting the regression coefficients
• Learn the concept of standardized coefficient and categorical variables
• Have a knowledge on validating a multiple regression model
• Have a brief idea of R-squared and adjusted R-squared
• Have a clear understanding on t-Test and F-Test
Introduction IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Multiple regression is an extension of simple linear regression.
• We consider the problem of regression when a study variable depends on more than one
explanatory or independent variables, called as multiple linear regression model.

• Used while predicting the value of a variable based on the value of two or more other
variables.

• For example, we may use multiple regression to understand whether examination


performance can be predicted based on revision time, test anxiety, lecture attendance and
gender.
Ordinary least squares estimation for
multiple linear regression IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Allows to estimate the relation between a dependent variable and a set of explanatory
variables.

• The dependent variable is an interval variable and in principle, take any real value between
−∞ and +∞.

• The multiple linear regression model assumes a linear relationship between a dependent
variable yi and a set of explanatory variables x’i =(xi0, xi1, ..., xiK). xik is also called an
independent variable, a covariate or a regressor. The first regressor xi0 = 1 is a constant
unless otherwise specified.

• Consider a sample of N observations i = 1, ... , N. Every single observation i follows


– yi = x’iβ + ui
• where β is a (K + 1) -dimensional column vector of parameters, x’i is a (K + 1) dimensional row vector and ui is a
scalar called the error term.
Multiple linear regression model building IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Let y denotes the dependent variable that is linearly related to k independent (or explanatory)
variables X1, X2 ,..., X k through the parameters 1, 2 ,..., k and we write,
y = X11+ X22+……+Xkk +  … (1)

• We note that the jth regression coefficient  j represents the expected change in y per unit
change in jth independent variable X j.

( )
• Assuming E( )= 0,

• A model is said to be linear when it is linear in parameters. In such a case should not
depend on any ’s.
Partial correlation and regression model
building IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Partial correlation is a measure of the strength and direction of a linear relationship between
two continuous variables whilst controlling for the effect of one or more other continuous
variables (also known as covariates or control variables).

• Suppose we want to find the correlation between Y and X controlling W. This is called the
partial correlation.

• We need to insure that no variance predictable from W enters the relationship between Y and
X. In z-score form we can predict both X and Y from W, then subtract those predictions
leaving only information in X and Y that is independent of W, as follows.
z =r z (1) (2)

where zXP and zYP are the predicted z-scores for X and Y respectively.

• Subtracting these predicted scores we get,


( ) (3) ( ) (4)

• With variance (1-rXW2), and with variance (1-rYW2), where zX(res) and zY(res) are the
residual information in X and Y controlling W.
Multiple linear regression model IBM ICE (Innovation Centre for Education)
IBM Power Systems

• A multiple linear regression model with k predictor variables X1, X2, ..., Xk and a response Y,
can be written as
– y = β0 + β1x1 + β2x2 + · · · βkxk + 

• More complex models may include higher powers of one or more predictor variables, e.g.,
– y = β0 + β1x + β2x2 +  …. (1)

• Or interaction effects of two or more variables,


– y = β0 + β1x1 + β2x2 + β12x1x2 +  …… (2)

• Example 1 and 2:
Multiple linear regression coefficients
-partial regression coefficients IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Linear regression is one of the most popular statistical techniques.

• A linear regression model with two predictor variables can be expressed with the following
equation: Y = B0 + B1*X1 + B2*X2 + e.

• The variables in the model are:


– Y, the response variable.
– X1, the first predictor variable.
– X2, the second predictor variable and
– e, the residual error, which is an unmeasured variable.

• The parameters in the model are:


– B0, the Y-intercept.
– B1, the first regression coefficient and
– B2, the second regression coefficient.

• Interpreting the intercept


– B0, the Y-intercept, can be interpreted as the value we would predict for Y if both X1 = 0 and X2 = 0.
Standardized regression coefficients IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The value of a slope in a multiple regression problem depends on the units in


which the corresponding predictor xj is measured.

• Scaling is necessary to over come the problem of comparison.

• Unit normal scaling: Subtract the sample mean and divide by the sample
standard deviation of both the predictor variables and the response:
̅
• ∗

– Where sj is the estimated sample standard deviation of predictor xj and sy is the


estimated sample standard deviation of the response.

• Using these new standardized variables, our regression model becomes,


– yi∗ = b1zi1 + b2zi2 + · · · + bkzik + i, i = 1, . . . , n.

• The least squares estimator =(Z’Z)-1Z’y* is the standardized coefficient estimate.


Missing data IBM ICE (Innovation Centre for Education)
IBM Power Systems

• Missing data causes problems because multiple regression procedures require that every
case have a score on every variable that is used in the analysis.

• The most common ways of dealing with missing data are:


– pairwise deletion, listwise deletion, deletion of variables and coding of missingness.

• If data are missing randomly, then it may be appropriate to estimate each bivariate
correlation on the basis of all cases that have data on the two variables.
– pairwise deletion of missing data.

• A second procedure is to delete an entire case if information is missing on any one of the
variables that is used in the analysis
– list-wise deletion.

• A third procedure is simply to delete a variable that has substantial missing data.
– Deletion of variables.
Validation of multiple regression model IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The validation process can involve analyzing the goodness of fit of the regression, analyzing
whether the regression residuals are random and checking whether the models predictive
performance deteriorates substantially when applied to data that were not used in model
estimation.

• One measure of goodness of fit is the R2 (coefficient of determination), which, in ordinary


least squares with an intercept ranges between 0 and 1.

• Numerical methods also play an important role in model validation. For example, the lack of
fit test for assessing the correctness of the functional part of the model can aid in interpreting
a borderline residual plot.

• Cross-validation is the process of assessing how the results of a statistical analysis will
generalize to an independent data set.

• A development in medical statistics is the use of out of sample cross validation techniques in
meta analysis. It forms the basis of the validation statistic, Vn, which is used to test the
statistical validity of meta analysis summary estimates. Essentially it measures a type of
normalized prediction error and its distribution is a linear combination of χ2 variables of
degree 1.
Coefficient of multiple determination
(R-Squared) IBM ICE (Innovation Centre for Education)
IBM Power Systems

• R-squared is a goodness of fit measure for linear regression models.

• Indicates the percentage of the variance in the dependent variable that the independent
variables explain collectively.

• R-squared measures the strength of the relationship between the model and the dependent
variable on a convenient 0-100% scale.

• Residuals are the distance between the observed value and the fitted value.
Adjusted R-squared IBM ICE (Innovation Centre for Education)
IBM Power Systems

• The adjusted R-squared compares the explanatory power of regression models that contain
different numbers of predictors.

• The adjusted R-squared is a modified version of R-squared that has been adjusted for the
number of predictors in the model.

• Multiple R squared is the proportion of Y variance that can be explained by the linear model
using X variables in the sample data, but it over-estimates that proportion in the population.

• Consider, for example, sample R2 = 0.60 based on k=7 predictor variables in a sample of
N=15 cases. An estimate of the proportion of Y variance that can be accounted for by the X
variables in the population is called shrunken R squared or adjusted R squared. It can be
calculated with the following formula:

~ 2 2 2 N -1 14
Shrunken R = R = 1 - (1 - R ) = 1 - (1 - .6) = .20.
N - k -1 7
Statistical significance : t-Test IBM ICE (Innovation Centre for Education)
IBM Power Systems

• A t-test is a type of inferential statistic used to determine if there is a significant difference


between the means of two groups, which may be related in certain features.

• A t-test looks at the t-statistic, the t-distribution values and the degrees of freedom to deter.
• Mathematically, the t-test takes a sample from each of the two sets and establishes the
problem statement by assuming a null hypothesis that the two means are equal. Mine the
probability of difference between two sets of data.

• For a large sample size, statisticians use a z-test. Other testing options include the chi-
square test and the f-test.
Checkpoint (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems

Multiple choice questions:

1. Multiple linear regression (MLR) is a __________ type of statistical analysis.


a) Univariate
b) Bivariate
c) Multivariate
d) None of these
2. The following types of data can be used in MLR (choose all that apply)
a) Interval or higher dependent variable (DV)
b) Interval or higher independent variables (IVs)
c) Dichotomous Ivs
d) Interval or lower independent variables
3. A LR analysis produces the equation Y = -3.2X + 7. This indicates that:
a) A 1 unit increase in X results in a 3.2 unit decrease in Y.
b) A 1 unit decrease in X results in a 3.2 unit decrease in Y.
c) A 1 unit increase in X results in a 3.2 unit increase in Y.
d) None of these
Checkpoint solutions (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems

Multiple choice questions:

1. Multiple linear regression (MLR) is a __________ type of statistical analysis.


a) Univariate
b) Bivariate
c) Multivariate
d) None of these
2. The following types of data can be used in MLR (choose all that apply)
a) Interval or higher dependent variable (DV)
b) Interval or higher independent variables (IVs)
c) Dichotomous (Ivs)
d) Interval or lower independent variables
3. A LR analysis produces the equation Y = -3.2X + 7. This indicates that:
a) A 1 unit increase in X results in a 3.2 unit decrease in Y.
b) A 1 unit decrease in X results in a 3.2 unit decrease in Y.
c) A 1 unit increase in X results in a 3.2 unit increase in Y.
d) None of these
Checkpoint (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems

Fill in the blanks:

1. The main purpose(s) of (LR) is/are (choose all that apply) is to explain ____.
2. In MLR, the square of the multiple correlation coefficient or R2 is called the _____.
3. _______ is a modified version of R-squared.
4. OLS stands for _______

True or False:

1. The major conceptual limitation of all regression techniques is that one can only ascertain
relationships, but never be sure about underlying causal mechanism. True/False
2. In MLR, a residual is the difference between the predicted Y and actual Y values.
True/False
3. Multiple regression is not an extension of simple linear regression. True/False
Checkpoint solutions (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems

Fill in the blanks:

1. The main purpose(s) of (LR) is/are (choose all that apply) is to explain one variable in
terms of another.
2. In MLR, the square of the multiple correlation coefficient or R2 is called the co-efficient of
determination.
3. Adjusted R-squared is a modified version of R-squared.
4. OLS stands for Ordinary Least Square.

True or False:

1. The major conceptual limitation of all regression techniques is that one can only ascertain
relationships, but never be sure about underlying causal mechanism. True
2. In MLR, a residual is the difference between the predicted Y and actual Y values. True
3. Multiple regression is not an extension of simple linear regression. False
Question bank IBM ICE (Innovation Centre for Education)
IBM Power Systems

Two marks questions:

1. What is multiple linear regression?


2. What is meant by dependent and independent variables?
3. What is the notion of partial correlation?
4. Define validation process in multiple linear regression

Four marks questions:

1. Give an insight into standardized regression coefficients.


2. What are categorical variable? How the regression coefficients of CVs are interpreted?
3. What is the response variable? What are the explanatory variables?
4. What is adjusted R-squared? Give the mathematical representation.

Eight marks questions:

1. Describe coefficient of multiple determination (R-Squared).


2. Describe the statistical significance of individual variables in multiple linear regression: t-Test
Unit summary IBM ICE (Innovation Centre for Education)
IBM Power Systems

Having completed this unit you should be able to:

• Understand the concept of multiple regression


• Learn the significance of ordinary least square

• Have an insight into the regression model building

• Have a conceptual clarity on interpreting the regression coefficients


• Learn the concept of standardized coefficient and categorical variables
• Have a knowledge on validating a multiple regression model

• Have a brief idea of R-squared and adjusted R-squared


• Have a clear understanding on t-Test and F-Test

You might also like