LM04 Extensions of Multiple Regression IFT Notes
LM04 Extensions of Multiple Regression IFT Notes
1. Introduction ...........................................................................................................................................................2
2. Influence Analysis ...............................................................................................................................................2
3. Dummy Variables in a Multiple Linear Regression ................................................................................6
4. Multiple Linear Regression with Qualitative Dependent Variables.............................................. 10
Summary................................................................................................................................................................... 16
This document should be read in conjunction with the corresponding learning module in the 2024
Level II CFA® Program curriculum. Some of the graphs, charts, tables, examples, and figures are
copyright 2023, CFA Institute. Reproduced and republished with permission from CFA Institute. All
rights reserved.
Required disclaimer: CFA Institute does not endorse, promote, or warrant the accuracy or quality of
the products or services offered by IFT. CFA Institute, CFA®, and Chartered Financial Analyst® are
trademarks owned by CFA Institute.
Version 1.0
1. Introduction
This learning module covers:
• Influence analysis and methods of detecting influential data points
• Dummy variables
• Logistic regression models
2. Influence Analysis
Besides violation of regression assumptions, another issue that an analyst should look for is
the presence of influential observations in the sample data. An influential observation is an
observation whose inclusion may significantly alter regression results.
Two kinds of observations may potentially influence regression results:
• A high-leverage point: A data point having an extreme value of an independent
variable (X).
• An outlier: A data point having an extreme value of the dependent variable (Y).
Exhibit 1 from the curriculum shows a high-leverage point (triangle). It has an unusually
high X value relative to other observations. The exhibit also presents two regression lines:
The dashed line includes the high-leverage point in the regression sample; the solid line
deletes it from the sample.
Exhibit 2 from the curriculum shows an outlier data point (triangle). It has an unusually high
Y value relative to other observations. As before, two regression lines are shown: The dashed
line includes the outlier in the regression sample; the solid line deletes it from the sample.
If the high-leverage points or outliers are far from the regression line, they can tilt the
estimated regression line towards them, affecting slope coefficients and goodness-of-fit
statistics.
Detecting Influential Points
Leverage (hii)
A high-leverage point can be identified using a measure called leverage (hii). Leverage
measures the distance between the value of the ith observation of an independent variable
and the mean value of that variable across all n observations. Leverage ranges from 0 to 1,
the higher the leverage the more distant the observation’s value is from the mean, and hence
the more influence it can exert on the estimated regression line. Statistical software
packages can easily calculate the leverage measure.
Interpretation:
If Then
Leverage > 3 (
𝑘+1
) The observation is potentially influential
𝑛
if its exclusion from the sample causes substantial changes in the estimated regression
function. Cook’s distance, or Cook’s D (Di), is a metric for identifying influential data points.
It measures how much the estimated values of the regression change if observation i is
deleted from the sample.
Interpretation:
If Then
Di > 0.5 The ith observation may be influential and merits further investigation.
Di > 1.0 The ith observation is highly likely to be an influential data point.
Di > 2√𝑘/𝑛 The ith observation is highly likely to be an influential data point.
Example:
(This is Question Set example from Section 2 of the curriculum.)
You are analyzing a regression model of companies’ ROA estimated with 26 observations
and three independent variables and are concerned about outliers and influential
observations. Using software, you calculate the studentized residual t-statistic and Cook’s D
for each observation, as shown below.
While meeting with the research director to discuss your results, she asks you to do the
following:
1. Identify which observations, if any, are considered outliers using studentized residuals at
a 5% significance level.
Solution
There are 21 (= 26 – 3 – 2) degrees of freedom, so the critical t-statistics are ±2.080.
Therefore, Observations 3 and 5 are considered outliers based on the studentized residuals.
3. Recommend what actions, if any, you should take to deal with any influential observations.
Solution
From the results of studentized residuals and Cook’s D, the analyst should investigate outlier
Observation 5 to ensure there are no data entry or quality issues.
Slope Dummy
A slope dummy allows for the slope of the regression line to change if a specific condition is
met.
Consider the following regression equation that has one one independent variable, X and one
slope dummy variable, D.
Yi = b0 + b1Xi + d1DiXi + εi.
The slope dummy variable creates an interaction term between the X variable and the
condition represented by D = 1.
• If D = 0, then Y = b0 + b1X + ε (base category).
• If D = 1, then Y = b0 + (b1 + d1) X + ε (category to which changed slope applies).
Exhibit 11, Panel B, shows the effect of a slope dummy variable.
stock market type, emerging (1) or developed (0) markets. He provides three choices, saying
the following:
1. To answer the MD’s question, identify the new variable and its function.
A. Slope dummy
B. Intercept dummy
C. Interaction term
Solution
B is correct. The new variable, an intercept dummy, allows for a change in intercept to
classify countries by emerging versus developed stock market status.
2. The MD continues, indicating that you must refine the model to capture the effect on stock
returns of the interaction of each country’s GDP growth and its stock market development
status. He then asks you to do the following:
Identify the model you should use (noting these definitions).
GDPG: Country GDP growth
EM: Indicates emerging stock market country
DM: Indicates developed stock market country
A. Stock return = b0 + b1GDPG + d1EM + d2DM + d3(EM × GDPG) + ɛ.
B. Stock return = b0 + b1GDPG + d1EM + d2DM + ɛ.
C. Stock return = b0 + b1GDPG + d1EM + d2(EM × GDPG) + ɛ.
Solution
C is correct. This model includes a variable for country GDP growth, (GDPG); one dummy for
emerging stock market status (EM = 1, 0 otherwise), with developed market status as the
base case; and a term (EM × GDPG) for the interaction of EM status with GDP growth.
3. Another MD joins the interview and mentions that an analyst on her team estimated a
regression to explain a cross-section of returns on assets of companies using a regulation
dummy variable (REG = 1 if regulated, 0 otherwise), market share (MKTSH), and an
interaction term, REG_MKTSH, the product of REG and MKTSH. She notes the resulting
model is
RET = 0.50 – 0.5REG + 0.4MKTSH – 0.2REG_MKTSH
and asks you to do the following:
For example, if the probability of a company going bankrupt is 0.75, P/(1 – P) is 0.75/(1 −
0.75) = 3. So, the odds of bankruptcy are 3 to 1, implying the probability of bankruptcy is
three times as large as the probability of the company not going bankrupt.
The natural logarithm (ln) of the odds of an event happening is the log odds which is also
called the logit function.
The logit transformation linearizes the relation between the transformed dependent
variable and the independent variables.
𝑃
𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
Once the log odds are estimated, the event probability can be derived as:
1
𝑃=
1 + exp [−(𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 )]
Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares. MLE is an iterative process
performed by software, where the goal is to maximize the log likelihood. Each iteration
produces a higher log likelihood, and the process stops when the difference in log likelihood
between two successive iterations is very small.
Exhibit 14 from the curriculum compares a linear probability model to a logit model.
In a logit model, slope coefficients are interpreted as the change in the log odds that the
event happens per unit change in the independent variable, holding all other independent
variables constant.
A likelihood ratio (LR) test is a method to assess the fit of logistic regression models. The
LR test statistic is:
LR = −2 (Log likelihood restricted model − Log likelihood unrestricted model)
The test is similar to the joint F-test seen in an earlier learning module. It compares the fit of
the restricted and unrestricted models. For example, say we want to compare an
unrestricted Model A:
𝑃
Model A: 𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
to Model B, with restrictions b2 = b3 = 0,
𝑃
Model B: 𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + €
1−𝑃
In this case, the null hypothesis is H0: b2 = b3 = 0, and the alternative hypothesis is that at
least one of the coefficients is different from zero. Thus, the LR test is a joint test of the
restricted coefficients. Rejecting the null hypothesis means rejecting the smaller, restricted
model in favor of the larger, unrestricted model.
Example:
(This is Knowledge Check example from Section 4 of the curriculum.)
You are assigned to examine the propensity of companies to repurchase their shares, so for a
sample of 500 companies, you have identified those that repurchased shares (Repurchase =
1) and those that did not (Repurchase = 0). You also collected company data for the year
prior to the repurchase, including cash-to-total-assets ratio (CASH), debt-to-equity ratio
(DE), and net profit margin (NPM), and estimated the following logistic regression:
Repurchasei = b0 + b1CASHi + b2DEi + b3NPMi + εi.
Your regression results are shown in Exhibit 15.
In the weekly research team meeting, the research director asks you to explain your logistic
regression model and assess how the model fits the data, as follows:
we use the values of the coefficients from the logistic equation result and the mean, or
average values of the independent variables, to find the initial average probability of
repurchasing shares:
This implies that for the average firm, there is a 29.06% probability of share repurchase.
Now, for each independent variable, let us increase it by 1%, or 0.01, while holding the
others constant and see the marginal impact to probability of a share buyback.
CASH:
We increase the CASH variable by 1%, from 0.083 to 0.093, and calculate the new probability
of share buyback:
P = 28.87%
Therefore, the marginal impact of increasing the CASH variable by 1% and holding all the
other variables constant is a change in the probability of a share buyback of 28.87% −
29.06% = −0.19%; differently put, increasing the CASH variable by 1% decreases the
probability of a buyback by 0.19%.
NPM:
We increase the NPM variable by 1%, from −0.0535 to −0.0435, and calculate the new
probability of a share buyback:
Therefore, the marginal impact of increasing the NPM variable by 1% is an increase in the
probability of a share buyback of 29.26% − 29.06% = 0.20%.
DE:
We increase the DE variable by 1%, from 0.9182 to 0.9282, and calculate the new probability
of a share buyback:
Therefore, the marginal impact of increasing the NPM variable by 1%, rounded to two
decimal places, is a decrease in the probability of a share buyback of 29.00% − 29.06% =
−0.07%; differently put, it increases the probability of a share buyback.
3. Evaluate how your logistic regression model fits the data using the LR test and an
intercept-only model as the restricted model.
Solution:
The log likelihood statistics from the logistic regression results are:
The LR test is a test of the hypothesis for the restrictions, using the standard six-step
hypothesis test process, as follows:
Based on the LR test, your conclusion is that the unrestricted model fits the data better than
the intercept-only model, indicating that the three explanatory variables are jointly
significant. Note the regression results show the LR test statistic’s P-value is 0.0007.
Moreover, individual (z-statistic) tests of the coefficients show that DE and NPM are each
significant at the 5 percent level.
Summary
LO: Describe influence analysis and methods of detecting influential data points.
Two kinds of observations may potentially influence regression results:
• A high-leverage point: A data point having an extreme value of an independent
variable (X).
• An outlier: A data point having an extreme value of the dependent variable (Y).
Exhibit 7 presents a summary of the measures of influential observations.
LO: Formulate and interpret a multiple regression model that includes qualitative
independent variables.
Dummy (or indicator) variables represent qualitative independent variables. They take on a
value of 1 if a particular condition is true and 0 if that condition is false. To distinguish
among n categories, the model must include n – 1 dummy variables.
An intercept dummy adds to or reduces the original intercept if a specific condition is met.
When the intercept dummy is 1, the regression line shifts up or down parallel to the base
regression line.
Yi = b0 + d0Di + b1Xi + εi.
A slope dummy allows for the slope of the regression line to change if a specific condition is
met.
Yi = b0 + b1Xi + d1DiXi + εi.
It is also possible for a regression model to use both intercept and slope dummy variables.
Yi = b0 + d0Di + b1Xi + d1DiXi + εi.