0% found this document useful (0 votes)
7 views

LM04 Extensions of Multiple Regression IFT Notes

The document provides an overview of extensions of multiple regression, including influence analysis, the use of dummy variables, and multiple linear regression with qualitative dependent variables. It discusses methods for detecting influential observations, such as leverage and Cook's distance, and explains how to incorporate dummy variables to represent qualitative data. Additionally, it covers the application of logistic regression for categorical outcomes, emphasizing the importance of using the logit transformation for accurate predictions.

Uploaded by

ereden4030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

LM04 Extensions of Multiple Regression IFT Notes

The document provides an overview of extensions of multiple regression, including influence analysis, the use of dummy variables, and multiple linear regression with qualitative dependent variables. It discusses methods for detecting influential observations, such as leverage and Cook's distance, and explains how to incorporate dummy variables to represent qualitative data. Additionally, it covers the application of logistic regression for categorical outcomes, emphasizing the importance of using the logit transformation for accurate predictions.

Uploaded by

ereden4030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

LM04 Extensions of Multiple Regression 2024 Level II Notes

LM04 Extensions of Multiple Regression

1. Introduction ...........................................................................................................................................................2
2. Influence Analysis ...............................................................................................................................................2
3. Dummy Variables in a Multiple Linear Regression ................................................................................6
4. Multiple Linear Regression with Qualitative Dependent Variables.............................................. 10
Summary................................................................................................................................................................... 16

This document should be read in conjunction with the corresponding learning module in the 2024
Level II CFA® Program curriculum. Some of the graphs, charts, tables, examples, and figures are
copyright 2023, CFA Institute. Reproduced and republished with permission from CFA Institute. All
rights reserved.

Required disclaimer: CFA Institute does not endorse, promote, or warrant the accuracy or quality of
the products or services offered by IFT. CFA Institute, CFA®, and Chartered Financial Analyst® are
trademarks owned by CFA Institute.

Version 1.0

© IFT. All rights reserved 1


LM04 Extensions of Multiple Regression 2024 Level II Notes

1. Introduction
This learning module covers:
• Influence analysis and methods of detecting influential data points
• Dummy variables
• Logistic regression models
2. Influence Analysis
Besides violation of regression assumptions, another issue that an analyst should look for is
the presence of influential observations in the sample data. An influential observation is an
observation whose inclusion may significantly alter regression results.
Two kinds of observations may potentially influence regression results:
• A high-leverage point: A data point having an extreme value of an independent
variable (X).
• An outlier: A data point having an extreme value of the dependent variable (Y).
Exhibit 1 from the curriculum shows a high-leverage point (triangle). It has an unusually
high X value relative to other observations. The exhibit also presents two regression lines:
The dashed line includes the high-leverage point in the regression sample; the solid line
deletes it from the sample.

Exhibit 2 from the curriculum shows an outlier data point (triangle). It has an unusually high
Y value relative to other observations. As before, two regression lines are shown: The dashed
line includes the outlier in the regression sample; the solid line deletes it from the sample.

© IFT. All rights reserved 2


LM04 Extensions of Multiple Regression 2024 Level II Notes

If the high-leverage points or outliers are far from the regression line, they can tilt the
estimated regression line towards them, affecting slope coefficients and goodness-of-fit
statistics.
Detecting Influential Points
Leverage (hii)
A high-leverage point can be identified using a measure called leverage (hii). Leverage
measures the distance between the value of the ith observation of an independent variable
and the mean value of that variable across all n observations. Leverage ranges from 0 to 1,
the higher the leverage the more distant the observation’s value is from the mean, and hence
the more influence it can exert on the estimated regression line. Statistical software
packages can easily calculate the leverage measure.
Interpretation:
If Then
Leverage > 3 (
𝑘+1
) The observation is potentially influential
𝑛

where: k is the number of independent variables


Studentized residual (ti*)
An outlier can be identified using a measure called studentized residuals. Statistical software
packages can calculate and present this measure.
Interpretation:
If Then
| ti*| > 3 Flag observation as being an outlier
| ti*| > critical value of the t-statistic with n-k-2 Flag outlier observation as being
degrees of freedom potentially influential
Cook’s distance
Outliers and high-leverage points are not necessarily influential. An observation is influential

© IFT. All rights reserved 3


LM04 Extensions of Multiple Regression 2024 Level II Notes

if its exclusion from the sample causes substantial changes in the estimated regression
function. Cook’s distance, or Cook’s D (Di), is a metric for identifying influential data points.
It measures how much the estimated values of the regression change if observation i is
deleted from the sample.
Interpretation:
If Then
Di > 0.5 The ith observation may be influential and merits further investigation.
Di > 1.0 The ith observation is highly likely to be an influential data point.
Di > 2√𝑘/𝑛 The ith observation is highly likely to be an influential data point.

Exhibit 7 presents a summary of the measures of influential observations.

Example:
(This is Question Set example from Section 2 of the curriculum.)
You are analyzing a regression model of companies’ ROA estimated with 26 observations
and three independent variables and are concerned about outliers and influential
observations. Using software, you calculate the studentized residual t-statistic and Cook’s D
for each observation, as shown below.

© IFT. All rights reserved 4


LM04 Extensions of Multiple Regression 2024 Level II Notes

While meeting with the research director to discuss your results, she asks you to do the
following:

1. Identify which observations, if any, are considered outliers using studentized residuals at
a 5% significance level.
Solution
There are 21 (= 26 – 3 – 2) degrees of freedom, so the critical t-statistics are ±2.080.
Therefore, Observations 3 and 5 are considered outliers based on the studentized residuals.

2. Identify which observations, if any, are considered influential observations based on


Cook’s D.
Solution
Using the 0.5 and 1.0 guidelines for Cook’s D, there are no influential observations. Using the
alternative approach, 2√𝑘/𝑛= 2√3/26 = 0.3397, there are no influential observations.
However, on visual inspection, Observation 5 has a Cook’s D much different from those of the
other observations.

3. Recommend what actions, if any, you should take to deal with any influential observations.

© IFT. All rights reserved 5


LM04 Extensions of Multiple Regression 2024 Level II Notes

Solution
From the results of studentized residuals and Cook’s D, the analyst should investigate outlier
Observation 5 to ensure there are no data entry or quality issues.

3. Dummy Variables in a Multiple Linear Regression


Dummy (or indicator) variables represent qualitative independent variables. They take on a
value of 1 if a particular condition is true and 0 if that condition is false. Dummy variables
are often used to distinguish between “groups” or “categories” of data.
Defining a Dummy Variable
A dummy variable may arise in several ways, such as:
• It may reflect an inherent property of the data (i.e., industry membership).
• It may be a characteristic of the data represented by a condition that is either true or
false (i.e., a date before or after a key market event).
• It may be constructed from some characteristic of the data where the dummy variable
reflects a condition that is either true or false (i.e., firm sales less than or greater than
some value).
To distinguish among n categories, the model must include n – 1 dummy variables. So, if we
use dummy variables to denote companies belonging to one of five industry sectors, we use
four dummies. The category not assigned becomes the “base” or “control” group and the
slope of each dummy variable is interpreted relative to the base.
Visualizing and Interpreting Dummy Variables
Intercept Dummy
A commonly used dummy variable is the intercept dummy. It adds to or reduces the original
intercept if a specific condition is met. When the intercept dummy is 1, the regression line
shifts up or down parallel to the base regression line.
Consider the following regression equation that has one independent variable, X and one
intercept dummy variable, D.
Yi = b0 + d0Di + b1Xi + εi.
This single regression model estimates two lines of best fit based on the value of the dummy
variable:
• If D = 0, then the equation becomes Y = b0 + b1X + ε (base category).
• If D = 1, then the equation becomes Y = (b0 + d0) + b1X + ε (category to which the
changed intercept applies).
Exhibit 11, Panel A, shows the effect of an intercept dummy variable.

© IFT. All rights reserved 6


LM04 Extensions of Multiple Regression 2024 Level II Notes

Slope Dummy
A slope dummy allows for the slope of the regression line to change if a specific condition is
met.
Consider the following regression equation that has one one independent variable, X and one
slope dummy variable, D.
Yi = b0 + b1Xi + d1DiXi + εi.
The slope dummy variable creates an interaction term between the X variable and the
condition represented by D = 1.
• If D = 0, then Y = b0 + b1X + ε (base category).
• If D = 1, then Y = b0 + (b1 + d1) X + ε (category to which changed slope applies).
Exhibit 11, Panel B, shows the effect of a slope dummy variable.

© IFT. All rights reserved 7


LM04 Extensions of Multiple Regression 2024 Level II Notes

Models with both intercept and slope dummy variables


It is possible for a regression model to use both intercept and slope dummy variables.
Consider the following equation:
Yi = b0 + d0Di + b1Xi + d1DiXi + εi.
• If D = 0, then Y = b0 + b1X + ε (base category).
• If D = 1, then Y = (b0 + d0) + (b1 + d1)X
Exhibit 11, Panel C, shows the combined effect of an intercept and slope dummy variable.

Testing for Statistical Significance of Dummy Variables


Individual t-tests on the dummy variable coefficients indicate whether they are significantly
different from zero.
Example:
(This is Question Set example from Section 3 of the curriculum.)
You are interviewing for the position of junior analyst at a global macro hedge fund. The
managing director (MD) interviewing you outlines the following scenario: You are tasked
with studying the relation between stock market returns and GDP growth for multiple
countries and must use a binary variable in your regression model to categorize countries by

© IFT. All rights reserved 8


LM04 Extensions of Multiple Regression 2024 Level II Notes

stock market type, emerging (1) or developed (0) markets. He provides three choices, saying
the following:

1. To answer the MD’s question, identify the new variable and its function.
A. Slope dummy
B. Intercept dummy
C. Interaction term
Solution
B is correct. The new variable, an intercept dummy, allows for a change in intercept to
classify countries by emerging versus developed stock market status.

2. The MD continues, indicating that you must refine the model to capture the effect on stock
returns of the interaction of each country’s GDP growth and its stock market development
status. He then asks you to do the following:
Identify the model you should use (noting these definitions).
GDPG: Country GDP growth
EM: Indicates emerging stock market country
DM: Indicates developed stock market country
A. Stock return = b0 + b1GDPG + d1EM + d2DM + d3(EM × GDPG) + ɛ.
B. Stock return = b0 + b1GDPG + d1EM + d2DM + ɛ.
C. Stock return = b0 + b1GDPG + d1EM + d2(EM × GDPG) + ɛ.
Solution
C is correct. This model includes a variable for country GDP growth, (GDPG); one dummy for
emerging stock market status (EM = 1, 0 otherwise), with developed market status as the
base case; and a term (EM × GDPG) for the interaction of EM status with GDP growth.

3. Another MD joins the interview and mentions that an analyst on her team estimated a
regression to explain a cross-section of returns on assets of companies using a regulation
dummy variable (REG = 1 if regulated, 0 otherwise), market share (MKTSH), and an
interaction term, REG_MKTSH, the product of REG and MKTSH. She notes the resulting
model is
RET = 0.50 – 0.5REG + 0.4MKTSH – 0.2REG_MKTSH
and asks you to do the following:

© IFT. All rights reserved 9


LM04 Extensions of Multiple Regression 2024 Level II Notes

Identify which of the following statements is correct regarding interpretation of the


regression results (indicate all that apply).
A. The average return for a regulated firm is 0.5% lower than for a non-regulated firm,
holding the market share constant.
B. Non-regulated companies with larger market shares have lower ROAs than regulated
companies.
C. For each increase in market share, a regulated firm has a 0.3 lower return on assets than a
non-regulated firm.
Solution
A and C are correct.
A is correct because the coefficient on REG is –0.5.
C is correct because the sum of coefficients is –0.3 = –0.5REG + (0.4MKTSH + –
0.2REG_MKTSH).
B is not correct because the coefficient on MKTSH is positive and the coefficient on REG is
negative.
4. Multiple Linear Regression with Qualitative Dependent Variables
Qualitative (or categorical) dependent variables are outcome variables describing data
that fit into categories. For example, to predict whether a company will go bankrupt or not,
we need to use a qualitative dependent variable (bankrupt or not) as the dependent variable
and financial performance of the company (e.g. return on equity, debt-to-equity ratio) as
independent variables.
In these situations, linear regression model is not the right estimation model to predict a
discrete outcome. In the bankruptcy example, the outcome we are expecting to make a
decision is 0 or 1. However, it is highly unlikely that we will get a discrete outcome using a
linear regression. The predicted value will be either greater than 1 or less than 0. Since the
probability cannot be greater than 100 percent or less than 0 percent, this model is not
appropriate.
To address this issue, we apply a logistic transformation. If P denotes the probability of the
event (e.g. bankruptcy), the logistic transformation is:
ln[P/(1 − P)]
The ratio P/(1 – P) is a ratio of probabilities—the probability that the event of interest
happens, P, divided by the probability that it does not happen (1 − P), with the ratio
representing the odds of an event occurring.

© IFT. All rights reserved 10


LM04 Extensions of Multiple Regression 2024 Level II Notes

For example, if the probability of a company going bankrupt is 0.75, P/(1 – P) is 0.75/(1 −
0.75) = 3. So, the odds of bankruptcy are 3 to 1, implying the probability of bankruptcy is
three times as large as the probability of the company not going bankrupt.
The natural logarithm (ln) of the odds of an event happening is the log odds which is also
called the logit function.
The logit transformation linearizes the relation between the transformed dependent
variable and the independent variables.
𝑃
𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
Once the log odds are estimated, the event probability can be derived as:
1
𝑃=
1 + exp [−(𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 )]
Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares. MLE is an iterative process
performed by software, where the goal is to maximize the log likelihood. Each iteration
produces a higher log likelihood, and the process stops when the difference in log likelihood
between two successive iterations is very small.
Exhibit 14 from the curriculum compares a linear probability model to a logit model.

In a logit model, slope coefficients are interpreted as the change in the log odds that the
event happens per unit change in the independent variable, holding all other independent
variables constant.

© IFT. All rights reserved 11


LM04 Extensions of Multiple Regression 2024 Level II Notes

A likelihood ratio (LR) test is a method to assess the fit of logistic regression models. The
LR test statistic is:
LR = −2 (Log likelihood restricted model − Log likelihood unrestricted model)
The test is similar to the joint F-test seen in an earlier learning module. It compares the fit of
the restricted and unrestricted models. For example, say we want to compare an
unrestricted Model A:
𝑃
Model A: 𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
to Model B, with restrictions b2 = b3 = 0,
𝑃
Model B: 𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + €
1−𝑃
In this case, the null hypothesis is H0: b2 = b3 = 0, and the alternative hypothesis is that at
least one of the coefficients is different from zero. Thus, the LR test is a joint test of the
restricted coefficients. Rejecting the null hypothesis means rejecting the smaller, restricted
model in favor of the larger, unrestricted model.
Example:
(This is Knowledge Check example from Section 4 of the curriculum.)
You are assigned to examine the propensity of companies to repurchase their shares, so for a
sample of 500 companies, you have identified those that repurchased shares (Repurchase =
1) and those that did not (Repurchase = 0). You also collected company data for the year
prior to the repurchase, including cash-to-total-assets ratio (CASH), debt-to-equity ratio
(DE), and net profit margin (NPM), and estimated the following logistic regression:
Repurchasei = b0 + b1CASHi + b2DEi + b3NPMi + εi.
Your regression results are shown in Exhibit 15.

© IFT. All rights reserved 12


LM04 Extensions of Multiple Regression 2024 Level II Notes

In the weekly research team meeting, the research director asks you to explain your logistic
regression model and assess how the model fits the data, as follows:

1. Interpret the logit regression intercept.


Solution:
The intercept of –0.4738 is the log odds of the probability of being a share repurchaser if
CASH, DE, and NPM are all zero. The odds are e−0.4738 = 0.6226, and the probability (P) =
0.6226/(1 + 0.6226) = 0.3837, or 38.37%. This is the probability of not being captured by
the independent variables in the logistic regression equation.

2. Estimate the marginal effect of each independent variable in explaining companies’


propensity to repurchase shares.
Solution:
Starting with the equation for the probability to repurchase shares,

we use the values of the coefficients from the logistic equation result and the mean, or
average values of the independent variables, to find the initial average probability of
repurchasing shares:

© IFT. All rights reserved 13


LM04 Extensions of Multiple Regression 2024 Level II Notes

This implies that for the average firm, there is a 29.06% probability of share repurchase.
Now, for each independent variable, let us increase it by 1%, or 0.01, while holding the
others constant and see the marginal impact to probability of a share buyback.
CASH:
We increase the CASH variable by 1%, from 0.083 to 0.093, and calculate the new probability
of share buyback:

P = 28.87%
Therefore, the marginal impact of increasing the CASH variable by 1% and holding all the
other variables constant is a change in the probability of a share buyback of 28.87% −
29.06% = −0.19%; differently put, increasing the CASH variable by 1% decreases the
probability of a buyback by 0.19%.
NPM:
We increase the NPM variable by 1%, from −0.0535 to −0.0435, and calculate the new
probability of a share buyback:

Therefore, the marginal impact of increasing the NPM variable by 1% is an increase in the
probability of a share buyback of 29.26% − 29.06% = 0.20%.
DE:
We increase the DE variable by 1%, from 0.9182 to 0.9282, and calculate the new probability
of a share buyback:

Therefore, the marginal impact of increasing the NPM variable by 1%, rounded to two
decimal places, is a decrease in the probability of a share buyback of 29.00% − 29.06% =
−0.07%; differently put, it increases the probability of a share buyback.

3. Evaluate how your logistic regression model fits the data using the LR test and an
intercept-only model as the restricted model.
Solution:
The log likelihood statistics from the logistic regression results are:

© IFT. All rights reserved 14


LM04 Extensions of Multiple Regression 2024 Level II Notes

The LR test is a test of the hypothesis for the restrictions, using the standard six-step
hypothesis test process, as follows:

Based on the LR test, your conclusion is that the unrestricted model fits the data better than
the intercept-only model, indicating that the three explanatory variables are jointly
significant. Note the regression results show the LR test statistic’s P-value is 0.0007.
Moreover, individual (z-statistic) tests of the coefficients show that DE and NPM are each
significant at the 5 percent level.

© IFT. All rights reserved 15


LM04 Extensions of Multiple Regression 2024 Level II Notes

Summary
LO: Describe influence analysis and methods of detecting influential data points.
Two kinds of observations may potentially influence regression results:
• A high-leverage point: A data point having an extreme value of an independent
variable (X).
• An outlier: A data point having an extreme value of the dependent variable (Y).
Exhibit 7 presents a summary of the measures of influential observations.

LO: Formulate and interpret a multiple regression model that includes qualitative
independent variables.
Dummy (or indicator) variables represent qualitative independent variables. They take on a
value of 1 if a particular condition is true and 0 if that condition is false. To distinguish
among n categories, the model must include n – 1 dummy variables.
An intercept dummy adds to or reduces the original intercept if a specific condition is met.
When the intercept dummy is 1, the regression line shifts up or down parallel to the base
regression line.
Yi = b0 + d0Di + b1Xi + εi.
A slope dummy allows for the slope of the regression line to change if a specific condition is
met.
Yi = b0 + b1Xi + d1DiXi + εi.
It is also possible for a regression model to use both intercept and slope dummy variables.
Yi = b0 + d0Di + b1Xi + d1DiXi + εi.

© IFT. All rights reserved 16


LM04 Extensions of Multiple Regression 2024 Level II Notes

LO: Formulate and interpret a logistic regression model.


Qualitative dependent variables are outcome variables describing data that fit into
categories (e.g. bankrupt or not bankrupt). A logistic transformation is applied when the
model contains qualitative dependent variables. The logistic transformation is:
ln[P/(1 − P)]
The logit transformation linearizes the relation between the transformed dependent
variable and the independent variables.
𝑃
𝑙𝑛 ( ) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 + €
1−𝑃
The natural logarithm (ln) of the odds of an event happening is the log odds which is also
called the logit function.
Logistic regression coefficients are typically estimated using the maximum likelihood
estimation (MLE) method rather than by least squares.
In a logit model, slope coefficients are interpreted as the change in the log odds that the
event happens per unit change in the independent variable, holding all other independent
variables constant.
A likelihood ratio (LR) test is a method to assess the fit of logistic regression models. The test
is similar to the joint F-test. It compares the fit of the restricted and unrestricted models.

© IFT. All rights reserved 17

You might also like