0% found this document useful (0 votes)
12 views30 pages

21 Multiple Regression

The document provides an overview of multiple regression analysis in R, highlighting the importance of interpreting slope and intercept values, and the potential for confounding variables to bias results. It discusses how to address confounders through design, subclassification, and control variables, emphasizing the need for careful analysis to isolate the effects of independent variables on a dependent variable. Additionally, it illustrates the application of multiple regression with examples and the significance of coefficients in understanding relationships between variables.

Uploaded by

bollfills
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views30 pages

21 Multiple Regression

The document provides an overview of multiple regression analysis in R, highlighting the importance of interpreting slope and intercept values, and the potential for confounding variables to bias results. It discusses how to address confounders through design, subclassification, and control variables, emphasizing the need for careful analysis to isolate the effects of independent variables on a dependent variable. Additionally, it illustrates the application of multiple regression with examples and the significance of coefficients in understanding relationships between variables.

Uploaded by

bollfills
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA IN POLITICS I

MULTIPLE REGRESSION

D r. A n n i e Wa t s o n
November 12, 2024
[email protected]
REVIEW
Review
• Running a regression in R is simple
• lm(y ~ x)
• Interpreting it is more tricky
• Two main outputs: slope and intercept
• Slope (aka beta-hat, beta coefficient, regression coefficient)
• “What is the predicted change in Y when X increases by 1 unit”
• Trick: What is “one unit” for X in your analysis?
• Intercept (aka alpha-hat, Y intercept, constant)
• “What is the predicted value of Y when X is zero?”
• Trick: Is this actually meaningful?
Review Example 1
Table of Regression Output
Socialist Thermometer
Unions 0.58
Therm
(0.013)
Constant 4.58
• 𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖 + 𝜀𝑖
(0.822)
• Regression estimates 𝛼ො and 𝛽መ
• You tell me: in this regression, what
(conceptually) are 𝑋𝑖 and 𝑌𝑖 ? N 6354
• Which number is our estimate of 𝛼? ො How would R2 0.2377
we interpret it?
• Which number is our estimate for 𝛽? መ How would
we interpret it?
Review Example 2
DV: % of voters who
turn out in that
- Simplified from real study in Ghana precinct (0-100)
- Unit of analysis: voting precinct Had election - 2.00
- IV: Whether precinct had election monitors: monitor (0 or 1)
- Treatment group = 1, Control group = 0 (.012)
- DV: voter turnout (as a %, 0-100) Constant 76.7
(4.6)
Discuss with your group:
- What does the Intercept tell us here?
- What does it mean here for X to move from 0 to 1?
- What was the estimated treatment effect of election monitors in this
experiment?
REGRESSION AND CAUSALITY
From Math to Interpretation
• Does the slope have a causal interpretation?
• We often want it to!
• “Liking unions more causes you to like socialists more.”
• It can, but it’s certainly not foreordained.
• You can regress anything on anything else and get a result.
• E.g. you’d find that ambulance rides predict your risk of dying. This does not
mean that ambulance rides cause people to die. (Quite the opposite!)
• Even sillier: How does the number of oranges you eat per day predict the
number of your grandparents who were born abroad?
• For a casual interpretation, we need (at a minimum), an argument
about:
• Why X precedes Y temporally
• Why the people high in X are similar to the people low in X.
Confounders / Omitted Variables
• Something associated with our X of interest also affects Y.

• As a result, we may have failed to isolate the effect of X on Y.

• We’ve seen this before.


• Maybe Francine didn’t go to high school, and Allison did.
• Maybe New Jersey started a new job training program around the same time that it raised the minimum
wage.
• Maybe partisanship affects both how you feel about unions, and your feelings towards socialists

• These kinds of concerns are everywhere, and a constant focus in social scientific analysis.

• In regression contexts, confounders are often called “omitted variables,” since there is some
promise that, if we cease omitting them, it could solve some of our problems (coming up).
Hypothetical Example
• Say women tend to both
• Dislike unions
• Dislike socialists
• This is a potential confounder.
• Suggests our result is biased
• Bias means our regression results are
too high or too low, compared to the
“true” effect of X on Y.
• Nothing to do with bias in terms of
ideology etc.
• In this case: maybe true effect is 0, but
 More women? More men? → we get a positive result.
Confounders / Omitted Variables
• Omitted variables: X and O are X and O are X and O are
positively negatively uncorrelated
• X = Variable of interest correlated correlated
• Y = Dependent variable
• O = Omitted variable
O has a positive
• O is only a problem if it’s effect on Y
No bias!
correlated with both the O has a
IV of interest, and the negative effect No bias!
DV. on Y
O has no effect
No bias! No bias! No bias!
on Y
Confounders / Omitted Variables
• If there’s only 1 X and O are X and O are X and O are
positively negatively uncorrelated
confounder, we can correlated correlated
actually say something
about direction of bias
O has a positive Positive
• Meaning: Are you getting effect on Y bias
Negative bias No bias!
a coefficient that’s too big, O has a
or too small? Negative
negative effect Positive bias No bias!
bias
• With multiple on Y
confounders, gets O has no effect
No bias! No bias! No bias!
complicated really fast on Y

• And what if you don’t


think of a key confounder?
Dealing with Potential Confounders in Regression
1. Rely on your design
• Maybe your X-variable is randomly assigned.
• If so, you start with a strong argument that the people high in X are similar to the
people low in X.
• Could imagine an experiment where randomly assign some treatment to change
perceptions of unions, then see if that also affects feelings towards socialists

2. Subclassification
• Suppose men have more positive feelings toward unions, and more positive feelings
toward socialists.
• This is a classic confounding problem.
• It goes away if we estimate separate regressions (socialist liking = 𝛼 + 𝛽1Union
liking) within gender subgroups.
• But this requires clear categories, and the number of regressions explodes as we
attempt to “control for” more and more things.
Dealing with Potential Confounders in Regression

3. Control variables
• Think about the (potential) positive association between having positive
feelings toward socialists, and Republican partisanship
• It’s possible to estimate more than one linear relationship simultaneously.
• That is, we can have more than one independent variable in a regression.
• This is what we call multiple regression.
MULTIPLE REGRESSION
Multiple Regression
• Model
• Before: 𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖 + 𝜀𝑖
• Now: 𝑌𝑖 = 𝛼 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ⋯ + 𝛽𝑛𝑋𝑛𝑖 + 𝜀𝑖
• We are saying that Y is a linear function of multiple X variables

• Decision rule
• Before: Find the 𝛼 and 𝛽1 that minimize the SSR.
• Now: Find the 𝛼 and 𝛽1, 𝛽2, … , 𝛽𝑛 that jointly minimize the SSR.

• Estimation
σ𝑛 ത ത
𝑖=1 𝑌𝑖 −𝑌 (𝑋𝑖−𝑋)
• Before: 𝛽1 = σ𝑛 ത 2
𝑖=1(𝑋𝑖−𝑋)
• Now: Requires matrix algebra. (R will do this for us.)
What does multiple regression do?
• One-variable regression: Estimate of 𝛽1 comes from comparing all variation in
𝑋1 ((liking unions) to variation in Y (liking socialists)

• Multiple regression: Estimates of 𝛽1 (slope of union_th) and 𝛽2 (slope of


partisanship) come only from cases where X1 and X2 depart from each other.
• “Holds constant” partisanship, and estimates effect of liking unions
• “Holds constant” liking unions, and estimates effect of partisanship
• Conceptually this is similar to subclassification! But we’re “holding constant” one variable by
estimating the linear relationship of best fit, rather than looking within categories.
• If X1 and X2 are perfectly correlated in our dataset, we can’t even do this. (We couldn’t do
subclassification, either.)

• As a result, “controlling for” another variable can change our estimate on 𝛽1.
Perhaps a lot.
Intuition for Multiple Regression
• Worried that some omitted variable, X2, is biasing our result
• So, estimate 𝑌𝑖 = 𝛼 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + 𝜀𝑖
• In R, lm(y ~ x1 + x2)
• Behind the scenes in R: to get 𝛽1 , multiple regression is like the following:
• Regresses X1 on X2, and takes the residual for X1.
• This is the variation in X1 that’s uncorrelated with X2
• So, X2 no longer a confounder: uncorrelated w. residual for X1
• Then regresses Y on the residual of X1
• End result:
• 𝛼 – still the intercept. Predicted value of Y when X1 and X2 are BOTH zero
• 𝛽1 –Slope of line of best fit between X1 and Y, once we’ve accounted for X2
• 𝛽2 –Slope of line of best fit between X2 and Y, once we’ve accounted for X1
Visualizing Multiple Regression (2 predictors)
One X: Fit a line through a 2-D scatterplot Two X’s: Fit a plane through
a 3-D cloud
Implementation in R
Only ever 1 DV Multiple IVs separated by “+”
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1)

Strong Republican

High Education

Intercept

N
R2
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22
(1.268)
Strong Republican --

High Education --

Intercept 4.26
(0.798)

N 6,689
R2 0.239
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06
(1.268) (1.197)
Strong Republican -- -37.87
(0.76)
High Education -- --

Intercept 4.26 37.14


(0.798) (0.95)

N 6,689 6,674
R2 0.239 0.445
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06
(1.268) (1.197)
Strong Republican -- -37.87
(0.76)
High Education -- --

Intercept 4.26 37.14


(0.798) (0.95)
Predicted value of the
DV when “Union therm”
N = 0 AND ”Strong 6,689 6,674
2
Republican” = 0
R 0.239a 0.445
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06 33.16
(1.268) (1.197) (1.21)
Strong Republican (0-1) -- -37.87 -37.64
(0.76) (0.778)
High Education (0-1) -- -- 0.2163
(0.132)
Intercept 4.26 37.14 35.98
(0.798) (0.95) (1.21)

N 6,689 6,674 6,582


R2 0.239a 0.445 0.4438
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06 33.16
(1.268) (1.197) (1.21)
Strong Republican (0-1) -- -37.87 -37.64
(0.76) (0.778)
High Education (0-1) -- -- 0.2163
(0.132)
Intercept 4.26 37.14 35.98
(0.798) (0.95) (1.21)
Predicted value of the
DV when all three IVs = 0
N 6,689 6,674 6,582
R2 0.239a 0.445 0.4438
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Multiple Regression
Socialist
• What does the coefficient (e.g. 𝛽1 = 33.06) mean
now? Thermometer
Union Thermometer (0 – 1) 33.06
• The linear relationship between Union Therm and (1.197)
the DV that minimizes SSR… once we’ve Strong Republican = 1 -37.87
accounted for the linear relationship between
partisanship and the DV. (0.76)
Intercept 37.14
• Likewise, 𝛽2 (=-37.87) is the linear relationship (0.95)
between partisanship and the DV that minimizes
SSR… once we’ve accounted for the linear
relationship between Union Therm and the DV. N 6,674
R2 0.445
• Have we “controlled for” Partisanship? Partisanship has 7 levels, from 0 = Strong
• Yes, in a sense, but in a different way than in the Democrat to 1 = Strong Republican
subsetting approach.
• We haven’t isolated observations that are all exactly Education has 15 levels ranging from 0 = Less
the same respect to partisanship. Rather, we’ve than 1st grade to 1 = Ph.D.
assumed the relationship is linear, and analyzed only
the variance that was left over, after a linear prediction.
Multiple Regression – The Perils
• Lots. This is just for starters.

• You might not include (or even have a measure of) all the
confounding variables.

• It is sensitive to “outlier” observations.

• If any of your variables have a nonlinear relationship with


the DV, results can be very misleading.

• If one of the X’s has a causal relationship with another one


of them, things can also be misleading.
• Imagine controlling for “Minutes exercising per week” and “VO2
Max” in a model predicting “Time in a 10-mile race.”

• A linear model is an awkward way to analyze categorical


data.
Multiple Regression – The Perils (continued)
• You can regress anything on anything, and it’s
easy to read too much into the relationships
you uncover.

• There are temptations to data-mine.


Multiple Regression – The Promise
• This basic model is highly adaptable.

• If you think there is a nonlinear relationship, there are ways to model


that.
• E.g., square an X variable and include it in the model.

• You can examine interactive relationships between different variables


of interest.
• “I think the effect of fiscal stimulus on economic recovery depends on whether
a country has a parliamentary or presidential system.”

• This can be adapted to make sense of far more complex data


structures
• Children inside of schools inside of U.S. states, which develop over time. (I.e.
multilevel panel data.)
MULTIPLE REGRESSION IN R

You might also like