DATA IN POLITICS I
MULTIPLE REGRESSION
D r. A n n i e Wa t s o n
November 12, 2024
[email protected]
REVIEW
Review
• Running a regression in R is simple
• lm(y ~ x)
• Interpreting it is more tricky
• Two main outputs: slope and intercept
• Slope (aka beta-hat, beta coefficient, regression coefficient)
• “What is the predicted change in Y when X increases by 1 unit”
• Trick: What is “one unit” for X in your analysis?
• Intercept (aka alpha-hat, Y intercept, constant)
• “What is the predicted value of Y when X is zero?”
• Trick: Is this actually meaningful?
Review Example 1
Table of Regression Output
Socialist Thermometer
Unions 0.58
Therm
(0.013)
Constant 4.58
• 𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖 + 𝜀𝑖
(0.822)
• Regression estimates 𝛼ො and 𝛽መ
• You tell me: in this regression, what
(conceptually) are 𝑋𝑖 and 𝑌𝑖 ? N 6354
• Which number is our estimate of 𝛼? ො How would R2 0.2377
we interpret it?
• Which number is our estimate for 𝛽? መ How would
we interpret it?
Review Example 2
DV: % of voters who
turn out in that
- Simplified from real study in Ghana precinct (0-100)
- Unit of analysis: voting precinct Had election - 2.00
- IV: Whether precinct had election monitors: monitor (0 or 1)
- Treatment group = 1, Control group = 0 (.012)
- DV: voter turnout (as a %, 0-100) Constant 76.7
(4.6)
Discuss with your group:
- What does the Intercept tell us here?
- What does it mean here for X to move from 0 to 1?
- What was the estimated treatment effect of election monitors in this
experiment?
REGRESSION AND CAUSALITY
From Math to Interpretation
• Does the slope have a causal interpretation?
• We often want it to!
• “Liking unions more causes you to like socialists more.”
• It can, but it’s certainly not foreordained.
• You can regress anything on anything else and get a result.
• E.g. you’d find that ambulance rides predict your risk of dying. This does not
mean that ambulance rides cause people to die. (Quite the opposite!)
• Even sillier: How does the number of oranges you eat per day predict the
number of your grandparents who were born abroad?
• For a casual interpretation, we need (at a minimum), an argument
about:
• Why X precedes Y temporally
• Why the people high in X are similar to the people low in X.
Confounders / Omitted Variables
• Something associated with our X of interest also affects Y.
• As a result, we may have failed to isolate the effect of X on Y.
• We’ve seen this before.
• Maybe Francine didn’t go to high school, and Allison did.
• Maybe New Jersey started a new job training program around the same time that it raised the minimum
wage.
• Maybe partisanship affects both how you feel about unions, and your feelings towards socialists
• These kinds of concerns are everywhere, and a constant focus in social scientific analysis.
• In regression contexts, confounders are often called “omitted variables,” since there is some
promise that, if we cease omitting them, it could solve some of our problems (coming up).
Hypothetical Example
• Say women tend to both
• Dislike unions
• Dislike socialists
• This is a potential confounder.
• Suggests our result is biased
• Bias means our regression results are
too high or too low, compared to the
“true” effect of X on Y.
• Nothing to do with bias in terms of
ideology etc.
• In this case: maybe true effect is 0, but
More women? More men? → we get a positive result.
Confounders / Omitted Variables
• Omitted variables: X and O are X and O are X and O are
positively negatively uncorrelated
• X = Variable of interest correlated correlated
• Y = Dependent variable
• O = Omitted variable
O has a positive
• O is only a problem if it’s effect on Y
No bias!
correlated with both the O has a
IV of interest, and the negative effect No bias!
DV. on Y
O has no effect
No bias! No bias! No bias!
on Y
Confounders / Omitted Variables
• If there’s only 1 X and O are X and O are X and O are
positively negatively uncorrelated
confounder, we can correlated correlated
actually say something
about direction of bias
O has a positive Positive
• Meaning: Are you getting effect on Y bias
Negative bias No bias!
a coefficient that’s too big, O has a
or too small? Negative
negative effect Positive bias No bias!
bias
• With multiple on Y
confounders, gets O has no effect
No bias! No bias! No bias!
complicated really fast on Y
• And what if you don’t
think of a key confounder?
Dealing with Potential Confounders in Regression
1. Rely on your design
• Maybe your X-variable is randomly assigned.
• If so, you start with a strong argument that the people high in X are similar to the
people low in X.
• Could imagine an experiment where randomly assign some treatment to change
perceptions of unions, then see if that also affects feelings towards socialists
2. Subclassification
• Suppose men have more positive feelings toward unions, and more positive feelings
toward socialists.
• This is a classic confounding problem.
• It goes away if we estimate separate regressions (socialist liking = 𝛼 + 𝛽1Union
liking) within gender subgroups.
• But this requires clear categories, and the number of regressions explodes as we
attempt to “control for” more and more things.
Dealing with Potential Confounders in Regression
3. Control variables
• Think about the (potential) positive association between having positive
feelings toward socialists, and Republican partisanship
• It’s possible to estimate more than one linear relationship simultaneously.
• That is, we can have more than one independent variable in a regression.
• This is what we call multiple regression.
MULTIPLE REGRESSION
Multiple Regression
• Model
• Before: 𝑌𝑖 = 𝛼 + 𝛽1𝑋𝑖 + 𝜀𝑖
• Now: 𝑌𝑖 = 𝛼 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ⋯ + 𝛽𝑛𝑋𝑛𝑖 + 𝜀𝑖
• We are saying that Y is a linear function of multiple X variables
• Decision rule
• Before: Find the 𝛼 and 𝛽1 that minimize the SSR.
• Now: Find the 𝛼 and 𝛽1, 𝛽2, … , 𝛽𝑛 that jointly minimize the SSR.
• Estimation
σ𝑛 ത ത
𝑖=1 𝑌𝑖 −𝑌 (𝑋𝑖−𝑋)
• Before: 𝛽1 = σ𝑛 ത 2
𝑖=1(𝑋𝑖−𝑋)
• Now: Requires matrix algebra. (R will do this for us.)
What does multiple regression do?
• One-variable regression: Estimate of 𝛽1 comes from comparing all variation in
𝑋1 ((liking unions) to variation in Y (liking socialists)
• Multiple regression: Estimates of 𝛽1 (slope of union_th) and 𝛽2 (slope of
partisanship) come only from cases where X1 and X2 depart from each other.
• “Holds constant” partisanship, and estimates effect of liking unions
• “Holds constant” liking unions, and estimates effect of partisanship
• Conceptually this is similar to subclassification! But we’re “holding constant” one variable by
estimating the linear relationship of best fit, rather than looking within categories.
• If X1 and X2 are perfectly correlated in our dataset, we can’t even do this. (We couldn’t do
subclassification, either.)
• As a result, “controlling for” another variable can change our estimate on 𝛽1.
Perhaps a lot.
Intuition for Multiple Regression
• Worried that some omitted variable, X2, is biasing our result
• So, estimate 𝑌𝑖 = 𝛼 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + 𝜀𝑖
• In R, lm(y ~ x1 + x2)
• Behind the scenes in R: to get 𝛽1 , multiple regression is like the following:
• Regresses X1 on X2, and takes the residual for X1.
• This is the variation in X1 that’s uncorrelated with X2
• So, X2 no longer a confounder: uncorrelated w. residual for X1
• Then regresses Y on the residual of X1
• End result:
• 𝛼 – still the intercept. Predicted value of Y when X1 and X2 are BOTH zero
• 𝛽1 –Slope of line of best fit between X1 and Y, once we’ve accounted for X2
• 𝛽2 –Slope of line of best fit between X2 and Y, once we’ve accounted for X1
Visualizing Multiple Regression (2 predictors)
One X: Fit a line through a 2-D scatterplot Two X’s: Fit a plane through
a 3-D cloud
Implementation in R
Only ever 1 DV Multiple IVs separated by “+”
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1)
Strong Republican
High Education
Intercept
N
R2
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22
(1.268)
Strong Republican --
High Education --
Intercept 4.26
(0.798)
N 6,689
R2 0.239
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06
(1.268) (1.197)
Strong Republican -- -37.87
(0.76)
High Education -- --
Intercept 4.26 37.14
(0.798) (0.95)
N 6,689 6,674
R2 0.239 0.445
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06
(1.268) (1.197)
Strong Republican -- -37.87
(0.76)
High Education -- --
Intercept 4.26 37.14
(0.798) (0.95)
Predicted value of the
DV when “Union therm”
N = 0 AND ”Strong 6,689 6,674
2
Republican” = 0
R 0.239a 0.445
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06 33.16
(1.268) (1.197) (1.21)
Strong Republican (0-1) -- -37.87 -37.64
(0.76) (0.778)
High Education (0-1) -- -- 0.2163
(0.132)
Intercept 4.26 37.14 35.98
(0.798) (0.95) (1.21)
N 6,689 6,674 6,582
R2 0.239a 0.445 0.4438
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Socialist Socialist Socialist
Thermometer Thermometer Thermometer
Union Thermometer (0-1) 58.22 33.06 33.16
(1.268) (1.197) (1.21)
Strong Republican (0-1) -- -37.87 -37.64
(0.76) (0.778)
High Education (0-1) -- -- 0.2163
(0.132)
Intercept 4.26 37.14 35.98
(0.798) (0.95) (1.21)
Predicted value of the
DV when all three IVs = 0
N 6,689 6,674 6,582
R2 0.239a 0.445 0.4438
Partisanship has 7 levels, from 0 = Strong Democrat to 1 = Strong Republican
Education has 15 levels ranging from 0 = Less than 1st grade to 1 = Ph.D.
Multiple Regression
Socialist
• What does the coefficient (e.g. 𝛽1 = 33.06) mean
now? Thermometer
Union Thermometer (0 – 1) 33.06
• The linear relationship between Union Therm and (1.197)
the DV that minimizes SSR… once we’ve Strong Republican = 1 -37.87
accounted for the linear relationship between
partisanship and the DV. (0.76)
Intercept 37.14
• Likewise, 𝛽2 (=-37.87) is the linear relationship (0.95)
between partisanship and the DV that minimizes
SSR… once we’ve accounted for the linear
relationship between Union Therm and the DV. N 6,674
R2 0.445
• Have we “controlled for” Partisanship? Partisanship has 7 levels, from 0 = Strong
• Yes, in a sense, but in a different way than in the Democrat to 1 = Strong Republican
subsetting approach.
• We haven’t isolated observations that are all exactly Education has 15 levels ranging from 0 = Less
the same respect to partisanship. Rather, we’ve than 1st grade to 1 = Ph.D.
assumed the relationship is linear, and analyzed only
the variance that was left over, after a linear prediction.
Multiple Regression – The Perils
• Lots. This is just for starters.
• You might not include (or even have a measure of) all the
confounding variables.
• It is sensitive to “outlier” observations.
• If any of your variables have a nonlinear relationship with
the DV, results can be very misleading.
• If one of the X’s has a causal relationship with another one
of them, things can also be misleading.
• Imagine controlling for “Minutes exercising per week” and “VO2
Max” in a model predicting “Time in a 10-mile race.”
• A linear model is an awkward way to analyze categorical
data.
Multiple Regression – The Perils (continued)
• You can regress anything on anything, and it’s
easy to read too much into the relationships
you uncover.
• There are temptations to data-mine.
Multiple Regression – The Promise
• This basic model is highly adaptable.
• If you think there is a nonlinear relationship, there are ways to model
that.
• E.g., square an X variable and include it in the model.
• You can examine interactive relationships between different variables
of interest.
• “I think the effect of fiscal stimulus on economic recovery depends on whether
a country has a parliamentary or presidential system.”
• This can be adapted to make sense of far more complex data
structures
• Children inside of schools inside of U.S. states, which develop over time. (I.e.
multilevel panel data.)
MULTIPLE REGRESSION IN R