HLST 2302
LECTURE 5
ADVANCED OLS
REGRESSION
GUYTANO VIRDO, PHD
ANNOUNCEMENTS
• Mid-term exam May 27th
• In class during usual lecture time
• Mix of MC, definition, and math questions
• You will need a scientific calculator, YUcard (or passport), and writing instrument of
choice
• You will be provided with a formula sheet
TODAY
• Residuals
• Extending the linear equation
• Adding additional IVs
• Prediction with additional IVs
• Interpreting coefficients and standardized coefficients
• R-squared and adjusted r-squared
• Multiple dummy variables
ERRORS IN REGRESSION
• y = a + bx + e, where e refers to the error term
• Multiple causes
• Poor measurement, sampling bias, missing data, difficult to measure constructs
• Fact of life in the social sciences
• In stats speak, these errors are called residuals
RESIDUALS
• The residuals are the difference between our dependent variable values, and
the case’s predicted value for the dependent variable based on the OLS
regression results
• In the aggregate, they represent our estimate for what all the errors would be
• y(observed) – y(predicted)= prediction error
• These tell us how well our model predicts the DV for all cases in the sample
RESIDUAL CALCULATIONS
Case Degrees Deaths Predicted y Difference
• We can use our linear equation to 1 30 73 75.885 -2.885
calculate the difference between 2 28 68 67.421 0.579
the case’s y value, and the 3 29 67 71.653 -4.653
predicted y value from our model
4 26 52 58.957 -6.957
5 28 70 67.421 2.579
For case 1: 6 28 76 67.421 8.579
y= -51.075 + 4.232*30 7 27 63 63.189 -0.189
y= -51.075 + 126.96
y= 75.885
Slope coefficient 4.232
Constant -51.075
RESIDUALS II
• These should be:
• Normally distributed (skew and kurtosis)
• 95% of cases within +2/-2 SD, etc
• Not significantly related to any of the independent variables
• Correlations
• Heteroskedastic
• We shouldn’t be better at predicting some cases more than others
EXTENDING OLS REGRESSION
• Most phenomena in the social sciences cannot be explained by a single IV
• We could do a series of bivariate regressions
• This wouldn’t solve the third variable problem
• Including multiple IVs in one regression model allows us to “control” for each
IV
• The regression coefficient for each IV shows the impact of that IV on the DV, controlling
for the impact from the other included IVs
EXTENDING OUR REGRESSION EQUATION
• We simply need to extend our linear equation to accommodate additional
independent variables
• y = a + bx + e, becomes
• y = a + b1x1 + b2x2 + b3x3….+ e
• b1 now shows the estimated impact of the first independent variable, holding
constant the impact of b2 and b3
• Why is this important?
• We now have a visual problem
WE NOW HAVE MULTIPLE DIMENSIONS
• With bivariate regression, we could
make a scatterplot with a line of
best fit
• If we have two IVs, we now have a
“plane” of best fit in a 3D graph
instead of a regression line
• If we have more IVs, we can no
longer visualize the relationship
between the IVs and the DV
INTERPRETING REGRESSION COEFFICIENTS
• Interpreting regression coefficients is slightly different
• It’s still about a one-unit change in x has b impact on y, but now it is controlling
for other variables
• Which IV has the strongest impact on the DV?
• Dependent variable: personal income
• Data source: American General Social Survey (2017)
Variable B coefficient Beta coefficient Pr (sig.)
Constant -19, 248.32 <0.01
Age (years) 516.68 0.21 <0.01
Sex (M=0, F=1) -17, 129.71 -0.26 <0.01
Education (years) 4007.87 0.358 <0.01
BETA COEFFICIENTS
• Looking at regression coefficients alone does not reliably indicate which IV
has the strongest impact
• Sex had the largest absolute coefficient, but can only range from 0-1
• Education can range from 0-20
• We can “standardize” the coefficients based on the SD of the IV and the DV
!"!"
• Beta=(𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑠𝑙𝑜𝑝𝑒) !"
#"
• Ignore signs and focus on the absolute value
• These show us the relative impact of each IV on the DV
• They are not easily interpreted or used beyond this
MULTIVARIATE OLS REGRESSION EXAMPLE I
• We can now fully interpret this regression model
• Hypotheses?
• Significance?
• Impact of Xs on Y?
• Relative impact of Xs on Y?
• Linear equation?
Variable B coefficient Beta coefficient Pr (sig.)
Constant -19, 248.32 <0.01
Age (years) 516.68 0.21 <0.01
Sex (M=0, F=1) -17, 129.71 -0.26 <0.01
Education (years) 4007.87 0.358 <0.01
MULTIVARIATE OLS REGRESSION & PREDICTION
• We can use our linear equation to predict the value of y for cases from the
population not in our dataset
• How would we predict the income of a 40 year old, American female with a
bachelor’s degree?
• Income = -19248.32 + 516.68(x1) -17120.71(x2) + 4007.87(x3)
R-SQUARED
• The R-squared value indicates how well our IVs explain the DV
• Statistically, not epistemologically
• It is an overall assessment of the model based on comparing the predicted versus observed
values (residuals)
%&'()*+%, -).*)/*0+
• 𝑅$ = /0/)( -).*)/*0+
6 !
3 4)
∑(45
• 𝑅$ = ∑(454)
6 !
• Y-hat = the predicted values for Y
• R-squared ranges from 0 (no explanation) to 1 (perfect explanation)
• For our previous model, R-squared = 0.239
• We can turn this into a percentage to make it easier to understand: 0.239 * 100 = 23.9 %
ADJUSTED R-SQUARED
• Generally, any additional IV included in a model will increase the R-squared, which resulted in
unethical practice by researchers
• The adjusted R-squared provides a small penalty for each additional IV included in a model
• It is more commonly reported, and is interpreted the same way
• Adjusted R-squared for our model: 0.238, or 23.8%
𝑘
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅$ = 𝑅$ − x 1 − 𝑅$
𝑛−𝑘+1
Where: k = number of IVs
n = sample size
OMITTED VARIABLE BIAS
• A regression model is only as good as the theory behind it
• Results need to be interpreted with caution, as often we are unable to
measure important IVs (known unknowns), and often we don’t know what we
should be measuring (unknown unknowns)
• Our coefficients (and R-squared) will often be inflated if an important
variable is absent from the regression model
MULTIPLE DUMMY VARIABLES
• We can create multiple dummy variables if needed to incorporate a multi-
attribute variable
• We need to exclude one of the dummy variables, as failing to do so would
result in collinearity
• This excluded category becomes the reference category
MARITAL STATUS AND PULSE REGRESSION I
• We can build on the dummy variable model we created last week to get more specific
• We need to create five dummy variables using the marital status variable
• Married (1) or not (0) Marital Status (NHANES)
Divorced
• Divorced (1) or not (0)
Live with partner
• Live with a partner (1) or not (0)
Married
• Separated (1) or not (0) Never married
• Widowed (1) or not (0) Separated
• Never married: reference Widowed
• We can then include them in a regression model, and compare the associated impact of each
marital status on pulse compared to the reference category
MARITAL STATUS AND PULSE REGRESSION II
• How do we interpret this?
Estimate Pr (>|t|)
(Intercept) 73.90240 < 2e-16 ***
married_not -2.32265 8.61e-10 ***
divorced_not -0.18352 0.7428
livepart_not 0.06439 0.9153
sep_not -1.36439 0.1575
widow_not -1.59647 0.0147 *
TYPES OF VARIABLES IN MODELS
• So far, we have only considered simple relationships between variables
• E.g. the relationship between education and income
• Independent variables: variables we are interested in and expect to have an
independent effect on y
• Control variables: variables we include for epistemological reasons, but don’t
necessarily care about
• What counts as an IV or control variable is usually researcher determined
• E.g. I might care about the impact of sex on income but need to include education to
control for that variable. OR vice versa.
• This doesn’t change how we create a regression, only the interpretation of the
results
TYPES OF VARIABLES IN MODELS II
• Mediating variables: a variable in-between the IV and the DV
• E.g. exercise and energy levels. Exercise costs energy, but people who exercise are more
energetic. Exercise improves sleep, cardiovascular functioning, etc.
• We would expect exercise to no longer have a strong/significant impact on energy if we
included these other variables in our model
• Moderating variables: a variable that changes the relationship (either direction or
strength) between the IV and the DV
• E.g. the relationship between income, sex, and education
• Sex directly impacts income, but sex also directly impacts education which impacts income
MEDIATORS
• Mediators are a direct link between X and Y, and without this variable there is
no reason X would influence Y
• If a mediating variable is included in a model, the other variable should no
longer be significant/have a strong impact
Exercise Better sleep More energy
MODERATORS
• Moderators change the impact of other variables in the model, but still
independently impact the DV
• Sex moderates the impact of education on income, but still influences income
on its own
Education
Sex Income
MEDIATION, MODERATION, OR BOTH?
HIERARCHICAL LINEAR REGRESSION
• This technique allows us to compare our full model to a nested/several nested
models
• Full model: includes all IVs we think are related to the DV
• Nested model: missing one or more IVs from the full model
• These models are considered to be “nested” in the full model
• The results allow us to determine if the full model better explains the DV than
the nested models
• Why would we care about this?
• Why would we want to avoid doing this?
THE RELATIONSHIP BETWEEN STATISTICS AND
ACCESSIBILITY
• There is a trade-off in quantitative analysis between simplicity and complexity
• Simple statistics are easily communicated and understood
• They also tend to oversimplify and reduce complex phenomena to one variable/a few
variables
• Complicated statistics are hard to communicate effectively and are understood fully by a
select few
• In many circumstances they most accurately reflect the underlying phenomena
• Striking an appropriate balance between these two is a large part of the job of a
quantitative researcher, data analyst, or data scientist
Advanced
Means,
Regression
Pie graphs Simple Complicated
Techniques
HIERARCHICAL LINEAR REGRESSION II
• Full model
• y = a + b1(x1) + b2(x2) + e
• Nested model
• y = a + b1(x1) + e
• The second model is nested in the first
• Comparing these models allows us to determine if the full model better explains y
than the second model
• How do we determine which model is better?
COMPARING MODELS
• We can compare the R-squared of to see if adding additional variables
improves the model (similar to a measure of association)
• ∆R2 = Change in R-squared between the models
• If the full model is better than the nested model, then the added IVs are
empirically important
• If the full model is not better than the nested model, then the added IVs are
redundant and likely shouldn’t be in the model
Change in R2 ∆R2
• Unless there are strong theoretical reasons
<0.1 Small
0.1 - 0.24 Medium
0.25 or > Large
HIERARCHICAL REGRESSION I
• Does including some social determinants of health variables help explain Pulse?
• We can run two regression models
• Nested model: without SDOH variables
• Full model: with SDOH variables
• Full model:
• Pulse = a + age + BMI + Sex + Income + Education + e
• Nested model:
• Pulse = a + age + BMI + e
• We can run a hierarchical and evaluate the differences to answer this question
HIERARCHICAL REGRESSION II
Nested model: Full model:
Coefficient Standardized P Coefficient Standardized P
Beta Beta
Constant 76.23 <0.01 Constant 72.07 <0.01
BMI 0.13 0.07 <0.01 BMI 0.22 0.124 <0.01
Age -0.15 -0.25 <0.01 Age -0.12 -0.17 <0.01
Adjusted R- 0.06 Income <0.001 -0.07 <0.01
squared Female 3.28 0.014 <0.01
College -1.05 -0.041 <0.01
The R-squared change is small (0.01) Adjusted R- 0.07
squared
What is interesting across the models?
WHAT’S WRONG WITH THIS GRAPH?