0% found this document useful (0 votes)
13 views32 pages

HLST 2302 Lecture 5

Uploaded by

okolotemmhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

HLST 2302 Lecture 5

Uploaded by

okolotemmhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

HLST 2302

LECTURE 5
ADVANCED OLS
REGRESSION
GUYTANO VIRDO, PHD
ANNOUNCEMENTS

• Mid-term exam May 27th


• In class during usual lecture time
• Mix of MC, definition, and math questions
• You will need a scientific calculator, YUcard (or passport), and writing instrument of
choice
• You will be provided with a formula sheet
TODAY

• Residuals
• Extending the linear equation
• Adding additional IVs
• Prediction with additional IVs
• Interpreting coefficients and standardized coefficients
• R-squared and adjusted r-squared
• Multiple dummy variables
ERRORS IN REGRESSION

• y = a + bx + e, where e refers to the error term


• Multiple causes
• Poor measurement, sampling bias, missing data, difficult to measure constructs

• Fact of life in the social sciences


• In stats speak, these errors are called residuals
RESIDUALS

• The residuals are the difference between our dependent variable values, and
the case’s predicted value for the dependent variable based on the OLS
regression results
• In the aggregate, they represent our estimate for what all the errors would be

• y(observed) – y(predicted)= prediction error


• These tell us how well our model predicts the DV for all cases in the sample
RESIDUAL CALCULATIONS
Case Degrees Deaths Predicted y Difference
• We can use our linear equation to 1 30 73 75.885 -2.885
calculate the difference between 2 28 68 67.421 0.579
the case’s y value, and the 3 29 67 71.653 -4.653
predicted y value from our model
4 26 52 58.957 -6.957
5 28 70 67.421 2.579
For case 1: 6 28 76 67.421 8.579
y= -51.075 + 4.232*30 7 27 63 63.189 -0.189
y= -51.075 + 126.96
y= 75.885
Slope coefficient 4.232
Constant -51.075
RESIDUALS II

• These should be:


• Normally distributed (skew and kurtosis)
• 95% of cases within +2/-2 SD, etc
• Not significantly related to any of the independent variables
• Correlations
• Heteroskedastic
• We shouldn’t be better at predicting some cases more than others
EXTENDING OLS REGRESSION

• Most phenomena in the social sciences cannot be explained by a single IV


• We could do a series of bivariate regressions
• This wouldn’t solve the third variable problem

• Including multiple IVs in one regression model allows us to “control” for each
IV
• The regression coefficient for each IV shows the impact of that IV on the DV, controlling
for the impact from the other included IVs
EXTENDING OUR REGRESSION EQUATION

• We simply need to extend our linear equation to accommodate additional


independent variables
• y = a + bx + e, becomes
• y = a + b1x1 + b2x2 + b3x3….+ e
• b1 now shows the estimated impact of the first independent variable, holding
constant the impact of b2 and b3
• Why is this important?
• We now have a visual problem
WE NOW HAVE MULTIPLE DIMENSIONS

• With bivariate regression, we could


make a scatterplot with a line of
best fit
• If we have two IVs, we now have a
“plane” of best fit in a 3D graph
instead of a regression line
• If we have more IVs, we can no
longer visualize the relationship
between the IVs and the DV
INTERPRETING REGRESSION COEFFICIENTS
• Interpreting regression coefficients is slightly different
• It’s still about a one-unit change in x has b impact on y, but now it is controlling
for other variables
• Which IV has the strongest impact on the DV?
• Dependent variable: personal income
• Data source: American General Social Survey (2017)
Variable B coefficient Beta coefficient Pr (sig.)
Constant -19, 248.32 <0.01
Age (years) 516.68 0.21 <0.01
Sex (M=0, F=1) -17, 129.71 -0.26 <0.01
Education (years) 4007.87 0.358 <0.01
BETA COEFFICIENTS
• Looking at regression coefficients alone does not reliably indicate which IV
has the strongest impact
• Sex had the largest absolute coefficient, but can only range from 0-1
• Education can range from 0-20
• We can “standardize” the coefficients based on the SD of the IV and the DV
!"!"
• Beta=(𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑠𝑙𝑜𝑝𝑒) !"
#"

• Ignore signs and focus on the absolute value


• These show us the relative impact of each IV on the DV
• They are not easily interpreted or used beyond this
MULTIVARIATE OLS REGRESSION EXAMPLE I
• We can now fully interpret this regression model
• Hypotheses?
• Significance?
• Impact of Xs on Y?
• Relative impact of Xs on Y?
• Linear equation?
Variable B coefficient Beta coefficient Pr (sig.)
Constant -19, 248.32 <0.01
Age (years) 516.68 0.21 <0.01
Sex (M=0, F=1) -17, 129.71 -0.26 <0.01
Education (years) 4007.87 0.358 <0.01
MULTIVARIATE OLS REGRESSION & PREDICTION

• We can use our linear equation to predict the value of y for cases from the
population not in our dataset
• How would we predict the income of a 40 year old, American female with a
bachelor’s degree?
• Income = -19248.32 + 516.68(x1) -17120.71(x2) + 4007.87(x3)
R-SQUARED
• The R-squared value indicates how well our IVs explain the DV
• Statistically, not epistemologically
• It is an overall assessment of the model based on comparing the predicted versus observed
values (residuals)
%&'()*+%, -).*)/*0+
• 𝑅$ = /0/)( -).*)/*0+

6 !
3 4)
∑(45
• 𝑅$ = ∑(454)
6 !

• Y-hat = the predicted values for Y


• R-squared ranges from 0 (no explanation) to 1 (perfect explanation)
• For our previous model, R-squared = 0.239
• We can turn this into a percentage to make it easier to understand: 0.239 * 100 = 23.9 %
ADJUSTED R-SQUARED
• Generally, any additional IV included in a model will increase the R-squared, which resulted in
unethical practice by researchers
• The adjusted R-squared provides a small penalty for each additional IV included in a model
• It is more commonly reported, and is interpreted the same way
• Adjusted R-squared for our model: 0.238, or 23.8%

𝑘
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅$ = 𝑅$ − x 1 − 𝑅$
𝑛−𝑘+1

Where: k = number of IVs


n = sample size
OMITTED VARIABLE BIAS
• A regression model is only as good as the theory behind it
• Results need to be interpreted with caution, as often we are unable to
measure important IVs (known unknowns), and often we don’t know what we
should be measuring (unknown unknowns)
• Our coefficients (and R-squared) will often be inflated if an important
variable is absent from the regression model
MULTIPLE DUMMY VARIABLES

• We can create multiple dummy variables if needed to incorporate a multi-


attribute variable
• We need to exclude one of the dummy variables, as failing to do so would
result in collinearity
• This excluded category becomes the reference category
MARITAL STATUS AND PULSE REGRESSION I
• We can build on the dummy variable model we created last week to get more specific
• We need to create five dummy variables using the marital status variable
• Married (1) or not (0) Marital Status (NHANES)
Divorced
• Divorced (1) or not (0)
Live with partner
• Live with a partner (1) or not (0)
Married
• Separated (1) or not (0) Never married
• Widowed (1) or not (0) Separated
• Never married: reference Widowed

• We can then include them in a regression model, and compare the associated impact of each
marital status on pulse compared to the reference category
MARITAL STATUS AND PULSE REGRESSION II

• How do we interpret this?


Estimate Pr (>|t|)
(Intercept) 73.90240 < 2e-16 ***
married_not -2.32265 8.61e-10 ***
divorced_not -0.18352 0.7428
livepart_not 0.06439 0.9153
sep_not -1.36439 0.1575
widow_not -1.59647 0.0147 *
TYPES OF VARIABLES IN MODELS
• So far, we have only considered simple relationships between variables
• E.g. the relationship between education and income
• Independent variables: variables we are interested in and expect to have an
independent effect on y
• Control variables: variables we include for epistemological reasons, but don’t
necessarily care about
• What counts as an IV or control variable is usually researcher determined
• E.g. I might care about the impact of sex on income but need to include education to
control for that variable. OR vice versa.

• This doesn’t change how we create a regression, only the interpretation of the
results
TYPES OF VARIABLES IN MODELS II

• Mediating variables: a variable in-between the IV and the DV


• E.g. exercise and energy levels. Exercise costs energy, but people who exercise are more
energetic. Exercise improves sleep, cardiovascular functioning, etc.
• We would expect exercise to no longer have a strong/significant impact on energy if we
included these other variables in our model

• Moderating variables: a variable that changes the relationship (either direction or


strength) between the IV and the DV
• E.g. the relationship between income, sex, and education
• Sex directly impacts income, but sex also directly impacts education which impacts income
MEDIATORS

• Mediators are a direct link between X and Y, and without this variable there is
no reason X would influence Y
• If a mediating variable is included in a model, the other variable should no
longer be significant/have a strong impact

Exercise Better sleep More energy


MODERATORS

• Moderators change the impact of other variables in the model, but still
independently impact the DV
• Sex moderates the impact of education on income, but still influences income
on its own

Education

Sex Income
MEDIATION, MODERATION, OR BOTH?
HIERARCHICAL LINEAR REGRESSION

• This technique allows us to compare our full model to a nested/several nested


models
• Full model: includes all IVs we think are related to the DV
• Nested model: missing one or more IVs from the full model
• These models are considered to be “nested” in the full model
• The results allow us to determine if the full model better explains the DV than
the nested models
• Why would we care about this?
• Why would we want to avoid doing this?
THE RELATIONSHIP BETWEEN STATISTICS AND
ACCESSIBILITY
• There is a trade-off in quantitative analysis between simplicity and complexity
• Simple statistics are easily communicated and understood
• They also tend to oversimplify and reduce complex phenomena to one variable/a few
variables
• Complicated statistics are hard to communicate effectively and are understood fully by a
select few
• In many circumstances they most accurately reflect the underlying phenomena
• Striking an appropriate balance between these two is a large part of the job of a
quantitative researcher, data analyst, or data scientist

Advanced
Means,
Regression
Pie graphs Simple Complicated
Techniques
HIERARCHICAL LINEAR REGRESSION II

• Full model
• y = a + b1(x1) + b2(x2) + e
• Nested model
• y = a + b1(x1) + e
• The second model is nested in the first
• Comparing these models allows us to determine if the full model better explains y
than the second model
• How do we determine which model is better?
COMPARING MODELS
• We can compare the R-squared of to see if adding additional variables
improves the model (similar to a measure of association)
• ∆R2 = Change in R-squared between the models
• If the full model is better than the nested model, then the added IVs are
empirically important
• If the full model is not better than the nested model, then the added IVs are
redundant and likely shouldn’t be in the model
Change in R2 ∆R2
• Unless there are strong theoretical reasons
<0.1 Small
0.1 - 0.24 Medium
0.25 or > Large
HIERARCHICAL REGRESSION I

• Does including some social determinants of health variables help explain Pulse?
• We can run two regression models
• Nested model: without SDOH variables
• Full model: with SDOH variables
• Full model:
• Pulse = a + age + BMI + Sex + Income + Education + e
• Nested model:
• Pulse = a + age + BMI + e
• We can run a hierarchical and evaluate the differences to answer this question
HIERARCHICAL REGRESSION II
Nested model: Full model:
Coefficient Standardized P Coefficient Standardized P
Beta Beta
Constant 76.23 <0.01 Constant 72.07 <0.01
BMI 0.13 0.07 <0.01 BMI 0.22 0.124 <0.01
Age -0.15 -0.25 <0.01 Age -0.12 -0.17 <0.01
Adjusted R- 0.06 Income <0.001 -0.07 <0.01
squared Female 3.28 0.014 <0.01
College -1.05 -0.041 <0.01
The R-squared change is small (0.01) Adjusted R- 0.07
squared
What is interesting across the models?
WHAT’S WRONG WITH THIS GRAPH?

You might also like