0% found this document useful (0 votes)

13 views32 pages

HLST 2302 Lecture 5

Uploaded by

okolotemmhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views32 pages

HLST 2302 Lecture 5

Uploaded by

okolotemmhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

HLST 2302

LECTURE 5
ADVANCED OLS
REGRESSION
GUYTANO VIRDO, PHD
ANNOUNCEMENTS

• Mid-term exam May 27th

• In class during usual lecture time
• Mix of MC, definition, and math questions
• You will need a scientific calculator, YUcard (or passport), and writing instrument of
choice
• You will be provided with a formula sheet
TODAY

• Residuals
• Extending the linear equation
• Adding additional IVs
• Prediction with additional IVs
• Interpreting coefficients and standardized coefficients
• R-squared and adjusted r-squared
• Multiple dummy variables
ERRORS IN REGRESSION

• y = a + bx + e, where e refers to the error term

• Multiple causes
• Poor measurement, sampling bias, missing data, difficult to measure constructs

• Fact of life in the social sciences

• In stats speak, these errors are called residuals
RESIDUALS

• The residuals are the difference between our dependent variable values, and
the case’s predicted value for the dependent variable based on the OLS
regression results
• In the aggregate, they represent our estimate for what all the errors would be

• y(observed) – y(predicted)= prediction error

• These tell us how well our model predicts the DV for all cases in the sample
RESIDUAL CALCULATIONS
Case Degrees Deaths Predicted y Difference
• We can use our linear equation to 1 30 73 75.885 -2.885
calculate the difference between 2 28 68 67.421 0.579
the case’s y value, and the 3 29 67 71.653 -4.653
predicted y value from our model
4 26 52 58.957 -6.957
5 28 70 67.421 2.579
For case 1: 6 28 76 67.421 8.579
y= -51.075 + 4.232*30 7 27 63 63.189 -0.189
y= -51.075 + 126.96
y= 75.885
Slope coefficient 4.232
Constant -51.075
RESIDUALS II

• These should be:

• Normally distributed (skew and kurtosis)
• 95% of cases within +2/-2 SD, etc
• Not significantly related to any of the independent variables
• Correlations
• Heteroskedastic
• We shouldn’t be better at predicting some cases more than others
EXTENDING OLS REGRESSION

• Most phenomena in the social sciences cannot be explained by a single IV

• We could do a series of bivariate regressions
• This wouldn’t solve the third variable problem

• Including multiple IVs in one regression model allows us to “control” for each
IV
• The regression coefficient for each IV shows the impact of that IV on the DV, controlling
for the impact from the other included IVs
EXTENDING OUR REGRESSION EQUATION

• We simply need to extend our linear equation to accommodate additional

independent variables
• y = a + bx + e, becomes
• y = a + b1x1 + b2x2 + b3x3….+ e
• b1 now shows the estimated impact of the first independent variable, holding
constant the impact of b2 and b3
• Why is this important?
• We now have a visual problem
WE NOW HAVE MULTIPLE DIMENSIONS

• With bivariate regression, we could

make a scatterplot with a line of
best fit
• If we have two IVs, we now have a
“plane” of best fit in a 3D graph
instead of a regression line
• If we have more IVs, we can no
longer visualize the relationship
between the IVs and the DV
INTERPRETING REGRESSION COEFFICIENTS
• Interpreting regression coefficients is slightly different
• It’s still about a one-unit change in x has b impact on y, but now it is controlling
for other variables
• Which IV has the strongest impact on the DV?
• Dependent variable: personal income
• Data source: American General Social Survey (2017)
Variable B coefficient Beta coefficient Pr (sig.)
Constant -19, 248.32 <0.01
Age (years) 516.68 0.21 <0.01
Sex (M=0, F=1) -17, 129.71 -0.26 <0.01
Education (years) 4007.87 0.358 <0.01
BETA COEFFICIENTS
• Looking at regression coefficients alone does not reliably indicate which IV
has the strongest impact
• Sex had the largest absolute coefficient, but can only range from 0-1
• Education can range from 0-20
• We can “standardize” the coefficients based on the SD of the IV and the DV
!"!"
• Beta=(𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑠𝑙𝑜𝑝𝑒) !"
#"

• Ignore signs and focus on the absolute value

• These show us the relative impact of each IV on the DV
• They are not easily interpreted or used beyond this
MULTIVARIATE OLS REGRESSION EXAMPLE I
• We can now fully interpret this regression model
• Hypotheses?
• Significance?
• Impact of Xs on Y?
• Relative impact of Xs on Y?
• Linear equation?
Variable B coefficient Beta coefficient Pr (sig.)
Constant -19, 248.32 <0.01
Age (years) 516.68 0.21 <0.01
Sex (M=0, F=1) -17, 129.71 -0.26 <0.01
Education (years) 4007.87 0.358 <0.01
MULTIVARIATE OLS REGRESSION & PREDICTION

• We can use our linear equation to predict the value of y for cases from the
population not in our dataset
• How would we predict the income of a 40 year old, American female with a
bachelor’s degree?
• Income = -19248.32 + 516.68(x1) -17120.71(x2) + 4007.87(x3)
R-SQUARED
• The R-squared value indicates how well our IVs explain the DV
• Statistically, not epistemologically
• It is an overall assessment of the model based on comparing the predicted versus observed
values (residuals)
%&'()*+%, -).*)/*0+
• 𝑅$ = /0/)( -).*)/*0+

6 !
3 4)
∑(45
• 𝑅$ = ∑(454)
6 !

• Y-hat = the predicted values for Y

• R-squared ranges from 0 (no explanation) to 1 (perfect explanation)
• For our previous model, R-squared = 0.239
• We can turn this into a percentage to make it easier to understand: 0.239 * 100 = 23.9 %
ADJUSTED R-SQUARED
• Generally, any additional IV included in a model will increase the R-squared, which resulted in
unethical practice by researchers
• The adjusted R-squared provides a small penalty for each additional IV included in a model
• It is more commonly reported, and is interpreted the same way
• Adjusted R-squared for our model: 0.238, or 23.8%

𝑘
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅$ = 𝑅$ − x 1 − 𝑅$
𝑛−𝑘+1

Where: k = number of IVs

n = sample size
OMITTED VARIABLE BIAS
• A regression model is only as good as the theory behind it
• Results need to be interpreted with caution, as often we are unable to
measure important IVs (known unknowns), and often we don’t know what we
should be measuring (unknown unknowns)
• Our coefficients (and R-squared) will often be inflated if an important
variable is absent from the regression model
MULTIPLE DUMMY VARIABLES

• We can create multiple dummy variables if needed to incorporate a multi-

attribute variable
• We need to exclude one of the dummy variables, as failing to do so would
result in collinearity
• This excluded category becomes the reference category
MARITAL STATUS AND PULSE REGRESSION I
• We can build on the dummy variable model we created last week to get more specific
• We need to create five dummy variables using the marital status variable
• Married (1) or not (0) Marital Status (NHANES)
Divorced
• Divorced (1) or not (0)
Live with partner
• Live with a partner (1) or not (0)
Married
• Separated (1) or not (0) Never married
• Widowed (1) or not (0) Separated
• Never married: reference Widowed

• We can then include them in a regression model, and compare the associated impact of each
marital status on pulse compared to the reference category
MARITAL STATUS AND PULSE REGRESSION II

• How do we interpret this?

Estimate Pr (>|t|)
(Intercept) 73.90240 < 2e-16 ***
married_not -2.32265 8.61e-10 ***
divorced_not -0.18352 0.7428
livepart_not 0.06439 0.9153
sep_not -1.36439 0.1575
widow_not -1.59647 0.0147 *
TYPES OF VARIABLES IN MODELS
• So far, we have only considered simple relationships between variables
• E.g. the relationship between education and income
• Independent variables: variables we are interested in and expect to have an
independent effect on y
• Control variables: variables we include for epistemological reasons, but don’t
necessarily care about
• What counts as an IV or control variable is usually researcher determined
• E.g. I might care about the impact of sex on income but need to include education to
control for that variable. OR vice versa.

• This doesn’t change how we create a regression, only the interpretation of the
results
TYPES OF VARIABLES IN MODELS II

• Mediating variables: a variable in-between the IV and the DV

• E.g. exercise and energy levels. Exercise costs energy, but people who exercise are more
energetic. Exercise improves sleep, cardiovascular functioning, etc.
• We would expect exercise to no longer have a strong/significant impact on energy if we
included these other variables in our model

• Moderating variables: a variable that changes the relationship (either direction or

strength) between the IV and the DV
• E.g. the relationship between income, sex, and education
• Sex directly impacts income, but sex also directly impacts education which impacts income
MEDIATORS

• Mediators are a direct link between X and Y, and without this variable there is
no reason X would influence Y
• If a mediating variable is included in a model, the other variable should no
longer be significant/have a strong impact

Exercise Better sleep More energy

MODERATORS

• Moderators change the impact of other variables in the model, but still
independently impact the DV
• Sex moderates the impact of education on income, but still influences income
on its own

Education

Sex Income
MEDIATION, MODERATION, OR BOTH?
HIERARCHICAL LINEAR REGRESSION

• This technique allows us to compare our full model to a nested/several nested

models
• Full model: includes all IVs we think are related to the DV
• Nested model: missing one or more IVs from the full model
• These models are considered to be “nested” in the full model
• The results allow us to determine if the full model better explains the DV than
the nested models
• Why would we care about this?
• Why would we want to avoid doing this?
THE RELATIONSHIP BETWEEN STATISTICS AND
ACCESSIBILITY
• There is a trade-off in quantitative analysis between simplicity and complexity
• Simple statistics are easily communicated and understood
• They also tend to oversimplify and reduce complex phenomena to one variable/a few
variables
• Complicated statistics are hard to communicate effectively and are understood fully by a
select few
• In many circumstances they most accurately reflect the underlying phenomena
• Striking an appropriate balance between these two is a large part of the job of a
quantitative researcher, data analyst, or data scientist

Advanced
Means,
Regression
Pie graphs Simple Complicated
Techniques
HIERARCHICAL LINEAR REGRESSION II

• Full model
• y = a + b1(x1) + b2(x2) + e
• Nested model
• y = a + b1(x1) + e
• The second model is nested in the first
• Comparing these models allows us to determine if the full model better explains y
than the second model
• How do we determine which model is better?
COMPARING MODELS
• We can compare the R-squared of to see if adding additional variables
improves the model (similar to a measure of association)
• ∆R2 = Change in R-squared between the models
• If the full model is better than the nested model, then the added IVs are
empirically important
• If the full model is not better than the nested model, then the added IVs are
redundant and likely shouldn’t be in the model
Change in R2 ∆R2
• Unless there are strong theoretical reasons
<0.1 Small
0.1 - 0.24 Medium
0.25 or > Large
HIERARCHICAL REGRESSION I

• Does including some social determinants of health variables help explain Pulse?
• We can run two regression models
• Nested model: without SDOH variables
• Full model: with SDOH variables
• Full model:
• Pulse = a + age + BMI + Sex + Income + Education + e
• Nested model:
• Pulse = a + age + BMI + e
• We can run a hierarchical and evaluate the differences to answer this question
HIERARCHICAL REGRESSION II
Nested model: Full model:
Coefficient Standardized P Coefficient Standardized P
Beta Beta
Constant 76.23 <0.01 Constant 72.07 <0.01
BMI 0.13 0.07 <0.01 BMI 0.22 0.124 <0.01
Age -0.15 -0.25 <0.01 Age -0.12 -0.17 <0.01
Adjusted R- 0.06 Income <0.001 -0.07 <0.01
squared Female 3.28 0.014 <0.01
College -1.05 -0.041 <0.01
The R-squared change is small (0.01) Adjusted R- 0.07
squared
What is interesting across the models?
WHAT’S WRONG WITH THIS GRAPH?

HLST 2302 Lecture 4
No ratings yet
HLST 2302 Lecture 4
30 pages
HLST 2302 Lecture 3
No ratings yet
HLST 2302 Lecture 3
30 pages
A Guide To Interpreting Regression Tables
No ratings yet
A Guide To Interpreting Regression Tables
15 pages
Multivariate Statistics Introduction
No ratings yet
Multivariate Statistics Introduction
20 pages
15multiple Linear Regression
No ratings yet
15multiple Linear Regression
168 pages
Cross Sectional
No ratings yet
Cross Sectional
40 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Econometrics Cheat Sheet
No ratings yet
Econometrics Cheat Sheet
4 pages
Statistical Modelling
No ratings yet
Statistical Modelling
39 pages
Lec 5 V 11
No ratings yet
Lec 5 V 11
44 pages
Statistical Modelling of Epidemiological Data
No ratings yet
Statistical Modelling of Epidemiological Data
87 pages
Lecture06 MultReg
No ratings yet
Lecture06 MultReg
38 pages
Multiple Regression Explained
100% (2)
Multiple Regression Explained
23 pages
DA&V Module 2 (SAMI)
No ratings yet
DA&V Module 2 (SAMI)
14 pages
Linear Regression and Correlation
No ratings yet
Linear Regression and Correlation
99 pages
Multi Regrson
No ratings yet
Multi Regrson
40 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Ra Web
No ratings yet
Ra Web
70 pages
Student Achievement Predictors
No ratings yet
Student Achievement Predictors
60 pages
Introductory Econometrics Viva Flashcards
No ratings yet
Introductory Econometrics Viva Flashcards
2 pages
Multiple Linear Regression 1
No ratings yet
Multiple Linear Regression 1
115 pages
Chapter 4
No ratings yet
Chapter 4
78 pages
Introduction To Econometrics - Summary
No ratings yet
Introduction To Econometrics - Summary
23 pages
Linear Regression Model: Man - PN@VNP - Edu.vn
No ratings yet
Linear Regression Model: Man - PN@VNP - Edu.vn
77 pages
Item Regression: Multivariate Regression Models
No ratings yet
Item Regression: Multivariate Regression Models
41 pages
Statistical Analysis Techniques
No ratings yet
Statistical Analysis Techniques
3 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
OLS Assumptions & Issues Guide
No ratings yet
OLS Assumptions & Issues Guide
4 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
19 pages
Note 13 - Linear Regression
No ratings yet
Note 13 - Linear Regression
25 pages
Lecture 4
No ratings yet
Lecture 4
3 pages
Chapter 3 Notes Part 3
No ratings yet
Chapter 3 Notes Part 3
9 pages
Chapter 4 (Compatibility Mode)
No ratings yet
Chapter 4 (Compatibility Mode)
66 pages
Multiple Linear Regression - Prof. Sami Day 1
No ratings yet
Multiple Linear Regression - Prof. Sami Day 1
58 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
43 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
Betas: Standardized Variables in Regression: Paul E. Johnson
No ratings yet
Betas: Standardized Variables in Regression: Paul E. Johnson
46 pages
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
100% (14)
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
15 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Metrics Course Outline
No ratings yet
Metrics Course Outline
22 pages
Regression PDF
No ratings yet
Regression PDF
7 pages
Econometrics Notes
No ratings yet
Econometrics Notes
95 pages
1 - Linear Models
No ratings yet
1 - Linear Models
22 pages
Thesis
No ratings yet
Thesis
8 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Lecture 5A
No ratings yet
Lecture 5A
21 pages
Session 11 - Correlation and Regression
No ratings yet
Session 11 - Correlation and Regression
28 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
R-Programming - Unit 5
No ratings yet
R-Programming - Unit 5
43 pages
Regression Modeling in Biostatistics
No ratings yet
Regression Modeling in Biostatistics
3 pages
Stats Notes
No ratings yet
Stats Notes
48 pages
Lecture 3 - LRM
No ratings yet
Lecture 3 - LRM
40 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Econometrics ch-4
No ratings yet
Econometrics ch-4
78 pages
Advance Business Research Methods
No ratings yet
Advance Business Research Methods
38 pages
049 Stat 326 Regression Final Paper
No ratings yet
049 Stat 326 Regression Final Paper
17 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
T-Test: Paired Samples Statistics
No ratings yet
T-Test: Paired Samples Statistics
1 page
Econometrics Jimma Assignment
No ratings yet
Econometrics Jimma Assignment
6 pages
Tutorial
No ratings yet
Tutorial
42 pages
Report
No ratings yet
Report
55 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
Forecasting Milk Sales in Dade County
No ratings yet
Forecasting Milk Sales in Dade County
19 pages
Action Research Project
100% (1)
Action Research Project
47 pages
BSBMGT615 Organisation Development Assessment
50% (2)
BSBMGT615 Organisation Development Assessment
34 pages
World Class BWC
No ratings yet
World Class BWC
61 pages
Excel Project Sales Data Analysis GC
No ratings yet
Excel Project Sales Data Analysis GC
33 pages
Sample Final Solutions
No ratings yet
Sample Final Solutions
23 pages
DataScience - Week 10
No ratings yet
DataScience - Week 10
2 pages
AWS Certified Big Data Specialty Exam Guide
No ratings yet
AWS Certified Big Data Specialty Exam Guide
2 pages
Spriiprad - Machine Learning Model Basics Intermediate
No ratings yet
Spriiprad - Machine Learning Model Basics Intermediate
2 pages
MBA Sem 2 Final Unit 2 205 BA
No ratings yet
MBA Sem 2 Final Unit 2 205 BA
36 pages
An Assessment of Market Information Systems in East Africa
No ratings yet
An Assessment of Market Information Systems in East Africa
6 pages
Statistician Cover Letter Guide
100% (2)
Statistician Cover Letter Guide
6 pages
Reduce Sampling Rejection in Fashion
No ratings yet
Reduce Sampling Rejection in Fashion
61 pages
Frequency Table
100% (2)
Frequency Table
3 pages
Orange Green Corporate Geometric Business Case Study and Report Business Presentation
No ratings yet
Orange Green Corporate Geometric Business Case Study and Report Business Presentation
11 pages
Module 3
No ratings yet
Module 3
3 pages
Biodun Oluwole - Business Analyst - Jconnect Infotech.
No ratings yet
Biodun Oluwole - Business Analyst - Jconnect Infotech.
9 pages
TV Ratings Analysis for Networks
No ratings yet
TV Ratings Analysis for Networks
9 pages
Question Bank For ML
No ratings yet
Question Bank For ML
3 pages
Syllabus FML
No ratings yet
Syllabus FML
3 pages
MATH Normal Distribution
No ratings yet
MATH Normal Distribution
37 pages
GSM Network Optimization Dropped Calls, Congestion Causes and Solutions
No ratings yet
GSM Network Optimization Dropped Calls, Congestion Causes and Solutions
14 pages
Winters Model Excel For SPC
No ratings yet
Winters Model Excel For SPC
14 pages
Financial Data Analysis Report
No ratings yet
Financial Data Analysis Report
20 pages