Confounding and Adjustment
John McGready, PhD
Johns Hopkins University
Learning Objectives
► In this set of lectures, we will:
► Formally define confounding and give explicit examples of its impact
► Define adjustment and adjusted estimates conceptually
► Begin a discussion of the analytics of adjustment
2
Confounding: A Formal Definition and
Some Examples
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Section Objectives
► Formally define confounding
► Establish conditions that can result in the confounding of an outcome/exposure
relationship
► Demonstrate the potential effects of confounding via examples
4
Confounding (Lurking Variable)—1
► Consider results from the following (fictitious) study:
► This study was done to investigate the association between smoking and a certain
disease in male and female adults
► 210 smokers and 240 nonsmokers were recruited for the study
Diagnosis Smokers Nonsmokers Total
Disease 52 64 116
No disease 158 176 334
Total 210 240 450
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 = 𝑝𝑝�𝑆𝑆 = 52⁄210 ≈ 0.93
► Here, 𝑅𝑅𝑅𝑅 𝑝𝑝�
𝑁𝑁𝑁𝑁 64⁄240
5
Confounding (Lurking Variable)—2
► Additional information: The following table shows the distribution of smokers and non-
smokers by sex
Sex Smokers Nonsmokers Total
Male 160 40 200
Female 50 200 250
Total 210 240 450
6
Confounding (Lurking Variable)—3
► Even more additional information: The following table shows the distribution of disease
status by sex
Sex Disease No disease Total
Male 33 167 200
Female 83 167 250
Total 116 334 450
7
Recap
► The original outcome of interest is DISEASE, and the original exposure of interest is
SMOKING
► In this sample, SEX is related to both the outcome and exposure
► This relationship is possibly impacting the overall relationship between DISEASE and
SMOKING
► How can we look at the relationship between DISEASE and SMOKING, removing any
possible “interference” from SEX?
► One approach—look at the DISEASE and SMOKING relationship separately for males
and females
8
Disease and Smoking: Males Only
► Here is a 2x2 of the disease/smoking relationship among males only
Diagnoses for males Smokers Nonsmokers Total
Disease 29 4 33
No disease 131 36 167
Total 160 40 200
� 𝑀𝑀:𝑆𝑆 𝑡𝑡𝑡𝑡 𝑀𝑀:𝑁𝑁𝑁𝑁 = 𝑝𝑝�𝑀𝑀:𝑆𝑆 = 29⁄160 ≈ 1.8
► Here, 𝑅𝑅𝑅𝑅 𝑝𝑝�
𝑀𝑀:𝑁𝑁𝑁𝑁 4⁄40
9
Disease and Smoking: Females Only
► Here is a 2x2 of the disease/smoking relationship among females only
Diagnoses for
Smokers Nonsmokers Total
females
Disease 23 60 83
No disease 27 140 167
Total 50 200 250
� 𝐹𝐹:𝑆𝑆 𝑡𝑡𝑡𝑡 𝐹𝐹:𝑁𝑁𝑁𝑁 = 𝑝𝑝�𝐹𝐹:𝑆𝑆 = 23⁄50 ≈ 1.5
► Here, 𝑅𝑅𝑅𝑅 𝑝𝑝�
𝐹𝐹:𝑁𝑁𝑁𝑁 60⁄200
10
Smoking, Disease, and Sex: A Recap
► The overall (sometimes called crude or unadjusted) relationship (RR) between smoking
and disease was nearly 1 (risk difference nearly 0)
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 ≈ 0.93; 𝑅𝑅𝑅𝑅
𝑅𝑅𝑅𝑅 � 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 = 𝑝𝑝̂ 𝑀𝑀:𝑆𝑆 − 𝑝𝑝̂ 𝑁𝑁𝑁𝑁 = −0.02
► The sex-specific results showed similar positive associations between smoking and disease
� 𝑀𝑀:𝑆𝑆 𝑡𝑡𝑡𝑡 𝑀𝑀:𝑁𝑁𝑁𝑁 ≈ 1.8;
Males: 𝑅𝑅𝑅𝑅 � 𝑀𝑀:𝑆𝑆 𝑡𝑡𝑡𝑡 𝑀𝑀:𝑁𝑁𝑁𝑁 = 𝑝𝑝̂ 𝑀𝑀:𝑆𝑆 − 𝑝𝑝̂ 𝑀𝑀:𝑆𝑆 = 0.08
𝑅𝑅𝑅𝑅
� 𝐹𝐹:𝑆𝑆 𝑡𝑡𝑡𝑡 𝐹𝐹:𝑁𝑁𝑁𝑁 ≈ 1.5;
Females: 𝑅𝑅𝑅𝑅 � 𝐹𝐹:𝑆𝑆 𝑡𝑡𝑡𝑡 𝐹𝐹:𝑁𝑁𝑁𝑁 = 𝑝𝑝̂ 𝐹𝐹:𝑆𝑆 − 𝑝𝑝̂ 𝐹𝐹:𝑆𝑆 = 0.16
𝑅𝑅𝑅𝑅
► (Note: for the moment, we are not considering statistical significance, just using estimates
to illustrate point)
11
Smoking, Disease, and Sex: What Happened?
► Recall: Males more likely to be smokers, and females more likely to have disease
► The crude RR comparing the risk of disease in smokers to non-smokers has an over-
representation of persons with lower risk of disease (Males)
12
Simpson’s Paradox and Confounding—1
► The nature of an association can change (and even reverse direction) or disappear when
data from several groups are combined to form a single group
► An association between an exposure 𝑋𝑋 and an outcome 𝑌𝑌 can be confounded by another
lurking (hidden) variable 𝑍𝑍 (or variables 𝑍𝑍1, 𝑍𝑍2…)
13
Simpson’s Paradox and Confounding—2
► A confounder 𝑍𝑍 (or set of confounders 𝑍𝑍1 … 𝑍𝑍𝑝𝑝) distorts the true relation between 𝑋𝑋
and 𝑌𝑌
► This can happen if 𝑍𝑍 is related both to 𝑋𝑋 and to 𝑌𝑌
14
Arm Circumference, Height, and Weight—1
► An observational study to estimate association between arm circumference and height in
Nepali children (we’ve used these data before, of course)
► 150 randomly selected subjects, ages 0–12 months, had arm circumference, weight,
and height measured
► This study is observational—it is not possible to randomize subjects to height groups!
► The data
► Arm circumference range: 7.3–15.6 cm
► Height range: 40.9–73.3 cm
► Weight range: 1.6–9.9 kg
15
Arm Circumference, Height, and Weight—2
► Scatterplot with regression line, 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
16
Arm Circumference, Height, and Weight—3
► Perhaps not surprisingly, weight is associated with both arm circumference (AC) and
height
17
Arm Circumference, Height and Weight-4
► Scatterplot: Arm circumference by height, after adjusting for weight
18
“Batch Effects” in Lab-based Analyses
► Lab-based results can be influenced by the technician, the laboratory used, the time of
day, the temperature in the lab, etc.
► If the goal of a study is to ascertain differences in lab measures between groups (for
example, diseased and non-diseased), and the group is associated with at least some of
the above characteristics, then there can be confounding
19
Summary
► In non-randomized studies, outcome/exposure relationships of interest may be
confounded by other variables
► In such a situation, the relationship between the outcome and exposure differs after
taking into account the confounder(s) of note
► In order to confound an outcome/exposure relationship, a variable must be related to
both the outcome and exposure
20
Adjusted Estimates: Presentation,
Interpretation, and Utility for Assessing
Confounding
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Learning Objectives
► Understand how to interpret estimates of association that have been adjusted to control
for a confounder
► Compare/contrast the comparisons being made by unadjusted and adjusted association
estimates
2
Adjustment
► Adjustment is a method for making comparable comparisons between groups in the
presence of a confounder/confounding variables
► We will discuss the basics of the mechanics behind adjustment in the next lecture section
3
Fictitious Example—1
► Recall the results from the following (fictitious) study:
► This study was done to investigate the association between smoking and a certain
disease in male and female adults
► 210 smokers and 240 nonsmokers were recruited for the study
Diagnosis Smokers Nonsmokers Total
Disease 52 64 116
No disease 158 176 334
Total 210 240 450
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 = 𝑝𝑝�𝑆𝑆 = 52⁄210 ≈ 0.93
► Here, 𝑅𝑅𝑅𝑅 𝑝𝑝�
𝑁𝑁𝑁𝑁 64⁄240
4
Fictitious Example—2
► This relative risk is being influenced by the difference in sex distributions among smokers
and nonsmokers
► This relative risk compares all smokers to all nonsmokers in the sample without taking any
other factors into account: this is called the unadjusted or crude estimated association
between disease and smoking
5
Fictitious Example—3
► Adjustment provides a mechanism for estimating an outcome/exposure relationship after
removing the potential distortion or negation that comes from a confounder or multiple
confounders
► In the fictional example, for example, the relationship between disease and smoking can
be adjusted for sex
6
Fictitious Example—4
► Frequently, the presentation of results from non-randomized studies will include a table of
unadjusted and adjusted measures of association
► Example: table of unadjusted and sex-adjusted relative risks from this fictitious example
Table 1: Relative risks of disease (and 95% CIs)
Participation in smoking Unadjusted Adjusted1
Nonsmokers ref ref
Smokers 0.93 (0.68, 1.27) 1.57 (1.12, 2.20)
— — 1adjusted for sex
7
Fictitious Example—5
► Unadjusted estimated relative risk, 0.93
► This compares the risk of disease for all smokers compared to all nonsmokers in the
sample, regardless of sex or any other characteristic (including sex), and, hence,
estimates the comparison of all smokers to all nonsmokers in the population sampled
► Adjusted estimated relative risk, 1.57
► This compares the risk of disease for smokers to nonsmokers of the same sex in the
sample and, hence, estimates the comparison of smokers to nonsmokers of the same
sex in the population sampled: male smokers to male nonsmokers and female smokers
to female nonsmokers
8
Fictitious Example—6
► The unadjusted and adjusted associations can be compared both numerically and
qualitatively to assess confounding by (at least some of) the adjustors
Table 1: Relative risks of disease (and 95% CIs)
Participation in smoking Unadjusted Adjusted1
Nonsmokers ref ref
Smokers 0.93 (0.68, 1.27) 1.57 (1.12, 2.20)
— — 1adjusted for sex
9
Arm Circumference, Height, and Weight—1
► An observational study to estimate association between arm circumference and height in
Nepali children (we’ve used these data before, of course)
► 150 randomly selected subjects, ages 0–12 months, had arm circumference, weight,
and height measured
► This study is observational—it is not possible to randomize subjects to height groups!
► The data
► Arm circumference range: 7.3–15.6 cm
► Height range: 40.9–73.3 cm
► Weight range: 1.6 – 9.9 kg
10
Arm Circumference, Height, and Weight—2
► The unadjusted and adjusted associations can be compared both numerically and
qualitatively to assess confounding by (at least some of) the adjustors
Table 1: Regression slopes (and 95% CIs) from models with AC as outcome (� �
𝒚𝒚 = 𝑨𝑨𝑨𝑨)
Physical characteristic Unadjusted Adjusted1
Height 0.16 (0.13, 0.19) −0.16 (−0.21, −0.11)
Weight 0.80 (0.72, 0.88) 1.40 (1.21, 1.59)
11
Arm Circumference, Height, and Weight—3
► Unadjusted linear regression slope estimate for height, 𝛽𝛽̂ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = 0.16
► This estimates the average difference in arm circumference (cm) between two groups
of children who differ by one centimeter in height
► The average change in arm circumference (cm) per one-centimeter increase in height
► Adjusted linear regression slope estimated for height, 𝛽𝛽̂ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒∗ = − 0.16
► This estimates the average difference in arm circumference (cm) between two groups
of children who differ by one centimeter in height and are the same weight
► The average change in arm circumference (cm) per one-centimeter increase in height
adjusted for weight
12
Arm Circumference, Height, and Weight—4
► The unadjusted and adjusted associations can be compared both numerically and
qualitatively to assess confounding by (at least some of) the adjustors
Table 1: Regression slopes (and 95% CIs) from models with AC as outcome (� �
𝒚𝒚 = 𝑨𝑨𝑨𝑨)
Physical characteristic Unadjusted Adjusted1
Height 0.16 (0.13, 0.19) −0.16 (−0.21, −0.11)
Weight 0.80 (0.72, 0.88) 1.40 (1.21, 1.59)
13
Summary
► Adjustment is a method for making comparable comparisons between groups in the
presence of a confounder/confounding variables
► The group comparisons made by adjusted associations are more specific than those made
by unadjusted (crude) associations
► Contrasting crude and adjusted association estimates is useful for identifying confounding
14
Adjusted Estimates: The General Idea
Behind the Computations
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Learning Objectives
► Gain some insights conceptually as to how adjusted estimates are computed
2
Confounding (Lurking Variable)
► Consider results from the following (fictitious) study:
► This study was done to investigate the association between smoking and a certain
disease in male and female adults
► 210 smokers and 240 nonsmokers were recruited for the study
Diagnosis Smokers Nonsmokers Total
Disease 52 64 116
No disease 158 176 334
Total 210 240 450
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 = 𝑝𝑝�𝑆𝑆 = 52⁄210 ≈ 0.93
► Here, 𝑅𝑅𝑅𝑅 𝑝𝑝�
𝑁𝑁𝑁𝑁 64⁄240
3
Smoking, Disease, and Sex: A Recap
► The overall (sometimes called crude, unadjusted) relationship (RR) between smoking and
disease was nearly 1 (risk difference nearly 0)
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 ≈ 0.93; 𝑅𝑅𝑅𝑅
𝑅𝑅𝑅𝑅 � 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁 = 𝑝𝑝̂ 𝑀𝑀:𝑆𝑆 − 𝑝𝑝̂ 𝑁𝑁𝑁𝑁 = −0.02
► The sex-specific results showed similar positive associations between smoking and disease
� 𝑀𝑀:𝑆𝑆 𝑡𝑡𝑡𝑡 𝑀𝑀:𝑁𝑁𝑁𝑁 ≈ 1.8;
Males: 𝑅𝑅𝑅𝑅 � 𝑀𝑀:𝑆𝑆 𝑡𝑡𝑡𝑡 𝑀𝑀:𝑁𝑁𝑁𝑁 = 𝑝𝑝̂ 𝑀𝑀:𝑆𝑆 − 𝑝𝑝̂ 𝑀𝑀:𝑆𝑆 = 0.08
𝑅𝑅𝑅𝑅
� 𝐹𝐹:𝑆𝑆 𝑡𝑡𝑡𝑡 𝐹𝐹:𝑁𝑁𝑁𝑁 ≈ 1.5;
Females: 𝑅𝑅𝑅𝑅 � 𝐹𝐹:𝑆𝑆 𝑡𝑡𝑡𝑡 𝐹𝐹:𝑁𝑁𝑁𝑁 = 𝑝𝑝̂ 𝐹𝐹:𝑆𝑆 − 𝑝𝑝̂ 𝐹𝐹:𝑆𝑆 = 0.16
𝑅𝑅𝑅𝑅
4
Computing an Adjusted Estimate, Conceptually—1
► Stratify when the confounder 𝑍𝑍 is categorical
► Compute the association between the outcome and the exposure separately for each
level (stratum) of 𝑍𝑍
► In this fictitious example, separate sex-specific estimates of the disease/smoking
relationship for males and females
► Take weighted average of stratum-specific estimates
5
Computing an Adjusted Estimate, Conceptually—2
► For example, to get a sex-adjusted relative risk for the smoking disease relationship, we
could weight the sex-specific relative risks by numbers of males and females, i.e.:
� 𝑀𝑀:𝑆𝑆 𝑡𝑡𝑡𝑡 𝑀𝑀:𝑁𝑁𝑁𝑁 + 𝑛𝑛𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝑅𝑅𝑅𝑅
𝑛𝑛𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑅𝑅𝑅𝑅 � 𝐹𝐹:𝑆𝑆 𝑡𝑡𝑡𝑡 𝐹𝐹:𝑁𝑁𝑁𝑁
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁,𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎
𝑅𝑅𝑅𝑅 =
𝑛𝑛𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 + 𝑛𝑛𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
► So, for the given results:
200 × 1.8 + 250 × 1.5
� 𝑆𝑆 𝑡𝑡𝑡𝑡 𝑁𝑁𝑁𝑁,𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 =
𝑅𝑅𝑅𝑅 ≈ 1.6
200 + 250
6
Computing an Adjusted Estimate, Conceptually—3
► There are better ways than this to take such a weighted average (first, doing the
computation the natural 𝑙𝑙𝑙𝑙 scale and then weight by standard error, for example), but this
just illustrates the concept
► Confidence intervals can be computed for these adjusted measures of association
► Multiple regression (in this case, logistic) will be a very useful tool for performing
adjustment
7
Arm Circumference, Height, and Weight—1
► (Unadjusted) scatterplot with regression line, 𝑦𝑦� = 2.7 + 0.16𝑥𝑥1
8
Arm Circumference, Height, and Weight—2
► Scatterplot: Arm circumference by height, after adjusting for weight
9
Arm Circumference, Height, and Weight—3
► How to adjust for a continuous measure (in this case, weight)?
► The algorithm (multiple regression) breaks data into individual weight groups
► In each specific weight strata, a simple linear regression is fit to the AC/height data for
the stratum
► The overall height-adjusted association between AC and height is a weighted average
of the AC/height slopes for each of the individual weight strata
10
Summary
► The adjusted association between 𝑌𝑌 and 𝑋𝑋, adjusted for a single potential confounder 𝑍𝑍,
can be estimated by:
► Stratifying on 𝑍𝑍 (hard to operationalize if 𝑍𝑍 is continuous)
► Estimate the 𝑌𝑌/𝑋𝑋 relationship for each stratum of 𝑍𝑍
► Take a weighted estimate of all 𝑍𝑍 strata-specific 𝑌𝑌/𝑋𝑋 associations
► Idea can be generalized to estimating the adjusted association between 𝑌𝑌 and 𝑋𝑋, adjusted
for multiple potential confounders 𝑍𝑍1, 𝑍𝑍2 … 𝑍𝑍𝑐𝑐
► Multiple regression methods will make the adjustment process easy and straightforward
11
Additional Examples
The material in this video is subject to the copyright of the owners of the material and is being provided for educational purposes under
rules of fair use for registered students in this course only. No additional copies of the copyrighted work may be made or distributed.
Physician Salaries and Sex of the Physician—1
► Article abstract
Source: Jagsi, R., et al. (2012). Gender differences in the salaries of physician researchers. JAMA, 307(22), 2410–2417. 2
Physician Salaries and Sex of the Physician—2
► Unadjusted linear regression slope estimate for sex (1=M, 0=F)
𝛽𝛽̂𝑠𝑠𝑠𝑠𝑠𝑠 = $32,764
► Adjusted linear regression slope estimated for sex (1=M, 0=F)
𝛽𝛽̂𝑠𝑠𝑠𝑠𝑠𝑠∗ = $13,399
► (*after adjustment for specialty, academic rank, leadership positions, publications, and
research time)
3
Example: Clinical Trial, PBC: Incidence Rate Ratio—1
► Crude (unadjusted) incidence rate ratio:
𝐼𝐼�
𝐼𝐼𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷
� 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝐼𝐼𝐼𝐼𝐼𝐼 = 1.06, with 95% CI (0.75, 1.50)
𝐼𝐼�
𝐼𝐼𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
► Interpretations
► The risk of death in the DPCA group (in the study follow-up period) is 1.06 times the
risk in the placebo group
► Subjects in the DPCA group had 6% higher risk of death in the follow-up period when
compared to the subjects in the placebo group
► This comparison is not statistically significant
Source: Dickson, E., et al. (1985). Trial of penicillamine in advanced primary biliary cirrhosis. N Engl J Med, 312(16), 1011–1015. 4
Example: Clinical Trial, PBC: Incidence Rate Ratio—2
► Recall, patients (𝑛𝑛 = 312 total) were randomized to the DPCA or placebo group
► In a moment, the adjusted IRR, adjusted for sex and baseline bilirubin, will be presented
► How do you expect this to compare in value to the unadjusted estimate from the
previous slide? Why?
Source: Dickson, E., et al. (1985). Trial of penicillamine in advanced primary biliary cirrhosis. N Engl J Med, 312(16), 1011–1015. 5
Example: Clinical Trial, PBC: Incidence Rate Ratio—3
� 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝∗ = 1.01, with 95% CI (0.70, 1.43)
► The adjusted IRR is 𝐼𝐼𝐼𝐼𝐼𝐼
► Interpretations
► The risk of death in the DPCA group (in the study follow-up period) is 1.01 times the
risk in the placebo group after adjusting for sex and baseline bilirubin
► Subjects in the DPCA group had 1% higher risk of death in the follow-up period when
compared to the subjects in the placebo group, among subjects of the same sex with
the same baseline bilirubin levels
6
Example: Clinical Trial, PBC: Incidence Rate Ratio—4
► Why are unadjusted and adjusted IRRs so similar?