Causal Notes
Causal Notes
(with corrections)
Qingyuan Zhao
i
Contents
2 Randomised experiments 12
2.1 Assignment mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Potential outcomes/Counterfactuals . . . . . . . . . . . . . . . . . . . . . 13
2.3 Randomisation distribution of causal effect estimator . . . . . . . . . . . . 17
2.4 Randomisation test of sharp null hypothesis . . . . . . . . . . . . . . . . . 18
2.5 Super-population inference and regression adjustment . . . . . . . . . . . 23
2.6 Comparison of different modes of inference . . . . . . . . . . . . . . . . . . 26
3 Path analysis 28
3.1 Graph terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Linear structural equation models . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Path analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Correlation and causation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Latent variables and identifiability . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Factor models and measurement models . . . . . . . . . . . . . . . . . . . 35
3.7 Estimation in linear SEMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Strengths and weaknesses of linear SEMs . . . . . . . . . . . . . . . . . . 39
4 Graphical models 41
4.1 Markov properties for undirected graphs . . . . . . . . . . . . . . . . . . . 41
4.2 Markov properties for directed graphs . . . . . . . . . . . . . . . . . . . . 42
4.3 Structure discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Discussion: Using DAGs to represent causality . . . . . . . . . . . . . . . 50
4.A Graphical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
ii
5.2 Markov properties for counterfactuals . . . . . . . . . . . . . . . . . . . . 56
5.3 From counterfactual to factual . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Causal identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Proofs (non-examinable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8 Sensitivity analysis 87
8.1 A roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.2 Rosenbaum’s sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 87
8.3 Sensitivity analysis in semiparametric inference . . . . . . . . . . . . . . . 89
iii
Chapter 1
• Smoking.
1
Such results suggest that an error has been made of an old kind, in arguing
from correlation to causation.... Such differences in genetic make-up between
those classes would naturally be associated with differences of disease incidence
without the disease being causally connected with smoking.
Fisher then demonstrated evidence of a gene that is associated with both smoking and
lung cancer.
We now know Fisher was wrong. His criticism was logical, but the association between
smoking and lung cancer is simply too strong to be explained away by different genetic
make-ups4 . Some believe that his views may have been influenced by personal and
professional conflicts, by his work as a consultant to the tobacco industry, and by the
fact that he was himself a smoker.
2
Record state school intake but independent schools still over-
represented
Cambridge welcomes 68.7% of 2019 from maintained schools but falls far below a national average of 93% of state educated
students
% Intake from independents for home students % Intake from state schools for home students
75%
Percentage of First Year Cohort
50%
25%
0%
2015 2016 2017 2018 2019
everviz.com
Some interesting quotes: “Considering 93% of pupils in England are taught in state
schools, a figure of 68.7% means that state school students are still vastly under-
represented in the University.... Cambridge’s acceptance of state school applicants
continues to be amongst the lowest in the UK, with 90% of university students on
average hailing from state schools across the country. ”
Does this mean Cambridge’s admission is biased against state schools? Not neces-
sarily. For example, applicants from independent schools may have better A-level
results.
Causal inference can be used to understand fairness in decisions made by human
and computer algorithms6 .
3
Racial disparities persist in acceptance rates
Although the successful applications ratio for Black students moved up to 15.1% from 13% this still falls a way
below the average of 21.4% across all groups
Chinese
Arab
White
Mixed Race
Bangladeshi
Indian
Pakistani
Asian (Other)
Other/Unknown
Average
4
Continued decline of EU applicants
Percentage of EU applicants declines to 12.5% as Chinese applications increase 33% and the nation
sees more acceptances than Northern Ireland, Wales and Scotland combined
17%
16%
15%
Values
14%
13%
12%
2012 2014 2016 2018
everviz.com
5
The gender divide: offer holder discrepancies
between Sciences and Humanities
Computer science continues to rank amongst the lowest in terms of female intake at 20.4%
Male Female
Education
Veterinary Medicine
Land Economy
Natural Science
Economics
Maths
Engineering
Computer Science
6
This paradox is first discovered by Pearson (1899), who offered a causal explanation:
“To those who persist on looking upon all correlation as cause and effect, the fact
that correlation can be produced between two quite uncorrelated characters A and
B by taking an artificial mixture of the two closely allied races, must come as rather
a shock.” 10
7
(iii) Using graphs:
Some remarks:
• Applications are often related to humans (biology, public health, economics, political
sciences...). Why? Open-system with external interactions, difficult or nearly
impossible for manipulation in experiments.
• State-of-the-art: The three languages are basically equivalent and advantageous for
different purposes.
8
– Example: Investigation of aircraft crash; Cigarette smoking causes lung
cancer.
• Question: What type of inference is mathematical induction?
• The boundary between induction and abduction is not always clear.
• Very very roughly speaking, deduction ≈ mathematics; induction ≈ statistics;
abduction ≈ causal inference.
(vi) All models are wrong, but some are useful (G. Box).
(viii) Specificity.
9
• One of Hill’s 9 criteria for causality11 : “If as here, the association is limited to
specific workers and to particular sites and types of disease and there is no
association between the work and other modes of dying, then clearly that is a
strong argument in favor of causation.”
• Original definition now considered weak or obsolete. Counterexample: smoking.
• In Hill’s era, exposure = an occupational setting or a residential location
(proxies for true exposures).
• Nowadays, exposure is much more precise (for example, a specific gene expres-
sion).
• Specificity is still useful. Examples: Instrumental variables, negative controls,
sparsity.
Notes
1
Doll, R., & Hill, A. B. (1950). Smoking and carcinoma of the lung. BMJ, 2 (4682), 739–748.
doi:10.1136/bmj.2.4682.739.
2
Hammond, E. C. (1964). Smoking in relation to mortality and morbidity. findings in first thirty-four
months of follow-up in a prospective study started in 1959. Journal of the National Cancer Institute,
32 (5), 1161–1188.
3
Fisher, R. A. (1958). Cancer and smoking. Nature, 182 (4635), 596–596. doi:10.1038/182596a0.
4
This is pointed out by Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B.,
& Wynder, E. L. (1959). Smoking and lung cancer: Recent evidence and a discussion of some questions.
Journal of the National Cancer institute, 22 (1), 173–203, which is widely regarded as the first sensitivity
analysis to observational studies.
5
Vides, G., & Powell, J. (2020, June 16). The eight charts that explain the university’s 2019-2020
undergraduate admissions data. Varsity.
6
See e.g. Kusner, M. J., & Loftus, J. R. (2020). The long road to fairer algorithms. Nature, 578 (7793),
34–36. doi:10.1038/d41586-020-00274-3.
10
7
Pearl, J. (1999). Probabilities of causation: Three counterfactual interpretations and their identifica-
tion. Synthese, 121, 93–149. doi:10.1023/a:1005233831499.
8
Dawid, A. P., Musio, M., & Murtas, R. (2017). The probability of causation. Law, Probability and
Risk, 16 (4), 163–179. doi:10.1093/lpr/mgx012.
9
https://en.wikipedia.org/wiki/Simpson’s_paradox#UC_Berkeley_gender_bias
10
See Sec. 6.1 of Pearl, J. (2009). Causality (2nd ed.). Cambridge University Press.
11
Hill, A. B. (2015). The environment and disease: Association or causation? Journal of the Royal
Society of Medicine, 108 (1), 32–37. doi:10.1177/0141076814562718.
12
Cochran, W. G. (1965). The planning of observational studies of human populations. Journal of the
Royal Statistical Society. Series A (General), 128 (2), 234–266.
11
Chapter 2
Randomised experiments
• Randomised experiment (or randomised controlled trial) is the gold standard for
establishing causality.
• This Chapter: basic concepts and techniques in designing and analysing a ran-
domised experiment.
• For the i-th unit, observed some covariates Xi prior to treatment assignment.
2.1 Example (Bernoulli trial). The treatment assignments are independent and the
probability of being treated is a constant 0 < π < 1. π(a[n] | x[n] ) = ni=1 π ai (1 − π)1−ai .
Q
12
2.2 Example (Sample without replacement). The treatment assignments are “completely
randomised” with the only restriction that the number of treated units is 0 < n1 < n.
( −1 Pn
n
n1 , if i=1 ai = n1 ,
π(a[n] | x[n] ) =
0, otherwise.
2.3 Example (Bernoulli trial with covariates). Bernoulli trial with π replaced by a
function 0 < π(x) < 1. π(a[n] | x[n] ) = ni=1 π(xi )ai {1 − π(xi )}1−ai .
Q
2.4 Example (Pairwise experiment). Suppose n is even. The units are divided into
n/2 pairs based on the covariates. Within each pair, one unit is randomly assigned to
treatment.
Let Bi = Bi (x[n] ) be the pair that unit i is assigned to. Then
( Pn
2−n/2 , if i=1 ai · I(Bi = j) = 1, ∀j = 1, . . . , n/2,
π(a[n] | x[n] ) =
0, otherwise.
2.5 Exercise (Stratified experiment). Generalise the pairwise experiment to allow more
than 2 units in each group. Suppose there are m groups and group j has nj units and
n1j treated units. What is the assignment mechanism?
After treatment assignment, we follow up the units and measure an outcome variable Yi
for unit i.
This approach allows us to apply familiar statistical methodologies, but it has several
limitations:
(i) Causal inference is only implicit and informal, as it seems that any difference can
only be reasonably attributed to the different treatment assignments.
13
(iii) Cannot distinguish internal validity from external validity.
Internal validity: Inference for the finite population consisting of the n units.
External validity: Inference for the super-population from which the n units are
sampled from.
Potential outcomes
The potential outcome model avoids the above problems and provides a flexible basis for
causal inference. It is first introduced by Neyman in his 1923 Master’s thesis3 to study
randomised experiments and later brought to observational studies by Rubin4 .
This approach posits a potential outcome (or counterfactual ), Yi (a[n] ), for unit i under
treatment assignment a[n] . The potential outcomes (or counterfactuals) are linked to the
observed outcome (or factuals) via the following assumption.
This should not to be confused with consistency of statistical estimators, which says
the estimator converges to its targeting parameter as sample size grows.
Question: How many potential outcomes are there in an experiment with n units?
|An | = 2n .
To reduce the unknowns in the problem, a common assumption is
2.7 Assumption (No interference). Yi (a[n] ) = Yi (ai ) for all i ∈ [n] and a[n] ∈ An .
14
i Yi (0) Yi (1) Ai Yi
1 ? -3.7 1 -3.7
2 2.3 ? 0 2.3
3 ? 7.4 1 7.4
4 0.8 ? 0 0.8
.. .. .. .. ..
. . . . .
2.9 Remark. The latter implicitly assumes that the n units are sampled from a super-
population, so Yi (0) and Yi (1) follow an unknown bivariate probability distribution.
The conditioning on X[n] can be removed if X is not used in the treatment assignment
(such as in Examples 2.1 and 2.2).
2.11 Remark (Fatalism). To better understand Assumption 2.10, it is often helpful to
view Y[n] (0) and Y[n] (1) as determined prior to treatment assignment. The randomness
of A[n] given X[n] (e.g. picking balls from an urn or a computer pseudo-random number
generator) should then be independent of the potential outcomes. From a statistical
point of view, this fatalism interpretation is unncessary. One may regard the statistical
inference as being conditional on the potential outcomes.
Note that Assumption 2.10 is different from A[n] ⊥ ⊥ Y[n] | X[n] , as Yi = Yi (Ai )
generally depends on Ai .
Recall that we are using X, A, and Y to refer to a generic Xi , Ai , and Yi when they
are iid.
15
2.12 Theorem (Causal identification in randomised experiments). Consider a
Bernoulli trial with covariates (Example 2.3), where {Xi , Ai , Yi (a), a ∈ A} are iid.
Suppose the above assumptions are given and
d
where = means the random variables have the same distribution.
where the first equality uses Assumption 2.10 and the second uses Assumption 2.6.
2.13 Remark. Equation (2.1) is called the positivity assumption. It is also called the
overlap assumption because (2.1) implies that X | A = a has the same support for all a.
It makes sure the right hand side of (2.2) is well defined.
Proof. Equation (2.3) follows from taking the expectation for (2.2) and then averaging
over X. For (2.4), we prove it in the case of discrete X. Since A ⊥ ⊥ X, we have
P(X = x) = P(X = x | A = 0) = P(X = x | A = 1). By using Theorem 2.12 and the
law of total expectation,
X
E[Y (1)] = E[Y | A = 1, X = x] P(X = x)
x∈X
X
= E[Y | A = 1, X = x] P(X = x | A = 1)
x∈X
= E[Y | A = 1].
16
2.15 Remark. Results like (2.2), (2.3), (2.4) are called causal identification.because they
equate a counterfactual quantity on the left hand side with a factual (so estimable)
quantity on the right hand side.
Denote Y (a) = (Y1 (a), Y2 (a), . . . , Yn (a))T for a ∈ A. Neyman studied the conditional
distribution of β̂ given the potential outcomes Y (0), Y (1). We may refer to this as
the randomization distribution, because the only randomness left in β̂ comes from the
randomization of the treatment A[n] .
2.16 Theorem. Let Assumptions 2.6, 2.7 and 2.10 be given and suppose the treatment
assignments Ai are sampled without replacement according to Example 2.2. Then
n
1X
E[β̂ | Y (0), Y (1)] = SATE = Yi (1) − Yi (0), (2.6)
n
i=1
1 2 1 S2
S0 + S12 − 01 ,
Var β̂ | Y (0), Y (1) = (2.7)
n0 n1 n
Sa2 = ni=1 (Yi (a) − Ȳ (a))2 /(n − 1), Ȳ (a) = ni=1 Yi (a)/n for
P P
where n0 = n − n1 , P
2 = n 2
a = 0, 1, and S01 i=1 (Yi (1) − Yi (0) − SATE) /(n − 1).
The expectation and variance are computed under the randomisation distribution
distribution of β̂, in which the potential outcomes Y (1) and Y (0) are treated as fixed and
the randomness comes from the randomisation of A[n] . As a consequence, the right hand
side of (2.6) and (2.7) depend on the unobserved potential outcomes Y (1) and Y (0).
Proof of Equation (2.6). For simplicity of exposition, we omit the conditioning on Y (0), Y (1)
below. By using E[Ai ] = n1 /n, the consistency assumption and the linearity of expecta-
tions,
h1 Xn n
1 X i
E[β̂] = E Ai Yi − (1 − Ai )Yi
n1 n0
i=1 i=1
h1 Xn n
1 X i
=E Ai Yi (1) − (1 − Ai )Yi (0)
n1 n0
i=1 i=1
n n
1 X n1 1 X n0
= Yi (1) − Yi (0)
n1 n n0 n
i=1 i=1
= Ȳ (1) − Ȳ (0).
17
2.17 Exercise. Prove (2.7). Hint: Let Yi∗ (a) = Yi (a) − Ȳ (a), a = 0, 1. Show that
n
h X Ai ∗ 1 − Ai ∗ 2 i
Var β̂ | Y (0), Y (1) = E Yi (1) − Yi (0) .
n1 n0
i=1
non-estimable. Why is that? Notice that S01 2 is the sample variance of the individual
treatment effect and depends on the covariance of Yi (1) and Yi (0), which can never be
observed together (the “fundamental problem of causal inference”).
Instead, it is common to estimate the variance (2.7) by Ŝ02 /n0 + Ŝ12 /n1 , where
n n
1 X 1 X
Ŝ12 = Ai (Yi − Ȳ1 )2 , Ŝ02 = (1 − Ai )(Yi − Ȳ0 )2 .
n1 − 1 n0 − 1
i=1 i=1
This is an unbiased estimator of S02 /n0 + S12 /n1 (the proof is similar to that of (2.6) and
is left as an exercise). Thus we get a conservative (on average) estimator of the variance
of β̂.
Distributional results are further needed to form confidence intervals. Central limit
theorems can be established by assuming that the potential outcomes in Y (0) and Y (1)
are not too volatile.6 .
2.18 Remark. One drawback of Neyman’s randomisation inference is that it is difficult
to extend it to settings with covariates (unless the covariates are discrete). The main
obstacle is that the randomisation distribution necessarily depends on unobserved potential
outcomes.
Fisher7 appears to be the first to grasp fully the importance of randomisation for credible
causal inference.8
18
i Yi (0) Yi (1) Ai Yi
1 -3.7 -3.7 1 -3.7
2 2.3 2.3 0 2.3
3 7.4 7.4 1 7.4
4 0.8 0.8 0 0.8
19
Randomisation distribution
The key step is to derive the randomisation distribution of T . There are two ways to do
this:
(i) Consider the distribution of T1 (A[n] , X[n] , Y[n] (0)) given X[n] and Y[n] (0);
(ii) Consider the distribution of T2 (A[n] , X[n] , Y[n] (A[n] )) given X[n] , Y[n] (0), and
Y[n] (1);
In both cases, the randomness comes from the randomisation of A[n] . The first approach
tries to test the conditional independence A[n] ⊥ ⊥ Y[n] (0) | X. The second approach
tries to directly obtain the randomisation distribution of T (A[n] , X[n] , Y[n] ) and bears a
resemblance to Neyman’s inference.
It is easy to see that the two approaches are exactly the same if β = 0. Exercise 2.21
below shows that they are still equivalent if β 6= 0. For more complex hypotheses, however,
one approach can be more convenient than the other.
Let F = (X[n] , Y[n] (0), Y[n] (1)). The randomisation distributions in the two approaches
above are given by
F1 (t) = P T1 (A[n] , X[n] , Y[n] (0)) ≤ t F ,
and
F2 (t) = P T2 (A[n] , X[n] , Y[n] (A[n] )) ≤ t F .
The one-sided p-value is the probability of observing the same or a more extreme test
statistic than the observed statistic T ,
Pm = Fm (Tm ), m = 1, 2.
2.19 Theorem. Under SUTVA (Assumptions 2.6 and 2.7) and H0 , P(Pm ≤ α) ≤
α, ∀ 0 < α < 1, m = 1, 2.
20
Proof. This follows from the property of the distribution function: If F (t) is the distribu-
tion function of a random variable T , then F (T ) stochastically dominates the uniform
distribution on [0, 1]. To show this, let F −1 (α) = sup{t | F (t) ≤ α}. By using the fact
that F (t) is non-decreasing and right-continuous (see Figure 2.1),
2.20 Remark. The probability integral transform says that if T is a continuous random
variable and F (t) is its distribution function, then F (T ) is uniformly distributed on [0, 1].
However, we cannot directly use this well known result here because our T has a discrete
(conditional) distribution.
The conditional independence in Assumption 2.10 and H0 (so there is no further random-
ness in Y (1) after conditioning on Y (0)) allow us to replace the first term by
21
prob . > L
Mass # Bn Reject Ho
prob #
PCT=t )
.
1B 1B
t
Distribution
function
FGCKPCTET)
BBB
x
'
ki)
PETE 't
o
-
-
- - -
- -
-
-
-••
•• O
O
'
F- K )
t
'
F- (d) =
Supt tithed } .
22
Practical issues
Computing the p-value exactly via its definition (2.10) can be computationally intensive
because it requires summing over An . In practice, F (T ) is often computed by Monte-Carlo
simulation.
In the example sheet, you will learn how to obtain an estimator of β (suppose H0
is true for some unknown β) by using the Hodges-Lehmann estimator.9 You will also
explore how to obtain a confidence interval for β.
It is easy to show that β̂1 is the difference-in-means estimator (2.5). The rationale for
β̂2 is that the adjustment tends to improve precision if X is correlated with Y . This is
known as the analysis of covariance (ANCOVA)10 . The third estimator further models
treatment effect heterogeneity through the interaction term AX.
The classical linear regression theory for these estimators assumes the regression models
are correct. Below we provide a modern analysis that allows model misspecification by
using the M-estimation theory. Our analysis is a compact version of previous results11 .
Let’s first write down the population version of the least squares problems:
23
By the law of large numbers, we expect β̂m converges to βm , m = 1, 2, 3, as n → ∞. To
focus on the essential ideas, below we will omit the regularity conditions (for example, to
ensure these parameters exist and the central limit theorems hold).
E[Y − α3 − β3 A] = 0,
E[A(Y − α3 − β3 A)] = 0.
Following the same derivation, these two equations also hold for the other estimators. By
cancelling α3 in the equations, we get β3 = β.
2.25 Remark. Notice that Lemma 2.23 does not rely on the correctness of the linear model.
Modern causal inference often tries to make minimal assumptions about the data and
avoid relying on specific statistical models (“all models are wrong, but some are useful”).
We will use a general result for least squares estimators to study the asymptotic
behaviour of β̂1 ,β̂2 , and β̂3 .12
2.26 Lemma. Suppose (Zi , Yi ), i = 1, . . . , n are iid and E[ZZ T ] is positive definite. Let
θ = (E[ZZ T ])−1 (E[ZY ]) be the population least squares parameter and
n
1 X n
−1 1 X
θ̂ = ZZ T ZY
n n
i=1 i=1
as n → ∞.
24
Informal proof of (2.17). Notice that θ̂ is an empirical solution to the equation
E[ψ(θ; Z, Y )] = 0,
where
ψ(θ; Z, Y ) = Z · (Y − Z T θ) = Z. (2.18)
For a general function ψ, the Z-estimation theory shows that
√
n h
d ∂ψ(θ) io−1 T
n h ∂ψ(θ) io−T
n(θ̂ − θ) → N 0, E E ψ(θ)ψ(θ) E . (2.19)
∂θ ∂θ
By plugging in (2.18), we obtain (2.17). The asymptotic normality (2.19) follows from
the argument below. Using Taylor’s expansion,
n
1X
0= ψ(θ̂; Zi , Yi )
n
i=1
n
1 X h ∂ i
= ψ(θ; Zi , Yi ) + (θ̂ − θ)T ψ(θ; Zi , Yi ) + Rn .
n ∂θ
i=1
p
By using θ̂ → θ, it can be shown that the residual term Rn is asymptotically smaller
than the other two terms and can be ignored. Thus
n
√ nh ∂ i−1 oh 1 X i
n(θ̂ − θ) ≈ E ψ(θ; Z, Y ) √ ψ(θ; Zi , Yi ) . (2.20)
∂θ n
i=1
The first term on the right hand side converges in probability to E[∂ψ(θ)/∂θ]. The second
term converges in distribution to a normal random variable with variance E[ψ(θ)ψ(θ)T ].
Using Slutsky’s theorem, we arrive at (2.19).
2.28 Remark. The Z-estimation theory generalises the asymptotic theory for maximum
likelihood estimator (MLE), where ψ is the score function (gradient of the log-likelihood).
In that case, it can be shown that E[∂ψ(θ)/∂θ] = E[ψ(θ)ψ(θ)T ] is the Fisher information
matrix, which you may recognise from your undergraduate lectures (Part II Principles of
Statistics).
2.29 Exercise. Let i1 , i2 , i3 be the error terms in the three regression estimators:
T T
im = Yi − αm − βm Ai − γm Xi − Ai (δm Xi ), m = 1, 2, 3.
Here we are using the convention γ1 = 0 and δ1 = δ2 = 0. By using Lemma 2.26 with
different Z and θ, show that, as n → ∞,
25
2.30 Theorem. Suppose (Xi , Ai , Yi ) are iid, A ⊥
⊥ X, E[X] = 0. Then as n → ∞,
√ d
n(β̂m − β) → N(0, Vm ), m = 1, 2, 3,
By Lemma 2.23,
Thus, for m = 1, 2,
≥0.
2.31 Exercise. Use your results in Exercise 2.24 to derive the conditions under which
V1 = V2 = V3 . Show that V2 ≤ V1 is not always true, therefore, regression adjustment
does not always reduce the asymptotic variance (if not done properly)!
2.32 Remark. The assumption E[X] = 0 is useful to simplify the calculations above. In
practice, we obviously don’t know if this assumption is true, so it is common to centre X
before solving the least squares problem. Let β̃1 , β̃2 , β̃3 be the least
P squares estimators
in eqs. (2.11) to (2.13) with Xi replaced by Xi − X̄ where X̄ = ni=1 Xi /n. It is easy
to show that β̃1 = β̂1 and β̃2 = β̂2 (because of the intercept term) and β̃3 = β̂3 + δ̂3T X̄.
Therefore, β̃1 and β̃2 have the same asymptotic distributions as β̂1 and β̂2 , even when
E[X] 6= 0. However, the asymptotic variance of β̃3 (denoted as Ṽ3 ) is larger than that of
β̂3 . It can be shown that Ṽ3 = V3 + δ3T Σδ3 and Ṽ3 ≤ min{V1 , V2 } still holds.
Notes
1
Ye, T., Shao, J., & Zhao, Q. (2020). Principles for covariate adjustment in analyzing randomized
clinical trials. arXiv: 2009.11828 [stat.ME].
2
Li, X., & Ding, P. (2019). Rerandomization and regression adjustment. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 82 (1), 241–268. doi:10.1111/rssb.12353.
26
Neyman’s inference Randomisation test Regression
3
Splawa-Neyman, J., Dabrowska, D. M., & Speed, T. P. (1990). On the application of probability
theory to agricultural experiments. essay on principles. section 9. Statistical Science, 5 (4), 465–472.
doi:10.1214/ss/1177012031.
4
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66 (5), 688–701. doi:10.1037/h0037350.
5
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association,
81 (396), 945–960. doi:10.1080/01621459.1986.10478354.
6
For a recent review, see Li, X., & Ding, P. (2017). General forms of finite population central limit
theorems with applications to causal inference. Journal of the American Statistical Association, 112 (520),
1759–1769. doi:10.1080/01621459.2017.1295865.
7
Fisher, R. A. (1925). Statistical methods for research workers (1st ed.). Oliver and Boyd, Edinburgh
and London.
8
See Chapter 2 of Imbens and Rubin, 2015, for a historical account.
9
Rosenbaum, P. R. (1993). Hodges-Lehmann point estimates of treatment effect in observational studies.
Journal of the American Statistical Association, 88 (424), 1250–1253. doi:10.1080/01621459.1993.10476405.
10
Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Oliver and Boyd, Edinburgh
and London.
11
Tsiatis, A. A., Davidian, M., Zhang, M., & Lu, X. (2008). Covariate adjustment for two-sample
treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in
Medicine, 27 (23), 4658–4677. doi:10.1002/sim.3113. Imbens, G. W., & Rubin, D. B. (2015). Causal
inference in statistics, social, and biomedical sciences (1st ed.). Cambridge University Press, Chapter
7. For finite-population randomisation inference, see Lin, W. (2013). Agnostic notes on regression
adjustments to experimental data: Reexamining freedman’s critique. The Annals of Applied Statistics,
7 (1), 295–318. doi:10.1214/12-aoas583
12
See, for example, White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and
a direct test for heteroskedasticity. Econometrica, 48 (4), 817–838. doi:10.2307/1912934.
27
Chapter 3
Path analysis
Last Chapter introduced potential outcomes to describe a causal effect. This approach is
most convenient when we focus on a single cause-effect pair.
An alternative and more holistic approach to is to use a graph, in which directed
edges represent causal effects. This framework traces back to a series of works on path
analysis by Sewall Wright a century ago.1
3.1 Definition (Graph and subgraph). A graph G = (V, E) is defined by its finite
vertex set V and its edge set E ⊆ V × V containing ordered pairs of vertices. The
subgraph of G restricted to A ⊂ V is GA = (A, EA ) where EA = {(i, j) ∈ E | i, j ∈ A}.
3.2 Definition (Directed edge and graph). An edge (i, j) is called directed and
written as i → j if (i, j) ∈ E but (j, i) 6∈ E. Vertex i is called a parent of j and j a
child of i if i → j. The set of parents of a vertex i is denoted as pa G (i) or simply
pa(i). A graph G is called a directed graph if all its edges are directed.
3.3 Definition (Path and cycle). A path between i and j is a sequence of distinct
vertices k0 = i, k1 , k2 , . . . , km = j such that the consecutive vertices are adjacent,
that is, (kl−1 , kl ) ∈ E or (kl , kl−1 ) ∈ E for all l = 1, 2, . . . , m.
A directed path from i to j is a path in which all the arrows are going “forward”,
that is, (kl−1 , kl ) ∈ E for all l = 1, 2, . . . , m.
A cycle is a directed path with the only modification that the first and last
vertices are the same km = k0 .
A directed acyclic graph (DAG) is a directed graph with no cycles.
28
3.4 Definition (Ancestor and descendant). In a DAG G, a vertex i is an ancestor
of j if there exists a directed path from i to j; conversely, j is called a descendant of
i. Let an G (I) denote the union of ancestors and de G (I) the union of descendants of
all vertices in I.
3.5 Exercise. Show that a directed graph is acyclic if and only if the vertices can be
relabeled in a way that the edges are monotone in the label (this is called a topological
ordering). In other words, there exists a permutation (k1 , . . . , kp ) of (1, . . . , p) such that
(i, j) ∈ E implies ki < kj .
3.6 Exercise. Show that for any J ⊂ [p], there exists i 6∈ J such that all the descendants
of i in a DAG G are in J. Hint: Use the topological ordering.
For the rest of this course we will focus on DAG models, which provide a natural
setup for causal inference.
Some conventions: We will often consider the random variables X = X[p] =
(X1 , . . . , Xp ) and the graphical model G = (V = [p], E) with the map i → Xi . To
simplify notation, we won’t distinguish X[p] with the set {X1 , . . . , Xp }. We will often not
distinguish between G and the graph induced by the graphical model, G[X] = (X, E[X])
where (Xi , Xj ) ∈ E[X] if and only if (i, j) ∈ E.
Wright’s path analysis applies to random variables that satisfy the linear structural
equation model (SEM), a (causal) graphical model defined below.
3.8 Definition (Linear SEM). The random variables X[p] satisfy a linear SEM with
respect to a DAG G = (V = [p], E) if they satisfy
X
Xi = β0i + βji Xj + i , (3.1)
j∈pa(Xi )
where 1 , . . . , p are mutually independent with mean 0 and finite variance, and the
interventional distributions of X also satisfy (3.1) (see Remark 3.9). The parameter
βji is called a path coefficient.
Equation (3.1) looks just like a linear model and can indeed be fitted using linear
regression. What makes it structural or causal is an implicit assumption that (3.1) still
29
holds if we make interventions to one or some of the variables. For example, consider the
following linear SEM,
X1 = 1 ,
X2 = 2X1 + 2 ,
X3 = X1 + 2X2 + 3 .
X1 = x1 ,
X2 = 2X1 + 2 ,
X3 = X1 + 2X2 + 3 .
X1 (x1 ) = x1 ,
X2 (x1 ) = 2x1 + 2 ,
X3 (x1 ) = x1 + 2X2 (x1 ) + 3 .
So SEM is not only a model for the factuals (like regression models) but also a model for
the counterfactuals (unlike regression models).
We can use matrix notation to write (3.1) more compactly:
Given a linear SEM, we may define causal effect of Xi on Xj as the product of path
coefficients along all directed paths from i to j.
30
3.11 Definition. Let C(i, j) be the collection of all directed paths (“causal paths”)
from i to j. The causal effect of Xi on Xj in a linear SEM is defined as
X m
Y
β(Xi → Xj ) = βkl−1 kl . (3.2)
(k0 ,...,km )∈C(i,j) l=1
Wright’s path analysis uses the path coefficients to express the covariance matrix of X
and clearly describes why “correlation does not imply causation”.
3.14 Theorem (Wright’s path analysis). Suppose the random variables X[p] satisfy
the linear SEM (3.1) with respect to a DAG G and are standardised so that Var(Xi ) = 1
for all i. Then
X Ym
Cov(Xi , Xj ) = βkl−1 kl . (3.3)
(k0 ,...,km )∈D(i,j) l=1
Proof. Without loss of generality, let’s assume (1, . . . , p) is a topological order of G and
i < j. We prove Theorem 3.14 by induction. Equation (3.3) is obviously true if i = 1 and
j = 2. Now suppose (3.3) is true for any i < j ≤ k, where 2 ≤ k ≤ p − 1. It suffices to
show that (3.3) also holds for i < j = k + 1. The key is to realise that D(i, j) can be
obtained by taking a union of all paths in D(i, l) for l ∈ pa(j) appended with the edge
l → j. See Figure 3.1 for an illustration.
By (3.1), X
Xj = βlj Xl + j .
l∈pa(j)
31
i l1
l2 j
Therefore, using the induction hypothesis and the trivial fact that Xi ⊥
⊥ j (beause i
precedes j),
X
Cov(Xi , Xj ) = βlj Cov(Xi , Xl )
l∈pa(j)
X X m
Y
= βlj βkl−1 kl
l∈pa(j) (k0 ,...,km )∈D(i,l) l=1
X X Y m
= βkl−1 kl · βlj
l∈pa(j) (k0 ,...,km )∈D(i,l) l=1
X m+1
Y
= βkl−1 kl .
(k0 ,...,km+1 )∈D(i,j) l=1
3.15 Exercise. Modify equation (3.3) so that it is still true when the random variables
are not standardised. Hint: How many “forks” kl−1 ← kl → kl+1 can a d-connected path
have?
When is correlation the same as causation? Comparing (3.2) with (3.3), we see that is
only the case if all the d-connected paths are directed.
The causal effect of Xi on Xj is said to be confounded if i and j shares a common
ancestor in the graph. In this case, non-zero correlation between Xi and Xj does not
imply a causal relationship.
Cov(A, X) = βXA ,
Cov(X, Y ) = βXY + βXA βAY , (3.4)
Cov(A, Y ) = βAY + βXA βXY .
32
X
βXA βXY
βAY
A Y
(ii) The linear model (3.8) is structural (see Remark 3.9), so βAY is indeed the causal
effect.
Following Example 3.17, a natural question is: in order to identify causal effects
by regression coefficients, which variables should be included as regressors (“adjusted
for”)? We will learn the answer later on in the course but we will examine some negative
examples below to gain intuitions.
3.19 Example. Consider the two graphical models in Figure 3.3, in which the random
variables are all centred and standardised. In the left diagram, β(A → Y ) = βAX βXY
but γAY ·X = 0. In the right diagram, β(A → Y ) = 0 but using (3.5),
βAX βY X
γAY ·X = − 2 .
1 − βAX
This is commonly referred to as collider bias because X is a collider in A → X ← Y .
33
A Y
βAX βY X
βAX βXY
A X Y
X
Figure 3.3: Two examples in which adjusting for X in a linear regression introduces bias
to estimating the causal effect of A on Y .
An immediate lesson learned from Example 3.19 is that we should not include
descendants of the cause in the regression. However, the next Example shows that this is
not enough.
3.20 Exercise. In each of the two cases below, give a linear SEMs such that X is not a
descendant of A or Y , β(A → Y ) = 0 but γAY ·X 6= 0.
So far we have assumed that all the variables in the linear SEM are observed. A direct
consequence is that all the path coefficients are identifiable from the distribution of X.
3.21 Proposition. Suppose the random variables X[p] satisfy the linear SEM with
respect to a DAG G. Then the path coefficients in B can be written as functions of
Σ = Cov(X[p] ).
Proof. First of all, Σ is positive definite (Exercise 3.10), so any principal submatrix of Σ
is also positive definite. For each variable Xi , the path coefficients from its parents to Xi
can be identified by the corresponding linear regression, i.e.,
There are at least two reasons to consider SEMs with latent variables (also called
factors):
(i) Confounding bias: Simply ignoring the latent variables (e.g. using the subgraph of G
restricted to the observed variables) lead to biased estimate of the path coefficients.
It is thus important to know if we can still identify a causal effect when some
variables are unobserved.
34
(ii) Proxy measurement: In many real applications, the variables of interest are not
directly measured. This is particularly common in the social sciences where the
variable of interest may be socioeconomic status, personality, or political ideology.
These variables may only be approximately measured by observable variables
(proxies) like human behaviours and questionnaires.
With latent variables, identifiability of path coefficients no longer follows from Proposi-
tion 3.21 because Σ is only partially estimable. Path analysis (3.3) allows us to construct
a mapping (Exercise 3.10)
B 7→ Σ(B)
between the paths coefficients and the covariance matrix of the observed and unobserved
variables.
An entry (or a function) of B is said to be identifiable if it can be expressed in terms
of the distribution of the observed variables. In linear SEMs with normal errors, this is
equivalent to expressing B in terms of the submatrix of Σ corresponding to the observed
variables (because the multivariate normal distribution is uniquely determined by its
mean and covariance matrix).
When the errors are non-normal, we may further use higher moments or the entire
distribution of the observed variables to identify β. However, it is also more sensitive
to the distributional assumptions. Below we will restrict our discussion to the case of
normal errors.
3.23 Remark. The notion of identifiability can depend on the context of the problem. With
latent variables, it is often the case that we can only identify some path coefficients up to
a sign change. In other problems (such as problems with instrumental variables), the set
of nonidentifiable path coefficients has measure zero (this is called generic identifiability).
We will not differentiate between these concepts in the discussion below.
To identify the entire matrix B, a necessary condition is that Σ has at least as many
entries as B. Unfortunately, there is no known necessary and sufficient condition for
identifiability in linear SEMs.3
Below we give some examples in which the path coefficients are indeed identifiable.4 The
basic idea is to use proxies for the latent variables.
Without loss of generality, we assume all the unmeasured variables are standardised so
they all have unit variance. In the diagrams below, we will use dashed circles to indicate
latent variables.
3.24 Lemma (Three-indicator rule). Consider any linear SEM for (U, X[p] ) corresponding
to Figure 3.4. Suppose X[p] is observed but U is not. Suppose Var(U ) = 1. Then the path
coefficients are identifiable (up to a sign change) if p ≥ 3 and at least 3 coefficients are
nonzero.
35
U
β1
β2 βp
X1 X2 ... ... Xp
Proof. Denote the path coefficient for U1 → Xi as βi and the variance of the noise variable
for Xi as σi2 . It is straightforward to show that
Therefore, we have
Cov(X1 , X2 ) · Cov(X1 , X3 )
β12 = ,
Cov(X2 , X3 )
and similarly for β22 and β32 . Although the sign of β1 is not identifiable, it is easy to see
that once it is fixed, the signs of β2 and β3 are also determined. Thus the vector β is
identifiable up to the transformation β 7→ −β.
For p > 3, we can apply apply the above result for the 3-subset of X[p] whose
corresponding path coefficients are nonzero.
3.25 Remark. Statistical inference for the graphical model in Figure 3.4 is often called a
confirmatory factor analysis because the structure is already given. This is different from
the exploratory factor analysis (e.g., via principal component analysis), which tries to use
ovserved data to discover the factor structure.
3.26 Example. For the linear SEM corresponding to the graphical model in Figure 3.5,
βAY is identifiable. To see this, we can first use Lemma 3.24 on {A, Y, X} and {A, Y, Z}
to identify (βU A , βU Y ) (up to a sign change). Without loss of generality we assume A
and Y have unit variance, then βAY = Cov(A, Y ) − βU A βU Y is also identified.
PCov(X1 , X2 | X3 )
= Cov(X1 , X2 ) − Cov(X1 , X3 ) Var(X3 )−1 Cov(X3 , X2 ).
36
U
βU X βU A βU Y βU Z
X A Y Z
βAY
Show that if we add a directed edge from X to A in Figure 3.5, βAY is still identifiable
by5
PCov(X, Y | A)
βAY = Cov(A, Y ) − Cov(A, Z).
PCov(X, Z | A)
The three-indicator rule is also quite useful in the so-called measurement models. In
this type of problems (an instance is Example 3.22), we are indeed interested in the causal
effects between the latent variables (these are often abstract constructs like personalities
and academic achievements).
Suppose the latent variables U ∈ Rq have unit variances and satisfy a linear SEM
with respect to a prespecified DAG. The observed variables (or measurements) X ∈ Rp
satisfy the following model (the intercept term is omitted for simplicity)
X = ΓU + X ,
(i) Each row of Γ has only 1 nonzero entry (i.e., every measurement loads on only
one factor).
(ii) Each column of Γ has at least 3 nonzero entries (i.e., each factor has at least
three measurements).
Proof. By Lemma 3.24, Γ can be identified. The assumptions in the proposition statement
also ensures that Γ has full column rank, so Cov(U ) can be identified from Cov(X). The
conclusion then follows from Proposition 3.21.
3.30 Example. The graphical model in Figure 3.6 satisfies the criterion in Proposi-
tion 3.29, thus βU is identifiable (up to its sign). To see this, β11 , β12 , . . . , β26 can be
identified by confirmatory factor analysis, and by using path analysis, we have
37
X1 X4
β11 β24
β12 βU β25
X2 U1 U2 X5
β13 β26
X3 X6
3.31 Exercise. Show that βU in the last example is still identifiable if each latent variable
only has two measurements (i.e. if X3 and X6 are deleted from the graph).
3.32 Remark. Although the path coefficients between U can only be identified up to
sign changes, this is usually not a problem in practice. Usually we can confidently make
assumptions about the signs of certain factor loadings (for example, the loading of a
student’s maths score on academic achievements is positive).
Let X ∈ Rp be the observed variables in a linear SEM. Let B denote the matrix of path
coefficients between all the variables, observed or latent. Suppose B is indeed identifiable.
There are two main approaches to fit a linear SEM and estimate B: maximum
likelihood and generalised method of moments.
By assuming the noise variable [p] in (3.1) follows a multivariate normal distribution,
the maximum likelihood estimator of B minimises
1 1
l(B) = log det ΣX (B) + tr SΣ−1 X (B) , (3.7)
2 2
where S is the sample covariance matrix of X and ΣX (B) is the covariance matrix of X
and depends on the path coefficients B through (3.3).
38
Different choices of W lead to estimators with different asymptotic efficiency. The
“optimal” choice is W = Σ(B) (or any other matrix that converges in probability to
Σ(B)). This motivates the practical choice W = S, so we estimate B by minimising
1 2
lS (B) = tr I − S −1 Σ(B) . (3.9)
2
The generalised method of moments estimator is consistent if the linear SEM is
correctly specified (so Var(X) = Σ(B)). Furthermore, if lS (B) is used and is nor-
mally distributed, the estimator is asymptotically equivalent to the maximum likelihood
estimator and is thus asymptotically efficient.6 .
Despite being a century old, linear SEMs are still widely used in many applications for
many good reasons:
(i) Graphs and linear SEMs provide an intuitive way to rigorously describe causality
that can also be easily understood by applied researchers.
(ii) Path analysis provides a powerful tool to distinguish correlation from causation.
Even though we will move away from linearity soon, path analysis provides a
straightforward way to disprove some statements and gain intuitions for others7 .
(iii) Linear SEMs allow us to directly put models on unobserved variables. This is
especially useful when the causes and effects of interest are abstract constructs.
(iv) Fitting a linear SEM only requires the sample covariance matrix, which can be
handy in modern applications with privacy constraints.
(ii) The linear model can be misspecified and does not handle binary variables or
discrete variables very well. This is problematic because the causal effect is not
well defined if the model is nonlinear. As a consequence, the meaning of structural
equation models became obscure and lead many to believe they are just the same as
linear regression. This misconception led many researchers to rejected linear SEMs
as a tool for causal inference.8
(iii) Any model put on the unobserved variables is dangerous, because there is no realistic
way to verify those assumptions.
Notes
1
Wright, S. (1918). On the nature of size factors. Genetics, 3 (4), 367–374; Wright, S. (1934). The
method of path coefficients. The Annals of Mathematical Statistics, 5 (3), 161–215.
39
2
Marsh, H. W. (1990). Causal ordering of academic self-concept and academic achievement: A
multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82 (4), 646–656. doi:10.1037/
0022-0663.82.4.646.
3
For a review on recent advances using algebraic geometry, see Drton, M. et al. (2018). Algebraic
problems in structural equation modeling. In The 50th Anniversary of Gröbner Bases (pp. 35–86).
Mathematical Society of Japan.
4
More discussion and examples can be found at Bollen, K. A. (1989). Structural equations with latent
variables. doi:10.1002/9781118619179, page 326.
5
Kuroki, M., & Pearl, J. (2014). Measurement bias and effect restoration in causal inference.
Biometrika, 101 (2), 423–437. doi:10.1093/biomet/ast066.
6
For more detail, see Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis
of covariance structures. British Journal of Mathematical and Statistical Psychology, 37 (1), 62–83.
doi:10.1111/j.2044-8317.1984.tb00789.x.
7
Pearl, J. (2013). Linear models: A useful "microscope" for causal analysis. Journal of Causal
Inference, 1 (1), 155–170. doi:10.1515/jci-2013-0003.
8
For a historical account, see Pearl, 2009, Section 5.1.
40
Chapter 4
Graphical models
The linear SEMs are intuitive and easy to interpret, but become inadequate when the
structural relations are non-linear. Intuitively, causality should be already entailed in the
graphical diagram, and linearity should be unnecessary for causal inference.
To move away from the linearity assumption, we will introduce graphical models for
the observed variables in this Chapter and for the unobserved counterfactuals in the next
Chapter. The main idea in this Chapter is to describe conditional independences with
graphs.
Briefly speaking, a graphical model provides a concise representation of all the conditional
independence relations (aka Markov properties) between random variables. We will start
from undirected graphs.
Let G = (V = [p], E) be an undirected graphical model for the random variables
X[p] = (X1 , X2 , . . . , Xp ). Edges in an undirected graph have no direction. In other words,
if (i, j) ∈ E, so does (j, i).
The global Markov property with respect to a graph is closely related to the factorisa-
tion of a probability distribution.
41
4.3 Definition (Factorisation according to an undirected graph). A clique in an
undirected graph G is a subset of vertices such that every two distinct vertices in the
clique are adjacent.
A probability distribution P is said to factorise according to G (or a Gibbs random
field with respect to G) if P has a density f that can be written as
Y
f (x) = ψC (xC ),
clique C⊆V
Proof. We will only prove the ⇐= direction here.1 . Let I, J, K be disjoint subsets of V
such that I ⊥ ⊥ J | K. A consequence of I and J being separated by K is that I and J
must be in different connected components of the subgraph GV \K . (A set of vertices is
called connected if there exists a path from any vertex to any other vertex in this set. A
connected component is a maximal connected set, meaning no superset of the connected
component is connected.)
Let I˜ denote the connected component that I is in and let J˜ = V \ (I˜ ∪ K). Any
clique of G must either be a subset of I˜ ∪ K or of J˜ ∪ K, otherwise at least one vertex in
I˜ is adjacent to a vertex in J,
˜ violating the maximality of I.˜ This implies that
Y
f (x) = ψC (xC )
clique C⊆V
Y Y Y
= ψC (xC ) · ψC (xC )/ ψC (xC )
˜
clique C⊆I∪K ˜
clique C⊆J∪K cliqueK
= h(xI∪K
˜ ) · g(xJ∪K
˜ ).
We have shown that f (x) can be written as a function of xI∪K˜ multiplied by another
function xJ∪K
˜ . By normalising the functions properly, this shows that XI˜ ⊥
⊥ XJ˜ | XK
and hence XI ⊥⊥ XJ | XK .
Notice that in the proof above we did not use the positive density assumption. It is
only needed for the =⇒ direction.
42
4.5 Definition. We say a probability distribution P factorises according to a DAG
G if its density function f satisfies
Y
f (x) = fi|pa(i) (xi | xpa(i) ),
i∈V
where fi|pa(i) (xi | xpa(i) ) is the conditional density of Xi given Xpa(i) , that is,
Below we will often suppress the subscript and use f as a generic symbol to indicate
a density or conditional density function.
4.6 Example. A probability distribution factorises according to Figure 4.1 if its density
can be written as
2 5
1 4 7
3 6
4.7 Example. Probabilitistic graphical models are widely used in Bayesian statistics,
machine learning, and engineering.2 Besides its intuitive representation, another motivation
for using graphical models is more efficient storage of probability distributions. Consider p
binary random variables. A naive way that stores the entire table of their joint distribution
needs to record 2p probability mass values. In contrast, suppose the random variables
factorise according to a DAG G in which each vertex has no more than d parents. Then
it is sufficient to store p · 2d values.
It is obvious that if X[p] satisfies the linear SEM according to G (Definition 3.8), then
the distribution of X[p] also factorises according to G. Therefore, we can often use linear
SEMs to understand properties of DAG models (see, e.g., Exercise 4.16 below). However,
linear SEM further makes assumptions about the interventional distributions of X[p] . On
the contrary, a DAG model only restricts the observational distribution of X[p] like a
43
regression model (see Remark 3.9 for a comparison of linear SEM with linear regression).
The next Chapter will introduce DAG models for counterfactuals.
4.8 Definition. Given a DAG G, its undirected moral graph G m is obtained by first
adding undirected edges between all pairs of vertices that have a common child and
then erasing the direction of all the directed edges.
This graph is called “moral” because we are marrying all the parents that have a
common child.
4.9 Example. Figure 4.2 illustrates the moralisation of the DAG in Figure 4.1. First,
three undirected edges (2,3), (4,5), (5,6) are added because they share a common child.
Second, the directions of all the edges in the original graph are erased.
2 5
1 4 7
3 6
(a) Step 1: Add undirected edges (in red) between all pairs of vertices that have a common child.
2 5
1 4 7
3 6
Lemma 4.10 gives us a way to obtain conditional independence relations for distribu-
tions that factorises according to a DAG (using Definition 4.1).
44
4.11 Corollary. Suppose P factorises according to a DAG G, then
⊥ J | K [G m ] =⇒ XI ⊥
I⊥ ⊥ XJ | XK .
This criterion can be improved. Let an(I) = an(I) ∪ I. The next Proposition says
that we only need to moralise the subgraph of G restricted to an(I ∪ J ∪ K).
⊥ J | K (Gan(I∪J∪K) )m =⇒ XI ⊥
I⊥ ⊥ XJ | XK .
Proof. It is easy to verify that for any I ⊆ V , i ∈ an(I) implies that pa(i) ⊆ an(I). By
Definition 4.5, this implies that the marginal distribution of Xan(I∪J∪K) must factorise
according to the subgraph Gan(I∪J∪K) . The proposition then immediately follows from
Corollary 4.11.
4.13 Example. Suppose the distribution P of X factorises according to the the DAG in
Figure 4.1. We can use the criterion in Proposition 4.12 but not the one in Corollary 4.11
to derive X4 ⊥
⊥ X5 | {X2 , X3 }.
4.14 Exercise. Suppose the distribution P of X factorises according to the the DAG in
Figure 4.1. Which one(s) of the following conditionally independent relationships can be
derived from Proposition 4.12?
(i) X2 ⊥
⊥ X6 | X4 ;
(ii) X2 ⊥
⊥ X6 | X3 ;
(iii) X2 ⊥
⊥ X7 | {X4 , X5 };
(iv) X5 ⊥
⊥ X6 | X4 ;
(v) X5 ⊥
⊥ X6 | {X3 , X4 };
Next we give another criterion called d-separation that only uses the original DAG G
and thus is much easier to apply. To gain some intuition, consider the following example.
4.15 Example. There are three possible situations for a DAG with 3 vertices and 2
edges (Figure 4.3). Using Corollary 4.11, it is easy to show that X1 ⊥
⊥ X3 | X2 is true in
the first two cases. However, in the third case, even though X1 and X3 are marginally
independent, conditioning on the collider X2 (common child of X1 and X3 ) actually
introduces dependence, so X1 ⊥⊥ X3 | X2 is not true in general. Example 3.19 showed the
same phenomenon using the more restrictive linear SEM interpretation of these graphical
models.
4.16 Exercise. (i) By directly using the DAG factorisation (without using moralisa-
tion), show that X1 ⊥⊥ X3 | X2 is true in the first two cases but generally false for
the third case in Figure 4.3.
45
chain/mediator
X1 X2 X3
fork/confounder
X2
X1 X3
X1 X3
collider
X2
(ii) Alternatively, by assuming (X1 , X2 , X3 ) satisfies a linear SEM with respect to the
corresponding graph, demonstrate why X1 ⊥ ⊥ X3 | X2 holds or does not hold. For
simplicity, you may assume all the path coefficients are equal to 1.
(iii) What happens if there is an additional vertex X4 that is a child of X2 and has no
other parent, and we condition on X4 instead of X2 ?
(ii) k is a collider on this path and k and all its descendants are not in K;
4.18 Example. For the DAG in Figure 4.1, K = {1} blocks the paths (3, 1, 2, 5),
(3, 4, 2, 5), (3, 6, 4, 2, 5), (3, 6, 7, 4, 2, 5), and (3, 6, 7, 5). Therefore the nodes 3 and 5 are
d-separated by 1.
4.19 Remark. To memorise the definition of d-separation, imagine water flowing along the
edges and each vertex acts as a valve. A collider valve is naturally “off”, meaning there
46
is no flow of water from one side of the collider to the other side. A non-collider valve
is naturally “on”, allowing water to flow freely. Now imagine we can turn on or off the
valves (the perhaps non-intuitive part is that turning on any descendant of a collider also
turns on that collider). Water can flow from one end of the path to the other end (path
is unblocked) if and only if all the valves on the path are “on”.
In path analysis (Definition 3.13), we have already seen that a d-connected path
can induce dependence between variables. The induced dependence can be removed
(“blocked”) by conditioning on any non-collider on the path. Conversely, although a closed
(not d-connected) path does not induce dependence, it can do so if we condition on all
the colliders.
The proof of this Lemma is a bit technical and is deferred to Section 4.A.1 (so is
non-examinable).
Proof. Proposition 4.12 and Lemma 4.20 immediately imply the =⇒ direction. The ⇐=
direction can be shown by induction on |V |. Without loss of generality let’s assume
V = [p] and (1, 2, . . . , p) is a topological ordering of G, so (i, j) ∈ E implies that i < j.
Because the vertex p has no child, it is easy to see that
p⊥
⊥ V \ {p} \ pa(p) | pa(p) [G].
By (4.1), Xp is independent of the other variables given Xpa(p) . Thus the joint density of
X[p] can be written as
Using the induction hypothesis for the first term on the right hand side, we thus conclude
that P also factorises according to G when |V | = p.
Distributions P satisfying (4.1) are said to satisfy the global Markov property with
respect to the DAG G. In the last section, Theorem 4.4 establishes the equivalence
between global Markov property and factorisation in undirected graphs. Theorem 4.21
extends this equivalence to DAGs, with a small distinction that P is no longer required to
have a positive density function.
4.22 Exercise. Apply the d-separation criterion in Theorem 4.21 to the examples in
Exercise 4.14.
47
4.23 Remark (Completeness of d-separation). The criterion (4.1) cannot be further
improved in the following sense. Given any DAG G, it can be shown that there exists a
probability distribution P3 such that
I⊥
⊥ J | K [G] ⇐⇒ XI ⊥
⊥ XJ | XK , ∀ disjoint I, J, K ⊂ V. (4.2)
Furthermore, it can be shown that if X[p] is discrete, the set of probability distributions
that factorise according to G but do not satisfy (4.2) has Lebesgue measure zero.4
4.24 Example. Consider the setting in Example 3.16 where three random variables
(X, A, Y ) satisfy a linear SEM (3.4) corresponding to the graph in Figure 3.2. Suppose the
structural noise variables are jointly normal and the random variables are standardised
so Var(X) = Var(A) = Var(Y ) = 1. For most values of βXA , βXY , βAY , the variables
X1 , X2 , X3 are unconditionally and conditionally dependent. However, very occasionally
the distribution of (X, A, Y ) may have some “coincidental” independence relations. For
example, A ⊥ ⊥ Y if the path coefficients happen to satisfy βAY + βXA βXY = 0. It is easy
to see that this event has Lebesgue measure 0.
In structure discovery, the goal is to use conditional independence in the observed data
to infer the graphical model, that is, to invert (4.1). Remark 4.23 suggests that this is
possible for almost all distributions, which is formalised below.
Given that P is faithful to the unknown DAG G, we can obtain all d-separation
relations in G from the conditional independence relations in P. Without faithfulness, we
cannot even exclude the possibility that the underlying DAG is complete.
However, this may not be enough to recover G. A simple counter-example is the two
DAGs X1 → X2 and X2 → X1 , both implying X1 6⊥ ⊥ X2 .
4.26 Definition. Two DAGs are called Markov equivalent if they contain the same
d-separation relations. A Markov equivalence class is the maximal set of Markov
equivalent DAGs.
Without additional assumptions, we can only recover the Markov equivalence class
that contains in G. The next Theorem gives a complete characterisation of a Markov
equivalence class.
48
4.27 Theorem. Two DAGs are Markov equivalent if and only if the next two
properties are satisfied:
(i) They have the same “skeleton” (set of edges ignoring the directions);
(ii) They have the same “immoralities” (structures like i → k ← j where i and j
are not adjacent).
We will only show the =⇒ direction here by proving two Lemmas. The proof for the
⇐= direction can be found in Section 4.A.2.
4.28 Lemma. Given a DAG G = (V, E), two vertices i, j ∈ V are adjacent if and only
if they cannot be d-separated by any set D ⊆ V \ {i, j}; otherwise they can be d-separated
by pa(i) or pa(j).
Proof. =⇒ is obvious because no set can block the edge between i and j. For the ⇐=
direction, because i ∈ an(j) and j ∈ an(i) cannot both be true (otherwise creating a
cycle), without loss of generality we assume j 6∈ an(i). Any path connecting i and j
cannot be a directed path from j to i. Consider the directed edge on this path with j as
one end. If this edge points into j, this path contains a parent of j that is not a collider;
otherwise it must contain a collider that is a descendant of j. In either case the path is
blocked by pa(j).
4.29 Lemma. For any undirected path i − k − j in G such that i and j are not adjacent,
k is a collider if and only if i and j are not d-separated by any set containing k.
Proof. This immediately follows from the fact that the path i − k − j is blocked by k if
and only if k is not a collider.
The IC5 or SGS6 algorithm uses the conditions in Lemmas 4.28 and 4.29 to recover
the Markov equivalence class:
Step 0 Start with an undirected complete graph in which all vertices are adjacent.
Step 1 For every two vertices i, j, remove the edge between i and j if Xi ⊥
⊥ Xj | XK for
some K ⊆ V \ {i, j}. This gives us the skeleton of the graph (Lemma 4.28).
Step 2 For every undirected path i − k − j such that i and j are not adjacent in the
skeleton obtained in Step 1, orient the edges as i → k ← j if Xi 6⊥
⊥ Xj | XK for all
K ⊆ V \ {i, j} containing k (Lemma 4.29).
Step 3 Orient some of the other edges so that the graph contains no cycle or a new
immorality would be introduced if the edge is oriented the other way. (In general
it is impossible to orient all the edges unless the Markov equivalence class is a
singleton.)
49
The PC algorithm7 accelerates the Step 1 above using the following trick: to test
whether i and j can be d-separated, one only needs to go through subsets of the neighbours
of i and subsets of the neighbours of j. The PC algorithm also imposes an order: it starts
with K = ∅ and then gradually increases size of K. For sparse graphs, these two tricks
allow us to not only test fewer conditional independences for each pair but also stop the
algorithm at a much smaller size for K.
4.30 Exercise. Use the IC/SGS algorithm to derive the Markov equivalence class
containing the DAG in Figure 4.1. More specifically, give the conditional independence
and dependence relations you used in Steps 1 and 2 of the algorithm. How many DAGs
are there in this Markov equivalence class?
It may not be surprising that many people have been fascinated about the prospects of
the graphical approach to causality:
(iii) The possibility of discovering (possibly causal) structures from observational data
is exciting.
(iii) DAGs do not necessarily represent causality: graphical models can just be viewed
as a useful tool to describe a probability distribution.
(iv) Additional assumptions like the causal Markov condition needed to define causal
DAGs are not as transparent as assumptions on counterfactuals or structural noises.9
Notes
1
Proof of the other direction can be found, for example, in Lauritzen, S. L. (1996). Graphical models.
Clarendon Press, page 36.
2
See, for example, Wainwright, M. J., & Jordan, M. I. (2007). Graphical models, exponential families,
and variational inference. Foundations and Trends in Machine Learning, 1 (1-2), 1–305. doi:10.1561/
2200000001.
3
Geiger, D., & Pearl, J. (1990). On the logic of causal models. R. D. Shachter, T. S. Levitt, L. N. Kanal,
& J. F. Lemmer (Eds.), Uncertainty in artificial intelligence (Vol. 9, pp. 3–14). Machine Intelligence and
Pattern Recognition. doi:10.1016/B978-0-444-88650-7.50006-8.
4
Meek, C. (1995). Strong completeness and faithfulness in Bayesian networks. Proceedings of the
eleventh conference on uncertainty in artificial intelligence (pp. 411–418). Montréal, Qué, Canada:
Morgan Kaufmann Publishers Inc.
50
5
Pearl, J., & Verma, T. S. (1991). A theory of inferred causation. J. Allen, R. Fikes, & E. Sandewall
(Eds.), Principles of knowledge representation and reasoning: Proceedings of the second international
conference (pp. 441–452). Morgan Kaufmann.
6
Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. doi:10.1007/978-
1-4612-2748-9.
7
Spirtes, P., Glymour, C., & Scheines, R. (2001). Causation, prediction, and search (2nd ed.).
doi:10.7551/mitpress/1754.001.0001.
8
Shah, R. D., & Peters, J. (2020). The hardness of conditional independence testing and the generalised
covariance measure. Annals of Statistics, 48 (3), 1514–1538. doi:10.1214/19-aos1857.
9
Dawid, A. P. (2010). Beware of the DAG!. I. Guyon, D. Janzing, & B. Schölkopf (Eds.), (pp. 59–86).
Proceedings of Machine Learning Research. Whistler, Canada: PMLR.
(i) If this unblocked path contains no collider, then every vertex on this path, if not
already in I ∪ J, must be an ancestor of I ∪ J.
(ii) If this unblocked path contains at least one collider, then all the colliders must be
or have a descendant that is in K. Thus, all the vertices on this path must be in K
or ancestors of K.
In case (i), this path cannot contain a vertex in K (because it is unblocked in G), so is
unblocked by K in (Gan(I∪J∪K) )m . In case (ii), the path in (Gan(I∪J∪K) )m that marries
the parents of all the colliders is not separated by K (the parents of a collider cannot be
colliders and thus do no belong to K). In both cases, K does not separate I from J in
the moral graph.
Next we consider the other direction. Suppose I is not separated from J by K in
(Gan(I∪J∪K) )m , so there exists a path from a vertex in I to a vertex in J circumventing K
in (Gan(I∪J∪K) )m . Edges in the moral graph are either already in G or added during the
“marriage”. For each edge added because of a “marriage” by a collider, extend the path to
include that collider. This results in a path in G. The set K does not block this path at
the non-colliders because the original path in the moral graph is not separated by K.
Consider the subsequence of this path, say from i ∈ I to j ∈ J, that does not contain
any intermediate vertex in I ∪ J. Consider any collider in this sub-path (let’s call it l) that
does not belong to an(K), so l ∈ an(I ∪ J); without loss of generality assume l ∈ an(I).
By definition, there exists a directed path in G from l to i0 for some i0 ∈ I. Consider a
new path, tracing back from i0 to k and then joining the original path from k to j (see
Figure 4.4 for an illustration). Because l 6∈ an(K), the new part of the path from i to l
is not blocked by K. Thus we have obtained a path in G from I to J, unblocked by K
at non-colliders, with one fewer collider than the original. Repeating the argument in
this paragraph until we end up with a path from I to J whose colliders are in an(K) and
non-colliders are not in K. By Definition 4.17, this path is not blocked by K.
51
j
i l i0
Figure 4.4: Illustration for obtaining a new path with fewer colliders (in red). Dashed
line indicates an edge added during the marriage.
4.32 Lemma. Suppose two DAGs G1 and G2 have the same skeleton and immoralities.
If there is a path from i to j unblocked by K ⊂ V in G1 , then there exists a path from i to
j that is unblocked by K in both G1 and G2 .
Proof of Lemma 4.31. If k − l − m constitutes a moral collider so the edges are oriented
as k → l ← m, we show that the shorter path that bypasses l (by directly going through
the k → m or k ← m edge) is also unblocked by K, thus contradicting the hypothesis.
Because k, m are not colliders in the original path (there cannot be two consecutive
colliders) and the original path is unblocked by K, we have k, m 6∈ K. Suppose the path
is like i · · · k → l ← m · · · j (k could be the same as i and m could be the same as j).
The sub-paths i · · · k and j · · · m are not blocked by K because the original path is not
blocked by K. Although k or m might be a collider on the new path, none of them blocks
the path when conditioning on K because k and m are parents of l ∈ K.
The other possibility of k − l − m is that it forms a chain, for example k → l → m.
In this case k − m must be oriented as k → m in order to not not create a cycle. By
observing that any intermediate vertex is a collider in the new path if and only if it is a
collider in the original path, it is easy to show that the shorter path bypassing l is also
unblocked by K, thus resulting in a contradiction.
The only remaining possibility is k ← l → m. Since k and m are adjacent, without
loss of generality let’s say the orientation is k → m. It must be the case that k is a
collider in the original path, otherwise the shorter path bypassing l would have the same
colliders and non-colliders as the original path (except l) and is thus unblocked by K.
Proof of Lemma 4.32. Consider the shortest unblocked path between i and j in G1 . The
goal is to show that the same path (or some path constructed based on this path) is
unblocked by K in G2 . By Lemma 4.31, all G1 -colliders in this path are immoral and
52
hence are also G2 -colliders (since G1 and G2 share the same immoralities). Consider any
intermediate vertex l in this path; if there is none then obviously the path can’t be blocked
by any K in G2 . Let’s say the adjacent vertices are k, m. The vertex l must either be
Obviously K does not block the path in G2 at the first kind of vertices, because the path
is unblocked in G1 . There can be no l of the second kind for the following reason. Because
l is a collider in G2 but not in G∞ , the parents k, m of l must be adjacent; otherwise G1
would not share the immorality k → l ← m with G2 . By Lemma 4.31, the orientation in
G1 must be k ← l → m, and at least one of k, m is an immoral collider in G1 . However,
k, m cannot be colliders in G2 , which contradicts the hypothesis that G1 and G2 have the
same immoralities.
Finally we consider the third case that l is a collider in both G1 and G2 . Because the
shortest path we are considering is unblocked by K in G1 , l or a G1 -descendant of l must
be in K. Among de G1 (l) ∩ K, let o be a vertex that is closest to l (if there are several, let
o be any one of them). There are three possibilities:
(i) o = l;
(ii) o ∈ ch G1 (l);
In the first case, the path not blocked at l in G2 because l ∈ K. In the second or the
third case, the path is blocked at l in G2 only if edge directions in the shortest direct
path from l to o in G1 are no longer the same in G2 . In the third case, we claim that this
shortest directed path from l to o must also a directed path in G2 , hence o ∈ de G2 (l) and
the path is unblocked at l in G2 . If this not the true, this path from l to o must have a
G2 -collider. In order to not create an immoral collider that does not exist in G1 , the two
G∈ − parents of this collider must be adjacent. In G1 , this edge either results in a cycle or
a shorter directed path from l to o.
We are left with the second case with the edge direction l ← o in G2 . In order to
not create the immorality k → l ← o which would be inconsistent with G1 , k, o must be
adjacent; furthermore, the direction must be k → o in G1 to not create a k → l → o → k
cycle. For the same reason, m, o must be adjacent and the direction must be m → o in G1 .
Hence k → o ← m is a immorality in G1 and must also be a immorality in G2 . Consider
a new path modified from the original path from i to j with k → l ← m replaced by
k → o ← m. It is easy to show that this new path has the same length as the original
path and is also unblocked bt K in G1 and G2 because o ∈ K. Apart from o, any other
vertex in the new path is a G2 -collider if and only if it is a G2 -collider in the original path.
We can continue to use the arugment in this paragraph until we no longer have a collider
l with a G1 -child but not a G2 -child o in K.
53
Chapter 5
(i) The potential outcomes are useful to elucidate the causal effect that is being
estimated in a randomised experiment (Chapter 2);
(ii) The linear structural equations can be used to define causal effects and distinguish
correlation from causation (Chapter 3);
(iii) The graphical models (particularly DAGs) can encode conditional independence
relations and can be used to represent causality (Chapter 4).
The key idea is to define counterfactuals from a DAG by using nonparametric structural
equations:1
5.1 Definition (NPSEMs). Given a DAG G = (V = [p], E), the random variables
X = X[p] satisfy a nonparametric structural equation model (NPSEM) if the observed
and interventional distributions of X[p] satisfy
Xi = fi (Xpa G (i) , i ), i = 1, . . . , p,
54
5.2 Definition (Counterfactuals). Given the above NPSEM, the counterfactual
variables {Xi (XJ = xJ ) | i ∈ [p], J ⊆ [p]} (Xi (XJ = xJ ) is often abbreviated as
Xi (xJ )) are defined as follows:
(ii) Substantative counterfactuals: For any i ∈ [p] and J ⊆ an(i), recursively define
Xi (XJ = xJ ) = Xi Xk = xk I(k ∈ J) + Xk (xJ∩an(k) )I(k 6∈ J), k ∈ pa(i) .
5.4 Example. Consider the graphical model in Figure 5.1. The basic counterfactuals are
X1 , X2 (x1 ), and X3 (x1 , x2 ). The substantative counterfactuals of X3 are defined using
the basic counterfactuals as
X3 (x1 ) = X3 x1 , X2 (x1 ) and X3 (x2 ) = X3 (X1 , x2 ).
The other counterfactuals are irrelevant and can be trimmed into the basic or substantative
counterfactuals. For example, X1 (x1 ) = X1 (x2 ) = X1 , X2 (x2 ) = X2 , X2 (x1 , x2 ) =
X2 (x1 ).
X1 X2 X3
55
5.5 Proposition. Consider any disjoint J, K ⊆ V and any i ∈ V . If K blocks all
directed paths from J to i, then
Xi (xJ , xK ) = Xi (xK ).
Proof. This follows from recursive substitution and the next observation: if K blocks all
directed paths from J to i, then K also blocks directed paths from J to pa(i) \ K.
Xi = Xi (Xpa(i) ). (5.1)
in the sense that the event defined on the left hand side is a subset of the event defined
on the right hand side.
Proof. By definition,
If you come from a statistics background, it may be natural to assume the error variables
1 , . . . , p are mutually independent. But after translating it into the basic counterfactuals,
this assumption may seem rather strong.3
56
5.7 Definition (Basic counterfactual independence). A NPSEM is said to satisfy
the single-world independence assumptions, if
For any x[p] , the variables Xi (xpa(i) ), i ∈ [p], are mutually independent. (5.3)
The sets of variables {Xi (xpa(i) ) | xpa(i) }, i ∈ [p] are mutually independent.
5.8 Example. Consider the graph in Figure 5.1. The single-world independence assump-
tions assert that X1 , X2 (x1 ), X3 (x1 , x2 ) are mutually independent for any x1 and x2 .
The multiple-world independence assumptions are
X1 ⊥
⊥ {X2 (x1 ) | x1 } ⊥
⊥ {X3 (x1 , x2 ) | x1 , x2 }.
X2 (x1 ) ⊥
⊥ X3 (x̃1 , x2 ) for any x1 6= x̃1 , x2 .
5.10 Definition (Single-world causal model). We say the random variables X[p]
satisfy a single-world causal model or simply a causal model defined by a DAG
G = (V = [p], E), if X[p] satisfies a NPSEM defined by G and the counterfactuals of
X[p] satisfy the single-world independence assumptions.
Next we introduce a transformation that maps a graphical model for the factual
variables X to a graphical model for the counterfactual variables X(xJ ).
57
5.11 Definition. The single-world intervention graph (SWIG) G[X(xJ )] (sometimes
abbreviated as G[xJ ]) for the intervention XJ = xJ is constructed from G via the
following two steps:
(i) Node splitting: For every j ∈ J, split the vertex Xj into a random and a
fixed component, labelled Xj and xj respectively. The random half inherited
all edges into Xj and the fixed half inherited all edges out of Xj .
(ii) Labelling: For every random node Xi in the new graph, label it with Xi (xJ ) =
Xi (xJ∩an(i) ).
5.12 Example. Figure 5.2 shows the SWIGs for the graphical model in Figure 5.1.
X1 x1 X2 (x1 ) x2 X3 (x1 , x2 )
X1 x1 X2 (x1 ) X3 (x1 )
X1 X2 x2 X3 (x2 )
The next Theorem states that in a single-world causal model, the counterfactuals
X(xJ ) “factorise” according to the SWIG G[X(xJ )]. “Factorise” is quoted because
G[X(xJ )] has non-random vertices and we have not formally defined a graphical model
for a mixture of random and non-random quantities. In this case, we essentially always
condition on the fixed quantities, so in the graph they block all the paths they are on.
To simplify this, let G ∗ [X(xJ )] be the random part of G[X(xJ )], i.e., the subgraph of
G[X(xJ )] restricted to Xi (xJ ), i ∈ [p]. This is sometimes abbreviated as G ∗ [xJ ]. Thus
G[xJ ] has the same number of edges as G and G ∗ [xJ ] has the same number of vertices as
G.
58
5.13 Theorem (Factorisation of counterfactual distributions). Suppose X satisfies
the causal model defined by a DAG G, then X(xJ ) factorises according to G ∗ [X(xJ )]
for all J ⊆ [p].
A key step in the proof of Theorem 5.13 is to establish the following Lemma using
induction.
5.14 Lemma. For any k 6∈ J ⊆ [p] such that de(k) ⊆ J, i ∈ [p], xJ and x̃,
P Xi (xJ , x̃k ) = x̃i Xpa(i)\J\{k} (xJ , x̃k ) = x̃pa(i)\J\{k}
=P Xi (xJ ) = x̃i Xpa(i)\J (xJ ) = x̃pa(i)\J .
Proof of Lemma 5.14 and Theorem 5.13. To simplify the exposition, let G ∗ [J] denote the
modified graph G ∗ [X(xJ )] with the vertex mapping Xi (xJ ) → i, so G ∗ [J] can be obtained
by removing the outgoing arrows from J in G. Notice that for any i ∈ [p] and J ⊂ [p],
pa G ∗ [J] (i) = pa G (i) \ J.
The single-world independence assumptions in Equation (5.3) means that this con-
clusion is true for J = [p]. The Theorem can be proven by reverse induction from
J ∪ {k} ⊆ [p] to J where k 6∈ J and de(k) ⊆ J (Exercise 3.6 shows that such k always
exists). By Proposition 5.6 (consistency of counterfactuals),
P X(xJ ) = x̃ = P X(xJ , x̃k ) = x̃ , for any x̃.
59
If i ∈ ch G (k), we have
P Xi (xJ ) = x̃i Xpa(i)\J (xJ ) = x̃pa(i)\J
=P Xi (xJ ) = x̃i Xpa(i)\J\{k} (xJ ) = x̃pa(i)\J\{k} , Xk (xJ ) = x̃k
=P Xi (xJ , x̃k ) = x̃i Xpa(i)\J\{k} (xJ , x̃k ) = x̃pa(i)\J\{k} , Xk (xJ , x̃k ) = x̃k
=P Xi (xJ , x̃k ) = x̃i Xpa(i)\J\{k} (xJ , x̃k ) = x̃pa(i)\J\{k} .
The second equality follows from the consistency of counterfactuals (Proposition 5.6).
The third equality follows from the conditional independence between Xi (xJ∪{k} ) and
Xk (xJ∪{k} ) that follows from the induction hypothesis and the observation that pa G ∗ [J∪{k}] (i)
d-separates i from k in G ∗ [J ∪ {k}].
5.16 Exercise. Consider the causal model defined by the graph in Figure 5.3. Show
that Y (a1 , a2 ) ⊥
⊥ A1 and Y (a2 ) ⊥
⊥ A2 | A1 , X.
A1 X A2 Y
Figure 5.3: A sequentially randomised experiment (A1 and A2 are time-varying treat-
ments).
5.17 Example (Continuing from Example 5.12). By using d-separation for the SWIG
in Figure 5.2c, we have X2 ⊥
⊥ X3 (x2 ) | X1 for any x2 . This conditional independence is
60
the same as the randomisation assumption (Assumption 2.10) in Chapter 2. We can then
apply Theorem 2.12 with X = X1 , A = X2 , Y = X3 to obtain
P(X3 (x2 ) = x3 | X1 = x1 ) = P(X3 = x3 | X1 = x1 , X2 = x2 ). (5.6)
Next, we give some general results that link counterfactual distributions with factual
distributions. The first Lemma establishes the modularity property of the counterfactual
distribution.4 A proof of this result can be found in the appendix.
5.18 Lemma. Suppose X satisfies the causal model defined by a DAG G = ([p], E). Then
for any i ∈ [p], J ⊆ [p] and x̃,
P Xi (xJ ) = x̃i Xpa (i)\J (xJ ) = x̃pa (i)\J
=P Xi = x̃i Xpa (i)\J = x̃pa (i)\J , Xpa (i)∩J = xpa (i)∩J ,
5.20 Theorem. Suppose X satisfies the causal model defined by a DAG G, then for
any J ⊆ [p],
p
Y
P(X(xJ ) = x̃) = P Xi = x̃i Xpa (i)∩J = xpa (i)∩J , Xpa (i)\J = x̃pa (i)\J .
i=1
In practice, we are often more interested in the marginals of X(xJ ). The next result
simplifies the marginalisation.
5.21 Corollary. Suppose X is discrete. Then for any disjoint I, J ⊆ [p], let
K = [p] \ (I ∪ J), then
X Y
P(XI (xJ ) = x̃I ) = P Xi = x̃i Xpa (i)∩J = xpa (i)∩J , Xpa (i)\J = x̃pa (i)\J .
x̃K i∈I∪K
61
For any i ∈ J, x̃i only appears in the ith term in the product. When marginalising over
such x̃i , this term sums up to 1 (because it is a conditional density).
The identity in Corollary 5.21 is known as the g-compuation formula (or simply the
g-formula, g for generalised)5 or the truncated factorisation 6 .
5.22 Example (Continuing from Example 5.19). By applying Theorem 5.20 with I = {3}
and J = {2}, we have
P(X1 = x̃1 , X2 = x̃2 , X3 (x2 ) = x̃3 ) = P(X1 = x̃1 ) P(X2 = x̃2 ) P(X3 = x3 | X1 = x̃1 , X2 = x2 ).
which is what we would obtain if we directly apply the g-formula (Corollary 5.21). A
cleaner form is
X
P X3 (x2 ) = x3 = P(X1 = x1 ) P(X3 = x3 | X1 = x1 , X2 = x2 ),
x1
5.23 Remark. Notice that the formula for P(X3 (x2 ) = x3 ) above is generally different
from the conditional distribution of X3 given X2 :
X
P X3 = x3 | X2 = x2 = P(X1 = x1 | X2 = x2 ) P(X3 = x3 | X1 = x1 , X2 = x2 ),
x1
This generalises the discussion in Section 3.4 and demonstrates how “correlation does not
imply causation”.
5.24 Exercise. Consider the causal model defined by the graph in Figure 5.3. Suppose
all the random variables are discrete.
(ii) Derive (5.7) using the conditional independence in Exercise 5.16 and the consistency
of counterfactuals (Proposition 5.6).
62
5.25 Remark. In the identification formula (5.7), the condition expectation E[Y | A1 =
a1 , A2 = a2 , X = x] is weighted by P(X = x | A1 = a1 ) instead of the marginal probability
P(X = x) in (5.22). This makes (5.7) a non-trivial extension to the simple case with
one treatment variable. Intuitively, the dilemma is that, in order to recover the causal
effect of A2 on Y , we need to condition on their confounder X2 . However, this blocks the
directed path A1 → X → Y and makes the estimated causal effect of A1 on Y biased.
A major limitation of the results in the last section is that they require all relevant
variables in the causal model to be measured. This is rarely the case in practice.
Fortunately, in many problems with unmeasured variables it may still be possible to
identify the causal effect. To focus on the main ideas, we will assume all the random
variables are discrete in the examples below.
U X X U
A Y A Y
(a) U confounds X-A relation. (b) U confounds X-Y relation.
U X X U
A a Y (a) A a Y (a)
Figure 5.4: It suffices to adjust for X in these scenarios to estimate the average causal
effect of A on Y .
63
tion is usually called the no unmeasured confounders assumption, because we can no
longer guarantee it by physically randomising A.
5.27 Remark. A common name for A ⊥ ⊥ Y (a) | X is treatment assignment ignorability
or simply ignorability 7 . The underlying idea is that A can be treated as a missingness
indicator for Y (0) (and 1 − A for Y (1)), and this assumption says that the missingness is
“ignorable”. Essentially, this assumption allows us to treat the observational study as if
the data came from a randomised experiment (but with unknown distribution of A given
X). Another name is exchangeability 8 . Our nomenclature emphasises on the structural
(instead of the statistical) content of the assumption.
5.28 Remark. In observational studies, a widely held belief is that the more pre-treatment
covariates being included in X, the more “likely” the assumption Y (a) ⊥ ⊥ A | X is
satisfied. However, this is not necessarily true; see Figure 5.5 for two counterexamples.9
U1 U2 U1 U2
X X
A Y A Y
Figure 5.5: Counter-examples for the claim that adjusting for all observed variables that
temporally precedes the treatment would be sufficient. U1 and U2 are unobserved.
64
5.4.2 Front-door formula
The back-door formula cannot be applied when there are unmeasured confounders between
A and Y . The front-door formula is designed to overcome this problem by decomposing
the causal effect of A on Y into unconfounded mechanisms.10
A M Y
Figure 5.6: A causal model showing the causal effect of A on Y being entirely mediated
by M .
Y (m) ⊥
⊥ M (a), M (a) ⊥
⊥ A, Y (m) ⊥
⊥ M | A. (5.9)
The distribution of Y (m) and M (a) can be identified by the back-door formula. Thus
XnX o
P(Y (a) = y) = P(Y = y | M = m, A = a0 ) P(A = a0 ) P(M = m | A = a).
m a0
(5.10)
There are two key counterfactual relations in Example 5.31. First, the exclusion
restriction (5.8) allows us to decompose the effect of A on Y to the product of the effect
of A on M and the effect of M on Y . This is possible because M blocks all the directed
path from A to Y , often referred to as the front-door condition. Next, the no unmeasured
confounder conditions in (5.9) allow us to use back-door adjustment to identify the effect
of A on M and the effect of M on Y .
65
5.4.3 Counterfactual calculus
Back-door and front-door are two graphical conditions for causal identification. More
generally, there are three graphical rules that allow us to simplify counterfactual distribu-
tions:11
(iii) If Y (x, z) ⊥
⊥ z [G[x, z]] (z is the fixed half-vertex), then
P Y (x, z) = y = P Y (x) = y .
66
U
Z A Y
In the example sheet you will further explore partial identification of the average
treatment effect using instrumental variables without linearity.
From here it is sufficient to remove the intervention xJ\pa(i) from the right hand side.
By Proposition 5.5, we can first remove any intervention that is not on an ancestor
of i. So without loss of generality we may assume J ⊆ an(i). To achieve our goal, we
first add XJ\pa(i) (xJ\pa(i) ) = xJ\pa(i) to the conditioning event. This does not change
the conditional probability because Xi (xJ\pa(i) ) is d-separated from XJ\pa(i) (xJ\pa(i) ) by
Xpa(i) (xJ\pa(i) ) in the SWIG G[X(xJ\pa(i) )]. We can then remove the intervention xJ\pa(i)
from all the counterfactuals by consistency. Finally, we can remove XJ\pa(i) = xJ\pa(i)
from the conditioning event because Xi is d-separated from XJ\pa(i) by Xpa(i) in G.
Following Richardson and Robins (2013, p. 113), a previous version of the lecture
notes suggested that Lemma 5.18 follows from repeatedly applying Lemma 5.14. However,
an application of the factorization property seems needed to remove the interventions on
XJ\pa(i) .
Notes
1
Pearl, J. (2000). Causality (1st ed.). Cambridge University Press.
2
Malinsky, D., Shpitser, I., & Richardson, T. (2019). A potential outcomes calculus for identifying
conditional path-specific effects. Proceedings of Machine Learning Research, 89, 3080.
67
3
The next two Sections are based on Richardson, T. S., & Robins, J. M. (2013). Single world
intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality
(tech. rep. No. 128). Center for the Statistics and the Social Sciences, University of Washington Series.
4
The term “modularity” is originally due to Pearl (2000). The notion here is due to Richardson and
Robins (2013).
5
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure
period-application to control of the healthy worker survivor effect. Mathematical Modelling, 7 (9-12),
1393–1512. doi:10.1016/0270-0255(86)90088-6.
6
Pearl and Verma, 1991.
7
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational
studies for causal effects. Biometrika, 70 (1), 41–55.
8
For connections with exchangeability in Bayesian statistics, see Saarela, O., Stephens, D. A., &
Moodie, E. E. M. (2020). The role of exchangeability in causal inference. arXiv: 2006.01799 [stat.ME].
9
For an interesting scientific debate on this from different perspectives, see Sjölander, A. (2009).
Propensity scores and m-structures. Statistics in Medicine, 28 (9), 1416–1420. doi:10.1002/sim.3532 and
the reply in the same issue by Rubin.
10
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82 (4), 669–688. doi:10.1093/
biomet/82.4.669.
11
Malinsky et al., 2019.
12
Pearl, 1995.
13
Huang, Y., & Valtorta, M. (2006). Pearl’s calculus of intervention is complete. In Proceedings of the
twenty-second conference on uncertainty in artificial intelligence (pp. 217–224). UAI’06. Cambridge, MA,
USA: AUAI Press; Shpitser, I., & Pearl, J. (2006). Identification of joint interventional distributions
in recursive semi-markovian causal models. Proceedings of the 21st national conference on artificial
intelligence - volume 2 (pp. 1219–1226). AAAI’06. Boston, Massachusetts: AAAI Press.
14
Wald, A. (1940). The fitting of straight lines if both variables are subject to error. The Annals of
Mathematical Statistics, 11 (3), 284–300. doi:10.1214/aoms/1177731868.
68
Chapter 6
(i) Design: Empirical data are collected and preprocessed in an organised way.
(ii) Analysis: A statistical method is then applied to answer the research question.
In statistics lectures (including this course), you are spending most of time learning
useful statistical models for the data already collected and how the analysis can be done
correctly and optimally.
In applications, it is the opposite: “design trumps analysis” 1 .
Consider the following argument in a randomised experiment:
(i) Design: Suppose we let half of the patients to receive the treatment at random.
69
(ii) Analysis: Significantly more treated participants have a better outcome,
(i) Design: Suppose the observed patients are pair matched, so that the patients in the
same pair have similar demographics and medical history.
(ii) Analysis: In significantly more pairs, the treated patient has a better outcome,
Causal estimator − True causal effect = Design bias + Modelling bias + Statistical noise.
(6.1)
This is more than a conceptual statement. To make (6.1) more concrete, let O be all
the observed variables (O for observed data), O is the distribution of O. Similarly, let F
denote the relevant factuals and counterfactuals in the causal question being asked (F for
full data) and F be its distribution.
Then (6.1) amounts to the decomposition
β(O[n] ; θ̂) − β(F) = {β(O) − β(F)} + {β(O; θ) − β(O)} + {β(O[n] ; θ̂) − β(O; θ)}, (6.2)
where β is a generic symbol for causal effect functional or estimator, O[n] is the observed
data of size n, θ is the parameter in a statistical model and θ̂ = θ̂(O[n] ) is an estimator of
θ.
6.3 Exercise. In Example 6.2, how much is the design bias? How much is the modelling
bias?
70
6.2 No unmeasured confounders
In this and the next Chapters, we will assume all relevant confounders are measured, so
the observational study is mimicing a randomised experiment.
More specifically, let Ai ∈ {0, 1} be a binary treatment for individual i, Yi be its the
outcome of interest with two counterfactuals Yi (0) and Yi (1), and Xi be a p-dimensional
vector of covariates.
Although unncessary for the randomisation inference, for now we assume (Xi , Ai , Yi (0), Yi (1)),
i = 1, . . . , n, are i.i.d. In this setting, subscripts are often suppressed to indicate a generic
random variable.
We restate Assumption 2.10, but now with a different name:
71
Nearest-neighbour matching
Given the distance measure d, this naive method matches a treated observation 1 ≤ i ≤ n1
with its nearest control observation,
The problem with this method is that one control individual could be matched to several
treated individuals, which never happens in a pairwise randomised experiment.
We can fix this problem by a greedy algorithm that sequentially matches a treated i
to its nearest control neighbor that has yet been selected. A drawback is that the result
will then depend on the order of the input.
Optimal matching
An improvement is the optimal matching that solves the following optimisation problem:
n1
X n
X
minimize d Xi , Mij Xj
i=1 j=n1 +1
where Mij is an indicator for the treated observation i being mached to the control
observation j. The last two constraints mean that every treated is matched to exactly
one control and every control is matched to at most one treated.
Although combinatorial optimisation is generally NP-complete, the optimal match-
ing problem (6.4) can be recasted as a network flow problem and solved efficiently in
polynomial time.4
72
6.6 Exercise. Prove (6.5), then show that the propensity score is a balancing score.
Furthermore, show that π(X) can be written as a function of any balancing score b(X).
The propensity score can be estimated from the observational data, commonly by
fitting a logistic regression of Ai on Xi . Let the estimated propensity score for individual
i be π̂(Xi ). A popular distance measure is the squared distance between the estimated
propensity scores in the logit scale:
π̂(X ) π̂(X ) 2
i j
dPS (Xi , Xj ) = log − log .
1 − π̂(Xi ) 1 − π̂(Xj )
where τ > 0 is some tuning parameter. In this case, an treated observation is only allowed
to be matched to a control observation whose propensity score is no more different than
τ 2 (in the logit scale).
Recall the logic of randomised experiments (Section 6.1) is that randomisation allows us to
choose between causality and statistical error. This is reasonable because randomisation
balances all pre-treatment covariates in simple Bernoulli trials and pairwise/stratified
experiments. In other words, all pre-treatment covariates—measured or unmeasured—
have the same distribution in the treated and control groups. Therefore, we cannot
attribute any difference in the outcome (beyond some statistical error) to systematic
differences in the covariates.
Following this logic, we can assess whether the matching is satisfactory by checking
covariate balance. A common measure of covariate imbalance is the standardised covariate
differences,6
1 Pn1 Pn
n1 i=1 Xik − j=n1 +1 M X
ij jk
Bk (M ) = q , k = 1, . . . , p,
(s21k + s20k )/2
where Xik is the kth covariate for the ith observation and s21k and s20k are the sample
variances of Xk in the treated and control groups before matching.
A rule of thumb is that the k-th covariate Xk is considered approximately balanced if
|Bk | < 0.1, but obviously we would like the entire vector B to be as close to 0 as possible.
If the covariate balance is unsatisfactory, a common practice is to rerun the matching
algorithm with a different distance measure or remove treated units that have extreme
73
propensity scores. This is often called the “propensity score tautology” 7 . In modern
optimal matching algorithms, it is possible to include |Bk (M )| ≤ η, ∀k as a constraint
in the combinatorial optimisation problem.
be all the treatment assignments such that within a matched pair, exactly one observation
receives the treatment. Let Ci = (Xi , Yi (0), Yi (1)).
There are two ways to proceed from here. The first approach is to use the sample
average of Di ,
n1
1 X
D̄ = Di
n1
i=1
to estimate E[Y (1) − Y (0) | A = 1], the average treatment effect on the treated (ATT).
We will not say more about this estimator other than it is commonly used in practice but
its statistical inference is not as straightforward as one might imagine.8
The second and perhaps more interesting approach is to use a randomisation test to
mimic what is done for randomised experiments in Section 2.4. The next assumption
mimics Example 2.4.
This assumption is satisfied if there are no unmeasured confounders and the (true)
propensity scores are exactly matched.
6.8 Proposition. Suppose the data are i.i.d and Assumption 6.4 is satisfied. Then
Assumption 6.7 holds if π(Xi ) = π(Xi+n1 ) for all i ∈ [n1 ].
74
Assumption 6.7 allows us to apply the randomisation test described in Section 2.4.
Because the observations are matched, it is common to construct test statistics based on
D[n1 ] .
Consider the sharp null hypothesis H0 : Yi (1) − Yi (0) = β, ∀i, where β is given. Under
H0 and by using the consistency assumption (Assumption 2.6), the counterfactual values
of D[n1 ] can be imputed as
(
Di , if ai = 1, ai+n1 = 0,
Di (a[2n1 ] ) = (ai − ai+n1 ) · (Yi (ai ) − Yi+n1 (ai+n1 )) =
2β − Di , if ai = 0, ai+n1 = 1.
(6.6)
Consider any test statistic T = T (D[n1 ] ). Next we construct a randomisation test
based on the randomisation distribution of T (D[n1 ] (A[2n1 ] )).
Let F (t) denote its cumulative distribution function given C[2n1 ] and A[2n1 ] ∈ M
under H0 ,
F t; D[n1 ] , β = P T ≤ t C[n1 ] , A[2n1 ] ∈ M, Hβ ,
X 1 n1 (6.7)
= · I T D[n1 ] (a[2n1 ] ) ≤ t .
2
a[2n1 ] ∈M
We then compute the p-value for this randomisation test as P2 = F (T ) and reject the
hypothesis Hβ if P2 is less than a significance threshold 0 < α < 1. Following the same
argument as in the proof of Theorem 2.19, this is valid test of H0 .
6.10 Theorem. Under Assumptions 2.6 and 6.7, P(P2 ≤ α) ≤ α under H0 for all
0 < α < 1.
6.11 Example (Signed score statistic). Let ψ : [0, 1] → R+ be a positive function on the
unit interval. The signed score statistic is defined as
n1 rank(|D |)
i
X
Tψ (D[n1 ] ) = sgn(Di )ψ , (6.8)
n1 + 1
i=1
where sgn is the sign function and rank(|Di |) is the rank of the absolute difference |Di |
among |D1 |, . . . , |Dn1 |. The widely used Wilcoxon signed rank statistic corresponds to
the choice ψ(t) = (n1 + 1)t.
6.12 Remark. We have been following the second approach of randomisation test in
Section 2.4. We can also follow the first approach by considering the distribution of
T (A[2n1 ] , Y[2n1 ] (0)), although this is less intuitive because A[2n1 ] = 0 is not in the allowed
set M .
6.13 Exercise. Consider the signed score statistic in (6.8) (treated as a function of
A[2n1 ] , Y[2n1 ] ). Derive the randomisation test based on the randomisation distribution of
75
T (A[2n1 ] , Y[2n1 ] (0)) and show that, given H0 and conditioning on C[2n1 ] ,
n1 rank(|Y (0) − Y
n1 +i (0)|)
d i
X
T A[2n1 ] , Y[2n1 ] (0) | A[2n1 ] ∈ M = Si ψ , (6.9)
n1 + 1
i=1
where Si = (Ai − Ai+n1 ) · sgn(Yi (0) − Yi+n1 (0)) ∼ Bernoulli(1/2). Justify this test using
the symmetry of Di − β under Assumption 6.7 and H0 . Establish the equivalence between
P1 and P2 (see also Exercise 2.21).
Equation (6.9) is indeed the more commonly used test because it is more computa-
tionally friendly. (Why?)
Notes
1
Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied
Statistics, 2 (3), 808–840. doi:10.1214/08-aoas187.
2
Zhao, Q., Keele, L. J., & Small, D. S. (2019). Comment: Will competition-winning methods for
causal inference also succeed in practice? Statistical Science, 34 (1), 72–76. doi:10.1214/18-sts680.
3
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical
Science, 25 (1), 1–21. doi:10.1214/09-sts313.
4
Rosenbaum, P. R. (2020). Modern algorithms for matching in observational studies. Annual Review
of Statistics and Its Application, 7 (1), 143–176. doi:10.1146/annurev-statistics-031219-041058.
5
This was proposed in one of the most cited statistics paper by Rosenbaum and Rubin, 1983
6
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched
sampling methods that incorporate the propensity score. The American Statistician, 39 (1), 33–38.
doi:10.1080/00031305.1985.10479383.
7
Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and
observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 171 (2), 481–502. doi:10.1111/j.1467-985x.2007.00527.x.
8
Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica, 74 (1), 235–267. doi:10.1111/j.1468-0262.2006.00655.x.
76
Chapter 7
As in the Chapter, we will focus on the case of a binary treatment A and assume
(Xi , Ai , Yi (0), Yi (1)), i ∈ [n] are iid.
We maintain the assumptionss of no unmeasured confounders (Assumption 6.4),
consistency (Assumption 2.6), and positivity:
A Y
7.2 Remark. We have seen in Chapter 5 that no unmeasured confounders and consistency
necessarily follow from assuming the single-world causal model corresponding to Figure 7.1.
Assumption 7.1 is also called the overlap assumption, because by the Bayes rule, it is
equivalent to assuming that the distribution X has the same support given A = a for all
a.
By Theorem 2.12, we have
ATE = E[Y (1) − Y (0)] = E E[Y | A = 1, X] − E[Y | A = 0, X] . (7.1)
77
Our goal is estimate the right hand side of the above quation, a functional of the observed
data distribution.
This is where semiparametric inference becomes useful. A semiparametric model is a
statistical model with parametric and nonparametric components.
7.3 Example. An example of semiparametric model is the partially linear model
E[Y | A, X] = β · A + g(X),
where β and g(·) are unknown. Other well known examples include single index models,
varying coefficient models, and Cox’s proportional hazards model in survival analysis.
Semiparametric inference is mainly concerned with estimating and making inference
for the (finite-dimensional) parametric component. This is called the parameter of interest,
and the (infinite-dimensional) nonparametric component is called the nuisance parameter.
An alternative setup is to consider estimating a functional β(P ) using an iid sample
D1 , . . . , Dn from the distribution P that is known to belong to a set P of probability
distributions. This is very general and is well suited for causal inference problems: the
causal identification theory (Section 5.4) often equates a causal effect of interest with a
functional of the observed variables; an example is (7.1).
Formally deriving the semiparametric inference theory is beyond the scope of this
course.1 Below we will just informally describe some key results in this theory.
Roughly speaking, semiparametric inference provides a theory for well-behaved (so-
called regular) estimators2 that admit the so-called asymptotic linear expansion:
n
√ 1 X
n(β̂ − β) = √ ψβ (Di ) + op (1), (7.2)
n
i=1
where the influence function ψβ (·) has mean 0 and finite variance, and op (1) means that
the residual converges to 0 in probability as n → ∞.
The asymptotic linearity (7.2) implies that β̂ has an asymptotic normal distribution
√ d
n(β̂ − β) → N(0, Var(ψβ {D)}).
The influence function that has the smallest variance, ψβ,eff (·) is called the efficient
influence function.
Semiparametric inference theory gives a geometric characterisation of the space of
influence functions. A key conclusion is that ψβ,eff (·) can be obtained by projecting any
influence function onto the so-called tangent space consisting of all score functions of the
model. In consequence, Var(ψβ,eff (D)) ≤ Var(ψβ (D)). This generalises the Cràmer-Rao
lower bound and asymptotic efficiency of the maximum likelihood estimator in parametric
models to semiparametric models.
7.4 Example. Following the proof of Lemma 2.26, a Z-estimator is generally asymptotic
linear and its influence function is given in (2.20).
7.5 Exercise. Derive the influence function for the regression estimator β̂1 in (2.11).
Veryify that it has mean 0.
78
7.2 Discrete covariates
The semiparametric inference theory is very general and abstract. To obtain some
intuitions for our problem, we first consider the case of discrete covariates X and the
estimation of
X
βa = E{E[Y | A = a, X]} = µa (x) P(X = x), a = 0, 1 (7.3)
x
The first estimator is well defined if the denominator is non-zero, an event with probability
tending to 1 as n → ∞. By plugging these into (7.3), we obtain the outcome regression
(OR) estimator
n n
X 1 XX 1X
β̂a,OR = µ̂a (x)P̂(Xi = x) = µ̂a (x)I(Xi = x) = µ̂a (Xi ). (7.4)
x
n x
n
i=1 i=1
7.6 Remark. Since (7.4) does not depend on the form of µ̂a (x), this outcome estimator
can be easily extended to the continuous X case.
Next, we analyse the asymptotic behaviour of β̂OR with discrete X. This does not
follow trivially from the central limit theorem because the summands in (7.4) are not
independent. To solve this problem, we derive an alternative representation of the outcome
regression estimator.
Recall πa (x) = P(A = a | X = x), which can be estimated by
Pn
I(A = a, Xi = x)
π̂a (x) = i=1Pn i .
i=1 I(Xi = x)
Because A is binary, π0 (x) = 1 − π1 (x). Note that π1 (x) = π(x), the propensity score
defined in Section 6.3.
The inverse probability weighted (IPW) estimator3 is given by
n
1 X I(Ai = a)
β̂a,IPW = Yi . (7.6)
n π̂a (Xi )
i=1
79
The name is derived from the form of the estimator. The average treatment effect can be
subsequently estimated by
n
1 X h Ai 1 − Ai i
β̂IPW = β̂1,IPW − β̂0,IPW = − Yi . (7.7)
n π̂(Xi ) 1 − π̂(Xi )
i=1
7.7 Proposition. Suppose X is discrete and π̂a (x) > 0 for all a and x. Then
β̂a,OR = β̂a,IPW , a = 0, 1 and β̂OR = β̂IPW .
Notice that the summands in (7.7) are still not independent. To break through this
impasse, a key observation is that
7.8 Lemma. Under the sample assumptions in Proposition 7.7, the identity
n n
1 X I(Ai = a) 1X
µ(Xi ) = µ(Xi ), a = 0, 1.
n π̂a (Xi ) n
i=1 i=1
Intuitively, Lemma 7.8 says that the distribution of X in the entire sample can be
obtained from the A = a subsample by inverse probability weighting.
I(A = a)
ψβa (D) = [Y − µa (X)] + µa (X) − βa .
πa (X)
The asymptotic results are summarised in the next Theorem.
80
7.10 Theorem. Suppose the positivity assumption (Assumption 7.1) is given. Under
iid sampling with discrete X and regularity conditions, we have
√ d
n β̂a,OR − βa → N 0, Var ψβa (Di ) , a = 0, 1,
and √ d
n β̂OR − β → N 0, Var ψβ1 (Di ) − ψβ0 (Di ) .
7.11 Remark. In the discrete X case, ψβa (D) is the only influence function for estimating
βa because we are considering the nonparametric model and the tangent space contains
all square-integrable functions with mean 0. As a consequence, ψβa (D) is also the efficient
influence function. This last conclusion is still true when X contains continuous covariates,
although there are many other possible influence functions.4
7.12 Exercise. Derive the same results for estimating the average treatment effect on
the treated ATT = E[Y (1) − Y (0) | A = 1]
(iv) Complete the asymptotic theory for estimating ATT with discrete X.
When X contains continuous covariates, Remark 7.6 suggests that the OR and IPW
estimators can still be applied by plugging in empirical estimates of the nuisance parameters
81
µa (x) = E[Y | A = a, X = x] and πa (x) = P(A = a | X = x). However, unlike the
discrete X case, the OR and IPW estimators are generally different.
Intuitively, these estimators are reasonable because of the following dual representation
of βa :
thus h A 1−A i
β = E[µ1 (X) − µ0 (X)] = E Y − Y .
π(X) 1 − π(X)
Next we will combine the OR estimator and the IPW estimator to obtain a more efficient
and robust estimator. The idea is to use the efficient influence function ψβa (D) derived
82
in Section 7.2, which can be written as
where
I(A = a)
ma (D; η) = Y − µa (X) + µa (X). (7.8)
πa (X)
Let µ̂a (·) be an estimator of µa (·) and π̂a (·) be an estimator of πa (·). Because influence
functions have mean 0, the above representation motivates the estimator
n
1X
β̂a,DR = ma (Di , µ̂a , π̂a ), (7.9)
n
i=1
βa = E[ma (Vi ; µa , π)] = E[ma (Vi ; µa , π̃)] = E[ma (Vi ; µ̃a , π)].
This shows that if either µa (·) or πa (·) is consistently estimated, the estimator β̂a,DR
is generally consistent for estimating βa . This is why we call β̂a,DR the doubly robust
estimator.
The doubly robust estimator is also useful when the nuisance parameters are estimated
using flexible machine learning methods. To see this, we examine the residual term in the
asymptotic linear expansion of β̂a,DR :
n
√ 1 X
Rn = n β̂a,DR − βa − √ ψβa (Di )
n
i=1
n
1 X
=√ ma (Di ; µ̂a , π̂a ) − ma (Di ; µa , πa ).
n
i=1
p
We cannot immediately conclude that Rn → 0 using the law of large numbers, because
the summands on the right hand side are not iid (µ̂a and π̂a are obtained using D[n] ).
There are two ways to resolve this issue. First, if the models we used for µa and πa
are not too complex (they are in the so-called Donsker function class)7 , one can deduce
83
from the empirical process theory that the dependence of µ̂a and π̂a on the data can be
ignored: √
Rn ≈ n EDn+1 [ma (Dn+1 ; η̂) − ma (Dn+1 ; η)],
where EDn+1 indicates the expectation is taken over a new and independence observation
Dn+1 .
Another approach is to use sample splitting, so that the nuisance parameters are
estimated using an independent subsample.8 This techniques allows us to avoid restricting
the complexity of the nuisance models and becomes popular with machine learning
methods (sometimes under the name “cross-fitting”).
Using (7.8) and by taking expectation over An+1 , Yn+1 given Xn+1 , it can be shown
that
√
{π̂a (Xn+1 ) − πa (Xn+1 )}{µ̂a (Xn+1 ) − µa (Xn+1 )}
Rn ≈ n EXn+1 . (7.10)
π̂a (Xn+1 )
In order words, the residual term Rn depends on a product of the estimation error of π̂
and µ̂a .
7.19 Exercise. Derive (7.10) and use it to (informally) prove the double robustness of β̂.
Similarly, we may define the MSE of π̂a (x). By applying the Cauchy-Scharwz inequality
to (7.10), we obtain
√
RHS of (7.10) ≤ n sup π̂a−1 (x) MSE µ̂a (·) · MSE π̂a (·) .
x
7.20 Lemma. Under i.i.d. sampling and mild regularity conditions, the above residual
p
term Rn → 0 if
(i) There exists C > 0 such that P(π̂a (x) ≥ C, ∀x) → 1 as n → ∞; and
√ p
(ii) n MSE µ̂a (·) · MSE π̂a (·) → 0 as n → ∞.
Suppose π̂0 (x) = 1 − π̂(x). By combining the previous results, we obtain the next
Theorem.
84
7.21 Theorem (Semiparametric efficiency of the DR estimator). Under i.i.d. sam-
pling and mild regularity conditions, suppose there exists C > 0 such that
Furthermore, suppose
√ n o p
n max MSE µ̂0 (x) , MSE µ̂1 (x) · MSE π̂(x) → 0 as n → ∞. (7.12)
7.22 Remark. The condition (7.11) highlights the role of the positivity assumption and
is needed because of the weighting by the inverse of π̂a (X). It is satisfied, for example,
if π(x) is bounded away from 0 and 1 and π̂(·) is consistent. The condition (7.12) is
satisfied if both MSE µ̂a (·) and MSE π̂a (·) are op (n−1/4 ), for a = 0, 1.
We have covered several statistical methods for observational studies with no unmeasured
confounders. Each method has its own strengths and weaknesses and may be preferrable
in different practical problems.
Outcome regression
• Advantages: can reach the parametric Cràmer-Rao bound (smaller than the semi-
parametric bound); can easily incorporate machine learning methods.
85
Doubly robust estimator
• Advantages: doubly robust; modelling bias is reduced; can reach the semiparametric
efficiency bound.
Notes
1
For the general theory, see Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University
Press, Chapter 25. For a less formal treatment with examples in causal inference, see Tsiatis, A. (2007).
Semiparametric theory and missing data. Springer.
2
The regularity condition is needed to rule out estimators (for example, Hodges’ estimator) that are
“super-efficient” at some parameter values but have erratic behaviours nearby. See Van der Vaart, 2000,
Example 8.1.
3
This is also called the Horvitz-Thompson estimator, which is first proposed in survey sampling; see
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a
finite universe. Journal of the American Statistical Association, 47 (260), 663–685. doi:10.1080/01621459.
1952.10483446.
4
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1995). Analysis of semiparametric regression models
for repeated outcomes in the presence of missing data. Journal of the American Statistical Association,
90 (429), 106–121. doi:10.1080/01621459.1995.10476493. See also Tsiatis, 2007, Chapters 7 and 13.
5
Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus do-it-yourself
methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34 (1),
43–68. doi:10.1214/18-sts667.
6
Keele, L., & Small, D. (2018). Comparing covariate prioritization via matching to machine learning
methods for causal inference using five empirical applications. arXiv: 1805.03743 [stat.AP].
7
Van der Vaart, 2000, Chapter 19.
8
This idea is originally due to Hájek, J. (1962). Asymptotically most powerful rank-order tests. The
Annals of Mathematical Statistics, 33 (3), 1124–1147. doi:10.1214/aoms/1177704476. See also Van
der Vaart, 2000, Section 25.8
9
Robins, J. M., Hernán, M. Á., & Brumback, B. (2000). Marginal structural models and causal
inference in epidemiology. Epidemiology, 11 (5), 550–560. doi:10.1097/00001648-200009000-00011, For
example, see.
10
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects
using the estimated propensity score. Econometrica, 71 (4), 1161–1189. doi:10.1111/1468-0262.00442.
11
Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. The Annals of
Statistics, 47 (2), 965–993. doi:10.1214/18-aos1698.
12
Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative
strategies for estimating a population mean from incomplete data. Statistical Science, 22 (4), 523–539.
doi:10.1214/07-sts227.
13
Kang and Schafer, 2007.
86
Chapter 8
Sensitivity analysis
8.1 A roadmap
(i) Model augmentation: Specify a family of distributions Fθ,η for the full data F of
all the relevant factuals and counterfactuals, where η is a sensitivity parameter. It
is customary to let η = 0 corresponds to the case of no unmeasured confounders.
(ii) Statistical inference: Test a causal hypothesis regarding Fθ,η or estimate a causal
effect β(Fθ,η ). This is usually done in one of the two following senses:
(iii) Interpretatation: Assess the strength of evidence by examining how sensitive the
conclusions are to unmeasured confounders. This typically involves finding the
“tipping point” η and make a heuristic interpretation.
Below we will give two sensitivity analysis methods that rely on different model
augmentations and provide different statistical guarantees.
We first describe a sensitivity analysis that applies to randomisation inference for matched
observational studies. Recall the setting in Section 6.5 that observations 1, . . . , n1 are
treated and matched to control observations n1 + 1, . . . , 2n1 , respectively.
Suppose the data are iid and denote
πi = P(Ai = 1 | Ci ), i ∈ [2n1 ],
87
where Ci = (Xi , Yi (0), Yi (1)).
The following sensitivity model is widely used in observational studies.1
(corrections from the lectures.) The odds ratio bound (8.1) further implies that
1 Γ
≤ P Ai = 1, An1 +i = 0 C[2n1 ] , Ai + An1 +i = 1 ≤ . (8.2)
1+Γ 1+Γ
8.2 Remark. An alternative and prehaps more intuitive formulation of Rosenbaum’s
sensitivity model is the following. Suppose there exists an unmeasured confounder
U ∈ [0, 1] so that A ⊥
⊥ {Y (0), Y (1)} | X, U . Then if we let πi = P(Ai = 1 | Xi , Ui ), the
sensitivity model (8.1) is equivalent to assuming the logistic regression model
Next we consider the randomisation distribution of the signed score statistic (6.8)
under Rosenbaum’s sensitivity model. Following Exercise 6.13, given H0 and conditioning
on C[2n1 ]
n1 rank(|Y (0) − Y
n1 +i (0)|)
d i
X
T A[2n1 ] , Y[2n1 ] (0) | A[2n1 ] ∈ M = Si ψ ,
n1 + 1
i=1
88
8.4 Theorem. Suppose Assumption 8.1 holds and we are using the signed score
statistic (6.8). Given H0 ,
n1 rank(|D − β|)
X i
T A[2n1 ] , Y[2n1 ] (0) Si− ψ . (8.4)
n1 + 1
i=1
Proof. This Theorem follows from noticing |Di − β| = |Yi (0) − Yn1 +i (0)| and the following
property Pn ordering: If Xi Yi for i ∈ [n] and Xi ⊥
Pn of stochastic ⊥ Xj , Yi ⊥⊥ Yj for all i 6= j,
then i=1 Xi i=1 Yi .
8.5 Exercise. For the sign statistic ψ(t) ≡ 1, derive an asymptotic p-value based on a
central limit theorem for the bounding variable.
E[Y (a)]
= E{E[Y (a) | X]}
= E{E[Y (a) | A = a, X] P(A = a | X)} + E{E[Y (a) | A = 1 − a, X] P(A = 1 − a | X)}
= E{E[Y | A = a, X] · πa (X)} + E{E[Y (a) | A = 1 − a, X] · π1−a (X)}
This motivates us to specify the contrast between the identifiable and non-identifiable
counterfactual quantities as a sensitivity parameter:2
89
8.6 Exercise. Show that the design bias for estimating the average treatment effect is
given by
Given the functions δ0 (x) and δ1 (x), estimating E[Y (1) − Y (0)] becomes another
semiparametric inference problem. For example, we can estimate the design bias by
plugging in the estimated propensity score:
n
d = 1
X
bias (1 − π̂(Xi ))δ1 (Xi ) + π̂(Xi )δ0 (Xi ).
n
i=1
8.7 Exercise. Suggest an outcome regression estimator and a doubly robust estimator
in this setting.
Notes
1
Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences in matched observa-
tional studies. Biometrika, 74 (1), 13–26. doi:10.1093/biomet/74.1.13.
2
Robins, J. M. (1999). Association, causation, and marginal structural models. Synthese, 121, 151–179.
doi:10.1023/a:1005285815569.
90
Chapter 9
Causal estimator − True causal effect = Design bias + Modelling bias + Statistical noise.
In the last three Chapters about observational studies, we assumed that the unmeasured
confounders are non-existent or have a limited strength. In other words, this amounts to
assuming that the design bias is zero or determined by the sensitivity model. Besides
trying to measure all the confounders, is it possible to design the observational study
cleverly to reduce the design bias?
To overcome unmeasured confounders, the key idea is leveraging specificity of the causal
structure.
Specificity is one of Bradford Hill’s nine principles1 for causality in epidemiological
studies:
One reason, needless to say, is the specificity of the association, the third
characteristic which invariably we must consider. If, as here, the association
is limited to specific workers and to particular sites and types of disease and
there is no association between the work and other modes of dying, then
clearly that is a strong argument in favour of causation.
Hill is the coauthor of a landmark observational study on smoking and lung cancer2 .
Somewhat ironically, smoking has many detrimental health effects and is often used as a
counterexample to the specificity principle. Nonetheless, specificity unifies several causal
inference approaches that do not require no unmeasured confounders.
In graphical terminology, specificity refers to the lack of certain causal pathways. One
classical example is the use of instrumental variables, which will be a central topic in
this Chapter. In Exercise 5.35, it is shown that in a linear structural equation model
corresponding to Figure 9.1, the causal effect of A on Y can be identified by the Wald
ratio Cov(Z, Y )/ Cov(Z, A). This relies on two structural specificities in Figure 9.1: Z is
independent of U , and Z has no direct effect on Y .
91
U
Z A Y
A Y W
9.1 Exercise. Consider the causal diagram in Figure 9.2. Suppose the negative control
outcome W has the same confounding bias as Y in the following sense:
is satisfied, and use it to show that the average treatment effect on the treated is identified
by the so-called difference-in-differences estimator:
Among all the study designs that leverage specificity, the method of instrumental variables
(IV) has the longest history and is the most well established.
92
Figure 9.3: Estimating price elasticity using instrumental variables. The price elasticitiy
of supply (demand) at (P1 , Q1 ) is defined as as slope of the price-supply (price-demand)
curve.
See Figure 9.4 for an example with one treatment and two IVs. The other linear structural
equations can be derived from Definition 3.8 and are omitted. In fact, we do not need
93
the other equations to be structural (causal) in the derivations below.
Z1 U
A Y
Z2 X
Figure 9.4: An example showing two IVs, Z1 and Z2 , and one treatment A. There is
one measured confounder X and one unmeasured confounder U . A bi-directional arrow
indicates that the variables can be dependent.
By using Z ⊥
⊥ U , A , Y , we obtain from (9.1) and (9.2) that
T T
E[A | Z, X] = β̃0A + βZA Z + β̃XA X, (9.3)
T
E[Y | Z, X] = β̃0Y + βAY E[A | Z, X] + β̃XY X, (9.4)
(i) Estimate E[A | Z, X] by a least squares regression of A on Z and X. Let the fitted
model be Ê[A | Z, X].
(ii) Fit another regression of Y on Ê[A | Z, X] and X by least squares, and let β̂AY be
the coefficient of Ê[A | Z, X].
To study the asymptotic properties of the two-stage least squares estimator, we consider a
more general counterfactual setup. We will skip the observed confounders X and consider
the diagram in Figure 9.1 with a possibly multi-dimensional instrumental variables Z.
(i) Relevance: Z 6⊥
⊥ A;
(ii) Exogeneity: Z ⊥
⊥ {A(z), Y (z, a)} for all z, a.
94
The core IV assumptions in Assumption 9.2 are structural and nonparametric. The
three assumption would follow from (i) assuming the distribution is faithful to Figure 9.1;
(ii) assuming the variables satisfy a single world causal model according to Figure 9.1;
(iii) the recursive substitution of counterfactuals.
9.3 Remark. Different authors state these assumptions slightly differently, but they all
reflect the structural assumptions: Z and A are dependent; there are no unmeasured Z-Y
confounders; there is no direct effect from Z on Y .
As mentioned in Section 9.1, structural assumptions alone are generally not enough
to overcome unmeasured confounders. For the rest of this Section, we further assume the
causal effect of A on Y is a constant β:
This would be satisfied if we assume the linear structural equation model (9.2).
9.4 Remark. When A is binary, this reduces to the constant treatment effect assumption
(2.8) in randomisation inference. The main distinction is that randomisation inference is
concerned with testing a given β, while the derivations below focus on the estimation of
β. But we can also use randomisation inference for instrumental variables5 and method
of moments (Z-estimation) for models like (9.5)6 .
The estimation of β in (9.5) is a semiparametric inference problem. We first express it
in terms of the observed data. Let ã be a reference level of the treatment; for simplicity,
let ã = 0. Like randomisation inference, (9.5) gives the imputation Y (0) = Y − βA.
Let α = E[Y (0)]. The exogeneity assumption Z ⊥ ⊥ Y (z, a) and exclusion restriction
Y (z, a) = Y (a) imply that, for any function g(z),
E (Y − α − βA) g(Z) = E[(Y (0) − α) g(Z)] = 0, (9.6)
Let α̂ = Ȳ −β Ā, where Ā = ni=1 Ai /n and Ȳ = ni=1 Yi /n. The method of moments
P P
estimator of β is given by solving the empirical version of (9.6) (and with α replaced by
α̂). After some algebra, we obtain
1 Pn
(Yi − Ȳ )g(Zi )
β̂g = 1 Pni=1
n
. (9.7)
n i=1 (Ai − Ā)g(Zi )
This as an empirical estimator of Cov(Y, g(Z))/ Cov(A, g(Z)), which is the Wald ratio
(5.11) with Z replaced by g(Z) in. In view of this, g(Z) is a one-dimensional summary
statistic of all the instrumental variables.
95
9.5 Theorem. Under Assumption 9.2, model (9.5), iid sampling, and suitable
regularity conditions including Cov(A, g(Z)) > 0, we have
√ d
n(β̂g − β) → N(0, σg2 ) as n → ∞,
where
Var(Y − βA) Var(g(Z))
σg2 = (9.8)
Cov2 (A, g(Z))
is minimised at g ∗ (z) = E[A | Z = z] by the Cauchy-Schwarz inequality.
9.6 Exercise. Prove Theorem 9.5 by showing the influence function of β̂g is given by
The choice g(Z) = g ∗ (Z) that minimises the variance of β̂g is often called the optimal
instrument. To reduce the variance of β̂, we can first estimate g ∗ (Z) and then plug it in
β̂. Let the resulting estimator be β̂ĝ .
It is common to estimate g ∗ (Z) by a linear regression, which returns the two-stage
least squares estimator in Section 9.2. Other regression models including machine learning
methods can also be used.
9.7 Exercise. Show that β̂ĝ reduces to the two-stage least squares estimator (with no
observed covariates X), if we use a linear regression model for g ∗ (Z) and obtain ĝ(Z) by
least squares.
Can we relax the constant treatment effect assumption (9.5)? The answer is yes, but
alternative assumptions need to be made. In this Section we will show how a monotonicity
assumption allows us to identify the so-called complier average treatment effect.
Before we go into any detail, let us first introduce an example to motivate the definition
of compliance classes.
96
9.9 Example. A common problem in randomised experiments is that not all experimental
subjects would comply with the assigned treatment. This problem may be described by
the IV diagram Figure 9.1, where
(i) The intention-to-treat analysis that ignores A and estimates the causal effect of Z
on Y (which is not confounded). This has been discussed in Chapter 2.
We will focus on the case of binary Z and A. As before, A = 1 refers to receiving the
treatment and A = 0 refers to the control. The same terminology is used for the levels of
Z.
9.10 Exercise. Verify that, if Z is binary, the Wald ratio can be written as
9.11 Assumption. We make all the core IV assumptions in Assumption 9.2 and
additionally assume cross-world counterfactual independence according to Figure 9.1.
That is, (ii) in Assumption 9.2 is replaced by (ii’) Z ⊥
⊥ {A(a), Y (z, a) | z, a}.
97
9.12 Assumption (Monotonicity). P(A(1) ≥ A(0)) = 1, or equivalently, P(C =
de) = 0.
For example, Assumption 9.12 is reasonable if the control patients have no access to the
new treatment drug.
The next Theorem shows that under the above assumptions,
9.13 Theorem. Under Assumptions 9.11 and 9.12, the complier average treatment
effect is identified by
E[Y | Z = 1] − E[Y | Z = 0]
E[Y (1) − Y (0) | C = co] = .
E[A | Z = 1] − E[A | Z = 0]
Proof. Let us expand the term in the numerator using the compliance classes. Using the
law of total expectation,
X
E[Y | Z = 1] = E[Y | Z = 1, C = c] P(C = c | Z = 1).
c∈{at,nt,co,de}
By using the exogeneity of Z (Assumption 9.11(ii’)) and the fact that C is a deterministic
function of A(0) and A(1), we can drop the conditioning on Z = 1
E[Y | Z = 1] = E[Y (1) | C = at] P(C = at) + E[Y (0) | C = nt] P(C = nt)
+ E[Y (1) | C = co] P(C = co) + E[Y (0) | C = de] P(C = de).
Similarly, we have
E[Y | Z = 0] = E[Y (1) | C = at] P(C = at) + E[Y (0) | C = nt] P(C = nt)
+ E[Y (0) | C = co] P(C = co) + E[Y (1) | C = de] P(C = de).
Therefore
E[Y | Z = 1] − E[Y | Z = 0]
= E[Y (1) − Y (0) | C = co] P(C = co) − E[Y (1) − Y (0) | C = de] P(C = de).
98
Using a similar argument, it can be shown that the denominator of β is
By dividing the two equations above, we obtain the identification formula of E[Y (1)−Y (0) |
C = co].
9.14 Remark. The complier average treatment effect E[Y (1)−Y (0) | C = co] is an instance
of the local average treatment effect.7 Here, local means the treatment effect is averaged
over a specific subpopulation. What is unusual about the complier average treatment
effect is that the subpopulation is defined in terms of cross-world counterfactuals (so it
can never be observed). This demonstrates the utility of the counterfactual language as
compliance class as a concept does not exist in a purely graphical setup. Pearl, 2009, page
29 discussed three layers of data queries: predictions, interventions, and counterfactuals.
The meaning of “counterfactual” in Pearl’s classification is not immediately clear. It
is helpful to base the classification on whether the query only contains factuals (can
be answered without causal inference), only contains counterfactuals in the same world
(can be answered with randomised intervention), or contains cross-world counterfactuals
(cannot be answered unless some non-verifiable cross-world independence is assumed).8
Notes
1
Hill, A. B. (1965). The environment and disease: Association or causation? Proceedings of the Royal
Society of Medicine, 58 (5), 295–300. doi:10.1177/003591576505800503.
2
Doll and Hill, 1950.
3
Stock, J. H., & Trebbi, F. (2003). Retrospectives: Who invented instrumental variable regression?
Journal of Economic Perspectives, 17 (3), 177–194. doi:10.1257/089533003769204416.
4
This is studied in simultaneous equations models (the variables are determined simultaneously).
See, for example, Davidson, R., & MacKinnon, J. G. (1993). Estimation and inference in econometrics.
Oxford University Press, Chapter 18.
5
Imbens, G. W., & Rosenbaum, P. R. (2005). Robust, accurate confidence intervals with a weak
instrument: Quarter of birth and education. Journal of the Royal Statistical Society: Series A (Statistics
in Society), 168 (1), 109–126. doi:10.1111/j.1467-985x.2004.00339.x.
6
Equation (9.5) belongs to a more general class of models called structural nested models proposed by
Robins; for a review, see Vansteelandt, S., & Joffe, M. (2014). Structural nested models and g-estimation:
The partially realized promise. Statistical Science, 29 (4), 707–731. doi:10.1214/14-sts493
7
Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment
effects. Econometrica, 62 (2), 467. doi:10.2307/2951620.
8
See also Robins, J. M., & Richardson, T. S. (2010). Alternative graphical causal models and the
identification of direct effects. In P. Shrout, K. Keyes, & K. Ornstein (Eds.), Causality and psychopathology:
Finding the determinants of disorders and their cures (pp. 103–158). Oxford University Press.
99
Chapter 10
Mediation analysis
In the last Chapter, we saw how specificity (absence of causal mechanisms) could be
useful to overcome unmeasured confounding.
In many other problems, the research question is about the causal mechanism itself.
For example, instead of simply concluding that smoking causes lung cancer, it is more
informative to determine which chemicals in the cigarettes are carcinogenic.
The problem of inferring causal mechanisms is called mediation analysis. It is a
challenging problem and this Chapter will introduce you to some basic ideas.1
We start with the simplest setting with three variables A (treatment), Y (outcome), and
M (mediator) and no measured or unmeasured confounder (Figure 10.1).
A M Y
Figure 10.1: The basic mediation analysis problem with three variables and no confounder.
A linear SEM with respect to this causal diagram (Definition 3.8) assumes that the
variables are generated by
M = βAM A + M , (10.1)
Y = βAY A + βM Y M + Y , (10.2)
where M , Y are mutually independent noise variables that are also independent of A.
Wright’s path analysis formula (Theorem 3.14) shows that the total effect of A on Y
is βAY + βAM βM Y . This can be seen directly from the reduced-form equation that plugs
equation (10.1) into (10.2):
100
The path coefficient βAY is the direct effect of A on Y , and the product βAM βM Y is
the indirect effect of A on Y through the mediator M . They can be estimated by first
estimating the path coefficients using linear regression.
A M Y
This approach can be easily extended to allow for covariates (Figure 10.2). In this
case, the linear SEM is
T
M = βAM A + βXM X + M ,
T
Y = βAY A + βM Y M + βXY X + Y .
The direct and indirect effects are still βAY and βXM βM Y , though to estimate them one
now needs to include X in the regression models for M and Y .
This regression approach to mediation analysis is intuitive and very popular in
practice.2
The obvious drawback is the strong linearity assumption, which enables us to express
direct and indirect effects using regression coefficients. In more sophisticated scenarios
(with nonlinearities and interactions), we can no longer rely on this decomposition.
The rest of this Chapter will develop a counterfactual approach to mediation analysis.
For simplicity, we will again focus on the case of binary treatment. The counterfactuals
Y (a, m), Y (a), and Y (m) are defined as before via a nonparametric SEM (see Chapter 5).
The controlled direct effect (CDE) of A on Y when M is fixed at m is defined as
10.1 Theorem. In a single-world causal model for the graph in Figure 10.2, we
have
CDE(m) = E E[Y | A = 1, M = m, X] − E E[Y | A = 0, M = m, X]
101
Proof. By the law of total expectation and the single-world independence assumptions
Y (a, m) ⊥
⊥ A | X and Y (m) ⊥ ⊥ M | A, X, we have, for any a and m,
E[Y (a, m)] = E E[Y (a, m) | X]
= E E[Y (a, m) | A = a, X]
= E E[Y (m) | A = a, X]
= E E[Y (m) | A = a, M = m, X]
= E E[Y | A = a, M = m, X] .
The third and the last equalities used consistency of the counterfactual.
Another approach to mediation analysis is to consider the natural direct effect (NDE)
and natural indirect effect (NIE):3
NDE = E Y (1, M (0)) − Y (0, M (0)) ,
NIE = E Y (1, M (1)) − Y (1, M (0)) .
Compared to the CDE, these new quantities allow the mediator M to vary naturally
according to some treatment level. By definition, they provide a decomposition of the
average treatment effect:
Both the NDE and NIE depend on a cross-world counterfactual, Y (1, M (0)). This
means that some cross-world independence is needed for causal identification.
To focus on this issue, let’s consider the basic mediation analysis problem with no
covariates (Figure 10.1). The single-world independence assumptions of the graph in
Figure 10.1 are
A⊥ ⊥ M (a) ⊥⊥ Y (a, m), ∀ a, m. (10.4)
Consecutive ⊥⊥ means mutual independence.
To identify E[Y (1, M (0))], we need an additional cross-world independence:
Y (1, m) ⊥
⊥ M (0), ∀m. (10.5)
10.2 Exercise. Suppose both A and M are binary. Count the number of pairwise inde-
pendences in the single-world and multiple-world independence assumptions introduced
in Definition 5.7.
102
10.3 Proposition. Consider a causal model corresponding to the graph in Figure 10.1,
in which the counterfactuals satisfy the multiple-world independence assumptions.
When M is discrete, we have
X
E[Y (1, M (0))] = E[Y | A = 1, M = m] · P(M = m | A = 0).
m
In consequence,
X
NDE = E[Y | A = 1, M = m] − E[Y | A = 0, M = m] · P(M = m | A = 0),
m
(10.6)
X
NIE = E[Y | A = 1, M = m] · P(M = m | A = 1) − P(M = m | A = 0) .
m
(10.7)
The identification formulas for NDE and NIE can be derived accordingly.
10.4 Remark. In the proof of Proposition 10.3 only the second equality uses the cross-world
independence. Thus without this assumption, we can still interpret the right hand side of
(10.6) as the expectation of the controlled direct effect Y (1, M 0 ) − Y (0, M 0 ), where M 0
is a randomised interventional analogue in the sense that M 0 is an independent random
variable with the same distribution as M (0).4
To extend the identification results to more complex situations with covariates X, let’s
examine the proof of Proposition 10.3. The first equality is just the law of total expectation.
The second equality uses cross-world independence Y (1, m) ⊥ ⊥ M (0) and the independence
M (0) ⊥⊥ A. The third equality use Y (1, m) ⊥ ⊥ A and consistency. The fourth equality
uses consistency and Y (m) ⊥ ⊥ M | A. The last equality uses consistency.
103
To extend this proof, we can assume all the independences we used still hold when
conditioning on X:
The above assumptions are stated in terms of the counterfactuals. Assumption 10.5
can be checked by d-separation in the corresponding SWIGs. Assumption 10.6 cannot be
checked in SWIGs because the counterfactuals are in different “worlds”.
Importantly, Assumption 10.6 is not implied by Assumption 10.5 and multiple-world
independence, as illustrated in the next example.
A M Y
10.7 Example. Consider the causal diagram in Figure 10.3, where the mediator M and
the outcome Y are confounded by an observed variable L that is a descendant of A. In
other words, L is another mediator that precedes M . The NPSEM corresponding to
Figure 10.3 is
A = fA (A ),
L = fL (A, L ),
M = fM (A, L, M ),
Y = fY (A, L, M, Y ).
The counterfactuals are defined according to Definition 5.2 and the multiple-world
independence assumptions (Definition 5.7) assert that the noise variables A , L , M , Y
104
are mutually independent. In this case, Assumption 10.5 is satisfied but Assumption 10.6
is generally not. To see this, the counterfactuals are, by definition,
M (a0 ) = fM (a0 , L(a0 ), M ),
Y (a, m) = fY (a, L(a), m, Y ).
⊥ M (a0 ) | L(a), L(a0 ). However, the variable L = fL (A, L ) does not contain
So Y (a, m) ⊥
as much information as L(a) = fL (a, L ) and L(a0 ) = fL (a0 , L ) together, unless fL (a, L )
does not depend on a (in which case L ⊥ ⊥ A and the graph in Figure 10.3 is not faithful).
This example motivates the following structural assumption for X:
10.9 Lemma. Under the multiple-world independence assumptions and Assumption 10.8,
if Y (m) and M are d-separated by {A, X} in G(m), then Assumption 10.6 is satisfied.
Proof. Assumption 10.8 implies that X does not contain a descendant of M or Y . Given
X, the randomness of M (a) comes from all (the noise variables of) the ancestors of M (a)
which have a directed path to M that is not blocked by {A, X}; denote those ancestors as
an(M | A, X). Similarly, let an(Y | A, M, X) be the ancestors of Y that are d-connected
with Y given {A, M, X}. So given X, the randomness of Y (a0 , m) comes from (the noise
variables of) the variables in an(Y | A, M, X).
We claim that an(M | A, X) and an(Y | A, M, X) are d-separated by X. Otherwise,
say V ∈ an(M | A, X) and U ∈ an(Y | A, M, X) are d-connected given X; we can
then append that d-connected path with the directed paths from U to M and from V
to Y to create a d-connected path from M to Y , which contradicts the assumption that
Y (m) ⊥⊥ M | A, X[G(m)] (see Figure 10.4 for an illustration).
U X V
A M Y
Figure 10.4: Illustration for the proof of Lemma 10.9. Adding the edge from U to X or
from U to V creates a d-connected path from M to Y given X.
105
10.10 Theorem. Suppose M and X are discrete. Under Assumptions 10.5 and 10.8,
10.11 Example. Suppose a new process can completely remove the nicotine from tobacco,
allowing the production of a nicotine-free cigarette to begin next year. The goal is to use
the collected data on smoking status A, hypertensive status M and heart disease status
Y from a randomised smoking cessation trial to estimate the incidence of heart disease
in smokers were all smokers to change to nicotine-free cigarettes. Suppose a scientific
theory tells us that the entire effect of nicotine on heart disease is through changing
the hypertensive status, while the non-nicotine toxins in cigarettes have no effect on
hypertension. Then, under the additional assumption that there are no confounders
(besides A) for the effect of hypertension on heart disease, the causal DAG in Figure 10.1
can be used to describe the assumptions. The heart disease incidence rate among smokers
of the new nicotine-free cigarettes is equal to E[Y (a = 1, M (a = 0))].
N M
A Y
The scientific story in this example allows as to extend the graph and include two
additional variables N and O to represent the nicotine and non-nicotine content in
cigarettes (Figure 10.5). Since the nicotine-free cigarette is not available till next year,
we have P(A = N = O) = 1 in our current data.
Using this new graph, the heart disease incidence rate among the future smokers of
nicotine-free cigarettes is given by
106
Although the event {N = 0, O = 1} has probability 0 in the current data, once the
nicotine-free cigarettes become available, it will become possible to estimate E[Y (N =
0, O = 1)] by randomising this new treatment.
What is really interesting here is that we can still identify the distribution of Y (N =
0, O = 1) with the current data, even though P(N = 0, O = 1) = 0. Using the g-formula
and P(A = N = O) = 1,
P Y (N = 0, O = 1) = y, M (N = 0) = m
= P Y (N = 0, O = 1) = y | M (N = 0) = m) · P(M (N = 0) = m
= P Y (O = 1) = y | M = m, N = 0) · P(M = m | N = 0
= P Y = y | M = m, N = 0, O = 1) · P(M = m | N = 0
= P Y = y | M = m, O = 1 · P M = m | N = 0
= P Y = y | M = m, A = 1 · P M = m | A = 0 .
Notes
1
A more comprehensive treatment is given in VanderWeele, T. (2015). Explanation in causal inference:
Methods for mediation and interaction. Oxford University Press.
2
To give you a sense of how popular mediation question is in psychology, the paper that made this
regression analysis popular is now one of most cited of all times (close to 100,000 citations on Google
Scholar): Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and
Social Psychology, 51 (6), 1173–1182. doi:10.1037/0022-3514.51.6.1173
3
This terminology is due to Pearl, 2000. These quantities are first proposed under the name “pure
direct effect” and “total indirect effect” by Robins, J. M., & Greenland, S. (1992). Identifiability and
exchangeability for direct and indirect effects. Epidemiology, 3 (2), 143–155. doi:10.1097/00001648-
199203000-00013.
4
Didelez, V., Dawid, A. P., & Geneletti, S. (2006). Direct and indirect effects of sequential treatments.
In Proceedings of the twenty-second conference on uncertainty in artificial intelligence (pp. 138–146).
UAI’06. Cambridge, MA, USA: AUAI Press.
5
Robins and Richardson, 2010; see also Didelez, V. (2018). Defining causal mediation with a
longitudinal mediator and a survival outcome. Lifetime Data Analysis, 25 (4), 593–610. doi:10.1007/
s10985-018-9449-0.
107