0% found this document useful (0 votes)
7 views110 pages

Causal Notes

The document consists of lecture notes on causal inference by Dr. Qingyuan Zhao, covering various topics such as randomized experiments, path analysis, graphical models, and mediation analysis. It includes motivating examples like the association between smoking and lung cancer and issues in undergraduate admissions related to fairness and bias. The notes aim to provide a comprehensive understanding of causal inference methodologies and their applications.

Uploaded by

brickblack2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views110 pages

Causal Notes

The document consists of lecture notes on causal inference by Dr. Qingyuan Zhao, covering various topics such as randomized experiments, path analysis, graphical models, and mediation analysis. It includes motivating examples like the association between smoking and lung cancer and issues in undergraduate admissions related to fairness and bias. The notes aim to provide a comprehensive understanding of causal inference methodologies and their applications.

Uploaded by

brickblack2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Lecture Notes on Causal Inference

(with corrections)

Qingyuan Zhao

May 30, 2022

Website for this course: http://www.statslab.cam.ac.uk/~qz280/teaching/causal-2020/.


Please contact me if you find any mistakes or have any comments.
Copyright c 2022 Dr Qingyuan Zhao ([email protected])
Please ask permission from the copyright holder to distribute and/or modify this document.

i
Contents

1 What is causal inference? 1


1.1 Some motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Languages for causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Concepts and principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Randomised experiments 12
2.1 Assignment mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Potential outcomes/Counterfactuals . . . . . . . . . . . . . . . . . . . . . 13
2.3 Randomisation distribution of causal effect estimator . . . . . . . . . . . . 17
2.4 Randomisation test of sharp null hypothesis . . . . . . . . . . . . . . . . . 18
2.5 Super-population inference and regression adjustment . . . . . . . . . . . 23
2.6 Comparison of different modes of inference . . . . . . . . . . . . . . . . . . 26

3 Path analysis 28
3.1 Graph terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Linear structural equation models . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Path analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Correlation and causation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Latent variables and identifiability . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Factor models and measurement models . . . . . . . . . . . . . . . . . . . 35
3.7 Estimation in linear SEMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Strengths and weaknesses of linear SEMs . . . . . . . . . . . . . . . . . . 39

4 Graphical models 41
4.1 Markov properties for undirected graphs . . . . . . . . . . . . . . . . . . . 41
4.2 Markov properties for directed graphs . . . . . . . . . . . . . . . . . . . . 42
4.3 Structure discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Discussion: Using DAGs to represent causality . . . . . . . . . . . . . . . 50
4.A Graphical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 A unifying theory of causality 54


5.1 From graphs to structural equations to counterfactuals . . . . . . . . . . . 54

ii
5.2 Markov properties for counterfactuals . . . . . . . . . . . . . . . . . . . . 56
5.3 From counterfactual to factual . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Causal identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Proofs (non-examinable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 No unmeasured confounders: Randomisation inference 69


6.1 The logic of observational studies . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 No unmeasured confounders . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Matching algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Covariate balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5 Randomisation inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 No unmeasured confounders: Semiparametric inference 77


7.1 An introduction to semiparametric inference . . . . . . . . . . . . . . . . . 77
7.2 Discrete covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3 Outcome regression and inverse probability weighting . . . . . . . . . . . . 81
7.4 Doubly robust estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5 A comparison of the statistical methods . . . . . . . . . . . . . . . . . . . 85

8 Sensitivity analysis 87
8.1 A roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.2 Rosenbaum’s sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 87
8.3 Sensitivity analysis in semiparametric inference . . . . . . . . . . . . . . . 89

9 Unmeasured confounders: Leveraging specificity 91


9.1 Structural specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2 Instrumental variables and two-stage least squares . . . . . . . . . . . . . 92
9.3 Instrumental variables: Method of moments . . . . . . . . . . . . . . . . . 94
9.4 Complier average treatment effect . . . . . . . . . . . . . . . . . . . . . . . 96

10 Mediation analysis 100


10.1 Linear SEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.2 Identification of controlled direct effect . . . . . . . . . . . . . . . . . . . . 101
10.3 Natural direct and indirect effects . . . . . . . . . . . . . . . . . . . . . . . 102
10.4 Observed confounders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.5 Extended graph interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 106

iii
Chapter 1

What is causal inference?

Causal inference ≈ Causal language/model + Statistical inference.

1.1 Some motivating examples

Example 1: Smoking and lung cancer


By the mid-1940s, it had been observed that lung cancer cases had tripled over the
previous three decades. But the cause for the increase in lung cancer was unclear and not
agreed upon. Possible explanations included

• Changes in air quality due to the introduction of automobile;

• Widespread expansion of paved roads that contained many carcinogens;

• Aging of the population;

• The advent of radiography;

• Better clinical awareness of lung cancer and better diagnostic methods;

• Smoking.

Advertisement for cigarette smoking (1950):


A series of observational studies reported overwhelming association between smoking
and lung cancer1 . Some data2 : 36,975 pairs of heavy smoker and nonsmoker, matched
by age, race, nativity, rural versus urban residence, occupational exposures to dust and
fumes, religion, education, marital status, ....
Of the 36,975 pairs, there were 122 discordant pairs in which exactly one person died
of lung cancer. Among them,

• 12 pairs in which nonsmoker died of lung cancer;

• 110 pairs in which smoker died of lung cancer.

So smoking is very strongly associated with lung cancer.


Fisher strongly objected the idea that smoking is carcinogenic:3

1
Such results suggest that an error has been made of an old kind, in arguing
from correlation to causation.... Such differences in genetic make-up between
those classes would naturally be associated with differences of disease incidence
without the disease being causally connected with smoking.

Fisher then demonstrated evidence of a gene that is associated with both smoking and
lung cancer.
We now know Fisher was wrong. His criticism was logical, but the association between
smoking and lung cancer is simply too strong to be explained away by different genetic
make-ups4 . Some believe that his views may have been influenced by personal and
professional conflicts, by his work as a consultant to the tobacco industry, and by the
fact that he was himself a smoker.

Example 2: Undergraduate admissions


Some great visualisations of Cambridge’s 2019–2020 admission data5 :

Chart 1 State vs. independent schools.

2
Record state school intake but independent schools still over-
represented
Cambridge welcomes 68.7% of 2019 from maintained schools but falls far below a national average of 93% of state educated
students

% Intake from independents for home students % Intake from state schools for home students

75%
Percentage of First Year Cohort

50%

25%

0%
2015 2016 2017 2018 2019
everviz.com

Some interesting quotes: “Considering 93% of pupils in England are taught in state
schools, a figure of 68.7% means that state school students are still vastly under-
represented in the University.... Cambridge’s acceptance of state school applicants
continues to be amongst the lowest in the UK, with 90% of university students on
average hailing from state schools across the country. ”
Does this mean Cambridge’s admission is biased against state schools? Not neces-
sarily. For example, applicants from independent schools may have better A-level
results.
Causal inference can be used to understand fairness in decisions made by human
and computer algorithms6 .

Chart 3 Racial disparities.

3
Racial disparities persist in acceptance rates
Although the successful applications ratio for Black students moved up to 15.1% from 13% this still falls a way
below the average of 21.4% across all groups

Male Success Rate Female Success Rate

Chinese

Arab

Other Black background

Black or Black British - Caribbean

Black or Black British - African

White

Mixed Race

Bangladeshi

Indian

Pakistani

Asian (Other)

Other/Unknown

Average

0% 5% 10% 15% 20% 25% 30% 35%


Percentage of First Year Cohort
everviz.com

Again this is showing associations. But in general it is not straightforward to discuss


the causal effect of race, because it is hard to conceptualise “manipulation” of race
at birth. One possibility is to consider “perceived race” instead.

Chart 4 Impact of Brexit.

4
Continued decline of EU applicants
Percentage of EU applicants declines to 12.5% as Chinese applications increase 33% and the nation
sees more acceptances than Northern Ireland, Wales and Scotland combined

Proportion of applicants from the EU as a percentage of total applicants

17%

16%

15%
Values

14%

13%

12%
2012 2014 2016 2018
everviz.com

Is the steep decline in EU applications caused by Brexit (or Brexit dubiety)? It


is possible to answer this question by using a concept in causal inference called
probability of causation7 , which is quite useful in law8 .

Chart 5 Gender difference.

5
The gender divide: offer holder discrepancies
between Sciences and Humanities
Computer science continues to rank amongst the lowest in terms of female intake at 20.4%

Male Female

Education

Veterinary Medicine

Psychological and Behavioural


Sciences

Land Economy

Natural Science

Economics

Maths

Engineering

Computer Science

0% 20% 40% 60% 80% 100%


Percentage of Offer Holders
everviz.com

Related: Simpson’s (or Yule-Simpson) paradox. UC Berkeley 1973 admission data9 :

6
This paradox is first discovered by Pearson (1899), who offered a causal explanation:
“To those who persist on looking upon all correlation as cause and effect, the fact
that correlation can be produced between two quite uncorrelated characters A and
B by taking an artificial mixture of the two closely allied races, must come as rather
a shock.” 10

1.2 Languages for causality

(0) “Implicit” in randomisation:

1925 Fisher (statistics, genetics, agricultural experiments).

(i) Using potential outcomes/counterfactuals:

1923 Neyman (statistics);


1973 Lewis (philosophy);
1974 Rubin (statistics);
1986 Robins (epidemiology);

(ii) Using structural equations:

1921 Wright (genetics);


1943 Haavelmo (econometrics);
1975 Duncan (social sciences);
2000 Pearl (computer science).

7
(iii) Using graphs:

1921 Wright (genetics);


1988 Pearl (computer science “AI”);
1993 Spirtes, Glymour, Scheines (philosophy).

Some remarks:

• The multi-disciplinary origin and development.

• Theorisation driven by demand from applications.

• Applications are often related to humans (biology, public health, economics, political
sciences...). Why? Open-system with external interactions, difficult or nearly
impossible for manipulation in experiments.

• State-of-the-art: The three languages are basically equivalent and advantageous for
different purposes.

– Graphs: Easy to visualise the causal assumptions; Difficult for statistical


inference because model is nonparametric.
– Structural equations: Bridge between graphs and counterfactuals; Easy to
operationalise; Danger to be confused with regressions.
– Counterfactuals: Easy to incorporate additional assumptions; Elucidation of
the meaning of statistical inference; Not as convenient if system is complex.

1.3 Concepts and principles

(i) Observation vs. intervention.

• Non-experimental vs. experimental data.


• Seeing vs. doing (J. Pearl).

(ii) (Controlled) Randomised experiment is the gold standard of causal in-


ference.

(iii) Different types of inference (C. S. Peirce, late 1800s):

Deduction Necessary inference following logic.


– Example: Euclidean geometry.
Induction Probable or non-necessary inference (purely) based on statistical data.
– Example: Survey sampling; Correlation between cigarette smoking and
lung cancer.
Abduction Inference with implicit or explicit appeal to explanatory considerations.

8
– Example: Investigation of aircraft crash; Cigarette smoking causes lung
cancer.
• Question: What type of inference is mathematical induction?
• The boundary between induction and abduction is not always clear.
• Very very roughly speaking, deduction ≈ mathematics; induction ≈ statistics;
abduction ≈ causal inference.

(iv) Causal identification.

• Identifiability of statistical models: Pθ1 = Pθ2 =⇒ θ1 = θ2 , ∀θ1 , θ2 .


• Causal identifiability: think about θ as counterfactuals or unobserved variables
and Pθ as the distribution or model of the factuals or observed data.
• Causal identification with non-experimental data always requires non-verifiable
assumptions.
– Example: No unmeasured confounding; Instrumental variable is exogenous.

(v) Design trumps analysis (D. Rubin).

• Design: Choose an identification strategy and collect relevant data.


• Analysis: Apply an appropriate statistical method to analyse the data.
• Success of a research study: 99% (maybe exaggerating...) depends on the
design and how data are collected.
• Unfortunately, we usually learn much more about analysis in statistics courses.

(vi) All models are wrong, but some are useful (G. Box).

• Historically, causality is often defined as certain parameters in a statistical


model (e.g., a linear model) not equal to 0.
• A strong emphasis of modern causal inference is on robustness to the statistical
models, this includes
– Nonparametric identification of causal effect that is free of modelling
assumptions.
– Nonparametric/semiparametric inference for causal effect.

(vii) Causal mechanism and causal mediation.

• Asking why and how.


• Example: How does (or which substance in) cigarettes cause lung cancer?
• Often more important than “is”.

(viii) Specificity.

9
• One of Hill’s 9 criteria for causality11 : “If as here, the association is limited to
specific workers and to particular sites and types of disease and there is no
association between the work and other modes of dying, then clearly that is a
strong argument in favor of causation.”
• Original definition now considered weak or obsolete. Counterexample: smoking.
• In Hill’s era, exposure = an occupational setting or a residential location
(proxies for true exposures).
• Nowadays, exposure is much more precise (for example, a specific gene expres-
sion).
• Specificity is still useful. Examples: Instrumental variables, negative controls,
sparsity.

(ix) Corroboration of evidence.

• Famous quote from Fisher12 :


About 20 years ago, when asked in a meeting what can be done in
observational studies to clarify the step from association to causation,
Sir Ronald Fisher replied: “Make your theories elaborate.” The
reply puzzled me at first, since by Occam’s razor, the advice usually
given is to make theories as simple as is consistent with known data.
What Sir Ronald meant, as subsequent discussion showed, was that
when constructing a causal hypothesis one should envisage as many
different consequences of its truth as possible, and plan observational
studies to discover whether each of these consequences is found to
hold... this multi-phasic attack is one of the most potent weapons in
observational studies.
• Falsifiability of scientific theory (K. Popper, 1959).
• Evolution of scientific paradigms—the social and collaborative nature of scien-
tific progress (T. Kuhn, 1962).

Notes
1
Doll, R., & Hill, A. B. (1950). Smoking and carcinoma of the lung. BMJ, 2 (4682), 739–748.
doi:10.1136/bmj.2.4682.739.
2
Hammond, E. C. (1964). Smoking in relation to mortality and morbidity. findings in first thirty-four
months of follow-up in a prospective study started in 1959. Journal of the National Cancer Institute,
32 (5), 1161–1188.
3
Fisher, R. A. (1958). Cancer and smoking. Nature, 182 (4635), 596–596. doi:10.1038/182596a0.
4
This is pointed out by Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B.,
& Wynder, E. L. (1959). Smoking and lung cancer: Recent evidence and a discussion of some questions.
Journal of the National Cancer institute, 22 (1), 173–203, which is widely regarded as the first sensitivity
analysis to observational studies.
5
Vides, G., & Powell, J. (2020, June 16). The eight charts that explain the university’s 2019-2020
undergraduate admissions data. Varsity.
6
See e.g. Kusner, M. J., & Loftus, J. R. (2020). The long road to fairer algorithms. Nature, 578 (7793),
34–36. doi:10.1038/d41586-020-00274-3.

10
7
Pearl, J. (1999). Probabilities of causation: Three counterfactual interpretations and their identifica-
tion. Synthese, 121, 93–149. doi:10.1023/a:1005233831499.
8
Dawid, A. P., Musio, M., & Murtas, R. (2017). The probability of causation. Law, Probability and
Risk, 16 (4), 163–179. doi:10.1093/lpr/mgx012.
9
https://en.wikipedia.org/wiki/Simpson’s_paradox#UC_Berkeley_gender_bias
10
See Sec. 6.1 of Pearl, J. (2009). Causality (2nd ed.). Cambridge University Press.
11
Hill, A. B. (2015). The environment and disease: Association or causation? Journal of the Royal
Society of Medicine, 108 (1), 32–37. doi:10.1177/0141076814562718.
12
Cochran, W. G. (1965). The planning of observational studies of human populations. Journal of the
Royal Statistical Society. Series A (General), 128 (2), 234–266.

11
Chapter 2

Randomised experiments

• Randomised experiment (or randomised controlled trial) is the gold standard for
establishing causality.

• This Chapter: basic concepts and techniques in designing and analysing a ran-
domised experiment.

• We will focus on the case of a binary treatment.

2.1 Assignment mechanism

• Suppose there are n units in this experiment.

• For the i-th unit, observed some covariates Xi prior to treatment assignment.

• Treatment Ai ∈ A = {0, 1}. Example: a new drug.

– Convention: Ai = 1 is the treatment or treated and Ai = 0 is the control.


– Some also refer to Ai as the exposure. Example: toxic chemicals in a factory.

• Ai is assigned according to a randomisation scheme (below).

• Notation: [n] = {1, 2, . . . , n}; A[n] = (A1 , A2 , . . . , An )T ; X[n] = (X1 , X2 , . . . , Xn )T .


All vectors are column vectors. We often suppress the subscript i to refer to a
generic variable; for example, A refers to a generic Ai .

The assignment mechanism is the conditional distribution:

P(A[n] = a[n] | X[n] = x[n] ) = π(a[n] | x[n] ),

where the function π(a[n] | x[n] ) is prespecified.


Next: some commonly used assignment mechanisms. What is the corresponding
π(· | ·)?

2.1 Example (Bernoulli trial). The treatment assignments are independent and the
probability of being treated is a constant 0 < π < 1. π(a[n] | x[n] ) = ni=1 π ai (1 − π)1−ai .
Q

12
2.2 Example (Sample without replacement). The treatment assignments are “completely
randomised” with the only restriction that the number of treated units is 0 < n1 < n.
( −1 Pn
n
n1 , if i=1 ai = n1 ,
π(a[n] | x[n] ) =
0, otherwise.

2.3 Example (Bernoulli trial with covariates). Bernoulli trial with π replaced by a
function 0 < π(x) < 1. π(a[n] | x[n] ) = ni=1 π(xi )ai {1 − π(xi )}1−ai .
Q

2.4 Example (Pairwise experiment). Suppose n is even. The units are divided into
n/2 pairs based on the covariates. Within each pair, one unit is randomly assigned to
treatment.
Let Bi = Bi (x[n] ) be the pair that unit i is assigned to. Then
( Pn
2−n/2 , if i=1 ai · I(Bi = j) = 1, ∀j = 1, . . . , n/2,
π(a[n] | x[n] ) =
0, otherwise.

2.5 Exercise (Stratified experiment). Generalise the pairwise experiment to allow more
than 2 units in each group. Suppose there are m groups and group j has nj units and
n1j treated units. What is the assignment mechanism?

More complicated assignment mechanisms attempt to make the distributions of


X in the treated and the control as close as possible. Examples: covariate adaptive
randomisation1 ; re-randomisation2 .

2.2 Potential outcomes/Counterfactuals

After treatment assignment, we follow up the units and measure an outcome variable Yi
for unit i.

“Implicit” causal inference


To estimate the causal effect of A on Y , one may

• Compare the conditional expectations E[Y | A = 0] with E[Y | A = 1].

• Compare the conditional distributions P(Y ≤ y | A = 0) with P(Y ≤ y | A = 1).

• Further condition on X and compare E[Y | A = 0, X = x] with E[Y | A = 1, X =


x] or the conditional distributions.

This approach allows us to apply familiar statistical methodologies, but it has several
limitations:

(i) Causal inference is only implicit and informal, as it seems that any difference can
only be reasonably attributed to the different treatment assignments.

(ii) Difficult to extend to non-iid treatment assignments.

13
(iii) Cannot distinguish internal validity from external validity.

Internal validity: Inference for the finite population consisting of the n units.
External validity: Inference for the super-population from which the n units are
sampled from.

Potential outcomes
The potential outcome model avoids the above problems and provides a flexible basis for
causal inference. It is first introduced by Neyman in his 1923 Master’s thesis3 to study
randomised experiments and later brought to observational studies by Rubin4 .
This approach posits a potential outcome (or counterfactual ), Yi (a[n] ), for unit i under
treatment assignment a[n] . The potential outcomes (or counterfactuals) are linked to the
observed outcome (or factuals) via the following assumption.

2.6 Assumption (Consistency). Yi = Yi (A[n] ) for all i ∈ [n].

This should not to be confused with consistency of statistical estimators, which says
the estimator converges to its targeting parameter as sample size grows.
Question: How many potential outcomes are there in an experiment with n units?
|An | = 2n .
To reduce the unknowns in the problem, a common assumption is

2.7 Assumption (No interference). Yi (a[n] ) = Yi (ai ) for all i ∈ [n] and a[n] ∈ An .

Note the abuse of notation.


Due to historical reasons, people often use the jargon SUTVA (stable unit treatment
value assumption) to refer to Assumptions 2.6 and 2.7.
Unless noted otherwise, we will maintain these assumptions in this Chapter. Because
Ai is binary, we are only dealing with two potential outcomes, Yi (0) and Yi (1), for each
unit i.

Fundamental problem of causal inference


The potential outcome framework allows as to view causal inference as a missing data
problem.
Consider a hypothetical experiment in Table 2.1. Using the observed (Ai , Yi ) and the
consistency assumption, we can impute one of potential outcomes Yi (Ai ). However, the
other potential outcome Yi (1 − Ai ) is unknown.
The difference Yi (1) − Yi (0) is called the individual treatment effect for unit i, which
can never be observed. This is often referred to as the “fundamental problem of causal
inference” 5 .
However, it should be possible to estimate the treatment effect at the population level
if the treatment is randomised. Two populations can be considered:

14
i Yi (0) Yi (1) Ai Yi
1 ? -3.7 1 -3.7
2 2.3 ? 0 2.3
3 ? 7.4 1 7.4
4 0.8 ? 0 0.8
.. .. .. .. ..
. . . . .

Table 2.1: Illustration of potential outcome and the consistency assumption.

2.8 Definition. The population average treatment effect (SATE) is defined as


n
1X
SATE = Yi (1) − Yi (0).
n
i=1

The population average treatment effect (PATE, or simply ATE) is defined as

ATE = E[Yi (1) − Yi (0)].

2.9 Remark. The latter implicitly assumes that the n units are sampled from a super-
population, so Yi (0) and Yi (1) follow an unknown bivariate probability distribution.

The role of randomisation


Intuitively, it should be possible to estimate the treatment effect at the population level if
the treatment is randomised. This can be formalised by the following assumption:

⊥ Y[n] (a[n] ) | X[n] for a[n] ∈ An .


2.10 Assumption (Randomisation). A[n] ⊥

The conditioning on X[n] can be removed if X is not used in the treatment assignment
(such as in Examples 2.1 and 2.2).
2.11 Remark (Fatalism). To better understand Assumption 2.10, it is often helpful to
view Y[n] (0) and Y[n] (1) as determined prior to treatment assignment. The randomness
of A[n] given X[n] (e.g. picking balls from an urn or a computer pseudo-random number
generator) should then be independent of the potential outcomes. From a statistical
point of view, this fatalism interpretation is unncessary. One may regard the statistical
inference as being conditional on the potential outcomes.
Note that Assumption 2.10 is different from A[n] ⊥ ⊥ Y[n] | X[n] , as Yi = Yi (Ai )
generally depends on Ai .
Recall that we are using X, A, and Y to refer to a generic Xi , Ai , and Yi when they
are iid.

15
2.12 Theorem (Causal identification in randomised experiments). Consider a
Bernoulli trial with covariates (Example 2.3), where {Xi , Ai , Yi (a), a ∈ A} are iid.
Suppose the above assumptions are given and

P(A = a | X = x) > 0, ∀a ∈ A, x, (2.1)

then we have, for all a ∈ A and x,


   
d
Y (a) | X = x = Y | A = a, X = x , (2.2)

d
where = means the random variables have the same distribution.

Proof. For any y ∈ R, a ∈ A, and x,

P(Y (a) ≤ y | X = x) = P(Yi (a) ≤ y | X = x, A = a)


= P(Y ≤ y | X = x, A = a),

where the first equality uses Assumption 2.10 and the second uses Assumption 2.6.

2.13 Remark. Equation (2.1) is called the positivity assumption. It is also called the
overlap assumption because (2.1) implies that X | A = a has the same support for all a.
It makes sure the right hand side of (2.2) is well defined.

2.14 Corollary. Under the assumptions in Theorem 2.12,


n o
ATE = E[Y (1) − Y (0)] = E E[Y | A = 1, X] − E[Y | A = 0, X] . (2.3)

If P(A = 1 | X) does not depend on X (Example 2.1), then

ATE = E[Y | A = 1] − E[Y | A = 0]. (2.4)

Proof. Equation (2.3) follows from taking the expectation for (2.2) and then averaging
over X. For (2.4), we prove it in the case of discrete X. Since A ⊥ ⊥ X, we have
P(X = x) = P(X = x | A = 0) = P(X = x | A = 1). By using Theorem 2.12 and the
law of total expectation,
X
E[Y (1)] = E[Y | A = 1, X = x] P(X = x)
x∈X
X
= E[Y | A = 1, X = x] P(X = x | A = 1)
x∈X
= E[Y | A = 1].

Similarly, E[Y (0)] = E[Y | A = 0].

16
2.15 Remark. Results like (2.2), (2.3), (2.4) are called causal identification.because they
equate a counterfactual quantity on the left hand side with a factual (so estimable)
quantity on the right hand side.

2.3 Randomisation distribution of causal effect estimator

Neyman considered the following difference-in-means estimator:


Pn Pn
i=1 Ai Yi i=1 (1 − Ai )Yi
β̂ = Ȳ1 − Ȳ0 , where Ȳ1 = Pn , Ȳ0 = P n . (2.5)
i=1 Ai i=1 1 − Ai

Denote Y (a) = (Y1 (a), Y2 (a), . . . , Yn (a))T for a ∈ A. Neyman studied the conditional
distribution of β̂ given the potential outcomes Y (0), Y (1). We may refer to this as
the randomization distribution, because the only randomness left in β̂ comes from the
randomization of the treatment A[n] .

2.16 Theorem. Let Assumptions 2.6, 2.7 and 2.10 be given and suppose the treatment
assignments Ai are sampled without replacement according to Example 2.2. Then
n
1X
E[β̂ | Y (0), Y (1)] = SATE = Yi (1) − Yi (0), (2.6)
n
i=1

1 2 1 S2
S0 + S12 − 01 ,

Var β̂ | Y (0), Y (1) = (2.7)
n0 n1 n
Sa2 = ni=1 (Yi (a) − Ȳ (a))2 /(n − 1), Ȳ (a) = ni=1 Yi (a)/n for
P P
where n0 = n − n1 , P
2 = n 2
a = 0, 1, and S01 i=1 (Yi (1) − Yi (0) − SATE) /(n − 1).

The expectation and variance are computed under the randomisation distribution
distribution of β̂, in which the potential outcomes Y (1) and Y (0) are treated as fixed and
the randomness comes from the randomisation of A[n] . As a consequence, the right hand
side of (2.6) and (2.7) depend on the unobserved potential outcomes Y (1) and Y (0).

Proof of Equation (2.6). For simplicity of exposition, we omit the conditioning on Y (0), Y (1)
below. By using E[Ai ] = n1 /n, the consistency assumption and the linearity of expecta-
tions,
h1 Xn n
1 X i
E[β̂] = E Ai Yi − (1 − Ai )Yi
n1 n0
i=1 i=1
h1 Xn n
1 X i
=E Ai Yi (1) − (1 − Ai )Yi (0)
n1 n0
i=1 i=1
n n
1 X n1 1 X n0
= Yi (1) − Yi (0)
n1 n n0 n
i=1 i=1
= Ȳ (1) − Ȳ (0).

17
2.17 Exercise. Prove (2.7). Hint: Let Yi∗ (a) = Yi (a) − Ȳ (a), a = 0, 1. Show that
n
 h X Ai ∗ 1 − Ai ∗ 2 i
Var β̂ | Y (0), Y (1) = E Yi (1) − Yi (0) .
n1 n0
i=1

Then expand the sum of squares and use


n
n1 n1 − 1 X
E[Ai Ai0 ] = , i 6= i0 and Yi∗ (a) = 0.
n n−1
i=1

To get interval estimators, it is to estimate the variance in (2.7). However, S01 2 is

non-estimable. Why is that? Notice that S01 2 is the sample variance of the individual

treatment effect and depends on the covariance of Yi (1) and Yi (0), which can never be
observed together (the “fundamental problem of causal inference”).
Instead, it is common to estimate the variance (2.7) by Ŝ02 /n0 + Ŝ12 /n1 , where
n n
1 X 1 X
Ŝ12 = Ai (Yi − Ȳ1 )2 , Ŝ02 = (1 − Ai )(Yi − Ȳ0 )2 .
n1 − 1 n0 − 1
i=1 i=1

This is an unbiased estimator of S02 /n0 + S12 /n1 (the proof is similar to that of (2.6) and
is left as an exercise). Thus we get a conservative (on average) estimator of the variance
of β̂.
Distributional results are further needed to form confidence intervals. Central limit
theorems can be established by assuming that the potential outcomes in Y (0) and Y (1)
are not too volatile.6 .
2.18 Remark. One drawback of Neyman’s randomisation inference is that it is difficult
to extend it to settings with covariates (unless the covariates are discrete). The main
obstacle is that the randomisation distribution necessarily depends on unobserved potential
outcomes.

2.4 Randomisation test of sharp null hypothesis

Fisher7 appears to be the first to grasp fully the importance of randomisation for credible
causal inference.8

Testing no causal effect


Fisher considered testing the sharp null hypothesis (or exact null hypothesis) H0 : Yi (0) =
Yi (1), ∀i ∈ [n].
Using H0 , we can impute all the missing potential outcomes (Table 2.2).
Suppose A[n] in Table 2.2 is randomised by sampling without replacement (Exam-
ple 2.2). Consider Neyman’s difference-in-means estimator β̂ = Ȳ1 − Ȳ0 . Because the way
A[n] is randomised, the following 6 scenarios are equally likely:

18
i Yi (0) Yi (1) Ai Yi
1 -3.7 -3.7 1 -3.7
2 2.3 2.3 0 2.3
3 7.4 7.4 1 7.4
4 0.8 0.8 0 0.8

Table 2.2: Illustration of Fisher’s randomisation test.

(i) A[4] = (1, 1, 0, 0), β̂ = (−3.7 + 2.3 − 7.4 − 0.8)/2 = −4.8.

(ii) A[4] = (1, 0, 1, 0), β̂ = (−3.7 − 2.3 + 7.4 − 0.8)/2 = 0.3.

(iii) A[4] = (1, 0, 0, 1), β̂ = (−3.7 − 2.3 − 7.4 + 0.8)/2 = −6.3.

(iv) A[4] = (0, 1, 1, 0), β̂ = (3.7 + 2.3 + 7.4 − 0.8)/2 = 6.3.

(v) A[4] = (0, 1, 0, 1), β̂ = (3.7 + 2.3 − 7.4 + 0.8)/2 = −0.3.

(vi) A[4] = (0, 0, 1, 1), β̂ = (3.7 − 2.3 + 7.4 + 0.8)/2 = 4.8.

The observed realisation is the second.


Fisher proposed to test H0 based on how extreme the observed β̂ is compared to other
potential values of β̂.

A more general setup


Next we formalise the idea above. We first consider a more general class of null hypotheses
(let β be a fixed value):

H0 : Yi (1) − Yi (0) = β, ∀i ∈ [n], (2.8)


This is still a very strong hypothesis: it says the individual treatment effect is always
a fixed value β.
Using the consistency assumption, (2.8) allow us to impute the potential outcomes as

Yi ,
 if a = Ai ,
Yi (a) = Yi + β, if a > Ai , (2.9)

Yi − β, if a < Ai .

A more compact form is Y[n] (a[n] ) = Y[n] + β(a[n] − A[n] ).


In randomisation inference, it is also necessary to choose a test statistic T (A[n] , X[n] , Y[n] ).
An example is the difference-in-means estimator β̂.

19
Randomisation distribution
The key step is to derive the randomisation distribution of T . There are two ways to do
this:

(i) Consider the distribution of T1 (A[n] , X[n] , Y[n] (0)) given X[n] and Y[n] (0);

(ii) Consider the distribution of T2 (A[n] , X[n] , Y[n] (A[n] )) given X[n] , Y[n] (0), and
Y[n] (1);

In both cases, the randomness comes from the randomisation of A[n] . The first approach
tries to test the conditional independence A[n] ⊥ ⊥ Y[n] (0) | X. The second approach
tries to directly obtain the randomisation distribution of T (A[n] , X[n] , Y[n] ) and bears a
resemblance to Neyman’s inference.
It is easy to see that the two approaches are exactly the same if β = 0. Exercise 2.21
below shows that they are still equivalent if β 6= 0. For more complex hypotheses, however,
one approach can be more convenient than the other.
Let F = (X[n] , Y[n] (0), Y[n] (1)). The randomisation distributions in the two approaches
above are given by
 
F1 (t) = P T1 (A[n] , X[n] , Y[n] (0)) ≤ t F ,

and  
F2 (t) = P T2 (A[n] , X[n] , Y[n] (A[n] )) ≤ t F .

The observed test statistics are

T1 = T1 (A[n] , X[n] , Y[n] − βA[n] ), T2 = T2 (A[n] , X[n] , Y[n] ).

The one-sided p-value is the probability of observing the same or a more extreme test
statistic than the observed statistic T ,

Pm = Fm (Tm ), m = 1, 2.

An equivalent and perhaps more informative representation is


 
P1 = P∗ T1 (A∗[n] , X[n] , Y[n] (0)) ≤ T1 | F ,

where A∗[n] is an independent copy of A, so A∗[n] | X[n] ∼ π but A∗ ⊥ ⊥ A, and P∗ means


that the probability is with respect to the distribution of A∗ . The other p-value P2 can
be similarly defined. Note that the dependence of P1 and P2 on F are omitted.
A level-α randomisation test then rejects H0 if Pm ≤ α.

2.19 Theorem. Under SUTVA (Assumptions 2.6 and 2.7) and H0 , P(Pm ≤ α) ≤
α, ∀ 0 < α < 1, m = 1, 2.

20
Proof. This follows from the property of the distribution function: If F (t) is the distribu-
tion function of a random variable T , then F (T ) stochastically dominates the uniform
distribution on [0, 1]. To show this, let F −1 (α) = sup{t | F (t) ≤ α}. By using the fact
that F (t) is non-decreasing and right-continuous (see Figure 2.1),

P(F (T ) ≤ α) = P(T < F −1 (α)) = lim P(T ≤ t) ≤ α.


t↑F −1 (α)

This shows that P Pm ≤ α | F ≤ α. To get the unconditional result, take the
expectation over the random variables in F.

2.20 Remark. The probability integral transform says that if T is a continuous random
variable and F (t) is its distribution function, then F (T ) is uniformly distributed on [0, 1].
However, we cannot directly use this well known result here because our T has a discrete
(conditional) distribution.

The role of randomisation


Theorem 2.19 essentially just restates a basic fact in probability theory. Notice that the
randomisation assumption (Assumption 2.10) is not required in Theorem 2.19. So what
is the role of randomisation?
The randomisation assumption (and basically all other assumptions in Theorem 2.19)
make the p-values possible to compute. To see this, by definition,
 
F1 (t) = P T1 (A[n] , X[n] , Y[n] (0)) ≤ t F
(2.10)
X  
= P(A[n] = a[n] | F) · I T (a[n] , X[n] , Y[n] (0)) ≤ t
a[n] ∈An

The conditional independence in Assumption 2.10 and H0 (so there is no further random-
ness in Y (1) after conditioning on Y (0)) allow us to replace the first term by

P(A[n] = a[n] | F) = P(A[n] = a[n] | X[n] ) = π(a[n] | X[n] ),

which is the known randomisation scheme.


2.21 Exercise. Show that the two tests for H0 above (based on different randomisation
distributions) are equivalent. Hint: Construct a one-to-one mapping between the functions
T1 and T2 .
2.22 Example. Some commonly used test statistics in randomisation inference include
(to simplify the exposition, we consider β = 0):
q
(i) t-statistic: T = |β̂|/ Var(
d β̂), where β̂ is the difference-in-means estimator (2.5).

Pn Mann-Whitney U statistic (equivalent to Wilcoxon’s rank-sum statistic): T =


(ii) The
i,j=1 I(Ai > Aj ) · I(Yi > Yj ).

(iii) Regression-adjusted statistic: T is the least squares coefficient of A in the regression


Y ∼ A + X + A : X (R formula notation, see next section).

21
prob . > L

Mass # Bn Reject Ho

function Men Accept Ho

prob #
PCT=t )
.

1B 1B
t

Distribution
function
FGCKPCTET)

BBB

x
'
ki)
PETE 't
o
-

-
- - -
- -
-
-
-••

•• O

O
'
F- K )
t
'
F- (d) =
Supt tithed } .

PCFLTIEL)=PCTcF 'LL ' ) Ed

Figure 2.1: Illustration of the randomisation test.

22
Practical issues
Computing the p-value exactly via its definition (2.10) can be computationally intensive
because it requires summing over An . In practice, F (T ) is often computed by Monte-Carlo
simulation.
In the example sheet, you will learn how to obtain an estimator of β (suppose H0
is true for some unknown β) by using the Hodges-Lehmann estimator.9 You will also
explore how to obtain a confidence interval for β.

2.5 Super-population inference and regression adjustment

Next we consider a different inference paradigm. Rather than conditioning or assuming a


hypothesis on the potential outcomes, we will assume the potential outcomes are drawn
from a “super-population” and are independent and identically distributed (i.i.d.). We
will discuss asymptotic super-population inference by considering the simple Bernoulli
trial (Example 2.1), so Ai ⊥⊥ Xi . Further, suppose (Ai , Xi , Yi (0), Yi (1)) are i.i.d.. To
focus on the main ideas, we assume that X is centred, i.e. E[X] = 0.
Denote π = P(A = 1), Σ = E[XX T ], and β = E[Y | A = 1] − E[Y | A = 0].
Corollary 2.14 shows that β in a randomised experiment is equal to the (population)
average treatment effect.
We shall consider three regression estimators of β:
n
1X
(α̂1 , β̂1 ) = arg min (Yi − α − βAi )2 , (2.11)
α,β n
i=1
n
1X
(α̂2 , β̂2 , γ̂2 ) = arg min (Yi − α − βAi − γ T Xi )2 , (2.12)
(α,β,γ) n
i=1
n
1 X
(α̂3 , β̂3 , γ̂3 , δ̂3 ) = arg min (Yi − α − βAi − γ T Xi − Ai (δ T Xi ))2 . (2.13)
(α,β,γ,δ) n
i=1

It is easy to show that β̂1 is the difference-in-means estimator (2.5). The rationale for
β̂2 is that the adjustment tends to improve precision if X is correlated with Y . This is
known as the analysis of covariance (ANCOVA)10 . The third estimator further models
treatment effect heterogeneity through the interaction term AX.
The classical linear regression theory for these estimators assumes the regression models
are correct. Below we provide a modern analysis that allows model misspecification by
using the M-estimation theory. Our analysis is a compact version of previous results11 .
Let’s first write down the population version of the least squares problems:

(α1 , β1 ) = arg min E[(Y − α − βA)2 ], (2.14)


α,β

(α2 , β2 , γ2 ) = arg min E[(Y − α − βA − γ T X)2 ], (2.15)


(α,β,γ)

(α3 , β3 , γ3 , δ3 ) = arg min E[(Y − α − βA − γ T X − A · (δ T X))2 ]. (2.16)


(αβ,γ,δ)

23
By the law of large numbers, we expect β̂m converges to βm , m = 1, 2, 3, as n → ∞. To
focus on the essential ideas, below we will omit the regularity conditions (for example, to
ensure these parameters exist and the central limit theorems hold).

2.23 Lemma. Suppose (Xi , Ai , Yi ) are iid, A ⊥


⊥ X, E[X] = 0. Then α1 = α2 = α3 and
β1 = β2 = β3 = β.

Proof. By taking partial derivatives of E[(Y − α − βA − γ T X − A · (δ T X))2 ] with respect


to α and β, we obtain

E[Y − α3 − β3 A − γ3T X − A(δ3T X)] = 0,


E[A(Y − α3 − β3 A − γ3T X − A(δ T X))] = 0.

Using E[X] = 0 and A ⊥


⊥ X, they can be simplified to

E[Y − α3 − β3 A] = 0,
E[A(Y − α3 − β3 A)] = 0.

Following the same derivation, these two equations also hold for the other estimators. By
cancelling α3 in the equations, we get β3 = β.

2.24 Exercise. Derive α1 , α2 , α3 , γ2 , γ3 , and δ3 in terms of the distribution of (X, A, Y )


and in terms of the distribution of (X, A, Y (0), Y (1)).

2.25 Remark. Notice that Lemma 2.23 does not rely on the correctness of the linear model.
Modern causal inference often tries to make minimal assumptions about the data and
avoid relying on specific statistical models (“all models are wrong, but some are useful”).
We will use a general result for least squares estimators to study the asymptotic
behaviour of β̂1 ,β̂2 , and β̂3 .12

2.26 Lemma. Suppose (Zi , Yi ), i = 1, . . . , n are iid and E[ZZ T ] is positive definite. Let
θ = (E[ZZ T ])−1 (E[ZY ]) be the population least squares parameter and
n
1 X n
−1  1 X 
θ̂ = ZZ T ZY
n n
i=1 i=1

be its sample estimator. Let i = Yi − θ T Zi be the regression error. Suppose Y and Z


p
have bounded fourth moments, then θ̂ → θ and
√ d
 
n(θ̂ − θ) → N 0, {E[ZZ T ]}−1 E[ZZ T 2 ]{E[ZZ T ]}−1 , (2.17)

as n → ∞.

2.27 Remark. The variance in (2.17) is said to be robust to heteroscedasticity, meaning


that Var( | X) depends on X in a regression model. It can be obtained using the general
M-estimation (M for maximum/minimum) or Z-estimation (Z for zero) theory.

24
Informal proof of (2.17). Notice that θ̂ is an empirical solution to the equation

E[ψ(θ; Z, Y )] = 0,

where
ψ(θ; Z, Y ) = Z · (Y − Z T θ) = Z. (2.18)
For a general function ψ, the Z-estimation theory shows that

 n h 
d ∂ψ(θ) io−1  T
n h ∂ψ(θ) io−T
n(θ̂ − θ) → N 0, E E ψ(θ)ψ(θ) E . (2.19)
∂θ ∂θ

By plugging in (2.18), we obtain (2.17). The asymptotic normality (2.19) follows from
the argument below. Using Taylor’s expansion,
n
1X
0= ψ(θ̂; Zi , Yi )
n
i=1
n
1 X h ∂ i
= ψ(θ; Zi , Yi ) + (θ̂ − θ)T ψ(θ; Zi , Yi ) + Rn .
n ∂θ
i=1

p
By using θ̂ → θ, it can be shown that the residual term Rn is asymptotically smaller
than the other two terms and can be ignored. Thus
n
√ nh ∂ i−1 oh 1 X i
n(θ̂ − θ) ≈ E ψ(θ; Z, Y ) √ ψ(θ; Zi , Yi ) . (2.20)
∂θ n
i=1

The first term on the right hand side converges in probability to E[∂ψ(θ)/∂θ]. The second
term converges in distribution to a normal random variable with variance E[ψ(θ)ψ(θ)T ].
Using Slutsky’s theorem, we arrive at (2.19).

2.28 Remark. The Z-estimation theory generalises the asymptotic theory for maximum
likelihood estimator (MLE), where ψ is the score function (gradient of the log-likelihood).
In that case, it can be shown that E[∂ψ(θ)/∂θ] = E[ψ(θ)ψ(θ)T ] is the Fisher information
matrix, which you may recognise from your undergraduate lectures (Part II Principles of
Statistics).

2.29 Exercise. Let i1 , i2 , i3 be the error terms in the three regression estimators:
T T
im = Yi − αm − βm Ai − γm Xi − Ai (δm Xi ), m = 1, 2, 3.

Here we are using the convention γ1 = 0 and δ1 = δ2 = 0. By using Lemma 2.26 with
different Z and θ, show that, as n → ∞,

√ d E[(A − π)2 2m ]


n(β̂m − β) → N(0, Vm ), where Vm = , m = 1, 2, 3.
π 2 (1 − π)2

25
2.30 Theorem. Suppose (Xi , Ai , Yi ) are iid, A ⊥
⊥ X, E[X] = 0. Then as n → ∞,
√ d
n(β̂m − β) → N(0, Vm ), m = 1, 2, 3,

where the asymptotic variances satisfy V3 ≤ min{V1 , V2 }.

Proof. By differentiating (2.16),

E[3 ] = E[A3 ] = 0, E[X3 ] = E[AX3 ] = 0.

By Lemma 2.23,

1 = 3 + γ3T X + A(δ3T X),


2 = 3 + (γ3 − γ2 )T X + A(δ3T X).

Thus, for m = 1, 2,

E[(A − π)2 2m ] − E[(A − π)2 23 ]


2 
= E (A − π)2 (γ3 − γm )T X + A(δ3T X)


≥0.

2.31 Exercise. Use your results in Exercise 2.24 to derive the conditions under which
V1 = V2 = V3 . Show that V2 ≤ V1 is not always true, therefore, regression adjustment
does not always reduce the asymptotic variance (if not done properly)!

2.32 Remark. The assumption E[X] = 0 is useful to simplify the calculations above. In
practice, we obviously don’t know if this assumption is true, so it is common to centre X
before solving the least squares problem. Let β̃1 , β̃2 , β̃3 be the least
P squares estimators
in eqs. (2.11) to (2.13) with Xi replaced by Xi − X̄ where X̄ = ni=1 Xi /n. It is easy
to show that β̃1 = β̂1 and β̃2 = β̂2 (because of the intercept term) and β̃3 = β̂3 + δ̂3T X̄.
Therefore, β̃1 and β̃2 have the same asymptotic distributions as β̂1 and β̂2 , even when
E[X] 6= 0. However, the asymptotic variance of β̃3 (denoted as Ṽ3 ) is larger than that of
β̂3 . It can be shown that Ṽ3 = V3 + δ3T Σδ3 and Ṽ3 ≤ min{V1 , V2 } still holds.

2.6 Comparison of different modes of inference

See Table 2.3.

Notes
1
Ye, T., Shao, J., & Zhao, Q. (2020). Principles for covariate adjustment in analyzing randomized
clinical trials. arXiv: 2009.11828 [stat.ME].
2
Li, X., & Ding, P. (2019). Rerandomization and regression adjustment. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 82 (1), 241–268. doi:10.1111/rssb.12353.

26
Neyman’s inference Randomisation test Regression

Population Finite Finite Super-population


Randomness Only A Only A A, X, Y
Point estimator Difference-in- Hodges-Lehmann Least squares
means (example sheet)
Inference Exact variance, CI Exact (approx- Asymptotic
asymptotic imate if using
Monte-Carlo)
Covariate adjust- No Yes Yes
ment
Effect heterogeneity Yes No Yes

Table 2.3: Side-by-side comparison of the modes of inference

3
Splawa-Neyman, J., Dabrowska, D. M., & Speed, T. P. (1990). On the application of probability
theory to agricultural experiments. essay on principles. section 9. Statistical Science, 5 (4), 465–472.
doi:10.1214/ss/1177012031.
4
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66 (5), 688–701. doi:10.1037/h0037350.
5
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association,
81 (396), 945–960. doi:10.1080/01621459.1986.10478354.
6
For a recent review, see Li, X., & Ding, P. (2017). General forms of finite population central limit
theorems with applications to causal inference. Journal of the American Statistical Association, 112 (520),
1759–1769. doi:10.1080/01621459.2017.1295865.
7
Fisher, R. A. (1925). Statistical methods for research workers (1st ed.). Oliver and Boyd, Edinburgh
and London.
8
See Chapter 2 of Imbens and Rubin, 2015, for a historical account.
9
Rosenbaum, P. R. (1993). Hodges-Lehmann point estimates of treatment effect in observational studies.
Journal of the American Statistical Association, 88 (424), 1250–1253. doi:10.1080/01621459.1993.10476405.
10
Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Oliver and Boyd, Edinburgh
and London.
11
Tsiatis, A. A., Davidian, M., Zhang, M., & Lu, X. (2008). Covariate adjustment for two-sample
treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in
Medicine, 27 (23), 4658–4677. doi:10.1002/sim.3113. Imbens, G. W., & Rubin, D. B. (2015). Causal
inference in statistics, social, and biomedical sciences (1st ed.). Cambridge University Press, Chapter
7. For finite-population randomisation inference, see Lin, W. (2013). Agnostic notes on regression
adjustments to experimental data: Reexamining freedman’s critique. The Annals of Applied Statistics,
7 (1), 295–318. doi:10.1214/12-aoas583
12
See, for example, White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and
a direct test for heteroskedasticity. Econometrica, 48 (4), 817–838. doi:10.2307/1912934.

27
Chapter 3

Path analysis

Last Chapter introduced potential outcomes to describe a causal effect. This approach is
most convenient when we focus on a single cause-effect pair.
An alternative and more holistic approach to is to use a graph, in which directed
edges represent causal effects. This framework traces back to a series of works on path
analysis by Sewall Wright a century ago.1

3.1 Graph terminology

We start with introducing some terminology in graph theory.

3.1 Definition (Graph and subgraph). A graph G = (V, E) is defined by its finite
vertex set V and its edge set E ⊆ V × V containing ordered pairs of vertices. The
subgraph of G restricted to A ⊂ V is GA = (A, EA ) where EA = {(i, j) ∈ E | i, j ∈ A}.

3.2 Definition (Directed edge and graph). An edge (i, j) is called directed and
written as i → j if (i, j) ∈ E but (j, i) 6∈ E. Vertex i is called a parent of j and j a
child of i if i → j. The set of parents of a vertex i is denoted as pa G (i) or simply
pa(i). A graph G is called a directed graph if all its edges are directed.

3.3 Definition (Path and cycle). A path between i and j is a sequence of distinct
vertices k0 = i, k1 , k2 , . . . , km = j such that the consecutive vertices are adjacent,
that is, (kl−1 , kl ) ∈ E or (kl , kl−1 ) ∈ E for all l = 1, 2, . . . , m.
A directed path from i to j is a path in which all the arrows are going “forward”,
that is, (kl−1 , kl ) ∈ E for all l = 1, 2, . . . , m.
A cycle is a directed path with the only modification that the first and last
vertices are the same km = k0 .
A directed acyclic graph (DAG) is a directed graph with no cycles.

28
3.4 Definition (Ancestor and descendant). In a DAG G, a vertex i is an ancestor
of j if there exists a directed path from i to j; conversely, j is called a descendant of
i. Let an G (I) denote the union of ancestors and de G (I) the union of descendants of
all vertices in I.

3.5 Exercise. Show that a directed graph is acyclic if and only if the vertices can be
relabeled in a way that the edges are monotone in the label (this is called a topological
ordering). In other words, there exists a permutation (k1 , . . . , kp ) of (1, . . . , p) such that
(i, j) ∈ E implies ki < kj .

3.6 Exercise. Show that for any J ⊂ [p], there exists i 6∈ J such that all the descendants
of i in a DAG G are in J. Hint: Use the topological ordering.

3.7 Definition (Graphical model). A graphical model is a graph G = (V, E) and a


bijection from the vertex set V of the graph to a set of random variables X.
A causal graphical model is a collection of graphical models, each corresponding
to an intervention on X.

For the rest of this course we will focus on DAG models, which provide a natural
setup for causal inference.
Some conventions: We will often consider the random variables X = X[p] =
(X1 , . . . , Xp ) and the graphical model G = (V = [p], E) with the map i → Xi . To
simplify notation, we won’t distinguish X[p] with the set {X1 , . . . , Xp }. We will often not
distinguish between G and the graph induced by the graphical model, G[X] = (X, E[X])
where (Xi , Xj ) ∈ E[X] if and only if (i, j) ∈ E.

3.2 Linear structural equation models

Wright’s path analysis applies to random variables that satisfy the linear structural
equation model (SEM), a (causal) graphical model defined below.

3.8 Definition (Linear SEM). The random variables X[p] satisfy a linear SEM with
respect to a DAG G = (V = [p], E) if they satisfy
X
Xi = β0i + βji Xj + i , (3.1)
j∈pa(Xi )

where 1 , . . . , p are mutually independent with mean 0 and finite variance, and the
interventional distributions of X also satisfy (3.1) (see Remark 3.9). The parameter
βji is called a path coefficient.

Equation (3.1) looks just like a linear model and can indeed be fitted using linear
regression. What makes it structural or causal is an implicit assumption that (3.1) still

29
holds if we make interventions to one or some of the variables. For example, consider the
following linear SEM,

X1 = 1 ,
X2 = 2X1 + 2 ,
X3 = X1 + 2X2 + 3 .

By recursive substitution, we have X3 = 51 + 22 + 3 . If we make an intervention that


sets X1 to some value x1 (for example, give a treatment to an experimental unit), we can
still use the above equations to derive the values of X by simply replacing the equation
X1 = 1 with X1 = x1 :

X1 = x1 ,
X2 = 2X1 + 2 ,
X3 = X1 + 2X2 + 3 .

By recursive substitution, we obtain X3 = 5x1 + 22 + 3 under the intervention X1 = x1 .


Thus a structural equation model not only describes the distribution of the random
variables but also their data generating mechanism.
3.9 Remark. The distinction between a structural equation model and a regression model
can be further elucidated using counterfactuals. Later in the course we will define the
counterfactuals for the intervention Xj = xj by replacing Xj with xj in (3.1) and all
other Xi by Xi (xj ), the counterfactual value Xi . In the example above, this amounts to

X1 (x1 ) = x1 ,
X2 (x1 ) = 2x1 + 2 ,
X3 (x1 ) = x1 + 2X2 (x1 ) + 3 .

So SEM is not only a model for the factuals (like regression models) but also a model for
the counterfactuals (unlike regression models).
We can use matrix notation to write (3.1) more compactly:

X[p] = β0 + B T X[p] + [p] ,

where the (i, j)-entry of B is βij .

3.10 Exercise. Suppose the vertices in a DAG is labelled according to a topological


ordering. What property does the matrix B have? Use this property to show that
Cov(X[p] ) is positive definite.

Given a linear SEM, we may define causal effect of Xi on Xj as the product of path
coefficients along all directed paths from i to j.

30
3.11 Definition. Let C(i, j) be the collection of all directed paths (“causal paths”)
from i to j. The causal effect of Xi on Xj in a linear SEM is defined as

X m
Y
β(Xi → Xj ) = βkl−1 kl . (3.2)
(k0 ,...,km )∈C(i,j) l=1

Immediately we have, if j precedes i in a topological ordering of G, then β(Xi → Xj ) = 0,


i.e., Xi has no causal effect on Xj .
3.12 Remark. Notice that Cov(Xi , Xj ) = Cov(Xj , Xi ) is symmetric, but β(Xi → Xj ) is
clearly not symmetric.

3.3 Path analysis

Wright’s path analysis uses the path coefficients to express the covariance matrix of X
and clearly describes why “correlation does not imply causation”.

3.13 Definition. A path k0 = i, k1 , . . . , km = j between i and j is called d-connected


or open if it does not contain a collider kl−1 → kl ← kl+1 .

The letter d in d-connected stands for dependence. Intuitively, a d-connected path


introduces dependence between Xi and Xj .
Let D(i, j) be the collection of all d-connected paths between i and j.

3.14 Theorem (Wright’s path analysis). Suppose the random variables X[p] satisfy
the linear SEM (3.1) with respect to a DAG G and are standardised so that Var(Xi ) = 1
for all i. Then
X Ym
Cov(Xi , Xj ) = βkl−1 kl . (3.3)
(k0 ,...,km )∈D(i,j) l=1

Proof. Without loss of generality, let’s assume (1, . . . , p) is a topological order of G and
i < j. We prove Theorem 3.14 by induction. Equation (3.3) is obviously true if i = 1 and
j = 2. Now suppose (3.3) is true for any i < j ≤ k, where 2 ≤ k ≤ p − 1. It suffices to
show that (3.3) also holds for i < j = k + 1. The key is to realise that D(i, j) can be
obtained by taking a union of all paths in D(i, l) for l ∈ pa(j) appended with the edge
l → j. See Figure 3.1 for an illustration.
By (3.1), X
Xj = βlj Xl + j .
l∈pa(j)

31
i l1

l2 j

Figure 3.1: Illustration for the proof of Theorem 3.14.

Therefore, using the induction hypothesis and the trivial fact that Xi ⊥
⊥ j (beause i
precedes j),
X
Cov(Xi , Xj ) = βlj Cov(Xi , Xl )
l∈pa(j)
X X m
Y
= βlj βkl−1 kl
l∈pa(j) (k0 ,...,km )∈D(i,l) l=1
X X Y m 
= βkl−1 kl · βlj
l∈pa(j) (k0 ,...,km )∈D(i,l) l=1

X m+1
Y
= βkl−1 kl .
(k0 ,...,km+1 )∈D(i,j) l=1

3.15 Exercise. Modify equation (3.3) so that it is still true when the random variables
are not standardised. Hint: How many “forks” kl−1 ← kl → kl+1 can a d-connected path
have?

3.4 Correlation and causation

When is correlation the same as causation? Comparing (3.2) with (3.3), we see that is
only the case if all the d-connected paths are directed.
The causal effect of Xi on Xj is said to be confounded if i and j shares a common
ancestor in the graph. In this case, non-zero correlation between Xi and Xj does not
imply a causal relationship.

3.16 Example. Consider the graphical model in Figure 3.2.


Applying path analysis and assuming A, X, Y have unit variance, we obtain

Cov(A, X) = βXA ,
Cov(X, Y ) = βXY + βXA βAY , (3.4)
Cov(A, Y ) = βAY + βXA βXY .

Thus Cov(A, Y ), the coefficient of regressing Y on A, is generally not equal to β(A →


Y ) = βAY .

32
X
βXA βXY

βAY
A Y

Figure 3.2: X confounds the causal effect of A on Y .

3.17 Example (Continuing Example 3.16). To remove the confounding effect, it is


common to add X in the linear regression. Let the coefficient of A in that regression be
γAY ·X . Using least squares theory, this is equal to the partial regression coefficient of
Y − Cov(X, Y )X on A − Cov(X, A)X:

Cov A − Cov(X, A)X, Y − Cov(X, Y )X
γAY ·X = 
Var A − Cov(X, A)X
(3.5)
Cov(A, Y ) − Cov(X, A) Cov(X, Y )
= .
1 − Cov(X, A)2
Plug in the expressions in (3.4) into (3.5), we obtain
βAY + βXA βXY − βXA (βXY + βXA βAY )
γAY ·X = 2 = βAY .
1 − βXA
3.18 Remark. The fact that γAY ·X = βAY in Example 3.16 relies on different assumptions
than the regression adjustment discussed in Section 2.5. In regression adjustment, the
coefficient of A in the linear regression equals to the average causal effect because A
is randomised and A ⊥ ⊥ X, and the conclusion holds regardless of whether the linear
regression correctly specifies E[Y | A, X] (see Remark 2.25). In contrast, two strong
assumptions are needed here:

(i) The linear regression correctly specifies E[Y | A, X];

(ii) The linear model (3.8) is structural (see Remark 3.9), so βAY is indeed the causal
effect.

Following Example 3.17, a natural question is: in order to identify causal effects
by regression coefficients, which variables should be included as regressors (“adjusted
for”)? We will learn the answer later on in the course but we will examine some negative
examples below to gain intuitions.

3.19 Example. Consider the two graphical models in Figure 3.3, in which the random
variables are all centred and standardised. In the left diagram, β(A → Y ) = βAX βXY
but γAY ·X = 0. In the right diagram, β(A → Y ) = 0 but using (3.5),
βAX βY X
γAY ·X = − 2 .
1 − βAX
This is commonly referred to as collider bias because X is a collider in A → X ← Y .

33
A Y
βAX βY X
βAX βXY
A X Y
X

Figure 3.3: Two examples in which adjusting for X in a linear regression introduces bias
to estimating the causal effect of A on Y .

An immediate lesson learned from Example 3.19 is that we should not include
descendants of the cause in the regression. However, the next Example shows that this is
not enough.

3.20 Exercise. In each of the two cases below, give a linear SEMs such that X is not a
descendant of A or Y , β(A → Y ) = 0 but γAY ·X 6= 0.

(i) There is no d-connected path between A and Y ;

(ii) X is on every d-connected path between A and Y .

3.5 Latent variables and identifiability

So far we have assumed that all the variables in the linear SEM are observed. A direct
consequence is that all the path coefficients are identifiable from the distribution of X.

3.21 Proposition. Suppose the random variables X[p] satisfy the linear SEM with
respect to a DAG G. Then the path coefficients in B can be written as functions of
Σ = Cov(X[p] ).

Proof. First of all, Σ is positive definite (Exercise 3.10), so any principal submatrix of Σ
is also positive definite. For each variable Xi , the path coefficients from its parents to Xi
can be identified by the corresponding linear regression, i.e.,

βpa(Xi ),i = Σ−1


pa(Xi ),pa(Xi ) Σpa(Xi ),i . (3.6)

There are at least two reasons to consider SEMs with latent variables (also called
factors):

(i) Confounding bias: Simply ignoring the latent variables (e.g. using the subgraph of G
restricted to the observed variables) lead to biased estimate of the path coefficients.
It is thus important to know if we can still identify a causal effect when some
variables are unobserved.

34
(ii) Proxy measurement: In many real applications, the variables of interest are not
directly measured. This is particularly common in the social sciences where the
variable of interest may be socioeconomic status, personality, or political ideology.
These variables may only be approximately measured by observable variables
(proxies) like human behaviours and questionnaires.

3.22 Example. Excerpt of an educational psychology study (click here).2

With latent variables, identifiability of path coefficients no longer follows from Proposi-
tion 3.21 because Σ is only partially estimable. Path analysis (3.3) allows us to construct
a mapping (Exercise 3.10)
B 7→ Σ(B)
between the paths coefficients and the covariance matrix of the observed and unobserved
variables.
An entry (or a function) of B is said to be identifiable if it can be expressed in terms
of the distribution of the observed variables. In linear SEMs with normal errors, this is
equivalent to expressing B in terms of the submatrix of Σ corresponding to the observed
variables (because the multivariate normal distribution is uniquely determined by its
mean and covariance matrix).
When the errors are non-normal, we may further use higher moments or the entire
distribution of the observed variables to identify β. However, it is also more sensitive
to the distributional assumptions. Below we will restrict our discussion to the case of
normal errors.
3.23 Remark. The notion of identifiability can depend on the context of the problem. With
latent variables, it is often the case that we can only identify some path coefficients up to
a sign change. In other problems (such as problems with instrumental variables), the set
of nonidentifiable path coefficients has measure zero (this is called generic identifiability).
We will not differentiate between these concepts in the discussion below.
To identify the entire matrix B, a necessary condition is that Σ has at least as many
entries as B. Unfortunately, there is no known necessary and sufficient condition for
identifiability in linear SEMs.3

3.6 Factor models and measurement models

Below we give some examples in which the path coefficients are indeed identifiable.4 The
basic idea is to use proxies for the latent variables.
Without loss of generality, we assume all the unmeasured variables are standardised so
they all have unit variance. In the diagrams below, we will use dashed circles to indicate
latent variables.

3.24 Lemma (Three-indicator rule). Consider any linear SEM for (U, X[p] ) corresponding
to Figure 3.4. Suppose X[p] is observed but U is not. Suppose Var(U ) = 1. Then the path
coefficients are identifiable (up to a sign change) if p ≥ 3 and at least 3 coefficients are
nonzero.

35
U
β1
β2 βp

X1 X2 ... ... Xp

Figure 3.4: Illustration for three-indicator rule (p ≥ 3).

Proof. Denote the path coefficient for U1 → Xi as βi and the variance of the noise variable
for Xi as σi2 . It is straightforward to show that

Cov(X) = ββ T + diag(σ12 , . . . , σp2 ).

When p = 3, this means that


  2
β1 + σ12
 
Var(X1 )
Cov(X1 , X2 ) Var(X2 )  =  β1 β2 β22 + σ22 .
Cov(X1 , X3 ) Cov(X2 , X3 ) Var(X3 ) β1 β3 β2 β3 2 2
β3 + σ3

Therefore, we have
Cov(X1 , X2 ) · Cov(X1 , X3 )
β12 = ,
Cov(X2 , X3 )
and similarly for β22 and β32 . Although the sign of β1 is not identifiable, it is easy to see
that once it is fixed, the signs of β2 and β3 are also determined. Thus the vector β is
identifiable up to the transformation β 7→ −β.
For p > 3, we can apply apply the above result for the 3-subset of X[p] whose
corresponding path coefficients are nonzero.

3.25 Remark. Statistical inference for the graphical model in Figure 3.4 is often called a
confirmatory factor analysis because the structure is already given. This is different from
the exploratory factor analysis (e.g., via principal component analysis), which tries to use
ovserved data to discover the factor structure.

3.26 Example. For the linear SEM corresponding to the graphical model in Figure 3.5,
βAY is identifiable. To see this, we can first use Lemma 3.24 on {A, Y, X} and {A, Y, Z}
to identify (βU A , βU Y ) (up to a sign change). Without loss of generality we assume A
and Y have unit variance, then βAY = Cov(A, Y ) − βU A βU Y is also identified.

3.27 Exercise. Show that βAY is non-identifiable if Z is unobserved in Figure 3.5

3.28 Exercise. Let X1 , X2 , X3 be three random variables/vectors. The partial covari-


ance between X1 and X2 given X3 is defined as

PCov(X1 , X2 | X3 )
= Cov(X1 , X2 ) − Cov(X1 , X3 ) Var(X3 )−1 Cov(X3 , X2 ).

36
U
βU X βU A βU Y βU Z

X A Y Z
βAY

Figure 3.5: Illustration of using proxies of unmeasured confounders to remove unmeasured


confounding bias.

Show that if we add a directed edge from X to A in Figure 3.5, βAY is still identifiable
by5
PCov(X, Y | A)
βAY = Cov(A, Y ) − Cov(A, Z).
PCov(X, Z | A)
The three-indicator rule is also quite useful in the so-called measurement models. In
this type of problems (an instance is Example 3.22), we are indeed interested in the causal
effects between the latent variables (these are often abstract constructs like personalities
and academic achievements).
Suppose the latent variables U ∈ Rq have unit variances and satisfy a linear SEM
with respect to a prespecified DAG. The observed variables (or measurements) X ∈ Rp
satisfy the following model (the intercept term is omitted for simplicity)

X = ΓU + X ,

where Γ ∈ Rp×q is the factor loading matrix, X is a vector of mutually independent


mean-zero noise variables and X ⊥
⊥ U.

3.29 Proposition. Suppose (U , X) satisfy the measurement model described above.


The path coefficients between the latent variables U are identifiable (up to sign change
of U ) if the following conditions are satisfied:

(i) Each row of Γ has only 1 nonzero entry (i.e., every measurement loads on only
one factor).

(ii) Each column of Γ has at least 3 nonzero entries (i.e., each factor has at least
three measurements).

Proof. By Lemma 3.24, Γ can be identified. The assumptions in the proposition statement
also ensures that Γ has full column rank, so Cov(U ) can be identified from Cov(X). The
conclusion then follows from Proposition 3.21.

3.30 Example. The graphical model in Figure 3.6 satisfies the criterion in Proposi-
tion 3.29, thus βU is identifiable (up to its sign). To see this, β11 , β12 , . . . , β26 can be
identified by confirmatory factor analysis, and by using path analysis, we have

Cov(X1 , X4 ) = β11 βU β24 .

37
X1 X4
β11 β24

β12 βU β25
X2 U1 U2 X5
β13 β26

X3 X6

Figure 3.6: Example of a measurement model.

3.31 Exercise. Show that βU in the last example is still identifiable if each latent variable
only has two measurements (i.e. if X3 and X6 are deleted from the graph).

3.32 Remark. Although the path coefficients between U can only be identified up to
sign changes, this is usually not a problem in practice. Usually we can confidently make
assumptions about the signs of certain factor loadings (for example, the loading of a
student’s maths score on academic achievements is positive).

3.7 Estimation in linear SEMs

Let X ∈ Rp be the observed variables in a linear SEM. Let B denote the matrix of path
coefficients between all the variables, observed or latent. Suppose B is indeed identifiable.
There are two main approaches to fit a linear SEM and estimate B: maximum
likelihood and generalised method of moments.
By assuming the noise variable [p] in (3.1) follows a multivariate normal distribution,
the maximum likelihood estimator of B minimises
1   1  
l(B) = log det ΣX (B) + tr SΣ−1 X (B) , (3.7)
2 2
where S is the sample covariance matrix of X and ΣX (B) is the covariance matrix of X
and depends on the path coefficients B through (3.3).

3.33 Exercise. Derive (3.7).

Generalised method of moments (an extension of Z-estimation) tries to directly


match the theoretical covariance matrix Σ(B) with the sample covariance matrix S by
minimising
1  2

S − Σ(B) W −1

lW (B) = tr , (3.8)
2
where W is a p × p positive definite weighting matrix. This is also called the generalised
least squares estimator in the SEM literature.

38
Different choices of W lead to estimators with different asymptotic efficiency. The
“optimal” choice is W = Σ(B) (or any other matrix that converges in probability to
Σ(B)). This motivates the practical choice W = S, so we estimate B by minimising
1  2 
lS (B) = tr I − S −1 Σ(B) . (3.9)
2
The generalised method of moments estimator is consistent if the linear SEM is
correctly specified (so Var(X) = Σ(B)). Furthermore, if lS (B) is used and  is nor-
mally distributed, the estimator is asymptotically equivalent to the maximum likelihood
estimator and is thus asymptotically efficient.6 .

3.8 Strengths and weaknesses of linear SEMs

Despite being a century old, linear SEMs are still widely used in many applications for
many good reasons:

(i) Graphs and linear SEMs provide an intuitive way to rigorously describe causality
that can also be easily understood by applied researchers.

(ii) Path analysis provides a powerful tool to distinguish correlation from causation.
Even though we will move away from linearity soon, path analysis provides a
straightforward way to disprove some statements and gain intuitions for others7 .

(iii) Linear SEMs allow us to directly put models on unobserved variables. This is
especially useful when the causes and effects of interest are abstract constructs.

(iv) Fitting a linear SEM only requires the sample covariance matrix, which can be
handy in modern applications with privacy constraints.

Linear SEMs also have important limitations:

(i) The causal structure needs to be known a priori.

(ii) The linear model can be misspecified and does not handle binary variables or
discrete variables very well. This is problematic because the causal effect is not
well defined if the model is nonlinear. As a consequence, the meaning of structural
equation models became obscure and lead many to believe they are just the same as
linear regression. This misconception led many researchers to rejected linear SEMs
as a tool for causal inference.8

(iii) Any model put on the unobserved variables is dangerous, because there is no realistic
way to verify those assumptions.

Notes
1
Wright, S. (1918). On the nature of size factors. Genetics, 3 (4), 367–374; Wright, S. (1934). The
method of path coefficients. The Annals of Mathematical Statistics, 5 (3), 161–215.

39
2
Marsh, H. W. (1990). Causal ordering of academic self-concept and academic achievement: A
multiwave, longitudinal panel analysis. Journal of Educational Psychology, 82 (4), 646–656. doi:10.1037/
0022-0663.82.4.646.
3
For a review on recent advances using algebraic geometry, see Drton, M. et al. (2018). Algebraic
problems in structural equation modeling. In The 50th Anniversary of Gröbner Bases (pp. 35–86).
Mathematical Society of Japan.
4
More discussion and examples can be found at Bollen, K. A. (1989). Structural equations with latent
variables. doi:10.1002/9781118619179, page 326.
5
Kuroki, M., & Pearl, J. (2014). Measurement bias and effect restoration in causal inference.
Biometrika, 101 (2), 423–437. doi:10.1093/biomet/ast066.
6
For more detail, see Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis
of covariance structures. British Journal of Mathematical and Statistical Psychology, 37 (1), 62–83.
doi:10.1111/j.2044-8317.1984.tb00789.x.
7
Pearl, J. (2013). Linear models: A useful "microscope" for causal analysis. Journal of Causal
Inference, 1 (1), 155–170. doi:10.1515/jci-2013-0003.
8
For a historical account, see Pearl, 2009, Section 5.1.

40
Chapter 4

Graphical models

The linear SEMs are intuitive and easy to interpret, but become inadequate when the
structural relations are non-linear. Intuitively, causality should be already entailed in the
graphical diagram, and linearity should be unnecessary for causal inference.
To move away from the linearity assumption, we will introduce graphical models for
the observed variables in this Chapter and for the unobserved counterfactuals in the next
Chapter. The main idea in this Chapter is to describe conditional independences with
graphs.

4.1 Markov properties for undirected graphs

Briefly speaking, a graphical model provides a concise representation of all the conditional
independence relations (aka Markov properties) between random variables. We will start
from undirected graphs.
Let G = (V = [p], E) be an undirected graphical model for the random variables
X[p] = (X1 , X2 , . . . , Xp ). Edges in an undirected graph have no direction. In other words,
if (i, j) ∈ E, so does (j, i).

4.1 Definition (Separation in undirected graph). For any disjoint I, J, K ⊂ V , K


is said to separate I and J in an undirected graph G, denoted as I ⊥
⊥ J | K [G], if
every path from a node in I to a node in J contains a node in K.

4.2 Definition (Global Markov property). A probability distribution P is said to


satisfy the global Markov property with respect to the graph G if I ⊥
⊥ J | K [G] =⇒
XI ⊥ ⊥ XJ | XK for any disjoint I, J, K ⊂ V .

The global Markov property with respect to a graph is closely related to the factorisa-
tion of a probability distribution.

41
4.3 Definition (Factorisation according to an undirected graph). A clique in an
undirected graph G is a subset of vertices such that every two distinct vertices in the
clique are adjacent.
A probability distribution P is said to factorise according to G (or a Gibbs random
field with respect to G) if P has a density f that can be written as
Y
f (x) = ψC (xC ),
clique C⊆V

for some functions ψC , C ⊂ V .

4.4 Theorem (Hammersley-Clifford). Suppose the probability distribution P has a


positive density function. Then P satisfies the global Markov property with respect to
G if and only if it factorises according to G.

Proof. We will only prove the ⇐= direction here.1 . Let I, J, K be disjoint subsets of V
such that I ⊥ ⊥ J | K. A consequence of I and J being separated by K is that I and J
must be in different connected components of the subgraph GV \K . (A set of vertices is
called connected if there exists a path from any vertex to any other vertex in this set. A
connected component is a maximal connected set, meaning no superset of the connected
component is connected.)
Let I˜ denote the connected component that I is in and let J˜ = V \ (I˜ ∪ K). Any
clique of G must either be a subset of I˜ ∪ K or of J˜ ∪ K, otherwise at least one vertex in
I˜ is adjacent to a vertex in J,
˜ violating the maximality of I.˜ This implies that
Y
f (x) = ψC (xC )
clique C⊆V
Y Y Y
= ψC (xC ) · ψC (xC )/ ψC (xC )
˜
clique C⊆I∪K ˜
clique C⊆J∪K cliqueK

= h(xI∪K
˜ ) · g(xJ∪K
˜ ).

We have shown that f (x) can be written as a function of xI∪K˜ multiplied by another
function xJ∪K
˜ . By normalising the functions properly, this shows that XI˜ ⊥
⊥ XJ˜ | XK
and hence XI ⊥⊥ XJ | XK .

Notice that in the proof above we did not use the positive density assumption. It is
only needed for the =⇒ direction.

4.2 Markov properties for directed graphs

We now move to DAG models.

42
4.5 Definition. We say a probability distribution P factorises according to a DAG
G if its density function f satisfies
Y
f (x) = fi|pa(i) (xi | xpa(i) ),
i∈V

where fi|pa(i) (xi | xpa(i) ) is the conditional density of Xi given Xpa(i) , that is,

f{i}∪pa(i) (xi , xpa(i) )


fi|pa(i) (xi | xpa(i) ) = ,
fpa(i) (xpa(i) )

where fI (xI ) is the marginal density function for XI .

Below we will often suppress the subscript and use f as a generic symbol to indicate
a density or conditional density function.

4.6 Example. A probability distribution factorises according to Figure 4.1 if its density
can be written as

f (x) = f (x1 )f (x2 | x1 )f (x3 | x1 )f (x4 | x2 , x3 )f (x5 | x2 )f (x6 | x3 , x4 )f (x7 | x4 , x5 , x6 ).

2 5

1 4 7

3 6

Figure 4.1: Example of a DAG.

4.7 Example. Probabilitistic graphical models are widely used in Bayesian statistics,
machine learning, and engineering.2 Besides its intuitive representation, another motivation
for using graphical models is more efficient storage of probability distributions. Consider p
binary random variables. A naive way that stores the entire table of their joint distribution
needs to record 2p probability mass values. In contrast, suppose the random variables
factorise according to a DAG G in which each vertex has no more than d parents. Then
it is sufficient to store p · 2d values.

It is obvious that if X[p] satisfies the linear SEM according to G (Definition 3.8), then
the distribution of X[p] also factorises according to G. Therefore, we can often use linear
SEMs to understand properties of DAG models (see, e.g., Exercise 4.16 below). However,
linear SEM further makes assumptions about the interventional distributions of X[p] . On
the contrary, a DAG model only restricts the observational distribution of X[p] like a

43
regression model (see Remark 3.9 for a comparison of linear SEM with linear regression).
The next Chapter will introduce DAG models for counterfactuals.

4.8 Definition. Given a DAG G, its undirected moral graph G m is obtained by first
adding undirected edges between all pairs of vertices that have a common child and
then erasing the direction of all the directed edges.

This graph is called “moral” because we are marrying all the parents that have a
common child.

4.9 Example. Figure 4.2 illustrates the moralisation of the DAG in Figure 4.1. First,
three undirected edges (2,3), (4,5), (5,6) are added because they share a common child.
Second, the directions of all the edges in the original graph are erased.

2 5

1 4 7

3 6

(a) Step 1: Add undirected edges (in red) between all pairs of vertices that have a common child.
2 5

1 4 7

3 6

(b) Step 2: Remove edge directions (in blue).

Figure 4.2: Illustration of moralising the DAG in Figure 4.1.

For any i ∈ V , the subgraph of G m restricted to {i} ∪ pa(i) is a clique. Thus by


Definitions 4.3 and 4.5 and Theorem 4.4, we immediately have

4.10 Lemma. If a probability distribution P factorises according to a DAG G, it also


factorises according to its moral graph G m and thus satisfies the global Markov property
with respect to G m .

Lemma 4.10 gives us a way to obtain conditional independence relations for distribu-
tions that factorises according to a DAG (using Definition 4.1).

44
4.11 Corollary. Suppose P factorises according to a DAG G, then

⊥ J | K [G m ] =⇒ XI ⊥
I⊥ ⊥ XJ | XK .

This criterion can be improved. Let an(I) = an(I) ∪ I. The next Proposition says
that we only need to moralise the subgraph of G restricted to an(I ∪ J ∪ K).

4.12 Proposition. Suppose P factorises according to a DAG G, then

⊥ J | K (Gan(I∪J∪K) )m =⇒ XI ⊥
 
I⊥ ⊥ XJ | XK .

Proof. It is easy to verify that for any I ⊆ V , i ∈ an(I) implies that pa(i) ⊆ an(I). By
Definition 4.5, this implies that the marginal distribution of Xan(I∪J∪K) must factorise
according to the subgraph Gan(I∪J∪K) . The proposition then immediately follows from
Corollary 4.11.
4.13 Example. Suppose the distribution P of X factorises according to the the DAG in
Figure 4.1. We can use the criterion in Proposition 4.12 but not the one in Corollary 4.11
to derive X4 ⊥
⊥ X5 | {X2 , X3 }.
4.14 Exercise. Suppose the distribution P of X factorises according to the the DAG in
Figure 4.1. Which one(s) of the following conditionally independent relationships can be
derived from Proposition 4.12?
(i) X2 ⊥
⊥ X6 | X4 ;
(ii) X2 ⊥
⊥ X6 | X3 ;
(iii) X2 ⊥
⊥ X7 | {X4 , X5 };
(iv) X5 ⊥
⊥ X6 | X4 ;
(v) X5 ⊥
⊥ X6 | {X3 , X4 };
Next we give another criterion called d-separation that only uses the original DAG G
and thus is much easier to apply. To gain some intuition, consider the following example.
4.15 Example. There are three possible situations for a DAG with 3 vertices and 2
edges (Figure 4.3). Using Corollary 4.11, it is easy to show that X1 ⊥
⊥ X3 | X2 is true in
the first two cases. However, in the third case, even though X1 and X3 are marginally
independent, conditioning on the collider X2 (common child of X1 and X3 ) actually
introduces dependence, so X1 ⊥⊥ X3 | X2 is not true in general. Example 3.19 showed the
same phenomenon using the more restrictive linear SEM interpretation of these graphical
models.
4.16 Exercise. (i) By directly using the DAG factorisation (without using moralisa-
tion), show that X1 ⊥⊥ X3 | X2 is true in the first two cases but generally false for
the third case in Figure 4.3.

45
chain/mediator

X1 X2 X3

fork/confounder

X2

X1 X3

X1 X3

collider

X2

Figure 4.3: Possible DAGs with 3 vertices and 2 edges.

(ii) Alternatively, by assuming (X1 , X2 , X3 ) satisfies a linear SEM with respect to the
corresponding graph, demonstrate why X1 ⊥ ⊥ X3 | X2 holds or does not hold. For
simplicity, you may assume all the path coefficients are equal to 1.

(iii) What happens if there is an additional vertex X4 that is a child of X2 and has no
other parent, and we condition on X4 instead of X2 ?

4.17 Definition. Given a DAG G, a path is said to be blocked by K ⊆ V if there


exists a vertex k on the path such that either

(i) k is not a collider on this path and k ∈ K; or

(ii) k is a collider on this path and k and all its descendants are not in K;

For disjoint subsets of nodes I, J, K ⊂ V , we say I and J are d-separated by K,


written as I ⊥
⊥ J | K [G], if all paths from a vertex in I to a vertex in J are blocked
by K.

4.18 Example. For the DAG in Figure 4.1, K = {1} blocks the paths (3, 1, 2, 5),
(3, 4, 2, 5), (3, 6, 4, 2, 5), (3, 6, 7, 4, 2, 5), and (3, 6, 7, 5). Therefore the nodes 3 and 5 are
d-separated by 1.

4.19 Remark. To memorise the definition of d-separation, imagine water flowing along the
edges and each vertex acts as a valve. A collider valve is naturally “off”, meaning there

46
is no flow of water from one side of the collider to the other side. A non-collider valve
is naturally “on”, allowing water to flow freely. Now imagine we can turn on or off the
valves (the perhaps non-intuitive part is that turning on any descendant of a collider also
turns on that collider). Water can flow from one end of the path to the other end (path
is unblocked) if and only if all the valves on the path are “on”.
In path analysis (Definition 3.13), we have already seen that a d-connected path
can induce dependence between variables. The induced dependence can be removed
(“blocked”) by conditioning on any non-collider on the path. Conversely, although a closed
(not d-connected) path does not induce dependence, it can do so if we condition on all
the colliders.

4.20 Lemma. Consider a DAG G = (V, E) and disjoint I, J, K ⊂ V . Then


 m 
I⊥⊥ J | K Gan(I∪J∪K) ⇐⇒ I ⊥ ⊥ J | K [G].

The proof of this Lemma is a bit technical and is deferred to Section 4.A.1 (so is
non-examinable).

4.21 Theorem. The distribution P of X[p] factorises according to a DAG G if and


only if
I⊥⊥ J | K [G] =⇒ XI ⊥ ⊥ XJ | XK , ∀ disjoint I, J, K ⊂ V. (4.1)

Proof. Proposition 4.12 and Lemma 4.20 immediately imply the =⇒ direction. The ⇐=
direction can be shown by induction on |V |. Without loss of generality let’s assume
V = [p] and (1, 2, . . . , p) is a topological ordering of G, so (i, j) ∈ E implies that i < j.
Because the vertex p has no child, it is easy to see that

p⊥
⊥ V \ {p} \ pa(p) | pa(p) [G].

By (4.1), Xp is independent of the other variables given Xpa(p) . Thus the joint density of
X[p] can be written as

f (x[p] ) = f (x[p−1] ) · f (xp | x[p−1] ) = f (x[p−1] ) · f (xp | xpa(p) ).

Using the induction hypothesis for the first term on the right hand side, we thus conclude
that P also factorises according to G when |V | = p.

Distributions P satisfying (4.1) are said to satisfy the global Markov property with
respect to the DAG G. In the last section, Theorem 4.4 establishes the equivalence
between global Markov property and factorisation in undirected graphs. Theorem 4.21
extends this equivalence to DAGs, with a small distinction that P is no longer required to
have a positive density function.

4.22 Exercise. Apply the d-separation criterion in Theorem 4.21 to the examples in
Exercise 4.14.

47
4.23 Remark (Completeness of d-separation). The criterion (4.1) cannot be further
improved in the following sense. Given any DAG G, it can be shown that there exists a
probability distribution P3 such that

I⊥
⊥ J | K [G] ⇐⇒ XI ⊥
⊥ XJ | XK , ∀ disjoint I, J, K ⊂ V. (4.2)

Furthermore, it can be shown that if X[p] is discrete, the set of probability distributions
that factorise according to G but do not satisfy (4.2) has Lebesgue measure zero.4

4.24 Example. Consider the setting in Example 3.16 where three random variables
(X, A, Y ) satisfy a linear SEM (3.4) corresponding to the graph in Figure 3.2. Suppose the
structural noise variables are jointly normal and the random variables are standardised
so Var(X) = Var(A) = Var(Y ) = 1. For most values of βXA , βXY , βAY , the variables
X1 , X2 , X3 are unconditionally and conditionally dependent. However, very occasionally
the distribution of (X, A, Y ) may have some “coincidental” independence relations. For
example, A ⊥ ⊥ Y if the path coefficients happen to satisfy βAY + βXA βXY = 0. It is easy
to see that this event has Lebesgue measure 0.

4.3 Structure discovery

In structure discovery, the goal is to use conditional independence in the observed data
to infer the graphical model, that is, to invert (4.1). Remark 4.23 suggests that this is
possible for almost all distributions, which is formalised below.

4.25 Definition. We say a distribution P of X that factorises according to G is


faithful to G if I ⊥
⊥ J | K [G] ⇐⇒ XI ⊥
⊥ XJ | XK for all disjoint I, J, K ⊂ V .

Given that P is faithful to the unknown DAG G, we can obtain all d-separation
relations in G from the conditional independence relations in P. Without faithfulness, we
cannot even exclude the possibility that the underlying DAG is complete.
However, this may not be enough to recover G. A simple counter-example is the two
DAGs X1 → X2 and X2 → X1 , both implying X1 6⊥ ⊥ X2 .

4.26 Definition. Two DAGs are called Markov equivalent if they contain the same
d-separation relations. A Markov equivalence class is the maximal set of Markov
equivalent DAGs.

Without additional assumptions, we can only recover the Markov equivalence class
that contains in G. The next Theorem gives a complete characterisation of a Markov
equivalence class.

48
4.27 Theorem. Two DAGs are Markov equivalent if and only if the next two
properties are satisfied:

(i) They have the same “skeleton” (set of edges ignoring the directions);

(ii) They have the same “immoralities” (structures like i → k ← j where i and j
are not adjacent).

We will only show the =⇒ direction here by proving two Lemmas. The proof for the
⇐= direction can be found in Section 4.A.2.

4.28 Lemma. Given a DAG G = (V, E), two vertices i, j ∈ V are adjacent if and only
if they cannot be d-separated by any set D ⊆ V \ {i, j}; otherwise they can be d-separated
by pa(i) or pa(j).

Proof. =⇒ is obvious because no set can block the edge between i and j. For the ⇐=
direction, because i ∈ an(j) and j ∈ an(i) cannot both be true (otherwise creating a
cycle), without loss of generality we assume j 6∈ an(i). Any path connecting i and j
cannot be a directed path from j to i. Consider the directed edge on this path with j as
one end. If this edge points into j, this path contains a parent of j that is not a collider;
otherwise it must contain a collider that is a descendant of j. In either case the path is
blocked by pa(j).

4.29 Lemma. For any undirected path i − k − j in G such that i and j are not adjacent,
k is a collider if and only if i and j are not d-separated by any set containing k.

Proof. This immediately follows from the fact that the path i − k − j is blocked by k if
and only if k is not a collider.

The IC5 or SGS6 algorithm uses the conditions in Lemmas 4.28 and 4.29 to recover
the Markov equivalence class:

Step 0 Start with an undirected complete graph in which all vertices are adjacent.

Step 1 For every two vertices i, j, remove the edge between i and j if Xi ⊥
⊥ Xj | XK for
some K ⊆ V \ {i, j}. This gives us the skeleton of the graph (Lemma 4.28).

Step 2 For every undirected path i − k − j such that i and j are not adjacent in the
skeleton obtained in Step 1, orient the edges as i → k ← j if Xi 6⊥
⊥ Xj | XK for all
K ⊆ V \ {i, j} containing k (Lemma 4.29).

Step 3 Orient some of the other edges so that the graph contains no cycle or a new
immorality would be introduced if the edge is oriented the other way. (In general
it is impossible to orient all the edges unless the Markov equivalence class is a
singleton.)

49
The PC algorithm7 accelerates the Step 1 above using the following trick: to test
whether i and j can be d-separated, one only needs to go through subsets of the neighbours
of i and subsets of the neighbours of j. The PC algorithm also imposes an order: it starts
with K = ∅ and then gradually increases size of K. For sparse graphs, these two tricks
allow us to not only test fewer conditional independences for each pair but also stop the
algorithm at a much smaller size for K.

4.30 Exercise. Use the IC/SGS algorithm to derive the Markov equivalence class
containing the DAG in Figure 4.1. More specifically, give the conditional independence
and dependence relations you used in Steps 1 and 2 of the algorithm. How many DAGs
are there in this Markov equivalence class?

4.4 Discussion: Using DAGs to represent causality

It may not be surprising that many people have been fascinated about the prospects of
the graphical approach to causality:

(i) Representating causality by directed edges is naturally appealing.

(ii) The theory of probabilistic graphical models is elegant and powerful.

(iii) The possibility of discovering (possibly causal) structures from observational data
is exciting.

But structure learning also has important limitations:

(i) Strucutre learning algorithms are computationally intensive;

(ii) Testing conditional independence is known to be a very difficult statistical problem.8

(iii) DAGs do not necessarily represent causality: graphical models can just be viewed
as a useful tool to describe a probability distribution.

(iv) Additional assumptions like the causal Markov condition needed to define causal
DAGs are not as transparent as assumptions on counterfactuals or structural noises.9

Notes
1
Proof of the other direction can be found, for example, in Lauritzen, S. L. (1996). Graphical models.
Clarendon Press, page 36.
2
See, for example, Wainwright, M. J., & Jordan, M. I. (2007). Graphical models, exponential families,
and variational inference. Foundations and Trends in Machine Learning, 1 (1-2), 1–305. doi:10.1561/
2200000001.
3
Geiger, D., & Pearl, J. (1990). On the logic of causal models. R. D. Shachter, T. S. Levitt, L. N. Kanal,
& J. F. Lemmer (Eds.), Uncertainty in artificial intelligence (Vol. 9, pp. 3–14). Machine Intelligence and
Pattern Recognition. doi:10.1016/B978-0-444-88650-7.50006-8.
4
Meek, C. (1995). Strong completeness and faithfulness in Bayesian networks. Proceedings of the
eleventh conference on uncertainty in artificial intelligence (pp. 411–418). Montréal, Qué, Canada:
Morgan Kaufmann Publishers Inc.

50
5
Pearl, J., & Verma, T. S. (1991). A theory of inferred causation. J. Allen, R. Fikes, & E. Sandewall
(Eds.), Principles of knowledge representation and reasoning: Proceedings of the second international
conference (pp. 441–452). Morgan Kaufmann.
6
Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. doi:10.1007/978-
1-4612-2748-9.
7
Spirtes, P., Glymour, C., & Scheines, R. (2001). Causation, prediction, and search (2nd ed.).
doi:10.7551/mitpress/1754.001.0001.
8
Shah, R. D., & Peters, J. (2020). The hardness of conditional independence testing and the generalised
covariance measure. Annals of Statistics, 48 (3), 1514–1538. doi:10.1214/19-aos1857.
9
Dawid, A. P. (2010). Beware of the DAG!. I. Guyon, D. Janzing, & B. Schölkopf (Eds.), (pp. 59–86).
Proceedings of Machine Learning Research. Whistler, Canada: PMLR.

4.A Graphical proofs (non-examinable)

4.A.1 Proof of Lemma 4.20


Suppose K does not d-separate I from J in G, so there exists a path from I to J not
blocked by K. The vertices on this path must be contained in an(I ∪ J ∪ K), which can
be shown by considering the following two cases:

(i) If this unblocked path contains no collider, then every vertex on this path, if not
already in I ∪ J, must be an ancestor of I ∪ J.

(ii) If this unblocked path contains at least one collider, then all the colliders must be
or have a descendant that is in K. Thus, all the vertices on this path must be in K
or ancestors of K.

In case (i), this path cannot contain a vertex in K (because it is unblocked in G), so is
unblocked by K in (Gan(I∪J∪K) )m . In case (ii), the path in (Gan(I∪J∪K) )m that marries
the parents of all the colliders is not separated by K (the parents of a collider cannot be
colliders and thus do no belong to K). In both cases, K does not separate I from J in
the moral graph.
Next we consider the other direction. Suppose I is not separated from J by K in
(Gan(I∪J∪K) )m , so there exists a path from a vertex in I to a vertex in J circumventing K
in (Gan(I∪J∪K) )m . Edges in the moral graph are either already in G or added during the
“marriage”. For each edge added because of a “marriage” by a collider, extend the path to
include that collider. This results in a path in G. The set K does not block this path at
the non-colliders because the original path in the moral graph is not separated by K.
Consider the subsequence of this path, say from i ∈ I to j ∈ J, that does not contain
any intermediate vertex in I ∪ J. Consider any collider in this sub-path (let’s call it l) that
does not belong to an(K), so l ∈ an(I ∪ J); without loss of generality assume l ∈ an(I).
By definition, there exists a directed path in G from l to i0 for some i0 ∈ I. Consider a
new path, tracing back from i0 to k and then joining the original path from k to j (see
Figure 4.4 for an illustration). Because l 6∈ an(K), the new part of the path from i to l
is not blocked by K. Thus we have obtained a path in G from I to J, unblocked by K
at non-colliders, with one fewer collider than the original. Repeating the argument in
this paragraph until we end up with a path from I to J whose colliders are in an(K) and
non-colliders are not in K. By Definition 4.17, this path is not blocked by K.

51
j

i l i0

Figure 4.4: Illustration for obtaining a new path with fewer colliders (in red). Dashed
line indicates an edge added during the marriage.

4.A.2 Additional proofs for Markov equivalent DAGs


The ⇐= direction of Theorem 4.27 is established through the next two Lemmas.

4.31 Lemma. Among all paths from i to j that is unblocked by K ⊂ V in a DAG


G = (V, E), consider the shortest one (if there are several, consider any one of them).
Moreover, suppose this path contains a k − l − m such that k, m are adjacent. Then the
edges must be oriented as k ← l → m, and at least one of k, m is an collider in this path
(if k → m then k is a collider; if k ← m then m is a collider). As a corollary, all colliders
in this path must be “immoral”.

4.32 Lemma. Suppose two DAGs G1 and G2 have the same skeleton and immoralities.
If there is a path from i to j unblocked by K ⊂ V in G1 , then there exists a path from i to
j that is unblocked by K in both G1 and G2 .

Proof of Lemma 4.31. If k − l − m constitutes a moral collider so the edges are oriented
as k → l ← m, we show that the shorter path that bypasses l (by directly going through
the k → m or k ← m edge) is also unblocked by K, thus contradicting the hypothesis.
Because k, m are not colliders in the original path (there cannot be two consecutive
colliders) and the original path is unblocked by K, we have k, m 6∈ K. Suppose the path
is like i · · · k → l ← m · · · j (k could be the same as i and m could be the same as j).
The sub-paths i · · · k and j · · · m are not blocked by K because the original path is not
blocked by K. Although k or m might be a collider on the new path, none of them blocks
the path when conditioning on K because k and m are parents of l ∈ K.
The other possibility of k − l − m is that it forms a chain, for example k → l → m.
In this case k − m must be oriented as k → m in order to not not create a cycle. By
observing that any intermediate vertex is a collider in the new path if and only if it is a
collider in the original path, it is easy to show that the shorter path bypassing l is also
unblocked by K, thus resulting in a contradiction.
The only remaining possibility is k ← l → m. Since k and m are adjacent, without
loss of generality let’s say the orientation is k → m. It must be the case that k is a
collider in the original path, otherwise the shorter path bypassing l would have the same
colliders and non-colliders as the original path (except l) and is thus unblocked by K.

Proof of Lemma 4.32. Consider the shortest unblocked path between i and j in G1 . The
goal is to show that the same path (or some path constructed based on this path) is
unblocked by K in G2 . By Lemma 4.31, all G1 -colliders in this path are immoral and

52
hence are also G2 -colliders (since G1 and G2 share the same immoralities). Consider any
intermediate vertex l in this path; if there is none then obviously the path can’t be blocked
by any K in G2 . Let’s say the adjacent vertices are k, m. The vertex l must either be

(i) A non-collider in both G1 and G2 ; or

(ii) A non-collider in G1 and collider in G2 ; or

(iii) A collider in both G1 and G2 .

Obviously K does not block the path in G2 at the first kind of vertices, because the path
is unblocked in G1 . There can be no l of the second kind for the following reason. Because
l is a collider in G2 but not in G∞ , the parents k, m of l must be adjacent; otherwise G1
would not share the immorality k → l ← m with G2 . By Lemma 4.31, the orientation in
G1 must be k ← l → m, and at least one of k, m is an immoral collider in G1 . However,
k, m cannot be colliders in G2 , which contradicts the hypothesis that G1 and G2 have the
same immoralities.
Finally we consider the third case that l is a collider in both G1 and G2 . Because the
shortest path we are considering is unblocked by K in G1 , l or a G1 -descendant of l must
be in K. Among de G1 (l) ∩ K, let o be a vertex that is closest to l (if there are several, let
o be any one of them). There are three possibilities:

(i) o = l;

(ii) o ∈ ch G1 (l);

(iii) The length of the shortest directed path from l to o in G1 is at least 2.

In the first case, the path not blocked at l in G2 because l ∈ K. In the second or the
third case, the path is blocked at l in G2 only if edge directions in the shortest direct
path from l to o in G1 are no longer the same in G2 . In the third case, we claim that this
shortest directed path from l to o must also a directed path in G2 , hence o ∈ de G2 (l) and
the path is unblocked at l in G2 . If this not the true, this path from l to o must have a
G2 -collider. In order to not create an immoral collider that does not exist in G1 , the two
G∈ − parents of this collider must be adjacent. In G1 , this edge either results in a cycle or
a shorter directed path from l to o.
We are left with the second case with the edge direction l ← o in G2 . In order to
not create the immorality k → l ← o which would be inconsistent with G1 , k, o must be
adjacent; furthermore, the direction must be k → o in G1 to not create a k → l → o → k
cycle. For the same reason, m, o must be adjacent and the direction must be m → o in G1 .
Hence k → o ← m is a immorality in G1 and must also be a immorality in G2 . Consider
a new path modified from the original path from i to j with k → l ← m replaced by
k → o ← m. It is easy to show that this new path has the same length as the original
path and is also unblocked bt K in G1 and G2 because o ∈ K. Apart from o, any other
vertex in the new path is a G2 -collider if and only if it is a G2 -collider in the original path.
We can continue to use the arugment in this paragraph until we no longer have a collider
l with a G1 -child but not a G2 -child o in K.

53
Chapter 5

A unifying theory of causality

This Chapter introduces a theory that unifies the previous approaches:

(i) The potential outcomes are useful to elucidate the causal effect that is being
estimated in a randomised experiment (Chapter 2);

(ii) The linear structural equations can be used to define causal effects and distinguish
correlation from causation (Chapter 3);

(iii) The graphical models (particularly DAGs) can encode conditional independence
relations and can be used to represent causality (Chapter 4).

5.1 From graphs to structural equations to counterfactuals

The key idea is to define counterfactuals from a DAG by using nonparametric structural
equations:1

5.1 Definition (NPSEMs). Given a DAG G = (V = [p], E), the random variables
X = X[p] satisfy a nonparametric structural equation model (NPSEM) if the observed
and interventional distributions of X[p] satisfy

Xi = fi (Xpa G (i) , i ), i = 1, . . . , p,

for some functions f1 , . . . , fp and random variables [p] .

54
5.2 Definition (Counterfactuals). Given the above NPSEM, the counterfactual
variables {Xi (XJ = xJ ) | i ∈ [p], J ⊆ [p]} (Xi (XJ = xJ ) is often abbreviated as
Xi (xJ )) are defined as follows:

(i) Basic counterfactuals:


 For each Xi and intervention Xpa(i) = xpa(i) , define
Xi Xpa(i) = xpa(i) = fi (xpa(i) , i ).

(ii) Substantative counterfactuals: For any i ∈ [p] and J ⊆ an(i), recursively define
 
Xi (XJ = xJ ) = Xi Xk = xk I(k ∈ J) + Xk (xJ∩an(k) )I(k 6∈ J), k ∈ pa(i) .

(iii) Irrelevant counterfactuals: For any i ∈ [p] and J 6⊆ an(i), Xi (xJ ) =


Xi (xJ∩an(i) ).

5.3 Remark. The recursive definition of substantative counterfactuals is easy to com-


prehend but makes the algebra clunky (because we need to keep track of the relevant
intervention). An equivalent but more compact definition for the non-basic counterfactual
can be obtained via recursive substitution: For any i ∈ [p], J ⊆ [p] and J 6= pa(i), the
counterfactual Xi (xJ ) is defined recursively by

Xi (XJ = xJ ) = Xi Xpa(i)∩J = xpa(i)∩J , Xpa(i)\J = Xpa(i)\J (xJ ) .

We will often abbreviate Xi (XJ = xJ ) as Xi (xJ ), where the subscript J indicates


the intervention. For disjoint J, K ⊂ [p], we write Xi (XJ∪K = xJ∪K ) as Xi (xJ , xK ).
By induction, it is easy to show that Xi = Xi (x∅ ). This is consistent with our intution
that Xi is the “counterfactual” corresponding to no intervention.

5.4 Example. Consider the graphical model in Figure 5.1. The basic counterfactuals are
X1 , X2 (x1 ), and X3 (x1 , x2 ). The substantative counterfactuals of X3 are defined using
the basic counterfactuals as

X3 (x1 ) = X3 x1 , X2 (x1 ) and X3 (x2 ) = X3 (X1 , x2 ).

The other counterfactuals are irrelevant and can be trimmed into the basic or substantative
counterfactuals. For example, X1 (x1 ) = X1 (x2 ) = X1 , X2 (x2 ) = X2 , X2 (x1 , x2 ) =
X2 (x1 ).

X1 X2 X3

Figure 5.1: A simple graphical model illustrating the concept of counterfactuals.

We can further simplify some of the substantative counterfactuals:

55
5.5 Proposition. Consider any disjoint J, K ⊆ V and any i ∈ V . If K blocks all
directed paths from J to i, then

Xi (xJ , xK ) = Xi (xK ).

Proof. This follows from recursive substitution and the next observation: if K blocks all
directed paths from J to i, then K also blocks directed paths from J to pa(i) \ K.

A corollary of Proposition 5.5 is that Xi (xJ ) = Xi (XJ∩an(i) ) for all i ∈ V and j ⊆ V .


From Definitions 5.1 and 5.2, we immediately obtain

Xi = Xi (Xpa(i) ). (5.1)

This generalises the consistency assumption in Chapter 2 (Assumption 2.6). Here,


consistency is a property instead of an assumption because the connections between the
factuals and counterfactuals are already made explicit in the definition of counterfactuals.
The next result further generalises the consistency property:2

5.6 Proposition. For any disjoint J, K ⊆ V and any i ∈ V ,

XJ (xK ) = xJ =⇒ Xi (xJ , xK ) = Xi (xK ), (5.2)

in the sense that the event defined on the left hand side is a subset of the event defined
on the right hand side.

Proof. By definition,

Xi (xJ , xK ) = Xi (xpa(i)∩J , xpa(i)∩K , Xpa(i)\J\K (xJ , xK )),

Xi (xK ) = Xi (Xpa(i)∩J (xK ), xpa(i)∩K , Xpa(i)\J\K (xK )).


By using the assumption XJ (xK ) = xJ , we see it suffices to show Xpa(i)\J\K (xJ , xK ) =
Xpa(i)\J\K (xK ). We can then complete the proof by induction.

Notice that (5.1) is implied by (5.2) by letting K = ∅.

5.2 Markov properties for counterfactuals

If you come from a statistics background, it may be natural to assume the error variables
1 , . . . , p are mutually independent. But after translating it into the basic counterfactuals,
this assumption may seem rather strong.3

56
5.7 Definition (Basic counterfactual independence). A NPSEM is said to satisfy
the single-world independence assumptions, if

For any x[p] , the variables Xi (xpa(i) ), i ∈ [p], are mutually independent. (5.3)

A NPSEM is said to satisfy the the multiple-world independence assumptions, if


1 , . . . , p are mutually independent. Equivalently,

The sets of variables {Xi (xpa(i) ) | xpa(i) }, i ∈ [p] are mutually independent.

5.8 Example. Consider the graph in Figure 5.1. The single-world independence assump-
tions assert that X1 , X2 (x1 ), X3 (x1 , x2 ) are mutually independent for any x1 and x2 .
The multiple-world independence assumptions are

X1 ⊥
⊥ {X2 (x1 ) | x1 } ⊥
⊥ {X3 (x1 , x2 ) | x1 , x2 }.

Thus in addition to the single-world independence assumptions, the multiple-world


independence assumptions also make the following cross-world independence assumption:

X2 (x1 ) ⊥
⊥ X3 (x̃1 , x2 ) for any x1 6= x̃1 , x2 .

Cross-world independence is controversial because it is about two counterfactuals that


can never be observed together in any experiment. Fortunately, it is not needed in most
causal inference problems.
5.9 Remark. Whether the cross-world independence seems “natural” depends on how
we approach the problem. If we start from structural equations, the multiple-world
independence assumptions may seem natural. If we start from counterfactuals, the same
assumptions may seem unnecessarily strong. But the two frameworks are equivalent.
To give an example of a NPSEM that satisfies the single-world but not the cross-world
independence assumptions, suppose all variables are discrete. We can simply let i be
the vector of all the basic counterfactuals of Xi and fi select the basic counterfactual
according to xpa(i) .

5.10 Definition (Single-world causal model). We say the random variables X[p]
satisfy a single-world causal model or simply a causal model defined by a DAG
G = (V = [p], E), if X[p] satisfies a NPSEM defined by G and the counterfactuals of
X[p] satisfy the single-world independence assumptions.

Next we introduce a transformation that maps a graphical model for the factual
variables X to a graphical model for the counterfactual variables X(xJ ).

57
5.11 Definition. The single-world intervention graph (SWIG) G[X(xJ )] (sometimes
abbreviated as G[xJ ]) for the intervention XJ = xJ is constructed from G via the
following two steps:

(i) Node splitting: For every j ∈ J, split the vertex Xj into a random and a
fixed component, labelled Xj and xj respectively. The random half inherited
all edges into Xj and the fixed half inherited all edges out of Xj .

(ii) Labelling: For every random node Xi in the new graph, label it with Xi (xJ ) =
Xi (xJ∩an(i) ).

5.12 Example. Figure 5.2 shows the SWIGs for the graphical model in Figure 5.1.

X1 x1 X2 (x1 ) x2 X3 (x1 , x2 )

(a) SWIG for the (X1 , X2 ) = (x1 , x2 ) intervention.

X1 x1 X2 (x1 ) X3 (x1 )

(b) SWIG for the X1 = x1 intervention.

X1 X2 x2 X3 (x2 )

(c) SWIG for the X2 = x2 intervention.

Figure 5.2: SWIGs for the graphical model in Figure 5.1.

The next Theorem states that in a single-world causal model, the counterfactuals
X(xJ ) “factorise” according to the SWIG G[X(xJ )]. “Factorise” is quoted because
G[X(xJ )] has non-random vertices and we have not formally defined a graphical model
for a mixture of random and non-random quantities. In this case, we essentially always
condition on the fixed quantities, so in the graph they block all the paths they are on.
To simplify this, let G ∗ [X(xJ )] be the random part of G[X(xJ )], i.e., the subgraph of
G[X(xJ )] restricted to Xi (xJ ), i ∈ [p]. This is sometimes abbreviated as G ∗ [xJ ]. Thus
G[xJ ] has the same number of edges as G and G ∗ [xJ ] has the same number of vertices as
G.

58
5.13 Theorem (Factorisation of counterfactual distributions). Suppose X satisfies
the causal model defined by a DAG G, then X(xJ ) factorises according to G ∗ [X(xJ )]
for all J ⊆ [p].

A key step in the proof of Theorem 5.13 is to establish the following Lemma using
induction.

5.14 Lemma. For any k 6∈ J ⊆ [p] such that de(k) ⊆ J, i ∈ [p], xJ and x̃,
 
P Xi (xJ , x̃k ) = x̃i Xpa(i)\J\{k} (xJ , x̃k ) = x̃pa(i)\J\{k}
 
=P Xi (xJ ) = x̃i Xpa(i)\J (xJ ) = x̃pa(i)\J .

Proof of Lemma 5.14 and Theorem 5.13. To simplify the exposition, let G ∗ [J] denote the
modified graph G ∗ [X(xJ )] with the vertex mapping Xi (xJ ) → i, so G ∗ [J] can be obtained
by removing the outgoing arrows from J in G. Notice that for any i ∈ [p] and J ⊂ [p],
pa G ∗ [J] (i) = pa G (i) \ J.
The single-world independence assumptions in Equation (5.3) means that this con-
clusion is true for J = [p]. The Theorem can be proven by reverse induction from
J ∪ {k} ⊆ [p] to J where k 6∈ J and de(k) ⊆ J (Exercise 3.6 shows that such k always
exists). By Proposition 5.6 (consistency of counterfactuals),
 
P X(xJ ) = x̃ = P X(xJ , x̃k ) = x̃ , for any x̃.

Using the induction hypothesis, we have, for any x̃,


p
Y  
P(X(xJ , x̃k ) = x̃) = P Xi (xJ , x̃k ) = x̃i Xpa G ∗ [J∪{k}] (i) (xJ , x̃k ) = x̃pa G ∗ [J∪{k}] (i) ,
i=1
(5.4)
By using Lemma 5.14, we have, for any x̃,
p
Y  
P(X(xJ ) = x̃) = P Xi (xJ ) = x̃i Xpa G ∗ [J] (i) (xJ ) = x̃pa G ∗ [J] (i) . (5.5)
i=1

This shows that X(xJ ) factorises according to G ∗ [J].


We next prove Lemma 5.14 using the induction hypothesis. If i 6∈ de(k), then there is
no directed path from k to i or pa(i). If i ∈ de(k) but i 6∈ ch(k), all the directed paths
from k to i or pa(i) \ J (if non-empty) are blocked by J, because de(k) ⊂ J and there are
no directed paths from k to pa(i) \ J. In both cases above, Proposition 5.5 shows that
Xi (xJ ) = Xi (xJ , x̃k ) and Xpa(i)\J (xJ ) = Xpa(i)\J (xJ , x̃k ), which proves the identity in
Lemma 5.14.

59
If i ∈ ch G (k), we have
 
P Xi (xJ ) = x̃i Xpa(i)\J (xJ ) = x̃pa(i)\J
 
=P Xi (xJ ) = x̃i Xpa(i)\J\{k} (xJ ) = x̃pa(i)\J\{k} , Xk (xJ ) = x̃k
 
=P Xi (xJ , x̃k ) = x̃i Xpa(i)\J\{k} (xJ , x̃k ) = x̃pa(i)\J\{k} , Xk (xJ , x̃k ) = x̃k
 
=P Xi (xJ , x̃k ) = x̃i Xpa(i)\J\{k} (xJ , x̃k ) = x̃pa(i)\J\{k} .

The second equality follows from the consistency of counterfactuals (Proposition 5.6).
The third equality follows from the conditional independence between Xi (xJ∪{k} ) and
Xk (xJ∪{k} ) that follows from the induction hypothesis and the observation that pa G ∗ [J∪{k}] (i)
d-separates i from k in G ∗ [J ∪ {k}].

5.15 Remark. (Completeness of d-separation in SWIGs) The completeness of d-separation


in DAGs (Remark 4.23) also extends to SWIGs, in the sense that if two counterfactuals
are d-connected given a third counterfactual, then there exists a distribution of the
factuals and counterfactuals obeying Theorem 5.13 and Lemma 5.18 in which the two
counterfactuals are dependent given the third.

5.16 Exercise. Consider the causal model defined by the graph in Figure 5.3. Show
that Y (a1 , a2 ) ⊥
⊥ A1 and Y (a2 ) ⊥
⊥ A2 | A1 , X.

A1 X A2 Y

Figure 5.3: A sequentially randomised experiment (A1 and A2 are time-varying treat-
ments).

5.3 From counterfactual to factual

Theorem 5.13 allows us to use d-separation to check all conditional independences


in a causal DAG model. This can be used, together with the consistency property
(Proposition 5.6), to identify causal effects.

5.17 Example (Continuing from Example 5.12). By using d-separation for the SWIG
in Figure 5.2c, we have X2 ⊥
⊥ X3 (x2 ) | X1 for any x2 . This conditional independence is

60
the same as the randomisation assumption (Assumption 2.10) in Chapter 2. We can then
apply Theorem 2.12 with X = X1 , A = X2 , Y = X3 to obtain
P(X3 (x2 ) = x3 | X1 = x1 ) = P(X3 = x3 | X1 = x1 , X2 = x2 ). (5.6)
Next, we give some general results that link counterfactual distributions with factual
distributions. The first Lemma establishes the modularity property of the counterfactual
distribution.4 A proof of this result can be found in the appendix.
5.18 Lemma. Suppose X satisfies the causal model defined by a DAG G = ([p], E). Then
for any i ∈ [p], J ⊆ [p] and x̃,
 
P Xi (xJ ) = x̃i Xpa (i)\J (xJ ) = x̃pa (i)\J
 
=P Xi = x̃i Xpa (i)\J = x̃pa (i)\J , Xpa (i)∩J = xpa (i)∩J ,

5.19 Example (Continuing from Example 5.17). By letting i = 3 and J = {2} in


Lemma 5.18, we obtain
P(X3 (x2 ) = x̃3 | X1 (x2 ) = x̃1 ) = P(X3 = x̃3 | X1 = x̃1 , X2 = x2 ),
which is exactly the same as (5.6).
Because X(xJ ) factorises according to G ∗ [X(xJ )] (Theorem 5.13), Lemma 5.18
implies that

5.20 Theorem. Suppose X satisfies the causal model defined by a DAG G, then for
any J ⊆ [p],
p
Y  
P(X(xJ ) = x̃) = P Xi = x̃i Xpa (i)∩J = xpa (i)∩J , Xpa (i)\J = x̃pa (i)\J .
i=1

In practice, we are often more interested in the marginals of X(xJ ). The next result
simplifies the marginalisation.

5.21 Corollary. Suppose X is discrete. Then for any disjoint I, J ⊆ [p], let
K = [p] \ (I ∪ J), then
X Y  
P(XI (xJ ) = x̃I ) = P Xi = x̃i Xpa (i)∩J = xpa (i)∩J , Xpa (i)\J = x̃pa (i)\J .
x̃K i∈I∪K

For continuous X, replace the summation by integral.

Proof. By Theorem 5.20,


p
X Y  
P(XI (xJ ) = x̃I ) = P Xi = x̃i Xpa (i)∩J = xpa (i)∩J , Xpa (i)\J = x̃pa (i)\J .
x̃K ,x̃J i=1

61
For any i ∈ J, x̃i only appears in the ith term in the product. When marginalising over
such x̃i , this term sums up to 1 (because it is a conditional density).

The identity in Corollary 5.21 is known as the g-compuation formula (or simply the
g-formula, g for generalised)5 or the truncated factorisation 6 .

5.22 Example (Continuing from Example 5.19). By applying Theorem 5.20 with I = {3}
and J = {2}, we have

P(X1 = x̃1 , X2 = x̃2 , X3 (x2 ) = x̃3 ) = P(X1 = x̃1 ) P(X2 = x̃2 ) P(X3 = x3 | X1 = x̃1 , X2 = x2 ).

By summing over x̃1 and x̃2 , we obtain



P X3 (x2 ) = x̃3
X
= P(X1 = x̃1 ) P(X2 = x̃2 ) P(X3 = x̃3 | X1 = x̃1 , X2 = x2 )
x̃1 ,x̃2
X X
= P(X1 = x̃1 ) P(X3 = x̃3 | X1 = x̃1 , X2 = x2 ) P(X2 = x̃2 )
x̃1 x̃2
X
= P(X1 = x̃1 ) P(X3 = x̃3 | X1 = x̃1 , X2 = x2 ),
x̃1

which is what we would obtain if we directly apply the g-formula (Corollary 5.21). A
cleaner form is
 X
P X3 (x2 ) = x3 = P(X1 = x1 ) P(X3 = x3 | X1 = x1 , X2 = x2 ),
x1

which is simply a marginalisation of (5.6).

5.23 Remark. Notice that the formula for P(X3 (x2 ) = x3 ) above is generally different
from the conditional distribution of X3 given X2 :
 X
P X3 = x3 | X2 = x2 = P(X1 = x1 | X2 = x2 ) P(X3 = x3 | X1 = x1 , X2 = x2 ),
x1

This generalises the discussion in Section 3.4 and demonstrates how “correlation does not
imply causation”.

5.24 Exercise. Consider the causal model defined by the graph in Figure 5.3. Suppose
all the random variables are discrete.

(i) By applying the g-computation formula, show that


X
E[Y (a1 , a2 )] = P(X = x | A1 = a1 ) · E[Y | A1 = a1 , A2 = a2 , X = x]. (5.7)
x

(ii) Derive (5.7) using the conditional independence in Exercise 5.16 and the consistency
of counterfactuals (Proposition 5.6).

62
5.25 Remark. In the identification formula (5.7), the condition expectation E[Y | A1 =
a1 , A2 = a2 , X = x] is weighted by P(X = x | A1 = a1 ) instead of the marginal probability
P(X = x) in (5.22). This makes (5.7) a non-trivial extension to the simple case with
one treatment variable. Intuitively, the dilemma is that, in order to recover the causal
effect of A2 on Y , we need to condition on their confounder X2 . However, this blocks the
directed path A1 → X → Y and makes the estimated causal effect of A1 on Y biased.

5.4 Causal identification

A major limitation of the results in the last section is that they require all relevant
variables in the causal model to be measured. This is rarely the case in practice.
Fortunately, in many problems with unmeasured variables it may still be possible to
identify the causal effect. To focus on the main ideas, we will assume all the random
variables are discrete in the examples below.

5.4.1 Back-Door formula


5.26 Example. Consider the causal graphs in Figures 5.4a and 5.4b. We cannot directly
apply the g-formula to identify the causal effect of A on Y , because there is an unmeasured
variable U in both cases. However, the same identification formula is still correct because
A⊥ ⊥ Y (a) | X still holds (you can check this using the SWIGs in Figures 5.4c and 5.4d).
It then follows from Theorem 2.12 that P(Y (a) = y | X = x) = P(Y = y | A = a, X = x)
for all a, x, y.

U X X U

A Y A Y
(a) U confounds X-A relation. (b) U confounds X-Y relation.
U X X U

A a Y (a) A a Y (a)

(c) SWIG for (a). (d) SWIG for (b).

Figure 5.4: It suffices to adjust for X in these scenarios to estimate the average causal
effect of A on Y .

The crucial condition here is A ⊥


⊥ Y (a) | X. This has appeared before in Chapter 2
under the name “randomisation assumption”. With observational data, the same assump-

63
tion is usually called the no unmeasured confounders assumption, because we can no
longer guarantee it by physically randomising A.
5.27 Remark. A common name for A ⊥ ⊥ Y (a) | X is treatment assignment ignorability
or simply ignorability 7 . The underlying idea is that A can be treated as a missingness
indicator for Y (0) (and 1 − A for Y (1)), and this assumption says that the missingness is
“ignorable”. Essentially, this assumption allows us to treat the observational study as if
the data came from a randomised experiment (but with unknown distribution of A given
X). Another name is exchangeability 8 . Our nomenclature emphasises on the structural
(instead of the statistical) content of the assumption.
5.28 Remark. In observational studies, a widely held belief is that the more pre-treatment
covariates being included in X, the more “likely” the assumption Y (a) ⊥ ⊥ A | X is
satisfied. However, this is not necessarily true; see Figure 5.5 for two counterexamples.9

U1 U2 U1 U2

X X

A Y A Y

(a) M-bias. (b) Butterfly bias.

Figure 5.5: Counter-examples for the claim that adjusting for all observed variables that
temporally precedes the treatment would be sufficient. U1 and U2 are unobserved.

In the graphical framework, we can check Y (a) ⊥


⊥ A | X by d-separation in the SWIG.
Because there is no out-going arrow from A in G ∗ [a], this essentially says that every
back-door path from A to Y (meaning the path has an edge going into A) must be blocked
by X.

5.29 Proposition (Back-door adjustment). Suppose (X, A, Y ) are random variables


in a causal model G that may contain other unobserved variables. Suppose X contains
no descendant of A and blocks every back-door path from A to Y in G. Then
Y (a) ⊥
⊥ A | X for all a and
X
P(Y (a) ≤ y) = P(X = x) · P(Y ≤ y | A = a, X = x), ∀a, x, y.
x

5.30 Exercise. Why is it necessary to assume X contains no descendant of A?

64
5.4.2 Front-door formula
The back-door formula cannot be applied when there are unmeasured confounders between
A and Y . The front-door formula is designed to overcome this problem by decomposing
the causal effect of A on Y into unconfounded mechanisms.10

A M Y

Figure 5.6: A causal model showing the causal effect of A on Y being entirely mediated
by M .

5.31 Example. Consider the DAG in Figure 5.6. By recursive substitution,

Y (a, m) = Y (m). (5.8)

Using d-separation in the corresponding SWIGs, we have

Y (m) ⊥
⊥ M (a), M (a) ⊥
⊥ A, Y (m) ⊥
⊥ M | A. (5.9)

In other words, A → M is unconfounded and M → Y is unconfounded given A. Using


the law of total probability, we obtain

P(Y (a) = y) = P Y (a, M (a)) = y

= P Y (M (a)) = y
X 
= P Y (M (a)) = y, M (a) = m
m
X 
= P Y (m) = y, M (a) = m
m
X  
= P Y (m) = y · P M (a) = m .
m

The distribution of Y (m) and M (a) can be identified by the back-door formula. Thus
XnX o
P(Y (a) = y) = P(Y = y | M = m, A = a0 ) P(A = a0 ) P(M = m | A = a).
m a0
(5.10)

There are two key counterfactual relations in Example 5.31. First, the exclusion
restriction (5.8) allows us to decompose the effect of A on Y to the product of the effect
of A on M and the effect of M on Y . This is possible because M blocks all the directed
path from A to Y , often referred to as the front-door condition. Next, the no unmeasured
confounder conditions in (5.9) allow us to use back-door adjustment to identify the effect
of A on M and the effect of M on Y .

65
5.4.3 Counterfactual calculus
Back-door and front-door are two graphical conditions for causal identification. More
generally, there are three graphical rules that allow us to simplify counterfactual distribu-
tions:11

5.32 Proposition (Counterfactual calculus). Consider a causal model with respect


to a DAG G. For any disjoint subsets of variables, X, Y , Z, W ⊂ V , and x, y, z, w,
we have

⊥ Y (x) | W (x) [G ∗ [x]], then


(i) If Z(x) ⊥
 
P Y (x) = y | Z(x), W (x) = P Y (x) = y | W (x) .

⊥ Z(x, z) | W (x, z) [G ∗ [x, z]], then


(ii) If Y (x, z) ⊥
 
P Y (x, z) = y | W (x, z) = w = P Y (x) = y | W (x) = w, Z(x) = z .

(iii) If Y (x, z) ⊥
⊥ z [G[x, z]] (z is the fixed half-vertex), then
 
P Y (x, z) = y = P Y (x) = y .

5.33 Exercise. Prove Proposition 5.32.


The rules in Proposition 5.32 provide a systematic, graphical approach to deduce
causal identification. In our framework, they simplify follow from the SWIG Markov
properties and basic properties of counterfactuals. These rules are not always user-friendly
because all the counterfactual variables must be in the same world. In practice, we do
not need to be so dogmatic.
5.34 Remark. Proposition 5.32 is the counterfactual version of the famous do-calculus 12 .
By using SWIGs and counterfactuals, Proposition 5.32 is, however, much simpler and
transparent than the original do-calculus. It has been shown that this calculus is
complete for acyclic directed mixed graphs, in the sense that all identifiable counterfactual
distributions can be deduced via repeatedly using these rules (and of course the probability
calculus).13
The completeness of the counterfactual calculus does not preclude the possibility of
causal identification with additional assumptions.
5.35 Exercise. Consider the causal diagram in Figure 5.7, where the effect of A on Y
is confounded by an unmeasured variable U . The variable Z is called an instrumental
variable and represents an unconfounded change to A. Show that if the variables satisfy
a linear structural equation model according to the diagram, the causal effect of A on Y
is identified by the so-called Wald ratio 14
βAY = Cov(Z, Y )/ Cov(Z, A). (5.11)

66
U

Z A Y

Figure 5.7: Instrumental variables.

In the example sheet you will further explore partial identification of the average
treatment effect using instrumental variables without linearity.

5.5 Proofs (non-examinable)

5.5.1 Proof of Lemma 5.18


Consider any k ∈ pa(i). Then by using counterfactual consistency (Proposition 5.6),
 
P Xi (xJ ) = x̃i Xpa (i)\J (xJ ) = x̃pa (i)\J
  
P Xi (xJ\{k} ) = x̃i Xpa (i)\J (xJ\{k} ) = x̃pa (i)\J , Xk (xJ\{k} ) = xk , if k ∈ J,
=  
P Xi (xJ ) = x̃i Xpa (i)\J\{k} (xJ ) = x̃pa (i)\J\{k} , Xk (xJ ) = x̃k , if k 6∈ J.

Repeating this, we obtain


 
P Xi (xJ ) = x̃i Xpa (i)\J (xJ ) = x̃pa (i)\J
 
=P Xi (xJ\pa(i) ) = x̃i Xpa (i)\J (xJ\pa(i) ) = x̃pa (i)\J , Xpa(i)∩J (xJ\pa(i) ) = xpa (i)\J .

From here it is sufficient to remove the intervention xJ\pa(i) from the right hand side.
By Proposition 5.5, we can first remove any intervention that is not on an ancestor
of i. So without loss of generality we may assume J ⊆ an(i). To achieve our goal, we
first add XJ\pa(i) (xJ\pa(i) ) = xJ\pa(i) to the conditioning event. This does not change
the conditional probability because Xi (xJ\pa(i) ) is d-separated from XJ\pa(i) (xJ\pa(i) ) by
Xpa(i) (xJ\pa(i) ) in the SWIG G[X(xJ\pa(i) )]. We can then remove the intervention xJ\pa(i)
from all the counterfactuals by consistency. Finally, we can remove XJ\pa(i) = xJ\pa(i)
from the conditioning event because Xi is d-separated from XJ\pa(i) by Xpa(i) in G.
Following Richardson and Robins (2013, p. 113), a previous version of the lecture
notes suggested that Lemma 5.18 follows from repeatedly applying Lemma 5.14. However,
an application of the factorization property seems needed to remove the interventions on
XJ\pa(i) .

Notes
1
Pearl, J. (2000). Causality (1st ed.). Cambridge University Press.
2
Malinsky, D., Shpitser, I., & Richardson, T. (2019). A potential outcomes calculus for identifying
conditional path-specific effects. Proceedings of Machine Learning Research, 89, 3080.

67
3
The next two Sections are based on Richardson, T. S., & Robins, J. M. (2013). Single world
intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality
(tech. rep. No. 128). Center for the Statistics and the Social Sciences, University of Washington Series.
4
The term “modularity” is originally due to Pearl (2000). The notion here is due to Richardson and
Robins (2013).
5
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure
period-application to control of the healthy worker survivor effect. Mathematical Modelling, 7 (9-12),
1393–1512. doi:10.1016/0270-0255(86)90088-6.
6
Pearl and Verma, 1991.
7
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational
studies for causal effects. Biometrika, 70 (1), 41–55.
8
For connections with exchangeability in Bayesian statistics, see Saarela, O., Stephens, D. A., &
Moodie, E. E. M. (2020). The role of exchangeability in causal inference. arXiv: 2006.01799 [stat.ME].
9
For an interesting scientific debate on this from different perspectives, see Sjölander, A. (2009).
Propensity scores and m-structures. Statistics in Medicine, 28 (9), 1416–1420. doi:10.1002/sim.3532 and
the reply in the same issue by Rubin.
10
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82 (4), 669–688. doi:10.1093/
biomet/82.4.669.
11
Malinsky et al., 2019.
12
Pearl, 1995.
13
Huang, Y., & Valtorta, M. (2006). Pearl’s calculus of intervention is complete. In Proceedings of the
twenty-second conference on uncertainty in artificial intelligence (pp. 217–224). UAI’06. Cambridge, MA,
USA: AUAI Press; Shpitser, I., & Pearl, J. (2006). Identification of joint interventional distributions
in recursive semi-markovian causal models. Proceedings of the 21st national conference on artificial
intelligence - volume 2 (pp. 1219–1226). AAAI’06. Boston, Massachusetts: AAAI Press.
14
Wald, A. (1940). The fitting of straight lines if both variables are subject to error. The Annals of
Mathematical Statistics, 11 (3), 284–300. doi:10.1214/aoms/1177731868.

68
Chapter 6

No unmeasured confounders: Randomisation


inference

Causal inference ≈ Causal language/model + Statistical inference.


We are at the turning point in this course. In the last Chapter, we have unified
three seemingly different languages for causality: counterfactuals (Chapter 2), structural
equation models (Chapter 3), and graphical models (Chapter 4). If you are a philosopher,
the problem of causal inference might seem to have been solved. But for statisticians, the
real (and fun) part is just beginning. Starting from this next Chapter, we will put this
theory into practice—how can we actually make “good causal inference” in the real world?

6.1 The logic of observational studies

An observational study is an empirical investigation that utilises observation data. By


observational (as opposed to experimental) data, we mean the data are collected without
manipulation or intervention by the researcher. In fact, in many observational studies the
data were recorded before a concrete research question was posed.
6.1 Remark. We will only talk about observational studies for causality below. But many
principles below also apply to non-causal inference from observational data (such as
estimating the basic reproduction number of COVID-19).
All observational (and experimental) studies have two stages:

(i) Design: Empirical data are collected and preprocessed in an organised way.

(ii) Analysis: A statistical method is then applied to answer the research question.

In statistics lectures (including this course), you are spending most of time learning
useful statistical models for the data already collected and how the analysis can be done
correctly and optimally.
In applications, it is the opposite: “design trumps analysis” 1 .
Consider the following argument in a randomised experiment:

(i) Design: Suppose we let half of the patients to receive the treatment at random.

69
(ii) Analysis: Significantly more treated participants have a better outcome,

(iii) Conclusion: Therefore, the treatment must be beneficial.

The underlying logic is that randomisation allows us to choose between statistical


error and causality. This reasoning is inductive.
Consider the another argument:

(i) Design: Suppose the observed patients are pair matched, so that the patients in the
same pair have similar demographics and medical history.

(ii) Analysis: In significantly more pairs, the treated patient has a better outcome,

(iii) Conclusion: Therefore, the treatment must be beneficial.

In this example, randomisation is replaced by pair matching. As a consequence, apart


from statistical error and causality, a third possible explanation is that the treated patients
and the control patients are systematically different in some other way (for instance,
different lifestyles).
So causal inference in observational studies is always abductive (inference to the best
explanation). This is summarised in the following equation:2

Causal estimator − True causal effect = Design bias + Modelling bias + Statistical noise.
(6.1)
This is more than a conceptual statement. To make (6.1) more concrete, let O be all
the observed variables (O for observed data), O is the distribution of O. Similarly, let F
denote the relevant factuals and counterfactuals in the causal question being asked (F for
full data) and F be its distribution.
Then (6.1) amounts to the decomposition

β(O[n] ; θ̂) − β(F) = {β(O) − β(F)} + {β(O; θ) − β(O)} + {β(O[n] ; θ̂) − β(O; θ)}, (6.2)

where β is a generic symbol for causal effect functional or estimator, O[n] is the observed
data of size n, θ is the parameter in a statistical model and θ̂ = θ̂(O[n] ) is an estimator of
θ.

6.2 Example. In regression adjustment for randomised experiments, O = (X, A, Y ),


F = (Y (0), Y (1)), β(F) = E[Y (1) − Y (0)], β(O) = E[Y | A = 1] − E[Y | A = 0], β(O, θ)
be the any of (2.14), (2.15), or (2.16), β(O[n] ; θ̂) be the corresponding (2.11), (2.12), or
(2.13).

6.3 Exercise. In Example 6.2, how much is the design bias? How much is the modelling
bias?

Unlike previous Chapters, we now use subscripts to index observations instead of


variables. This convention will be used throughout the rest of this course as we are mainly
concerned with statistical inference.

70
6.2 No unmeasured confounders

In this and the next Chapters, we will assume all relevant confounders are measured, so
the observational study is mimicing a randomised experiment.
More specifically, let Ai ∈ {0, 1} be a binary treatment for individual i, Yi be its the
outcome of interest with two counterfactuals Yi (0) and Yi (1), and Xi be a p-dimensional
vector of covariates.
Although unncessary for the randomisation inference, for now we assume (Xi , Ai , Yi (0), Yi (1)),
i = 1, . . . , n, are i.i.d. In this setting, subscripts are often suppressed to indicate a generic
random variable.
We restate Assumption 2.10, but now with a different name:

6.4 Assumption (No unmeasured confounders). A ⊥


⊥ Y (a) | X for a = 0, 1.

In other words, we eliminate one of the possible explanations to observed associations


by assumption. This is convenient for studying statistical methodologies but obviously
optimistic for practical applications.

6.3 Matching algorithms

Matching is a popular observational study design. Matching is essentially a preprocessing


algorithm that aims to reconstruct a pairwise randomised experiment (Example 2.4) or a
stratified randomised experiment (Exercise 2.5) from observational data.
We will focus on 1-to-1 matching below, but the algorithms can be easily extended to
more general forms of matching.3
To simplify the exposition, we will assume Ai = 1 for 1 ≤ i ≤ n1 and Ai = 0 for
n1 + 1 ≤ i ≤ n.
An essential element is a measure of distance d(·, ·) between two values of the covariates
X.

6.5 Example. A commonly used distance measure in matching is the Mahalanobis


distance:
dMA (x, x̃) = (x − x̃)T Σ̂(x − x̃), (6.3)
where Σ̂ is an estimate of the covariance matrix of Xi within a treatment group. For
example,
n1 n
1hX T
X i
Σ̂ = (Xi − X̄1 )(Xi − X̄1 ) + (Xi − X̄0 )(Xi − X̄0 )T ,
n
i=1 i=n1 +1
Pn1 Pn
where X̄1 = i=1 Xi /n1 and X̄0 = i=n1 +1 Xi /n0 are the treated and control sample
means.

71
Nearest-neighbour matching
Given the distance measure d, this naive method matches a treated observation 1 ≤ i ≤ n1
with its nearest control observation,

j = arg min{d(Xi , Xj )}.


Aj =0

The problem with this method is that one control individual could be matched to several
treated individuals, which never happens in a pairwise randomised experiment.
We can fix this problem by a greedy algorithm that sequentially matches a treated i
to its nearest control neighbor that has yet been selected. A drawback is that the result
will then depend on the order of the input.

Optimal matching
An improvement is the optimal matching that solves the following optimisation problem:
n1 
X n
X 
minimize d Xi , Mij Xj
i=1 j=n1 +1

subject to Mij ∈ {0, 1}, ∀1 ≤ i ≤ n1 , n1 + 1 ≤ j ≤ n,


n0
X (6.4)
Mij = 1, ∀1 ≤ i ≤ n1 ,
j=1
n1
X
Mij ≤ 1, ∀n1 + 1 ≤ j ≤ n,
i=1

where Mij is an indicator for the treated observation i being mached to the control
observation j. The last two constraints mean that every treated is matched to exactly
one control and every control is matched to at most one treated.
Although combinatorial optimisation is generally NP-complete, the optimal match-
ing problem (6.4) can be recasted as a network flow problem and solved efficiently in
polynomial time.4

Propensity score matching


When matching was first developed in the 1980s, computing power was limited and it
was often desirable to reduce the dimension of X before running a matching algorithm.
We say b(X) is a balancing score if A ⊥⊥ X | b(X), that is, given b(X), the covariates
X have the same conditional distribution (are “balanced”) in different treatment groups.
If b(X) is a balancing score, then the no unmeasured confounders assumption implies
that
A⊥ ⊥ Y (a) | b(X), for a = 0, 1. (6.5)
Among all the balancing scores, of particular interest is the propensity score 5 , defined as

π(x) = P(A = 1 | X = x).

72
6.6 Exercise. Prove (6.5), then show that the propensity score is a balancing score.
Furthermore, show that π(X) can be written as a function of any balancing score b(X).

The propensity score can be estimated from the observational data, commonly by
fitting a logistic regression of Ai on Xi . Let the estimated propensity score for individual
i be π̂(Xi ). A popular distance measure is the squared distance between the estimated
propensity scores in the logit scale:
  π̂(X )   π̂(X ) 2
i j
dPS (Xi , Xj ) = log − log .
1 − π̂(Xi ) 1 − π̂(Xj )

Propensity score caliper


The distance measure d can be freely modified according to the problem. As an example,
we may use the Mahalanobis distance with a propensity score caliper:
(
dMA (Xi , Xj ) if dPS (Xi , Xj ) ≤ τ 2 ,
d(Xi , Xj ) =
∞ otherwise,

where τ > 0 is some tuning parameter. In this case, an treated observation is only allowed
to be matched to a control observation whose propensity score is no more different than
τ 2 (in the logit scale).

6.4 Covariate balance

Recall the logic of randomised experiments (Section 6.1) is that randomisation allows us to
choose between causality and statistical error. This is reasonable because randomisation
balances all pre-treatment covariates in simple Bernoulli trials and pairwise/stratified
experiments. In other words, all pre-treatment covariates—measured or unmeasured—
have the same distribution in the treated and control groups. Therefore, we cannot
attribute any difference in the outcome (beyond some statistical error) to systematic
differences in the covariates.
Following this logic, we can assess whether the matching is satisfactory by checking
covariate balance. A common measure of covariate imbalance is the standardised covariate
differences,6
 
1 Pn1 Pn
n1 i=1 Xik − j=n1 +1 M X
ij jk
Bk (M ) = q , k = 1, . . . , p,
(s21k + s20k )/2

where Xik is the kth covariate for the ith observation and s21k and s20k are the sample
variances of Xk in the treated and control groups before matching.
A rule of thumb is that the k-th covariate Xk is considered approximately balanced if
|Bk | < 0.1, but obviously we would like the entire vector B to be as close to 0 as possible.
If the covariate balance is unsatisfactory, a common practice is to rerun the matching
algorithm with a different distance measure or remove treated units that have extreme

73
propensity scores. This is often called the “propensity score tautology” 7 . In modern
optimal matching algorithms, it is possible to include |Bk (M )| ≤ η, ∀k as a constraint
in the combinatorial optimisation problem.

6.5 Randomisation inference

Next we consider the statistical inference after matching.


To simplify the exposition, we assume treated observation i is matched to control
observation i + n1 , i = 1, . . . , n1 . Let

Di = (Ai − Ai+n1 )(Yi − Yi+n1 ), i = 1, . . . , n1 ,

be the treated-minus-control difference in pair i. Let

M = {a[2n1 ] ∈ {0, 1}2n1 | ai + ai+n1 = 1, ∀i ∈ [n1 ]}

be all the treatment assignments such that within a matched pair, exactly one observation
receives the treatment. Let Ci = (Xi , Yi (0), Yi (1)).
There are two ways to proceed from here. The first approach is to use the sample
average of Di ,
n1
1 X
D̄ = Di
n1
i=1

to estimate E[Y (1) − Y (0) | A = 1], the average treatment effect on the treated (ATT).
We will not say more about this estimator other than it is commonly used in practice but
its statistical inference is not as straightforward as one might imagine.8
The second and perhaps more interesting approach is to use a randomisation test to
mimic what is done for randomised experiments in Section 2.4. The next assumption
mimics Example 2.4.

6.7 Assumption. We assume matching reconstructs a pairwise randomised experi-


ment, so
(
  2−n1 , if a ∈ M,
P A[2n1 ] = a C[2n1 ] , A[2n1 ] ∈ M =
0, otherwise.

This assumption is satisfied if there are no unmeasured confounders and the (true)
propensity scores are exactly matched.

6.8 Proposition. Suppose the data are i.i.d and Assumption 6.4 is satisfied. Then
Assumption 6.7 holds if π(Xi ) = π(Xi+n1 ) for all i ∈ [n1 ].

6.9 Exercise. Prove Proposition 6.8

74
Assumption 6.7 allows us to apply the randomisation test described in Section 2.4.
Because the observations are matched, it is common to construct test statistics based on
D[n1 ] .
Consider the sharp null hypothesis H0 : Yi (1) − Yi (0) = β, ∀i, where β is given. Under
H0 and by using the consistency assumption (Assumption 2.6), the counterfactual values
of D[n1 ] can be imputed as
(
Di , if ai = 1, ai+n1 = 0,
Di (a[2n1 ] ) = (ai − ai+n1 ) · (Yi (ai ) − Yi+n1 (ai+n1 )) =
2β − Di , if ai = 0, ai+n1 = 1.
(6.6)
Consider any test statistic T = T (D[n1 ] ). Next we construct a randomisation test
based on the randomisation distribution of T (D[n1 ] (A[2n1 ] )).
Let F (t) denote its cumulative distribution function given C[2n1 ] and A[2n1 ] ∈ M
under H0 ,
  
F t; D[n1 ] , β = P T ≤ t C[n1 ] , A[2n1 ] ∈ M, Hβ ,
X  1 n1    (6.7)
= · I T D[n1 ] (a[2n1 ] ) ≤ t .
2
a[2n1 ] ∈M

We then compute the p-value for this randomisation test as P2 = F (T ) and reject the
hypothesis Hβ if P2 is less than a significance threshold 0 < α < 1. Following the same
argument as in the proof of Theorem 2.19, this is valid test of H0 .

6.10 Theorem. Under Assumptions 2.6 and 6.7, P(P2 ≤ α) ≤ α under H0 for all
0 < α < 1.

6.11 Example (Signed score statistic). Let ψ : [0, 1] → R+ be a positive function on the
unit interval. The signed score statistic is defined as
n1  rank(|D |) 
i
X
Tψ (D[n1 ] ) = sgn(Di )ψ , (6.8)
n1 + 1
i=1

where sgn is the sign function and rank(|Di |) is the rank of the absolute difference |Di |
among |D1 |, . . . , |Dn1 |. The widely used Wilcoxon signed rank statistic corresponds to
the choice ψ(t) = (n1 + 1)t.

6.12 Remark. We have been following the second approach of randomisation test in
Section 2.4. We can also follow the first approach by considering the distribution of
T (A[2n1 ] , Y[2n1 ] (0)), although this is less intuitive because A[2n1 ] = 0 is not in the allowed
set M .

6.13 Exercise. Consider the signed score statistic in (6.8) (treated as a function of
A[2n1 ] , Y[2n1 ] ). Derive the randomisation test based on the randomisation distribution of

75
T (A[2n1 ] , Y[2n1 ] (0)) and show that, given H0 and conditioning on C[2n1 ] ,
n1  rank(|Y (0) − Y
n1 +i (0)|)

d i
 X
T A[2n1 ] , Y[2n1 ] (0) | A[2n1 ] ∈ M = Si ψ , (6.9)
n1 + 1
i=1

where Si = (Ai − Ai+n1 ) · sgn(Yi (0) − Yi+n1 (0)) ∼ Bernoulli(1/2). Justify this test using
the symmetry of Di − β under Assumption 6.7 and H0 . Establish the equivalence between
P1 and P2 (see also Exercise 2.21).

Equation (6.9) is indeed the more commonly used test because it is more computa-
tionally friendly. (Why?)

Notes
1
Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied
Statistics, 2 (3), 808–840. doi:10.1214/08-aoas187.
2
Zhao, Q., Keele, L. J., & Small, D. S. (2019). Comment: Will competition-winning methods for
causal inference also succeed in practice? Statistical Science, 34 (1), 72–76. doi:10.1214/18-sts680.
3
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical
Science, 25 (1), 1–21. doi:10.1214/09-sts313.
4
Rosenbaum, P. R. (2020). Modern algorithms for matching in observational studies. Annual Review
of Statistics and Its Application, 7 (1), 143–176. doi:10.1146/annurev-statistics-031219-041058.
5
This was proposed in one of the most cited statistics paper by Rosenbaum and Rubin, 1983
6
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched
sampling methods that incorporate the propensity score. The American Statistician, 39 (1), 33–38.
doi:10.1080/00031305.1985.10479383.
7
Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and
observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 171 (2), 481–502. doi:10.1111/j.1467-985x.2007.00527.x.
8
Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica, 74 (1), 235–267. doi:10.1111/j.1468-0262.2006.00655.x.

76
Chapter 7

No unmeasured confounders: Semiparametric


inference

In this Chapter, we extend the super-population inference in randomised experiments


(Section 2.5) to observational studies. The main challenge is that the treatment assignment
mechanism is now unknown and must be estimated.

7.1 An introduction to semiparametric inference

As in the Chapter, we will focus on the case of a binary treatment A and assume
(Xi , Ai , Yi (0), Yi (1)), i ∈ [n] are iid.
We maintain the assumptionss of no unmeasured confounders (Assumption 6.4),
consistency (Assumption 2.6), and positivity:

7.1 Assumption (Positivity). πa (x) = P(A = a | X = x) > 0, ∀a, x.

A Y

Figure 7.1: An observational study with no unmeasured confounders.

7.2 Remark. We have seen in Chapter 5 that no unmeasured confounders and consistency
necessarily follow from assuming the single-world causal model corresponding to Figure 7.1.
Assumption 7.1 is also called the overlap assumption, because by the Bayes rule, it is
equivalent to assuming that the distribution X has the same support given A = a for all
a.
By Theorem 2.12, we have

ATE = E[Y (1) − Y (0)] = E E[Y | A = 1, X] − E[Y | A = 0, X] . (7.1)

77
Our goal is estimate the right hand side of the above quation, a functional of the observed
data distribution.
This is where semiparametric inference becomes useful. A semiparametric model is a
statistical model with parametric and nonparametric components.
7.3 Example. An example of semiparametric model is the partially linear model

E[Y | A, X] = β · A + g(X),

where β and g(·) are unknown. Other well known examples include single index models,
varying coefficient models, and Cox’s proportional hazards model in survival analysis.
Semiparametric inference is mainly concerned with estimating and making inference
for the (finite-dimensional) parametric component. This is called the parameter of interest,
and the (infinite-dimensional) nonparametric component is called the nuisance parameter.
An alternative setup is to consider estimating a functional β(P ) using an iid sample
D1 , . . . , Dn from the distribution P that is known to belong to a set P of probability
distributions. This is very general and is well suited for causal inference problems: the
causal identification theory (Section 5.4) often equates a causal effect of interest with a
functional of the observed variables; an example is (7.1).
Formally deriving the semiparametric inference theory is beyond the scope of this
course.1 Below we will just informally describe some key results in this theory.
Roughly speaking, semiparametric inference provides a theory for well-behaved (so-
called regular) estimators2 that admit the so-called asymptotic linear expansion:
n
√ 1 X
n(β̂ − β) = √ ψβ (Di ) + op (1), (7.2)
n
i=1

where the influence function ψβ (·) has mean 0 and finite variance, and op (1) means that
the residual converges to 0 in probability as n → ∞.
The asymptotic linearity (7.2) implies that β̂ has an asymptotic normal distribution
√ d
n(β̂ − β) → N(0, Var(ψβ {D)}).

The influence function that has the smallest variance, ψβ,eff (·) is called the efficient
influence function.
Semiparametric inference theory gives a geometric characterisation of the space of
influence functions. A key conclusion is that ψβ,eff (·) can be obtained by projecting any
influence function onto the so-called tangent space consisting of all score functions of the
model. In consequence, Var(ψβ,eff (D)) ≤ Var(ψβ (D)). This generalises the Cràmer-Rao
lower bound and asymptotic efficiency of the maximum likelihood estimator in parametric
models to semiparametric models.
7.4 Example. Following the proof of Lemma 2.26, a Z-estimator is generally asymptotic
linear and its influence function is given in (2.20).
7.5 Exercise. Derive the influence function for the regression estimator β̂1 in (2.11).
Veryify that it has mean 0.

78
7.2 Discrete covariates

The semiparametric inference theory is very general and abstract. To obtain some
intuitions for our problem, we first consider the case of discrete covariates X and the
estimation of
X
βa = E{E[Y | A = a, X]} = µa (x) P(X = x), a = 0, 1 (7.3)
x

where µa (x) = E[Yi | Ai = a, Xi = x]. The ATE can be written as β = β1 − β0 .


Given an iid sample from this population, we can empirically estimate the quantities
in (7.3) by
Pn
I(Ai = a, Xi = x)Yi
µ̂a (x) = Pi=1
n ,
i=1 I(Ai = a, Xi = x)
n
1X
P̂(X = x) = I(Xi = x).
n
i=1

The first estimator is well defined if the denominator is non-zero, an event with probability
tending to 1 as n → ∞. By plugging these into (7.3), we obtain the outcome regression
(OR) estimator
n n
X 1 XX 1X
β̂a,OR = µ̂a (x)P̂(Xi = x) = µ̂a (x)I(Xi = x) = µ̂a (Xi ). (7.4)
x
n x
n
i=1 i=1

The ATE can be subsequently estimated by


n
1X
β̂OR = β̂1,OR − β̂0,OR = µ̂1 (Xi ) − µ̂0 (Xi ). (7.5)
n
i=1

7.6 Remark. Since (7.4) does not depend on the form of µ̂a (x), this outcome estimator
can be easily extended to the continuous X case.
Next, we analyse the asymptotic behaviour of β̂OR with discrete X. This does not
follow trivially from the central limit theorem because the summands in (7.4) are not
independent. To solve this problem, we derive an alternative representation of the outcome
regression estimator.
Recall πa (x) = P(A = a | X = x), which can be estimated by
Pn
I(A = a, Xi = x)
π̂a (x) = i=1Pn i .
i=1 I(Xi = x)

Because A is binary, π0 (x) = 1 − π1 (x). Note that π1 (x) = π(x), the propensity score
defined in Section 6.3.
The inverse probability weighted (IPW) estimator3 is given by
n
1 X I(Ai = a)
β̂a,IPW = Yi . (7.6)
n π̂a (Xi )
i=1

79
The name is derived from the form of the estimator. The average treatment effect can be
subsequently estimated by
n
1 X h Ai 1 − Ai i
β̂IPW = β̂1,IPW − β̂0,IPW = − Yi . (7.7)
n π̂(Xi ) 1 − π̂(Xi )
i=1

7.7 Proposition. Suppose X is discrete and π̂a (x) > 0 for all a and x. Then
β̂a,OR = β̂a,IPW , a = 0, 1 and β̂OR = β̂IPW .

Notice that the summands in (7.7) are still not independent. To break through this
impasse, a key observation is that

7.8 Lemma. Under the sample assumptions in Proposition 7.7, the identity
n n
1 X I(Ai = a) 1X
µ(Xi ) = µ(Xi ), a = 0, 1.
n π̂a (Xi ) n
i=1 i=1

holds for any function µ(·).

Intuitively, Lemma 7.8 says that the distribution of X in the entire sample can be
obtained from the A = a subsample by inverse probability weighting.

7.9 Exercise. Prove Proposition 7.7 and Lemma 7.8.

Using Lemma 7.8, we have, by adding and subtracting µa (X),


√ 
n β̂a,IPW − βa
n
1 X I(Ai = a)
=√ [Yi − µa (Xi )] + µa (Xi ) − βa
n π̂a (Xi )
i=1
n n
1 X I(Ai = a) o
=√ [Yi − µa (Xi )] + µa (Xi ) − βa + Rn .
n πa (Xi )
i=1

The residual term


n
1 X h 1 1 i p
Rn = √ I(Ai = a) − [Yi − µa (Xi )] → 0
n π̂a (Xi ) πa (Xi )
i=1

as n → ∞. This is because π̂a (x) generally converges to πa (x) at 1/ n rate and the
other term I(Ai = a)[Yi − µa (Xi )] is iid with mean 0. This shows that β̂a,IPW admits the
asymptotic linear expansion in (7.2) with the influence function

I(A = a)
ψβa (D) = [Y − µa (X)] + µa (X) − βa .
πa (X)
The asymptotic results are summarised in the next Theorem.

80
7.10 Theorem. Suppose the positivity assumption (Assumption 7.1) is given. Under
iid sampling with discrete X and regularity conditions, we have
√  d  
n β̂a,OR − βa → N 0, Var ψβa (Di ) , a = 0, 1,

and √  d  
n β̂OR − β → N 0, Var ψβ1 (Di ) − ψβ0 (Di ) .

7.11 Remark. In the discrete X case, ψβa (D) is the only influence function for estimating
βa because we are considering the nonparametric model and the tangent space contains
all square-integrable functions with mean 0. As a consequence, ψβa (D) is also the efficient
influence function. This last conclusion is still true when X contains continuous covariates,
although there are many other possible influence functions.4

7.12 Exercise. Derive the same results for estimating the average treatment effect on
the treated ATT = E[Y (1) − Y (0) | A = 1]

(i) Show that


E[µ0 (X)π(X)]
E[Y (0) | A = 1] = := β0|1 .
P(A = 1)
Pn
(ii) Let n1 = i=1 I(Ai = 1). Show that the OR estimator of β0|1 is given by
n
1 X
β̂0|1,OR = I(Ai = 1)µ̂0 (Xi ),
n1
i=1

the IPW estimator of β0|1 is given by


n
1 X π̂(Xi )
β̂0|1,IPW = I(Ai = 0) Yi ,
n1 1 − π̂(Xi )
i=1

and β̂0|1,OR = β̂0|1,IPW .

(iii) Show that


√ 
n1 β̂0|1,OR − β0|1
n
1 X h π(X )
i i  
=√ (1 − Ai ) Yi − µ0 (Xi ) + Ai µ0 (Xi ) − β0|1 + op (1).
n1 1 − π(Xi )
i=1

(iv) Complete the asymptotic theory for estimating ATT with discrete X.

7.3 Outcome regression and inverse probability weighting

When X contains continuous covariates, Remark 7.6 suggests that the OR and IPW
estimators can still be applied by plugging in empirical estimates of the nuisance parameters

81
µa (x) = E[Y | A = a, X = x] and πa (x) = P(A = a | X = x). However, unlike the
discrete X case, the OR and IPW estimators are generally different.
Intuitively, these estimators are reasonable because of the following dual representation
of βa :

7.13 Proposition. Under the positivity assumption (Assumption 7.1), we have


h I(A = a) i
βa = E[µa (X)] = E Y , a = 0, 1,
πa (X)

thus h A 1−A i
β = E[µ1 (X) − µ0 (X)] = E Y − Y .
π(X) 1 − π(X)

7.14 Exercise. Prove Proposition 7.13.


Thus, the asymptotic behaviour of the OR and IPW estimators crucially depend on
how well µa (x) and π(x) are estimated. When µa (x) is modelled parametrically, β̂a,OR
is generally consistent if the model is correctly specified. The same conclusion holds for
β̂a,IPW and modelling πa (x).
7.15 Example. In practice, a common model for µa (x) is the linear regression
µa (x; ηµ ) = γµa + δµTa x, for a = 0, 1,
where ηµ = (γµ0 , δµ0 , γµ1 , δµ1 ). A common model for π(x) = π1 (x) is the logistic
regression
exp(γπ + δπT x)
π(x; ηπ ) = ,
1 + exp(γπ + δπT x)
where ηπ = (γπ , δπ ). The parameters ηµ and ηπ are usually estimated by maximum
likelihood in the corresponding generalised linear models (Y is normally distributed
and A is Bernoulli). When the regression models are correctly specified, µa (x; η̂µ ) and
π(x; η̂π ) generally converge to µa (x) and π(x). However, when the regression models are
incorrectly specified, they only converge to the best parametric approximations to the
true µa (x) and π(x).
7.16 Remark. To reduce the modelling bias (recall (6.1)), we can estimate the nuisance
parameters µa (·), πa (·), a = 0, 1 using more flexible models. Many researchers have
suggested to use machine learning methods to estimate the nuisance parameters in hope
that they can adapt to unspecified patterns in the data. These methods indeed perform
much better than using traditional statistical models in simulated datasets5 , but the
discrepancy tends to be much smaller in real datasets6 .

7.4 Doubly robust estimator

Next we will combine the OR estimator and the IPW estimator to obtain a more efficient
and robust estimator. The idea is to use the efficient influence function ψβa (D) derived

82
in Section 7.2, which can be written as

ψβa (D; µa , πa ) = ma (D; µa , πa ) − βa ,

where
I(A = a) 
ma (D; η) = Y − µa (X) + µa (X). (7.8)
πa (X)
Let µ̂a (·) be an estimator of µa (·) and π̂a (·) be an estimator of πa (·). Because influence
functions have mean 0, the above representation motivates the estimator
n
1X
β̂a,DR = ma (Di , µ̂a , π̂a ), (7.9)
n
i=1

and β̂DR = β̂1,DR − β̂0,DR .


The efficient influence function has the following appealing property.

7.17 Proposition (Double robustness). Under positivity (Assumption 7.1), we have,


for any functions µ̃a (·) and π̃(·),

0 = E[ψβa (D; µa , πa )] = E[ψβa (D; µ̃, πa )] = E[ψβa (D; µa , π̃)].

7.18 Exercise. Prove Proposition 7.17.

In other words, we have

βa = E[ma (Vi ; µa , π)] = E[ma (Vi ; µa , π̃)] = E[ma (Vi ; µ̃a , π)].

This shows that if either µa (·) or πa (·) is consistently estimated, the estimator β̂a,DR
is generally consistent for estimating βa . This is why we call β̂a,DR the doubly robust
estimator.
The doubly robust estimator is also useful when the nuisance parameters are estimated
using flexible machine learning methods. To see this, we examine the residual term in the
asymptotic linear expansion of β̂a,DR :
n
√  1 X
Rn = n β̂a,DR − βa − √ ψβa (Di )
n
i=1
n
1 X
=√ ma (Di ; µ̂a , π̂a ) − ma (Di ; µa , πa ).
n
i=1

p
We cannot immediately conclude that Rn → 0 using the law of large numbers, because
the summands on the right hand side are not iid (µ̂a and π̂a are obtained using D[n] ).
There are two ways to resolve this issue. First, if the models we used for µa and πa
are not too complex (they are in the so-called Donsker function class)7 , one can deduce

83
from the empirical process theory that the dependence of µ̂a and π̂a on the data can be
ignored: √
Rn ≈ n EDn+1 [ma (Dn+1 ; η̂) − ma (Dn+1 ; η)],
where EDn+1 indicates the expectation is taken over a new and independence observation
Dn+1 .
Another approach is to use sample splitting, so that the nuisance parameters are
estimated using an independent subsample.8 This techniques allows us to avoid restricting
the complexity of the nuisance models and becomes popular with machine learning
methods (sometimes under the name “cross-fitting”).
Using (7.8) and by taking expectation over An+1 , Yn+1 given Xn+1 , it can be shown
that

 
{π̂a (Xn+1 ) − πa (Xn+1 )}{µ̂a (Xn+1 ) − µa (Xn+1 )}
Rn ≈ n EXn+1 . (7.10)
π̂a (Xn+1 )

In order words, the residual term Rn depends on a product of the estimation error of π̂
and µ̂a .

7.19 Exercise. Derive (7.10) and use it to (informally) prove the double robustness of β̂.

Let’s define the mean squared error (MSE) of µ̂a (·) as


h i
 2
MSE µ̂a (·) = EXn+1 µ̂a (Xn+1 ) − µa (Xn+1 ) .

Similarly, we may define the MSE of π̂a (x). By applying the Cauchy-Scharwz inequality
to (7.10), we obtain
√ 
RHS of (7.10) ≤ n sup π̂a−1 (x) MSE µ̂a (·) · MSE π̂a (·) .
  
x

This result is summarised in the next Lemma.

7.20 Lemma. Under i.i.d. sampling and mild regularity conditions, the above residual
p
term Rn → 0 if

(i) There exists C > 0 such that P(π̂a (x) ≥ C, ∀x) → 1 as n → ∞; and
√   p
(ii) n MSE µ̂a (·) · MSE π̂a (·) → 0 as n → ∞.

Suppose π̂0 (x) = 1 − π̂(x). By combining the previous results, we obtain the next
Theorem.

84
7.21 Theorem (Semiparametric efficiency of the DR estimator). Under i.i.d. sam-
pling and mild regularity conditions, suppose there exists C > 0 such that

P(C ≤ π̂(x) ≤ 1 − C, ∀x) → 1 as n → ∞. (7.11)

Furthermore, suppose
√ n  o  p
n max MSE µ̂0 (x) , MSE µ̂1 (x) · MSE π̂(x) → 0 as n → ∞. (7.12)

Then the estimator β̂DR satisfies


√  d  
n β̂DR − β → N 0, Var ψβ1 (Di ) − ψβ0 (Di ) .

7.22 Remark. The condition (7.11) highlights the role of the positivity assumption and
is needed because of the weighting by the inverse of π̂a (X). It is satisfied, for example,
if π(x) is bounded away from  0 and 1 and π̂(·) is consistent. The condition (7.12) is
satisfied if both MSE µ̂a (·) and MSE π̂a (·) are op (n−1/4 ), for a = 0, 1.


7.5 A comparison of the statistical methods

We have covered several statistical methods for observational studies with no unmeasured
confounders. Each method has its own strengths and weaknesses and may be preferrable
in different practical problems.

Matching and randomisation inference


• Advantages: transparent; easy implementation; can incorporate prior knowledge;
ensures well overlapping covariates.

• Disadvantages: Less efficient (though can be improved).

Inverse probability weighting


• Advantages: extends matching; generalisable to more complex problems9 ; can reach
the semiparametric efficiency bound10 ; can be doubly robust11 .

• Disadvantages: can be unstable if the estimated probabilities are close to zero12 ;


not robust to model misspecification.

Outcome regression
• Advantages: can reach the parametric Cràmer-Rao bound (smaller than the semi-
parametric bound); can easily incorporate machine learning methods.

• Disadvantages: not robust to model misspecification.

85
Doubly robust estimator
• Advantages: doubly robust; modelling bias is reduced; can reach the semiparametric
efficiency bound.

• Disadvantages: can be unstable if the estimated probabilities are close to zero13 .

Notes
1
For the general theory, see Van der Vaart, A. W. (2000). Asymptotic statistics. Cambridge University
Press, Chapter 25. For a less formal treatment with examples in causal inference, see Tsiatis, A. (2007).
Semiparametric theory and missing data. Springer.
2
The regularity condition is needed to rule out estimators (for example, Hodges’ estimator) that are
“super-efficient” at some parameter values but have erratic behaviours nearby. See Van der Vaart, 2000,
Example 8.1.
3
This is also called the Horvitz-Thompson estimator, which is first proposed in survey sampling; see
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a
finite universe. Journal of the American Statistical Association, 47 (260), 663–685. doi:10.1080/01621459.
1952.10483446.
4
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1995). Analysis of semiparametric regression models
for repeated outcomes in the presence of missing data. Journal of the American Statistical Association,
90 (429), 106–121. doi:10.1080/01621459.1995.10476493. See also Tsiatis, 2007, Chapters 7 and 13.
5
Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus do-it-yourself
methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34 (1),
43–68. doi:10.1214/18-sts667.
6
Keele, L., & Small, D. (2018). Comparing covariate prioritization via matching to machine learning
methods for causal inference using five empirical applications. arXiv: 1805.03743 [stat.AP].
7
Van der Vaart, 2000, Chapter 19.
8
This idea is originally due to Hájek, J. (1962). Asymptotically most powerful rank-order tests. The
Annals of Mathematical Statistics, 33 (3), 1124–1147. doi:10.1214/aoms/1177704476. See also Van
der Vaart, 2000, Section 25.8
9
Robins, J. M., Hernán, M. Á., & Brumback, B. (2000). Marginal structural models and causal
inference in epidemiology. Epidemiology, 11 (5), 550–560. doi:10.1097/00001648-200009000-00011, For
example, see.
10
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects
using the estimated propensity score. Econometrica, 71 (4), 1161–1189. doi:10.1111/1468-0262.00442.
11
Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. The Annals of
Statistics, 47 (2), 965–993. doi:10.1214/18-aos1698.
12
Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative
strategies for estimating a population mean from incomplete data. Statistical Science, 22 (4), 523–539.
doi:10.1214/07-sts227.
13
Kang and Schafer, 2007.

86
Chapter 8

Sensitivity analysis

No unmeasured confounders, as introduced in Section 6.2, is a rather optimistic assumption


in practice. Does a limited violation of this assumption render our statistical analysis
useless? This is answered by a sensitivity analysis.

8.1 A roadmap

Broadly speaking, a sensitivity analysis consists of three steps:

(i) Model augmentation: Specify a family of distributions Fθ,η for the full data F of
all the relevant factuals and counterfactuals, where η is a sensitivity parameter. It
is customary to let η = 0 corresponds to the case of no unmeasured confounders.

(ii) Statistical inference: Test a causal hypothesis regarding Fθ,η or estimate a causal
effect β(Fθ,η ). This is usually done in one of the two following senses:

(i) Point identification: The inference is valid at a given η;


(ii) Partially identifiation: The inference is valid for all η in a given set H.

(iii) Interpretatation: Assess the strength of evidence by examining how sensitive the
conclusions are to unmeasured confounders. This typically involves finding the
“tipping point” η and make a heuristic interpretation.

Below we will give two sensitivity analysis methods that rely on different model
augmentations and provide different statistical guarantees.

8.2 Rosenbaum’s sensitivity analysis

We first describe a sensitivity analysis that applies to randomisation inference for matched
observational studies. Recall the setting in Section 6.5 that observations 1, . . . , n1 are
treated and matched to control observations n1 + 1, . . . , 2n1 , respectively.
Suppose the data are iid and denote

πi = P(Ai = 1 | Ci ), i ∈ [2n1 ],

87
where Ci = (Xi , Yi (0), Yi (1)).
The following sensitivity model is widely used in observational studies.1

8.1 Assumption (Rosenbaum’s sensitivity model). For a given value Γ ≥ 1, we


have
1 πi /(1 − πi )
≤ ≤ Γ, ∀i ∈ [n1 ]. (8.1)
Γ πn1 +i /(1 − πn1 +i )

In fact, Γ = 1 recovers no unmeasured confounders (Assumption 6.7). This is because


  πi (1 − πn1 +i )
P Ai = 1, An1 +i = 0 C[2n1 ] , Ai + An1 +i = 1 = .
πi (1 − πn1 +i ) + πn1 +i (1 − πi )

(corrections from the lectures.) The odds ratio bound (8.1) further implies that

1   Γ
≤ P Ai = 1, An1 +i = 0 C[2n1 ] , Ai + An1 +i = 1 ≤ . (8.2)
1+Γ 1+Γ
8.2 Remark. An alternative and prehaps more intuitive formulation of Rosenbaum’s
sensitivity model is the following. Suppose there exists an unmeasured confounder
U ∈ [0, 1] so that A ⊥
⊥ {Y (0), Y (1)} | X, U . Then if we let πi = P(Ai = 1 | Xi , Ui ), the
sensitivity model (8.1) is equivalent to assuming the logistic regression model

P(A = 1 | X, U ) = expit(g(X) + γU ), 0 ≤ γ ≤ log Γ, (8.3)

where g(·) is an arbitrary function and expit(η) = eη /(1 + eη ).

8.3 Exercise. Show that (8.3) implies (8.1) if Xi = Xn1 +i .

Next we consider the randomisation distribution of the signed score statistic (6.8)
under Rosenbaum’s sensitivity model. Following Exercise 6.13, given H0 and conditioning
on C[2n1 ]
n1  rank(|Y (0) − Y
n1 +i (0)|)

d i
 X
T A[2n1 ] , Y[2n1 ] (0) | A[2n1 ] ∈ M = Si ψ ,
n1 + 1
i=1

where Si = (Ai − An1 +i ) · sgn(Yi (0) − Yn1 +i (0)).


A random variable X is said to stochastically dominate another random variable
Y , written as X  Y , if P(X > t) ≥ P(Y > t) for all t. The distribution of Si given
Yi (0), Yn1 +i (0) is unkonwn, but by using (8.2), Si stochastically dominates the following
random variable (
−1, with probability Γ/(1 + Γ),
Si− =
1, with probability 1/(1 + Γ).
This can be used to obtain a (sharp) bound on the p-value:

88
8.4 Theorem. Suppose Assumption 8.1 holds and we are using the signed score
statistic (6.8). Given H0 ,
n1  rank(|D − β|) 
 X i
T A[2n1 ] , Y[2n1 ] (0)  Si− ψ . (8.4)
n1 + 1
i=1

Proof. This Theorem follows from noticing |Di − β| = |Yi (0) − Yn1 +i (0)| and the following
property Pn ordering: If Xi  Yi for i ∈ [n] and Xi ⊥
Pn of stochastic ⊥ Xj , Yi ⊥⊥ Yj for all i 6= j,
then i=1 Xi  i=1 Yi .

Theorem 8.4 allows us to upper bound the randomisation p-value in Rosenbaum’s


sensitivity model. Let F1− (·) denote the cumulative distribution function of the right hand
side of (8.4) given C[2n1 ] . Let P1− = F1− (T (A[2n1 ] , Y[2n1 ] − βA[2n1 ] )). Then under the
assumptions in Theorem 8.4, P1− is a valid p-value if the distribution of (Ai , Xi , Yi (0), Yi (1))
satisfies Rosenbaum’s sensitivity model.
We can compute the bounding p-value P1− by using the exact distribution of the right
hand side of (8.4) or by using Monte Carlo. Alternatively, we can approximate it by the
central limit theorem.

8.5 Exercise. For the sign statistic ψ(t) ≡ 1, derive an asymptotic p-value based on a
central limit theorem for the bounding variable.

8.3 Sensitivity analysis in semiparametric inference

We will give another example of sensitivity analysis in the semiparametric inference


framework described in Chapter 7.
The idea is quite simple. Suppose the treatment A is binary. By using the law of
total expectation, we have, for any a = 0, 1,

E[Y (a)]
= E{E[Y (a) | X]}
= E{E[Y (a) | A = a, X] P(A = a | X)} + E{E[Y (a) | A = 1 − a, X] P(A = 1 − a | X)}
= E{E[Y | A = a, X] · πa (X)} + E{E[Y (a) | A = 1 − a, X] · π1−a (X)}

The last equality used the consistency of counterfactuals.


The only non-identifiable term here is E[Y (a) | X, A = 1 − a]. The no unmeasured
confounders assumption, A ⊥ ⊥ Y (a) | X, renders this term identifiable:

E[Y (a) | A = 1 − a, X] = E[Y (a) | A = a, X] = E[Y | A = a, X].

This motivates us to specify the contrast between the identifiable and non-identifiable
counterfactual quantities as a sensitivity parameter:2

δa (x) = E[Y (a) | A = 1, X = x] − E[Y (a) | A = 0, X = x]. (8.5)

89
8.6 Exercise. Show that the design bias for estimating the average treatment effect is
given by

E{E[Y | A = 1, X]} − E{E[Y | A = 0, X]} − E[Y (1) − Y (0)]


 
= E (1 − π(X))δ1 (X) + π(X)δ0 (X) .

So when δ0 (x) = δ1 (x) = δ for all x, the design bias is simply δ.

Given the functions δ0 (x) and δ1 (x), estimating E[Y (1) − Y (0)] becomes another
semiparametric inference problem. For example, we can estimate the design bias by
plugging in the estimated propensity score:
n
d = 1
X
bias (1 − π̂(Xi ))δ1 (Xi ) + π̂(Xi )δ0 (Xi ).
n
i=1

We can then estimate the ATE by β̂IPW − bias.


d

8.7 Exercise. Suggest an outcome regression estimator and a doubly robust estimator
in this setting.

Notes
1
Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences in matched observa-
tional studies. Biometrika, 74 (1), 13–26. doi:10.1093/biomet/74.1.13.
2
Robins, J. M. (1999). Association, causation, and marginal structural models. Synthese, 121, 151–179.
doi:10.1023/a:1005285815569.

90
Chapter 9

Unmeasured confounders: Leveraging specificity

Recall the decomposition (6.1) of the error of a causal estimator:

Causal estimator − True causal effect = Design bias + Modelling bias + Statistical noise.

In the last three Chapters about observational studies, we assumed that the unmeasured
confounders are non-existent or have a limited strength. In other words, this amounts to
assuming that the design bias is zero or determined by the sensitivity model. Besides
trying to measure all the confounders, is it possible to design the observational study
cleverly to reduce the design bias?

9.1 Structural specificity

To overcome unmeasured confounders, the key idea is leveraging specificity of the causal
structure.
Specificity is one of Bradford Hill’s nine principles1 for causality in epidemiological
studies:

One reason, needless to say, is the specificity of the association, the third
characteristic which invariably we must consider. If, as here, the association
is limited to specific workers and to particular sites and types of disease and
there is no association between the work and other modes of dying, then
clearly that is a strong argument in favour of causation.

Hill is the coauthor of a landmark observational study on smoking and lung cancer2 .
Somewhat ironically, smoking has many detrimental health effects and is often used as a
counterexample to the specificity principle. Nonetheless, specificity unifies several causal
inference approaches that do not require no unmeasured confounders.
In graphical terminology, specificity refers to the lack of certain causal pathways. One
classical example is the use of instrumental variables, which will be a central topic in
this Chapter. In Exercise 5.35, it is shown that in a linear structural equation model
corresponding to Figure 9.1, the causal effect of A on Y can be identified by the Wald
ratio Cov(Z, Y )/ Cov(Z, A). This relies on two structural specificities in Figure 9.1: Z is
independent of U , and Z has no direct effect on Y .

91
U

Z A Y

Figure 9.1: Instrumental variables.

By itself, structural specificity is not enough for causal identification. Additional


assumptions are needed. One typical assumption is linearity, and the role of structural
specificity can indeed be intuitively understood in the linear SEM framework (Chapter 3).
Assuming the absence of certain edges reduces the dimension of the unknown parameters.
If sufficiently many edges are non-existent, the path coefficients become identifiable; see
Section 3.6 for some examples.
Besides linearity, other assumptions can be used with specificity to establish causal
identification. One example is monotonicity of instrumental variables; see Section 9.4
below. Another example is the difference-in-difference estimator, a popular method to
evaluate the effect of a policy. The corresponding causal diagram is Figure 9.2. The
structural specificity here is the lack of causal pathway from A to W ; due to this reason,
W is often called a negative control outcome.

A Y W

Figure 9.2: The use of negative control in the difference-in-difference estimator.

9.1 Exercise. Consider the causal diagram in Figure 9.2. Suppose the negative control
outcome W has the same confounding bias as Y in the following sense:

E[Y (0) | A = 1] − E[Y (0) | A = 0] = E[W (0) | A = 1] − E[W (0) | A = 0].

Show that the so-called parallel trend assumption

E[Y (0) − W | A = 1] = E[Y (0) − W | A = 0]

is satisfied, and use it to show that the average treatment effect on the treated is identified
by the so-called difference-in-differences estimator:

E[Y (1) − Y (0) | A = 1] = E[Y − W | A = 1] − E[Y − W | A = 0].

9.2 Instrumental variables and two-stage least squares

Among all the study designs that leverage specificity, the method of instrumental variables
(IV) has the longest history and is the most well established.

92
Figure 9.3: Estimating price elasticity using instrumental variables. The price elasticitiy
of supply (demand) at (P1 , Q1 ) is defined as as slope of the price-supply (price-demand)
curve.

IV was invented by economist Philip Wright (father of Sewall Wright) in 1928 to


estimate the price elasticities of (the causal effects of price on) demand and supply.3
This is a challenging problem because price is determined simultaneously by demand
and supply (Figure 9.3). To estimate the causal effect of price on supply, we cannot use
observational data that correspond to different demand and supply curves (e.g., (P1 , Q1 )
and (P2 , Q2 ) in Figure 9.3). Instead, we need to use “exogenous” events that change the
demand but not the supply (e.g., (P1 , Q1 ) and (P3 , Q3 ) in Figure 9.3). For example, we
can use the COVID-19 outbreak as an instrumental variable for the demand of masks.
The simulatenous determination of price and quantity does not immediately fit in
our causal inference framework,4 but the same idea applies to observational studies with
unmeasured confounders. In this case, the instrumental variable needs to be independent
of the unmeasured confounders and changes the outcome only through changing the
treatment (Figure 9.1). The “exogenous” variability in the IV can then be used to make
unbiased causal inference.
In the rest of this section, we extend Exercise 5.35 to the setting with multiple IVs and
observed confounders. Given iid observations of treatment Ai , outcome Yi , instrumental
variables Zi , observed confounders Xi , and unobserved confounders Ui , i = 1, . . . , n, we
assume the structural equations for A and outcome Y are given by
T T
A = β0A + βZA Z + βXA X + βUT A U + A , (9.1)
T
Y = β0Y + βAY A + βXY X + βUT Y U + Y . (9.2)

See Figure 9.4 for an example with one treatment and two IVs. The other linear structural
equations can be derived from Definition 3.8 and are omitted. In fact, we do not need

93
the other equations to be structural (causal) in the derivations below.

Z1 U

A Y

Z2 X

Figure 9.4: An example showing two IVs, Z1 and Z2 , and one treatment A. There is
one measured confounder X and one unmeasured confounder U . A bi-directional arrow
indicates that the variables can be dependent.

By using Z ⊥
⊥ U , A , Y , we obtain from (9.1) and (9.2) that
T T
E[A | Z, X] = β̃0A + βZA Z + β̃XA X, (9.3)
T
E[Y | Z, X] = β̃0Y + βAY E[A | Z, X] + β̃XY X, (9.4)

for some β̃0A , β̃XA , β̃0Y , β̃XY .


The key observation is that the confounded treatment effect βAY is the coefficient of
E[A | Z, X] in (9.4).
This motivates the two-stage least squares estimator of βAY :

(i) Estimate E[A | Z, X] by a least squares regression of A on Z and X. Let the fitted
model be Ê[A | Z, X].

(ii) Fit another regression of Y on Ê[A | Z, X] and X by least squares, and let β̂AY be
the coefficient of Ê[A | Z, X].

9.3 Instrumental variables: Method of moments

To study the asymptotic properties of the two-stage least squares estimator, we consider a
more general counterfactual setup. We will skip the observed confounders X and consider
the diagram in Figure 9.1 with a possibly multi-dimensional instrumental variables Z.

9.2 Assumption (Core IV assumptions). We make the following assumptions:

(i) Relevance: Z 6⊥
⊥ A;

(ii) Exogeneity: Z ⊥
⊥ {A(z), Y (z, a)} for all z, a.

(iii) Exclusion restriction: Y (z, a) = Y (a) for all z, a.

94
The core IV assumptions in Assumption 9.2 are structural and nonparametric. The
three assumption would follow from (i) assuming the distribution is faithful to Figure 9.1;
(ii) assuming the variables satisfy a single world causal model according to Figure 9.1;
(iii) the recursive substitution of counterfactuals.
9.3 Remark. Different authors state these assumptions slightly differently, but they all
reflect the structural assumptions: Z and A are dependent; there are no unmeasured Z-Y
confounders; there is no direct effect from Z on Y .
As mentioned in Section 9.1, structural assumptions alone are generally not enough
to overcome unmeasured confounders. For the rest of this Section, we further assume the
causal effect of A on Y is a constant β:

Y (a) − Y (ã) = (a − ã)β. (9.5)

This would be satisfied if we assume the linear structural equation model (9.2).
9.4 Remark. When A is binary, this reduces to the constant treatment effect assumption
(2.8) in randomisation inference. The main distinction is that randomisation inference is
concerned with testing a given β, while the derivations below focus on the estimation of
β. But we can also use randomisation inference for instrumental variables5 and method
of moments (Z-estimation) for models like (9.5)6 .
The estimation of β in (9.5) is a semiparametric inference problem. We first express it
in terms of the observed data. Let ã be a reference level of the treatment; for simplicity,
let ã = 0. Like randomisation inference, (9.5) gives the imputation Y (0) = Y − βA.
Let α = E[Y (0)]. The exogeneity assumption Z ⊥ ⊥ Y (z, a) and exclusion restriction
Y (z, a) = Y (a) imply that, for any function g(z),
 
E (Y − α − βA) g(Z) = E[(Y (0) − α) g(Z)] = 0, (9.6)

Let α̂ = Ȳ −β Ā, where Ā = ni=1 Ai /n and Ȳ = ni=1 Yi /n. The method of moments
P P
estimator of β is given by solving the empirical version of (9.6) (and with α replaced by
α̂). After some algebra, we obtain
1 Pn
(Yi − Ȳ )g(Zi )
β̂g = 1 Pni=1
n
. (9.7)
n i=1 (Ai − Ā)g(Zi )

This as an empirical estimator of Cov(Y, g(Z))/ Cov(A, g(Z)), which is the Wald ratio
(5.11) with Z replaced by g(Z) in. In view of this, g(Z) is a one-dimensional summary
statistic of all the instrumental variables.

95
9.5 Theorem. Under Assumption 9.2, model (9.5), iid sampling, and suitable
regularity conditions including Cov(A, g(Z)) > 0, we have
√ d
n(β̂g − β) → N(0, σg2 ) as n → ∞,

where
Var(Y − βA) Var(g(Z))
σg2 = (9.8)
Cov2 (A, g(Z))
is minimised at g ∗ (z) = E[A | Z = z] by the Cauchy-Schwarz inequality.

9.6 Exercise. Prove Theorem 9.5 by showing the influence function of β̂g is given by

[{Y − E(Y )} − β{A − E(A)}][g(Z) − E{g(Z)}]


ψg (Z, A, Y ) = .
Cov(A, g(Z))

The choice g(Z) = g ∗ (Z) that minimises the variance of β̂g is often called the optimal
instrument. To reduce the variance of β̂, we can first estimate g ∗ (Z) and then plug it in
β̂. Let the resulting estimator be β̂ĝ .
It is common to estimate g ∗ (Z) by a linear regression, which returns the two-stage
least squares estimator in Section 9.2. Other regression models including machine learning
methods can also be used.

9.7 Exercise. Show that β̂ĝ reduces to the two-stage least squares estimator (with no
observed covariates X), if we use a linear regression model for g ∗ (Z) and obtain ĝ(Z) by
least squares.

Suppose g(z) is the probability limit of ĝ(z) as n → ∞ (assuming it exists). If the


first stage regression model is not too complex or the sample splitting technique is used
(see Section 7.4), the difference between β̂ĝ and β̂g is negligible and β̂ĝ has the same
asymptotic distribution as β̂g .
9.8 Remark. There is a remarkable robustness property here: Theorem 9.5 shows that

regardless of the choice of g(z), β̂g is n-consistent as long as Cov(A, g(Z)) > 0. A better
model for g ∗ (Z) provides a more efficient estimator of β. This bears some similarities to
the regression adjustment methods in Section 2.5. However, there is no free lunch (as A is
not randomised here): this robustness relies on the rather optimistic constant treatment
effect assumption (9.5).

9.4 Complier average treatment effect

Can we relax the constant treatment effect assumption (9.5)? The answer is yes, but
alternative assumptions need to be made. In this Section we will show how a monotonicity
assumption allows us to identify the so-called complier average treatment effect.
Before we go into any detail, let us first introduce an example to motivate the definition
of compliance classes.

96
9.9 Example. A common problem in randomised experiments is that not all experimental
subjects would comply with the assigned treatment. This problem may be described by
the IV diagram Figure 9.1, where

• Z is the initial treatment assignment (randomised);

• A is the actual treatment taken by the patient;

• Y is some clinical outcome

Because A is not randomised, the effect of A on Y can be confounded.

There are two solutions to this noncompliance problem:

(i) The intention-to-treat analysis that ignores A and estimates the causal effect of Z
on Y (which is not confounded). This has been discussed in Chapter 2.

(ii) Use Z as an instrumental variable to estimate the causal effect of A on Y . This


will be discussed next.

We will focus on the case of binary Z and A. As before, A = 1 refers to receiving the
treatment and A = 0 refers to the control. The same terminology is used for the levels of
Z.

9.10 Exercise. Verify that, if Z is binary, the Wald ratio can be written as

Cov(Z, Y ) E[Y | Z = 1] − E[Y | Z = 0]


= .
Cov(Z, A) E[A | Z = 1] − E[A | Z = 0]

Motivated by the noncompliance problem, we may define four intuitive compliance


classes based on the counterfactual treatments A(0) and A(1):



 always taker (at), if A(0) = A(1) = 1,

never taker (nt), if A(0) = A(1) = 0,
C= (9.9)
complier (co),

 if A(0) = 0, A(1) = 1,

defier (de), if A(0) = 1, A(1) = 0.

Notice that the definition of C involves cross-world counterfactuals. So the identifica-


tion result below requires multiple-world counterfactual independence (see Section 5.2):

9.11 Assumption. We make all the core IV assumptions in Assumption 9.2 and
additionally assume cross-world counterfactual independence according to Figure 9.1.
That is, (ii) in Assumption 9.2 is replaced by (ii’) Z ⊥
⊥ {A(a), Y (z, a) | z, a}.

Assumption 9.2 is a structural assumption. As discussed in Section 9.1, additional


assumptions are needed to make causal identification. The noncompliance problem
motivates the following assumption:

97
9.12 Assumption (Monotonicity). P(A(1) ≥ A(0)) = 1, or equivalently, P(C =
de) = 0.

For example, Assumption 9.12 is reasonable if the control patients have no access to the
new treatment drug.
The next Theorem shows that under the above assumptions,

9.13 Theorem. Under Assumptions 9.11 and 9.12, the complier average treatment
effect is identified by

E[Y | Z = 1] − E[Y | Z = 0]
E[Y (1) − Y (0) | C = co] = .
E[A | Z = 1] − E[A | Z = 0]

Proof. Let us expand the term in the numerator using the compliance classes. Using the
law of total expectation,
X
E[Y | Z = 1] = E[Y | Z = 1, C = c] P(C = c | Z = 1).
c∈{at,nt,co,de}

By exclusion restriction (Assumption 9.2(iii)), Y = Y (A(1)) given Z = 1. Thus

E[Y | Z = 1] = E[Y (1) | Z = 1, C = at] P(C = at | Z = 1)


+ E[Y (0) | Z = 1, C = nt] P(C = nt | Z = 1)
+ E[Y (1) | Z = 1, C = co] P(C = co | Z = 1)
+ E[Y (0) | Z = 1, C = de] P(C = de | Z = 1)

By using the exogeneity of Z (Assumption 9.11(ii’)) and the fact that C is a deterministic
function of A(0) and A(1), we can drop the conditioning on Z = 1

E[Y | Z = 1] = E[Y (1) | C = at] P(C = at) + E[Y (0) | C = nt] P(C = nt)
+ E[Y (1) | C = co] P(C = co) + E[Y (0) | C = de] P(C = de).

Similarly, we have

E[Y | Z = 0] = E[Y (1) | C = at] P(C = at) + E[Y (0) | C = nt] P(C = nt)
+ E[Y (0) | C = co] P(C = co) + E[Y (1) | C = de] P(C = de).

Therefore
E[Y | Z = 1] − E[Y | Z = 0]
= E[Y (1) − Y (0) | C = co] P(C = co) − E[Y (1) − Y (0) | C = de] P(C = de).

Finally, by using the monotonicity assumption (Assumption 9.12), we obtain

E[Y | Z = 1] − E[Y | Z = 0] = E[Y (1) − Y (0) | C = co] P(C = co).

98
Using a similar argument, it can be shown that the denominator of β is

E[A | Z = 1] − E[A | Z = 0] = P(C = co).

By dividing the two equations above, we obtain the identification formula of E[Y (1)−Y (0) |
C = co].

9.14 Remark. The complier average treatment effect E[Y (1)−Y (0) | C = co] is an instance
of the local average treatment effect.7 Here, local means the treatment effect is averaged
over a specific subpopulation. What is unusual about the complier average treatment
effect is that the subpopulation is defined in terms of cross-world counterfactuals (so it
can never be observed). This demonstrates the utility of the counterfactual language as
compliance class as a concept does not exist in a purely graphical setup. Pearl, 2009, page
29 discussed three layers of data queries: predictions, interventions, and counterfactuals.
The meaning of “counterfactual” in Pearl’s classification is not immediately clear. It
is helpful to base the classification on whether the query only contains factuals (can
be answered without causal inference), only contains counterfactuals in the same world
(can be answered with randomised intervention), or contains cross-world counterfactuals
(cannot be answered unless some non-verifiable cross-world independence is assumed).8

Notes
1
Hill, A. B. (1965). The environment and disease: Association or causation? Proceedings of the Royal
Society of Medicine, 58 (5), 295–300. doi:10.1177/003591576505800503.
2
Doll and Hill, 1950.
3
Stock, J. H., & Trebbi, F. (2003). Retrospectives: Who invented instrumental variable regression?
Journal of Economic Perspectives, 17 (3), 177–194. doi:10.1257/089533003769204416.
4
This is studied in simultaneous equations models (the variables are determined simultaneously).
See, for example, Davidson, R., & MacKinnon, J. G. (1993). Estimation and inference in econometrics.
Oxford University Press, Chapter 18.
5
Imbens, G. W., & Rosenbaum, P. R. (2005). Robust, accurate confidence intervals with a weak
instrument: Quarter of birth and education. Journal of the Royal Statistical Society: Series A (Statistics
in Society), 168 (1), 109–126. doi:10.1111/j.1467-985x.2004.00339.x.
6
Equation (9.5) belongs to a more general class of models called structural nested models proposed by
Robins; for a review, see Vansteelandt, S., & Joffe, M. (2014). Structural nested models and g-estimation:
The partially realized promise. Statistical Science, 29 (4), 707–731. doi:10.1214/14-sts493
7
Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment
effects. Econometrica, 62 (2), 467. doi:10.2307/2951620.
8
See also Robins, J. M., & Richardson, T. S. (2010). Alternative graphical causal models and the
identification of direct effects. In P. Shrout, K. Keyes, & K. Ornstein (Eds.), Causality and psychopathology:
Finding the determinants of disorders and their cures (pp. 103–158). Oxford University Press.

99
Chapter 10

Mediation analysis

In the last Chapter, we saw how specificity (absence of causal mechanisms) could be
useful to overcome unmeasured confounding.
In many other problems, the research question is about the causal mechanism itself.
For example, instead of simply concluding that smoking causes lung cancer, it is more
informative to determine which chemicals in the cigarettes are carcinogenic.
The problem of inferring causal mechanisms is called mediation analysis. It is a
challenging problem and this Chapter will introduce you to some basic ideas.1

10.1 Linear SEM

We start with the simplest setting with three variables A (treatment), Y (outcome), and
M (mediator) and no measured or unmeasured confounder (Figure 10.1).

A M Y

Figure 10.1: The basic mediation analysis problem with three variables and no confounder.

A linear SEM with respect to this causal diagram (Definition 3.8) assumes that the
variables are generated by

M = βAM A + M , (10.1)
Y = βAY A + βM Y M + Y , (10.2)

where M , Y are mutually independent noise variables that are also independent of A.
Wright’s path analysis formula (Theorem 3.14) shows that the total effect of A on Y
is βAY + βAM βM Y . This can be seen directly from the reduced-form equation that plugs
equation (10.1) into (10.2):

Y = (βAY + βAM βM Y )A + (βM Y M + Y ).


|{z} | {z }
direct indirect

100
The path coefficient βAY is the direct effect of A on Y , and the product βAM βM Y is
the indirect effect of A on Y through the mediator M . They can be estimated by first
estimating the path coefficients using linear regression.

A M Y

Figure 10.2: The basic mediation analysis problem with covariates.

This approach can be easily extended to allow for covariates (Figure 10.2). In this
case, the linear SEM is
T
M = βAM A + βXM X + M ,
T
Y = βAY A + βM Y M + βXY X + Y .

The direct and indirect effects are still βAY and βXM βM Y , though to estimate them one
now needs to include X in the regression models for M and Y .
This regression approach to mediation analysis is intuitive and very popular in
practice.2
The obvious drawback is the strong linearity assumption, which enables us to express
direct and indirect effects using regression coefficients. In more sophisticated scenarios
(with nonlinearities and interactions), we can no longer rely on this decomposition.

10.2 Identification of controlled direct effect

The rest of this Chapter will develop a counterfactual approach to mediation analysis.
For simplicity, we will again focus on the case of binary treatment. The counterfactuals
Y (a, m), Y (a), and Y (m) are defined as before via a nonparametric SEM (see Chapter 5).
The controlled direct effect (CDE) of A on Y when M is fixed at m is defined as

CDE(m) = E[Y (1, m) − Y (0, m)]. (10.3)

This quantity is of practical interest if we can intervene on both A and M .


CDE is a single-world quantity and can be identified using the g-formula if all the
confounders are measured (Figure 10.2).

10.1 Theorem. In a single-world causal model for the graph in Figure 10.2, we
have
   
CDE(m) = E E[Y | A = 1, M = m, X] − E E[Y | A = 0, M = m, X]

101
Proof. By the law of total expectation and the single-world independence assumptions
Y (a, m) ⊥
⊥ A | X and Y (m) ⊥ ⊥ M | A, X, we have, for any a and m,

E[Y (a, m)] = E E[Y (a, m) | X]

= E E[Y (a, m) | A = a, X]

= E E[Y (m) | A = a, X]

= E E[Y (m) | A = a, M = m, X]

= E E[Y | A = a, M = m, X] .

The third and the last equalities used consistency of the counterfactual.

Estimating CDE(m) is similar to estimating ATE in Chapter 7 (with the M = m


subsample).
However, the controlled direct effect does not motivate a definition of indirect effects.

10.3 Natural direct and indirect effects

Another approach to mediation analysis is to consider the natural direct effect (NDE)
and natural indirect effect (NIE):3
 
NDE = E Y (1, M (0)) − Y (0, M (0)) ,
 
NIE = E Y (1, M (1)) − Y (1, M (0)) .

Compared to the CDE, these new quantities allow the mediator M to vary naturally
according to some treatment level. By definition, they provide a decomposition of the
average treatment effect:

ATE = E[Y (1) − Y (0)] = NDE + NIE.

Both the NDE and NIE depend on a cross-world counterfactual, Y (1, M (0)). This
means that some cross-world independence is needed for causal identification.
To focus on this issue, let’s consider the basic mediation analysis problem with no
covariates (Figure 10.1). The single-world independence assumptions of the graph in
Figure 10.1 are
A⊥ ⊥ M (a) ⊥⊥ Y (a, m), ∀ a, m. (10.4)
Consecutive ⊥⊥ means mutual independence.
To identify E[Y (1, M (0))], we need an additional cross-world independence:

Y (1, m) ⊥
⊥ M (0), ∀m. (10.5)

10.2 Exercise. Suppose both A and M are binary. Count the number of pairwise inde-
pendences in the single-world and multiple-world independence assumptions introduced
in Definition 5.7.

102
10.3 Proposition. Consider a causal model corresponding to the graph in Figure 10.1,
in which the counterfactuals satisfy the multiple-world independence assumptions.
When M is discrete, we have
X
E[Y (1, M (0))] = E[Y | A = 1, M = m] · P(M = m | A = 0).
m

In consequence,
X 
NDE = E[Y | A = 1, M = m] − E[Y | A = 0, M = m] · P(M = m | A = 0),
m
(10.6)
X 
NIE = E[Y | A = 1, M = m] · P(M = m | A = 1) − P(M = m | A = 0) .
m
(10.7)

Proof. Using the (conditional) independence assumptions and consistency of counterfac-


tuals,
X
E[Y (1, M (0))] = E[Y (1, m) | M (0) = m] · P(M (0) = m)
m
X
= E[Y (1, m)] · P(M (0) = m | A = 0)
m
X
= E[Y (1, m) | A = 1] · P(M = m | A = 0)
m
X
= E[Y (m) | A = 1, M = m] · P(M = m | A = 0)
m
X
= E[Y | A = 1, M = m] · P(M = m | A = 0)
m

The identification formulas for NDE and NIE can be derived accordingly.

10.4 Remark. In the proof of Proposition 10.3 only the second equality uses the cross-world
independence. Thus without this assumption, we can still interpret the right hand side of
(10.6) as the expectation of the controlled direct effect Y (1, M 0 ) − Y (0, M 0 ), where M 0
is a randomised interventional analogue in the sense that M 0 is an independent random
variable with the same distribution as M (0).4

10.4 Observed confounders

To extend the identification results to more complex situations with covariates X, let’s
examine the proof of Proposition 10.3. The first equality is just the law of total expectation.
The second equality uses cross-world independence Y (1, m) ⊥ ⊥ M (0) and the independence
M (0) ⊥⊥ A. The third equality use Y (1, m) ⊥ ⊥ A and consistency. The fourth equality
uses consistency and Y (m) ⊥ ⊥ M | A. The last equality uses consistency.

103
To extend this proof, we can assume all the independences we used still hold when
conditioning on X:

10.5 Assumption (No unmeasured confounders). We assume there are

(i) No unmeasured treatment-outcome confounders: Y (a, m) ⊥


⊥ A | X, ∀a, m;

(ii) No unmeasured mediator-outcome confounders: Y (m) ⊥


⊥ M | A, X, ∀a, m;

(iii) No unmeasured treatment-mediator confounders: M (a) ⊥


⊥ A | X, ∀a.

10.6 Assumption. Assumption 10.5 is strengthened to include the cross-world


⊥ M (a0 ) | X, ∀a 6= a0 , m.
independence Y (a, m) ⊥

The above assumptions are stated in terms of the counterfactuals. Assumption 10.5
can be checked by d-separation in the corresponding SWIGs. Assumption 10.6 cannot be
checked in SWIGs because the counterfactuals are in different “worlds”.
Importantly, Assumption 10.6 is not implied by Assumption 10.5 and multiple-world
independence, as illustrated in the next example.

A M Y

Figure 10.3: An illustration of treatment-induced mediator-outcome confounding.

10.7 Example. Consider the causal diagram in Figure 10.3, where the mediator M and
the outcome Y are confounded by an observed variable L that is a descendant of A. In
other words, L is another mediator that precedes M . The NPSEM corresponding to
Figure 10.3 is

A = fA (A ),
L = fL (A, L ),
M = fM (A, L, M ),
Y = fY (A, L, M, Y ).

The counterfactuals are defined according to Definition 5.2 and the multiple-world
independence assumptions (Definition 5.7) assert that the noise variables A , L , M , Y

104
are mutually independent. In this case, Assumption 10.5 is satisfied but Assumption 10.6
is generally not. To see this, the counterfactuals are, by definition,
M (a0 ) = fM (a0 , L(a0 ), M ),
Y (a, m) = fY (a, L(a), m, Y ).
⊥ M (a0 ) | L(a), L(a0 ). However, the variable L = fL (A, L ) does not contain
So Y (a, m) ⊥
as much information as L(a) = fL (a, L ) and L(a0 ) = fL (a0 , L ) together, unless fL (a, L )
does not depend on a (in which case L ⊥ ⊥ A and the graph in Figure 10.3 is not faithful).
This example motivates the following structural assumption for X:

10.8 Assumption (No treatment-induced mediator-outcome confounding). X ∩


de(A) = ∅.

10.9 Lemma. Under the multiple-world independence assumptions and Assumption 10.8,
if Y (m) and M are d-separated by {A, X} in G(m), then Assumption 10.6 is satisfied.
Proof. Assumption 10.8 implies that X does not contain a descendant of M or Y . Given
X, the randomness of M (a) comes from all (the noise variables of) the ancestors of M (a)
which have a directed path to M that is not blocked by {A, X}; denote those ancestors as
an(M | A, X). Similarly, let an(Y | A, M, X) be the ancestors of Y that are d-connected
with Y given {A, M, X}. So given X, the randomness of Y (a0 , m) comes from (the noise
variables of) the variables in an(Y | A, M, X).
We claim that an(M | A, X) and an(Y | A, M, X) are d-separated by X. Otherwise,
say V ∈ an(M | A, X) and U ∈ an(Y | A, M, X) are d-connected given X; we can
then append that d-connected path with the directed paths from U to M and from V
to Y to create a d-connected path from M to Y , which contradicts the assumption that
Y (m) ⊥⊥ M | A, X[G(m)] (see Figure 10.4 for an illustration).

U X V

A M Y

Figure 10.4: Illustration for the proof of Lemma 10.9. Adding the edge from U to X or
from U to V creates a d-connected path from M to Y given X.

Thus, the noise variables corresponding to an(M | A, X) and an(Y | A, M, X) are


⊥ M (a0 ) | X.
independent given X. In consequence, Y (a, m) ⊥
Our proof of Proposition 10.3 and Lemma 10.9 then imply the following identification
result.

105
10.10 Theorem. Suppose M and X are discrete. Under Assumptions 10.5 and 10.8,

E[Y (1, M (0))]


X
= E[Y | A = 1, M = m, X = x] · P(M = m | A = 0, X = x) · P(X = x).
m,x

10.5 Extended graph interpretation

The cross-world independence in Assumption 10.6 is concerning. It is impossible to verify


it empirically, because we can never observe Y (a, m) and M (a0 ) together if a 6= a0 .
To move beyond the impasse, one proposal is to consider an extension to the causal
graph.5 Consider the following example.

10.11 Example. Suppose a new process can completely remove the nicotine from tobacco,
allowing the production of a nicotine-free cigarette to begin next year. The goal is to use
the collected data on smoking status A, hypertensive status M and heart disease status
Y from a randomised smoking cessation trial to estimate the incidence of heart disease
in smokers were all smokers to change to nicotine-free cigarettes. Suppose a scientific
theory tells us that the entire effect of nicotine on heart disease is through changing
the hypertensive status, while the non-nicotine toxins in cigarettes have no effect on
hypertension. Then, under the additional assumption that there are no confounders
(besides A) for the effect of hypertension on heart disease, the causal DAG in Figure 10.1
can be used to describe the assumptions. The heart disease incidence rate among smokers
of the new nicotine-free cigarettes is equal to E[Y (a = 1, M (a = 0))].

N M

A Y

Figure 10.5: Extended causal diagram for mediation analysis.

The scientific story in this example allows as to extend the graph and include two
additional variables N and O to represent the nicotine and non-nicotine content in
cigarettes (Figure 10.5). Since the nicotine-free cigarette is not available till next year,
we have P(A = N = O) = 1 in our current data.
Using this new graph, the heart disease incidence rate among the future smokers of
nicotine-free cigarettes is given by

E[Y (a = 1, M (a = 0))] = E[Y (N = 0, O = 1)],

which no longer involves cross-counterfactuals.

106
Although the event {N = 0, O = 1} has probability 0 in the current data, once the
nicotine-free cigarettes become available, it will become possible to estimate E[Y (N =
0, O = 1)] by randomising this new treatment.
What is really interesting here is that we can still identify the distribution of Y (N =
0, O = 1) with the current data, even though P(N = 0, O = 1) = 0. Using the g-formula
and P(A = N = O) = 1,

P Y (N = 0, O = 1) = y, M (N = 0) = m

= P Y (N = 0, O = 1) = y | M (N = 0) = m) · P(M (N = 0) = m

= P Y (O = 1) = y | M = m, N = 0) · P(M = m | N = 0

= P Y = y | M = m, N = 0, O = 1) · P(M = m | N = 0
 
= P Y = y | M = m, O = 1 · P M = m | N = 0
 
= P Y = y | M = m, A = 1 · P M = m | A = 0 .

Summing over m, we arrive at the formula in Proposition 10.3.

Notes
1
A more comprehensive treatment is given in VanderWeele, T. (2015). Explanation in causal inference:
Methods for mediation and interaction. Oxford University Press.
2
To give you a sense of how popular mediation question is in psychology, the paper that made this
regression analysis popular is now one of most cited of all times (close to 100,000 citations on Google
Scholar): Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and
Social Psychology, 51 (6), 1173–1182. doi:10.1037/0022-3514.51.6.1173
3
This terminology is due to Pearl, 2000. These quantities are first proposed under the name “pure
direct effect” and “total indirect effect” by Robins, J. M., & Greenland, S. (1992). Identifiability and
exchangeability for direct and indirect effects. Epidemiology, 3 (2), 143–155. doi:10.1097/00001648-
199203000-00013.
4
Didelez, V., Dawid, A. P., & Geneletti, S. (2006). Direct and indirect effects of sequential treatments.
In Proceedings of the twenty-second conference on uncertainty in artificial intelligence (pp. 138–146).
UAI’06. Cambridge, MA, USA: AUAI Press.
5
Robins and Richardson, 2010; see also Didelez, V. (2018). Defining causal mediation with a
longitudinal mediator and a survival outcome. Lifetime Data Analysis, 25 (4), 593–610. doi:10.1007/
s10985-018-9449-0.

107

You might also like