Li Et Al 2023 Bayesian Causal Inference A Critical Review
Li Et Al 2023 Bayesian Causal Inference A Critical Review
a critical review
royalsocietypublishing.org/journal/rsta
Fan Li1 , Peng Ding2 and Fabrizia Mealli3
1 Duke University, Durham, NC, USA
2 University of California, Berkeley, CA, USA
Review 3 University of Florence and EUI, Florence, Italy
2023 The Author(s) Published by the Royal Society. All rights reserved.
defines causal effects under general scenarios, (ii) specifies assumptions under which one can
2
identify causation from association and (iii) assesses the sensitivity to the causal assumptions and
finds ways to mitigate.
outcome. For example, patients with worse health conditions may be more likely to obtain a
beneficial medical treatment; then directly comparing the outcomes of the treated and control
patients, without adjusting for the difference in their baseline health conditions, would bias
the causal comparisons and mistakenly conclude that the treatment is harmful. Randomized
experiments, known as A/B testing in industry or randomized controlled trials in medicine, are the
gold standard for casual inference by eliminating confounding via randomization. But modern
causal inference has increasingly relied on observational data. The potential outcomes framework
provides the basis for identifying and estimating causal effects—quantities defined based on
counterfactuals—from the factual data in the presence of confounding, using randomized or
observational data. This framework is applicable to a wide range of problems in many disciplines
and has been increasingly adopted in the area of machine learning. Other frameworks for causal
inference, including the causal diagram [4] and invariant prediction [5], are beyond the scope of
this review.
There are three primary inferential approaches within the potential outcomes framework [6]:
Fisherian randomization test, Neymanian repeated-sampling evaluation and Bayesian inference.
The first two approaches belong to the Frequentist paradigm and have been dominant, with many
popular tools such as propensity scores, matching and weighting. The Bayesian approach has
several established advantages for general statistical analysis, including automatic uncertainty
quantification, coherently incorporating prior knowledge, and offering a rich collection of
advanced models for complex data. As causal studies increasingly involve real-world big data,
there has been a recent surge of research in Bayesian inference of causal effects [7–11], but there
is no comprehensive appraisal of the current state of the research. This paper aims to fill this gap.
Due to the space limit, we do not intend to provide a catalogue of the existing research on this
topic, but rather discuss the big picture of why and how to conduct Bayesian causal inference
in general settings. We emphasize the unique questions, challenges and opportunities that the
Bayesian approach brings to causal inference. We hope this review can stimulate broader and
deeper cross-fertilization between causal inference and Bayesian analysis.
Section 2 introduces the preliminaries of the potential outcomes framework, and briefly
discusses several Frequentist methods to causal inference. Section 3 outlines the general structure
of Bayesian causal inference, focusing on ignorable treatment assignments at one time point.
Section 4 discusses model specification and implications in high-dimensional settings. Section
5 reviews the role and various uses of the propensity score in Bayesian causal inference. Section 6
outlines sensitivity analysis in observational studies. Section 7 describes two complex assignment
mechanisms: instrumental variable and time-varying treatments. Section 8 concludes.
treatment effect (SATE): τ S ≡ N−1 N i=1 τi . Furthermore, the conditional average treatment effect
(CATE) is the average of the individual treatment effect of all units with the covariate value x:
The PATE is a function of the distribution of the potential outcomes in a population, whereas
the SATE is a function of the potential outcomes themselves. The subtle distinction in their
definitions leads to important differences in inferential and computational strategies, as will
be discussed later. Traditionally, the SATE is of interest in randomized experiments where the
target population is the specific sample, whereas the PATE is of interest in observational studies
where the target population is the population from which the sample is drawn. In general, the
choice of a causal estimand is determined by the scientific question in hand rather than statistical
considerations. Note that although both the ITE and CATE are important in characterizing
treatment effect heterogeneity, they are obviously different; however, these two estimands are
sometimes conflated in the literature.
The fundamental problem of causal inference [14] is that, for each unit only the potential
outcome corresponding to the actual treatment, Yiobs ≡ Yi = Yi (Zi ), is observed or factual, and
the other potential outcome, Yimis = Yi (1 − Zi ), is missing or counterfactual. Therefore, additional
assumptions are necessary to identify the causal effects. The key identifying assumptions concern
the assignment mechanism, i.e. the process that determines which units get what treatment and
hence which potential outcomes are observed or missing [15]. The vast majority of causal studies
assume certain versions of an ignorable assignment mechanism, where the treatment assignment is
independent of the potential outcomes conditional on some observed variables. Specifically, in
the simple setting of a binary treatment at one time, ignorability consists of two sub-assumptions
[15,16].
Assumption 2.1. (Ignorability). (a) Unconfoundedness. Pr{Zi | Yi (0), Yi (1), Xi } = Pr(Zi | Xi ), or
equivalently Zi ⊥⊥ {Yi (0), Yi (1)} | Xi . (b) Overlap. 0 < e(Xi ) < 1 for all i, where e(x) ≡ Pr(Zi = 1 | Xi = x)
is the propensity score [16].
The unconfoundedness assumption states that there is no unmeasured confounding, and the
overlap assumption states that each unit has non-zero probability of being assigned to each
treatment condition. These two assumptions together ensure that the conditional distribution of
the potential outcomes is identifiable from observed data as
the outcome model, τ dr is doubly robust in the sense that it is consistent if either the propensity
score or the outcome model, but not necessarily both, is correctly specified [25]. Despite the
seemingly different construction, matching estimators, with proper mathematical formulations,
can be viewed as non-parametric versions of τ ipw , τ reg and τ dr based on nearest-neighbour
regressions [26]. These are the main Frequentist estimation strategies for τ P under ignorability.
When the target estimand is the CATE, the primary estimation strategy is outcome modelling.
5
We will discuss how to specify the outcome model in §4(a).
Four quantities are associated with each unit i, {Yi (0), Yi (1), Zi , Xi }, where {Zi , Xi , Yi (Zi )} are
observed but Yi (1 − Zi ) is missing. Bayesian inference views all these quantities as random
variables and centres around specifying a model for them. Based on the Bayesian model, we can
draw inference on causal estimands—functions of the model parameters, covariates and potential
outcomes—from the posterior predictive distributions of the parameters and the unobserved
potential outcomes. Specifically, we assume the joint distribution of these random variables of
all units is governed by a parameter θ = (θX , θZ , θY ), conditional on which the random variables
for each unit are i.i.d.. Then we can factorize the joint density Pr{Yi (0), Yi (1), Zi , Xi | θ} for each unit
i as
Pr{Zi | Yi (0), Yi (1), Xi ; θZ } · Pr{Yi (0), Yi (1) | Xi ; θY } · Pr(Xi ; θX ). (3.1)
The three terms in (3.1) represent the model for the assignment mechanism, potential outcomes,
and covariates, respectively. Under ignorability, the assignment mechanism further reduces to the
propensity score model Pr(Zi | Xi ; θZ ).
Before diving into the technical details, we first clarify the subtle but important difference
between the Bayesian estimation of the PATE and SATE estimands. For the PATE, we rewrite the
outcome-model-based identification formula in §2 as τ P = {μ1 (x; θY ) − μ0 (x; θY )}F(dx; θX ), which
depends only on the unknown parameters θX and θY . Therefore, Bayesian inference for the PATE
requires obtaining posterior distributions of (θX , θY ). By contrast, the SATE τ S is a function of
the potential outcomes {Yi (0), Yi (1)}Ni=1 , which involves both observed and missing quantities.
Bayesian inference for the SATE requires imputing the missing potential outcomes Yimis from
their posterior predictive distributions based on the outcome model, and consequently deriving
the posterior distribution of τ S .
However, in practice, we rarely model the possibly multi-dimensional covariates Xi , and
instead condition on the observed values of the covariates. This is equivalent to replacing F(x; θX )
with FX , the empirical distribution of the covariates. Therefore, most Bayesian causal inference
(e.g. [9,28]) in fact focuses on the mixed average treatment effect (MATE) [6]
N
τ M ≡ τ (x; θY )
FX (dx) = N−1 τ (Xi ; θY ), (3.2)
i=1
where τ (x; θY ) = τ (x) highlights the dependence on the parameter θY . The MATE is a convenient
approximation of the PATE and is particularly natural under the Bayesian paradigm. The
difference between the MATE and SATE is subtle: the former equals the average of the CATE
whereas the latter equals the average of the ITEs over the finite sample. Based on the posterior
distributions, the PATE has the largest uncertainty, whereas the SATE has the smallest uncertainty.
The distinction between these estimands is illustrated in the following example.
...............................................................
⎪
⎪
⎪
⎪
N ⎬
−1
τ =N
S
{Yi (1) − Yi (0)}, (3.3)
⎪
⎪
i=1 ⎪
⎪
⎪
⎪
⎭
τ M = (β1 − β0 ) X̄,
respectively, where X̄ = N−1 N i=1 Xi is the sample mean of the covariates.
Regardless of the version of the target estimand, the following assumption is commonly adopted.
Assumption 3.2. (Prior independence). The parameters for the models of assignment mechanism θZ ,
outcome θY , and covariates θX are a priori distinct and independent.
From (3.4), the posterior distributions of θX and θY , and consequently of τ P , do not depend on the
second component corresponding to the propensity score. Therefore, the propensity score model
is ignorable in Bayesian inference of τ P . The same ignorability argument applies to other estimands
such as τ S , τ M and τ (x) [6,9,15]. Furthermore, inference of τ M does not depend on the covariate
model Pr(Xi ; θX ). Because of this, it is essential to specify the outcome model Pr{Yi (1), Yi (0) | Xi ; θY }
in Bayesian causal inference.
By definition, τ P = E{Yi (1)} − E{Yi (0)} does not depend on the association between Yi (0) and
Yi (1), denoted by the parameter ρ. Similarly, τ (x) does not depend on ρ, but τ S does. So in
the inference of τ P and τ (x), we usually directly specify the marginal models Pr{Yi (z) | Xi ; θY }
or equivalently Pr(Yi | Zi = z, Xi ; θY ) under ignorability [28]. The observed-data likelihood based
on (3.4) becomes i:Zi =1 Pr(Yi | Zi = 1, Xi ; θY ) i:Zi =0 Pr(Yi | Zi = 0, Xi ; θY ). Imposing a prior for θY ,
we can proceed to infer θY and subsequently τ P , τ M , or τ (x) using the usual Bayesian inferential
procedures.
Bayesian inference of τ S is more complex, because it depends on both Yi (0) and Yi (1) and
thus requires posterior sampling of both θY and Ymis . The most common sampling strategy is
through data augmentation: iteratively simulate θ and Ymis given each other and the observed
data, namely from Pr(θY | Ymis , Yobs , Z, X) and Pr(Ymis | Yobs , Z, X; θY ). The former, given the
observed data and the imputed Ymis , can be straightforwardly obtained by a complete-data
analysis based on Pr{θY | Y(1), Y(0), X} ∝ Pr(θY ) N i=1 Pr{Yi (1), Yi (0) | Xi ; θY }. The latter requires
more elaboration. Specifically, we can show that Pr(Ymis | Yobs , Z, X; θY ) is proportional to
i:Zi =1 Pr{Yi (0) | Yi (1), Xi ; θY } i:Zi =0 Pr{Yi (1) | Yi (0), Xi ; θY }. This renders that imputing the Y
mis
depends crucially on the joint distribution of {Yi (1), Yi (0)}. Because Yi (0) and Yi (1) are never jointly
observed, the data provide no information about their association ρ. Unless the specific marginal
model places constraints on ρ, the posterior distribution of ρ would be the same as its prior.
Consequently, the posterior distribution of τ S would be sensitive to the prior of ρ.
The above discussion prompts us to clarify the notion of identifiability in Bayesian inference.
Under the Frequentist paradigm, a parameter is identifiable if any of its two distinct values
give two different distributions of the observed data. Under the Bayesian paradigm, there is
7
no consensus. For example, Lindley [29] argued that all parameters are identifiable in Bayesian
analysis because with proper prior distributions, posterior distributions are always proper. In
outcome
–10
–20
(b)
25
Downloaded from https://royalsocietypublishing.org/ on 17 January 2025
20
CATE
15
10
25 50 75 25 50 75 25 50 75
X
Figure 1. Example 4.1: estimates of means of the potential outcomes (a) and the CATE (b) and corresponding uncertainty band
as a function of the single covariate by linear model, Gaussian Process, BART, respectively. Cross symbols: treated units; circles:
control units. (Online version in colour.)
Example 4.1. [Choice of priors in estimating the CATE] Consider a study with 250 treated and
250 control units. Each unit has a single covariate X that follows a Gamma distribution with
mean 60 and 35 in the control (Zi = 0) and treatment (Zi = 1) group, respectively, and with s.d. 8
in both groups. To convey the main message, we consider a true outcome model with constant
treatment effects: Yi (z) = 10 + 5z − 0.3Xi + i with i ∼ N (0, 1), where the CATE τ (x) = 5 for all x.
The scatterplots in the upper panel of figure 1 show that covariate overlap is good between the
groups in the middle of the range of X (around 40 to 50), but deteriorates towards the tails of X.
To estimate the CATE, we fit an outcome model separately in each group: μ(z, x) = fz (x) + i
with i ∼ N (0, σ 2 ). We choose three priors for fz (x): (i) a linear model fz (x) = αz + βz x with a
Gaussian prior for the coefficients; (ii) a BART prior similar to [9]; (iii) a Gaussian Process
prior [53] with the covariance function specified by a Gaussian kernel with signal-to-noise
ratio parameter ρ and inverse-bandwidth parameter λ: (fz (x1 ), fz (x2 ), . . . , fz (xN )) ∼ N (0, Σ) where
Σij = δ 2 ρ 2 exp{−λ2 ||xi − xj ||2 }). Figure 1 shows the posterior means of μz (X) and the CATE,
with corresponding uncertainty band as a function of X. Here, we focus on the uncertainty
quantification. In the region of good overlap, all three models lead to similar points and credible
interval estimates of the CATE, but a marked difference emerges in the region of poor overlap.
The linear model appears overconfident in estimating the CATE. The Gaussian process trades
potential bias with wider credible interval as overlap decreases and produces a more adaptive
uncertainty quantification. BART produces shorter error bars than the Gaussian Process (but
wider than linear models), but the width of the credible interval remains similar regardless of
the degree of overlap and thus is over confident in the presence of poor overlap.
Example 4.1 illustrates that, even with low dimensional covariates, standard Bayesian
10
priors can have markedly different operating characteristics when the two groups are poorly
overlapped, and not all priors can adaptively capture the uncertainty according to the degree
non-parametric estimates often have slow convergence rates in this regime, which translates into
poor finite-sample performance. A central challenge in causal inference is that covariate overlap
is rapidly diminishing as the covariates dimension increases, violating the overlap assumption
that underpins standard causal analysis [61]. Lack of overlap exacerbates the usual inferential
challenges—such as sparsity and slow convergence—in high-dimensional analysis. Even if we
assume linear outcome models, we must carefully specify the priors on the regression coefficients.
For example, Hahn et al. [7] showed that standard Bayesian regularization on the nuisance
parameters may indirectly regularize important causal parameters and thus induce bias, namely
the regularization induced confounding. This issue was rigorously investigated in [62]. Specifically,
Linero [62] defines the selection bias as δz = E(Yi | Zi = z) − E{Yi (z)}, and showed that, under the
seemingly innocuous prior independence assumption 3.2, many Bayesian regularization priors
would a priori induce the selection bias δz to sharply concentrate around zero as the number
of covariates, p, increases, to the extent that no amount of data would overcome such a bias.
This implies that assumption 3.2 effectively acts as a strongly informative prior as p increases.
Such a phenomenon is referred to as prior dogmaticism and is the Bayesian analogue of the
aforementioned problem in Ritov et al. [63]. This line of research highlighted the importance
of incorporating the propensity score in Bayesian causal inference [7,62,64,65], which echos the
insights from the Frequentist double machine learning method [66,67]. Specifically, the regularized
propensity score model or outcome model alone would not be sufficient for valid causal inference,
but combining the two would achieve desirable convergence rate and finite sample performance
in high-dimensional causal analysis.
covariates within each stratum of the propensity score. Various reparametrizations have been
proposed. One example is to specify μ(z, x, e(x)) = g1 (x, e(x)) + g2 (z, x), with g1 (·) being a non-
parametric model and g2 (·) being a parametric model. Little [70] adopted a penalized spline
model of e(x) for g1 (·). In the aforementioned Bayesian Causal Forest, Hahn et al. [8] imposed a
separate BART model for g1 (·) and g2 (·), and demonstrated that adding the propensity score as an
additional predictor in g1 significantly improves the empirical estimation of the CATE.
This strategy is usually implemented in two stages: first estimate the propensity score as e(X)
and then plug it into the Bayesian outcome model μ(Z, X, e(X)). Such a two-stage procedure
is not dogmatically Bayesian, which generically refers to the procedure of specifying a model
with parameters and prior distributions of these parameters and then use the Bayes theorem to
obtain the posterior distributions of the parameters. A direct consequence is that this procedure
may not properly propagate the uncertainty of estimating the propensity score in the outcome
model [69]. A dogmatic Bayesian approach would jointly model e(X; θZ ) and μ(Z, X, e(X); θY ) and
draw posterior inference of θZ and θY simultaneously [71]. However, when the outcome model
is misspecified, the joint-modelling approach would introduce a feedback problem, that is, the
fit of the outcome model would inform the estimation of the propensity scores. This violates
the unconfoundedness assumption, distorts the balancing property of the propensity score, and
consequently leads to biased estimate of causal effects. A suggested remedy is to first fit a Bayesian
model for e and then plug the posterior predictive draws of e into the outcome [11]. Such a two-
stage procedure is still not dogmatically Bayesian, but provides more robust posterior inference
to model misspecification empirically.
However, adding the propensity score into the outcome model is controversial conceptually,
because the outcome model reflects the nature of the generating process of the potential outcomes,
which arguably should not depend on how the treatment is assigned [72].
conditional variances, rather than the conditional means of the potential outcomes.
Carefully designed dependent priors often achieve desirable finite sample results and are more
reasonable in real world studies. However, specification of such priors is case-dependent, and
there is no general solution.
propensity score in Bayesian posterior predictive p-value. For the model with the Fisher’s sharp
null hypothesis of no causal effect for any units whatsoever (i.e. Yi (1) = Yi (0) for all i), the
procedure in [76] is equivalent to the Fisher randomization test averaged over the posterior
predictive distribution of the propensity score. Simulations in [76] show the advantages of the
Bayesian p-value compared to the Frequentist analogue. This perspective offers a straightforward
and flexible strategy to integrate Bayesian modelling and common Frequentist procedures for
causal inference and enables proper uncertainty quantification.
Besides the above three strategies, another general approach is through the aforementioned
Bayesian bootstrap, which can be used to simulate the posterior distribution of any parameter that
can be formulated as M-estimation or estimating equation [41,77]. As special cases, because the
IPW estimator τ ipw and the doubly robust estimator τ dr —both involving the propensity scores—
are both solutions to estimating equations, they can be naturally combined with the Bayesian
bootstrap to devise a Bayesian version. However, such an approach may be guilty of ‘Bayesian for
the sake of being Bayesian’, and their methodological and practical value compared to competing
methods is unclear.
Under the factorization (6.1), sensitivity analysis requires us to specify the models for
Pr{Y(1), Y(0) | X, U}, Pr(Z | X, U) and Pr(U | X). In the special case of a binary Z, a binary Y
Downloaded from https://royalsocietypublishing.org/ on 17 January 2025
and a discrete X (which can be thought as a stratified propensity score), Rosenbaum & Rubin
[80] assumed a logistic model for Y given (Z, X, U), a logistic model for Z given (X, U), and
a Bernoulli distribution for U, and treated the logistic regression coefficients of U and the
probability parameter of U as the sensitivity parameters. They integrated out U in the complete-
data likelihood and obtained the maximum likelihood estimates of τ P over a plausible range of
values of the sensitivity parameters. This method has been extended to more general settings
in the Frequentist fashion in [81,82]. The Bayesian analogue of [80] is straightforward and can
leverage the data augmentation algorithm to impute U to simplify the computation. Dorie et al.
[83] extended this method to impose a Bayesian semiparametric model with a BART component
for Pr{Y(1), Y(0) | X, U} to allow for model flexibility.
As an extension of [79], Ding & VanderWeele [84] treated the treatment-confounder (Z and
U) and outcome-confounder (Y and U) associations as two sensitivity parameters, and derived
analytical thresholds for them in order to explain away the observed treatment-outcome (Z and
Y) association. Based on that theory, VanderWeele & Ding [85] further simplified by assuming
the two associations to be the same and called the resulting threshold the E-value, as a measure
of robustness of the causal conclusions with respect to unmeasured confounding. The E-value
framework is model-free because it avoids modelling assumptions with U; it also avoids repeating
the analysis over a range of sensitivity parameters as in the competing methods and thus is simple
to calculate.
Wi (0). Monotonicity rules out defiers, and exclusion restriction imposes that the assignment has
zero effects among never-takers and always-takers. Then the compiler average causal effect is
identified by
E(Y | Z = 1) − E(Y | Z = 0)
τco ≡ E{Y(1) − Y(0) | U = co} = ,
E(W | Z = 1) − E(W | Z = 0)
which is exactly the probability limit of the two-stage least-squares estimator [90]. Because
under monotonicity, only compilers’ actual treatments are affected by the assignment, τco can
be interpreted as the effect of the treatment.
We now describe the Bayesian inference of the IV set-up, first outlined in [91]. Without
additional assumptions, the observed cells of Z and W consist of a mixture of units from more
than one stratum. For example, the units who are assigned to the treatment arm and took
the treatment (Z = 1, W = 1) can be either always-takers or compliers. One must disentangle
the causal effects for different compliance types from observed data. Therefore, model-based
inference here resembles that of a mixture model. In Bayesian analysis, it is natural to impute the
missing label Ui under some model assumptions. Specifically, six quantities are now associated
with each unit, {Yi (1), Yi (0), Wi (1), Wi (0), Xi , Zi }, four of which, {Yiobs = Yi = Yi (Zi ), Wiobs = Wi =
Wi (Zi ), Zi , Xi }, are observed and the remaining two, {Yimis = Yi (1 − Zi ), Wimis = Wi (1 − Zi )}, are
unobserved. Assume the joint distribution of these random variables of all units is governed
by a parameter θ , conditional on which the random variables for each unit are iid. We assume
unconfoundedness Pr{Zi = 1 | Xi , Wi (1), Wi (0), Yi (1), Yi (0)} = Pr(Zi = 1 | Xi ), and impose a prior
distribution Pr(θ ). Then the joint posterior distribution of θ and the missing potential outcomes
are proportional to the complete-data likelihood as follows:
N
Pr(θ ) Pr{Yi (0), Yi (1) | Ui , Xi ; θY } Pr(Ui | Xi ; θU ) Pr(Xi | θX ). (7.1)
i=1
πco · pY
co,0 (1 − pco,0 )
i 1−Yi
πco · pY
co,0 (1 − pco,0 )
i 1−Yi + π · pYi (1 − p )1−Yi
nt nt nt
and Ui = nt with the rest probability. With the imputed Ui ’s, we can sample the parameters
Downloaded from https://royalsocietypublishing.org/ on 17 January 2025
N
from standard Beta posteriors: (i) sample πco from Beta(1/2 + N i=1 1(U i = co), 1/2 + i=1 1(Ui =
N
nt)) and obtain πnt = 1 − πco , (ii) sample pco,1 from Beta(1/2 + i=1 Zi 1(Ui = co)Yi , 1/2 +
N
Zi 1(Ui = co)(1 − Yi )), (iii) sample pco,0 from Beta(1/2 + N − Zi )1(Ui = co)Yi , 1/2 +
i=1 (1
i=1
N N
(1 − Z i )1(U i = co)(1 − Y i )), and (iv) sample pnt from Beta(1/2 + i=1 1(Ui = nt)Yi , 1/2 +
i=1
N
i=1 1(U i = nt)(1 − Y i )). We iterate until convergence and obtain the posterior distribution of
τco = pco,1 − pco,0 . Imbens & Rubin [91] provided more detailed discussions.
Frangakis & Rubin [93] generalized the IV approach to principal stratification, a unified
framework for causal inference with post-treatment confounding. In the simplest scenario, a post-
treatment confounded variable lies in the causal pathway between the treatment and the outcome;
it cannot be adjusted in the same fashion as a pre-treatment covariate in causal inference. A
principal stratification with respect to a post-treatment variable is the classification of units based
on the joint potential values of the post-treatment variable, and the stratum-specific effects are
called principal causal effects, of which τco is a special case. The post-treatment variable setting
includes a wide range of examples. For instance, in the non-compliance setting, the ‘treatment’
is the randomized treatment assignment, the ‘post-treatment’ variable is the actual treatment
received, and the compliance types are the principal strata [91,92,94,95]. Zeng et al. [96] connects
principal stratification to the local IV method with a continuous IV and binary treatment [97].
Other examples include censoring due to death [98], surrogate endpoints [99,100], regression
discontinuity designs [101], time-varying treatments [102], and many more. The choices of target
strata and thus estimands, interpretations, and identifying assumptions depend on specific
applications, details of which are omitted here.
Assumption 7.2. (Sequential Ignorability). Pr{Zt | Z̄t−1 , L̄t−1 , Y(z̄t ) for all z̄t } = Pr(Zt | Z̄t−1 , L̄t−1 )
for t = 1, . . . , T.
Downloaded from https://royalsocietypublishing.org/ on 17 January 2025
A full Bayesian approach to time-varying treatments [105] would specify a joint model for
treatment assignment Zt and time-varying confounders Lt at all time points as well as all the
potential outcomes Y(z̄T ), and then derive the posterior predictive distributions of the missing
potential outcomes and thus of the estimands. This procedure is a straightforward extension from
the structure introduced in §3. However, the joint modelling approach is rarely used because it
quickly becomes intractable as the time T and the number of time-varying confounders increases.
Instead, most of the Bayesian methods are grounded in the g-computation. Under sequential
ignorability, the average potential outcome E{Yi (z̄T )} is identified from the observed data via the
g-formula [103]:
E{Yi (z̄T )} = E(Y | Z̄T = z̄T , L̄T−1 ) · Pr(LT−1 | Z̄T−1 = z̄T−1, L̄T−2 )
L0 ,L1 ,...,LT−1
To operationalize the g-formula, we can specify models for all the components of (7.2), including
an outcome model Pr(Y | Z̄T = z̄T , L̄T−1 ) and a model for the time-varying confounders Lt at each
time t, Pr(Lt | Z̄t = z̄t , L̄t−1 ). The g-formula is in essence an extension of the outcome regression
approach to time-varying treatments. The Bayesian version of the g-computation would specify a
Bayesian model for each component in the g-formula (7.2) and then combine the posterior draws
of the parameters to obtain the posterior distribution of the estimands. Below we present an
illustrative example of Bayesian g-computation due to [106].
Example 7.3. [Bayesian g-computation with two periods] Consider the simplest possible
scenario with two time periods, binary covariates and a binary outcome. Let L0 be a binary
baseline covariate, Z1 is a binary treatment at time 1, L1 is a binary time-varying covariate
between time 1 and 2, Z2 is a binary treatment, and Y is a binary outcome. To obtain the posterior
distribution of
E{Y(z1 , z2 )} = Pr(Y = 1 | Z1 = z1 , Z2 = z2 , L0 = l0 , L1 = l1 )
l0 =0,1 l1 =0,1
it suffices to obtain the posterior distributions of the probabilities in (7.3). Assuming the standard
Beta(1/2, 1/2) conjugate priors. We can obtain the posterior of the probabilities as follows: (i)
sample Pr(Y = 1 | Z1 = z1 , Z2 = z2 , L0 = l0 , L1 = l1 ) from Beta(1/2 + N i=1 1(Zi1 = z1 , Zi2 = z2 , Li0 =
N
l0 , Li1 = l1 )Yi , 1/2 + i=1 1(Zi1 = z1 , Zi2 = z2 , Li0 = l0 , Li1 = l1 )(1 − Yi )), (ii) sample Pr(L1 = 1 | Z1 =
N
z1 , L0 = l0 ) from Beta(1/2 + N i=1 1(Zi1 = z1 , Li0 = l0 )Li1 , 1/2 + i=1 1(Zi1 = z1 , Li0 = l0 )(1 − Li1 )),
and (iii) sample Pr(L0 = 1) from Beta(1/2 + N i=1 Li0 , 1/2 +
N
i=1 (1 − Li0 )). With these ingredients
and (7.3), we can obtain the posterior distributions of E{Y(z1 , z2 )}’s and their contrasts
z1 ,z2 c(z1 , z2 )E{Y(z1 , z2 )}.
G-computation quickly becomes complex as the number of time periods T and time-varying
confounders increases, which requires specifying a large number of models. Then it is necessary
to impose more structural restrictions on the data-generating process. However, Robins &
Wasserman [107] showed that unsaturated models might rule out the null hypothesis of zero
18
causal effect a priori, a phenomenon termed the ‘g-null paradox’.
A popular alternative strategy is the marginal structural model [104], which generalizes IPW to
rules, one per time point of intervention, that determines how to individualize treatments to
units based on evolving treatment and covariate history. Inferring optimal dynamic treatment
regimes requires combining causal inference and decision theory techniques and is closely related
to reinforcement learning. See [109] for a review. Due to the space limit, we omit the discussion
of the closely related topics of Bayesian multi-armed bandit [110] and Bayesian reinforcement
learning [111].
8. Discussion
This paper reviews the Bayesian approach to causal inference under the potential outcomes
framework. We discussed the causal estimands, identification strategies, the general structure
of Bayesian inference of causal effects, and sensitivity analysis. We highlight issues that are
unique to Bayesian causal inference, including the role of the propensity score, definition of
identifiability, the choice of priors in both low- and high-dimensional regimes. In particular, under
ignorability and prior independence, the propensity score is seemingly irrelevant for the posterior
distributions of the causal parameters. However, we pointed out that even in this setting, the
propensity score and more generally the design stage plays a central role in obtaining robust
Bayesian causal inference. Regardless of the mode of inference, a critical step in causal inference
with observational data is to ensure adequate covariate overlap and balance in the design or
analysis stages. In high dimensions, such a task is particularly challenging and what is the optimal
practice remains an open question.
The Bayesian approach offers several advantages for causal inference. First and most
importantly, by enabling imputation of all missing potential outcomes, the Bayesian paradigm
provides a unified inferential framework for any causal estimand. This is particularly appealing
for inferring complex estimands such as the conditional average treatment effects or individual
treatment effects as well as partially identifiable causal estimands such as the principal strata
causal effects. In contrast, the Frequentist approach to these problems needs to be customized for
each scenario, and the inference usually relies on bounds or asymptotic arguments, which are
often either non-informative or questionable in cases like individual treatment effects. Second,
the automatic uncertainty quantification of any estimand renders it straightforward to combine
causal inference and decision theory for dynamic decision-making, e.g. in personalized medicine.
Third, the Bayeian approach naturally incorporates prior knowledge into a causal analysis, e.g.
in evaluating spatially correlated treatments and/or outcomes. Fourth, there is rich collection
of Bayesian models for complex data with limited Frequentist counterparts. A few examples
are (i) spatial or temporal data, (ii) functional data, and (iii) interference, i.e. when the SUTVA
assumption is violated. In these settings, special care must be taken on issues key to causal
inference such as defining relevant estimands and ensuring overlap. Moreover, it is important
to ensure that the Bayesian models are coherent to the model-free identification assumptions
such as ignorability. For example, adding spatial random effects into an outcome model may
inadvertently bias the coefficient of the treatment variable as the estimate of a causal effect [112].
Research on Bayesian analysis of these topics has been rapidly increasing [10,112–115] and is
expected to continue to grow.
Despite the above advantages, the theory and practice in causal inference has long been
19
dominated by non-Bayesian methods. One reason is that many popular Frequentist techniques,
such as matching and weighting, as well as Fisherian randomization test, do not require
of important scientific problems and (iii) developing and disseminating user-friendly, general
purpose software packages.
We have occasionally commented on whether a method is dogmatically Bayesian in the
discussion. However, we do not regard the conceptual purity of being dogmatically Bayesian,
per se, as advantageous, nor should it be the motivating goal in real applications. When a
quasi-Bayesian method outperforms its dogmatically Bayesian counterpart (if available) with
methodological footing and empirical evidence, as the example of adding estimated propensity
score in an outcome model in §5(a), we would endorse the former over the latter. We also
doubt the value of devising a Bayesian version of an established Frequentist method without
clear theoretical or practical advantages. As a general view, we believe whether to choose a
Bayesian approach should be dictated by its practical utility in a specific context rather than an
unconditional commitment to the Bayesian doctrine. For causal inference and perhaps everything
in statistics, being Bayesian should be a tool, not a goal.
Data accessibility. This article has no additional data.
Authors’ contributions. F.L.: conceptualization, formal analysis, investigation, methodology, writing—original
draft, writing—review and editing; P.D.: formal analysis, investigation, methodology, writing—original draft,
writing—review and editing; F.M.: methodology, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed
therein.
Conflict of interest declaration. We declare we have no competing interests.
Funding. P.D.’s research is partially funded by the US National Science Foundation grant no. 1945136. F.M.
thanks the Department of Excellence 2018–2022 funding provided by the Italian Ministry of University and
Research (MUR).
Acknowledgements. The authors are grateful to Joey Antonelli, Yuansi Chen, Sid Chib, Ruobin Gong, Guido
Imbens, Zhichao Jiang, Antonio Linero, Georgia Papadogeorgou, Donald Rubin, Surya Tokdar, Mike West,
Jason Xu, Cory Zigler, Anqi Zhao and two anonymous reviewers for discussions and suggestions.
References
1. Neyman J. 1923 On the application of probability theory to agricultural experiments: essay
on principles, §9. Masters Thesis. Portions translated into english by D. Dabrowska and T.
Speed (1990) in Stat. Sci., pp. 465–472.
2. Rubin DB. 1974 Estimating causal effects of treatments in randomized and nonrandomized
studies. J. Edu. Psychol. 66, 688–701. (doi:10.1037/h0037350)
3. Rubin DB. 1975 Bayesian inference for causality: the role of randomization. In Proc. of Social
Statistics Section of Am Stat. Assoc., pp. 233–239.
4. Pearl J. 2000 Causality: models, reasoning, and inference. New York, NY: Cambridge University
Press.
5. Peters J, Bühlmann P, Meinshausen N. 2016 Causal inference by using invariant prediction:
identification and confidence intervals. J. R. Stat. Soc. B 78, 947–1012. (doi:10.1111/rssb.12167)
6. Ding P, Li F. 2018 Causal inference: a missing data perspective. Stat. Sci. 33, 214–237.
(doi:10.1214/18-STS645)
7. Hahn PR, Carvalho CM, Puelz F, He J. 2018 Regularization and confounding
20
in linear regression for treatment effect estimation. Bayesian Anal. 13, 163–182.
(doi:10.1214/16-BA1044)
(doi:10.1080/01621459.2013.869498)
12. Dawid AP. 1979 Conditional independence in statistical theory. J. R. Stat. Soc. B 41, 1–15.
13. Rubin DB. 1980 Comment on ‘Randomization analysis of experimental data: the Fisher
randomization test’ by D. Basu. J. Am. Stat. Assoc. 75, 591–593.
14. Holland PW. 1986 Statistics and causal inference. J. Am. Stat. Assoc. 81, 945–960.
(doi:10.1080/01621459.1986.10478354)
15. Rubin DB. 1978 Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6,
34–58. (doi:10.1214/aos/1176344064)
16. Rosenbaum PR, Rubin DB. 1983 The central role of the propensity score in observational
studies for causal effects. Biometrika 70, 41–55. (doi:10.1093/biomet/70.1.41)
17. Lin W. 2013 Agnostic notes on regression adjustments to experimental data: reexamining
Freedman’s critique. Ann. Appl. Stat. 7, 295–318. (doi:10.1214/12-AOAS583)
18. Rubin DB. 2007 The design versus the analysis of observational studies for causal effects:
parallels with the design of randomized trials. Stat. Med. 26, 20–36. (doi:10.1002/sim.
2739)
19. Abadie A, Imbens GW. 2011 Bias corrected matching estimators for average treatment
effects. J. Bus. Econom. Stat. 29, 1–11. (doi:10.1198/jbes.2009.07333)
20. Abadie A, Imbens GW. 2006 Large sample properties of matching estimators for average
treatment effects. Econometrica 74, 235–267. (doi:10.1111/j.1468-0262.2006.00655.x)
21. Rubin DB. 2006 Matched sampling for causal effects. Cambridge, UK: Cambridge University
Press.
22. Li F, Morgan KL, Zaslavsky AM. 2018 Balancing covariates via propensity score weighting.
J. Am. Stat. Assoc. 113, 390–400. (doi:10.1080/01621459.2016.1260466)
23. Rosenbaum PR. 1987 Model-based direct adjustment. J. Am. Stat. Assoc. 82, 387–394.
(doi:10.1080/01621459.1987.10478441)
24. Robins JM, Rotnitzky A, Zhao LP. 1994 Estimation of regression coefficients when
some regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866. (doi:10.1080/
01621459.1994.10476818)
25. Bang H, Robins JM. 2005 Doubly robust estimation in missing data and causal inference
models. Biometrics 61, 962–972. (doi:10.1111/j.1541-0420.2005.00377.x)
26. Lin Z, Ding P, Han F. 2021 Estimation based on nearest neighbor matching: from density
ratio to average treatment effect. Preprint (https://arxiv.org/abs/2112.13506).
27. Rubin DB. 1976 Inference and missing data. Biometrika 63, 581–592. (doi:10.1093/biomet/
63.3.581)
28. Chib S. 2007 Analysis of treatment response data without the joint distribution of potential
outcomes. J. Econom. 140, 401–412. (doi:10.1016/j.jeconom.2006.07.009)
29. Lindley DV. 1972 Bayesian statistics: a review. SIAM.
30. Gustafson P. 2015 Bayesian inference for partially identified models: exploring the limits of limited
data. New York, NY: CRC Press.
31. Lu J, Ding P, Dasgupta T. 2018 Treatment effects on ordinal outcomes: causal estimands and
sharp bounds. J. Educ. Behav. Stat. 43, 540–567. (doi:10.3102/1076998618776435)
32. Daniels MJ, Hogan JW. 2008 Missing data in longitudinal studies: strategies for Bayesian modeling
and sensitivity analysis. London, UK: Chapman and Hall/CRC.
33. Ding P, Dasgupta T. 2016 A potential tale of two-by-two tables from completely randomized
experiments. J. Am. Stat. Assoc. 111, 157–168. (doi:10.1080/01621459.2014.995796)
34. Franks AM, D’Amour A, Feller A. 2020 Flexible sensitivity analysis for observational
21
studies without observable implications. J. Am. Stat. Assoc. 115, 1730–1746. (doi:10.1080/0162
1459.2019.1604369)
230. (doi:10.1214/aos/1176342360)
39. Oganisian A, Mitra N, Roy JA. 2020 Hierarchical Bayesian bootstrap for heterogeneous
treatment effect estimation. (https://arxiv.org/abs/2009.10839).
40. Taddy M, Gardner M, Chen L, Draper D. 2016 A nonparametric Bayesian analysis of
heterogenous treatment effects in digital experimentation. J. Bus. Econ. Stat. 34, 661–672.
(doi:10.1080/07350015.2016.1172013)
41. Chamberlain G, Imbens GW. 2014 Nonparametric applications of Bayesian inference. J. Bus.
Econ. Stat. 21, 12–18. (doi:10.1198/073500102288618711)
42. Künzel SR, Sekhon JS, Bickel PJ, Yu B. 2019 Metalearners for estimating heterogeneous
treatment effects using machine learning. Proc. Natl Acad. Sci. 116, 4156–4165.
(doi:10.1073/pnas.1804597116)
43. Breiman L, Friedman JH, Olshen RA, Stone CJ. 2017 Classification and regression trees.
Routledge.
44. Chipman HA, George EI, McCulloch RE. 2010 BART: Bayesian additive regression trees. Ann.
Appl. Stat. 4, 266–298. (doi:10.1214/09-AOAS285)
45. Dorie V, Hill J, Shalit U, Scott M, Cervone D. 2019 Automated versus do-it-yourself methods
for causal inference: lessons learned from a data analysis competition. Stat. Sci. 4, 43–68.
(doi:10.1214/18-STS667)
46. Hu L, Ji J, Li F. 2021 Estimating heterogeneous survival treatment effect in observational data
using machine learning. Stat. Med. 40, 4691–4713. (doi:10.1002/sim.9090)
47. Ray K, van der Vaart A. 2020 Semiparametric Bayesian causal inference. Ann. Stat. 48, 2999–
3020. (doi:10.1214/19-AOS1919)
48. Chib S, Hamilton BH, 2002 Semiparametric Bayes analysis of longitudinal data treatment
models. J. Econom. 110, 67–89. (doi:10.1016/S0304-4076(02)00122-7)
49. Karabatsos G, Walker SG. 2012 A Bayesian nonparametric causal model. J. Stat. Plan. Inference
142, 925–934. (doi:10.1016/j.jspi.2011.10.013)
50. Oganisian A, Roy JA. 2021 A practical introduction to Bayesian estimation of causal effects:
parametric and nonparametric approaches. Stat. Med. 40, 518–551. (doi:10.1002/sim.8761)
51. Roy J, Lum KJ, Zeldow B, Dworkin JD, Re III VL, Daniels MJ. 2018 Bayesian nonparametric
generative models for causal inference with missing at random covariates. Biometrics 74,
1193–1202. (doi:10.1111/biom.12875)
52. Papadogeorgou G, Li F. 2020 Discussion for ‘Bayesian regression tree models for causal
inference: regularization, confounding, and heterogeneous effects’. Bayesian Anal. 15, 1007–
1013.
53. Williams CK, Rasmussen CE. 2006 Gaussian processes for machine learning. Cambridge, MA:
MIT Press.
54. Linero AR, Yang Y. 2018 Bayesian regression tree ensembles that adapt to smoothness and
sparsity. J. R. Stat. Soc. B 80, 1087–1110. (doi:10.1111/rssb.12293)
55. Antonelli JL, Parmigiani G, Dominici F. 2019 High-dimensional confounding adjustment
using continuous spike and slab priors. Bayesian Anal. 14, 805–828. (doi:10.1214/18-BA1131)
56. Park T, Casella G. 2008 The Bayesian lasso. J. Am. Stat. Assoc. 103, 681–686. (doi:10.1198/
016214508000000337)
57. Raftery AE, Madigan D, Hoeting JA. 1997 Bayesian model averaging for linear regression
models. J. Am. Stat. Assoc. 92, 179–191. (doi:10.1080/01621459.1997.10473615)
58. Wang C, Parmigiani G, Dominici F. 2012 Bayesian effect estimation accounting for
22
adjustment uncertainty. Biometrics 68, 661–671. (doi:10.1111/j.1541-0420.2011.01731.x)
59. Zigler CM. 2016 The central role of Bayes’ theorem for joint estimation of causal effects and
89. Angrist JD, Pischke J-S. 2009 Mostly harmless econometrics: an empiricist’s companion. Princeton
University Press.
90. Angrist JD, Imbens GW, Rubin DB. 1996 Identification of causal effects using instrumental
variables. J. Am. Stat. Assoc. 91, 444–455. (doi:10.1080/01621459.1996.10476902)
91. Imbens GW, Rubin DB. 1997 Bayesian inference for causal effects in randomized experiments
with noncompliance. Ann. Stat. 25, 305–327. (doi:10.1214/aos/1034276631)
92. Hirano K, Imbens GW, Rubin DB, Zhou XH. 2000 Assessing the effect of an influenza vaccine
in an encouragement design. Biostatistics 1, 69–88. (doi:10.1093/biostatistics/1.1.69)
93. Frangakis CE, Rubin DB. 2002 Principal stratification in causal inference. Biometrics 58, 21–29.
(doi:10.1111/j.0006-341X.2002.00021.x)
94. Frangakis CE, Rubin DB, Zhou XH. 2002 Clustered encouragement designs with individual
noncompliance: Bayesian inference with randomization, and application to advance
directive forms. Biostatistics 3, 147–164. (doi:10.1093/biostatistics/3.2.147)
95. Mealli F, Pacini B. 2013 Using secondary outcomes to sharpen inference in
randomized experiments with noncompliance. J. Am. Stat. Assoc. 108, 1120–1131.
(doi:10.1080/01621459.2013.802238)
96. Zeng S, Li F, Ding P. 2020 Is being an only child harmful to psychological health?: evidence
from an instrumental variable analysis of China’s one-child policy. J. R. Stat. Soc. A 15, 1615–
1635. (doi:10.1111/rssa.12595)
97. Heckman JJ, Vytlacil EJ. 1999 Local instrumental variables and latent variable models
for identifying and bounding treatment effects. Proc. Natl Acad. Sci. USA 96, 4730–4734.
(doi:10.1073/pnas.96.8.4730)
98. Zhang JL, Rubin DB, Mealli F. 2008 Evaluating the effects of job training programs on
wages through principal stratification. In Applied Bayesian modeling and causal inference from
incomplete-data perspectives, pp. 117–145. New York, NY: John Wiley & Sons.
99. Gilbert PB, Hudgens MG. 2008 Evaluating candidate principal surrogate endpoints.
Biometrics 64, 1146–1154. (doi:10.1111/j.1541-0420.2008.01014.x)
100. Jiang Z, Ding P, Geng Z. 2016 Principal causal effect identification and surrogate end point
evaluation by multiple trials. J. R. Stat. Soc. B 78, 829–848. (doi:10.1111/rssb.12135)
101. Li F, Mattei A, Mealli F. 2015 Evaluating the causal effect of university grants on student
dropout: evidence from a regression discontinuity design using principal stratification. Ann.
Appl. Stat. 9, 1906–1931. (doi:10.1214/15-AOAS881)
102. Ricciardi F, Mattei A, Mealli F. 2020 Bayesian inference for sequential treatments under
latent sequential ignorability. J. Am. Stat. Assoc. 115, 1498–1517. (doi:10.1080/01621459.
2019.1623039)
103. Robins JM. 1986 A new approach to causal inference in mortality studies with sustained
exposure periods—Application to control of the healthy worker survivor effect. Math. Modell.
7, 1393–1512. (doi:10.1016/0270-0255(86)90088-6)
104. Robins JM, Hernán MA, Brumback B. 2000 Marginal structural models and causal inference.
Epidemiology 11, 550–560. (doi:10.1097/00001648-200009000-00011)
105. Zajonc T. 2012 Bayesian inference for dynamic treatment regimes: mobility, equity,
and efficiency in student tracking. J. Am. Stat. Assoc. 107, 80–92. (doi:10.1080/01621459.
2011.643747)
106. Gustafson P. 2015 Discussion of ‘Bayesian estimation of marginal structural models’ by
Saarela et al. Biometrics 71, 291–293. (doi:10.1111/biom.12271)
107. Robins JM, Wasserman L. 1997 Estimation of effects of sequential treatments by
24
reparameterizing directed acyclic graphs. In Proc. of the Thirteenth Conf. on Uncertainty in
Artificial Intelligence, pp. 409–420. Burlington, MA: Morgan Kaufmann Publishers Inc.
estimating the effect of supermarket access on cardiovascular disease deaths. Ann. Appl. Stat.
14, 2069–2095. (doi:10.1214/20-AOAS1377)
113. Daniels MJ, Roy JA, Kim C, Hogan JW, Perri MG. 2012 Bayesian inference for the causal
effect of mediation. Biometrics 68, 1028–1036. (doi:10.1111/j.1541-0420.2012.01781.x)
114. Forastiere L, Mealli F, Wu A, Airoldi E. 2022 Estimating causal effects under interference
using Bayesian generalized propensity scores. J. Mach. Learn. Res. 23, 1–61.
115. Zeng S, Rosenbaum S, Alberts SC, Archie EA, Li F. 2021 Causal mediation analysis for sparse
and irregular longitudinal data. Ann. Appl. Stat. 15, 747–767. (doi:10.1214/20-AOAS1427)
116. Stan Development Team. 2022 RStan: the R interface to Stan. R package version 2.21.5.