Causal Obs
Causal Obs
By Donald B. Rubin
Harvard University
For obtaining causal inferences that are objective, and therefore
have the best chance of revealing scientific truths, carefully designed
and executed randomized experiments are generally considered to be
the gold standard. Observational studies, in contrast, are generally
fraught with problems that compromise any claim for objectivity of
the resulting causal inferences. The thesis here is that observational
studies have to be carefully designed to approximate randomized ex-
periments, in particular, without examining any final outcome data.
Often a candidate data set will have to be rejected as inadequate be-
cause of lack of data on key covariates, or because of lack of overlap
in the distributions of key covariates between treatment and control
groups, often revealed by careful propensity score analyses. Some-
times the template for the approximating randomized experiment
will have to be altered, and the use of principal stratification can be
helpful in doing this. These issues are discussed and illustrated using
the framework of potential outcomes to define causal effects, which
greatly clarifies critical issues.
Lilienfeld and Lilienfeld (1976), Maddala (1977) and Cochran (1983). This
began to change in the 1970’s when the use of potential outcomes, commonly
used in the context of randomized experiments to define causal effects since
Neyman (1923), was used to define causal effects in both randomized exper-
iments and observational studies [Rubin (1974)]. This allowed the definition
of assignment mechanisms [Rubin (1975)], with randomized experiments as
special cases, thereby allowing both types of studies for causal effects to be
considered within a common framework sometimes called the Rubin Causal
Model [RCM–Holland (1986)]. In particular, the same underlying principles
can be used to design both types of studies, and the thesis of this article is
that for objective causal inference, those principles must be used.
process that led to some units being exposed to the treatment condition
and other units being exposed to the control condition. The careful descrip-
tion and implementation of these two “design” steps is absolutely essential
for drawing objective inferences for causal effects in practice, whether in
randomized experiments or observational studies, yet the steps are often ef-
fectively ignored in observational studies relative to details of the methods
of analysis for causal effects. One of the reasons for this misplaced empha-
sis may be that the importance of design in practice is often difficult to
convey in the context of technical statistical articles, and, as is common in
many academic fields, technical dexterity can be more valued than practical
wisdom.
This article is an attempt to refocus workers in observational studies on
the importance of design, where by “design” I mean all contemplating, col-
lecting, organizing, and analyzing of data that takes place prior to seeing any
outcome data. Thus, for example, design includes conceptualization of the
study and analyses of covariate data used to create matched treated-control
samples or to create subclasses, each with similar covariate distributions
for the treated and control subsamples, as well as the specification of the
primary analysis plan for the outcome data. However, any analysis that re-
quires final outcome data to implement is not part of design. The same point
has been emphasized in Rubin (2002, 2007) and the subsequent editorial by
D’Agostino and D’Agostino (2007).
A brief review of the two essential parts of the RCM will be given in Sec-
tion 2, which introduces terminology and notation; an encyclopedia entry
review is given by Imbens and Rubin (2008a), a chapter length review is in
Rubin (2008), and a full-length text from this perspective is Imbens and Rubin
(2008b). Section 3 focuses on the assignment mechanism, the real or hypo-
thetical rule used to assign treatments to the units, and on the importance
of trying to reconstruct the hypothetical randomized experiment that led to
the observed data, this reconstruction being conducted without examining
any final outcome data in that observational data set.
Then Section 4 illustrates the design of an observational study using
propensity scores and subclassification, first in the context of a classic single-
covariate example from Cochran (1968) with one background covariate. Sec-
tion 4 goes on to explain how propensity score methods allow the design of
observational studies to be extended to cases with many covariates, first
with an example comparing treatments for breast cancer to illustrate how
this extension can be applied, and second, with a marketing example to il-
lustrate the kind of balance on observed covariates that can be achieved in
practice. Section 5 uses a Karolinska Institute example to illustrate a differ-
ent point: that the same observational data set may be used to support two
(or more) different templates for underlying randomized experiments, and
one that may be far more plausible than the other. The concluding Section
6 briefly summarizes major points.
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 5
2.1. Part one: units, treatments, potential outcomes. Three basic con-
cepts are used to define causal effects in the RCM. A unit is a physical ob-
ject, for example, a patient, at a particular place and point in time, say, time
t. A treatment is an action or intervention that can be initiated or withheld
from that unit at t (e.g., an anti-hypertensive drug, a job-training program);
if the active treatment is withheld, we will say that the unit has been ex-
posed to the control treatment. Associated with that unit are two potential
outcomes at a future point in time, say, t∗ > t: the value of some outcome
measurements Y (e.g., cholesterol level, income, possibly vector valued with
more than one component) if the active treatment is given at t, Y (1), and
the value of Y at the same future point in time if the control treatment is
given at t, Y (0). The causal effect of the treatment on that unit is defined
to be the comparison of the treatment and control potential outcomes at
t∗ (e.g., their difference, their ratio, the ratio of their squares). The times
t can vary from unit to unit in a population of N units, but typically the
intervals, t∗ − t, are essentially constant across the N units.
The full set of potential outcomes comprises all values of the outcome Y
that could be observed in some real or hypothetical experiment comparing
the active treatment to the control treatment in a population of N units.
Under the “Stable Unit-Treatment Value Assumption (SUTVA)” [Rubin
(1980, 1990a)], the full set of potential outcomes for two treatments and the
population of N units can be represented by an array with N rows, one for
each unit, and two “super” columns, one for Y (0) and one for Y (1), “super”
in the sense that Y can be multi-component. The fundamental problem fac-
ing causal inference [Holland (1986); Rubin (1978), Section 2.4] is that for
the ith unit, only one of the potential outcomes for each unit, either Y (0) or
Y (1), can ever be observed. In contrast to outcome variables, covariates are
variables, X, that for each unit take the same value no matter which treat-
ment is applied to the unit, such as quantities determined (e.g., measured)
before treatments are assigned (e.g., age, pre-treatment blood pressure or
pre-treatment education). The values of all these variables under SUTVA
is the N row array, [X, Y (0), Y (1)], which is the object of causal inference
called “the science.”
A causal effect is, by definition, a comparison of treatment and control
potential outcomes on a common set of units; for example, the average Y (1)
minus the average Y (0) across all units, or the median log Y (1) verses the
median log Y (0) for those units who are female between 31 and 35 years
old, as indicated by their X values, or the median [log Y (1) − log Y (0)] for
those units whose Y (0) and Y (1) values are both positive. It is critically
important in practice to keep this definition firmly in mind.
6 D. B. RUBIN
This first part of the RCM is conceptual and can, and typically should,
be conducted before seeing any data, especially before seeing any outcome
data. It forces the conceptualization of causal questions in terms of real or
hypothetical manipulations: “No causation without manipulation” [Rubin
(1975)]. The formal use of potential outcomes to define unit-level causal ef-
fects is due to Neyman in 1923 [Rubin (1990a)] in the context of randomized
experiments, and was a marvelously clarifying contribution. But evidently
this notation was not formally extended to nonrandomized settings until
Rubin (1974), as discussed in Rubin (1990a, 2005) and Imbens and Rubin
(2008a, 2008b).
The intuitive idea behind the use of potential outcomes to define causal
effects must be very old. Nevertheless, in the context of nonrandomized ob-
servational studies, prior to 1974 everyone appeared to use the “observed
outcome” notation when discussing “formal” causal inference. More explic-
itly, letting W be the column vector indicating the treatment assignments
for the units (Wi = 1 if treated, Wi = 0 if control), the observed outcome
notation replaces the array of potential outcomes [Y (0), Y (1)] with Yobs ,
where the ith component of Yobs is
(2.1) Yobs,i = Wi Yi (1) + (1 − Wi )Yi (0).
The observed outcome notation is inadequate in general, and can lead to se-
rious errors—see, for example, the discussions in Holland and Rubin (1983)
on Lord’s paradox, and in Rubin (2005), where errors are explicated that
Fisher made because (I believe) he eschewed the potential outcome notation.
The essential problem with Yobs is that it mixes up the science [i.e., Y (0)
and Y (1)] with what is done to learn about the science via the assignment
of treatment conditions to the units (i.e., Wi ).
2.2. Part 2: the assignment mechanism. The second part of the RCM is
the formulation, or positing, of an assignment mechanism, which describes
the reasons for the missing and observed values of Y (0) and Y (1) through
a probability model for W given the science:
(2.2) Pr(W |X, Y (0), Y (1)).
Although this general formulation, with the possible dependence of assign-
ments on the yet to be observed potential outcomes, arose first in Rubin
(1975), special cases were much discussed prior to that. For example, ran-
domized experiments [Neyman (1923, 1990), Fisher (1925)] are “uncon-
founded” [Rubin (1990b)],
(2.3) Pr(W |X, Y (0), Y (1)) = Pr(W |X),
and they are “probabilistic” in the sense that their unit level probabilities,
or propensity scores −ei , are bounded between 0 and 1:
(2.4) 0 < ei < 1,
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 7
where
(2.5) ei ≡ Pr(Wi = 1|Xi ).
When the assignment mechanism is both probabilistic [(2.4) and (2.5)] and
unconfounded (2.3), then for all assignments W that have positive proba-
bility, the assignment mechanism generally can be written as proportional
to the product of the unit level propensity scores, which emphasizes the
importance of propensity scores in design:
Y
N
(2.6) Pr(W |X, Y (0), Y (1)) ∝ ei or = 0.
i=1
The collection of propensity scores defined by (2.5) is the most basic in-
gredient of an unconfounded assignment mechanism because of (2.6), and
its use for objectively designing observational studies will be developed and
illustrated here, primarily in Section 4, but also in the context of a more
complex design discussed in Section 5.
The term “propensity scores” was coined in Rosenbaum and Rubin (1983),
where an assignment mechanism satisfying (2.4) and (2.5) is called “strongly
ignorable,” a stronger version of “ignorable” assignment mechanisms, coined
in Rubin (1976a, 1978), which allows possible dependence on observed val-
ues of the potential outcomes, Yobs defined by (2.1), such as in a sequential
experiment:
Pr(W |X, Y (0), Y (1)) = Pr(W |X, Yobs ).
But until Rubin (1975), randomized experiments were not defined us-
ing (2.3) and (2.4), which explicitly show such experiments’ freedom from
any dependence on observed or missing potential outcomes. Instead, ran-
domized experiments were described in such a way that the assignments
only depended on available covariates, and so implicitly did not involve
the potential outcomes themselves. But explicit mathematical notation, like
Neyman’s, can be a major advance over implicit descriptions.
Other special versions of assignment mechanisms were also discussed prior
to Rubin (1975, 1978), but without the benefit of explicit equations for the
assignment mechanism showing possible dependence on the potential out-
comes. For example, in economics, Roy (1951) described, without equations
or notation, “self-optimizing” behavior where each unit chooses the treat-
ment with the optimal outcome. And another well-known example from
economics is Haavelmo’s (1944) formulation of supply and demand behav-
ior. But these and other formulations in economics and elsewhere did not
use the notation of an assignment mechanism, nor did they have methods of
statistical inference for causal effects based on the assignment mechanism.
Instead, “regression” models were used to predict Yobs,i from Xi and Wi ,
8 D. B. RUBIN
3.1. Overview. A crucial idea when trying to estimate causal effects from
an observational dataset is to conceptualize the observational dataset as hav-
ing arisen from a complex randomized experiment, where the rules used to
assign the treatment conditions have been lost and must be reconstructed.
There are various steps that I consider essential for designing an objective
observational study. These will be described in this section and then illus-
trated in the remaining parts of this article. In practice, the steps are not
always conducted in the order given below, but often they are, especially
when facing a particular candidate data set.
3.2. What was the hypothetical randomized experiment that led to the
observed dataset? As a consequence of our conceptualization of an observa-
tional study’s data as having arisen from a hypothetical randomized experi-
ment, the first activity is to think hard about that hypothetical experiment.
To start, what exactly were the treatment conditions and what exactly were
the outcome (or response) variables? Be aware that a particular observa-
tional dataset can often be conceptualized as having arisen from a variety of
different hypothetical experiments with differing treatment and control con-
ditions and possibly differing outcome variables. For example, a dataset with
copious measurements of humans’ prenatal exposures to exogenous agents,
such as hormones or barbiturates [e.g., Rosenbaum and Rubin (1985),
Reinisch et al. (1995)], could be proposed to have arisen from a random-
ized experiment on prenatal hormone exposure, or a randomized experiment
on prenatal barbiturate exposure, or a randomized factorial experiment on
both hormone and barbiturate exposure. But the investigator must be clear
about the hypothetical experiment that is to be approximated by the ob-
servational data at hand. Running regression programs is no substitute for
careful thinking, and providing tables summarizing computer output is no
substitute for precise writing and careful interpretation.
3.3. Are sample sizes in the dataset adequate? If the step presented in
Section 3.1 is successful in the limited sense that measurements of both
treatment conditions and outcomes seem to be available or obtainable from
descriptions of the observational dataset, the next step is to decide whether
the sample sizes in this dataset are large enough to learn anything of interest.
Here is where traditional power calculations are relevant; also extensions, for
example, involving the ratios of sample sizes needed to obtain well-matched
samples [Rubin (1976b), Section 5], are relevant, and should be considered
before plunging ahead. Sometimes, the sample sizes will be small, but the
data set is the only one available to address an important question. In such
10 D. B. RUBIN
3.4. Who are the decision makers for treatment assignment and what mea-
surements were available to them? The next step is to think very carefully
about why some units (e.g., medical patients) received the active treatment
condition (e.g., surgery) versus the control treatment condition (e.g., no
surgery): Who were the decision makers and what rules did they use? In a
randomized experiment, the randomized decision rules are explicitly written
down (hopefully), and in any subsequent publication, the rules are likewise
typically explicitly described. But with an observational study, we have to
work much harder to describe and justify the hypothetical approximating
randomized assignment mechanism. In common practice with observational
data, however, this step is ignored, and replaced by descriptions of the re-
gression programs used, which is entirely inadequate. What is needed is a
description of critical information in the hypothetical randomized experi-
ment and how it corresponds to the observed data.
For example, what were the background variables measured on the ex-
perimental units that were available to those making treatment decisions,
whether observed in the current dataset or not? These variables will be
called the “key covariates” for this study. Was there more than one decision
maker, and if so, is it plausible that all decision makers used the same rule,
or nearly so, to make their treatment decisions? If not, in what ways did the
decision rules possibly vary? It is remarkable to me that so many published
observational studies are totally silent on how the authors think that treat-
ment conditions were assigned, yet this is the single most crucial feature
that makes their observational studies inferior to randomized experiments.
3.5. Are key covariates measured well? Next, consider the existence and
quality of the key covariates’ measurements. If the key covariates are very
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 11
3.6. Can balance be achieved on key covariates? The next step is to try
to find subgroups (subclasses, or matched pairs) of treated and control units
such that within a subgroup, the treated and control units appear to be bal-
anced with respect to their distributions of key covariates. That is, within
such a subgroup, the treated and control units should look as if they could
have been randomly divided (usually not with equal probability) into treat-
ment and control conditions. Often, it will not be possible to achieve such
balance in an entirely satisfactory way. In that situation, we may have to
restrict inferences to a subpopulation of units where such balance can be
achieved, or we may even decide that with this dataset we cannot achieve
balance with enough units to make the study worthwhile. If so, we should
usually forgo using this dataset to address the causal question being con-
sidered. A related issue is that if there appear to be many decision makers
using differing rules (e.g., different hospitals with different rules for when to
give a more expensive drug rather than a generic version), then achieving
this balance will be more difficult because different efforts to create balance
will be required for the differing decision makers. This point will be clearer
in the context of particular examples.
3.7. The result. These six steps combine to make for objective observa-
tional study design in the sense that the resultant designed study can be con-
ceptualized as a hypothetical, approximating randomized block (or paired
comparison) experiment, whose blocks (or matched pairs) are our balancing
groups, and where the probabilities of treatment versus control assignment
may vary relatively dramatically across the blocks. This statement does not
mean the researcher who follows these steps will achieve an answer similar
to the one that would have been found in the analogous randomized experi-
ment, but at least the observational study has a chance of doing so, whereas
12 D. B. RUBIN
if these steps are not followed, I believe that it is only blind luck that could
lead to a similar answer as in the analogous randomized experiment.
Sometimes the design effort can be so extensive that a description of it,
with no analyses of any outcome data, can be itself publishable. For a specific
example on peer influence on smoking behaviors, see Langenskold and Rubin
(2008).
4.1. Classic example with one observed covariate. The following very
simple example is taken from Cochran (1968) classic article on subclassi-
fication in observational studies, which uses some smoking data to illustrate
ideas. Let us suppose that we want to compare death rates (the outcome
variable of primary interest) among smoking males in the U.S., where the
treatment condition is considered cigarette smoking and the control condi-
tion is cigar and pipe smoking. There exists a very large dataset with the
death rates of smoking males in the U.S., and it distinguishes between these
two types of smokers. So far, so good, in that we have a dataset with Y and
treatment indicators, and it is very large. Now we strip this dataset of all
outcome data; no survival (i.e., Y ) data are left and are held out of sight
until the design phase is complete.
Next we ask (in a simple minded way, because this is only an illustrative
example), who is the decision maker for treatment versus control, and what
are the key covariates used to make this decision? It is relatively obvious that
the main decision maker is the individual male smoker. It is also relatively
obvious that the dominant covariate used to make this decision is age—
most smokers start in their teens, and most start by smoking cigarettes, not
pipes or cigars. Some pipe and cigar smokers start in college, but many start
later in life. Cigarette smokers tend to have a more uniform distribution of
ages. Other possible candidate key covariates are education, socio-economic
status, occupational status, income, and so forth, all of which tend to be
correlated with age, so to illustrate, we focus on age as our only X variable.
Then our hypothetical randomized experiment starts with male smokers and
randomly assigns them to cigarette or cigar/pipe smoking, where the propen-
sity to be a cigarette smoker rather than a cigar/pipe smoker is viewed as a
function of age. In this dataset, age is very well-measured. When we compare
the age distribution of cigarette smokers and age distribution of cigar/pipe
smokers in the U.S. in this dataset, we see that the former are younger, but
that there is substantial overlap in the distributions. Before moving on to
the next step, we should worry about how people in the hypothetical ex-
periment who died prior to the assembling of the observational dataset are
represented, but, for simplicity in this illustrative example, we will move on
to the next step.
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 13
careful observational studies can (not necessarily will) reach the same general
conclusions as expensive randomized experiments. The second example is
from a large marketing study and displays the kind of balance that can
be achieved following propensity score subclassification, as well as the fact
that some units can be unmatchable. The last example, in Section 5, uses
a data set on large volume versus small volume hospitals to emphasize that
one observational data set can be used to support two (or more) differing
templates for the underlying randomized study of a particular question, and
one template may be considered far better than the other.
4.3. GAO study of treatments for breast cancer. The following example
appeared in a Government Accounting Office (GAO) publication that was
summarized in Rubin (1997). In the 1960s mastectomy was the standard
treatment for many forms of breast cancer, but there was growing interest
in the possibility that for a class of less severe situations (e.g., small tumors,
node negative) a more limited surgery, which just removed the tumor, might
be just as successful as the more radical and disfiguring operation.
Several large and expensive randomized trials were done for this category
of women with less severe cancer, and the results of these trials are summa-
rized in Table 1. As can be seen there, these studies suggest that for this
class of women who are willing to participate in a randomized experiment,
and for these cancer treating centers and their doctors, who are also willing
Table 1
Estimated 5-year survival rates for nodenegative patients in six randomized clinical trials
Estimated Estimated
survival survival
rate for rate for Estimated
Study Women Women women women causal effect
Breast
conservation Mastectomy
Study (BC) (Mas) BC Mas BC–Mas
Study n n % % %
Table 2
Estimated 5-year survival rates for node-negative patients in SEER data base within each
of five propensity score subclasses: from tables in U.S. GAO Report [General Accounting
Office (1994)]
Propensity score
subclass Treatment condition n Estimate
mastectomy (subclasses 1, 2, 3), but the data are certainly not definitive.
Similarly, for the women and doctors relatively more likely to select breast
conserving operations, there is some slight evidence of a survival benefit to
that choice. If we believed that the treatment effect should be the same for
all women in the study, these changing results across propensity subclasses
could be viewed as evidence of a confounded and nonignorable treatment as-
signment (i.e., an omitted key covariate). Overall, however, there appears to
be no advantage to recommending one treatment over the other. It is inter-
esting to note that, consistent with expectations, the overall survival rates
in the observational dataset are not as good as those in the more specialized
centers represented in Table 1.
drug. Do the visits cause more scripts to be written, and if so, which doc-
tors should be visited with higher priority? Both of these, and other similar
questions, are causal ones.
The decision-maker for visiting or not the doctors is essentially the sales
rep, and these folks, rather obviously, like to visit doctors who prescribe a lot,
who have large practices, are in a specialty that prescribes a lot of the type
of drug being detailed, etc. Essentially all of these background variables, X,
and more, are available on the purchased data set, which has huge sample
sizes; the company has the indicator W for visited versus not, and next
year’s purchased data set will have the outcome variables Y on the actual
number of scripts written by these doctors in the next time period. So things
look in good shape to estimate and re-estimate the propensity scores until
we achieve balanced distributions within subclasses, or we decide that there
are some types of doctors who have essentially no chance of being visited or
not being visited, and then no estimation of causal effects will be attempted
for them.
Figures 1 and 2 display the initial balance for two important covariates,
number of prior scripts written in the previous year (for drugs in the same
class as the detailed drug) on a scale from 0 (minimum) to 100 (the arbitrar-
ily scaled maximum), and the specialty. These figures reveal quite dramatic
differences between the doctors who were visited and those who were not
visited. It is not surprising that the visited doctors were the ones who wrote
many more prescriptions (per doctor) than the not visited doctors. But the
visited doctors also have a different distribution of specialties than the not
visited doctors. For example, ob-gyn doctors are visited relatively less often
than doctors with other specialties; presumably, ob-gyn doctors do not pre-
scribe weight-loss drugs for their pregnant patients, and the sales reps use
this information.
Propensity scores were estimated by logistic regression based on various
functions of all of the covariates. Figure 3 displays the histograms for the
estimated linear propensity scores (the β̂X in the logistic regression) among
the not visited and visited doctors. These histograms are shown with 15
subclasses (or bins) of propensity scores. In some bins, there are only visited
doctors, that is, in the bins with linear propensity scores larger than 1.0; in
those two bins, there are no doctors who were not visited. Presumably, they
are high prescribing doctors with large practices, etc. No causal inferences
are possible for them without making model-based assumptions relating
outcomes to covariates for which there are no data to assess the underlying
assumptions. Similarly, for the four lowest bins of propensity, with linear
scores less than 0.1, all doctors are not visited, and so, similarly, no causal
inferences about the effect of visiting this type of doctor are possible unless
based on unassessable assumptions.
But in the other nine bins, there are both visited and not visited doctors,
and the claim is that within each of those bins, the distributions of all
covariates that entered the propensity score estimation will be nearly the
same for the visited and not visited doctors. To be specific, let us examine
the bin between 0.5 and <0.6. Figures 4 and 5 show the distributions of prior
number of prescriptions and specialties in this bin for the not visited and
visited doctors. These distributions are strikingly more similar than their
counterparts shown in Figures 1 and 2. In fact, they are so similar that
one could believe that, within that bin, the visited doctors are a random
sample from all doctors in that bin. And the claim is that this will hold (in
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 21
expectation) for all covariates used to estimate the propensity score and in
all bins where there are both visited and not visited doctors.
The process of assessing balance was conducted for all variables and all
bins and considered adequate in the sense that it was considered plausible
5.1. The causal effect of being treated in large volume versus small volume
hospitals. The third example illustrates the point that the design phase
in some observational studies may involve conceptualizing the hypothetical
underlying randomized experiment that lead to the observed data as being
more complex than a randomized block or randomized paired comparison.
In particular, in some situations, we may have to view the hypothetical ex-
periment as being a randomized block with noncompliance to the assigned
treatment, a so-called “encouragement” design [Holland (1988)]. In many
dataset seems well-suited for studying the causal effect of home hospital type
on survival.
Propensity score analyses were done to predict diagnosing (home) hospital
type from X, including nonlinear terms in X. It was decided that the age
of patient should be limited to between 35 and 84 because the two patients
under 35 (actually both under 30) were both diagnosed in large volume
hospitals, and longer term survival in the 8 cardia cancer patients 85 and over
was considered unlikely no matter where treated, and would therefore simply
add noise to the survival data. Propensity score analyses on the remaining
148 patients led to five subclasses; these are summarized in Figures 6–8 are
“Love plots” [Ahmed et al. (2006)] summarizing balance before and after
this subclassification, for binary and continuous covariates, respectively.
5.3. Treating hospital type versus home hospital type. If patients were al-
ways treated in the same hospital where they were diagnosed, estimating the
causal effects of hospital type would now be easy because of the assumed
unconfounded assignment of diagnosing hospital type. However, there are
transfers between hospital types, typically from small to large—33 of the
75 diagnoised in a small hospital transferred to a large one for treatment,
but sometimes from large to small—2 of 75 transferred this direction. The
reasons for these transfers are considered quite complex. The decisions are
made by the individual patient, but clearly with input from doctors, rela-
tives, and friends, where the issues being discussed include speculation about
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 25
the probability of success of the treatment at one versus the other, the pa-
tient’s willingness to tolerate invasive operations, the importance of being
close to relatives and friends, and a host of other reasons. Consequently,
there is no doubt that given the observed covariates and the home hospi-
tal type, the assignment of treating hospital type is confounded. Therefore,
doing a direct analysis of treating hospital type, even if propensity score
methods were used to create subclasses of patients with identical distribu-
tions of all observed covariates in large and small treating hospitals, would
be considered unsatisfactory because key covariates were not available in the
data set.
We can, however, still make progress based on the assumed unconfounded
assignment of home hospital type by using a different template for our ob-
servational study of treating hospital type: a randomized experiment with
noncompliance. That is, think of patients who transfer, or, more generally,
who would have transferred if assigned to a different hospital type, as being
noncompliers, and therefore, our template is that of a randomized encour-
agement design, where the encouragement to be treated in the diagnosing
large or small hospital is randomly assigned within propensity score strata.
The crucial idea here is then to stratify also on the bivariate “intermedi-
ate outcome,” treating hospital when assigned to a large home hospital and
treating hospital when assigned to a small home hospital. Even though only
one of these intermediate variables is actually observed, progress can still be
made. Notice that the design phase does here look at intermediate outcome
Fig. 7. Cardia cancer, difference in means for binary covariates and pscore.
26 D. B. RUBIN
Table 3
Cardia cancer: observed counts in observed groups and approximate counts in principal
strata under monotonicity assumption—subclass 1
data, treating hospital type, but not the outcome data on survival, on which
decisions will be based. Survival data are not available at this stage!
Denote the home hospital type by h, which takes the value ℓ when as-
signed large hospital type and s when assigned small home hospital type.
Similarly, let T denote treating hospital type, which takes the value L when
the treating hospital is large, and takes the value S when treating hospital is
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 27
small. The first three columns of Tables 3–7 summarize the observed values
of h and T within each of the five propensity score subclasses. Clearly, in all
subclasses, transfers into large hospitals are common, but only in subclass 5
are there any ℓ → S transfers. But do we estimate that there are compliers,
who are treated in both large and small treating hospital types, within each
subclass? If not, we will not be able to estimate the causal effect of treating
hospital type for the entire group of patients—a critical design issue with
this template.
Table 4
Cardia cancer: observed counts in observed groups and approximate counts in principal
strata under monotonicity assumption—subclass 2
Table 5
Cardia cancer: observed counts in observed groups and approximate counts in principal
strata under monotonicity assumption—subclass 3
Table 6
Cardia cancer: observed counts in observed groups and approximate counts in principal
strata under monotonicity assumption—subclass 4
Table 7
Cardia cancer: observed counts in observed groups and approximate counts in principal
strata under monotonicity assumption—subclass 5
matter where assigned, and SL can be thought of as defiers, who will transfer
no matter where assigned. The values of the principal strata are not affected
by assignment of home hospital type—which value [T (ℓ) or T (s)] is observed
is affected by treatment assignment, but the bivariate values are not, and
therefore (T (ℓ), T (s)) is, formerly, a partially observed covariate.
Now, we consider what is called the “monotonicity” assumption or the
“no-defier” assumption—that is, we assume that the SL principal stratum
is empty. In our setting, this assumption is very plausible, and because it
excludes the SL principal stratum, we have only three principal strata: LL,
LS and SS. Under this assumption, the possible principal strata for each
observed combination of home hospital type and treating hospital type in
each propensity subclass are shown in the fourth columns of Tables 3–7. The
observed ℓ → S group (the second row in Tables 3–7) must be composed of
SS patients because they can be neither LL nor SL patients, respectively—
because they were assigned ℓ but treated in S and therefore are not LL
patients, and there are no SL patients by the monotonicity assumption.
Similarly, the observed s → L group (the third row of Tables 3–7) must be
LL patients because they were assigned s but were treated in L.
In contrast, the observed ℓ → L subgroup (the first row of Tables 3–7)
could be compliers, and so be in LS, or noncompliers who are members of
the LL principal stratum (who were assigned to home hospital type L, and
to which they would have transferred for their treating hospital type if they
were assigned to a small home hospital type). Hence, we split row 1 into two
sub-rows in the fourth column of Tables 3–7. Similarly, the observed s → S
subgroups (the fourth row of Tables 3–7) could be compliers, and so be in
LS, or noncompliers who are members of the SS principal stratum, and so
is also split into two sub-rows.
We can approximate the proportion of patients in each principal stratum,
as shown in the fifth columns of Tables 3–7. More explicitly, from the second
row of Table 7, columns (1) and (3), we see that 2/20 are observed to be
ℓ → S. Because of the assumed random assignment into ℓ and s within
propensity score subclasses, we have that approximately 10% of the patients
belong to the principal stratum SS, as shown in the fifth column of Table 7.
Similarly, from the third row of Table 7, columns (1) and (3), we infer that
approximately 6/9 ≈ 67% of patients belong to principal stratum LL in this
subclass, as shown in the fifth columns of Table 7.
Hence, we can approximate the fraction of compliers, the LS principal
stratum in this subclass, by simple subtraction: 100% − 10% − 67% = 23%.
The sixth column in Table 7 indicates the approximate number of LS pa-
tients in each of the four rows of observed patients. Analogous calculations
are summarized in Tables 3–6 for the other propensity score subclasses.
Even if we could perfectly identify all the LS patients, which we cannot,
the sample sizes are small, and so inference for the causal effect of treating
30 D. B. RUBIN
5.5. ITT and CACE = ITTLS and their estimation. The average causal
effect of home hospital type on survival is the comparison of the potential
survival outcomes of all N patients under hi = ℓ and under hi = s,
1 XN
ITT = [Yi (ℓ) − Yi (s)],
N i=1
where NLS is the number of LS patients, and CACE means “Compliance Av-
erage Causal Effect” [Imbens and Rubin (1997)]. ITTLS can be interpreted
as either the intention-to-treat effect of home hospital type for complying pa-
tients or the intention-to-treat effect of treating hospital type for complying
patients, because for the LS principal stratum, hi = Ti . Under monotonicity,
the LS principal stratum is the only stratum of patients where we can learn
about the causal effects of treating hospital type because the patients in
the other principal strata, LL and SS, will always be exposed to the same
treating hospital type.
CACE is easily estimated once we identify the individuals in the LS stra-
tum, and we have not yet identified any particular member of the ℓ → L or
s → S rows (rows one and four) in Tables 3–7 as being in the LS prin-
cipal stratum, and so we cannot yet compare average outcomes in this
stratum. Nevertheless, we can find a unique method-of-moments estimate
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 31
where πLS , πSS and πLL are the fractions of the sample, and ITTLS , ITTSS
and ITTLL are the intention-to-treat effects in the LS, SS and LL strata,
respectively. Because the exclusion restrictions force ITTSS and ITTLL to
be identically zero, this equation becomes
or
ITTLS = ITT/πLS .
REFERENCES
Ahmed, A., Husain, A., Love, T., Gambassi, G., Dell’Italia, L., Francis, G.,
Gheorghiade, M., Allman, R., Meleth, S. and Bourge, R. (2006). Heart failure,
chronic diuretic use, and increase in mortality and hospitalization: An observational
study using propensity score methods. Eur. Heart J. 27 1431–1439.
Angrist, J., Imbens, G. and Rubin, D. (1996). Identification of causal effects using
instrumental variables. J. Amer. Statist. Assoc. 91 444–472.
Barnard, J., Frangakis, C., Hill, J. and Rubin, D. (2003). Principal stratification
approach to broken randomized experiments: A case study of school choice vouchers in
New York city. J. Amer. Statist. Assoc. 98 299–323. MR1995712
Blalock, H. (1964). Causal Inference in Nonexperimental Research. Univ. North Carolina
Press, Chapel Hill.
FOR OBJECTIVE CAUSAL INFERENCE, DESIGN TRUMPS ANALYSIS 33
Rubin, D. (2007). The design versus the analysis of observational studies for causal effects:
Parallels with the design of randomized trials. Stat. Med. 26 20–30. MR2312697
Rubin, D. (2008). Statistical inference for causal effects, with emphasis on applications
in epidemiology and medical statistics. II. In Handbook of Statisics: Epidemiology and
Medical Statistics (C. R. Rao, J. P. Miller and D. C. Rao, eds.). Elsevier, The Nether-
lands.
Rubin, D. and Thomas, N. (1992). Characterizing the effect of matching using linear
propensity score methods with normal covariates. Biometrika 79 797–809. MR1209479
Rubin, D. and Thomas, N. (2000). Combining propensity score matching with additional
adjustments for prognostic covariates. J. Amer. Statist. Assoc. 95 573–585.
Rubin, D., Wang, X., Yin, L. and Zell, E. (2008). Bayesian causal inference: Ap-
proaches to estimating the effect of treating hospital type on cancer survival in Sweden
using principal stratification. In Handbook of Applied Bayesian Analysis (T. O’Hagan
and M. West, eds.). Oxford Univ. Press, Oxford.
Rubin, D. and Waterman, R. (2006). Estimating causal effects of marketing interventions
using propensity score methodology. Statist. Sci. 21 206–222. MR2324079
Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002). Experimental and Quasi-
Experimental Designs for Generalized Causal Inference. Houghton Mifflin Company,
Boston.
Zell, E., Kuwanda, M. Rubin, D., Cutland, C., Patel, R., Velaphi S., Madhi,
S. and Schrag, S. (2007). Conducting and analyzing a single-blind clinical trial in a
developing country: Prevention of perinatal sepsis, soweto, South Africa. In Proceedings
of the International Statistical Institute (CD-ROM).
Department of Statistics
Harvard University
Cambridge, Massachusetts 02138
USA
E-mail: [email protected]