A Practical Introduction To Regression Discontinuity Designs
A Practical Introduction To Regression Discontinuity Designs
Designs
http://www.cambridge.org/us/academic/elements/
quantitative-and-computational-methods-social-science
∗
Department of Economics and Department of Statistics, University of Michigan.
†
Department of Economics, University of Michigan.
‡
Department of Political Science, University of Michigan.
CONTENTS CONTENTS
Contents
Acknowledgments 1
1 Introduction 2
1.1 Software for RD Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Running Empirical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
i
CONTENTS CONTENTS
Bibliography 158
ii
ACKNOWLEDGMENTS CONTENTS
Acknowledgments
This monograph collects and expands the instructional materials we prepared for more than 25
short courses and workshops on Regression Discontinuity (RD) methodology taught over the years
2014–2017. These teaching materials were used at the following institutions and programs: the
Asian Development Bank, the Philippine Institute for Development Studies, the International Food
Policy Research Institute, the ICPSR’s Summer Program in Quantitative Methods of Social Re-
search, the Abdul Latif Jameel Poverty Action Lab, the Inter-American Development Bank, the
Georgetown Center for Econometric Practice, and the Universidad Católica del Uruguay’s Winter
School in Methodology and Data Analysis. The materials were also employed for teaching at the
undergraduate and graduate level at Brigham Young University, Cornell University, Instituto Tec-
nológico Autónomo de México, Pennsylvania State University, Pontificia Universidad Católica de
Chile, University of Michigan, and Universidad Torcuato Di Tella. We thank all these institutions
and programs, as well as their many audiences, for the support, feedback and encouragement we
received over the years.
The work collected in this monograph evolved and benefited from many insightful discus-
sions with our present and former collaborators: Sebastián Calonico, Bob Erikson, Juan Carlos
Escanciano, Max H. Farrell, Yingjie Feng, Brigham Frandsen, Sebastián Galiani, Michael Jansson,
Luke Keele, Marko Klašnja, Xinwei Ma, Kenichi Nagasawa, Brendan Nyhan, Jas Sekhon, Gonzalo
Vazquez-Bare, and José Zubizarreta. Their intellectual contribution to our research program on RD
designs cannot be overestimated, and certainly made this monograph much better than it would
have otherwise been. We also thank Alberto Abadie, Josh Angrist, Ivan Canay, Richard Crump,
David Drukker, Sebastian Galiani, Guido Imbens, Pat Kline, Justin McCrary, David McKenzie,
Doug Miller, Aniceto Orbeta, Zhuan Pei, and Andres Santos for the many stimulating discussions
and criticisms we received from them over the years, which also shaped the work presented here in
important ways.
The monograph is purposely practical and hence focuses on empirical analysis of RD designs.
We do not seek to provide a comprehensive literature review on RD designs nor discuss theoretical
aspects in detail. We employ the data of Meyersson (2014) as the main running example throughout
the manuscript, and we also use the data of Lindo et al. (2010) as a second empirical illustration. We
thank these authors for making their data and codes publicly available. Accompanying the mono-
graph, we provide complete replication codes in both R and Stata. Furthermore, we provide full
replication codes for a third empirical illustration using the data of Cattaneo et al. (2015), though
this example is not discussed in the text to conserve space and because it is already analyzed in our
companion software articles. The general purpose, open-source software used in this monograph,
as well as all replication files, can be found at https://sites.google.com/site/rdpackages.
Last but not least, we gratefully acknowledge funding from the National Science Foundation
through grant SES-1357561.
1
1. INTRODUCTION CONTENTS
1 Introduction
One important goal in the social sciences is to understand the causal effect of a treatment on some
outcome of interest. As social scientists, we are interested in questions as varied as the effect of
minimum wage increases on unemployment, the role of information dissemination on political par-
ticipation, the impact of educational reforms on student achievement, and the effects of conditional
cash transfers on children’s health. The analysis of such effects is relatively straightforward when
the treatment of interest is randomly assigned, as this ensures the comparability of units assigned
to the treatment and control conditions. However, by its very nature, many interventions of interest
to social scientists cannot be randomly assigned for either ethical or practical reasons—often both.
In this context, research designs that allow for the rigorous study of non-experimental interven-
tions are particularly promising. One of them is the regression discontinuity (RD) design, which has
emerged as one of the most credible non-experimental strategies for the analysis of causal effects.
In the simplest RD design, units are assigned a score, and a treatment is given to those units whose
value of the score exceeds a known cutoff and withheld from units whose value of the score is below
the cutoff. The key feature of the design is that the probability of receiving the treatment changes
abruptly at the known threshold. If units are not able to perfectly “sort” around this threshold,
this discontinuous change in the treatment assignment probability can be used to infer the effect
of the treatment on an outcome of interest, at least locally, because units with scores barely below
the cutoff can be used as counterfactuals for units with scores barely above it.
The first step to employ the RD design in practice is to learn how to recognize it. There are three
fundamental components in the RD design—a score, a cutoff, and a treatment. Without these three
basic defining features, RD methodology cannot be employed. Therefore, the analysis of the RD
design is not always implementable, unlike other non-experimental methods such as those based on
regression adjustments or more sophisticated selection-on-observables approaches, which can always
be used to describe the relationship between outcomes and treatments after adjusting for observed
covariates. Instead, RD is a research design that must possess certain objective features: we can
only study causal effects with a RD design when one occurs; the decision to use a RD design is not
up to the researcher. The key defining feature of any RD design is that the probability of treatment
assignment as a function of the score changes discontinuously at the cutoff—a condition that is
directly verifiable. In addition, the RD design comes with an extensive array of falsification and
related empirical approaches that can be used to offer empirical support for its validity, making it
more plausible in any specific application. These features give the RD design an objective basis for
implementation and testing that is usually lacking in other non-experimental empirical strategies,
and endow it with superior credibility among observational studies.
The popularity of the RD design has grown markedly over the last decades, and it is now used
frequently in Economics, Political Science, Education, Epidemiology, Criminology, and many other
disciplines. This recent proliferation of RD applications has been accompanied by great disparities
2
1. INTRODUCTION CONTENTS
in how RD analysis is implemented, interpreted, and evaluated. RD applications often differ sig-
nificantly in how authors estimate the effects of interest, make statistical inferences, present their
results, evaluate the plausibility of the underlying assumptions, and interpret the estimated effects.
The lack of consensus about best practices for validation, estimation, inference, and interpretation
of RD results makes it hard for scholars and policy-makers to judge the plausibility of the evidence
and compare results from different RD studies. In this monograph, our goal is to provide an acces-
sible and practical guide for the analysis and interpretation of RD designs that encourages the use
of a common set of practices and facilitates the accumulation of RD-based empirical evidence.
In addition to the existence of a treatment assignment rule based on a score and a cutoff,
the formal interpretation, estimation and inference of RD treatment effects requires several other
assumptions. First, we need to define the parameter of interest and provide assumptions under
which such parameter is identifiable—i.e., conditions under which is uniquely estimable in some
objective sense (finite sample or super population). Second, we must impose additional assumptions
to ensure that the parameter can be estimated; these assumptions will naturally vary according
to the estimation method employed and the parameter under consideration. In this monograph,
we discuss two frameworks for RD analysis that define different parameters of interest, rely on
different identification assumptions, and employ different estimation and inference methods. These
two alternative frameworks also generate different testable implications, which can be used to assess
their validity in specific applications.
The first framework we discuss is based on conditions that ensure the smoothness of the re-
gression functions, and is the framework most commonly employed in practice. We call this the
standard or continuity-based framework for RD analysis. The second framework we describe is
based on conditions that ensure that the treatment can be interpreted as being randomly assigned
for units near the cutoff. We call this second setup the local randomization framework for RD
analysis. Both setups rely on the notion that units that receive very similar score values on op-
posite sides of the cutoff ought to be comparable to each other except for their treatment status.
The main distinction between both approaches is how the idea of comparability is formalized:
in the continuity-based framework, comparability is conceptualized as continuity of average (or
some other feature of) potential outcomes; in the local randomization framework, comparability
is conceptualized as conditions that mimic an experimental setting in a neighborhood around the
cutoff.
We present each approach separately, discussing in detail the required assumptions, the ade-
quate interpretation of the target parameters, the graphical illustration of the design, the appro-
priate methods to estimate effects and conduct statistical inference, and the available strategies to
evaluate the plausibility of the design. Our presentation of the topics is intentionally geared towards
practitioners: our main goal is to clarify conceptual issues in the analysis of RD designs, and offer an
accessible guide for applied researchers and policy-makers who wish to implement RD analyses. For
this reason, we omit most technical discussions—but provide references for the technically inclined
3
1. INTRODUCTION CONTENTS
To ensure that our discussion is most useful to practitioners, we illustrate all methods with two
previously published empirical applications, one of which we use as a running example throughout
the sections. Our leading example is a study conducted by Meyersson (2014), who analyzed the
effect Islamic political representation in Turkey’s municipal elections on the educational attainment
of women. The score in this RD design is the margin of victory of the largest Islamic party in the
municipality, a continuous random variable, which makes the example suitable to illustrate both
the continuity-based and the local randomization methods.
The second example we consider is the study by Lindo et al. (2010), who investigate the effects
of placing students on academic probation on their future academic achievement. The score in
this second example is the students’ Grade Point Average (GPA); since there are many students
with the same GPA value, this variable has mass points and is therefore a discrete—rather than
continuous—random variable. RD designs with discrete running variables create some difficulties
for analysis and interpretation, because the continuity-based methods cannot be applied directly.
In the last section of this monograph we use this empirical study to illustrate how to approach
the analysis of RD designs with discrete scores. We put special emphasis on discussing issues of
neighborhood selection and causal interpretation of estimands.
To conclude, we emphasize that this monograph is not meant to offer a comprehensive review
of the literature on RD designs (though we do offer references to further readings after each topic
is presented), but rather only a succinct practical guide for empirical analysis. For early review
articles see Imbens and Lemieux (2008) and Lee and Lemieux (2010), and for an edited volume
with a contemporaneous overview of the RD literature see Cattaneo and Escanciano (2017). We
are currently working on a comprehensive literature review that complements this monograph
(Cattaneo and Titiunik, 2017).
As already mentioned, we use two empirical applications to illustrate the different RD methods
discussed in this monograph. All implementations of these methods are done using two leading
statistical software environments in the social sciences: R and Stata. We lead our illustrations with
R, but every time we illustrate a method we also present the equivalent Stata command. To be
specific, each numerical illustration includes an R command with its output, and the analogous
4
1. INTRODUCTION CONTENTS
Stata command that reproduces the same analysis—though we omit the Stata output to avoid
repetition.
All the RD methods we discuss and illustrate are implemented using various user-developed
packages, which are free and available for both R and Stata. The local polynomial methods for
continuity-based RD analysis are implemented in the package rdrobust, which is presented and
illustrated in three companion software articles: Calonico et al. (2014a), Calonico et al. (2015b)
and Calonico et al. (2017d). This package has three functions specifically designed for continuity-
based RD analysis: rdbwselect for data-driven bandwidth selection methods, rdrobust for local
polynomial point estimation and inference, and rdplot for graphical RD analysis. In addition, the
package rddensity, discussed by Cattaneo et al. (2017b), provides manipulation tests of density
discontinuity based on local polynomial density estimation methods.
The local randomization methods for RD analysis are implemented in the package rdlocrand,
which is presented and illustrated by Cattaneo et al. (2016b). This package has four functions specif-
ically designed for local randomization RD analysis: rdrandinf for randomization-based estimation
and inference, rdwinselect for data-driven window selection methods based on predetermined co-
variates, and rdsensitivity and rdrbounds for different randomization-based sensitivity analyses.
Before concluding this section, we introduce the empirical example that we employ throughout the
manuscript, originally analyzed by Meyersson (2014)—henceforth Meyersson. The example is based
on a (sharp) RD design in Turkey that studies of the impact of having a mayor from an Islamic
party on the educational outcomes of women. This study is one of many RD applications based on
close elections, as popularized by the original work of Lee (2008).
Meyersson is broadly interested in the effect of Islamic parties’ control of local governments on
women’s rights, in particular on the educational attainment of young women. The methodological
challenge is that municipalities where the support for Islamic parties is high enough to result in
the election of an Islamic mayor may differ systematically from municipalities where the support
for Islamic parties is more tenuous and results in the election of a secular mayor. (For brevity,
we refer to a mayor who belongs to one of the Islamic parties as an “Islamic mayor”, and to a
mayor who belongs to a non-Islamic party as a “secular mayor”.) If some of the characteristics
on which they differ affect (or are correlated with) the educational outcomes of women, a simple
comparison of municipalities with an Islamic versus a secular mayor will be misleading. For example,
municipalities where an Islamic mayor wins in 1994 may be on average more religiously conservative
than municipalities where a secular mayor is elected. If religious conservatism affects the educational
outcomes of women, the naive comparison between municipalities controlled by an Islamic versus
5
1. INTRODUCTION CONTENTS
a secular mayor will not successfully isolate the effect of the Islamic party’s control of the local
government—instead, the effect of interest will be contaminated by differences in the degree of
religious conservatism between the two groups.
This challenge is illustrated in Figure 1.1, where we plot the share of young women who complete
high school by 2000 against the Islamic margin of victory in the 1994 mayoral elections (more
information on these variables is given below). These figures are examples of so-called RD plots,
which we discuss in detail in Section 3. In Figure 1.1(a), we show the scatter plot of the raw
data (i.e., each point is an observation), superimposing the overall sample mean for each group—
treated observations where an Islamic mayor is elected are located to the right of zero, and control
observations where a secular mayor is elected are located to the left of zero. The raw comparison
reveals a negative average effect: municipalities with an Islamic mayor have on average a lower share
of young women who complete high school. Figure 1.1(b), shows the scatter plot for the subset of
municipalities where the Islamic margin of victory is within 50 percentage points, a range that
includes 83% of the total observations; this second figure superimposes a fourth-order polynomial
fit separately on either side of the cutoff. Figure 1.1(b) reveals that the negative effect in Figure
1.1(a) arises because there is a general negative relationship between Islamic vote share and female
high school share for the majority of the observations. Thus, a naive comparison of treated and
control municipalities will mask systematic differences and may lead to incorrect inferences about
the effect of local Islamic political representation.
The RD design can be used in cases such as these to isolate a treatment effect of interest from all
other systematic differences between treated and control groups. Under appropriate assumptions, a
comparison municipalities where the Islamic party barely wins the election and municipalities the
Islamic party barely loses will reveal the causal (local) effect of Islamic party control of the local
government on female educational attainment. If parties cannot systematically manipulate the vote
share that they obtain, observations just above and just below the cutoff will tend to be comparable
in terms of all characteristics with the exception of the party that won the 1994 election. Thus,
right at the cutoff, the comparison is free of the complications introduced by systematic observed
and unobserved differences between the groups. This strategy is illustrated in Figure 1.1(b), where
we see that, despite the negative slope on either side, right near the cutoff the effect of an Islamic
victory on the educational attainment of women is positive, in stark contrast to the negative
difference-in-means in Figure 1.1(a).
6
1. INTRODUCTION CONTENTS
Figure 1.1: Municipalities with Islamic Mayor vs. Municipalities with Secular Mayor—Meyersson
data
70
70
●
● ●
60
60
● ●
● ●
50
50
●
●
●
Female High School Share
●
●
●
40
40
●● ●● ● ●
● ● ● ●
●
● ● ● ●
● ● ● ● ● ● ●
● ● ●●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●●
●●● ● ●
●
●●●● ●●
●● ● ● ●
● ● ● ● ● ● ● ● ●● ● ●
● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●
●
● ●
●● ● ●● ● ●● ●● ●
30
30
●
● ● ●● ● ● ● ●● ● ●
● ● ● ●●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●
● ●
●● ●
●● ● ● ● ●●● ● ●●● ● ● ● ● ●● ● ●
● ● ● ●● ● ● ● ●● ● ● ● ●● ●
● ● ● ● ●●● ●●● ●●●
● ●●● ● ●● ● ●● ●
● ●● ●● ● ●● ●●● ● ● ●● ● ● ●● ● ●● ●
● ● ●●●● ● ● ● ●●●●● ● ● ●● ● ● ● ●● ●
● ● ●
●● ●
● ●●● ●● ●● ● ●● ● ● ● ● ●● ●
●● ●● ● ● ●● ● ● ● ● ●
● ●●● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●● ● ●● ●
●● ● ● ●
● ●●● ● ●●● ● ●● ● ●● ●
●● ● ●●●● ●● ● ●●● ● ● ● ●● ●● ●●● ● ● ●●
●
●●
● ● ● ●● ●●
●●
● ● ● ● ● ●
● ●● ● ●● ● ●●●● ●●●● ● ●●●●●●● ●
● ● ●
● ● ●● ● ●
●● ● ● ● ●
●
●● ● ●● ● ● ● ● ● ●
●● ●●●●●● ● ● ●●
● ●
●● ● ● ● ● ● ●●● ●
●●●●●●●● ●● ●●● ● ●●● ●●●● ● ●● ● ● ● ● ●●
●●● ●●
● ●
● ● ●● ●
●●● ●
● ●●●● ● ●●●
● ● ●
● ● ● ●●● ●●● ●●● ●
● ● ●● ● ● ● ●
● ●●●● ●● ●● ●●● ● ● ● ●●●● ● ●● ●●● ● ● ● ●●
● ●●● ● ● ●●● ●●
●●●●
● ● ● ● ●● ●● ● ● ●
●● ● ● ●● ● ● ●●●●● ● ● ● ●
● ● ●●● ● ●●● ●● ●● ●●● ● ●●
● ●● ● ● ●●● ● ●● ● ●● ● ●● ● ● ●
●●●●●●● ●●●●● ● ●●
● ● ● ● ● ● ●
● ●●
● ●●●●● ●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●
●● ●
●● ●● ● ● ●●● ● ●● ●● ● ●● ●● ●
●● ● ● ● ● ●
● ● ●●● ●
20
● ●
20
● ●● ● ●● ●●● ●● ● ● ●● ●● ●● ●●●●
● ● ● ● ●●●● ●
●
● ●● ●●●● ● ●● ● ●●● ● ● ● ●
●● ●● ●●● ●● ●
● ●● ●● ●● ● ● ● ● ●● ● ● ●
● ●● ●● ● ●●● ● ●
●● ●
● ●●
●●●●●● ●● ● ●
● ●●
● ●●●
●●●
● ● ●● ●
● ●
●●●
● ●
● ●● ●
● ●●● ● ●●● ●●● ●●● ● ● ● ● ● ● ● ● ●
●● ●
● ●● ●
●●
●
●●●● ●●
●●●
● ●●
●● ●●●
● ● ●●●● ● ● ●●● ●
●●● ● ●
● ●
●●●
●●● ● ● ●●
● ●●●●● ●
●
●● ● ● ● ● ● ●●●●● ●● ●
●
● ● ● ● ●●● ●●● ●●●
●● ● ● ● ● ●● ● ● ●●●
●
● ● ● ●
●
● ●● ●●●
● ●●
● ●
●● ●●
● ●●
●
●
● ●●●● ● ●●
●● ●● ● ● ● ●●●●● ● ● ● ● ●●● ●● ●●● ●●● ●● ● ●● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ● ●● ●● ●●● ● ● ● ●● ● ●
● ●● ● ●●
●● ●●●● ● ●●●●
● ●● ●● ●● ●● ● ●● ● ● ● ●●●● ● ● ● ● ● ●● ● ●●●● ●●● ● ● ●● ● ● ●
●● ●●●
●●
●
● ● ●
●●
● ●●●
●●●
●
●●● ●● ●●● ● ●
●●●●●●
●● ● ● ● ●● ●● ● ●●● ●● ● ●●● ● ● ●●● ● ●
● ●
●
●● ●● ●● ●●●●●●● ●●●●● ●●●● ●●● ●●● ●
●
● ● ●
●●●
●● ●● ●●
● ● ● ● ●● ●● ● ●● ● ● ●
● ● ●●● ●●● ●● ● ● ●● ● ● ● ●
●● ● ●● ● ● ●
● ● ●● ● ●●●●●● ● ● ●● ●●
●●● ●●●● ● ●
● ●●●●
●●●
● ●●● ●●●● ● ● ●● ● ● ●●●●● ● ●
● ● ●●
● ●●●●
● ●●●● ●● ● ●● ●● ● ● ●● ● ●●●●● ● ● ●
●●●● ● ●●
● ● ● ●● ● ● ● ●
● ● ● ●●●● ● ● ●●●●
●● ●● ● ●●
●● ●
●●●● ●●● ●●●●●●
● ● ●●
●● ●●●●
●●
●●●●●● ●● ●●●●●●● ●● ● ●●● ● ● ●
●●●●●● ●● ●● ●● ● ● ● ● ●
●●
●●● ●
●●●● ●●● ● ● ●●
● ●● ●● ●● ●●●●●● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ●●●●●●●● ● ● ● ●● ●●● ● ● ● ● ●
●
● ● ● ● ●
● ●●●● ●●●●●● ●●●
● ● ● ● ●● ●●● ● ●● ● ●●●● ● ●● ●● ●● ● ● ●● ●● ●● ●
● ● ● ●●● ● ●● ●
●●●●● ●●
●●●●● ● ●●● ●
●● ●● ●
●●
● ●●
● ●●●●
● ●
●●● ● ● ● ● ● ● ● ●● ●●
●●●●●
●● ● ● ●●●
●●
●
● ●● ●●●●
● ●
●●
● ● ●● ● ●●●●● ●● ● ● ● ●● ●● ●
● ● ● ● ● ●●
● ●●
● ●● ●●
● ●● ● ● ●●
●
●
●●●●●
● ● ●
●●
● ●●
●● ● ● ● ● ●● ●● ● ●● ●
● ●● ●● ●
●●●
●
● ● ●● ●●● ● ●● ●●● ● ● ●
● ● ● ● ●● ●● ●
●● ● ●
● ● ●●● ● ● ● ●
● ●●
●● ●● ●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ●
● ●● ●● ● ●●●●
● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ●●●●●● ●● ● ●● ●● ● ● ●● ●● ● ●
●●●
●●
● ● ● ● ●● ●●● ● ●● ● ● ● ● ●● ● ● ●● ●● ●● ● ●●
●● ● ●● ●● ●● ● ● ● ● ●●●● ● ●
●
●● ●●
10
10
● ●●●● ● ●● ● ●● ● ●● ●
●● ●●● ● ● ●●
●●●●● ● ●●● ●● ● ● ● ●●● ●●● ●●●
● ● ● ● ●
●● ●●●●● ● ● ● ● ● ●●
● ●● ● ●●●● ●●● ● ● ●●●●● ●
●● ●
●●●●
●
●●●● ●● ● ●●● ● ●
●●
● ●
● ● ●
● ●●● ● ●● ● ●●
●
●
●● ●
● ●●●● ●● ● ●● ●●● ●● ● ●● ● ● ● ● ●● ● ●●● ● ●●● ●
● ●● ●● ●●
● ● ●● ● ● ●●
● ● ● ●● ●●●
● ●●
●● ●● ●
●●● ●
●●
● ● ● ●● ● ● ●● ●●
●●● ● ● ● ● ● ●●●●●● ● ●●● ● ● ● ● ● ● ● ● ●
●● ●●●●● ● ●● ●●● ● ● ●● ● ● ●● ●● ●● ●●●●
●● ●● ●
●●● ●●● ● ● ●● ● ● ● ● ●●● ●
● ● ● ● ● ●● ● ● ●
●
● ● ● ● ● ● ● ● ●●● ● ●●●●● ●● ●● ●●
● ● ● ● ●
● ●
● ● ●●●●● ● ●● ● ● ●● ●●●● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ●
●●
●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ●
● ●● ●
● ●
● ● ● ● ●●●● ● ●
● ●●
● ● ● ● ●
●● ● ● ● ●●● ●
●
● ● ●●
● ● ● ● ● ● ●●●
● ● ●● ●●● ● ●●● ●
● ●
●● ●●
●
● ● ●● ●
●
●
●●● ● ● ● ● ●● ●● ●●● ● ●
● ●
●
●● ● ●●
● ●● ●● ●●
●● ●
● ● ● ●
●
● ● ●● ● ●● ● ● ● ●● ● ●● ● ●●● ● ● ●●● ● ●●●●● ●
●● ● ● ●● ●
● ●● ●●● ● ●● ● ● ●
● ● ●● ●
●●
●●●●● ●
●●● ●
● ● ● ● ● ● ●● ●●● ● ● ●● ● ●
●
● ● ●●●●●●●
● ● ●
● ●
●
●
● ● ● ● ●● ● ● ● ●●● ● ● ● ●●●● ● ● ● ● ●● ●● ● ● ● ●●
●● ● ●● ● ● ● ● ●
● ● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●
● ●●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●
0
0
●
0.94 percent of the national vote and won in only 11 out of the 329 municipalities where an Islamic
mayor was elected. As defined, the Islamic margin of victory can be positive or negative, and the
cutoff that determines an Islamic party victory is located at zero. Given this setup, the treatment
group consists of municipalities that elect a mayor from an Islamic party in 1994, and the control
group consists of municipalities that elect a mayor from a secular party. The main outcome of
interest is the school attainment for women who were (potentially) in high school during the period
1994 − −2000, measured with variables extracted from the 2000 census. The particular outcome
we re-analyze is the share of the cohort of women ages 15 to 20 in 2000 who had completed high
school by 2000. For brevity, we refer to this outcome variable interchangeably as female high school
attainment share, female high school attainment, or high school attainment for women.
In order to streamline the computer code for our analysis, we rename the variables in the
following way:
• Y: high school attainment for women in 2000, measured as the share of women ages 15 to 20
in 2000 who had completed high school by 2000.
• X: vote margin obtained by the Islamic candidate for mayor in the 1994 Turkish elections,
measured as the vote percentage obtained by the Islamic candidate minus the vote percentage
obtained by its strongest opponent.
• T: electoral victory of the Islamic candidate in 1994, measured as 1 if Islamic candidate won
the election and 0 if the candidate lost.
7
1. INTRODUCTION CONTENTS
The Meyersson dataset also contains several predetermined covariates that we will later use
to investigate the plausibility of the RD design and also to illustrate covariate-adjusted estima-
tion methods. The covariates that we include in our analysis are the Islamic vote share in 1994
(vshr islam1994), the number of parties receiving votes in 1994 (partycount), the logarithm of
the population in 1994 (lpop1994), an indicator equal to one if the municipality elected an Islamic
party in the previous election in 1989 (i89), a district center indicator (merkezi), a province cen-
ter indicator (merkezp), a sub-metro center indicator (subbuyuk), and a metro center indicator
(buyuk).
Table 1.1 presents descriptive statistics for the three RD variables (Y, X, and T), and the
municipality-level predetermined covariates.
The outcome of interest (Y) has a minimum of 0 and a maximum of 68.04, with a mean of
16.31. Since this variable measures the share of women between ages 15 and 20 who had completed
high school by 2000, these descriptive statistics mean that there is at least one municipality in
2000 where no women in this age cohort had completed high school, and on average 16.31% of
women in this cohort had completed high school by the year 2000. The Islamic vote margin (X)
ranges from −100 (party receives zero votes) to 100 (party receives 100% of the vote), and it has a
mean of −28.14, implying that on average the Islamic party loses by 28.14 percentage points. This
explains why the mean of the treatment variable (T) is 0.120, since this indicates that in 1994 an
Islamic mayor was elected in only 12.0% of the municipalities. This small proportion of victories is
consistent with the finding that the average margin of victory is negative and thus leads to electoral
loss.
8
2. RD TAXONOMY CONTENTS
In the RD design, all units in the study receive a score, also known as running variable or index,
and a treatment is assigned to those units whose score is above a known cutoff and not assigned
to those units whose score is below the cutoff. In our running example based Meyersson’s study,
the units are municipalities and the score is the margin of victory of the Islamic party in the 1994
Turkish mayoral elections. The treatment is the Islamic party’s electoral victory, and the cutoff
is zero: municipalities elect an Islamic mayor when the Islamic vote margin is above zero, and
elect a secular mayor otherwise. In the second empirical example we present in Section 7, which is
based on Lindo et al. (2010), some students are placed on academic probation if their GPA a given
semester exceeds 1.6, and the authors are interested on the effects of probation on future academic
performance. In this example, the score is the grade point average of each student, the cutoff is 1.6,
and the treatment is being placed on academic probation.
These three components—score, cutoff, and treatment—define the RD design in general, and
characterize its most important feature: in the RD design, unlike in other non-experimental studies,
the assignment of the treatment follows a rule that is known (at least to the researcher) and hence
empirically verifiable. To formalize, we assume that there are n units, indexed by i = 1, 2, . . . , n,
each unit has a score or running variable Xi , and x̄ is a known cutoff. Units with Xi ≥ x̄ are
assigned to the treatment condition, and units with Xi < x̄ are assigned to the control condition.
This assignment, denoted Ti , is defined as Ti = 1(Xi ≥ x̄), where 1(·) is the indicator function.
This treatment assignment rule implies that if we know a unit’s score, we know with certainty
whether that unit was assigned to the treatment or the control condition. This is a key defining
feature of any RD design: the probability of treatment assignment as a function of the score changes
discontinuously at the cutoff.
Being assigned to the treatment condition, however, is not the same as receiving the treatment.
As in experimental and other non-experimental settings, this distinction is important in RD designs
because non-compliance introduces complications and typically requires stronger assumptions to
learn about treatment effects of interest. We introduce the binary variable Di to denote whether
the treatment was actually received by unit i. In a RD design with perfect compliance, also known
as the Sharp RD design, units comply perfectly with their assignment, so Di = Ti for all i. In
contrast, in a RD with imperfect compliance, also known as the Fuzzy RD design, we have Di 6= Ti
for some units.
In the remainder of this section, we discuss the basic features of the most common types of
RD designs encountered in practice. In addition to the canonical Sharp RD design, which is the
focus of this monograph, we discuss important extensions of this canonical RD setup. This includes
Fuzzy and Kink RD designs, RD designs with multiple cutoffs, RD designs with multiple scores
and geographic RD designs, just to mention a few. We give references to other related designs at
the end of this section. We also include a discussion of the local nature of RD-based parameters,
9
2. RD TAXONOMY CONTENTS
and the resulting limitations to the external validity of any conclusions drawn from RD designs.
This monographs focuses almost exclusively on the practical aspects of RD analysis in Sharp
RD designs. However, as will become apparent throughout the manuscript, most methodological
discussions can be applied or easily extended to many (if not all) the other RD designs encountered
in practice. We will hint to some of these extensions in the upcoming subsections, as we discuss in
some detail the other types of RD designs.
The Sharp RD design is the canonical RD setup. In this design, all units whose score is above
the cutoff are assigned to the treatment condition and actually receive the treatment, and all
units whose score is below the cutoff are assigned to the control condition and do not receive the
treatment. This stands in contrast to the Fuzzy RD design, where some of the units fail to receive
the treatment despite having a score above the cutoff, and/or some units receive the treatment
despite having been assigned to the control condition. Every Fuzzy RD design can be analyzed as
a Sharp RD design, if the “treatment” status is downgraded to “intention-to-treat” status, as it
is common in experimental settings with imperfect compliance. This fact is also applicable to the
other RD settings discussed below.
The difference between the Sharp and Fuzzy RD designs is illustrated in Figure 2.1, where
we plot the conditional probability of receiving treatment given the score, P(Di = 1|Xi = x), for
different values of the running variable Xi . As shown in Figure 2.1(a), in a Sharp RD design the
probability of receiving treatment changes exactly from zero to one at the cutoff. In contrast, in a
Fuzzy RD design, the change in the probability of being treated at the cutoff is always less than
one. Figure 2.1(b) illustrates a Fuzzy RD design where units with score below the cutoff comply
perfectly with the treatment, but compliance with the treatment is imperfect for units with score
above the cutoff. This case is sometimes called one-sided non-compliance and, of course, RD designs
can (and often will) exhibit two-sided non-complaince, where P(Di = 1|Xi = x) will be neither zero
nor one for units with running variable Xi near the cutoff x̄. In the remaining of this subsection,
we focus on Sharp RD designs and thus assume that Di = Ti = 1(Xi ≥ x̄) for all units.
Following the causal inference literature, we adopt the potential outcomes framework and as-
sume that each unit has two potential outcomes, Yi (1) and Yi (0), corresponding, respectively, to
the outcomes that would be observed under treatment or control. In this framework, treatments
effects are defined in terms of comparisons between features of (the distribution of) both potential
outcomes, such as their means, variances or quantiles. Although every unit is assumed to have both
Yi (1) and Yi (0), these outcomes are called potential because only one of them is observed. If unit i
receives the treatment, we will observe Yi (1), the unit’s outcome under treatment—and Yi (0) will
remain latent or unobserved. Similarly, if i receives the control condition, we will observe Yi (0)
but not Yi (1). This results in the fundamental problem of causal inference, and implies that the
10
2. RD TAXONOMY CONTENTS
Figure 2.1: Conditional Probability of Receiving Treatment in Sharp vs. Fuzzy RD Designs
1 1
Conditional Probability of Receiving Treatment
0.5 0.5
Cutoff
Cutoff
0 0
For now we adopt the usual econometric perspective that sees the data (Yi , Xi )ni=1 as a random
sample from a larger population, taking the potential outcomes (Yi (1), Yi (0))ni=1 as random vari-
ables. We consider an alternative perspective in Section 5 when we discuss inference in the local
randomization framework, employing ideas from the classical statistical literature on analysis of
experiment. In later sections we will also augment the basic models to account for pre-intervention
covariates and other empirically relevant features, which we omit at this stage to ease exposition
and ground ideas.
In the specific context of the Sharp RD design, the fundamental problem of causal inference
occurs because we only observe the outcome under control, Yi (0), for units whose score is below
the cutoff, and we only observe the outcome under treatment, Yi (1), for those units whose score
is above the cutoff. We illustrate this problem in Figure 2.2, which plots the average potential
outcomes given the score, E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x], against the score. In statistics,
conditional expectation functions such as these are usually called regression functions. As shown in
Figure 2.2, the regression function E[Yi (1)|Xi ] is observed for values of the score to the right of the
cutoff—because when Xi ≥ x̄, the observed outcome Yi is equal to the potential outcome under
treatment, Yi (1), for every i. This is represented with the solid red line. However, to the left of the
11
2. RD TAXONOMY CONTENTS
cutoff, all units are untreated, and therefore E[Yi (1)|Xi ] is not observed (represented by a dashed
red line). A similar phenomenon occurs for E[Yi (0)|Xi ], which is observed for values of the score to
the left of the cutoff (solid blue line), Xi < x̄, but unobserved for Xi ≥ x̄ (dashed blue line). Thus,
the observed average outcome given the score is
(
E[Yi (0)|Xi ] if Xi < x̄,
E[Yi |Xi ] =
E[Yi (1)|Xi ] if Xi ≥ x̄.
The Sharp RD design exhibits an extreme case of lack of common support, as units in the
control (Di = Ti = 1(Xi ≥ x̄) = 0) and treatment (Di = Ti = 1(Xi ≥ x̄) = 0) groups can not have
the same value of their running variable (Xi ). This feature sets aside RD designs from other non-
experimental settings, and highlights one of its fundamental features: extrapolation is unavoidable.
As we discuss through this monograph, a major practical enterprise of empirical work employing
RD designs voids down to perform extrapolation in orde to compare control and treatment units.
This unique feature of RD designs also makes causal interpretation of some parameters potentially
more difficult, though we do not discuss this issue further here as it does not change the practical
aspects underlying the analysis of RD designs. See Cattaneo et al. (2017d) for more discussion on
this point.
As seen in Figure 2.2, the average treatment effect at a given value of the score, E[Yi (1)|Xi =
x] − E[Yi (0)|Xi = x], is the vertical distance between the two regression curves at that value. This
distance cannot be directly estimated because we never observe both curves for the same value of
x. However, a special situation occurs at the cutoff x̄: this is the only point at which we “almost”
observe both curves. To see this, we imagine having units with score exactly equal to x̄, and units
with score barely below x̄—that is, with score x̄ − ε for a small and positive ε. The former units
would receive treatment, and the latter would receive control. Yet if the values of the average
potential outcomes at x̄ are not abruptly different from their values at points near x̄, the units
with Xi = x̄ and Xi = x̄ − ε would be very similar except for their treatment status, and we could
approximately calculate the vertical distance at x̄ using observed outcomes.
This notion of comparability between units with very similar values of the score but on opposite
sides of the cutoff is the fundamental concept on which all RD designs are based, and it was
first formalized by Hahn et al. (2001). These authors showed that, among other conditions, if the
regression functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x], seen as functions of x, are continuous at
x = x̄, then in a Sharp RD design we have
E[Yi (1) − Yi (0)|Xi = x̄] = lim E[Yi |Xi = x] − lim E[Yi |Xi = x] (2.1)
x↓x̄ x↑x̄
The result in Equation 2.1 says that, if the average potential outcomes are continuous functions
of the score at x̄, the difference between the limits of the treated and control average observed
outcomes as the score converges to the cutoff is equal to the average treatment effect at the cutoff.
12
2. RD TAXONOMY CONTENTS
E[Y(1)|X]
µ1
1 ●
E[Y(1)|X], E[Y(0)|X]
τSRD
µ0
E[Y(0)|X]
0 ●
−1 Cutoff
We call this effect the Sharp RD treatment effect, defined as the right-hand-side of Equation 2.1:
In words, τSRD captures the (reduced form) treatment effect for units with score values Xi = x̄. This
parameter answers the question: what would be the change in average response for control units
with score level Xi = x̄ had they received treatment? This treatment effect is, by construction, local
in nature and cannot inform about treatment effects at other levels of the score without additional
assumptions. We revisit this point further below when we discuss issues of extrapolation.
13
2. RD TAXONOMY CONTENTS
In the Fuzzy RD design, the treatment is assigned based on whether the score exceeds the cutoff
x̄, but compliance with treatment is imperfect. As a consequence, the probability of receiving
treatment changes at x̄, but not necessarily from 0 to 1. This occurs, for example, when units with
score above the cutoff are eligible to participate in a program, but participation is optional.
Using our previous notation to distinguish between the treatment assignment, Ti , and the
treatment received, Di , in a Fuzzy RD design there are some units for which Ti 6= Di . Because the
treatment received is not always equal to treatment assigned, now the treatment take-up variable
Di has two potential values, Di (1) and Di (0), corresponding, respectively, to the treatment taken
when the unit is assigned to treatment condition and the treatment taken when the unit is assigned
to the control condition. The observed treatment taken is Di = Ti · Di (1) + (1 − Ti ) · Di (0) and, as
occurred for the outcome Yi , the fundamental problem of causal inference now means that we do
not observe, for example, whether a unit that was assigned to the treatment condition and took the
treatment would have taken the treatment if it had been assigned to the control condition instead.
Notice that our notation also imposes additional restrictions on the potential outcomes, sometimes
called exclusion restrictions.
Under regularity conditions, the canonical parameter in the Fuzzy RD design, τFRD , is the “ratio”
between the Sharp RD parameter, τSRD capturing the intention-to-treat effect, and the average effect
of the treatment assignment capturing the treatment take-up, both at the cutoff, that is,
Under additional conditions, such as monotonicity or local independence, this parameter can be
given a causal interpretation similar to the well-known Local Average Treatment Effect (LATE)
estimand in experiments with imperfect complainace (and other instrumental variable settings):
the Fuzzy RD parameter τFRD can be interpreted as a LATE at the cutoff for “compliers”. See, for
example, Imbens and Lemieux (2008) and Cattaneo et al. (2016a) for further discussion on the
interpretation of LATE-type estimands in RD designs.
Regardless of the interpretation attached to τFRD via additional identifying assumptions, this
popular parameter is identifiable and estimable from data because
under continuity conditions on the regression functions. Thus, τFRD is the ratio of two Sharp RD
effects: the effect of Ti on Yi (the outcome equation or intention-to-treat effect), and the effect of
Ti on Di (the treatment equation or take-up effect). This implies that from a practical perspective,
analyzing Fuzzy RD designs is not more difficult than analyzing a ratio of Sharp RD designs,
and hence most estimation, inference and testing procedures naturally extend from sharp to fuzzy
14
2. RD TAXONOMY CONTENTS
settings under standard assumptions. Of course, specific issues for fuzzy RD designs do arise, such
as those related to small denominators (called “weak instruments” in the econometrics literature),
validity of exclusion restrictions (called “invalid instruments” in the econometrics literature) or
interpretation/extrapolation of estimands, just to mention a few. We offer some references to further
reading in the context of RD designs connected with these issues at the end of this section.
In recent years, researchers have been particularly interested in RD parameters defined in terms of
derivatives of the regression functions at the cutoff, as opposed to the the levels of the regression
functions themselves. Generically, we call the Kink RD design as the setting where the goal is to
estimate the first derivative of the regression function, in which case the canonical parameters are
given by
d
τSKRD = E[Yi (1) − Yi (0)|Xi = x]
dx x=x̄
and
d
dx E[(Di (1) − Di (0))(Yi (1) − Yi (0))|Xi = x]x=x̄
τFKRD = d
.
dx E[Di (1) − Di (0)|Xi = x] x=x̄
The Sharp Kink RD parameter τSKRD corresponds to the first derivative at the cutoff of τSRD ,
while the Fuzzy Kink RD parameter τFKRD corresponds to the ratio of first derivatives at the cutoff of
numerator and denominator entering τFRD . These parameters are written in this form because they
emerge generically as the probability limits of the plug-in RD estimators for the first derivatives
at the cutoff of different estimable functions. To be more precise, the kink RD designs parameters
τSKRD and τFKRD are generically identifiable and estimable from data because:
d d
τSKRD = lim E[Yi |Xi = x] − lim E[Yi |Xi = x]
x↓x̄ dx x↑x̄ dx
and
d d
limx↓x̄ dx E[Yi |Xi = x] − limx↑x̄ dx E[Yi |Xi = x]
τFKRD = d d
,
limx↓x̄ dx E[Di |Xi = x] − limx↑x̄ dx E[Di |Xi = x]
under appropriate regularity conditions.
15
2. RD TAXONOMY CONTENTS
12
E[Y(1)|X] ∂E[Y(1)|X]/∂X
∂E[Y(1)|X]/∂X , ∂E[Y(0)|X]/∂X
10
1
E[Y(1)|X], E[Y(0)|X]
Kink in E[Y|X] 8
6
τSKRD
0 E[Y(0)|X]
4
2
−1 Cutoff ∂E[Y(0)|X]/∂X
0
−1 Cutoff
−100 −50 x 50 100 −100 −50 x 50 100
Score (X) Score (X)
Beyond the basic (reduced form) interpretation of the kink parameters, these parameters have
featured in other related contexts. For example, τSKRD is of direct interest in Cerulli et al. (2017)
for analysis of local sensitivity of RD treatment effects. Furthermore, the parameter τSKRD up to
a known scale and the parameter τFKRD are both on interest in Card et al. (2015) where, under
additional assumptions and in a different setting, are estimated using RD methods.
The practical discussion given in this monograph also excludes Kink RD designs for space and
pedagogical reasons, but these estimands and estimators are readily available using the methods
presented in the upcoming sections. As in the caseof Fuzzy RD designs, the researcher only needs
to specify additional information: the derivative of interest and, in the fuzzy case, the variable
Di . Furthermore, the Regression Kink Designs of Card et al. (2015) are readily available because,
from an implementation perspective, the variable Di can be take to be continuous without loss of
generality. See also Card et al. (2017) for further discussion.
Another generalization of the RD design that is commonly seen in practice occurs when the treat-
ment is assigned using different cutoff values for different subgroups of units. In the standard RD
design, all units face the same cutoff value x̄; as a consequence, the treatment assignment rule
is Ti = 1(Xi ≥ x̄) for all units. In contrast, in the Multi-Cutoff RD design different units face
difference cutoff values.
An example occurs in RD designs where the running variable Xi is a political party’s vote
share in a district, and the treatment is winning that district’s election. When there are only two
parties contesting the election, the cutoff for the party of interest to win the election is always 50%,
because the party’s strongest opponent always obtains (100 − Xi )% of the vote. However, when
16
2. RD TAXONOMY CONTENTS
there are three or more parties contesting the race and the winner is the party with the highest
vote share, the party can win the election barely in many different ways. For example, if there are
three parties, the party of interest could barely win with 50% of the vote against two opponents
who get, respectively, 49% and 1% of the vote; but it could also barely win with 39% of the vote
against two opponents who got 38% and 23%. Indeed, in this context, there is an infinite number
of ways in which one party can barely win the election—the party just needs to obtain a barely
higher vote share than the vote share obtained by its strongest opponent, whatever value the latter
takes.
Another common example occurs when a federal program is administered by sub-national units,
and each of the units chooses a different cutoff value to determine program eligibility. For example,
in order to target households that were most in need in a given locality, the Mexican conditional
cash transfer program Progresa determined program eligibility based on a household-level poverty
index. In rural areas, the cutoff that determined program eligibility varied geographically, with
seven distinct cutoffs used depending on the geographic location of each locality. This type of
situation arises in many other contexts where the cutoff for eligibility varies among the units in the
analysis.
Cattaneo et al. (2016a) introduced an RD framework based on potential outcomes and continuity
conditions to analyzing Multi-Cutoff RD designs, and established a connection with the most
common practice of normalizing-and-pooling the information for empirical implementation. Suppose
that the cutoff is a random variable Ci , instead of a known constant, taking on J distinct values
C = {c1 , c2 , . . . , cJ }. The continuous case is discussed below, though in practice it is often hard to
implement RD designs with more than a few cutoff points due to data limitations. In a multi-cutoff
RD setting, the treatment assignment is generalized to Ti = 1(Xi ≥ Ci ), where Ci is a random
variable with support C. Of course, the single cutoff RD design is contained in this generalization
when C = {x̄} and thus P[Ci = x̄] = 1, though more generally P[Ci = c] ∈ (0, 1) for each c ∈ C.
In multi-cutoff RD settings, one approach commonly used in practice is to normalize the running
variable so that all units face the same common cutoff value at zero, and then apply the standard
RD design machinery to the normalized score and the common cutoff. To do this, researchers define
the normalized score X̃i = Xi − Ci , and pool all observations using the same cutoff of zero for all
observations in a standard RD design, with the normalized score used in place of the original
score. In this normalizing-and-pooling approach, the treatment assignment indicator is therefore
Ti = 1(Xi −Ci ≥ 0) = 1(X̃i ≥ 0) for all units. In the case of the single-cutoff RD design discussed so
far, this normalization is achieved without loss of generality as the interpretation of the estimands
remain unchanged; only the score (Xi 7→ X̃i ) and cutoff (x̄ 7→ 0) change.
More generally, the normalize-and-pool strategy employing the score variable X̃i , usually called
the normalized (or centered) running variable, changes the parameters already discussed above
in an intuitive way: they become weighted averages of RD treatment effects for each cutoff value
17
2. RD TAXONOMY CONTENTS
τSRD(c2)
●
τSRD(c2, c1)
2 ●
Ec2[Y(0)|X]
E[Y(1)|X], E[Y(0)|X]
τSRD(c1, c2)
Ec1[Y(1)|X]
●
τSRD(c1) ●
0 ●
Ec1[Y(0)|X]
−1 Cutoff c1 Cutoff c2
c1 c2
Score (X)
underlying the original score variable. For example, the Sharp RD treatment effect now is
X fX|C (c|c)P[Ci = c]
τ̄SRD = E[Yi (1) − Yi (0)|X̃i = 0] = τSRD (c)ω(c), ω(c) = P
c∈C c∈C fX|C (c|c)P[Ci = c]
with τ̄SRD denoting the normalized-and-pooled sharp RD treatment effect, τSRD (c) denoting the
cutoff-specific sharp RD treatment effect, and fX|C (x|c) denoting the conditional density of Xi |Ci .
See Cattaneo et al. (2016a) for more details on notation and interpretation, and for analogous
results for Fuzzy and Kink RD designs.
18
2. RD TAXONOMY CONTENTS
the discussion in this monograph and/or its extension to fuzzy and kink designs applies directly:
under standard assumptions on the normalized score, we have the analogous identification result
to the standard Sharp RD design, given by
which implies that estimation and inference for τ̄SRD can proceed in the same way as in the standard
Sharp RD design with a single cutoff. Alternatively, by considering each subsample Ci = c with
c ∈ C, the methods discuss in this monograph can be applied directly to each cutoff point, and
then collected for further analysis and interpretation under additional regularity conditions. Either
way, as mentioned before, we focus exclusively on the practical aspects of implementing estimation,
inference and falsification for the single-cutoff Sharp RD design to conserve space and avoid side
discussions.
Yet another generalization of canonical RD designs occurs when two or more running variables
are determining a treatment assignment, which by construction induces a multi-cutoff RD design
with infinite many cutoffs. For example, a grant or scholarship may be given to students who
score above a given cutoff in both a mathematics and a language exam. This leads to two running
variables—the student’s score in the mathematics exam and her score in the language exam—
and two (possibly different) cutoffs. Another popular example is related to geographic boundaries
inducing discontinuous treatment assignments. This type of designs have been studied in Papay
et al. (2011), Reardon and Robinson (2012), and Wong et al. (2013) for generic multi-score RD
settings, and in Keele and Titiunik (2015) for geographic RD settings.
To allow for multiple running variables, we assume each unit’s score is a vector (instead of
a scalar as before) denoted by Xi . When there are two running variables, the score for unit i is
Xi = (X1i , X2i ), and the treatment assignment is, for example, Ti = 1(X1i > b1 )· 1(X2i > b2 ) where
b1 and b2 denote the cutoff points along each of the two dimensions. For simplicity, we assume the
potential outcome functions are Yi (1) and Yi (0), which implicitly imposes additional assumptions
(e.g., no spill-overs in a geographic setting). See Cattaneo et al. (2016a) for more discussion on this
type of restrictions on potential outcomes.
The parameter of interest changes, as discussed before in the context of Multi-Cutoff RD designs,
because there is no longer a single cutoff at which the probability of treatment assignment changes
discontinuously. Instead, there is a set of values at which the treatment changes discontinuously. To
continue our education example, assume that the scholarship is given to all students who score above
60 in the language exam and above 80 in the mathematics exam, letting X1i denote the language
score and X2i the math score, and b1 = 80 and b2 = 60 be the respective cutoffs. According to
this hypothetical treatment assignment rule, a student with score xi = (80, 59.9) is assigned to the
19
2. RD TAXONOMY CONTENTS
control condition, since 1(80 ≥ 80)· 1(59.9 ≥ 60) = 1·0 = 0, and misses the treatment only barely—
had she scored an additional 1/10 of a point in the mathematics exam, she would have received
the scholarship. Without a doubt, this student is very close to the cutoff criteria for receiving the
treatment. However, scoring very close to both cutoffs is not the only way for a student to be barely
assigned to treatment or control. A student with a perfect 100 score in language would still be barely
assigned to control if he scored 59.9 in the mathematics exam, and a student with a perfect math
score would be barely assigned to control if she got 79.9 points the language exam. Thus, with
multiple running variables, there is no longer a single cutoff value at which the treatment status
of units changes from control to treated. Instead, the discontinuity in the treatment assignment
occurs along a boundary of points. This is illustrated graphically in Figure 2.5.
Figure 2.5: Example of RD Design With Multiple Scores: Treated and Control Areas
100
Treated
Area
80 Mathematics Cutoff Boundary
Mathematics Score
60
Control Area
40
20
Language Cutoff
0
0 20 40 60 80 100
Language Score
Consider once again for simplicity a sharp RD design (or an intention-to-treat situation). The
parameter of interest in the Multi-Score RD design is therefore a generalization of the standard
Sharp RD design parameter, where the average treatment effect is calculated at all (or, more
empirically relevant, at some) points along the boundary between the treated and control areas,
that is, at points where the treatment assignment changes discontinuously from zero to one:
where B denotes the boundary determining the control and treatment areas. For example, in the
hypothetical education example in Figure 2.5, B = {(x1 , x2 ) : x1 = 80 and x2 = 60}.
where Bt and Bc denote the treatment and control areas, respectively. In other words, for each
20
2. RD TAXONOMY CONTENTS
cutoff point along the boundary, the treatment effect at that point is identifiable by the observed
bivariate regression functions for each treatment group, just like in the single-score case. The only
conceptually important distinction is that Multi-Score RD designs generate a τSRD (b) family or curve
of treatment effects, one for each boundary point b ∈ B. For example, two potentially distinct sharp
RD treatment effects are τSRD (80, 70) and τSRD (90, 60).
An important special case of the RD design with multiple running variables is the Geographic
RD design, where the boundary B at which the treatment assignment changes discontinuously is
a geographic boundary that separates a geographic treated area from a geographic control area. A
typical Geographic RD design is one where the treated and control areas are adjacent administrative
units such as counties, districts, municipalities, states, etc., with opposite treatment status. In
this case, the boundary at which the treatment status changes discontinuously is the border that
separates the adjacent administrative units. For example, some counties in Colorado have all-mail
elections where voting can only be conducted by mail and in-person voting is not allowed, while
other counties have traditional in-person voting. Where the two types of counties are adjacent, the
administrative border between the counties induces a discontinuous treatment assignment between
in-person and all-mail voting, and a Geographic RD design can be used to estimate the effect of
adopting all-mail elections on voter turnout. This RD design can be formalized as a RD design
with two running variables, where the score Xi = (X1i , X2i ) contains two coordinates such as
latitude and longitude that determine the exact geographic location of unit i. In practice, the score
Xi = (X1i , X2i )—that is, the geographic location of each unit in the study—is obtained using
Geographic Information Systems (GIS) software, which allows researchers to locate each unit on
a map as well as to locate the entire treated and control areas, and all points on the boundary
between them.
For implementation, in both the geographic and non-geographic cases, there are two main
approaches mirroring the discussion for the case of Multi-Cutoff RD designs. One approach is the
equivalent of normalizing-and-pooling, while the other approach estimates many RD treatment
effects along the boundary. For example, consider first the latter approach in a sharp RD context:
the RD effect at a given boundary point b = (b1 , b2 ) ∈ B may be obtained by calculating each
unit’s distance to b, and using this one-dimensional distance as the unit’s running variable, giving
negative values to control units and positive values to treated units. Letting the distance between
a unit’s score Xi and a point x be di (x), we can re-write the above identification result as
τSRD (b) = lim E[Yi |di (b) = d] − lim E[Yi |di (b) = d], b ∈ B.
d↑0 d↓0
The choice of distance metric di (·) depends on the particular application. A typical choice is the
p
Euclidean distance di (b) = (X1i − b1 )2 + (X2i − b2 )2 . In practice, this approach is implemented
for a finite collection of evaluation points along the boundary, and all the methods and discus-
sion presented in this monograph can be apply to this case directly, one cutoff at the time. The
normalizing-and-pooling approach is also straightforward in the case of Multi-Score RD designs, as
21
2. RD TAXONOMY CONTENTS
the approach simply pools together all the units closed to boundary and conducts inference as in a
single-cutoff RD design.
As in the previous cases, we do not elaborate on practical issues for this specific setting to
conserve space and because all the main methodological recommendations, codes and discussions
apply directly. However, to conclude our discussion, we do highlighting an important connection
between RD designs with multiple running variables and RD designs with multiple cutoffs. In
the Multi-Cutoff RD design, our discussion was based on a discrete set of cutoff points, which
would be the natural setting in Multi-Score RD designs applications. In such case, we can map
each cutoff point on the boundary to one of the cutoff points in C and each observation can be
assigned a running variable relative to each cutoff point via the distance function. With these two
simple modifications, any Multi-Score RD design can be analyzed as a Multi-Cutoff RD design over
finitely many cutoff points on the boundary. In particular, this implies that all the conclusions and
references given in the previous section apply to this case as well. See the supplemental appendix
of Cattaneo et al. (2016a) for more discussion on this idea and further generalizations.
All the RD parameters discussed in previous sections can be interpreted as causal in the sense
that they capture differences in some feature of potential outcome under treatment, Yi (1), and
the potential outcome under control, Yi (0). However, in contrast to other causal parameters in
the potential outcomes framework, these average RD differences are calculated at a single point
on the support of a continuous random variable (Xi ) and as a result are very local causal effects
in nature. From some perspectives, the parameters cannot even be interpreted as causal as they
cannot be reproduced via manipulation (i.e., experimentation). Regardless of their status as causal
parameters, RD treatment effects tend to have little external validity, that is, the degree to which
RD effects are representative of the treatment effect that would occur for units with scores farther
away from the cutoff. For example, in the case of the canonical Sharp RD design, the RD effect can
be interpreted graphically as the vertical difference between E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] at
the point where the score equals the cutoff, x = x̄. In the general case where the average treatment
effect varies as a function of the score Xi , as it is common in applications, the RD effect may not
be informative of the average effect of treatment at values of x different from x̄. For this reason,
in the absence of specific (usually restrictive) assumptions about the global shape of the regression
functions, the effect recovered by the RD design is only the average effect of treatment for units
local to the cutoff, that is, for units with score values Xi = x̄.
How much can be learned from such local treatment effects will depend on each particular
application. For example, in the scenario illustrated in Figure 2.6(a), the vertical distance between
E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] at x = x̄ is considerably higher than at other points, such as
x = −100 and x = 100, but the effect is positive everywhere. A more heterogeneous scenario is
shown in Figure 2.6(b), where the effect is zero at the cutoff but ranges from positive to negative
22
2. RD TAXONOMY CONTENTS
at other points. Since in real examples the counterfactual (dotted) regression functions are not
observed, it is not possible to know with certainty the degree of external validity of any given RD
application. Increasing the external validity of the RD estimates is a topic of very active research
and, regardless of the approach taken, will necessarily require more assumptions. For example,
extrapolation of RD treatment effects can be done by imposing additional assumptions about (i)
the regression functions near the cutoff (Wing and Cook, 2013; Dong and Lewbel, 2015), (ii) local
independence assumptions (Angrist and Rokkanen, 2015), or (iii) exploiting specific features of
the design such as imperfect complaince (Bertanha and Imbens, 2017) or the presence of multiple
cutoffs (Cattaneo et al., 2017c). On this regard, RD designs are not different from experiments as
they both require additional assumptions to map internally valid estimates into externally valid
ones.
2 2
E[Y(1)|X]
1
τSRD
1 E[Y(1)|X]
●
E[Y(1)|X], E[Y(0)|X]
E[Y(1)|X], E[Y(0)|X]
τSRD
0
E[Y(0)|X]
0 ●
−1
E[Y(0)|X]
−1 Cutoff
−2 Cutoff
For an introduction to causal inference based on potential outcomes see, for example, Imbens and
Rubin (2015) and references therein. The RD design was originally proposed by Thistlethwaite and
Campbell (1960), and historical as well as early review articles are given by Cook (2008), Imbens
and Lemieux (2008) and Lee and Lemieux (2010). Lee (2008) provided an influential contribution
to the identification of RD effects and was the first to apply the RD design to elections. The
edited volume by Cattaneo and Escanciano (2017) provides a more recent overview of the RD
literature and includes several methodological and practical contributions. Most of the literature
focuses on average treatment effects, but quantile and distributional RD treatment effects have
23
2. RD TAXONOMY CONTENTS
also been considered; for example, see Shen and Zhang (2016), Chiang and Sasaki (2017) and
references therein. Finally, a recent and related literature on bunching and density discontinuities
is summarized by Kleven (2016) and Jales and Yu (2017).
24
3. RD PLOTS CONTENTS
An appealing feature of any RD design is that it can be illustrated graphically. This graphical
representation, in combination with the formal approaches to estimation, inference and falsification
discussed below, adds transparency to the analysis by plotting all (or a subset of) the observations
used for estimation and inference. RD plots also allow researchers to readily summarize the main
empirical findings as well as other important features of the work conducted. We now discuss the
most transparent and effective methods to plot the RD design and present effects (and later conduct
empirical falsification) in RD designs.
At first glance, it seems that one should be able to illustrate the relationship between the
outcome and the running variable by simply constructing a scatter plot of the observed outcome
against the score, clearly identifying the points above and below the cutoff. However, this strategy is
rarely useful, as it is often hard to see “jumps” or discontinuities in the outcome-score relationship
by simply looking at the raw data. We illustrate this point with the Meyersson application, plotting
female high school attainment against the Islamic vote margin using the raw observations. We create
this scatter plot in R with the plot command.
> plot (X , Y , xlab = " Running Variable " , ylab = " Outcome " , col = 1 ,
+ pch = 20)
> abline ( v = 0)
Every point in Figure 3.1 corresponds to one raw municipality-level observation in the dataset—
so there are 2,629 points in the scatter plot (see Table 1.1). Although this plot is helpful to visualize
the raw observations, detect outliers, etc., its effectiveness for visualizing the RD design is limited.
In the Meyersson application there is empirical evidence that the Islamic party’s victory translates
into a small increase in women’s educational attainment. Despite this evidence of a positive RD
treatment effect, a jump in the values of the outcome at the cutoff cannot be seen by simply looking
at the raw cloud of points around the cutoff in Figure 3.1. In general, raw scatter plots do not allow
for easy visualization of the RD effect even when the effect is large.
A more useful approach is to aggregate or “smooth” the data before plotting. The typical RD
plot presents two summaries: (i) a global polynomial fit, represented by a solid line, and (ii) local
sample means, represented by dots. The global polynomial fit is simply an smooth approximation
to the unknown regresion functions based a fourth- or fifth-order polynomial regression of the
outcome on the score, fitted separately above and below the cutoff, and using the original raw data.
In contrast, the local sample means are created by first choosing disjoint (i.e., non-overlapping)
25
3. RD PLOTS CONTENTS
70
●
60
●
●
●
50
●
●
●
● ●
●
● ● ●
●
● ● ● ● ●
40
● ●
●●● ● ● ● ●
●● ● ●
Outcome
●● ●
●● ● ●
●● ●
● ●● ● ●●
● ●● ●● ● ● ●● ●●
● ● ● ●
● ●●● ● ● ● ●
● ● ● ● ●● ●● ●● ● ●●● ●
● ●
● ●●● ●
● ●
●● ●●● ●● ● ●
● ●●
● ●● ● ●●●●● ● ●●●● ●
●
●● ●
● ● ● ●● ● ●● ●
● ● ● ●●
●●● ● ● ●● ● ●● ● ● ●
● ● ● ●
●
●
● ●
●●●● ● ● ● ●
● ●●● ● ● ● ●●● ● ●
●● ●● ● ● ● ●
●●●●● ●● ●●
30
● ●●●● ● ●● ●● ●
● ●
● ● ●
● ● ●
●●●● ● ● ● ●●●● ●
●● ●● ●●● ● ● ● ● ● ● ●
● ● ● ● ●●● ● ●●●● ● ●
●● ● ●● ● ● ●● ●● ●
● ● ●
● ●● ●● ●● ●●●
● ●
●
●●● ● ● ●●● ●
● ● ● ●
●●● ● ● ●
● ●
●● ● ● ●
● ●
● ●● ● ●●● ●
●● ●●●
● ●● ●●
● ●●●
●●●● ●● ●●●● ●●●●●● ● ● ● ● ●●●
●
●
● ● ●● ●●●● ● ●● ●● ● ● ●● ●●● ● ● ●● ● ●
●●●●●●● ● ●● ● ● ●●●● ●●●●● ●●●●● ●● ● ●● ●●● ●● ●
● ●
●● ●●●●● ● ●● ● ●●● ●● ●●●●● ●● ●●●●●●●●
● ●● ●●●●●● ●●●●● ●● ● ●●●
● ●●●●● ●
●●● ●●●●●●● ●● ● ● ●
● ●● ●●
●● ●●●● ●●● ●●●● ● ●
● ●
●● ●● ●●
● ●●● ●
●● ● ●● ● ●
●● ●●●
●● ●●●
● ●●
●●● ●●
●●●●
●● ● ● ●
●● ●●●
●
●●● ●●● ● ●● ● ● ● ●●
●●● ●
●●●●● ●●●● ●
●● ●●
●●●● ●●●● ●
●●
● ●●
●
●●●●●●
● ●●●● ●● ●
●
●
●
●●
●● ●
● ●● ● ● ●●
● ●
● ●● ● ● ●●●● ● ●● ● ●●
● ● ● ●● ●
●●●
● ●●● ●●
●●●
●● ● ●●●
● ●●●●
●●●
●●
●
●●●●
●
●●●●● ●
●
●
●●● ●●●● ●● ● ●●● ●
● ●
●● ●
●
●
●
● ●●● ●●●●● ● ● ●●● ●● ●●● ● ● ●●● ● ● ●
●●●● ●●●● ●
20 ● ● ●●●● ● ●● ●● ●●
●●● ●● ●
● ● ●●● ●●
●● ●● ●●●● ● ● ●●
● ●
●●●
●●●● ● ●●● ●●
●● ●● ● ● ● ● ●●
●●●●● ●●●
● ●● ●●
●●●
●
●
●● ●● ●●● ●●●● ● ●● ●●
●●●●● ●
●
●
●
●
●
●
●●
●
●●●●● ●●● ●
● ●
●
●●● ● ●●● ●
●●● ● ●● ●
●● ●● ●● ●● ● ●
●●
● ●●
●●
●
●●
●●●●
●
●
●
●●
● ●
●●●
●
●●
●●●
●
●●●●● ●●● ●● ●●●●●●●●●●●● ●
●
●● ●●● ●
●●● ● ● ●●●●●● ● ● ● ●
●
●● ●● ●
●● ● ●● ●●●●● ●●
●●
●
●●
●● ●●● ● ●
●● ●●●
●●● ●●●●●
● ●● ●●● ●
●●
● ●● ●
● ●● ● ● ●● ●●●● ●●●●●●● ●●●● ●●●●● ●●● ●
●● ● ●● ●● ● ●●● ●●
● ●
●●●●
●● ● ● ●●
●
●● ●●● ● ●
●●●●● ●● ●
● ●●● ●●
● ●● ●● ● ● ●
●
● ●
●●
●● ●● ● ●
●● ●● ●●
●
●● ●●
● ● ●●● ● ● ●●
●
●●
●● ●● ●
●● ●● ●●●●●●●●
●●
●●
●● ●●● ● ● ●● ● ●
●●●●● ●● ●●●●●●
● ●
●●
● ●
● ●● ● ●●● ●●● ●
● ● ● ● ●
● ●●●
●
●
●● ●●●
●●
●●
●●●●● ●●●
● ● ●●
●
●● ●●● ●
● ● ●●
● ●●●● ● ●
● ●●●
● ● ●●●● ● ● ● ●
●
●● ● ●
●●● ● ● ●● ● ●
●● ● ●●●● ● ● ●● ● ●● ●● ● ●●● ● ● ●
●●● ● ●
●●●●● ●● ●●● ●●●●
● ●●●●
●
● ●
●●●● ● ●●●●● ●● ● ●●●● ●● ●● ● ● ●●
●
● ● ●● ●● ●●
● ● ●
●
●●
●
●
●
●●
● ●●● ● ●● ●●●●
●●●● ●
●
●
●●●
● ●● ●
●● ●●
●●
● ● ●●● ●
●● ● ●●●
● ● ●● ●
● ● ●● ● ●● ● ●●● ● ●● ●● ● ●● ●● ●●● ●
●● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ●
●
● ●
●●●●
● ●● ●● ● ●● ●●
●●
●●
●
●
●●●●
●
● ●
●●●● ● ● ●● ●●●●●●● ●●● ● ● ●● ●
● ● ● ●● ●● ●●
●●
●●● ●● ●● ●●●●●● ●●
●● ● ● ●● ●●●● ● ● ●● ● ●
10
●● ● ● ● ●●●
●
●
●●●
●●
●
●●● ●● ●●
● ●● ●●● ●● ● ●●● ●● ● ● ●●● ●● ●● ●
● ●●
● ● ●
●● ●●● ●●● ●●●● ●●● ● ●
● ●●●●●● ● ●● ●● ●● ●●●● ●● ● ● ●● ●● ● ●
● ●●●●● ● ●● ● ●
●●
● ●
●
● ●●●●●
● ●
●●●
●● ●●●● ● ●●● ●
●●●
●●
●●●●●
● ●●●●● ●●●
●●●● ●●●
● ●●● ●● ●●●
● ●● ●●
● ● ●●● ●
●●
● ●●●● ●● ●
● ●● ●●
●●
● ● ● ●
●● ●●●
●
●●●
● ●
●
●●
●●
● ●●●
●
●●●●●●●●●
●●
●●●●●
●●● ●●● ●● ●●●●
●● ●● ●●
●
●● ●
● ● ●●●
●●●●
●●●● ●●
● ●● ● ● ● ● ●●
●●●● ● ● ● ●● ●●
● ● ●● ● ● ●●●●●● ●● ● ●●●●●● ● ● ●
●●● ●●●● ●● ● ●●●● ●
●
●● ● ● ●
●
● ●●●
●●●●●
●●●●
●●●● ● ●● ●●●● ●
●●●●●●● ● ●
●●●●
●●●●●●● ●●
●● ●●●●
●●●●
● ●● ●
●
● ● ● ● ● ● ●●●
●● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ●● ●
● ● ●
● ●● ●●
● ●● ●●● ●● ●● ●● ●●● ●● ● ● ●●●● ●● ●
●●● ● ● ●●● ●
●● ● ●
●
● ● ●● ●
●● ●● ●
●● ●
●
●
● ● ●
● ●● ●●● ● ●● ●● ● ●● ● ●
●●
● ●●●
●●●●
● ● ●
●
● ● ●●● ●● ●●● ● ●● ● ●● ● ●● ● ●● ● ● ●●●● ●●
●●
● ● ● ●● ● ● ●● ●● ● ●● ●
● ●● ●
● ●● ● ●●
● ●
● ● ●● ● ● ●●●● ●● ● ●●●● ● ●●● ●
●
●
● ● ●●
●●● ●
●●
●● ●●
●● ●●●● ● ●●● ● ● ●● ●●● ●● ● ●● ●●
● ●●●
● ● ●●● ●●
●●●
● ●●
● ● ●
● ● ● ●●
● ●●● ● ● ● ●● ●●
●● ● ● ● ●
●●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●●
● ●●●● ● ●● ● ● ●● ● ● ● ●● ●
● ● ● ●● ● ●
● ●● ● ●● ● ●● ●● ●● ● ● ● ●●● ●● ●● ●
0
● ● ●● ● ●
Running Variable
intervals or “bins” of the score, calculating the mean of the outcome for all observations falling
within each bin, and then plotting the average outcome in each bin against the mid point of the
bin, which can be interpreted as a non-smooth approximation of the unknown regression functions.
The combination of these two ingredients in the same plot allows researchers to visualize the global
or overall shape of the regression functions for treated and control observations, while at the same
time retaining enough information about the local behavior of the data to observe the RD treatment
effect and the variability of the data around the global fit. Importantly, in the standard RD plot,
the global polynomial is calculated using the original observations, not the binned observations.
For example, using the Meyersson data, if we use 20 bins of equal length on each side of
the cutoff, we partition the support of the Islamic margin of victory into 40 disjoint intervals of
length 5—recall that a party’s margin of victory ranges theoretically from −100 to 100, and in
practice the Islamic margin of victory ranges from −100 to 99.051. Table 3.1 shows the bins and
the corresponding average outcomes in this case, where we denote the bins by B−,1 , B−,2 , . . . , B−,20
(control group) and B+,1 , B+,2 , . . . , B+,20 (treatment group); that is, using the subscripts − and +
to indicate, respectively, bins located to the left and right of the cutoff. In that table, each local
sample average is computed as
1 X 1 X
Ȳ−,j = Yi and Ȳ+,j = Yi
#{Xi ∈ B−,j } #{Xi ∈ B+,j }
i:Xi ∈B−,j i:Xi ∈B+,j
26
3. RD PLOTS CONTENTS
Table 3.1: Partition of Islamic Margin of Victory into 40 Bins of Equal Length—Meyersson Data
Number of Group
Bin Average Outcome in Bin
Observations Assignment
B−,20 = [−100, −95) Ȳ−,20 = 4.6366 4 Control
B−,19 = [−95, −90) Ȳ−,19 = 10.8942 2 Control
.. .. .. ..
. . . .
B−,3 = [−15, −10) Ȳ−,3 = 17.0525 162 Control
B−,2 = [−10, −5) Ȳ−,2 = 12.9518 149 Control
B−,1 = [−5, 0) Ȳ−,1 = 13.8267 148 Control
B+,1 = [0, 5) Ȳ+,1 = 15.3678 109 Treatment
B+,2 = [5, 10) Ȳ+,2 = 13.9640 83 Treatment
B+,3 = [10, 15) Ȳ+,3 = 14.5288 56 Treatment
.. .. .. ..
. . . .
B+,19 = [90, 95) Ȳ+,19 = NA 0 Treatment
B+,20 = [95, 100] Ȳ+,20 = 10.0629 1 Treatment
In order to create an RD plot corresponding to the binned outcome means in Table 3.1 and
with the addition of a 4-order global polynomial fit estimated separately for treated and control
observations, we use the rdplot command:
> out = rdplot (Y , X , nbins = c (20 , 20) , binselect = " esmv " )
> print ( out )
27
3. RD PLOTS CONTENTS
70
60
50
40
Outcome
30
20
● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
10
● ●
● ●
●
●
●
●
●
0
Running Variable
Figure 3.2: RD Plot for Meyersson Data Using 40 Bins of Equal Length
In figure 3.2, the global fit reveals that the observed regression function seems to be non-linear—
particularly on the control side. At the same time, the binned local means let us see the variability
around the global fit. The plot also reveals a positive jump at the cutoff: the average share of
female high school attainment seems to be higher for those municipalities where the Islamic party
obtained a barely positive margin of victory than in those municipalities where the Islamic party
lost narrowly.
The type of information conveyed by Figures 3.1 and 3.2 is very different. In order to facilitate
their comparison, we reproduce them side by side in Figure 3.3. In the raw scatter plot (Figure
3.3(a)), it is difficult to see any systematic pattern, and there is no visible discontinuity in the
average outcome at the cutoff. In contrast, when we use 20 bins on each side of the cutoff to bin
the data and include the global polynomial fit (Figure 3.3(b)), the plot now allows us to see a
discontinuity at the cutoff and to better understand the shape of the underlying regression function
over the whole support of the running variable.
The differences between the plots clearly show that binning the data may reveal striking patterns
that can remain hidden in a simple scatter plot. Since binning leads to such drastic differences, a
natural question is how many bins should be chosen, and what kinds of properties are desirable in
the chosen bins. In Figure 3.2, we chose 20 bins of equal length on either side of the cutoff—but
we could have chosen 10 or 40, a decision that could have affected the conclusions drawn from the
plot. Choosing the number and type of bins in an ad-hoc manner compromises the transparency
and replicability of the RD plots, and leaves researchers uncertain about the underlying properties
of this smoothing strategy. As we now discuss, a more desirable approach is to choose the type and
28
3. RD PLOTS CONTENTS
70
●
●
60
60
●
●
●
50
50
●
●
●
● ●
●
● ● ●
●
● ● ● ● ●
40
40
● ●
●●● ● ● ● ●
●● ● ●
Outcome
Outcome
●
●● ● ●● ● ● ● ● ● ●●
● ●● ● ●● ● ●
● ●● ● ●●
● ● ● ●
● ●●● ● ●● ● ● ●
● ● ● ● ●● ●●
● ●● ●● ●
● ● ●●● ● ● ●
●● ●●● ●● ● ●
● ●●
● ●● ● ●●●●● ● ●●●● ●
●
●● ●
●● ● ● ●● ● ●● ●
● ● ●●
●●
●●● ●● ● ●● ● ●● ● ● ● ●
● ● ● ● ● ●
●●● ●
● ● ●● ●
● ● ● ●
● ●● ● ● ● ● ● ● ●● ●
●● ●●● ● ●● ●●●●● ●
30
30
● ● ●
● ● ●
● ● ●
●●●●● ● ●●● ● ●
●●● ●●● ● ●●● ● ● ●● ●● ●
● ● ● ● ●● ●●●
● ● ● ●●●● ● ●
●● ● ●● ● ● ●● ●● ●
● ● ●
● ●● ●● ●● ●●●
● ●
●
●●● ● ● ●●● ●
● ● ● ●
●●● ● ● ●
● ●
●● ● ● ●
● ●
● ●● ● ●●● ●● ●●
● ●●
● ●● ●●●●● ●●●● ● ● ● ● ● ● ●●●
●● ●● ●● ●●
●● ● ●●
●● ●● ●●
● ● ●● ●● ● ● ●
● ●● ● ● ● ●
●●●●●● ● ●● ● ● ●●● ●●
●●● ●●●●● ● ●●
●● ●● ● ●● ●
●● ●●
●● ● ● ●
●● ●●●●● ● ●● ● ●●● ●●●
●● ●● ●●● ●●●●●●●●
● ●● ●●●●●● ●●●●● ●● ● ● ●●●●● ●
●●● ●●●●●●● ●● ● ● ●
● ●● ●●
●● ●●●● ●●● ●●●● ● ●
● ●
●● ● ●●
●
● ●●● ●
●● ● ●● ● ●
●● ●●●
●● ●●●
● ●●
●●● ●●
●●●●
●● ● ● ●
●● ●●●
●
●●● ●●● ● ●● ● ● ● ●●
●●● ●
●●●●● ●●●● ●
●● ●●
●●●● ●●●● ●
●●
● ●●
●
●●●●●●
● ●●●● ●● ●
●
●
●
●●
●● ●
● ●● ● ● ●●
● ●
● ●● ● ● ●●●● ● ●● ● ●●
● ● ● ●● ●
●●●
● ●●● ●●
●●●●●● ● ●●●
● ●●●
●●●●
●●
●
●●●●
●
●●●●● ●
●
●
●●● ●●●● ●● ● ●●● ●
● ● ●
●● ● ●
●
● ● ● ● ● ●● ● ●● ●●● ● ● ●●● ● ● ●
●● ●●●●
● ● ●● ●●●● ●● ●
20
20
● ●●● ● ● ● ●● ●●
●●● ●● ●
● ● ●●● ●●
●● ●● ●●●● ●●● ●●● ●●
●● ●● ● ●●● ●●
●● ●● ● ● ● ● ●●
●●●●● ●●●
● ●● ●●
●●●
●
●
●● ●● ●●● ●●●● ● ●● ●●●
●●●● ●
●
●
●
●
●
●
●●
●
●●●●● ●●● ●
● ●
●
●●● ● ●●● ●
●●● ● ●● ●
●● ●● ●● ●● ● ●
●●
● ●●
●●
●
●●
●●●●
●
●
●
●●
● ●
●●●
●
●●
●●●
● ●●
● ●● ●●●●● ●●●●●●●●●●●● ●
●
●● ●●● ●
●●● ● ● ●●●●●● ● ● ● ●● ● ● ●
●● ●● ●
●● ● ●● ●●●●● ●
●●
●●
●
●●
● ●●● ● ●
●● ●●● ●●●●●●●
● ●● ●●● ●
●●
● ●● ● ●
● ●● ● ● ●● ●●●● ●●●●●●● ●●●● ●●●●● ●●● ●
●● ●
● ●● ●● ● ●●● ●●
● ● ● ● ●
●●●●
●● ● ● ●●
●
●● ●●● ● ●
●●●●● ●● ●
● ●●● ●
●
● ●● ●● ● ● ● ● ●
● ●●
●● ●● ●
●● ●● ●● ●● ●●
● ●●● ● ● ●● ●● ●● ●● ●●●●●●●●●● ● ●
● ●
●●
●● ●●●● ● ●●●● ●●
●●●●●● ●●● ●●● ●●●
● ●
●●
● ●
●●●● ●● ● ●● ● ●●● ●●● ● ● ● ●
● ● ● ● ● ● ●●● ●●
●●●
●
●● ●●●● ● ●● ●●● ● ●
● ●●●● ● ●●● ●● ● ●●●● ●● ● ● ● ●
●●● ● ● ●
● ● ● ●
●
●●● ●●
● ●● ●
● ● ●
● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●
●●●●
●●
● ●● ●
●●● ●●
●●●
●●●●● ●●● ●●●●
● ●●●
●
● ●
●●●●● ●●●● ●● ●● ●
● ●● ●● ●● ● ●● ● ● ●
●
● ● ●● ●● ●●
● ● ●
●
●●
●
●
●
●●
● ●●● ●
●
●● ●●●
●●●●●●
●
●●●
● ●● ●
●●● ●● ● ● ●●● ●●● ● ● ●●
● ●●●
●●
● ●●●● ● ●
● ● ● ● ● ●●●
● ●
●●●● ●●● ●● ●● ● ● ●● ● ● ●
●●●●●● ●● ●●● ● ● ● ● ● ● ●
● ● ●● ● ●● ●● ● ●●●● ● ● ●
● ●● ● ●
● ● ●●
● ●● ●●
●● ●●●● ●●●● ● ●● ●●● ●● ● ●●
● ● ● ●● ●● ●●
●●
●●● ●● ●● ●●●●●● ●
●●● ● ● ●●●
● ● ● ●● ● ● ●
10
10
●● ● ● ● ●●●
●
●
●●●
●●
●
●●● ●● ●●
● ●● ●●● ●● ● ●●● ●● ● ● ●●● ●● ●● ●
● ●●
● ● ●
●● ●●● ●●● ●●●● ●●● ● ● ● ●
● ●●●●●● ● ●● ●● ●
● ●●●● ●● ● ● ●● ●● ● ●
● ●●●●● ● ●● ● ●
●●
● ●
●
● ●●●●●
● ●
●●●
●● ●●●● ● ●●● ●
●●●
●●
●●●●●
● ●●●●● ●●●
●●●● ●●●
● ●●● ●● ●●●
● ●● ●●
● ● ●●● ●
●●
● ●●●● ●● ●
● ●● ●●
●●
● ● ● ● ● ●
●● ●●●
●
●●●
● ●
●
●●
●●
● ●●●
●
●●●●●●●●●
●●
●●●●●
●●● ●●● ●● ●●●●
●● ●● ●●
●
●● ●
● ● ●●●
●●●●
●●●● ●●
● ● ● ● ● ●
● ●● ●
●●●● ● ● ● ●● ●●
● ● ●● ● ● ●●●●●● ●● ● ●●●●●● ● ● ●
●●● ●●●● ●● ● ●●
●● ●●
●● ● ● ●
●
● ●●● ●●● ●●●●
●●●● ●●●● ● ●●●● ●
●●●●
●●●●●●● ●●●●●●●● ● ●
● ● ● ●
●●●
●
●●
● ●●
●●●● ●●● ● ●
●
●●● ● ●
●
●●●●● ● ●● ● ●● ● ●
●
●●●
●●●● ●
● ●●● ●
●
● ●
● ● ●
● ● ●● ●
●●●● ● ● ●● ●● ● ● ● ●●● ●●● ● ●● ●
●● ● ●
●
● ● ●●
●●
●● ●● ●●●
●
●
● ●● ● ●●
●
●● ●● ● ●●●● ●●
●●
● ●●●
●●●●
● ● ● ●
● ●
● ● ●●● ●● ● ● ● ● ● ●●● ●● ● ●
● ● ● ●● ● ● ●● ● ● ●●
● ●● ●
●● ● ●
● ● ● ● ● ●● ●● ● ● ●● ● ●●● ● ● ● ● ●
● ● ● ● ●● ●● ●
●●● ● ● ●● ● ● ●● ●●● ●● ● ●●●●
●● ●●
●●
●
●●●● ●●● ● ● ●
●
●
●
● ● ● ●●
●●●● ●
●
●
●●● ●
●
●●● ●●●
● ●●
●
● ● ●●
●●
●●
●
●
●●● ●●●
●
●
●● ●
●●
● ● ●● ●●● ● ● ●
● ● ● ●
● ●● ● ● ● ●●
●●●● ● ● ● ● ● ●
●● ●● ● ●●
● ● ● ● ●● ● ● ● ●
● ●● ● ●● ●● ●● ●● ● ● ● ●●● ●● ●● ●
0
0
● ● ●● ● ●
We first discuss two different types of bins that can be used in the construction of RD plots. When
we partition the running variable in bins, we may employ bins of equal length as in Table 3.1 or,
alternatively, we may employ bins that contain (roughly) the same number of observations but
whose length may differ. We refer to these two bin types as evenly-spaced bins and quantile-spaced
bins, respectively.
In order to define the bins more precisely, we assume that the running variable takes values
inside the interval [xl , xu ]. In other words, xl is the lowest value and xu the highest value that
the score may take. In the Meyersson application, xl = −100 and xu = 100. We continue to use
the subscripts + and − to denote treated and control observations, respectively. The bins are
constructed separately for treated and control observations; thus, the control bins partition [xl , x̄)
in non-overlapping intervals, and the treated bins partition [x̄, xu ] in non-overlapping intervals—
recall that x̄ is the RD cutoff. We use J− to denote the total number of bins chosen to the left of
the cutoff, and J+ to denote the total number of bins chosen to the right of the cutoff. Using this
29
3. RD PLOTS CONTENTS
[x̄ , b+,1 )
j=1
Treated Bins: B+,j = [b+,j−1 , b+,j ) j = 2, · · · , J+ − 1
[b+,J+ −1 , xu ] j = J+ ,
with b−,0 < b−,1 < · · · < b−,J− and b+,0 < b+,1 < · · · < b+,J+ . In other words, the union of the
control and treated bins, B−,1 ∪ B−,2 ∪ . . . ∪ BJ− ∪ B+,1 ∪ B+,2 ∪ . . . ∪ BJ+ , forms a disjoint partition
of the support of the running variable, [xl , xu ], centered at the cutoff x̄.
Letting X−,(i) and X+,(i) denote the i-th quantile of the control and treatment subsamples, re-
spectively, and b·c denote the floor function, we can now formally define evenly-spaced and quantile-
spaced bins.
• Evenly-spaced (ES) bins: non-overlapping intervals that partition the entire support of
the running variable, all of the same length within each treatment assignment status:
• Quantile-spaced (QS) bins: non-overlapping intervals that partition the entire support of
the running variable, all containing (roughly) the same number of observations within each
treatment assignment status:
Note that the length of QS bins may differ even within treatment assignment status; the bins
will be larger in regions of the support where there are fewer observations. QS bins will be
evenly-spaced, for example, in the (unusual) case that the running variable has unique values
uniformly spaced-out over [xl , xu ].
In practical terms, the most important difference between ES and QS bins is the underlying
variability of the local mean estimate in every bin. Although ES bins have equal length, if the
observations are not uniformly distributed in the support of the running variable [xl , xu ], each
30
3. RD PLOTS CONTENTS
bin may contain a different number of observations. This means that in an RD plot with evenly-
spaced bins, each of the local means represented by a dot may be computed using a different
number of observations and thus may be more or less variable than the other local means in the
plot. For example, if there are many more observations near the cutoff than far away from it,
the local mean estimates in the farthest bins will be much more variable than the local mean
estimates near the cutoff. Thus, when the data is not approximately uniformly distributed in
[xl , xu ], the dots representing the local means in an evenly-spaced RD plot may not be directly
comparable. For example, Table 3.1 shows that there are only 4 observations in [−100, −95], and
only 2 observations in [−95, −90]; thus, the variance of these local mean estimates is very high
because they are constructed with very few observations.
To illustrate the differences between binning strategies, we again use the rdplot command but
this time specifying the desired type of bin via the binselect option. We reproduce the full output
of rdplot, which includes several descriptive statistics in addition to the actual plot. First, we
reproduce the RD plot in Figure 3.2 above, using 20 evenly-spaced bins on each side, including the
full output, which we now explain in detail:
Method :
Left Right
Number of Obs . 2314 315
Polynomial Order 4 4
Scale 2 3
Selected Bins 20 20
Average Bin Length 5.0000 4.9526
Median Bin Length 5.0000 4.9526
31
3. RD PLOTS CONTENTS
70
60
50
40
Outcome
30
20
● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
10
● ●
● ●
●
●
●
●
●
0
Running Variable
The total number of observations is shown in the top right of the panel, where we can also
see the type of weights used to plot the observations. We have 2, 629 observations in total, which
by default are all given equal or uniform weight, which is indicated by Kernel = Uniform. The
rest of the output is divided in two columns, one corresponding to observations assigned to control
and located to the left of the cutoff (indicated by c in the output), and another corresponding to
observations assigned to treatment and located to the right of the cutoff.
The top output panel shows that there are 2, 314 observations to the left of the cutoff, and 315
to the right of the cutoff, consistent with our descriptive analysis indicating that the Islamic party
loses the majority of the electoral races. The third row in the top panel indicates that the global
polynomial fit used in the RD plot is of order 4 on both sides of the cutoff. The fourth row indicates
the window or bandwidth h where the global polynomial fit was conducted; the global fit uses all
observations in [c − h, c) on the control side, and all observations in [c, c + h] on the treated side.
By default, all observations to the left of the cutoff are included in the left fit, and all observations
to the right of the cutoff are included in the right fit. In this case, because the range of the Islamic
margin of victory is [−100, 99.051], the bandwidth on the right is slightly smaller than 100. This
occurs because in our data there are no observations where the Islamic party wins an uncontested
election. Finally, the last row on the top panel shows the scale selected, which is an optional factor
by which the chosen number of bins can be multiplied to either increase or decrease the original
32
3. RD PLOTS CONTENTS
The lower output panel shows results on the number and type of bin selected. The top two rows
show that we have selected 20 bins to the left of the cutoff, and 20 bins to the right of the cutoff.
x̄−xl 0−(−100)
On the control side, the length of each bin is exactly 5 = J− = 20 = 100/20. However, the
actual length of the ES bins to the right of the cutoff is slightly smaller than 5, as the edge of the
support on treated side is 99.051 instead of 100. The actual length of the bins to the right of the
xu −x̄ 99.051−0
cutoff is J+ = 20 = 99.051/20 = 4.9526. We postpone discussion of the five bottom rows
until the next subsection where we discuss optimal bin number selection.
We now compare this plot to an RD plot that also uses 20 bins on each side, but uses quantile-
spaced bins instead of evenly-spaced bins by setting the option binselect = "qs".
Method :
Left Right
Number of Obs . 2314 315
Polynomial Order 4 4
Scale 1 1
Selected Bins 20 20
Average Bin Length 4.9951 4.9575
Median Bin Length 2.9498 1.0106
For easy comparison, Figure 3.6 reproduces side-by-side the RD plots in Figures 3.4 and 3.5:
Figure 3.6(a) reproduces the evenly-spaced RD plot, while Figure 3.6(b) reproduces the quantile-
spaced RD plot. Both plots use the Meyersson data and have 20 bins on each side of the cutoff; the
only difference difference between them is the type of bin—ES vs. QS—that is used in each case.
One of the main differences between these two plots is that they show where the data is located.
In the evenly-spaced RD plot in Figure 3.6(a), there are five bins in the interval [-100,-75] of the
33
3. RD PLOTS CONTENTS
70
60
50
40
Outcome
30
20
●
●●
● ● ●
●● ● ●
● ● ●●
● ●● ● ●
● ●
● ● ●
● ●●
● ● ● ● ●
● ● ● ● ●
●
10
●
●
0
Running Variable
running variable. In contrast, in the quantile-spaced RD plot in Figure 3.6(b), this interval is entirely
contained in the first bin. The vast difference in the length of QS and ES bins occurs because, as
shown in Table 3.1, there are very few observations near −100, which leads to local mean estimates
with high variance. This problem is avoided if we choose quantile-spaced bins, since the procedure
ensures that each bin has the same number of observations.
Once a decision to use either quantile-spaced or evenly-spaced bins has been made, which determines
the position of the bins, the only remaining choice for practical implementation of RD plots is the
total number of bins on either side of the cutoff: the quantities J− and J+ . Thus, given a choice of
QS or ES bins, a data-driven and automatic RD plot can be produced as long as one has a data-
driven and automatic way for selecting J− and J+ . Below we discuss two such methods to choose
the number of bins. Both methods set up the problem of choosing J− and J+ in an automatic,
data-driven way, where the chosen values of J− and J+ are those that either optimize or satisfy a
particular criterion. To be more specific, the procedure involves constructing asymptotic expansions
of the (integrated) variance and squared bias of the local means under ES or QS bins, and then
choosing the values of J− and J+ that either minimize or satisfy particular restrictions on functions
of these expansions.
34
3. RD PLOTS CONTENTS
70
60
60
50
50
40
40
Outcome
Outcome
30
30
20
20
●
● ●●
● ● ● ●
● ● ●●
● ● ● ● ● ● ●●
● ● ● ●● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ●●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
●
● ●
10
10
● ● ●
● ● ●
●
●
●
●
●
0
0
−100 −50 0 50 100 −100 −50 0 50 100
The first method we consider selects the values of J− and J+ that minimize an asymptotic approx-
imation to the integrated mean-squared error (IMSE) of the local means estimator, that is, the
sum of the expansions of the (integrated) variance and squared bias. The resulting choice of bins is
therefore IMSE-optimal, implying that the chosen values of J− and J+ optimally balance or “trade
off” squared-bias and variance of the local sample means when viewed as a local estimator of the
underlying unknown regression functions. If we choose a large number of bins, we have a small
bias because the bins are smaller and the local constant fit is better; but this reduction in bias
comes at a cost, as increasing the number of bins leads to less observations per bin and therefore
more variability within bin. The IMSE-optimal J− and J+ are the numbers of bins that balance
squared-bias and variance so that the IMSE is (approximately) minimized.
By construction, choosing an IMSE-optimal number of bins will result in binned sample means
that “trace out” the underlying regression function. For this reason, this method is most useful to
asses the overall shape of the regression function, perhaps to identify potential discontinuities in
these functions that occur far from the cutoff. In general, however, an IMSE-optimal number of
bins will tend to produce a very smooth plot where the local means nearly overlap with the global
polynomial fit. For this reason, this method of choosing bins is not always the most appropriate to
capture the local variability of the data near the cutoff.
35
3. RD PLOTS CONTENTS
where n is the total number of observations, d·e denotes the ceiling operator, and the exact form of
the constants C−IMSE and C+IMSE depends on whether ES or QS bins are used (and some features of
the underlying data generating process). In practice, the unknown constants C−IMSE and C+IMSE are
estimated using preliminary, objective data-driven procedures.
Method :
Left Right
Number of Obs . 2314 315
Polynomial Order 4 4
Scale 1 1
Selected Bins 11 7
Average Bin Length 9.0909 14.1501
Median Bin Length 9.0909 14.1501
The output reports both the average length of the bins, and the median length of the bins. In
the ES case, since each bin has the same length, each bin has length equal to both the average and
the median bin length on each side. The IMSE criterion leads to different numbers of ES bins above
and below the cutoff. As shown in the Selected Bins row in the bottom panel, the IMSE-optimal
36
3. RD PLOTS CONTENTS
70
60
50
40
Outcome
30
20
● ●
●
● ●
● ●
●
● ●
● ●
●
10
●
●
●
0
Running Variable
number of bins to the left of the cutoff is 11, while the optimal number of bins above the cutoff is
only 7. As a result, the length of the bins above and below the cutoff is different: above the cutoff,
each bin has a length of 14.1501 percentage points, while below the cutoff the bins are smaller,
with a length of 9.0909. The middle rows show the optimal number of bins according to the IMSE
criterion (which coincides with the selected number of bins because this is the criterion chosen), and
the mimicking variance criterion which we discuss below. The bottom three rows show the weights
that the chosen bins give to the variance term relative to the bias term in the IMSE objective
function—when the IMSE criterion is used, these weights are always equal to 1/2.
To produce an RD plot that uses an IMSE-optimal number of quantile-spaced bins, we use the
option binselect = "qs" instead of binselect = "es".
Method :
Left Right
Number of Obs . 2314 315
Polynomial Order 4 4
Scale 1 1
Selected Bins 21 14
Average Bin Length 4.7572 7.0821
Median Bin Length 2.8327 1.4289
37
3. RD PLOTS CONTENTS
30
20
● ●
●● ● ●
● ● ● ● ●
●
● ●● ● ●
●● ●
● ● ● ●
● ●
● ● ● ● ●
●
●
●
10
●
0
Running Variable
Note that the IMSE-optimal number of QS bins is much larger on both sides, with 21 bins below
the cutoff and 14 above it, versus 11 and 7 in the analogous ES plot in Figure 3.7. The average bin
length is 4.7572 below the cutoff, and 7.0821 above it. As expected, the median length of the bins
is much smaller than the average length on both sides of the cutoff, particularly above. Since there
are very few observations where the Islamic vote margin is above 50%, the length of the last bin
above the cutoff must be very large in order to ensure that it contains 315/14 ≈ 22 observations.
Figure 3.9 reproduces the evenly-spaced and quantile-spaced IMSE-optimal RD plots side-by-
side. As shown, the ES IMSE-optimal bins produce local means that trace the global polynomial
fit closely, and do not reveal much variability near the cutoff. In contrast, the IMSE-optimal QS
bins give a better idea of where most of the observations are located, and because there are more
bins on each side, they produce a plot that reveals more local variability.
38
3. RD PLOTS CONTENTS
70
60
60
50
50
40
40
Outcome
Outcome
30
30
20
20
● ●
● ● ●
●● ●
● ● ● ● ● ●
●
● ● ● ●● ● ●
●● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ●
●
●
10
10
● ●
●
●
●
0
39
3. RD PLOTS CONTENTS
Since one of the most important roles of the RD plot is to illustrate the behavior of the data near
the cutoff, the oversmoothed RD plot that may result from choosing an IMSE-optimal number of
bins—particularly when using ES bins—is not always desirable. An alternative is to select bins
in a way that guarantees a sufficiently large number of local means so that researchers can easily
get a graphical representation of the variability of the data near the cutoff (and elsewhere in the
support).
The second fully automatic and data-driven method to select the number bins reaches this
goal by selecting the vales of J− and J+ so that the binned means have an asymptotic (integrated)
variability that is approximately equal to the variability of the raw data. In other words, the number
of bins are chosen so that the overall variability of the binned means “mimics” the overall variability
in the raw scatter plot of the data. In the Meyersson application, this method involves choosing
J− and J+ so that the binned means have a total variability approximately equal to the variability
illustrated in Figure 3.1. We refer to this choice of total number of bins as mimicking-variance (MV)
choices.
where again n is the sample size and the exact form of the constants C−MV and C+MV depends on
whether ES or QS bins are used (and some features of the underlying data generating process).
These constants are different than those appearing in the IMSE-optimal choices and, in practice,
are also estimated using preliminary, objective data-driven procedures.
MV > J ES and J MV > J ES . That is, the MV method to select the number of bins
In general, J− − + +
leads to a larger number of bins than the IMSE method, resulting in an RD plot with more dots
representing local means and thus giving a better sense of the variability of the data. In order to
produce an RD plot with ES bins and a MV total number of bins on either side, we use the option
binselect ="esmv".
Method :
Left Right
Number of Obs . 2314 315
Polynomial Order 4 4
Scale 4 11
40
3. RD PLOTS CONTENTS
Selected Bins 40 75
Average Bin Length 2.5000 1.3207
Median Bin Length 2.5000 1.3207
●
40
Outcome
30
●
20
● ●
● ●●
●● ●
● ●●
● ● ●● ●● ● ●
● ● ●● ●
● ● ● ● ●
● ●● ●
●● ● ● ●
●
● ● ●● ● ● ●
● ●
● ●
10
● ●
●
● ●
● ● ●
●
●
●
●
●●
● ●● ●
●
0
Running Variable
This produces a much higher number of bins than we obtained with both ES and QS bins under
the IMSE criterion. The MV total number of bins is 40 below the cutoff and 75 above the cutoff,
with length 2.5 and 1.3207, respectively. The difference in the chosen number of bins between the
IMSE and the MV criteria is dramatic. The middle rows in the bottom panel show the number of
bins that would have been produced according to the IMSE criterion (11 and 7) and the number
of bins that would have been produced according to the MV criterion (40 and 75). This allows for
a quick comparison between both methods. In this example, we see that the MV criterion leads to
approximately four times (below the cutoff) and ten times (above the cutoff) as many bins as the
number of bins produced by the IMSE criterion. The bottom rows indicate that the chosen number
of MV bins on both sides of the cutoff is equivalent to the number of bins that would have been
41
3. RD PLOTS CONTENTS
chosen according to an IMSE criterion where, instead of giving the bias and the variance each a
weight of 1/2, the relative weights of the variance and bias had been, respectively, 0.024 and 0.9796
below the cutoff, and 0.0008 and 0.9992 above the cutoff. Thus, we see that if we want to justify the
MV choice in terms of the IMSE criterion, we must weight the bias much more than the variance.
Finally, to obtain an RD plot that chooses the total number of bins according to the MV
criterion but uses QS bins, we use the option binselect = "qsmv".
Method :
Left Right
Number of Obs . 2314 315
Polynomial Order 4 4
Scale 2 3
Selected Bins 44 41
Average Bin Length 2.2705 2.4183
Median Bin Length 1.3755 0.5057
42
3. RD PLOTS CONTENTS
70
60
50
40
Outcome
30
20 ●
● ●
● ●●
● ●● ● ●● ● ●● ● ●
●
● ● ● ●● ● ●
●
● ● ● ● ●
●●● ●● ● ●● ●
● ●● ●
● ● ● ●● ●●
●●
● ● ●● ● ● ●●
● ● ●
● ●● ● ●●● ●
● ●● ●
● ●
10
●● ● ●
●
●
●
0
Running Variable
Below the cutoff, the MV number of bins is very similar to the MV choice for ES bins (44
versus 40). However, above the cutoff, the MV number of bins for QS bins is much lower than for
MV ES bins (41 versus 75). This occurs because, although the range of the running variable is
[−100, 99.051], there are very few observations in the intervals [−100, −50] and [50, 100] far from
the cutoff, which leads to high variability; at the same time, choosing ES bins forces the length of
the bins to be the same everywhere in the support . Thus, in order to produce small enough bins
to adequately mimic the overall variability of the scatter plot in regions with few observations, the
number of bins has to be large. In contrast, QS bins can be short near the cutoff and long away
from the cutoff, so they can mimic the overall variability by adapting their length to the density
of the data. Figure 3.12 reproduces the two MV RD plots—one using ES bins and the other using
QS bins.
43
3. RD PLOTS CONTENTS
70
60
60
50
50
●
40
40
Outcome
Outcome
30
30
●
● ●
20
20
● ● ● ●
●● ● ●●
● ● ●● ● ●● ● ●●
●● ● ● ● ● ● ●
●
● ●● ● ● ● ●● ● ●
● ●
●● ●● ● ●
● ●
● ● ● ●●● ●● ● ●● ●
● ●● ●
●● ● ● ● ● ●● ●
● ● ● ● ● ●
●●
● ●● ● ● ●● ● ● ●●
●● ● ● ●
●
●
● ● ●
● ● ●● ● ● ● ● ●● ● ●● ●
●
● ● ●● ●
● ●
● ● ●
10
10
● ●● ● ●
● ● ●
● ●
● ● ●
● ●
● ●
●
●
●●
● ●● ●
●
0
0
−100 −50 0 50 100 −100 −50 0 50 100
Since the IMSE-optimal bins are better suited to trace out the regression function rather than illus-
trate the local variability, and given that the overall shape of the regression function is represented
by the global fit, we recommend choosing MV bins for depicting the overall RD design in the first
place. The IMSE-optimal number of bins choices will be useful to identify potential discontinu-
ities in the underlying regression functions, specially when contrasted with the global polynomial
fits, as we will discuss in more detail in Section 6. Both QS and ES bins are useful in their own
right for describing the RD design, and hence it is useful to report both RD plots side-by-side in
applications. The QS RD plot gives a more accurate representation of the concentration of the
observations along the support of the score, while the ES RD Plot provides similar information
but without being influenced by the underlying distribution of the score. We recommend using a
4th-order or 5th-order polynomial for the global fit over extended supports of the score, and lower
order poylyomials for restricted supports (we discuss explicitly issues of global vs. local polynomial
fitting in the upcoming section). Having said this, we caution about interpreting the specific jump
at the cutoff as a valit treatment effect estimator because, as we will discuss in the upcoming sec-
tion, global polynomial fits tend to perform very poorly at boundary points. Finally, RD plots are
useful not only to present the overall RD design and motivate its empirical falsification, but also
to depict the specific RD analysis local to the cutoff, as we will illustrate in the following sections.
44
3. RD PLOTS CONTENTS
A detailed discussion of RD plots and formal methods for automatic data-driven bin selection are
given by Calonico et al. (2015a). That paper formalized the commonly used RD plots with evenly-
spaced binning, introduced RD Plots with quantile-spaced binning, and developed optimal choices
for the number of bins in terms of both integrated mean squared error and mimicking variance
targets. RD plots are special cases of nonparametric partitioning estimators—see, e.g., Cattaneo
and Farrell (2013) and references therein.
45
4. CONTINUITY-BASED RD APPROACH CONTENTS
This section discusses empirical methods based on continuity assumptions and extrapolation for
estimation and inference in RD designs, which rely on large sample approximations with random
potential outcomes under repeated sampling. These methods offer tools useful not only for esti-
mation of and inference on main treatment effects, but also for falsification and validation of the
design—discussed in Section 6. The approach discussed here is based on formal statistical meth-
ods and hence leads to discipline and objective empirical analysis, which typically has two related
but distinct goals: point estimation of RD treatment effect (i.e., give a scalar estimate of the ver-
tical distance between the regression functions at the cutoff) and statistical inference about the
RD treatment effect (i.e., construct valid hypothesis tests and confidence intervals to establish the
values of the RD parameter that are most supported by our data).
The methods dicussed in this section is based on the continuity conditions underlying Equation
(2.1), and generalizations thereof. This framework for RD analysis, which we call the continuity-
based RD framework, uses methodological tools that directly rely on continuity (and differentia-
bility) assumptions and define τSRD as the parameter of interest. In this framework, estimation
typically proceeds by using (local to the cutoff) polynomial methods to model or approximate
the regression function E[Yi |Xi = x] on each side of the cutoff separately. In practical terms, this
involves using least-squares methods to fit a polynomial of the observed outcome on the score.
When all the observations are used for estimation, these polynomial fits are global or parametric in
nature, as those used in the default RD plots discussed in the previous section. In contrast, when
estimation employs only observations with scores near the cutoff, the polynomial fits are local or
“nonparametric”. Local polynomial methods are by now the standard framework for RD empirical
analysis, as they offer a good compromise between flexibility and simplicity.
A fundamental feature of the RD design is that, in general, there are no observations whose score
Xi equals the cutoff value x̄: because the running variable is assumed continuous, there are none
(or sometimes very few) observations whose score is x̄ or very nearly so. Thus, extrapolation in RD
designs is unavoidable in general. In other words, in order to form estimates of the average response
of control units at the cutoff, E[Yi (0)|Xi = x̄], and of the average response of treatment units at the
cutoff, E[Yi (1)|Xi = x̄], we must rely on observations further away from the cutoff to approximate
the unknown regression functions. In the Sharp RD design, for example, the treatment effect τSRD is
46
4. CONTINUITY-BASED RD APPROACH CONTENTS
the vertical distance between the E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] at x = x̄, as shown in Figure
2.2, and thus estimation and inference proceeds by first approximating the unknown regression
functions to then compute the estimated treatment effect and/or the statistical inference procedure
of interest. In this context, the key practical issue in RD design analysis is how the approximation
of the regression functions is done, as this will have major effects on the robustness and credibility
of the empirical findings.
The problem of approximating an unknown function is well understood: any sufficiently smooth
function can be well approximated by a polynomial function, locally or globally. Applied to the RD
point estimation problem, this result suggests that the unknown regression functions E[Yi (t)|Xi =
x], t = 0, 1, can in principle be well approximated by a polynomial function of the score, up to ran-
dom sampling error. Early empirical work employed this idea globally, that is, tried to approximate
the unknown regression functions using flexible higher-order polynomials, usually 4th or 5th order,
over the entire support of the data. This global approach is still used in RD plots, as illustrated
in the previous section, because there the goal is to approximate the entire unknown regression
functions. However, it is now widely recognized that this global polynomial approach does not de-
liver point estimators and inference procedures with good properties for the main object of interest:
the RD treatment effect. The reason is that global polynomial approximations tend to do a good
job overall but a very bad job at boundary points—this problem is known as the Runge’s phe-
nomenon in approximation theory. Put differently, global polynomial approximations tend to have
very erratic behavior near boundary points and induce counter-intuitive weighting schemes on the
observations when the goal is to estimate the unknown function at the boundary point. Because
RD treatment effects are defined at boundary points, of control and treatment groups separately,
the global polynomial approach will be suspect in practice. For example, distant points and/or
outliers may severely affect the global polynomial RD point estimator. In sum, global polynomials
can lead to invalid estimates, and thus the conclusions from a global parametric RD analysis can
be highly misleading. For these reasons, we recommend against using global polynomial methods
for RD analysis.
Modern and principled RD empirical work employs local polynomial methods, which focus on
approximating the regression functions only near the cutoff. Because this approach localizes the
polynomial fit to the cutoff (discarding all other observations sufficiently further away) and employs
a low-order polynomial approximation (usually linear or quadratic), it is substantially more robust
and less sensitive to boundary-related problems. Furthermore, this approach can be viewed formally
as a nonparametic local polynomial approximation, which has also aided the development of a
large toolkit of statistical and econometrics results. In contrast to global higher-order polynomials,
local lower-order polynomial approximations can be viewed as intuitive approximations with a
potential misspecification of the functional form of the regression function, which can be modeled
and understood formally, while at the same time are less sensitive to outliers or other extreme
features in the data generating process away from the cutoff.
47
4. CONTINUITY-BASED RD APPROACH CONTENTS
Modern empirical work in RD designs employ local polynomial methods using only observations
close to the cutoff and taking them as local approximations, not necessarily as correctly specified
models. Not surprisingly, the statistical properties of local polynomial estimation and inference
crucially depend on the accuracy of the approximation near the cutoff, which is controlled by
the size of the neighborhood or bandwidth around the cutoff where the local polynomial is fit.
In the upcoming subsections, we discuss the modern local polynomial methods for RD analysis,
and explain all the steps involved in their implementation for both estimation and inference. We
also discuss several extensions, including inclusion of covariates and use of cluster-robust standard
errors.
Local polynomial methods estimate the desired polynomials using only observations near the cutoff
point, separately for control and treatment units. This approach uses only observations that are
between x̄ − h and x̄ + h, where h > 0 is some chosen bandwidth. Moreover, within this bandwidth,
observations closer to x̄ often receive more weight than observations further away, where the weights
are determined by a kernel function K(·). This local polynomial approach can be understood and
analyzed formally as nonparametric, in which case the fit is taken as an approximation of the
unknown underlying regression functions within the region determined by the bandwidth used.
2. Choose a bandwidth h.
3. For observations above the cutoff (i.e., observations with score Xi ≥ x̄), fit a weighted least
squares regression of the outcome Yi on a constant and (Xi − x̄), (Xi − x̄)2 , . . . , (Xi − x̄)p ,
where p is the chosen polynomial order, with weight K( xih−x̄ ) for each observation. The
estimated intercept from this local weighted regression, µ̂+ , is an estimate of the point µ+ =
E[Yi (1)|Xi = x̄]. In other words, µ̂+ is the first element (intercept) of the weighted least
squares problem:
n
xi − x̄
1(Xi ≥ x̄) (Yi − β+,0 − β+,1 (Xi − x̄) − · · · − β+,p (Xi − x̄)p )2 K
X
β̂+ = arg min
β+,0 ,··· ,β+,p i=1 h
4. For observations below the cutoff (i.e., observations with score Xi < x̄), fit a weighted least
squares regression of the outcome Yi on a constant and (Xi − x̄), (Xi − x̄)2 , . . . , (Xi − x̄)p ,
where p is the chosen polynomial order, with weight K( xih−x̄ ) for each observation. The
estimated intercept from this local weighted regression, µ̂− , is an estimate of the point µ− =
E[Yi (0)|Xi = x̄]. In other words, µ̂− is the first element (intercept) of the weighted least
48
4. CONTINUITY-BASED RD APPROACH CONTENTS
squares problem:
n
xi − x̄
1(Xi < x̄) (Yi − β−,0 − β−,1 (Xi − x̄) − · · · − β−,p (Xi − x̄) ) K
X
p 2
β̂− = arg min
β−,0 ,··· ,β−,p i=1 h
A graphical representation of local polynomial RD point estimation is given in Figure 4.1, where
a polynomial of order one (p = 1) is fit within bandwidth h1 —observations outside this bandwidth
are not used in the estimation. The RD effect is τSRD = µ+ − µ− and the local polynomial estimator
of this effect is µ̂+ − µ̂− . The dots and squares represent observed data points because this methods
employs the actual raw data, not the binned data typically reported in the RD plots).
2
● Binned Control Observations
Binned Treatment Observations
E[Y(1)|X] µ+
1 ●
● ●
●
● ^
µ
E[Y(1)|X], E[Y(0)|X]
● +
● ●
●
● ●
● ● ●
● ● ●
●
● ^
µ
● −
●
● ●
●
●
µ−
● ● ●
● ●
● ● ● ●
●●
● ● ● ●
● ● ●
● ●
●
● ●
●
● ●
●
0 ● ●
● ● ●
●
●
●
●
E[Y(0)|X]
Cutoff
−1
−100 x − h1 x x + h1 100
Score (X)
The implementation of the local polynomial approach thus requires the choice of three main
ingredients: the kernel function K(·), the order of the polynomial p, and the bandwidth h. We now
turn to a detailed discussion of each of these choices.
49
4. CONTINUITY-BASED RD APPROACH CONTENTS
The kernel function K(x) assigns non-negative weights to each observation based on the distance
of their score Xi relative to the cutoff x̄. The recommended choice is the triangular kernel function,
K(x) = (1 − |x|)1(|x| ≤ 1), because with an optimal bandwidth in a mean square error (MSE)
sense, it leads to a point estimator with optimal properties in a MSE sense. As illustrated in Figure
4.2, the triangular kernel function assigns zero weight to all observations with score outside the
interval [x̄ − h, x̄ + h], and positive weights to all observations within this interval. The weight is
maximized at Xi = x̄, and declines symmetrically and linearly as the value of the score gets farther
from the cutoff.
Despite the desirable asymptotic optimality property (from a point estimation perspective) of
the triangular kernel, researchers sometimes prefer to use the more simple uniform kernel K(x) =
1(|x| ≤ 1), which also gives zero weight to all observations with score outside [x̄−h, x̄+h] but equal
weight to all the observations whose scores are within this interval: see Figure 4.2. Employing a local-
linear estimation with bandwidth h and the uniform kernel is therefore equivalent to estimating
a simple linear regression without weights using only observations whose distance from the cutoff
is at most h, i.e. observations with Xi ∈ [x̄ − h, x̄ + h]. This second choice of kernel minimizes
the asymptotic variance of the local polynomial estimator. A third weighting scheme alternative
sometimes encountered in practice is the Epanechnikov kernel, also depicted in Figure 4.2, which
gives a quadratic decaying weighting to observations within Xi ∈ [x̄ − h, x̄ + h] and zero weight
to the rest. In practice, estimation and inference results are typically not very sensitive to the
particular choice of kernel used.
0.5
0.0
A more important issue is the choice of the order of the local polynomial used. As we will
50
4. CONTINUITY-BASED RD APPROACH CONTENTS
discussed next, given this choice, the accuracy of the approximation will be essentially controlled
by the bandwidth: in other words, the bandwidth will be selected given the polynomial order
taking into account misspecification errors. As already mentioned, higher-order polynomials tend
to produce over-fitting of the data and hence unreliable results near boundary points. At the same
time, local constant fits (p = 0) exhibit some undesirable theoretical features and usually under-fit
the data. In practice, the recommended choices are usually p = 1 or p = 2, though theory and
practice considers other polynomial orders as well.
In sum, several issues need to be considered when choosing the specific order of the local poly-
nomial. First, a polynomial of order zero—a constant fit— has undesirable theoretical properties
at boundary points, which is precisely where RD estimation must occur. Second, for a given band-
width, increasing the order of the polynomial generally improves the accuracy of the approximation
but also increases the variability of the treatment effect estimator. More precisely, in particular, it
can be shown that the asymptotic variances of the local constant (p = 0) and local linear (p = 1)
polynomial fits are equal, while the latter fit has small asymptotic bias. This fact has lead re-
searchers to prefer the local linear RD estimator, which by now is the default point estimator in
most applications. In finite samples, of course, the ranking between different local polynomial esti-
mators may be different, but in general the local linear estimator seems to deliver a good trade off
between simplicity, precision and stability in sharp RD settings. Although it may seem at first that
a linear polynomial is not flexible enough, the bandwidth chosen appropriately will adjust to the
selected polynomial order so that the linear approximation to the unknown regression functions is
reliable. We discuss this in more detail below, when we discuss optimal bandwidth selection.
The choice of bandwidth h is fundamental for the analysis and interpretation of RD designs, and it
is rarely the case that the findings are not sensitive to the choice of this key tuning parameter. This
choice controls the width of the neighborhood around the cutoff that is used to fit the local poly-
nomial that approximates the unknown regression functions. Figure 4.3 illustrates how the error in
the approximation is directly related to the bandwidth choice. The unknown regression functions in
the figure, E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x], have considerable curvature. At first, it would seem
inappropriate to approximate these functions with a linear polynomial. Indeed, inside the interval
[x̄ − h2 , x̄ + h2 ], a linear approximation yields an estimated RD effect equal to µ̂+ (h2 ) − µ̂− (h2 ),
which is considerably different from the true effect µ+ − µ− . Thus, a linear regression approxima-
tion within bandwidth h2 results in a large approximation error because of misspecification error.
However, reducing the bandwidth from h2 to h1 improves the linear approximation considerably,
as now the estimated RD effect µ̂+ (h1 ) − µ̂− (h1 ) is much closer to the population treatment effect
τSRD . The reason is that the regression functions are nearly linear in the interval [x̄ − h1 , x̄ + h1 ],
and therefore the linear approximation results in a smaller misspecification error. This illustrates
the general principle that, given a polynomial order, the accuracy of the approximation can always
51
4. CONTINUITY-BASED RD APPROACH CONTENTS
1.0 ●
E[Y(1)|X], E[Y(0)|X]
τSRD
E[Y(0)|X]
●
E[Y(1)|X]
0.0
Cutoff
−0.5
−100 x − h2 x − h1 x x + h1 x0 + h2 100
Score (X)
Choosing a smaller h will reduce the error or bias of the local polynomial approximation,
but will simultaneously tend to increase the variance of the estimated coefficients because fewer
observations will be available for estimation. On the other hand, a larger h will result in more
misspecification error (bias) if the unknown function differs considerably from the polynomial model
used for approximation, but will reduce the variance because the number of observations in the
interval [x̄ − h, x̄ + h] will be larger. For this reason, the choice of bandwidth is said to involve a
“bias-variance trade-off”.
Since empirical RD results are often sensitive to the choice of bandwidth, it is important to select
h in a data-driven, automatic way to avoid specification searching and ad-hoc decisions, as such
an approach provides (at least) a good benchmark for empirical work. Most (if not all) bandwidth
52
4. CONTINUITY-BASED RD APPROACH CONTENTS
selection methods try to balance some form of bias-variance trade-off (sometimes involving other
features of the estimator as well). The most popular approach in practice seeks to minimize the
MSE of the local polynomial RD point estimator, τ̂SRD , given a choice of polynomial order and
kernel function. Since the MSE of an estimator is the sum of its squared bias and its variance,
this approach effectively chooses h to optimize a bias-variance trade-off. The precise procedure
involves using data-driven methods to choose the bandwidth that minimizes an approximation to
the asymptotic MSE of the RD point estimator because the exact MSE of the estimator is difficult
to characterize in general: this requires deriving an asymptotic MSE approximation, estimating the
unknown quantities in the resulting formula, and optimizing it with respect to h.
To describe the MSE-optimal bandwidth selection in more detail, let p denote the order of the
polynomial used to form the RD estimator τ̂SRD , with kernel K(·). Then, the general form of the
asymptotic MSE approximation is
1
MSE(τ̂SRD ) ≈ Bias2 (τ̂SRD ) + Variance(τ̂SRD ) ≈ h2(p+1) B 2 + V,
nh
where the constants B and V represent, respectively, the (leading) asymptotic bias and variance
of the RD point estimator τ̂SRD . Although we omit the technical details, we present the general
form of B and V to clarify the most important trade-offs involved in the choice of a MSE-optimal
bandwidth for the local polynomial RD estimate.
(p+1) (p+1)
B = B+ − B− , B− = µ− B− , B+ = µ+ B+
where
are related to the “curvature” of the unknown regression functions for treatment and control units,
respectively, and known constants B+ and B− are related to the kernel function and the order p of
the polynomial used.
The bias term B associated with the local polynomial RD point estimator of order p, τ̂SRD ,
depends on the (p + 1)-th derivatives of the regression functions E[Yi (1)|X = x] and E[Yi (0)|X = x]
with respect to the running variable. This is a more formal characterization of the phenomenon we
illustrated in Figure 4.3: when we approximate E[Yi (1)|X = x] and E[Yi (0)|X = x] with a local
polynomial of order p, that approximation has an error (unless E[Yi (1)|X = x] and E[Yi (0)|X = x]
happen to be polynomials of at most order p). The leading term of the approximation error is the
derivative of order p + 1—that is, the order following the polynomial order used to estimate τSRD .
For example, as illustrated in Figure 4.3, if we use a local linear (p = 1) polynomial to estimate
τSRD , our approximation by construction ignores the second order term (which depends on the
53
4. CONTINUITY-BASED RD APPROACH CONTENTS
second derivative of the function), and all higher order terms (which depend on the higher-order
derivatives). Thus, the leading bias associated with a local linear estimator depends on the second
derivatives of the regression functions, which are the leading terms in the error of approximation
incurred when we set p = 1. Similarly, if we use a local quadratic polynomial to estimate τ̂SRD , the
leading bias will depend on the third derivatives of the regression function.
σ−2 σ+2
V = V− + V+ , V− = V− , V+ = V+
f f
where
2 2
σ+ = lim V[Yi (1)|Xi = x] and σ− = lim V[Yi (0)|Xi = x]
x↓x̄ x↑x̄
capture the conditional variability of the outcome given the score at the cutoff for treatment and
control units, respectively, f denotes the density of the score variable at the cutoff, and the known
constants V− and V− are related to kernel function and the order p of the polynomial used.
Thus, for example, as the number of observations near the cutoff decreases (i.e., as the density f
decreases), the contribution of the variance term to the MSE increases. Similarly, as the number of
observations near the cutoff increases, the contribution of the variance term to the MSE decreases
accordingly. This captures the intuition that the variability of the RD point estimator will partly
depend on the density of observations near the cutoff. Similarly, an increase (decrease) in the
conditional variability of the outcome given the score will increase (decrease) the MSE of the RD
point estimators.
In order to obtain a MSE-optimal point estimator τ̂SRD , we choose the bandwidth that minimizes
the MSE approximation:
1
min h 2(p+1)
B +
2
V ,
h>0 nh
which leads to the MSE-optimal bandwidth choice
1/(2p+3)
V
hMSE = n−1/(2p+3) .
2(p + 1)B 2
This formula formally incorporates the bias-variance trade-off mentioned above. It follows that hMSE
is proportional to n−1/(2p+3) , and that this MSE-optimal bandwidth increases with V and decreases
with B. In other words, a larger asymptotic variance will lead to a larger MSE-optimal bandwidth;
this is intuitive, as a larger bandwidth will include more observations in the estimation and thus
reduce the variance of the resulting point estimator. In contrast, a larger asymptotic bias will lead
to a smaller bandwidth, as a smaller bandwidth will reduce the approximation error and reduce
the bias of the resulting point estimator.
Another way to see this trade-off is to note that if we chose a bandwidth h > hMSE , decreasing
h would lead to a reduction in the approximation error and an increase the variability of the point
54
4. CONTINUITY-BASED RD APPROACH CONTENTS
estimator, but the MSE reduction caused by the decrease in bias would be larger than MSE increase
caused by the variance increase, leading to a smaller MSE overall. In other words, when h > hMSE ,
it is possible to reduce the misspecification error without increasing the MSE. In contrast, when
we set h = hMSE , both increasing and decreasing the bandwidth necessarily lead to a higher MSE.
Given the quantities V and B, increasing the sample size n leads to a smaller optimal hMSE .
This is also intuitive: as more sample becomes available both bias and variance are reduced. Thus,
the lager the sample size, the better the asymptotic MSE of the RD estimator because it is possible
to reduce the error in the approximation by reducing the bandwidth without paying a penalty in
variability increase due to the large sample.
In some applications, it may be useful to choose different bandwidths for each group, that is, on
either side of the cutoff. Since the RD treatment effect of τSRD = µ+ − µ− is simply the difference
of two (one-sided) estimates, allowing for two distinct bandwidth choices can be accomplished
by considering an MSE approximation for each estimate separately. In other words, two different
bandwidths can be selected for µ̂+ and µ̂− , and then used to form the RD treatment effect estimator.
Practically, this is equivalent to choosing a asymmetric neighborhood near the cutoff of the form
[x̄ − h− , x̄ + h+ ], where h− and h+ denote the control (left) and treatment (right) bandwidths,
respectively. These MSE-optimal choices are given by
1/(2p+3) 1/(2p+3)
V− V+
hMSE,− = 2 n−1/(2p+3) and hMSE,+ = 2 n−1/(2p+3) .
2(p + 1)B− 2(p + 1)B+
Thus, these bandwidth choices will be most practically relevant when the bias and/or variance of
the control and treatment group differ substantially, for example because of different curvature of
the unknwon regression functions or different conditional variance of the outcome given the score
near the cutoff.
In practice, the optimal bandwidth selectors described above (and all other variants thereof)
are implemented by constructing preliminary plug-in estimates of the unknown quantities entering
their formulas. For example, the misspecification biases B+ and B− are constructed by forming
(p+1) (p+1)
preliminary “curvature” estimates µ̂− and µ̂+ , which are constructed using a local polynomial
of order q ≥ p + 1 with bias-bandwidth b, not necessarily equal to h. Since the constants B− and B+
(p+1) (p+1)
are known, feasible bias estimates can be form as: B̂+ = µ̂+ B+ and B− = µ̂− B− . Similarly,
for the terms V− and V+ capturing the asymptotic variance of the estimates on the left and on the
right of the cutoff, respectively, can be estimated by replacing the unknwon conditional variance
and density functions at the cutoff by preliminary estimates thereof. Given these ingredients, data-
driven MSE-optimal bandwidth selectors are easily constructed for the RD treatment effect (i.e.,
one common bandwidths on both sides of the cutoff) and for each of the two regression function
estimators at the cutoff (i.e., two distinct bandwidths).
The approach described so far for bandwidth selection in RD designs is arguable the default in
most modern empirical work. A potential drawback of this approach is that in some applications
55
4. CONTINUITY-BASED RD APPROACH CONTENTS
the estimated biases may be close to zero, leading to poor behavior of the resulting bandwidth
selectors. To handle this computational issue, it is common to include a “regularization” term R to
avoid small denominators in small samples. For example, in the case of the MSE-optimal bandwidth
choice for the RD treatment effect estimator, the alternative formula is
1/(2p+3)
V
hMSE = n−1/(2p+3) ,
2(p + 1)B 2 + R
where the extra term R can be justified theoretically but requires additional preliminary estimators
when implemented. Empirically, since R is in the denominator, including a regularization terms
will always lead to a smaller hMSE . This idea is also used in the case of hMSE,− and hMSE,+ , and other
related bandwidth selection procedures. We discuss how to include and exclude a regularization
term in practice when we illustrate local polynomial methods in subsection 4.2.4.
Given the choice of polynomial order p and kernel function K(·), the local polynomial RD point
estimator τSRD is implemented for a choice of neighborhood around the cutoff x̄ determined by the
bandwidth h. As discussed previously, the smaller h the smaller the misspecification bias and the
larger variability of the RD treatment effect estimator, while for larger bandwidths the bias-variance
effects are reversed. Selecting an common MSE-optimal bandwidth for τSRD , or two distinct MSE-
optimal bandwidths for its ingredients µ̂− and µ̂+ , leads to an MSE-optimal RD point estimator.
To be more specific, the resulting estimator is consistent and achieves the fastest rate of decay in
MSE sense. Furthermore, it can be argued in some precise technical sense that the triangular kernel
is the MSE-optimal choice for point estimation. Because of these optimality properties, and the fact
that the procedures are data-driven and objective, modern RD empirical work routinely employs
some form of automatic MSE-optimal bandwidth selector, and reports the resulting MSE-optimal
point estimator when estimating RD treatment effects.
We now return to the Meyersson application to illustrate point estimation of RD effects using local
polynomials. First, we use standard least-squares commands to emphasize that local polynomial
point estimation is nothing more than a weighted least-squares fit when it comes to point estimation
(this is not true when the goal is inference, as we discussed below).
We start by choosing a fixed or ad-hoc bandwidth equal to h = 20, and thus postpone the
illustration of optimal bandwidth selection until further below. Within this arbitrary bandwidth
choice, we can construct the local linear (p = 1) RD point estimation with a uniform kernel
using standard least-squares routines. As mentioned above, a uniform kernel simply means that all
observations outside [x̄−h, x̄+h] are excluded, and all observations inside this interval are weighted
56
4. CONTINUITY-BASED RD APPROACH CONTENTS
equally.
> out = lm ( Y [ X < 0 & X >= -20] ~ X [ X < 0 & X >= -20])
> left _ intercept = out $ coefficients [1]
> print ( left _ intercept )
( Intercept )
12.62254
> out = lm ( Y [ X >= 0 & X < 20] ~ X [ X >= 0 & X < 20])
> right _ intercept = out $ coefficients [1]
> print ( right _ intercept )
( Intercept )
15.54961
> difference = right _ intercept - left _ intercept
> print ( paste ( " The RD estimator is " , difference , sep = " " ) )
[1] " The RD estimator is 2.92707507543107 "
The results indicate that within this ad-hoc bandwidth of 20 percentage points, the share of
women ages 15 to 20 who completed high school increases by 2.927 percentage points: 15.549 percent
of women in this age group had completed high school by 2000 in municipalities where the Islamic
party barely won the 1994 mayoral elections, while the analogous share in municipalities where the
Islamic party was barely defeated is 12.622 percent.
The same point estimator can be obtained by fitting a single linear regression that includes
an interaction between the treatment indicator and the score—both approaches are algebraically
equivalent.
> Z_X = X * Z
> out = lm ( Y [ X >= -20 & X <= 20] ~ X [ X >= -20 & X <= 20] + Z [ X >=
+ -20 & X <= 20] + Z _ X [ X >= -20 & X <= 20])
> print ( out )
Call :
lm ( formula = Y [ X >= -20 & X <= 20] ~ X [ X >= -20 & X <= 20] +
Z [ X >= -20 & X <= 20] + Z _ X [ X >= -20 & X <= 20])
Coefficients :
( Intercept ) X [ X >= -20 & X <= 20] Z [ X >= -20 & X <= 20] Z_X[X
>= -20 & X <= 20]
12.6225 -0.2481 2.9271
0.1261
57
4. CONTINUITY-BASED RD APPROACH CONTENTS
To produce the same point estimation with a triangular kernel instead of a uniform kernel,
we simply use a least-squares routine with weights. First, we create the weights according to the
triangular kernel formula.
> w = NA
> w [ X < 0 & X >= -20] = 1 - abs ( X [ X < 0 & X >= -20] / 20)
> w [ X >= 0 & X <= 20] = 1 - abs ( X [ X >= 0 & X <= 20] / 20)
Note that, with h and p fixed, changing the kernel from uniform to triangular alters the point
estimator only slightly, from 2.9271 to 2.9373. This is typical; point estimates tend to be relatively
stable with respect to the choice of kernel.
We showed how to use least-squares estimation only for pedagogical purposes, that is, to clarify
the algebraic mechanics behind local polynomial point estimation. However, employing weighted
least-squares routines in practice could be misleading and will be incompatible with MSE-optimal
bandwidth selection for inference, as we discuss in the upcoming sections. From this point on, we
58
4. CONTINUITY-BASED RD APPROACH CONTENTS
employ software packages that are specifically tailored to RD point estimation and inference. In
particular, we focus on the rdrobust software package, which includes several functions to conduct
local polynomial bandwidth selection, RD point estimation and inference, in a fully internally
coherent methodological way.
To replicate the previous point estimators using the command rdrobust, we use the options p
to set the order of the polynomial, kernel to set the type of kernel to weigh the observations, and h
to choose the bandwidth manually. By default, rdrobust sets the cutoff value to zero, but this can
be changed with the option c. We first use rdrobust to create a local linear RD point estimator
with h = 0.20 and uniform kernel.
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 608 280
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 20.0000 20.0000
BW Bias ( b ) 20.0000 20.0000
rho ( h / b ) 1.0000 1.0000
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 2.9271 1.2345 2.3710 0.0177 0.5074 5.3467
Robust 0.1018 -0.5822 6.4710
The output includes many details. The four rows in the uppermost panel indicate that the total
number of observations in the dataset is 2629, the bandwidth was chosen manually as opposed
to using an optimal data-driven algorithm, and the observations were weighed with a uniform
kernel. The final line indicates that the variance-covariance estimator (VCE) was constructed using
nearest-neighbor (NN) estimators instead of sums of squared residuals (this default behavior can
be changed with the option vce). We discuss details on variance estimation further below in the
context of RD inference.
The middle panel resembles the output of rdplot in that it is divided in two columns that
give information separately for the observations above (Right) and below (Left) the cutoff. The
59
4. CONTINUITY-BASED RD APPROACH CONTENTS
first row shows that the 2, 629 observations are split into 2, 314 (control) observations below the
cutoff, and 315 (treated) observations above the cutoff. The second row shows the effective number
of observations, which refers to the number of observations with scores within h distance from the
cutoff and therefore effectively used in the estimation of the RD effect. In other words, these are
the observartions with Xi ∈ [x̄ − h, x̄ + h]. The output indicates that there are 608 observations
with Xi ∈ [x̄ − h, x̄), and 280 observations with Xi ∈ [x̄, x̄ + h]. The third line shows the order of
the local polynomial used to estimate the main RD effect, τSRD , which in this case is equal to p = 1.
The bandwidth used to estimate τSRD is shown on the fifth line, BW Loc Poly (h), where we see
that the same bandwidth h = 20 was used to the left and right of the cutoff (below we illustrate
how to allow for different bandwidths on either side of the cutoff). We defer discussion of Order
Bias (q), BW Bias (b), and rho (h/b) until we discuss methods for inference.
Finally, the last panel shows the estimation results. The point estimator is reported in the Coef
column on the first row. The estimated RD treatment effect is τ̂SRD = 2.9271, indicating that in
municipalities where the Islamic party barely wins the female high school attainment share is about
3 percentage points higher than in municipalities where the party barely lost. As expected, this
number is identical to the number we obtained with the least-squares command lm.
The rdrobust routine also allows us to easily estimate the RD effect using triangular instead
of uniform kernel weights.
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 608 280
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 20.0000 20.0000
BW Bias ( b ) 20.0000 20.0000
rho ( h / b ) 1.0000 1.0000
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 2.9373 1.3429 2.1872 0.0287 0.3052 5.5694
Robust 0.1680 -1.1166 6.4140
60
4. CONTINUITY-BASED RD APPROACH CONTENTS
Once again, this produces the same coefficient of 2.9373 that we found when we used the weighted
least-squares command with triangular weights. We postpone the discussion of standard errors,
confidence intervals, and the distinction between the Conventional versus Robust results until we
discuss methods for inference.
Finally, if we wanted to reduce the approximation error in the estimation of the RD effect, we
could increase the order of the polynomial and use a local quadratic fit instead of a local linear.
This can be implemented in rdrobust setting p=2.
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 608 280
Order Loc Poly ( p ) 2 2
Order Bias ( q ) 3 3
BW Loc Poly ( h ) 20.0000 20.0000
BW Bias ( b ) 20.0000 20.0000
rho ( h / b ) 1.0000 1.0000
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 2.6487 1.9211 1.3787 0.1680 -1.1166 6.4140
Robust 0.6743 -3.9688 6.1350
Note that the estimated effect changes from 2.9373 with p = 1 to 2.6487 with p = 2. It is not
unusual to observe a change in the point estimate as one changes the polynomial order used in
the estimation. Unless the higher order terms in the approximation are exactly zero, incorporating
those terms in the estimation will reduce the approximation error and thus lead to changes in the
estimated effect. The relevant practical question is whether such changes in the point estimator
change the conclusions of the study. For that, we need to consider inference as well as estimation
procedures, a topic we discuss in the upcoming sections.
Choosing an ad-hoc bandwidth as shown in the previous commands is not advisable. It is unclear
what the value h = 20 means in terms of bias and variance properties, or whether this is the best
approach for estimation and inference. The command rdbwselect, which is part of the rdrobust
61
4. CONTINUITY-BASED RD APPROACH CONTENTS
package, implements optimal, data-driven bandwidth selection methods. We illustrate the use of
rdbwselect by selecting a MSE-optimal bandwidth for the local linear estimator of τSRD .
> rdbwselect (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " )
Call :
rdbwselect ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " )
BW Selector mserd
Number of Obs 2629
NN Matches 3
Kernel Type Triangular
Left Right
Number of Obs 2314 315
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
The MSE-optimal bandwidth choice depends on the choice of polynomial order and kernel
function, which is why both have to be specified in the call to rdbwselect.The first output line
indicates the type of bandwidth selector; in this case, it is MSE-optimal (mserd). The type of kernel
used is also reported, as is the total number of observations. The middle panel reports the number
of observations on each side of the cutoff, and the order of polynomial chosen for estimation of the
RD effect—the Order Loc Poly (p) row. We postpone discussion of the Order Bias (q) results
until we discuss inference.
In the bottom panel, we see the estimated optimal bandwidth choices. The bandwidth h refers to
the bandwidth used to estimate the RD effect τSRD ; we sometimes refer to it as the main bandwidth.
The bandwidth b is an additional bandwidth used to estimate a bias term that is needed for robust
inference; we omit discussion of b until the following sections.
As shown, the estimated MSE-optimal bandwidth for the local-linear RD point estimator with
triangular kernel weights is 17.23947. The option bwselect = "mserd" imposes the same band-
width h on each side of the cutoff, that is, uses the neighborhood [x̄ − h, x̄ + h]. This is why the
columns h (left) and h (right) have the same value 17.23947. If instead we wish to allow the
bandwidth to be different on each side of the cutoff and estimate the RD effect in the neighbor-
hood [x̄ − hleft , x̄ + hright ], we can choose two MSE-optimal bandwidths by using the bwselect =
"msetwo" option.
> rdbwselect (Y , X , kernel = " triangular " , p = 1 , bwselect = " msetwo " )
Call :
rdbwselect ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " msetwo " )
62
4. CONTINUITY-BASED RD APPROACH CONTENTS
BW Selector msetwo
Number of Obs 2629
NN Matches 3
Kernel Type Triangular
Left Right
Number of Obs 2314 315
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
This leads to a bandwidth of 19.96678 on the control side, and a bandwidth of 17.35913 on the
treated side.
Once we select the MSE-optimal bandwidth(s), we could pass them to the function rdrobust
using the option h. But it is much easier to use the option bwselect in rdrobust. When we use
this option, rdrobust calls rdbwselect internally, selects the bandwidth as requested, and then
uses the optimally chosen bandwidth to estimate the RD effect.
In order to perform bandwidth selection and point estimation in one step, using p = 1 and
triangular kernel weights, we use the rdrobust command.
> rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " )
Call :
rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " )
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 529 266
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 17.2395 17.2395
BW Bias ( b ) 28.5754 28.5754
rho ( h / b ) 0.6033 0.6033
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 3.0195 1.4271 2.1159 0.0344 0.2225 5.8165
Robust 0.0758 -0.3093 6.2758
63
4. CONTINUITY-BASED RD APPROACH CONTENTS
When the same MSE-optimal bandwidth is used on both sides of the cutoff, the effect of a bare
Islamic victory on the female educational attainment share is 3.0195, slightly larger than the 2.9373
effect that we found above when used the ad-hoc bandwidth of 20.
We can also explore the rdrobust output to obtain the estimates of the average outcome at
the cutoff separately for treated and control observations.
> rdout = rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " )
> print ( names ( rdout ) )
[1] " tabl1 . str " " tabl2 . str " " tabl3 . str " " N " "N_l" "N_r" "N_h_
l" "N_b_l" "N_b_r" "c" "p" "q" "h_l"
"h_r" "b_l" "b_r" " tau _ cl " " tau _ bc " " se _ tau _ cl " " se _ tau
_ rb " " bias _ l " " bias _ r " " beta _ p _ l " " beta _ p _ r "
[25] " V _ cl _ l " " V _ cl _ r " " V _ rb _ l " " V _ rb _ r " " coef " " bws " " se "
"z" " pv " " ci " " call "
> print ( rdout $ beta _ p _ r )
[ ,1]
[1 ,] 15.6649438
[2 ,] -0.1460846
> print ( rdout $ beta _ p _ l )
[ ,1]
[1 ,] 12.6454218
[2 ,] -0.2477231
We see that the RD effect of 3.0195 percentage points in the female high school attainment
share is the difference between a share of 15.6649438 percent in municipalities where the Islamic
party barely wins and a share of 12.6454218 percent in municipalities where the Islamic party barely
loses—that is, 15.6649438 − 12.6454218 ≈ 3.0195. By accessing the control mean at the cutoff in
this way, we learn that the RD effect represents an increase of (3.0195/12.6454218) × 100 = 23.87
percent relative to the control mean. This effect, together with the means at either side of the
cutoff, can be easily illustrated with rdplot, using the options h, p, and kernel, to specify exactly
the same specification used in rdrobust and produce an exact illustration of the RD effect.
> bandwidth = rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) $ h _ l
> out = rdplot ( Y [ abs ( X ) <= bandwidth ] , X [ abs ( X ) <= bandwidth ] ,
+ p = 1 , kernel = " triangular " )
> print ( out )
Call :
rdplot ( y = Y [ abs ( X ) <= bandwidth ] , x = X [ abs ( X ) <= bandwidth ] , p = 1 , kernel = "
triangular " )
64
4. CONTINUITY-BASED RD APPROACH CONTENTS
Method :
Left Right
Number of Obs . 529 266
Polynomial Order 1 1
Scale 4 6
Selected Bins 19 17
Average Bin Length 0.9066 1.0028
Median Bin Length 0.9066 1.0028
20
●
●
● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
●
● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ●
10
0
−15 −10 −5 0 5 10 15
Running Variable
Finally, we note that by default, all MSE-optimal bandwidth selectors in rdrobust include in
65
4. CONTINUITY-BASED RD APPROACH CONTENTS
the denominator the regularization term that we discussed in subsection 4.2.2. We can exclude the
regularization term with the option scaleregul=0 in the rdrobust (or rdbwselect) call
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 1152 305
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 34.9830 34.9830
BW Bias ( b ) 46.2326 46.2326
rho ( h / b ) 0.7567 0.7567
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 2.8432 1.1098 2.5619 0.0104 0.6681 5.0184
Robust 0.0171 0.5964 6.1039
In this application, excluding regularization term as a very large impact on the estimated hMSE .
With regularization, ĥMSE is 17.2395, while excluding regularization increases it to 34.98296, an
increase of roughly 100%. Nevertheless, the point estimate remains relatively stable, moving from
3.0195 with regularization to 2.8432 without regularization.
In addition to providing a local polynomial point estimator of the RD treatment effect, we are
interested in testing hypotheses and constructing confidence intervals. At first glance, it seems
that ordinary least squares (OLS) inference methods could be used since, as we have seen, local
polynomial estimation involves simply fitting two weighted least-squares regressions within a region
near x̄ controlled by the bandwidth. However, relying on OLS methods for inference would treat the
local polynomial regression model as correctly specified (i.e., parametric), and de facto disregard its
fundamental approximation (i.e., nonparametric) nature. To put things differently, it is intellectually
and methodologically incoherent to simultaneously select a bandwidth according to a bias-variance
66
4. CONTINUITY-BASED RD APPROACH CONTENTS
trade-off but then proceed as if the bias is zero, that is, as if the local polynomial fit is correctly
specified and no misspecification error exists. In fact, if the model is indeed correctly specified, then
the full support of the data will be use (i.e., a very large bandwidth), a procedure that in most
applications will lead to severely biased RD point estimators.
This considerations imply that valid inference should take into account the degree of misspeci-
fication or, alternatively, whether the bandwidth is too large for hypothesis testing and confidence
interval estimation. For example, the MSE-optimal bandwidths discussed previously (hMSE , hMSE,− ,
hMSE,+ ), and many other variants thereof, result in a RD point estimator that is both consistent
and optimal in a MSE sense. However, inferences based on hMSE present a challenge because this
bandwidth choice is by construction not “small” enough to remove the leading bias term in the stan-
dard distributional approximations used to conduct statistical inference. Heuristically, because these
bandwidth choices are developed for point estimation purposes, they pay no attention to their effects
in terms of distributional properties of typical t-tests or related statistics. Thus, constructing confi-
dence intervals using standard OLS large-sample results using the data with Xi ∈ [x̄−hMSE , x̄+hMSE ]
will result in invalid inferences.
There are several approaches that one could take to address this difficulty. One approach is to
use hMSE only for point estimation, and then choose a different bandwidth for inferences purpose.
This method requires selecting an smaller bandwidth than hMSE and then recomputing the point
estimator and standard error to conduct inference. This approach is called undersmoothing and
requires using more observations for point estimation than for infernece, which some times may
be regarded as problematic and will certainly lead to statistical power loss. Another approach is
to retain the same bandwidth hMSE for both estimation and inference but with a modified test
statistic to account for the effects of misspecification due to the large bandwidth being used as well
as the additional sampling error introduced by such modification. This approach is called robust
bias correction and has the advantage that the same observations can be used for both estimation
and inference, thereby leading to more powerful statistical methods. We discuss both approaches in
detail below, and also briefly elaborate on further refinements and extensions to the latter approach
leading to more robust and powerful inference procedures.
The MSE-optimal bandwidth hMSE , or any of the other similar choices, is the bandwidth that mini-
mizes the asymptotic MSE of the point estimator τ̂SRD , and is by now the most popular benchmark
choice in practice. But the optimal properties of such an MSE-optimal choice for point estimation
purposes do not guarantee valid statistical inference based on large sample distributional approxi-
mations. We now discuss how to make valid inferences when the bandwidth choice is hMSE , or some
data-driven implementation thereof.
67
4. CONTINUITY-BASED RD APPROACH CONTENTS
The local polynomial RD point estimator τ̂SRD has an approximate large-sample distribution
τ̂SRD − τSRD − B a
√ ∼ N (0, 1)
V
where B and V are, respectively, the asymptotic bias and variance of the RD local polynomial es-
timator of order p, discussed previously in the context of MSE expansions and bandwidth selection.
This distributional result is similar to those encountered in, for example, standard linear regression
problems with the important distinction that now the bias term B features explicitly. In fact, the
variance term V can be calculated as in (weighted) least squares problems, for instance accounting
for heteroskedasticity and/or clustered data. We do not provide the exact formulas for variance
estimation to save space and notation, but these formulas can be found in the references given at
the end of this section and are all implemented in the RD software available and discussed in the
empirical illustration further below.
Given the distributional approximation for the RD local polynomial estimator, an asymptotic
95-percent confidence interval for τSRD is approximately given by
h √ i
CI = (τ̂SRD − B) ± 1.96 · V .
Such confidence intervals depend on the unknown bias or misspecification error B, and any practical
procedure that ignores it will lead to incorrect inferences unless this term is negligible (i.e., the
local linear regression model is correctly specified). In other words, the bias term arises because
the local polynomial approach is a nonparametric approximation: instead of assuming that the
underlying regression functions are p-th order polynomials (as would occur in OLS estimation),
this approach uses the polynomial to approximate the unknown regression functions. The degree
of misspecification error is controlled by the choice of bandwidth, with larger biases the larger
the bandwidth(s) used. Thus, the large sample distributional approximation naturally includes the
term B to highlight the fact that there is a trade-off between bandwidth choice and misspecification
bias locally to the cutoff.
As already anticipated, different strategies have been proposed and employed to make infer-
ences based on asymptotic distributional approximations for τ̂SRD in the presence of nonparametric
misspecification biases. Some strategies are invalid in general, some are theoretically sound but lead
to ad-hoc procedures with poor properties and performance in applications, and others are both
theoretically valid and perform very well in practice. We now discuss some of these approaches in
detail and explain their relative merits.
68
4. CONTINUITY-BASED RD APPROACH CONTENTS
must make a choice: either an ad hoc bandwidth is used assuming that is small enough so that
the misspecification error can be ignored, or the misspecification error must be accounted for
explicitly when conducting inference. Notice that the former approach leads to valid inference
(because a smaller than MSE-optimal bandwidth is used), but the resulting RD treatment effect
point estimator is no longer MSE-optimal, thus a sub-optimal treatment effect estimator is reported.
To be more specific, the naı̈ve approach to statistical inference ignoring the effects of MSE-
optimal bandwidth selection and misspecification bias treats the local polynomial approach as
parametric within the neighborhood around the cutoff and de facto ignores the bias term, a pro-
cedure that leads to invalid inferences in all cases except when the approximation error is so small
that can be ignored. When the bias term is zero, the approximate distribution of the RD estimator
p a
is τ̂SRD − τSRD / Vp ∼ N (0, 1) and the confidence interval is
h p i
CIus = τ̂SRD ± 1.96 · Vp .
Since this is the same confidence interval that follows from parametric least-squares estimation, we
refer to it as conventional. Using the conventional confidence interval CIus is equivalent to assuming
that the chosen polynomial gives an exact approximation to the true functions E[Yi (1)|Xi ] and
E[Yi (0)|Xi ]. Since these functions are unknown, this assumption is not verifiable and will rarely
be credible. If researchers use CIus when in fact the approximation error is non-negligible, all
inferences will be incorrect, leading to under-coverage of the true treatment effect or, equivalently,
over-rejection of the null hypothesis of zero treatment effect. For this reason, we strongly discourage
researchers from using conventional inference when using local polynomial methods, unless the
misspecification bias can credibly be assumed small (ruling out, in particular, the use of MSE-
optimal bandwidth choices).
A common mistake in practice is to employ CIus even when first-order misspecification errors
are presented in the distributional approximation due to the (too large) bandwidth choice used.
A theoretically sound but ad-hoc alternative procedure is to use these conventional confidence in-
tervals with an “undersmoothed” bandwidth relative to the one used for point estimation (i.e., for
constructing the point estimator τ̂SRD in the first place). Practically, the procedure involves selecting
a bandwidth smaller than the MSE-optimal choice and then constructing the conventional confi-
dence intervals CIus with this smaller bandwidth (both a new point estimator and a new standard
error is estimated). The theoretical justification is that, for bandwidths smaller than the MSE-
optimal choice, the bias term will become negligible in large sample distributional approximation.
The main drawback of this procedure is that there are no clear and transparent criteria for shrinking
the bandwidth below the MSE-optimal value: some researchers might estimate the MSE-optimal
choice and divide by two, others may chose to divide by three, and yet others may decide to sub-
tract a small number ε from it. Although these procedures can be justified in a strictly theoretical
sense, but they are all ad-hoc and can result in lack of transparency and specification searching.
Moreover, this general strategy leads to a loss of statistical power because a smaller bandwidth re-
69
4. CONTINUITY-BASED RD APPROACH CONTENTS
sults in less observations used for estimation and inference. Finally, from a substantive perspective,
some researchers would not like to use different observations for estimation and inference, which is
required by any undersmoothing approach.
As explained above, the bias term depends on the “curvature” of the unknown regression functions
captured via their derivative of order p + 1 at the cutoff. This unknown feature of the underlying
data generating process can be estimated with a local polynomial of order q = p + 1 or higher, and
another choice of bandwidth denoted b. Therefore, the main point estimate employs the bandwidth
h, for example chosen in an MSE-optimal way for RD point estimation, while the bias correction
estimate employs the additional bandwidth b, which can be chosen in different ways. The ratio
ρ = h/b relates to the variability of the bias correction estimate relative to the point estimator
itself, and standard bias correction methods require ρ = h/b → 0, that is, an small ρ. Note that
ρ = h/b = 1 (h = b) is not allowed by this method.
The bias corrected confidence intervals allow for a wider range of bandwidths h and, in particu-
lar, result in valid inferences when the MSE-optimal bandwidth is used. However, these confidence
intervals have typically poor performance in applications. The reason is that the variability intro-
duced in the bias estimation step is not incorporated in the variance term used: the same standard
errors as in CIus are employed despite the additional estimated term B̂ now features in the con-
struction of the confidence intervals CIbc , which results in a poor distributional approximation and
hence important coverage distortions in practice.
A superior strategy that is both theoretically sound and leads to excellent coverage in finite
samples is to use robust bias correction for constructing confidence intervals. This approach leads
to demonstrably superior inference procedures: for example, the coverage error and average lenght
of these confidence intervals are improved relative to those associated with either CIus and CIbc .
Furthermore, the robust bias correction approach delivers valid inferences even when the MSE-
optimal bandwidth for point estimation is used—no undersmoothing is necessary— and remains
valid even when ρ = h/b = 1 (h = b), which implies that exactly the same data can be used for
both point estimation and for statistical inference.
Robust bias-corrected confidence intervals are based on the bias correction procedure described
above, by which the estimated bias term B̂ is removed from the RD point estimator. But, in contrast
to CIbc , the derivation of the robust bias corrected confidence intervals allows the estimated bias
70
4. CONTINUITY-BASED RD APPROACH CONTENTS
term to converge in distribution to a random variable and thus contribute to the distributional
approximation of the RD point estimator. This results in a new asymptotic variance Vbc that, unlike
the variance V used in CIconv and CIbc , incorporates the contribution of the bias correction step
to the variability of the bias corrected point estimator. Because the new variance Vbc incorporates
the extra variability introduced in the bias estimation step, it is larger than the conventional OLS
variance V , when the same bandwidth is used. This approach leads to the robust bias-corrected
confidence intervals: h i
p
τ̂SRD − B̂ ± 1.96 · Vbc
CIrbc =
which are constructed by subtracting the bias estimate from the local polynomial estimator and
using the new variance formula for Studentization. Note that CIrbc is centered around the bias-
corrected point estimate, τ̂SRD − B̂, not around the uncorrected estimate τ̂SRD . These robust confi-
dence intervals result in valid inferences when the MSE-optimal is used, because they have smaller
coverage errors and are therefore less sensitive to tuning parameter choices. In practice, the confi-
dence intervals can be implemented by setting ρ = h/b = 1 (h = b) and choosing h = hMSE , or by
selecting both h and b to be MSE-optimal for the corresponding estimators, in which case ρ is set
by hMSE /bMSE or by their data-driven implementations.
We summarize the differences between the three types of confidence intervals discussed in Table
4.1. The conventional OLS confidence intervals CIus ignore the bias term and are thus centered at
the local polynomial point estimator τ̂SRD , and use the conventional standard error Vˆ. The bias-
corrected confidence intervals CIbc remove the bias estimate from the conventional point estimator,
and are therefore centered at τ̂SRD − B̂; these bias corrected confidence intervals, however, ignore
the variability introduced in the bias correction step and thus continue to use the standard error
Vˆ, which is the same standard error used by the conventional confidence intervals CIus . The robust
bias corrected confidence intervals are also centered at the bias-corrected
q point estimator τ̂SRD − B̂
but, in contrast to CIbc , they employ a different standard error, Vˆbc , which is larger than the
p
conventional standard error Vˆ when the same bandwidth h is used. However, as discussed above,
if h = hMSE then CIus are invalid confidence intervals!
Relative to the conventional confidence intervals, the robust bias-corrected confidence intervals
are both re-centered and re-scaled. This implies that CIrbc are not centered at the conventional
point estimator τ̂SRD and, in fact, the RD point estimator does not need to be within the interval
71
4. CONTINUITY-BASED RD APPROACH CONTENTS
CIrbc . This illustrates some of the fundamental conceptual differences between point estimation
and confidence interval estimation. Nevertheless, in practice, the RD point estimator will often
be covered by robust bias corrected confidence interval, and when it is not this can be taken as
evidence of fundamental misspecification of the underlying local polynomial estimators.
From a practical perspective, the most important feature of the robust bias-corrected confidence
intervals CIrbc is that they can be used along with the MSE-optimal point estimator τ̂SRD when
constructed using the MSE-optimal bandwidth choice hMSE . In other words, the same observations
with score Xi ∈ [x̄ − hMSE , x̄ + hMSE ] are used for both optimal point estimation and valid statistical
inference.
Conceptually, the invalidity of the conventional confidence intervals CIus based on the MSE-optimal
bandwidth hMSE stems from using for inference a bandwidth that was optimally chosen for point
estimation purposes. Using hMSE for estimation of the RD effect τSRD results in a point estimator τ̂SRD
that is not only consistent but also has minimal asymptotic MSE. Thus, from a point estimation
perspective, hMSE leads to highly desirable properties of the RD treatment effect estimator.
In contrast, serious methodological challenges arise when researchers attempt to use hMSE for
building confidence intervals and making inferences in the standard parametric way, because the
MSE-optimal bandwidth choice is not designed with the goal of ensuring good (or even valid) dis-
tributional approximations. Robust bias correction restores a valid standard normal distributional
approximation when hMSE is used by recentering and rescaling the usual t-statistic to, respectively,
remove the underlying misspecification bias and account for additional variability introduced by
the bias correction estimate. Thus, by using the robust bias corrected confidence intervals CIrbc ,
researchers can use the same bandwidth hMSE for both point estimation and inference.
While employing the MSE-optimal bandwidth for both optimal point estimation and valid
statistical inference is certainly useful and practically relevant, it may be important to also con-
sider statistical inference that is optimal. A natural optimality criterion associated with robustness
properties of confidence intervals is the minimization of their coverage error. This is, for confidence
intervals, an analogous idea to minimization of MSE for point estimators. Thus, an alternative is to
decouple the goal of point estimation from the goal of inference, and to use a different bandwidth for
each case. In particular, this strategy involves estimating the RD effect with hMSE and constructing
confidence intervals using a different bandwidth, where the latter is specifically derived to provide
optimal inference properties. In fact, h can be chosen to minimize an approximation to the coverage
error of the confidence intervals CIrbc , that is, the discrepancy between the empirical coverage of
the confidence interval and the nominal level. For example, if a 95% confidence interval contains
the true parameter 80% of the time, the coverage error is 15 percentage points.
Therefore, while hMSE minimizes the asymptotic MSE of the point estimator τ̂SRD , the CER-
72
4. CONTINUITY-BASED RD APPROACH CONTENTS
optimal bandwidth hCER minimizes the asymptotic coverage error rate of the robust bias corrected
confidence interval for τSRD . This bandwidth cannot be obtain in closed form, but it can be shown
that it has a faster rate of decay than hMSE , which implies that all practically relevant sample sizes
hCER < hMSE . By design, choosing h = hCER and then using that bandwidth choice to construct CIrbc
leads to confidence intervals that are not only valid but also have the fastest rate of coverage error
decay. Furthermore, it follows that using hCER for point estimation will result in a point estimator
that has too much variability relative to its bias and is therefore not MSE-optimal (but nonetheless
valid). It is best practice to continue to use hMSE for MSE-optimal point estimation of τSRD , and then
researchers can either use the same bandwidth or hCER to build confidence intervals CIrbc , where
the resulting confidence intervals will valid (former case) or CER-optimal (latter case).
We can now discuss the full output of our previous call to rdrobust with p = 1 and triangular
kernel, which we reproduce below.
> rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " )
Call :
rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " )
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 529 266
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 17.2395 17.2395
BW Bias ( b ) 28.5754 28.5754
rho ( h / b ) 0.6033 0.6033
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 3.0195 1.4271 2.1159 0.0344 0.2225 5.8165
Robust 0.0758 -0.3093 6.2758
As reported before, the local linear RD effect estimate is 3.0195, which is estimated within the
MSE-optimal bandwidth of 17.2395. The last output panel provides all the necessary information to
make inferences. The row labeled Conventional reports, in addition to the point estimator τ̂SRD , the
73
4. CONTINUITY-BASED RD APPROACH CONTENTS
p p
conventional standard error Vˆ, the standardized test statistic τ̂SRD − τSRD / Vˆ, the corresponding
p-value, and the 95% conventional confidence interval CIus . This confidence interval ranges from
0.2225 to 5.8165 percentage points, suggesting a positive effect of an Islamic victory on the female
education share. Note that CIus is centered around the conventional point estimator τ̂SRD :
The row labeled Robust reports the robust bias-corrected confidence interval CIrbc . In contrast
q estimator τ̂SRD − B̂ (which is by default not reported),
to CIus , CIrbc is centered around the point
and scaled by the robust standard error Vˆbc (not reported either). CIrbc ranges from -0.3093 to
6.2758. And, in contrast to the conventional confidence interval, it does include zero. As expected,
CIrbc is not centered at τ̂SRD . Also, its length is longer than the length of CIus :
For aqfixed common bandwidth, the length of CIrbc is always greater than the length of CIus
p
because Vˆbc > Vˆ. However, this will not be necessarily true if different bandwidths are used
to construct each confidence interval.
The omission of the bias-corrected point estimator that is at the center of CIrbc from the
rdrobust output is intentional: the bias-corrected estimator is suboptimal relative to τ̂SRD , in
terms of point estimation properties, when the MSE-optimal bandwidth for τ̂SRD is used. (The
bias-corrected estimator is nevertheless always consistent and valid whenever τ̂SRD is.) Practically,
it is usually desiriable to report an MSE-optimal point estimator and then form valid confidence
intervals either with the same MSE-optimal bandwidth or some other optimal choice specifically
tailored for inference.
In order to see all the ingredients that go into building the robust confidence intervals , we can
use the all option in rdrobust.
> rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ,
+ all = TRUE )
Call :
rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " ,
all = TRUE )
Summary :
74
4. CONTINUITY-BASED RD APPROACH CONTENTS
VCE Type NN
Left Right
Number of Obs 2314 315
Eff . Number of Obs 529 266
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 17.2395 17.2395
BW Bias ( b ) 28.5754 28.5754
rho ( h / b ) 0.6033 0.6033
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 3.0195 1.4271 2.1159 0.0344 0.2225 5.8165
Bias - Corrected 2.9832 1.4271 2.0905 0.0366 0.1862 5.7802
Robust 2.9832 1.6799 1.7758 0.0758 -0.3093 6.2758
The three rows in the bottom output panel are analogous to the the rows in Table 4.1: the
Conventional row reports CIus , the Bias-Corrected row reports CIbc , and the Robust
p row reports
ˆ
q by CIus and CIbc is the same ( V = 1.4271),
CIrbc . We can see that the standard error used
while CIrbc uses a different standard error ( Vˆbc = 1.6799). We also see that the conventional
confidence interval is centered at the conventional, non-bias-corrected point estimator 3.0195, while
both CIbc and CIrbc are centered at the bias-corrected point estimator 2.9832. Since we know that
τ̂SRD = 3.0195 and τ̂SRD − B̂ = 2.9832, we can deduce that the bias estimate is B̂ = 3.0195−2.9832 =
0.0363.
Finally, we investage the properties of robust bias corrected inference when employing a CER-
optimal bandwidth choice. This is obtain via rdrobust with the option bwselect="cerrd".
> rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " cerrd " )
Call :
rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " cerrd " )
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 360 216
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 11.6288 11.6288
BW Bias ( b ) 28.5754 28.5754
rho ( h / b ) 0.4070 0.4070
75
4. CONTINUITY-BASED RD APPROACH CONTENTS
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 2.4298 1.6824 1.4443 0.1487 -0.8676 5.7272
Robust 0.1856 -1.1583 5.9795
The common for both control and treatment units bandwidth used is hCER = 11.6288, which is
smaller than the MSE-optimal bandwidth previously employed hMSE = 17.2395. The results are qual-
itatively similar, but now with a larger p-value as the nominal 95% robust bias corrected confidence
intervals change from [−0.3093, 6.2758] when using the MSE-optimal bandwidth to [−1.1583, 5.9795]
when using the MSE-optimal bandwidth. The RD point estimator changes from the MSE-optimal
value 2.9832 to the undersmoothed value 2.4298, where the latter RD estimate can be interpreted
as having less bias but more variability than the former.
Since the change in bandwidth choice from MSE-optimal to CER-optimal is practically impor-
tant, as well as whether a common bandwidth or two different bandwidths are used, we conclude
this section with a report of all the bandwidth choices available in the RD software employed. This
is obtained using the all option in rdbwselect command.
BW Selector All
Number of Obs 2629
NN Matches 3
Kernel Type Triangular
Left Right
Number of Obs 2314 315
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
76
4. CONTINUITY-BASED RD APPROACH CONTENTS
In our discussion of local polynomial methods above, we assumed that the local polynomial fit
includes only the running variable as a regressor—that is, we considered a local polynomial fit of
the outcome on the score alone. However, in some cases, researchers may want to augment their local
polynomial analysis by including, in addition to the RD score, a set of predetermined covariates
in their model specification. We now briefly discuss the most important issues related to adding
covariates, including when it is justified to include covariates, how exactly they should be included,
and how to modify the estimation and inference methods outlined above.
We let Zi (1) and Zi (0) denote two vectors of potential covariates—where Zi (1) represents the
value taken by the covariates above the cutoff (i.e., under treatment), and Zi (0) the value taken
below the cutoff (i.e., under control). For adjustment, researchers use the observed covariates, Zi ,
defined as
Z (0) if Xi < x̄
i
Zi = .
Z (1)
i if Xi ≥ x̄
If the covariates are truly predetermined, their values are determined before the treatment is
ever assigned and it must be the case that the potential covariate values under control will be
identical to the potential covariate values under treatment. Formally, predetermined covariates
satisfy Zi (1) =d Zi (0) for all i.
Covariates can be included in multiple different ways to augment the basic RD estimation and
inference methods. The two most natural approaches are conditioning, which makes the most sense
when only a few discrete covariates are used, and partialling out via local polynomial methods. The
first approach amounts to employ all the methods presented and discussed so far, after subsetting
the data along the different subclasses generated by the interacted values of the covariates being
used. No modifications are needed, and all the methods can be applied directly.
The second approach to incorporating covariates based on augmenting the local polynomial
model allows for many covariates, which can be discrete or continuous. In this case, the idea is
to directly include as many pre-intervention covariates as possible without affecting the validity
of the point estimator, while at the same time improving its efficiency. To describe the covariate-
adjustment RD method, we retain all the previous notation and ingredients underlying the RD
local polynomial estimator, and then define the following joint estimation problem:
ψ̂− n
X
ψ̂+ = arg min (Yi − ψ−,0 − ψ, 1 (Xi − x̄) − · · · − ψ−,p (Xi − x̄)p
ψ− ,ψ+ ,γ i=1
γ̂
X − x̄
i
− ψ+,0 Ti − ψ+,1 Ti (Xi − x̄) − · · · − ψ+,p Ti (Xi − x̄)p − Z0i γ)2 K ,
h
where ψ− = (ψ−,0 , ψ−,1 , · · · , ψ−,p )0 and ψ+ = (ψ+,0 , ψ+,1 , · · · , ψ+,p )0 . The covariate-adjusted RD
77
4. CONTINUITY-BASED RD APPROACH CONTENTS
estimator is
τ̃SRD = ψ̂+,0 ,
as this estimate captures the jump at the cutoff point in a fully interacted local polynomial regression
fit after partialling out the effect of the covariates Zi . In words, the approach is to fit a weighted least
squares regression of the outcome Yi on (i) a constant, (ii) the treatment indicator Ti , (iii) a p-order
polynomial on the running variable, (Xi − x̄), (Xi − x̄)2 , . . . , (Xi − x̄)p , (iv) a p-order polynomial on
the running variable interacted with the treatment (Xi − x̄)·Ti , (Xi − x̄)2 ·Ti , . . . , (Xi − x̄)p ·Ti , and (v)
the covariates Zi , using the weights K((Xi − x̄)/h) for all observations with x̄−h ≤ Xi ≤ x̄+h. The
approach, of course, reduces to the standard RD estimation when no covariates are included, that
is, τ̃SRD = τ̂SRD when γ = 0 is set before estimation. As it should be apparent, including covariates
in a linear-in-parameters way requires the same type of choices as in the standard RD treatment
effect estimation case: the researcher needs to choose a polynomial order p, a kernel function K(·)
and a bandwidth h.
A crucial question is whether the covariate-adjusted estimator τ̃SRD estimates the same parameter
as the unadjusted estimator τ̂SRD . It can be shown that if the covariates Zi are truly predetermined,
then τ̃SRD is a consistent estimator of the sharp RD treatment effect τSRD —that is, both τ̃SRD and τ̂SRD
estimate the same parameter. If the covariates are not predetermined, in the sense that E[Zi (0)|Xi =
x] 6= E[Zi (1)|Xi = x], then the covariate-adjusted estimator will not generally recover the RD
treatment effect τSRD .
We illustrate the inclusion of covariates using the Meyersson application. We use the predeter-
mined covariates introduced in Section 1.2 above: variables from the 1994 election (vshr islam1994,
partycount, lpop1994), and the geopgraphic indicators (merkezi, merkezp, subbuyuk, buyuk). In
order to keep the same number of observations as in the analaysis without covariates, we exclude
the indicator for electing an Islamic party in the 1989 election (i89) because this variable has
missing values.
We start by using rdbwselect to choose a MSE-optimal bandwidth using the default options:
a polynomial of order one, a triangular kernel, and the same bandwidth on each side of the cutoff
78
4. CONTINUITY-BASED RD APPROACH CONTENTS
(mserd option).
BW Selector mserd
Number of Obs 2629
NN Matches 3
Kernel Type Triangular
Left Right
Number of Obs 2314 315
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
The MSE-optimal bandwidth including covariates is 14.40877, considerable different from the
value of 17.2395 that we found before without covariate adjustment. This illustrates the general
principle that covariate-adjustment will generally change the values of the optimal bandwidths.
Summary :
79
4. CONTINUITY-BASED RD APPROACH CONTENTS
Left Right
Number of Obs 2314 315
Eff . Number of Obs 448 241
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 14.4088 14.4088
BW Bias ( b ) 23.7310 23.7310
rho ( h / b ) 0.6072 0.6072
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 3.1080 1.2839 2.4207 0.0155 0.5915 5.6244
Robust 0.0368 0.1937 6.1317
The estimated RD effect is now 3.1080, similar to the unadjusted estimate of 3.0195 that we
found before. As we explained, this similarity is reassuring because if the included covariates are
truly predetermined, the unadjusted estimator and the covariate-adjusted estimator are estimating
the same parameter and thus should result in roughly the same estimate. With the inclusion of
covariates, the 95% robust confidence interval is now [0.1937 , 6.1317]. The unadjusted robust
confidence interval we estimated in the previous section is [−0.3093 , 6.2758]. Thus, including
covariates reduced the length of the confidence interval from 6.2758 − (−0.3093) = 6.5851 to
6.1317 − 0.1937 = 5.938, a length reduction of (|5.938 − 6.5851|/6.5851) × 100 = 9.82 percent. The
shorter confidence interval obtained with covariate adjustment (and the slight increase in the point
estimate) results in the robust p-value decreasing from 0.0758 to 0.0368. This exercise illustrates
the main benefit of covariate-adjustment in local polynoimial RD estimation: when successful,
the inclusion of covariates in the analysis decreases the length of the confidence interval while it
simultaneously leaves the point estimate (roughly) unchanged.
Finally, before closing this section, we briefly remark that all the variance estimators discussed
throughout can be extended to account for clustered data. This extension is straightforward, but
implies once again that the optimal bandwidth choices and corresponding optimal point estimators
needs to be modified accordingly. Fortunately, all these modifications and extensions are readily
available in general purpose software, and we will employ them briefly in Section 7.
We offer several general recommendations for the implementation of local polynomial RD estimation
and inference. First, we recommend using a local linear polynomial with a triangular kernel as the
initial specification, and to select the bandwidth in a data-drive automatic fashion. A natural initial
choice is the MSE-optimal bandwidth for the RD point estimator. This gives an initial benchmark
80
4. CONTINUITY-BASED RD APPROACH CONTENTS
for further analysis. For point estimation purposes, it is natural to report the point estimator prior
bias correction and without covariate adjustment constructed using the MSE-optimal bandwidth.
For inference purposes, the robust bias corrected confidence intervals can be reported using the
same MSE-optimal bandwidth used for point estimation and, in addition, the analogous confidence
intervals constructed using the CER-optimal bandwidth can be reported. Inclusion of covariates,
accounting for clustering, etc., can also be done as appropriate, and usually as a further robustness
check. If covariates are pre-intervention and the estimand of interest is the sharp RD treatment
effect, then the RD point estimator with and without covariates should not change much. Different
bandwidth on each side of the cutoff can also be used, both MSE-optimal or CER-optimal, as an
additional improvement in some cases.
A textbook discussion of nonparametric local polynomial methods can be found in Fan and Gijbels
(1996). The specific application of local polynomial methods to RD estimation and inference was
first discussed by Hahn et al. (2001) and Porter (2003). Gelman and Imbens (2014) discuss the
problems associated with using global polynomial estimation for RD analysis. MSE-optimal band-
width selection for the local polynomial RD point estimator was first developed by Imbens and
Kalyanaraman (2012), and then generalized by Calonico et al. (2014b), Arai and Ichimura (2016,
2017), Bartalotti and Brummet (2017) and Calonico et al. (2017c) to different RD designs and set-
tings. Robust bias corrected confidence intervals were proposed by Calonico et al. (2014b), and their
higher-order properties as well as CER-optimal bandwidth selection for local polynomial confidence
intervals were developed by Calonico et al. (2017b,a). An overview of bandwidth selection methods
for RD analysis is provided by Cattaneo and Vazquez-Bare (2016). Bootstrap methods based on
robust bias corrected distributional approximations and inference are developed in Bartalotti et al.
(2017) and Chiang et al. (2017). Identification, estimation and inference when the local polynomial
RD analysis is performed with the addition of predetermined covariates is discussed in Calonico
et al. (2017c). Other extensions of estimation and inference using local polynomial methods and
robust bias correction inference are discussed in Xu (2017) and Dong (2017). An interesting em-
pirical example assessing the performance of robust bias correction inference methods is discussed
in Tukiainen et al. (2017). Estimation and inference when multiple/many RD cutoff are discussed
in Cattaneo et al. (2016a) and Bertanha (2017). Further related results and references are given in
the contemporaneous edited volume Cattaneo and Escanciano (2017).
81
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
The continuity-based approach to RD analysis discussed in the previous section is the most com-
monly used in practice. That approach is based on assumptions of continuity (and further smooth-
ness) of the regression functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x]. In contrast, the approach we
describe in this section is based on a formalization of the idea that the RD design can be viewed
as a randomized experiment near the cutoff x̄.
When the RD design was first introduced by Thistlethwaite and Campbell (1960), the justifica-
tion for this novel research design was not based on approximation and extrapolation of smoothness
regression functions, but instead on the idea that the abrupt change in treatment status that oc-
curs at the cutoff leads to a treatment assignment mechanism that, near the cutoff, resembles the
assignment that a randomized experiment would have. Indeed, the authors described a hypothet-
ical experiment where the treatment is randomly assigned near the cutoff as an “experiment for
which the regression-discontinuity analysis may be regarded as a substitute” (Thistlethwaite and
Campbell, 1960, p. 310).
The idea that the treatment assignment is “as good as” randomly assigned in a neighborhood of
the cutoff is often invoked in the continuity-based framework to describe the required identification
assumptions in an intuitive way, and it has been used to develop formal results. However, within the
continuity-based framework, the formal derivation of identification and estimation results always
relies on continuity and differentiability of regression functions, and the idea of local randomization
is used as a heuristic device only. In contrast, what we call the local randomization approach to RD
analysis formalizes that idea that the RD design behaves like a randomized experiment near the
cutoff by imposing explicit randomization-type assumptions that are stronger than the standard
continuity-type conditions. In a nutshell, this approach imposes conditions so that units whose score
values lie in a small window around the cutoff can be analyzed as-if they were randomly assigned to
treatment or control. The local randomization approach adopts the local randomization assumption
explicitly, not as a heuristic interpretation, and builds a set of statistical tools exploiting that specific
feature.
We now introduce the local randomization approach in detail, discussing how adopting an
explicit randomization assumption near the cutoff allows for the use of new methods of estimation
and inference for RD analysis. We also discuss the differences between the standard continuity-
based approach and the local randomization approach. When the running variable is continuous,
the local randomization approach typically requires stronger assumptions than the continuity-based
approach; in these cases, it is natural to use the continuity-based approach for the main RD analysis,
and to use the local randomization approach as a robustness check. But in settings where the
running variable is discrete (with few mass points) or other departures from the canonical RD
framework occur, the local randomization approach can not only be very useful but also possibly
the only valid method for estimation and inference in practice.
82
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
When the RD is based on a local randomization assumption, instead of assuming that the unknown
regression functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are continuous at the cutoff, the researcher
assumes that there is a small window around the cutoff, W0 = [x̄−w0 , x̄+w0 ], such that for all units
whose scores fall in that window their placement above or below the cutoff is assigned as in a ran-
domized experiment—sometimes called as if random assignment. Formalizing the assumption that
the treatment is (locally) assigned as it would have been assigned in an experiment requires careful
consideration of the conditions that are guaranteed to hold in an actual experimental assignment.
There are important differences between the RD design and an actual randomized experiment.
To discuss such differences, we start by noting that any simple experiment can be recast as an RD
design where the score is a randomly generated number, and the cutoff is chosen to ensure a certain
treatment probability. For example, consider an experiment in a student population that randomly
assigns a scholarship with probability 1/2. This experiment can be seen as an RD design where
each student is assigned a random number with uniform distribution between 0 and 100, say, and
the scholarship is given to students whose score is above 50. We illustrate this scenario in Figure
5.1(a).
The crucial feature of a randomized experiment recast as an RD design is that the running
variable, by virtue of being a randomly generated number, is unrelated to the average potential
outcomes. This is the reason why, in Figure 5.1(a), the average potential outcomes E[Yi (1)|Xi = x]
and E[Yi (0)|Xi = x] take the same constant value for all values of x. Since the regression functions
are flat, the vertical distance between them can be recovered by the difference between the average
observed outcomes among all units in the treatment and control groups, i.e. E[Yi |Xi ≥ 50] −
E[Yi |Xi < 50] = E[Yi (1)|Xi ≥ 50] − E[Yi (0)|Xi < 50] = E[Yi (1)] − E[Yi (0)], where the last equality
follows from the assumption that Xi is a randomly generated number and thus is unrelated to Yi (1)
and Yi (0).
83
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
2 2
E[Y(1)|X]
E[Y(1)|X]
1 ● 1 ●
E[Y(1)|X], E[Y(0)|X]
E[Y(1)|X], E[Y(0)|X]
Average Treatment Effect τSRD
0 ● 0 ●
E[Y(0)|X]
E[Y(0)|X]
−1 Cutoff −1 Cutoff
The crucial difference between the scenarios in Figures 5.1(a) and 5.1(b) is our knowledge about
the functional form of the regression functions. As discussed in Section 4, the RD treatment effect
in 5.1(b) can be estimated by calculating the limit of the average observed outcomes as the score
approaches the cutoff for the treatment and control groups, limx↓x̄ E[Yi |Xi = x] − limx↑x̄ E[Yi |Xi =
x]. The estimation of these limits requires that the researcher approximate the regression functions,
and this approximation will typically contain an error that may directly affect estimation and
inference. This is in stark contrast to the experiment depicted in Figure 5.1(a), where the random
assignment of the score implies that the average potential outcomes are unrelated to the score and
estimation does not require functional form assumptions—by construction, the regression functions
are constant in the entire region where the score was randomly assigned.
A point often overlooked is that the known functional form of the regression functions in a true
experiment does not follow from the random assignment of the score per se, but rather from the
score being an arbitrary computer-generated number that is unrelated to the potential outcomes.
If the value of the score were randomly assigned but had a direct effect on the average outcomes,
the regression functions in Figure 5.1(a) would not necessarily be flat. Thus, a local randomization
approach to RD analysis must be based not only on the assumption that placement above or below
the cutoff is randomly assigned within a window of the cutoff, but also on the assumption that the
value of the score within this window is unrelated to the potential outcomes—a condition that is
guaranteed neither by the random assignment of Xi nor by the random assignment of Zi .
Formally, letting W0 = [x̄ − w, x̄ + w], the local randomization assumption can be stated as the
84
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
(LR1) The distribution of the running variable in the window W0 , FXi |Xi ∈W0 (x), is known, is the
same for all units, and does not depend on the potential outcomes: FXi |Xi ∈W0 (x) = F (x)
(LR2) Inside W0 , the potential outcomes depend on the running variable solely through the treat-
ment indicator Ti = 1(Xi ≥ x̄) but not directly: Yi (Xi , Ti ) = Yi (Ti ) for all i such that
Xi ∈ W0 .
Under these conditions, inside the window W0 , placement above or below the cutoff is unre-
lated to the potential outcomes, and the potential outcomes are unrelated to the running vari-
able; therefore, the regression functions are flat inside W0 . This is illustrated in Figure 5.2, where
E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are constant for all values of x inside W0 , but have non-zero
slopes outside of it.
The contrast between Figures 5.1(a), 5.1(b), and 5.2 illustrates the differences between the
actual experiment where the score was a random number, a continuity-based RD design, and a
local randomization RD design. In the randomly assigned score experiment, the potential outcomes
are unrelated to the score for all possible score values—i.e., in the entire support of the score. In
this case, there is no uncertainty about the functional forms of E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x].
In the continuity-based RD design, the potential outcomes can be related to the score everywhere;
the functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are completely unknown, and estimation and
inference is based on approximating them at the cutoff. Finally, in the local randomization RD
design, the potential outcomes can be related to the running variable far from the cutoff, but
there is a window around the cutoff where this relationship ceases. In this case, the functions
E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are unknown over the entire support of the running variable,
but inside the window W0 they are assumed to be constant functions of x.
In some applications, assuming that the score will have no effect on the (average) potential
outcomes somewhere very near the cutoff may be regarded as unrealistic or true restrictive. How-
ever, such an assumption can be taken as an approximation, at least for the very few units with
score extremely close to the RD cutoff. As we will discuss below, a key advantage of the local ran-
domization approach is that it leads to valid and powerful finite sample inference methods, which
remain valid and can be used even when only a handful of observations very close to the cutoff are
considered for (estimation and) inference.
Furthermore, the restriction that the score cannot directly affect the (average) potential out-
comes near the cutoff can be relaxed if the researcher is willing to impose more parametric assump-
tions (locally to the cutoff). As described and formalized so far, the local randomization assumption
assumes that, inside the window where the treatment is assumed to have been randomly assigned,
the potential outcomes are entirely unrelated to the running variable. This assumption, also known
as the exclusion restriction, leads to the flat regression functions in Figure 5.2. It is possible to
consider a slightly weaker version of this assumption, where condition ?? is relaxed. In this version,
85
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
E[Y(1)|X]
1
E[Y(1)|X], E[Y(0)|X]
τLR
E[Y(0)|X]
0
−1 Cutoff
the potential outcomes are allowed to depend on the running variable, but researchers assume that
there exists a transformation that, once applied to the potential outcomes of the units inside the
window where the treatment is assumed to be randomly assigned, leads to transformed potential
outcomes that are unrelated to the running variable.
Using the random potential outcomes notation, the exclusion restriction in (LR2) requires that,
for units with Xi ∈ W0 , the potential outcomes satisfy Yi (Xi , Ti ) = Yi (Ti )—that is, the potential
outcomes depend on the running variable only via the treatment assignment indicator and not via
the particular value taken by Xi . In contrast, the weaker alternative assumption requires that, for
units with Xi ∈ W0 , the exists a transformation φ(·) such that
86
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
This condition says that, although the potential outcomes are allowed to depend on the running
variable Xi directly, but the transformed potential outcomes Ỹi (Ti ) depend only on the treatment
assignment indicator and thus satisfy the original exclusion restriction in (LR2). For implementa-
tion, a transformation φ(·) must be assumed; for example, one can use a polynomial of order p on
the unit’s score, with slopes that are constant for all individuals on the same side of the cutoff. This
transformation has the advantage of linking the local randomization approach to RD analysis to the
continuity-based approach discussed in the previous section. Finally, we note that these conditions
can be defined analogously for fixed (i.e., non-random) potential outcome functions yi (·).
Adopting a local randomization approach to RD analysis implies assuming that the assignment of
units above or below the cutoff was random inside the window W0 (condition LR1), and that in this
window the potential outcomes are unrelated to the score (condition LR2)—or can be somehow
transformed to be unrelated to the score.
Therefore, given knowledge of W0 , under a local randomization RD approach we can analyze the
data as we would analyze an experiment. If the number of observations inside W0 is large, researchers
can use the full menu of standard experimental methods, all of which are based on large-sample
approximations—that is, on the assumption that the number of units inside W0 is large enough to be
well approximated by large sample limiting distributions. These methods may or may not involve the
assumption of random sampling, and may or may not require LR2 per se (though removing LR2 will
change the interpretation of the RD parameter in general). In contrast, if the number of observations
inside W0 is very small, as it is usual the case when local randomization methods are invoked in RD
designs, estimation and inference based on large-sample approximations may be invalid; in this case,
under appropriate assumptions, researchers can still employ randomization-based inference methods
that are exact in finite samples and do not require large-sample approximations for their validity.
These methods rely on the random assignment of treatment to construct confidence intervals and
hypothesis tests. We review both types of approaches below.
87
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
In many RD applications, a local randomization assumption will only be plausible in a very small
window around the cutoff, and by implication this small window will often contain very few ob-
servations. In this case, it is natural to employ Fisherian inference approach, which is valid in any
finite sample, and thus leads to correct inferences even when the number of observations inside W0
is very small.
The Fisherian approach sees the potential outcomes as non-stochastic; this is contrast to the
inference approaches used in the continuity-based RD approach, where the potential outcomes are
random variables as a consequence of random sampling. More precisely, in Fisherian inference, the
total number of units in the study, n, is seen as fixed—i.e., there is no random sampling assumption;
moreover, inferences do not rely on assuming that this number is large. This setup is then combined
with the so-called sharp null hypothesis that the treatment has no effect for any unit:
The combination of fixed units and the sharp null hypothesis leads to inferences that are (type-
I error) correct for any sample size because, under HF0 , both potential outcomes (i.e., Yi (1) and
Yi (0)) can be imputed for every unit and there is no missing data. In other words, under the
sharp null hypothesis, and the observed outcome of each unit is equal to the unit’s two potential
outcomes. Thus, when the treatment assignment is known, the fact that all potential outcomes are
observed under the null hypothesis allows us to derive the null distribution of any test statistic from
the randomization distribution of the treatment assignment alone. Since the latter distribution is
finite-sample exact, the Fisherian framework allows researchers to make inferences without relying
on large-sample approximations.
A hypothetical example
To illustrate how Fisherian inference leads to the exact distribution of test statistics, we use
a hypothetical example. We imagine that we have five units inside W0 , and we randomly assign
nW0 ,+ = 3 units to treatment and nW0 ,− = nW0 − nW0 ,+ = 5 − 3 = 2 units to control, where nW0
is the total number of units inside W0 . We choose the difference-in-means as the test-statistic. The
treatment indicator continues to be Ti , and we collect in the set TW0 all possible nW0 -dimensional
treatment assignment vectors t within the window.
88
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Of course, a crucial difference between an actual experiment and the RD design is that, in the RD
design, the true mechanism by which units are assigned a value of the score smaller or larger than x̄
inside W0 is fundamentally unknown. Thus, the choice of the particular randomization mechanism
is best understood as an approximation. A common choice is the assumption that, within W0 ,
nW0 ,+ units are assigned to treatment and nW0 − nW0 ,+ units are assigned to control, where each
−1
unit has probability nnWW0,+ of being assigned to the treatment (i.e. above the cutoff) group. This
0
is commonly known as a complete randomization mechanism or a fixed margins randomization—
under this mechanism, the number of treated and control units is fixed, as all treatment assignment
vectors result in exactly nW0 ,+ treated units and nW0 − nW0 ,+ control units.
5
In our example, under complete randomization, the number of elements in ΩW0 is 3 = 10—
that is, there are ten different ways to assign five units to two groups of size three and two. We
assume that Yi (1) = 5 and Yi (0) = 2 for all units, so that the treatment effect is constant and
equal to 3 units. The top panel of Table 5.1 shows the ten possible treatment assignment vectors,
t1 , . . . , t10 , and also the two potential outcomes.
Suppose that the observed treatment assignment inside W0 is t6 , so that units 1, 4 and 5
are assigned to treatment, and units 2 and 3 are assigned to control. Given this assignment, the
vector of observed outcomes is Y = (5, 2, 2, 5, 5), and the observed value of the difference-in-means
5+5+5 2+2
statistic is S obs = Ȳ+ − Ȳ− = 3 − 2 = 5 − 2 = 3. The bottom panel of Table 5.1 shows the
distribution of the test statistic under the null—that is, the ten different possible values that the
difference-in-means can take when HF0 is assumed to hold. The observed difference-in-means S obs is
the largest of the ten, and the exact p-value is therefore pF = 1/10 = 0.10. Thus, we can reject HF0
with a test of level α = 0.10. Note that, since the number of possible treatment assignments is ten,
the smallest value that pF can take is 1/10. This p-value is finite-sample exact, because the null
distribution in Table 5.1 was derived directly from the randomization distribution of the treatment
assignment, and does not rely on any statistical model or large-sample approximations.
We can generalize the above example to provide a general formula for the exact p-value associ-
ated with a test of HF . As before, we let TW0 be the treatment assignment for the nW0 units in W0 ,
and collect in the set TW0 all the possible treatment assignments that can occur given the assumed
randomization mechanism. In a complete or fixed margins randomization, TW0 includes all vectors
of length nW0 such that each vector has nW0 ,+ ones and nW0 ,− = nW0 − nW0 ,+ zeros. Similarly,
89
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
YW0 collects the nW0 observed outcomes for all units with Xi ∈ W0 . We also need to choose a test
statistic, which we denote S(TW0 , YW0 ), that is a function of the treatment assignment TW0 and
the vector YW0 of observed outcomes for the nW0 units in the experiment that is assumed to occur
inside W0 .
Of all the possible values of the treatment vector TW0 that can occur, only one will have occurred
in W0 ; we call this value the observed treatment assignment, tobs
W0 , and we denote S
obs the observed
1
When each of the treatment assignments in TW0 is equally likely, P(T = t) = #{TW0 } with
#{TW0 } the number of elements in TW0 , and this expression simplifies to the number of times the
test-statistic exceeds the observed value divided by the total number of test-statistics that can
possibly occur,
F obs # S(tW0 , YW0 ) ≥ S obs
p = P(S(ZW0 , YW0 ) ≥ S )= .
#{TW0 }
As in the hypothetical example above, under the sharp null hypothesis, all potential outcomes are
known and can be imputed. To see this, note that under HF0 we have YW0 = YW0 (1) = YW0 (0),
so that S(TW0 , YW0 ) = S(TW0 , YW0 (0)). Thus, under HF0 , the only randomness in S(ZW0 , YW0 )
90
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
comes through the random assignment of the treatment, which is assumed to be known.
In practice, it often occurs that the total number of different treatment vectors tW0 that can
occur inside the window W0 is too large, and enumerating them exhaustively is unfeasible. For
example, assuming a fixed-margins randomization inside W0 with 15 observations on each side of
the cutoff, there are nnWW0,t = 30
F
0
15 = 155, 117, 520 and calculating p by complete enumeration is not
possible or very time consuming. When exhaustive enumeration is infeasible, we can approximate
pF using simulations, as follows:
2. Draw a value tjW0 from the treatment assignment distribution, P(TW0 ≤ tW0 ).
3. Calculate the test statistic for the j th draw tjW0 , S(tjW0 , YW0 ).
B
1 X
F
p̃ = 1(S(tjW0 , YW0 ) ≥ S obs ).
B
j=1
Fisherian confidence intervals can be obtained by specifying sharp null hypotheses about treat-
ment effects, and then inverting these tests. In order to apply the Fisherian framework, the null
hypotheses to be inverted must be sharp—that is, under these null hypotheses, the full profile of
potential outcomes must be known. This requires specifying a treatment effect model, and testing
hypotheses about the specified parameters. A simple and common choice is a constant treatment
effect model, Yi (1) = Yi (0) + τ , which leads to the null hypothesis HFτ0 : τ = τ0 —note that HF0 is a
special case of HFτ0 when τ0 = 0. Under this model, a 1 − α confidence interval for τ is be obtained
by collecting the set of all the values τ0 that fail to be rejected when we test HFτ0 : τ = τ0 with an
α-level test.
To test HFτ0 , we build test statistics based on an adjustment to the potential outcomes that
renders them constant under this null hypothesis. Under HFτ0 , the observed outcome is
Yi = Ti · Yi (1) + (1 − Ti ) · Yi (0)
= Ti · (Yi (0) + τ0 ) + (1 − Ti ) · Yi (0)
= Ti · τ0 + Yi (0).
Thus, the adjusted outcome Ÿi ≡ Yi − Zi τ0 = Yi (0) is constant under the null hypothesis HFτ0 .
A randomization-based test of HFτ0 proceeds by first calculating the adjusted outcomes Ÿi for all
the units in the window i = 1, . . . , nW0 , and then computing the test statistic using the adjusted
outcomes instead of the raw outcomes, i.e. computing S(TW0 , ŸW0 ). Once the adjusted outcomes
91
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
are used to calculate the test statistic, we have S(TW0 , ŸW0 ) = S(TW0 , YW0 (0)) as before, and a
test of HFτ0 : τ = τ0 can be implemented as a test of the sharp null hypothesis HF0 , using S(ZW0 , ŸW0 )
instead of S(ZW0 , YW0 ). We use pFτ0 to refer to the p-value associated with a randomization-based
test of HFτ0 .
In practice, assuming that τ takes values in [τmin , τmax ], computing these confidence intervals
requires building a grid Gτ0 = τ01 , τ02 , . . . , τ0G , with τ01 ≥ τmin and τ0G ≤ τmax , and collecting all
τ0 ∈ Gτ0 that fail to be rejected with an α-level test of HFτ0 . Thus, the Fisherian (1 − α) × 100%
confidence intervals are
CILRF = τ0 ∈ Gτ0 : pFτ0 ≤ α .
Other test statistics that could be used include the Kolmogorov-Smirnov (KS) statistic and
the Wilcoxon rank sum statistic. The KS statistic is defined as SKS = supy |F̂1 (y) − F̂0 (y)|, and
measures the maximum absolute difference in the empirical cumulative distribution functions (CDF)
of the treated and control outcomes—denoted respectively by F̂1 (·) and F̂0 (·). Because SKS is the
treated-control difference in the outcome CDFs, it is well suited to detect departures from the null
hypothesis that involve not only differences in means but also differences in other moments and in
quantiles. Another test statistic commonly used is the Wilcoxon rank sum statistic, which is based
on the ranks of the outcomes, denoted Riy . This statistic is SWR = i:Ti =1 Riy , that is, it is the sum
P
of the ranks of the treated observations. Because SWR is based on ranks, it is not affected by the
particular values of the outcome, only by their ordering. Thus, unlike the difference-in-means, SWR
is insensitive to outliers.
In addition to different choices of statisics, the Fisherian approach also allows for different
randomization mechanisms. An alternative to the complete randomization mechanism discussed
above is a Bernoulli assignment, where each unit is assigned to treatment with some fixed equal
probability. For implementation, researchers can set this probability equal to 1/2 or, alternatively,
equal to the proportion of treated units in W0 . The disadvantage of a Bernoulli assignment is that
it can result in a treated or a control group with few or no observations—a phenomenon that can
never occur under complete randomization. In practice, nevertheless, complete randomization and
Bernoulli randomization often lead to very similar conclusions.
92
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Despite the conceptual elegance of finite-sample Fisherian methods, the most frequently chosen
methods in the analysis of experiments are based on large-sample approximations. These methods
are appropriate to analyze RD designs under a local randomization assumption when the number
of observations inside W0 is sufficiently large to ensure that the moment and/or distributional
approximations are sufficiently similar to the finite-sample distributions of the statistics of interest.
A classic framework for experimental analysis is known as the Neyman approach. This approach
relies on large-sample approximations to the randomization distribution of the treatment assign-
ment, but still assumes that the potential outcomes are fixed or non-stochastic. In other words, the
Neyman approach to experimental analysis is based on approximations to the randomization dis-
tribution but does not assume that the data is a (random) sample from a larger super-population.
Since in the Neyman framework the potential outcomes are non-stochastic, the parameter of inter-
est is the finite-sample average treatment effect inside the window if point estimation is the goal,
or otherwise a given hypothesis test such as equal (sample) potential outcome means.
LR 1 X 1 X
τSRD = Ȳ (1) − Ȳ (0), Ȳ (1) = Yi (1), Ȳ (0) = Yi (0)
nW0 nW 0
i:Xi ∈W0 i:Xi ∈W0
where Ȳ (1) and Ȳ (0) are the average potential outcomes inside the window. In this definition, we
have assumed that the potential outcomes are non-stochastic.
LR is different from the more conventional continuity-based RD pa-
Note that the parameter τSRD
LR is an average effect inside an interval (the window W ),
rameter τSRD defined in Section 4: while τSRD 0
τSRD is an average at a single point (the cutoff x̄) where, by construction, the number of observa-
tions is zero. Thus, the decision to adopt a continuity-based approach versus a local randomization
approach directly affects the definition of the parameter of interest. Naturally, if the window W0 is
LR and τ
extremely small, τSRD SRD become more conceptually similar.
1 1
Yi 1(Xi ≥ x̄), Yi 1(Xi < x̄),
X X
τ̂ RSRD = Ȳ+ − Ȳ− , Ȳ+ = Ȳ− =
nW0 ,+ nW0 ,−
i:Xi ∈W0 i:Xi ∈W0
where Ȳ+ and Ȳ− are the average treated and control observed outcomes inside W0 , and nW0 ,+
and nW0 ,− are the number of treatment and control units inside W0 , respectively. In this case,
LR is given by the sum of the sample variance in
a conservative estimator of the variance of τSRD
2
σ̂+ 2
σ̂−
each group, V̂ = + 2 and σ̂ 2 denote the sample variance of the outcome
nW0 ,+ nW0 ,− , where σ̂+ −
for the treatment and control units within W0 , respectively. A confidence 100(1 − α)% confidence
93
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
interval can be constructed in the usual way by relying on a normal large-sample approximation
to the randomization distribution of the treatment assignment. For example, an approximate 95%
confidence interval is h p i
LRN RSRD
CI = τ̂ ± 1.96 · V
b .
Testing of the null hypothesis that the average treatment effect is zero can also be based on
Normal approximations. The Neyman null hypothesis is
In contrast to Fisher’s sharp null hypothesis HF0 , HN0 does not allow us to calculate the full profile of
potential outcomes for every possible realization of the treatment assignment vector t. Thus, unlike
the Fisherian approach, the Neyman approach to hypothesis testing must rely on approximation
and is therefore not exact. In the Neyman approach, we can construct the usual t-statistic using
the point and variance estimators just introduced, S = Ȳ+√−Ȳ− , and then use Normal approximation
V̂
to its distribution. For example, for a one-sided test, the p-value associated with a test of HN0 , is
pN = 1 − Φ(t), where Φ(·) is the Normal CDF.
We illustrate the finite-sample and large-sample inference procedures described above using the
Meyersson application. For this, we use the function rdrandinf, which is part of the rdlocrand
package. The main arguments of rdrandinf include the outcome variable Y, the running variable X,
and the upper and lower limits of the window where inferences will be performed (wr and wl). We
choose the ad-hoc window [-2.5, 2.5], and postpone the discussion of automatic data-driven window
selection until the next section. To make inferences in W = [−2.5, 2.5], we set wl = −2.5 and
94
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
wr = 2.5. Since Fisherian methods are simulation-based, we also choose the number of simulations
via the argument reps, in this case choosing 1, 000 simulations. Finally, in order to be able to
replicate the Fisherian simulation-based results at a later time, we set the random seed using the
argument seed.
The output is divided in three panels. The top panel first presents the total number of observa-
tions in the entire dataset (that is, in the entire support of the running variable), the order of the
polynomial, and the kernel function that is used to weigh the observations. By default, rdlocrand
uses a polynomial of order zero, which means the outcomes are not transformed. In order to trans-
form the outcomes via a polynomial as explained above, users can use the option p in the call to
rdlocrand. The default is also to use a uniform kernel, that is, to compute the test statistic using
the unweighted observations. This default behavior can be changed with the option kernel. The
rest of the top panel reports the number of simulations used for Fisherian inference, the method
used to choose the window, and the null hypothesis that is tested (default is τ0 = 0, i.e. a test of
HF0 and HN0 ). Finally, the last row of the top panel reports the chosen randomization mechanism,
which by default is fixed margins (i.e. complete) randomization.
95
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
The middle panel reports the number of observations to the left and right of the cutoff in both the
entire support of the running variable, and in the chosen window. As shown in the output, although
there is a total of 2314 control observations and 315 treated observations in the entire dataset, the
number of observations in the window [−2.5, 2.5] is much smaller, with only 68 municipalities
below the cutoff and 62 municipalities above the cutoff. The middle panel also reports the mean
and standard deviation of the outcome inside the chosen window.
The last panel reports the results. The first column reports the type of test statistic employed for
testing the Fisherian sharp null hypothesis (the default is the difference-in-means), and the column
labeled T reports its value. In this case, the difference-in-means is 1.072; given the information in
the Mean of outcome row in the middle panel, we see that this is the difference between a female
education share of 15.044 percentage points in municipalities where the Islamic party barely won,
and a female education share of 13.972 percentage points in municipalities where the Islamic party
barely lost. The Finite sample column reports the p-value associated with a randomization-based
test of the Fisherian sharp null hypothesis HF0 (or the alternative sharp null hypothesis HFτ0 based
on a constant treatment effect model if the user sets τ0 6= 0 via the option nulltau). This p-value
is 0.488, which means we fail to reject the sharp null hypothesis.
Finally, the Large sample columns in the bottom panel report Neyman inferences based on the
large sample approximate behavior of the (distribution of the) statistic. The p-value reported in the
large-sample columns is thus pN , the p-value associated with a test of the Neyman null hypothesis
HN0 that the average treatment effect is zero. The last column in the bottom panel reports the
power of the Neyman test to reject a true average treatment effect equal to d, where by default d
is set to one half of the standard deviation of the outcome variable for the control group, which in
this case is 4.27 percentage points. The value of d can be modified with the options d or dscale.
Like pN , the calculation of the power versus the alternative hypothesis d is based on the Normal
approximation discussed in the previous section. The large-sample p-value is 0.501, indicating that
the Neyman null hypothesis also fails to be rejected at conventional levels. The power calculation
indicates that the probability of rejecting the null hypothesis when the true effect is equal to half a
(control) standard deviation is relatively high, at 0.765. Thus, it seems that the failure to reject the
null hypothesis stems from the small size of the average treatment effect estimated in this window,
which is just 1.072/(4.27 × 2) = 1.072/8.54 = 0.126 standard deviations of the control outcome—a
very small effect.
It is also important to note the different interpretation of the difference-in-means test statistic
in the Fisherian and Neyman frameworks. In Fisherian inference, the difference-in-means is simply
one of the various test statistics that can be chosen to test the sharp null hypothesis, and should
not be interpreted as an estimated effect—remember that the in Fisherian framework, the focus is
on testing null hypotheses that are sharp. In contrast, in the Neyman framework, the focus is on
the sample average treatment effect; since the difference-in-means is an unbiased estimator of this
parameter, it can be appropriately interpreted as an estimated effect.
96
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
To illustrate how robust Fisherian inferences can be to the choice of randomization mechanism
and test statistic, we modify our call to randinf to use a binomial randomization mechanism, where
every unit in the ad-hoc window [−2.5, 2.5] has a 1/2 probability of being assigned to treatment. For
this, we must first create an auxiliary variable that contains the treatment assignment probability
of every unit in the window; this auxiliary variable is then passed as an argument to rdrandinf.
The last row of the top panel now says Randomization = Bernoulli, indicating that the Fish-
erian randomization-based test of the sharp null hypothesis is assuming a Bernoulli treatment
assignment mechanism, where each unit has probability q of being placed above the cutoff—in
this case, given our construction of the bern prob variable, q = 1/2 for all units. The Fisherian
finite-sample p-value is now 0.469, very similar to the 0.488 p-value obtained above under the
assumption of a fixed margins randomization. The conclusion of failure to reject HF0 is therefore
unchanged. This robustness of the Fisherian p-value to the choice of fixed margins versus Bernoulli
97
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
randomization is typical of most applications. Note also that the large-sample results are exactly
the same as before—this is expected, since the choice of randomization mechanism does not affect
the large-sample Neyman inferences.
We can also change the test statics used to test the Fisherian sharp null hypothesis. For example,
to use the Kolmogorov-Smirnov (KS) test statistic instead of the difference-in-means, we set the
option statistic = "ksmirnov".
> out = rdrandinf (Y , X , wl = -2.5 , wr = 2.5 , seed = 50 , statistic = " ksmirnov " )
The bottom panel now reports the value of the KS statistic in the chosen window, which is 0.101.
The randomization-based test of the Fisherian sharp null hypothesis HF based on this statistic has
p-value 0.846, considerably larger than the 0.488 p-value found in the same window (and with the
same fixed-margins randomization) when the difference-in-means was chosen instead. Note that
now the large-sample results report a large-sample approximation to the KS test p-value, and not
a test of the Neyman null hypothesis HN . Moreover, the KS statistic has no interpretation as a
treatment effect in either case.
Finally, we illustrate how to obtain confidence intervals in our call to rdrandinf. Remember
that in the Fisherian framework, confidence intervals are obtained by inverting tests of sharp null
98
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
hypothesis. To implement this inversion, we must specify a grid of τ values; rdrandinf will then test
the null hypotheses HFτ0 : Yi (1)−Yi (0) = τ0 for all values of τ0 in the grid, and collect in the confidence
interval all the hypotheses that fail to be rejected in a randomization-based test of the desired level
(default is level α = 0.05). To calculate these confidence intervals, we create the grid, and then call
rdrandinf with the ci option. For this example, we choose a grid of values for τ0 between −10 and
10, with 0.25 increments. Thus, we test Hτ0 for all τ0 ∈ Gτ0 = {−10, −9.75, −9.50, . . . , 9.50, 9.75, 10}.
The Fisherian 95% confidence interval is [−2, 4]. As explained, this confidence interval assumes
a constant treatment effect model. The interpretation is therefore that, given the assumed ran-
domization mechanism, all values of τ between -2 and 4 in the constant treatment effect model
99
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Yi (1) = Yi (0) + τ fail to be rejected with a randomization-based 5%-level test. In other words, in
this window, and given a constant treatment effect model, the empirical evidence based on a local
randomization RD framework is consistent with both negative and positive true effects of Islamic
victory on the female education share.
In the previous sections, we assumed that W0 was known. However, in practice, even when a
researcher is willing to assume that there exists a window around the cutoff where the treatment
is as-if randomly assigned, the location of this window will be typically unknown. This is another
fundamental difference between local randomization RD designs and actual randomized controlled
experiments, since in the latter there is no ambiguity about the population of units that were subject
to the random assignment of the treatment. Thus, the most important step in the implementation of
the local randomization RD approach is to select the window around the cutoff where the treatment
can be plausibly assumed to have been as-if randomly assigned.
One option is to choose the randomization window in an ad-hoc way, selecting a small neigh-
borhood around the cutoff where the researcher is comfortable assuming local randomization. For
example, a scholar may believe that elections decided by 0.5 percentage points or less are essentially
decided as if by the flip of a coin, and chose the window [x̄ − 0.5, x̄ + 0.5]. The obvious disadvantage
of selecting the window arbitrarily is that the resulting choice is based neither on empirical evidence
nor on a systematic procedure, and thus lacks objectivity and replicability.
A preferred alternative is to choose the window using the information provided by relevant
predetermined covariates—variables that reflect important characteristics of the units, and whose
values are determined before the treatment is assigned and received. This approach requires assum-
ing that there exists at least one important predetermined covariate of interest, Z, that is related
to the running variable everywhere except inside the window W0 . Figure 5.3 shows a hypotheti-
cal illustration, where the conditional expectation of Z given the score, E(Z|X) is plotted against
X. Outside of W0 , E(Z|X) and X are related: a mild U-shaped relationship to the left of x̄, and
monotonically increasing to the right—possibly due to correlation between the score and another
characteristic that also affects Z. However, inside the window W0 where local randomization holds,
this relationship disappears by virtue of applying conditions LR1 and LR2 to Z, taking Z as an
“outcome” variable. Moreover, because Z is a predetermined covariate, the effect of the treatment
on Z is zero by construction. In combination, these assumptions imply that there is no association
between E(Z|X) and X inside W0 , but these two variables are associated outside of W0 .
This suggests a data-driven method to choose W0 . We define a null hypothesis H0 stating that
the treatment is unrelated to Z (or that Z is “balanced” between the groups). In theory, this
hypothesis could be the Fisherian hypothesis HF0 or the Neyman hypothesis HN0 . However, since
the procedure we discuss will be based on some small windows with very few observations, we
100
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
recommend using randomization-based tests of the Fisherian hypothesis HF0 , which takes the form
HF0 : Vi (1) = Vi (0). Naturally, the effect of the treatment on Z is zero for all units inside W0 because
the covariate is predetermined. However, the window selection procedure is based on the assumption
that, outside W0 , the treatment and control groups differ systematically in Z—not because the
treatment has a causal effect on Z but rather because the running variable is correlated with Z
outside W0 . This assumption is important; without it, the window selector will not recover the true
W0 .
The procedure starts with the smallest possible window—W1 in Figure 5.3—and tests the null
hypothesis of no effect H0 . Since there is no relationship between E(Z|X) and Z inside W1 , H0
will fail to be rejected. Once H0 fails to be rejected, a smaller window W2 is selected, and the null
hypothesis is tested again inside W2 . The procedure keeps increasing the length of the window and
re-testing H0 in each larger window, until a window is reached where H0 is rejected at the chosen
significance level α? ∈ (0, 1). In the figure, assuming the test has perfect power, the null hypothesis
will not be rejected in W0 , nor will it be rejected in W2 or W1 . The chosen window is the largest
window such that H0 fails to be rejected inside that window, and in all windows contained in it. In
Figure 5.3, the chosen window is W0 .
In spirit, this nested procedure is analogous to the use of covariate balance tests in randomized
controlled experiments. In essence, the procedure chooses the largest window such that covariate
balance holds in that window and all smaller windows inside it. As we show in our empirical
illustration, this data-driven window selection method can be implemented with several covariates,
for example, rejecting a particular window choice when H0 is rejected for at least one covariate.
As mentioned, we recommend choosing HF0 as the null hypothesis. In addition, the practical
implementation of the procedure requires several other choices:
• Choose the relevant covariates. Researchers must decide which covariates to use in the window
selection procedure; these covariates should be related to both the outcome and the treatment
assignment. If multiple covariates are chosen, the procedure can be applied using either the
p-value of an omnibus test statistic, or by testing H0 for each covariate separately and then
making the decision to reject H0 based on the minimum p-value across all covariates—i.e.,
rejecting a particular window choice when H0 is rejected for at least one covariate.
• Choose the test statistic. Researchers must choose the statistic on which the randomization-
based test of the Fisherian null hypothesis will be based. This can be the difference-in-means,
one of the alternatives statistics discussed above, or other possibilities.
• Choose the randomization mechanism. Researchers must select the randomization mechanism
that will be assumed inside the window to test the sharp null hypothesis HF0 using Fisherian
methods. In many applications, an appropriate choice is a complete randomization mechanism
where every unit in the window has treatment assignment probability 1/ nnWW0,t , as discussed
0
above.
101
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Cutoff
E[V|X]
E[V|X]
W6
W5
W4
W3
W0
−1
W2
W1
• Choose α? . The threshold significance level determines when a null hypothesis is considered
rejected. Since the main concern is failing to reject a null hypothesis when it is false—in
contrast to the usual concern about rejecting a true null hypothesis—the level of the test
should be higher than conventional levels. When we test H0 at a higher level, we tolerate a
higher probability of Type I error and a lower probability of concluding that the covariate is
unrelated to the treatment assignment when in fact it is. We recommend setting α? ≥ 0.15 if
102
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Once researchers have selected the relevant covariates, the test statistic, the randomization
mechanism in the window, and the threshold significance level α? , the window selection procedure
can be implemented with the following algorithm:
2. For every covariate Zk , k = 1, 2, . . . , K use the test statistic Tk and the chosen randomization
mechanism to test HF0 using a Fisherian framework. Use only observations whose score values
are inside Wj . Compute the associated p-value, pk . (If using an omnibus test, compute the
single omnibus p-value.)
3. Compute the minimum p-value pmin = min(p1 , p2 , . . . , pK ). (If using an omnibus test, replace
pmin by the single omnibus p-value.)
(a) If pmin > α? , do not reject the null hypothesis and increase the length of the window by
2wstep , Wj+1 = [x̄ − wj − wstep , x̄ + wj + wstep ]. Set Wj = Wj+1 and go back to step (2).
(b) If pmin ≤ α? , reject the null hypothesis and conclude that the largest window where the
local randomization assumption is plausible is Wj .
To see how the procedure works in practice, we use it to select a window in the Meyersson
application using the set of predetermined covariates described in Section 1.2: vshr islam1994,
partycount, lpop1994, i89, merkezi, merkezp, subbuyuk, buyuk. We use the function rdwinselect,
which is one of the functions in rdlocrand. The main arguments are the score variable X, the matrix
of predetermined covariates, and the sequence of nested windows; for simplicity, only symmetric
windows are considered. We also choose 1,000 simulations for the calculation of Fisherian p-values
in each window.
There are two ways to increment the length of the windows in rdwinselect. One is to in-
crement the length of the window in fixed steps, which can be implemented with the option
wstep. For example, if the first window selected is [0.1, 0.1] and wstep = 0.1, the sequence is
W1 = [0.1, 0.1],W2 = [0.2, 0.2], W3 = [0.3, 0.3], etc. The other is to increase the length of the win-
dow so that the number of observations increases by a minimum fixed amount on every step, which
can be done via the option wobs. For example, by setting wobs = 2, every window in the sequence
is the smallest symmetric window such that the number of added observations on each side of the
cutoff relative to the prior window is at least 2. By default, rdwinselect starts with the smallest
window that has at least 10 observations on either side, but this default behavior can be changed
with the options wmin or obsmin. Finally, rdwinselect uses the chosen level α? to recommend the
chosen window; the default is α? = 0.15, but this can be modified with the level option.
We start by considering a sequence of symmetric windows where we increase the length in every
step by the minimum amount so that we increase at least 2 observations in each step on either side.
103
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
> Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 ,
+ data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk )
> colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " ,
+ " merkezi " , " merkezp " , " subbuyuk " , " buyuk " )
> out = rdwinselect (X , Z , seed = 50 , reps = 1000 , wobs = 2)
Window length / 2 p - value Var . name Bin . test Obs < c Obs >= c
The top and middle panels in the rdwinselect output are very similar to the corresponding
panels in the rdrandinf output. One difference is the Testing method, which indicates whether
randomization-based methods are used to test HF0 , or Normal approximations methods are used to
test HN0 . The default is randomization-based methods, but this can be changed with the approximate
option. The other difference in the output is the Balance test, which indicates the type of test
statistic used for testing the null hypothesis—the default is diffmeans, the difference-in-means.
The option statistic allows the user to select a different test statistic; the available options are
104
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
The bottom panel shows the results of the tests of the null hypothesis for each window con-
sidered. By default, rdwinselect starts with the smallest symmetric window that has at least 10
observations on either side of the cutoff. Since we set wobs=2, we continue to consider the smallest
possible (symmetric) windows so that at least 2 observations are added on each side of the cutoff
in every step. For every window,the column p-value reports either pmin —the minimum of the K
p-values, p1 , p2 , . . . , pK associated with a test of the null hypothesis of no effect for each of the
covariates Z1 , Z2 , . . . , ZK —or the unique p-value if an omnibus test is used to test H0 jointly for all
covariates. The column Var. name reports the covariate associated with the minimum p-value—
that is, the covariate Zk such that pk = pmin .
Finally, the column Bin. test uses a Binomial test to calculate the probability of observing
nW,+ successes out of nW trials, where nW,t is the number of observations within the window that
are above the cutoff (reported in column Obs>c) and nW is the total number of observations within
the window (which can be calculated by adding the number reported in columns Obs<c and Obs<c).
We will discuss this specific method in the following section on falsification and validation of RD
designs.
In this empirical example, the p-values are above 0.20 in all windows between the minimum
window [−0.446, 0.446] and [−0.944, 0.944]. In the window immediately after [−0.944, 0.944], the p-
value drops to 0.048, considerably below the suggested 0.15 threshold, therefore the chosen window
is W ? = [−0.944, 0.944]. After this window, the p-values start decreasing, albeit initially this
decrease is not necessarily monotonic. By default, rdwinselect only shows the first 20 windows; in
order to see the sharp decrease in p-values that occurs as larger windows are considered, we set the
option nwindows=50 to see the output of the first 50 windows. We also set the option plot=TRUE, to
create a plot of the minimum p-values associated with the length of each window considered—the
plot is shown in Figure 5.4 below.
> Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 ,
+ data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk )
> colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " ,
+ " merkezi " , " merkezp " , " subbuyuk " , " buyuk " )
> out = rdwinselect (X , Z , seed = 50 , reps = 1000 , wobs = 2 , nwindows = 50 ,
+ plot = TRUE )
105
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Window length / 2 p - value Var . name Bin . test Obs < c Obs >= c
106
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Figure 5.4: Window selection: minimum P-value against window length—Meyersson data
●
0.4
●
●
●
0.3
●
●
●
● ●
Pvals
● ●
0.2
●
●
●
0.1
●
● ●
●
●●
0.0
1 2 3 4 5 6
window.list
From the plot, we see that, in the sequence of windows considered, the minimum p-value is
above 0.20 for all windows smaller than [-1,1], and decreases below 0.1 as the windows get larger
than [-1,1]. Although the p-values increase again for windows between [-1,1] and [-2,2], they decrease
sharply once windows larger than [-2,2] are considered.
If we want to choose the window using a sequence of symmetric windows of fixed length rather
than controlling the minimum number of observation, we simply use the wstep option. Calling
rdwinselect with wstep=0.1 performs the covariate balance tests in a sequence of windows that
starts at the minimum window and increases the length by 0.1 at each side of the cutoff.
> Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 ,
+ data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk )
> colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " ,
+ " merkezi " , " merkezp " , " subbuyuk " , " buyuk " )
> out = rdwinselect (X , Z , seed = 50 , reps = 1000 , wstep = 0.1 ,
+ nwindows = 25)
107
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Window length / 2 p - value Var . name Bin . test Obs < c Obs >= c
The suggested window is [−0.946, 0.946], very similar to the [−0.944, 0.944] window chosen
above with the wobs=2 option.
108
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
We can now use rdrandinf to perform a local randomization analysis in chosen window. For
this, we use the options wl and wr to input, respectively, the lower and upper limit of the chosen
window W ? = [−0.944, 0.944]. We also use the option d = 3.0195 to calculate the power of a
Neyman test to reject the null hypothesis of null average treatment effect when the true average
difference is 3.0195. This value is the linear polynomial point estimate obtained in Section 4 with
the continuity-based approach.
The difference in means in the chosen window W ? = [−0.944, 0.944] is 2.638, considerably similar
to the continuity-based local linear point estimate of 3.0195. However, using a Neyman approach,
we cannot distinguish this average difference from zero, with a p-value of 0.333. Similarly, we fail
to reject the Fisherian sharp null hypothesis that an electoral victory by the Islamic party has
no effect on the female education share for any municipality (p-value 0.386). As shown in the last
column, the large-sample power to detect a difference of around 3 percentage points is only 19.4%.
Naturally, the small number of observations in the chosen window (22 and 27 below and above the
cutoff, respectively) limits statistical power. In addition, the effect of 2.638, although much larger
than the 1.072 estimated in the ad-hoc [-2.5, 2.5] window, is still a very small effect. The 2.638 effect
is less than a third third of one standard deviation of the female education share in the control
group—we see this by calculating 2.638/8.615 = 0.306. Consistent with these results, the Fisherian
95% confidence interval under a constant treatment effect model is [−2.8 , 8.5], consistent with
both positive and negative effects.
109
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
Finally, we mention that instead of calling rdwinselect first and rdrandinf second, we can
choose the window and perform inference in one step by using the covariates option in rdrandinf.
> Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 ,
+ data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk )
> colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " ,
+ " merkezi " , " merkezp " , " subbuyuk " , " buyuk " )
> out = rdrandinf (Y , X , covariates = Z , seed = 50 , d = 3.019522)
rdwinselect complete .
110
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
However, it is usually better to first choose the window using rdwinselect and then use
rdrandinf. The reason is that calling rdwinselect by itself will never show outcome results,
and will reduce the possibility of choosing the window where the outcome results are in the “ex-
pected” direction—in other words, setting on a window choose before looking at the outcome results
minimizes pre-testing and specification-searching issues.
Unlike an experiment, the treatment assignment mechanism in the RD design does not logically
imply that the treatment is randomly assigned within some window. Like the continuity assumption,
the local randomization assumption must be made in addition to the RD assignment mechanism,
and is not directly testable. But the local randomization assumption is strictly stronger than the
continuity assumption, in the sense that if there is a window around x̄ in which the regression
functions are flat, then these regression functions will also be continuous at x̄—but the converse
is not true. Why, then, would researchers want to impose stronger assumptions to make their
inferences?
In order to see in what type of situations the stronger assumption of local randomization is
appropriate, it is useful to remember that the local-polynomial approach, although based on the
weaker condition of continuity, it necessarily relies on extrapolation because there are no observa-
tions exactly at the cutoff. The continuity assumption does not imply a specific functional form of
the regression functions near the cutoff, as it approximates these functions using non-parametric
methods; however, this approximation relies on extrapolation methods and introduces an approx-
imation erorr that is only negligible if the sample size is large enough. This makes the continuity
assumption more appealing if there are enough observations near the cutoff to approximate the
shape of the regression functions with reasonable accuracy—but possibly inadequate when the
number of observations is small. In applications with few observations, the local randomization
approach has the advantage of requiring minimal extrapolation and avoiding the use of smoothing
methods.
111
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
variable and is not applicable when the running variable is discrete without further assumptions.
In this case, the local randomization approach. Because RD designs with discrete running variables
are ubiquitous in many social sciences, we discuss them in detail in Section 7.
Textbook reviews of Fisherian and Neyman estimation and inference methods in the context of
analysis of experiments are given by Rosenbaum (2002, 2010) and Imbens and Rubin (2015). The
later book also discusses super-population approaches and their connections to finite population
infernece mehtods. Ernst (2004) gives a nice discussion of the connection and distinctions between
randomization inference and permutation inference methods. Cattaneo et al. (2015) were the first to
propose Fisherian randomization-based inference to analyze RD designs based on a local random-
ization assumption; these authors also proposed the window selection procedure based on balance
tests on predetermined covariates. Cattaneo et al. (2017d) relaxed the local randomization assump-
tion to allow for a weaker exclusion restriction, and also compare RD analysis in continuity-based
and randomization-based approaches. The interpretation of the RD design as a local experiment
and its connection to the continuity-based framework is also discussed by Sekhon and Titiunik
112
5. LOCAL RANDOMIZATION RD APPROACH CONTENTS
(2016, 2017).
113
6. VALIDATION AND FALSIFICATION CONTENTS
A main advantage of the RD design is that the mechanism by which treatment is assigned is
known and based on observable features, giving researchers an objective basis to distinguish pre-
treatment from post-treatment variables, and to identify qualitative information regarding the
treatment assignment process that can be helpful to justify assumptions. However, the known rule
that assigns treatment based on whether a score exceeds a cutoff is not by itself enough to guarantee
that the assumptions needed to recover the causal effect of interest are met.
For example, a scholarship may be assigned based on whether the grade students receive on a
test is above a cutoff, but if the cutoff is known to the students’ parents and there are mechanisms
to appeal the grade, then the RD design may be invalid whenever systematic differences among
students are present due to the appeal process. More formally, a systematically successful appeal
process could invalidate the assumption that the average potential outcomes are continuous at the
cutoff. To give a concrete example, if some parents decide to appeal the grade when their child is
barely below the cutoff and, crucially, these parents are systematically successful and different than
other parents that choose not to appeal, then the RD design based on the final grade assigned to
each student would be invalid (while the RD design based on the original grade assigned would
not). The reasoning is as follows: parents who choose to appeal on behalf of students whose score is
barely below the cutoff and (systematically) manage to change their child’s score so that it reaches
the cutoff may also be more involved in other aspects of their children’s education which would
also have a direct impact on the outcome variable of interest. For instance, if parents’ involvement
affects students’ future academic achievement then, on average, the potential outcomes of those
students above the cutoff may be discontinuously different from the potential outcome of those
students below the cutoff, making the RD design invalid for causal inference at the cutoff. In other
words, students barely below and barely above the cutoff would be systematically different in ways
that also affect the outcome of interest, making impossible to disentangle the effect of the treatment
from the effect of their other systematic underlying (usually unobserved) differences.
In general, if the cutoff that determines treatment is known to the units that will be the ben-
eficiaries of the treatment, researchers must worry about the possibility of units actively changing
or manipulating the value of their score when they miss the treatment barely. The first type of
information that can be provided is whether an institutionalized mechanism to appeal the score
exists, and if so, how often it is used to successfully change the score and which units use it. Qual-
itative data about the administrative process by which scores are assigned, cutoffs determined and
publicized, and treatment decisions appealed, is extremely useful to validate the design. To give
another empirical example, social programs are commonly assigned based on some poverty index:
if the program officers bump units with an index barely below the cutoff to the treatment group
in a systematic way (e.g., all households with small children), then the RD design would be in-
valid whenever the systematic differences between units near the cutoff have a direct effect on the
outcome of interest. This type of behavior can typically be identified using qualitative information
114
6. VALIDATION AND FALSIFICATION CONTENTS
In many cases, however, qualitative information will be limited and the possibility of units ma-
nipulating their score cannot be ruled out. Crucially, the fact that there are no institutionalized
mechanisms to appeal and change scores does not mean that there are no informal mechanisms by
which this may happen. Thus, an essential step in evaluating the plausibility of the RD assumptions
is to provide empirical evidence supporting the validity of the design. Naturally, the continuity and
local randomization assumptions that guarantee the validity of the RD design are about unobserv-
able features and as such are inherently untestable. At the same time, the RD design is perhaps
the one non-experiment research design that offers an array of empirical methods gear to providing
plausible evidence in favor of its validity. More precisely, there are several important empirical im-
plications of the unobservable assumptions underlying RD designs that can be expected to hold in
most cases and can provide indirect evidence about its validity. We consider three such empirical
tests based on: (i) continuity of the score density around the cutoff, (ii) null treatment effect on
pre-treatment covariates or placebo outcomes, and (iii) treatment effect at artificial cutoffs values,
exclusion of nearby observations, and bandwidth choices. As we discuss below, the implementation
of each of the tests differs according to whether a continuity or a local randomization assumption
is invoked.
The first type of falsification test examines whether, in a local neighborhood near the cutoff, the
number of observations below the cutoff is “surprinsigly” different from the number of observations
above it. The underlying assumption is that if individuals do not have the ability to precisely
manipulate the value of the score that they receive, the number of treated observations just above
the cutoff should be approximately similar to the number of control observations below it. In
other words, even if units actively attempt to manipulate their score, in the absence of precise
manipulation, random change would place roughly the same amount of units on either side of the
cutoff, leading to continuous probability density function when the score is continuously distributed.
Although this assumption is neither necessary nor sufficient for the validity of an RD design, RD
applications where there is an unexplained abrupt change in the number of observations right at
the cutoff will tend to be less credible. This kind of test is often called a density test.
Figure 6.1 shows a histogram of the running variable in two hypothetical RD examples. In
the scenario illustrated in Figure 6.1(a), the number of observations above and below the cutoff
is very similar. In contrast, Figure 6.1(b) illustrates a case in which the density of the score right
below the cutoff is considerably lower than just above it—a finding that is compatible with units
systematically increasing the value of their original score so that they are assigned to the treatment
group instead of the control.
In addition to a graphical illustration of the density of the running variable, researchers should
115
6. VALIDATION AND FALSIFICATION CONTENTS
Cutoff Cutoff
80 80
Number of Observations
Number of Observations
Control Control
40 40
Treatment Treatment
20 20
0 0
test the assumption more formally. The implementation of the formal test depends on whether
one adopts a continuity-based or a local randomization approach to RD analysis. In the former
approach, the null hypothesis is that the density of the running variable is continuous at the
cutoff, and its implementation requires the estimation of the density of observations near the cutoff,
separately for observations above and below the cutoff. We employ here an implementation based
on a local polynomial density estimator that does not require pre-binning of the data and leads to
size and power improvements relative to other approaches. The null hypothesis is that there is no
“manipulation” of the density at the cutoff, formally stated as continuity of the density functions
for control and treatment units at the cutoff. Therefore, failing to reject implies that there is no
statistical evidence of manipulation at the cutoff, and thus offers support in favor of the RD design.
In order to perform this density test using the Meyersson data, we use the rddensity command,
which only needs to receive the running variable as an argument.
116
6. VALIDATION AND FALSIFICATION CONTENTS
0.025
40
0.020
Number of Observations
30
0.015
Density
20
0.010
10 0.005
0 0.000
The value of the statistic used to test is −1.394 and the associated p-value is 0.1633. This means
that under the continuity-based approach, we fail to reject the null hypothesis of no difference
in the density of treated and control observations at the cutoff. Figure 6.2 provides a graphical
representation of the continuity in density test approach, exhibiting both a histogram of the data
(panel (a)) and the actual density estimate (panel (b)).
The implementation of the density test is different under the local randomization approach.
In this case, the null hypothesis is that, within the window W0 where the treatment is assumed
to be randomly assigned, the number of the number of observations in the control and treatment
groups is consistent with whatever assignment mechanism is assumed to generate the treatment
assignment rule. For example, assuming a simple “coin flip” or Bernoulli trial with say q success
probabily, we would expect that the control sample size nW0 ,− and treatment sample size nW0 ,+
within W0 is compatible with what a treatment assigment of Bernoulli trials with a pre-specified
treatment probability q would generate. The assumption is therefore that the number of treated
and control units in W0 follows a binomial distribution, and the null hypothesis of the test is that
the probability of success in the nW0 Bernoulli experiments is q. As we have discussed, the true
117
6. VALIDATION AND FALSIFICATION CONTENTS
probability of treatment is unknown, but in practice q = 1/2 is the most natural choice in the
absence of additional information (and this choice can be justified from a large sample perspective
when the score is continuous).
The binomial test is implemented in all common statistical software, and is also part of the
rdlocrand package and is implemented via the command rdwinselect. Using the Meyersson data,
we can implement this falsifcation test after selecting a window W0 where the local randomization
assumption is assumed to hold. Since we need to implement the density test in only this window, we
use the option nwindows(1). We employ W0 = [−1.859, 1.859], as selected in the previous section.
Window length / 2 p - value Var . name Bin . test Obs < c Obs >= c
0.944 NA NA 0.672 23 27
The important results in this case are the number of observations in each side of the cutoff
(44 and 49). These observation numbers and the probability q = 1/2 are the ingredients for the
binomial test, which has a p-value of 0.679, as is shown in the Bin. test column in the rdwinselect
output. Based on this test, we find no evidence of “sorting” around the cutoff in the window W0 =
[−1.859, 1.859]—in other words, the difference in the number of treated and control observations
in this window is entirely consistent with what would be expected if municipalities were assigned
to an Islamic win or loss by the flip of an unbiased coin.
The binomial test is also implemented in the base distribution R and Stata. We can perform
the same test by simply executing the canned binomial test command.
118
6. VALIDATION AND FALSIFICATION CONTENTS
data : 27 and 50
number of successes = 27 , number of trials = 50 , p - value = 0.6718
alternative hypothesis : true probability of success is not equal to 0.5
95 percent confidence interval :
0.3932420 0.6818508
sample estimates :
probability of success
0.54
As expected, the two-sided p-value is 0.6785, which is equal (after rounding) to the p-value
obtained using rdwinselect. Thus, we arrive to the same conclusion and fail to reject the null
hypothesis.
Another important RD falsification test involves examining whether, near the cutoff, treated units
are similar to control units in terms of observable characteristics. The idea behind this approach
is simply that, if units lack the ability to precisely manipulate the score value they receive, there
should be no systematic differences between units with similar values of the score. Thus, except
for their treatment status, units just above and just below the cutoff should be similar in all those
characteristics that could not have been affected by the treatment. These variables can be divided
into two groups: variables that are determined before the treatment is assigned, which we call
predetermined covariates, and variables that are determined after the treatment is assigned but,
according to substantive knowledge about the treatment’s causal mechanism, could not possibly
have been affected by the treatment, which we call placebo outcomes.
Importantly, predetermined covariates are always unambiguously defined, but placebo outcomes
are application specific. Any characteristic that is determined before the treatment assignment is
a predetermined covariate. In contrast, whether a variable is a placebo outcome depends on the
particular treatment under consideration. For example, if the treatment is access to clean water and
the outcome of interest is child mortality, a treatment effect is expected on mortality due to water-
bone illnesses but not on mortality due to other causes such as car accidents. Thus, mortality from
road accidents would be a reasonable placebo outcome in this example. However, mortality from
road accidents would not be a placebo outcome if access to clean water occurred simultaneously
with a safety program that educated parents in the proper installment of car seats.
Once again, the particular implementation of this type of falsification test depends on whether
researchers adopt a continuity-based or a local-randomization-based approach. But despite differ-
119
6. VALIDATION AND FALSIFICATION CONTENTS
ences in implementation, a fundamental principle applies to both: all predetermined covariates and
placebo outcomes should be analyzed in the same way as the outcome of interest. In the continuity-
based approach, this principle means that for each predetermined covariate (and placebo outcome),
researchers should first choose an optimal bandwidth, and then use local polynomial techniques
within that bandwidth to estimate the “treatment effect” and then employ valid inference proce-
dures such as the robust bias corrected methods discussed previously. In the local randomization
approach, this principle means that the effect for both covariates and placebo outcomes should be
analyzed in the window that the assumption of local randomization is assumed to hold. In both
approaches, since the predetermined covariate (or placebo outcome) could not have been affected
by the treatment, the idea is that the null hypothesis of no treatment effect will fail to be rejected.
When using the continuity-based approach to RD analysis, this falsification test employs the local
polynomial techniques discussed in Section 4 to test if the predetermined covariates and placebo
outcomes are continuous at the cutoff—in other words, to test if the treatment has an effect on
them. We illustrate with the Meyersson application, using the set of predetermined covariates
introduced in Section 5.3 for window selection in purposes. We start by presenting a graphical
analysis, creating an RD plot for every covariate using rdplot with the default options (mimicking
variance, evenly-spaced bins). The plots are presented in Figure 6.3—the specific commands are
omitted to conserve space.
120
6. VALIDATION AND FALSIFICATION CONTENTS
14
14
12
●
10
12
● ●
8
●
●
●
10
●
●
●
● ● ● ●
● ● ●
●● ●
● ● ● ●● ●● ●
● ● ●●
6
● ● ● ● ● ● ●●
●● ●● ●
● ●
●
● ●● ●●● ● ●
●
●● ● ●● ● ● ● ● ●
● ●
● ● ● ●● ●
● ●● ●●● ●
●● ●
● ● ● ● ● ● ●● ●
● ● ●● ●●●●
8
● ● ● ● ● ●
● ● ●
●● ●●
4
● ● ● ● ● ● ●●● ● ● ● ●
● ●● ●● ● ● ● ●
● ● ● ●
● ● ● ●
●● ●●●●
● ●
● ●●●● ● ● ● ● ●
● ●● ● ●●● ● ● ● ●
● ●
●
● ●
●
●
● ● ●
2
● ●● ● ● ● ●
6
1.0
● ● ● ●
0.8
0.8
●
●
0.6
●
0.6
●
●
●
●
● ●
● ●
●
●
0.4
0.4
●
●
●●
●
● ● ●
●●
●
● ●
● ● ●
●● ●●
● ●
● ● ●
● ●●
● ● ●● ●
0.2
●
0.2
●● ●
● ●
● ●●
●
● ●
●●● ●
●
● ●●
●●
● ●
●● ● ● ● ●
●●
●
●
● ● ● ●
● ●●●●●
● ●● ●
●
● ● ● ●● ●
● ●
● ● ● ●●
● ●●●
●●
●●●●
●●
●
●
●
●●● ● ● ● ●
● ●● ●● ●
●
● ● ● ●● ●
●
●● ●●● ●
●● ● ●●
● ●
● ● ●
●
0.0
0.0
●●
●●
●● ● ● ●●
●●●●●●● ● ●
●● ● ●● ● ● ●● ● ●● ●● ●●●●●●●●●● ●● ● ● ● ●● ●●● ● ●● ● ● ●
1.0
● ● ● ●● ● ●
●
0.8
0.8
●
0.6
0.6
●
●
● ●
● ● ●● ● ●●
●
● ●
● ●
●
● ●
● ●● ●●
●
●
0.4
0.4
● ●
● ●
●●
●
●● ● ●● ● ●
● ●●
●
● ● ● ● ●
● ●
●
0.2
0.2
●
●
●
● ●
●
● ●
● ● ●
● ● ●
●● ● ● ●
● ●● ●
● ●● ● ● ●
●
●
●● ● ●●●
0.0
0.0
● ● ● ●
● ●● ●●● ●●● ● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●● ● ● ● ● ●●● ● ● ●●● ● ●●●●● ●●●●● ● ● ● ●● ●●●●● ●● ● ●● ● ●●● ● ●●● ● ● ● ●
The graphical analysis does not reveal obvious discontinuities at the cutoff, but of course a
statistical analysis is required before we can reach a more objective and formal conclusion. In order
to implement the analysis, an optimal bandwidth must be chosen for each test, and this bandwidth
is not necessarily the bandwidth used to analyze the original outcome of interest. As shown very
clearly in the RD plots, each covariate Zj exhibits a quite different estimated regression function,
with different curvature and overall functional form. As a result, the optimal bandwidth for local
polynomial estimation and inference will be different for every variable, and must be re-estimated
accordingly in each case. This implies that the statistical analysis is conducted separately for each
covariate, choosing the appropriate optimal bandwidth for each covariate considered.
To implement this formal falsification test, we simply run rdrobust using each covariate of
interest as the outcome variable. As an example, we analyze the RD treatment effect on the covariate
lpop1994, the logarithm of the municipality population in 1994. Since this covariate is measured
in 1994, it could not have been affected by the treatment—that is, by the party that wins the
1994 election. We estimate a local linear RD effect with triangular kernel weights and common
MSE-optimal bandwidth using rdrobust (remember that the default bandwidth selection option
is bwselect="mserd").
Summary :
Left Right
Number of Obs 2314 315
Eff . Number of Obs 400 233
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 13.3186 13.3186
BW Bias ( b ) 21.3661 21.3661
rho ( h / b ) 0.6234 0.6234
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 0.0124 0.2777 0.0447 0.9643 -0.5319 0.5567
Robust 0.9992 -0.6442 0.6448
The point estimate is very close to zero and the robust p-value is 0.9992, so we find no evidence
that, at the cutoff, treated and control municipalities differ systematically in this covariate differs
122
6. VALIDATION AND FALSIFICATION CONTENTS
systematically between treated and control units at the cutoff. In other words, we find no evidence
that the size of the municipalities is discontinuous at the cutoff. In order to provide a complete
falsification test, the same estimation and inference procedure should be repeated for all important
covariates—that is, for all potential confounders. In a convincing RD design, these tests would show
that there are no discontinuities in any variable. Table 6.1 contains the local polynomial estimation
and inference results for four different pre-determined covariates available in the Meyersson dataset.
All results were obtained employing rdrobust with the default specifications, as shown for lpop1994
above.
Table 6.1: Formal Continuity-Based Analysis for Covariates
All point estimates are small and most are remarkably close to zero, and all 95% confidence
intervals contain zero, with p-values ranging from 0.462 to 0.999. In other words, there is no
empirical evidence that these predetermined covariates are discontinuous at the cutoff. Note that
the number of observations used in the analysis varies between covariates because the MSE-optimal
bandwidth is different for every covariate. As explained above, this is to be expected, since each
optimal bandwidth depends on the particular shape of the conditional expectation of each covariate.
Finally, in this illustration we employ the default options for simplicity, but for falsification purposes
the CER-optimal bandwidth choice is naturally more appropriate because only testing the null
hypothesis is of interest in this case. Nevertheless, switching to bwselect=cerrd" does not change
any of the empirical conclusions.
We complement these results with a graphical illustration of the RD effects for every covariate,
to provide further evidence that in fact these covariates do not jump discretely at the cutoff. For
this, we plot each covariate inside of their respective MSE-optimal bandwidth, using a polynomial
of order one, and a triangular kernel function to weigh the observations. We create these plots
with the rdplot command. Below we illustrate the specific command for the lpop1994 covariate
analyzed above.
123
6. VALIDATION AND FALSIFICATION CONTENTS
We follow this same procedure for each of the four covariates of interest. The resulting plots are
presented in Figure 6.4. Consistent with the formal statistical results, the graphical analysis within
the optimal bandwidth indicates that all covariates appear to be continuous at the cutoff.
Figure 6.4: Graphical Illustration of Local Linear RD Effects for Predetermined Covariates—
Meyersson data
14
14
12
10
12
8
10
● ●
●
●
● ●
● ●
● ● ● ● ●
6
● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ●
● ●
8
● ● ● ●
● ●
● ● ● ●
● ● ●
●
● ●
4
6
0.8
●
●
● ●
●
0.4
●
●
● ● ●
●
●
0.6
● ● ●
● ●
●
0.3
● ● ●
●
● ●
● ●
●
0.4
●
●
● ● ●
●
0.2
●
●
● ●
●
0.2
● ●
● ●
●
●
●
0.1
●
● ● ●
● ● ● ● ●
0.0
● ● ● ● ● ● ● ● ● ● ●
124
6. VALIDATION AND FALSIFICATION CONTENTS
The continuity of the estimated conditional expectations at the cutoff for each of these covariates
does stand in contrast to the analogous plot for the outcome of interest, female education share,
given in Figure 4.4.
In the local randomization RD approach, we can similarly employ pre-intervention covariates and
placebo outcomes to evaluate whether there is evidence of potential manipulation. As before, the
idea is that the null hypothesis of no treatment effect should be tested within W0 for all pre-
determined covariates and placebo outcomes, using the same inference procedures and the same
assumptions regarding the treatment assignment mechanism and the same test statistic used for the
analysis of the outcome of interest. Since in this approach W0 is the window where the treatment is
assumed to have been randomly assigned, all covariates and placebo outcomes should be analyzed
within this window. This illustrates a fundamental difference between the continuity-based and
the randomization-based approach: in the former, in order to estimate and test hypotheses about
treatment effects we need to approximate the unknown functional form of outcomes, predetermined
covariates and placebo outcomes, which requires estimating separate bandwidths for each variable
analyzed; in the latter, since the treatment is thought as-if-randomly assigned in W0 , all analysis
occur within the same window, W0 .
As it was discussed in Section 5, the chosen window for the local-randomization approach is
W0 = [−1.859, 1.859]. Therefore, in order to test if covariates are balanced within this window, we
must compare their behavior to each side of the cutoff using randomization inference techniques.
We should see that the observed difference in means is not statistically significant, since otherwise
we would have evidence against the thought that treatment is as-if-randomly assigned in W0 . In
order to test this formally, we can use rdrandinf and the outcome of interest as a covariate. For
example, in order to study the covariate vshr islam1994 we should run:
125
6. VALIDATION AND FALSIFICATION CONTENTS
This shows that the difference-in-means statistic is very small (0.328 − 0.319 = −0.009), and
the finite sample p-value is large (0.689). That is, the covariate is as-if-randomly assigned inside
W0 . Table 6.2 contains a summary of this analysis for all covariates using randomization inference.
We cannot conclude that the control and treatment means are different for any covariate, since
the p-values are above 0.15 in all cases. There is no statistical evidence of imbalance (in terms of
their means) inside of this window. We should also notice that the number of observations is fixed
in all cases. This happens because the window in which we analyze these covariates is not changing,
is always W0 , whereas in the continuity-based approach the MSE-optimal bandwidth depends on
the particular outcome being used, and therefore the number of observations is different for every
covariate.
This analysis can also be carried out visually, by using rdplot restricted to W0 , with p = 0
and a uniform kernel. If it is true that the window was chosen appropriately, then all covariates
should be continuous at the cutoff with the specifications given above. This is equivalent to assert
that their means are statistically equal, since the local-polynomial approach with p = 0 and a
uniform kernel inside of W0 will be in fact testing if the control and treatment means are different
126
6. VALIDATION AND FALSIFICATION CONTENTS
from zero. For example, we can construct an RD plot with these characteristics for the covariate
vshr islam1994.
> rdplot ( data $ vshr _ islam1994 [ abs ( X ) <= 0.944] , X [ abs ( X ) <= 0.944] ,
+ p = 0 , kernel = " uniform " , x . label = " Running Variable " ,
+ y . label = " " , title = " " )
●
0.35
●
●
0.30
●
0.25
●
0.20
−2 −1 0 1 2
Running Variable
Figure 6.5 contains analogous RD plots for the six predetermined covariates analyzed above. In
most cases, the visual inspection shows that the means of these covariates seem to be similar on
each side of the cutoff, consistent with the results from the window selection procedure discussed
in Section 5 and the formal analysis in Table 6.2.
127
6. VALIDATION AND FALSIFICATION CONTENTS
14
14
12
12
10
8
●
10
● ●
6
●
● ●
● ●
● ●
●
8
4
2
−2 −1 0 1 2 −2 −1 0 1 2
1.0
0.45
0.8
0.40
0.6
●
0.35
●
●
0.4
0.30
● ●
0.25
0.2
●
●
0.20
0.0
● ●
−2 −1 0 1 2 −2 −1 0 1 2
1.0
●
0.8
0.8
0.6
0.6
●
●
● ● ● ●
0.4
0.4
●
0.2
0.2
●
0.0
0.0
● ● ● ● ● ● ● ● ● ● ● ●
−2 −1 0 1 2 −2 −1 0 1 2
We close this section with a brief discussion of three other design-specific falsification approaches:
(i) null treatment effect for placebo cutoffs, (ii) treatment effect sensitivity to units near the cutoff,
and (iii) treatment effect sensitivity to bandwidth choice. All these empirical tests can be con-
ducted using either the continuity based or the local randomization approach. To conserve some
space, here we discuss only implementation and empirical results using local polynomial methods
within the continuity-based approach, but the the accompanying replication files include analogous
implementations using randomization inference methods.
The first falsification approach, based on placebo cutoff points, was hinted at already when
discussing RD plots in Section 3. The motivation behind this method start by recalling that the
key identifying assumption underlying RD designs is continuity (or lack of abrupt changes) of the
regression functions for treatment and controls at the cutoff in the absence of the treatment. While
such condition is fundamentally untestable at the cutoff, researchers can investigate empirically
whether the estimable regression functions for control and treatment units are continuous over the
support of the score variable, that is, at points away from the cutoff. Evidence of continuity away
from the cutoff is, of course, neither necessary nor sufficient for continuity at the cutoff, but the
presence of discontinuities away from the cutoff can be interpreted as potentially casting doubt
on the RD deign, at the very least in cases where such discontinuities can not be explained by
substantive knowledge of the specific application.
Practically, the method replaces the true cutoff value by another value at which the treatment
status does not really change to then perform estimation and inference using this “fake” cutoff
point. The motivating idea is that a significant treatment effect should occur only at the true cutoff
value and not at other values of the score where the treatment status is constant. A graphical
implementation of this falsification approach follows directly from the RD plots described previ-
ously, by simply assessing whether there are jumps in the observed regression functions at points
other than the true cutoff. A more formal implementation of this idea conducts statistical estima-
tion and inference for RD treatment effect at placebo or artificial cutoff points, using control and
treatment units separately. Once again, the implementation depends on the approach adopted: in
the continuity-based approach, we would use local-polynomial methods within an optimally-chosen
bandwidth around the fake cutoff to estimate treatment effects on the outcome, as it was explained
in Section 4. In the local-randomization approach, we would choose a window around the fake cutoff
where randomization is plausible, and make inferences for the true outcome within that window,
as it was explained in Section 5.
In order analyze the alternative cutoffs using the continuity-based approach, we employ rdrobust
after restricting to the appropriate group and specifying the artificial cutoff. For example, we con-
sider only the treatment group with a placebo cutoff point x̄ = 1. We do not expect to find an
statistically significant effect in this case, since treatment did not change discontinuously at any
other value different from 0. Here is the empirical result of this exercise using the Meyersson data:
129
6. VALIDATION AND FALSIFICATION CONTENTS
Summary :
Left Right
Number of Obs 30 285
Eff . Number of Obs 30 49
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 2.3016 2.3016
BW Bias ( b ) 3.2845 3.2845
rho ( h / b ) 0.7007 0.7007
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional -0.9935 4.2782 -0.2322 0.8164 -9.3786 7.3917
Robust 0.7599 -9.8276 13.4594
In order to estimate at the alternative cutoff we must use the option c = 1 in rdrobust, and
thus we need to restrict only to the treatment group, since otherwise the estimation on the left
would be invalid due to the actual non-zero treatment effect observed at 0. This is forcing the
program to compare the educational outcomes of municipalities were Islamic mayors won by a
margin of more than 1%, with municipalities were they also won by less than 1%. That is, in both
sides of the cutoff we will have municipalities with an Islamic mayor and therefore, we should find
any discontinuity of the outcome at 1%. This is in fact what happens, since the p-value is much
larger than 0.1. That is, we can conclude that the outcome of interest did not jump at the specific
cutoff value of 1%. Table 6.3 summarizes this analysis for alternative cutoffs ranging from −5%
to 5% with increments of 1%, and Figure 6.6 depicts the main results from this falsification test
graphically.
130
6. VALIDATION AND FALSIFICATION CONTENTS
-5 -4 -3 -2 -1 0 1 2 3 4 5
Placebo Cutoff
The cutoff equal to 0 is being included in order to have a benchmark. Zero is in fact the true
cutoff, and the particular results regarding this cutoff were discussed at length in Section 4. All other
cutoffs are “fake” or placebo, in the sense that treatment did not actually change at those points.
We find that in all of the artificial cutoff points the RD estimator is smaller in magnitude from the
true RD estimator (3.019) and the corresponding p-values at those cutoffs are larger than 0.1 in
all cases. Therefore, we can conclude that the outcome of interest does not jump discontinuously
131
6. VALIDATION AND FALSIFICATION CONTENTS
The second falsification falsification approach, based on sensitivity to observations near the
cutoff, seeks to investigate how sensitive the results are to the response of units very close to the
cutoff point. If manipulation was present, it is natural to assume that the units closest to the cutoff
are those most likely to have manipulated in a systematic way their score value. Thus, the idea
behind this approach is to exclude such units to then repeat the estimation and inference analysis
using the remaining sample. This idea is sometimes referred to as a “donut hole” approach. Once
again, it can be done using either continuity-based and local randomization methods, though it is
most natural in the former context because the latter setting tends to employ very few observations
to begin with and does not rely on extrapolation as much. Indeed, this approach is also useful to
assess the extrapolation sensitivity as a function of the few observations closest to the cutoff, as
they are likely to be the most influential when fitting the local polynomials.
To implement the falsification method based on excluding the closest units to the cutoff, we
employ rdrobust after subsetting the data accordingly. For example, we consider first the case
where units with score |Xi | < 0.25 are excluded from the analysis. As discussed before, this implies
that a new optimal bandwidth will be selected and, in this case, de facto more extrapolation than
before will take place. The result using the Meyersson data is follows:
Summary :
Left Right
Number of Obs 2308 309
Eff . Number of Obs 483 248
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 16.0276 16.0276
BW Bias ( b ) 27.4569 27.4569
rho ( h / b ) 0.5837 0.5837
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 3.4118 1.5126 2.2556 0.0241 0.4472 6.3764
Robust 0.0540 -0.0594 6.9499
In practice, it is natural to repeat this exercise a few times to assess the actual sensitivity for
132
6. VALIDATION AND FALSIFICATION CONTENTS
different amounts of units excluded. Table 6.4 illustrates this approach, and Figure 6.7 depicts the
results graphically.
Finally, the last falsification method commonly encounter in empirical work is related to sensi-
tivity to bandwidth choice or window length. This method complements the donut hole approach
just discussed, which investigated sensitivity of the empirical findings as units from the center of
the neighborhood around the cutoff are removed. Now, instead, we investigative the sensitivity
as units are added or removed from the end points of the neighborhood. The implementation of
this method is also straightforward, as it requires employing the same commands with different
bandwidth or window length choices. However, the interpretation of the results must be taken with
care: as we discussed in detail along this monograph, choosing the bandwidth is perhaps the most
important problem in RD design analysis because result are largely affected by this choice. In fact,
it is well understood how the bandwidth will affect the results: as the bandwidth increases, the bias
133
6. VALIDATION AND FALSIFICATION CONTENTS
of the estimator increases and its variance decreases. Thus, it is natural to expect that the larger
the bandwidth, the smaller the confidence intervals will be but they will also be displaced (because
of the bias).
The considerations above suggest that investigating sensitivity to bandwidth is only useful over
small ranges around the MSE-optimal bandwidth, since otherwise the results will be mechanically
determined by the statistical properties of the estimation method. To illustrate this approach in
practice, we present Figure 6.8, where we report the empirical analysis of the Meyersson data over
five bandwidth choices.
Figure 6.8: Sensitivity to Bandwidth in the Continuity-Based Approach
20 10
RD Treatment Effect
-10 0
-20
-30
Figure 6.8 reports local polynomial RD point estimators and robust confidence intervals using
as bandwidth: (i) the local randomization choice hLR = 0.944, (ii) the CER-optimal choice hCER =
11.629, (iii) the MSE-optimal choice hMSE = 17.239, (iv) 2 · hCER = 23.258, and (v) 2 · hMSE = 34.478.
The density test was first proposed by McCrary (2008). Cattaneo et al. (2017a) developed a local
polynomial density estimator that does not require pre-binning of the data and leads to size and
power improvements relative to other implementation approaches, and Frandsen (2017) developed
a related manipualtion test for cases where the score is discrete. The importance of falsification
tests and the use of placebo outcomes is generally discussed in in the analysis of experiments
literature (e.g., Rosenbaum, 2002, 2010; Imbens and Rubin, 2015). Lee (2008) applied and extended
these ideas to the context of RD designs, and Canay and Kamat (2017) developed a permutation
134
6. VALIDATION AND FALSIFICATION CONTENTS
inference approach in the same context. Ganong and Jäger (2017) developed a permutation inference
approach based on the idea of placebo RD cutoffs for the Kink RD designs, Regression Kink designs,
and related settings. Finally, falsification testing based on donut hole specifications are discussed
in Bajari et al. (2011) and Barreca et al. (2016).
135
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
The canonical continuity-based RD design assumes that the score that determines treatment is a
continuous random variable. A random variable is continuous when the set of values that it can take
contains an uncountable number of elements. For example, a share such as a party’s proportion of
the vote is continuous, because it can take any value in the [0, 1] interval. In practical terms, when
a variable is continuous, all the observations in the dataset have distinct values—i.e., there are no
ties. In contrast, a discrete random variable such as date of birth can only take a finite number
of values; as a result, a random sample of a discrete variable will contain “mass points”—that is,
values that are shared by many observations.
When the continuous score assumption does not hold, some of the local polynomial methods we
described in Section 4 are not directly applicable. This is a practically relevant issue, because many
real RD applications have a discrete score that can only take a finite number of values. We now
consider an empirical RD example where the running variable has mass points in order to illustrate
some of the strategies that can be used to analyze RD designs with a discrete running variable. We
employ this empirical application to illustrate how identification, estimation, and inference can be
modified when the dataset contains multiple observations per value of the running variable.
As we illustrate and discuss below, the key issue when deciding how to analyze a RD design
with a discrete running variable is the number of distinct mass points in the running variable. Local
polynomial methods will behave essentially as if each mass point is a single observation. Thus, if
the score is discrete but the number of mass points is sufficiently large, using local polynomial
methods may still be appropriate. In contrast, if the number of mass points is very small, local
polynomial methods will not be directly applicable. In this case, analyzing the RD design using
the local randomization approach is a natural alternative. When the score is discrete, the local
randomization approach has the advantage of that the window selection procedure is no longer
needed, as the smallest window is well defined. Regardless of the estimation and inference method
employ, issues of interpretability and extrapolation will naturally arise. In the upcoming sections,
we discuss and illustrate all these issues using an example from the education literature.
The example we re-analyze is the study by Lindo, Sanders and Oreopoulos (2010, LSO hereafter),
who use an RD design to investigate the impact of placing students on academic probation on
their future academic performance. Our choice of an education example is intentional. The origins
of the RD design can be traced to the education literature, and RD methods continue to be
used extensively in education because interventions such as scholarships or probation programs are
often assigned on the basis of a test score and a fixed approving threshold. Moreover, despite being
continuous in principle, it is common for test scores and grades to be discrete in practice.
136
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
LSO analyze a policy at a large Canadian university that places students on academic probation
when their grade point average (GPA) falls below a certain threshold. As explained by LSO, the
treatment of placing a student on academic probation involves setting a standard for the student’s
future academic performance: a student who is placed on probation in a given term must improve
her GPA in the next term according to campus-specific standards, or face suspension. Thus, in this
RD design, the unit of analysis is the student, the score is the student’s GPA, the treatment of
interest is placing the student on probation, and the cutoff is the GPA value that triggers probation
placement. Students come from three different campuses. In campuses 1 and 2, the cutoff is 1.5. In
campus 3 the cutoff is 1.6. In their original analysis, the authors adopt the normalizing-and-pooling
strategy we discussed in Section 2.4, centering each student’s GPA at the appropriate cutoff, and
pooling the observations from the three campuses in a single dataset. Thus, the original running
variable is the difference between the student’s GPA and the cutoff; this variable ranges from -1.6 to
2.8, with negative values indicating that the student was placed on probation, and positive values
indicating that the student was not placed on probation.
Table 7.1 contains basic descriptive statistics for the score, treatment, outcome and predeter-
mined covariates that we use in our re-analysis. There are 40,582 student-level observations coming
the 1996-2005 period. LSO focus on several outcomes that can be influenced by academic probation.
In order to simplify our illustration, we focus on two of them: the student’s decision to permanently
leave the university (), and the GPA obtained by the student in the term immediately after he was
placed on probation (). Naturally, the second outcome is only observed for students who decide to
continue at the university, and thus the effects of probation on this outcome must be interpreted
with caution, as the decision to leave the university may itself be affected by the treatment. We
also investigate some of the predetermined covariates included in the LSO dataset: an indicator for
whether the student is is male (male), the student’s age at entry (age), the total number of cred-
its for which the student enrolled in the first year (totcredits year1), an indicator for whether
the student’s first language is English (english), an indicator for whether the student was born
in North America (bpl north america), the percentile of the student’s average GPA in standard
classes taken in high school (hsgrade pct), and indicators for whether the student is enrolled in
each of the three different campuses (loc campus1, loc campus2, and loc campus3). As LSO, we
employ these covariates to study the validity of the RD design.
The crucial issue in the analysis of RD designs with discrete scores is the number of mass points
that actually occur in the dataset. When this number is large, it will be possible to apply the tools
from the continuity-based approach to RD analysis, after possibly changing the interpretation
of the treatment effect of interest. When this number is either moderately small or very small,
a local randomization approach will be most appropriate. In the latter situation, (local or global)
polynomial fitting will be useful only as an exploratory devise unless the research is willing to impose
137
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
strong parametric assumptions. Therefore, the first step in the analysis of an RD design with a
discrete running variable is to analyze the empirical distribution of the score and determine the
total number of observations, the total number of mass points, and the total number of observations
in per mass point. We continue to illustrate this step of the analysis using the LSO application.
Since only those students who have GPA below a certain level are placed on probation, the
treatment—the assignment to probation—is administered to students whose GPA is to the left of
the cutoff. As we discussed in Section 2, the convention is to define the RD treatment indicator
as equal to one for units whose score is above (i.e., to the right of) the cutoff. To conform to
this convention, we invert the original running variable in the LSO data. We multiply the original
running variable—the distance between GPA and the campus cutoff—by -1, so that, according to
the transformed running variable, students placed on probation (i.e. those with GPA below the
cutoff) are now above the cutoff, and students not placed on probation (i.e. those with GPA above
the cutoff) are now below the cutoff.
For example, a student who has Xi = −0.2 in the original score is placed on probation because
her GPA is 0.2 units below the threshold. The value of the transformed running variable for this
treated student is X̃i = 0.2. Moreover, since we define the treatment as 1(X̃i ≥ 0), this student
will be now be placed above the cutoff. The only caveat is that we must shift slightly those control
students who were exactly at the cutoff in the original score, since multiplying the running variable
by -1 does not alter their score (the cutoff is zero). In the scale of the transformed variable, we need
these students to be below zero to continue to assign them to the control condition. Therefore, we
manually change the score of students who are exactly at zero to Xi = −0.000005. A histogram of
the transformed running variable is shown in Figure 7.1.
138
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
2000
1500
Number of Observations
1000
500
−3 −2 −1 0 1
Score
We first check how many total observations we have in the dataset, that is, how many observa-
tions have a non-missing value of the score.
> length ( X )
[1] 44362
The total sample size in this application is large, with 44,362 observations. However, because
the running variable is discrete, the crucial step is to calculate how many mass points we have.
The 44,362 total observations in the dataset take only 430 distinct values. This means that,
on average, there are roughly 100 observations per value. To have a better idea of the density of
observations near the cutoff, Table 7.2 shows the number of observations for the five mass points
closest to the cutoff; this table also illustrates how the score is transformed. Since the original score
ranges between -1.6 and 2.8, and our transformed score ranges from -2.799 to 1.6. Both the original
and the transformed running variables are discrete, because the GPA increases in increments of 0.01
units and there are many students with the same GPA value. For example, there are 76 students
who are 0.02 GPA units below the cutoff. Of these 76 students, 44 + 5 = 49 have a GPA of 1.48
139
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
(because the cutoff in Campuses 1 and 2 is 1.5), and 27 students have a GPA of 1.58 (because the
cutoff in Campus 3 is 1.6). The same phenomenon of multiple observations with the same value of
the score occurs at all other values of the score; for example, there are 228 students who have a
value of zero in the original score (and -0.000005 in our transformed score).
7.3 Using the Continuity-Based Approach when the Number of Mass Points is
Large
When the number of mass points in the discrete running variable is sufficiently large, it is possible to
use the tools from the continuity-based approach to RD analysis. The LSO application illustrates a
case in which a continuity-based analysis might be possible, since the total number of mass points
is 430, a moderately large value. Because there are mass points, extrapolation between them is
unavoidable, but this is empirically no different from analyzing a (finite sample) dataset with a
sample size of n = 430.
We start with a falsification analysis, doing a continuity-based density test and a continuity-
based analysis of the effect of the treatment on predetermined covariates. First, we use rddensity
to test whether the density of the score is continuous at the cutoff.
140
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
Order est . ( p ) 2 2
Order bias ( q ) 3 3
BW est . ( h ) 0.706 0.556
The p-value is 0.6496, and we fail to reject the hypothesis that the density of the score changes
discontinuously at the cutoff point.
Next, we use rdrobust to use local polynomial methods to estimate the RD effect of being placed
on probation on several predetermined covariates. We use the default specifications in rdrobust,
that is, a MSE-optimal bandwdith that is equal on each side of the cutoff, a triangular kernel, a
polynoimial of order one, and a regularization term.
For example, we can estimate the RD effect of probation on hsgrade pct, the measure of high
school performance.
Summary :
Left Right
Number of Obs 37211 7151
Eff . Number of Obs 6115 3665
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 0.4647 0.4647
BW Bias ( b ) 0.7590 0.7590
rho ( h / b ) 0.6123 0.6123
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 1.3282 1.0104 1.3145 0.1887 -0.6522 3.3087
Robust 0.1943 -0.7880 3.8780
141
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
> rdplot ( data $ hsgrade _ pct , X , x . label = " Running Variable " , y . label = " " ,
+ title = " " )
100
●●●● ● ●
●●
● ●● ●●
●
●●● ●●
●
●●● ● ● ●
●● ●
●
● ● ●●
● ●●●
●● ●
● ●
● ● ●
● ● ●●
●●●
80
● ● ●
●● ●●●
● ●
●
●●● ●●●
●
● ●●
●
●●●
●
● ●● ●
●●
● ●● ●● ●●●
● ●● ● ●
●● ●
● ● ● ●
● ●● ●
●
● ●●● ●
●●●● ● ●
60
●●
● ●● ●●
●●● ● ●● ●
● ●● ● ●
●
● ●●● ●● ●● ●
● ●
●● ● ●
● ● ● ●●● ● ●
●● ● ● ● ● ●
● ●
● ●● ● ●●●
●● ● ●● ●● ●
● ● ● ●●
●
● ● ● ● ● ●● ●●
● ●●● ●● ●
●
40
●●●
●● ●● ●
● ● ● ● ●
●● ● ●●●●● ●●● ●
● ●● ● ● ● ● ● ●● ●● ●
● ●
● ●●● ●● ● ● ● ● ●
●●● ● ● ●
● ●
● ●●● ●● ●● ● ●● ●
●● ● ● ●
● ● ●●●
● ●●
● ●
● ●● ● ● ●
●●●● ●● ●● ●
●
● ●● ● ●
● ●
● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ●
● ● ●
● ● ● ●● ● ●
● ● ●● ●●● ● ● ●
● ● ●
● ● ● ● ●● ● ●● ● ● ●
● ●● ●● ●
●●● ●●● ●
20
● ● ●●
● ● ● ●● ●● ●● ●
●
● ● ●● ● ● ●●
●● ●
● ● ●● ● ● ●
● ● ●
● ●
● ●
●
●
●
●
0
−2 −1 0 1
Running Variable
Both the formal analysis and the graphical analysis indicate that, according to this continuity-
based local polynomial analysis, the students right above and below the cutoff are similar in terms
of their high school performance.
We repeat this analysis for the nine predetermined covariates mentioned above. Table 7.3
presents a summary of the results, and Figure 7.3 shows the associated RD plots for six of the
nine covariates.
Table 7.3: RD Effects on Predetermined Covariates—LSO data, Continuity-Based Approach
142
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
6.5
●●●● ● ●
●●
● ●● ●●
●
●●● ●●
●
●●● ● ● ●
●● ●
6.0
●
● ● ●●
● ●●●
●● ●
● ●
● ● ●
● ● ●●
●●●
80
● ● ●
● ● ●●●
● ●
●
●●● ●●●
●
● ●●
●
●●●
5.5
●
● ●● ● ●
●●
● ●● ●● ●●●
● ●● ● ●
●● ●
● ● ● ●
● ●● ● ● ●
●
● ●●● ●
●●●● ● ●
60
●● ●
● ●●
5.0
●● ●
●●● ● ●● ● ●● ●●●●● ●● ● ● ●●
● ● ● ●
● ●● ● ● ●
●● ● ● ●● ●● ● ● ●●● ● ● ● ● ●
●
● ●●● ●● ●● ● ● ●●● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ●● ●
●
● ● ● ●
● ●● ●● ● ● ● ● ●
●● ● ● ● ●● ●● ●●
● ● ●● ● ● ● ● ● ●
● ● ● ●●● ●
●● ● ● ● ● ●
● ● ● ● ● ●
●●● ●
● ● ●● ● ●●●● ● ● ● ●
● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●● ●
● ● ●
● ●● ● ●●● ● ●●● ● ● ●
● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●
● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ●● ●
● ● ● ●● ●● ● ●● ●● ● ●● ● ● ●● ● ● ● ●●
4.5
● ● ● ● ●● ● ● ●● ●
● ● ● ● ● ●● ●● ● ● ● ● ●
●● ● ● ●● ● ●
●● ● ● ●● ● ● ● ● ●
●
●● ●●
● ●●● ●● ● ●
● ● ● ● ● ● ● ●● ●● ●
● ●● ●
● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●
40
●●●
●● ●● ● ● ● ● ● ● ● ● ●● ●●
●
● ● ●● ●● ● ● ●
● ● ● ● ● ●
● ● ● ● ●●●●
● ●●
●● ● ●●●●● ●●● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ●
● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●
● ● ● ●●
●
● ● ●
● ● ● ● ● ●● ●● ● ● ●
●
● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
●●● ●
● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●
● ●●● ●● ●● ● ●● ●
● ● ● ●● ● ● ● ●● ●
● ● ●
●● ● ●●● ● ● ● ● ●
● ●
4.0
● ● ●●● ● ● ●
● ● ● ● ● ●
●
●●● ●● ●● ●●
●
● ● ● ● ● ●●
● ●● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●● ●
● ●● ●
● ● ●
● ● ● ●● ●●
● ● ●
● ● ●● ●●● ● ● ● ●
● ● ●
● ● ● ● ●● ● ●● ● ● ● ●
● ●● ●● ● ●
●●● ●●● ● ●●
20
● ● ●●
● ● ● ●● ●● ●● ●
●
● ● ●● ● ● ●●
●● ●
● ●● ● ●
3.5
● ●
● ● ● ●
● ●
● ●
●
●
●
● ●
3.0
●
0
−2 −1 0 1 −2 −1 0 1
● ● ●● ●
● ● ●
● ●
●
●
● ● ● ●
0.8
● ●
● ● ● ● ●
● ● ●● ● ●● ● ●
● ● ● ● ● ●●●● ● ●●
● ● ●● ● ● ● ● ●
● ● ● ● ●●
● ●●●● ● ● ● ● ● ● ● ● ●
20
● ● ● ●●● ● ●
● ● ●
● ● ●● ● ●●
●●
● ● ●● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ●●● ● ● ●● ● ● ●● ●● ● ●
● ●● ●● ● ●●● ● ●●●● ● ● ●● ● ●
●
●
● ●● ●
● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●
● ●● ● ●
●
●
● ●● ● ● ● ●● ● ● ● ●
●● ● ● ●● ● ● ● ●●● ● ●
● ● ●● ● ● ● ● ●●●● ●
●● ●● ●● ● ● ● ●● ●● ● ● ●● ●
● ●●● ● ● ● ● ● ● ●● ● ●● ●
●
●
● ● ● ●●● ●● ● ●● ● ●●
● ●●
● ● ● ● ● ● ● ● ● ●● ●
●● ●● ● ● ● ●
● ●
●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ● ●
● ●● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ●
● ● ●
0.6
● ● ● ● ● ● ● ●
● ●● ● ●● ●
● ●● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●●
● ●● ● ● ●
●
● ● ● ● ●
● ● ●●
● ● ● ●
●
● ● ●
●●
19
● ● ●
●● ●
●●● ● ● ●● ●
● ●●●● ● ● ● ● ● ● ●
● ● ● ●
● ●● ●● ● ● ●● ●● ● ●
●
● ● ● ● ● ● ●● ●●
●
● ● ●●●●●●●● ● ●● ● ● ● ●
●● ● ● ● ●
● ●
● ● ●
● ● ●
● ● ●● ●●
● ●●●
● ●●●●● ● ● ● ● ● ●● ● ● ● ● ●
●● ● ●● ● ●● ● ●● ●● ●●
●
● ● ●
● ●●
●● ●
● ●
●●● ●● ●
●●● ● ● ● ●● ●
●● ● ● ● ●
●● ●
●●● ● ●
● ●● ● ●●●
● ●
●● ●● ●
● ● ●●● ● ●●●●
● ● ●●●● ● ●●
● ●●● ●●●●●●● ●●●
●●●● ●● ●
● ●● ● ● ●●●●
● ● ●● ● ● ●● ●
0.4
● ●
● ●● ● ● ●●●●
● ● ● ●●
● ● ●
●● ● ●●●●
●
● ● ● ●
● ●●● ●●●● ●
●● ●●● ● ● ● ● ● ● ●●
● ● ●● ●● ● ●●
●●●●●● ● ●● ● ●●●●
●● ●
● ● ● ● ● ●
●●●●
● ● ●
● ● ●● ● ● ●
● ● ●
● ●●●●● ● ● ● ● ●●● ●
● ● ● ● ● ● ● ● ●
● ●● ●●●●●●● ●● ● ●
● ● ●● ●
● ● ● ● ●
● ● ●
● ●
● ●● ●
● ● ●● ●● ●
● ●
●
●● ● ● ●
●● ● ● ● ● ●
●
●
● ●
18
● ● ● ●
● ●●
0.2
●
0.0
17
●● ●
● ●●
●
−2 −1 0 1 −2 −1 0 1
1.0
● ● ●
●● ● ● ●●
●
● ●
●●● ●● ● ●●
● ●
● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ●● ● ● ●
● ●●●●
●●
● ●●● ●● ● ●● ● ● ●● ● ● ●● ● ●
● ● ●● ● ● ●●●●● ● ●● ● ●● ● ●
● ● ●●●●●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●
● ●● ● ●● ● ● ●●●● ● ●
● ●● ● ●● ●
● ●●●●
●
● ● ● ●● ● ● ●
● ● ● ●
● ●● ● ●●● ●● ● ● ●● ● ● ●
● ●●●●● ● ●
● ● ●
● ●●● ● ● ●● ● ● ● ●
● ● ●●● ●●● ●● ●● ●● ● ● ● ● ●● ● ●
● ● ● ● ●●
● ●●
● ● ●●● ●●
● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●
● ●● ● ● ●● ●● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ● ●● ● ● ● ●
●
● ●● ●● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●
● ● ●● ● ● ●
● ● ● ● ● ● ●
0.8
0.8
●● ● ●●
●● ● ●● ●● ● ●● ● ●● ● ●● ● ● ●
● ●● ● ● ● ● ● ●
●
● ● ●● ● ●● ● ●
● ● ● ● ●
● ● ● ●● ● ● ● ●● ● ● ●●● ●
● ●
● ●
● ● ● ●
●
● ●
● ● ●
●● ●● ● ● ● ● ●
●
●●
● ● ●
●● ●
● ●
● ● ● ● ●● ● ●
● ●
● ●
● ●
0.6
0.6
●
● ● ●● ●
● ● ● ● ●
●
● ●● ●
● ● ● ●
● ● ●● ● ●
● ● ●
● ● ●
● ●●
●● ● ● ● ●●
● ● ● ●● ●●
● ●● ●● ●
●●
● ●● ● ● ●
● ●● ● ● ● ● ●
●● ● ● ● ●● ● ●
● ●● ● ● ●● ●
● ● ● ●
●●
●● ●
● ●●●
●
● ● ● ●● ●
● ●● ● ● ●● ● ●
● ● ●● ● ● ● ●● ● ● ●
●● ● ● ●●● ● ● ●
● ●● ●● ● ● ●● ●● ● ●
● ● ●● ●● ● ●● ● ●● ●●● ● ● ●
0.4
0.4
● ● ● ●● ● ● ● ● ● ●● ●
●● ● ●● ●
● ● ●● ●
●● ● ●
● ●
● ● ● ●●● ● ● ● ● ●● ● ●
● ● ● ●
●
● ●● ● ●●●●●● ●●● ● ● ●●● ●
●
●● ●●
● ●●● ● ● ● ● ●● ● ●
●
● ●
●●
● ●
● ●
● ●●●
● ●● ● ● ● ●● ● ● ●
● ● ●● ● ● ●●● ●● ● ● ●● ●
● ●● ● ● ●
● ● ●
●● ● ●●● ● ● ●●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ● ●● ● ● ●
● ● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ●
● ● ●●● ● ● ● ●
●
● ●● ●
● ●● ● ● ●
●●
●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
●
0.2
0.2
●
●
●
0.0
0.0
●● ● ● ● ●●
−2 −1 0 1 −2 −1 0 1
143
Running Variable Running Variable
(e) Male Indicator (f) Born in North America Indicator
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
Overall, the results indicate that the probation treatment has no effect on the predetermined
covariates, with two exceptions. First, the the effect on totcredits year1 has associated p-value
0.004, rejecting the hypothesis of no effect at standard levels. However, the point estimate is
small: treated students take an additional 0.081 credits in the first year, but the average value
of totcredits year1 in the overall sample is 4.43, with a standard deviation of roughly 0.5. Sec-
ond, students who are placed on probation are 3.5 percentage points less likely to speak English as
a first language, an effect that is significant at 10%. This difference is potentially more worrisome,
and is also somewhat noticeable in the RD plot in Figure 7.3(d).
Next, we analyze the effect of being placed on probation on the outcome of interest, nextGPA,
the GPA in the following academic term. We first use rdplot to visualize the effect.
Method :
Left Right
Number of Obs . 34854 5728
Polynomial Order 4 4
Scale 16 28
144
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
●
●
●
●● ●
●●●●●
4
● ● ●
●
● ●●● ●
●
● ● ●●
●● ●●
●●●●●
●● ●● ●● ●
●● ●
● ●●
●●●●●●
●
●
● ●●●●●● ●
●
●●●
●●
● ●●
●●
●●●●
●●
●●●●
● ●●●● ●
●● ●●●●
●●●●
●●●●●●
● ●
●
3
● ●● ●●●●●● ●
●● ●●●●●
●●●●●●
●●● ●
●● ●
●●●● ●
●●●
●●● ●
● ●
● ●●● ●●●●●●
●●
● ●
● ●● ● ●
●
● ●●●●●
●● ●
●●● ●● ●
● ● ●
Outcome
●
●●
●●●●● ●●●
●●● ●
●
● ● ●● ●●
● ●●● ● ●
● ● ● ●●●●● ● ●
●●●
●● ●
●●●●
●● ●●
●
● ●
●
●
● ● ● ●● ● ●
● ●
● ●● ● ●● ● ● ● ●
●●●
2
●
● ● ● ●● ● ●●● ●
●● ●● ● ●● ●● ●
● ●● ● ● ●● ●● ● ●●
●
●
●●
● ●● ● ● ● ●●● ●
●● ●● ●● ● ●●
● ● ●● ●●●
●
● ●● ●●
● ●
● ●● ●●●
● ● ● ●● ● ● ●
●
●● ●● ●●
● ● ● ●
●● ●
● ●● ● ●
● ● ● ● ●
● ● ●● ●●
● ●● ● ●● ●
● ● ●
● ●
●● ● ●
1
● ●●
●● ●
● ●●
● ●
●●
●
● ● ● ● ●
● ●
● ●
●
●
0
●● ●
−2 −1 0 1
Running Variable
Overall, the plot shows a very clear negative relationship between the running variable and the
outcome: students who have a low GPA in the current (and thus have a higher value of the running
variable) tend to also have a low GPA in the following term. The plot also shows that students
with scores just above the cutoff (who are just placed on probation) tend to have a higher GPA in
the following term relative to students who are just below the cutoff and just avoided probation.
These results are confirmed when we use a local linear polynomial and robust inference to provide
a formal statistical analysis of the RD effect.
> rdrobust ( nextGPA _ nonorm , X , kernel = " triangular " , p = 1 , bwselect = " mserd " )
Call :
rdrobust ( y = nextGPA _ nonorm , x = X , p = 1 , kernel = " triangular " ,
bwselect = " mserd " )
Summary :
Left Right
Number of Obs 34854 5728
Eff . Number of Obs 5249 3038
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 0.4375 0.4375
BW Bias ( b ) 0.7171 0.7171
rho ( h / b ) 0.6101 0.6101
Estimates :
145
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
As shown, students who are just placed on probation improve their GPA in the following term
by 0.2221 additional points, relative to students who just missed probation. The robust p-value
is less than 0.00005, and the robust 95% confidence interval ranges from 0.1217 to 0.3044. Thus,
the evidence indicates that, conditional on not leaving the university, being placed on academic
probation translates into an increase in future GPA. The point estimate of 0.2221—oobtained with
rdrobust within a MSE-optimal bandwidth of 0.4375—is very similar to the effect of 0.23 grade
points found by LSO within an ad-hoc bandwidth of 0.6.
To better understand this effect, we may be interested in knowing the point estimate for the con-
trols and treated students separetely. To see this information, we explore the information returned
by rdrobust.
This output shows the estimated intercept and slope from the two local regressions estimated
separately to the rigth (beta p r) and left (beta p l) of the cutoff. At the cutoff, the average GPA
in the following term for control students who just avoid probation is 1.8450372 , while the average
future GPA for treated students who are just placed on probation is 2.0671526. The increase
is the estimated RD effect reported above, 2.0671526 − 1.8450372 = 0.2221154. This represents
approximately a 12% GPA increase relative to the control group, a considerable effect.
146
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
An alternative to the simplest use of rdrobust illustrated above is to cluster the standard
errors by every value of the score—this is the approach recommended by Lee and Card (2008), as
we discuss in the Further Readings section below. We implement this using the cluster option in
rdrobust.
> clustervar = X
> rdrobust ( nextGPA _ nonorm , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ,
+ vce = " hc0 " , cluster = clustervar )
Call :
rdrobust ( y = nextGPA _ nonorm , x = X , p = 1 , kernel = " triangular " ,
bwselect = " mserd " , vce = " hc0 " , cluster = clustervar )
Summary :
Left Right
Number of Obs 34854 5728
Eff . Number of Obs 4357 2709
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 0.3774 0.3774
BW Bias ( b ) 0.6270 0.6270
rho ( h / b ) 0.6019 0.6019
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 0.2149 0.0332 6.4625 0.0000 0.1497 0.2800
Robust 0.0000 0.1316 0.2818
The conclusions remain essentially unaltered, as the 95% robust confidence interval changes
only slightly from [0.1217 , 0.3044] to [0.1316 , 0.2818]. Note that the point estimate moves slightly
from 0.2221 to 0.2149 because the MSE-optimal bandwidth with clustering shrinks to 0.3774 from
0.4375, and the bias bandwidth also decreases.
Provided that the number of mass points in the score is reasonably large, it is possible to analyze
an RD design with a discrete score using the tools from the continuity-based approach. However,
it is important to understand how to correctly interpret the results from such analysis. We now
analyze the LSO application further, with the goal of clarifying these issues.
147
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
When there are mass points in the running variable, local polynomial methods for RD analysis
behave essentially as if we had as many observations as mass points, and therefore the method
implies extrapolation from the closest mass point on either side to the cutoff. In other words,
when applied to a RD design with a discrete score, the effective number of observations used by
continuity-based methods is the number of mass points or distinct values, not the total number of
observations. Thus, in practical terms, fitting a local polynomial to the raw data with mass points
is roughly equivalent to fitting a local polynomial to a “collapsed” version of the data, where we
aggregate the original observations by the discrete score values, calculating the average outcome
for all observations that share the same score value. Thus, the total number of observations in the
collapsed dataset is equal to the number of mass points in the running variable.
To illustrate this procedure with the LSO data, we calculate the average outcome for each of
the 430 mass points in the score value. The resulting dataset has 430 observations, where each
observation consists of a score-outcome pair: every score value is paired with the average outcome
across all students in the original dataset whose score is equal to that value. When then use
rdrobust to estimate the RD effect with a local polynomial.
Summary :
Left Right
Number of Obs 274 155
Eff . Number of Obs 51 50
Order Loc Poly ( p ) 1 1
Order Bias ( q ) 2 2
BW Loc Poly ( h ) 0.5057 0.5057
BW Bias ( b ) 0.8053 0.8053
rho ( h / b ) 0.6280 0.6280
Estimates :
Coef Std . Err . z P >| z | CI Lower CI Upper
Conventional 0.2456 0.0321 7.6400 0.0000 0.1826 0.3085
Robust 0.0000 0.1659 0.3165
148
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
The estimated effect is 0.2456, with robust p-value less than 0.00005. This is similar to the 0.2221
point estimate obtained with the raw dataset. The similarity between the two point estimates is
remarkable, but not unusual, considering that the former was calculated using 430 observations,
while the latter was calculated using 40, 582 observations, more than a ninety-fold increase. Indeed,
the inference conclusions from both analysis are extremely consistent, as the robust 95% confidence
interval using the raw data is [0.1217 , 0.3044], while the robust confidence interval for the collapsed
data is [0.1659, 0.3165], both indicating that the plausible values of the effect are in roughly the
same positive range.
This analysis shows that the seemingly large number of observations in the raw dataset is
effectively much smaller, and that the behavior of the continuity-based results is governed by the
average behavior of the data at every mass point. Thus, a natural point of departure for researchers
who wish to study a discrete RD design with many mass points is to collapse the data and estimate
the effects on the aggregate results. As a second step, these aggregate results can be compared to the
results using the raw data—in most cases, both sets of results should lead to the same conclusions.
While the mechanics of local polynomial fitting using a discrete running variable are clear now,
the actual relevance and interpretation of the treatment effect may change. As we will discuss
below, researchers may want to change the parameter of interest altogether when the score is
discrete. Alternatively, (parametric) extrapolation is unavoidable to point identification. To be
more precise, because the score is discrete, it is not possible to nonparametrically point identify
the vertical distance between τSRD = E[Yi (1)|Xi = x̄] − E[Yi (0)|Xi = x̄], even asymptotically,
because conceptually the lack of denseness in Xi makes it impossible to appeal to large sample
approximations. Put differently, if researchers insist on retaining the same parameter of interest as
in the canonical RD design, then extrapolation from the closest mass point to the cutoff will be
needed, no matter how large the sample size is.
Of course, there is no reason why the same RD treatment effect would be of interest when
the running variable is discrete or, if it is, then any extrapolation method would be equally valid.
Thus, continuity-based methods, that is, simple local linear extrapolation towards the cutoff point
is natural and intuitive. When only a few mass points are present, then bandwidth selection makes
little sense, and the research may just conduct linear (parametric) extrapolation globally, as this is
essentially the only possibility, if the goal is to retain the same canonical treatment effect parameter.
149
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
A natural alternative to analyze an RD design with a discrete running variable is to use the local
randomization approach, which effectively changes the parameter of interest from the RD treatment
effect a the cutoff to the RD treatment effect in the neighborhood around the cutoff where local
randomization is assumed to hold. A key advantage of this alternative conceptual framework is
that, unlike the continuity-based approach illustrated above, it can be used even when there are
very few mass points in the running variable: indeed, it can be used with as few as two mass points.
To compare the change in RD parameter of interest, consider the extreme case where the score
takes five values −2, −1, 0, 1, 1 and the RD cutoff is x̄ = 0. Then, the continuity-based parameter
of interest is τSRD = E[Yi (1)|Xi = 0] − E[Yi (0)|Xi = 0], which is not nonparametrically identifiable,
but the local randomization parameter will be τLR = E[Yi (1)|Xi = 0] − E[Yi (0)|Xi = −1] when
W0 = [−1, 0], say, which is nonparametrically identifiable under the conditions discussed in Section
5. Going from τLR to τSRD requires extrapolating from E[Yi (0)|Xi = −1] to E[Yi (0)|Xi = 0], which is
impossible without additional assumptions even in large samples because of the intrinsic discreteness
of the running variable. In some specific applications, additional features may allow researchers to
extrapolate (e.g., rounding), but in general extrapolation will require additional restrictions on the
data generating process. Furthermore, from a conceptual point of view, it can be argue that the
parameter τLR is more interesting and policy relevant than the parameter τSRD when the running
variable is discrete.
When the score is discrete using the local randomization approach for inference does not require
choosing a window in most applications. In other words, with a discrete running variable the
researcher knows the exact location of the minimum window around the cutoff: this window is the
interval of the running variable that contains the two mass points, one on each side of the cutoff,
that are immediately consecutive to the cutoff value. Crucially, if local randomization holds, then
it must hold for the smallest window in the absence of design failures such as manipulation of the
running variable. To illustrate, as shown in Table 7.2, in the LSO application the original score
has a mass point at zero where all observations are control (because they reach the minimum GPA
required to avoid probation), and the mass point immediately below it occurs at -0.01, where all
students are placed on probation because they fall short of the threshold to avoid probation. Thus,
the smallest window around the cutoff in the scale of the original score is W0 = [0.00, −0.01].
Analogously, in the scale of the transformed score, the minimum window is W0 = [−0.000005, 0.01].
Regardless of the scale used, the important point is that the minimum window around the cutoff
in a local randomization analysis of an RD with a discrete score is precisely the interval between
the two consecutive mass points where the treatment status changes from zero to one. Note that
the particular values taken by the score are irrelevant, as the analysis will proceed to assume that
the treated and control groups were assigned to treatment as-if randomly, and will typically make
the exclusion restriction assumption that the particular value of the score has no direct impact on
the outcome of interest. Moreover, the location of the cutoff is no longer meaningful, as any value
150
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
of the cutoff between the minimum value of the score on the treated side and the maximum value
of the score in the control side will produce identical treatment and control groups.
Once the researcher finds the treated and control observations located at the two mass points
around the cutoff, the local randomization analysis can proceed as explained in Section 5. We first
conduct a falsification analysis, to determine whether the assumption of local randomization in the
window [−0.00005, 0.1] seems consistent with the empirical evidence. We conduct a density test
using the rdwinselect function using the option nwindows=1 to see only results for this window,
to test whether the density of observations in this window is consistent with the density that would
have been observed in a series of unbiased coin flips.
Window length / 2 p - value Var . name Bin . test Obs < c Obs >= c
0.01 NA NA 0 228 77
As shown in the rdwinselect output and also showed previously in Table 7.2, there are 228
control observations immediately below the cutoff, and 77 above the cutoff. In other words, there
are 228 students who get exactly the minimum GPA needed to avoid probation, and 77 students
who get the maximum possible GPA that still places them on probation. The number of control
observations is roughly three times higher than the number of treated observations, a ratio that
is inconsistent with the assumption that the probability of treatment assignment in this window
was 1/2—the p-value of the Binomial test reported in column Bin. test is indistinguishable from
zero.
We can also obtain this result by using the Binomial test commands directly.
151
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
Although these results alone do not imply that the local randomization RD assumptions are
violated, the fact that there are many more control than treated students is consistent with what
one would expect if students were actively avoiding an undesirable outcome. The results raise some
concern that students may have been aware of the probation cutoff, and may have tried to appeal
their final GPA in order to avoid being placed on probation.
Strictly speaking, an imbalanced number of observations would not pose any problems if the
types of students in the treated and control groups were on average similar. To establish whether
treated and control students at the cutoff are similar in terms of observable characteristics, we
use rdrandinf to estimate the RD effect of probation on the predetermined covariates introduced
above.
152
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
We repeated this analysis for all predetermined covariates, but do not present the individual
runs to conserve space. A summary of the results is reported in Table 7.4. As shown, treated and
control students seem indistinguishable in terms of prior high school performance, total number of
credits, age, sex, and place of birth.
On the other hand, the Fisherain sharp null hypothesis is that the treatment has no effect on
the English-as-first-language indicator is rejected with p-value of 0.009. The average differences in
this variable are very large: 75% of control students speak English as first language, but only 62.3%
of treated students do. These difference is consistent with the local polynomial results we reported
for this variable in Table 7.3, although the difference is much larger (an average difference of -3.5
percentage points in the continuity-based analysis, versus an average difference of -15 percentage
points in the local randomization analysis). A similar phenomenon occurs for the Campus 2 and
Campus 3 indicators, which appear imbalanced in the local randomization analysis (with Fisherian
p-values of 0.075 and 0.009) but appear balanced with a continuity-based analysis.
These differences illustrate how a continuity-based analysis of a discrete RD design can mask
differences that occurs in mass points closest to the cutoff. In general, when analyzing a RD design
with a discrete running variable, it is advisable to perform falsification tests with the two mass points
closest to the cutoff in order to detect phenomena of sorting or selection that may go unnoticed
when a continuity-based approach is used.
153
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
Finally, we also investigate the extent to which the particular window around the cutoff, includ-
ing only two mass points, is driving the empirical results by repeating the analysis using different
nested windows. These exercise is easily implemented using the command rdwinselect:
> Z = cbind ( data $ hsgrade _ pct , data $ totcredits _ year1 , data $ age _ at _ entry ,
+ data $ male , data $ bpl _ north _ america , data $ english , data $ loc _ campus1 ,
+ data $ loc _ campus2 , data $ loc _ campus3 )
> colnames ( Z ) = c ( " hsgrade _ pct " , " totcredits _ year1 " , " age _ at _ entry " ,
+ " male " , " bpl _ north _ america " , " english " , " loc _ campus1 " , " loc _ campus2 " ,
+ " loc _ campus3 " )
> out = rdwinselect (X , Z , p = 1 , seed = 50 , wmin = 0.01 , wstep = 0.01 ,
+ cutoff = 5e -06)
Window length / 2 p - value Var . name Bin . test Obs < c Obs >= c
154
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
The empirical results continue to provide evidence of invalance in at least one pre-intervention
covariate for each window consider, using randomization inference methods for the difference in
means test statistic.
Remarkably, the difference-in-means between the 208 control students and the 67 treated stu-
dents in the smallest window around the cutoff is 0.234 grade points, extremely similar to the
155
7. EXAMPLE WITH DISCRETE SCORE CONTENTS
continuity-based local polynomial effects of 0.2221 and 0.2456 that we found using the raw and
aggregated data, respectively. (The discrepancy between the treated and control sample sizes of
77 and 288 reported in Table 7.2 and the sample sizes of 67 and 208 reported in the rdrandinf
output occurs because there are missing values in the nextGPA outcome, as students who leave
the university do not have any future GPA.) Moreover, we can reject the null hypothesis of no
effect at 10% level using both the Fisherian and the Neyman inference approaches. This shows
that the results for next term GPA are remarkably robust: we found very similar results using
the 208 + 67 = 275 observations closest to the cutoff in a local randomization analysis, the total
40, 582 observations using a continuity-based analysis, and the 429 aggregated observations in a
continuity-based analysis.
Lee and Card (2008) proposed alternative assumptions under which the local polynomial methods
in the continuity-based RD framework can be applied when the running variable is discrete. Their
method requires assuming a random specification error that is orthogonal to the score, and mod-
ifying inferences by using standard errors that are clustered at each of the different values taken
by the score. Similarly, Dong (2015) discusses the issue rounding in the running variable. Both
approaches have in common that the score is assumed to be inherently continuous, but somehow
imperfectly measured—perhaps because of rounding errors—in such a way that the dataset avail-
able to the researcher contains mass points. Cattaneo et al. (2015, Section 6.2) discuss explicitly
the connections between discrete scores and the local randomization approach; see also Cattaneo
et al. (2017d).
156
8. FINAL REMARKS CONTENTS
8 Final Remarks
157
BIBLIOGRAPHY BIBLIOGRAPHY
Bibliography
Angrist, J. D., and Rokkanen, M. (2015), “Wanna get away? Regression discontinuity estimation
of exam school effects away from the cutoff,” Journal of the American Statistical Association,
110, 1331–1344.
Arai, Y., and Ichimura, H. (2016), “Optimal bandwidth selection for the fuzzy regression disconti-
nuity estimator,” Economic Letters, 141, 103–106.
(2017), “Simultaneous Selection of Optimal Bandwidths for the Sharp Regression Discon-
tinuity Estimator,” Quantitative Economics, forthcoming.
Bajari, P., Hong, H., Park, M., and Town, R. (2011), “Regression Discontinuity Designs with an
Endogenous Forcing Variable and an Application to Contracting in Health Care,” NBER Working
Paper No. 17643.
Barreca, A. I., Lindo, J. M., and Waddell, G. R. (2016), “Heaping-Induced Bias in Regression-
Discontinuity Designs,” Economic Inquiry, 54, 268–293.
Bartalotti, O., and Brummet, Q. (2017), “Regression Discontinuity Designs with Clustered Data,”
in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, vol-
ume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 383–420.
Bartalotti, O., Calhoun, G., and He, Y. (2017), “Bootstrap Confidence Intervals for Sharp Re-
gression Discontinuity Designs,” in Regression Discontinuity Designs: Theory and Applications
(Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald
Group Publishing, pp. 421–453.
Bertanha, M. (2017), “Regression Discontinuity Design with Many Thresholds,” Working paper,
University of Notre Dame.
Bertanha, M., and Imbens, G. W. (2017), “External Validity in Fuzzy Regression Discontinuity
Designs,” National Bureau of Economic Research, working paper 20773.
Calonico, S., Cattaneo, M. D., and Farrell, M. H. (2017a), “Coverage Error Optimal Confidence
Intervals for Regression Discontinuity Designs,” working paper, University of Michigan.
(2017b), “On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Infer-
ence,” Journal of the American Statistical Association, forthcoming.
Calonico, S., Cattaneo, M. D., Farrell, M. H., and Titiunik, R. (2017c), “Regression Discontinuity
Designs Using Covariates,” working paper, University of Michigan.
(2017d), “rdrobust: Software for Regression Discontinuity Designs,” Stata Journal, forth-
coming.
158
BIBLIOGRAPHY BIBLIOGRAPHY
Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014a), “Robust Data-Driven Inference in the
Regression-Discontinuity Design,” Stata Journal, 14, 909–946.
Canay, I. A., and Kamat, V. (2017), “Approximate Permutation Tests and Induced Order Statistics
in the Regression Discontinuity Design,” Working paper, Northwestern University.
Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015), “Inference on Causal Effects in a Generalized
Regression Kink Design,” Econometrica, 83, 2453–2483.
Card, D., Lee, D. S., Pei, Z., and Weber, A. (2017), “Regression Kink Design: Theory and Prac-
tice,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics,
volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 341–382.
Cattaneo, M. D., and Escanciano, J. C. (2017), Regression Discontinuity Designs: Theory and
Applications (Advances in Econometrics, volume 38), Emerald Group Publishing.
Cattaneo, M. D., and Farrell, M. H. (2013), “Optimal convergence rates, Bahadur representation,
and asymptotic normality of partitioning estimators,” Journal of Econometrics, 174, 127–143.
Cattaneo, M. D., Frandsen, B., and Titiunik, R. (2015), “Randomization Inference in the Regression
Discontinuity Design: An Application to Party Advantages in the U.S. Senate,” Journal of Causal
Inference, 3, 1–24.
Cattaneo, M. D., Jansson, M., and Ma, X. (2017a), “Simple Local Regression Distribution Estima-
tors with an Application to Manipulation Testing,” working paper, University of Michigan.
Cattaneo, M. D., Keele, L., Titiunik, R., and Vazquez-Bare, G. (2016a), “Interpreting Regression
Discontinuity Designs with Multiple Cutoffs,” Journal of Politics, 78, 1229–1248.
Cattaneo, M. D., and Titiunik, R. (2017), “Regression Discontinuity Designs: A Review of Recent
Methodological Developments,” manuscript in preparation, University of Michigan.
159
BIBLIOGRAPHY BIBLIOGRAPHY
Cattaneo, M. D., Titiunik, R., and Vazquez-Bare, G. (2016b), “Inference in Regression Discontinuity
Designs under Local Randomization,” Stata Journal, 16, 331–367.
Cattaneo, M. D., and Vazquez-Bare, G. (2016), “The Choice of Neighborhood in Regression Dis-
continuity Designs,” Observational Studies, 2, 134–146.
Cerulli, G., Dong, Y., Lewbel, A., and Poulsen, A. (2017), “Testing Stability of Regression Dis-
continuity Models,” in Regression Discontinuity Designs: Theory and Applications (Advances in
Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing,
pp. 317–339.
Chiang, H. D., Hsu, Y.-C., and Sasaki, Y. (2017), “A Unified Robust Bootstrap Method for Sharp/-
Fuzzy Mean/Quantile Regression Discontinuity/Kink Designs,” Working paper, Johns Hopkins
University.
Chiang, H. D., and Sasaki, Y. (2017), “Causal Inference by Quantile Regression Kink Designs,”
Working paper, Johns Hopkins University.
Cook, T. D. (2008), ““Waiting for Life to Arrive”: A history of the regression-discontinuity design
in Psychology, Statistics and Economics,” Journal of Econometrics, 142, 636–654.
Dong, Y. (2015), “Regression Discontinuity Applications with Rounding Errors in the Running
Variable,” Journal of Applied Econometrics, 30, 422–446.
(2017), “Regression Discontinuity Designs with Sample Selection,” Journal of Business &
Economic Statistics, forthcoming.
Dong, Y., and Lewbel, A. (2015), “Identifying the Effect of Changing the Policy Threshold in
Regression Discontinuity Models,” Review of Economics and Statistics, 97, 1081–1092.
Ernst, M. D. (2004), “Permutation Methods: A Basis for Exact Inference,” Statistical Science, 19,
676–685.
Fan, J., and Gijbels, I. (1996), Local polynomial modelling and its applications: monographs on
statistics and applied probability 66, Vol. 66, CRC Press.
Frandsen, B. (2017), “Party Bias in Union Representation Elections: Testing for Manipulation
in the Regression Discontinuity Design When the Running Variable is Discrete,” in Regression
Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds.
M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 281–315.
Ganong, P., and Jäger, S. (2017), “A Permutation Test for the Regression Kink Design,” Journal
of the American Statistical Association, forthcoming.
160
BIBLIOGRAPHY BIBLIOGRAPHY
Gelman, A., and Imbens, G. W. (2014), “Why High-Order Polynomials Should Not be Used in
Regression Discontinuity Designs,” NBER Working Paper 20405, New York: National Bureau of
Economic Research.
Hahn, J., Todd, P., and van der Klaauw, W. (2001), “Identification and Estimation of Treatment
Effects with a Regression-Discontinuity Design,” Econometrica, 69, 201–209.
Imbens, G., and Lemieux, T. (2008), “Regression Discontinuity Designs: A Guide to Practice,”
Journal of Econometrics, 142, 615–635.
Imbens, G., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical Sci-
ences, Cambridge University Press.
Imbens, G. W., and Kalyanaraman, K. (2012), “Optimal Bandwidth Choice for the Regression
Discontinuity Estimator,” Review of Economic Studies, 79, 933–959.
Jales, H., and Yu, Z. (2017), “Identification and Estimation using a Density Discontinuity Ap-
proach,” in Regression Discontinuity Designs: Theory and Applications (Advances in Economet-
rics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp.
29–72.
Keele, L. J., and Titiunik, R. (2015), “Geographic Boundaries as Regression Discontinuities,” Po-
litical Analysis, 23, 127–155.
Lee, D. S. (2008), “Randomized Experiments from Non-random Selection in U.S. House Elections,”
Journal of Econometrics, 142, 675–697.
Lee, D. S., and Card, D. (2008), “Regression discontinuity inference with specification error,”
Journal of Econometrics, 142, 655–674.
Lee, D. S., and Lemieux, T. (2010), “Regression Discontinuity Designs in Economics,” Journal of
Economic Literature, 48, 281–355.
Lindo, J. M., Sanders, N. J., and Oreopoulos, P. (2010), “Ability, Gender, and Performance Stan-
dards: Evidence from Academic Probation,” American Economic Journal: Applied Economics,
2, 95–117.
Ludwig, J., and Miller, D. L. (2007), “Does Head Start Improve Children’s Life Chances? Evidence
from a Regression Discontinuity Design,” Quarterly Journal of Economics, 122, 159–208.
McCrary, J. (2008), “Manipulation of the running variable in the regression discontinuity design:
A density test,” Journal of Econometrics, 142, 698–714.
Meyersson, E. (2014), “Islamic Rule and the Empowerment of the Poor and Pious,” Econometrica,
82, 229–269.
161
BIBLIOGRAPHY BIBLIOGRAPHY
Papay, J. P., Willett, J. B., and Murnane, R. J. (2011), “Extending the regression-discontinuity
approach to multiple assignment variables,” Journal of Econometrics, 161, 203–207.
Porter, J. (2003), “Estimation in the Regression Discontinuity Model,” working paper, University
of Wisconsin.
Reardon, S. F., and Robinson, J. P. (2012), “Regression discontinuity designs with multiple rating-
score variables,” Journal of Research on Educational Effectiveness, 5, 83–104.
Sekhon, J. S., and Titiunik, R. (2016), “Understanding Regression Discontinuity Designs as Obser-
vational Studies,” Observational Studies, 2, 174–182.
Shen, S., and Zhang, X. (2016), “Distributional Regression Discontinuity: Theory and Applica-
tions,” Review of Economics and Statistics, 98, 685–700.
Tukiainen, J., Saarimaa, T., Hyytinen, A., Meriläinen, J., and Toivanen, O. (2017), “When Does Re-
gression Discontinuity Design Work? Evidence from Random Election Outcomes,” VATT Work-
ing Papers 59.
Wing, C., and Cook, T. D. (2013), “Strengthening The Regression Discontinuity Design Using Ad-
ditional Design Elements: A Within-Study Comparison,” Journal of Policy Analysis and Man-
agement, 32, 853–877.
Wong, V. C., Steiner, P. M., and Cook, T. D. (2013), “Analyzing Regression-Discontinuity De-
signs With Multiple Assignment Variables A Comparative Study of Four Estimation Methods,”
Journal of Educational and Behavioral Statistics, 38, 107–141.
Xu, K.-L. (2017), “Regression Discontinuity with Categorical Outcomes,” Working Paper, Indiana
University.
162