Markus Frölich, Stefan Sperlich - Impact Evaluation - Treatment Effects and Causal Analysis-Cambridge University Press (2019)
Markus Frölich, Stefan Sperlich - Impact Evaluation - Treatment Effects and Causal Analysis-Cambridge University Press (2019)
Markus Frölich is Director of the Center for Evaluation and Development (C4ED)
in Germany and Switzerland, Professor of Econometrics at Universität Mannheim,
Germany and a J-PAL Affiliate. He has twenty years of experience in impact evalu-
ation, including the development of new econometric methods and numerous applied
impact evaluations for organisations such as Green Climate Fund, International Fund
for Agricultural Development (IFAD), International Labour Organization (ILO), UNDP,
UNICEF, World Food Programme (WFP) and the World Bank.
Stefan Sperlich is Full Professor of Statistics and Econometrics at the Université de
Genève, has about fifteen years of experience as a consultant, and was co-founder of the
‘Poverty, Equity and Growth in Developing Countries’ Research Centre, Göttingen. He
has published in various top-ranked journals and was awarded the 2000–2002 Tjalling
C. Koopmans Econometric Theory Prize.
Impact Evaluation
Treatment Effects and Causal Analysis
MA RK U S F R ÖLIC H
University of Mannheim, Center for Evaluation and Development
S TEFA N SP ER LIC H
University of Geneva
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781107042469
DOI: 10.1017/9781107337008
c Markus Frölich and Stefan Sperlich 2019
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2019
Printed in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-107-04246-9 Hardback
ISBN 978-1-107-61606-6 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
Acknowledgement page ix
Foreword xi
Introduction 1
Bibliography 399
Index 415
Acknowledgement
This book is the result of many years of teaching Impact Evaluation at numerous univer-
sities and research institutes worldwide. We thank all of our colleagues, PhD and Master
students who contributed with discussions, comments, recommendations or simply their
encouragement. We would like to mention in particular Michael Lechner and Manuel
Arellano.
A special thanks goes to our families for all the support we experienced during the
last seven years for finishing this work, Lili, Yvonne, Bianca and Daniel, and to our
parents.
Foreword
The treatment effect approach to policy evaluation focuses on evaluating the impact
of a yes-or-no policy in place. This approach has inspired a huge amount of empirical
work and has become the centerpiece of the econometric toolkit in important branches
of applied economics. Markus Frölich and Stefan Sperlich’s book is a comprehensive
graduate-level treatise on the econometrics of treatment effects. It will appeal to students
and researchers who want to have more than a cursory understanding of the whats, whys
and hows of treatment effect estimation.
There is much in the book to commend. I like the fact that the authors pay seri-
ous attention to both identification and estimation problems. In applied work treatment
effects are not identified; they are estimated. Reading this book reminds us that it is
not always the case that, given identification, an obvious estimation method follows.
This is not to the detriment of the book’s attention to identification. Formal assump-
tions involving potential outcomes are discussed alongside Pearl graphical displays of
causal models. Causal graphs and the many examples spread over the text help develop
intuition in an effective way.
The estimation of treatment effects from non-experimental data – the focus of
this book – typically involves conditional arguments, be they conditional exogeneity
as in regression and matching approaches, conditional instrumental variable assump-
tions or conditional difference-in-differences. Conditioning often involves non-trivial
choices and trade-offs beyond those associated with identification arrangements. One
has to choose the set of variables on which to condition a statistical approach and its
implementation. Here the benefits of the in-depth treatment provided by the Frölich–
Sperlich partnership are clearly visible. In line with the literature, the authors emphasise
non-parametric approaches, providing the reader with an excellent self-contained
introduction to local non-parametric methods.
The method of instrumental variables is the central tool for the estimation of endoge-
nous treatments and so it features prominently in this book. Monotonicity of the
first-stage equation is required to identify local average treatment effects. Such local
effects may or may not be policy-relevant treatment effects. However, the fact that they
can all be expressed as weighted averages of marginal treatment effects opens up the
possibility of learning from the former about the latter. This is a promising avenue of
progress, and the book provides the essential elements to understand the interconnec-
tions between local, marginal and other treatment effects. An important lesson from
the literature is the major role that the first-stage equation plays in the identification of
causal effects. The instrumental-variable method only allows us to identify averages of
xii Foreword
heterogeneous treatment effects provided that the first-stage equations are not heteroge-
neous to the same extent. This fact naturally leads to considering structural approaches
to modelling treatment choice. This is specially so in the case of non-binary treatments,
another situation that is also addressed in the book.
The treatment effect approach has made remarkable progress by boldly focusing on
a binary static treatment setting. However, there are economic policies that provide fun-
damentally dynamic incentives so that their effects cannot be understood in the absence
of a dynamic framework of analysis. It is good that Frölich and Sperlich have included
a final chapter in which a dynamic potential outcomes model and duration models are
discussed.
The authors cover most of the standard tools in the econometrics of treatment
effects from a coherent perspective using a common notation. Besides selection
of observables and instrumental variables, the book discusses linear and non-linear
difference–indifference methods, regression discontinuity designs and quantile mod-
els. The econometrics of treatment effects remains an active literature in which new
developments abound. High-dimensional regression, Bayesian approaches, bounds and
networks are among the many areas of current research in causal inference, so there will
soon be material for a second volume.
Frölich and Sperlich’s book will be of interest whether you are an applied economist
who wants to understand what you are doing or you just want to understand what others
do. Estimating the causal effect of a policy from non-experimental data is challenging.
This book will help us to better understand and use the existing tools to deal with the
challenges. I warmly congratulate Markus and Stefan on their remarkable achievement.
Manuel Arellano
Madrid
April 2018
Manuel Arellano has been a Professor of Economics at CEMFI in Madrid since 1991.
Prior to that, he held appointments at the University of Oxford (1985–89) and the Lon-
don School of Economics (1989–91). He is a graduate of the University of Barcelona
and holds a PhD from the London School of Economics. He has served as Editor of the
Review of Economic Studies (1994–98), Co-Editor of the Journal of Applied Economet-
rics (2006–08) and Co-Chair of the World Congress of the Econometrc Society (2010).
He is a Fellow of the Econometric Society and a Foreign Honorary Member of the
American Academy of Arts and Sciences. He has been President of the Spanish Eco-
nomic Association (2003), President of the European Economic Association (2013) and
President of the Econometric Society (2014). He has published many research papers
on topics in econometrics and labour economics, in particular on the analysis of panel
data, being named a Highly Cited Researcher by Thomson ISI (2010). He is the author of
Panel Data Econometrics (2003). He is a recipient of the Rey Jaime I Prize in Economics
(2012).
Introduction
This book on advanced econometrics is intended to familiarise the reader with techni-
cal developments in the area of econometrics known as treatment effect estimation, or
impact or policy evaluation. In this book we try to combine intuitive reasoning in identi-
fication and estimation with econometric and statistical rigour. This holds especially for
the complete list of stochastic assumptions and their implications in practice. Moreover,
for both identification and estimation, we focus mostly on non-parametric methods (i.e.
our methods are not based on specific pre-specified models or functional forms) in order
to provide approaches that are quite generally valid. Graphs and a number of examples
of evaluation studies are applied to explain how sources of exogenous variation can be
explored when disentangling causality from correlation.
What makes the analysis of treatment effects different from more conventional econo-
metric analysis methods, such as those covered, for example, in the textbooks of
Cameron and Trivedi (2005), Greene (1997) or Wooldridge (2002)? A first major dif-
ference is that the three steps – definition of parameter of interest, identification and
statistical modelling – are clearly separated. This helps first to define the objects one
is interested in, and to clearly articulate the definition and interpretation of counterfac-
tual outcomes. A second major difference is the focus on non-parametric identification
and estimation. Even though parametric models might eventually be used in the empir-
ical analysis, discussing identification without the need to impose – usually arbitrary –
functional forms helps us to understand where the identifying power comes from. This
permits us to link the identification strategy very tightly to the particular policy evalu-
ation problem. A third, and also quite important, difference is the acknowledgement of
possible treatment effect heterogeneity. Even though it would be interesting to model
this heterogeneity of treatment effects, according to the standard literature we take it
as being of unknown form: some individuals may benefit greatly from a certain inter-
vention whereas some may benefit less, while others may even be harmed. Although
treatment effects are most likely heterogeneous, we typically do not know the form of
this heterogeneity. Nonetheless, the practitioner should always be aware of this het-
erogeneity, whereas (semi-)parametric regression models either do not permit it or do
not articulate it clearly. For example, most of the instrumental variable (IV) literature
simply ignores the problem of heterogeneity, and often people are not aware of the con-
sequences of particular model or IV choices in their data analysis. This can easily render
the presented interpretation invalid.
The book is oriented towards the main strands of recent developments, and it empha-
sises the reading of original articles by leading scholars. It does not and cannot substitute
2 Introduction
for the reading of original articles, but it seeks to summarise most of the central aspects,
harmonising notation and (hopefully) providing a coherent road map. Unlike some
handbooks on impact evaluation, this book aims to impart a deeper understanding of the
underlying ideas, assumptions and methods. This includes such questions as: what are
the necessary conditions for the identification and application of the particular methods?;
what is the estimator doing to the data?; what are the statistical properties, asymptoti-
cally and in finite samples, advantages and pitfalls, etc.? We believe that only a deeper
understanding of all these issues (the economic theory that identifies the parameters of
interest, the conditions of the chosen estimator or test and the behaviour of the statistical
method) can finally lead to a correct inference and interpretation.
Quite comprehensive review articles, summarising a good part of the theoretical work
that has been published in the last 15 years in the econometric literature,1 include,
for example, Imbens (2004), Heckman and Vytlacil (2007a), Heckman and Vytlacil
(2007b), Abbring and Heckman (2007) and Imbens and Wooldridge (2009). See also
Angrist and Pischke (2008). The classical area of application in economics was that of
labour market research, where some of the oldest econometric reviews on this topic can
be found; see Angrist and Krueger (1999) and Heckman, LaLonde and Smith (1999).
Nowadays, the topic of treatment effect estimation and policy evaluation is especially
popular in the field of poverty and development economics, as can be seen from the
reviews of Duflo, Glennerster and Kremer (2008) and Ravallion (2008). Blundell and
Dias (2009) try to reconcile these methods with the structural model approach that is
standard in microeconometrics. Certainly, this approach has to be employed with care,
as students could easily get the impression that treatment effect estimators are just semi-
parametric extensions of the well-known parameter estimation problems in structural
models.
Before starting, we should add that this book considers the randomised control tri-
als (RCT) only in the first chapter, and just as a general principle rather than in detail.
The book by Guido W. Imbens and Donald B. Rubin, Causal Inference for Statistics,
Social, and Biomedical Sciences: An Introduction, has appeared quite recently and deals
with this topic in considerable detail. (See also the book Glennerster and Takavarasha
(2013) on practical aspects of running RCTs). Instead, we have added to the chapters
on the standard methods of matching, instrumental variable approach, regression dis-
continuity design and difference-in-differences more detailed discussions about the use
of propensity scores, and we introduce in detail quantile and distributional effects and
give an overview of the analysis of dynamic treatment effects, including sequential treat-
ments and duration analysis. Furthermore, unlike the standard econometrics literature,
we introduce (for the identification of causality structures) graph theory from the statis-
tics literature, and give a (somewhat condensed) review of non-parametric estimation
that is applied later on in the book.
1 There exists an even more abundant statistical literature that we neither cite nor review here simply for the
sake of brevity.
1 Basic Definitions, Assumptions
and Randomised Experiments
In econometrics, one often wants to learn the causal effect of a variable on some other
variable, be it a policy question or some mere ‘cause and effect’ question. Although, at
first glance, the problem might look trivial, it can become tricky to talk about causal-
ity when the real cause is masked by several other events. In this chapter we will
present the basic definitions and assumptions about the casual models; in the com-
ing chapters you will learn the different ways of answering questions about causality.
So this chapter is intended to set up the framework for the content of the rest of the
book.
We start by assuming we have a variable D which causes variable Y to change. Our
principal aim here is not to find the best fitting model for predicting Y or to analyse the
covariance of Y ; we are interested in the impact of this treatment D on the outcome of
interest (which is Y ). You might be interested in the total effect of D on Y , or in the
effect of D on Y in a particular environment where other variables are held fixed (the
so-called ceteris paribus case). In the latter case, we again have to distinguish carefully
between conditional and partial effects. Variable Y could indicate an outcome later in
life, e.g. employment status, earnings or wealth, and D could be the amount of education
an individual has received, measured as ‘years of schooling’. This setup acknowledges
the literature on treatment evaluation, where D ∈ {0, 1} is usually binary and indicates
whether or not an individual received a particular treatment. Individuals with D = 1
will often be called participants or treated, while individuals with D = 0 are referred
to as non-participants or controls. A treatment D = 1 could represent, for example,
receiving a vaccine or a medical treatment, participating in an adult literacy training
programme, participating in a public works scheme, attending private versus public sec-
ondary school, attending vocational versus academic secondary schooling, attending a
university, etc. A treatment could also be a voucher (or receiving the entitlement to a
voucher or even a conditional cash transfer) to attend a private school. Examples of this
are the large conditional cash transfer programmes in several countries in Latin America.
Certainly, D could also be a non-binary variable, perhaps representing different subjects
of university degrees, or even a continuous variable such as subsidy payments, fees or
tax policies.
4 Basic Definitions, Assumptions and Randomised Experiments
Example 1.1 The Mexican programme PROGRESA, which has been running under
the name Oportunidades since 2002, is a government social assistance programme that
started in 1997. It was designed to alleviate poverty through the rise of human capital.
It has been providing cash payments to families in exchange for regular school atten-
dance, health clinic visits and also nutritional support, to encourage co-responsibility.
There was a rigorous (pre-)selection of recipients based on geographical and socioe-
conomic factors, but at the end of 2006 around one-quarter of Mexico’s population
had participated in it. One might be interested to know how these cash payments to
members, families or households had helped them, or whether there has been any pos-
itive impact to change their living conditions. These are quite usual questions that
policy makers need to answer on a regular basis. One key feature of PROGRESA
is its system of evaluation and statistical controls to ensure its effectiveness. For
this reason and given its success, Oportunidades has recently become a role model
for programmes instituted in many other countries, especially in Latin America and
Africa.
Let us set up the statistical setting that we will use in this book. All variables will be
treated as random. This is a notational convenience, but it does not exclude deterministic
variables. As measure-theory will not help you much in understanding the econometrics
discussed here, we assume that all these random variables are defined in a common prob-
ability space. The population of this probability space will often be the set of individuals,
firms, households, classrooms, etc. of a certain country, province, district, etc. We are
thinking not only of the observed values but of all possible values that the considered
variable can take. Similarly, we are not doing finite population theory but thinking rather
of a hyper-population; so one may think of a population containing infinitely many indi-
viduals from which individuals are sampled randomly (maybe organised in strata or
blocks). Furthermore, unlike the situation where we discuss estimation problems, for the
purpose of identification one typically starts from the idea of having an infinitely large
sample. From here, one can obtain estimators for the joint distribution of all (observed)
variables. But as samples are finite in practice, it is important to understand that you
can obtain good estimators and reasonable inference only when putting both together:
that is, a good identification strategy and good estimation methods. Upper-case letters
will represent random variables or random vectors, whereas lower-case letters will rep-
resent (realised) numbers or vectors, or simply an unspecified argument over which we
integrate.
In most chapters the main interest is first to identify the impact of D on Y from an
infinitely large sample of independently sampled observations, and afterwards to esti-
mate it. We will see that, in many situations, once the identification problem is solved, a
natural estimator is immediately available (efficiency and further inference issues aside).
We will also examine what might be estimated under different identifying assumptions.
The empirical researcher has then to decide which set of assumptions is most adequate
for the situation. Before we do so, we have to introduce some notation and definitions.
This is done in the (statistically probably) ‘ideal’ situation of having real experimental
data, such as in a laboratory.
1.1 Treatment Effects: Definitions, Assumptions and Problems 5
which is the outcome that individual i would experience if (X i , Ui ) were held fixed but
Di were set externally to the value d (the so-called treatment). The point here is not
to enforce the treatment Di = d, but rather to highlight that we are not interested in a
1 In contrast, a variable like wages would only be observed for those who are actually working. The case is
slightly different for those who are not working; clearly it’s a latent variable then. This might introduce a
(possibly additional) selection problem.
6 Basic Definitions, Assumptions and Randomised Experiments
ϕ that varies with the individual’s decision to get treated. The case where changing Di
also has an impact on (X i , Ui ) will be discussed later.
Example 1.2 Let Yi denote a person’s wealth at the age of 50, and let Di be a dummy
indicating whether or not he was randomly selected for a programme promoting his
education. Further, let X i be his observable external (starting) conditions, which were
not affected by Di , and Ui his (remaining) unobserved abilities and facilities. Here, Di
was externally set to d when deciding about the kind of treatment. If we think of a Di
that can only take 0 and 1, then for two values d = 1 (he gets the treatment) and d = 0
(he doesn’t get the treatment), the same individual can have the two different potential
outcomes Yi1 and Yi0 respectively. But of course in reality we observe only one. We
denote the realised outcome as Yi .
This brings us to the notion of a counterfactual exercise: this simply means that you
observe Yi = Yid for the realised d = Di but use your model ϕ(·) to predict Yid for a d
of your choice.
Example 1.3 Let Yi be as before and let Di be the dummy indicating whether person i
graduated from a university or not. Further, let X i and Ui be the external conditions as
in Example 1.2. In practice, X i and Ui may impact on Di for several individuals i such
that those who graduate from a different subpopulation from those who do not might
hardly be comparable. Note that setting externally Di to d is a theoretical exercise, it
does not necessarily mean that we can effectively enforce a ‘treatment’ on individual
i; rather, it allows us to predict how the individual would perform under treatment (or
non-treatment), generating the potential outcomes Yi1 and Yi0 . In reality, we only observe
either Yi1 or Yi0 for each individual, calling it Yi .
Notice that the relationship (1.2) is assumed on the individual level to be given an
unchanged environment: only variation in D for individual i is considered, but not vari-
ation in D for other individuals which may impact on Yi or might generate feedback
cycles. We will formalise this assumption in Section 1.1.3. In this sense, the approach is
more focused on a microeconometric effect: a policy that changes D for every individ-
ual or for a large number of individuals (like a large campaign to increase education
or computer literacy) might change the entire equilibrium, and therefore function ϕ
might change then, too. Such kinds of macro effects, displacement effects or general
equilibrium effects are not considered here, though they have been receiving more
and more attention in the treatment evaluation literature. Certainly, i could be cities,
regions, counties or even states.2 In this sense, the methods introduced here also apply
to problems in macroeconomics.
2 Card and Krueger (1994), for example, studied the impact of the increase of the minimum wage in 1992 in
New Jersey.
1.1 Treatment Effects: Definitions, Assumptions and Problems 7
Example 1.4 An example of changing ϕ could be observed when a large policy is pro-
viding employment or wage subsidies for unemployed workers. This may lower the
labour market chances for individuals not eligible to such subsidies. This is known
as substitution or displacement effects, and they are expected to change the entire
labour market: the cost of labour decreases for the firms, the disutility from unem-
ployment decreases for the workers, which in turn impacts on efficiency wages, search
behaviour and the bargaining power of trade unions. In total, we are changing our
function ϕ.
Let us get back to the question of causality in the microcosm and look at the different
outcomes for an exogenous change in treatment from d to d . The difference
Yid − Yid
is obviously the individual treatment effect. It tells us how the realised outcome for
the ith individual would change if we changed the treatment status. This turns out to
be almost impossible to estimate or predict. Fortunately, most of the time we are more
interested in either the expected treatment effect or an aggregate of treatment effects for
many individuals. This brings us to the average treatment effect (ATE).
Example 1.5 As in the last two examples, let Di ∈ {0, 1} indicate whether or not per-
son i graduated from university, and let Yi denote their wealth at the age of 50. Then,
Yi1 − Yi0 is the effect of university graduation on wealth for person i. It is the wealth
obtained if this same individual had attended university minus the wealth this individ-
ual would have obtained without attending university. Notice that the ‘same individual’
is not equivalent to the ceteris paribus assumption in regression analysis. We explic-
itly want to allow for changes in other variables if they were caused by the university
graduation. While this is doubtless of intimate interest for this particular person, politi-
cians might be more interested in the gain in wealth on average or for some parts of the
population.
Example 1.6 Let Di indicate whether individual i attended private or public secondary
school, whereas X i indicates whether the individual afterwards went to university or
not. Here, we might be interested in that part of the effect of private versus public school
on wealth that is not channelled via university attendance. Clearly, attending private or
public school (D) is likely to have an effect on the likelihood to visit a university (X ),
which in turn is going to affect wealth. But one might instead be interested in a potential
direct effect of D on wealth, even if university attendance is externally fixed. From this
example we can easily see that it depends heavily on the question of interest, i.e. how
the treatment parameter is defined.
Example 1.7 Consider the Mincer earnings functions in labour economics, which are
often used to estimate the returns to education. To determine them, in many empiri-
cal studies log wages are regressed on the job experience, years of schooling and a
measure of ability (measured in early childhood, if available). The reasoning is that all
these are important determinants of wages. We are not so interested in the effects of
ability on wages, and merely include ability in the regression to deal with the selec-
tion problem discussed later on. The ceteris paribus analysis examines, hypothetically,
how wages would change if years of schooling (D) were changed while experience
(X ) remained fixed. Since on-the-job experience usually accumulates after the comple-
tion of education, schooling (D) may have different effects: one plausible possibility
is that schooling affects the probability and duration of unemployment or repeated
unemployment, which reduces the accumulation of job experience. Schooling outcomes
1.1 Treatment Effects: Definitions, Assumptions and Problems 9
may also affect the time out of the labour force, which also reduces job experience. In
some countries it may decrease the time spent in prison. Hence, D affects Y indirectly
via X . Another possibility is that years of schooling are likely to have a direct positive
effect on wages. Thus, by including X in the regression, we control for the indirect effect
and measure only the direct effect of schooling. So, including X in the regression may
or may not be a good strategy, depending on what we are trying to identify. Sometimes
we want to identify only the total effect, but not the direct effect, and sometimes vice
versa.
two groups differ in observed and unobserved characteristics, and might even define
different populations while we are now exclusively focusing on the subpopulation for
which D = 1. Of course, if rolling out the policy to the entire population is intended for
the future, then the ATE, or maybe even the average treatment effect on the non-treated
(ATEN) E[Y 1 − Y 0 |D = 0], would be more interesting.
In the university graduation example, the difference between ATET and ATE is often
referred to as the sorting gain. The decision whether to attend university or not is likely
to depend on some kind of individual expectation about their wage gains from attending
university. This leads to a sorting of the population. Those who gain most from uni-
versity are more likely to attend it, whereas those who have little to gain from it will
most likely abstain. This could lead to an ATET being much higher than ATE. Hence,
the average wage gain for students is higher in the sorted subpopulation than in a world
without sorting. This difference between ATET and ATE could be due to differences
in observed as well as unobserved characteristics. Hence, the observed difference in
outcomes among students and non-students can be decomposed as
E [Y |D = 1] − E [Y |D = 0] = average return to schooling + sorting gain
AT E AT E T −AT E
+ selection
bias .
E[Y 0 |D=1]−E[Y 0 |D=0]
Example 1.8 Consider an example of formal and informal labour markets. This example
will help us to understand that typically AT E T > AT E if a larger Yi means something
positive for i. In many parts of the developing and developed world, individuals work in
the informal sector (typically consisting of activities at firms without formal registration
or without employment contract or without compliance with required social security
contributions). Roughly, one can distinguish four different activities: self-employed in
the formal sector, i.e. owner of a registered firm; self-employed in the informal sec-
tor, i.e. owner of a business without formal registration;3 worker in the formal sector;
and, lastly, worker in the informal sector. Firms in the formal sector pay taxes, have
access to courts and other public services but also have to adhere to certain legislation,
e.g. adhering to worker protection laws, providing medical and retirement benefits, etc.
Informal firms do not have access to public services such as police and courts, and have
to purchase private protection or rely on networks. Similarly, employees in the formal
sector have a legal work contract and are, at least in principle, covered by worker protec-
tion laws and usually benefit from medical benefits like accident insurance, retirement
benefits, job dismissal rules, etc.
The early literature on this duality sometimes associated the formal sector with the
modern industrialist sector and the informal sector with technologically backward or
rural areas. The formal sector was considered to be superior. Those individuals migrat-
ing from the rural to the urban areas in search of formal sector jobs who do not find
formal employment, accept work in the urban informal sector until they find formal
3 This includes, for example, family firms or various types of street vendors.
12 Basic Definitions, Assumptions and Randomised Experiments
employment. Jobs in the formal sector are thus rationed, and employment in the infor-
mal sector is a second-best choice.4 Therefore, a formal and an informal sector coexist,
with higher wages and better working conditions in the formal one. Everyone would
thus prefer to be working in the latter.
On the contrary, there may be good reasons why some firms and workers voluntarily
prefer informality, particularly when taxes and social security contributions are high,
licences for registration are expensive or difficult to obtain, public services are of poor
quality and returns to firm size (economies of scale) are low. To run a large firm usu-
ally means a switch to the formal sector. Similarly, the medical and retirement benefits
to formal employees (and worker protection) may be of limited value, and in some
countries access to these benefits already exists if a family member is in formal employ-
ment. In addition, official labour market restrictions relating to working hours, paid
holidays, notice period, severance pay, maternity leave, etc. may not provide the flexi-
bility that firms and workers desire. Under certain conditions, workers and firms could
then voluntarily choose informal employment. Firms may also prefer informality, as
this may guard them against the development of strong unions or worker representation,
e.g. regarding reorganisations, dismissals, social plans for unemployed or precarious
workers. Hence, costs (taxes, social security) and state regulations provide incentives
for remaining informal.
Now think about individual i who seeks treatment or not, say to employment either in
the formal or informal sector, respectively. Let Yi1 be his wage if he goes to the formal
sector, and Yi0 his wage in the informal sector. This outcome may also include non-
wage benefits. If individuals self-select their sector they would choose the formal sector
when
Di = 11 Yi1 > Yi0 ,
i.e. decide for treatment (or against it), depending on their potential outcomes and ignor-
ing for a moment the uncertainty here. This model is often referred to as the Roy (1951)
model.
Under the hypothesis of informality being only an involuntary choice because of the
limited size of the formal sector, it should be that Yi1 − Yi0 > 0 for almost everyone. In
this case, some individuals would like to join the formal sector but are not successful.
But thinking of an efficient allocation, and taking the size of the formal sector as given,
we would find that
AT E T = E Y 1 − Y 0 |D = 1 > E Y 1 − Y 0 |D = 0 = AT E N , (1.5)
4 Recall also the efficiency wage theory: if a worker’s effort in the formal sector cannot be monitored
perfectly, or only at a considerable cost, to promote workers effort some incentives are required. An
implication of efficiency wage theory is that firms pay higher wages to promote effort, which leads to
unemployment. The risk of becoming unemployed in the case of shirking provides the incentives for the
worker to increase effort. Because most developing countries do not provide generous unemployment
insurance schemes, and because the value of money is larger than the utility from leisure, these
unemployed enter into low-productivity informal activities where they are either self-employed or where
monitoring is less costly.
1.1 Treatment Effects: Definitions, Assumptions and Problems 13
that is, those who obtained a formal sector job should have a larger gain vis-à-vis
non-formal employment than those who did not obtain a formal sector job (D = 0).5
Note that the inequality (1.5) is only a result of economic theory but not a general result
from statistics. Further, as
AT E N = {AT E − AT E T · P(D = 1)}/P(D = 0) ,
Equation 1.5 is equivalent to AT E T > AT E.
Example 1.9 Typical examples where spillover effects are quite obvious are (medi-
cal) treatments to combat contagious diseases. Therefore, studies in which medical
treatment is randomised at the individual level potentially underestimate the benefits
of treatment. They typically miss externality benefits to the comparison group from
reduced disease transmission. Consequently, one fails to estimate the counterfactual sit-
uation of no-treatment. Miguel and Kremer (2004) evaluated a Kenyan project in which
school-based mass treatment with deworming drugs was randomly phased into schools,
rather than to individuals, allowing estimation of overall programme effects. Individuals
at the selected school could nonetheless decide not to participate. When accounting for
the mentioned spill over effects, they found that the programme reduced school absen-
teeism in treatment schools by one-quarter. Not surprisingly, deworming substantially
improved health and school participation also among the untreated children in both,
treatment schools and even in neighbouring schools. However, they could not find, for
example, statistical evidence that the deworming programme had a positive impact on
the academic test scores.
where Di and Di denote the ith element of the vectors D and D , respectively. In other
words, it is assumed that the observed outcome Yi depends only on the treatment to
which individual i is assigned to, and not on the allocation of other individuals. If we
change the allocation of other individuals, keeping the ith allocation fixed, then the
outcome of the ith individual shouldn’t change.6
The SUTVA assumption might be invalidated if individuals interact, either directly or
through markets. Let’s see some examples.
Example 1.10 Let’s assume a firm wants to give training to build a skilled workforce
and it needs to evaluate how effective the training is, so that training materials can also
be used in future. If the firm wants to see how this training creates an impact on the
production or output, it really needs to make sure that lessons from the training do
not get to the workers in the control group. It can take two groups from very different
parts of the factory, so that they have little or no chance to interact, but then we have
different structural setups and it would make little sense to compare them. But if it takes
employees from the same part of the production process then there is a possibility that
the people who were intentionally not given the training might be interested to know
about the contents, ask treated workers and try to implement the ideas. For example, if
the training teaches the use of some kind of waste management technique, then some
people in the control group might be tempted to use the ideas, too.
6 For an excellent discussion about the history of potential outcomes and SUTVA, please have a look at
chapters 1 and 2 of Imbens and Rubin (2015). There they mention two assumptions related to SUTVA: ‘No
interference’, which is same as our no-spillover, and ‘No hidden variations of treatments’, which means
that, for all the observations, the treatment variations should be the same.
1.1 Treatment Effects: Definitions, Assumptions and Problems 15
Market and general equilibrium effects often depend on the scale of the policy, i.e.
on the number of participants in the programmes. In fact, departures from SUTVA are
likely to be small if only a few individuals participate in the policy, but with an increasing
number of participants we expect larger spillover effects (or other externalities).
Example 1.11 If active labour market programmes change the relative supply of skilled
and unskilled labour, all individuals may be affected by the resulting changes in the
wage structure. In addition, programmes which affect the labour cost structure, e.g.
through wage subsidies, may lead to displacement effects, where unsubsidised work-
ers are laid off and are replaced by subsidised programme participants. Individuals
might further be affected by the taxes raised for financing the policy. It is obvious that
these interaction or spillover effects can be pretty small if one focuses only on a small
economic sector, for example in order to alleviate social hardships when a structural
break happens, as was the case for the European coal mining sector or the shipbuilding
industry.
A quite different form of interference between individuals can arise due to supply con-
straints. If the number of programme slots is limited, the availability of the programme
for a particular individual depends on how many participants have already been allocated
to this programme. Such interaction does not directly affect the potential outcomes and,
thus, does not invalidate the microeconometric evaluation approaches discussed subse-
quently. However, it restricts the set of feasible allocations D and could become relevant
when trying to change the allocation of participants in order to improve the overall effec-
tiveness of the policy. Supply constraints are often (at least partly) under the control of
the programme administration and could be moderated if necessary.
Henceforth, the validity of SUTVA is assumed. Consequently, it is no longer nec-
essary to take account of the full treatment allocation vector D, since the outcome of
individual i depends only on the treatment received by himself, which is denoted by a
scalar variable Di in the following.
Example 1.12 Suppose you want to see whether increased sanitation coverage has any
impact on health. In many parts of the developing world, open defecation is still a big
problem and the government might be interested in seeing the impact of this policy.
Assume we start with a group of households. We seek the households with the worst
latrines, or no latrines, and install hygiene latrines there. Then we take the difference
of the average of some health measure between those who got the latrines and those
who didn’t. As we gave treatments to those who were the worst, it might be the case
that initially (before treatment) they were already in a worse state for other reasons
(people who didn’t have the latrines might be poor and their health status is already
pretty bad). So even if they hadn’t received the treatment, their average health status
would be relatively low, i.e. E[Y 0 |D = 1] might be a lot larger than E[Y 0 |D = 0].
In this case, just taking the difference of simple averages would not reveal the ATE,
because selection bias would mask the actual treatment effect.
7 This essentially means that there is no selection on unobservables that are also affecting the outcome.
1.2 Randomised Controlled Trials 17
them entered university. To identify the individual return to schooling, one would like
to compare individuals with the same observed and unobserved characteristics but with
different levels of schooling. This argument is actually not that different from the ceteris
paribus and exogeneity discussion in structured regression. The particular interpretation,
however, depends essentially on the assumption made about causal chains.
Example 1.13 Consider again the return to schooling on earnings. Even if one identifies
the individual return to schooling, the economic interpretation still depends on the causal
channels one has in mind. This can easily be seen when contrasting the human capital
theory versus the signalling theory of schooling. The human capital theory posits that
schooling increases human capital, which increases wages. The signalling theory pre-
sumes that attainment of higher education (e.g. a degree) simply signals high unobserved
ability to potential employers, even if the content of education was completely useless.
In the latter case, from an individual perspective, schooling may well have a high return.
On the other hand, if years of schooling were increased for everyone, the overall return
would be zero since the ranking between individuals would not change. Then a clear
violation of the SUTVA occurs, because now the individual potential outcomes depend
on the treatment choices of other individuals. This is also referred to as ‘peer effects’
or ‘externalities’. Individual-level regressions would identify only the private marginal
return, not the social return.
Example 1.14 Beegle, Dehejia and Gatti (2006) analyse the effects of transitory income
shocks on the extent of child labour, using household panel data in rural western Tan-
zania collected from 1991 to 1994. Their hypothesis is that transitory income shocks
due to crops lost may induce families to use, at least temporarily, more child labour.
This effect is expected to be mitigated by family wealth. In other words, the impact
will be quite heterogeneous with respect to the wealth of each individual family. If the
(relative) size of the transitory income shock depends on this wealth, then we expect
AT E T > AT E > AT E N .
Other examples are the effects of the tax system on labour supply, the public–private
sector wage differential or the effects of class size on students’ outcomes. Distinguishing
the true causal effect from differences in unobservables is the main obstacle to non-
parametric identification of the function ϕ or of features of it such as treatment effects.
The challenge will be to work out the assumptions that permit non-parametric identi-
fication. While this has always been of concern in econometrics, in recent years much
more emphasis has been placed on trying to verify these assumptions and finding weaker
assumptions for identification.
Example 1.15 One of the well-known randomised experiments is the ‘Student teacher
achievement ratio’ or STAR experiment in Tennessee. This experiment took place
around the mid-1980s. It was designed to obtain credible evidence on the hotly debated
issue of whether smaller classes support student learning and led to better student out-
comes. Because reducing class size would imply hiring more teachers and lead to more
investment, this experiment was important to observe whether any gains would justify
1.2 Randomised Controlled Trials 19
the costs of reducing class sizes. Although there were many observational studies before
STAR, the results were highly disputed. Overall non-experimental results suggest that
there was very little or no effect of class size on the performance of the students. But
class size can be endogenous and there are many observed and unobserved character-
istics that can make the students in smaller classes quite different from the students
in larger classes. On the one hand, class size may be smaller in richer areas or where
parents are very interested in a good education for their children. On the other hand,
more disruptive children, and those with learning difficulties, are often placed in smaller
classes. Randomised experiments help here to balance the two groups in both observed
and unobserved variables. In the STAR experiment, each participating school assigned
children to one of three types of classrooms: small classes had a targeted enrolment
of 13–17; regular classes had a targeted enrolment of 22–25; and a third class targeted
regular enrolment of size 22–25 but adding a full-time teacher’s aide in the room.
The design of these experiments ensures that treated and control have the same
distribution of observed and unobserved characteristics such that
E[Y |D = 1] − E[Y |D = 0] .
8 Compare with the common support condition discussed in the next chapters.
20 Basic Definitions, Assumptions and Randomised Experiments
programme was launched, due to budgetary limits it was introduced only in several
pilot regions, which were randomly selected (randomised phasing-in). The unit of
randomisation was the community level and data were collected not only for these ran-
domly selected communities but also in several randomly selected non-participating
communities. In fact, half of the communities participated in the programme and the
other half did not. Participation in the programme was designed as a two-step procedure.
In the first step, a number of localities with a high degree of marginality were selected,
of which about half were randomised into the programme. In the second step, only poor
households living in pilot localities were considered as eligible to the programme, on
the basis of a region-specific poverty index at the household level. Data were collected
at baseline, i.e. before the introduction of the programme, and in subsequent waves
afterwards.
Other than evaluating the impact of different programmes, randomisation can also
help us to identify the proper group of beneficiaries. Proper targeting is a common
screening problem when implementing different kinds of conditional cash transfer or
other welfare programmes. A government needs to separate poor from rich and incor-
porate them into the programme. But as you might guess, this is not a straightforward
problem because rich individuals can always sneak in to get benefits. One of the ways to
avoid these problems is to use a self-selection mechanism, which is to incorporate costly
requirements for rich, like manual labour requirement, or provide low quality foods so
that rich people might not be interested. But this can often produce inefficient outcomes
because, just to disincentivise the rich, poor people have to suffer unnecessary costs by
painful labour or having bad quality aids. Another way is ‘automatic screening’, which
typically proceeds by some kind of asset test or proxy means test; for example, inter-
viewing the individuals, observing their present status like residence quality, ownership
of motorbikes, etc., and then asking other neighbours. But again, this process can also
be misleading and lengthy. So the question is whether we can do something better than
these suggestions and, if so, what the alternatives might be.
Example 1.17 Alatas, Banerjee, Hanna, Olken, Purnamasari and Wai-poi (2013) used
randomised evaluations to see whether it is possible to incorporate some self-targeting
mechanism to screen the poor. The idea was to see what happens if the individuals
were asked to apply for the test. They used randomisation to select the beneficiaries
in Indonesian Conditional Cash Transfer programme PKH and experimentally varied
the enrolment process for 400 villages. So they compared those households that were
actively applying for the test with those where there was an automatic screening or proxy
means test conducted directly by PKH. In the self-targeting villages, the households
were asked to go to the registration office first, and only after the asset test was conducted
by PKH. In the automatic screening group, PKH conducted the usual proxy means test
to see whether they were eligible. They found that villages where the households had to
apply for the test had much poorer groups of beneficiaries. The possible explanation is
that when households have to apply, then many of them who probably didn’t need the
aid didn’t go for the test.
1.2 Randomised Controlled Trials 21
Like the STAR experiment that we mentioned in Example 1.15, many designs of
experiments include the interaction of different treatments. In many cases you may
think of one specific treatment, but then you find out that interactions work even
better.
Example 1.18 Two major health risks for teenage girls in the sub-Saharan countries
are early (adolescent) pregnancy and sexually transmitted infections (STIs) (particu-
larly HIV). In recent reports, WHO reported more than 50 per cent of adolescent births
took place in sub-Saharan countries. Both early pregnancy and STIs have negative
health effects and social consequences for teenage girls. Often, girls attending primary
school have to leave the school, and in many cases adolescent births can lead to fur-
ther health problems. Duflo, Dupas and Kremer (2015) did an experimental study to see
how teenage pregnancy and STI prevalence are affected by two important policy instru-
ments and their interaction: (a) education subsidies and (b) HIV prevention (focused on
abstinence until marriage). The experiment was started in 2003 with students of average
age from 13.5 to 20.5, enrolled in grade 6 at 328 schools located in the Western part of
Kenya. The study followed the project for seven years with 9500 girls and 9800 boys.
Schools were randomly assigned to one of four groups: (1) Control (82 schools); (2)
Stand-Alone Education Subsidy programme (83 schools); (3) Stand-Alone HIV Edu-
cation programme (83 schools); and (4) Joint Programme (80 schools). The education
subsidy treatment was just like a simple subsidy programme that provided two free
school uniforms (it was given to the same students, one at the beginning of 2003 and
the other in late 2004) over the last three years of primary school. The HIV education
programme was like an education programme about sexually transmitted infections with
an emphasis on abstinence until marriage. In every school three teachers were trained
by the government to help them deliver Kenya’s national HIV/AIDS curriculum. Short,
medium and long-term impacts of these two programmes and their interaction were
observed on outcome variables like sexual behaviour, fertility and infection with HIV
and another STI (Herpes Simplex Virus type 2 [HSV2]). They found only education
subsidies reduced adolescent girl dropout, pregnancy and marriage; HIV prevention did
not reduce pregnancy or STI. The combined programme reduced STI more, but dropout
and pregnancy less, than the education subsidy alone.
S
AT E = 1 Yi − 1 Yi . (1.10)
2 n D =1 i 2 n D =0 i
The hope is to have data such that the S AT E can be consistently estimated by S
AT E.
9 See also Imai, King and Stuart (2008) and King and Zeng (2006).
1.2 Randomised Controlled Trials 23
the population distribution and the difference would not vanish with increasing smaple
size. For example, the individuals who (actively) apply to participate in the experiment
are often different from the population we would like to target. This issue is often
referred to as external versus internal validity. The randomised controlled trials have
the advantage of high internal validity in the sense that the SATE is consistently esti-
mated, since any difference in observables and unobservables vanishes between treated
and controls with increasing sample size. On the other hand, external validity may be
low in the sense that SATE is not a consistent indicator for the population ATE when the
participants in the experiment (treated and controls) may not be randomly sampled from
the population of interest; in other words, the sample may not be a good representative
of the population.
Let us formalise the idea. To understand the difference between SATE and ATE better,
for now it will be more illuminating to switch to the finite population case (of size N ).
We can later always conclude for infinite populations by considering N → ∞.10
We start by specifying the sampling related differences. Let’s make use of the
separability in (1.11) to obtain
N −n
S(X ) = {m 1 (X ) − m 0 (X )}d{ F̂(X |S = 0) − F̂(X |S = 1)}, (1.13)
N
N −n
S(U ) = {ξ1 (U ) − ξ0 (U )}d{ F̂(U |S = 0) − F̂(U |S = 1)}, (1.14)
N
where S = 1 indicates that the individual is in the sample, S = 0 otherwise, and F̂ refers
to the empirical cumulative conditional distribution of either X or U , respectively.
The expressions can be better understood if we focus on each part separately. Let’s
interpret S(X ) . We have two distributions for X , conditional on whether we are looking
at the people in the sample or not. If we focus on F̂(X |S = 1), this is the empirical
cdf of X for the people who are present in the sample, and accordingly, {m 1 (X ) −
m 0 (X )}d F̂(X |S = 1) is the ATE related to observed variables that are in the sample.
Similarly, it is possible to consider F̂(X |S = 0) for the people who are not in the
sample. Potential differences are due to the difference in the distribution of X in the
two samples. You can think about the term NN−n as some finite population correction
term. For infinite population this vanishes because it goes to 0 as N → ∞. Using the
definition of empirical cdf, Equation 1.13 can also be written as
⎡ ⎤
N −n ⎣ 1
1
{m 1 (X i ) − m 0 (X i )} − {m 1 (X i ) − m 0 (X i )}⎦ .
N N −n n
i:Si =0 i:Si =1
In a similar fashion you can also interpret S(U ) . But this portion of the treatment effect
is related to the unobserved variables.
10 You may argue that the populations you have in mind are finite, too. This, however, is often not really the
case as e.g. the population of a country changes every second, and you want to make a more general
statement than one to that resolution. Therefore, it can be quite useful to abstract to an infinite
hyperpopulation that might be described by a distribution, and your specific population (of a country,
right now) is just a representative sample of it.
24 Basic Definitions, Assumptions and Randomised Experiments
Also note that for random(ised) samples, when sample size increases F̂(X |S = 0)
should converge to F̂(X |S = 1), and F̂(U |S = 0) to F̂(U |S = 1). So in the limit both
S(X ) and S(U ) will approach zero.
Randomisation Method
A second issue refers to the random treatment assignment itself. The simplest strategy,
which is often used when treatment decisions have to be made immediately, is to assign
each individual with probability 50 per cent either to treatment 1 or 0. Although being
a valid randomisation design, this is usually associated with a rather high variance. The
intuition is simple, as can be seen from the following example.
Example 1.19 Suppose n = 100, of which 50 are men and 50 are women. We randomly
assign 50 of these individuals to treatment and 50 to control. By chance it could happen
that 40 men and 10 women are assigned to treatment, with the remaining 10 men and 40
women being in the control group. In this case, men are highly overrepresented among
the treated, which of course could affect the estimated treatment effect
1
1
Yi − Yi .
50 50
Di =1 Di =0
Although gender would be balanced in treatment and control group when the sample
size goes to infinity, in any given sample it will usually not be. To obtain a quantitative
intuition, consider a sample which contains only 0.3n women.11 Half of the sample is
randomly allocated to treatment and the other half to the control group. When n = 50,
in 38 per cent of these experiments the difference in the fraction of women between the
treatment and the control group will be larger than 0.1. When n = 100, this occurs in
only 27 per cent of the experiments. Fortunately, when n = 400, such large differences
occur only very rarely, namely in 2 per cent of the experiments.
Let us again formalise the balancing issue. Analogously to (1.13) and (1.14), one
obtains from the separability (1.11) for our estimation bias (1.12)
1
T (X ) = {m 1 (X ) + m 0 (X )}d{ F̂(X |D = 0, S = 1)− F̂(X |D = 1, S = 1)}, (1.15)
2
1
T (U ) = {ξ1 (U ) + ξ0 (U )}d{ F̂(U |D = 0, S = 1)− F̂(U |D = 1, S = 1)}. (1.16)
2
Note that we only look at the empirical distributions inside the sample. Looking again
at the differences in distributions at the end of each formula, it becomes clear that we
have an asymptotic balance in X (and U ) between treatment and control group. That is,
for increasing samples, T (X ) (and T (U ) ) disappear.
Taking all together, if we can combine, for example, random sampling with random
treatment assignment, we could consistently estimate the ATE simply by appropriate
averaging. Otherwise, if random sampling from the population of interest is not possible,
11 The following example is taken from Kernan, Viscoli, Makuch, Brass and Horwitz (1999).
1.2 Randomised Controlled Trials 25
12 Or an according imputation like matching, see the next chapter; also see Exercise 3.
26 Basic Definitions, Assumptions and Randomised Experiments
Usually, one would like to stratify on some variables X that are closely related to the
outcome variable Y (or one of the several outcome variables of interest) and on variables
for which a subgroup analysis is planned (e.g. estimation of treatment effects separately
for men and women). Stratification is most helpful when future values of Y can be
predicted reasonably well from baseline data. Important predictors are often the lagged
values Yt=0 of the outcome variable, which should be collected as part of a baseline
survey. These variables are most relevant when Y is highly persistent, e.g. when one is
interested in school test scores, education, height, wealth, etc. On the other hand, for
very volatile outcome variables such as firm profits, lagged values may not predict very
well.
The way randomisation was performed has to be taken into account when conduct-
ing inference. A large biostatistics literature has examined this issue for clinical trials.
Exercises 3 and 4 study how an appropriate weighting modifies the S AT E to become
a consistent estimator for ATE, and how this weighting changes the variance of the
estimator. The latter has to be taken into account when estimating the standard error.
For given weights wx (the proportion x occurs in the population of interest) and inde-
pendent observations, this is straightforward: the variance expression (1.22) in Exercise
4 can be estimated by n2 x∈X wx {V ar (Y 1 |X = x) + V ar (Y 0 |X = x)}, where the
conditional variances are estimated separately from the samples of the treated and the
untreated, respectively. This can be done parametrically or non-parametrically.13 Note
that we assume we have random samples stratified (or blocked) along X and therefore
not being representative for the population. Knowing, however, the population weights
wx allows us to correct for this stratification (or blocking).
In order to be able to afterwards correct the estimator for the bias, one should always
choose strata or blocks X for which the population weights wx are provided or at least
can be obtained.14 Then, the ATE estimate, standard error and its estimate are as above.
In case of using a parametric estimate for the standard error, many authors (compare
with Bruhn and McKenzie, 2009) advise correcting the degrees of freedom (d.o.f.) by
the number of used strata or blocks. The procedure becomes evident when thinking in
a simple linear regression model; compare, for example, with Duflo, Glennerster and
Kremer (2008): for J blocks15 B j of X = ∪ Jj=1 B j with n j individuals in block j of
which half of the subjects (let n j be even) is treated, consider
Yi j = β0 + β Di + γ j + i j , i = 1, . . . , n i j , j = 1, . . . , J, (1.17)
where γ j are fixed effects. Let w j be the population block weights, w j = x∈B j wx .
If the sample is representative of the population of interest, then the OLS estimate of β
is consistent for ATE. Otherwise, one has to use GLS with weights w j · n/n j . Further
13 While we generally recommend doing this non-parametrically, in practice this will depend on factors like
sample size and the nature or dimension of X .
14 In the above-described procedure, treatment is balanced inside each stratum or block, but we did not say
that sampling had to be done along strata, so it might easily be that wx = 1.
15 You may want to define one block for each potential value x that can be taken by X or to define larger
blocks that entail a range of X .
1.2 Randomised Controlled Trials 27
inference should automatically correct the standard error for the degrees of freedom; it
is always a remaining question whether to use block-robust standard errors or to assume
homoskedasticity.
Obviously, exact stratification is not tractable for continuous variables such as income
or wealth. There, only stratification on coarsely defined intervals of those variables is
possible (e.g. low, medium and high income). This is defining blocks or strata compris-
ing intervals of the support X . If X is multidimensional, containing some continuous
variables, this procedure gets unwieldy. Then an alternative ‘randomisation’ approach
which permits near balance to be achieved on many variables – in contrast to exact bal-
ances on very few variables – is more appropriate. A popular approach is the so-called
matched pairs.
Matched Pairs
If not only gender but also other covariates are known beforehand, one should include
these in the randomisation protocol. The more covariates X are observed and included
in the blocking, the smaller the variance of the estimated treatment effect will be. One
would thus like to block for many covariates and then assign treatment randomly within
each stratum or block. When X contains more than one or two covariates, more complex
randomisation routines are available. The basic idea of many of these approaches is the
use of matched pairs. Suppose the treatment is binary, and a number of pre-treatment
covariates X are observed. One proceeds to match pairs of individuals such that the two
individuals within each pair have very similar X variables. One individual of each pair is
randomly chosen and assigned to treatment. If one has three treatment arms, one would
construct triplets instead of pairs.
The more difficult part is the construction of these pairs. Suppose there are 2n individ-
uals, and define the distance between individual i and j with respect to their covariates
by the Mahalanobis distance16
X i − X j −1 X i − X j , (1.18)
where is the covariance matrix of X which might be estimated from the sample. One
seeks to construct pairs such that the sum of the within-pair distance over all pairs is
minimised. This gives the optimal matching of 2n subjects into n pairs of two subjects.
The problem is that the sequencing in which pairs are matched matters, as examined e.g.
in Greevy, Lu, Silver and Rosenbaum (2004). A naive ‘greedy’ algorithm would first
pair the two individuals with the smallest distance, thereafter pairs the two individuals
with the second-smallest distance, etc. Such greedy algorithms, however, usually do not
produce optimal matches.
16 This is a natural extension of the Euclidean distance, the latter being probably the most intuitive number
people can imagine and understand to describe distances in a multidimensional space. In an Euclidean
space, however, people subliminally presume orthonormality (90◦ angles and same scales) for the axes.
As this is typically not the case when looking at social economic indicators subsumed in X , the
Mahalanobis transformation will first put them in such shape before calculating the Euclidean distance.
28 Basic Definitions, Assumptions and Randomised Experiments
Example 1.20 Consider a simple numerical example, with one particular variable
X (say ‘age’) as the only covariate. Suppose we have eight individuals with ages:
{24, 35, 39, 40, 40, 41, 45, 56}. The greedy algorithm would choose 40 : 40 as the first
pair, followed by 39 : 41, etc. The sum of all within-pair differences is 0+2+10+32 =
44. In contrast, if we were to match adjacent values, i.e. 24 : 35, 39 : 40, 40 : 41, 45 : 56,
the sum of the differences is 11 + 1 + 1 + 11 = 24, which is also the optimal pairing.
Finding the optimal pairing with multivariate matching is far more complex. Therefore,
a distance measure is necessary to project it onto a one-dimensional problem.
The Mahalanobis distance is probably the most common distance metric used, but
other distance metrics could be used as well. Instead of applying the Mahalanobis dis-
tance to the covariates themselves, one could alternatively apply them to their ranks to
limit the impact of a few extreme observations. The Mahalanobis distance has the advan-
tage of requiring only the covariance matrix of X without requiring any knowledge or
conjectures as to how these X are related to interesting outcome variables Y . This may
be appropriate when multiple and rather diverse outcome variables Y are measured later
in the trial. On the other hand, if one is mostly interested in one specific outcome mea-
sure, e.g. income or consumption, and has some prior subjective knowledge about the
relevance of the X covariates as predictors for Y , one may want to give larger weights
in the distance metric to those covariates that are more important.17
For inference and hypothesis tests about the estimated treatment effects one should
take the method of randomisation into account, i.e. the degrees of freedom. If one does
not, the standard errors are underestimated. Again, the simplest solution is to include
stratum dummies or pair dummies in the regression model (1.17). Hence, if Mahalanobis
matching was used to construct pairs, a dummy for each pair should be included in the
linear regression. Clearly, these pair dummies replace the block dummies in (1.17). In
other words, for making an inference, one could use what we learnt in the paragraph on
blocking and stratification.
An alternative approach, which might either be interpreted as blocking or as matching
pairs, is the following. In order to avoid introducing more notation, we redefine now the
J blocks to be the different matched pairs or blocks with n 1 j treated and n 0 j untreated
individuals for j = 1, . . . , J , etc. Then, an obvious direct estimator for ATE is
n1 j 1 n0 j
J
Yi j
Yi0j
αd = wj − , (1.19)
n1 j n0 j
j=1 i=1 i=1
17 If one considers, for example, gender to be a very important variable, then one could require exact
matching on gender, by modifying the distance metric such that it takes the value infinity between any
two individuals of opposite gender. Similarly, if one wants to ensure that matched individuals differ at
most by four years in age, one could simply define the distance to be infinity between individuals who
differ in age by more than four years.
1.2 Randomised Controlled Trials 29
There exist several proposals for a variance estimator of αd ; a most intuitive and
consistent one under weak conditions (see Imai, King and Nall, 2009, for details) is
n1 j 1 n0 j 2
J
J
Yi j
Yi0j
αd
wj − − . (1.20)
(J − 1) n1 j n0 j J
j=1 i=1 i=1
It is clear that, due to the weighting, we have again that the S(X ) is zero if the
weights are exact, and in expectation zero with a variance going to zero if the weights
are estimated or simply if we are provided with a random (i.e. representative) sample.
The latter is true also for S(U ) . The random treatment assignment in the blocks is to
obtain T (X ) = 0 and T (U ) = 0 asymptotically. Then αd is asymptotically unbiased.
18 Various diagnostics for assessing overall balance are discussed in the section on propensity score
matching later in this book.
30 Basic Definitions, Assumptions and Randomised Experiments
simulations they suggest that one may want to include rather more than fewer covariates
in the stratification/matching, as long as one thinks that they may add additional power
in explaining the future outcome. But the theoretical guidance is not unambiguous,
because, while adding more covariates is likely to increase the explanatory power in
the sample, adding more strata dummies to the regression decreases the d.o.f.
Note that one should not conduct a test of equality of X between the two groups,
but rather examine the standardised differences in X . The equality-in-means test is a
function of the sample size and for a sufficiently low sample size would (almost) always
indicate that there are no significant imbalances in X . The concern with pair matching is
to reduce relative differences in X and not absolute differences due to the sample size.19
The following criteria are often suggested instead.20 Take the propensity score function
Pr(D = 1|X = x) which usually has first to be estimated:
(a) The standardised difference in the mean propensity scores between the two groups
should be close to zero.
(b) The ratio of the variance of the propensity score between the two groups should be
close to one.
(c) The standardised difference in X should be close to zero.
(d) The ratio of the variance in X between the two groups should be close to one.
Otherwise, in case you use a parametric propensity score (estimate), one repeats
this and respecifies the model. Note that at this stage we did not yet look at the out-
come data Y . These various diagnostics thus do not depend on the outcome data.
Consequently, the pre-specification cannot be influenced by the true treatment effects.
Ideally, all the planned analyses should already be specified before any outcome data
is examined in order to avoid the temptation of data mining during the evaluation phase.
In practice, however, missing data and partial or non-compliance (e.g. dropout) may
nevertheless still require substantial econometric modelling.
Next, organising and conducting an experimental trial can be expensive and may
receive a lot of resistance. Heckman and Smith (1995) discuss a variety of resulting
problems along the experiment with random assignment to the JTPA training pro-
gramme in the USA. They also discuss many other sources that may invalidate the
experimental evaluation results. If participation in this programme is voluntary, ran-
domisation can only be implemented with respect to the individuals who applied for
the programme, which are then randomised in or randomised out. However, these appli-
cants are maybe different from the population of interest. If randomisation covers only
parts of the population, the experimental results may not be generalisable to the broader
population. In other words, although internal validity is often plausible, external validity
may be limited if the selected units are not representative of the population at large. We
may speak then of a sample bias.
19 Earlier work by Rosenbaum and Rubin had put emphasis on significance testing. Significance testing,
however, confuses successful balance with low power. What is relevant for pair matching is the size of the
imbalance and not the size of the confidence interval.
20 See, for example, Lechner (1999), Imai, King and Stuart (2008), Rubin (2001) and Imbens and Rubin
(2015).
1.2 Randomised Controlled Trials 31
Even if a policy is mandatory such that all individuals can be randomly assigned to
the treatments, full compliance is often difficult to achieve if participants must exercise
some effort during the participation and may refuse their cooperation.
One speaks of a randomisation bias if the prospect of randomised allocation alters
the pool of potential participants because individuals may be reluctant to apply at all or
reduce (or increase) any preparatory activities such as complementary training due to
the fear of being randomised out (threat of service denial).
A substitution bias occurs if members of the control group (the randomised-out non-
participants) obtain some treatment or participate in similar programmes, e.g. identical
or similar training obtained from private providers. In this case, the experimental eval-
uation measures only the incremental value of the policy relative to the programmes
available otherwise.
A so-called drop-out bias occurs if individuals assigned to a particular programme
do not or only partly participate. This bias, like the substitution bias, is the results of
non-compliance.
As randomised experiments can be expensive and face political obstacles, one often
proposes to first perform pilot studies before implementing the actual study. But the
pilot-study character of an experiment may change the behaviour of the participants,
who may put in additional effort to show that the pilot study works (or does not). This
is called the Hawthorne effect.
If randomisation proceeds not on the individual but a higher level, endogenous
sample selection problems may occur. For example, if programme schools receive addi-
tional resources, this might attract more parents to send their children to these schools,
withdrawing their children from the control schools. Consequently, the resulting
allocation is not representative anymore.
Example 1.21 A small number of schools in Kenya received additional inputs such as
uniforms and textbooks. This reduced the drop-out rate in the treatment schools. In
addition, several students from nearby control schools were transferred to the treatment
schools. These two aspects led to a substantial increase in class size in the treatment
schools. A large increase in class size leads to downwardly biased treatment effects.
The treatment being estimated thus corresponded to a provision of additional school
inputs combined with an increase in class size. This had to be taken into account in the
cost–benefit calculation, since the increase in class size may be associated, for example,
with a cost saving, since teacher salaries usually represent the most expensive input into
education.
Example 1.22 During the Vietnam war, young American men were drafted to the army
on the basis of their month and day of birth, where a certain number of birth dates had
been randomly determined to be draft eligible: see Angrist (1998). Hence, the indicator
whether being born on a draft-eligible day or not satisfies the above requirements and
would deliver the ITT effect. But the main research interest is in the effect of partic-
ipating in the army on later outcomes. As we will see later, the lottery of birth dates
can function as an instrument. The pure participation, however, is no longer random as
people could voluntarily enrol or avoid their enrolment in various ways.
Obviously, the potential treatment itself can lead to differential attrition or non-
response in the treatment and/or the comparison group. Take our examples about
performance in school: if one obtains outcome data only for those children who are
in school on the day a test is administered, the data will be affected by selection bias.
One should try to avoid differential non-response or attrition by tracing all students. This
may not always be feasible so that non-response (or attrition on collecting longer-term
outcomes) may still be high. For such cases, methods to deal with this selection bias21
are needed.
Often experimental evaluations (randomised controlled trials) are considered as
unethical or unfair since some individuals are denied access to the treatment. Yet, if
public budgets or administrative capacity are insufficient to cover the entire country
at once, it appears fair to choose the participants in the pilot programmes at random.
But publicly provided or mandated programmes may partly overcome this problem as
follows.
A randomised phasing-in will only temporarily deny participation in the programme.
In some situations it might even be possible to let all units participate but treat only
different subsamples within each unit. Consider, for example, the provision of additional
schoolbooks. In some schools, additional books could be provided to the third grade
only, and in some other schools to the fifth grade only. Hence, all schools participate
to the same degree in the programme (which thus avoids feelings of being deprived of
resources relative to others), but the fifth graders from the first half of schools can be
used as a control group for the second half of schools and vice versa for the third graders.
Marginal randomisation is sometimes used when the number of available places in
a programme or a school is limited, such that those admitted are randomly drawn from
the applicants. Consider the application of this method to a particular public school or
university, which might (be forced to) choose randomly from the applicants if oversub-
scribed. In such a situation, those randomised out and randomised in should not differ
from each other in their distributions of observable and unobservable characteristics.
Otherwise marginal groups may represent only a very tiny fraction of the entire pop-
ulation of interest and the estimated effects may not generalise to the population at
large.
21 If one can assume, for example, that it is the weaker students who remain in school when treated but
would have dropped out otherwise, the experimental estimates are downward biased.
1.3 Respecting Heterogeneity: Non-Experimental Data and Distributional Effects 33
Hence, randomised assignment can be very helpful for credible evaluation. But not
all questions can be answered by experiments (e.g. the effects of constitutions or institu-
tions) and experimental data are often not available. Experimental data alone may also
not allow the entire function ϕ(d, x, u) to be determined, for which additional assump-
tions will be required. Even if a proper experiment is conducted, it might still occur by
chance that the treatment and control groups differ substantially in their characteristics,
in particular if the sample sizes are small. Although the differences in sample means
provide unbiased estimates of average treatment effects, adjusting for the differences in
the covariates, as discussed below, can reduce the variance of the estimates; see Rubin
(1974).
In practice, randomised experiments hardly ever turn out to be perfect. For example,
in the STAR experiment, children who skipped a grade or who repeated a class left
the experiment. Also, some pupils entered the school during the trial. Some kind of
reassignment happened during the trial, etc. This implies that one needs to know all
those details when evaluating the trial, and estimating treatment effects. One should not
only know the experimental protocol but also the (smaller and larger) problems that
happened during the experimental phase.
Other problems may appear when collecting follow-up data. E.g. an educational inter-
vention may have taken place in kindergarten and we would like to estimate its effects
several years later. Attrition and non-response in follow-up surveys may lead to selected
samples; e.g. it is be harder to trace and survey individuals who have moved. (In many
health interventions, mortality may also be an important reason behind attrition.) Non-
experimental methods are needed to deal with this. Nevertheless, it is helpful to keep the
ideal setup of a randomised trial in mind when designing or choosing a non-experimental
method since some non-experimental designs are in a sense superior than others. As a
rule of thumb: collecting pre-treatment data and collecting data from similar but non-
treated control observations, e.g. from the same family (twins, siblings), neighbourhood
or local labour market is often helpful. In addition, the same survey designs and def-
initions of the outcome variable should be used for both control and treated, and one
should obtain detailed information about the selection process.
As we have seen in the previous subsection, experiments can be very helpful for credible
identification of the average treatment effect. If possible, one should nearly always strive
to incorporate some randomised element in an intervention. In many situations, how-
ever, we have only access to observational (= non-experimental) data. In addition, even
with a perfectly designed experiment, problems such as non-compliance, non-response
and attrition often occur in practice, calling for more complex econometric modelling.
The source of problems that can arise then for identification and estimation is typically
the heterogeneity of individuals, first in their endowments and interests, second in the
(resulting) returns. For part of the heterogeneity we can control or at least account for,
34 Basic Definitions, Assumptions and Randomised Experiments
e.g. via the observed endowments X . We have seen this already when doing blocking
or matching. Much more involved is the handling of heterogeneity due to the unob-
served part, represented by U in our model. We learnt from the last subsection that
randomisation can avoid biased inference. But what happens if we cannot randomly
assign treatments? Or, what if heterogeneity is of the first order? Evidently, in the latter
case it is much more insightful to study treatment effects conditioned on X or, if it is
heterogeneity due to U that dominates, the distributions or quantiles of Y d .
Consequently, the literature on non-experimental estimators covers a wide array of
different parameters (or functions) you might be interested in, and some of these are
discussed in the following chapters. Different strategies to estimate them from non-
experimental data will be examined there.
22 For the sake of notation we have set the error U to be equal to the outcome produced by the
unobservables, called ξ(U ) before. This does not entail a simplification of the model but just of the
notation as ξ(·) is not identified anyway due to the unobservability of its argument.
1.3 Respecting Heterogeneity: Non-Experimental Data and Distributional Effects 35
make these models more realistic and delineate more clearly the nature of the identifying
assumptions to be used. On the other hand, it also makes identification more difficult.
Heterogeneity in the responses might itself be of policy interest, and it might there-
fore often be interesting to try to identify the entire function ϕ(d, x, u). In the familiar
linear model Y = α + Dβ + X γ + U a common treatment effect β is assumed. It
prohibits not only effect heterogeneity conditional on X but also effect heterogeneity
in general. This is certainly in line with the practitioners’ wish to obtain a param-
eter that does not depend on U , since U is unobserved and its effect is usually not
identified. The average treatment effect is a parameter where the unobserved variables
have been averaged out. For the observed X , however, we may want to study the con-
ditional ATE or the conditional ATET for a given set of observed characteristics x,
namely
AT E(x) = (ϕ(1, x, U ) − ϕ(0, x, U )) d FU ,
AT E T (x) = (ϕ(1, x, U ) − ϕ(0, x, U )) d FU |D=1 .
These could also be interpreted as partial treatment effects, and ATE and ATET are just
their averages (or integrals).
Sometimes, in the econometric literature, expected potential outcome (for par-
tial and total effects) is also referred to as the average structural function (ASF);
see Blundell and Powell (2003). More specifically, there we are interested in par-
tial effects where we fixed also some other (treatment) variable X at some value x,
namely
AS F(d, x) = E[Yxd ] = ϕ(d, x, U ) d FU .
If U and X are uncorrelated, which is often assumed, E[Yxd ] and E[Y d |X = x] are
identical, but otherwise they are not. Both have their justification and interpretation, and
one should be careful to not mix them up. Another important point is that these two
functions can be much more insightful if the treatment effect varies a lot with X . If the
outcome Y depends mainly on X , then this information is politically much more relevant
than the average treatment effect over all U and X .
Having defined the ASF, we could imagine various policy scenarios with different dis-
tributions of d and x. Consider a policy which assigns d and x according to a weighting
function f ∗ (d, x). To obtain the expected outcome of such a policy, one has to calculate
the integral
AS F(d, x) · f ∗ (d, x) d x dd.23
23 where dd is the differential with respect to continuous d, or else imagine a sum running over the support
of D.
36 Basic Definitions, Assumptions and Randomised Experiments
Example 1.23 Let us consider the increasing wage inequality. Juhn, Murphy and Pierce
(1993) analysed individual wage data from 27 years of the US Population Surveys. Real
wages increased by 20 per cent between 1963 and 1989, but with an unequal distribution.
Those in the bottom 10 percentile of wages (for the less skilled workers) fell by 5 per
cent, whereas those in the 90 percentile increased by 40 per cent. When they repeated
these calculations by categories of education and experience, then they observed that
wage inequality also increased within categories, especially during the 80s, and that
between-group wage differences increased substantially. They interpreted these changes
as the result of increased returns to observable and unobservable components of skills
(education, experience and ability), e.g. due to the resulting productivity. This, however,
was just speculation. It is clear that this increasing wage gap comes from an increase
in bargaining power, but this might equally well result from globalisation or weakened
trade unions.
The following equations are defined with respect to the two variables D and X (i.e.
the included observables), but we could consider X to be the empty set in order to obtain
total effects. The distributional structural function is the distribution function of ϕ(·) for
given x and d:
DS F(d, x; a) ≡ Pr [ϕ(d, x, U ) ≤ a] = 11 [ϕ(d, x, u) ≤ a] d FU (u) .
The quantile structural function (QSF) is the inverse of the DSF. It is the τ th quantile of
the outcome for externally set d and x:
Q S F(d, x; τ ) = Q τ [ϕ(d, x, U )] = Q τ [Yxd ] , (1.21)
where the quantile refers to the marginal distribution of U .25 The symbol Q τ (A) repre-
sents the τ th quantile of A, i.e. Q τA ≡ Q τ (A) ≡ inf{q : FA (q) ≥ τ }. While this is the
τ th quantile of Y if D and X are fixed externally for every individual, in practice it is
much easier to estimate from the data the following quantile:
Q τ [Y |D = d, X = x] = Q τ [ϕ(D, X, U )|D = d, X = x]
24 In the international organisations it has become customary to speak then of an integrated approach.
25 Quantile and distributional effects will be discussed in detail in Chapter 7.
1.3 Respecting Heterogeneity: Non-Experimental Data and Distributional Effects 37
But for the following discussion it is easier to work with (1.21) and supposing that U
can be condensed to a scalar. It is usually assumed that ϕ is strictly increasing in this
unobserved argument u. This greatly simplifies identification and interpretation.26 Then
we can write
Q τ (ϕ(d, x, U )) = ϕ(d, x, Q U
τ
)
τ represents the quantile in the ‘fortune’ distribution in the population. Hence,
where Q U
Q S F(d, x; 0.9) is the outcome for different values of d and x for an individual at the
90 percentile in the fortune distribution. On the other hand, the observed quantile is
Q τ [Y |D = d, X = x] = ϕ(d, x, Q U
τ
|D=d,X =x )
τ τ
where Q U |D=d,X =x = Q [U |D = d, X = x] is the quantile in the ‘fortune’ distribution
among those who chose d years of schooling and characteristics x.
Note that since the QSF describes the whole distribution, the ASF can be recovered
from the QSF by noticing that
1
AS F(d, x) = E[Yxd ] = Q S F(d, x; τ )dτ.
0
Hence, if the QSF is identified at all quantiles τ , so is the ASF, but not vice versa. As
stated, we will more often be interested in
1
E[Y |X = x] =
d
Q τ [ϕ(d, X, U )|X = x]dτ.
0
So, when in the following chapters you see a minor x, it simply refers to a realisation
of X , i.e. to ·|X = x, or to an argument you are integrating out. The estimation of
distributional effects will be studied in detail in Chapter 7.
So far we have discussed which types of objects we would like to estimate. The next
step is to examine under which conditions they can be identified. This means that, sup-
pose we know the distribution function FY,D,X,Z (e.g. through an infinite amount of
data); is this sufficient to identify the above parameters? Without further assumptions, it
is actually not, since the unobserved variables can generate any statistical association
between Y , X and D, even if the true impact of D and/or X on Y is zero. Hence,
data alone are not sufficient to identify treatment effects. Conceptual causal models
are required, which entail identifying assumptions about the process through which the
individuals were assigned to the treatments. The corresponding minimal identifying
assumptions cannot be tested formally with observational data, and their plausibility
must be assessed through prior knowledge of institutional details, the allocation pro-
cess and behavioural theory. As we will discuss in the next chapter, the necessary
assumptions and their implications are by no means trivial in practice.
Barrios (2013) points out that his approach allows for a large number of variables for
balancing while maintaining simple inference techniques since only pair-dummies have
to be used for proper inference. The author shows that his approach is optimal in the
sense that it minimises the variance of the difference in means. Such a randomisa-
tion approach might further be very credible, since researchers have to decide before
the experiment what they define as their ‘outcome of interest’. Barrios (2013) further
points out that he only defines optimality with respect to the mean squared error cri-
terion. Further research might focus on alternative criteria like minimising the mean
absolute value of the error if one is interested in estimating a conditional quantile
function.
The randomisation method is usually not applicable when treatment decisions need
to be made immediately every time a new individual enters the trial. Yet, treatment
assignment algorithms exist that assign treatments sequentially taking into account
the covariate information of the previously assigned individuals, see e.g. Pocock and
Simon (1975). Alternative pair-matching algorithms to those being discussed here can
be found e.g. in King, Gakidou, Ravishankar, Moore, Lakin, Vargas, Tellez-Rojo and
Avila (2007).
After having constructed matched pairs, one can examine the remaining average dif-
ferences in X between the treated and non-treated group. If these differences appear
relatively large, one may start afresh from the beginning with a new randomisation
and see whether, after applying the pair-matching process, one would obtain a smaller
average imbalance. Of course, such re-randomisation is only possible if treatment has
not yet started. If time permits it may be most effective to draw independently a num-
ber of randomisation vectors (e.g. 100 times) and choose the assignment vector which
gives the smallest imbalance in X . Some re-randomisation methods are also examined
in Bruhn and McKenzie (2009). A problem is the correct inference afterwards as our
final observations are a result of conditional drawing and therefore follow a conditional
distribution. For example, if we re-randomise until we obtain a sample where the Maha-
lanobis distance of the means of X between the treated subjects and its controls are
smaller than a given threshold ε > 0 in each block, then we should be aware that the
variance of our ATE estimate is also conditioned on this.
For calculating standard errors in randomised trials we presented regression-based
estimators corrected for d.o.f. and potential heteroskedasticity over blocks or strata.
An alternative approach to do inference for estimators can be based on randomisation
inference. This is mostly based on bootstraps and requires somewhat more complex
programming, but has the advantage of providing exact finite sample inference: see Car-
penter, Goldstein and Rasbash (2003), Field and Welsh (2007), Have and Rosenbaum
(2008), or, for a general introduction, Politis, Romano and Wolf (1999).
More recent is the practice to use hypothesis tests to evaluate balance; see, for exam-
ple, Lu, Zanuto, Hornik and Rosenbaum (2001), Imai (2005) or Haviland and Nagin
(2005). However, Imai, King and Stuart (2008) pointed out the fallacy problem of these
methods when matching is mainly based on dropping and doubling observations to
reach balance. For further reading on matched sampling we refer to Rubin (2006). A
well-known compendium on observational studies in general is Rosenbaum (2002).
40 Basic Definitions, Assumptions and Randomised Experiments
1.5 Exercises
wx {V ar (Y 1 |X = x) + V ar (Y 0 |X = x)}. (1.22)
n
x∈X
Data are just either dependent or independent, and such a relation is perfectly symmetric.
It is therefore often impossible to draw conclusions on causality out of a purely data
explorative analysis. In fact, in order to conclude on a causal effect, one has to have
an idea about the causal chain. In other words, you need to have a model. Sometimes
it is very helpful to include the time dimension; this leads to the concept of Granger-
causality. But even this concept is based on a model which assumes that the leading
series (the one being ahead in time) is exogenous in the sense of ‘no anticipation’. You
just have to remind yourself that the croaking of frogs does not cause rain, though it
might come first, and is therefore Granger-causal for rain.
In the last chapter, i.e. for randomised experiments, we saw that you actually do not
have to specify all details of the model. It was enough to have the ignorability of D
for (Y 0 , Y 1 ), i.e. (Y 0 , Y 1 ) ⊥⊥ D. This is equivalent to the ‘no anticipation’ assump-
tion for Granger-causality: whether someone participates or not is not related to the
potential outcome. But we did not only introduce basic definitions, assumptions and
the direct estimators for randomised experiments; we discussed potential problems of
heterogeneity and selection bias, i.e. the violation of the ignorability assumption. And
it has been indicated how controlling for characteristics that drive the selection might
help. We continue in this line, giving a brief introduction to non-parametric identifica-
tion via controlling for covariates, mainly the so-called confounders (or confounding
variables). We call those variables X confounders that have an impact on the difference
in the potential outcomes Y d and – often therefore – also on the selection process, i.e.
on the decision to participate (D = 1). In addition, we discuss some general rules on
which variables you want to control for and for which ones you do not. We do this
along causal graphs, as they offer quite an illustrative approach to the understanding of
non-parametric identification.
The set of control variables used in the classic linear and generalised linear regression
analysis often includes variables for mainly two purposes: to control for confounders
to eliminate selection bias and/or to control for (filter out) certain covariates in order
to obtain the partial effect of D on Y instead of the total effect. In fact, in the clas-
sic econometric literature one often does not distinguish between them but includes
the consequence of their exclusion in the notation of omitted variable bias. We will
see, however, that for the identification and estimation of treatment effects, it is typi-
cally not appropriate to include all available information (all potential control variables
X ), even if they exhibit some correlation with Y and/or D. Actually, the inclusion of
2.1 An Illustrative Approach to the Identification of Causality 43
all those variables does not automatically allow for the identification of partial effects.
Unfortunately, in most cases, one can just argue, but not prove, what is the necessary
conditioning to obtain total or partial effects, respectively.
The first step is to form a clear idea about the causal chain you are willing to believe,
and to think of potential disturbances. This guides us to the econometric model to be
analysed. The second step is the estimation. Even though today, in economics and econo-
metrics, most of the effort is put on the identification, i.e. the first step, there is actually
no reason why a bad estimate of a neatly identified parameter should contain more (or
more helpful) information than a good estimate of an imperfectly (i.e. ‘up to a small
bias’) identified parameter. Even if this ‘bad’ estimator is consistent, this does not nec-
essarily help much in practice. Recall that in empirical research, good estimators are
those that minimise the mean squared error (MSE), i.e. the expected squared distance to
the parameter of interest, for the given sample. Unbiasedness is typically emphasised a
lot but is actually a poor criterion; even consistency is only an asymptotic property that
tells us what happens if n ≈ ∞. Therefore, as we put a lot of effort into the identification,
it would be a pity if it was all in vain because of the use of a bad estimator.
In sum, the first part of this chapter is dedicated to the identification strategies (form-
ing an idea of the causal chain), and the second part to estimation. The former will
mainly happen via conditioning strategies on either confounders or instruments. As this
does not, however, tell us much about the functional forms of the resulting models, the
second part of the chapter is dedicated to estimation without knowledge of the func-
tional forms of dependencies or distributions. This is commonly known as non- and
semi-parametric estimation.
assigned. On the other hand, if the entire information set on which the selection process
or assignment mechanism D is based on was observed, then the CIA would hold.
The causal assumptions are to be distinguished from statistical associations. While
causal statements can be asymmetric, stochastic associations are typically symmetric: if
D is statistically dependent on X , then X is also statistically dependent on D. Exactly
the same can be said about independence. This can easily lead to some confusion.
Example 2.1 As an example of such confusion in the literature, take the situation in
which some variables X are supposed to be exogenous for potential outcomes, in
the sense that D does not cause X . When formalising this, the distribution of X is
sometimes assumed to be independent from D given the potential outcomes (Y 0 , Y 1 ),
i.e. F(X |Y 0 , Y 1 , D) = F(X |Y 0 , Y 1 ). However, X ⊥⊥ D|(Y 0 , Y 1 ) is the same as
D ⊥⊥ X |(Y 0 , Y 1 ), and does not entail any structural assumption on whether X causes D
or D causes X . However, the idea in these papers is to use X as a confounder. But then
it is quite questionable whether you want to assume that D is (conditionally on Y 0 , Y 1 )
independent from X . What the authors intend to say is that D has no (causal) impact on
X if conditioning on the potential outcomes (Y 0 , Y 1 ). But the use of F(X |Y 0 , Y 1 , D) =
F(X |Y 0 , Y 1 ) in subsequent steps or proofs renders the identification strategy of little
help when the core idea is the inclusion of X as a confounder.
V1 Y U
D
X
V2
V1 Y U
D
V2
other. Plainly, the missing and the directed arcs encode our a priori assumptions used
for identifying the (total) impact of D on Y .
Consequently, a causal structure is richer than the notation of (in)dependence, because
X can causally affect Y without X being causally affected by Y . Later on we will be
interested in estimating the causal effect of D on Y , i.e. which outcomes would be
observed if we were to set D externally. In the non-experimental world, D is also deter-
mined by its antecedents in the causal model, here by V1 and X , and thus indirectly by
the exogenous variables V1 and V2 . When we consider an external intervention that sets
D to a specific value d to identify the distribution of Y d , then this essentially implies
that the graph is stripped off all arrows pointing to D.
The graph in Figure 2.2 incorporates only triangular structure and causal chains.
Such a triangular structure is often not sufficient to describe the real world. Like all
models it is a simplification. For example, in a market with Q (quantity) and P (price),
both variables will have a direct impact on each other, as indicated in the graph of
Figure 2.3. This can be solved by simultaneous equations under the (eventual) inclusion
of further variables. However, for the ease of presentation we will concentrate in this
chapter on graphs that do not entail such feedback or (direct) reverse causality.
A graph where all edges are directed (i.e. a graph without bi-directed dashed arcs)
and which contains no cycles is called a directed acyclic graph (DAG). Although the
requirement of acyclicity rules out many interesting cases, several results for DAG are
useful to form our intuition. Note that bi-directed dashed arcs can usually be eliminated
by introducing additional unobserved variables in the graph, e.g. in order to obtain a
DAG. For example, the left graph in Figure 2.4 can be expressed equivalently by the
right graph. In fact, in a DAG you are obliged to specify (or ‘model’) all relations. This
is not always necessary but can make things much easier.3
Before coming to the identification of causal relationships, we first discuss explic-
itly some basic findings on conditional independence to better understand the con-
ditional independence assumption (CIA). We start with some helpful definitions,
speaking henceforth of a path between two variables when referring to a sequence
Figure 2.3 The mutual effects of quantity and price cannot be presented by triangular structures or
causality chains
Example 2.2 Let us consider the admission to a certain graduate school which is based
on either good grades or high talent in sport. Then we will find a negative correlation
between these two characteristics in the school even if these are independent in the
entire population. To illustrate this, suppose that both grades and sports were binary
variables and independent in the population. There are thus four groups: strong in sports
and strong in academic grades, weak in sports and strong in grades, strong in sports and
weak in grades, and those being weak in both fields. The first three groups are admitted
to the university, which thus implies a negative correlation in the student population.
Conditioning on m could also happen inadvertently through the data collection pro-
cess. In fact, if we obtain our data set from the school register, then we have implicitly
4 For formal details, see definition 1.2.3 and theorem 1.2.4 of Pearl (2000).
48 An Introduction to Non-Parametric Identification and Estimation
conditioned on the event that all observations in the data set have been admitted to the
school.
effect. If the former is true, the latter is most likely to be true. But in certain situations
it could happen that despite Y d not being independent of D (given X ), we still observe
Y ⊥⊥ D|X . This would be the (quite unlikely) case when a non-zero treatment effect
and non-zero selection bias cancel each other. In sum, generally one has
Y ⊥⊥ D|X ⇐⇒
Y d ⊥⊥ D|X.
Example 2.3 In his analysis of the effects of voluntary participation in the military on
civilian earnings, Angrist (1998) takes advantage of the fact that the military is known
to screen applicants to the armed forces on the basis of particular characteristics, say X ,
primarily on the basis of age, schooling and test scores. Hence, these characteristics are
the principal factors guiding the acceptance decision, and he assumes that among appli-
cants with the same observed characteristics, those who finally enter the military and
those who do not are not systematically different with respect to some outcome variable
Y measured later in life.6 A similar reasoning applies to the effects of schooling if it is
known that applicants to a school are screened on the basis of certain characteristics, but
that conditional on these characteristics, selection is on a first come, first served basis.
Theorem 2.3 provides us the tools we need for identifying causal impacts of treatment
D on outcome Y . If, for example, due to a conditioning on X or X d , independence of
D from the potential outcomes Y d is achieved, then the causal impact of D on Y is
identifiable. More specifically, one obtains the causal effect of D on Y (i.e. setting D
externally from 0 to 1) by
E[Y 1 − Y 0 ] = E[Y 1 − Y 0 |X ]d FX (2.7)
= E[Y 1 |X, D = 1]d FX − E[Y 0 |X, D = 0]d FX
= E[Y |X, D = 1]d FX − E[Y |X, D = 0]d FX .
That is, one first calculates the expected outcome conditional on D = d and X , to
afterwards integrate out X . In practice, the expectations in the last line of (2.7) can
be estimated from the sample of the treated (d = 1) and the non-treated (d = 0),
respectively, to afterwards average over these with respect to the distribution of X (but
careful: over the entire population and not just to the respective conditional distributions
of X |D = d, d = 0, 1). This method will be discussed in detail in the next chapter.
Figure 2.2 was a simple though typical situation of identifiability. Let us turn to an
example where we cannot identify the effect of D on Y . In Figure 2.5, the original graph
and the subgraph needed to apply Theorem 2.3 are given. Not conditioning at all leaves
the path D X 2 −→ Y unblocked. But conditioning on X 2 unblocks the path
D X 2 ←→ X 1 −→ Y . Conditioning on X 1 (or on X 1 and X 2 ) would block a
part of the causal effect of D on Y since X 1 is a descendant of D, i.e. here we do not
have X 1d = X 1 .
Figure 2.6 A model where we must not condition on X to identify the impact of D on Y
With this basic intuition developed, we can already imagine which variables need to
be included (or not) in order to identify a causal effect of D on Y . The easiest way
of thinking about this is to suppose that the true effect is zero, and ascertain whether
the impacts of the unobserved variables could generate a dependence between D and
Y . Before you continue, try to solve Exercise 3 at the end of this chapter. Then let us
conclude this consideration with an example.
Example 2.4 Take a Bernoulli variable D (treatment ‘yes’ or ‘no’) with p = 0.5. Let
the outcome be Y = D + U and further X = Y + V . Suppose now that (U, V ) are
independent and jointly standard normal, and both independent from D, which implies
that the support of Y and X is the entire real line. We thus obtain that E[Y |D = 1] −
E[Y |D = 0] = 1. However, if we condition on X = 1 then it can be shown (see
Exercise 4) that E[Y |X = 1, D = 1] − E[Y |X = 1, D = 0] = 0. This result also holds
for other values of X , showing that the estimates for the impact of D on Y (in absolute
value) are downward biased when conditioning on X .
In Example 2.4 we have seen that conditioning on third variables is not always appro-
priate, even if they are highly correlated with D. This becomes also evident in Figure 2.6,
where X is neither causally affected by, nor affecting D or Y . Yet, it can still be highly
correlated with both variables. The effect of D on Y is well identified if not conditioning
on X . Conditioning on X would unblock the path via V and U , and thus confound the
effect of D.
According to Theorem 2.3 and the proceeding discussion, the effect of D on Y can
often be identified by adjusting for a set of variables X , such that X does not contain
any descendant of D, and that X blocks every path between D and Y which contains
an arrow pointing to D. Pearl (2000) denoted this as the back-door adjustment. This
set of variables, however, is not necessarily unique. In Figure 2.7, for example, the set
{X 3 , X 4 } meets this back-door criterion, as does the set {X 4 , X 5 }. The set {X 4 }, however,
does not meet the criterion because it unblocks the path from D via X 3 , X 1 , X 4 , X 2 , X 5
to Y ; neither does {X 1 , X 2 }.
Before turning to another identification method, let us recall the structural func-
tion notation introduced in equations (1.1) and (1.2). Thinking of classical regression
analysis with response Y , regressors (D, X ), and the remainder U often called the ‘error
2.1 An Illustrative Approach to the Identification of Causality 51
X1 X2
X3 X4 X5
D X6 Y
Figure 2.7 Different sets can be used for blocking the paths between D and Y
term’, an interesting question to ask would be: what happens if there is also a relation
between X and U ? Consider Figure 2.8 and note that this graph implies that U ⊥⊥ D|X .
It is not hard to see that nonetheless one has
E[Y d ] = ϕ(d, X, U )d FU X = ϕ(d, X, U ) d FU |X d FX
= ϕ(d, X, U ) d FU |X,D=d · d FX = E Y d |X, D = d d FX
= E [Y |X, D = d] d FX = E X [E [Y |X, D = d]] . (2.8)
Similarly to (2.7), the inner expectation of the last expression can be estimated from the
respective subsamples of each treatment group (d) to afterwards average (or integrate)
out the X . Thus, the method for identifying the impact of D on Y is the same as in Equa-
tion 2.7; it is the so-called matching and propensity score method discussed in Chapter 3.
Figure 2.9 Front door identification: left: not identifiable; centre and right: identifiable
One should note, however, that a different formula for identification has to be used then.
One example is the so-called front-door adjustment.
Example 2.5 Pearl (2000, section 3.3.3) gives an example of front-door identification
to estimate the effect of smoking on the occurrence of lung cancer. The advocates of
the tobacco industry attributed the observed positive correlation between smoking and
lung cancer to some latent genetic differences. According to their theory, some indi-
viduals are more likely to enjoy smoking or become addicted to nicotine, and the same
individuals are also more susceptible to develop cancer, but not because of smoking.
If we were to find a mediating variable Z not caused by these genetic differences, the
previously described strategy could be used. The amount of tar deposited in a person’s
lungs would be such a variable, if we could assume that (1) smoking has no effect on
the production of lung cancer except as mediated through tar deposits (i.e. the effect of
smoking on cancer is channelled entirely via the mediating variable), (2) that the unob-
served genotype has no direct effect on the accumulation of tar, and (3) that there are
no other factors that affect the accumulation of tar deposits and (at the same time) have
another path to smoking or cancer. This identification approach shows that sometimes it
can be appropriate to adjust for a variable that is causally affected by D.7 Note that our
set of assumptions is designed in order to identify the total impact. For just the existence
of any effect, you may be able to relax them.
7 Pearl (2000) continues with an insightful and amusing example for the kinds of problems and risks this
strategy entails. Suppose for simplicity that all variables are binary with 50% of the population being
smokers and the other 50% being non-smokers. Suppose 95% of smokers have accumulated high levels of
tar, whereas only 5% of non-smokers have high levels of tar. This implies the population sizes given in the
second column of the table below. In the last column the fraction of individuals who have developed lung
cancer is given. For example, 10% of non-smokers without tar have lung cancer.
This table can be interpreted in two ways: overall, smokers seem to have higher lung cancer than
non-smokers. One could argue though that this relation is spurious and driven by unobservables. On the
other hand, we see that high values of tar seem to have a protective effect. Non-smokers with tar deposits
experience less lung cancer than non-smokers without tar. In addition, smokers with tar also have less lung
cancer than smokers without tar. Hence, tar is an effective protection against lung cancer such that one
should aim to build up tar deposits. At the same time, smoking indeed seems to be a very effective method
to develop these protective tar deposits. Following the second interpretation, smoking would even help to
reduce lung cancer.
2.1 An Illustrative Approach to the Identification of Causality 53
Let us return to the identification of the treatment effect (impact of D on Y ) with such
a mediating variable in a more general setting. Consider the graph in Figure 2.10. For
simplicity we abstract from further covariates X , but as usual, we permit each variable
to be further affected by some additional unobservables which are independent of each
other. This is made explicit in the left graph. Usually one suppresses these independent
unobservables in the graphs, and only shows the simplified graph on the right-hand side.
The graph implies that
In terms of cumulative distribution function F, the first statement can also be written as
FZ d ,U = FZ d FU , while the second statement implies that FZ d = FZ d |D=d = FZ |D=d .
We make use of these implications further below when expressing the potential out-
comes in terms of observed variables. The potential outcome depends on Z and U in
that
Y d = ϕ(Z d , U ),
Note that this calculus holds for continuous and discrete variables. It follows that
E[Y d ] = E [E [Y |D, Z = z]] d FZ |D=d , (2.9)
The formula (2.9) shows that we can express the expected potential outcome in terms of
observable random variables; so it is identifiable. If D and Z are discrete, (2.9) can be
written as
"
!
E[Y ] =
d
Pr (Z = z|D = d) E Y |D = d , Z = z Pr D = d . (2.10)
z d
To obtain an intuition for this, recall that we separately identify the effect of D on Z ,
and the effect of Z on Y . First, consider the effect of Z on Y , and note that the graph
implies Y z ⊥⊥ Z |D such that
E[Y z ] = E [Y |D, Z = z] d FD = E D [E [Y |D, Z = z]] . (2.11)
noise. The treatment-effect philosophy is interested in the effect of only one (or perhaps
two) variables D and chooses the other regressors for reasons of identification, accord-
ing to knowledge about the data-generating process. Then, additional covariates could
nonetheless be included or not on the basis of efficiency considerations.
Example 2.6 Often one is interested in the effects of some school inputs, our D (e.g.
computer training in school), on ‘productivity’ Y in adult life (e.g. wages). In the
typical Mincer-type equation one regresses wages on a constant, experience (X ) and
school inputs (D). Here, experience is included to obtain only the direct effect of D
on Y , by blocking the indirect effect that D may have on experience (X ). This is an
example where including an additional variable in the regression may cause problems.
Suppose the computer training programme was introduced in some randomly selected
pilot schools. Clearly, due to the randomisation the total effect of D on Y is identified.
However, introducing experience (X ) in the regression is likely to lead to identifica-
tion problems when being interested in the total effect. Evidently, the amount of labour
market experience depends on the time in unemployment or out of the labour force,
which is almost certainly correlated with some unobserved productivity characteristics
that also affect Y . Hence, introducing X destroys the advantages that could be reaped
from the experiment. Most applied labour econometricians are well aware of this prob-
lem and use potential experience instead. This, however, does not fully separate the
direct from the indirect effect because a mechanistic relationship is imposed, i.e. if edu-
cation is increased by one year, potential experience decreases automatically by one
year.
Whether we are interested in identifying the total or the partial impact is not a question
of econometrics but depends on the economic question under study. It is important here
to understand the differences when identifying, estimating and interpreting the effect.
In practice it can easily happen that we are interested in the total but can identify only
the partial effect or vice versa. One might also be in the fortunate situation where we
can identify both or in the unfortunate one where we are unable to identify any of them.
To illustrate these various cases, let us deviate from the DAGs and examine the more
involved analysis in the presence of cycles or feedback.
Consider Figure 2.11 graph (a), where D affects X , and X affects D. This could
be due to a direct feedback or simultaneous determination of both variables. It could
also be that for some (unknown) subpopulation treatment D affects X , and for the other
individuals X affects D. Finally, there is the possibility that the causal influence is in fact
D Y D Y D Y
X X X
Figure 2.11 Direct and indirect effects, examples (a) to (c) from left to right
56 An Introduction to Non-Parametric Identification and Estimation
unidirectional, but that we simply do not know the correct direction, and therefore do
not want to restrict this relationship. Not conditioning on X would lead to confounding.
On the other hand, conditioning on X would block the back-door path but would also
block the effect of D on Y which mediates through X . By conditioning on X we might
be able to estimate the direct effect of D on Y , i.e. the total effect minus the part that is
channelled through X . In other words, conditioning on X permits us to estimate partial
(here the direct) effect.
Example (b) illustrates once again that conditioning on X does not always guarantee
the identification of an (easily) interpretable effect. In this situation, conditioning on X
unblocks the path between D and Y via the dashed arc. Hence, even if the true direct
effect of D on Y is zero, we still might find a non-zero association between D and
Y after conditioning on X . This simple graph demonstrates that attempting to identify
direct effects via conditioning can fail.
Graph (c) demonstrates that sometimes, while the direct effect cannot be identified,
the total effect might well be. The total effect of D on Y is identified without condi-
tioning. However, the direct effect of D on Y is not because conditioning on X would
unblock the path via the dashed arc. Not conditioning would obviously fail, too. A
heuristic way to see this is that we could identify the effect of D on X but not that
of X on Y ; hence we could never know how much of the total effect is channelled by X .
Example 2.7 Consider the example where a birth-control pill is suspected of increasing
the risk of thrombosis, but at the same time reduces the rate of pregnancies, which
are known to provoke thrombosis. Here you are not interested in the total effect of
the pill on thrombosis but rather on its direct impact. Suppose the pill is introduced
in a random drug-placebo trial and suppose further that there is an unobserved vari-
able affecting the likelihood of pregnancy as well as of thrombosis. This corresponds
to the graph in example (c) in Figure 2.11. The total effect of the pill is immediately
identified since it is a random trial. On the other hand, measuring the effect separately
among pregnant women and non-pregnant women could lead to spurious associations
due to the unobserved confounding factor. Therefore, to measure the direct effect, alter-
native approaches are required, e.g. to start the randomised trial only after women
became pregnant or among women who prevented pregnancy by means other than this
drug.
Let us have a another look on the different meanings of a ceteris paribus effect,
depending on whether we look at the treatment or the production function literature.
In the analysis of gender discrimination one frequently observes the claim that women
are paid less than men or that women are less likely to be hired. Women and men differ
in many respects, yet the central claim is that there is a direct effect of gender on hiring
or pay, even if everything else is hold equal. This is exemplified in the graph displayed
in Figure 2.12. There, gender may have an effect on education (subject of university
degree and type of programme in vocational school), on labour market experience or
preferred occupation, and many other factors, in addition to a possible direct effect on
2.1 An Illustrative Approach to the Identification of Causality 57
Education, Skills
Gender
Wage
Experience
wage.9 In order to attain a real ceteris paribus interpretation in the production function
thinking, one would like to disentangle the direct effect from the other factors. Even if
we abstract from the fact that a large number of unobservables is missing in that graph,
it is obvious that gender has also many indirect effects on wages. It has actually turned
out to be pretty hard to measure correctly the indirect and therefore the total effect of
gender on wages, and many different models have been proposed in the past to solve
this problem.10
Example 2.8 Rose and Betts (2004) consider the effects of the number and type of maths
courses during secondary school on adult earnings. Maths courses are likely to have
two effects: First, they could affect the likelihood of continuing with further education.
Second, they may have a direct effect on earnings, i.e. given total years of education.
Therefore, regressions are examined of the type
wages on maths cour ses, year s o f schooling and other contr ols,
where ‘years of schooling’ contains the total years of education including tertiary edu-
cation. The main interest is in the coefficient on maths courses, controlling for the
post-treatment variable total schooling. Rose and Betts (2004) also considered a variant
where they control for the two post-treatment variables college major (i.e. field of study
in university) and occupation. In all cases, direct positive effects of maths courses on
wages were found. In a similar spirit, they examined the effects of the credits completed
during secondary school on wages, controlling for total years of education. The motiva-
tion for that analysis was to investigate whether the curriculum during secondary school
mattered. Indeed, in classical screening models, education serves only as a screening
device such that only the number of years of education (or the degree obtained) should
determine wages, while the content of the courses should not matter.
What should become clear from all these examples and discussions is that identifying
direct (or partial) effects requires the identification of the distribution of Yxd . This can
9 See Moral-Arce, Sperlich, Fernandez-Sainz and Roca (2012), Moral-Arce, Sperlich and Fernandez-Sainz
(2013) and references therein for recent non-parametric identification and estimation of the gender wage
gap.
10 A further problem is that we might be able to identify direct and indirect effects of gender on wages, but
not all can automatically be referred to as discrimination. For example, if women would voluntarily
choose an education that leads to low-paid jobs, one had in a next step to investigate whether the jobs are
low-paid just because they are dominated by women; but we could not automatically conclude so.
58 An Introduction to Non-Parametric Identification and Estimation
While the latter can usually be identified from a perfect randomised experiment, identi-
fication of the former will require more assumptions. In order to simplify the problem,
let us again concentrate on DAGs and establish some rules helping us to identify the
distribution of Yxd .
Let Y, D, X, V be arbitrary disjoint sets of nodes in a causal DAG, where each of
these sets may be empty. Let Pr (Y d ) denote the distribution of Y if D is externally
set to the value d. Similarly, Pr(Yxd ) represents the distribution of Y if D and X are
both externally set. In contrast, Pr(Y d |X d = x) is the outcome distribution when D
is externally set and x is observed subsequently. In our previous notation, this refers
to X d , i.e. the potential outcome of X when D is fixed externally. Note that as usual
d ,X d )
Pr(Y d |X d ) = Pr(Y
Pr(X d )
. As before, let G D be the subgraph obtained by deleting all
arrows emerging from nodes in D. Analogously, G D is the graph obtained by deleting
all arrows pointing to nodes in D. Then, the rules are summarised in11
THEOREM 2.5 For DAGs, and the notation introduced above one has
1. Insertion and deletion of observations
Pr(Y d |X d , V d ) = Pr(Y d |V d ) if (Y ⊥⊥ X |D, V )G D
2. Action or observation exchange
Pr(Yxd |Vxd ) = Pr(Y d |X d , V d ) if (Y ⊥⊥ X |D, V )G D X
3. Insertion or deletion of actions
Pr(Yxd |Vxd ) = Pr(Y d |V d ) if (Y ⊥⊥ X |D, V )G D,X (V )
where X (V ) is the set of X -nodes that are not ancestors of any V -node in the
subgraph G D .
We illustrate the use of the rules in Theorem 2.5 by applying them to the graphs in
Figure 2.13. In fact, we can show by this theorem that the direct effects are identified.
In graph (a) we can apply rule 2 twice: first to obtain
Pr(Yxd ) = Pr(Y d |X d ) because (Y ⊥⊥ X |D)G D X ,
and afterwards to show that
Pr(Y d |X d ) = Pr(Y |D, X ) as (Y ⊥⊥ D|X )G D .
Figure 2.13 How to identify the direct effects in graph (a) [left] and (b) [right]
You can check that both conditions are satisfied in graph (a) such that we finally have
Pr(Yxd ) = Pr(Y |D, X ). In this situation, conditioning on D and X clearly identifies
potential outcomes. Hence, in conventional regression jargon, X can be added as an
additional regressor in a regression to identify that part of the effect of D which is not
channelled via X . Note however, that the total effect of D on Y cannot be identified.
Also in graph (b), still Figure 2.13, we fail to identify the total effect of D on Y .
Instead, with Theorem 2.5, we can identify the distributions of Yxd and Yvd . For example,
by applying rule 2 jointly to D and X we obtain
Furthermore, with Vxd = V we have Pr(Yxd |Vxd ) = Pr(Yxd |V ) = Pr(Y |D, X, V ) (cf.
rule 3), and finally
It has to be admitted that most of the treatment effect identification and estimation
methods we present in the following chapters were introduced as methods for studying
the total effect. It is, however, obvious that, if the variation of confounders X is not
purely exogenous but has a mutual effect or common driver with D (recall graph (a)
of Figure 2.11), then we may want to identify the direct (or partial) effect of D on Y
instead.
In later chapters we will also consider the so-called instrumental variable estimation,
where identification via causal graphs of the kind in Figure 2.14 is applied. A variable
Z has a direct impact on D, but is not permitted to have any path to or from Y other
than the mediating link via D. So the question is not just to exclude a causal impact
of Z on Y ; some more assumptions are necessary. The prototypical example for such
a situation is the randomised encouragement design. We are, for example, interested in
the effect of smoking (D) on health outcomes (Y ). One would suspect that the smoking
behaviour is not independent of unobservables affecting health outcomes. A randomised
trial where D is set randomly is impossible as we cannot force individuals to smoke or
not to smoke. In the encouragement design, different doses of encouragement Z ‘to
stop smoking’ are given. For example, individuals could be consulted by their physician
about the benefits of stopping smoking or receive a discount from their health insurance
provider. These different doses could in principle be randomised. In the simplest design,
Z contains only two values: encouragement yes or no. Half of the physicians could be
randomly selected to provide stop-smoking encouragement to their patients, while the
other half does not. This way, Z would be randomly assigned and thus independent of all
unobservables. The resulting graph is given in Figure 2.14. One can immediately obtain
the intention-to-treatment effect of Z on Y . Identification of the treatment effect of D on
60 An Introduction to Non-Parametric Identification and Estimation
Y , however, will require further assumptions, as will be discussed later. One assumption
is that Z has no further link with Y , i.e. the stop-smoking campaign should only lead
to a reduction in smoking (among those who receive the encouragement) but provide
no other health information (e.g. about the harm of obesity) that could also affect Y . In
addition, some kind of monotonicity of the effect of Z on D is required, e.g. that the
stop-smoking campaign does not induce anyone to start or increase smoking. Clearly, it
is only permitted to have an impact (direct or indirect) on those individuals to whom the
anti-smoking incentives are offered, but not on anyone else.
Unlike the rest of the book, this section is basically a condensed summary. We intro-
duce here non- and semi-parametric estimation methods that we will frequently apply
in the following chapters. The focus is on presenting the main ideas and results (sta-
tistical properties), so that you get a feeling for which types of estimators exist and
learn their properties. For a deeper understanding of them we recommend that you con-
sult more specific literature on non- and semi-parametric inference, which is now quite
abundant.12 Especially if encountering these methods for the first time, you might find
this section a bit too dense and abstract.
In the last section on non-parametric identification we controlled for covariates by
means of conditional expectations: recall for example Equation 2.7. Crucial ingredi-
ents are conditional mean functions E[Y |X ] or E[Y |X, D] and estimates thereof; recall
Equation 2.8. Similarly, in the case of a front-door identification with a mediating vari-
able Z , we need to predict conditional expectations in certain subpopulations to apply
Equation 2.9. The results for the estimated treatment effects will thus depend on the
way we estimate these conditional expectations. Once we have succeeded in identifying
treatment effects non-parametrically, i.e. without depending on a particular parametric
specification, it would be nice if we could also estimate them without such a restrictive
specification. This is the topic of the remaining part of this chapter. The focus is here
on the basic ideas. Readers who are already familiar with local polynomial regression
can skip the next two sections. To Master-level and PhD students we nonetheless
recommend reading Section 2.2.3 where we review exclusively estimators and results
of semi-parametric estimation that will be referred to in the subsequent chapters.
12 See for example Härdle, Müller, Sperlich and Werwatz (2004), Henderson and Parmeter (2015), Li and
Racine (2007), Yatchew (2003) or Pagan and Ullah (1999).
2.2 Non- and Semi-Parametric Estimation 61
13 That is, including some observations for which we only have x ≈ x but not equality.
i
14 Typically one adds smoothness conditions for the density of the continuous covariates. This is mainly
done for technical convenience but also avoids that the ‘left’ or ‘right’ neighbourhood of x is not much
more represented in the sample than the other side.
62 An Introduction to Non-Parametric Identification and Estimation
Lipschitz continuity, see below), and sometimes also by boundedness conditions. The
idea is pretty simple: if there were a downward jump of m(·) right before x, then a
weighted average of the yi (for xi being neighbour of x) would systematically overesti-
mate m(x); a ‘smoother’ is simply not the right approach to estimate functions that are
not smooth.15
It is useful to present these smoothness concept in real analysis. Let m : IR q → IR be
a real-valued function. This function m is called Lipschitz continuous over a set X ⊂ IR q
if there is a non-negative constant c such that for any two values x1 , x2 ∈ X
|m(x1 ) − m(x2 )| ≤ c · x1 − x2
where · is the Euclidean norm. Loosely speaking, the smallest value of c for which
this condition is satisfied represents the ‘steepest slope’ of the function in the set X . If
there is a c such that the Lipschitz condition is satisfied over its entire domain, one says
that the function is uniformly Lipschitz continuous.
Example 2.9 Consider q = 1, then the function m(x) = |x| is uniformly Lipschitz con-
tinuous over the entire real line. On the other hand, it is not differentiable at zero. Note
that according to the theorem of Rademacher, a Lipschitz continuous function is dif-
ferentiable almost everywhere but not necessarily everywhere.16 As a second example,
the function m(x) = x 2 is differentiable, but not Lipschitz continuous over IR. Hence,
neither does differentiability imply Lipschitz continuity nor the other way around. See
also Exercise 6.
15 There certainly exist modifications that account for jumps and edges if their location is known.
16 This means that picking randomly a point from the support, it is for sure that at this point the function is
differentiable.
17 If α = 1 one often writes C k and refers to Lipschitz continuity.
2.2 Non- and Semi-Parametric Estimation 63
∂ |λ|
Dλ m(x) = λ
m(x).
∂ x1λ1 · · · ∂ xq q
for some non-negative c. Note that the summation runs over all permutations of the
q-tuple λ.
Based on this type of properties it is possible to derive general results on optimal
convergence for non-parametric estimators when the only information or restriction on
the true function m(x) over some set X ⊂ IR q is that it belongs to the class C k,α .18
To examine the properties of non-parametric estimators, one also needs to define what
convergence of an estimator m̂(x) to m(x) over some set X ⊂ IR q means.19 Different
ways to measure the distance between two functions can be used. A popular one20 is the
L p norm · p
⎡ ⎤1
p
# # $ $p
#m̂(x) − m(x)# = ⎣ $m̂(x) − m(x)$ dκ(x)⎦ for 1 ≤ p < ∞,
p
X
where κ is a measure on X ; for simplicity imagine the identity or the cumulative distri-
bution function of covariate X . For p = 2 you obtain the Euclidean norm. Also quite
useful is the sup-norm ·∞ which is defined by
# # $ $
#m̂(x) − m(x)# = sup $m̂(x) − m(x)$ .
∞
X
The Sobolev norm ·a, p also accounts for distances in the derivatives,
⎡ ⎤1
$$ $q q
# # $
#m̂(x) − m(x)# = ⎣ $D m̂(x) − m(x) $ dκ(x)⎦ .
k
a,q
0≤|k|≤a X
The Sobolev norms include the L p and the sup-norm for a = 0. These norms express the
distance between two functions by a real-valued number so that they can be used for the
standard concepts of convergence (plim, mean square, almost sure, and in distribution).
In the classic regression literature we are used to specify a parametric model and esti-
mate its parameters, say θ . We then speak of an unbiased estimator θ̂ if its expectation
in L p norm for any 0 < p < ∞, and for the sup norm (necessary for getting an idea of
the uniform convergence of the function as a whole)
− (k+α)−v
n 2(k+α)+q
.
log n
One can see now that convergence is faster, the smoother the function m(·) is. But
we also see that non-parametric estimators can never achieve the convergence rate of
1
n − 2 (which is the typical rate in the parametric world) unless the class of functions is
very much restricted. The convergence becomes slower when derivatives are estimated
(v > 0): in addition, the convergence rate becomes slower for increasing dimension q
of X , an effect which is known as the curse of dimensionality.
Non-Parametric Smoother
As stated previously, kernel and kNN based methods for non-parametric regression are
based on a local estimation approach. The common idea is that only data within a small
neighbourhood are used (except for kernels with infinite support). The concept of kernel
2.2 Non- and Semi-Parametric Estimation 65
where the weight is simply a trimming giving a constant weight to the h-neighbourhood
of x0 . A weighted average in which different weights are assigned to observations
(Yi , X i ) depending on the distance from X i to x0 would look like
n % &
X i −x0
i=1 Yi · K h
(x0 ; h) =
m % & , (2.14)
n X i −x0
i=1 K h
where K (u) is the weighting function called kernel. An intuitively appealing ker-
nel would look like for example the Epanechnikov or the Quartic kernel presented in
Figure 2.15; they give more weights to the observations being closer to x0 and no
weight to those being far away (except for those with infinite support like the Gaus-
sian). As K (·) almost always appears together with the bandwidth h, often the notation
K h (u) := K (u/ h)/ h is used.
0.8 1.0
0.8 1.0
0.4
–1.5 –0.5 0.5 1.0 1.5 –1.5 –0.5 0.5 1.0 1.5 –1.5 –0.5 0.5 1.0 1.5
Gaussian Higher order (r=6) boundary kernel
0.2 0.3 0.4 0.5
0.8 1.0
0.5 1.0 1.5 2.0
–0.5 0.0
–2 –1 0 1 2 –1.5 –0.5 0.5 1.0 1.5 –1.0 0.5 0.0 0.5 1.0 1.5
Figure 2.15 Examples of kernel functions: uniform as in (2.13), Epanechnikov, Quartic, Gaussian
kernel (all 2nd order), a 6th-order and a boundary (correcting) kernel
66 An Introduction to Non-Parametric Identification and Estimation
Usually, K is positive (but not always, see the 6th-order kernel), has a maximum
at 0 and integrates to one. Commonly used kernels are the mentioned Epanechnikov
kernel K (v) = 34 (1 − v 2 ) 11{−1 < v < 1} or the Gaussian kernel K (v) = φ(v).
While the former is compactly supported, the latter has unbounded support. For the
Gaussian kernel the bandwidth h corresponds to the standard deviation of a normal
distribution with centre x0 . An also quite popular alternative is the quartic (or biweight)
kernel K (v) = 15 16 (1 − v ) · 11{−1 < v < 1}, which is similar to the Epanechniokov but
2 2
differentiable at the boundaries of its support. In Figure 2.15 we see also the uniform
kernel used in Formula 2.13; further, a so-called higher-order kernel K (v) = 256 35
(15 −
105v + 189v − 99v )11{−1 < v < 1} (discussed later) with some negative weights,
2 4 6
and finally a boundary kernel. In this example is a truncated quartic kernel which is set
to zero outside the support of X (here for all values ≤ x0 − 0.5h) but re-normed such
that it integrates to one.
It is easy to show that the choice of the kernel is of less importance, whereas the choice
of the bandwidth is essential for the properties of the estimator. If h were infinitely large,
m̂(·; h) in (2.14) would simply be the sample mean of Y . For h coming close to zero,
m̂ is the interpolation of the Yi . As we know, interpolation gives a quite wiggly, rough
idea of the functional form of m(·) but is inconsistent as its variance does not go to
zero; therefore we need smoothing, i.e. including some neighbours. In fact, for consis-
tent estimation, the number of these neighbours must go to infinity with n → ∞ even
when h → 0. Obviously, a necessary condition for identification of E[Y |X = x0 ] is
that observations at x0 (if X is discrete) or around it (if continuous) are available. For
a continuous density f (·) of X the condition amounts to f (x0 ) > 0, so that asymptot-
ically we have an infinite number of X i around x0 . The estimator (2.14) is called the
Nadaraya (1965)–Watson (1964) kernel regression estimator. The extension to a multi-
variate Nadaraya–Watson estimator is immediate; you only have to define multivariate
kernel weights K : IR q → IR accordingly. The same holds for the next coming esti-
mator (local polynomials), but there you additionally need to use the Taylor expansion
(2.12) for q > 1 which is notationally (and also computationally) cumbersome. For the
sake of presentation we therefore continue for a while with q = 1.21
Instead of taking a simple weighted average, one could also fit a local model in the
neighbourhood around x0 and take this as a local estimator for E[Y |X = x0 ]. For
example, the local polynomial estimator takes advantage of the fact that any continu-
ous function can be approximated arbitrarily well by its Taylor expansion, and applies
the idea of weighted least squares by setting
% &
m(x0 ; h), m (x0 ; h), . . . , m
( p) (x ; h)
0 (2.15)
"2
n
m ( p) X i − x0
:= arg min Yi − m − m · (X i − x0 ) − . . . − · (X i − x0) K
p
m,m ,...,m ( p) p! h
i=1
21 Note that this is not a particular pitfall for kernel methods but a difficulty that other non-parametric
estimators share in different ways.
2.2 Non- and Semi-Parametric Estimation 67
for some integer p > 0. In the previous expression, m refers to the first derivative and
m ( p) refers to the pth derivative. Thus, the local polynomial estimator obtains simulta-
neously estimates the function m(·) and its derivatives. Fitting a local constant estimator,
i.e. setting p = 0 gives exactly the Nadaraya–Watson estimator (2.14). According
to the polynomial order p, the local polynomial estimator is also called local linear
( p = 1), local quadratic ( p = 2) or local cubic ( p = 3) estimator. Nadaraya–Watson
and local linear regression are the most common versions in econometrics. Local poly-
nomial regression of order two or three is more suited when estimating derivatives or
strongly oscillating functions in larger samples, but it often is unstable in small sam-
ples since more data points in each smoothing interval are required. When the main
interest lies in the vth derivative (including v = 0, i.e. the function itself), choosing p
such that p − v > 0 is odd, ensures that the smoothing bias in the boundary region is
of the same order as in the interior. If p − v is even, this bias will be of higher order
at the boundary and will also depend on the density of X . Finally, it has been shown
that the local linear estimator attains full asymptotic efficiency (in a minimax sense)
among all linear smoothers, and has high efficiency among all smoothers.22 By % defin- &
p
ing β = β0 , β1 , . . . , β p , Xi = 1, (X i − x0 ) , . . . , (X i − x0 ) , K i = K X i −x
h
0
,
X = (X1 , X2 , . . . , Xn ) , K = diag(K 1 , K 2 , . . . , K n ) and Y = (Y1 , . . . , Yn ) , we can
write the local polynomial estimator as
−1
β̂ = arg min (Y − Xβ) K (Y − Xβ) = X KX X KY (2.16)
β
with m̂ (l) (x0 ) := β̂l /(l!), 0 ≤ l ≤ p. Note that we are still in the setting where X is
one-dimensional and that, for ease of exposition, we have suppressed the dependence
on h. We thus obtain
−1
m̂(x0 ) = e1 X KX X KY
n
1
m̂(x0 ) = wi Yi with wi = e1 · X KX Xi K i . (2.17)
n n
i=1
22 For details see for example Fan and Gijbels (1996), Loader (1999a) or Seifert and Gasser (1996, 2000).
68 An Introduction to Non-Parametric Identification and Estimation
The expressions of m̂(x0 ) up to polynomial order three are (suppressing the dependence
of all Tl and Q l on x0 and h)
n % &
X i −x0
T0 i=1 Yi K h
m̂ p=0 (x0 ) = = % & (2.18)
Q0 n X i −x0
Ki=1 h
Q 2 T0 − Q 1 T1
m̂ p=1 (x0 ) =
Q 2 Q 0 − Q 21
(Q 2 Q 4 − Q 23 )T0 + (Q 2 Q 3 − Q 1 Q 4 )T1 + (Q 1 Q 3 − Q 22 )T2
m̂ p=2 (x0 ) =
Q 0 Q 2 Q 4 + 2Q 1 Q 2 Q 3 − Q 32 − Q 0 Q 23 − Q 21 Q 4
A0 T0 + A1 T1 + A2 T2 + A3 T3
m̂ p=3 (x0 ) = ,
A0 Q 0 + A1 Q 1 + A2 Q 2 + A3 Q 3
where A0 = Q 2 Q 4 Q 6 + 2Q 3 Q 4 Q 5 − Q 34 − Q 2 Q 25 − Q 23 Q 6 , A1 = Q 3 Q 24 + Q 1 Q 25 +
Q 2 Q 3 Q 6 − Q 1 Q 4 Q 6 − Q 2 Q 4 Q 5 − Q 23 Q 5 , A2 = Q 1 Q 3 Q 6 + Q 2 Q 24 + Q 2 Q 3 Q 5 −
Q 23 Q 4 − Q 1 Q 4 Q 5 − Q 22 Q 6 , A3 = Q 33 + Q 1 Q 24 + Q 22 Q 5 − Q 1 Q 3 Q 5 − 2Q 2 Q 3 Q 4 .
Using the above formulae, we can write the local linear estimator equivalently as
n ∗
i=1 K i Yi ∗ X i − x0
m̂ p=1 (x0 ) = n ∗ where K i = {Q 2 − Q 1 (X i − x 0 )} K . (2.19)
i=1 K i h
Hence, the local linear estimator is a Nadaraya–Watson kernel estimator with kernel
function K i∗ . This kernel K i∗ is negative for some values of X i and sometimes also called
equivalent kernel. This may help our intuition to understand what a kernel function with
negative values means. Similarly, every local polynomial regression estimator can be
written in the form (2.19) with different equivalent kernels K i∗ . They all sometimes take
negative values, except for the case p = 0, the Nadaraya–Watson estimator.
Ridge Regression
Ridge regression is basically a kernel regression with a penalisation for the roughness of
the resulting regression in order to make it more stable (more robust). The name ridge
originates from the fact that in a simple linear regression context this penalisation term is
added to the ridge of the correlation matrix of X when calculating the projection matrix.
How this applies to local linear regression is shown below.
In different simulation studies the so-called ridge regression has exhibited quite
attractive performance qualities such as being less sensible to the bandwidth choice,
having a small finite-sample bias, but also numerical robustness against irregular designs
(like, for example, data sparseness in some regions of the support of X ). A simple pre-
sentation and implementation, however, is only known for the one-dimensional case
(q = 1). To obtain a quick and intuitive idea, one might think of a kind of linear
combination of Nadaraya–Watson and local linear regression. More specifically, for
K h (u) = h1 K ( uh ) consider
n
2
min Y j − β0 − β1 (X j − x̃i ) K h X j − xi + rβ12
β0 ,β1
j=1
2.2 Non- and Semi-Parametric Estimation 69
n n
where x̃i = j=1 X j K h (X j − x i )/ j=1 K h (X j − x i ) and r is the so-called ridge
parameter. So x̃i is a weighted average of the neighbours of xi . Define sα(i, j) = (X j −
x̃i )α K h (X j − xi ), α = 0, 1, 2. Then the ridge regression estimate is
n
m̂(xi ) = β̂0 + β̂1 (xi − x̃i ) = w(i, j)Y j (2.20)
j=1
with w(i, j) = s0 (i, j)/{ nj=1 s0 (i, j)} + s1 (i, j) · (xi − x̃i )/{r + nj=1 s2 (i, j)}. Defin-
n n
ing Sα (i) = j=1 sα (i, j), Tα (i) = j=1 sα (i, j)Y j , and r = S2 (i)/{r + S2 (i)} we
see that
T0 T0 (xi − x̃i )T1
m̂(xi ) = (1 − r ) + r +
S0 S0 S2
being thus a linear combination of the local constant (i.e. Nadaraya–Watson) estimator
with weight (1 − r ) and the local linear estimator with weight r . The mean squared
error minimising r is quite complex with many unknown functions and parameters. A
simple rule of thumb suggests to set r = h · |xi − x̃i | · cr , cr = maxv {K (v)}/{4κ̄0 }
which is cr = 5/16 for the Epanechnikov kernel, and cr = 0.35 for the Gaussian one,
cf. (2.23).23
n ⎜ 0 ⎟
⎜ ⎟
wi Xi = ⎜ . ⎟ , (2.21)
n ⎝ .. ⎠
i=1
0
which can immediately be seen by inserting the definition of the weights (2.17); see
Exercise 10. The previous expression can be rewritten as
-
1
n
1 for l = 0
(X i − x0 ) · wi =
l
(2.22)
n 0 for 1 ≤ l ≤ p.
i=1
These orthogonality conditions imply an exactly zero-finite bias up to order p. This also
implies that if the true function m(x) happened indeed to be a polynomial function of
order p or less, the local polynomial estimator would be exactly unbiased, i.e. in finite
samples for any h > 0 and thereby also asymptotically. In this case, one would like to
choose the bandwidth h = ∞ to minimise the variance. You arrive then in the paramet-
ric world with parametric convergence rates etc. because h is no longer supposed to go
to zero.
Now we consider the expression as a linear smoother (2.17) and derive the expected
value of the estimator. Note that the expected value of the estimator could be undefined if
the denominator of the weights is zero. In other words, there could be local collinearity
23 For further details we refer to Seifert and Gasser (2000) or Busso, DiNardo and McCrary (2009).
70 An Introduction to Non-Parametric Identification and Estimation
which impedes the calculation of the estimator. Ruppert and Wand (1994) therefore
proposed to examine the expected value conditional on the observations X 1 , . . . , X n :
1
! 1
n n
!
E m̂(x0 )|X 1 , . . . , X n = E w j Y j |X 1 , . . . , X n = w j m(X j ).
n n
j=1 j=1
n
∂m(x0 ) 1 p ∂ m(x 0 )
p
= wi m(x0 ) + (X i − x0 ) + . . . + (X i − x0 ) + R(X i , x0 )
n ∂x p! ∂x p
i=1
n
= m(x0 ) + wi R(X i , x0 ),
n
i=1
where the other terms are zero up to order p because of (2.21). We thus obtain that
! 1
n
E m̂(x0 ) − m(x0 )|X 1 , . . . , X n = wi R(X i , x0 ),
n
i=1
As an intuitive argument note that compact kernels are zero outside the interval [−1, 1].
Hence, for every i where |X i − x0 | > h, the kernel function K i will be zero. This
implies that the remainder term is at most of order O p (h p+α ). We will show later that
the expression n1 X KX is O p (1). Therefore, the entire expression is O p (h p+α ). Since h
will always be assumed to converge to zero as n → ∞, the higher the polynomial order
p, the lower the order of the finite sample bias (or, say, the faster the bias goes to zero
for h → 0).
Before taking a closer look at the asymptotic properties of these estimators, it is useful
to work with the following definitions for the one-dimensional kernel function:
κλ = v λ K (v) dv and κ̄λ = v λ K (v)2 dv. (2.23)
negative for some values of its support. Higher-order kernels are often used in theoretical
derivations, particularly for reducing the bias in semi-parametric estimators. They have
rarely been used in non-parametric applications, but may be particularly helpful for
average treatment effect estimators.
To explicitly calculate bias and variance of non-parametric (kernel) regression esti-
mators we consider first the Nadaraya–Watson estimator for dimension q = 1 with a
2nd-order kernel (r = 2),
% &
1
nh Yi · K X i −x
h
0
m̂(x0 ; h) = % & .
1
nh K X i −x
h
0
The expected value of the numerator can be rewritten for independent observations
(where we also make use of a Taylor expansion) as
1
n
X i − x0 1 x − x0
E Yi · K = m(x) · K f (x)d x
nh h h h
i=1
= m(x0 + uh) f (x0 + uh) · K (u) du
= m(x0 ) f (x0 ) K (v) dv + h · m (x0 ) f (x0 ) + m(x0 ) f (x0 ) u K (u) du
m (x0 ) f (x0 )
+ h2 · f (x0 ) + m(x0 ) + m (x0 ) f (x0 )
2 2
% &
u 2 K (u) du + O h 3 (2.24)
m (x0 ) f (x0 )
= m(x0 ) f (x0 ) + h 2 · f (x0 ) + m(x0 ) + m (x0 ) f (x0 )
2 2
% &
u 2 K (u) du + O h 3
for κ0 = K (v) dv = 1 and κ1 = v K (v) dv = 0. Analogously, the expected value
of the denominator is24
1
n
X i − x0 f (x0 ) % &
E K = f (x0 ) + h 2 · κ2 + O h 3 .
nh h 2
i=1
A weak law of large numbers gives as limit in probability for a fixed h and n → ∞
by showing that the variance converges to zero and applying Chebyshev’s inequality.
Under some regularity conditions like the smoothness of m(·) or V ar (Y |x) < ∞, we
obtain
24 The expected value of the denominator may be zero if a kernel with compact support is used. So the
expected value of the Nadaraya–Watson estimator may not exist. Therefore the asymptotic analysis is
n
usually done by estimating m(·) at the design points {X i }i=1 as in Ruppert and Wand (1994) or by adding
a small number to the denominator that tends to zero as n → ∞; see Fan (1993).
72 An Introduction to Non-Parametric Identification and Estimation
% &
m (x0 )
(x ) f (x ) κ + O h 3
f (x 0 )2+ m 0 0 2
plim m̂(x0 , h) − m(x0 ) = h 2
f (x0 ) + h 2 f (x 0)
2 κ2 + O h
3
% &
m (x0 ) m (x0 ) f (x0 )
= h2 + κ2 + O h 3 .
2 f (x0 )
Hence, the bias is proportional to h 2 . Exercise 11 asks you to derive the bias result-
ing from higher-order kernels by revisiting the calculations in (2.24). It is easy to
see that in general, the bias term is then of order h r with r being the order of the
kernel.
To obtain an idea of the conditional variance is more tedious but not much more
difficult; one basically needs to calculate
2
V ar (m̂(x0 , h)) = E m̂(x0 , h) − E[m̂(x0 , h)]
⎡ 2 ⎤
1
n
X − x
= E⎣ ) ⎦
i 0
{Yi − m(X i )}K (
nh h
i=1
[Y |x0 ]
obtaining approximately (i.e. up to higher-order terms) f (x0 ) V arnh K 2 (v)dv.
The derivations made implicit use of the Dominated (Bounded) Convergence Theorem
along Pagan and Ullah (1999, p. 362). It says that for a Borel measurable
function g(x)
on IR and some function f (x) (not necessarily a density) with | f (x)| d x < ∞
%x &
1
g f (x0 − x) d x −→ f (x0 ) g(x)d x as h → 0 (2.25)
hq h
at every point x0 of continuity of f if |g(x)| d x < ∞, x · |g(x)| → 0 as
x → ∞, and sup |g(x)| < ∞. Furthermore, if f is uniformly continuous, then con-
vergence
is uniform.
% & For g being a kernel function, this theorem gives for example that
X j −x0
E nh1
K h −→ f (x0 ) K (v)dv. This results extends also to x ∈ IR q for
q > 1.
Let us recall some of the assumptions which have partly been discussed above:
(A2) a kernel K that is of second order (r = 2) and integrates to one, and a bandwidth
h → 0 with nh → ∞ for n → ∞.
2.2 Non- and Semi-Parametric Estimation 73
Here we can see that the vector-valued estimator converges slower in its second term,
i.e. for the derivative by factor h −1 , than it does in the first term (the regression function
itself).
The last term in (2.27) characterises the conditional variance, which is given by
−1 −1
X KX X KKX X KX . (2.29)
Hence, the expressions are such that the bias is at least of order h p+1 , while the variance
1
is of order nh (similarly to what we saw for higher-order kernels). In sum, we have seen
that the analogue to Theorem 2.6, still for q = 1, can be written as
THEOREM 2.7 Assume that we are provided with a sample {X i , Yi }i=1 n coming from
a model fulfilling (A1). Then, for x0 being an interior point of the support of X ∈ IR,
the local linear estimator m̂(x0 ) of m(x0 ) with kernel and bandwidth as in (A2) has bias
and variance
κ2 1
Bias(m̂(x0 )) = h 2 m (x0 ) + O + o(h 2 ),
2 nh
σ 2 (x0 ) 1
V ar (m̂(x0 )) = κ̄0 + o .
nh f (x0 ) nh
Notice that these results hold only for interior points. The following table gives the
rates of the bias for interior as well as for boundary points. As already mentioned ear-
lier, for odd-order polynomials the local bias is of the same order in the interior as the
boundary, whereas it is of lower order in the interior for even-order polynomials.
Bias and variance in the interior and at boundary points, dim(X )=1
p=0 p=1 p=2 p=3
2 2 4
Bias in interior O h O h O h O h 4
Bias at boundary O h 1 O h2 O h3 O h4
76 An Introduction to Non-Parametric Identification and Estimation
The variance is always of order (nh)−1 . To achieve the fastest rate of convergence with
respect to the mean squared error, the bandwidth h could be chosen to balance squared
bias and variance, which leads to the following optimal convergence rates:
Optimal convergence rates in the interior and at boundary points, dim(X )=1
Convergence rate p = 0 p = 1 p = 2 p = 3
4 4
n− 9 n− 9
2 2
in the interior n− 5 n− 5
4
n− 9
1 2 3
at the boundary n− 3 n− 5 n− 7
There exist various proposals for how to reduce the bias at the boundary (or say, correct
the boundary effects). Especially for density estimation and local constant (Nadaraya–
Watson) estimation, the use of boundary kernels (recall Figure 2.15) is quite popular.
For the one-dimensional ridge regression (q = 1) asymptotic statements are available
for the case where the asymptotically optimal ridge parameter for point x0 has been
used. Though in practice people will rather choose the same (probably a rule-of-thumb)
ridge-parameter for all points, this gives us at least an idea of the statistical performance
of this method. As it had been proposed as an improvement of the local linear estimator,
we give here the variance and mean squared error for m̂ ridge (xi ) compared to those of
m̂ loc.lin. (xi ):
T H E O R E M 2.8 Under the same assumptions as for the local linear estimator, see
Theorem 2.7, q = 1, second order kernel K (·), x0 being an interior point of X , f the
density, and using the asymptotically optimal ridge parameter,
This theorem shows that we indeed improve in the variance by having made the estima-
tor more stable, but we may pay for this in the bias. Whether asymptotic bias and mean
squared error are smaller or larger than those of the local linear estimator depends on the
derivatives of the underlying regression function m(·) and those of the (true) density f .
without being specific. The derivations of its properties are also analogous, although
some care in the notation is required.
A multivariate kernel function is needed. Of particular convenience for multivariate
regression problems are the so-called product kernels, where the multivariate ker-
nel function K (v) = K (v1 , . . . , vq ) is defined as a product of univariate kernel
functions
0
q
K (v1 , . . . , vq ) = K (vl ) , (2.30)
l=1
see Exercise 8, Theorem 2.9 and Subsection 2.2.2. For such product kernels, higher-
order kernels are easy to implement.
Further, a q × q bandwidth matrix H determines the shape of the smoothing −1 win-
dow, such that the multivariate analogue of K h (v) becomes K H (v) = det(H ) K H v .
1
This permits smoothing in different directions and can take into account the correla-
tion structure among covariates. Selecting this q × q bandwidth matrix by a data-driven
bandwidth selector can be inconvenient and time-consuming, especially when q is large.
Typically, only diagonal bandwidth matrices H are used. This is not optimal but is
done for convenience (computational reasons, interpretation, etc.), such that in prac-
tice one bandwidth is chosen for each covariate – or even the same for all. As a practical
device, one often just rescales all covariates inside the kernel such that their sample
variance is one, but one ignores their potential correlation. After the rescaling simply
H := diag{h, h . . . , h} is used; for details see the paragraph on bandwidth choice in
Section 2.2.2.
We will see that in such a setting, the bias of the local polynomial estimator at an
interior point is still of order h p+1 if the order of the polynomial ( p) is odd, and of order
h p+2 if p is even.26 Hence, these results are the same as in the univariate setting and
do not depend on the dimension q. In contrast, the variance is now of order nh1 q , i.e. it
decreases for increasing dimension q of X .27 Recall that it does not depend on p or r .
The reason why multivariate non-parametric regression nevertheless becomes difficult
is the sparsity of data in higher-dimensional spaces.
Example 2.10 Consider a relatively large sample of size n, and start with a uniformly
distributed X ∈ [0, 1]. If we choose a smoothing window of size 0.01 (e.g. a bounded
symmetric kernel with h = 0.01 2 ), one expects about 1% of the observations to lie in
this smoothing window. Then consider the situation where the dimension of X is 2, and
X is uniformly distributed on [0, 1]2 . With a bandwidth size of h = 0.005 you obtain
windows of size 0.001 containing in average only 0.1% of all data, etc. If we have
dim(X ) = 10 and want to find a smoothing area that contains 1% of the observations
in average, then this requires a 10-dimensional cube with length 0.63. Hence, for each
component X l (l = 1, . . . , q) the smoothing area covers almost two thirds of the support
of X l , whereas it was only 0.01 in the one-dimensional case.
This example illustrates that in higher dimensions we need h (or, in the case of using a
non-trivial bandwidth matrix H , all its elements at a time) to go to zero much slower than
in the univariate case to control the variance. This in turn implies that the bias will be
much larger. Supposing sufficient smoothness of m(x) one could use local polynomials
of higher order to reduce the bias. But when dim(X ) = q is large, then a high order
of p can become very inconvenient in practice since the number of (interaction) terms
proliferates quickly. This could soon give rise to problems of local multicollinearity in
small samples. A computationally more convenient alternative is to combine local linear
regression with higher-order kernels for bias reduction.
First we need to clarify the properties of kernel functions for q > 1. Let λ be a q-tuple
λ
of non-negative integers and define |λ| = λ1 + . . . + λq and v λ = v1λ1 v2λ2 · · · vq q . Define
κλ = ··· v λ K (v1 , . . . , vq ) · dv1 · · · dvq (2.31)
and κ̄λ = ··· v λ K 2 (v1 , . . . , vq ) · dv1 · · · dvq .
−1
n
−1
n
m̂(x0 ) = e1 X KX
Xi K i Yi = e1 X KX Xi K i (Yi − m(X i ) + m(X i )) ,
i=1 i=1
where e1 is a column vector of zeros with first element being 1. A series expansion gives
n
−1
n
∂m (x 0 ) 1 ∂ m (x 0 )
2
+ Xi K i m(x0 ) + (X i −x0 ) + (X i − x0 ) (X i − x0 )+ Ri ,
∂x 2 ∂ x∂ x
i=1
where ∂m(x 0)
is the q × 1 vector of first derivatives, ∂ ∂m(x 0)
2
∂x x∂ x the q × q matrix of sec-
ond derivatives, and Ri the remainder term of all third- and higher-order derivatives
multiplied with the respective higher-order (interaction) terms of X i − x0 . We can
see now what an r th order kernel will do: it will let pass m(x0 ) (because the ker-
nel integrates to one) but it turns all further additive terms equal to zero until we
reach the r th order terms in the Taylor expansion. Let us assume we used a ker-
nel of the most typical order r = 2. Since K i has bounded support, for x0 being
an interior point the remainder term multiplied with K i is of order O(h 3max ), where
h max is the largest diagonal element of bandwidth matrix H . We obtain after some
calculations
2.2 Non- and Semi-Parametric Estimation 79
⎧
⎨
n
−1
= e1 X KX Xi K i (Yi − m(X i )) + m(x0 ) (2.32)
⎩
i=1
⎛ ⎞⎫
n
1 ⎬
+ Xi K i ⎝ Dλ m(x0 )(X i − x0 )λ + R(x0 , X i − x0 )⎠ , (2.33)
k! ⎭
i=1 1≤|λ|≤k
α ≤ 1. The first term inside the brackets {· · ·} gives the variance of the estimator, the
second is the wanted quantity, and the two remainder terms give the bias. As for the
one-dimensional case, for an r th-order kernel the bias is of order O(h r ) and contains all
r th-order partial derivatives but not those of smaller order.
Note that (2.33) divided by n can be approximated by the expectation taken over X i .
Then, by applying the kernel properties, all summands up to |λ| = r with (X i − x0 )λ
will integrate to zero (do not forget to count also the ones in Xn ). Then you obtain for
(2.33)
"
−1 −1 κr q r δr m(x0 )
r! h
l=1 l δ xl
m̂(x0 ) = e1 X KX f (x0 ) ,
r
(2.34)
o(H r 1q )
n
1
X KX = Xi Xi K i
n n
i=1
⎡ κr ∂ r −1 f (x0 )
⎤
f (x0 ) + O p (h rmax ) h r1 (r −1)! + o p (h rmax ) ···
⎢ ∂ x1r −1 ⎥
⎢ ⎥
=⎢ κr ∂ r −1 f (x0 )
⎢ h r1 (r −1)! + o p (h rmax ) κr ∂ r −2 f (x0 )
h r1 (r −2)! + o p (h rmax ) ··· ⎥⎥ . (2.35)
⎣ ∂ x1r −1 ∂ x1r −2 ⎦
.. .. ..
. . .
a b
You may imagine the last matrix as a 2×2 block matrix with a being mainly
b c
the density f at point x0 , b a q-dimensional vector proportional to its (r − 1)’th partial
derivatives times h r , and c being proportional to the symmetric q × q matrix of all its
(mixed) derivatives of (total) order r . This can be shown element-wise via mean square
convergence. Let us illustrate this along the (2, 2) element. The derivations for the other
elements work analogously. For x0 = (x0,1 , x0,2 , . . . , x0,q ) we have
1 1
n
2 % &
X KX = X i1 − x0,1 K H −1 {X i − x0 }
n n det(H )
i=1
80 An Introduction to Non-Parametric Identification and Estimation
Putting this together with (2.34) we obtain the bias, and similar calculation would give
the variance of the multivariate local linear estimator with higher-order kernels. We
summarise:
THEOREM 2.9 Assume that we are provided with a sample {X i , Yi }i=1 n coming from a
model fulfilling (A1) with X i ∈ IR , m : IR → IR. Then, for x0 ∈ IR being an interior
q q q
point of the support of X , the local linear estimator m̂(x0 ) of m(x0 ) with a multivariate
symmetric r th-order kernel (r ≥ 2) and bandwidth matrix H = diag{h 1 , . . . , h q } such
that h max → 0, n det(H ) → ∞ for n → ∞ has
κr
r δr m(x0 )
q
Bias m̂(x0 ) = hl + o(h rmax )
r! δ r xl
l=1
σ 2 (x0 ) 1
V ar m̂(x0 ) = κ̄0 + o .
n det(H ) f (x0 ) n det(H )
From the calculations above we obtained an idea of at least three things: how higher
dimensions increase the variance in local polynomial kernel regression, its asymptotic
performance, and how higher order kernels can reduce the bias for local linear regres-
sion. When q is large, local linear estimation with higher-order kernels are easier to
implement than higher-order local polynomial regression. The optimal rate of conver-
gence for non-parametric estimation of a k times continuously differentiable regression
function m(x), x ∈ Rq in L 2 -norm is
k
− 2k+q
n
2.2.2 Extensions: Bandwidth Choice, Bias Reduction, Discrete Covariates and Estimating
Conditional Distribution Functions
Throughout this subsection we keep the definition and notation of kernel moments as
introduced in (2.23). Where we have different kernels, say L and K , we specify the
moments further by writing e.g. κ j (K ) and κ j (L), respectively.
Bandwidth Choice
You can interpret the bandwidth choice as the fine-tuning of model selection: you have
avoided choosing a functional form but the question of smoothness is still open. Like in
model selection, once you have chosen a bandwidth h (or matrix H ), it is taken as given
for any further inference. This is standard practice even if it contradicts the philosophy
of purely non-parametric analysis. The reason is that an account of any further inference
for the randomness of data adaptively estimated bandwidths is often just too complex.
It is actually not even clear whether valid inference is possible without the assumption
of having the correct bandwidth.28
To simplify the presentation let us start with the one-dimensional and local constant
regressor case with second-order kernel: q = 1, p = 0, r = 2. Actually, if we just
follow the idea of minimising the mean squared error (MSE), then Theorems 2.6 and 2.7
indicate how the bandwidth should be chosen optimally. Suppose we aim to minimise
the asymptotic M S E(m̂(x0 )). Along with our Theorems, the first-order approximation
to the MSE of the Nadaraya–Watson estimator is
- 9
h2
2 σ2
κ2 m (x0 ) f (x0 ) + 2 f (x0 ) m (x0 ) + κ̄0 .
2 f (x0 ) nh f (x0 )
Considering this as a function of h for fixed n, the optimal bandwidth choice is obtained
by minimising it with respect to h. The first order condition gives
h 3
2 σ2
κ 2 m (x 0 ) f (x 0 ) + 2 f (x 0 ) m (x 0 ) − κ̄0 = 0
f 2 (x0 ) nh 2 f (x0 )
- 9 15
− 15 σ 2 f (x0 ) κ̄0
=⇒ h opt = n . (2.36)
(κ2 {m (x0 ) f (x0 ) + 2 f (x0 ) m (x0 )})2
Hence, the optimal bandwidth for a one-dimensional regression problem under the
1
assumptions of Theorems 2.6 or 2.7 is proportional to n − 5 .
Unfortunately, asymptotic properties of non-parametric estimators are often of little
guidance for choosing the bandwidth for a particular data set in practice because they
contain many unknown terms, and because for your sample size ‘higher-order terms’
may still be dominant or at least important. A more versatile approach to bandwidth
selection is the hitherto very popular cross-validation (Stone 1974), based on the princi-
ple of maximising the out-of-sample predictive performance. If a quadratic loss function
(= L 2 error criterion) is used to assess the performance of an estimator of m(x0 ) at a
particular point x0 , a bandwidth value h should be selected to minimise
28 There is a large literature on model and variable selection already in the parametric world discussing the
problems of valid inference after preselection or testing.
2.2 Non- and Semi-Parametric Estimation 83
E {m̂(x0 ; h) − m(x0 )}2 .
Moreover, when a single bandwidth value is used to estimate the entire function m(·) at
all points, we would like to choose the (global) bandwidth as the minimiser to the mean
integrated squared error (MISE), typically weighted by density f :
MISE(h; n) = E {m̂(x; h) − m(x)}2 f (x) d x.
In practice, it is more common to look at the minimiser of the integrated squared error
ISE(h; n) = {m̂(x; h) − m(x)}2 f (x) d x
as this gives you the optimal bandwidth for your sample, while minimising the MISE
in looking for a bandwidth that minimises the ISE on average (i.e. independent of the
sample). Since m(x) is unknown, a computable approximation to minimising the ISE is
minimising the average squared error (ASE)
1
29 For properties of cross-validation bandwidth selection see Härdle and Marron (1987).
84 An Introduction to Non-Parametric Identification and Estimation
used is the so-called generalised cross-validation: A linear smoother for the data points
Y = (Y1 , . . . , Yn ) can be written as (Ŷ1 , . . . , Ŷn ) = AY where A is the n × n so-
called hat, smoothing or projection matrix. Letting aii denote the (i, i) element of A,
the generalised cross-validation criterion is
n 2
1 (In − A) Y2 1 i=1 Yi − m̂(X i ; h)
GC V (h) = = n 2 , In identity matrix,
n (tr (In − A))2 n
i=1 (1 − aii )
(2.39)
which does not require estimating the leave-one-out estimates. However, the approxima-
tion that is used here for estimating the degrees of freedom is not generally valid when
we turn to more complex estimators.
Usually a single bandwidth h is considered for a given sample to estimate m(·) at dif-
ferent locations x. However, permitting the bandwidth to vary with x (a so-called local
bandwidth h(x)) may yield a more precise estimation if the smoothing window adapts
to the density of the available data. One such approach is the kNN regression. In the
kNN approach the ‘local bandwidth’ h(x) is chosen such that exactly k observations fall
in the window. I.e. only the k nearest neighbours to x0 are used for estimating m(x0 ).30
Generally, when dim(X ) > 1, we have to smooth in various dimensions. This would
require the choice of a q × q-dimensional bandwidth matrix H (recall our paragraph
on multivariate kernel smoothing), which also defines the spatial properties of the ker-
nel, e.g. ellipsoidal support of the kernel. To better understand what a bandwidth matrix
plus multivariate kernel is doing, just imagine that for determining nearness a multidi-
mensional
: distance metric is required. One common choice is the Mahalanobis distance
(X i − x0 )V ar −1 [X ](X i − x0 ) , which is a quadratic form in (X i − x0 ), weighted by
the inverse of the covariance matrix of X . More specifically, it is the Euclidean dis-
tance31 after having passed all variables to a comparable scale (by normalisation). In
other words, the simplest solution to deal with this situation is to scale and turn the X i
data beforehand such that each regressor has variance one and covariance zero. This
is actually done by the Mahalanobis transformation X̃ := V̂ ar [X ]−1/2 X with V̂ ar [X ]
being any reasonable estimator for the variance–covariance matrix of the covariance
vector (typically the sample covariance); recall our discussed of pair matching. Note
that then all regressors X̃ are on the same scale (standard deviation = 1) and uncorre-
lated. So you basically use H := h · V̂ ar [X ]1/2 . Then, using a single value h for all
dimensions combined with a product kernel is convenient.32
30 The basic difference between kNN and kernel-based techniques is that the latter estimates m(x ) by
0
smoothing the data in a window around x0 of fixed size 2h, whereas the former smoothes the data in a
neighbourhood of stochastic size containing exactly the k nearest neighbours. Furthermore, a kNN
assigns the same weight to all neighbours like the uniform kernel does.
31 Typically the Euclidean distance is understood to be just the square root of the sum of squared distances
in each dimension, but supposes linear independence of the dimensions. In our case we have to account
for the correlation structure of the regressors spanning a non-right-angled space.
32 An important exception applies when the optimal bandwidth would be infinity for one of the regressors,
which is e.g. the case with Nadaraya–Watson regression when one of the regressors is irrelevant in the
conditional mean function. Then separate bandwidths for each regressor would have to be used, such that
2.2 Non- and Semi-Parametric Estimation 85
One should point out that all known approaches to choose h – see Köhler, Schindler
and Sperlich (2014) – are constructed to optimise the estimation of E[Y |·] = m(·) which
is not necessarily optimal for the matching or propensity score-based treatment effect
estimators. For these, the issue of optimal bandwidth choice is not yet fully resolved, but
the results of Frölich (2004) and Frölich (2005) indicate that the bandwidth selectors
having been invented for estimating E[Y |X = x] may not perform too badly in this
context.
Bias Reduction
Various approaches have been suggested to reduce the asymptotic bias of the non-
parametric regression estimator. Unfortunately, most of these approaches have mainly
theoretical appeal and seem not really to work well in finite samples. However, as we
will see later, for many semi-parametric regression estimators the bias problem is of
different nature since variance can be reduced through averaging, whereas bias cannot.
Then, the reduction of the bias term can be crucial for obtaining asymptotically better
properties.
When introducing local polynomials and higher-order kernels, we could already see
their bias reducing properties; their bias was a multiple of h δ with h being the band-
width and δ increasing with the order of the polynomial and/or the kernel. Typically, the
bandwidth should be chosen to balance variance and squared bias. Nonetheless, if the
bandwidth matrix converges to zero such that the squared bias goes faster to zero than
the variance, then the former can be neglected in further inference. This reduction of the
bias comes at the price of a larger variance and a lower convergence rate, a price we are
often willing to pay in the semi-parametric context. This strategy is called undersmooth-
ing as we smooth the data less than the smallest MSE would suggest. Note, however,
that without further bias reduction (by increasing p or r ), this works only for q ≤ 3 (at
most).
An alternative approach to bias reduction is based on the idea of ‘jackknifing’ (to
eliminate the first-order bias term). A jackknife kernel estimator for q = dim(X ) = 1
is defined by
m̂(x0 ; h) − 1
c2
m̂(x0 ; c · h)
m̃(x0 ) =
1 − c12
where c > 1 is a constant,33 m̂(x0 ; h) is the kernel estimator with bandwidth, h and
m̂(x0 ; c · h) with bandwidth c · h. The intuition behind this estimator is as follows: the
first-order approximation to the expected value of the kernel estimator is
c2 h 2
E[m̂(x0 ; c · h)] = m(x0 ) + κ2 m (x0 ) f (x0 ) + 2 f (x0 ) m (x0 ) .
2 f (x0 )
Inserting this into the above expression shows that the bias of m̃(x0 ) contains terms only
of order h 3 or higher. This is easy to implement for q = 1 but else rarely used in practice.
automatic bandwidth selectors could smooth out irrelevant variables via choosing infinitely large
bandwidths, see e.g. section 2.2.4 of Li and Racine (2007).
33 For example, 1 < c < 1.1 is suggested e.g. in Pagan and Ullah (1999).
86 An Introduction to Non-Parametric Identification and Estimation
where X l,i and xl denote the lth element of X i and x, respectively, K is a standard (i.e.
as before) kernel with bandwidth h, δ and λ positive smoothing parameters satisfying
0 ≤ δ, λ ≤ 1. This kernel function K h,δ,λ (X i − x) measures the distance between
X i and x through three components: the first term is the standard product kernel for
continuous regressors with h defining the size of the local neighbourhood. The second
term measures the distance between the ordered discrete regressors and assigns geo-
metrically declining weights to less narrow observations. The third term measures the
(mis-)match between the unordered discrete regressors. Thus, δ controls the amount of
smoothing for the ordered and λ for the unordered discrete regressors. For example, the
multiplicative weight contribution of the last regressor is 1 if the last element of X i and
x is identical, and λ if they are different. The larger δ and/or λ are, the more smooth-
ing takes place with respect to the discrete regressors. If δ and λ are both 1, then the
discrete regressors would not affect the kernel weights and the non-parametric estima-
tor would ‘smooth globally’ over the discrete regressors. On the other hand, if δ and
λ are both zero, then smoothing would proceed only within each of the cells defined
by the discrete regressors but not between them. If in such a situation X contained no
continuous regressors, then this would correspond to the frequency estimator, where Y
is estimated by the average of the observations within each cell. Any value between 0
and 1 for δ and λ thus corresponds to some smoothing over the discrete regressors. By
noting that
0
q q
λ11{ X l,i =xl } = λ l=1 11 { X l,i =xl } ,
l=1
it can be seen that the weight contribution of the unordered discrete regressors
depends only on λ and the number of regressors that are distinct between X i
2.2 Non- and Semi-Parametric Estimation 87
where the regressors 1, . . . , q1 contain the continuous and the ordered discrete variables.
An important aspect in practice is how the information contained in unordered discrete
regressors should enter a local model, for example when the same value of λ is used
for all.
Example 2.11 Suppose we have two unordered discrete regressors: gender and region,
where region takes values in {1=North, 2=South, 3=East, 4=West, 5=North-East,
6=North-West, 7=South-East, 8=South-West} while the dummy variable ‘gender’
would enter as one regressor in a PLM or in (2.41). The situation with region is more
difficult. First, comprising the information on region in one regressor variable in the
PLM makes no sense because the values 1 to 8 have no logical meaning. Instead, one
would use seven dummy variables for the different regions. However, when the kernel
function (2.41) one can use a single regressor variable. If one were to use seven dummy
variables instead, then the effective kernel weight used for ‘region’ would be λ7 but only
λ for gender. The reason is that if two observations j and i live in different regions, they
will be different on all seven regional dummies. Hence, the implicit bandwidth would
be dramatically smaller for region than it is for gender. This would either require using
separate smoothness parameters λ1 , λ2 for region and gender or a rescaling of them by
the number of corresponding dummy variables.
88 An Introduction to Non-Parametric Identification and Estimation
n
!
F̂(y) = Ê 11{Y ≤ y} = 11{Yi ≤ y}. (2.42)
n
i=1
! 1
n
K h (X i − x)
F̂(y|x) = Ê 11{Y ≤ y}|x = 11{Yi ≤ y} 1 n , (2.43)
j=1 K h (X j − x)
n ni=1
n
min {L δ (Yi − y) − β0 − β1 (X i − x)}2 K h (X i − x) (2.44)
β0 ,β1 n
i=1
1 f (y|x)
κ̄0 (K ) · κ̄0 (L) .
nhδ f (x)
A more direct way is to recall that f (y|x) = f (y, x)/ f (x) and to derive standard
kernel density estimators for f (y, x), f (x). This actually results in an estimator being
equivalent to the local constant estimator of E[L δ (Y − y)|x].
34 This is actually much closer to the original idea of ‘kernels’ than their use as weight functions.
2.2 Non- and Semi-Parametric Estimation 89
Example 2.12 The reason why Nadaraya–Watson regression performs poorly is due to
its limited use of covariate information, which is incorporated only in the distance met-
ric in the kernel function but not in the extrapolation plane. Consider a simple example
where only two binary X characteristics are observed: gender (male/female) and pro-
fessional qualification (skilled/unskilled) and both coded as 0–1 variables. Expected
wages shall be estimated. Suppose that, for instance, the cell skilled males contains no
observations. The Nadaraya–Watson estimate with h > 1 of the expected wage for
skilled male workers would be a weighted average of the observed wages for unskilled
male, skilled female and unskilled female workers, and would thus be lower than the
90 An Introduction to Non-Parametric Identification and Estimation
expected wage for skilled female workers, which is in contrast to theory and reality.
For h < 1 the Nadaraya–Watson estimator is not defined for skilled males, as the
cell is empty, and h < 1 with bounded kernels assigns weight zero to all observa-
tions. Now, if the a priori beliefs sustain that skilled workers earn higher wages than
unskilled workers and that male workers earn higher wages than female workers, then a
monotonic ‘additive’ extrapolation would be more adequate than simply averaging the
observations in the neighbourhood (even if down-weighting more distant observations).
Under these circumstances a linear extrapolation e.g. in form of local linear regression
would be more appropriate, which would add up the gender wage difference and the
wage increment due to the skill level to estimate the expected wage for skilled male
workers. Although the linear specification is not true, it is still closer to the true shape
than the flat extrapolation plane of Nadaraya–Watson regression. Here, a priori informa-
tion from economic theory becomes useful for selecting a suited parametric hyperplane
that allows the incorporation of covariate information more thoroughly to obtain better
extrapolations.
A direct extension of the local linear towards a local parametric estimator seems to
be a natural answer to our problem. Moreover, if we think of local linear (or, more gen-
erally, local parametric) estimators as kernel weighted least squares, one could equally
well localise the parametric maximum-likelihood estimator by convoluting it with a ker-
nel function. These will be the first semi-parametric estimators we introduce below.
Unfortunately, this does not necessarily mitigate the curse of dimensionality if the
imposed parametric structure is not used inside the kernel function. Therefore, other
models and methods have been proposed. Among them, the most popular ones are the
partial linear models (PLM); see Speckman (1988),
E[Y |X = x] = x1 β + m(x2 ), x = (x1 , x2 ) ∈ IR q1 +q2 , β ∈ IR q1 , (2.45)
where x1 contains all dummy variables and those covariates whose impact can be
restricted to a linear one for whatever reason. Although the method contains non-
√
parametric steps, the β can often35 be estimated at the parametric convergence rate n.
Also quite popular are the additive partial linear models; see Hastie and Tibshirani
(1990),
q
E[Y |X = x] = x1 β + m α (xα ), (2.46)
α=q1 +1
x = (x1 , x2 , . . . , xq ) ∈ IR q1 +q2 , β ∈ IR q1 , xα ∈ IR ∀α > q1 .
The advantage is that when applying an appropriate estimator, each additive component
m α can be estimated at the optimal one-dimensional non-parametric convergence rate.
In other words, this model overcomes the curse of dimensionality. Another class that
achieves this is the single index model; see Powell, Stock and Stoker (1989) or Härdle,
Hall and Ichimura (1993),
E[Y |X = x] = G(x β), x, β ∈ IR q , G : IR → IR unknown, (2.47)
35 Required are set of assumptions on the smoothness of m, the distribution of X , the dimension of x , etc.
2
2.2 Non- and Semi-Parametric Estimation 91
which is an extension of the well-known generalised linear models but allowing for an
unknown link function G. Under some regularity assumptions, the β can be estimated
at the optimal parametric rate, and G at the optimal one-dimensional non-parametric
convergence rate. A less popular but rather interesting generalisation of the parametric
linear model is the varying coefficient model; see Cleveland, Grosse and Shyu (1991):
g(x, θx ) (2.49)
where the function g is known, but the coefficients θx are unknown, and fitting this local
model to the data in a neighbourhood of x. The estimate of the regression function m(x)
is then calculated as
m̂(x) = g(x, θ̂x ).
The function g should be chosen according to economic theory, taking into account the
properties of the outcome variable Y .
Example 2.13 If Y is binary or takes only values between 0 and 1, a local logit
specification would be appealing, i.e.
1
g(x, θx ) = ,
1 + eθ0,x +x θ1,x
where θ0,x refers to the constant and θ1,x to the other coefficients corresponding to the
regressors in x. This local logit specification has the advantage vis-à-vis a local linear
one, that all the estimated values m̂(x) are automatically between 0 and 1. Furthermore,
it may also help to reduce the high variability of local linear regression in finite samples.
92 An Introduction to Non-Parametric Identification and Estimation
The function g should be chosen to incorporate also other properties that one might
expect for the true function m, such as convexity or monotonicity. These properties,
however, only apply locally when fitting the function g at location x. It does not imply
that m̂(·) is convex or monotonous over the entire support of X . The reason for this is
that the coefficients θx are re-estimated for every location x: for two different values x1
and x2 the function estimates are g(x1 , θ̂x1 ) and g(x2 , θ̂x2 ), where not only x changes
but also θ̂x .36
One should note that the local coefficients θx may not be uniquely identified, although
g(x, θ̂x ) may still be. E.g. if some of the regressors are collinear, θx is not unique, but
all solutions lead to the same value of g(x, θ̂x ). This was discussed in detail in Gozalo
and Linton (2000).
There are several ways to estimate the local model. Local least squares regression
estimates the vector of local coefficients θx as
n
θ̂x = arg min {Yi − g(X i , θx )}2 · K (X i − x). (2.50)
θx i=1
It is embedded in the class of local likelihood estimation (see Tibshirani and Hastie 1987,
and Staniswalis 1989), which estimates θ̂x by
n
θ̂x = arg max ln L (Yi , g(X i , θx )) · K (X i − x), (2.51)
θx i=1
36 Note that when one is interested in the first derivative, there are two different ways of estimating it: either
as ∂ m̂(x)/∂ x, or from inside the model via ∂g(x, θ̂x )/∂ x. These are different estimators and may have
different properties. E.g. when a local logit model is used, the first derivative ∂g(x, θ̂x )/∂ x is always
between 0 and 0.25, whereas ∂ m̂(x)/∂ x is not restricted but can take any value between −∞ and ∞.
2.2 Non- and Semi-Parametric Estimation 93
Local least squares (2.50) and local likelihood (2.51) can be estimated by setting the
first derivative to zero. Therefore they can also be written as
n
(Yi , g(X i , θx )) · K (X i − x) = 0 (2.52)
i=1
for some function that is defined by the first-order condition. They can thus also be
embedded in the framework of local estimating equations (Carroll, Ruppert and Welsh
1998), which can be considered as a local GMM estimation but for a more general
setup.37
Gozalo and Linton (2000) showed uniform consistency of these estimators under
quite general assumptions. Simplifying, one could summarise them as follows: to the
assumptions used for the local linear regression you have to add assumptions on the
behaviour of the criterion function and the existence of (unique) solutions θ̂x , respec-
tively. Asymptotic normality can be shown when the ‘true’ vector38 θx0 is uniquely
identified. This again depends on the regularity assumptions applied.
Another interesting result – see Carroll, Ruppert and Welsh (1998) – is that the asymp-
totic theory becomes quite similar to the results for local polynomial regression when
an adequate reparametrisation is conducted. The reparametrisation is necessary as oth-
erwise some (or all) elements of vector θx0 contain (asymptotically) derivatives m l of
different order l, including order 0, i.e. function m(·). A proper reparametrisation sepa-
rates terms of different convergence rates such that their scores are orthogonal to each
other. For example, one wants to achieve that θ0,x0 contains only m(x), and θ 0 only the
1,x
gradient of m(x) with scores being orthogonal to the score of θ0,x giving independent
estimates with different convergence rates. This canonical parametrisation is setting
0 = m(x) and θ 0 = m(x). To get from the original parametrisation of g(·) to a
θ0,x 1,x
canonical, to be used in (2.52), we look for a g(X i , γ ) that solves the system of partial
differential equations g(x, γ ) = θ0,x , g(x, γ ) = θ1,x where γ depends on θ and x.
In accordance with the Taylor expansion, the final orthogonal canonical parametrisation
then is given by g(X i − x, γ ) as will also be seen in the examples below.
Example 2.14 For index models like in Example 2.13 an orthogonal reparametrisation
(X − x)}. But the canonical parametrisation to
is already given if we use F{θ0,x + θ1,x i
be used in (2.52) is of the much more complex form
F F −1 (θ0,x ) + θ1,x
(X i − x)/F {F −1 (θ0,x )} .
For such a specifications one obtains for the one-dimensional case with a second-
order kernel in (2.50) the bias
37 Local least squares, local likelihood and local estimating equations are essentially equivalent approaches.
However, local least squares and local likelihood have the practical advantage over local estimating
equations that they can distinguish between multiple optima of the objective function through their
objective function value, whereas local estimating equations would treat them all alike.
38 This refers to the solution for the asymptotic criterion function.
94 An Introduction to Non-Parametric Identification and Estimation
! 1 % &
E m̂(x) − m(x) = κ2 h 2 m (x) − g (x, θx0 ) , (2.53)
2
where θx0 satisfies (2.52) in expectation. The bias is of order h 2 as for the local linear
estimator. In addition, the bias is no longer proportional to m but rather to m − g .
When the local model is linear, g is zero and the result is the one we obtained for local
linear regression. If we use a different local model, the bias will be smaller than for local
linear regression if
$ $ $ $
$m (x) − g (x, θx )$ < $m (x0 )$ .
Example 2.15 Recall Example 2.13 with Y being binary, and function g(·) being a logit
specification. A quadratic extension would correspond to
1
,
1 + eθ0,x +(X i −x) θ1,x +(X i −x) θ2,x (X i −x)
where θ2,x contains also coefficients for mixed terms, i.e. local interactions. This local
logit specification with quadratic extensions has the advantage to be bias reducing but
requires more assumptions and is more complex to calculate. In fact, with a second-order
kernel the bias would be of order h 3 without changing the variance.
Admittedly, the discussion has been a bit vague so far since some further restrictions
are required on the local parametric model. If e.g. the local model would be the trivial
local constant one, then we should obtain the same results as for Nadaraya–Watson
regression, such that (2.53) cannot apply. Roughly speaking, (2.53) applies if the number
of coefficients in g is the same as the number of regressors in X plus one (excluding the
local constant case). Before we consider the local logit estimator more in detail, we can
generally summarise for dim(X ) = q and order(K ) = r :
T H E O R E M 2.10 Under the assumptions for the local linear regression (and some
additional assumptions on the criterion function – see Gozalo and Linton 2000) for
all interior points x of the support of X , the local parametric estimator defined as the
solution of (2.50) with a kernel of order r ≥ 2 is uniformly consistent with
"
√ 1
q
(r ) (r ) σ 2 (x)
nh q {g(x, θ̂x )−m(x)} → N ch κ2 (K ) {m l (x) − gl (x, θx )}, κ¯0 (K )
0
,
r! f (x)
l=1
(r ) (r )
where θx0 as before,
ml and gl √ r , f (·) the density of
the partial derivatives of order
X , σ 2 (x) the conditional variance of Y , and ch = limn→∞ h r nh q < ∞.
2.2 Non- and Semi-Parametric Estimation 95
1
n
ln L(x0 , a, b) = Yi ln a + b (X i − x0 )
n
i=1
+ (1 − Yi ) ln 1 − a + b (X i − x0 ) · Ki
where (x) = 1
1+e−x
and the K i = K h (X i − x0 ). We will denote derivatives of (x)
by (x), (3)
(x), (x), etc. and note that (x) = (x) · (1 − (x)). Let â and b̂ be
the maximiser of ln L(x0 , a, b) with a0 , b0 being the values that maximise the expected
value of the likelihood function E [ln L(x0 , a, b)]. Note that we are interested only in
â, and include b̂ only to appeal to the well-known properties that local likelihood or
local estimating equations perform better if more than a constant term is included in the
local approximation. We estimate m(x0 ) by m̂(x0 ) = (â). For clarity we may also
write m̂(x0 ) = (â(x0 )) because the value of â varies for different x0 . Similarly, a0 is
a function of x0 , that is a0 = a0 (x0 ). The same applies to b̂(x0 ) and b0 (x0 ). Most of
the time we suppress this dependence to ease notation and focus on the properties at a
particular x0 .
In what follows we will also see that (a0 (x0 )) is identical to m(x0 ) up to an O(h r )
term. To derive this, note that since the likelihood function is globally convex, the max-
imisers are obtained by setting the first-order conditions to zero. The values of a0 (x0 )
and b0 (x0 ) are thus implicitly defined by the (1 + dim(X )) moment conditions
. /
1
E Yi − a0 + b0 (X i − x0 ) Ki = 0
X i − x0
. /
1
⇔ E m(X i ) − a0 + b0 (X i − x0 ) K i = 0, (2.54)
X i − x0
where u = X i −x 0
h . Assuming that m is r times differentiable, and noting that the kernel
is of order r , we obtain by Taylor expansion that
Let us now examine â in more detail. We denote (a0 , b0 ) by β0 , its estimate by
β̂ = (â, b̂ ) , and set Xi = 1, (X i − x0 ) . The first-order condition of the estimator is
given by
n
0= Yi − (β̂ Xi ) K i Xi
i=1
n %
1
% &
n
(β0 Xi ) + (β0 Xi )Xi (β̂ − β0 ) + O p ||β̂ − β0 ||2 Xi Xi K i .
n
i=1
⎡ ⎤
κr ∂ r −1 ( f (x0 ))
f (x0 ) (a0 ) + O p (h r ) h r (r −1)! + o p (h r ) ···
⎢ ∂ x1r −1 ⎥
⎢ (
r −1 f (x )
0 ) (
r −2 f (x )
0 )
⎥
=⎢ r κr ∂ + o p (h r ) κr ∂
h r (r −2)! ··· ⎥,
⎢ h (r −1)! ∂ x1r −1 ∂ x1r −2 ⎥
⎣ ⎦
.. .. ..
. . .
where ∂ r f (x0 ) /∂ xlr is a shortcut notation for all the cross derivatives of and
f (x0 ):
∂ r f (x0 )
r
∂ r −l f (x0 )
≡ (r +1) (a0 (x0 )) · . (2.57)
∂ xl ∂ xlr −l
r
l=1
The derivations are similar to those for the local linear estimator and therefore omitted
here. An additional complication compared to the derivations for the local linear esti-
mator are the second-order terms, which however are all of lower order when (â − a0 )
and (b̂ − b0 ) are o p (1).
Similarly to the derivations for the local linear estimator one can now derive
n "−1
1
% &
e1 (β0 Xi ) + (β0 Xi )Xi (β̂ − β0 ) + O p ||β̂ − β0 ||2 Xi Xi K i
n
i=1
⎛ ⎞
1
⎜ ⎟
⎜ −h (r −2)! ∂ (r −1 0 ) ∂ ( 0 )
r −1 f (x ) r −2 f (x )
/ ⎟
⎜ (r −1)! ∂ x1 r −2
∂ x1 ⎟
1 ⎜ ⎟
= ⎜ . ⎟ 1 + o p (1) .
f (x0 ) (a0 (x0 )) ⎜ .
. ⎟
⎜ ⎟
⎝ −2)! ∂
r −1
( f (x0 )) / ∂ ( f (x0 )) ⎠
r −2
−h (r(r −1)! r −1 r −2
∂ xd ∂ xq
(2.58)
Putting together (2.56) and (2.58) you obtain for (2.55) that
(x0 ) − m(x0 )
m
1
n
= (a0 (x0 )) · e1 (β0 Xi ) + (β0 Xi )Xi (β̂ − β0 )
n
i=1
−1
+ O p (||β̂ − β0 ||2 )Xi Xi K i
1
n
× Yi − m i + m i − (a0 + b0 (X i − x0 )) K i Xi · 1 + o p (1) + O p (h r ),
n
i=1
⎛ ⎞
1
⎜ ⎟
⎜ −h (r −2)! ∂ (r −1 0 ) ∂ ( 0 )
r −1 f (x ) r −2 f (x )
/ ⎟
1 ⎜ ⎜ (r −1)! ∂ x1 r −2
∂ x1 ⎟
⎟
= ⎜ . ⎟
f (x0 ) ⎜ .. ⎟
⎜ ⎟
⎝ −2)! ∂
r −1
( f (x0 )) / ∂ ( f (x0 )) ⎠
r −2
−h (r
(r −1)! r −1 r −2
∂ xq ∂ xq
n
× Yi − m i + m i − a0 + b0 X j − x0 K i Xi · 1 + o p (1) + O p (h r ),
n
i=1
98 An Introduction to Non-Parametric Identification and Estimation
where m i = m(X i ) and ∂ r f (x0 ) /∂ x1r as defined in (2.57). All in all, we have veri-
fied parts of Theorem 2.10 for the local logit case. The calculation of the variance is more
tedious, and the normality of the estimator can be derived by the delta method. With
similar calculations one could also derive the statistical properties for the derivatives.
Y = m(X 2 ) + X 1 β + U, X 1 ∈ IR q1 , X 2 ∈ IR q2 ,
where the relationship between the budget share and income is left completely unspec-
√
ified. Speckman (1988) introduced several estimators for β which were n consistent
under some smoothness assumptions. The idea is to condition on X 2 and consider
Clearly, the second summand and E[U |X 2 ] equal zero. Hence, one could estimate β by
n "−1
% &% &
X 1,i − Ê [X 1 |X 2i ] X 1i − Ê [X 1 |X 2i ]
i=1
n %
&% &
× X 1i − Ê [X 1 |X 2i ] Yi − Ê [Y |X 2i ] , (2.59)
i=1
T H E O R E M 2.11 Under the assumptions of Theorem 2.7 applied to the local linear
predictors Ê[X 1 |X 2i ] and Ê[Y |X 2i ], some additional regularity conditions, and 2r >
dim(X 2 ) for the kernel order, we have for the semi-parametric estimator defined in
(2.59) that
√ d
% &
n(β̂ − β) −→ N 0, σ 2 ϕ −1 (2.60)
! !
with ϕ = E (X 1 − E[X 1 |X 2 ])(X 1 − E[X 1 |X 2 ]) and σ 2 = E (Y − E[Y |X 2 ])2 .
Consistent estimates for the variance σ 2 ϕ −1 are given by
"
1
n −1
σ̂ 2 X 1i − Ê [X 1 |X 2i ] X 1i − Ê [X 1 |X 2i ]
n
i=1
n 2
1
with σ̂ 2 = Yi − Ê [Y |X 2i ] − X 1i − Ê [X 1 |X 2i ] β .
n
i=1
2.2 Non- and Semi-Parametric Estimation 99
Alternative but less efficient estimators are those based on partialling out, i.e.
n " n
−1
β̂ P O = X 1,i − Ê [X 1 |X 2i ] X 1i − Ê [X 1 |X 2i ] X 1i − Ê [X 1 |X 2i ] Yi ,
i=1 i=1
where β0 and ζ0 are determined by F and the expectation operator is taken with respect
to F. A semi-parametric moment estimator β̂ solves the moment equation
n
M(Wi , β, ζ̂ ) = 0. (2.61)
n
i=1
If this has a unique solution, under some regularity conditions, the estimator β̂ converges
to β0 because for n → ∞ also ζ̂ converges to ζ0 , and by the law of large numbers the
sample moment converges to the population moment.
For moment estimators it is well established that in the parametric world, i.e. for
ζ0 known, the
& influence function ψ of the moment estimator β̂, i.e. the one for which
√ %
n β̂ − β0 = √1n i ψ(Wi ) + O p (1), is given by
$ "−1
∂ E [M (W, β, ζ0 )] $$
ψ(W ) := − $ {M (W, β0 , ζ0 )} , E[ψ(W )] = 0. (2.62)
∂β β0
the bias and higher-order terms. For the local linear estimator we have ψ(Yi , X i , x) =
{Yi − m(X i )} K h (X i − x)/ f (x), and it is easy to see that indeed E[ψ(W )ψ(W ) ]/n =
n σ (x)κ2 (K )/ f (x). For our semi-parametric estimators of a finite dimensional β with
1 2
infinite dimensional ζ all this looks a bit more complex. Yet, in practice it often has a
quite simple meaning as can be seen from the following example.
Example 2.16 For calculating an average treatment effect we often need to predict the
expected counterfactual outcome E[Y d ] for a given (externally set) treatment D = d.39
An example of a semi-parametric estimator is the so-called matching estimator, see
Chapter 3
1
n
E[Y d] = m̂ d (X i )
n
i=1
n
(m̂ d (X i ) − β) = 0
i=1
E [m d (X i ) − β0 ] = 0
where β0 = E[Y d ] and ζ0 = m d . For more details see the next chapter.
39 Here Y d denotes the potential outcome Y given D is set externally to d; recall Chapter 1.
2.2 Non- and Semi-Parametric Estimation 101
In this example the problem of using a non-parametric predictor for the estimation of a
finite dimensional parameter is almost eliminated by averaging it to a one-dimensional
number. We say almost because we actually need an adjustment term, say α(·), like
b(·) + R(·) for the non-parametric regression above. It is called adjustment term as it
adjusts for the nuisance term (or its estimation). For the semi-parametric estimators we
consider, this directly enters the influence
! function so that we get an estimator of kind
(2.61) with variance E ψ(W )ψ(W ) divided by n, where
$ "−1
∂ E [M (W, β, ζ0 )] $$
ψ(W ) = − $ {M (W, β0 , ζ0 ) + α(W )} (2.63)
∂β β̂
is the influence function, and α(W ) the adjustment term for the non-parametric estima-
tion of ζ0 . If ζ0 contains several components (subsequently or simultaneously), then the
adjustment factor is the sum of the adjustment factors relating to each component being
previously estimated. This gives the general form of how the asymptotic variance of β̂
would usually look. One still has to specify precise regularity conditions under which the
√
estimator actually achieves n consistency (without first-order bias). It also should be
√
mentioned that there may exist situations where n estimation of the finite-dimensional
parameter is not achievable.40
How do these adjustment factors look? At least for the case where the nuisance
λ
ζ0 consists of ∂ λ m(x) = ∂ |λ| m(x)/∂ x1λ1 · · · ∂ xq q , i.e. partial derivatives of m(x) =
E[·|X = x] (including |λ| = 0, the conditional expectation itself), there exists a general
formula.41 In fact, under some (eventually quite strong) regularity assumptions it holds
λ
|λ| ∂ T̄ (x) · f (x)
α(w) = (−1) · · {y − m(x)} , f (x) the density of X (2.64)
f (x)
$
∂ M (w, β0 , ζ ) $$
where T̄ (x) = E [T (W )|X = x] with T (w) = $ λ . (2.65)
∂ζ ζ =∂ m(x)
We will use this frequently, e.g. for the prediction of the expected potential outcome
E[Y d ], where d indicates the treatment.
40 A popular example is the binary fixed effects panel data M-score estimator of Manski.
41 It also exists if ζ consists of the density f (x) or its derivatives, but we skip it here as we won’t use that.
0
102 An Introduction to Non-Parametric Identification and Estimation
developed by Koshevnik and Levit (1976) and Bickel, Klaassen, Ritov and Wellner
(1993). If such a semi-parametric variance bound exists, no semi-parametric estima-
tor can have lower variance than this bound, and any estimator that attains this bound is
semi-parametrically efficient. Furthermore, a variance bound that is infinitely large tells
√
us that no n consistent estimator exists.
Not surprisingly, the derivation of such bounds can easily be illustrated for the
likelihood context. Consider the log-likelihood
1
n
ln L n (β, ζ ) = ln L (Wi , β, ζ ) ,
n
i=1
that is maximised at the values β0 and ζ0 where the derivative has expectation zero.
When the nuisance parameter ζ0 is finite dimensional, then the information matrix
provides the Cramér–Rao lower bound β, using partitioned inversion
% &−1
V ∗ = Iββ − Iβζ Iζ−1
ζ Iζβ (2.66)
where Iββ ,Iβζ , Iζ ζ ,Iζβ are the respective submatrices of the information matrix for
(β, ζ ). For maximum likelihood (ML) estimation we obtain
√ d
n(β̂ − β) −→ N (0, V ∗ ).
A non-zero Iβζ indicates that there is an efficiency loss when ζ is unknown.
Now let ζ0 be non-parametric, i.e. an infinite-dimensional parameter. Then, loosely
speaking, the semi-parametric variance bound V ∗∗ is the largest variance V ∗ over all
possible parametric models that nest ln L n (β, ζ0 ) for some value of ζ .42 An estimator
that attains the semi-parametric variance bound
√ d
n(β̂ − β) −→ N (0, V ∗∗ ) (2.67)
is called semi-parametrically efficient. In some situations, the semi-parametric estimator
may even obtain the variance V ∗ , which means that considering ζ as a non-parametric
function does not lead to an efficiency loss in the first-order approximation. These are
called adaptive.
42 This is why in profiled likelihood estimation the estimators of the infinite-dimensional nuisance parameter
are often called the least favourable curve.
2.2 Non- and Semi-Parametric Estimation 103
Remains the question how to get V ∗∗ . Let β denote the object of interest which
depends on the true distribution function F(w) of the data W . Let f (w) be the density of
the data. Let F be a general family of distributions and {Fθ : Fθ ∈ F} a one-dimensional
subfamily (θ ∈ IR) of F with Fθ=θ0 being the true distribution function, and Fθ=θ0 the
other distributions from class F. The pathwise derivative δ(·) of β(F) is a vector of
functions defined by
$
∂β(Fθ ) $$ !
= E δ(W ) · S(W )|θ=θ0 , (2.68)
∂θ $θ=θ0
such that E [δ(W )] = 0 and E[δ(W )2 ] < ∞ with S(w) = ∂ ln f (w|θ )/∂θ the score
function. Clearly, the latter has expectation zero for θ = θ0 as
$
∂ ln f (w|θ ) $$ ∂
E θ0 [S(W )] = $ f (w|θ0 )dw = f (w|θ0 )dw = 0,
∂θ |θ=θ0 ∂θ
provided conditions for interchanging integration and differentiation. The semi-
parametric variance bound V ∗∗ /n for β̂ is then given by V ar [δ(W )]/n.43 Not surpris-
ingly, under some regularity conditions, δ(·) is the influence-function ψ(·) introduced
in the preceding paragraph.
We will see some more examples in the next chapter. A particularity there is that the
binary treatment indicator D acts as a trigger that may change the true joint distribution
f of Wi = (Yi , X i ), where treatment occurs with probability p(x|θ ) := Pr(D = 1|x; θ ).
For finding the δ(W ), it then helps a lot to decompose the score S(w) along the three
cases d = 0, d = 1, and d − p(x). Suppressing the θ inside the functions you use
f (Y, X, D) = f (Y |D, X ) f (D|X ) f (X )
= { f 1 (Y |X ) p (X )} D { f 0 (Y |X ) (1 − p (X ))}1−D f (x)
where f d (Y |X ) ≡ f (Y |D = d, X ), d = 0, 1. This leads to the score function
∂ ln f 1 (y|x) ∂ ln f 0 (y|x) d − p (x) ∂ ln p(x) ∂ ln f (x)
S(w) = d + (1 − d) + +
∂θ ∂θ 1 − p (x) ∂θ ∂θ
(2.69)
giving us the set of zero-mean functions spanning the (proper) tangent space.
43 More specifically, the semi-parametric efficiency bound is equal to the expectation of the squared
projection of function δ(·) on the tangent space of the model F (for more details see Bickel, Klaassen,
Ritov and Wellner 1993) which is the space spanned by the partial derivatives of the log-densities with
respect to θ .
104 An Introduction to Non-Parametric Identification and Estimation
44 For the mathematical details see the work of Schwarz and Krivobokova (2016).
45 As always, you will certainly find examples that might be considered as exceptions like e.g. wavelet
estimators with a Haar basis and high-resolution levels.
46 This popularity is boosted by the common practice in econometrics (not so in statistics, biometrics, etc.)
to resort to the corresponding parametric inference tools, though then it no longer has much to do with
non- or semi-parametric analysis.
2.2 Non- and Semi-Parametric Estimation 105
Throughout this subsection we keep the introduced notation, considering the problem
of estimation of the regression function m(·) in a model of the type
Y = m(X ) + ε, E[ε|X ] = E[ε] = 0, V ar [ε|X ] = σ 2 (X ) < ∞ (2.70)
and having observed an i.i.d. sample {Yi , X i }i=1
with Y ∈ IR and X ∈
n Again, IR q .
function m(·) is assumed to be smooth, and X a vector of continuous variables.
L
m(x) = bl · Bl (x), m̂(x) = b̂l · Bl (x), (2.71)
l=1 l=1
for a particular choice of smoothing (or tuning) parameter L. The coefficients bl can be
estimated by ordinary least squares, i.e.
% &−1 % &
b̂ = (b̂1 , . . . , b̂ L ) = B L B L B L Y ,
47 Actually, there is a general confusion about the notion of what ‘non-parametric’ means as it actually
refers to an infinite-dimensional parameter or, in other words, an infinite number of parameters rather than
to ‘no parameters’.
106 An Introduction to Non-Parametric Identification and Estimation
Splines
The term spline originates from ship building, where it denoted a flexible strip of wood
used to draw smooth curves through a set of points on a section of the ship. There, the
spline (curve) passes through all the given points and is therefore referred to as an ‘inter-
polating spline’. In the regression context, interpolation is obviously not the objective;
you rather look for a smooth version of such an interpolation. Today, splines have been
widely studied in the statistics literature (see Rice 1986, Heckman 1986 or Wahba 1990
for early references) and are extensively used in different domains of applied statistics
including biometrics and engineering, but less so in econometrics. Splines are basi-
cally piecewise polynomials that are joined at certain knots which, in an extreme case,
can be all the xi observations. Therefore, they are also quite popular for non-linear
interpolation.
There exist many different versions of spline estimators even when using the same
functional basis. Take cubic polynomials, called cubic splines. One may differentiate
between the three classes: regression splines, smoothing splines and penalised splines
(also called P-splines, especially when combined with the B-spline basis – see below).
The latter ones are basically compromises between the first two and belong asymp-
totically to either one or the other class, depending on the rate at which the number
of knots increases with the sample size: see Claeskens, Krivobokova and Opsomer
(2009).
2.2 Non- and Semi-Parametric Estimation 107
Regression Splines
One starts by defining L values ξl , so-called knots that separate the interval [a, b] in
convenient L + 1 non-overlapping intervals, i.e. a < ξ1 < · · · < ξ L < b with a ≤
xmin , b ≥ xmax . One could introduce the notation ξ0 = a, ξ L+1 = b. Fitting a cubic
polynomial in each interval has at least two obvious drawbacks: one has to estimate
4(L + 1) parameters in total, and the function is not continuous as it may exhibit jumps
at each knot. Both can be overcome at once by imposing restrictions on the smoothness
of the estimate m̂(x) of E[Y |x]. Making m̂ continuous requires L linear restrictions, and
the same holds true for making m̂ smooth by imposing linear restrictions that also make
the first and second derivative continuous. Then we have only 4(L + 1) − 3L = L + 4
parameters to be estimated with a piecewise (i.e. in each interval) constant m̂ . One
can further reduce the number of parameters to only L + 2 by imposing restrictions at
the boundaries like making m̂ to be a straight line outside [a, b]. The result is called a
natural cubic spline.
Example 2.19 Imagine we choose a single knot ξ = 0, so that we consider only two
polynomials. Then the conditional expectation of Y given x is represented as
-
m 1 (x) = α0 + α1 x + α2 x 2 + α3 x 3 for x ≤ 0
m(x) = .
m 2 (x) = β0 + β1 x + β2 x 2 + β3 x 3 for x > 0
The smoothness restrictions impose: continuity of m̂(x) at x = 0 such that m 1 (0) =
m 2 (0) requiring β0 = α0 ; continuity of m̂ (x) at x = 0 such that m 1 (0) = m 2 (0)
requiring β1 = α1 ; and continuity of m̂ (x) at x = 0 such that m 1 (0) = m 2 (0) requiring
β2 = α2 . So we end up with
-
m 1 (x) = α0 + α1 x + α2 x 2 + α3 x 3 for x ≤ 0
m(x) = .
m 2 (x) = α0 + α1 x + α2 x 2 + α3 x 3 + θ1 x 3 for x > 0
with θ1 = β3 − α3 .
The idea of Example 2.19 extends to any number L > 0 of knots so that we can
generally write a cubic regression spline as
L
m(x) = α0 + α1 x + α2 x 2 + α3 x 3 + θl (x − ξl )3+ where z + = z{z > 0}. (2.72)
l=1
and (x − ξl )3+ terms. However, procedures based on this simple representation are often
unstable as for many knots (large L) the projection matrix is often almost singular. In
practice one uses so-called B-bases, see below, which lead to the same fit in theory.48
As the estimator is a parametric approximation of the true function but without penal-
ising wiggliness or imposing other smoothness than continuity (of the function and
some derivatives), the final estimator now has a so-called ‘approximation bias’ but no
‘smoothing bias’. Nonetheless, the number of knots L plays a similar role as the band-
width h for kernel regression or the number of neighbours in the kNN estimator. For
consistency L must converge to infinity but at a slower rate than n does. For L close to
n you interpolate (like for h = 0), whereas for L = 0 you obtain a simple cubic poly-
nomial estimate (like for h = ∞ in local cubic regression). One might use generalised
cross-validation (2.39) to choose a proper L.
Smoothing Splines
As both the number and location of knots is subject to the individual choice of the
empirical researcher, the so-called smoothing splines gained rapidly in popularity. They,
in the end, are a generalisation of the original interpolation idea based on cubic splines.
The motivation for this generalisation is twofold: in model (2.70) one does not want
to interpolate the Y with respect to the X but smooth out the errors ε to identify the
mean function. This way one also gets rid of the problem that arises when several (but
different) responses Y for the same X are observed (so-called bins). Pure interpolation
is not possible there, and the natural solution would be to predict for those X the average
of the corresponding responses. The smoothing now automatically tackles this problem.
Smoothness is related to ‘penalisation’ if smoothing is a result of keeping the dth
derivative m (d) (·) under control. More specifically, one penalises for high oscillations
by minimising
b% &2
1
with m(·) typically being a polynomial and λ the smoothing parameter corresponding to
the bandwidth. It controls the trade-off between optimal fit to the data (first part) and the
roughness penalty (second part). Evidently, for λ = 0 the minimising function would be
the interpolation of all data points, and for λ → ∞, the function becomes a straight line
with m (d) ≡ 0 that passes through the data as the least squares fit. As above, it can be
chosen e.g. by (generalised) cross-validation.
Reinsch (1967) considered the Sobolev space of C 2 functions with square integrable
second derivatives (d = 2). Then the solution to (2.73) is a piecewise cubic polynomial
whose third derivative jumps at a set of points of measure zero. The knots are the data
points {xi }i=1
n . Hence, the solution itself, its first and its second derivative are contin-
uous everywhere. The third derivative is continuous almost everywhere and jumps at
the knots. The fourth derivative is zero almost everywhere. These conditions provide
a finite dimensional set of equations, for which explicit solutions are available. Actu-
ally, smoothing splines yield a linear smoother, i.e. the fitted values are linear in Y.
48 See, for example, chapter 2 of Hastie and Tibshirani (1990) for further details.
2.2 Non- and Semi-Parametric Estimation 109
For a particular case (thin plate splines), see below. Similar to kernel estimators, the
method is a penalised (i.e. smoothed) interpolation. Therefore these estimators have
only a smoothing (also called shrinkage) but no approximation bias. It disappears with
λ going to zero (while n → ∞ as otherwise its variance would go to infinity). Again,
cross validation is a popular method for choosing λ.
Penalised Splines
Eilers and Marx (1996) introduced a mixture of smoothing and regression splines. The
idea is to use many knots (i.e. large L) such that one does not have to care much about
their location and approximation error. For example, for the set of knots one often takes
every fifth, tenth or twentieth observation xi (recall that we assume them to be ordered).
As many knots typically lead to a large variance of the coefficients that correspond to
the highest order, our θl in (2.72), one introduces a penalisation like for the smoothing
splines. More specifically, one still considers a regression problem like in (2.72) but
restricting the variation of the coefficients θl . This can be thought of as a mixed effects
model where the αk , k = 0, 1, 2, 3 are fixed effects, and the θl , l = 1, . . . , L are treated
like random effects. Then, λ from (2.73) equals the ratio: the variance of θ (σθ2 ) by the
variance of noise ε (σε2 ). For a stable implementation one often does not simply use the
polynomials from (2.72) but a more complex spline basis; see below for some examples.
The final estimator is the minimiser of
2 ⎧ (d) ⎫2
n
b ⎨
⎬
yi − bl Bl (xi ) + λ bl Bl (x) d x, (2.74)
i=1 l a ⎩ l
⎭
where [. . .](d) indicates the dth derivative. In Equation 2.74 we have not specified the
limits for index l as they depend on the chosen spline basis. How in general a penalised
regression spline can be transformed into a mixed effects model in which the penalisa-
tion simply converts into an equilibration of σθ2 vs σε2 is outlined in Curie and Durban
(2002) and Wand (2003). Clearly, the bias of this kind of estimator is a combination of
approximation and shrinkage bias.
While for the regression splines the main (but in practice often unsolved) question
was the choice of number and placing of knots, for smoothing and penalised splines the
proper choice of parameter λ is the focus of interest. Today, the main two competing
procedures to choose λ (once L is fixed) are generalised cross-validation and the so-
called restricted (or residual, or reduced) maximum likelihood (REML) which estimates
the variances of ε and of the ‘random effects’ θ simultaneously. Which of these methods
performs better depends on the constellation of pre-fixed L and the smoothness of m(·).
Not surprisingly, depending on whether L is relatively large or whether λ is relatively
small, either the approximation or the shrinkage bias is of smaller order.
e.g. p = 3 for cubic ones. Define knots as before from a = ξ0 to ξ L+1 = b, and
set ξ j = ξ0 for j < 0, ξ j = ξ L+1 for j > L such that the interval over which the
spline is to be evaluated lies within [a, b]. Recall representation (2.71), but for nota-
tional convenience and only for the next formula let us provide the basis functions Bl
p
with a hyperindex indicating the polynomial order, i.e. Bl . Then, a B-spline of order p
is defined recursively as
p x − ξl p−1 ξl+ p+1 − x p−1
Bl (x) = B (x) + B (x) , j = 1, . . . , k
ξl+ p − ξl l ξl+ p+1 − ξl+1 l+1
with Bl0 (x) = 11{ξl ≤ x < ξl+1 }. (2.75)
The use of a B-spline basis within penalised splines led to the expression P-splines. They
are particularly popular for non-parametric additive models. The often praised simplicity
of P-splines gets lost, however, when more complex knot spacing or interactions are
required.
Thin plate splines were invented to avoid the allocation of knots, and to facilitate an
easy extension of splines to multivariate regression. They are often used for smoothing
splines, i.e. to estimate the vector of E[Y|x1 , x2 , . . . , xn ] by minimising
Y − m2 + λJr (m), m = (m(x1 ), m(x2 ), . . . , m(xn )) , xi ∈ IR q , (2.76)
where Jr is a penalty for wiggliness Jr (m) = (∂ r m/∂u r )2 du for the univariate case,
else
"2
r! ∂r m
Jr (m) = · · · du 1 d2 · · · du q .
ν ! · · · νq ! ∂u ν11 . . . ∂u qνq
ν1 +···+νq =r 1
For visually smooth results 2r > q + 1 is required. The calculation of thin plates is com-
putationally costly, so that today only approximations, the so-called thin plate regression
splines are in use. It can be shown that the solution of them (or their simplification) to
estimate m(·) at a given point depends only on the Euclidean distances between the
observations xi , i = 1, . . . , n and that point. Therefore, for q > 1 one also speaks of
isotropic thin plate smoothing.
An extension of the simple regression cubic splines or B- or P-splines to higher
dimensions is much less obvious. The method that’s probably most frequently used
is applying the so-called tensor products. The idea is pretty simple: for each variable
X j , j = 1, . . . , q calculate the spline basis functions B j,l (x j,i ), l = 1, · · · , L j , for all
observations i = 1, . . . , n. Then expression (2.71) becomes (for a given point x0 ∈ IR q )
Lq
L1
0
q
m(x0 ) = ··· bl1 ···lq B j,l j (x0, j ) , bl1 ···lq unknown,
l1 =1 lq =1 j=1
(here for simplicity without penalisation). This looks quite complex though it is just
crossing each basis function from one dimension with all basis functions of all the other
dimensions. Already for q = 3 this gets a bit cumbersome. Unfortunately, using thin
plates or tensor products can lead to quite different figures, depending on the choice of
knots and basis functions. These problems lead us to favour kernel based estimation,
2.3 Bibliographic and Computational Notes 111
although splines are attractive alternatives when only one or two continuous covariates
are involved or additivity is imposed.
in Section 2.1.3 by a reasoning illustrated in the graphs of Figure 2.9 (except the left
one, which is a counter-example). To our knowledge this strategy was firstly discussed
in detail in Baron and Kenny (1986). In that article they put their main emphasise on
the distinction between moderator and mediator variables, respectively, followed by
some statistical considerations. More than twenty years later Hayes (2009) revisited this
strategy and gave a brief review of the development and potentials. More recent method-
ological contributions to this approach are, for example, Imai, Keele and Yamamoto
(2010) and Albert (2012); consult them also for further references.
Regarding more literature on the identification of treatment effects via the back door,
we refer to a paper that tries to link structural regression and treatment effect analysis
by discussing how each ATE or ATET estimator relates to a regression estimator in a
(generalised) linear model. This was done in Blundell and Dias (2009).
We only gave a quite selective and narrow introduction to non- and semi-parametric
regression. The literature is so abundant that we only give some general references and
further reading to related literature that could be interesting in the context of treatment
effect estimation. For a general introduction to non- and semi-parametric methods for
econometricians see, for example, Härdle, Müller, Sperlich and Werwatz (2004), Li and
Racine (2007), Henderson and Parmeter (2015), Yatchew (2003) or Pagan and Ullah
(1999).
Semi-parametric efficiency bounds were introduced by Stein (1956) and developed
by Koshevnik and Levit (1976). Further developments were added by Pfanzagl and
Wefelmeyer (1982), Begun, Hall, Huang and Wellner (1983) and Bickel, Klaassen,
Ritov and Wellner (1993). You might also consult the survey of Newey (1990), or the
same ideas reloaded for the econometrics audience in Newey (1994). Chen, Linton and
van Keilegom (2003) extended these results to non-smooth criterion functions, which
are helpful e.g. for quantile estimators.
Interesting for estimating the propensity score is also the literature on single index
models, see for example Härdle and Stoker (1989) and Powell, Stock and Stoker
(1989) for average derivative-based estimators, Klein and Spady (1993) for a semi-
parametric maximum-likelihood-based one, and Ichimura (1993) for a semi-parametric
least squares approach.
More references to additive and generalised additive (or related) models can be
skipped here as they are treated in the mentioned compendia above. Typically not treated
there are estimators that guarantee monotonicity restrictions. One approach is to modify
the estimator to incorporate the monotonicity restriction in the form of constrained opti-
misation; see e.g. Mammen (1991), Hall, Wolff and Yao (1999) or Neumeyer (2007),
among others. Alternatively, one could rearrange the estimated function; see e.g. Dette,
Neumeyer and Pilz (2006), Dette and Pilz (2006) or Chernozhukov, Fernandez-Val and
Galichon (2007).
the package np extends the non-parametric methods that were already available in the
basic version of R (e.g. density) and the somewhat older package KernSmooth.
Almost all mentioned packages allow the estimation of various kinds of non- and semi-
parametric regression models, univariate and multivariate, and are able to compute data-
driven bandwidths. The np package uses the discussed kernel extensions for treating
discrete and quantitative variables at once; recall Equation 2.40.
Among the different options present in the np package, there is the possibility to
estimate semi-parametric partial linear models with the function npplreg. Also, the
package gplm is able to estimate models of the form E(Y |X 1 , X 2 ) = G{X 1 β +m(X 2 )}.
Both the PLM (2.45) and the generalised PLM with link can be estimated using a
Speckman-type estimator or backfitting (setting the kgplm option to speckman or
backfit), and partial linear additive models (2.46) can be estimated with gam and
mgcv. For single-index models (2.47) and varying coefficient models (2.48), the func-
tions npindex and npscoef are available. Clearly, when using splines or other sieves,
these models can also be estimated with the aid of other packages. While the np package
uses kernel-based estimators also for semiparametric regression, the SemiPar package
uses a (penalised) spline. It has a somewhat larger variety as it includes, for example,
mixed effects models via the packages mgcv and lmeSplines, which are both con-
structed to fit smoothing splines; see also smooth.spline. For more details consult
the help files of the respective commands and package descriptions.
Also, Stata offers the possibility to fit several non- and semi-parametric mod-
els with different commands. It allows to compute and plot local regression
via kernel-weighted local polynomial smoothing (lpoly) but also applies splines
(mkspline, bsplines and mvrs), penalised splines (pspline), fractional poly-
nomials (fracploy, mfp) or lowess (the latter two methods were not discussed
here). For (generalised or partial linear) additive models you may use gam.
2.4 Exercises
1. Consider the example graphs in Figure 2.16. Which one is a DAG? Can we d-
separate X and Y by conditioning? For which variables W does X ⊥ Y |W hold?
Justify your answers.
2. Consider in the graphs in Figure 2.17 and decide whether conditioning on X is nec-
essary or not in order to identify the (total and/or direct) causal impact of treatment
D on outcome Y . Note that in all these graphs the pointing to Y , D and X are
omitted if they come from some unobservables U .
3. Prove the statement made in Example 2.4.
Figure 2.17 Example graphs (a) to (h) from the upper left to the lower right
D Y D Y D Y
X X X
Figure 2.19 Three examples, (a), (b) and (c) from left to right
4. Note that in equation (2.8) the central assumption is U ⊥⊥ D|X . In which of the
graphs of Figure 2.18 is this assumption satisfied? Justify your answers.
5. Consider the graph (a) in Figure 2.19. Discuss the identifiability of direct and indi-
rect effects in all three graphs. How could you test Y ⊥⊥ D|X when comparing (a)
with (b), and what are the potential problems when looking at (c)?
6. Note first that a differentiable function is Lipschitz continuous if its first derivative
is bounded. Based on this information, discuss to what extent the functions x 2 and
√
x are Lipschitz continuous. Discuss also if they are Hölder continuous (and on
which support).
7. Derive the Nadaraya–Watson estimator from the definition of conditional expecta-
1 n 1 n
tions, using the fact that nh i=1 K {(x − X i )/ h} and nh 2 i=1 K {(x − X i )/ h, (y−
Yi )/ h} are kernel estimators for the densities f (x) and f (x, y), respectively. Here,
K (·, ·) stands for a bivariate kernel K : IR 2 → IR.
;q
8. Recall the definition of multiplicative kernels (2.30). Show that l=1 K (vl ) is an
r th-order kernel function if each of the one-dimensional kernels K (vl ) is so.
2.4 Exercises 115
9. Derive the local quadratic estimator for a two-dimensional regression problem. Give
the expressions you obtain for the estimators of the partial first and second deriva-
tives of the regression function. How could this estimator be simplified if we knew
that the impact of the two covariates were additively separable?
10. Prove Equation 2.21 by inserting the definition of the weights given in (2.17).
11. Recall the calculations that lead to the result in Equation 2.24. What would have
happened if a third-order kernel (instead of a second-order one) had been used?
More generally, what bias would result from an r th order kernel (given that nothing
else changed)?
12. Imagine you tried to approximate an unknown one-dimensional function by a poly-
nomial of arbitrary degree p < n when the true underlying functional form is
a simple log-linear one. Simulate such a regression function E[Y |X ] with X ∼
U [0.1, 10], n = 50 and Y = log(X ) + e, where e ∼ N (0, 1). Then repeat the
exercise with a simple local linear function, alternately setting h = 0.5, 1 and 5.
The kernel function K might be the Epanechnikov, Quartic or Gaussian kernel. If
you take the last one, divide the proposed values for h by 2. For details see Härdle,
Müller, Sperlich and Werwatz (2004).
13. Recall the canonical reparametrisation introduced in the context of local para-
metric estimation. Consider the Cobb–Douglas production function g(z, γx ) =
;q γ
γ0 l=1 zl l and derive its canonical reparametrisation g(z, θx ).
14. Let D be binary. Imagine we want to estimate E[Y 1 ] from the sample
n
{(Yi , X i , Di )}i=1
n by solving n1 i=1 Yi Di p −1 (X i ) − β = 0 with p(·) := E[D|·]
the propensity score, such that the solution β̂ is our estimator.49 Recall Equa-
tion 2.61: show that the influence function (2.63) is equal to
(Y − m 1 (X )) D
ψ(W ) = + m 1 (X ) − β
p(X )
by finding the correct adjustment factor.
We start with an introduction to the practical use of the CIA and modified assumptions
to identify the average or conditional treatment effect. This is done through the use
3.1 Preliminaries: General Ideas 117
and introduced above as the conditional independence assumption (CIA). In some liter-
ature it is also called the selection on observables assumption, which essentially means
that there is no further selection on unobservables that is also affecting the outcome Y .
This assumption (3.1) implies the conditional mean independence
which is a much weaker assumption but often sufficient for our purposes. Both assump-
tions are most easily understood in the treatment evaluation context where the treatment
variable D is only binary. Then, by this assumption, we can identify average potential
outcomes as
E[Y d ] = E [Y |X, D = d] dF X .
The adjustment for the distribution of covariate-vector X (i.e. integrating with respect
to dF X ) is just the application of the law of large numbers applied on g(X ) :=
E[Y |X, D = d]. As long as the samples are representative of the population regard-
ing the distribution of X , such an integral can be approximated sufficiently well by the
sample average. The remaining statistical task is limited to the prediction of the condi-
tional expectations E[Y |X, D = d] for all combinations of X and D. This approach is
also known as the nonparametric regression method.
Example 3.1 Let D ∈ {0, 1} indicate whether or not an individual continues to univer-
sity after secondary school graduation. Suppose that the decision to enrol in a university
depends on only two factors: the examination results when finishing secondary school
and the weather on that particular day. Without controlling for the secondary school
118 Selection on Observables: Matching, Regression and Propensity Score Estimators
With an analogous derivation as in (2.7), cf. Exercise 1, we can identify the ATET
also by
E[Y 1 − Y 0 |D = 1] = E[Y |D = 1] − E[Y |X, D = 0]dF X |D=1 (3.4)
where we used E[Y 1 |D = 1] = E[Y |D = 1], i.e. that the observed outcome is identical
to the potential outcome Y 1 among those actually being treated. We observe a possibly
important difference to the identification of the ATE. For the ATET we only need
(AT1) Y 0 ⊥⊥ D|X
So for identification of ATET we do not need that Y 1 ⊥⊥ D|X and thus also do not need
that (Y 1 −Y 0 ) ⊥⊥ D|X . Hence, we can permit that Y 1 as well as the individual treatment
effects may differ between treated and controls, where such differences might be due to
unobservables. We could, for example, permit that individuals might have chosen their
treatment status D on the basis of their (expected) treatment gains (Y 1 − Y 0 ) but only
if we can rule out that this depends on Y 0 (i.e. that their choice of treatment status
was based on Y 0 ). This is different from identification of the ATE, and this difference
could be relevant in applications when we have good predictors for the individual non-
treatment outcome Yi0 , such that by controlling for their X i we can eliminate selection
bias for Yi0 , even when we know little about the treatment gains (Yi1 − Yi0 ) themselves.
The latter may largely reflect unobservables that are possibly known to the individuals
1 The whole impact of weather at that day on future earnings is channelled by the enrolment D.
3.1 Preliminaries: General Ideas 119
but not to the econometrician. This is not permitted for ATE. In the same way you can
argue that for identifying ATEN we only need conditional mean independence of the
form E[Y 1 |X, D] = E[Y 1 |X ] whereas we do not need this for Y 0 .
This difference can be a relevant relaxation in some applications.2 The selection-on-
observables assumption required for ATE rules out the possibility that individuals can
guess their potential outcomes and then choose the treatment with the highest (poten-
tial) outcome. In other words, in Chapter 1 we required that the probability of choosing
a particular programme must not be affected by the potential outcomes. For CIA now,
treatment selection is allowed to depend on anticipated potential outcomes as long as
these are anticipated exclusively on the basis of observed characteristics X . But if look-
ing at ATET, we can take advantage of the fact that for the (sub-)population of the
treated, their average outcome of Y 1 is the average of their observed outcome Y . Hence,
one has only a problem with the prediction of E[Y 0 |D = 1]. It is just for the non-
treatment state Y 0 where one has to control for all relevant factors to estimate its mean.
We do not need to predict E[Y 1 |D = 0] or E[(Y 1 − Y 0 )| D = 0].
To gain some intuition as to what the non-parametric regression treatment effect esti-
mator does, suppose you have a few different values x for X but a reasonably large
number of people for each x in all groups. Then we can perform a step-wise averaging:
first predict for any observed vector x the conditional expectations E[Y d |X = x] by
1
n d,x of individuals in group D = d with
i:Di =d,X i =x Yi with n d,x being the number
characteristics X = x. Secondly, you set for n d = x n d,x , d = 0, 1
1
1 |X = x] − E[Y
0 |X = x]),
AT E= (n 0,x + n 1,x )( E[Y
n x
1
1 |X = x] − E[Y
0 |X = x]),
AT ET = n 1,x ( E[Y
n1 x
1
1 |X = x] − E[Y
0 |X = x]).
AT EN = n 0,x ( E[Y
n0 x
In practice you often have too many different x for using such a simple averaging,
therefore you include the neighbours. Although this requires more sophisticated non-
parametric estimators, the idea stays the same. So we obtain estimates for ATE by first
estimating the regression functions E[Y |X, D], then predict the E[Y |X i , D = d] for all
individuals i = 1, . . . , n for all d, and finally calculate the difference of their sample
averages. For the ATET, c.f. (3.4), it is sufficient to do this just for d = 0, and to com-
pare it with the average of observed outcomes Y 1 . A regression estimator for the ATE is
therefore of the form
1
n
AT E= m̂ 1 (X i ) − m̂ 0 (X i ) , (3.5)
n
i=1
1
n
AT ET = Yi − m̂ 0 (X i ) with n 1 = 11{Di = 1}, (3.6)
n1
i:Di =1 i=1
bias reduction approaches. The practical difference is that the regression approach is
not necessarily non-parametric as we will briefly discuss in the next subsection. There-
fore, one advantage often brought forward is that matching estimators (in contrast to
regression-based ones) are entirely non-parametric and thus do not rely on functional
form assumptions like linearity. This permits in particular a treatment effect heterogene-
ity of any form. This advantage is of special relevance as the distribution of X can
– and typically will – be very different inside the treatment and non-treatment group,
respectively. In parametric estimation the distribution of X has an essential impact on
the parameter estimates (typically ignored in parametric econometrics). Therefore, pre-
diction typically works worse the more the distribution of X differs between treatment
and non-treatment group. This, however, is exactly the case in the treatment effect esti-
mation context as only those characteristics X can be confounders that differ (a lot)
in distribution between the two groups. In fact, only variables X showing a significant
variation between D = 0 and D = 1 can identify selection.
Note finally that matching will be more efficient the more observations we use for pre-
dicting the counterfactual outcome. In other words, matching becomes efficient when it
collapses with the non-parametric regression. Therefore we will often use the notation of
matching and regression estimation synonymously and only distinguish between them
where necessary. Most importantly, whenever we refer to parametric regression, this
will be made explicit as this is different from matching and non-parametric regression
in several aspects.
The first condition is essentially non-testable. Although it can be tested whether some
variables do affect D or Y , it is impossible to ascertain by statistical means whether
there is no omitted (unobserved) variable, which consciously or unconsciously affected
the process determining the choice of D but has else no impact on Y . In practice, identi-
fication by (CIA) is easier to achieve the more bureaucratic, rule-based and deterministic
the programme selection process is, provided the common support condition applies.
In contrast, our CSC assumption (A2) can be tested, and if rejected, the object of
estimation can be adapted by redefinition of the population of interest such that the CSC
holds. How does this work? Let X0 , X1 be the supports of X within the control and the
treatment group respectively, and X01 = X0 ∩ X1 (i.e. the intercept) be the common
support of the treatment and control group. Note that assumption (A2) is equivalent to
Hence, if (A2) fails for your original data, then one can still identify the treatment effects
for all people having characteristics from the common support X01 . So we can simply
declare this subpopulation to be our population of interest. It cannot be answered gener-
ally whether this ‘solution’ always satisfies our curiosity, but at least the subpopulation
and its treatment effect are well defined. One therefore speaks of the common support
condition (CSC) though it is often expressed in terms of the propensity score like in
(3.7).
In practice, if the common support condition is violated, the problem is often that
the support of X within the treated is just a subset of that within the control group.
The reason is that the projects typically target certain subpopulations but on a voluntary
basis. It is quite likely that we observe all kind of people among the non-treated whereas
among the treated we observe only those who were eligible for the project. The good
news is that this reduced common support is all we need to identify the ATET.3 We
define assumption (A2) for ATET as
Example 3.2 If applying for a master programme the university may require a mini-
mum grade (in some examination results). We therefore can find individuals with very
low grades in the population not attending university, but we may not find university
graduates for particular low values of X . Hence, we will not be able to find a proper
comparison group for such levels of X .4 If we know the rules to enrol in university,
we would know exactly which x values cannot be observed in the D = 1 population.
3 In addition, recall that for the identification of ATET it was sufficient to have Y 0 ⊥
⊥ D|X instead of
requiring the complete CIA. So both necessary conditions are relaxed for identifying ATET compared to
ATE.
4 For example, in active labour market programmes ‘being unemployed’ is usually a central condition for
eligibility. Thus, employed persons cannot be participants as they are not eligible and, hence, no
counterfactual outcome is identified for them.
3.1 Preliminaries: General Ideas 123
In addition to such formal rules, there are often many other factors unknown to us that
make the choice of D = 1 or D = 0 extremely unlikely. For example, parental income
may matter a lot for attending university whereas for very low incomes it might be
impossible to attend university. However, we do not know the threshold a priori and
would thus not know the common support ex-ante.
Here we have seen another advantage of matching: it highlights the importance of the
support condition; although matching does not solve a support problem, it visualises it.
We already gave some intuitive explanations and examples for the selection bias prob-
lem we might face in treatment effect estimation. Let us briefly revisit this problem more
formally. We are interested in estimating average potential or differences in outcomes,
i.e.
E[Y d ] , E[Y d2 − Y d1 ] ,
where the outcome could be for example wages or wealth after the treatments d1 and d2 .
The endogeneity of D due to (self-) selection implies that
E[Y d |D = d] = E[Y d ]
so that a simple estimation of E[Y d |D = d] will not identify the mean potential out-
come. The literature on matching estimators largely evolved around the identification
and estimation of treatment effects with a binary variable D. Following this discussion
we consider the problem of estimating the ATET, i.e. E[Y 1 − Y 0 |D = 1]. Recall that a
naive estimator would build upon
E[Y |D = 1] − E[Y |D = 0]
by simply comparing the observed outcomes among the treated and the non-treated.
With non-experimental data (where D is not randomly distributed), this estimator is
usually biased due to differences in observables and unobservables among those who
chose D = 1 and those who chose D = 0. This bias is
E[Y 0 |D = 1] − E[Y 0 |D = 0] ,
The third part (3.12) is the bias due to differences in the expected outcomes between the
participants (D = 1) and the non-participants (D = 0) conditional on X inside the popu-
lation of the participants.5 This component is zero if there are no systematic unobserved
differences after controlling for X , because in case that X includes all confounding
variables we have
This third part is what is traditionally understood by selection bias. Nevertheless the
first and the second part form also part of the bias showing that there are still some other
issues, namely differences in the conditional distributions of observed covariates as well
as different supports of these covariates.
The first component (3.10) is due to differences in the support of X in the partic-
ipant and non-participant subpopulation. When using the simple estimator E[Y |D =
1] − E[Y |D = 0] we partly compare individuals to each other for whom no counterfac-
tual could ever be identified simply because X1 \X01 is non-zero. There are participants
with characteristics x for whom no counterpart in the non-participant (D = 0) sub-
population could ever be observed. Analogously, if X0 \X01 is non-zero, there will be
non-participants with characteristics for whom no participant with identical characteris-
tics could be found. In other words, part (3.10) is zero if the CSC for ATE (A2) holds,
but only the first term of (3.10) is zero if CSC holds just for (AT2). The second term in
(3.10) disappears by not using individuals from X1 \X01 .
Example 3.3 If it happened that individuals with characteristics X1 \X01 have on aver-
age large outcomes Y 0 , and those with characteristics X0 \X01 have on average small
outcomes Y 0 , then the first bias component of the experimental estimator would be pos-
itive. The reason is that the term E[Y |D = 1] contains these high-outcome individuals
(i.e. X1 \X01 ), which are missing in the D = 0 population. Analogously, E[Y |D = 0]
contains individuals with low outcome (i.e. X0 \X01 ) whose characteristics have zero
density in the D = 1 population. Therefore the term E[Y |D = 1] would be too large
as it contains the individuals with high outcome, and the term E[Y |D = 0] would be
too small as it contains those low-outcome individuals. In the case of randomised exper-
iments the supports are identical, X0 = X1 , and common support is guaranteed. With
observational studies this is typically not the case.
The second part of the bias (3.11) is due to differences in the distributions of the X
characteristics among participants and non-participants (on the common support). An
adequate estimator will have to adjust for this difference. For example, to deal with the
5 We switch here from the notion of ‘treatment group’ to ‘participants’ by intention though, admittedly, it is
often used synonymously. This is to emphasise here a frequent reason for selection biases in practice:
people might be assigned to a treatment (or the control) group but decide (voluntarily or not) afterwards to
change the group. For the estimation, however, the treatment (i.e. participation) itself is crucial, not the
assignment. The ATE for D = ‘assignment’ instead of D = ‘actual participation’ is called
intention-to-treat effect.
3.1 Preliminaries: General Ideas 125
!
second component one has to weight the non-parametric estimates of E Y 0 |X, D = 0
with the appropriate distribution of X |D = d.
Note that all this discussion could be repeated now for
E[Y 1 |D = 1] − E[Y 1 |D = 0]
which adds to the bias above if the objective was to estimate the ATE. It has a similar
decomposition as that of (3.10) to (3.12). You can try as an exercise, and you will note
that these terms do not cancel those of (3.10) to (3.12) when calculating the bias of ATE.
U ⊥⊥ D|X , (3.15)
which we might call conditional exogeneity. For estimating average treatment effects
it would be sufficient to ask for conditional linear independence or conditional zero-
correlation. Condition (3.15) implies
The assumption typically invoked for OLS in (3.14) is actually stronger, namely
E[Ui |Di , X i ] = 0.
Indeed, for estimating the linear model we ask that U is mean-independent from D
and X , or at least from those elements of X which are correlated with D. For the non-
parametric identification as well as for the treatment effect estimators, we have seen
that this assumption is not needed. So the news is that U is allowed to be correlated
with X . More generally, in the matching approach for treatment effect estimation, the
confounders X are permitted to be endogenous in (3.14).
How is the above-introduced matching approach related to ordinary least squares
(OLS) regression of (3.14)? This is easier to see when starting with parametric matching,
also based on simple linear models for m 0 and m 1 having
where âd , b̂d are the coefficients estimated from group {i : Di = d}. The average
potential outcome is then
Y d = âd + X̄ b̂d ,
E
where X̄ are the average characteristics in the entire sample. The ATE estimate is then
Y 1 − Y 0 = â1 − â0 + X̄ (b̂1 − b̂0 ).
E (3.16)
Instead, an OLS estimation of (3.14) would deliver α̂ + d β̂ + x̄ γ̂ where α̂, β̂, γ̂ were
obtained from the entire sample. The corresponding direct estimate of ATE is then β̂.
The first thing that has to be recalled is that, in both cases, one must only use covariates
X that are confounders. Second, in (3.16) used assumption (3.15) whereas a stronger
assumption is needed for OLS. Third, the matching approach accounts automatically
for possible interaction of D and X on (Y 0 − Y 1 ) whereas in (3.14) one would have to
model this explicitly. It is clear that this is also true for any other functional modification
or extension; try e.g. with any polynomial extension of (3.14). An immediate conclusion
is that while for the partial, marginal ceteris paribus effect of D one might still argue
that an OLS estimate β̂ from (3.14) is a consistent estimate for the linear part of this
effect. There is not such a clear interpretation available when the parameter of interest
was the ATE. However, when introducing double robust estimators, then we will see
that the negligence of having different distributions of X in the two groups harms less
in (3.16) than it does in (3.14) while it causes no problems when using local estimators.
This partly explains the importance of non-parametric estimates for the treatment effect
estimation: the parametric simplification complicates the correct interpretation instead
of simplifying it.
We have seen what kind of biases can emerge from a direct mean comparison. They
reflect an identification problem due to (auto-)selection of the different treatment groups.
We saw how CIA and CSC help to identify ATE and ATET if all important confounders
were observed and X01 is the population of interest. A simple comparison of classical
structural equation analysis and the matching based approach has further illustrated why
the misspecification of the functional form has maybe even more severe consequences
for the correct interpretation than it typically has in the classical regression context.
The CIA is basically used in two different ways to estimate treatment effects: either
for a direct matching of individuals being treated with those not being treated, or via their
propensity (expressed in probability) to be treated or not. The second approach opens
different ways of how to continue: using the propensity either for matching or for read-
justing the distributions of subjects in the two subpopulations (treated vs non-treated) to
make them comparable. We will see that matching and propensity score weighting can
even be combined to increase the robustness of the treatment effect estimator.
3.2 ATE and ATET Estimation Based on CIA 127
Y 0 |D = 1 = 1
E m̂ 0 (X i ) ,
n1
i:Di =1
! 1
which gives our proposal (3.6) E Y 1 − Y 0 |D = 1 = 1 i:Di =1 Yi − m̂ 0 (X i ) . As
n1
this matching estimator automatically ‘integrates empirically’ over FX |D=1 (i.e. aver-
ages) we have to replace in (3.10) and (3.11) FX |D=0 by FX |D=1 . This eliminates
(3.11).
Concerning the second component of (3.10), recall that we redefine the ATET by
restricting it to the region X01 .6 As FX |D=1 (x) = 0 for x ∈ X0 \X01 the second com-
ponent in (3.10) is also zero. Thus, restricting to the common support region, our ATET
estimate is actually
n
Y 1 − Y 0 |D = 1 = 1
E Yi1 − m̂ 0 (X i ) , n 01 = 11{X i ∈ X01 }, (3.17)
n 01
X01 i
with m̂ 1 being an estimate of the expected outcome Y under treatment. The next step is
to find an appropriate predictor m̂ 0 (and m̂ 1 in case we want to estimate ATE); afterwards
one can study the statistical properties of the final estimators.
Popular non-parametric methods in this context are the kernel regression estimator,
local polynomial regression, and kNN estimators. A very popular version of the latter
is the simple first-nearest-neighbour regression: for predicting m 0 (X i ) for an individual
i taken from the treated, the individual from the control group with characteristics X j
being the closest to the characteristics X i is selected and its value Y j is taken as pre-
dictor: m̂ 0 (X i ) := Y j . The use of the nearest-neighbour regression estimators provides
actually the origin of the name matching: ‘pairs’ or ‘matches’ of similar participants
and non-participants are formed, and the average of their outcome difference is taken to
estimate the treatment effect. There existed the discussion on whether controls can be
matched (i.e. used) repeatedly or only once. In case of ATET estimation, for example,
the latter requires n 0 ≥ n 1 and leads to a larger bias but reduces the variance. One may
be wondering why the simple one-to-one matching estimators have been so popular.
One reason is that it can help to reduce the cost of data collection if matching is used
ex-ante.
Example 3.4 Suppose we have a data set from medical records on 50 individuals who
were exposed to a certain drug treatment and 5000 individuals who were not exposed.
For the 5000 controls some basic X variables are available but not the Y variable of
interest. We would thus have to still collect data on Y . Collecting these Y data is often
costly, and may e.g. require a blood test with prior consent of the physician and the
individual. Thus, instead of following-up all 5000 individuals, it makes sense to use the
available X data to choose a smaller number of control observations, e.g. 50, who are
most similar to the 50 treated individuals (in terms of X ) and to collect additional data
(namely their Y ) only on these individuals.
Example 3.4 gives a reason for why the one-to-one matching is helpful before data
collection is done. Nevertheless, after data collection has been completed it does not
preclude the use of estimators that use a larger smoothing area. Obviously, using a
single-nearest neighbour for predicting m 0 (x) leads (asymptotically) to the lowest bias
but rather high variance. Therefore a wider window (larger k = ‘number of neighbours’
for kNN or larger bandwidth for kernel and local polynomial smoothers) might be appro-
priate. Having said this, it is clear that in such cases several individuals will be used
repeatedly for matches. Matching with kNN methods or kernel regression with band-
width h are likely to perform very similarly if k and h are chosen optimally. Some
people argue that in practice, k nearest neighbour matching may perform somewhat
better since the smoothing region automatically adapts to the density and thus ensures
that never less than k observations are in the smoothing region. Recall that this corre-
sponds to local bandwidths in kernel regression. However, ‘matching’ based on local
polynomial regression or with higher-order kernels can reduce the bias of the matching
estimator, which is not possible with kNN regression.
Let us come back to the CSC in theory and practice. In theory, m 0 is simply not well
defined outside X0 . So if there exist x in X1 \X01 , then their potential outcome Y 0 is not
defined (or say ‘identified’) and consequently not their treatment effect. Then, neither
ATE nor ATET are defined for a population that includes individuals with those charac-
teristics. The same story could be told exchanging subindices 0, 1 and we conclude that
neither ATE nor ATEN were defined. This is the theoretical part. In practice, we simply
cannot (or should not try to) extrapolate non-parametrically too far. For example, if there
is no individual j in the control group exhibiting an x j close to xi for some i from the
treatment group, then there is no match. With kernels it is similar; if there is no match
for x in the h-neighbourhood (h being the bandwidth), the prediction of m 0 (x) is not
3.2 ATE and ATET Estimation Based on CIA 129
possible. Here we see the practical meaning of the CSC for non-parametric matching
and regression estimators.
n n
R(i)
AT E= Ŷi (1) − Ŷi (0) = (2Di − 1) 1 + Yi , (3.20)
n n K
i=1 i=1
n
1 1
R(i)
AT ET = Yi (1) − Ŷi (0) = Di − (1 − Di ) Yi . (3.21)
n1 n1 K
i:Di =1 i=1
n
average conditional treatment effect AT E(X ) = m 1 (X i ) − m 0 (X i ),
n
i=1
1
1
n K
conditional bias B K = (2Di − 1) m 1−Di (X i ) − m 1−Di (X jk (i) ) ,
n K
i=1 k=1
1
n
R(i)
and stochastic term S K = (2Di − 1) 1 + εi ,
n K
i=1
where εi = Yi − m Di (X i ), and analogously
AT E T − AT E T = AT E T (X ) − AT E T + BTK + STK , with
1
1
1
K
conditional bias BTK = m 0 (X i ) − m 0 (X jk (i) ),
n1 K
i:Di =1 k=1
1
n
R(i)
and stochastic term STK = (2Di − 1) 1 + εi .
n K
i=1
These decompositions show nicely what drives potential biases and variance of the treat-
ment effect estimates. Obviously, the main difficulty in calculating the bias and variance
for these estimators is the handling of the stochastic matching discrepancies X i − X jk (i) .
Recalling the common support assumption, it is clear that for discrete variables, fixed
K but n → ∞, these discrepancies will become zero, and so will be B K and BTK .
For continuous variables in X , Abadie and Imbens (2006) gave their explicit distribu-
tion (densities and the first two moments). These enabled them to derive the asymptotics
for (3.20) and (3.21) as given below. As the continuous confounders will dominate the
asymptotic behaviour, let us assume without loss of generality that X is a vector of q
continuous variables. The adding of discrete ones is asymptotically for free. Let us first
summarise the assumptions to be made:
(A1) and (A2) We use the CIA and the common support, i.e. there exist an > 0
such that < P(D = 1|X = x) < 1 − for all x.
(A3) We are provided with a random sample {(Yi , X i , Di )}i=1
n .
Recall that if the common support condition is not fulfilled, or if we cannot find rea-
sonable matches for some of the observed x, then the population of interest has to be
redefined restricting the analysis on a set, say X , where this condition holds. As already
discussed, for estimating the ATET we need to assume a little bit less, specifically
(AT1) and (AT2) Y 0 ⊥ D|X and P(D = 1|X = x) < 1 − for all x.
(AT3) Conditional on D = d the sample consists of independent draws from
(Y, X )|D = d for d = 0, 1, and for some r ≥ 1, nr1 /n 0 → ρ with 0 < ρ < ∞.
With these we can state
THEOREM 3.1 Under assumptions (A1) to (A3) and with m 1 (·), m 0 (·) Lipschitz, then
B K = O p (n −1/q ), and the order of the bias term E[B K ] is not in general lower than
n % &2
n −2/q . Furthermore, V ar [ AT E|X, D] = n12 i=1 1 + R(i) V ar [Y |X i , Di ].
K
Set f d := f X |D=d . Under assumptions (AT1) to (AT3) and with m 0 (·) Lipschitz, one has
−r/q
BTK = O p (n 1 ), and for X01 being a compact subset of the interior of X0 with m 0 (·)
having bounded third derivatives, and f 0 (x) having first bounded derivatives, one has
"
K
−2r/q −1 kq + 2 1
E[BTK ] = n ρ 2/q
K q (k − 1)!q
k=1
−2/q- 2 9
π q/2 −1 ∂ f0 ∂m 0 1 ∂ f0
× f 0 (x) f 0 (x) (x) (x) + tr (x)
(1 + q/2) ∂x ∂x 2 ∂ x ∂ x
2r/q
f 1 (x) d x + o(n 1 ).
n % &2
Furthermore, V ar [ AT E T |X, D] = 1
i=1 Di − (1 − Di ) R(i) V ar [Y |X i , Di ].
n 21 K
3.2 ATE and ATET Estimation Based on CIA 131
If additionally V ar [Y |X, D] is Lipschitz and bounded away from zero, and the fourth
moments of the conditional distribution of Y |(x, d) exist and are uniformly bounded in
x, then
% &
√
AT E − AT E − B K d
n 1/2 −→ N (0, 1),
E[(AT E(X ) − AT E)2 ] + nV ar [ AT E|X, D]
% &
AT E T − AT E T − BTK
√ d
n1 1/2 −→ N (0, 1).
E[(AT E T (X ) − AT E T )2 ] + n 1 V ar [ AT E T |X, D]
In the bias expressions we see what we called the curse of dimensionality in non-
parametric estimation: the larger the number q of continuous conditioning variables x,
the larger the bias, and the slower the convergence rate. What is somehow harder to see
is the impact of the number of neighbours K , and therefore also that of replicates R(i).
However, if we let K increase with n, then we are in the (non-parametric) regression
context which we will study later.8
One might argue that Theorem 3.1 indicates that the fewer (continuous) conditioning
variables we include, the better the performance of the estimator. However, the correct
statement is that ‘the fewer (continuous) conditioning variables are necessary, the easier
the estimation’. Actually, without an excellent estimator for the bias (that one would
have to use for bias reduction by subtracting it from the treatment effect estimate) we
√
only get the parametric n convergence rate for q ≤ 2 when estimating ATE. To ignore
the bias we even need q = 1. Not surprisingly, for the ATET, the convergence rate
depends on n 1 and on the ratio n 1 /n 0 , recall assumption (AT3). Consequently, even
√
with more than one conditioning variable (i.e. q > 1) one might reach a n 1 conver-
gence rate if n 0 increased accordingly faster (n 1 /n 0 → 0). The good news is that in
both cases, the inclusion of covariates that are discrete with finite support has asymp-
totically no impact on the bias. It should be said, however, that in finite samples the
inclusion of many discrete variables, and in particular of those with ‘large support’ (rel-
ative to sample size), does have an impact. Unfortunately, little is known about the ‘how
much’.
It is important to keep in mind that the Theorem holds only under the assumptions
(A1) to (A3) or (AT1) to (AT3), respectively. If the CIA fails because we did not include
enough conditioning variables, then an additional bias term adds to B K (or BTK when
estimating ATET). That does not asymptotically disappear and gives therefore an incon-
sistent estimator. But as in practice we are only provided with finite samples, also B K
(BTK ) is indeed always present, so that we have at least two trade-offs to handle:
8 The appearance of the other expressions like the number π or the Gamma-function come directly from
the density and moments of the distribution of the used matching discrepancy X i − X jk (i) in IR q . When
looking for the closest neighbours in the Euclidean sense, then the volume of the unit q sphere is of
particular interest which is in fact 2π q/2 / (q/2). This explains the appearance of these terms in the
bias.
132 Selection on Observables: Matching, Regression and Propensity Score Estimators
the bias–bias trade-off when choosing the number of conditioning variables9 , and a
bias–variance trade-off, especially when choosing K and therewith R(i).10
We should add at this point a comment regarding these trade-offs. For about two
decades the identification aspect has been dominating a large part of the economics and
econometrics literature. It basically puts most of the emphasise on identifying exactly
the parameter of interest. For theory and academic papers this might be fair enough. For
empirical research it can be misleading because there, people face finite samples and
have to estimate the parameter with the data and information at hand. The unbiased-
ness can only be attained thanks to untestable (and mostly disputable) assumptions. The
potential bias effect when these are violated is little studied. Furthermore, the unbiased-
ness is often just an asymptotic phenomenon while in practice the finite sample bias and
variance (i.e. the finite sample mean squared error) are quantities that matter and should
worry us as well. An empirical researcher should always look for a compromise between
all potential biases and variances of the estimators and data at hand. His objective must
be to minimise the finite sample mean squared error.
In practice the correction for bias is often much harder than the estimation of the
variance. One tries therefore to use bias-reducing methods, and in particular under-
smoothing, such that the squared bias becomes negligible compared to the variance. But
what about the variance? Theorem 3.1 gives explicit formulae for the V ar [ AT E|X, D],
V ar [ AT E T |X, D] which can be used directly when replacing V ar [Y |X, D] by non-
parametric estimates. For doing inference on our treatment effect estimates
we need
V ar [ AT E] = n −1 E nV ( AT E|X, D) + (AT E(X ) − AT E)2
and
V ar [ AT E T ] = n −1 E n 1 V (
AT E T |X, D) + (AT E T (X ) − AT E T )2
1
n K
R(i)
E εi + 2
2
ε jk (i) |X, D = 1 + 2 V ar [Y |X i , Di ] ,
n K n K
i=1 k=1 i=1
9 Choosing too many confounders increases B or BT unnecessarily, but choosing too few counfounders
K K
leads to the violation of CIA leading to an additional (the selection) bias.
10 The number of confounders has an impact on both, the total bias (B or BT plus selection bias) and the
K K
variance, but their choice is mostly driven by the first mentioned concern. The ‘smoothing’ bias B K
(BTK ) is increasing with K , while a small K increases the variance.
3.2 ATE and ATET Estimation Based on CIA 133
V̂ ar [ AT E]
1
% &2 R(i) 2 (2K − 1)R(i)
= 2
Yi − Ŷi (0) − AT E T + + V̂ ar [Y |X i , Di ] ,
n K K2
i=1
V̂ ar [ AT ET ]
1
% &2 1
R(i){R(i) − 1}
= 2
Ŷi (1) − Ŷi (0) − AT E T + 2 V̂ ar [Y |X i , Di ].
n 1 i:D =1 n 1 i:D =0 K2
i i
11 Abadie and Imbens (2006) show the consistency of these estimators for reasonable estimators
V̂ ar [Y |X i , Di ].
12 Again as a reminder: discrete covariates with finite support do not affect the asymptotic properties;
depending on their number and support size. However, they can essentially affect the finite sample
performance and thus are important in practice. This is why we set ‘continuous’ in parentheses.
13 Readers who are more familiar with non- and semi-parametric regression might be somewhat confused,
as for semi-parametric estimators the so-called curse of dimensionally starts at dimension q > 3 and not
for q > 1. This is true for all generally used methods like kernels, kNN, splines or any other sieves
estimator – but here the K is fixed. A further difference to estimation problems which are subject to the
less restrictive rule (q < 4) is that in our case – take, for the example, the ATE estimation problem – we
consider the average of differences of predictors from two non-parametrically estimated functions, m 0
and m 1 , estimated from two different independent samples with probably different densities. This is a
somewhat more complex problem than the classical semi-parametric estimation problems.
134 Selection on Observables: Matching, Regression and Propensity Score Estimators
We are not going to examine the different regularity conditions in detail here. They
are hard or impossible to check anyway, and therefore simply tell us what we have to
believe. In some approaches, the regularity conditions may look very strong.14 In brief,
either higher-order local polynomial regression or higher-order kernels are required if
√
we want to make the bias negligible in order to get n-convergence. It is often stated
that this can also be achieved – or even done better – by sieves. Unfortunately, this is not
true, especially not for the ‘global’ ones, recall our discussions in Section 2.2. There,
people just work with much stronger assumptions on the m d (·).
For the non-parametric treatment effect estimators exist asymptotic variance bounds,
always assuming sufficient smoothness for all unkonwn functions.15 We will later see
that there exist several estimators that indeed meet these bounds and can therefore be
called ‘efficient’.
T H E O R E M 3.2 Under the CIA and CSC, i.e. assumptions (A1) and (A2), for a binary
treatment D the asymptotic variance bound for ATE is generally
.% &2 /
V ar [Y 1 |X ] V ar [Y 0 |X ]
E E Y − Y |X − AT E +
1 0
+ .
Pr(D = 1|X ) 1 − Pr(D = 1|X )
Analogously, under the modified CIA and CSC (AT1) and (AT2), for a binary treatment
D the asymptotic variance bound for ATET is generally
. 2
−2
Pr (D = 1) · E Pr(D = 1|X ) E[Y 1 − Y 0 |X ] − AT E T
/
Pr2 (D = 1|X )V ar [Y 0 |X ]
+ Pr(D = 1|X )V ar [Y 1 |X ] + .
1 − Pr(D = 1|X )
In the special case when the propensity score is known, the efficiency bound for ATE
stays the same whereas for the ATET estimation it changes to
. 2
−2
Pr (D = 1) · E Pr2 (D = 1|X ) E[Y 1 − Y 0 |X ] − AT E T
/
Pr2 (D = 1|X )V ar [Y 0 |X ]
+ Pr(D = 1|X )V ar [Y |X ] + 1
.
1 − Pr(D = 1|X )
In order to prove these statements one can resort to the ideas of pathwise derivatives
in Section 2.2.3, recall Equation 2.68. There we already calculated the score function
S(Y, D, X ), Equation 2.69, which gives the tangent space of our model as a set of
functions that are mean zero and exhibit the additive structure of the score
= d · s1 (y|x) + (1 − d) · s0 (y|x) + (d − p (x)) · s p (x) + sx (x) (3.22)
14 Hirano, Imbens and Ridder (2003) assume that the propensity score is at least 7q times continuously
differentiable. Others work with infinitely many continuous derivatives for the m d , f d , p functions. This
is still less restrictive than directly working with a purely parametric approach with a fixed functional
specification.
15 Here we follow mainly Hahn (1998).
3.2 ATE and ATET Estimation Based on CIA 135
then we know that for its projection on the tangent space its variance
E[δ 2 (Y, D, X )] = V ar [δ(Y, D, X )] is the variance bound for the ATE estimators.
Consider now
y − m 1 (x) m 0 (x) − y
δ(y, d, x) = {m 1 (x) − m 0 (x) − AT E} + d + (1 − d)
p(x) 1 − p(x)
and verify (3.24), and that it lies in space , i.e. is identical to its projection on . To
calculate E[δ 2 (Y, D, X )] is straightforward then.
What can we see from the obtained results? On a first glimpse the importance of the
propensity score in these bounds might be surprising. But it is not when you realise that
we speak of binary treatments and thus E[D|X ] = Pr(D = 1|X ). Furthermore, the
treatment effect estimation problem conditioned on X is affected by ‘selection on X ’,
and therefore must depend on Pr(D = 1|X ). A corollary is that for constant propensity
scores, Pr(D = 1|X ) = E[Pr(D = 1|X )] = P, i.e. when we are back in the situation
of random treatment assignment with AT E = AT E T , we have the variance bound
.% &2 V ar [Y 1 |X ] V ar [Y 0 |X ] /
E E Y 1 − Y 0 |X − AT E + + . (3.25)
P 1− P
This would not change if we knew P, and therefore knew also that we are in the case of
random assignment. It tells us that for estimating ATE one does not asymptotically gain
in efficiency by knowing that random assignment has taken place.
Why does knowledge of the propensity score (like, for example, in a controlled exper-
iment) not change the variance bound for the ATE but reduces that of ATET? The main
16 I.e. a model with the parameters belonging to an open set, non-singular Fisher information and some more
regularity conditions.
136 Selection on Observables: Matching, Regression and Propensity Score Estimators
reason for this is that knowledge of the propensity score helps to improve the estimation
of f 1 := f X |D=1 which is needed for the ATET but not for the ATE. The propensity
score provides information about the ratio of the density in the control and the treated
population and thus allows control observations to identify the density of X in the treated
population and vice versa. The estimation of E[Y 0 |D = 1] can therefore be improved.
The (Y, X ) observations of both treatment groups identify the conditional expectation.
This conditional expectation is weighted by the distribution of X among the treated, say
f 1 , which can be estimated from the treated group. Usually, the non-participant obser-
vations are not informative for estimating that distribution. If, however, the relationship
between the distribution of X among the treated and the one among the controls was
known, then the X observations of the controls would be useful for estimating f 1 . The
propensity score ratio provides exactly this information as it equals the density ratio
) f 1 (X ) Pr(D=1)
times the size ratio of the subpopulations: 1−p(X
p(X ) = f 0 (X ) Pr(D=0) with f 0 := f X |D=0 .
Since the relative size of the treated subpopulation Pr(D = 1) = 1 − Pr(D = 0) can
be estimated precisely, for known p(x) the observations of both, the treated and the
controls can be used to estimate f 1 .
Example 3.5 In the case of random assignment with p(x) = 12 for all x, the distribution
of X is the same among the treated and the non-participants, and using only the treated
observations to estimate f 1 would neglect half of the informative observations. But as
we know that f 1 = f 0 you can use all observations. In fact, with knowledge of the
propensity score the counterfactual outcome for the treated E[Y 0 |D = 1] could be
predicted even without observing the treated.
This example demonstrates heuristically that for estimating the ATET we expect an
improvement when knowing the propensity score. For estimating ATE, this knowl-
edge is not helpful: the (Y, X ) observations of the treated sample are informative for
estimating E[Y 1 |X ], whereas the (Y, X ) observations of the controls are informative
for estimating E[Y 0 |X ]. Since the joint distribution of Y 1 , Y 0 is not identified, the
observations of the treated sample cannot assist in estimating E[Y 0 |X ] and vice versa.
Knowledge of the propensity score is of no use here. Theorem 3.2 has some practical
use. Sometimes we know an estimation procedure coming from a different context but
which may be applied to our problem. We would like to check, then, if this leads to an
efficient estimator in our setting or not.
Example 3.6 Imagine we have experimental data and can separate the treatment impact
from the confounders impact in an additive way: E[Y |X = x, D = d] = d α + m(x).
Obviously, we then face a partial linear model as discussed in the section on non- and
semi-parametric estimation. Recall now the estimator (2.59) of Speckman (1988) to get
α, i.e.
n %
&% &
n % &2
α̂ = yi − Ê [Y |xi ] di − Ê [D|xi ] / di − Ê [D|xi ] ,
i=1 i=1
3.2 ATE and ATET Estimation Based on CIA 137
The next direct corollary from Theorem 3.2 is that for estimators of the sample ana-
logues, i.e. estimators of the SATE (sample ATE) and the SATET, we obtain the same
lower bounds for the variances minus the respective
! first term. Take for example the
ATE: the first term is V ar Y 1 − Y 0 − AT E|X which only describes ! the contribution
of the sampling variance, and therefore V ar Y − Y − S AT E|X = 0.
1 0
(B1) Both, the density of X , f (x), and the function m 0 (x) have Hölder continuous
derivatives up to the order p > q.
(B2) Let K (·) be a Lipschitz continuous kernel function of order p with a compact
support with Hölder continuous derivatives of order 1 at least.
It should be evident to readers of Section 2.2 that this condition is needed as a bias
reduction tool. This can also be seen from the next condition which can only hold if
2 p > q, which is automatically fulfilled by condition p > q in (B1). As discussed in the
section about non-parametric kernel regression, there is always the possibility of using
higher-order local polynomials instead of higher-order kernels (or even a mix of both).
17 Here we follow the lines of Heckman, Ichimura and Todd (1998). In their article, however, they mix
estimators with and without a prior estimation of the unknown propensity score, what led Hahn and
Ridder (2013) to the conjecture that their derivation was wrong. Note that our result refers to a special
case which is not affected by this criticism.
138 Selection on Observables: Matching, Regression and Propensity Score Estimators
There exist many versions of regression estimators for ATET; a most intuitive one is
1
j:D j =0 Y j K h (X j − x)
AT ET = Yi − mˆ0 (X i ) , mˆ0 (x) = . (3.28)
n1
i:Di =1 j:D j =0 K h (X j − x)
You may equally well replace mˆ0 (·) by a local polynomial estimator. This allows you
to accordingly relax assumption (B2). Let us consider only continuous confounders for
the reasons we discussed. Then one can state
p
q
1 ∂ l m 0 (x) ∂ p−l f (x)
B(x) = h p f −1 (x)
p
u j K (u)du · .
l!( p − l)! ∂ x lj p−l
∂x j
l=1 j=1
The bias term can be reduced by the use of higher-order polynomials: the general
rule is that the higher the order of the local polynomial, the later starts the first sum
p p
l=1 ; e.g. when using local linear estimation we obtain B(x) as above but with l=2 .
Moreover, we can choose p large enough to extend (B3) such that nh 2 p → 0 leading to
an asymptotically negligible bias.
18 This says that X is not only a subspace of X (which is the ATET analogue of the CSC), it demands the
1 0
sometimes not realistic assumption that the h-neighbourhood of all points of X1 are in X0 . In the articles
this assumption sometimes is weakened by introducing a trimming function to get rid of boundary effects
of the non-parametric estimator. Although for microdata sets where h (and therefore the boundary) is
relatively tiny such that the boundary effects become negligible, this is necessary for exact asymptotic
theory. The trimming also allows you to directly define a subpopulation S for which one wants to estimate
the treatment effect. In practice people do this automatically and thereby actually redefine the population
under consideration. We therefore have decided to present the version without trimming to simplify
notation and formulae, but we apply (B4) for mathematical correctness.
3.2 ATE and ATET Estimation Based on CIA 139
Both terms, V ar and B(x) can straightforwardly be derived from standard results
known in non-parametric regression; see our Section 2.2 and Exercise 8. The difference
between estimator (3.28) and the true AT E T can be rewritten as
1
{Yi − m 1 (X i )} + {m 1 (X i ) − m 0 (X i ) − AT E T } + m 0 (X i ) − m̂ 0 (X i ) .
n1
i:Di =1
It is clear that the expectation of the first two terms is zero while from the last term we
get the smoothing bias as given in Theorem 3.3; compare also with Section 2.2.
To obtain now the variance, note that under our assumptions the first two terms con-
verge to the first two terms of V ar . For the last term it is sufficient to consider the
random part of m̂ 0 (·). From Section 2.2 we know that it is asymptotically equivalent19 to
1
εi K h (X i − X ) f x−1 (X |D = 0), with εi = Yi − m 0 (X i ) ,
n0
where all (Yi , X i ) are taken from the control group {i : Di = 0}. Their average over all
X = X i with Di = 1 converges to the conditional expectation
. /
1
E εi K h (X i − X ) f x−1 (X )|D = 1, (Yi , X i , Di = 0)
n
1
= εi K h (X i − w) f x−1 (w|D = 0) f x (w|D = 1) dw
1
1
n
Yi − m̂ 0 (X i ) , and m̂ 1 (X i ) − m̂ 0 (X i ), (3.29)
n1 n
i:Di =1 i=1
19 We say ‘asymptotically equivalent’ because – without loss of generality – we substituted the true density
f x for its estimate in the denominator.
20 Check term by term and note that the second term of V ar in Theorem 3.3 is
E[V ar (Y 1 |X, D = 1)|D = 1] = V ar [Y 1 |x, D = 1] f x (x|D = 1) d x
= V ar [Y 1 |x, D = 1] Pr(D = 1|X )Pr−1 (D = 1) f x (x) d x
. /
=E (Y 1 − m 1 (X ))2 f (y 1 |X )Pr−1 (D = 1) dy 1 Pr(D = 1|X )Pr−1 (D = 1)
140 Selection on Observables: Matching, Regression and Propensity Score Estimators
respectively, with m 1 (·), m 0 (·) being estimated from the subsets of treated, respectively
non-treated. Again, where authors state a presumable superiority of those series estima-
tors one will note that it is always at the cost of stronger assumptions which, heuristically
stated, shall guarantee that the chosen series approximates sufficiently well the functions
m 0 (·), m 1 (·). Then the bias reduction is automatically given.21 As discussed at the end of
Chapter 2, a general handicap of too-simple sieves like the popular power series is that
they are so-called ‘global’ estimators. They do not adapt locally and depend strongly
on the density of X in the estimation sample. This makes them particularly inadequate
for extrapolation (prediction), especially when extrapolating to a (sub-)population with
a density of X that is different from the one used for estimation. Remember that this is
exactly what we expect to be the case for the confounders X (our situation).
Y d ⊥⊥ D|P. (3.30)
where P = p(X). The proof is very simple: to show that (3.30) holds, i.e. that
the distribution of D does not depend on Y d given p(X ), it needs to be shown that
Pr(D = 1|Y d , p(X )) = Pr(D = 1| p(X )), and analogously for D = 0. Because
Pr(D = 1|·) and Pr(D = 0|·) have to add to one for binary D, it suffices to show
this relationship for one of them. Now, Pr(D = 1|Y d , p(X )) = E[D|Y d , p(X )] =
E[E[D|X, Y d , p(X )]|Y d , p(X )] by iterated expectation. As p(X ) is deterministic
given X , by the CIA this equals
21 Hahn (1998) does this for a sequences of polynomials which in practice are hardly available, whereas
Imbens, Newey and Ridder (2005) propose the use of power series which in practice should not be used;
recall our earlier discussions.
3.3 Propensity Score-Based Estimator 141
So we see that the justification of propensity score matching does not depend on any
property of the potential outcomes. Note, however, that (3.30) does not imply CIA.
Propensity score matching and matching on covariates X will always converge to the
same limit since it is a mechanical property of iterated integration.22 Hence, in order
to eliminate selection bias due to observables x, it is indeed not necessary to compare
individuals that are identical in all x; it suffices that they are identical in the propensity
score. This suggests to match on the one-dimensional propensity score p(x), because
E Y 0 |D = 1 = E X E[Y 0 | p(X ), D = 1]|D = 1
!
= E X E[Y 0 | p(X ), D = 0]|D = 1 = E X E[Y | p(X ), D = 0]|D = 1 ,
where the subindex X emphasises that the outer expectation is integrating over X .
Finally, it is not hard to see from (3.30) that you also can obtain
Y d ⊥⊥ D|δ(P) (3.31)
for any function δ(·) that is bijective on the interval (0, 1). While this knowledge is use-
less for propensity score weighting, it can directly be used for propensity score matching
noticing that then
E Y 0 |D = 1 = E X E[Y 0 |δ{ p(X )}, D = 1]|D = 1
!
= E X E[Y |δ{ p(X )}, D = 0]|D = 1 .
In practice the propensity score is almost always unknown and has to be estimated
first.23 Estimating the propensity score non-parametrically is usually as difficult as
estimating the conditional expectation function m 0 (x) since they have the same dimen-
sionality.24 Whether matching on x or on p̂(x) yields better estimates depends on the
particular problem and data; for example on whether it is easier to model and estimate
p(x) or the m d (x).
So what are the advantages of doing propensity matching? Isn’t it just one more
estimation step but else giving the same results? There are actually some potential
advantages of propensity score matching: First, as indicated, it might be that the mod-
elling and estimation of the multivariate propensity score regression is easier than it is
for the m d (x). Second, it relaxes the common support restriction in practice: we only
need to find matches for people’s propensity (what is much easier than finding a match
regarding a high dimensional vector of characteristics). Moreover, if we can estimate the
propensity score semi-parametrically, then this two-step procedure does indeed lead to a
dimensionality reduction. If, however, also the propensity score has to be estimated non-
parametrically, then the dimensionality problem has just been shifted from the matching
to the propensity score estimation – and concerning the theoretical convergence rate
nothing is gained.
The most important advantage of propensity score based estimation is that it avoids
model selection based on the outcome variable: one can specify the model of the selec-
tion process without involving the outcome variable Y . Hence, one can respecify a probit
model several times e.g. via omitted-variables tests, balancing tests or the inclusion of
several interaction terms until a good fit is obtained, without this procedure being driven
by Y or the treatment effects themselves. This is in contrast to the conventional regres-
sion approach: if one were to estimate a regression of Y on D and X , all diagnostics
would be influenced by Y or treatment effect such that a re-specification of the model
would already depend on Y and thus on the treatment effect, and therefore be endoge-
nous by construction. In an ideal analysis, one would specify and analyse the propensity
score without ever looking at the Y data. This can already be used for designing an
observational study, where one could try to balance groups such that they have the same
support or (even better) distribution of the propensity score. Also in the estimation of the
propensity score diagnostic analysis for assessing the balance of covariate distributions
is crucial and should be done without looking at the outcome data Y . If the outcome is
not used at all, the true treatment effects cannot influence the modelling process for bal-
ancing covariates. The key advantage of propensity score analysis is that one conducts
the design, analysis and balancing of the covariates before ever seeing the outcome data.
Another point one should mention is that once a good fit of the propensity score
has been obtained, it can be used to estimate the treatment effect on several different
outcome variables Y , e.g. employment states at different times in the future, various
measures of earnings, health indicators etc., i.e. as the final outcome is not involved
in finding p̂(x), the latter can be used for the analysis of any outcome Y for which
Y d ⊥⊥ D|P can be supposed.
We still permit heterogeneity in treatment effects of arbitrary form: if we are inter-
ested in the ATET and not the ATE, we only need that Y 0 ⊥⊥ D|P but do not require
Y 1 ⊥⊥ D|P or {Y 1 − Y 0 } ⊥⊥ D|P. In other words, we can permit that individuals
endogeneously select into treatment. The analogue holds for identifying ATEN. Like
before in the simple matching or regression context, endogenous control variables X are
permitted, i.e. correlation between X and U is allowed to be non-zero.
We turn now to the actual implementation of the estimator. To guarantee that we
calculate a treatment effect only for population X01 , the propensity score estimate p̂
will be used for both matching and trimming: for
μd ( p) := E[Y d |P = p]
(analogously to the definition of m d with arguments X ) and all for population X01
3.3 Propensity Score-Based Estimator 143
AT Y 1 |D = 1 − E
ET = E Y 0 |D = 1
Yi − μ̂0 ( p̂i ) 11{ p̂i < 1}
i:Di =1
=
11{ p̂i < 1}
i:Di =1
AT Y1 − E
E=E Y0
n
μ̂1 ( p̂i ) − μ̂0 ( p̂i ) 11{1 > p̂i > 0}
i=1
= .
n
11{1 > p̂i > 0}
i=1
25 For more details see Frölich (2007b). See Sperlich (2009) and Hahn and Ridder (2013) for general results
on non- and semi-parametric regression with generated regressors.
26 These approaches are especially pronounced in different oeuvres of Heckman. Here we refer to ideas of
Heckman, Ichimura and Todd (1998).
144 Selection on Observables: Matching, Regression and Propensity Score Estimators
Example 3.7 Let X 2 comprise only (a few) components like gender and age. Hence, if
we were interested in estimating the average potential outcome separately for men and
women at different age categories, we could use (assuming full common support)
E Y 0 |D = 1, gender, age
= E E[Y 0 | p(X 1 ), D = 1, gender, age]|D = 1, gender, age
= E E[Y 0 | p(X 1 ), D = 0, gender, age] |D = 1, gender, age
!
= E E[Y | p(X 1 ), D = 0, gender, age] |D = 1, gender, age ,
where the outer expectation integrates over p(X 1 ). Interestingly, we can use the same
propensity score to estimate the potential outcome for both genders and all ages. Thus,
we can use the same estimated propensity score for estimating the average potential
outcome in the entire population; predicting the propensity scores only once will suf-
fice. Obviously, the analysis of common support has to be done separately for each
subpopulation.
We see an additional advantage of this structural approach here: as one would typi-
cally expect both X 1 and X 2 to be of smaller dimension than X , for both, p(·) and the
m d (·) fewer smoothness conditions (and fewer bias reducing methods) are necessary
than were needed before. That is, the structural modelling entails dimension reduction
by construction.
The propensity score has mainly an ex-post balancing function whereas the regression
has the matching approach interpretation. So it could be that from a regression point of
view we were (by chance) to match only on noise, then the m d (·) are almost constant
functions for each. However, an unbalanced sampling will still show a variation of p(·)
in X . It is often helpful to make use of (3.32) and to match not only on the propen-
sity score but also on those characteristics that we deem to be particularly important (or
interesting) with respect to the outcome variable. Including some covariates in addition
to the propensity score in the matching estimator can improve, apart from the advan-
tage of interpretability, also the finite sample performance since a better balancing of
these covariates is obtained. Further advantages of combining regression and propensity
weighting will be discussed in Subsection 3.3.3.
We conclude the subsection with a remark that might be obvious for some readers but
less so for others. Propensity score matching can also be used for estimating counter-
factual distribution functions. Furthermore, its applicability is not confined to treatment
evaluation. It can be used more generally to adjust for differences in the distribution
of covariates between the populations we compare. For this, certainly, propensity score
weighting, discussed next, can be used as well.
Example 3.9 Frölich (2007b) studied the gender wage gap in the labour market with
the use of propensity score matching. The fact that women are paid substantially lower
wages than men may be the result of wage discrimination in the labour market. On
the other hand, part of this wage gap may be due to differences in education, expe-
rience and other skills, whose distribution differs between men and women. Most of
the literature on discrimination has attempted to estimate how much of the gender
wage gap would remain if men and women had the same distributions of observable
characteristics.27 Not unexpectedly, the conclusion drawn from his study depends on
which and how many characteristics are observed. For individuals with tertiary edu-
cation (university, college, polytechnic) the choice of subject (or college major) may
be an important characteristic of subsequent wages. A wide array of specialisations is
available, ranging from mathematics, engineering, economics to philosophy, etc. One
observes that men and women choose rather different subjects, with mathematical and
technical subjects more often chosen by men. At the same time ‘subject of degree’
(= field of major) is not available in most data sets. In Frölich (2007b) this additional
explanatory power of ‘subject of degree’ on the gender wage was examined. Propen-
sity score matching was applied to analyse the gender wage gap of college graduates
in the UK to see to which extent this gap could be explained by observed characteris-
tics. He also simulated the entire wage distributions to examine the gender wage gap
at different quantiles. It turned out that subject of degree contributed substantially to
reducing the unexplained wage gap, particularly in the upper tail of the wage distribu-
tion. The huge wage differential between high-earning men and high-earning women
was thus to a large extent the result of men and women choosing different subjects in
university.
27 If one attempts to phrase this in the treatment evaluation jargon, one would like to measure the direct
effect of gender on wage when holding skills and experience fixed.
146 Selection on Observables: Matching, Regression and Propensity Score Estimators
weighting along the propensity score. This idea is exactly what one traditionally does in
(regression) estimation with missing values or with sample weights when working with
strata.28 Therefore, there exists already a huge literature on this topic but we limit our
considerations to what is specific to the treatment effect estimation.
For the sake of presentation we again focus first on the ATET estimation and even
start with the simpler case of known propensity scores and common support. Certainly,
all calculations are based on the CIA in the sense of Y d ⊥⊥ D| p(X ). The main challenge
is to predict the average potential outcome Y 0 for the participants. As by Bayes’ law we
have, again with notation f d := f X |D=d
Y 0 |D = 1 = Pr(D = 0) 1
E Yi ·
p(xi )
Pr(D = 1) n 0 1 − p(xi )
i:Di =0
p(xi )
Y 0 |D = 1 = 1
≈E Yi ·
n1 1 − p(xi )
i:Di =0
with Pr(D=0) n0
Pr(D=1) ≈ n 1 . Note that this estimator uses only the observations Yi from the
controls, cf. Example 3.5. All you need is a (‘good’) estimator for the propensity score.
Comparing the average outcome of the treated with this predictor gives a consistent
ATET estimator.
It is obvious that along similar steps we can obtain predictions of the potential treat-
ment outcomes Y 1 for the non-participants for an ATEN estimator. Putting both together
can be used to get an ATE estimator. Specifically, with a consistent predictor p̂ we
estimate the ATET by
n
p̂(X i )
Yi Di − Yi (1 − Di ) . (3.33)
n1 1 − p̂(X i )
i=1
28 How is this problem related to ours? Directly, because you can simply think of participants being the
missings when estimating m 0 (·) and the controls being the missings when estimating m 1 (·).
3.3 Propensity Score-Based Estimator 147
1
Yi Di
n
Yi (1 − Di )
− . (3.34)
n p̂(X i ) 1 − p̂(X i )
i=1
What are the advantages or disadvantages of this estimator compared to the match-
ing? It has the advantage that it only requires a first step estimation of p(x) and does
not require m 0 (x) or m 1 (x). Hence, we would avoid their explicit non-parametric esti-
mation. In small samples, however, the estimates can have a rather high variance if
some propensity scores pi are close to zero or one. In the latter case, term 1−pipi can get
arbitrarily large and lead to variable estimates. In practice, it is then recommended to
impose a cap on 1−pipi . One could either trim (i.e. delete) those observations or censor
them by replacing 1−pipi with min( 1−pipi , prefixed upper bound). The typical solution to
the problem is to remove (or rescale) observations with very large weights and check the
sensitivity of the final results with respect to the trimming rules applied. We will discuss
the general problem of trimming or capping somewhat later in this chapter.
The reason for the high variance when pi is close to one is of course related to the
common support problem. The remedies and consequences are somewhat different from
that in the matching estimator, though. In the matching setup discussed before, if we are
interested in the ATET we would delete the D = 1 observations with high propensity
scores. Then we could compare the descriptive statistics of the deleted D = 1 obser-
vations with the remaining observations to understand the implications of this deletion
and to assess external validity of our findings. If e.g. the deleted observations are low-
income individuals compared to the remaining D = 1 observations, we know that our
results do mainly hold for high-income individuals.
Applying some kind of trimming or capping with the weighting estimator also
changes the population for which the effect is estimated. But depending on the imple-
mentation of this capping, the implications might be less obvious. To simplify, consider
only the ATET comparing the average of observed Y 1 with the above proposed predic-
tor for E[Y 0 |D = 1]. Trimming would only happen in the latter term, but there are
used only observations from the control group. Now, if any of the D = 0 observations
with large values of 1−p p are trimmed or censored, we do not see how this changes
the treatment group (for which the ATET is calculated). A simple solution could be to
trim (i.e. delete) the D = 0 observations with large values of 1−p p in the calculation of
Ê[Y 0 |D = 1] and to use the same trimming rule for the treated when averaging over
the Yi1 . You may then compare those D = 1 observations that have been deleted with
those D = 1 observations that have not been deleted to obtain an understanding of the
implications of this trimming.
Concerning the asymptotic properties of estimators (3.33) and (3.34), there exist sev-
eral results in the literature deriving them for different estimators of the propensity
scores, see for example Hirano, Imbens and Ridder (2003), Huber, Lechner and Wunsch
(2013) and references therein. The non-parametric versions are typically calculating the
asymptotics for series estimators, namely power series. Applying slightly different (non-
testable) conditions they all show asymptotic efficiency, i.e. that the estimators reach the
variance bounds presented in Theorem 3.2 with asymptotically ignorable (smoothing
148 Selection on Observables: Matching, Regression and Propensity Score Estimators
or approximation) bias. Although the asymptotics are admittedly important, they alone
give hardly recommendations for practical estimation and inference. A main problem is
that practitioners underestimate the lever effect of an estimation error in the propensity
score for the final treatment effect estimate: a small estimation error for p(·) may have
a large impact on the treatment effect estimate. As p(·) is usually a smooth monotone
function, the errors are virtually small. This is also a reason, though not a good one,
why propensity score based methods are so enticing. Above all, do not forget that we
already have seen in preceding discussions that in a semi-parametric estimation pro-
cedure you need to keep the non-parametric bias small, i.e. you have to undersmooth.
When suffering from the curse of dimensionality you even have to use bias-reducing
methods.
How should one proceed if the true propensity score p(x·) is known, as it is for
example in an experiment where the treatment assignment is under control? From The-
orem 3.2 we see that the answer is different for ATE and ATET because the asymptotics
change only for ATET when p(·) was known. Nonetheless it might be surprising that
for ATE it is asymptotically more efficient to weight by the estimated than by the true
propensity score if the used propensity estimators are consistent and fulfil certain effi-
ciency conditions.29 Recalling the discussion that followed Theorem 3.2 we must realise
that the knowledge of the propensity score only provides important information for the
ATET, because there we need the conditional distribution F(X |D = 1), and p(·) pro-
vides information about F(X |D = 1). Theorem 3.2 reveals that the knowledge of p(·)
reduces the variance part that comes from the sampling; it does not reduce the variance
parts coming from the prediction of m 0 (xi ) or m 1 (xi ). So the variance part coming from
sampling (referring to the difference between the sample distribution and the population
distribution) can be reduced for the ATET estimates. The possibly surprising thing is
that replacing p̂ by p in the ATE estimator does lead to a larger variance. The reason is
quite simple: the weighting with p̂(X i ) (the sample propensity score) is used to ex-post
(re-)balance the participants in your sample. Using p(X i ) does this asymptotically (i.e.
for the population) but not so for your sample.30
Both aspects, that the knowledge of p does only help for reducing the sampling vari-
ance correcting for the conditioning in F(X |D = 1) but not regarding the balancing,
becomes obvious when looking at the following three ATET estimators using p, p̂ or
both:
1
n
Yi Di Yi (1 − Di )
p̂(X i ) −
n1 p̂(X i ) 1 − p̂(X i )
i=1
1
n
Yi Di Yi (1 − Di )
p(X i ) −
n1 p(X i ) 1 − p(X i )
i=1
29 See Robins and Rotnitzky (1995) for the case where the propensity score is estimated parametrically, and
Hirano, Imbens and Ridder (2003) where it is estimated non-parametrically.
30 See also Hirano, Imbens and Ridder (2003) who show that including the knowledge of p(·) as an
additional moment condition leads to exactly the same estimator for ATE as if one uses a direct ATE
estimator with estimated p(·).
3.3 Propensity Score-Based Estimator 149
1
n
Yi Di Yi (1 − Di )
p(X i ) − . (3.35)
n1 p̂(X i ) 1 − p̂(X i )
i=1
In the first one we do the balancing well but do not appreciate the information contained
in p to improve the estimation of the integral dF(X |D = 1); in the second we do this
but worsen the balancing; in the last one we use p for estimating dF(X |D = 1) but
keep p̂ for the right sample balancing. Consequently, the last one is an efficient estima-
tor of ATET whereas the others are not. Still, in practice one should be careful when
estimating p(·) and keep the bias small. You get rewarded by asymptotically reaching
the efficiency bound.
i)
We conclude with a practical note. It might happen that the weights 1−p̂(x p̂(xi )
do not
sum up to one in the respective (sub-)sample. It is therefore recommended to normalise
by the sum of the weights, i.e. to actually use
n n p̂(X i )
i=1 Yi Di i=1 Yi (1 − Di ) 1− p̂(X i )
for the ATET n − n p̂(X i )
(3.36)
i=1 Di i=1 (1 − Di ) 1− p̂(X i )
n Yi Di n Yi (1−Di )
i=1 p̂(X i ) i=1 1− p̂(X i )
and for the ATE n Di
− n 1−Di
. (3.37)
i=1 p̂(X i ) i=1 1− p̂(X i )
n
AT E= mˆ1 (X i ) − mˆ0 (X i ), and
n
i=1
1
which can be estimated non-parametrically from the sample. Both treatment effect esti-
mators are efficient under sufficient regularity conditions (mainly to keep the biases
small for p̂, mˆ0 and mˆ1 ). When using non-parametric estimators for the conditional
expectations, then we do not gain in efficiency when weighting (afterward) with p̂,
but even need more assumptions and nonparametric estimators than before. Therefore
this estimator only becomes interesting when you do not want to use non-parametric
estimators for the conditional expectation or p(·), and therefore risk to run into misspec-
ification problems. So we try to find a way of combining propensity score weighting
and regression in a way that we can model m d (·) and/or p(·) parametrically or semi-
parametrically, and get consistency if either the m d (·) or p(·) are correctly specified.
This would really be a helpful tool in practice as it simplifies interpretation and
estimation in the prior step.
In order to do this, let us rewrite the propensity score weighting ATET estimator:
. - 9/
1 1 − Di
E Y 1 − Y 0 |D = 1 = E Yi · Di − p(X i )
Pr(D = 1) 1 − p(X i )
1
= E p(X ) · E[Y 1 −Y 0 |X ] .
Pr(D = 1)
Alternatively, one can show that the weighting estimator can be written as a linear
regression:
regress Yi on constant, Di
using weighted least squares (WLS) with weights
p(X i )
ωi = Di + (1 − Di ) (3.38)
1 − p(X i )
to obtain an ATET estimate. The ATE estimation works similar but with weights32
Di 1 − Di
ωi = + . (3.39)
p(X i ) 1 − p(X i )
One can extend this idea to include further covariates at least in a linear way in this
regression. For estimating ATE we
regress Y on constant, D, X − X̄ and (X − X̄ )D (3.40)
using weighted least squares (WLS) with the weights ωi (3.39) ( X̄ denoting the
sample mean of X ). Basically, this is a combination of weighting and regres-
sion. An interesting property of these estimators is the so-called ‘double robust-
ness property’, which implies that the estimator is consistent if either the para-
metric (i.e. linear) specification (3.40) or the specification of p(·) in the weights
ωi is correct, i.e. that the propensity score is consistently estimated. The notion
of being robust refers only to model misspecification, not to robustness against
outliers.
Before we discuss this double robustness property for a more general case, let
us consider (3.40). Suppose we can estimate the weights (3.39) consistently (either
parametrically or non-parametrically). To see that the estimated coefficient of D esti-
mates the ATE consistently even if the linear model (3.40) is misspecified, note that
the plim of the coefficient of D, using weights (3.39) and setting X̃ := X − X̄ is
indeed
32 See Exercise 12.
3.3 Propensity Score-Based Estimator 151
⎡ ⎤−1 ⎡ ⎤
E [ω] E [ωD] E ω X̃ E ω X̃ D E [ωY ]
⎢ ! ⎥ ⎢ ⎥
⎢ E [ωD] E ωD 2 E ω X̃ D E ω X̃ D 2 ⎥ ⎢ E [ωDY ] ⎥
⎢ ⎥ ⎢ ⎥
e2 ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ E ω X̃ Y ⎥
⎢ E ω X̃ E ω X̃ D E ω X̃ 2 E ω X̃ D ⎥ ⎢
2 ⎥
⎣
⎦ ⎣ ⎦
E ω X̃ D E ω X̃ D 2
E ω X̃ 2 D E ω X̃ 2 D 2 E ω X̃ DY
⎡ ⎤
⎡ ⎤−1 E [ωY ]
2 1 0 0 ⎢ ⎥
⎢ E [ωDY ] ⎥
⎢ 1 1 0 0 ⎥ ⎢ ⎥ ⎢ ⎥
= e2 ⎢
⎣0 0 2V ar (X ) V ar (X )⎦ ⎢ E ω X̃ Y ⎥
⎥
⎢ ⎥
0 0 V ar (X ) V ar (X ) ⎣ ⎦
E ω X̃ DY
. /
D 1− D
= −E [ωY ] + 2E [ωDY ] = E Y− Y = E Y 1 − Y 0 = AT E.
p(X ) 1 − p(X )
To estimate the ATET we need to use the weights (3.38) and run the regression
where X̄ 1 indicates now the average of X among the D = 1 observations. With this
scaling of the regressors one can show analogously (in Exercise 11) that the ATET
is consistently estimated even if the linear regression specification was wrong. This
double robustness holds also when permitting non-linear specifications. Set m d (x) =
E [Y |D = d, X = x] but m̂ id := m d (xi ; β̂d ) being parametric estimators with finite
dimensional coefficient vectors β̂1 and β̂0 . These parametric models can be linear or
non-linear. In addition, let p̂i := p(xi ; β̂ p ) be a parametric estimator of the propensity
score. An efficient estimator of E[Y 1 ] is then obtained by
"
1
Di Yi
n
Di − p̂i m̂ i1
− . (3.41)
n p̂i p̂i
i=1
We can easily show that it is consistent if either the parametric specification of the
propensity score or that of the outcome equation is correct. In other words, one of the
parametric models may be misspecified, but we still attain consistency. We show this
only for the estimator of E[Y 1 ] because the derivations for E[Y 0 ] are analogous. Let
β1∗ and β ∗p be the probability limits of the coefficient estimates in the outcome and the
propensity score model. Then the estimator of E[Y 1 ] in (3.41) converges to
152 Selection on Observables: Matching, Regression and Propensity Score Estimators
⎡ ⎤
D − p(X ; β ∗ ) m (X ; β ∗ )
DY p 1 1
E⎣ − ⎦. (3.42)
p(X ; β ∗p ) p(X ; β ∗p )
We have only to show that the last expression is zero if either the outcome model or the
propensity score is correctly specified.
Consider first the case where the outcome model is correct, i.e. m 1 (X ; β1∗ ) =
E [Y |X, D = 1] a.s. (but p(x, β ∗p ) may not). The second term in (3.43) can be written,
after using iterated expectations with respect to D and X , as
⎡ ⎤
D − p(X ; β ∗p ) Y 1 − m 1 (X ; β1∗ )
E⎣ ⎦
p(X ; β ∗p )
⎡ ⎡ ⎤⎤
D − p(X ; β ∗p ) Y 1 − m 1 (X ; β1∗ ) $
= E ⎣E ⎣ $ D, X ⎦⎦
p(X ; β ∗p )
⎡ ⎤
⎢ D − p(X ; β ∗p ) 1 ⎥
= E⎢ ⎣ p(X ; β ∗ ) E Y |D, X − m 1 (X ; β ∗ ⎥
1 ) ⎦,
p
=0
!
where the first term is zero because E D|Y 1 , X = E [D|X ] = Pr(D = 1|X ) by
the conditional independence assumption, and because the propensity score model is
correctly specified.
It should be mentioned once again that in addition to the double robustness, these esti-
mators attain also the efficiency bound. Hence, if one intends to use parametric models to
estimate treatment effects, the combination of weighting and regression is very appeal-
ing due to efficiency and robustness considerations. When using fully non-parametric
approaches, then both methods, weighting and matching, can achieve efficiency on their
own; the combination cannot improve this, yet Firpo and Rothe still show advantages
with respect to regularity conditions.
where the weights w(i, j) are determined by the applied method like for example kNN,
Nadaraya–Watson and local linear regression (which we discussed in Section 2.2.1 and
above). Alternatively used (or recommended) methods are blocking (as an extension of
kNN), ridge regression or radius matching (as a special case of kernel regression).34
A particularity of the radius matching is that the worst match, i.e. the largest distance
from a participant to a control, determines the bandwidth size. The weights w(i, j) may
refer to the distance of either the vectors of confounders, or of the propensity scores
p(X i ) − p(X j ), or even a mixture of both. As discussed, sometimes it is also proposed
to include variables that have strong predictive power for outcome Y but are not really
confounders i.e., having no impact on the propensity to participate.
33 Remember that you can analogously estimate ATEN by simply replacing D by (1 − D ), i.e. declaring
i i
the treatment group to be the controls and vice versa.
34 See Lechner, Miquel and Wunsch (2011).
154 Selection on Observables: Matching, Regression and Propensity Score Estimators
For the (re-)weighting estimators, there are quite a few proposals. Consider
1
1
AT ET = Yi − w( j)Y j , (3.45)
n1 n0
i:Di =1 j:D j =0
n 0 p̂(X j ) p̂(X j ) n0
or p̂(X i )
n 1 1 − p̂(X j ) 1 − p̂(X j )
i:Di =0 1− p̂(X i )
(1 − c j ) p̂(X j ) n0
or p̂(X i )
1 − p̂(X j )
i:Di =0 (1 − ci ) 1− p̂(X i )
% & n % &
i) n p̂(X j )
1 − n p̂(X
n1 A i
1
n 1 − n1 A j
j=1
with ci = n % &2
n p̂(X j )
1
n 1− n1 A j
j=1
1 − Dj
in which A j = .
1 − p̂(X j )
The latter results from a variance minimising linear combination of the former
weights.35
Another alternative is the inverse probability tilting which criticises that the propen-
sity score estimate p̂ used in the (re-)weighting estimators may maximise the likelihood
for estimating the propensity but is not optimal for estimating treatment effects. A
method tailored towards the treatment effect estimation is to re-estimate (after hav-
ing calculated p̂) the two propensity function(s) (say, ( p̃0 , p̃1 )) by solving the moment
conditions36
n
1 − Di p̂(X i )
1= n and
n 1
j=1 p̂(X j )
1 − p̃0 (X i )
i=1 n
n n
p̂(X i ) 1 1 − Di p̂(X i )
n Xi = n Xi ,
n 1
j=1 p̂(X j ) n 1
j=1 p̂(X j ) 1 − p̃0 (X i )
i=1 n i=1 n
and the same way p̃1 by substituting Di for 1 − Di and 1 − p̃0 (X i ) by p̃1 (X i ).37 Then,
as an ATET estimator is suggested
p̂(X i )
p̂(X j )
AT ET = n Yi − n Yj .
p̃1 (X i ) j=1 p̂(X j ) {1 − p̃0 (X j )} i=1 p̂(X i )
i:Di =1 j:D j =0
(3.46)
There exist some proposals for correcting these kind of estimators for their finite
sample bias.38 This bias correction may be attractive if a simple but reasonable esti-
mate of the bias is available. It is not hard to see that for w( j) as in (3.45) or setting
w( j) = nn 01 i:Di =1 w(i, j) with w(i, j) as in (3.44), the bias of the above estimator of
E[Y 0 |D = 1] can be approximated by
1
1
where Ŷi0 are the predictors for the non-treatment outcome in (3.45) or (3.44),
respectively.
In order to do further inference, even more important than estimating the bias is
the problem of estimating the standard error of the estimators. There exist few explicit
variance estimators in the literature but many different proposals how to proceed in prac-
tice. A popular but coarse approach is to take an asymptotically efficient estimator for
the wanted treatment effect, and to (non-parametrically) estimate the efficiency bounds
given in Theorem 3.2. These bounds, however, can be far from the true finite sample
variances. Therefore it is common practice to approximate variances via simple39 boot-
strapping due to convenience and seemingly improved small sample results.40 Potential
alternative resampling methods are wild bootstrap41 and subsampling,42 but this is still
an open field for further research.
There exists, however, a generally accepted method for estimating the variance of
n
linear estimators, i.e. those that can be written in terms of i=1 w(i)Yi when the
observations are independent. Let us consider the ATET estimator
1
1
AT ET = Yi − m̂ 0 (X i ) , with Y d = m d (X i ) + Uid .
n1
i:Di =1
For all kind of estimators we have considered so far, we have (for some weights, say
w( j, i))
m̂ 0 (X i ) = w( j, i)Y j0 = w( j, i){m 0 (X j ) + U 0j }
j:D j =0 j:D j =0
= w( j, i)m 0 (X j ) + w( j, i)U 0j ,
j:D j =0 j:D j =0
38 See for example Abadie and Imbens (2011) or Huber, Lechner and Steinmayr (2013).
39 We call simple bootstrap the resampling procedure where random samples {(Y , X , D )∗ }n
i i i i=1 are drawn
with replacement directly from the original sample, maybe stratified along treatment.
40 Moreover, Abadie and Imbens (2008) showed that bootstrapping is inconsistent for the kNN matching
estimator.
41 See Mammen (1992). In wild bootstrap one relies on the original design {(X , D )}n
i i i=1 but generates
{Yi∗ }i=1
n from estimates m̂ d and some random errors. Note that generally, naive bootstrap is estimating
the variance of the conditional treatment effects, say AT E(x), inconsistently.
42 See Politis, Romano and Wolf (1999).
156 Selection on Observables: Matching, Regression and Propensity Score Estimators
which equals m 0 (X i ) plus a smoothing bias b(X i ) and the random term
j:D j =0 w( j, i)U j . Therefore we can write
0
1
−w( j, i)
AT ET = Yi1 + Y j0
n1 n1
i:Di =1 j:D j =0 i:Di =1
1
−w( j, i)
= {m 1 (X i ) + Ui1 } + {m 0 (X j ) + U 0j }
n1 n1
i:Di =1 j:D j =0 i:Di =1
1
n
V ar [ AT E T |x1 , . . . , xn , d1 , . . . , dn ] = w(i)2 V ar [U D |X = xi , D = di ] . (3.48)
i=1
Generally, it is not hard to show that conditional on the covariates, i.e. on confounders
X and treatment D, Formula 3.48 applies to basically all the here presented estimators.
Nonetheless, there are two points to be discussed. The first is that we still need to
estimate the V ar [U D |X = xi , D = di ]; the second is that we have conditioned on
the sample design. This implicates that we neglect variation caused by potential differ-
ences between the sample distribution of X, D compared to the population distribution.
Whether this makes a big difference or not depends on several factors like whether we
used global or local smoothers (the impact is worse for the former ones) and also on the
variance of AT E T (X ). Some resampling methods are supposed to offer a remedy here.
Coming back to (3.48) and knowing the w(i), for the prediction of V ar [U D |X =
xi , D = di ] different methods have been proposed in the past.43 It might be helpful to
realise first that in order to get a consistent estimator in (3.48), we only need asymptot-
ically unbiased predictors. This is similar to what we discussed in Section 2.2.3 in the
context of root-n-consistent semi-parametric estimators: the (though weighted) aver-
aging over i provides the variance with a rate of 1/n such that only the bias has to
be shrunken to O(n −1/2 ) for obtaining root-n convergence. Consequently, you may in
(3.48) simply replace V ar [U D |X = xi , D = di ] by (Y − m̂ di (X i ))2 . A quite attrac-
tive and intuitive procedure is to go ahead with exactly the same smoother m̂ d used for
obtaining the treatment effect estimate. Certainly, as for ATET you only needed m̂ 0 (or
for ATEN only m̂ 1 ), the lacking regression m 1−d has also to be estimated for ATE. But
it is still the same procedure.
Simulation-based comparison studies have mainly looked at the finite sample perfor-
mance of the treatment effect estimates, not on estimators of the variance or bias. These
43 See Dette, Munk and Wagner (1998) for a review of non-parametric proposals.
3.4 Practical Issues on Matching and Propensity Score Estimation 157
revealed, among others, the following findings: bias correction with (3.47) can generally
be recommend but increases variance. For the (re-)weighting estimators trimming can
be important to obtain reliable estimates of treatment effects, but the discussion about
the question of an adequate trimming is still controversial; we will discuss trimming in
the context of practical issues when using the propensity score. Moreover, there is an
interplay between trimming and bandwidth choice partly due to the so-called boundary
problems in non-parametric regression. Generally, cross validation (CV) that evaluates
the prediction power of m 0 (·) seems to be reasonable bandwidth selection criteria for our
purpose. While it is true that CV aims to minimise the mean squared error of the non-
parametric predictors but not for ATE or ATET estimation, its tendency to undersmooth
in the non-parametric part is exactly what we need for our semi-parametric estimation
problem. As already mentioned, the ridge regression based estimates are less sensible
to bandwidth choice, and so are bias corrected versions as these try to correct for the
smoothing bias. For the rest, the main findings are that – depending on the underly-
ing data generating process, the (mis-)specification of the propensity score function, or
the compliance of the common support condition – most of the introduced estimators
have their advantages but also their pitfalls, so that further general recommendation can
hardly be given. Maybe surprisingly, even for a given data generating process, the rank-
ing of estimators can vary with the sample size. The main conclusion is therefore that it
is good to have different estimators, use several of them and try to understand the dif-
ferences in estimation results by the above highlighted differences in construction and
applied assumptions.44 For further results see Frölich, Huber and Wiesenfarth (2017).
Men Women
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p(X) p(X)
Figure 3.1 Density f P|D=0 for men (left), and density f P|D=1 for women (right)
not randomly allocated over the entire population, the distributions for the different D
would be more dissimilar, for example with most of the mass to the left for the controls
and most of the mass to the right for the treated.
45 Similarly, Black and Smith (2004) define the ‘thick support’ region as 0.33 < P < 0.67 and examine an
additional analysis for this region and give arguments for this choice.
3.4 Practical Issues on Matching and Propensity Score Estimation 159
most of the data are. Additional reasons are for example that a very high value of Pi
for individual i with recorded Di = 0 could be an indication of measurement error in
Di or X i . There may be less reason for suspecting measurement errors when Pi takes
intermediate values. Another reason is that, under certain assumptions, the bias due to
any remaining selection-on-unobservables is largest in the tails of the distribution of
P.46 Finally, trimming at the boundaries typically improves the performance of the non-
parametric estimator. There is certainly always a bias-variance trade-off; the trick is that,
as the bias is the expected distance to the parameter of interest, a simple redefinition
of this parameter of interest can make the bias disappear. Specifically, we declare the
parameter of interest to be the ATE or ATET for the finally chosen set X01 . Trimming
changes this set towards a set on which the non-parametric estimator works pretty well
(has small variance) while the theoretical bias increases due to the suppression of certain
observations. This is eliminated by our (re-)definition of the parameter of interest. Con-
sequently, trimming can achieve a (seemingly free-lunch) variance reduction. However,
as trimming is only used to improve the finite sample variance, we should be aware of
the fact that for increasing sample size the estimation improves even where the propen-
sity score is extremely low or high. For this reason, alternative trimming procedures
were proposed in the literature; see Section 3.5.1.
Example 3.10 If in the true model there is only one strong predictor of D, estimating
the propensity score with only this variable would ensure that we compare only obser-
vations with the same characteristic. If, on the other hand, we include many additional
insignificant variables in X , the estimated propensity scores would then contain a lot
of noise and it would be more or less random which control individual is matched to a
given treated individual.
46 See Black and Smith (2004, pp. 111–113) for an illustrative example of such a situation.
160 Selection on Observables: Matching, Regression and Propensity Score Estimators
On the other hand, if they are good predictors for Y , then they can equally well reduce
the variance of treatment effect estimates. If pre-programme outcome data Yt=0 or even
Yt=−1 , Yt=−2 , etc. are available, it is also helpful to examine a regression of Yt=0 on
various X variables. If we expect Y to be rather persistent over time, this provides us
with guidance on likely important predictors of the outcome variable, which should
be included in X even if they affect D only little. The reason for this, though, is a
different one. It actually refers exactly to the problem which we discussed in the context
of randomised experiments: it is about variance reduction (while the inclusion of real
confounders is about bias reduction, or say ‘identification’).
Example 3.11 When analyzing effects of some treatment on incomes, gender might be a
good predictor of income. Even when gender is balanced between treatment and control
(e.g. RCT), i.e., it is not a confounder, controlling for gender reduces variance, as we
estimate treatment effects by gender with subsequent averaging across gender by their
proportions.
This example shows nicely the pros and cons of including ‘additional’ (in the sense of
not being confounders in its strict definition) covariates. Obviously, it is not always easy
to decide which of those variables one should include or not. This can only be found out
by analysing the impact of X on Y and D. The proper set of confounders does not change
when switching from matching to propensity score matching or weighting.47 Similarly,
also the choice of smoothing parameter (like the number of knots for splines or the order
of the polynomial for power series) is not trivial. In order to decide this it is helpful to
remember that in the context of experimental designs we used the propensity function
to assess balance in covariates. What we want to reach in the context of matching and
regression is a conditional balance: we require
What does this mean in practice and how can we make use of it? Imagine all confounders
were discrete. Then it simply says that for all x ∈ X01 you should get
where n dx is the number of individuals i with (Di = d, X i = x). If we have also con-
tinuous confounders one has to build accordingly strata and blocks to perform a similar
analysis. Similarly to what has been said in the context of randomised experiments,
a testing of conditional balance is not recommended particularly if sample size var-
ied after trimming. Especially attractive for continuous confounders, one could proceed
along p(X ) instead of checking along X : for any value of p(X ) or a subset of values, the
variables X should be balanced between the D = 1 and the D = 0 group in the sense
47 However, as we have seen, from a structural modelling point of view one could ask which confounders
should be used for modelling the selection, and which (only) for the regression part.
48 Although Y d ⊥
⊥ D| p(X ) is the real goal. But this is not testable.
3.4 Practical Issues on Matching and Propensity Score Estimation 161
that the number of observations are very similar when inversely weighted by p(X ) and
(1 − p(X )), cf. Equation 3.50. If this is not the case, the propensity score model is likely
to be misspecified and has to be respecified until balance is achieved.49 One way to pro-
ceed is to sort the estimated propensity scores and group them into five or ten strata i.e.
using quintiles or deciles. By the balancing property of the propensity score we have
. $ / . $ /
X · D $$ X · (1 − D) $$
E a ≤ p(X ) ≤ b = E a ≤ p(X ) ≤ b
p(X ) $ 1 − p(X ) $
Then in each block the absolute difference of the weighted (by p(X )) average of X
in the D = 1 and the D = 0 group is examined, standardised by the standard devia-
tion of X . If the absolute difference is large, the propensity score model is respecified
by making it more flexible (less smooth by decreasing the bandwidth, increasing the
knots or the order of the polynomial, etc.). In any case we are looking for a weighted
(by the inverse of Pr(D = d|X = x)) balance in X between the different treatment
groups. A test tells us only if we were able to statistically prove (weighted) imbalances.
Already a generous number of confounders can reduce the power of any such a test to an
extent that it hardly ever finds significant imbalances and may therefore lead to wrong
conclusions.
49 We say here ‘likely’ because it can well be that very different X values predict the same or similar
treatment propensity allowing for those imbalances.
162 Selection on Observables: Matching, Regression and Propensity Score Estimators
50 We do not refer to local smoothing parameters but to local smoothers. Global smoothers will also be
seriously affected by sampling along X whereas local smoothers like kNN, kernels or splines will not.
51 In the example mentioned above, the relevant condition is Pr (individual is in sample|X, D = 1) ∝
Pr (individual is in sample|X, D = 0). Notice that this proportionality condition refers to the marginal
sampling probability with respect to X only.
3.4 Practical Issues on Matching and Propensity Score Estimation 163
As E[π(d)|ci ] = E[ϕ(d, ei ) − ci d|ci ], where the expectation runs only over ei , the
rest is given, one has Di = 11{E[ϕ(1, ei ) − ϕ(0, ei ) ≥ ci |ci ]} which is just a deter-
ministic (though unknown) function of ci . If ci is independent of the ei , then you get
(ϕ(1, ei ), ϕ(0, ei )) ⊥⊥ ci . In this case we have even unconfoundedness without condi-
tioning. So one could identify the treatment effect on production without observing the
ci although the firms knew their ci and use it for the selection decision. But we could
not identify the treatment effect on profits without observing all ci . If X comprises all
information about ci , it is sufficient to condition on X . See also Exercise 9.
In any case, the conditional mean independence assumption might be the minimal
identifying assumption and cannot be validated from the data. Its assertion must be based
on economic theory, institutional knowledge and beliefs. It could only be rigorously
tested if one were willing to impose additional assumptions. With such over-identifying
assumptions it can be tested whether given a certain set of assumptions, the remaining
assumptions are valid. If under these conditions the latter assumptions were not rejected,
the identification strategy would be considered as being credible.
Nonetheless, simply claiming that one believes in the independence assumption might
be unsatisfactory. In order to get some more insight, it is common to conduct falsifica-
tion tests, also called pseudo-treatment tests. For example, an indirect test of the CIA
is to examine whether we would obtain a zero treatment effect when comparing sub-
populations for which we knew that they were either both treated or both untreated. In
these test situations we know that the estimated effects should be zero if the CIA were
true. Hence, if nonetheless the estimated treatment effect is significantly different from
zero, one concludes that the CIA fails. Examples are: split the controls into two groups,
and see whether you find AT E = 0 comparing them; another example is given below.
If such a falsification test fails, one would be doubtful about the CIA. If one is able to
conduct different falsification tests and hardly any of them fails, one would be more
inclined to believe the CIA.52 Let us consider another example.
Example 3.13 Access to social programmes often depends on certain eligibility criteria.
This leads to three groups: ineligibles, eligible non-participants and (eligible) partic-
ipants. We are only interested in the ATET as the ineligibles will never be treated.
52 For a good example of how falsification analysis helps to increase the credibility of the findings see Bhatt
and Koedel (2010).
164 Selection on Observables: Matching, Regression and Propensity Score Estimators
Therefore it is sufficient to check Y 0 ⊥⊥ D|X . The first two groups are non-participants
and their Y 0 outcome is thus observed. Usually both groups have different distributions
of X characteristics. If one strengthens the conditional independence assumption to
The (average) outcome of Y among the ineligibles and among the eligible non-
participants should be about identical when adjusting for differences in the distribution
of X . This is testable and might indicate whether Y 0 ⊥⊥ D|X holds.
So we see that you may simply split the control group into two (T ∈ {0, 1}, e.g.
eligibles vs. non-eligibles) for testing Y 0 ⊥⊥ T |X or the treatment group for testing
Y 1 ⊥⊥ T |X . These test are interpreted then as indicators for the validity of Y 0 ⊥⊥ D|X
and Y 1 ⊥⊥ D|X , respectively.
An especially interesting situation for this pseudo-treatment approach is the case
where information on the outcome variable before the treatment happened is available,
e.g. in the form of panel data. One can then examine differences between participants
and non-participants before the treatment actually happened (and hopefully before the
participants knew about their participation status as this might have generated anticipa-
tion effects). Since the treatment has not yet happened, there should be no (statistically
significant) difference in the outcomes between the subpopulation that is later taking
treatment, and the subpopulation that is later not taking treatment (at least after control-
ling for confounders X ). This is known as the pre-programme test or pseudo-treatment
test.
Let us discuss such a situation more in detail. Suppose that longitudinal data on par-
ticipants and non-participants are available for up to k + 1 periods before treatment
started (at t = 0). As an example think of an adult literacy programme that starts at time
t = 0 and where we measure the outcome at time 1. Let us consider the CIA condi-
tion for ATET. Before time 0, all individuals are in the non-treatment state. We assume
0
that there are no anticipation effects, such that Yt=0 = Yt=0 is fulfilled. Assume that
conditional independence holds at time t = 1:
0
Yt=1 ⊥⊥ Dt=0 |X, Yt=0
0
, Yt=−1
0
, Yt=−2
0
, . . . , Yt=−k
0
, (3.51)
0
Yt=l ⊥⊥ Dt=0 |X, Yt=l−1
0
, Yt=l−2
0
, Yt=l−3
0
, . . . , Yt=−(k+1)
0
, l = 0, −1, −2, . . .
(3.52)
This assumption is testable, because at time t = 0 we observe the non-treatment out-
come Y 0 as well for those with Dt=0 = 0 as for those with Dt=0 = 1, i.e. those who will
later participate in treatment. Assumption 3.51 is untestable because at time 1 the out-
t=0 = 1 is counterfactual (could never be observed because
come Yt=10 for those with D
1 can be observed. Hence,
these individuals received treatment); in other words, only Yt=1
if we are willing to accept equivalence of (3.51) and (3.52), we could estimate the treat-
ment effects in those previous periods and test whether they were zero. If they are
statistically different from zero, participants and non-participants were already differ-
ent in their unobserved confounders before the treatment started, even conditional on X .
0
To be able to use this test, we needed to have additional lags of Yt=−l , l > k, that were
not included as control variables in (3.51). To implement this test it is useful to think of
it as if some pseudo-treatment had happened at time zero or earlier. Hence, we retain
the observed indicator Dt=0 as defining the participants and non-participants and pre-
tend that the treatment had started already at time −1. Since we know that actually no
treatment had happened, we expect treatment effect to be zero. Statistically significant
non-zero estimates would be an indication for CIA violation. A simple and obvious case
is that where you check Yt=1 0 ⊥⊥ Dt=0 |X by testing Yt=0
0 ⊥⊥ Dt=0 |X . Here k = 0 is
such that the lagged outcome is not included in the original conditioning but only used
for the pre- or pseudo-treatment test.
So far we tried to check the CIA for ATET but can we extend this idea in order to
check also
1
Yt=1 ⊥⊥ Dt=0 |X, Yt=0
0
, Yt=−1
0
, Yt=−2
0
, . . . , Yt=−k
0
,
i.e. the assumption we need for ATE and ATEN? In fact, it cannot, since pre-treatment
periods only provide information about Y 0 but not Y 1 .
What if we find significant pseudo-treatment effects? We might consider them as an
estimate of the bias due to unobserved confounders and be willing to assume that this
bias is constant over time. Then we could first proceed as if CIA held to afterwards
correct the resulting estimate by substracting the bias estimate. This is the basic idea of
difference-in-difference (DiD) estimators and DiD-Matching to be discussed in a later
chapter of this book, and it is also essentially the intuition behind fixed effects estimators
for panel data models.
These do not have to be ordered.55 The average treatment effect for two different
treatments m and l would thus be
With these assumptions we can identify and estimate all ATE or ATET for any
combination m = l of programmes.
More specifically: if D is discrete with only a few mass points M, M << n, we could
estimate m d (x) separately in each sub-population, i.e. for each value of d in Supp(D).
But if D takes on many different values, e.g. M large or D being continuous (or multi-
variate) and thus ordered, then the estimator of m d (x) also has to smooth over D. Hence,
we could still use
! 1
n
E Y0 = m̂ 0 (X i ),
n
i=1
55 I.e. treatments 0, 1 and 2 can be different training programmes with arbitrary ordering. If, however, they
represented different dosages or intensities of the same treatment, then one would like to invoke
additional assumptions such as monotonicity, which would help to identify and improve the precision of
the estimates.
3.4 Practical Issues on Matching and Propensity Score Estimation 167
The latter result is obtained by showing that the conditional independence also implies
Y d ⊥⊥ D| p m|ml , D ∈ {m, l}.
Instead of conditioning on p m|ml it is also possible to jointly condition on p m and pl ,
because p m|ml is a function of them. Hence, we also could consider
Y d ⊥⊥ D|( p m , pl ), D ∈ {m, l}.
These results suggest different estimation strategies via propensity score matching. If
one is interested in all pairwise treatment effects, one could estimate a discrete choice
model such as multinomial probit (MNP) (or a multinomial logit (MNL) if the differ-
ent treatment categories are very distinct),56 which delivers consistent estimates of the
marginal probabilities pl (x) for all treatment categories.
If computation time for the MNP is too demanding, an alternative is to estimate all
the M(M − 1)/2 propensity scores p m|ml by using binary probits for all pairwise com-
parisons separately. From a modelling perspective, the MNP model might be preferred
because if the model is correct, all marginal and conditional probabilities would be con-
sistently estimated. The estimation of pairwise probits, on the other hand, does not seem
to be consistent with any well-known discrete choice model.57 On the other hand, spec-
ification tests and verification of balancing are often easier to perform with respect to
binary probits to obtain a well-fitting specification. Using separate binary probits has
also the advantage that misspecification of one of the binary probit models does not
imply that all propensity scores are misspecified (as would be the case with an MNP
model). So far, comparison studies of these various methods have found little difference
in their relative performance.58 Overall, estimating separate binary probit models seems
to be a flexible and convenient approach.
Whichever way one chooses to estimate the propensity scores, one should define the
common support with respect to all the propensity scores. Although it would suffice for
the estimation of E[Y m − Y l |D = m] to examine only p m|ml for the support region, the
interpretation of various effects such as E[Y m − Y l |D = m] and E[Y m − Y k |D = m]
56 The MNL is based on stronger assumptions than the MNP. A well-known implication is the Independence
of Irrelevant Alternatives (IIA), which is often not plausible if some of the choice options are more similar
than others. A nested logit approach might be an alternative, e.g. if the first decision is whether to attend
training or not, and the exact type of training is determined only as a second decision. This, however,
requires a previous grouping of the categories. For semi-parametric MNL see Langrock, Heidenreich and
Sperlich (2014). MNP is therefore a more flexible approach if computational power permits its use.
57 I.e. the usual discrete choice model would assume that all choices made and the corresponding
characteristics X have to be taken into account for estimation. A pairwise probit of m versus l, and one for
l versus k, etc. would not be consistent with this model.
58 Compare for example the studies and applications in Gerfin and Lechner (2002), Lechner (2002a) or
Gerfin, Lechner and Steiger (2005).
168 Selection on Observables: Matching, Regression and Propensity Score Estimators
would be more difficult if they were defined for different subpopulations due to the
common support restriction. A comparison of the estimates could not disentangle dif-
ferences coming from different supports compared to differences coming from different
effects.
One (relatively strict) way to implement a joint common support is to delete all
observations with at least one of their estimated probabilities larger than the smallest
maximum and smaller than the largest minimum of all subsamples defined by D. For
an individual who satisfies this restriction we can thus be sure that we find at least one
comparison observation (a match) in each subgroup defined by D. Instead of matching,
a propensity score weighting approach is also possible. In fact, it is straightforward to
show that, recall (3.55),
!
E[Y |D = l] = E Y m |X = x, D = m dF(x|l)
m
. /
m 1− p
m|ml (X ) Pr (D = m)
=E Y · |D = m .
p m|ml (X ) Pr (D = l)
The latter can be estimated by
n
1 − p̂ m|ml (X i ) n m
Yi with n k = 11{Di = k}.
nm p̂ m|ml (X i ) nl
i:Di =m i=1
A difference between the evaluation of a single programme and that of multiple pro-
grammes is that some identification strategies that we will learn for the evaluation of a
single programme are less useful for the evaluation of multiple treatments.
As stated different authors have made a major effort to compare all kind of proposed
matching, regression, (re-)weighting and double robust estimators: see, for example,
Lunceford and Davidian (2004), Frölich (2004), Zhao (2004) for early studies, Busso,
DiNardo and McCrary (2009), Huber, Lechner and Wunsch (2013) or Busso, DiNardo
and McCrary (2014), Frölich, Huber and Wiesenfarth (2017), Frölich and Huber (2017,
J RSS B) for more recent ones. Frölich (2005) contributed a study on the bandwidth
choice. A number of recipes for one-to-one propensity score matching have been sug-
gested, e.g. in Lechner (1999), Brookhart, Schneeweiss, Rothman, Glynn, Avorn and
Stürmer (2006) and Imbens and Rubin (2015) among many others. Some other esti-
mators proposed in the literature use the propensity score just to get estimates for the
functions m 0 , m 1 . For example you may take
i Yi |X i ]/ p̂(X i ),
m̂ 1 (X i ) := E[D − Di )Yi |X i ]/{1 − p̂(X i )}
m̂ 0 (X i ) := E[(1
(3.56)
i |X i ]. When using in (3.56) proper non-parametric estimators of the
with p̂(X i ) := E[D
conditional expectation, then it can be shown that for those
n
AT E= m̂ 1 (xi ) − m̂ 0 (xi )
n
i=1
As V ar [ AT E] = V ar [ψ(Y, X, D)]/n, it is easy to see that these estimators reach the
lower bound of variance. It is evident then how this procedure can be extended to ATET
or ATEN.
These again have typically been proposed using power series estimators. There-
fore, let us briefly comment on a common misunderstanding. Recall that reducing
the bias is reducing this approximation error; ‘undersmoothing’ is thus equivalent to
including that many basis functions that the variance (the difference between sam-
ple and population coefficients) clearly dominates the squared approximation error.
Note that this cannot be checked by simple t or F tests – and not only because of
the well-known pre-testing problem that invalidates further inference. For example, for
propensity score based estimation with (power) series it is stated that efficiency could
be reached when the number L of basis functions is in the interval (n 2(δ/q−2) , n 1/9 )
with δ/q ≥ 7, and δ being the number of times the propensity score is continuously
170 Selection on Observables: Matching, Regression and Propensity Score Estimators
differentiable.59 This would mean that even for n = 10,000 people would take only
two basis functions; for n = 60,000 just three, etc. For power series is proposed the
basis 1, x1 , x2 , . . . , xq , x12 , x22 , . . . , xq2 , x1 x2 , . . . etc.60 Along this reasoning you might
conclude that using a linear model with L = q + 1 you strongly undersmooth (actually,
more than admitted), which obviously does not make much sense. Even if you inter-
preted the series in the sense that L − 1 should be the order of the used polynomial,
then for n = 10,000 you would still work with a linear model, or for n = 60,000 with
a quadratic one. However, these will typically have very poor fitting properties. A more
appropriate way to understand the rate related statement is to imagine that one needs
L = n ν · C where ν is just about the rate but C is fixed and depends on the adaptiveness
of the used series, the true density and the true function, and should be much larger
than 1. But this does still not solve the problem of poor extrapolation (or prediction)
to other populations and thus the inappropriateness for the estimation of counterfactual
outcomes.
Concerning further discussion on trimming, especially that based on the propensity
score, Crump, Hotz, Imbens and Mitnik (2009) propose choosing the subset of the sup-
port of X that minimises the variance of the estimated treatment effect. Since the exact
variances of the estimators are unknown, their approach is based on the efficiency bound,
i.e. the asymptotic variance of an efficient non-parametric estimator. This solution only
depends on the propensity score and conditional variances of Y . Under homoskedastic-
ity, a simpler formula is obtained which depends only on the marginal distribution of
the propensity score. Trimming all observations with pi ≤ 0.1 or pi ≥ 0.9 works as a
useful rule of thumb.
Huber, Lechner and Wunsch (2013) criticise the various trimming proposals since
they all ignore the asymptotic nature. Unless the propensity score is in fact 0 or 1 for
some values, the need for trimming vanishes when the sample size increases. Trim-
ming is only used as a small-sample tool to make the estimator less variable when n
is small. Therefore, the proportion of observations being trimmed should go to zero
when n increases. They suggest a trimming scheme based on the sum of weights each
observation receives in the implicit weighting of the matching estimator. Observations
with very large weights are discarded. Since each weight is obtained by dividing by
the sample size, the weights automatically decrease with sample size and thus the pro-
portion of trimmed observations decreases to zero unless the treatment probability is
really 0 or 1 for that x. If the latter were true, we could suspect this from knowledge
of the institutional details and exclude those x values before estimating the propensity
score.
Much more recent is literature on treatment effect estimation with high-dimensional
data. Given the increasing size of data sets on the one hand, and the fear that one might
not have included enough confounders to reach CIA validity on the other, it would be
59 Hirano, Imbens and Ridder (2003) give slightly different bounds in their theorem but the ones given here
coincide with those in their proof and some other, unpublished work of theirs.
60 Sometimes the notation is careless if not wrong when constructing these series such that they are of little
use for the practitioner. Most of them additionally require rectangle supports of X , i.e. that the support of
X is the Cartesian product of the q intervals [min(X j ), max(X j )]. This basically excludes confounders
with important correlation like for example ‘age’ and ‘tenure’ or ‘experience’.
3.5 Bibliographic and Computational Notes 171
interesting to know how to do inference on the ATE estimate after having performed
an extensive selection of potential confounders. This problem is studied in Belloni,
Chernozhukov and Hansen (2014) who allow for a number of potential confounders
being larger than the sample size. Certainly, you need that the correct number q is much
smaller than n, and that potentially committed selection errors are (first-order) orthogo-
nal to the main estimation problem (i.e. the estimation of the ATE). So far, this has been
shown to work at least for some (generalised) partial linear models.
What has been discussed less is the identification and estimation of multivalued treat-
ment (which will be considered in later chapters), i.e. when we have various treatments
but each individual can participate at most in one. As was indicated, idea and procedure
are the same as for binary D; see e.g. Cattaneo (2010), who introduced a double robust
estimator for multivariate treatment effects.
estimator that makes use of GAM for both, the treatment assignment and/or the outcome
model.
Finally, ATE is a very recent R package to estimate the ATE or the ATET based on
a quite recent estimation idea; see Chan, Yam and Zhang (2016). This function uses a
covariate balancing method which creates weights for each subject without the need to
specify a propensity score or an outcome regression model.
Until recently Stata didn’t have many explicit built-in commands for propensity
score based methods or other non-experimental methods that produced control groups
with distributions of confounders similar to that of the treated group. However, there are
several user-written modules, of which the maybe most popular ones were: psmatch2
and pscore, and more recently nnmatch. All three modules support pair-matching
as well as subclassification. In addition, ivqte also permits estimation of distributional
effects and quantile treatment effects.
The command psmatch2 – see Leuven and Sianesi (2014) – has been the preferred
tool to perform propensity score matching. It performs full Mahalanobis and propensity
score matching, common support graphing (psgraph) and covariate imbalance test-
ing (pstest). It allows kNN matching, kernel weighting, Mahalanobis matching and
includes built-in diagnostics. It further includes procedures for estimating ATET or ATE.
The default matching method is single nearest-neighbour (without caliper). However,
standard errors are calculated by naive bootstrapping which is known to be inconsistent
in this context. The common option imposes a common support by dropping treatment
observations based on their propensity score; see the help file for details.
The command pscore estimates treatment effects by the use of propensity score
matching techniques. Additionally, the program offers balancing tests based on strati-
fication. The commands to estimate the average treatment effect on the treated group
using kNN matching are attnd and attnw. For radius matching, the average treat-
ment effect on the treated is calculated with the module attr. In the programs attnd,
attnw, and attr, standard errors are estimated analytically or approximated by boot-
strapping using the bootstrap option. Kernel matching is implemented in attk.
Users can choose the default Gaussian or the Epanechnikov kernel. Stratification can be
used in atts. By construction, in each block defined by this procedure, the covariates
are balanced and the assignment to treatment can be considered as random. No weights
are allowed.
If you want to directly apply nearest-neighbour matching instead of estimating the
propensity score equation first, you may use nnmatch. This command does kNN
matching with the option of choosing between several different distance metrics. It
allows for exact matching (or as close as possible) on a subset of variables, bias correc-
tion of the treatment effect and estimation of either the sample or population variance
with or without assuming a constant treatment effect.
However, as we have seen, the two main problems in practice are the choice of the
proper method for the present data set, and an appropriate estimator for the standard
error. So it is recommendable to always try not just different methods but also different
implementations. For instance, if pscore and nnmatch give similar results, then the
findings are assumed to be quite reliable; if not, then you have a problem. For a review
3.6 Exercises 173
see Becker and Ichino (2002) and Nichols (2007). Related implementations of the
reweighting propensity score estimator are for example the stata routine treatrew
(Cerulli, 2012).
With STATA 13 the command teffects was introduced. This command
takes into account that the propensity score used for matching was esti-
mated in a first stage. Adjusted standard errors are implemented in the com-
mand teffects psmatch. This command also provides regression adjust-
ment (teffects ra), inverse probability weighting (teffects ipw), aug-
mented inverse probability weighting (teffects aipw), inverse probability
weighted regression adjustment (teffects ipwra), and nearest neighbour match-
ing (teffects nnmatch). There is also a rich menu of post-estimation inference
tools. However, many (if not most) methods are purely parametric, typically based on
linear regression techniques. For more details visit the manual and related help files. For
extensions to multivalued treatment effects see Cattaneo, Drucker and Holland (2013),
who discuss in detail the related poparms command. The command ivqte permits
estimation of quantile treatment effects and distributional effects.
3.6 Exercises
8. Calculate the bias in Theorem 3.3 for the case p = 2, q = 1. How would it change
when using a local linear estimator for m 0 (·)?
9. Prove the statement from Section 3.3.1, where it was said that taking (3.58) with
E[Y |xi , di = d] replaced by μd ( p(xi )) plus the term
- 9
E[Y |xi , di = 1] − μ1 ( p(xi )) E[Y |xi , di = 0] − μ0 ( p(xi ))
{ p(xi ) − di } +
p(xi ) 1 − p(xi )
(3.59)
would again give the original influence function (3.58).
10. Let us extend Example 3.12 taken from Imbens (2004). Imagine we were provided
with a vector of firm characteristics xi that affected production and costs, and con-
sequently also potential profits. Then the production is still a stochastic function
Yi = ϕ(Di , xi , ei ), influenced by technological innovation Di , random factors not
being under the firms control ei , and some observable(s) xi . Profits are again mea-
sured by output minus costs: πi = Yi − c(xi , vi ) · Di , where c is the cost function
depending also on xi and unknown random factors vi . Discuss the (in-)validity of
the CIA along the same lines as we discussed the unconfoundedness in Example
3.12.
11. Recall the double robustness of the ATE estimator presented in Subsection 3.3.3.
Show that running a WLS regression with weights (3.38)
regress Y on constant, D, X − X̄ 1 and (X − X̄ 1 )D,
where X̄ 1 is the average of X among the D = 1 observations, gives a propensity
score weighting ATET estimator.
12. For the double robust estimator recall the weights for ATE in (3.39). Show then that
. /−1 . / n Yi Di n Yi (1−Di )
ω ω D ω Y i=1 ) i=1 1− p̂(X i )
e2
p̂(X
i
i i
i i
= n i
− n .
ωi Di ωi Di2 ωi Di Yi Di 1−Di
i=1 p̂(X i ) i=1 1− p̂(X i )
4 Selection on Unobservables:
Non-Parametric IV and Structural
Equation Approaches
In many situations we may not be able to observe all confounding variables, perhaps
because data collection has been too expensive or simply because some variables are
hard or impossible to measure. This may be less of a concern with detailed administra-
tive data, but more often when only a limited set of covariates is available, these may
perhaps even have been measured with substantial error if obtained by e.g. telephone
surveys. Often data of some obviously important confounders have not been collected
because the responsible agency did not consider this information relevant for the project.
In these kinds of situations the endogeneity of D can no longer be controlled for by
conditioning on the set of observed covariates X . In the classic econometric literature
the so-called instrumental variable (IV) estimation is the most frequently used tech-
nique to deal with this problem. An instrument, say Z , is a variable that affects the
endogenous variable D but is unrelated to the potential outcome Y d . In fact, in the
selection-on-observables approach considered in the previous chapter we also required
the existence of instrumental variables, but without the need to observe them explicitly.
In fact, in order to fulfil the common support condition (CSC) you need some vari-
ation in D|X (i.e. variation in D that cannot be explained by X ) that is independent
of Y d .
We first stress the point that instruments Z are supposed to affect the observed outcome
Y only indirectly through the treatment D. Hence, any observed impact of Z on Y must
have been mediated via D. Then a variation in Z permits to observe changes in D
without any change in the unobservables, allowing us to identify and estimate the effect
of D on Y .
Example 4.1 A firm can choose between adopting a new production technology (D = 1)
or not (D = 0). Our interest is in the effect of technology on production output Y . The
firm, on the other hand, chooses D in order to maximise profits, i.e.
where p is the price of a unit of output. This is common to all firms and not influenced
by the firm’s decision. Here the firm is a price-taker without market power. As before,
ci (d) is the cost of adopting the new technology. A valid instrument Z could be a subsidy
or a regulatory feature of the environment the firm operates in. It typically will change
the costs and thus the profits without affecting the production output directly. Suppose
that the cost function of adopting the new technology is the same for every firm, i.e.
ci (·) = c(·) and that it only depends on d and the value of the subsidy z or regulation.
Hence, the cost function is c(d, z) and the firm’s decision problem becomes
Notice that the cost enters in the choice problem of the firm but the potential outputs
Y d (d = 0, 1) is not affected by them. This is important for identification. We may be
able to use the subsidy as an instrument to identify the effect of technology on output.
However, we cannot use it to identify the effect of technology on profits or stock prices,
since the subsidy itself changes the profits.
Unfortunately, while many users of IVs emphasise the exogeneity of their instrument
regarding the economic process, they ignore the fact that they actually need to assume
its stochastic independence from the potential outcomes, which is often hard to justify.
This idea is pretty much the same as what is known and used in classical econometric
regression analysis. There are mainly two differences we should have in mind: first, we
are still interested in the total impact of D on Y , not in a marginal one. Second, we
consider non-parametric identification and estimation. Thus, we will allow for hetero-
geneous returns to treatment D. This reveals another fundamental problem one has with
the IV approach. In the treatment effect literature the latter is reflected in the notion
of local average treatment effects (LATE), which will be explained in this chapter.
To get around that, one needs to either make more assumptions or resort to structural
modelling.
To better highlight these issues we again start with Y , D and Z being scalar variables,
with the latter two being just binary. Somewhat later we will reintroduce confounders X
and discuss extensions to discrete and continuous instruments. We begin with prelimi-
nary considerations for illustration before formalising the identification and estimation
procedure.
Y = α0 + Dα1 + U (4.1)
cov(U, Z ) = 0 , cov(D, Z ) = 0,
i.e. Z is not correlated with the unobservables but is correlated with D. This leads to
some versions of parametric (standard) IV estimators, which will be discussed in the
next sections. This procedure does not change if D is, for example, continuous. Then,
however, the linearity in model (4.1) is chosen for convenience, but does not necessarily
emanate from economic theory.
It is most helpful to better understand the merits and limits of instrumental variables
by analysing non-parametric identification. A simple illustration of the situation without
control variables in which the effect of Z on Y is channelled by D is given in Figure
4.1. We are going to see why we often need three assumptions: first, that the instrument
Z has no direct effect on Y and, second, that the instrument itself is not confounded.
The meaning of these assumptions can be seen by comparing Figure 4.2 with 4.1. The
assumption that Z has no direct effect on Y requires that the direct arc from Z to Y
should not exist nor an inverse analogue. Furthermore, the assumption of no confound-
ing requires that there are no dashed arcs between Z and V , and no ones between Z and
U , or more generally. In sum, there must be no further (dashed or solid) arcs between Z
and Y . In practice you find mainly arguments that tell you why there is no direct impact
Z → Y but ignore the dashed arcs. The third assumption is that Z has predictive power
for D. (Fourth, we will need some monotonicity assumption.)
Example 4.2 The determinants of civil wars are an important research topic. A num-
ber of contributions have stressed that civil wars, particularly in Africa, are more often
driven by business opportunities than by political grievances. The costs of recruiting
178 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
fighters may therefore be an important factor for triggering civil wars. In this respect,
poor young men are more willing to be recruited as fighters when their income oppor-
tunities in agriculture or in the formal labour market are worse. Therefore, it would be
interesting to estimate the impact of economic growth or GDP per capita on the likeli-
hood of civil war in Africa. It is quite obvious that at the same time civil wars heavily
effect GDP p.c. and economic growth. A popular causal chain assumption1 is that, in
countries with a large agricultural sector (which is mainly rain fed), negative weather
shocks reduce GDP, which, as a proxy for the economic situation, increases the risk of
civil war. The price for recruiting fighters may be one of the channels. But another could
be the reduced state military strength or road coverage. Whereas weather shocks are
arguably exogenously driven, the absence of any other stochastic dependence between
weather and civil war incidence is a quite arguable assumption.
In the next section we will begin with the simplest situation where D and Z are both
binary and no other covariates are included. Later we will relax these assumptions by
permitting conditioning on additional covariates X , hoping that this conditioning makes
the necessary assumptions hold. Hence, observed X will then allow us to ‘block’ any
further arc between Z and Y .
Example 4.3 Edin, Fredriksson and Aslund (2003) studied the effect of living in highly
concentrated ethnic area (D = 1) on labour success Y in Sweden. Traditionally, the
expected outcome was ambiguous: on the one hand the residential segregation should
lower the acquisition rate of local skills preventing access to good jobs. But on the other
hand, these ethnic enclaves also act as an opportunity to increasing networks by dissem-
inating information to immigrants. The raw data say that immigrants in ethnic enclaves
have 5% lower earnings, even after controlling for age, education, gender, family back-
ground, country of origin and year of immigration. However, the resulting negative
association may not be causal if the decision to live in such an enclave depended on
one’s expected opportunities related to unobserved abilities.
From 1985 to 1991 the Swedish government assigned initial areas of residence to all
refugees, motivated by the belief that dispersing immigrants would promote integration.
Let now Z indicate the initial assignment eight years before measuring D, with Z = 1
meaning that one was – though randomly – assigned (close) to an ethnic enclave. It
seems to be plausible to assume that Z was independent of potential earnings Y 0 , Y 1
but affected D (eight years later). Then all impact from Z on Y is coming through
D. One might, however, want to control for some of the labour market conditions X
of the region people were originally assigned to. Formally stated, X should contain
all relevant information about the government’s assignment policy that could confound
our analysis. This way one could ensure that there was no further relation between Z
and Y .
1 See, for example, Miguel, Satyanath and Sergenti (2004) or Collier and Höffler (2002).
4.1 Preliminaries: General Ideas and LATE 179
Example 4.4 An individual may choose to attend college or not, and the outcome Y
is earnings or wealth later in the life cycle. The individual’s decision depends on the
expected payoff, i.e. better employment chances or higher wages, and also on the costs
of attending college, which includes travel costs, tuition, commuting time but also fore-
gone earnings. Only some of them are covered by confounders X . A popular though
problematic instrument Z is, for example, the distance to college. Suppose the indi-
vidual chooses college if Yi1 is larger than Yi0 . Albeit knowing X , he may not be able
to forecast the potential outcomes perfectly as he has only a noisy signal of ability,
reflected in U . The same problem has the empirical researcher who observes X and Z
but not the proneness to school (reflected in V ) which influences the cost function. The
participation decision is (most likely)
Di = 11 E Yi1 |Ui , X i , Z i − c(1, X i , Vi , Z i ) > E Yi0 |Ui , X i , Z i
−c(0, X i , Vi , Z i ) .
Here is a difference between the objective function of the individual (outcomes minus
costs) and the production function that the econometrician is interested in (namely Yi1 −
Yi0 ). The tricky point is that the instruments should shift the objective function of the
individual without shifting the production function.
Di = ζ (Z i , X i , Vi ), Yi = ϕ(Di , X i , Ui ), (4.2)
where the endogeneity of D arises from statistical dependence between U and V , where
U and V are vectors of unobserved variables, suppressing potential model misspecifica-
tion. In this triangular system U could be unobserved cognitive and non-cognitive skills,
talents, ability, etc., while V could be dedication to academic study or any other factor
affecting the costs of schooling. That is, one might think of U as fortune in the labour
market and V as ability in schooling. Most of the time we will consider only cases where
we are interested in identifying the second equation.
We know from the classical literature on triangular or simultaneous equation systems
that given (4.2) with unobserved U, V we identify and estimate the impact of those D on
Y |X that are predicted by (Z , X ); or say, the impact of that variation in D (on variation
in Y |X ) which is driven by the variation of (Z , X ).
In the introduction to this chapter we have already used the notion of LATE but with-
out explaining it further. The question is where this local stands for. Remember that we
distinguished between the general ATE and the ATET or ATEN referring to the ATE
for different (sub-)populations. This distinction made sense always when we allowed
180 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
for heterogeneous returns to treatment as otherwise they would all be equal. Calling it
local makes explicit that the identified treatment effect refers (again) only to a certain
subpopulation. One could actually also say that ATET and ATEN are local. The notion
LATE is typically used when this subpopulation is defined by another variable, here
instrument Z or (Z , X ). The interpretability or say the usefulness of LATE depends
thus on the extent to which the particular subpopulation is a reasonable target to look
at. In the statistics literature, LATE is usually referred to as Complier Average Causal
Effect (CACE), which makes it very explicit that we are referring to the average effect
for the subpopulation defined as Compliers.
You can think of Example 4.4 where D indicates ‘attending’ college and Z being an
indicator of living close to or far from a college. The latter was commonly considered
to be a valid instrument as living close to a college during childhood may induce some
children to go to college but is unlikely to directly affect the wages earned in their
adulthood. So one argues with Figure 2.14 and ignores potential problems coming from
the dashed lines in Figure 4.2.
According to the reaction of D on an external intervention on Z (family moves further
away or closer to a college but because of reasons not related with the college),3 the units
i can be distinguished into different types: For some units, D would remain unchanged
if Z were changed from 0 to 1, whereas for others D would change. With D and Z
binary, four different latent types T ∈ {n, c, d, a} are possible:
2 This is based on the ideas outlined in Angrist, Imbens and Rubin (1996), Imbens (2001) and Frölich
(2007a).
3 This excludes the cases where people move closer to a college because the children are supposed to
attend it.
4.1 Preliminaries: General Ideas and LATE 181
Example 4.5 In this situation it is quite easy to imagine the two first groups: people
who definitely will go to the college, no matter how far they live from one. The second
group is the exact counterpart, giving us the subpopulation of people who will not go
to the college independently of the distance. The third group consists exactly of those
who go to the college because it is close but would not have done so if it were far away.
The last group is composed of people who go to college because it is far from home
but might not have done so if it were close by or vice versa. But who are these people
and why should that group exist at all? First, one has to see that they have one thing in
common with the former group, the so-called compliers: both groups together present
people who are basically indifferent to attending a college but are finally driven by the
instrument ‘distance’. The last group just differs in the sense that their decision seems
counter-intuitive to what we expect. But if you imagine someone living far away, who
had to stay home when deciding for an apprenticeship but could leave and move to a new
place when choosing ‘college’, then we can well imagine that this latter group exists and
is even not negligibly small compared to the group size of compliers.
We might say that compliers and defiers are generally indifferent to D (get treated
or not) but their final decision is induced by instrument Z . Note that we have the same
problem as we discussed at the beginning for Y d : we observe each individual’s Dz only
under either z = 0 or z = 1. Consequently we cannot assign the individuals uniquely to
one of the four types. For example, individual i with Di,0 = 1 might be an always-taker
or a defier, and Di,1 = 1 might be either an always-taker or a complier. Furthermore,
since the units of type always-taker and of type never-taker cannot be induced to change
D through a variation in the instrumental variable; the impact of D on Y can at most
be ascertained for the subpopulations of compliers and defiers. Unfortunately, since
changes in the instrument Z would trigger changes in D for the compliers as well as
for the defiers, but with the opposite sign, any causal effect on the compliers could be
offset by opposite flows of defiers. The most obvious strategy is to rule out the exis-
tence of subpopulations that are affected by the instrument in an opposite direction (i.e.
assume ‘no defiers’ are observed). It is also clear – and will be seen in further discussion
below – that we need compliers for identification. In sum, we assume:
Assumption (A1), Monotonicity: The subpopulation of defiers has probability measure
zero:
Pr Di,0 > Di,1 = 0.
Monotonicity ensures that the effect of Z on D has the same direction for all units.
The monotonicity and the existence assumption together ensure that Di,1 ≥ Di,0 for all
182 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
i and that the instrument has an effect on D, such that Di,1 > Di,0 for at least some
units (with positive measure). These assumptions are not testable but are essential.
Example 4.6 Thinking of Example 4.4, where college proximity was used as an instru-
ment to identify the returns to attending college, monotonicity requires that any child
which would not have attended college if living close to a college, would also not have
done so if living far from a college. Analogously, any person attending college living far
away would also have attended if living close to one. The existence assumption requires
that the college attendance decision depends at least for some children on the proximity
to the nearest college (in both directions).
Example 4.7 Recall Example 4.1 on adopting (or not) a new production technology
with Z being subsidies for doing so. For identification we need a variation in the level
of Z . The unconfoundedness assumption requires that the mechanism that generated this
variation in Z should not be related to the production function of the firms nor to their
decision rule. A violation of these assumptions could arise, e.g. if particular firms are
granted a more generous subsidy after lobbying for favourable environments. If firms
that are more likely to adopt the new technology only if subsidised are able to lobby
for a higher subsidy, then the fraction of compliers would be higher among firms that
obtained a higher subsidy than among those that did not, violating Assumption (A3).
The monotonicity assumption is satisfied if the cost function c(d, z) is not increasing
in z. The LATE is the effect of technology on those firms which only adopt the new
technology because of the subsidy. It could be plausible that the effect for the always-
takers is larger than LATE, and that the effect on never-takers would be smaller. While
for engineers who want to know the total technology impact, this LATE is uninteresting,
for policymakers it is probably the parameter of interest.
Assumption (A4), Mean exclusion restriction: The potential outcomes are mean
independent of the instrumental variable Z in each subpopulation:
4.1 Preliminaries: General Ideas and LATE 183
0
E Yi,Z i
|Z i = 0, Ti = t = E Y 0
i,Z i |Z i = 1, Ti = t for t ∈ {n, c}
E Yi,Z i |Z i = 0, Ti = t = E Yi,Z
1 1
i
|Z i = 1, Ti = t for t ∈ {a, c}.
In order to keep things easy we restricted ourselves here to the equality of condi-
tional means instead of invoking stochastic independence. Exclusion restrictions are
also imposed in classical IV regression estimation, even though that is not always clearly
stated. Here it is slightly different in the sense that it includes the conditioning on type T .
It rules out a different path from Z to Y than the one passing D. This is necessary as in
treatment effect estimation we are interested in identifying and estimating the total effect
of D. Any effect of Z must therefore be channelled through D such that the potential
outcomes (given D) are not correlated with the instrument.
To gain a better intuition, one could think of Assumption (A4) as actually containing
two assumptions: an unconfounded instrument and an exclusion restriction. Take the
first condition
E Yi,00
|Z i = 0, Ti = t = E Yi,1 0
|Z i = 1, Ti = t for t ∈ {n, c}
and consider splitting it up into two parts, say Assumptions (A4a) and (A4b):4
0
E Yi,0 |Z i = 0, Ti = t = E Yi,1 0
|Z i = 0, Ti = t
= E Yi,10
|Z i = 1, Ti = t for t ∈ {n, c}.
The first part is like an exclusion restriction on the individual level and would be satis-
0 = Y 0 . It is assumed that the potential outcome for unit i is unaffected by
fied e.g. if Yi,0 i,1
an exogenous change in Z i . The second part represents an unconfoundedness assump-
tion on the population level. It assumes that the potential outcome Yi,1 0 is identically
distributed in the subpopulation of units for whom the instrument Z i is observed to have
the value zero, and in the subpopulation of units where Z i is observed to be one. This
assumption rules out selection effects that are related to the potential outcomes.
Example 4.8 Continuing our Examples 4.4 to 4.6, where D is college attendance and Y
the earnings or wealth later in the life cycle, if we have for potential outcomes Yi,01 =
1 , then college proximity Z itself has no direct effect on the child’s wages in its later
Yi,1
career. So it rules out any relation of Z with the potential outcomes on a unit level, cf.
Assumption (A4a).5 Assumption (A4b) now requires that those families who decided
to reside close to a college should be identical in all characteristics (that affect their
children’s subsequent wages) to the families who decided to live far from a college.
Thus, whereas the second part refers to the composition of units for whom Z = 1 or
4 Obviously, the following assumption is stronger than the previous and not strictly necessary. It helps,
though, to gain intuition into what these assumptions mean and how they can be justified in applications.
5 This implies the assumption that living in an area with higher educational level has no impact on later
earnings except via your choice to attend or not a college.
184 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
Z = 0 is observed, the first part of the assumption refers to how the instrument affects
the outcome Y of a particular unit.
Note that the second part of the assumption is trivially satisfied if the instrument
Z is randomly assigned. Nevertheless randomisation of Z does not guarantee that the
exclusion assumption holds on the unit level (Exercises 1 and 2). On the other hand, it is
rather obvious that if Z is chosen by the unit itself, selection effects may often invalidate
Assumption (A4b). In our college example this assumption is invalid if families who
decide to reside nearer to or farther from a college are different. This might be the
case due to the job opportunities in districts with colleges (especially for academics) or
because of the opportunity for the children to visit a college. In this case it is necessary to
also condition on the confounders X , i.e. all variables that affect the choice of residence
Z as well as the potential outcomes Yi,Z 0 1 . How to include them is the topic
and Yi,Z
i i
of the next section. As typically Z is assumed to fulfil Z ⊥⊥ Y z |X , we could calculate
the AT E Z , clearly related to the intention to treat effect (ITT), the total effect of Z on
Y . More interestingly, note that one implication of the mean exclusion restriction is that
it implies unconfoundedness of D in the complier subpopulation. As Di = Z i for a
complier you have
E Yi,Z0
i
|Di = 0, T i = c = E Y 0
i,Z i |Di = 1, Ti = c
E Yi,Z i |Di = 0, Ti = c = E Yi,Z
1 1
i
|Di = 1, Ti = c .
Hence, conditioning on the complier subpopulation, D is not confounded with the poten-
tial outcomes. If one were able to observe the type T , one could retain only the complier
subpopulation and use a simple means comparison (as with experimental data discussed
in Chapter 1) to estimate the treatment effect. The IV Z simply picks a subpopulation
for which we have a randomised experiment with Y d ⊥⊥ D (or conditioned on X ). In
other words, Z picks the compliers; for them the CIA holds and we can calculate their
ATE. This is the L AT E Z inside the population. However, we do not observe the type.
The ATE on the compliers is obtained by noting that both the ITT as well as the size of
the complier subpopulation can be estimated.
How now to get the ITT? First note that
E [Yi |Z i = z]
Di
= E Yi,Z i
|Z i = z, Ti = n · Pr (Ti = n|Z i = z)
Di
+ E Yi,Z i
|Z i = z, Ti = c · Pr (Ti = c|Z i = z)
Di
+ E Yi,Z i
|Z i = z, Ti = d · Pr (Ti = d|Z i = z)
Di
+ E Yi,Z i
|Z i = z, Ti = a · Pr (Ti = a|Z i = z)
Di
= E Yi,Z0
i
|Z i = z, Ti = n · Pr (Ti = n) + E Yi,Z i
|Z i = z, Ti = c · Pr (Ti = c)
Di
+ E Yi,Z i
|Z i = z, Ti = d · Pr (Ti = d) + E Y 1
|Z
i,Z i i = z, Ti = a · Pr (Ti = a)
4.1 Preliminaries: General Ideas and LATE 185
by Assumption (A3) and the definition of the types T . By the mean exclusion restriction
(A4) the potential outcomes are independent of Z in the always- and in the never-taker
subpopulation. Hence, when taking the difference E[Y |Z = 1] − E[Y |Z = 0] the
respective terms for the always- and for the never-takers cancel, such that
E [Yi |Z i = 1] − E [Yi |Z i = 0]
% &
Di Di
= E Yi,Z i
|Z i = 1, Ti = c − E Y i,Z i
|Z i = 0, T i = c · Pr (Ti = c)
% &
Di Di
+ E Yi,Z i
|Z i = 1, Ti = d − E Yi,Z i
|Z i = 0, Ti = d · Pr (Ti = d)
% &
= E Yi,Z 1
i
|Z i = 1, T i = c − E Y 0
i,Z i
|Z i = 0, T i = c · Pr (Ti = c)
% &
+ E Yi,Z 0
i
|Z i = 1, Ti = d − E Yi,Z 1
i
|Z i = 0, Ti = d · Pr (Ti = d) .
Exploiting the mean exclusion restriction for the compliers (and defiers) gives
= E Yi,Z1
i
− Y 0
i,Z i
|Ti = c ·Pr (Ti = c)−E Y 1
i,Z i
− Y 0
i,Z i
|Ti = d ·Pr (Ti = d) . (4.3)
Pr (T = c) = E [D|Z = 1] − E [D|Z = 0] .
The variance can easily be estimated by replacing the unknown moments by sam-
ple estimates. The problem of weak instruments is visible in the formula of the Wald
estimator and its variance; we are dividing the intention to treat effect E [Y |Z = 1] −
E [Y |Z = 0] by E [D|Z = 1] − E [D|Z = 0]. If the instrument has only a weak corre-
lation with D, then the denominator is close to zero, leading to very imprecise estimates
with a huge variance.
Clearly, if the treatment effect is homogeneous over the different types T , then LATE,
ATE, ATET and ATEN are all the same. Then we do not even need Assumption (A1),
i.e. the non-existence of defiers, as we get then
E [Y |Z = 1] − E [Y |Z = 0]
E [D|Z = 1] − E [D|Z = 0]
! !
E Y 1 − Y 0 |T = c · Pr (T = c) − E Y 1 − Y 0 |T = d · Pr (T = d)
= .
Pr (T = c) + Pr (T = a) − Pr (T = d) − Pr (T = a)
In fact, we only need that the complier- and defier-treatment effects are identical, and
both subpopulations not of equal size, then the Wald estimator is consistent. Note that
6 See Exercise 3.
4.1 Preliminaries: General Ideas and LATE 187
all the statements made in this paragraph become invalid or have to be modified if
conditioning on some additional confounders X is necessary.
We are thus identifying a parameter of an abstract subpopulation. Moreover, this sub-
population is defined by the choice of instruments, because the compliers are those who
react positively to this specific set. That is, different IVs lead to different parameters
even under instrument validity. Note that we are not just speaking of numerical differ-
ences in the estimates; different instruments identify and estimate different parameters.
So the question is to what extent the parameter identified by a particular instrument is
of political or economic relevance. This could have partly been answered already by
introducing the whole IV story in a different way, namely via using again the propensity
score. This becomes clear in the later sections of this chapter. In any case, most relevant
LATEs are those based on political instruments like subsidies, imposition of regula-
tions, college fees, or eligibility rules for being treated. The latter can even be of such
kind that only those people can participate in treatment that were randomly assigned
(without enforcement but random assignment as a eligibility criterion).
Example 4.9 Individuals in a clinical trial are randomly assigned to a new treatment
against cancer or to a control treatment. Individuals assigned to the treatment group may
refuse the new treatment. But individuals assigned to the control group cannot receive
the new treatment. Hence, individuals in the treatment group may or may not comply,
but individuals in the control group cannot get access to the treatment. This is called
one-sided non-compliance. The decision of individuals to decline the new treatment
may be related to their health status at that time. Individuals in particularly bad health
at the time when being administered the new drug may refuse to take it. As the decision
to take the drug may be related with health status at that time (which is likely to be
related to the health status later) D is endogenous. Nevertheless, the random assignment
could be used as an instrumental variable Z . The unconfoundedness of this instrument
is guaranteed by formal randomisation. Defiers are people being assigned but refuse the
treatment. But as they cannot do vice versa, they technically become never-takers. If all
188 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
individuals would comply with their assignment, the treatment effect could be estimated
by simple means comparisons. With non-compliance, still the Intention to Treat effect
of Z on Y can be estimated, but this does not correspond to any treatment effect of D on
Y . The exclusion restriction requires that the assignment status itself has no direct effect
on health, which could well arise e.g. through psychological effects on the side of the
patient or the physician because of the awareness of assignment status. This is actually
the reason for double-blind placebo trials in medicine.
Pr(D 1 ≥ j > D 0 )
w j · E[Y j − Y j−1 |D 1 ≥ j > D 0 ], wj = , (4.7)
k=1 Pr(D ≥ k > D )
1 0
j=1
which implies j=1 w j = 1, and delivers us a weighted average per-treatment-unit
effect. So, while the estimator and inference do not change compared to above, the inter-
pretation does, cf. also the literature on partial identification. A more precise discussion
is given in section 4.4.
7 More complex situations are discussed during, and in particular at the end of, this chapter.
4.1 Preliminaries: General Ideas and LATE 189
What if, for binary treatment, our instrument is discrete or even continu-
ous? Then look at the identification strategy we used for the Wald estimator
Cov(Y, Z )/Cov(D, Z ). This could be interpreted as the weighted average over all
LATEs for marginal changes in the instrument Z (e.g. the incentive for D). Let us imag-
ine Z to be discrete with finite support {z 1 , . . . , z K } of K values with z k ≤ z k+1 . Then
we need to assume that there are no defiers at any increase (or decrease if Z and D are
negatively correlated) of Z . In such a case we could explicitly set
K
L AT E = wk αk−1→k (4.8)
k=2
where αk−1→k is the LATE for the subpopulation of compliers that decide to switch
from D = 0 to D = 1 if their Z is set from z k−1 to z k . The weights wk are constructed
from the percentage of compliers and the conditional expectation of Z :
K
{Pr(D = 1|z k ) − Pr(D = 1|z k−1 )} l=k Pr(Z = zl )(zl − E[Z ])
wk = K K
j=2 {Pr(D = 1|z j ) − Pr(D = 1|z j−1 )} l= j Pr(Z = z l )(z l − E[Z ])
It is not hard to see that the variance is the analogue to the one of the Wald estimator
given in (4.6), Theorem 4.1, and can therefore be estimated the same way. The extension
to continuous instruments Z is now obtained by substituting integrals for the sums, and
densities for the probabilities of Z . For details on LATE identification and estimation
with continuous Z see Section 4.2.4.
Another question is how to make use of a set of instruments, say Z ∈ IR δ ; δ > 1. This
is particularly interesting when a single instrument is too weak or does not provide a sen-
sible interpretation of the corresponding LATE. Again, the extension is pretty straight;
you may take the propensity score Pr(D = 1|Z ) instead of Z . It is, however, sufficient
to take any function g : Z → IR such that Pr(D = 1|Z = z) ≤ Pr(D = 1|Z = z̃)
implies g(z) ≤ g(z̃) for all z, z̃ from the Supp(Z ).8 Then we can work with
K
L AT E = wk αk−1→k (4.9)
k=2
In cases where the propensity score (or g) has to be estimated, the variance will change.9
For combinations of these different extensions, potentially combined with (additional)
confounders, see the sections at the end of this chapter. Note finally that this instrumental
8 Alternatively, Pr(D = 1|Z = z) ≤ Pr(D = 1|Z = z̃) implies g(z) ≥ g(z̃) for all z, z̃ from the Supp(Z ).
9 To our knowledge, no explicit literature exists on the variance and its estimation, but an appropriate wild
bootstrap procedure should work here.
190 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
variable setup permits us not only to estimate the ATE for the compliers but also the
distributions of the potential outcomes for the compliers, namely
FY 1 |T =c and FY 0 |T =c (4.10)
As alluded to several times above, often the IV assumptions are not valid in general but
may become so only after conditioning on certain confounders X . In fact, it is very much
the same problem as the one we considered when switching from the randomized trials
with Y d ⊥⊥ D to the CIA Y d ⊥⊥ D|X . Now we switch from Y d ⊥⊥ Z in the last section
to Y d ⊥⊥ Z |X . Certainly, the other assumptions will have to be modified accordingly
such that for example D still exhibits variation with respect to instrument Z even when
conditioned on confounders X . You may equally well say that Z is still relevant for D
even when knowing X .
Figure 4.3 A case where neither matching nor IV will help to identify the impact of D on Y ; for
ease of illustration we suppressed here the arrowheads of the dashed-dotted lines
Figure 4.4 Left: (a) example of exogenous X ; Right: (b) confounder X might be endogenous
selection on observables. A more thorough discussion follows later. This graph shows
the situation where neither matching estimation nor IV identification is possible. IV
identification (without blocking) is not possible since Z has a direct impact highlighted
by the dashed-dotted line on Y , but also because Z has other paths to Y . (e.g. correlated
with U).
A crucial assumption for identification will be our new CIA analogue:
although some kind of mean independence would suffice. This is the conditional inde-
pendence assumption for instruments (CIA-IV). Consider first a situation where Z has
no direct impact on Y , i.e. skipping the dashed-dotted line. Then there are still paths left
that could cause problems. Some of them can be blocked by X , but let us go step by
step. Note that for the sake of simplicity we neglect the independence condition w.r.t. T
for now.
In Figure 4.4 both graphs (a) and (b) satisfy the independence assumption (4.11) con-
ditional on X . The difference between these two graphs is that in (a) X is exogenous
whereas in (b) X is correlated with V and U . We will see that non-parametric identi-
fication can be obtained in both situations. Note that in classical two-step least-squares
(2SLS), situations like (b) with endogenous X are not permitted.10
In Figure 4.5 we have added the possibility that Z might have another effect on Y
via the variables X 2 . In the left graph (a) we can achieve (CIA-IV) if we condition on
10 Think of the situation where ϕ and ζ of Equation 4.2 are parametric with additive errors U, V where you
first estimate ζ to then estimate ϕ using ζ̂ instead of D.
192 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
Figure 4.5 Left: (a) exogenous X 2 with endogenous X 1 ; Right: (b) X 1 and X 2 are both
endogenous
X 1 and X 2 . Hence, we can control for variables that confound the instrument and also
for those which lie on a mediating causal path other than via D. There is one further
distinction between X 1 and X 2 , though. Whereas X 1 is permitted to be endogenous, X 2
is not permitted to be so. This can be seen well in graph (b). If we there condition on
X 2 , we unblock the path Z → X 2 ← U and also the path Z → X 2 ← W2 . Hereby
we introduce another confounding link between Z and the outcome variable Y . On the
other hand, if we do not condition on X 2 , the instrument Z has an effect on Y via X 2
and would thus not satisfy (4.11). Hence, while the X 1 are permitted to be endogenous,
the X 2 (those being affected by Z ) must be exogenous.
Z →D→Y
# $
X
Hong and Nekipelov (2012) consider an empirical auction model in which they are inter-
ested in the effect of early bidding (D) in an internet auction on eBay on the variance of
the bids (Y ). Their concern is that the two variables D and Y may be correlated due to
the visibility of the auctioned object. To overcome this endogeneity problem, the authors
artificially increase the supply of the auctioned object by themselves auctioning addi-
tional objects on eBay. Z = 0 refers to the period with normal supply before, whereas
Z = 1 refers to the period with enlarged supply. The authors argue that the larger sup-
ply should have an effect on D but no direct effect on Y . Since the authors themselves
create the larger supply (Z = 1), they also changed the average characteristics X of the
auctioned objects. Relevant characteristics X in eBay auctions are the seller’s reliability
(as perceived by previous buyers), and the geographical location of the seller (which
affects shipping costs). These variables have been affected by the authors’ supply infla-
tion in the market, in particular the geographic location of the auctioned objects. These
X variables have thus been caused by the instrument Z , and should be controlled for.
4.2 LATE with Covariates 193
In sum we have seen that introducing covariates, say confounders X , may serve four
different purposes here:
Controlling for such X , however, may not always be a valid approach. Let us consider
the setup where the unobservable U affects Y but also X , and perhaps (but not neces-
sarily) D. More specifically, you may think of our Example 4.11 but being enlarged by
unobserved U :
Z →D→Y
# $ %
X ←− U
Before we continue let us revise and summarise the notation to describe the relation
between variables more formally. To keep things simple we may think first of the case
where both, the endogenous regressor D and the instrument Z are binary. Extensions to
non-binary D and Z are discussed later. We incorporate a vector of covariates X :
Di
Yi = ϕ(Di , Z i , X i , Ui ) with d =ϕ
Yi,z (d, z, X i , Ui ) and Yi,Z i
= Yi
Di = ζ (Z i , X i , Vi ) with Di,z = ζ (z, X i , Vi ) and Di,Z i = Di .
Recall that if D has also an effect on X in the sense that changing D would imply a
change in X , only the direct effect of D on Y would be recovered with our identification
strategy, but not the total effect, as discussed in Chapter 2.
The previous instrumental variable conditions are assumed to hold conditional on X .
Note that this also requires that conditioning on X does not introduce any dependencies
and new confounding paths. The extension to incorporate covariates is assumed not to
194 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
affect the decision of the compliance types T , which is as before. More specifically, we
modify all assumptions but keep the same numbering.
As before, these assumptions rule out the existence of subpopulations that are affected
by the instrument in an opposite direction, and guarantees that Z is relevant for D|X .
This has not changed compared to the case when we ignored covariates X . Monotonicity
just ensures that the effect of Z on D has the same direction for all individuals with
the same X . Later we will see that the assumption can be weakened by dropping the
conditioning on X .
Assumption (A3C), Unconfounded instrument: The relative size of the subpopulations
always-takers, never-takers and compliers is independent of the instrument: for all x ∈
Supp (X )
Validity of Assumption (A3C) requires that the vector X contains all variables that
affect (simultaneously) the choice of Z and T . Without conditioning on covariates X
this assumption may often be invalid because of selection effects.
Example 4.12 Recalling the college attendance example we already discussed the prob-
lem that it is quite likely that parents who want their children to visit later on a college
tend to live closer to one than those who care less. This would imply that there live
more compliers close than far away which would violate our former, i.e. unconditional
Assumption (A3) where no X were included. In this case, the subpopulation living close
to a college would contain a higher fraction of compliers than those living far away. If
this effect is captured by variables X (i.e. that control for these kind of parents) we
would satisfy the new version of Assumption (A3), namely our (A3C).
We further need to rule out a relation of Z with Y |X not channelled by D. This time,
however, it suffices to do this conditional on X . In other words, conditional on X any
effect of Z should be channelled through D such that the potential outcomes are not
related with the instrument.
Assumption (A4C), Mean exclusion restriction: Conditional on X the potential out-
comes are mean independent of the instrumental variable Z in each subpopulation: for
all x ∈ Supp (X )
4.2 LATE with Covariates 195
0
E Yi,Z i
|X i = x, Z i = 0, Ti = t = E Y 0
i,Z i |X i = x, Z i = 1, Ti = t for t ∈ {n, c}
E Yi,Z i |X i = x, Z i = 0, Ti = t = E Yi,Z
1 1
i
|X i = x, Z i = 1, Ti = t for t ∈ {a, c}.
Again, without conditioning on X , this assumption may often be invalid. However, recall
from Chapter 2 that conditioning can also create dependency for variables that without
this conditioning had been independent.
Often you see in the literature the Assumptions (A2C) and (A4C) replaced by modi-
fications of the CIA, namely asking for Z ⊥⊥ D|X and Y d ⊥⊥ Z |X , where the latter
obviously corresponds to (A4C), and the former to (A2C) which is also called the
relevance condition. Assumption (A3C) is often ignored.
Finally, since we are going to be interested in estimating some kind of average
complier effect (LATE) we will impose an additional assumption:
Assumption (A5C), Common support: The support of X is identical in both subpopu-
lations:
Supp (X |Z = 1) = Supp (X |Z = 0) .
Assumption (A5C) requires that for any value of X (in its support) both values of
the instrument can be observed. Clearly, an equivalent representation of the common
support condition is that 0 < Pr(Z = 1|X = x) < 1 ∀x with f x (x) > 0. As for the
CSC we are certainly free to (re-)define our population of interest such that χ fulfils
Assumptions (A1C) to (A5C).
With these assumptions, the LATE is identified for all x with Pr (T = c|X = x) > 0
by
E [Y |X = x, Z = 1]− E [Y |X = x, Z = 0]
L AT E(x) = E[Y 1−Y 0 |X = x, T = c] = .
E [D|X = x, Z = 1]− E [D|X = x, Z = 0]
If we could restrict to the subpopulation of compliers, this IV method is simply the
matching method. This is by no means surprising: as in the case with binary Z , one could
think of compliers being exactly those for whom always D = Z . The proof is analogous
to the case without covariates X . So for our crucial assumption for identification
Y d ⊥⊥ Z |X, T = c
we may equally well write
Y d ⊥⊥ D|X restricted to subpopulation T = c (4.12)
being exactly the selection on observables assumption (CIA) but restricted to compliers,
saying that conditional on X , the compliers were randomly selected into D = 0 or
D = 1. Again, as the CIA does not hold for the entire population, the IV picks from the
population a subpopulation for which it does hold.
certain parts of it. Particularly if X contains many variables, there would be many dif-
ferent LATE(x) to be interpreted. Moreover, if X contains continuous variables, the
√
estimates might be rather imprecise and we would also not be able to attain n con-
vergence for our LATE(x) estimators. In these cases we are interested in some kind of
average effects.
One possibility would be to weight LATE(x) by the population distribution of x,
which would give us an average treatment effect of the form
E [Y |X = x, Z = 1] − E [Y |X = x, Z = 0]
L AT E(x) d FX = d FX . (4.13)
E [D|X = x, Z = 1] − E [D|X = x, Z = 0]
However, this approach may be problematic in two respects. First, the estimates of
E [Y |X, Z = 1] − E [Y |X, Z = 0]
E [D|X, Z = 1] − E [D|X, Z = 0]
will sometimes be quite imprecise, especially if X contains continuous variables. The
non-parametrically estimated denominator Ê [D|X, Z = 1] − Ê [D|X, Z = 0] might
often be close to zero, thus leading to very large estimates of L AT E(x). In addition,
the above weighting scheme represents a mixture between the effects on compliers and
always-/never-takers that might be hard to interpret: L AT E(x) refers only to the effect
for compliers exhibiting x, whereas d Fx refers to the distribution of x in the entire
population (consisting of compliers, always- and never-takers – defiers do not exist by
assumption). That is, (4.13) mixes different things.
An alternative is to examine the effect in the subpopulation of all compliers, which is
in fact the largest subpopulation for which a treatment effect is identified without further
assumptions. This treatment effect over all compliers is
E Y − Y |T = c = E[Y 1 − Y 0 |X = x, T = c]d FX |T =c
1 0
= L AT E(x) d FX |T =c , (4.14)
where m̂ z (x) and p̂z (x) are corresponding non-parametric regression estimators. Alter-
natively, we could use the observed values Yi and Di as predictors of E [Yi |X i , Z = z]
and E [Di |X i , Z = z], whenever Z i = z. This gives the estimator:
Yi − m̂ 0 (X i ) − Yi − m̂ 1 (X i )
i:Z i =1 i:Z i =0
L
AT E = . (4.17)
Di − p̂0 (X i ) − Di − p̂1 (X i )
i:Z i =1 i:Z i =0
(i) f X |Z =1 , m z (·) and pz (·), z = 0, 1 are s-times continuously differentiable with the
s-th derivative being Hölder continuous with s > q = dim(X )
(ii) K (·) is a compact and Lipschitz continuous (s + 1)-order kernel
(iii) The bandwidth h satisfies n 0 h q / ln(n 0 ) → ∞ and n 0 h 2s → 0 for n 0 → ∞
n
where n 0 is the smallest subsample size out of the following four: i=1 11{z i = 0},
n n n
i=1 1
1{z i = 1}, i=1 11{d i = 0}, i=1 11{di = 1}.
Then, if the m d (x) and pd (x) are obtained by local polynomial regression of order < s,
one obtains for the estimator given in (4.17)
√
n( L
AT E − L AT E) −→ N (0, V )
198 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
with variance V reaching the efficiency bound for semi-parametric LATE estimators
which is given by
⎡
1
σ 2 (X, z) − 2ασY,D (X, z) + α 2 σ 2 (X, z)
+ Y D ⎦
Pr(Z = z|X )
z=0
where α = L AT E, γ = { p1 (x) − p0 (x)}d FX , σY,D (X, z) = Cov(Y, D|X, Z = z),
σY2 (X, z) = V ar [Y |X, Z = z], and σ D2 (X, z) analogously.
π(x) = Pr (Z = 1|X = x)
It has been shown that the efficiency bound is reached as given in Theorem 4.2. In
many applications, the propensity score π(x) is unknown and needs to be estimated.
But due to the efficiency results for the propensity score-based estimators in Chapter 3
it is expected that even if it was known, using an estimated propensity score would be
preferable.
As in Chapter 3 one might use the propensity score – here the one for Z given X –
not for weighting but as a substitute for the regressor X . Due to this analogy one speaks
again of propensity score matching though it refers to the propensity score for the binary
instrument. Let us first derive the identification of the LATE via μz ( p) := E[Y |π(X ) =
p, Z = z] and νz ( p) := E[D|π(X ) = p, Z = z] for z = 0, 1. Obviously, for a given
4.2 LATE with Covariates 199
(or predicted) π these four functions can be estimated non-parametrically, e.g. by kernel
regression. Now reconsider equation (4.18) noting that
. / - . /9
YZ YZ
E = Eρ E |π(X ) = ρ
π(X ) π(X )
- 9
1
= Eρ E[Y |π(X ) = ρ, Z = 1] Pr(Z = 1|π(X ) = ρ)
ρ
= E ρ {E[Y |π(X ) = ρ, Z = 1]} = μ1 (ρ) d Fπ
Y (1−Z )
where Fπ is the c.d.f. of ρ = π(x) in the population. Similarly we obtain E 1−π(X ) =
D(1−Z )
μ0 (ρ) d Fπ , E π(X
DZ
) = ν1 (ρ) d Fπ , and E 1−π(X ) = ν0 (ρ) d Fπ . Replacing
the expectation by sample averages and the μz , νz by non-parametric estimates, we can
estimate (4.18) by
n n
Asymptotically, however, this estimator is inefficient as its variance does not meet the
efficiency bound of Theorem 4.2 unless some very particular conditions are met.11 In
fact, its variance is
⎡
1
σ 2 (π, z) − 2ασY,D (π, z) + α 2 σ 2 (π, z)
+ Y D ⎦
z + (−1)z π
z=0
which follows from the exclusion restriction, Assumption 4. Then, following Bayes’
rule we have
Pr(Z = 1, T = c|x) d FX (x)
d FX |Z =1,T =c (x) =
Pr(Z = 1, T = c)
Pr(T = c|x, Z = 1)π(x) d FX (x) Pr(T = c|x)π(x) d FX (x)
= =
Pr(Z = 1, T = c|x) d FX (x) Pr(T = c|x)π(x) d FX (x)
by the unconfoundedness condition, Assumption 3. Consequently the effect is now
identified as
E Y 1 − Y 0 |D = 1, T = c
(E [Y |X = x, Z = 1] − E [Y |X = x, Z = 0]) π(X ) d FX
= , (4.21)
(E [D|X = x, Z = 1] − E [D|X = x, Z = 0]) π(X ) d FX
and in terms of propensity scores as
(μ (ρ) − μ (ρ)) ρ d F
1 0 π
E Y − Y |D = 1, T = c =
1 0
. (4.22)
(ν1 (ρ) − ν0 (ρ)) ρ d Fπ
As usual, you replace the unknown functions μz , νz , π (and thus ρ) by (non-)parametric
predictions and the integrals by sample averages. A weighting type estimator could be
derived from these formulae as well, see Exercise 4.
Why is this interesting? In the situation of one-sided non-compliance, i.e. where you
may say that the subpopulations of always-takers and defiers do not exist, the treated
compliers are the only individuals that are treated.12 The ATET is then identified as
E Y 1 − Y 0 |D = 1 = E Y 1 − Y 0 |D = 1, T = c .
Note that formula (4.21) is different from (4.16). Hence, with one-sided non-compliance
the ATET is the LATET (4.21) but not the LATE. This is different from the situa-
tion without confounders X . Simply check by setting X constant; then the formulae
(4.21) and (4.16) are identical in the one-sided non-compliance design such that
ATET = LATE.
What can be said about the (local) treatment effect for the always- and the never-
takers? With similar arguments as above we can identify E[Y 1 |T = a] and E[Y 0 |T =
n]. More specifically, from (4.3) combined with (A4C) we get that
E Y 1 |T = a Pr (T = a) = E[Y D|X, Z = 0] d FX
with Pr (T = a) = E[D|X, Z = 0] d FX
E Y 0 |T = n Pr (T = n) = E[Y (1 − D)|X, Z = 1] d FX
with Pr (T = n) = E[1 − D|X, Z = 1] d FX .
Following the same strategy, can we also identify E[Y 0 |T = a] and E[Y 1 |T = n]?
For this we would need to suppose that the selection on observables holds not only for
the compliers but also for the always- and never-takers. But in such a case we had CIA
for the entire population, and used the IV only for splitting the population along their
types T without any need.13 In other words, in such a case the IV Z was of interest
on its own (eligibility, subsidies, incentives, . . . ) but was not needed for ATE or ATET
identification. On the other hand, in some situations such a strategy could be helpful
in some other respects: first, we obtained the average treatment effects Y 1 − Y 0 sepa-
rately for the compliers, the always-participants and the never-participants. This gives
some indication
! about treatment effect
! heterogeneity. !Second, the comparison between
E Y 0 |T = c and E Y 0 |T = a , and E Y 0 |T = n may be helpful to obtain some
understanding what kind of people these groups actually represent. Note that this still
requires (A1C) to (A5C) to hold.
!
Example 4.13! Imagine Y is employment
! status and we find that E Y 0 |T = a <
E Y 0 |T = c < E Y 0 |T = n . This could be interpreted in that the never-takers have
the best labour market chances (even without treatment) and that the always-takers
have worse labour market chances than the compliers. This would help us to under-
stand which kind of people belong to a, c and n for a given incentive Z . In addition
to this, we can also identify the distributions of X among the always-, and the never-
takers and the compliers, which provides us with additional insights into the labour
market.
Chapter 7 will discuss how the same identification strategy can help us to recover the
entire hypothetical distributions of Y 0 , Y 1 , and therefore also the quantiles.
Assumption (A5C’), Common support: The support of X is identical for z min and z max
Supp (X |Z = z min ) = Supp (X |Z = z max ) .
Given these assumptions it can be shown that the LATE for the subpopulation of
compliers is non-parametrically identified as
(E [Y |X = x, Z = z max ] − E [Y |X = x, Z = z min ]) d FX
E[Y 1 − Y 0 |T = c] = .
(E [D|X = x, Z = z max ] − E [D|X = x, Z = z min ]) d FX
(4.23)
This formula is analogous to (4.16) with Z = 0 and Z = 1 replaced with the endpoints
of the support of Z . If Z is discrete with finite support, previous results would apply and
√
n consistency could be attained. This is certainly just a statement about the asymptotic
behaviour; it actually throws away all information in-between z min and z max . In prac-
tice you might therefore prefer to estimate the LATE for each increase in Z and then
4.2 LATE with Covariates 203
average over them. This is actually the idea of the next section. For a continuous instru-
√
ment, n-consistency can no longer be achieved, unless it is mixed continuous-discrete
with mass points at z min and z max . The intuitive reason for this is that with continuous
Z the probability of observing individuals with Z i = z max or z min is zero. Therefore
we also have to use observations with Z i a little bit smaller than z max , and for non-
parametric regression to be consistent we will need the bandwidth to converge to zero.
(A similar situation will appear in the following situation on regression discontinuity
design.)14
Now consider the situation with multiple instrumental variables, i.e. Z being vector
valued. There is a way to extend the above assumptions and derivations accordingly. Set
the sign of all IVs such that they are all positively correlated with D and ask the selection
function ζ to be convex and proceed as before. Another, simpler way is to recall the
idea of propensity score matching. The different instrumental variables act through their
effect on D, so the different components of Z can be summarised conveniently by using
p(z, x) = Pr(D = 1|X = x, Z = z) as instrument. If D follows an index structure
in the sense that Di depends on Z i only via p(Z i , X i ),15 and Assumptions (A1C’) to
(A5C’) are satisfied with respect to p(z, x), then the LATE is identified as
E[Y 1 − Y 0 |T = c]
% &
E[Y |X = x, p(Z , X ) = p̄x ] − E[Y |X = x, p(Z , X ) = px ] d FX
=% ¯ & , (4.24)
E[D|X = x, p(Z , X ) = p̄x ] − E[D|X = x, p(Z , X ) = px ] d FX
¯
where p̄x = max p(z, x) and px = min p(z, x). This is equivalent to
z ¯ z
E[Y 1 − Y 0 |T = c]
% &
E[Y |X = x, p(Z , X ) = p̄x ] − E[Y |X = x, p(Z , X ) = px ] d FX
= % & ¯ . (4.25)
p̄x − px d FX
¯
Again, this formula is analogous to (4.16). The two groups of observations on which
estimation is based are those with p(z, x) = p̄x and those with p(z, x) = px . In the first
representation (4.24), exact knowledge of p(z, x) is in fact not needed; it is¯ sufficient to
identify the set of observations for which p(Z , X ) is highest and lowest, respectively,
and compare their values of Y and D. Only the ranking with respect to p(z, x) mat-
ters, but not the values of p(z, x) themselves.16 For example, if Z contains two binary
variables (Z 1 , Z 2 ) which for any value of X are known to have a positive effect on D,
14 From (4.23) a bias-variance trade-off in the estimation of the LATE with non-binary Z becomes visible.
Although (4.23) incorporates the proper weighting of the different complier subgroups and leads to an
unbiased estimator, only observations with Z i equal (or close) to z min or z max are used for estimation.
Observations with Z i between the endpoints z min and z max are neglected, which might lead to a large
variance. Variance could be reduced, at the expense of a larger bias, by weighting the subgroups of
compliers differently or by choosing larger bandwidth values.
15 So D
i,z = Di,z if p(z, X i ) = p(z , X i ). In other words, Di does not change if Z i is varied within a set
where p(·, X i ) remains constant, see also next section.
16 In Equation 4.25 the consistent estimation of p(z, x) matters, though.
204 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
An often-expressed criticism is that the LATE identifies a parameter that is not of inter-
est. Since it is the effect on the complier subpopulation and this subpopulation is then
induced by the instrument, any LATE is directly tied to its instrument and cannot be
interpreted on its own. For example, if Z represents the size of a programme (the num-
ber of available slots), the LATE would represent the impact of the programme if it
were extended from size z to size z on the subpopulation which would participate only
in the enlarged programme. So is it interesting for decision-makers? As we discussed
in the previous sections, this depends on the context, and in particular on the applied
instrument Z . Especially if Z represents a political instrument (fees, taxes, eligibility
rules, subventions, etc.) LATE might actually even be more interesting than ATE or
ATET themselves, as it tells us the average effect for those who reacted on these policy
intervention.
This interpretation becomes more complex if we face non-binary treatments or instru-
ments. On the other hand, if we directly think of continuous instruments, which in
practice should often be the case, interpretation becomes simpler as this will allow us to
study the marginal treatment effect (MTE). Contrary to what we are used from the com-
mon notion when speaking of marginal effects, the MTE refers to the treatment effect
for a marginal change in the propensity to participate and therefore a marginal change
in the instrument.Most interestingly, we will see that this will enable us to redefine the
ATE, ATET, ATEN and LATE as a function of MTE and link it (more generally) to what
sometimes is called policy related treatment effects (PRTE). As stated, in order to do so
it is necessary from now on to have a continuous instrument (or a vector of instruments
with at least one continuous element).
17 In the literature is often added the technical, non-restrictive (i.e. in practice typically given) assumption
that Y 0 and Y 1 have finite first moments. % &
18 To see this consider (for a strictly increasing distribution function) Pr (F (V ) ≤ c) = Pr V ≤ F −1 (c)
V V
% &
= FV FV−1 (c) = c. Hence, the distribution is uniform. The same applies conditional on X , i.e.
Pr FV |X (V ) ≤ c|X = Pr(V ≤ FV−1 −1
|X (c)|X ) = FV |X (FV |X (c)) = c.
206 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
Everything conditional on X = x
MTE(x,v)
LATE
V uniform
complier
V=1
ρ'
ρ''
D=0
D=1
Indifference
line
P=1
who would be induced to switch if the instrument were changed from z to z . For these
compliers we have
! !
E Y |x, ρ − E Y |x, ρ
L AT E(x, ρ , ρ ) = E[Y − Y |X = x, ρ ≤ P ≤ ρ ] =
1 0
ρ − ρ
(4.28)
! !
for ρ < ρ . We! used that E D|X = x, P = ρ = E E !D|X, Z , P(Z , X ) = ρ
|X = x, P = ρ is equal to E E [D|X, Z ] |X = x, P = ρ = ρ . To see this, notice
that
E [Y |X = x, P = ρ] = E [Y |X = x, P = ρ, D = 1] Pr (D = 1|X = x, P = ρ)
+ E [Y |X = x, P = ρ, D = 0] Pr (D = 0|X = x, P = ρ)
ρ dv 1 dv
=ρ· E Y 1 |X = x, P = v + (1 − ρ) · E Y 0 |X = x, P = v .
0 ρ ρ 1−ρ
(4.29)
This gives the surplus for setting Z from z to z :
! !
E Y |X = x, P = ρ − E Y |X = x, P = ρ
ρ ρ
= E Y 1 |X = x, V = v dv − E Y 0 |X = x, V = v dv
ρ ρ
ρ
= E Y 1 − Y 0 |X = x, V = v dv = (ρ − ρ ) · E[Y 1 − Y 0 |X = x, ρ ≤ V ≤ ρ ].
ρ
So the surplus refers to the expected return to the treatment for the (sub)population with
X = x. In case you are interested in the LATE return for the participants induced by this
change in Z you will have to divide this expression by (ρ − ρ ).
Once again we notice that if Z takes on many different values, different LATE could
be defined for any two values of Z , recall (4.8). If Z is continuous, we can take the
derivative of (4.29)
ρ
∂ E [Y |X = x, P = ρ] ∂
= E Y 1 |X = x, P = v dv
∂ρ ∂ρ
0
1
∂
+ E Y 0 |X = x, P = v dv
∂ρ
ρ
= E Y 1 |X = x, V = ρ − E Y 0 |X = x, V = ρ
= E Y 1 − Y 0 |X = x, V = ρ .
!
provided E Y |X = x, P = p is differentiable in the second argument at the loca-
tion p. This is the average treatment effect among those with characteristics X = x
and unobserved characteristic V such that V = p. For this reason the MTE is often
expressed by M T E(x, v) where v refers to the unobserved characteristics in the selec-
tion equation. So we talk about the individuals being indifferent between participating
and non-participating if P = p.
It can be obtained by estimating the derivative of E [Y |X, P] with respect to P which
is therefore often called local instrumental variable estimator (LIVE). An evident non-
parametric estimator is the local linear regression estimator with respect to (X i − x) and
(Pi − p) where the coefficient of (Pi − p) gives the estimate of the partial derivative at
point p for X = x. This will certainly be a non-parametric function in p and x. Only
when either having a parametric specification, or if integrating (afterwards) over x and
√
p will provide us with a n-rate (consistent) estimator. This will be briefly discussed
later.
to change treatment status, then only little is identified. This shows also that a strong
impact of Z on D is important. Recall that extrapolation has to be done with care – if
at all – and is only possible in parametric models. So the question is whether you get
for any given x estimates of M T E(x, p) over the whole range of p from 0 to 1, and the
same can certainly be questioned for FP|X =x .
So you may say that we should at least attempt to estimate the treatment effect for the
largest subpopulation for which it is identified. Let Sρ|x = Supp( p(Z , X )|X = x) be
the support of ρ given X , and let px and p̄x be the inf and sup of Sρ|x . Then the treatment
effect on the largest subpopulation¯ with X = x is L AT E(x, px , p̄x ). Certainly, if px = 0
and p̄x = 1, the ATE conditional on X could be obtained. So¯ we are again in the ¯typical
dilemma of IV estimation: on the one hand we would like to have a strong instrument Z
such that p(Z , X ), conditional on X , has a large support. On the other hand, the stronger
the instrument the less credible are the necessary assumptions to hold. Moreover, if we
would like to average the obtained treatment effects over various values of x, only the
effect for the supx px and infx p̄x over this set of values x is identified, which reduces
¯
the identification set even further. However, if X is exogenous, i.e. independent of U 1
and U 0 , and our interest is the average effect over all values of x, then we can increase
our identification region, see below.
An interesting question is when LATE(x) (as a function of x) equals ATE boils down
to the question of where (for a given IV) M T E(x, p) is constant in p. Many ways can
be found to give answers to this. What you basically need is that for a given IV and x,
the gain or return to participation (Y 1 − Y 0 ) = {ϕ1 (x, U 1 ) − ϕ0 (x, U 0 )}; recall (4.27)
does not vary with the unobserved heterogeneity V in the participation decision. How
can this be formalised? First let us assume additive separability of the unobserved part in
the outcome equation redefining ϕd (x) = E[Y d |X = x] and U d := Y d − E[Y d |X = x]
for d = 0, 1. Then you would ask for (U 1 − U 0 ) ⊥⊥ V | X =x . Recalling that Y =
DY 1 + (1 − D)Y 0 = Y 0 + D(Y − Y 1 − Y 0 ) we have
E[Y |P = p, X = x] = E[Y 0 |P = p, X = x]
+ E {ϕ1 (x) − ϕ0 (x) + U 1 − U 0 }11{ p > V }
= E[Y 0 |P = p, X = x]
p
+ p · AT E(x) + E[U 1 − U 0 |V = v] dv,
0
keeping in mind that V ∼ U [0, 1]. The MTE is the derivative with respect to p, therefore
∂ E[Y |P = p, X = x]
M T E(x, p) = = AT E(x) + E[U 1 − U 0 |V = p].
∂p
If (U 1 − U 0 ) ⊥⊥ V | X =x , then E[U 1 − U 0 |V = p] cannot be a function of p because
of this (conditional) independence. Therefore, if E[Y |P = p, X = x] is a linear
function of p, then one concludes that M T E(x, p) = AT E(x). Then it also holds
M T E(x) = AT E(x) = L AT E(x). In other words, the heterogeneity of treatment
effect can be explained sufficiently well by x. There exist a large set of non-parametric
specification tests to check for linearity, see Gonzalez-Manteiga and Crujeiras (2013)
210 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
for a review on non-parametric testing. Which of these tests is the most appropriate for
which situation depends on the smoothing method you plan to use for the estimation.19
As we recommended a local linear or local quadratic estimator for E[Y |P = p, X = x]
to get directly an estimate for M T E(x, p), a straightforward strategy would be to
check whether M T E(x, p) is constant in p or not. The obvious problem is that one
would have to check this for any x ending up in a complex multiple testing prob-
lem. A simple solution can only be given for models with a parametric impact of x
and p.
As the MTE defines the gain in Y for a marginal change in ‘participation’ induced
by an instrument Z , it would be interesting to see how the general formula for a social
welfare gain caused by a policy change looks like.
Example 4.14 A policy could increase the incentives for taking up or extending school-
ing through financial support (without directly affecting the remuneration of education
in the labour market 20 years later). If the policy only operates through changing Z
without affecting any of the structural relationships, the impact of the policy can be
identified by averaging over the MTE appropriately. As usual, a problem occurs if Z is
also correlated with a variable that has a relation with the potential remuneration, except
if you can observe all those and condition on them.
Consider two potential policies denoted as a and b, which differ in that they affect
the participation inclination, but where the model remains valid under both policies, in
particular the independence of the instrument. Denote by Pa and Pb the participation
probabilities under the respective policy a and b. If the distributions of the potential
outcomes and of V (conditional on X ) are the same under policy a and b, the MTE
remains the same under both policies and is thus invariant to it. Any utilitarian welfare
function (also called a Benthamite welfare function) sums the utility of each individual
in order to obtain society’s overall welfare. All people are treated the same, regard-
less of their initial level of utility. For such a social welfare function of Y , say U, the
MTE is
where FPb |X and FPa |X are the respective distributions of the participation probability. In
the literature one often speaks of policy relevant treatment parameters. If the distribution
of P can be forecasted for the different policies, it gives us the appropriate weighting of
the MTE for calculating the impact of the policy.
∂ E[Y (D − 1)|X, P = ρ]
= E[Y 0 |X, V = ρ].
∂ρ
Hence, the mean potential outcomes Y 0 and Y 1 are identified separately. Therefore, we
can analogously identify the potential outcome distributions by substituting 11{Y ≤ c}
(for any c ∈ (0, 1)) for Y to obtain FY 1 |X,V =ρ and FY 0 |X,V =ρ .
We can estimate (4.31) by non-parametric regression of Y D on X and P. In order
to avoid a sample with many zeros when regressing the product of Y and D on the
regressors you may rewrite this as
∂ E[Y D|X, P = ρ] ∂
E[Y 1 |X, V = ρ] = = (E[Y |X, P = ρ, D = 1] · ρ)
∂ρ ∂ρ
∂ E[Y |X, P = ρ, D = 1]
=ρ + E[Y |X, P = ρ, D = 1]. (4.32)
∂ρ
Hence, one can estimate the potential outcome from the conditional mean of Y and its
derivative in the D = 1 population.
The distribution functions FY 1 |X,V and FY 0 |X,V can be estimated in two ways. One
approach is substituting 11{Y ≤ c} for Y as mentioned above. Alternatively one can use
the structure of the additively separable model with
Yi1 = ϕ1 (X i ) + Ui1 and Yi0 = ϕ0 (X i ) + Ui0 . (4.33)
which implies for the conditional densities
f Y d |X,V (c|x, v) = fU d |X,V (c − ϕd (x)|x, v) = fU d |V (c − ϕd (x)|v).
The latter can be obtained as a density estimate after having estimated ϕd (x).
212 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
Almost as a by-product, the above calculations reveal how an increase of the identi-
fication region could work. Above we briefly discussed the problem of treatment effect
identification when instrument Z does not cause much variation in the propensity score
once X has been fixed. In fact, E[Y 1 |X, V = ρ] and FY 1 |X,V =ρ are only identified
for those values of ρ which are in the support of the conditional distribution of P, i.e.
P|(X, D = 1), whereas E[Y 0 |X, V = ρ] and FY 0 |X,V =ρ are only identified for those
values of ρ which are in the support of P|(X, D = 0). Since P is a deterministic func-
tion of X and Z only, any variation in P|X can only be due to variation in Z . Unless the
instruments Z have strong predictive power such that for each value of X they generate
substantial variation in P|X , the set of values of (X, V ) where FY 1 |X,V and FY 0 |X,V are
identified may be small for given x. A remedy would be if we could integrate over X ,
enlarging the identification region substantially.
Though extensions to non-separable cases might be thinkable, we continue with the
more restrictive model (4.33). Much more restrictive and hard to relax is the next
assumption, namely that the errors U d , V are jointly independent from Z and X , i.e.
(U 0 , V ) ⊥⊥ (Z , X ) and (U 1 , V ) ⊥⊥ (Z , X ). Repeating then the calculations from above
we get
ρ ρ
dv dv
E[Y |X, P = ρ, D = 1] = E[Y |X, V = v]
1
= ϕ1 (X ) + E[U 1 |V = v]
0 ρ ρ
−∞
ρ
dv
= ϕ1 (X ) + λ1 (ρ), with λ1 (ρ) := E[U 1 |V = v] .
0 ρ
(4.34)
Note that we can identify the function ϕ1 (X ) by examining E[Y |X, P = ρ, D = 1] for
different values of X but keeping ρ constant. Analogously we can proceed for ϕ0 . These
results are helpful but do not yet provide us with the marginal treatment outcomes
The previous sections examined identification for a scalar binary endogenous regressor
D. This is a simple situation, though sufficient in many situations. In the case with D
discrete, the methods introduced above can often be extended. If now the treatment D
is continuous and confounders X are included, non-parametric identification becomes
more complex. To keep things relatively easy, the models examined here are based
on restrictions in the (selectivity or) choice equation. Specifically, we still work with
triangularity
Y = ϕ(D, X, U ) with D = ζ (Z , X, V ), (4.35)
where it is assumed that Y does not affect D. In other words, we still impose a causal
chain in that D may affect Y but not vice versa. Such a model may be appropriate
because of temporal ordering, e.g. if D represents schooling and Y represents some
outcome 20 years later. In other situation like that of a market equilibrium where Y
represents supply and D demand, such a triangular model is no longer appropriate.
Example 4.15 Let D be the years of schooling. An individual i with value vi larger than
v j for an individual j (with identical characteristics X and getting assigned the same z)
will always receive at least as much schooling as individual j regardless of the value of
the instrument Z . This assumption may often appear more plausible when we include
many variables in X as all heterogeneity these can capture is no longer contained in V .
Y = ϕ(D, X, U ) , D = ζ (Z , X, V, Y ),
i.e. where D is also a function of Y . We could insert the first equation into the second to
obtain
Y = ϕ(D, X, U ) , D = ζ (Z , X, V, ϕ(D, X, U ))
20 More often you may find the notion of control function; this typically refers to the special case where the
effect of V appears as a separate function in the model.
4.4 Non-Binary Models with Monotonicity in Choice Equation 215
which implies that D depends on two unobservables. Now we see that the unobservables
affecting D are two-dimensional such that we cannot write the model in terms of an
invertible function of a one-dimensional unobservable. Consequently, the problem can
only be solved simultaneously and imposing more structure.
Turning back to the triangular model. As stated, there exists as a simple and straight-
forward way the so-called control function approach. The idea is to condition on V when
studying the impact of D on Y as V should capture all endogeneity inherited by D. Let
us go step by step. Assumption IN.2 implies that the inverse function of ζ with respect to
its third argument exists: v = ζ −1 (z, x, d), such that ζ (z, x, ζ −1 (z, x, d)) = d. Hence,
if ζ was known, the unobserved V would be identified by z, x, d. For ζ unknown, with
Assumption IN.1 you still have
FD|Z X (d|z, x) = Pr(D ≤ d|X = x, Z = z) = Pr(ζ (z, x, V ) ≤ d|X = x, Z = z)
= Pr(V ≤ ζ −1 (z, x, d)) = FV (ζ −1 (z, x, d)) = FV (v).
If V is continuously distributed, FV (v) is a one-to-one function of v. Thus, controlling
for FV (v) is identical to controlling for V .21 Hence, two individuals with the same value
of FD|Z X (Di |Z i , X i ) have the same V . Since FD|Z X (d|z, x) depends only on observed
covariates, it is identified. We know from Chapter 2 that this can be estimated by non-
parametric regression noting that FD|Z X (d|z, x) = E [11 (D ≤ d) |Z = z, X = x].
After conditioning on V , observed variation in D is stochastically independent of
variation in U such that the effect of D on the outcome variable can be separated from
the effect of U . But it is required that there is variation in D after conditioning on V and
X , which is thus generated by the instrumental variable(s) Z . The endogeneity of D is
therefore controlled for in a similar way as in the selection on observables approach, i.e.
the matching approach.
To simplify notation, define the random variable
V̄ ≡ FV (V ) = FD|Z X (D|Z , X )
and let v̄ be a realisation of it. V̄ can be thought of as a rank-preserving transformation of
V to the unit interval. For example, if V were uniformly [0, 1] distributed, then V̄ = V
(this is basically equivalent to what we did in Section 4.3.1). In the context of treatment
effect estimation one often finds the notation of the average structural function (ASF)
which is the average outcome Y for given x and treatment d. To identify the ASF, notice
that conditional on V̄ , the endogeneity is controlled by
fU |D,X,V̄ = fU |X,V̄ = fU |V̄ .
As we have
E[Y |D = d, X = x, V̄ = v̄] = ϕ(d, x, u) · fU |D X V̄ (u|d, x, v̄)du
= ϕ(d, x, u) fU |V̄ (u|v̄)du
21 If V is not continuously distributed, F (v) contains steps, and the set {v : F (v) = a} of values v with
V V
the same FV (v) is not a singleton. Nevertheless, only one element of this set, the smallest, has a positive
probability, and therefore conditioning on FV (v) is equivalent to conditioning on this element with
positive probability.
216 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
assuming that all the conditional moments in the expressions are finite, and provided the
term E[Y |D = d, X = x, V̄ = v̄] is identified for all v̄ where f V̄ (v̄) is non-zero. The
latter requires that the support of V̄ |(D, X ) is the same as the support of V̄ , which in
practice can be pretty restrictive. This certainly depends on the context.
Example 4.16 Take once again our schooling example. If we want to identify the ASF
for d = 5 years of schooling and suppose the distribution of ‘ability in schooling’ V̄
ranges from 0 to 1, it would be necessary to observe individuals of all ability levels with
d = 5 years of schooling. If, for example, the upper part of the ability distribution would
always choose to have more than five years of schooling, the E[Y |D = 5, X, V̄ ] would
not be identified for all large ability values V̄ . In other words, in the sub-population
observed with five years of schooling, the high-ability individuals would be missing.
If this were the case, then we could never infer from data what these high-ability
individuals would have earned if they had received only five years of schooling.
From this example it becomes obvious that we need such an assumption, formally
Assumption IN.3 (Full range condition): For all (d, x) where the ASF shall be
identified,
Supp(V̄ |X = x, D = d) = Supp(V̄ ).
So, as in the sections before, since the support of V̄ given d and x depends only on
the instrument Z , this requires a large amount of variation in the instrument. Regarding
Example 4.16, this requires that the instrument is sufficiently powerful to move any
individual to 5 years of schooling. By varying Z , the individuals with the highest ability
for schooling and the lowest ability have to be induced to choose 5 years of schooling.
An analogous derivation shows the identification of the distribution structural
function (DSF) and thus the quantile structural function (QSF):
E[11{Y ≤ a}|D = d, X = x, V̄ = v̄] · f V̄ (v̄)d v̄
= 11{ϕ(d, x, u) ≤ a} · fU |V̄ (u|v̄)du · f V̄ (v̄)d v̄
= 11{ϕ(d, x, u) ≤ a} fU (u)du,
4.4 Non-Binary Models with Monotonicity in Choice Equation 217
which is identified as
DS F(d, x; a) = FY |D X V̄ [a|D = d, X = x, V̄ = v̄] · f V̄ (v̄)d v̄. (4.37)
If we are only interested in the expected potential outcomes E[Y d ], i.e. the ASF as a
function of d only and not of x, we could relax the previous assumptions somewhat.
Notice that the expected potential outcome is identified by
E[Y d ] = E[Y |D = d, X = x, V̄ = v̄] · f X V̄ (x, v̄)d xd v̄, (4.38)
see Exercise 9. For this result we could even relax Assumption IN.1 to (U, V ) ⊥⊥ Z |X
and would no longer require (U, V ) ⊥⊥ X . We would have to change notation somewhat
in that we should permit the distribution function FV to depend on X!. Furthermore, the
common support Assumption IN.3 changes to: for all d where E Y d shall be identified
we need
Supp(V̄ , X |D = d) = Supp(V̄ , X ).
The first part is in some sense weaker than Assumption IN.3 in that Supp(V̄ |X =
x, D = d) needs to contain only those ‘ability’ values V̄ that are also observed in the
X = x population instead of all values observed in the population at large. Hence, a
less powerful instrument could be admitted. However, this assumption is not necessarily
strictly weaker than Assumption IN.3 since this assumption is required to hold for all
values of X . The second part of the above assumption is new and was not needed before.
Nonetheless, Assumption IN.3 is quite strong and may not be satisfied. It is not
needed, however, for identifying Average Derivatives. Suppose ϕ is continuously
differentiable in the first element with probability one. Recall again the equality
E[Y |D = d, X = x, V̄ = v̄] = ϕ(d, x, u) · fU |V̄ (u|v̄)du,
Example 4.18 Suppose D is years of schooling and Z an instrument that influences the
schooling decision. If Z was changed exogenously, some individuals might respond by
increasing school attendance by an additional year. Other individuals might increase
school attendance by two or three years. But have in mind that even if Z was set to zero
for all individuals, they would ‘choose’ different numbers of years of schooling.
Here we consider the situation when only a single binary instrument is available like
for example a random assignment to drug versus placebo. Only a weighted average of
the effects can then be identified. According to their reaction on a change in Z from
0 to 1, the population can be partitioned into the types c0,0 , c0,1 , . . ., c K ,K , where the
treatment choice made by individual i is denoted by
Assuming monotonicity, the defier-types ck,l for k > l do not exist. The types ck,k repre-
sent those units that do not react on a change in Z . In the setup where D is binary these
4.4 Non-Binary Models with Monotonicity in Choice Equation 219
are the always-takers and the never-takers. The types ck,l for k < l are the compliers,
which comply by increasing Di from k to l. These compliers comply at different base
levels k and with different intensities l − k. In order to simplify identification you might
want restrict your study on the average returns accounting for intensities (l − k).
Example 4.19 In our returns to schooling example, E[Y k+1 − Y k |X, τ = ck,k+1 ] mea-
sures the return to one additional year of schooling for the ck,k+1 subpopulations.
E[Y k+2 − Y k |X, τ = ck,k+2 ] measures the return to two additional years of schooling,
which can be interpreted as twice the average return of one additional year. Simi-
larly, E[Y k+3 − Y k |X, τ = ck,k+3 ] is three times the average return to one additional
year. Hence, the effective weight contribution of the ck,l subpopulation to the measure-
ment of the return to one additional year of schooling is (l − k) · Pr τ = ck,l . Then
a weighted L AT E(x), say γw (x), for all compliers with characteristics x could be
defined as
K K !
l>k E Y − Y |x, τ = ck,l · Pr τ = ck,l |x
l k
k
γw (x) = K K . (4.40)
k l>k (l − k) · Pr τ = ck,l |x
!
The problem is now triple: to estimate E Y l − Y k |X, τ = ck,l and Pr τ = ck,l |X
for unobserved τ (you again have only treated and controls, you do not know to which
partition ck,l the individuals belong to, nor their proportions), and the integration of
γw (x). This function is the effect of the induced treatment change for given x, averaged
over the different complier groups and normalised by the intensity of compliance. To
obtain the weighted average effect for the subpopulation of all compliers (i.e. all sub-
populations ck,l with k < l), one would need to weight γw (x) by the distribution of X
in the complier subpopulation:
γw (x) d Fx|complier (x), (4.41)
Example 4.20 Imagine, for D taking values in {0, 1, 2}, the population can be partitioned
in the subpopulations: {c0,0 , c0,1 , c0,2 , c1,1 , c1,2 , c2,2 } with the all-compliers subpopu-
lation consisting of {c0,1 , c0,2 , c1,2 }. The two partitions with proportions {0.1, 0.1, 0.3,
0.3, 0.1, 0.1} and {0.1, 0.2, 0.2, 0.2, 0.2, 0.1}, respectively, generate the same distri-
bution of D given Z ; namely Pr(D = 0|Z = 0) = 0.5, Pr(D = 1|Z = 0) = 0.4,
Pr(D = 2|Z = 0) = 0.1, Pr(D = 0|Z = 1) = 0.1, Pr(D = 1|Z = 1) = 0.4,
Pr(D = 2|Z = 1) = 0.5. But already the size of the all-compliers subpopulation
is different for the two partitions (0.5 and 0.6, respectively). Hence the size of the
all-compliers subpopulation is not identified from the observable variables.
220 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
Now, if one defines the all-compliers subpopulation together with compliance inten-
sities (l − k), the distribution of X becomes identifiable. Each complier is weighted
by its compliance intensity. In the case of Example 4.20 where D ∈ {0, 1, 2}, the sub-
population c0,2 receives twice the weight of the subpopulation c0,1 . In the general case
one has
K K
w k l>k (l − k) · f x|τ =ck,l (x) Pr τ = ck,l
f x|complier (x) = K K . (4.42)
k l>k (l − k) · Pr τ = ck,l
Example 4.21 Considering the years-of-schooling example, the subpopulation c0,2 com-
plies with intensity = 2 additional years of schooling. If the returns to a year of
schooling are the same for each year of schooling, an individual who complies with
two additional years can be thought of as an observation that measures twice the
effect of one additional year of schooling or as two (correlated) measurements of
the return to a year of schooling. Unless these two measurements are perfectly cor-
related, the individual who complies with two additional years contributes more to
the estimation of the return to schooling than an individual who complies with only
one additional year. Consequently, the individuals who comply with more than one
year should receive a higher weight when averaging the return to schooling over the
distribution of X . If each individual is weighted by its number of additional years,
the weighted distribution function of X in the all-compliers subpopulation, where
D ∈ {0, 1, 2}, is
w f x|τ =c0,1 Pr τ = c0,1 + f x|τ =c1,2 Pr τ = c1,2 + 2 f x|τ=c0,2 Pr τ = c0,2
f x|complier = .
Pr τ = c0,1 + Pr τ = c1,2 + 2 Pr τ = c0,2
Suppose that D is discrete with finite support, the instrument Z is binary and Assump-
tions (A1C), (A2C) and (A5C) are satisfied as well as (A3C), (A4C) with respect to all
types t ∈ {ck,l : k ≤ l}, defined in (4.39). It can be shown (Exercise 10) that the
weighted LATE for the subpopulation of compliers is non-parametrically identified as
w (E [Y |X = x, Z = 1] − E [Y |X = x, Z = 0]) d FX
γw (x) · f x|complier (x)d x = .
(E [D|X = x, Z = 1] − E [D|X = x, Z = 0]) d FX
(4.43)
This is actually not hard to estimate (even non-parametrically). All we have to do is to
replace the conditional expectations by non-parametric predictors – which is not very
difficult given that these involve only observables; and the integrals with d FX can be
replaced by sample averages.
4.5 Bibliographic and Computational Notes 221
models. We will study the former in more detail in Chapter 7. Furthermore, as already
mentioned in the context of matching, there exists some quite recent research on post-
confounder-selection inference. In particular, Belloni, Chernozhukov, Fernández-Val
and Hansen (2017) consider the case with binary treatment and a binary instrument
Z but a huge vector of potential counfounders. The number of confounders on which
you indeed have to condition on in order to make the IV assumptions hold has to be
sparse (q is much smaller than n), and potential selection errors have to be ignorable
(first-order orthogonal). Then you can reach valid post-selection inference on treatment
effects with a binary IV in high-dimensional data.
In the spirit of the control function approach, Imbens and Newey (2009) consider
extensions of the ASF approach that allow for the simulation of an alternative treatment
regime where the variable D is replaced by some known function l(D, X ) of D and/or
X . The potential outcome of this policy is ϕ(l(D, X ), X, U ) and the average treatment
effect compared to the status quo is
E [ϕ(l(D, X ), X, U )] − E[Y ]. (4.44)
As an example, they consider a policy
which
imposes an upper limit on the choice
variable D. Hence, l(D, X ) = min D, d̄ , where d̄ is the limit.
Identification in simultaneous equations with monotonicity in both equations, namely
the outcome and the selection equation, say
Y = ϕ(D, X, U, V ) , D = ζ (Y, X, Z , U, V ),
is discussed in various articles by Chesher (Chesher 2005, Chesher 2007, Chesher
2010). For non-separable models see Chesher (2003) and Hoderlein and Mammen
(2007). Why is monotonicity in both equations of interest? Because then we can write
by differential calculus (using the chain rule):
∂y ∂ϕ(d, x, u, v) ∂d ∂ϕ(d, x, u, v)
= + , and
∂z ∂d ∂z ∂z
=0
∂d ∂ζ (y, x, z, u, v) ∂ y ∂ζ (y, x, z, u, v)
= + .
∂z ∂y ∂z ∂z
And with the exclusion restriction, you obtain
$
∂ϕ(d, x, u, v) ∂ y/∂z $$
= ,
∂d ∂d/∂z $d,x,z,u,v
where the right-hand side depends only on the variables d, x, z, u, v but no longer on
the unknown function. But as u, v are unobserved, you need monotonicity. The implicit
rank invariance was relaxed to rank similarity by Chernozhukov and Hansen (2005).
Back to some modelling for easier identification and estimation. A quite useful com-
promise between parametric and non-parametric modelling is the varying coefficient
models
E[Y |X = xi ] = xi βi , xi ∈ IR q , βi ∈ IR q non- or semi-parametric,
4.5 Bibliographic and Computational Notes 223
regression is more appropriate than logit or probit. The stata command eteffects
and its extensions offer further alternatives. These can also be used in order to apply
for example a probit in the first stage, and a simple linear one in the second step. If we
assume that the treatment effect is different for the treated and non-treated individuals
(grouped heterogeneity), ivtreatreg is another alternative; see Cerulli (2014). It
is unfortunately not always very clear what the particular differences between these
commands are and which one should be used.
For estimating the different LATE or MTE non-parametrically, it is recommended
to switch to R. In Exercises 6, 7 and 8 you are asked to implement general estimators
out of kernel and/or spline smoothers. When mainly the MTE is of interest, then the
estimated δ E(Yδi |P
Pi
i ,X i )
can be obtained from local polynomial regression. For example,
you obtain it using the command reg$grad[,1] consecutive to h=npregbw(Y ∼
P + X, gradient = TRUE) and reg = npreg(h), where P is the vector of
propensity scores. In this case, the local polynomial kernel estimator of at least order 2
is a privileged method because it provides the estimated gradient required to infer the
MTE.
Only very broadly discussed was the control function approach; recall Section 4.4.1.
Now, in additive models one can always include a control function, i.e. a non-parametric
function of the residual of the selection model, in order to switch from a standard match-
ing or propensity score weighting model to an IV model. There are no specific packages
of commands available for this approach. However, a simple implementation for binary
D is provided if in the second stage you assume to have an additively separable model
(also for the control function), i.e. you regress (using e.g. a command from the gam or
np package from R)
with m inter (·), m X (·) and m V (·) being non-parametric functions. When allowing for
individual heterogeneity of the treatment effect as well as complex interactions between
covariates and treatment, a conditional LATE on grouped observations of the values
taken by the confounders X is more appropriate.
Finally a general remark: many commands were constructed for linear additive mod-
els. You certainly know that you can always define polynomials and interactions in
a straightforward way. Similarly you can extend this to non-parametric additive mod-
els semi-parametric varying coefficient models using splines. Especially R allows you
almost always to build out of a variable a spline basis such that you get a non-parametric
additive model simply by substituting this spline basis for the original covariate. See also
npLate and Frölich and Melly.
4.6 Exercises
1. Recall Assumptions (A3), (A4) of the LATE estimator in Section 4.1.2. Show that
the second part of Assumption (A4), which we also called Assumption (A4b), is
trivially satisfied if the instrument Z is randomly assigned.
4.6 Exercises 225
2. Again recall Assumptions (A3), (A4) of the LATE estimator in Section 4.1.2. Show
that randomisation of Z does not guarantee that the exclusion assumption holds on
the unit level, i.e. Assumption (A4a).
3. Recall the Wald estimator for LATE (4.5) in Section 4.1.2. Show that this estimator
is identical to the 2SLS estimator with D and Z being binary variables.
4. Analogously to Chapter 3, derive for (4.21) and (4.22) propensity weighting
estimators of the LATE.
5. Discuss the validity of the necessary assumptions for the following example: The
yearly quarter of birth for estimating returns to schooling, as e.g. in Angrist and
Krueger (1991). They estimated the returns to schooling using the quarter of birth
as an instrumental variable for educational attainment. According to US compul-
sory school attendance laws, compulsory education ends when the pupil reaches a
certain age, and thus, the month in which termination of the compulsory education
is reached depends on the birth date. Since the school year starts for all pupils in
summer/autumn, the minimum education varies with the birth date, which can be
exploited to estimate the impact of an additional year of schooling on earnings. The
authors show that the instrument birth quarter Z has indeed an effect on the years
of education D. On the other hand, the quarter of birth will in most countries also
have an effect on age at school entry and relative age in primary school. In most
countries, children who are born, for example, before 1st September enter school
at age six, whereas children born after this date enter school in the following year.
Although there are usually deviations from this regulation, there are still many chil-
dren who comply with it. Now compare two children, one born in August and one
born in September of the same year. Although the first child is only a few weeks
older, it will tend to enter school about one year earlier than the second one. The
first child therefore starts schooling at a younger age and in addition will tend to be
younger relative to his classmates during elementary (and usually also secondary)
school. Discuss now the validity of the exclusion restriction.
6. Write R code for calculating a L AT E estimator by averaging over non-
parametrically predicted L AT E(X i ). Use local linear kernel regression for the
E[Y |Z = z, X i ] and either Nadaraya–Watson estimators or local logit estimation
for E[D|Z = z, X i ]. Start with the case where X is a continuous one-dimensional
variable; then consider two continuous confounders, and then a general partial linear
model for E[Y |Z = z, X i ], and a generalised partial linear model with logit link
for E[D|Z = z, X i ].
7. As Exercise 6 but now use for E[Y |Z = z, X i ] some additive (P-) splines and for
E[D|Z = z, X i ] a logit estimator with additive (P-) splines in the index. i.e. for the
latent model.
8. Repeat Exercises 6 for estimating the MTE. As you need the first derivatives, you
should again use either local linear or local quadratic estimators.
226 Selection on Unobservables: Non-Parametric IV and Structural Equation Approaches
K
(l − k) · Pr τ = ck,l = E [D|X = x, Z = 1] − E [D|X = x, Z = 0] .
k l>k
5 Difference-in-Differences Estimation:
Selection on Observables and
Unobservables
The methods discussed in previous sections could be applied with data observed for a
treated and a control group at a single point in time. In this section we discuss methods
that can be used if data are observed at several points in time and/or if several control
groups are available. We discuss first the case where data are observed at two points in
time for the control group and for the treated group. This could for example be panel
data, i.e. where the same individuals or households are observed repeatedly. But it could
also be independent cross-section observations from the same populations at different
points in time. Longitudinal data on the same observations is thus not always needed for
these methods to be applied fruitfully, which is particularly relevant in settings where
attrition in data collection could be high.
Data on cohorts (or even panels) from before and after a treatment has taken place are
often available as in many projects data collections took place at several points in time.
The obvious reason is that before a project is implemented, one already knows if at some
point in time an evaluation will be required or not. The most natural idea is then to either
try to implement a randomised design (Chapter 1) or at least to collect information on
Y (and potentially also on X ) before the project starts. Therefore, as before, we have
{Yi }i=1
n for a treatment and a control group, but additional to the information on final
outcomes you have the same information also for the time before treatment.
Example 5.1 Card and Krueger (1994) are interested in the employment effects of a
change in the legal minimum wage in one state, and take a neighbouring state, where
no change in the minimum wage occurred, as a comparison state. The effects of the
minimum wage change are examined over time, and the variation in employment over
time in the comparison state is used to identify the time trend that presumably would
have occurred in the absence of the raise in the minimum wage.
In this chapter we will see how this information over time can be exploited to relax
the assumptions necessary for identification of the treatment effect. In the setting with
two groups observed at two points in time, there are different ways how to look at the
difference-in-differences (DiD henceforth) idea introduced here. The crucial insight is
that for the control group we observe the non-treatment outcome Y 0 before and after the
intervention, because the control group is not affected by the intervention. On the other
228 Difference-in-Differences Estimation: Selection on Observables and Unobservables
hand, for the treatment group we observe the potential outcome Y 1 after the interven-
tion, but before the intervention we observe the non-treatment outcome Y 0 also for the
treatment group, because the intervention had not yet started.
Thinking of the regression or matching approach one might think of the situation
where at time t the additionally available information on which we plan to condition
are the individuals past outcomes Yi,t−1 , i.e. before treatment started. Let D be the
indicator for affiliation of the individual to either the treatment (Di = 1) or control
group (Di = 0). Being provided with this information, a simple way to predict the
average Yt0 for the treated individuals is
1
t0 |D = 1] := 1
E[Y Yi,t−1 + (Yi,t − Yi,t−1 ) (5.1)
n1 n0
i:Di =1 i:Di =0
with n 1 = i Di = n − n 0 . An alternative way to look at it is to imagine that we are
interested in the average return increase due to treatment, i.e. E[Yt1 − Yt−1
0 ] − E[Y 0 −
t
Yt−1 ], rather than the difference between treated and non-treated. This is actually the
0
same, because E[Yt1 −Yt−1 0 ]−E[Y 0 −Y 0 ] = E[Y 1 −Y 0 ], and this shows also where the
t t−1 t t
name difference-in-differences comes from. Obviously you need only to assume (Ytd −
0 )⊥
Yt−1 ⊥ D for applying a most simple estimator like in randomised experiments.
Recall that treatment effect estimation is a prediction problem. Having observed
outcomes from the time before treatment started will help us to predict the potential
non-treatment outcome, in particular E[Y 0 |D = 1]. But this does not necessarily pro-
vide additional information for predicting the treatment outcome for the control group.
Therefore we focus throughout on identification of the ATET E[Yt1 − Yt0 |D = 1].
The treatment outcome E[Yt1 |D = 1] can be directly estimated from the observed
outcomes, the focus will be on finding assumptions under which E[Yt0 |D = 1] is
identified.
We will first discuss non-parametric identification of E[Y 0 |D = 1] but also examine
linear models and the inherent assumptions imposed on E[Y 0 |D = 1]. While linear
models impose stronger assumptions on the functional form, they provide the useful
link to well known results and estimators in panel data analysis. After the different
possibilities to combine the DiD idea with RDD or matching we have to think about the
possibility that the development of Y had already been different for the treatment group
before treatment took place. Related to this problem is that the DiD is scale-dependent:
if the trend of Y 0 is the same for both treatment groups, this is no longer true for the
log of Y 0 . A remedy to this problem is to look at the entire distribution of Y 0 (and Y 1 )
resulting in the so-called changes-in-changes approach we introduce later on.
e.g. only some specific geographical area of a country. We could examine differences
in the outcomes between the affected and unaffected parts of the population after this
(policy) change, but we might be worried that these differences in outcomes might,
at least partly, also reflect other, say unobserved, differences between these regions.
This may generate a spurious correlation between treatment status and outcomes. If we
have outcome data for the period before or until t, i.e. when the population was not
yet affected by the policy change, we could examine whether differences between these
regions already existed before. If these differences are time-invariant, we could subtract
them from the differences observed after t. This is nothing other than taking differences
in differences.
Assumption 1 common trend (CT) or bias stability (BS): During the period [t − 1, t]
(or t0 to t1 ) the potential non-treatment outcomes Y 0 followed the same linear trend in
the treatment group as in the control group. Formally,
0
CommonT r end E[Yt=1 − Yt=0
0
|D = 1] = E[Yt=1
0
− Yt=0
0
|D = 0] or
0
Bias Stabilit y E[Yt=0 |D = 1] − E[Yt=0
0
|D = 0] = E[Yt=1
0
|D = 1]− E[Yt=1
0
|D = 0]
Often the common trend is synonymously called the parallel path. The main difference
is that parallel path always refers to the development of Y whereas the notation of
common trend is sometimes maintained when people actually refer to parallel growth,
i.e. a common trend of the growth (or first difference) of Y .
With the CT or BS assumption we can identify the counterfactual non-treatment
outcome as
0
E[Yt=1 |D = 1] = E[Yt=0
0
|D = 1] + E[Yt=1
0
− Yt=0
0
|D = 0]
and since the potential outcome Y 0 corresponds to the observed outcome Y if being in
non-treatment state we obtain
0
E[Yt=1 |D = 1] = E[Yt=0 |D = 1] + E[Yt=1 − Yt=0 |D = 0].
We can now estimate the counterfactual outcome by replacing expected values with
sample averages
0
Ê[Yt=1 |D = 1] = Ê[Y |D = 1, T = 0] + Ê[Y |D = 0, T = 1] − Ê[Y |D = 0, T = 0].
and
1
Ê[Yt=1 |D = 1] = Ê[Y |D = 1, T = 1].
The CT, BS or parallel path assumption can easily be visualised, especially if we are
provided with data that contain observations from several time points before and after
treatment. This is illustrated in two examples in Figure 5.1. In both panels we have
three time points (t = −2, −1, 0) before, and three after (t = 1, 2, 3) treatment. The
thin black line represents the unaffected development of the control group, the thick
black line the one of the treatment group. In both panels they run in parallel and hence
fulfil the CT assumption. The right panel simply illustrates that the trend does neither
has to be linear or monotone e.g. a seasonal pattern such as unemployment rate. After
treatment many developments are thinkable for the treatment group: a different, e.g.
steeper, trend (dashed), a parallel trend but on a different level than before treatment
(dotted), or unaffected (semi-dashed). We do not observe or know the exact development
between t = 0 to t = 1; in the left panel we may speculate about it due to the linearity,
but preferred to suppress it in the right panel.
An alternative, but numerically identical, estimator of ATET can be obtained via a
linear regression model. Such representation can be helpful to illustrate the link to linear
5.1 The Difference-in-Differences Estimator with Two Time Periods 231
29 30 31 32 33 34
7
6
E[Y|D]
E[Y|D]
5
4
3
2
–2 –1 0 1 2 3 –2 –1 0 1 2 3
t t
Figure 5.1 The CT, BS or parallel path assumption. Until t = 0, before treatment takes place,
E[Y |D] = E[Y 0 |D] (solid lines) is developing in parallel in both D groups with thin line for the
control group E[Y |D = 0], and the thick lines indicating different scenarios for the treatment
group as after treatment has taken place, E[Y |D = 1] may develop in different ways
panel models and also exemplifies how diff-in-diff estimation can be expressed in linear
models. More precisely, we can write the DiD estimator in the regression representation,
by including the interaction term:
we observe the relationship to linear panel data models. However, we actually do not
need individual level nor panel data, but only city averages of Y taken from cohorts.
This is a quite important advantage, since panel data is often plagued by attrition, panel
mortality, etc.
Example 5.2 Duflo (2001) took advantage of a rapid school expansion programme that
occurred in Indonesia in the 1970s to estimate the impact of building schools on school-
ing and subsequent wages. Identification is made possible by the fact that the allocation
rule for the school is known – more schools were built in places with low initial enrol-
ment rates – and by the fact that the cohorts participating in the programme are easily
identified. Children of 12 years or older when the programme started did not partic-
ipate in the programme. The increased growth of education across cohorts in regions
that received more schools suggests that access to schools contributed to increased edu-
cation. The trends were quite parallel before the programme and shifted clearly for
the first cohort that was exposed to the programme, thus reinforcing confidence in the
identification assumption.
232 Difference-in-Differences Estimation: Selection on Observables and Unobservables
Certainly, in practice regression models on the individual level are more frequently
used. Suppose a policy change at t = 1 in the unemployment insurance law. Affected
individuals become unemployed only if they are older than 50 at the time of unemploy-
ment registration. Let Y be some outcome measure, e.g. employment status after one
year, whereas time period t = 0 refers to a period before the change. We could run the
regression
Yt = β0 + β1 · 11time=1 + β2 · 11age>50 + γ · 11age>50 · 11time=1 + Ut , (5.2)
where the selection age > 50 refers to t = 0 so that it does not have a time index.
Here, γ measures the treatment effect of the policy change, β1 captures (time constant)
differences between the two age groups, and β2 captures time trends (in the absence of
the policy change) that are assumed to be identical for both age groups.
It is easy to see that the OLS estimate of γ in (5.2) can be written as
γ̂ = ȳ50+,t=1 − ȳ50+,t=0 − ȳ50−,t=1 − ȳ50−,t=0 (5.3)
or equivalently as
γ̂ = ȳ50+,t=1 − ȳ50−,t=1 − ȳ50+,t=0 − ȳ50−,t=0 , (5.4)
where ȳ is the group average outcome, 50+ refers to the group older than 50 years, and
50− are those below or equal to 50 years.
What is then the difference of these presentations? Only the way of thinking: in rep-
resentation (5.4) the DiD estimate compares the outcomes in time period 1 and subtracts
the bias from permanent (time constant) differences between the two groups. In repre-
sentation (5.3) the average outcome gain for age group 50+ is estimated and a possible
bias from a general trend is removed. This works only under the assumption that the
trend was the same in the 50− group. But both give the same.
Note again that for (5.2) cohorts are all what you need for estimation. In fact, not even
individual data is needed since only group averages are required, as is seen from (5.3)
and (5.4). For estimation, the four averages ȳ50+,t=1 , ȳ50−,t=1 , ȳ50+,t=0 and ȳ50−,t=0
would be sufficient.
An alternative way of writing (5.2) is to represent the potential non-treatment outcome
Y 0 as
Yi0 = β0 + β1 Ti + β2 G i + Ui with Ui ⊥⊥ (G i , Ti ),
where G i ∈ {50−, 50+} is the group indicator and Ti ∈ {0, 1} the time indicator.
G i takes the value one for the older group, and the value zero for the younger group.
Treatment status is defined as D = G · T . That is, only the older group is treated and
only in the later time period. In the earlier time period this group is untreated.
Instead of having a group fixed effect β2 G i we could consider a model with an
individual time-invariant fixed effect Ci
Yi0 = β0 + β1 Ti + Ci + Ui with Ui ⊥⊥ (Ci , Ti ),
where the Ci could be correlated with G i . If we are interested in the ATET then we
do not need a model for Y 1 , because without any assumptions E[Y 1 − Y 0 |D = 1] =
5.1 The Difference-in-Differences Estimator with Two Time Periods 233
Example 5.3 Chay, McEwan and Urquiola (2005) consider a policy in Chile where
poorly performing schools were given additional financial resources. The DiD estima-
tor compares average school outcomes between treated and control schools before and
after the intervention. The school outcomes are measured in the same grade before and
after the intervention (i.e. these are therefore different pupils). The treated schools are
selected according to the average performance of their pupils on an achievement test. All
schools with such a test-based ranking that is below a certain threshold receive a subsidy.
Test scores, however, are noisy measures of the true performance; also because different
pupils are tested before and after the intervention. Imagine two schools with identical
true average performance, which is close to the threshold. Suppose testing takes place
in grade 3. One of the schools happens to have a bad test-based ranking in this year (e.g.
due to a cohort of unusually weak students, bad weather, disruptions during the test etc.).
This school thus falls below the threshold and receives the subsidy. The other school’s
test-based ranking is above the threshold and no subsidy is awarded. Suppose the true
effect of the subsidy is zero. In the next year, another cohort enters grade 3 and is tested.
We would expect both schools to have the same test-based ranking (apart from random
variations). The DiD estimate, however, would give us a positive treatment effect esti-
mate because the school with the bad shock in the previous year is in the treated group.
This result is also often referred to as ‘regression to the mean’. The spurious DiD esti-
mate is due to the random noise or measurement error of the test-based ranking. If this is
just within the usual variation or test outcomes, then a correctly estimated standard error
of our ATET estimate should warn us that this effect is not significant. But if this trend
is stronger, then it is hard to see from the data whether it was just a random variation,
and the common trend assumption is no longer valid. Since this ranking is based on the
average performance of all pupils in grade 3, we expect the variance of this error to be
larger in small classes.
Sometimes we may have several pre-treatment waves of data for treated and con-
trol group, which would permit us to examine the trends for both groups before the
intervention. We discuss this further below.
where D denotes treatment group and X represents characteristics that are not affected
by treatment. Often one does therefore only consider predetermined X or those that do
(or did) not change from t = 0 to t = 1 such that their time index can be skipped.
This corresponds to assuming that, conditional on X t , the distribution of Yt0 does not
differ between treated and controls. The MDiD approach now essentially replaces this
assumption by
% &
Y10 − Y00 ⊥⊥ D|X (5.6)
where X may comprise information about both time points t. Moreover, we need the
common support condition (CSC) in the sense that1
Pr (T D = 1|X = x, (T, D) ∈ {(t, d), (1, 1)}) > 0 ∀x ∈ X , ∀(t, d) ∈ {(0, 0), (1, 0), (0, 1)} .
(5.8)
Hence again, we permit differences in levels but assume that the trends (i.e. the
change over time) are the same among treated and controls, or simply assumption CT
or BS conditional on X . Note that for a visual check we are now asking and look-
ing for a parallel path of the E[Y 0 |X ] instead of just Y 0 in the control and treatment
1 Note that we can identify with DiD at most the ATET anyway, so that we do not need the symmetric
assumptions for Y 1 , and just a strictly positive propensity score.
5.1 The Difference-in-Differences Estimator with Two Time Periods 235
!
group. This implies that the conditional ATET, say α(X ) = E Y11 − Y10 |X, D = 1 is
identified as
α(X ) = E [Y1 |X, D = 1]− E [Y1 |X, D = 0]−{E [Y0 |X, D = 1] − E [Y0 |X, D = 0]} .
(5.9)
Integrating X out with respect to the distribution of X |D = 1, i.e. among the treated,
will provide the
AT E T = E [α(X )|D = 1] . (5.10)
This approach is based on a similar motivation as the pre-programme test: if the
assumption (5.5) is not valid, we would expect also systematic differences in the pre-
programme outcomes between treated and controls (unless we have conditioned for X ).
By having pre-programme outcomes, we could in a sense, test whether the outcomes Yi0
are on average identical between treated and controls. If we detect differences, these dif-
ferences may be useful to predict the magnitude of selection bias in the post-programme
outcomes. Estimating this bias and subtracting it leads to the DiD estimator.
If X does not contain all confounding variables, i.e. Assumption (5.5) was not valid,
adjusting for X via matching will not yield a consistent estimate of the ATET because
! !
E Yt1 |D = 1 − E Yt0 |X t , D = 0 d FX t |D=1
! ! !
= E Yt |D = 1 − E Yt0 |X t , D = 1 · d FX t |D=1 = E Yt1 − Yt0 |D = 1
1
is the systematic bias in the potential outcome Yt0 in period t that still remains even
after adjusting for the different distributions of X . The conditional BS assumption says
that pre-programme outcomes permit to estimate this systematic bias, as for a period τ
before treatment
! !
Bτ,t = E Yτ0 |X τ , D = 1 − E Yτ0 |X τ , D = 0 d FX t |D=1 (5.11)
is equal to Bt,t .
Example 5.4 Consider the evaluation of training programmes. If the individuals who
decided to participate have on average more abilities to increase Y , it is likely that their
labour market outcomes would also have been better even without participation in the
programme. In this case, the average selection bias Bτ,t would be positive. If the poten-
tial outcome in the case of non-participation Yt0 is related over time, it is likely that
these differences between the treatment groups would also persist in other time peri-
ods including periods before the start of the programme. In other words, the more able
persons would also had enjoyed better labour market outcomes in periods previous to
treatment.
236 Difference-in-Differences Estimation: Selection on Observables and Unobservables
It now becomes clear that even the BS assumption is not strictly necessary. It suf-
fices that Bt,t can be estimated consistently from the average selection biases in
pre-programme periods, called also predictable-bias assumption. If several periods with
pre-programme outcomes are observed, the average selection bias can be estimated in
each period B̂τ,t , B̂τ −1,t , B̂τ −2,t . Any patterns in the estimates B̂τ,t , B̂τ −1,t , B̂τ −2,t may
lead to improved predictions of Bt,t . A nice example is that their average is expected
to mitigate potential biases due to the regression to the mean problem mentioned in
Example 5.3.
It is also clear now that the classic CIA requires Bt,t = 0 whereas for the MDiD we
require that Bt,t is estimable from pre-programme periods. Note that these assumptions
are not nested. For example, when imposing CIA we often include pre-programme out-
comes Yτ as potential confounders in X . However, when using the DiD approach we
cannot include the lags of the outcome variable Y since we have to be able to calculate
Bτ,t , cf. also Section 5.1.3.
The non-parametric estimation of (5.9) and (5.10) is not really a challenge, unless
X is of dimension larger than 3 and therefore affected by the curse of dimensionality.
We simply replace all conditional expectations in (5.9) by local polynomial estima-
tors for all X i for which Di = 1, and then average over them to obtain an estimator
for the ATET; see (5.10). As discussed in Chapter 3, one could alternatively pre-
estimate the propensity scores Pi for all X i , and condition the expectations in (5.9)
on them instead of conditioning on the vector X ; see propensity score matching.
The justification is exactly the same as in Chapter 3. Note that, as we can sepa-
rate the four conditional expectations and estimate each independently from the other,
we again do not need panel data; repeated cross sections, i.e. cohort data would do
equally well.
Recall Assumption 1x and define ψ1 = {D − Pr(D = 1|X )}{Pr(D = 1|X ) Pr(D =
0|X )}−1 for Pr(D = 1|X ) > 0 (you may set ψ1 = 1 else). Then
E Y11 − Y10 |D = 1 = E [ψ1 (Y1 − Y0 )|x] f (x|D = 1)d x
. /
Pr(D = 1|X )
= E ψ1 (Y1 − Y0 )
Pr(D = 1)
. /
Y1 − Y0 D − Pr(D = 1|X )
=E ,
Pr(D = 1) Pr(D = 0|X )
cf. Exercise 6. Once we have predictors for the propensity score, the ATET can
be obtained by weighted averages of outcomes Y before and after treatment. When
using cohorts instead of panels, then we need to modify the formulae as follows:
Define ψ2 = ψ1 · {T − λ}{λ(1 − λ)}−1 with λ being the proportion of obser-
vations sampled in the post-treatment period. We then get the conditional ATET
α(X ) = E [ψ2 · Y |X ], where the expectation is taken over the distribution of the
entire sample. Finally, and analogously to above you
! get the unconditional ATET by
α = E ψ2 · Y · Pr(D = 1|X ) Pr−1 (D = 1)|D = 1 .
2 In other words, it corresponds to the interaction term T D of our initial simple panel model.
3 When working with cohorts you must skip C but can still include time-invariant confounders.
i
238 Difference-in-Differences Estimation: Selection on Observables and Unobservables
In several applications we may have several groups and/or several time periods. Suppose
different states of a country are affected by a policy change at different time periods.
We may have panel data on these states for several years. The policy change occurs
at the state level, yet for reasons of estimation precision we may sometimes also want
to add individual characteristics to the regression. We need to acknowledge that entire
groups of individuals are affected simultaneously by the policy change: for example all
individuals aged 50 years or older in a certain state. Hence, the treatment indicator does
not vary at the individual but rather at the group level. In the following, let g index
different groups (e.g. 50+, 50− in regions A and B). The model for the mean outcome
Ȳgt can be written as
Ȳgt = δgt + Dgt β + Vgt
where δgt is a set of group by time period constants and Dgt is one if (already) treated,
and zero otherwise. As we just consider group averages, the model is completely general
so far. Without further restrictions it is not identified, though.
Possible identifying restrictions in this case are to assume
Ȳgt = αg + λt + Dgt β + Vgt
together with uncorrelatedness of Dgt and Vgt . With this restriction, the model is iden-
tified as long as we have more than four observations and at least two time periods and
two groups. One can use panel data analysis with the appropriate asymptotic inference
depending on whether groups or time go to infinity.
240 Difference-in-Differences Estimation: Selection on Observables and Unobservables
While the previous discussion only requires observation of the group level averages
Ȳgt , this changes when including covariates – no matter whether this is done for effi-
ciency reasons or for making the Common Trend assumption more plausible. Clearly, if
the observed characteristics X changed over time, this assumption is less plausible. We
would thus like to take changes in X into account and assume only that the differences
due to unobservables are constant over time. In a linear model one could simply include
the group by time averages X̄ gt in the model
Ȳgt = αg + λt + Dgt β + X̄ gt γ + Vgt .
This is an example of a multilevel model, where the regressors and error terms are
measured at different aggregation levels. Simply calculating standard errors by the con-
ventional formula for i.i.d. errors and thereby ignoring the group structure in the error
term Vgt + Uigt usually leads to bad standard error estimates and wrong t-values. There-
fore one might want to combine treatment effect estimation with methods from small
area statistics. For calculating the standards errors one would like to permit serial cor-
relation and within-group correlation, while assuming that the errors are independent
across groups or modelling the dependency.
How to do inference now? Consider Equation 5.16 and ignore any covariates X and
Z . The error term has the structure Vgt +Uigt . Suppose that Uigt and Vgt are both mean-
zero i.i.d. and neither correlated between groups nor over time. Consider the case with
two groups and two time periods. The DiD estimator of β is then
β̂ = Ȳ11 − Ȳ10 − Ȳ01 − Ȳ00 .
With a large number of individuals in each group, the group-time averages Ȳgt will
converge to αg + λt + Dgt β + Vgt by the law of large numbers. The DiD estimator will
thus asymptotically have mean
and is therefore inconsistent. Unbiasedness would require V11 − V10 − V01 + V00 = 0,
which is assumed by the simple DiD estimator. But we cannot conduct inference since
we cannot estimate σv2 which in turn is not going to zero. If we assumed that there were
only individual errors Uigt and no group errors Vgt (or, in other words, that the group
error Vgt is simply the average of the individual errors) then the estimates would usually
be consistent. If we further assume that Uigt is neither correlated over time nor between
individuals, we obtain that
√ d
n(β̂ − β) −→ N (0, V ar ),
σU2 σU2 σU2 σU2
V ar = 11
Pr(G=1,T =1) + 10
Pr(G=1,T =0) + 01
Pr(G=0,T =1) + Pr(G=0,T =0) ,
00
where the variances σU2 gt = V ar (Uigt ) are estimable from the data.
5.2 Multiple Groups and Multiple Time Periods 241
With multiple groups and time periods we can consider other approaches, e.g. consid-
ering the number of groups G and time periods T to go to infinity, when the sample size
increases (Hansen 2007a and Hansen 2007b). The analysis is then akin to conventional
linear panel data analysis with grouped and individual errors and one could also permit
a richer serial correlation structure. The relevance of this is for example documented
in the Monte Carlo study of Bertrand, Duflo and Mullainathan (2004), who found that
simple DiD estimation inference can exhibit severely bias standard errors, for example
when regions are affected by time persistent shocks, (i.e. auto-correlated errors) that
may look like programme effects. This is also discussed in the next sections.
So unless we do not impose some more assumptions and restrictions, it seems that the
inclusion of several time periods is a bane rather than a boon; that including several time
periods before and after the treatment may be problematic (without further assumptions)
becomes already clear when only reconsidering the idiosyncratic shocks Uigt – although
we could equally well find similar arguments when looking at the group shocks Vgt . For
d = g ∈ {0, 1} and neglecting potential confounders X (or Z ) one has that
A low U among the treated in the past period may often be the cause which triggered
a policy change. I.e. bad shocks may have prompted the policy change. Unless these
bad shocks are extremely persistent, the DiD estimator would overestimate the ATET
because we expect E[Uig1 − Uig0 |g = 1] > E[Uig1 − Uig0 |g = 0]. This is the so-called
Ashenfelter’s dip yielding a positive bias for the DiD estimators for ATET:
The idea is that among the treated those individuals are over-represented that have
Uig0 < 0 and analogously individuals with Uig0 > 0 are over-represented in the
control group. This is not a problem if this is also true for Uig1 (i.e. if shocks are
persistent). However, the regression-to-the-mean effect says that generally all have the
tendency to converge to the (regression) mean so that for individuals with negative resid-
uals we expect a different trend than for individuals with positive residuals. In other
words, the idea of a regression-to-the-mean effect combined with the Ashenfelter dip
contradicts the Common Trend assumption.4 Having said this, two obvious solutions
are thinkable. Either we include and average over several periods before and after the
treatment so that this ‘dip’ is smoothed out, or we have to correct for different trends
in the control compared to the treatment group. The former, simpler solution, can be
4 It is important that only this combination causes a problem: neither the Ashenfelter dip nor the regression
to the mean principle alone can cause a problem.
242 Difference-in-Differences Estimation: Selection on Observables and Unobservables
carried out by considering longer panels, the latter can be handled by the so-called
difference-in-differences-in-differences estimator which we consider next.
5 See Lalive (2008) for details from which is taken this example.
5.2 Multiple Groups and Multiple Time Periods 243
In order to prove that the population equivalent, i.e. the expected value of (5.17), is
identical to γ , rewrite the above regression equation in order to express the expected
value of ȳ A,50+,t=1 as β0 +β1 +β2 +β3 +β4 +β5 +β6 +γ . With analogous calculations
for the other groups, and plugging these expressions into (5.17), one obtains that the
expected value corresponds indeed to γ .
A similar idea can be used when three time periods, say t = −1, 0, 1 are available of
which two are measured before the policy change. If the assumption of identical time
trends for both groups were valid, the following expression should have mean zero:
ȳ50+,t=0 − ȳ50−,t=0 − ȳ50+,t=−1 − ȳ50−,t=−1 .
If not, we could use this expression to measure the difference in the time trend before
the treatment. Hence, the slope of the time trend is permitted to differ between the 50+
and the 50− group (as before treatment). If we assume that the change of the slope, i.e.
the second difference or acceleration, is the same in both groups, then we could predict
the counterfactual average outcome for ȳ50+,t=1 in the absence of a policy change. The
DiDiD estimate is
ȳ50+,t=1 − ȳ50−,t=1 − ȳ50+,t=0 − ȳ50−,t=0
− ȳ50+,t=0 − ȳ50−,t=0 − ȳ50+,t=−1 − ȳ50−,t=−1
= ȳ50+,t=1 − ȳ50+,t=0 − ȳ50+,t=0 − ȳ50+,t=−1
− ȳ50−,t=1 − ȳ50−,t=0 − ȳ50−,t=0 − ȳ50−,t=−1
= ȳ50+,t=1 − ȳ50−,t=1 .
Generally, with more than two time periods, we can use second differences to eliminate
not only ‘individual fixed effects’ but also ‘individual time trends’. This concept can
certainly be extended to higher order differences; see Mora and Reggio (2012).
The basic idea in all these situations is that we have only one treated group in one time
period, and several6 non-treated groups in earlier time periods. We thus use all the non-
treated observations to predict the counterfactual outcome for that time period in which
the treated group was affected by the policy change. For predicting the counterfactual
outcome we could also use more elaborate modelling approaches.
The DiDiD goes one step further than (5.7) by permitting differences in levels and
trends, but requires that the acceleration (second difference) is the same for treated and
controls. Sometimes one speaks also of parallel path instead of common trend (CT),
and of parallel growth instead of common acceleration. More specifically, let Yt0 =
Yt0 − Yτ0 be the first difference and Yt0 = Yt0 − Yτ0 be the second difference. Both
DiDiD extensions can be further developed to the case where we additionally condition
on potential confounders X to make the underlying assumptions more credible. Then
the CIA for the DiD approach requires that
The so-called pre-programme tests in the DiD approach test whether there are differ-
ences in levels between treated and controls. The pre-programme test in the DiDiD
approach tests whether there are differences in trends between treated and controls.
If one has several periods after treatment, one could test for parallel paths or parallel
growth after the treatment. However, without having comparable information about the
periods before treatment, the correct interpretation remains unclear. Recall finally that,
as long as we only work with averages or conditional averages, we do not need to be
provided with panel data; cohorts would do as well.
In the previous sections we introduced the DiD idea and studied different scenarios of
alternative assumptions and resulting modifications of our ATET estimator. However, we
have not studied so far the problem that the Assumptions (5.18) and (5.19) are not scale-
invariant. In fact, the parallel path or growth assumptions are by nature intrinsically
related to the scale of Y ; if, for example, the Y 0 in group D = 1 follow a parallel path
to Y 0 in group D = 0, than for the logY 0 this can no longer be the case, and vice versa.
While this is often presented as a major disadvantage, as a scale-invariant assumption
can hardly be justified only based on economic theory. On the other hand, it could also
be considered as an advantage if you have observations from at least two periods before
treatment started. Because in this case you just have to find the scale on which the nec-
essary assumptions apply. After such a prior study that finds the transformation of Y for
which either the assumptions required for DiD or those required for DiDiD hold, you can
apply the respective method. All what you need are data of several pre-treatment periods.
A quite different approach would be to get rid of the scale by no longer focusing
directly on the mean but the cumulative distribution function of Y . The simple reason is
that this is scale invariant. It has also the advantage that we can reveal the impact of D on
the entire distribution of Y . This is certainly much more informative than just looking at
the mean; it actually can still be useful when the treatment effect is quite heterogeneous.
For that reason we also dedicate an entire chapter (Chapter 8) on quantile treatment
effect estimation.
This particular extension of the DiD approach is known as changes-in-changes (CiC).
As stated, it does not just allow for treatment effect heterogeneity but even explores it
by looking at the identification and estimation of distributional effects. The effects of
time and of treatment are permitted to differ systematically across individuals. In order
to simplify we still discuss only the situation with two groups g ∈ {0, 1} and two time
periods t ∈ {0, 1}. The group 1 is subject to the policy change in the second time period.
1
For this, the outcome YG=1,T 0
=1 is observed, but the counterfactual outcome YG=1,T =1
is not. The focus is on estimation of (a kind of) ATET. As in DiD, to estimate the
counterfactual outcome Y 0 in case of non-treatment we can use the information from
the other three group-by-time combinations. We use the fact that Y 0 is observed for the
5.3 The Changes-in-Changes Concept 245
Example 5.6 Consider as groups G the cohort of 60-year-old males and females. We
may be willing to assume that the distribution of U is the same for males and females.
But even when conditioning on all kind of observables, we may still want to allow the
outcome ϕG to be different for different gender. As age is fixed to 60, we have different
cohorts over time, and thus the distribution of U should be allowed to change over time,
whereas the ϕG functions should not. One may think here of a medical intervention; the
health production function(s) ϕG for Y 0 (i.e. without treatment) may depend on U and
also on group membership (i.e. gender), but it does not change over time.
Hence, the model applies when either T or G does not enter in the production func-
tion ϕ(U, T, G) and the distribution of U (i.e. the quantiles) remains the same in the
other dimension (i.e. in the one which enters in ϕ). Whichever of these two potential
model assumptions is more appropriate depends on the particular empirical applica-
tion. The estimates can be different. However, since the model does not contain any
overidentifying restrictions, neither of these two models can be tested for validity.
Note that we have placed no restrictions on Yi1 . This implies that we permit arbitrary
treatment effect heterogeneity Yi1 − Yi0 , thereby also permitting (as indicated above)
that individuals were partly selected into treatment on the basis of their individual
gain.
We first sketch an intuitive outline of the identification for the counterfactual distri-
bution. As stated, the basic idea is that in time period T = 0, the production function
ϕ is the same in both groups G. Different outcome distributions of Y in the G = 0
and G = 1 groups can be attributed to different distributions of U in the two groups.
Therefore, while, from time period 0 to 1 the production function changes, the distribu-
tion of U remains the same. This means that someone at quantile q of U will remain at
quantile q in time period 1. The inverse distribution function (i.e. quantile function) will
frequently be used and is defined for a random variable Y as
This implies that FY (FY−1 (q)) ≥ q. This relation holds with equality if Y is continuous
or, when Y is discrete, at discontinuity points of FY−1 (q). Similarly, FY−1 (FY (y)) ≤ y.
This relation holds with equality at all y ∈ Supp(Y ) for continuous or discrete Y but
not necessarily if Y is mixed.
Consider an individual i in the G = 1 group, and suppose we knew the value of Ui .
We use the notation of ‘individual’ only for convenience. In fact, only the quantile in
the U distribution is important. So whenever it is referred to an individual, we actually
refer to any individual at a particular quantile of U . One would like to know ϕ(Ui , 1) for
which only the group G = 0 and T = 1 is informative, because the G = 1, T = 1 group
is observed only in the treatment state, and because the G = 0, T = 0 or G = 1, T = 0
group is only informative for ϕ(Ui , 0). We do not observe Ui in the G = 0 group, but
by assuming monotonicity we can relate quantiles of Y to quantiles of U .
We start from an individual of the (G = 1, T = 0) group with a particular value
Ui . We map this individual first into the G = 0, T = 0 group and relate it then to
the G = 0, T = 1 group. Define FU |gt = FU |G=g,T =t and note that FU |gt = FU |g
by Assumption CiC.2. Suppose the value Ui corresponds to the quantile q in the
(G = 1, T = 0) group
FU |10 (Ui ) = q.
We observe the outcomes Y 0 in the non-treatment state for both groups in the 0 period.
In the G = 0, T = 0 group, the value of Ui is associated with a different quantile q , i.e.
FU |00 (Ui ) = q
or in other words, the individual with Ui is at rank q in the G = 0, T = 0 group such
that
q = FU |00 (FU−1|10 (q)). (5.20)
More precisely, the observation at rank q in the G = 1, T = 0 group has the same value
of U as the observation at rank q in the G = 0, T = 0 group.
Because the function ϕ(Ui , t) is strictly increasing in its first element (Assumption
CiC 1), the rank transformation is the same with respect to U or with respect to Y , and
from (5.20) follows
q = FY |00 (FY−1
|10 (q)). (5.21)
Now use Assumption CiC 2 which implies that the quantile q in the G = 0 group is
the same in T = 0 as in T = 1. Then the outcome for rank q in the U distribution in
T = 1 is
FY−1
|01 (q ).
Because the function ϕ depends only on U and T but not on G (Assumption CiC 1)
this implies that this is the counterfactual outcome for an individual with Ui of group 1
in time period T = 1. In addition, by Assumption CiC 2 this individual would also be
at rank q in time period 1. More formally, the counterfactual outcome FY−1 0 |11 (q) for an
individual with Ui that corresponds to rank q in the G = 1 and T = 0 population is
FY−1 −1 −1 −1
0 |11 (q) = FY |01 (q ) = FY |01 (FY |00 (FY |10 (q))). (5.22)
248 Difference-in-Differences Estimation: Selection on Observables and Unobservables
From the above derivations it is obvious that for every value of U ∈ Supp(U |G = 1)
we need to have also observations with U in the G = 0 group, which is made precise in
Assumption CiC 3.
A formal derivation can be obtained as follows. One first shows that
% &
FY 0 |gt (y) = Pr (ϕ(U, t) ≤ y|G = g, T = t) = Pr U ≤ ϕ −1 (y, t) |G = g, T = t
% & % &
= Pr U ≤ ϕ −1 (y, t) |G = g = FU |g ϕ −1 (y, t) .
This implies FY |00 (y) = FU |0 ϕ −1 (y, 0) , and replacing y by ϕ(u, 0) we obtain
FY |00 (ϕ (u, 0)) = FU |0 (u) from which follows, provided u ∈ Supp(U |G = 0),
ϕ (u, 0) = FY−1
|00 FU |0 (u) . (5.25)
With similar derivations for G = 0 and T = 1 one obtains
% &
FY |01 (y) = FU |0 ϕ −1 (y, 1) =⇒ FU−1|0 FY |01 (y) = ϕ −1 (y, 1) . (5.26)
Now starting from (5.25), substituting u = ϕ −1 (y, 1) and entering (5.26) gives
% &
ϕ ϕ −1 (y, 1) , 0 = FY−1 |00 FY |01 (y) . (5.27)
Further,
% & % & % &
FY |10 (y) = FU |1 ϕ −1 (y, 0) =⇒ FY |10 (ϕ ϕ −1 (y, 1) , 0 ) = FU |1 ϕ −1 (y, 1) ,
(5.28)
where we substituted y with ϕ ϕ −1 (y, 1) , 0 . By entering (5.28) and plugging in (5.27)
gives
% & % &
FY 0 |11 (y) = FU |1 ϕ −1 (y, 1) = FY |10 ϕ ϕ −1 (y, 1), 0 = FY |10 (FY−1
|00 FY |01 (y) ),
This can be used to identify the ATET. Consider an individual i from the G = 1
population with outcome Yi,t=0 in the first period and Yi,t=1 after the treatment. As
derived in (5.21) the rank of this individual in the G = 0 population is
q = FY |00 (Yt=0 )
FY−1
|01 (FY |00 (Yt=0 )),
which is thus the counterfactual outcome for this individual. By conditioning only on
population G = 1 we obtain the ATET (making again use of Assumption CiC 2)
AT E T = E [Y |G = 1, T = 1] − E FY−1 |01 (FY |00 (Y )) |G = 1, T = 0 .
We use now the shortcut notation AT E T = E[Y11 1 ] − E[Y 0 ] and α CiC = E[Y ] −
11 11
−1
E[FY |01 (FY |00 (Y10 ))], which are identical if the identification assumptions hold. One
may estimate the distribution functions F and their inverse simply by the use of the
empirical counterparts
n gt
1
so that FY−1
|gt (0) = y gt . With these one can obtain
n 11
1
n 10 % &
α CiC
= Y11,i − F̂Y−1
|01 F̂Y |00 (Y10,i ) . (5.31)
n 11 n 10
i=1 i=1
α CiC − α CiC = O p (n 1/2 )
√
n(α CiC − α CiC ) → N (0, V p / p00 + Vq / p01 + Vr / p10 + Vs / p11 ) .
The idea is to linearise the estimator and decompose it into α CiC and some mean-zero
terms
1
1
1
1
n 00 n 01 n 10 n 11
p(Y00,i ) + q(Y01,i ) + r (Y10,i ) + s(Y11,i ) + o p (n −1/2 ).
n 00 n 01 n 10 n 11
i=n i=n i=n i=n
Note that the variance of the CiC estimator either neither generally larger than the vari-
ance of the standard DiD estimator, nor it is generally smaller, it might even be equal. To
estimate the asymptotic variance of α CiC one has to replace expectations with sample
averages, using empirical distribution functions and their inverses, and using any uni-
formly consistent non-parametric estimator for the density functions to obtain estimates
of P(y, z), Q(y, z), r (y), s(y), p(y) and q(y). Finally, one has to calculate
1
1
n 00 n 01
V̂ p = p̂(Y00,i )2 , V̂q = q̂(Y01,i )2 ,
n 00 n 01
i=1 i=1
1
1
n 10 n 11
V̂r = r̂ (Y10,i ) , V̂s =
2
ŝ(Y11,i )2 , (5.32)
n 10 n 11
i=1 i=1
and estimate the pgt by i=1 11{G i = g, Ti = t}/n. It can be shown that combining
these estimators gives a consistent one for the variance of α CiC .
Fortunately, in order to estimate the treatment effect αq for a given quantile q of the
CiC
distribution of Y , see (5.23), we can use almost the same notation and method: replace
in (5.23) all distribution functions by its empirical counterpart and define
for min y∈Y00 FY |10 (y) < q < q̄ = max y∈Y00 FY |10 (y).
5.3 The Changes-in-Changes Concept 251
Fy|00(y)
y
1 2 3 4 5
u = Pr (U ≤ u|G = 0) = Pr (U ≤ u|G = 0, T = 0)
≤ Pr (ϕ (U, 0) ≤ ϕ (u, 0) |G = 0, T = 0) . (5.33)
The inequality follows because U ≤ u implies ϕ (U, 0) ≤ ϕ (u, 0) but not vice versa.
Let Q denote the set of all values of q ∈ [0, 1] such that ∃y ∈ Y00 with FY |00 (y) = q.
If u ∈ Q, then the statements U ≤ u and ϕ (U, 0) ≤ ϕ (u, 0) imply each other. We thus
obtain for u ∈ Q
u = Pr (U ≤ u|G = 0) = Pr (U ≤ u|G = 0, T = 0)
= Pr (ϕ (U, 0) ≤ ϕ (u, 0) |G = 0, T = 0)
= Pr (Y ≤ ϕ (u, 0) |G = 0, T = 0) = FY |00 (ϕ (u, 0)). (5.34)
252 Difference-in-Differences Estimation: Selection on Observables and Unobservables
Hence, for u ∈ Q we have ϕ (u, 0) = FY−1 |00 (u). However, all values of U in
(FY |00 (λl−1 ), FY |00 (λl )] will be mapped onto Y = y. Define a second inverse function
FY−1
|00 (q) = inf y : FY |00 (y) ≥ q , y ∈ Y00 , and
FY(−1)
|00 (q) = sup y : FY |00 (y) ≤ q , y ∈ Y00 ∪ {−∞}
where Y00 = Supp(Y |G = 0, t = 0). These two inverse functions also permit to
describe the interval of values of U that are mapped onto the same value of Y . Consider
a value q such that FY−1
|00 (q) = y. Then all values of Ui = u with
FY |00 (FY(−1) −1
|00 (q)) < u ≤ FY |00 (FY |00 (q))
will be mapped on Yi = y.
Regarding the two inverse functions, we note that for values of q such that ∃y ∈ Y00
(−1)
with FY |00 (y) = q it follows that FY |00 (q) = FY−1 |00 (q). Let Q denote the set of all
values of q ∈ [0, 1] that satisfy this relationship. These are the jump points in Figure
(−1)
/ Q we have that FY |00 (q) < FY−1
5.3. For all other values of q ∈ |00 (q). For all values of
q it therefore follows that
(−1)
FY |00 (FY |00 (q)) ≤ q ≤ FY |00 (FY−1
|00 (q)), (5.35)
(−1)
and for q ∈ Q even FY |00 (FY |00 (q)) = q = FY |00 (FY−1|00 (q)). Likewise, we can show
that FU |G=1 (u) is identified only for u ∈ Q. We derived in (5.34) above that for those,
FY |00 (ϕ (u, 0)) = u and ϕ (u, 0) = FY−1 |00 (u). Now consider FU |G=1 (u) for a given
value of u ∈ Q:
Consequently, FU |G=1 (u) is point identified only for u ∈ Q. For all other values,
FU |G=1 (u) can only be bounded, similarly to (5.33), as is shown further below.
To illustrate the identification area of FU |G=1 (u), let us consider an example where
Y ∈ {1, 2, 3, 4}, and imagine we had observed the frequencies
FY |00 FY |10 FY |01
y =1 0.1 0.3 0.2
y =2 0.4 0.5 0.6 (5.36)
y =3 0.7 0.9 0.8
y =4 1 1 1.
Figure 5.4 shows the distribution function FU |G=1 (u) as a function of u. Note also that
FU |G=0 (u) = u because u has been normalised to be uniform in the G = 0 group. The
graph on the left indicates the values of FU |G=1 (u) where it is identified from FY |00 and
FY |10 . Since distribution functions are right-continuous and non-decreasing, the shaded
areas in the graph on the right show the lower and upper bounds for FU |G=1 (u), i.e.
function FU |G=1 must lie in the shaded areas.
5.3 The Changes-in-Changes Concept 253
FU|G=1(u) FU|G=1(u)
1 1
0.9 0.9
0.5 0.5
0.3 0.3
u u
0.1 0.2 0.4 0.6 0.7 0.8 1 0.1 0.2 0.4 0.6 0.7 0.8 1
Having (partly) identified the function FU |G=1 , we can proceed with identifying the
distribution of the counterfactual outcome FY 0 |11 . Note first that
This implies
FY 0 |11 (y) = FU |G=1 (FY |01 (y)). (5.38)
Hence, we can derive FY 0 |11 (y) from the distribution of FU |G=1 . For the numerical
example (5.36) given above we obtain
FY 0 |11 (1) = FU |G=1 (FY |01 (1)) = FU |G=1 (0.2) ∈ [0.3; 0.5]
FY 0 |11 (2) = FU |G=1 (FY |01 (2)) = FU |G=1 (0.6) ∈ [0.5; 0.9]
FY 0 |11 (3) = FU |G=1 (FY |01 (3)) = FU |G=1 (0.8) ∈ [0.9; 1]
FY 0 |11 (4) = FU |G=1 (FY |01 (4)) = FU |G=1 (1) = 1.
FU|G=1(u)
1
0.9
0.5
0.3
u
0.1 0.2 0.4 0.6 0.7 0.8 1
We thus obtain the lower and upper bound (Lb and Ub) distributions7 for y ∈ Y01
−1 −1
0 |11 (y) := FY |10 (FY |00 (FY |01 (y))) ≤ FY 0 |11 (y) ≤ FY |10 (FY |00 (FY |01 (y))) =: FY 0 |11 (y)
FYLb Ub
(5.39)
which bound the distribution of the counterfactual outcome of interest FY 0 |11 (y).
Both, the upper and the lower bound c.d.f. can be estimated by replacing in Equa-
tion 5.39 the different distribution functions by its empirical counterparts, and applying
numerical inversion. The upper and lower bound of the ATET can be estimated by
1
1
n 11 n 10
α̂U b = Y11,i − FY−1
n 11 n 10 |01 F Y |00 (Y10,i ) (5.40)
i=1 i=1
n 11
1
n 10
α̂ Lb = Y11,i − FY−1
n 11 n 10 |01 FY |00 (Y10,i ) , (5.41)
i=1 i=1
n 00
where F Y |00 (y) = Pr(Y00 < y) which can be estimated by n100 i=1 11{Y00,i < y},
Y |00 (y) is estimated like always, i.e. by Pr(Y00 < y) = 1 n 00
whereas F n 00 i=1 11{Y00,i ≤
y}.
7 C.f. Theorem 4.1 of Athey and Imbens (2006). They show also that these bounds are tight, i.e. that no
narrower bounds can exist.
5.3 The Changes-in-Changes Concept 255
FU G=1(u) FU G=1(u)
—
—
1 1
0.9 0.9
0.5 0.5
0.3 0.3
u u
0.1 0.2 0.4 0.6 0.7 0.8 1 0.1 0.2 0.4 0.6 0.7 0.8 1
T H E O R E M 5.3 With the same assumptions and pgt , Vs as for the continuous case
(Theorem 5.1) we obtain for the estimators defined in (5.40) and (5.41) that
√
n α̂U b − αU b → N 0, Vs / p11 + V / p10
√
n α̂ Lb − α Lb → N 0, Vs / p11 + V / p10
% & % &
with V = V ar FY−1 |01 (F (Y
Y |00 10 )) , and V = V ar F −1
(F (Y
Y |01 Y |00 10 )) .
But with Assumption CiC 4.1 we can reach point identification for FU |G=1 (u) for all
values of u as U is then also uniformly distributed in the G = 1, T = 0, Y = y
population. Hence, the distribution function FU |G=1 (u) has to be a diagonal between
the bounds on FU |G=1 (u) derived above. These bounds are replicated in the left graph
below, while the graph on the right shows FU |G=1 (u) with Assumption CiC 4.1. For a
formal proof you need some more assumptions. Let us discuss here only the proof for
binary Y .8 The assumptions for this binary Y case are, in addition to U ⊥⊥ T |G,
Assumption CiC 4.2 The random variable YG=0,T =0 is discrete with possible out-
comes Y00 = {0, 1}.
Assumption CiC 4.3 The function ϕ(u, t) is non-decreasing in u.
Assumption CiC 4.4 The variables U |G = 1 and U |G = 0 are continuously
distributed.
Still assume that U |G = 0 is normalised to be uniform. Define ũ(t) =
sup (u ∈ [0, 1] : ϕ (u, t) = 0) as the largest
! value of u such that ϕ (u, t) is still zero.
This implies that E Y 0 |G = g, T = t = Pr (U > ũ(t)|G = g, T = t). Now consider
Pr (U ≤ u|U ≤ ũ(t), G = 1) = Pr (U ≤ u|U ≤ ũ(t), G = 1, T = t)
because of U ⊥⊥ T |G. By the definition of ũ(t), conditioning on U ≤ ũ implies Y = 0,
such that
= Pr (U ≤ u|U ≤ ũ(t), G = 1, T = t, Y = 0)
= Pr (U ≤ u|U ≤ ũ(t), G = 0, T = t, Y = 0)
because of Assumption (A4.4). Now again using the definition of ũ(t) we obtain
= Pr (U ≤ u|U ≤ ũ(t), G = 0, T = t) = Pr (U ≤ u|U ≤ ũ(t), G = 0)
u
= min ,1 (5.42)
ũ(t)
because of U ⊥⊥ T |G; the last equality follows because U |G = 0 is uniform.
Analogously one can show that
1−u
Pr (U > u|U > ũ(t), G = 1) = min ,1 .
1 − ũ(t)
Recall the following equalities:
E [Y |G = 1, T = 0] = Pr (U > ũ(0)|G = 1)
E [Y |G = 0, T = t] = Pr (U > ũ(t)|G = 0, T = t) = Pr (U > ũ(t)|G = 0) = 1 − ũ(t).
With them we get
E Y 0 |G = 1, T = 1 = Pr (U > ũ(1)|G = 1, T = 1) = Pr (U > ũ(1)|G = 1)
= Pr (U > ũ(1)|U > ũ(0), G = 1) Pr (U > ũ(0)|G = 1)
+ Pr (U > ũ(1)|U ≤ ũ(0), G = 1) Pr (U ≤ ũ(0)|G = 1) .
(5.43)
8 The general case can be found for example in Athey and Imbens (2006).
5.3 The Changes-in-Changes Concept 257
Consider first the matching approach. If Yt=0 is the only confounding variable, then by
applying the logic of selection-on-observables identification we can write FY 0 |G=1 (y)
t=1
% & % &
= Pr Yt=10
≤ y|D = 1 = E 11 Yt=1 0
≤ y |G = 1
% & $
= E E 11 Yt=1 0
≤ y |Yt=0 , G = 1 $G = 1
% & $
= E E 11 Yt=1 0
≤ y |Yt=0 , G = 0 $G = 1
$ !
= E FYt=1 |Yt=0 ,G=0 (y|Yt=0 ) $G = 1 . (5.45)
This result is different from the above CiC method. With the selection on observables
approach it is assumed that conditional on Yt=0 the unobservables are identically dis-
tributed in both groups (in the second period). The above introduced CiC method did
not assume that the unobservables are identically distributed between groups (condi-
tional on Yt=0 ), but rather required that the unobservables were identically distributed
over time, cf. with (5.44). Hence, as we already showed for the DiD model, the CiC
method is not nested with the selection-on-observables approach.
However, selection on observables and CiC are identical when Ui0 = Ui1 . To see this,
note first that the conditional distribution
FYt=1 |Yt=0 ,G=0 (y|v)
is degenerate if Ui0 = Ui1 . Assuming this implies perfect rank
correlation: for i with
Ui0 such that Yi,t=0 = v we have Yi,t=1 0 = FY−1
|01 FY |00 (v) which is the mapping of
ranks. This implies
FYt=1 |Yt=0 ,G=0 (y|v) = 0 if y < FY−1 |01 FY |00 (v) (5.46)
FYt=1 |Yt=0 ,G=0 (y|v) = 1 if y ≥ FY−1 |01 FY |00 (v) .
which is identical to (5.24). Hence, Assumptions CiC 1 to CiC 3 are valid and also
Ui,t=1 = Ui,t=0 . Therefore CiC and matching (selection-on-observables) deliver the
same results.
To enhance our understanding of the relationship between the CiC and the selection-
on-observables approach, note that the latter only requires
0
Yt=1 ⊥⊥ D|Yt=0 ,
5.3 The Changes-in-Changes Concept 259
(or at least mean independence if interest is in average effects). If Yit0 = ϕ(Uit , t) and ϕ
is strictly monotonous in the first element, this is identical to
Ui,t=1 ⊥⊥ G i |Ui,t=0 . (5.47)
The selection-on-observables approach thus requires that all information that affects
Ut=1 and the treatment decision is incorporated in Ut=0 . This assumption (5.47) is, for
example, not satisfied in a fixed-effect specification Uit = vi + εit where vi is related
with the treatment decision and εit some independent noise. For (5.47) to be satisfied
would require that Ui,t=0 contains all information about vi because it is the confounding
element. However, in the fixed-effect model our Ui,t=0 reveals vi only partly since the
noise εi,t=0 is also contained.
Example 5.7 Consider the simple example where G i = 11 (vi − ηi > 0) and! ηi
some noise. For the (5.47)! we need for identification E Ui,t=1 |G i = 1, Ui,t=0 −!
E Ui,t=1 |G i = 0, Ui,t=0 = 0, which is not! true here since E Ui,t=1 |G i = 1, U!i,t=0 = a
= E vi + εi,t=1 |vi > ηi , vi + εi,t=0 = a ! = E vi |vi > ηi , εi,t=0 = a − vi which is
larger than E vi |vi ≤ ηi , εi,t=0 = a − vi . This is similar to situations with measure-
ment errors in the confounder or the treatment variable.
The CiC model requires, in addition to the monotonicity and the support assumption,
that
U ⊥⊥ T |G.
This does not permit that the distribution of U changes over time. It does not permit
e.g. an increase in the variance of U , which would not be a concern in the selection-on-
observables approach. In the CiC method, an increase in the variance of U or any other
change in the distribution of U , is not permitted because we attribute any change in the
observed outcomes Y (over time) to a change in the function from ϕ(u, 0) to ϕ(u, 1). If
the distribution of U changed between the time periods, we could not disentangle how
much of the changes in Y is due to changes in U and how much due to changes in the
function ϕ.
Another difference between the selection-on-observables approach and the CiC
method is the assumption that ϕ(u, t) is monotonous in u, an assumption which is
not required for matching. Hence, the CiC approach requires that the unobservables
in the outcome equation are one-dimensional, i.e. all individuals can be ranked on
a one-dimensional scale with respect to their outcomes, irrespective of the value of
the treatment. In the selection-on-observables approach, on the other hand, unobserv-
ables are permitted to be multi-dimensional. This emphasises once again that both
approaches rest on different assumptions which cannot be nested. Only in the case where
the joint distribution of Ui,t=1 and Ui,t=0 is degenerate, which trivially implies (5.47),
the selection-on-observables approach rests on weaker assumptions. One example is
Ui,t=0 = Ui,t=1 .
Finally, note that the CiC method can also be used to analyse the effects of changes
in the distribution of U over time if these occur only in one of the two groups. This can
be applied e.g. to the analysis of wage discrimination.
260 Difference-in-Differences Estimation: Selection on Observables and Unobservables
Example 5.8 Suppose we are interested in the wage differential between Black and
White, after having purged the effects due to differences in some pre-specified observ-
ables X . Let U be an unobserved skill and ϕ(U, T ) the equilibrium wage function,
which may change over time, but is assumed to be identical for the two groups
G = 1 = blacks and G = 0 = whites. Suppose that the distribution of U did not
change over time for white workers but that it did change for black workers. The treat-
ment effect of interest here is not the effect of a particular intervention, but rather the
impact on the wage distribution due to the change in the unobservables for the black.
We observe the wage distribution for the black after the change in the distribution of
U had taken place. The counterfactual is the wage distribution that would have been
observed if the distribution of U had remained constant over time for the black. Under
the maintained assumption that the distribution of U for the white was constant over
time this situation fits exactly the CiC model assumptions for Y 0 with U ⊥⊥ T |G.
The difference between the observed wage distribution for black and their counterfac-
tual is thus attributed to the change in the distribution of U over time for the blacks
(under the maintained assumption that the distribution of U did not change for white
workers).
It is not hard to imagine that there are many situations in which the assumptions neces-
sary for the CiC method are to some extent credible. As always, we never know whether
all model assumptions hold perfectly true. In fact, as models are always a simplification,
often this may not be 100% true; what we hope for is that the artificial simplification
is not too strong, i.e. that potential deviations from the made assumptions are not too
strong and to a good part accounted for by the (estimated) standard errors.
treatment group and zero otherwise. Then generate the post-period treatment dummy
gen pt = post ∗ tr eatment and run the regression of interest with the three dummies
(two dummies for the groups plus the interaction). The coefficient on pt then repre-
sent the ATET estimator. To test the difference in the two groups, use the t-statistic on
that coefficient. A popular correction for potential heteroscedasticity is to cluster stan-
dard errors at the group level adding the cl("group var") option in the regression
command.
For combining fixed effects model estimation with weighting in R, see Imai and Kim
(2015). They show how weighted linear fixed effects estimators can be used to estimate
the average treatment effects (for treated) using different identification strategies. These
strategies include stratified randomised experiments, matching and stratification for
observational studies, difference-in-differences, and a method they call first differenc-
ing. Their R package wfe provides a computationally efficient way of fitting weighted
linear fixed effects estimators for causal inference with various weighting schemes. The
package also provides various robust standard errors and a specification test for standard
linear fixed effects estimators.
You further will find in Stata the user-written ado commands diff, diffbs,
presently available at econpapers.repec.org/software/bocbocode/s45
7083.htm. Along the description of its release in 2015 it performs several diff-in-
diff estimations of the treatment effect of a given outcome variable from a pooled base
line and follow up dataset(s): Single Diff-in-Diff, Diff-in-Diff controlling for covari-
ates, Kernel-based Propensity Score Matching diff-in-diff, and Quantile Diff-in-Diff; see
Chapter 7. It is also suitable for estimating repeated cross section Diff-in-Diff, except
for the kernel option. Note that this command ignores the grouping variable and does
not take the pairing of the observations into account, as is usual when you xtset your
data before using xtreg.
5.5 Exercises
1. In a simple DiD without confounders, how can you test the validity of the made
assumptions before and after treatment when provided with additional panel waves
or cohorts? How can you make these assumptions hold?
2. Now think of conditional DiD with confounders and the accordingly modified par-
allel path assumption. Again imagine you are provided with data from at least two
waves before and after the treatment has taken place. How can you (a) test these
assumptions, (b) select an appropriate set of confounders, and (c) if necessary, find
the right scale for Y ?
3. Think about the difference between DiD with panels vs with cohorts. What is the
advantage of having panels compared to cohorts?
4. No matter whether you do DiD with panels or with cohorts, when including covari-
ates (e.g. necessary when they are confounders) – why now it is no longer sufficient
to have the cohort or panel aggregates for each group?
5.5 Exercises 263
5. Show that parallel path fails in the log-linear model (looking at log(Y )) when it holds
for the linear model. Discuss, how to choose the scale if you do not have data from
several time points before treatment was implemented.
6. Recall DiD-matching/propensity weighting: prove the steps of (5.13) using Bayes’
rule.
7. Have you thought about DiD with instruments (IVs)? Recall that it is very hard to
find IVs that indeed fulfil the necessary conditions and at the same time improve
in the finite sample mean squared error of your treatment effect estimate (compared
to matching or propensity score weighting). Furthermore, as you only identify the
LATE, only a reasonable structural model provides a useful estimate. Having the
time dimension in the DiD already, you may work with lagged variables as IV. What
are you identifying and estimating then? Which estimators do you know already from
panel data analytics – even if maybe just in the linear model context?
8. Above we have discussed how to check the parallel path assumption and how to adapt
if it is not fulfilled (change scale, condition on an appropriate set of covariates, etc.).
You may, however, end up with a data transformation or set of confounders that are
hard to justify with economic theory or that even contradict it. An alternative is to
change to the parallel growth model. Write down the new model and answer for it
Exercises 5.1 to 5.5.
9. You may end up with the question as to whether you should use parallel path, par-
allel growth, CiC, etc. Discuss the various possibilities of checking or testing which
assumptions are most likely to hold.
6 Regression Discontinuity Design
Example 6.1 Consider a summer school remediation programme for poorly perform-
ing school children. Participation in this mandatory remediation programme is based
on a grade in Mathematics. Students with low scores on the math test are obliged to
attend the summer school programme during the holidays. On the other hand, students
with high scores are not eligible for the programme. We want to learn whether these
remedial education programme during the summer break actually helped the children,
e.g. in performing better in school in the following years. Treatment D is defined as
participation in the programme. All students with math test score Z below the thresh-
old z 0 are assigned to treatment, whereas those with Z above the threshold z 0 are not.
Regression Discontinuity Design 265
Clearly, Z cannot be a valid instrumental variable since the test score Z is most likely
related to (unobserved) ability and skills, which will also affect school performance in
the future. Yet, perhaps we can use it if we restricted ourselves only to students in the
neighbourhood of z 0 .
We will see that such rules sometimes generate a local instrumental variable, i.e. an
instrumental variable that is valid only at a particular threshold (not for the entire popu-
lation). We will exploit this local behaviour at the margin of z 0 . But one should always
keep in mind that the identification is obtained only for the individuals at (or close to)
threshold value z 0 , which often may not be the primary population of interest. Some-
times it may be, e.g. when the policy of interest is a marginal change of the threshold
z 0 . In sum, the identification around this threshold may provide internal validity, but not
external.
Example 6.2 Leuven, Lindahl, Oosterbeek and Webbink (2007) examined a programme
in the Netherlands, where schools with at least 70% disadvantaged minority pupils
received extra funding. Schools slightly above this threshold would qualify for extra
funding whereas schools slightly below the threshold would not be eligible. While
comparing schools with 0% disadvantaged pupils to schools with 100% disadvantaged
pupils is unlikely to deliver the true treatment effect since these schools are likely to also
differ in many other unobserved characteristics, comparing only schools slightly below
70% to those slightly above 70% could be a valid approach since both groups of schools
are very similar in their student composition even though only one group qualifies for
the extra funding.
Note, that one could say that the expectation of D, i.e. the probability of getting
treated depends in a discontinuous way on the test score Z , while there is no rea-
son to assume that the conditional expectations E[Y d |Z = z], d = 0, 1 should be
discontinuous at z 0 .
Example 6.3 Lalive (2008) studied the effects of maximum duration of unemployment
benefits in Austria. In clearly defined regions of Austria the maximum duration of
receiving unemployment benefits was substantially extended for job seekers aged 50
or older at entry into unemployment. Basically, two control group comparisons can be
examined: those slightly younger than 50 to those being 50 and slightly above, and those
living in the treatment region but close to a border to a non-treatment region to those on
the other side of the border.
The age-based strategy would compare the 50-year-old to 49-year-old individuals.
This way we would compare groups of workers who are very similar in age (and in
other characteristics like health and working experience), but where only one group gets
the benefit of extension. To increase sample size, in practice we would compare job
seekers e.g. in the age bracket 45 to 49 to those of age 50 to 54. Similar arguments apply
266 Regression Discontinuity Design
top the strategy based on comparing people from different administrative regions but
living very close to each other and therefore sharing the same labour market.
Whether these strategies indeed deliver a consistent estimate of the treatment effect
depends on further conditions that are discussed below.
As seen in the last example, such a threshold could also be given by a geographical
or administrative border; so whether you get treatment or not depends on which side
of the border you reside. Then these geographical borders can also lead to regression
discontinuity. For example, two villages can be very close to an administrative border
but located on different sides of the border. If commuting times are short between these
two villages, they might share many common features. But administrative regulations
can differ a lot between these villages due to their belonging to different provinces.
Such kinds of geographic or administrative borders provide opportunities for evaluation
of interventions. We can think about individuals living close but on different sides of
an administrative border, they may be living in the same labour market, but in case of
becoming unemployed they have to attend different employment offices with potentially
rather different types of support or training programmes.
Example 6.4 Frölich and Lechner (2010) analyse the impact of participation in an active
labour market training programme on subsequent employment chances. They use the
so-called ‘minimum quota’ as an instrument for being assigned to a labour market pro-
gramme. When active labour market programmes were introduced on a large scale in
Switzerland, the central government wanted to ensure that all regions (so-called ‘can-
tons’) would get introduced to these new programmes at the same time. The fear was
that otherwise (at least some of) the cantons might have been reluctant to introduce
these new programmes and prefer a wait-and-see strategy (as they enjoyed a very high
degree of autonomy in the implementation of the policy). To avoid such behaviour, the
central government demanded that each canton had to provide a minimum number of
programme places (minimum quota). Since the calculation of these quota was partly
based on population share and partly on unemployment share, it introduced a differen-
tial in the likelihood of being assigned to treatment between neighbouring cantons. This
means that people living close to a cantonal border but on different sides of it, faced
essentially the same labour market environment, but their chances of being assigned to
treatment in case of becoming unemployed depended on their side of the border.
Thinking about Example 6.4 you will probably agree that there is no particular reason
why the potential employment chances should be discontinuous at the frontier of a can-
ton, but the chance to be involved in an active labour market training programme might
be discontinuous, and this happens because of different quota. In evaluating impacts
of policies, today it has become a frequently used tool to identify interventions where
certain rules, especially bureaucratic ones (less often natural1 ), increase the likelihood
of D to change discontinuously from 0 to 1.
1 ‘Natural’ like mountain chains or language borders may cause a discontinuity in E[Y d |Z ] at Z = z for at
0
least one d, and might therefore not be helpful.
6.1 Regression Discontinuity Design without Covariates 267
Example 6.5 Black (1999) used this idea to study the impact of school quality on the
prices of houses. In many countries, admission to primary school is usually based on
the residency principle. Someone living in a particular school district is automatically
assigned to a particular school. If the quality of school varies from school to school,
parents have to relocate to the school district where they want their child to attend the
school. Houses in areas with better schools would thus have a higher demand and thus
be more expensive. If the school district border runs, for example, through the middle
of a street, houses on the left-hand side of the street might be more expensive than those
on the right-hand side of the street because of its belonging to a different school district.
The original idea was that around such threshold you observe something like a ran-
dom experiment. Some units, firms or individuals happen to lie on the side of the
threshold at which a treatment is administered, whereas others lie on the other side
of the threshold. Units close to the threshold but on different sides can be compared to
estimate the average treatment effect. Often the units to the left of the threshold differ
in their observed characteristics from those to the right of the threshold. Then, as in the
CIA case, accounting for these observed differences can be important to identify the
treatment effect.
So we have two ways to relate RDD to preceding chapters and methods: either one
argues that the threshold acts like a random assignment mechanism, i.e. you are ‘by
chance’ right above or right below z 0 ; or we can argue that such rules generate a
local instrumental variable, i.e. an instrument that is valid only at or around a par-
ticular threshold z 0 . In the former case we consider the observations around z 0 like data
obtained from a randomised experiment but in both cases it is obvious that our argument
looses its validity as we move away from z 0 : In Example 6.1 pure hazard is not placing
a student far above or far below the threshold as we can always argue ability has played
an important role. Similarly, in Example 6.4 people living away from the frontier inside
one or the other province most likely face different labour markets.2
For the ease of presentation we first consider the case without further covariates. How
can we employ RDD to identify and estimate treatment effects?
assignment rule is that schools with Z ≥ z 0 receive some additional funding but schools
with Z < z 0 receive nothing. We are interested in the effect of this extra funding D on
some student outcomes Y . As stated, the basic idea of RDD is to compare the outcomes
of schools with Z just below z 0 to those with Z just above z 0 . Note that we cannot use
Z as an instrumental variable as we suspect that Z has a direct impact on school average
outcomes Y (the fraction of immigrant children Z is expected to have a down side effect
on Y ). But when we compare only schools very close to this threshold, this direct effect
of Z should not really matter.
Generally, RDD can be used when a continuous variable3 Z , which we will call
assignment score, influences an outcome variable Y and also the treatment indicator D,
which itself affects the outcome variable Y . Hence, Z has a direct impact on Y as well as
an indirect impact on Y via D. This latter impact, however, represents the causal effect
of D on Y . This can only be identified if the direct and the indirect (via D) impact of Z
on Y can be told apart. Think about the cases where the direct impact of Z on Y is known
to be smooth but the relationship between Z and D is discontinuous. Then any disconti-
nuity (i.e. a jump) in the observed relationship between Z and Y at locations where the
relation of Z to D is discontinuous, can be attributed to the indirect impact of D.
The graphs in Figures 6.1 and 6.2 give an illustration of this idea. While the two
functions E[Y 0 |Z ] and E[Y 1 |Z ] are continuous in Z , the function E[D|Z ] jumps at
a particular value. For values of Z smaller than z 0 the E[D|Z = z] is very small, for
3 Mathematically it has to be continuous around z in a strict sense. In practice it is sufficient that the
0
distance to z 0 is measured on a reasonable scale such that the here presented ideas and arguments still
apply, and the later on presented assumptions make at least intuitively some sense. As an exercise you
might discuss why ‘years of age’ for adults often might work whereas ‘number of children’ with z 0 ≤ 2
often would not.
6.1 Regression Discontinuity Design without Covariates 269
Figure 6.2 The observed outcomes and the treatment effect at the threshold
values of Z larger than z 0 the E[D|Z = z] is large. This discontinuity will generate
a jump in E[Y |Z ]. A special case is the ‘sharp design’ where E[D|Z ] jumps from 0
to 1 as in the examples discussed earlier. Hence, although Z is not ‘globally’ a valid
instrumental variable since it has a direct impact at Y 0 and Y 1 , visible in the graphs, it
can ‘locally’ be a valid instrument if we compare only those observations slightly below
(control group) with those slightly above z 0 (treatment group).
For the moment we distinguish two different situations (or designs): the sharp design
where Di changes for all i (i.e. everyone) at the threshold, and the fuzzy design,
where Di changes only for some individual i. In the former, the participation status
is determined by
Di = 11{Z i ≥ z 0 }, (6.1)
Example 6.6 Van der Klaauw (2002) analyses the effect of financial aid offered to col-
lege applicants on their probability of subsequent enrolment. College applicants are
ranked according to their test score achievements into a small number of categories.
The amount of financial aid offered depends largely on this classification. Yet, he finds
that the financial aid officer also took other characteristics into account, which are
not observed by the econometrician. Hence the treatment assignment is not a deter-
ministic function of the test score Z , but the conditional expectation function E[D|Z ]
nonetheless displays clear jumps because of the test-score rule.
Obviously, for sharp designs the difference is exactly equal to 1. As therefore the fuzzy
design includes the sharp design as a special case, much of the following discussion
focuses on the more general fuzzy design but implicitly includes the sharp designs (as a
trivial case).
A third case you may observe from time to time, is a mixed design, which is a
mixture of sharp and fuzzy design or, more specifically a design with only one-sided
non-compliance. This occurs if the threshold is strictly applied only on one side. A
frequent case arises when eligibility depends strictly on observed characteristics but
participation in treatment is voluntary. Obvious examples are all projects where eligibil-
ity to certain treatments are means like food stamp programmes, with a strict eligibility
threshold z 0 , but take-up of the treatment is typically less than 100 percent (people who
got the stamps might not go for the treatment). Consequently we expect
Example 6.7 Think about an eligibility to a certain labour market programme. It may
depend on the duration of unemployment or on the age of individuals. The ‘New Deal
for Young People’ in the UK offers job-search assistance (and other programmes) to all
6.1 Regression Discontinuity Design without Covariates 271
individuals aged between eighteen and twenty-four who have been claiming unemploy-
ment insurance for six months. Accordingly, the population consists of three subgroups
(near the threshold): ineligibles, eligible non-participants and participants. Often data
on all three groups is available.
You can also find mixed designs where theoretically everyone is allowed to get treated
but some people below (or above) threshold z 0 have the permission to resign. Then you
would get
(depending on the sign of Z ). Note, however, that (6.3) and (6.4) are equivalent; you
simply have to redefine the treatment indicator as 1 − D. To simplify the discussion we
can therefore always refer to (6.3) without loss of generality.
Like in the sharp design, the setup in mixed designs rules out the existence of (local)
defiers4 close to z 0 , i.e. that an individual i enters treatment for Z i < z 0 but sorts
out else. In the sharp design they are not allowed to chose, and in the mixed design,
by definition (6.3) potential defiers are either not allowed to participate or they equal
never-takers (recall the one-sided compliance case). For the rest one could say that all
discussion on fuzzy designs also applies to mixed designs though with somewhat sim-
pler formulae and fewer assumptions. For example in (6.3) also the group of (local)
always-takers is redundant as they either are not eligible or become local compliers.5
The adjective local refers now to the fact that we are only looking at the location
around z 0 .6
We will always use the Assumption RDD-1 (6.2), which is therefore supposed to
be fulfilled for this entire chapter. Later on we will also discuss that we need the non-
existence of defiers. Further, in many of our examples we have seen that Z may also
be linked to the potential outcomes Y d directly, so that the treatment effect cannot
be identified without further assumptions. Supposing that the direct influence of Z on
the potential outcomes is continuous, the potential outcomes hardly changes with Z
within a small neighbourhood, e.g. around z 0 . So, identification essentially relies on
analysing the outcomes of those individuals being located around the threshold and that
the conditional mean function is continuous at the threshold:
because if there were a jump in Y 0 or Y 1 at z 0 anyway, then the underlying idea of the
RDD identification and estimation would no longer apply. This again is assumed to be
fulfilled for the entire chapter. The previous assumptions are sufficient for identifying
average treatment effects, but if we are interested in distributional or quantile treat-
ment effects (Chapter 7), one often finds the stronger condition in terms of conditional
independence, namely
violated. If it is not monotonic but goes in both directions in the sense that we upgrade
the smartest for which the score was below z 0 but lowering the score for some bad pupils
who had a score right above z 0 , then despite the fact the density f Z may be continuous
at z 0 , assumption (6.5) is also violated.
Now, instead of manipulating scores, in practice it is more likely that one simply
relaxes the selection rules in the sense that people around z 0 were allowed or animated to
switch into or out of treatment. Then we should be in the above introduced fuzzy design.
But then Assumption RDD-2 is no longer sufficient. In fact, we need additionally to
assume
Assumption RDD-3: (Yi1 − Yi0 ) ⊥⊥ Di |Z i for Z i near z 0 (6.8)
which provides the identification of a LATE for the local compliers. The first line is
very similar to the instrument exclusion restriction of Chapter 4, whereas the second
line represents a type of local monotonicity restriction, requiring the absence of defiers
in a neighbourhood of z 0 .
It has been argued that in many applications this assumption would be easier to justify
(but is not testable anyway). Its handicap is that it is some kind of instrumental variable
approach and therefore only identifies the treatment effect for a group of local compliers
induced by the chosen instrument and threshold, i.e.
L AT E(z 0 ) = lim E Y 1 − Y 0 |D(z 0 + ε) > D(z 0 − ε), Z = z 0 .
ε→0
7 We slowly switch now from the mean-independence notation to the distributional one because in the
future, see especially Chapter 7, we study not just mean but distributional effects.
8 This is often still supposed to be equal to ATET and ATEN at z because this assumption says that
0
conditioning on Z near z 0 gives a randomised trial.
274 Regression Discontinuity Design
Like in Chapter 4 it can be shown (Exercise 2) that the ATE on the local compliers is
identified as
lim E [Y |Z = z 0 + ε] − lim E [Y |Z = z 0 − ε]
ε→0 ε→0
L AT E(z 0 ) = . (6.10)
lim E [D|Z = z 0 + ε] − lim E [D|Z = z 0 − ε]
ε→0 ε→0
It has the property of being ‘local’ twice: first for Z = z 0 and second for compliers, i.e.
the group of individuals whose Z value lies in a small neighbourhood of z 0 and whose
treatment status D would change from 0 to 1 if Z were changed exogenously from
z 0 − ε to z 0 + ε. Now you see why we called this a handicap: depending on the context,
this subpopulation and parameter might be helpful and easy to interpret or it might not.
The good news is, whichever of the two alternative assumptions, i.e. RDD-3 or RDD-
3*, is invoked, the existing estimators are actually the same under both identification
strategies. So there is no doubt for us what we have to do regarding the data analysis; we
might only hesitate when it comes to interpretation. Moreover, as in Chapter 4, the fact
that we can only estimate the treatment effect for the compliers needs not necessarily be
a disadvantage: sometimes this may just be the parameter one is interested in:
Example 6.8 Anderson, Dobkin and Gross (2012) examined the effect of health insur-
ance coverage on the use of medical services. They exploited a sharp drop in insurance
coverage rates at age 19, i.e. when children ‘age out’ of their parents’ insurance plans.
Many private health insurers in the USA cover dependent children up to age 18. When
these children turn 19, many drop out of their parents’ insurance cover. In fact, about
five to eight percent of teenagers become uninsured shortly after the nineteenth birthday.
The authors exploited this age discontinuity to estimate the effect of insurance coverage
on the utilisation of medical services and find a huge drop in emergency department
visits and inpatient hospital admissions. The estimated treatment effects represent the
response of ‘compliers’, i.e. individuals who become uninsured when turning 19. The
parameter of interest for policy purposes would be the average effect of insurance cov-
erage for these uninsured since most current policies focus on expanding rather than
reducing health insurance coverage. The ‘compliers’ represent a substantial fraction of
uninsured young adults. Providing insurance coverage to this population would have
a significant policy relevance, particularly since this group represents a large share of
the uninsured population in the US. In addition, there are also local never-takers, yet
the authors argue that their treatment effects should be similar to those of the compliers
since the pre-19 insurance coverage is mostly an artifact of their parents’ insurance plans
rather than a deliberate choice based on unobserved health status. Therefore the typical
adverse selection process is unlikely to apply in their context. Indeed they did not find
evidence that never-takers were significantly less healthy or consumed less health care
services than uninsured ‘compliers’.
of D for increasing Z ) is effectively an assumption while for the other designs they
are not an issue by construction. Notice that the RDD assumption implies the existence
of compliers as else there was no discontinuity. Fuzzy designs allow for never- and/or
always-takers, though this ‘never’ and ‘always’ refers to ‘nearby z 0 ’. In the mixed design
Assumptions RDD-1 and RDD-2 are sufficient if you aim to estimate the ATET (if it is
sharp with respect to treatment admission but fuzzy in the sense that you may refuse the
treatment, cf. equation (6.3)).9 Then ATET and LATE are even the same at the threshold.
To obtain this, recall
AT E T (z 0 ) = E[Y 1 − Y 0 |D = 1, Z = z 0 ],
lim E [Y |Z = z 0 + ε] − lim E [Y |Z = z 0 − ε]
ε→0 ε→0
= lim E D(Y − Y ) + Y 0 |Z = z 0 + ε − lim E D(Y 1 − Y 0 ) + Y 0 |Z = z 0 − ε
1 0
ε→0 ε→0
= lim E D(Y − Y )|Z = z 0 + ε − lim E D(Y 1 − Y 0 )|Z = z 0 − ε
1 0
ε→0 ε→0
= lim E D(Y − Y )|Z = z 0 + ε
1 0
ε→0
= lim E Y 1 − Y 0 |D = 1, Z = z 0 + ε lim E [D|Z = z 0 + ε] . (6.11)
ε→0 ε→0
The second equality follows because the left and right limits for Y 0 are identical by
Assumption RDD-2, the third equality follows because D = 0 on the left of the
threshold, and the last equality follows by RDD-1 and because D is binary. We thus
obtain
lim E [Y |Z = z 0 + ε] − lim E [Y |Z = z 0 − ε]
ε→0 ε→0
lim E [D|Z = z 0 + ε] − lim E [D|Z = z 0 − ε]
ε→0 ε→0
= lim E Y − Y |D = 1, Z = z 0 + ε ,
1 0
(6.12)
ε→0
which is the average treatment effect on the treated for those near the threshold, i.e.
ATET(z 0 ).
Note finally that if pre-treatment data on Y is available, we can also consider a DiD-
RDD approach, which we discuss further below.
9 Analogously, if rules are inverted such that you can switch from control to treatment but not vice versa,
then these assumptions are sufficient for estimating ATENT.
276 Regression Discontinuity Design
n
Z i − z0
(m̂ + , β̂+ ) = arg min {Yi − m − β (Z i − z 0 ) }2 K · 11 {Z i ≥ z 0 }
m,β h+
i=1
(6.13)
with a bandwidth h + , and analogously m − := lim E [Y |Z = z 0 − ε] by
ε→0
n
Z i − z0
(m̂ − , β̂− ) = arg min {Yi − m − β (Z i − z 0 ) }2 K · 11 {Z i < z 0 } ,
m,β h−
i=1
! (6.14)
with bandwidth h − to finally obtain an estimator for E Y 1 − Y 0 |Z = z 0 , namely
AT E(z 0 ) = m̂ + − m̂ − . (6.15)
6.1 Regression Discontinuity Design without Covariates 277
The exact list of necessary assumptions and the asymptotic behaviour of this estimator
is given below, together with those for the estimator when facing a fuzzy design. (Recall
that the sharp design can be considered as a special – and actually the simplest – case
of fuzzy designs.) But before we come to an ATE estimator for the fuzzy design, let us
briefly discuss some modifications of (6.14) and (6.15), still in the sharp design context.
We can rewrite the above expressions to estimate AT E(z 0 ) in a single estimation step.
Suppose that we use the same bandwidth left and right of z 0 , i.e. h − = h + = h. Define
further 1i+ = 11 {Z i ≥ z 0 }, 1i− = 11 {Z i < z 0 }, noticing that 1i+ + 1i− = 1. The previous
two local linear expressions can also be expressed as minimisers of quadratic objective
functions. Since m̂ + and m̂ − are estimated from separate subsamples, these solutions
are numerically identical to the minimisers of the sum of the two objective functions.
To obtain the following formula, we just add the objective functions of the previous
two local linear regressions. We obtain a joint objective function, which is minimised at
(m̂ + , β̂+ ) and (m̂ − , β̂− ):
n
Z i − z0
(Yi − m + − β+ (Z i − z 0 ) ) K
2
· 1i+
h
i=1
n
Z i − z0
+ (Yi − m − − β− (Z i − z 0 ) )2 K · 1i−
h
i=1
n
= Yi 1i+ − m + 1i+ − β+ (Z i − z 0 ) 1i+
i=1
2 Z i − z0
+ Yi 1i− − m − 1i− − β− (Z i − z 0 ) 1i− ·K .
h
Noting that in the sharp design 1i+ implies Di = 1 and 1i− implies Di = 0, such that we
obtain
= Yi − m + 1i+ − m − (1 − 1i+ ) − β+ (Z i − z 0 ) Di
2 Z i − z0
− β− (Z i − z 0 ) (1 − Di ) K
h
= Yi − m − − (m + − m − ) Di − β+ (Z i − z 0 ) Di
2 Z i − z0
− β− (Z i − z 0 ) (1 − Di ) K
h
= Yi − m − − (m + − m − ) Di − β− (Z i − z 0 )
2 Z i − z0
− (β+ − β− )(Z i − z 0 )Di K . (6.16)
h
Since this function is minimised at (m̂ + , β̂+ ) and (m̂ − , β̂− ), the coefficient on D would
be estimated by (m̂ + − m̂ − ). It gives a local linear estimator for AT E(z 0 ) that is equiv-
alent to the one above if h = h + = h − . We can thus obtain the treatment effect directly
by a local linear regression of Yi on a constant, Di , (Z i − z 0 ) and (Z i − z 0 ) Di , which
is identical to the separate regressions given above.
278 Regression Discontinuity Design
using weighted least squares with weights (6.17). The coefficient of Di corresponds to
the estimator (6.15). If for convenience one used a uniform kernel with equal band-
widths, then the estimator would correspond to a simple (unweighted) OLS regression
where all observations further apart from z 0 than h are deleted.
In some applications, the restriction is imposed that the derivative of E [Y |Z ] is
identical on the two sides of the threshold, i.e. that
∂ E [Y |Z = z 0 + ε] ∂ E [Y |Z = z 0 − ε]
lim = lim .
ε→0 ∂z ε→0 ∂z
This assumption appears particularly natural if one aims to test the ! hypothesis of a zero
treatment effect, i.e. the null hypothesis that E Y − Y |Z = z 0 = 0. In other words,
1 0
if the treatment has no effect on the level, it appears plausible that it also has no effect on
the slope. This can easily be implemented in (6.16) by imposing that β− = β+ . In the
implementation we would then estimate the treatment effect by a local linear regression
on a constant, Di and (Z i − z 0 ) without interacting the last term with Di . If one is not
testing for a null effect, this restriction is less appealing because a non-zero treatment
effect may not only lead to a jump in the mean outcome but possibly also in its slope.
Note moreover that if we do not impose the restriction β− = β+ and estimate expression
(6.16) including the interaction term (Z i − z 0 )Di , we ensure that only data points to
the left of z 0 are used for estimating the potential outcome E[Y 0 |Z = z 0 ] while only
points to the right of z 0 are used for estimating the potential outcome E[Y 1 |Z = z 0 ]. In
contrast, when we impose the restriction β− = β+ , then data points from both sides of
z 0 are always used for estimating the average potential outcomes. Consequently, some
Y 0 outcomes are used to estimate E[Y 1 |Z = z 0 ], and analogously, some Y 1 outcomes
are used to estimate E[Y 0 |Z = z 0 ], which is counter-intuitive unless treatment effect is
zero everywhere.
In the fuzzy design we implement the Wald type estimator along identification
strategy (6.10) by estimating (6.13) and (6.14), once with respect to the outcome Y ,
and once with respect to D. For notational convenience set m(z) = E[Y |Z = z],
p(z) = E[D|Z = z] with m + , m − , p+ , p− being the limits from above and below,
respectively, when z → z 0 . Imagine now that all these are estimated by local linear
regression. The same way we define for its first and second derivatives m + , m − , p+ , p
−
and m + , m − , p+ , p− . Let us further define
6.1 Regression Discontinuity Design without Covariates 279
and σ−2 , ρ− analogously, being the limits from below. Then we can state the asymptotic
behaviour of the Wald type RDD-(L)ATE estimator10
THEOREM 6.1 Suppose that Assumptions RDD-1, RDD-2 and RDD-3 or RDD-3* are
fulfilled. Furthermore, assume that m and p are twice continuously differentiable for
z > z 0 . For consistent estimation we need the following regularity assumptions:
(i) There exists some ε > 0 such that |m + |, |m + |, |m + |, and | p+ |, | p+ |, | p | are
+
uniformly bounded on (z 0 , z 0 + ε], and |m − |, |m − |, |m − |, and | p− |, | p− |, | p | are
−
uniformly bounded on [z 0 − ε, z 0 ).
(ii) The limits off m + , m − , p+ , p− in z 0 exist and are finite. The same holds for its first
and second derivatives.
(iii) The conditional variance σ 2 (z i ) = V ar (Yi |z i ) and covariance ρ(z i ) =
Cov(Yi , Di |z i ) are uniformly bounded near z 0 . Their limits σ+2 , σ−2 , ρ+ , ρ− exist
and are finite. !
(iv) The limits of E |Yi − m(Z i )|3 |z i = z exist and are finite for z approaching z 0
from above or below.
(v) The density f z of z is continuous, bounded, and bounded away from zero near z 0 .
(vi) The kernel function K (·) is continuous, of 2nd order, and > 0 with compact
support. For the bandwidth we have h = n −1/5 .
Then, with m̂ + , m̂ − , p̂+ and p̂− being local linear estimators of m + , m − , p+ and p−
respectively, we have for the RDD-LATE estimator
2/5 m̂ + − m̂ − m+ − m−
n − −→ N (B, V )
p̂+ − p̂− p+ − p−
where bias and variance are given by
v+ m + − v− m − (m + − m − )(v+ p+ − v p )
− −
B= −
p+ − p− ( p + − p − )2
∞ 2 ∞ ∞
2 0 u 2 K (u) du − 0 u K (u) du 0 u 3 K (u) du
with v+ = ∞ ∞ ∞ 2 ,
2 K (u) du u 2 K (u) du − u K (u) du
0 0 0
w+ σ+2 + w− σ−2 m+ − m− % &
V = − 2 w + ρ+
2
+ w − ρ−
2
( p + − p − )2 ( p + − p − )3
(m + − m − ) 2
+ (w+ p+ {1 − p+ } + w− p− {1 − p− })
( p + − p − )4
∞ ∞ 2 ∞ 2 2
0 s K (s) ds − u 0 s K (s) ds K (u) du
with w+ = 0
2 2
∞ ∞ ∞
f z (z 0 ) 0 u 2 K (u) du · 0 K (u) du − 0 u K (u) du
10 See Hahn, Todd and van der Klaauw (1999) for further details and proof.
280 Regression Discontinuity Design
Example 6.9 Angrist and Lavy (1999) used that in Israel ‘class size’ is usually deter-
mined by a rule that splits classes when class size would be larger than 40. This policy
generates discontinuities in class size when the enrolment in a grade grows from 40 to
41 – as class size changes from one class of 40 to one class of size 20 and 21. The same
applies then to 80–81, etc. Enrolment (Z ) has thus a discontinuous effect on class size
(D) at these different cut-off points. Since Z may directly influence student achievement
282 Regression Discontinuity Design
(e.g. via the size or popularity of the school), it is not a valid instrumental variable as it
clearly violates the exclusion restriction. But it produces thresholds at 41, 81, 121, etc.
such that, if we compared only classes with enrolment of size 40 to those with 41, those
of size 80 to those with 81, etc. we could apply the RDD idea. Furthermore, it is plausi-
ble to assume to have the same average treatment effect at each threshold. The authors
imposed more structure in form of a linear model to estimate the impact of class size on
student achievement. Nevertheless, the justification for their approach essentially relied
on the considerations above.
We discuss the case of multiple thresholds for the sharp design first. For further sim-
plification, imagine that around z 0 we can work with a more or less constant treatment
effect β. If we had just one threshold z 0 :
Yi = β0 + β Di + Ui , (6.20)
where endogeneity arises because of dependency between Di and Ui . In the sharp design
one has Di = 11 {Z i ≥ z 0 } such that we obtain
Yi = β0 + β Di + E[Ui |Z i ] + Wi .
Yi −E[Y |Z i ,Di ]
The ‘error’ term Wi has the nice properties E[Wi ] = 0 for all i, cov(Wi , Di ) = 0
and cov(Wi , E[Ui |Z i ]) = 0. This can be shown by straightforward calculations using
iterated expectations. Suppose further that E[Ui |Z i ] belongs to a parametric family of
functions, e.g. polynomial functions, which we denote by ϒ(z, δ) (with δ a vector of
unknown parameters; infinite number if ϒ is non-parametric) and is continuous in z at
z 0 . You must suppress the intercept in the specification of ϒ(·) because we already have
β0 in the above equation as a constant. Hence, we cannot identify another intercept (what
is not a problem as we are only interested in β). We assume that there is a true vector
δ such that E[Ui |Z i ] = ϒ(Z i , δ) almost surely.11 If E[Ui |Z i ] is sufficiently smooth, it
can always be approximated to arbitrary precision by a polynomial of sufficiently large
order. The important point is to have the number of terms in ϒ(z, δ) sufficiently large.12
By using E[Ui |Z i ] = ϒ(Z i , δ) we can rewrite the previous expression as
Yi = β0 + β Di + ϒ(Z i , δ) + Wi , (6.22)
Yi −E[Y |Z i ,Di ]
11 There is a vector δ such that for all values z ∈ R\A, where Pr(Z ∈ A) = 0, it holds
E[U |Z = z] = ϒ(z, δ).
12 You may take a second-order polynomial δ z + δ z 2 but with a high risk of misspecification.
1 2
Alternatively, you may take a series and, theoretically, include a number of basis functions that increases
with sample size. This would result in a non-parametric sieve estimator for E[U |Z ].
6.1 Regression Discontinuity Design without Covariates 283
where we now consider the terms in ϒ(Z i , δ) as additional regressors, which are all
uncorrelated with Wi . Hence, ϒ(Z i , δ) is supposed to control for any impact on Yi that
is correlated with Z i (and this way with Di without being caused by Di ). Then the
treatment effect β could be consistently estimated.
Interestingly, the regression in (6.22) does not make any use of z 0 itself. The identi-
fication nevertheless comes from the discontinuity at z 0 together with the smoothness
assumption on E[U |Z ]. To see this, consider what would happen if we used only data
to the left (respectively, right) side of z 0 . In this case, Di would be the same for all data
points such that β could not be identified. Actually, in (6.22) the variable Z i has two
functions: its discontinuity in D at z 0 to identify β, and then its inclusion via ϒ(·) to
avoid an omitted variable bias. Moreover, endogeneity of D in (6.20) with the treatment
effect being constant around z 0 is only caused by the omission of Z . In sum, from the
derivations leading to (6.22) it is not hard to see that the regression (6.22) would be the
same if there were multiple thresholds z 0 j , j = 1, 2, . . . But we have to redefine Di
accordingly; see below and Exercise 4.
Example 6.10 Recall Example 6.9 of splitting school classes in Israel if class size
exceeds 41 pupils having thus thresholds at 41, 81, 121, etc. We could either just ask
whether they have been split, or divide the set of all positive integers in non-overlapping
sets on which Di is either equal to one (school is considered as being treated) or zero.
This shows our ‘dilemma’: a school with 60 enrolments, is it considered as treated (since
60 > 40) or not (since 60 < 81)?
A remedy to this and the above-mentioned problems is to use only observations that
are close to a threshold z 0 j , j = 1, 2, . . . This also makes the necessary assumptions
more credible. Firstly, for sharp designs, near the thresholds it is clear whether Di takes
the value zero or one. Secondly, to approximate E[Ui |Z i ] by (different) local parametric
functions in (each) threshold neighbourhood should be a valid simplification. Recall
also that we are interested in the average of all treatment effects, i.e. the average over
all individuals over all thresholds. If β is constant, E[Ui |Z i ] should be almost constant
around a given Z i = z as otherwise the assumptions we made on Ui above might
become implausible.13 If you want to allow for different treatment effects at different
thresholds, then you would estimate them separately. In sum, obtaining β̂ by estimating
(6.22) with a partial linear model (recall Chapter 2) only using data around the thresholds
is a valid strategy.
The case of multiple thresholds becomes more complex when facing a fuzzy design.
We still work with a constant treatment effect as above. Recall Assumption RDD-3:
Generally, we do not permit that individuals may select into treatment according to their
gain (Yi1 − Yi0 ) from it. Note that the assumption of a constant treatment effect automat-
ically implies that this is satisfied, because then (Yi1 − Yi0 ) is the same for everyone. As
13 Note that U represents all deviations from the mean model, including those that are caused by potential
heterogeneous returns to treatment.
284 Regression Discontinuity Design
stated, an alternative is to work with Asumption RDD-3∗ , resulting in the same estima-
tor but with a more complex interpretation. We start again from Equation 6.20 with only
one threshold z 0 and aim to rewrite it such that we could estimate it by OLS. Because Di
is no longer a deterministic function of Z i , we consider only expected values conditional
on Z i , i.e. we do not condition on Z i and Di jointly:
where Wi is not correlated with any of the other terms on the right-hand side of the
equation. As before we suppose that E[Ui |Z i ] = ϒ(Z i , δ) belongs to a parametric
family of functions that are continuous at z 0 , and write
If we knew the function E[Di |Z i ] = Pr(Di = 1|Z i ), we could estimate the previous
equation by (weighted) OLS to obtain β. Since we do not know E[Di |Z i ] we could
pursue a two-step approach in that we first estimate it and plug the predicted E[Di |Z i ]
in (6.23). What is new here is that for an efficient estimation of E[Di |Z i ] one could
and should use the priori knowledge of a discontinuity at z 0 . In practice, people just use
linear probability models with an indicator function 11{Z i > z 0 }, see also Exercise 3. In
such cases one could invoke the following specification
where ϒ̄(·, δ̄) is a parametric family of functions indexed by δ̄, e.g. a polynomial.
In (6.24) one uses the knowledge of having a discontinuity at z 0 . (It is, however,
well known that linear probability models are inappropriate; see also our discussion
in Chapter 3.)
What would happen if we chose the same polynomial order for ϒ and ϒ̄, e.g. a third-
order polynomial? With exact identification, IV and 2SLS were identical because the
solution to (6.23) is identical to IV regression of Yi on a constant, Di , Z i , Z i2 , Z i3 with
instruments: a constant, Z i , Z i2 , Z i3 and 11{Z i ≥ z 0 }, where the latter is the excluded
instrument.
If we have multiple thresholds e.g. z 0 , z 1 , z 2 we replace (6.24) by
in which (6.25) is estimated and its predicted values are plugged into (6.23) is the class
size rule in Angrist and Lavy (1999).
Matsudaira (2008) considered a mandatory summer school programme for pupils
with poor performance on school. Pupils with low scores on maths and readings
tests were obliged to attend a summer school programme during the holidays. Stu-
dents who scored below a certain threshold on either of these tests had to attend the
programme, i.e.
Z = 11{Z math < z 0,math or Z reading < z 0,reading }. (6.26)
The structure with these two test scores thus permits to control for maths ability while
using the RDD with respect to the reading score and vice versa.
include covariates that are good predictors of the outcome variable Y . For the RDD
case, a typical example is when educational support programme is offered to children
from poor families. Children can participate in this programme if their parents’ income
Z falls below a certain threshold z 0 . The outcome of interest Y is the maths test one
year later. Good predictors of the maths test outcome Y are usually the maths tests in the
previous years. These could be added as additional control variables X to obtain more
precise estimates. So already in this example there are at least two reasons to include
X ; first, for a better control of heterogeneous returns to treatment and thus making the
RDD assumptions more likely to hold; and second, the reduction of the standard errors.
The first point is less important if all covariates are perfectly balanced between treated
and non-treated in the sample used for estimation. Certainly, one might argue that this
should be the case anyway if all subjects are very close or equal to z 0 . Note that all
mentioned arguments are also valid if we first include X for the regression, but later on
integrate them out to obtain an unconditional treatment effect.
Maybe more frequently, covariates X are added for a robustness when moving away
from z 0 . In many applications we might have only few observations close to the thresh-
old at our disposal. In practice we might thus be forced to also include observations with
values of Z not that close to z 0 (in other words, choose a rather large bandwidth). While
it appears plausible that locally pre-treatment covariates should be randomly distributed
about z 0 (such that each value of X is equally likely observed on the left and on the
right of z 0 ), further away from z 0 there is no reason why the distributions of X should
be balanced. Consequently, the omission of X could lead to sample biases akin to omit-
ted variables. Although this problem would vanish asymptotically (when data become
abundant close to z 0 ) the small sample imbalances in X can be serious in practice. In
sum, we see why including covariates X can help then to reduce the risk of a bias when
using observations (far) away from z 0 .
Example 6.11 Black, Galdo and Smith (2005) evaluate the finite sample performance
of the regression discontinuity design. They are interested in the impact of a train-
ing programme D on annual earnings Y and thereby note that ‘annual earnings in the
previous year’ is a very important predictor. They examine a randomised experiment
which also contains an RDD and conclude that controlling for covariates is important
for finite-sample performance. Their result highlights the importance of using pre-
treatment covariates in the estimation of conditional mean counterfactuals. In their case
for example, ignoring the pre-treatment covariate ‘past earnings’ causes a large bias in
the conventional RDD estimates without covariates.
While efficiency gains and selection or sample biases are the main reasons for incor-
porating covariates, there may also be situations where the distribution of F(X |Z ) is
truly discontinuous at z 0 for some variables X . In most cases, this may be an indica-
tion of a failure of the RDD assumptions. Sometimes, but not always, conditioning on
these covariates restores the validity. In other words, like in the previous chapters, con-
ditioning the necessary RDD-assumptions on X might render them more plausible (or at
6.2 Regression Discontinuity Design with Covariates 287
least help to make them less implausible). Certainly, this argument is not that different
from the ‘balancing’ argument above. Here we just say that even at z 0 (and not only
when moving away from it) you may face what we called confounders in the previous
chapters.
There are two major reasons why the distribution of (some) covariates may be dis-
continuous, due to confounding, or via a direct impact of Z on X . We first discuss the
latter case. Here, covariates may help to distinguish direct from total treatment effects,
recall Chapter 2. Note that many of the cases discussed imply different distributions of
X for treated and non-treated, respectively. Such a situation is sketched in Figure 6.3;
it shows a situation where the inclusion of covariates X helps to distinguish total from
direct (or partial) effects. Note, though, that this approach only works if there are no
unobservables that affect X and Y simultaneously.
Example 6.12 Recall Example 6.5. Black (1999) analysed the impact of school quality
on housing prices by comparing houses adjacent to school–attendance district bound-
aries. School quality varies across the border, which should be reflected in the prices of
apartments. Consider two plots of land of the same size which are adjacent to a school
district boundary but on opposite sides of it. The school on the left happens by chance
to be a good school. The school on the right by chance happens to be of poor qual-
ity supposing a completely random process. One is interested in the impact of school
quality on the market price of a flat. Using the RDD approach, we compare the prices
of houses left and right of the border. So this is an RDD with geographical borders.
To use this approach, we must verify the assumptions. As with all geographical bor-
ders used in an RDD approach, one might be concerned that there could also be other
changes in regulations when moving from the left-hand side to the right-hand side of the
street. It seems though that school districts boundaries do in some states not coincide
with other administrative boundaries such that these concerns can be dissipated. In this
example there is a different concern, though: although, in contrast to individual location
decisions, houses cannot move, the construction companies might have decided to build
different types of houses on the left-hand side and the right-hand side of the road. If
school quality was indeed valued by parents, developers would build different housing
structures on the two sides of the boundary: on the side with the good school, they will
construct larger flats with many bedrooms for families with children. On the side with
the bad school, they will construct flats suitable for individuals or families with no or
fewer children (of school age), i.e. smaller flats with fewer bedrooms. Hence, the houses
on the two sides of the border may be different such that differences in prices not only
reflect the valuation of school quality but also the differences in housing structures. Let
i indicate a flat where Z i indicates distance to the border. Yi is the market price of the
288 Regression Discontinuity Design
flat. Di is the school quality associated with the region where the flat is located, and X i
are characteristics of the flat (number of bedrooms, size, garden, etc.). If school quality
evolved completely randomly, Di is not confounded. However, school quality Di has
two effects. Firstly, it has a direct effect on the value of the flat i. Secondly, it has an
indirect effect via X i . As discussed, because houses are built (or refurbished) differently
on the two sides of the border, school quality has an effect on the characteristics X i of
the flat (number of bedrooms, size), which by itself has an effect on the market price. If
we are interested in the valuation of school quality, we need to disentangle these effects.
As Black (1999) wants to know the impact of school quality on market price for a flat
of identical characteristics, he controls for the number of bedrooms, square footage and
other characteristics of the apartments. This approach corresponds to Figure 6.3 and is
only valid if there are no other unobservables related to X and Y .
Now let us consider an example where it is less clear whether to condition on X or not.
Example 6.13 Reconsider the impact of a summer school programme for poorly per-
forming children, c.f. Matsudaira (2008). The fact that some pupils performed poorly
but nevertheless were just above the cutoff z 0 for participation in the publicly subsidised
summer school programme could lead their parents to provide some other kind of educa-
tional activities over the summer months. Let these be measured by X variables. Again,
the X variables are intermediate outcomes and we might be interested in both: the total
effect of the summer school programme and the direct effect after controlling for supple-
mentary but privately paid activities X . In this example, conditioning on X is unlikely
to work, though, because these activities are likely to be related to some unobservables
that reflect parental interest in education, which itself is likely to be also related with the
outcome variable Y . This makes the interpretation even harder.
Figure 6.4 indicates a situation where a change in Z also affects Y indirectly via X . In
such a situation controlling for X is necessary since the ‘instrumental variable’ Z would
otherwise have an effect on Y that is not channelled via D. Such a situation often occurs
when geographical borders are used to delineate a discontinuity. Without loss of gener-
ality, in the following example we look at a discretised but not necessarily binary D.
Example 6.14 Brügger, Lalive and Zweimüller (2008) use the language border within
Switzerland to estimate the effects of culture on unemployment. The language border
(German and French) is a cultural divide within Switzerland, with villages to the left
and the right side of the border sharing different attitudes. The authors use highly dis-
aggregated data (i.e. for each village) on various national referenda on working time
regulations. The voting outcomes per community are used to define an indicator of the
‘taste for leisure’ as one particular indicator of the local culture. When plotting the
‘taste for leisure’ of a community/village against the distance to the language border,
they find a discontinuous change at the language border. As ‘taste for leisure’ (treatment
D) may also have an effect on the intensity of job search efforts and thus the duration
of unemployment spells Y . They use commuting distance to the language border from
each village as an instrument Z . A crucial aspect of their identification strategy is thus
that changing the location of the village (e.g. from the German speaking to the French
speaking side) only changes Y via the ‘taste for leisure’ D. Very importantly, the lan-
guage border is different from administrative state borders, which implies that the same
unemployment laws and regulations apply to the left and right side of the border. They
also find that the distribution of many other community covariates X is continuous at
the border: local taxes, labour demand (vacancies etc.), age and education structure etc.
On the other hand, they also find discontinuities at the language border in the distri-
bution of some other community characteristics X , mainly in the use of active labour
market programmes and sanctions by the public employment services as well as in the
number of firms. To avoid allowing these covariates to bias the estimate, they control
for them.
Example 6.15 In most countries, the year when a child enters school depends on whether
a child was born before or after a fixed cut-off date, e.g. 1 July. A child born before 1
July would enter school in this school year, whereas a child born after 1 July would enter
school in the next school year. Comparing two children born close to the cut-off date, the
child born before the cut-off enters school now, whereas the other child born a few days
later enters school next year. The ‘age of entry’ in school thereby differs nearly a year.
Usually, the assignment according to this regular school starting age is not strict and
parents can advance or delay their child. Nevertheless, in most countries one observes a
clear discontinuity in ‘age of entry’ around the cut-off, corresponding to a fuzzy design.
This school-entry rule has been used in several research articles to estimate the returns
to the years of education: in many countries, pupils have to stay in school compulso-
rily until a specific age, e.g. until their 16th birthday, after which they can drop out of
290 Regression Discontinuity Design
education voluntarily. Children who entered school effectively one year later thus can
drop out with less schooling than those who entered school at the younger age. This
discontinuity is also visible in the data. One problem with this identification strategy,
though, is that the birth cut-off date has several effects: not only is there an effect on the
number of school years attended, but also on the age of school entry, which in itself not
only affects the absolute age at the child at school entry but also the relative age within
the class, i.e. the age compared to the schoolmates: children born before the cut-off date
tend to be the youngest in the class, whereas those born after the cut-off are the oldest
in the class. The relative age may be an important factor in their educational develop-
ment. Hence, the birth date has several channels, and attribution of the observed effects
to these channels is not possible without further assumptions. Fredriksson and Öckert
(2006) aim to disentangle the effects of absolute and relative age at school entry. They
are mainly interested in the effect of absolute age, without a change in relative age,
because the policy question they are interested in is a nationwide reduction in school
starting age, which obviously would reduce the school starting age for everyone without
affecting the relative age distribution. They assume that the relative age effect is fully
captured by the rank order in the age distribution within school and exploit the within
school variation in the age composition across cohorts to estimate the relative age effect.
Because of natural fluctuations in the age composition of the local school population and
postponed or early entry of some school children, it is possible that children with the
same age rank have quite different absolute ages (particularly for small schools in rural
areas). They thus estimate the effect of changes in absolute age while keeping the age
rank (X ) constant. Fully non-parametric identification is not possible in this approach
and their estimates therefore rely on extrapolations from their applied parametric
model.
Now we consider the case of confounding. Figure 6.5 shows the classical case of
confounding where there are variables X that determine Z and D or Y . An interesting
example is when looking at dynamic treatment assignment. Past treatment receipt may
affect the outcome as well as current treatment receipt, and the past value of the eligi-
bility variable Z t−1 may be correlated with the current one. This scenario is depicted in
Figure 6.6, which is a special case of Figure 6.5 for setting X = Z t−1 .
Example 6.16 Van der Klaauw (2008) analyses a policy where schools with a poverty
rate above a certain threshold z 0,t in year t receive additional subsidies, whereas schools
below the threshold do not. The threshold z 0,t changes from year to year. In addition
to this simple assignment rule, there is one additional feature: schools which received
a subsidy in the previous year continue to receive a subsidy for another year even if
their poverty rate drops below z 0,t . This is called the ‘hold-harmless’ provision. Hence,
treatment status Dt in time t depends on Z t and the threshold z 0,t as well as Z t−1 and
the threshold of last year z 0,t−1 . At the same time it is reasonable to expect that past
poverty Z t−1 is related to current poverty Z t .
In this situation of dynamic treatment assignment, one would like to control for Dt−1 .
If data on Dt−1 is not available, one would like to control for Z t−1 . By this we ensure
that individuals with the same values of the control variables have the same treatment
history. Otherwise, we do not know whether we estimate the effect of subsidies for
‘one-year’ or the ‘cumulative effect’ of subsidies over several years. This, of course, is
important for interpreting the results and to assess the cost benefit of the programme.
Example 6.17 Continuing with Example 6.16, consider a scenario where poverty rates
Z t are time-constant and also z 0,t is time-constant. In this scenario, schools with
Z t > z 0,t also had Z t−1 > z 0,t−1 and Z t−2 > z 0,t−2 etc. In other words, these schools
qualified for the school subsidies in every year, whereas schools with Z t < z 0,t did not
receive any subsidies in the past. In this situation, the simple RDD would measure the
cumulative effects of subsidies over many years. Note that the distribution of past treat-
ment receipt is discontinuous at z 0,t . On the other hand, if school poverty rates vary a
lot over time (or z 0,t varies over time), then it is more or less random whether schools
with Z t slightly above z 0,t in t had been above or below z 0,t−1 in the past year. Hence,
schools slightly above z 0,t in t are likely to have had a similar treatment history in the
past as those schools slightly below z 0,t in t. In this case, the simple RDD measures the
effect of one year of subsidy, and the treatment history is not discontinuous at z 0,t .
Hence, in one case we estimate the effect of current subsidies, whereas in the other
case we estimate the effect of current and previous subsidies. To distinguish between
these scenarios, we can control for Dt−1 , Dt−2 , etc. If data on past treatment status is
not available we could control for Z t−1 , Z t−2 , etc. More complex treatment assignment
rules are conceivable where controlling for past Z and/or D becomes important. e.g. a
school may be entitled to a subsidy if Z t > z 0,t and if they have been above the poverty
cutoff in at least five of the past ten years and have received subsidies for not more
than three years in the past ten years. Such kind of rules can lead to discontinuities in
treatment histories at z 0,t .
Confounding as in Figure 6.5 may also occur in other settings. Recall Example 6.9,
the example of splitting school classes in Israel if class size exceeds 41 pupils. It can
very well be that apart from class size there are also other differences, say in observable
292 Regression Discontinuity Design
incorporated, yet in almost all applications they are added rather ad hoc in the linear
regression (OLS or 2SLS) with a linear or (at most) second-order polynomial in Z , and
just a linear term in X . Below we discuss an alternative approach that explains how
covariates X can be included fully non-parametrically.
14 It is difficult to say which version is more restrictive. For example, it might be very well that RDD-3 is
fine, but conditioned on X i , variables Di and (Yi1 − Yi0 ) become dependent; recall examples in Chapter 2.
294 Regression Discontinuity Design
It should not be hard to identify the unconditional effect for all compliers at z 0 :
lim E Y 1 − Y 0 |D(z 0 + ε) > D(z 0 − ε), Z = z 0 . (6.29)
ε→0
We identify this effect by first controlling for X and thereafter averaging over it.
Recall that for sharp designs, the population consists only of compliers by defini-
tion at least at z 0 . For fuzzy designs however, you must ensure to integrate only over
f (x|complier s, z 0 ).
As discussed in Chapter 4, there are at least three reasons why also the uncondi-
tional effect (6.29) is interesting. First, for the purpose of evidence-based policymaking
a small number of summary measures can be more easily conveyed to the policymak-
ers and public than a large number of estimated effects for each possible X . Second,
unconditional effects can be estimated more precisely than conditional effects. Third,
the definition of the unconditional effects does not depend on the variables included in
X (if it contains only pre-treatment variables). One can therefore consider different sets
of control variables X and still estimate the same object, which is useful for examining
robustness of the results.
It is typically assumed that the covariates X are continuously distributed, but this is an
assumption made only for convenience to ease the exposition, particularly in the deriva-
tion of the asymptotic distributions later on. Discrete covariates can easily be included
in X at the expense of a more cumbersome notation. Note that identification does not
require any of the variables in X to be continuous. Only Z has to be continuous near z 0 .
We will see below that the derivation of the asymptotic distribution only depends on the
number of continuous regressors in X as discrete covariates do not affect the asymptotic
properties. As before, we must assume that only compliers, never- and always-takers
exist. Assumptions RDD-1 and RDD-2 are assumed to hold conditional on X . We can
summarise the additional assumptions for conditional RRD as follows:
Assumption RDD-4 Let Nε be a symmetric ε neighbourhood about z 0 and let’s par-
tition Nε into Nε+ = {z : z ≥ z 0 , z ∈ Nε } and Nε− = {z : z < z 0 , z ∈ Nε }. Then we
need the following three conditions;
(i) Common support: lim Supp(X |Z ∈ Nε+ ) = lim Supp(X |Z ∈ Nε− ).
ε→0 ε→0
(ii) Density at threshold: f Z (z 0 ) > 0.
lim FX |Z ∈Nε+ (x) and lim FX |Z ∈Nε− (x) exist and are differentiable in x ∈ X
ε→0 ε→0
with pdf f + (x|z 0 ) and f − (x|z 0 ), respectively.
(iii) Bounded moments: E[Y d |X, Z ] is bounded away from ±infinity a.s. over Nε , d ∈
{0, 1}.
Assumption RDD-4 (i) corresponds to the well-known common support assumption
we discussed, e.g. for matching. It is necessary because we are going to integrate over
the support of X in (6.27). If it is not satisfied, one has to restrict the LATE to be the
local average treatment on the common support. Assumption RDD-4 (ii) requires that
there is positive density at z 0 such that observations close to z 0 exist. We also assume
the existence of the limit density functions f + (x|z 0 ) and f − (x|z 0 ) at the threshold z 0 .
6.2 Regression Discontinuity Design with Covariates 295
So far we have not assumed their continuity; in fact, the conditional density could be
discontinuous, i.e. f + (x|z 0 ) = f − (x|z 0 ), in which case controlling for X might even
be important for identification and thus for consistent estimation. Assumption RDD-4
(iii) requires the conditional expectation functions to be bounded from above and below
in a neighbourhood of z 0 . It is invoked to permit interchanging the operations of inte-
gration and taking limits via the Dominated Convergence Theorem. This assumption
could be replaced with some other kind of smoothness conditions on E[Y d |X, Z ] in a
neighbourhood of z 0 .
Adding Assumption RDD-4 to Assumptions RDD-3* (or RDD-3) conditioned on
X , the LATE for the subpopulation of local (at z 0 ) compliers is non-parametrically
identified as
!
lim E Y 1 − Y 0 |Z ∈ Nε , complier
ε→0
!
= lim E Y 1 − Y 0 |X, Z ∈ Nε , complier d F (X |Z ∈ Nε , complier ) (6.30)
ε→0
the fraction of compliers at z 0 . So the ratio of integrals gives the ITT effect of Z on Y
multiplied by the inverse of the proportion of compliers. This identifies the treatment
effect for the compliers in the fuzzy design. Without any restrictions on treatment effect
heterogeneity, it is impossible to identify the effects for always- and never-takers since
they would never change treatment status in a neighbourhood of z 0 .
For the estimation one proceeds as usual, starting with the non-parametric estimation
of m + (·), m − (·), p+ (·) and p− (·) at all points X i . This can be done by local linear
estimation; e.g. an estimate of m + (x) is the value of a that solves
n
2
arg min Yi − a − az (Z i − z 0 ) − ax (X i − x) · K i 1i+ (6.33)
a,az ,ax
i=1
where K h∗ (u) is a boundary kernel function, see below for details. For establishing the
asymptotic properties of our non-parametric estimator we need some assumptions which
we have seen in similar form in Chapter 2.
Assumption RDD-5
T H E O R E M 6.2 Under Assumptions RDD 1, 2, 3 (or 4) and 5 without (v), the bias and
variance terms of α̂C R D D are of order
Bias(α̂C R D D ) = O(h 2 + h 2z + h λx )
1 1
V ar (α̂C R D D ) = O + .
nh nh z
Adding assumption RDD 5 (v), the estimator is asymptotically normally distributed and
converges at the univariate non-parametric rate
√
nh α̂C R D D − α → N (B, V)
r −1 r −1
∂ r p+ (x)
∂ s p+ (x) + ∂ r p− (x)
∂ s p− (x) −
−α + ωs − − ωs
r ! ∂ xlr ∂ xls r ! ∂ xlr ∂ xls
s=1 s=1
f − (x, z 0 ) + f + (x, z 0 )
× dx
2 f (z 0 )
−1 <
∂ r−s f + (X i ,z 0 ) ∂ r−1 f + (x0 ,z 0 ) ∂ r−2 f + (x0 ,z 0 ) (r −2)! ∂ r−1−s f + (X i ,z 0 )
with ωs+ = s!(r −s)! ∂ xlr−s
− (r −1)!s!(r −1−s)!
∂ x1r−1 ∂ xlr−2 ∂ x r−1−s
l
f + (x|z 0 ) + f − (x|z 0 )
− E [Y |D = 0, X = x, Z = z 0 ]) d x.
2
Here the E [Y |D, X, Z = z 0 ] can be estimated by a combination of the left- and right-
hand side limit. This approach does no longer rely only on comparing observations
across the threshold but also uses variation within either side of the threshold. This has
a similar structure as (6.31) and (6.36)
Note that assuming (6.37), one can also estimate the entire potential outcome
quantiles and distributions, cf. Chapter 7.
The main appeal of RDD approach rests on the idea of a local randomised experiment.
This interpretation insinuates some checks and diagnostic tools in order to judge the
plausibility of the identification assumptions. An obvious one is to obtain data from a
time point before treatment was implemented (or even announced, to exclude anticipa-
tion effects) to see whether there was already a significant difference between groups
(of type E[Y |X, D = 1] − E[Y |X, D = 0]) before the treatment started. This brings
us back to the idea of DiD; see Section 6.3.3 for bias stability and plausibility checks.
Recall also the discussion on checking for pseudo-treatment effects, e.g. in Chapter 4.
In Section 6.1.1 we already gave some ideas about potential manipulation of Z ; we start
this section by explaining that issue more in detail. But before we begin, first notice
that the concerns about self-selection, manipulation etc., in brief, most of the potential
sources of identification problems due to sample selection biases have their origin in
potentially heterogeneous treatment effects (Yi1 − Yi0 ) = constant. Consequently, the
people might want to manipulate their Z i or threshold z 0 along their expectations. If
the treatment effect is expected to be positive for everybody, then one would expect a
discontinuity in f Z at z 0 in the form of an upward jump.
own interests, but even after such modifications we still have some randomness left so
that FZ |U (z 0 |u) is neither zero nor one (i.e. the individual may manipulate Z i but does
not have full control over it). In addition, f Z (z 0 ) > 0 indeed implies that for some
individuals it was a random event whether their Z happened to be larger or smaller than
z 0 . You can see that it is important to know whether individuals have information about
z 0 in advance. If z 0 is unknown at the time of manipulation, then it is more likely that
it will be random whether a Z ends up on the left or right of z 0 . On the other hand,
if z 0 is known, it is more likely that strategic manipulation around the threshold is not
random.
Consider the situation where students have to attend a summer school if they fail on
a certain math test. Some students may want to avoid summer school (and therefore
aim to perform very well on the test), whereas others like to attend summer school
(and therefore want to perform poorly on the test). The important point here is that the
students are unlikely to sort exactly about the threshold. The reason is that even when
they answer purposefully some of the test items correctly or incorrectly, they may not
know with certainty how their final score would be and/or may not know the threshold
value z 0 . Hence, although the score Z i may not truly reflect the ability of student i (and
true ability may not even be monotonous in Z ), among those with final score Z close to
z 0 it is still random who is above and who is below.
On the other hand, the case is interesting for those who grade the exam. They have
control over the outcome and they can manipulate the test scores. Nonetheless, there is
still no need to worry as long as they do not know the value of z 0 . For example, grading
might be done independently by several people and z 0 is set such that, say, 20% of
all pupils fail the test. In this case, exact manipulation around z 0 is nearly impossible.
Certainly, if they know z 0 in advance, they can manipulate scores around the threshold.
We distinguish two types of manipulations: (a) random manipulations and (b) selection
on unobservables. As an example, suppose they attempt to reduce the class size of the
summer school programme, so they may increase the scores of a few individuals who
had scored slightly below z 0 so that now the students end up above z 0 . If they select
these students independently of their treatment effect, the RDD design would still be
valid. But if the manipulation of the exam grading is based on the teacher’s expectation
(unobserved to the econometrician) of the individual treatment effects, then we expect
this to lead to inconsistent estimates. An interesting observation is that such kind of
manipulation often goes only in one direction which would imply a discontinuity of f Z
at z 0 . Consequently, if we detect a discontinuity of f Z at z 0 in the data, this might be a
sign of possible manipulation.
Example 6.18 In Example 6.8, Anderson, Dobkin and Gross (2012) exploited the dis-
continuity around age 19 to estimate effects of insurance coverage. Clearly, individuals
cannot manipulate their age but they can react in anticipation of their birthday, i.e.
individuals could shift the timing of health care visits across the age of 19. Hence,
individuals may shift the timing of healthcare visits from the uninsured period to the
insured period. So they may ‘stockpile’ healthcare shortly before coverage expires. Such
6.3 Plausibility Checks and Extensions 301
behaviour would confound the RDD estimates as they would capture mostly short-
term inter-temporal substitution responses. The authors, however, found no evidence
that individuals would shift the timing of healthcare visits in anticipation of gaining or
losing insurance coverage.
Example 6.19 In earlier examples, the class size rule in Israel had been used to estimate
the effects of small classes in school on later outcomes. A similar class size rule existed
in Chile, which mandated a maximum class size of 45. This rule should lead to large
drops in average class size at grade-specific enrolment levels of 45, 90, 135 etc. stu-
dents. Histograms of school enrolment levels, however, show clear spikes, with higher
numbers of schools at or just below these thresholds. This shows clear evidence for (at
least some) precise sorting of schools around these thresholds: in order to avoid splitting
classes as mandated by law (which would require more teachers and more class rooms)
schools appear to be able to discourage some students from enrolling in their school.
Such patterns raise doubts about the validity of the RDD assumptions since the schools
being close to the left of the thresholds also contain those schools that deliberately
intervened to avoid splitting classes. These might differ in observables and unobserv-
ables from those right to the threshold. One might nevertheless hope that controlling for
covariates X might solve or at least ameliorate this problem. One could inspect if the
conspicuous spikes in school enrolment remain after controlling for some covariates X
or if they remain only in some subgroups.
So we have seen in several examples that when using RDD as the identification strat-
egy, it is important to check if there is sorting or clumping around the threshold that
separates the treated and untreated. This is particularly important when the thresholds
that are used for selecting people are known to the public or to politicians, and peo-
ple can easily shift their Z from below z 0 to above or vice versa. If individuals have
control over the assignment variable Z or if administrators can strategically choose the
assignment variable or the cut-off point, the observations may be strategically sorted
around the threshold such that comparing outcomes left and right will not longer be
a valid approach. Whether such behaviour might occur depends on the incentives and
abilities to affect the values of Z or even z 0 (no matter whether it is the potentially
treated of the agents responsible for conferring admissions). Generally, such sorting is
unlikely if the assignment rule is unknown or if the threshold is unknown or uncer-
tain, or if agents have insufficient time for manipulating Z . Generally, manipulation is
only a concern if people have (perfect) control over the placing of their Z below or
above z 0 .
Example 6.20 Another example is a university entrance admission test (or GRE test)
which can be taken repeatedly. If individuals know the threshold test score z 0 , those
scoring slightly below z 0 might retake the test, hoping for a better test result. Unless
the outcomes of repeated tests are perfectly correlated, this will lead to much lower
density f Z at locations slightly below z 0 and much higher density above z 0 . We might
302 Regression Discontinuity Design
be then comparing people who took the test only once with those who repeatedly took
the test, which might also be different on other characteristics. Hence, the RDD would
most likely be invalid. Even if we used only people who took the test just once could
be invalid as this would result in a very selective sample, where selection is likely to
be related to the unknown treatment effect. The correct way to proceed is to use all
observations and to define Z for each individual as the score obtained the first time the
test was taken. Clearly, this will lead to a fuzzy design, where the first test score basically
serves as an instrument for treatment, e.g. for obtaining the GRE. See Jepsen, Mueser
and Troske (2009).
Let us finally come back to Example 6.3 and consider the problem of potential
manipulation based on mutual agreement or on anticipation.
Example 6.21 Recall the example of the policy reform in Austria that provided a longer
unemployment benefit duration in certain regions of Austria but only for individuals
who became unemployed at age 50 or older. A clear concern is that employers and
employees might collude to manipulate age at entry into unemployment. Firms could
offer to wait with laying off their employees until they reach the age of 50, provided
the employees are also willing to share some of their gains, e.g. through higher effort in
their final years. In this case, the group becoming unemployed at the age of 49 might
be rather different from those becoming unemployed at age 50. Therefore Lalive (2008)
examines the histogram of age at entry into unemployment. If firms and workers agreed
to delay a layoff until the age of 50, then the histogram should show substantially more
entries into unemployment at age 50 than below. Non-continuity of the density at the
threshold may indicate that employers and employees actively changed their behaviour
because of this policy. This could induce a bias in the RDD if the additional layoffs
were selective, i.e. if they had different counterfactual unemployment duration. Indeed,
Lalive (2008) finds an abnormal reaction at the age threshold for women.
Another thing to check the above concerns on manipulation is to examine the exact
process how the policy change was enacted. If the change in the legislation was passed
rather unexpectedly, i.e. rapidly without much public discussion, it may have come as a
surprise to the public. Similarly, if the new rules apply retrospectively, e.g. for all cases
who had become unemployed six months ago, these early cases might not have been
aware of the change in the law at the time they became unemployed; for more details on
this see also Example 6.21.
this threshold. More difficult is even the situation if other law changes happen not at but
close to z 0 , e.g. for firms with more than 8 employees: because for obtaining a sufficient
sample size we would often like to include firms with 7, 8, 9 and 10 to our control group,
but in order to do so we need that there is no such break at 8.
Next, simple graphical tools can be helpful for finding possible threats to the validity
of the RDD. First, there should indeed be a discontinuity in the probability of treatment
at z 0 . Therefore, one can plot the functions E [D|Z = z 0 + ε] and E [D|Z = z 0 − ε]
for ε ∈ (0, ∞). One way of doing this, is to plot averages of D for equally sized
non-overlapping bins along Z , on either side of the cut-off. It is important that these
bins are either completely left or right of the cut-off z 0 , such that there should be no
bin that includes points from both sides of z 0 . This is to avoid smoothing over the
discontinuity at z 0 , which, if the jump really existed, would be blurred by pooling obser-
vations from left and right. Similarly, we could plot the functions E [Y |Z = z 0 + ε] and
E [Y |Z = z 0 − ε] for ε ∈ (0, ∞). If the true treatment effect is different from zero, the
plot should reveal a similar discontinuity at the same cut-off in the average outcomes.
There should be only one discontinuity at z 0 . If there happen to be other discontinuities
for different values of Z , they should be much smaller than the jump at z 0 , otherwise
the RDD method will not work.
If one has access to data on additional covariates that are related to Y , say X , one
can plot the functions E[X |Z = z 0 + ε] and E[X |Z = z 0 − ε] for ε ∈ (0, ∞). An
implication of the local randomised experiment interpretation is that the distribution of
all pre-treatment variables should be continuous at z 0 . Individuals on either side of the
threshold should be observationally similar in terms of observed as well as unobserved
characteristics. Hence, if we observe pre-treatment variables in our data, we can test
whether they are indeed continuously distributed at z 0 . If they are discontinuous at z 0 ,
the plausibility of the RDD is reduced. One should note, though, that this last implication
is a particular feature of Lee (2008, Condition 2b) and not of the RDD per se. But ideally,
X should not have any discontinuity at z 0 . If a discontinuity at z 0 is observed, one might
be concerned about potential confounding and has to apply the RDD with covariates,
i.e. one has to include (condition on) X .
Example 6.22 In the class size Example 6.19 in Chile, clear differences in student char-
acteristics left and right of the thresholds were observed. Private school students to the
left of the thresholds (who had larger classes) had lower average family incomes than
the students right of the thresholds (in smaller classes). Hence, students were not only
exposed to different class sizes; they were also different in background characteristics.
Example 6.23 Recall again Example 6.3. Lalive (2008) has the advantage of having
control regions available that were not affected by the policy change. This permits to
consider the histogram of age of those becoming unemployed in the treated regions
compared to those in the non-treated regions. We could also look at an alternative Z ,
one that measures distance to the regional border (with z 0 = 0) to an adjacent region
that is not subject to the policy. Now you could use either threshold (age 50 and/or
border between regions) to estimate the treatment effect and compare the outcomes.
Recall further the concern mentioned in Example 6.21 that manipulation has taken
place via anticipation. The implementation of the reform is strongly related to the his-
tory of the Austrian steel sector. After the Second World War, Austria nationalised its
iron, steel and oil industries into a large holding company, the Oesterreichische Industrie
AG (OeIAG). In 1986 a large restructuring plan was envisioned with huge lay-offs due to
plant closures and downsizing, particularly in the steel industry. With such large public
mass lay-offs planned, a social plan with extended unemployment benefit durations was
enacted, but only in those regions that were severely hit by the restructuring and only for
workers of age 50 and older with a continuous work history of at least 780 employment
weeks during the last 25 years prior to the current unemployment spell. Only work-
ers who lived since at least 6 months prior to the lay-off in the treatment regions were
eligible for the extended benefits. In his analysis, only individuals who entered unem-
ployment from a non-steel job were examined. The focus on non-steel jobs is that they
should only be affected by the change in the unemployment benefit system, whereas
individuals entering unemployment from a job in the steel industry were additionally
affected by the restructuring of the sector. The identification strategy uses as threshold
the border between treated and control regions. The ‘region of residence’ were harder to
manipulate as the law provided access to extended benefits only if the person had lived
6.3 Plausibility Checks and Extensions 305
in that region since, as stated, at least 6 months prior to the claim. Selective migration
is still possible, but workers would have to move from control to treated regions well in
advance.
Example 6.24 Lee (2008) examines the effect of incumbency on winning the next elec-
tions in the USA for the House of Representatives (1900 to 1990). He shows graphically
that if in an electoral district the vote share margin of victory for the democratic party
was positive at time t, it has a large effect of wining the elections in t + 1. On the other
hand, if it was close to zero and thus more or less random whether the vote share hap-
pened to be positive or negative in t, conditional on being close to zero (our z 0 ), it should
not be related to election outcomes before, e.g. in t − 1. In other words, for them the
sign of the vote share margin in t should have no correlation with earlier periods. Again,
this was examined graphically by plotting the Democratic Party probability victory in
election t − 1 on the margin of victory in election t.
A different way to use observations from earlier periods is discussed in the next sub-
section. An alternative diagnostic test is suggested in Kane (2003) to inspect whether the
RDD treatment effect estimates captured a spurious relationship. His idea is analysing
the threshold z 0 , i.e. where the threshold is actually placed and what to do if for some
individuals it was not z 0 . He suggests examining whether the actual threshold z 0 fits
the data better than an alternative threshold nearby. If we express the estimator in a
likelihood context, we obtain a log likelihood value of the model when exploiting the
threshold z 0 and would similarly obtain a log likelihood value if we pretended that the
threshold was z 0 + c for some positive or negative value of c. Repeating this exercise
for many different values of c, we can plot the log-likelihood value as a function of c.
A conspicuous spike at c = 0 would indicate that the discontinuity is indeed where
we thought it to be. Similarly, we could apply break-point tests from time series econo-
metrics to estimate the exact location of the discontinuity point. Finding only a single
break-point which in addition happened to be close to z 0 would be reassuring.
Finally, consider the case of mixed designs where nobody below z 0 is treated but some
people above z 0 decide against treatment. Imagine you would like to estimate ATET for
all people being treated and not only for the subpopulation at threshold z 0 . In order to
do so you need to assume in addition that
lim E [Y |D = 0, X = x, Z = z 0 + ε] = lim E [Y |X = x, Z = z 0 − ε] .
ε→0 ε→0
Hence, one can test (6.39) and thereby the joint validity of the RDD and the selection-
on-observables assumption at z 0 . Of course, non-rejection at z 0 does not ensure that
selection-on-observables is valid at other values of z. We would nevertheless feel more
confident in using assumption (6.38) to estimate ATET for the entire population. These
derivations can immediately be extended to the case where Z is a proper instrumental
variable, i.e. not only at a limit point. In other words, if Pr(D = 0|Z ≤ z̃) = 1 for some
value z̃, the ATET can be identified.
We can also rewrite this common trend assumption as a bias stability assumption in the
neighbourhood of z 0 , i.e.
lim E Yt=1 d
|Z = z 0 + ε − lim E Yt=1
d
|Z = z 0 − ε
ε→0 ε→0
= lim E Yt=0 d
|Z = z 0 + ε − lim E Yt=0
d
|Z = z 0 − ε .
ε→0 ε→0
In the sharp design, we showed in (6.16) that the (kernel weighted) regression of Yt=1
on a constant, D, (Z − z 0 ) D and (Z − z 0 ) (1 − D) non-parametrically estimates the
effect in the period t = 1. With two time periods t = 0, 1 and Assumption DiD-RDD,
we would regress
Example 6.25 Recall Examples 6.3 and 6.23 of Lalive (2008). In his study he gives a
nice application to study the effects of maximum duration of unemployment benefits
in Austria combining RDD with difference-in-differences (DiD) estimation. We already
discussed that he had actually two discontinuities he could explore for estimating the
treatment effect of extended unemployment benefits: the one of age z 0 = 50, and the
one at administrative borders as this law was employed only in certain regions. On top
of it, Lalive (2008) has also access to the same administrative data for the time period
before the introduction of the policy change. If the identification strategy is valid for that
period, we should not observe a difference at the age nor at the region threshold before
the policy change. So we can estimate pseudo-treatment effects like in the DiD case.
In this example also pre-programme data could be used for a pseudo-treatment anal-
ysis. The RDD compares either individuals on both sides of the age 50 threshold or
geographically across the border between affected and unaffected regions. Using the
same definitions of treatment and outcome with respect to a population that became
unemployed well before the reform, one would expect a pseudo-treatment effect of zero,
because the treatment was not yet enacted. If the estimate is different from zero, it may
indicate that differences in unobserved characteristics are present even in a small neigh-
bourhood across the border. On the one hand, this would reduce the appeal of the RDD
assumptions. On the other hand, one would like to account for such differences in a
DiD-RDD approach, i.e. by subtracting the pseudo-treatment effect from the treatment
effect.
Analogous results can be obtained for DiD-RDD with a fuzzy design. A Wald-type
estimator in the DiD-RDD setting is
lim E [Yt=1 − Yt=0 |Z = z 0 + ε] − lim E [Yt=1 − Yt=0 |Z = z 0 − ε]
ε→0 ε→0
. (6.43)
lim E [D|Z = z 0 + ε] − lim E [D|Z = z 0 − ε]
ε→0 ε→0
308 Regression Discontinuity Design
Example 6.26 Leuven, Lindahl, Oosterbeek and Webbink (2007) consider a programme
in the Netherlands, where schools with at least 70% disadvantaged minority pupils
received extra funding. The 70% threshold was maintained nearly perfectly, which
would imply a sharp design. The existence of a few exceptions make the design
nevertheless fuzzy, where the threshold indicator can be used as an instrument for
treatment. Given the availability of pre-programme data on the same schools, difference-
in-differences around the threshold can be used. The programme was announced in
February 2000 and eligibility was based on the percentage of minority pupils in the
school in October 1998, i.e. well before the programme started. This reduces the usual
concern that schools might have manipulated their shares of disadvantaged pupils to
become eligible. In that situation, schools would have to have anticipated the subsidy
about one to one-and-a-half years prior to the official announcements. As a check of
such potential manipulation, one can compare the density of the minority share across
schools around the 70% cutoff. In case of manipulation, one would expect a drop in the
number of schools which are slightly below 70% and a larger number above the cut-
off. Data on individual test scores is available for pre-intervention years 1999 and 2000,
and for post-intervention years 2002 and 2003, permitting a DiD-RDD. As a pseudo-
treatment test the authors further examine the estimated effects when assuming that the
relevant threshold was 10%, 30%, 50% or 90%. In all these cases the estimated effects
should be zero since no additional subsidy was granted at those thresholds.
Finally, you might even have data that allow for a mixture of experimental, RDD
and DiD approach. For example, in the pilot phase of PROGRESA, the participating
households were selected in a two-stage design. First, communities were geographically
6.4 Bibliographic and Computational Notes 309
selected in several states of Mexico. These communities were then randomly allocated
either as treatment or control community. A baseline household survey was collected in
all these communities. From these data a poverty score was calculated for each house-
hold and only households below this poverty score were eligible to conditional cash
transfers. This provides a sharp RDD. Because of the collection of baseline data, i.e.
data from the time before the conditional cash transfer programme started, it is possible
to use DiD, experimental evaluation and RDD separately for the identification. The pro-
gramme was later extended, and the calculation of the poverty score was also changed,
such that various groups might have become beneficiaries later.
Example 6.27 Buddelmeyer and Skoufias (2003) exploit this possibility to judge the
reliability of the RDD regression approach. The experimental data permits for a clean
estimation approach with the baseline data also permitting to test for differences even
before programme started. At the same time, one could also pretend that no data were
available for untreated and randomly selected control communities, and to estimate
effects by RDD using only the treatment communities (a pseudo non-treatment test).
By comparing this to the experimental estimates, one can judge whether a simple non-
experimental estimator can obtain similar results as a experimental design. One would
usually consider the experimental results more credible. However, when comparing the
results one has to bear in mind that they refer to different populations, which may limit
the comparability of the estimates. Nevertheless, one could even conceive a situation
where the RDD can help the experimental design. Suppose the households in the control
communities expected that the pilot programme would also be extended to them in the
near future such that they might have changed their behaviour in anticipation. Clearly,
only the households below the poverty score should change the behaviour (unless there
was belief that the poverty scores would be recalculated on the basis of future data
collection) such that the RDD in the control communities would indicate such kind of
anticipation effects.
Y = α D + g(Z ) + U,
with ∇ E[·|Z = z] denoting the first derivative with respect to z. This can certainly
repeated for more complex models and introducing again fuzzy and mixed designs. An
extensive discussion and overview is given in Card, Lee, Pei and Weber (2015).
In this chapter we have permitted for a situation where the density f (X |Z ) is dis-
continuous at z 0 . However, as stated, if X contains only pre-treatment variables, such a
discontinuity may indicate a failure of the RDD assumptions, see Lee (2008). We will
briefly discuss below his approach assuming continuity of f (X |Z ) at z 0 . Nevertheless,
there could also be situations where f (X |Z ) is discontinuous and all conditions of RDD
still apply. For example, such discontinuity can occur due to attrition, non-response
or other missing data problems. Non-response and attrition are common problems in
many datasets, particularly if one is interested in estimating long-term effects. Assum-
ing ‘missing at random’ (MAR, or conditional on covariates X ) is a common approach
to deal with missing data; see e.g. Little and Rubin (1987). Although controlling for
observed covariates X may not always fully solve these problems, it is nevertheless
helpful to compare the estimated treatment effects with and without X . If the results
turn out to be very different, one certainly would not want to classify the missing-data
problem as fully innocuous.
While the MAR assumption requires that the missing data process depends only on
observables, we could also permit that data might be missing on the basis of unobserv-
able or unobserved variables, c.f. Frangakis and Rubin (1999) or Mealli, Imbens, Ferro
and Biggeri (2004). The methods proposed in this chapter could be extended to allow
for such missing data processes.
Further, differences in X could also be due to different data collection schemes, espe-
cially if different collection schemes may have been used for individuals above the
threshold z 0 versus those below z 0 . Why should this happen? In practice this is quite
common as treated people are often monitored during the treatment and for a certain
period afterwards, whereas control groups are often collected ad hoc in the moment
when a treatment evaluation is requested.
Another reasons why one might want to control for X is to distinguish direct from
indirect effects, recall our first sections, especially Chapter 2.
Example 6.28 For further discussion on separating direct from indirect effects see also
Rose and Betts (2004). They examine the effects of the number and types of math
courses during secondary school on earnings. They are particularly interested in sep-
arating the indirect effect of math on earnings, e.g. via increasing the likelihood of
obtaining further education, from the direct effect that math might have on earnings.
They also separate the direct effect of maths from the indirect effects via the choice of
college major. See also Altonji (1995).
all the variables X that were affected, we can still apply the RDD after controlling for
X . In such a situation controlling for X is necessary since the ‘instrumental variable’
Z would otherwise have a direct effect on Y . Such a situation often occurs when geo-
graphical borders are used to delineate a discontinuity. Recall Example 6.14 that was
looking at the language border(s) within Switzerland to estimate the effects of culture
on unemployment. In that example it turned out that the distribution of some community
covariates X , others than language, are also discontinuous at the language borders. To
avoid that these covariates bias the instrumental variable estimate, one needs to control
for X .
Notice that Example 6.28 refers to treatments D that are no longer binary. We have
discussed this problem already before and will come back to it later. The ideas outlined
at the different chapters of this book typically carry over to the RDD case. This brings
us to the question what happens if Z is discrete? Lee and Card (2008) examine this sit-
uation, when Z is measured only as a discrete variable. For example if we have Z =
number of children. In such a case, non-parametric identification is not plausible and a
parametric specification is appropriate.
Let us now consider in a little more detail the approach of Lee (2008) assuming con-
tinuous f (X |Z ) in z 0 . He gives an intuitive discussion of assumption (6.5) describing
a selection mechanism under which it is true. Let Ui be unobservable characteristics of
individual i and suppose that treatment allocation depends on some score Z i such that
Di = 11{Z i ≥ z 0 }. Let FZ |U be the conditional distribution function and f Z the marginal
density of Z . He proposes the conditions f Z (z 0 ) > 0, 0 < FZ |U (z 0 |u) < 1 for every
u ∈ Supp(U ), and that its derivative f Z |U (z 0 |u) exists. The intuition is that every indi-
vidual i may attempt to modify or adjust the value of Z i in his own interest, but that even
after such modification there is still some randomness left in that FZ |U (z 0 |u) is neither
zero nor one. In other words, (defiers excluded) each individual may manipulate his Z i
but does not have full control. Actually, f Z (z 0 ) > 0 implies that for some individuals it
was a random event whether Z happened to be larger or smaller than z 0 .
Under this condition is follows that
f (z |u)
Z |U 0
E Y 1 − Y 0 |Z = z 0 = Y 1 (u) − Y 0 (u) d FU (u), (6.45)
f Z (z 0 )
which says that the treatment effect at z 0 is a weighted average of the treatment effect
for all individuals (represented by their value of U ), where the weights are the density
at the threshold z 0 . Those individuals who are more likely to have a value z 0 (large
f Z |U (z 0 |u)) receive more weight, whereas individuals whose score is extremely unlikely
to fall close to the threshold receive zero weight. Hence, this representation
! (6.45) gives
us a nice interpretation of what the effect E Y − Y |Z = z 0 represents.
1 0
The selection mechanism of Lee (2008) permits that individuals may partly self-select
or even manipulate their desired value of Z , but that the final value of it still depends on
some additional randomness. It permits some kind of endogenous sorting of individuals
as long as they are not able to sort precisely around z 0 . Recall the example in which
individuals have to attend a summer school if they fail on a certain mathematics test.
Some students may want to avoid summer school and therefore aim to perform well on
the test, whereas others like to attend summer school and therefore perform poorly on
the test. The important point is that students, however, are unlikely to sort exactly about
the threshold.
15 z can be a scalar in case of a single cut-off point or a vector when we have multiple thresholds.
0
16 In case of fuzzy design, E[D|Z = z] is typically expected to be smoother (as a function of z) than
E[Y |Z = z], therefore the bandwidth choice for the former regression must be larger than the latter:
hd > h y .
6.5 Exercises 313
If one is interested to perform the estimation manually out of the aforesaid packages,
recall first that the estimate of the treatment effect (in either sharp or fuzzy design) is
the difference of two local regressions at the boundaries of the threshold. In R there
are numerous functions that offer the local polynomial fit such as locpoly from the
package KernSmooth or npreg from the package np. As in the RDD context we are
mainly interested in the fit at the boundaries, it is advised to use the local linear or higher
degree local polynomial for the fit. In Stata one can use the command lpoly or
locpoly as a counterpart to fit a local polynomial regression. In any case, the standard
errors or confidence intervals must then be obtained by bootstrap.
Moreover, one can construct weights around the cut-off point by using directly the
function kernelwts from the rdd package in R. This is useful especially in cases
where there is a mixed design. The choice of kernel can be set by the user in both soft-
ware languages, but the estimation of the treatment is not very sensitive to this choice. In
the rdd package and rd command, the default is to take the triangular kernel. For fur-
ther practical guidance of the use and implementation see Imbens and Lemieux (2008),
Lee and Lemieux (2010) or Jacob and Zhu (2012).
6.5 Exercises
1. For Figures 6.3 to 6.6 discuss the different conditional distributions and expectations.
For which do you expect discontinuities at z 0 ?
2. Check the derivation (identification) of the LATE in Chapter 4. Then prove the equal-
ity (6.10); first under Assumption 2, then under Assumption 2’. You may also want
to consult Section 2 of Imbens and Angrist (1994).
3. Recall the parametric model (6.23) for RDD with fuzzy designs. Imagine now one
would model the propensity score as
E[D|Z ] = γ + ϒ̄ + λ · 11{Z ≥ z 0 }
with a parametrically specified function ϒ̄. What would happen if we chose the same
polynomial order for ϒ and ϒ̄, e.g. a third-order polynomial? Show that the solution
to (6.23) is identical to instrumental variable regression of Y on a constant, D, Z , Z 2 ,
Z 3 . What are the excluded instruments?
4. Imagine we face several thresholds z 0 j , j = 1, 2, 3, . . . at which treatment takes
place (as discussed in Section 6.1.3). Imagine that for all these we can suppose to
have sharp design. Consider now equation (6.22). How do you have to redefine D
and/or the sample to be used such that we can still identify and estimate the ATE by
a standard estimator for β?
5. Derive the asymptotics given in Theorem 6.1 for the case of sharp designs, i.e. when
the denominator is not estimated (because it is known to be equal to 1).
6. Revisit Section 6.1.2. Ignoring further covariates X , give an estimator for the
LATE(z 0 ) as in (6.10) in terms of a two-step least-squares estimator using always
314 Regression Discontinuity Design
the same bandwidth and uniform kernels throughout. Do this first for sharp designs,
then for the case of fuzzy, and finally for mixed designs.
7. Take Assumption (6.8) but conditioned on X . Show that (6.28) holds.
8. Revisit Section 6.3: Make a list of the different plausibility checks, and discuss their
pros and cons.
9. Revisit Section 6.2: Give and discuss at least two reasons (with examples) why the
inclusion of additional covariates might be helpful in the RDD context.
7 Distributional Policy Analysis and
Quantile Treatment Effects
Example 7.1 For studying the union wage premium, Chamberlain (1994) regressed the
log hourly wage on a union dummy for men with 20 to 29 years of work experience and
other covariates. He estimated this premium first for the mean (by OLS), and then for
different income quantiles (τ = 0.1, 0.25, 0.5, 0.75, 0.9). The results were as follows:
For the moment we abstract from a causal interpretation. The results show that on aver-
age the wage premium is 16%, which in this example is similar to the premium for the
316 Distributional Policy Analysis and Quantile Treatment Effects
Figure 7.1 Hypothetical distributions of conditional log wages in the union (solid line) vs
non-union sector (dashed line) along Example 7.1
median earner. For the lower quantiles it is very large and for the large quantiles it is
close to zero. Figure 7.1 shows a (hypothetical) distribution of conditional log wages in
the union and non-union sector, which shall illustrate the above estimates.
The main conclusion we can draw from this table and figure is: as expected, the
biggest impact is found for the low-income group. But maybe more importantly, het-
erogeneity of the impact seems to dominate, i.e. the change of the distribution is more
dramatic than the change of the simple mean.
Although there exists a literature on unconditional QTEs without including any con-
founders, we will often treat this as a special, simplified case. Indeed, we have seen in
the previous sections that methods without covariates (X ) require rather strong assump-
tions: either on the experimental design, i.e. assuming that treatment D (participation)
is independent of the potential outcomes, or on the instrument Z , i.e. assuming that Z
is independent of the potential outcomes but relevant for the outcome of D. And as
before, even if one of these assumptions is indeed fulfilled, the inclusion of covariates
can still be very helpful for increasing both the interpretability, and the efficiency of the
estimators.
Before we come to the specific estimation of quantile treatment effects let us briefly
recall what we have learnt so far about the estimation of distributional effects. In Chap-
ter 2 we introduced the non-parametric estimators of conditional cumulative distribution
functions (cdf) and densities in a quite unconventional way. We presented them as spe-
cial cases of non-parametric regression, namely by writing F(y|x) = E[11{Y ≤ y}|X =
x], i.e. regressing 11{Y ≤ y} on X by smoothing around x with kernel weights K h (X −x)
in order to estimate conditional cdfs, and by writing f (y|x) = E[L δ (Y − y)|X = x]
for estimating conditional densities (with a given kernel function L δ , see Chapter 2 for
details). The main advantage of this approach has been (in our context) that in all fol-
lowing chapters we could easily extend the identification and estimation of the potential
7.1 A Brief Introduction to (Conditional) Quantile Analysis 317
mean outcomes E[Y d ] (or E[Y d |X = x]) to those of the potential outcome distributions
F(y d ) (or F(y d |x)). This has been explicitly done only in some of the former chapters;
therefore let us revisit this along the example of instrumental variable estimation of
treatment effects.
We will re-discuss in detail the exact assumptions needed for IV estimation in Section
7.2.2. For the moment it is sufficient to remember that our population must be composed
only of so-called always takers T = a (always participate, Di ≡ 1), never takers T = n
(never participate, Di ≡ 0), and compliers T = c (do exactly what the instrument
indicates, Di = 11{Z i > 0}). IV methods never work if defiers exist (or indifferent
subjects that by chance act contrary to the common sense). For the cases where they can
be assumed to not exist, one can identify treatment effects at least for the compliers:
FY 1 |T =c and FY 0 |T =c .
For identifying distributions (and not ‘just’ the mean), we need the independent
assumptions
(Y d , T ) ⊥⊥ Z a.s. for d = 0, 1. (7.1)
This requires that Z is not confounded with D 0 , D 1 nor with the potential outcomes
Y 0 , Y 1 . Using basically the same derivations as in Chapter 4, it is easy to show that the
potential outcome distributions for the compliers are identified then by the Wald-type
estimator, i.e.
E [11 {Y ≤ u} · D|Z = 1] − E [11 {Y ≤ u} · D|Z = 0]
FY 1 |c (u) = ,
E [D|Z = 1] − E [D|Z = 0]
E [11 {Y ≤ u} · (D − 1)|Z = 1] − E [11 {Y ≤ u} · (D − 1)|Z = 0]
FY 0 |c (u) = .
E [D|Z = 1] − E [D|Z = 0]
Extensions to the case where we need to include some confounders X such that the
assumptions above are fulfilled at least ‘conditional on X ’ are straightforward. Then, by
using similar derivations one can also show that the potential outcome distributions are
identified by
E [11{Y ≤ u}· D|X, Z = 1] − E [11{Y ≤ u}· D|X, Z = 0] dFX
FY 1 |c (u) = ,
E [D|X, Z = 1] − E [D|X, Z = 0] dFX
E [11{Y ≤ u}·(D −1)|X, Z = 1] − E [11{Y ≤ u}·(D −1)|X, Z = 0] dFX
FY 0 |c (u) = .
E [D|X, Z = 1] − E [D|X, Z = 0] dFX
also recommend consulting some introductory literature to quantile regression; see our
bibliographical notes.
i.e. we obtain the entire distribution function by estimating (7.2) via mean regression of
11{Y ≤ a} for a grid over supp(Y ).
Recall then that a quantile of a variable Y is defined as
So, in principle one could invert the estimated cdf F̂. However, in practice this can be
quite cumbersome. Therefore, a substantial literature has been developed which aims
to estimate the quantiles directly. We will later see that from a non-parametric view-
point, though, these approaches are quite related. In contrast, for parametric models the
estimation procedures are rather different.
If Y is continuous with monotonically increasing cdf, there will be one unique, say
value a, that satisfies FY (a) ≥ τ (or FY (a) > τ if strictly monotone). This is the
case if FY has a first derivative f Y (the density) with f Y (Q τY ) > 0. Otherwise, the
smallest value is chosen. Note that even if you allow for jumps in F, the cdf is typically
still assumed to be right continuous, and thus the quantile function is left continuous.
Consequently, given a random i.i.d. sample {Yi }i=1
n
, one could estimate the quantile by
Q̂ τY = inf a : F̂Y (a) ≥ τ ,
and plug in the empirical distribution function of Y . Such an approach bears a close
similarity with sorting the observed values Yi in an ascending order. A main problem
of this is that its extension to conditional quantiles, i.e. including covariates, is some-
what complex, especially if they are continuous. Fortunately there are easier ways to do
so. Before we consider the most popular alternative quantile estimation strategy, let us
discuss a few important properties of quantiles which will be used in the following.
First, according to the remark above about continuous Y , quantile functions Q τY are
always non-decreasing in τ . They can nevertheless be constant over some intervals.
Second, if Y has cdf F, then F −1 (τ ) gives the quantile function, whereas the quantile
function of −Y is given by Q τ−Y = −F −1 (1 − τ ). Furthermore, if h(·) is a non-
decreasing function on IR, then Q τ h(Y ) = h (Q τ Y ). This is called equivariance to
monotone transformations. Note that the mean does not share this property because gen-
erally E [h(Y )] = h(E [Y ]) except for some special h(·) such as linear functions. On the
other hand, for quantiles there exists no equivalent to the so-called iterated expectation
7.1 A Brief Introduction to (Conditional) Quantile Analysis 319
E[Y ] = E[E[Y |X ]]. Finally, recall that median regression is more robust to outliers
than mean regression is.
Let us start now with the interpretation and estimation of parametric quantile regres-
sion. As stated, in most of the cases confounders are involved, so that we will examine
conditional quantiles Q τY |X instead of unconditional ones. How can we relate this
quantile function to the well-known (and well-understood) mean and variance (or
scedasticity) function? The idea is as follows: imagine you consider a variable U =
Y − μ(X ) of the subjects’ unobserved heterogeneity with distribution function F(·). If
this conditional distribution function of Y depends on X only via the location μ(·), then
F(y|x) = F(y − μ(x)) such that
% & % &
τ = F Q τy|x |x = F Q τy|x − μ(x)
Before continuing with the discussion and estimation consider another example:
Example 7.2 When studying the demand for alcohol, Manning, Blumberg and Moulton
(1995) estimated the model
log consumption i = α + β1 log pricei + β2 log incomei + U
at different quantiles. Here, incomei is the annual income of individual i, consumptioni
his annual alcohol consumption, and pricei a price index for alcoholic beverages, com-
puted for the place of residence of individual i. Hence, the latter varies only between
320 Distributional Policy Analysis and Quantile Treatment Effects
individuals that live in different locations. For about 40% of the observations con-
sumption was zero, such that price and income responses were zero for low quantiles.
For larger quantiles the income elasticity was relatively constant at about 0.25. The
price elasticity β1 showed more variation. Its value became largest in absolute terms at
τ ≥ 0.7, and very inelastic for low levels of consumption τ ≤ 0.4, but also for high
levels of consumption τ ≈ 1. Hence, individuals with very low demand and also those
with very high demand were insensitive to price changes, whereas those with average
consumption showed a stronger price response. A conventional mean regression would
not detect this kind of heterogeneity.
Consider the three examples of quantile curves given in Figure 7.2. For all three the
line in the centre shall represent the median regression. Obviously, they are all symmet-
ric around the median. To ease the following discussion, imagine that for all moments
of order equal to or larger than three the distribution of Y is independent from X .
The first example (on the left) exhibits parallel quantile curves for different τ . This
actually indicates homoscedasticity for U . The second example (in the centre) shows a
situation with a linear scedasticity, i.e. of the form
Y = α + Xβ + (γ + X δ) U with U ⊥⊥ X . (7.5)
Clearly, for δ > 0, the simple linear quantile models would cross if the X variable could
take negative values. For example, if γ = 0 and δ > 0, all conditional quantiles will
path through the point (0, α). A more adequate version for such a quantile model is then
-
τ α + xβ + (γ + xδ) F −1 (τ ) if γ + xδ ≥ 0
Q Y |x = (7.6)
α + xβ + (γ + xδ) F −1 (1 − τ ) else.
It is further clear that we can generate quantiles as indicated on the right side of Figure
7.2 by extending the scedasticity function in model (7.5) from a linear to a quadratic one.
But nonetheless, generally a polynomial quantile function might give crossing quantiles
as well.
Regarding estimation, if covariates are involved, then the estimation procedure is
based on optimisation instead of using ordering. Nonetheless, for the sake of presen-
tation let us first consider the situation without covariates. Define the asymmetric loss
(or check) function
ρτ (u) = u · (τ − 11 {u < 0}) , (7.7)
In mean regression one usually examines the square loss function u 2 , which leads to the
least squares estimator. For the median τ = 12 the loss function (7.7) is the absolute loss
function. For values τ = 12 it gives an asymmetric absolute loss function.1
Suppose that a density exists and is positive at the value Q τY , i.e. f Y (Q τY ) > 0.
Then it can be shown that the minimiser in (7.8) is in fact Q τY . To see this, suppose
that the quantile Q τY is unique. The interior solution to arg minβ E [ρτ (Y − β)] is given
by the first-order condition, i.e. setting the first-derivative to zero. Note that the first
derivative is
∞
∂
(Y − β) · (τ − 11 {(Y − β) < 0}) dFY
∂β −∞
β ∞
∂
= (τ − 1) (Y − β)dFY + τ (Y − β)dFY .
∂β −∞ β
which is zero for FY (β) = τ . Hence, minimising E [ρτ (Y − β)] leads to an estimator
of the quantile. An alternative interpretation is that β is chosen such that the τ -quantile
of (Y − β) is set to zero. Or in other words, it follows that
!
E 11 (Y − Q τY ) < 0 − τ = 0 .
n
β̂ τ = arg min ρτ (Yi − β) . (7.9)
β i=1
In order to develop an intuition for this loss or objective function, let us illustrate the
trivial cases of τ = 12 and sample sizes n =1, 2, 3 and 4. An example of this situation
is given in Figure 7.3. As the figure shows, the objective function is not differentiable
everywhere. It is differentiable except at the points at which one or more residuals are
zero.2 The figures also show that the objective function is flat at its minimum when (τ n)
is an integer. The solution is typically at a vertex. To verify the optimality one needs
only to verify that the objective function is non-decreasing along all edges.
1 Hence, the following estimators are not only for quantile regression, but can also be used for other
situations where an asymmetric loss function is appropriate. For example, a financial institution might
value the risk of large losses higher (or lower) than the chances of large gains.
2 At such points, it has only so-called directional derivatives.
322 Distributional Policy Analysis and Quantile Treatment Effects
5 5 5 5
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
–5 –3 –1 1 3 5 –5 –3 –1 1 3 5 –5 –3 –1 1 3 5 –5 –3 –1 1 3 5
Figure 7.3 Objective function of (7.9) with ρ as in (7.7), with τ = 12 and sample sizes 1, 2, 3 and
4 (from left to right)
Similarly to the above derivation, if there is a unique interior solution one can show
that when including covariates X , our estimator can be defined as
! !
arg min E ρτ (Y − X β) = arg zer o E τ − 11 Y < X β · X . (7.10)
β β
Y = X β0τ + U τ
with Q U |X = 0.
In other words, at the true values β0τ the quantile of (Y − X β0τ ) should be zero. This
suggests the linear quantile regression estimator
1
n
β̂ τ = arg min ρτ Yi − X i β with ρτ as in (7.7). (7.11)
β n
i=1
Using the relationship (7.10) we could choose β to set the moment conditions
$ n $
$1
$
$ $
$ τ − 11 Yi < X i β · X i $
$n $
i=1
to zero. In finite samples, it will usually not be possible to set this exactly equal to zero,
so we set it as close to zero as possible.
Example 7.3 Consider the situation for τ = 0.25 without X . Suppose we have three data
points. It will be impossible to find a β such that (τ − 11{Yi < β}) = 0. To see this,
3
rewrite this equation as 13 i=1 11{Yi < β} = 0.75, which cannot be satisfied.
But certainly, for n → ∞ the distance from zero will vanish. That is, for! finite n
the objective function (7.11) is not differentiable, whereas E ρτ (Y − X β) usually
is. The objective function (7.11) is piecewise linear and continuous. It is differentiable
everywhere except at those values of β where Yi − X i β = 0 for at least one sample
observation. As stated, at those points the objective function has directional derivatives
which depend on the direction of evaluation. If at a point β̂ all directional derivatives
are non-negative, then β̂ minimises the objective function (7.11).3
1
n
τ − 11 Y < X β · X
n
i=1
4 Rigorous proofs usually exploit the convexity of (7.11) and apply the convexity lemma of Pollard.
324 Distributional Policy Analysis and Quantile Treatment Effects
Figure 7.4 Example of 0.75 quantiles for model Yi = X i Ui with U standard normal (see left) and
uniform (see right)
about zero, the values of FU−1 (τ ) and −FU−1 (1 − τ ) are the same, and therefore the
absolute value of the slope is the same to the left and right of zero in the left graph. In
the right graph, the sign of the slope does not change, but its magnitude does though the
conditional median would still be linear.
Once again, quantile crossing can occur for the estimated quantiles for many reasons.
Interestingly, it is nevertheless ensured5 that even if we estimate all quantile functions
separately with a simple linear model (7.11), at least at the centre of the design points
X̄ = n1 X i , the estimated quantile function Q τY |X ( X̄ ) = X̄ β̂ τ is non-decreasing
in τ ∈ [0, 1]. On the other hand, if the assumed model was indeed correct, estimates
that exhibit crossing quantiles would not be efficient since they do not incorporate the
information that Q τY |X must be non-decreasing in τ . Algorithms exist that estimate para-
metric (linear) quantile regressions at the same time for all quantiles but modifying the
objective function such that Q τY |X is non-decreasing in τ .
n
β̂ τ = arg min ρτ Yi − X i β
β i=1
n
= arg min τ Yi − X i β 11 Yi > X i β − (1 − τ ) Yi − X i β 11 Yi < X i β .
β i=1
n
β̂ τ = arg min τr1i + (1 − τ ) r2i with r1i − r2i = Yi − X i β where r1i , r2i ≥ 0
β i=1
(7.12)
with only one of the two residuals r1i , r2i being non-zero given i. It can be shown that
the solution is identical to the solution to an LP problem where minimisation is over β,
r1 and r2 . Now define the following LP problem:
Y = X β0τ + U , τ
QU |X = 0 .
Let β̂ τ be an estimator obtained from the minimisation problem (see above) with τ ∈
(0, 1). Under the following assumptions one can establish its consistency and statistical
properties:
Assumption Q1 Let Fi be the conditional cdf of Yi (or simply the cdf of Ui ), allowing
for heteroskedasticity. Then we assume that for any ε > 0
6 An introduction to these algorithms is given for example in Koenker (2005, chapter 6).
326 Distributional Policy Analysis and Quantile Treatment Effects
"
1
τ
n
√ n→∞
n Fi X i β0 − ε − τ −→ −∞ and
n
i=1
"
1
τ
n
√ n→∞
n Fi X i β0 + ε − τ −→ ∞ .
n
i=1
This condition requires that the density of the error term U at point 0 is bounded away
from zero at an appropriate rate. If the density of U was zero in an ε neighbourhood,
the two previous expressions would be exactly zero. The conditions require a positive
density and are thus simple identification conditions. The next assumptions concern the
data matrix X .
Assumption Q2 There exist real numbers d > 0 and d > 0 such that
1
$$ $$ 1
2
n n
lim inf inf 11 X i β < d = 0 , lim sup sup Xi β ≤ d .
n→∞ β=1 n n→∞ β=1 n
i=1 i=1
These conditions ensure that the X i observations are not collinear, i.e. that there is
no β such that X i β = 0 for every observed X i . The second part of it controls the rate
of growth of the X i and is satisfied when n1 X i X i tends to a positive definite matrix.
Alternative sets of conditions can be used to prove consistency, e.g. by trading off some
conditions on the density of U versus conditions on the X design.
Assumptions Q1 and Q2 are typically sufficient for obtaining consistency. For exam-
ining the asymptotic distribution of the estimator, stronger conditions are required.
We still suppose the Yi to be i.i.d. observations with conditional distribution function
Fi = FYi |X i . For notational convenience we set ξiτ = Q τYi |X i . Then we need to impose
Assumption Q3 The cdf Fi are absolutely continuous with continuous densities f i
uniformly bounded away from zero and infinity at the points f i (ξiτ ) for all i.
Assumption Q4 There exist positive definite matrices D0 and D1 such that
1
1
n n
1
lim X i X i = D0 , lim f i (ξiτ )X i X i = D1 and lim max √ X i = 0 .
n n n→∞ i n
i=1 i=1
Then, with Assumptions Q3 and Q4, the estimated coefficients converge in distribution
as
√ % τ & d % &
n β̂ − β0τ −→ N 0, τ (1 − τ ) D1−1 D0 D1−1 .
The proof consists of three steps. First, it is shown that the function
Xδ
n
Z n (δ) = ρτ Ui − √i − ρτ (Ui ) with Ui = Yi − X i β0τ
n
i=1
d 1
Z n (δ) −→ −δ W + δ D1 δ with W ∼ N (0, τ (1 − τ )D0 ) .
2
Since the left- and right-hand sides are convex in δ with a unique minimiser, it follows
1
arg min Z n (δ) −→ arg min −δ W + δ D1 δ = D1−1 W ∼ N (0, τ (1−τ )D1−1 D0 D1−1 ).
d
2
(7.14)
Finally, the& function Z n (δ) is shown to be indeed minimised at the value
√ % τ
n β̂ − β0τ . To see this, note that with a few calculations it can be checked that
√ n
Z n ( n(β̂ τ − β0τ )) = i=1 ρτ (Yi − X i β̂ τ ) − ρτ (Ui ). The first term achieves here its
minimum as this is the definition of the linear quantile regression estimator. The
& sec-
√ %
ond term does not depend on δ anyhow. Hence, arg min Z n (δ) = n β̂ τ − β0τ , which
gives the asymptotic distribution of β̂ τ thanks to (7.14).
Consider the simple case where X contains only a constant which represents the case
of univariate quantile regression. Then
"
√ % τ & d τ (1 − τ )
n β̂ − β0τ −→ N 0, 2 τ . (7.15)
f Y (Q Y )
The variance is large when τ (1 − τ ) is large, which has its maximum at 0.5. Hence,
this part of the variance component decreases in the tails, i.e. τfor τ small or large. On
the other hand, the variance is large when the density fY Q Y is small, which usu-
ally increases the variance in the tails. If the density f Y Q τY is pretty small, rates of
√
convergence can effectively be slower than n because we expect to observe very few
observations there.
Due to the normality in (7.15) one can easily extend the previous derivations to obtain
the joint distribution of several quantiles, say β̂ τ = (β̂ τ1 , . . . , β̂ τm ) :
√ % τ τ
& d min τi , τ j − τi τ j
n β̂ − β0 −→ N (0, ) , i j = −1 , = {i j }i,m,m
j .
f F (τi ) · f F −1 (τ j )
With similar derivations as those being asked for in Exercise 3 one can calculate the
influence function representation of the linear quantile regression estimator which is
(for τ ∈ (0, 1))
√ % τ & 1
n
% 1 &
n β̂ − β0τ = D1−1 √
3
X i · τ − 11 Yi − ξiτ < 0 + O n − 4 (ln ln n) 4 ;
n
i=1
see Bahadur (1966) and Kiefer (1967). It can further be shown that this representation
holds uniformly over an interval τ ∈ [ε, 1 − ε] for some 0 < ε < 1.
These β̂ τ allow us to predict the unconditional quantiles of Y for a different distribu-
tion of X if the β τ remain unchanged (or say returns to X and the distribution of U ). In
fact, as one has (Exercise 2)
Q τY = FY−1 (τ ) ⇔ 11{Y ≤ Q τY }d FY |X d FX
1
"
= 11{FY−1 τ
|X (t|X ) ≤ Q Y }dt d FX = τ (7.16)
0
328 Distributional Policy Analysis and Quantile Treatment Effects
you can predict Q τY from consistent estimates β̂ t for FY−1 |X (t|X ) with t = 0 ≤ t1 <
· · · < t J ≤ 1, t1 being close to zero, and t J close to one. You only need to apply the
empirical counterpart of (7.16), i.e.
⎧ ⎫
⎨ 1
n
J ⎬
τY = inf q :
Q (t J − t j−1 )11{xi β̂ τ j ≤ q} ≥ τ . (7.17)
⎩ n ⎭
i=1 j=1
Example 7.4 Melly (2005) used linear quantile regression methods to replicate the
decomposition of Juhn, Murphy and Pierce (1993) for the entire distribution and not
just for the mean. This allowed him to study the development of wage inequality by
gender over the years 1973 to 1989 in the USA. More specifically, he simulated the
(hypothetical) wage distribution in 1973 for a population with the characteristics’ distri-
bution observed in 1989 but quantile returns β tj ( j = 1, . . . , J ) as in 1973. As a result,
he could quantify how much of the income inequality change was caused by the change
in characteristics over these 16 years. Then he calculated the changes of the deviations
0.5 − β 0.5 + β τ j , so that he could estimate the
of β τ j from the median returns, i.e. β89 73 73
distribution that would have prevailed if the median return had been as in 1989 with
the residuals distributed as in 1973. Taking all together, under these conditions he could
calculate how much of the change from 1973 to 1989 in income inequality was due to
the changes in returns and/or changes in X .
Alternatively, instead of using (7.16), (7.17) one could simply generate an artificial
sample {y ∗j }mj=1 of a ‘target population’ with {x ∗j }mj=1 (say, in the above example you
take the x from 1989) by first drawing randomly ti from U [0, 1], j = 1, . . . , n, esti-
mate from a real sample {(yi , xi )}i=1
n of the ‘source population’ (in the above example
the population in 1973) the corresponding β̂ t j and set afterwards y ∗j := x j β̂ τ j . Then
the distribution of Y in the target population can be revealed via this artificial sample
{y ∗j }mj=1 ; see Machado and Mata (2005).
Although the linear quantile regression presented above is doubtless the presently
most widely used approach, we finally also introduce non-parametric quantile regres-
sion. Extensions to nonlinear parametric models have been developed, but it might be
most illuminating to proceed directly to non-parametric approaches, namely to local
quantile regression. Similar to the previous chapters, we will do this via local polyno-
mial smoothing. Let us start with the situation where for a one-dimensional X i we aim
to estimate the conditional quantile function Q τY |X (x) at location X = x. A local linear
quantile regression estimator is given by the solution a to
Xi − x
min ρτ {Yi − a − b(X i − x)} · K .
a,b h
Extensions to higher polynomials are obvious but seem to be rarely used in practice.
Note that local constant regression quantiles won’t cross whereas (higher-order) local
polynomials might do, depending on the bandwidth choice. Regarding the asymptotics,7
7 They were introduced by Chaudhuri (1991) in a rather technical paper. Asymptotic bias and variance were
not given there.
7.2 Quantile Treatment Effects 329
the convergence rate is known (see below), whereas the variance is typically estimated
via simulation methods (namely jackknife, bootstrap or subsampling). Suppose that
τ
Y = g(X ) + U with QU =0,
and g belonging to the class of Hölder continuous functions C k,α (i.e. with k contin-
uously differentiable derivatives with the kth derivative Hölder being continuous with
exponent α). It has been shown that the estimate of the function ĝ when choosing a
1
bandwidth h proportional to n − 2(k+α)+dim(X ) converges to g almost surely as follows:
# # (k+α) √
#ĝ − g # = O n 2(k+α)+dim(X ) · ln n .
and the quantiles can be obtained by inverting the distribution function, i.e.
Q τY d = FY−1
d (τ ) ,
provided that FY d (τ ) is invertible. The latter is identical to saying that the quantile Q τY d
is well defined, and therefore not a restriction but a basic assumption for QTE estima-
tion. In order to estimate (7.22) you can now predict E [11 {Y ≤ a} |X = xi , D = d] for
all observed individuals i = 1, . . . , n and their xi by any consistent (e.g. non-parametric
regression) method, and then take the average over the sample that represents your pop-
ulation of interest (namely all, or only the treated, or only the non-treated). Alternatively,
weighting by the propensity score gives
. /
11 {Y ≤ a} · 11 {D = d}
FY d (a) = E , see (7.22) ,
Pr(D = d|X )
which can be estimated by
1
n
11 {Yi ≤ a} · 11 {Di = d} /P̂r(Di = d|X i ) .
n
i=1
These may be less precise in finite samples if the estimated probabilities Pr(D = d|X )
are very small for some observed xi . For the weighting by the propensity score, recall
also Chapter 3 of this book.
Example 7.5 Frölich (2007b) used a propensity score matching estimator to analyse the
gender wage gap in the UK. Along his data set he studied the impacts of using para-
metric vs nonparametric estimators, estimators that accounted for the sampling schemes
at different stages of estimation, and how sensitive the results were to bandwidth and
kernel choice (including pair matching). He controlled for various confounders like age,
9 Compare also with Melly (2005), Firpo (2007) and Frölich (2007b).
7.2 Quantile Treatment Effects 333
full- or part-time employment, private or public sector, and the subject of degree of the
individuals’ professional education. He could show that the subject of degree explained
the most important fraction of the wage gap. But even when controlling for all observ-
able characteristics, 33% of the gap still remained unexplained.10 Secondly, as expected,
the gap increased with the income quantile, i.e. Q τY m − Q τY f (with m = male, f =
female) and Q τY m /Q τY f increased with τ .
These considerations were interesting for some applications and in general for iden-
tification. But as already discussed at the beginning of this chapter, if one is interested
particularly in the difference between two quantiles and its asymptotic properties, then
direct estimation of the quantiles might be more convenient than first estimating the
entire distribution function. In order to do so, notice that the last equation implies
that
⎡ ⎤
11 Y ≤ Q τY d · 11 {D = d}
τ = FY d (Q τY d ) = E ⎣ ⎦ .
Pr(D = d|X )
Hence, no matter whether (Y, X ) comes from the treatment or control group we can
identify Q τY d for any d by
. /
11 {Y < β} · 11 {D = d}
Q τY d = arg zer o E −τ . (7.23)
β Pr(D = d|X )
Therefore, once we have estimated the propensity score Pr(D = d|X ), we could use a
11{D=d}
conventional univariate quantile regression estimation routine with weights Pr(D=d|X ).
Note that all weights are positive so that our problem is convex and can be solved by
Linear Programming.
All what we need in practice is to predict the propensity score function Pr(D =
d|X = xi ) for all observed xi , and to replace the expectation in (7.23) by the corre-
sponding sample mean. Again, you could alternatively consider (7.22), substitute Q τY d
for a, and estimate the unconditional quantile by
Q τY d = arg zer o E [11 {Y ≤ β} |X, D = d] d FX − τ ,
β
replacing [. . .]d FX by averaging over consistent predictors of E[. . . |xi , D = d].
10 It should be mentioned that also the ‘explained’ part of the gap could be due to discrimination. It is known
that branches or areas of knowledge that are dominated by men are systematically better paid than those
dominated by women, independently from the years of education, demand, etc.
334 Distributional Policy Analysis and Quantile Treatment Effects
Differentiation with respect to β, using Leibniz rule, assuming that the order of inte-
gration and differentiation can be interchanged, and some simple algebra gives the
first-order condition
. / β
11 {D = d} 11 {D = d}
0 = −τ E + d FY X D
Pr(D = d|X ) Pr(D = d|X )
−∞
β
11 {D = d}
= −τ + 11 {Y < β} d FY X D
Pr(D = d|X )
−∞
. /
11 {D = d}
=E 11 {Y < β} − τ
Pr(D = d|X )
which in turn gives (7.23).
When discussing the statistical properties of the estimators of the QTE τ for
binary D, recall (7.19), based on the empirical counterparts of (7.23) or (7.24), we
will directly consider the case in which the propensity scores p(X i ) are predicted
with the aid of a consistent non-parametric estimator. Let us use what we learnt about
semi-parametric estimation in Section 2.2.3. The influence function ψ from (2.63) for
estimating (7.23) with known p(x) is
(1 − d) − D
gdτ (Y, X, D) = − 11{Y ≤ Q τd } − τ / f Y d , d = 0, 1.
(1 − d) − p(X )
Note that this is just the influence function of a common quantile regression times the
(necessary) propensity weights. If we can non-parametrically estimate p(x) sufficiently
well (i.e. with a sufficiently fast convergence rate11 ), then (2.63) to (2.64) applies with
|λ| = 0, m(x) := p(x) and y := d (in the last equation). We get that the adjustment
factor for the non-parametric prior estimation is
D − p(X )
αdτ (D, X ) = − E[gdτ (Y )|X, D = d] .
(1 − d) − p(X )
Consequently our influence function is
ψdτ (Y, D, X ) = gdτ (Y, X, D) − αdτ (D, X ) ,
11 As we discussed in previous chapters this requires either dim(x) ≤ 3 or bias reducing methods based on
higher-order smoothness assumptions on p(·).
7.2 Quantile Treatment Effects 335
n
Di
Q̂ τY 1 |D=1 = arg min n ρτ (Yi − q), and
q l=1 Dl
i=1
n
1 − Di p(X i )
Q̂ τY 0 |D=1 = arg min n ρτ (Yi − q),
q D
l=1 l 1 − p(X i )
i=1
(for ρτ recall Equation 7.7) where again, in practice the propensity scores p(xi ) have to
be predicted.
As we already know from the previous chapters, the exclusion restriction in (7.26) is not
sufficient to obtain identification. As we discussed in Chapter 4, one needs additionally
the monotonicity assumption saying that the function ζ is weakly monotonous in z.
Without loss of generality we normalise it to be increasing, i.e. assume that an exogenous
increase in Z can never decrease the value of D (otherwise check with −Z ).
Supposing that D is binary let us define
z min = min Pr D z = 1 and z max = max Pr D z = 1 .
z∈Z z∈Z
By virtue of the monotonicity assumption, Diz min < Diz max for i being a complier,
whereas Diz min = Diz max for i being an always- or never-taker. (If Z is binary, clearly
z min = 0 and z max = 1.) Identification of the effect on all compliers is obtained
by those observations with z i = z min and z i = z max , irrespective of the number of
instrumental variables or whether they are discrete or continuous. The asymptotic the-
ory requires that there are positive mass points i.e. that Pr(Z = z min ) > 0 and that
Pr(Z = z max ) > 0. This rules out continuous instrumental variables, unless they are
mixed discrete-continuous and have positive mass at z min and z max .
Again, the subgroup of ‘compliers’ is the largest subpopulation for which the effect
is identified. If the instruments Z were sufficiently powerful to move everyone from
D = 0 to D = 1, this would lead to the average treatment effect (ATE) in the entire
population (but at the same time indicate that either D|X is actually exogenous or Z |X
is endogenous, too). If Y is bounded, we can derive bounds on the overall treatment
effects because the size of the subpopulation of compliers is identified as well. We focus
on the QTE for the compliers:
τc = Q τY 1 |c − Q τY 0 |c (7.28)
where Q τY 1 |c = inf Pr Y 1 ≤ q | complier ≥ τ .
q
Summarising, identification and estimation is based only on those observations with
Z i ∈ {z min , z max }.12 In the following we will assume throughout that z min and z max
are known (and not estimated) and that Pr(Z = z min ) > 0 and Pr(Z = z max ) > 0.
To simplify the notation we will use the values 0 and 1 subsequently instead of z min
and z max , respectively. Furthermore, we will only refer to the effectively used sample
{i : Z i ∈ {0, 1}} or in other words, we assume that Pr(Z = z min ) + Pr(Z = z max ) = 1.
This is clearly appropriate for applications where the single instruments Z are binary.
In other applications, where Pr(Z = z min ) + Pr(Z = z max ) < 1, the results apply with
reference to the subsample {i : Z i ∈ {0, 1}}.
By considering only the endpoints of the support of Z , recoding Z as 0
and 1, and with D being a binary treatment variable, we can define the same
kind of partition of the population as in Chapter 4, namely into the four groups
12 Differently from the Marginal Treatment Estimator in Chapter 4 we are not exploring variations in Y or of
the complier population over the range of Z .
7.2 Quantile Treatment Effects 337
T ∈ {a, n, c, d} (always treated, never treated, compliers, defiers) for which we need to
assume
Assumption IV-1
with π(x) = Pr(Z = 1|X = x), which we will again refer to π(x) as the ‘propensity
score’, although it refers to the instrument Z and not to the treatment D.
Assumption IV-1 (i) requires that the instruments have some power in that there are at
least some individuals who react to it. The strength of the instrument can be measured by
the probability mass of the compliants. The second assumption reflects the monotonicity
as it requires that D z weakly increases with z for all individuals (or decreases for all indi-
viduals). The third part of the assumption implicitly requires an exclusion restriction (⇒
triangularity) and an unconfounded instrument restriction. In other words, Z i must be
independent from the potential outcomes of individual i; and those individuals for whom
Z i = z is observed should not differ in their relevant unobserved characteristics from
individuals j with Z j = z. As discussed in Chapter 4, unless the instrument has been
randomly assigned, these restrictions are very unlikely to hold. However, conditional on
a large set of covariates X , these conditions can be made more plausible.
Note that we permit X to be endogenous. X can be related to U and V in (7.26) in any
way. This may be important in many applications, especially where X contains lagged
(dependent) variables that may well be related to unobserved ability U . The fourth
assumption requires that the support of X is identical in the Z = 0 and the Z = 1 sub-
population. This assumption is needed since we first condition on X to make the instru-
mental variables assumption valid but then integrate X out to obtain the unconditional
treatment effects.13 Let us also assume that the quantiles are unique and well defined;
this is not needed for identification, but very convenient for the asymptotic theory.
Assumption IV-2 The random variables Y 1 |c and Y 0 |c are continuous with positive
density in a neighbourhood of Q τY 1 |c and Q τY 0 |c , respectively.
Under these two Assumptions IV-1 and IV-2, a natural starting point to identify the
QTE is to look again at the distribution functions of the potential outcomes, which could
then be inverted to obtain the QTEs for compliers, say τc = Q τY 1 |c − Q τY 0 |c . It can be
shown that the potential outcome distributions are identified by
(E[11{Y ≤ u}D|X, Z = 1] − E[11{Y ≤ u}D|X, Z = 0]) d F(x)
FY 1 |c (u) =
(E[D|X, Z = 1] − E[D|X, Z = 0]) d F(x)
E [11 {Y < u} DW ]
= (7.29)
E [DW ]
13 An alternative set of assumptions, which leads to the same estimators later, replaces monotonicity with
the assumption that the average treatment effect is identical for compliers and defiers, conditional on X .
338 Distributional Policy Analysis and Quantile Treatment Effects
(E[11{Y ≤ u}(D −1)|X, Z = 1] − E[11{Y ≤ u}(D −1)|X, Z = 0]) d F(x)
FY 0 |c (u) =
(E[D|X, Z = 1] − E[D|X, Z = 0]) d F(x)
E [11 {Y < u} (1 − D) W ]
= (7.30)
E [DW ]
with weights
Z − π (X )
W = (2D − 1) . (7.31)
π(X ) (1 − π(X ))
Here we have made use of the fact that for the proportion of compliers, say Pc , one has
Pc = (E[D|X, Z = 1] − E[D|X, Z = 0]) d F(x)
. /
E[D Z |X ] E[D(1 − Z )|X ]
=E −
π(X ) 1 − π(X )
−π(X )
which with some algebra can be shown to equal E[D π(XZ){1−π(X )} ]. Hence, one could
estimate the QTE by the difference q1 −q0 of the solutions of the two moment conditions
! !
E 11 {Y < q1 } DW = τ E [(1− D) W ] and E 11 {Y < q0 } (1− D) W = τ E [DW ]
(7.32)
or equivalently (Exercise 4)
! !
E {11 {Y < q1 } − τ } W D = 0 and E {11 {Y < q0 } − τ } W (1 − D) = 0 .
(7.33)
To see the equivalence to QTE estimation note that these moment conditions are
equivalent to a weighted quantile regression representation, namely the solution of the
following optimisation problem
(α, β) = arg min E [ρτ (Y − a − bD) · W ] , (7.34)
a,b
n
ˆ τc )
( Q̂ τY 0 |c , = arg min ρτ (Yi − a − bDi )Ŵi (7.35)
a,b n i=1
with W i being as in (7.31) but with predicted π(X i ) for individual i. A problem in
practice is that the sample objective (7.35) is typically non-convex since Wi is negative
i . This complicates the optimisation problem because local
for Z i = Di , and so will be W
optima could exist. The problem is not very serious here because we need to estimate
only a scalar in the D = 1 population, and another one in the D = 0 population. In
other words, we can write (7.35) equivalently as
1
n
Q̂ τY 1 |c = arg min ρτ (Yi − q1 )Di Ŵi and (7.36)
q1 n
i=1
1
n
Q̂ τY 0 |c = arg min ρτ (Yi − q0 ) (1 − Di ) Ŵi .
q0 n
i=1
7.2 Quantile Treatment Effects 339
1
f Y 1 |c (u) = f Y |X,D=1,Z =1 (u) p(X, 1) − f Y |X,D=1,Z =0 (u) p(X, 0)d FX
Pc
−1
f Y 0 |c (u) = f Y |X,D=0,Z =1 (u){1 − p(X, 1)} − f Y |X,D=0,Z =0 (u){1 − p(X, 0)}d FX
Pc
the marginal densities of potential outcomes for the compliers. The variance contribu-
tions stem from two parts: first the weighting by W if the weights were known, and
√
second from the fact that the weights were estimated. To attain n consistency, higher-
order kernels are required if X contains more than three continuous regressors, else
conventional kernels can be used. More precisely, the order of the kernel should be larger
than dim(X )/2. It can be shown that then the estimator reaches the semi-parametric
efficiency bound, irrespectively of π(x) being known or estimated with a bias of order
o(n −1/2 ).
Now remember that these weights W might sometimes be negative in practice, which
leads to a non-convex optimisation problem. Alternatively one could work with modi-
fied, positive weights. These are obtained by applying an iterated expectations argument
to (7.34) to obtain
(α, β) = arg min E [ρτ (Y − a − bD) · W ] = arg min E [ρτ (Y − a − bD)E [W |Y, D]]
a,b a,b
Hence, they can be used to develop an estimator with a linear programming representa-
tion. The sample objective function with W + instead of W is globally convex in (a, b)
since it is the sum of convex functions, and the global optimum can be obtained in a
finite number of iterations. However, we would need to estimate W + first. Although
W + = E [W |Y, D ] is always non-negative, some predicted Ŵi+ can happen to be neg-
ative. In practice, the objective function would then be non-convex again. Since the
probability that Ŵi+ is negative goes to zero as sample size goes to infinity, one can
use the weights max(0, Ŵ + ) instead. In other words, negative W +
are discarded in the
i i
further estimation.
Similar to arguments discussed in the previous chapters, the covariates X are usu-
ally included to make the instrumental variable assumptions (exclusion restriction and
unconfoundedness of the instrument) more plausible. In addition, including covariates
X can also lead to more efficient estimates. Generally, with a causal model in mind,
we could think of four different cases for the covariates. A covariate X can (1) causally
influence Z and also D or Y , it can (2) influence Z but neither D nor Y , it can (3) influ-
ence D or Y but not Z , and finally (4) it may neither influence Z nor D nor Y .14 In
case (1), the covariate should be included in the set of regressors X because otherwise
the estimates would generally be inconsistent. In cases (2) and (4), the covariate should
14 There are also other possibilities where X might itself be on the causal path from Z or D or Y .
7.2 Quantile Treatment Effects 341
usually not be included in X as it would decrease efficiency and might also lead to com-
mon support problems. In case (3), however, inclusion of the covariate can reduce the
asymptotic variance.15
Let us finally comment on the estimation of conditional QTE, i.e. the quantile treat-
ment effect conditionally on X . When looking at non-parametric estimators, it should
again be local estimators. Therefore, when X contains continuous regressors, fully non-
√
parametric estimation will be slower than the n rate. The early contributions to the
estimation of conditional QTE usually imposed functional form assumptions. They often
imposed restrictions on treatment effect heterogeneity, e.g. that the QTE does not vary
with X , which in fact often implies equality of conditional and unconditional QTE. With
√
those kinds of strong assumptions one can again reach n consistency.
Let us briefly consider one popular version which can easily be extended to semi-
and even non-parametric estimators.16 We still apply the assumption of facing a mono-
tone treatment choice decision function and can only identify the conditional QTE for
compliers, i.e.
Q τY 1 |X,c − Q τY 0 |X,c .
Let us assume that conditional on X the τ quantile of Y in the subpopulation of
compliers is linear, i.e.
Q τ (Y |X, T = c) = α0τ D + X β0τ , (7.39)
where D and Z are binary. If the subpopulation of compliers were known, parameters α
and β of such a simple linear quantile regression could be estimated via
!
arg min E ρτ Y − a D − X b |T = c . (7.40)
a,b
Evidently, since we do not know in advance which observations belong to the compliers,
this is not directly achievable. But as before, with an appropriate weighting function
containing the propensity π(X ) = Pr(Z = 1|X ) it becomes a feasible task. First note
that for any absolutely integrable function ξ(·) you have
E [ξ(Y, D, X )|T = c] Pc = E [W · ξ(Y, D, X )] , (7.41)
with Pc being the proportion of compliers, and the weight
D (1 − Z ) (1 − D) Z
W =1− − . (7.42)
1 − π(X ) π(X )
To see the equality of (7.41), realise that with D and Z binary and monotonicity in the
participation decision (having excluded defiers and indifferent individuals), for Pa the
proportion of always-participants, Pn the proportion of never-participants, and Pc that
of compliers, we obtain
E [W · ξ(Y, D, X )] = E [W · ξ(Y, D, X )|T = c] Pc
E [W · ξ(Y, D, X )|T = a] Pa + E [W · ξ(Y, D, X )|T = n] Pn .
15 Frölich and Melly (2013) show that the semi-parametric efficiency bound decreases in this situation.
16 The version here presented is basically the estimator of Abadie, Angrist and Imbens (2002).
342 Distributional Policy Analysis and Quantile Treatment Effects
where Pc has been ignored as it does not affect the values where this function is
minimised.
Since (7.40) is globally convex in (a, b), the function (7.43) is also convex as
the objective function is identical apart from multiplication by Pc . But again, the
weights Wi for individuals i with Di = Z i are negative, and consequently the sample
analogue
1
arg min Wi · ρτ Yi − a Di − X i b (7.44)
a,b n
may not be globally convex in (a, b). Algorithms for such piecewise linear but non-
convex objective functions may not find the global optimum and (7.44) does not have
a linear programming (LP) representation. As in the case of IV estimation of the
unconditional QTE, one may use the weights W + := E [W |Y, D, X ] instead of W ,
which can be shown to be always non-negative. This permits the use of conven-
tional LP algorithms, but the estimation of the weights E [W |Y, D, X ] requires either
additional parametric assumptions or high-dimensional non-parametric regression.17
Unfortunately, as for W + the estimates E [W |Y, D, X ] could be negative, another
modification is necessary before an LP algorithm can be used. One could also use
the weights (7.31) instead of (7.42), both would lead to consistent estimation of α
and β, but it is not clear which ones will be more efficient. For compliers, W varies
with X whereas W + equals 1 for them. In any case, both types of weights would be
generally inefficient since they do not incorporate the conditional density function of
the error term at the τ quantile. Hence, if one was mainly interested in estimating
a conditional QTE with a parametric specification, more efficient estimators could be
developed.
At the beginning of this chapter on quantile treatment effects we mentioned that even if
one is not primarily interested in the distributional impacts, one may still use the quantile
method to reduce susceptibility to outliers. This argument is particularly relevant for the
regression discontinuity design (RDD) method since the number of observations close
to the discontinuity threshold is often relatively small. This is why we dedicate here
more space to QTE with RDD.
17 Note that the weights W + = E[W |Y, D] cannot be used, as conditioning on X is necessary here.
7.3 Quantile Treatment Effects under Endogeneity: RDD 343
On the other hand, in Chapter 6 we came to know the so-called RDD approach as
an alternative to the instrumental approach.18 So one could keep this section short by
simply defining the RDD as our instrument and referring to the last (sub)section. Instead,
we decided to take this as an opportunity to outline a different estimation method,
namely via numerically inverting the empirical cdf, while presenting some more details
on the RDD-QTE estimation.
Let us first recall the definition of the two designs and definitions we are using in
the RDD approach. One speaks of a sharp design if the treatment indicator D changes
for everyone at the threshold z 0 of a given variable Z which typically represents the
distance to a natural border (administrative, geographical, cultural, age limit, etc.). One
could then write
D = 11{Z ≥ z 0 } . (7.45)
In this sharp design, all individuals change programme participation status exactly at
z 0 . In many applications, however, the treatment decision contains some elements of
discretion. Caseworkers may have some latitude about whom they offer a programme,
or they may partially base their decision on criteria that are unobserved to the econome-
trician. In this case, known as the fuzzy design, D is permitted to depend also on other
(partly observed or entirely unobserved) factors but the treatment probability changes
nonetheless discontinuously at z 0 , i.e.
The fuzzy design includes the sharp design as a special case when the left-hand side of
(7.46) is equal to one. Therefore the following discussion focusses on the more general
fuzzy design.
Let Nε be a symmetric ε neighbourhood about z 0 and partition Nε into Nε+ = {z :
z ≥ z 0 , z ∈ Nε } and Nε− = {z : z < z 0 , z ∈ Nε }. According to the reaction to the
distance z over Nε we can partition the population into five (to us already well-known)
subpopulations:
We have already discussed the groups at different places in this book. The fifth group
(labelled indefinite) contains all units that react non-monotonously over the Nε neigh-
bourhood, e.g. they may first switch from D = 0 to 1 and switch then back for increasing
values of z. Clearly, for binary IVs the definition of such a group would not have made
18 Some may argue that this is not a different approach as one could interpret the RDD as a particular
instrument. As we discussed in that chapter, the main promoters of this method, however, prefer to
interpret it as a particular case of randomised experiments.
344 Distributional Policy Analysis and Quantile Treatment Effects
much sense. As in the RDD, Z itself is not the instrument but (if at all) 11{Z ≥ z 0 },
such a group might exist. For identification reasons, however, we must exclude them
by assumption together with the defiers. Note that in the sharp design, everyone is a
complier (by definition) for any ε > 0. We work with the following basic assumptions:
Assumption RDD-1 There exists some positive ε̄ such that for every positive ε ≤ ε̄
These assumptions require that for every sufficiently small neighbourhood, the thresh-
old acts like a local IV. Assumption RDD-1 (i) requires that E [D|Z ] is in fact
discontinuous at z 0 , i.e. we assume that some units change their treatment status exactly
at z 0 . Then, (ii) requires that in a very small neighbourhood of z 0 , the instrument has a
weakly monotonous impact on D(z). Further, (iii) and (iv) impose the continuity of the
types and the distribution of the potential outcomes as a function of Z at z 0 . Finally, (v)
requires that observations close to z 0 exist.
Under Assumption RDD-1 the distribution functions of the potential outcomes for
local compliers are identified. Define FY d |c (u) = lim FY d |Z ∈Nε ,Tε =c (u) and as in
ε→0
Chapter 6, 1+ = 11 {Z ≥ z 0 } = 1−1− . Then we get that the distributions of the potential
outcomes for the local compliers are identified as
!
E 11 {Y ≤ u} 1+ − pε |Z ∈ Nε , D = 1
FY 1 |c (u) = lim ! , and
ε→0 E 1+ − pε |Z ∈ Nε , D = 1
!
E 11 {Y ≤ u} 1+ − pε |Z ∈ Nε , D = 0
FY 0 |c (u) = lim ! , (7.47)
ε→0 E 1+ − pε |Z ∈ Nε , D = 0
In Monte-Carlos simulations, however, it turned out that the estimators for the
potential distributions performed better when a non-parametric estimator for pε was
used. The reasons for this could be many, and can therefore not be discussed
here.19
As in the sharp design everyone with 1+ has D = 1 and vice versa, i.e. everyone is a
complier at z 0 , the potential outcomes in the population is identified as
Analogously to the above, for the potential cdf one obtains from Assumption RDD-1
also the identification formulae for the quantiles of the potential outcomes for the local
compliers, namely
!
Q τY 1 |c = lim arg min E ρτ (Y − q) 1+ − pε |Z ∈ Nε , D = 1 , and
ε→0 q
!
Q τY 0 |c = lim arg min E ρτ (Y − q) pε − 1+ |Z ∈ Nε , D = 0 ,
ε→0 q
where ρτ (u) = u · (τ − 11 {u < 0}) is the check function. Again one could try with
pε = 0.5.
Regarding the quantile treatment effects (QTE) τQT E = Q τY 1 |c − Q τY 0 |c , we could
identify it directly as
% & !
Q τY 0 |c , τQT E = lim arg min E ρτ (Y − a − bD) 1+ − pε (2D − 1) |Z ∈ Nε ,
ε→0 a,b
(7.49)
which corresponds to a local linear quantile regression. Hence, the quantiles can
be obtained by univariate weighted quantile regressions. Despite its simplicity one
should note that the objective function of the weighted quantile regression is not
convex if some of the weights are negative. Conventional linear programming
algorithms will typically not work. Instead of repeating the discussion and pro-
cedures from the last sections for this modified context, we will briefly study
the non-parametric estimators for the distribution functions, and give the corre-
sponding quantile estimators resulting from inverting these distribution functions
afterwards.
The distribution functions can be estimated by local regression in a neighbourhood of
z 0 . More specifically, let K i be some kernel weights depending on the distance between
Z i and z 0 , and on a bandwidth h that converges to zero. Then, with a consistent estimator
+
for pε , e.g. 1i K i / K i , a natural estimator for the distribution function FY 1 |c is
(Exercise 7)
19 Note that having in (7.47) 1+ in the numerator and denominator, but in (7.48) the 1− is not an erratum;
see Exercise 6.
346 Distributional Policy Analysis and Quantile Treatment Effects
n
i=1 11 {Yi ≤ u} Di 1i+ − p̂ε K i
F̂Y 1 |c (u) = n
i=1 Di 1i+ − p̂ε K i
i:1i+ =1
11{Yi ≤u}Di K i i:1i+ =0
11{Yi ≤u}Di K i
K
−
K
i:1i+ =1 i i:1i+ =0 i
=
D K
D K
. (7.50)
i:1i+ =1 i i i:1i+ =0 i i
K
−
K
i:1i+ =1 i i:1i+ =0 i
This is certainly just a modified version of the Wald estimator. Let us define for a
random variable V the right limit m + V = lim E [V |Z = z 0 + ε] and the left limit
ε→0
m−
V = lim E [V |Z = z 0 − ε]. Imagine now that in (7.50) variable V represents either
ε→0
11 {Y ≤ u} · D, or 11 {Y ≤ u} · (1 − D), or (1 − D), or D. In all cases V has bounded
support such that the previously defined limit functions are bounded, too. The suggested
estimator is
m̂ + −
11{Y ≤u}D − m̂ 11{Y ≤u}D
F̂Y 1 |c (u) = .
m̂ + −
D − m̂ D
Similarly, for the non-treatment outcome we can use
m̂ + −
11{Y ≤u}(1−D) − m̂ 11{Y ≤u}(1−D)
F̂Y 0 |c (u) = .
m̂ + −
1−D − m̂ 1−D
If we want to apply local linear weights, which appears appropriate here since we are
effectively estimating conditional means at boundary points (from the left and right side
of z 0 ), each of our m +
V is estimated as the value of a that solves
n
2 + Z i − z0
arg min {Vi − a − b (Z i − z 0 )} 1i K .
a,b h
i=1
Analogously m−
V can be estimated by using only observations to the left of z 0 . This can
be applied to all the four above-discussed versions of V .
As usual, in order to use the estimator, draw conclusions, construct confidence inter-
vals, etc. it is quite helpful to know the statistical properties of the estimator(s). In order
to state them, we first have to specify some more regularity conditions.
Assumption RDD-2 The following conditions are assumed to hold.
(i) The data {(Yi , Di , Z i )} are i.i.d. with X being a compact set.
(ii) Smoothness and existence of limits: the left and right limits of the functions
E[11 {Y ≤ u} |Z , D = 0], E[11 {Y ≤ u} |Z , D = 1] and E[D|Z ] exist at z 0 , and
these functions are twice continuously differentiable with respect to Z at z 0 with
second derivatives being Hölder continuous in a left and a right ε-neighbourhood
of z 0 , and uniformly on a compact subset of IR, say Y.
(iii) The density f Z is bounded away from zero, and is twice continuously differentiable
at z 0 with a second derivative being Hölder continuous in an ε-neighbourhood of z 0 .
(iv) The fraction of compliers Pc = m + −
D − m D is bounded away from zero.
7.3 Quantile Treatment Effects under Endogeneity: RDD 347
√
(v) For bandwidth h it holds that nh → ∞ and nh · h 2 → " < ∞.
(vi) Kernel K is symmetric, bounded, zero outside a compact set and integrates to one.
These conditions were already discussed in Chapter 6. Recall that condition (iv) is
equivalent to assuming that we have a strong instrument in an IV context, and condition
(v) balances bias and variance of the estimator. This way, for " > 0, squared bias and
variance are of the same order. One may want to modify this to obtain faster rates for
the bias.
To simplify the notation, the same bandwidth is used for all functions on both sides
of the threshold. The method does certainly also allow for different bandwidths as long
as the convergence rates of the bandwidths are the same. Recall the definitions of the
∞
kernel
∞ l constants: κl = u l K (u)du, κ̇l = 0 u l K (u)du, κ̃ = κ̇2 κ̇0 − κ̇12 , and μ̈l =
0 u K (u)du. Then we can state:
2
THEOREM 7.1 If Assumptions RDD-1 and 2 are satisfied, the estimators F̂Y 0 |c (u)
and F̂Y 1 |c (u) of the distribution functions for the compliers, i.e. FY 0 |c (u) and FY 1 |c (u),
jointly converge in law such that
: % &
nh n F̂Y j |c (u) − FY j |c (u) −→ G j (u) , j ∈ {0, 1} ,
in the set of all uniformly bounded real functions on Y, sometimes denoted by #∞ (Y),
where the G j (u) are Gaussian processes with mean functions b j (u) =
2 +
μ̄22 − μ̄1 μ̄3 " ∂ m 11{Y ≤u}(D+ j−1) ∂ 2m+
− FY j |c (u)
D
2μ̃ Pc ∂z 2 ∂z 2
∂ 2m−
11{Y ≤u}(D+ j−1) ∂ 2m−
− + FY j |c (u) D
,
∂z 2 ∂z 2
∂ 2m+ ∂ 2m−
= lim ∂ E[V ∂z
|Z =z 0 +ε]
2
V V
where ∂z 2 2 for a random variable V , and ∂z 2
the analogous
ε→0
left limit,20 and covariance functions ( j, k ∈ {0, 1}),
μ̄22 μ̈0 − 2μ̄2 μ̄1 μ̈1 + μ̄21 μ̈2 1 % &
+ −
v j,k (u, ũ) = ω (u, ũ) + ω (u, ũ) ,
μ̃2 Pc2 f Z (z 0 ) j,k j,k
with ω+j,k (u, ũ) = lim Cov (D + j − 1) 11 {Y ≤ u} − FY j |c (u) ,
ε→0
(D + k − 1) 11 {Y ≤ ũ} − FY k |c (ũ) |Z ∈ Nε+
20 Note that the first fraction is a constant that depends only on the kernel, e.g. − 11 for the Epanechnikov.
190
21 Note that the first fraction is a constant that depends only on the kernel, e.g. 56832 for the Epanechnikov.
12635
348 Distributional Policy Analysis and Quantile Treatment Effects
The bias functions b j (u) disappear if we choose " = 0, and thereby choose an under-
smoothing bandwidth for the functions to be estimated. This has the advantage of
simplifying the asymptotics. The asymptotic covariances are the sum of the covariances
of the estimated functions rescaled by Pc2 f Z (z 0 ).
A possible way to characterise the treatment effect on the outcome Y consists
in estimating the distribution treatment effect (DTE) for compliers, say uDT E , by
ˆu
FY 1 |c (u) − FY 0 |c (u). A natural estimator is DT E = F̂Y 1 |c (u) − F̂Y 0 |c (u) for which
it can be shown that under Assumptions RDD-1 and 2, it converges in #∞ (Y) to the
Gaussian process
: % &
nh n ˆ uDT E − uDT E −→ G 1 (u) − G 0 (u)
with mean function b1 (u) − b0 (u) and covariance function v1,1 (u, ũ) + v0,0 (u, ũ) −
2v0,1 (u, ũ).
Let us turn to the quantile treatment effects. They have a well-defined asymptotic
distribution only if the outcome is continuous with continuous densities. One therefore
needs the additional
Assumption RDD-3 FY 0 |c (u) and FY 1 |c (u) are both continuously differentiable with
continuous density functions f Y 0 |c (u) and f Y 1 |c (u) that are bounded from above and
away from zero on Y.
One could estimate the quantile treatment effects by the sample analog of (7.49).
But also this minimisation problem is a non-convex optimisation problem because some
weights are positive while others are negative. This requires grid searches or algorithms
for non-convex problems. But they do not guarantee to find a global optimum. Instead,
one can follow a more direct strategy by inverting the estimated distribution function.
There one might find a similar problem, in particular that the estimated distribution func-
tion is non-monotone, i.e. F̂Y j |c (u) may decrease when we increase u. But this is only
a small sample problem because the assumed monotonicity ensures that the estimated
distribution function is asymptotically strictly increasing. A quick and simple method
to monotonise the estimated distribution functions is to perform some re-arrangements.
This does not affect the asymptotic properties of the estimator but allows us to invert
it. These procedures typically consist of a sequence of closed-form steps and are very
quick.
THEOREM 7.2 If Assumptions RDD-1 to 3 are satisfied, the estimators Q̂ Y 0 |c (τ ) and
Q̂ Y 1 |c (τ ) jointly converge in #∞ ((0, 1)) to the Gaussian processes
: % & −1 j
nh n Q̂ Y j |c (τ ) − Q Y j |c (τ ) −→ − f Y j |c Q Y j |c (τ ) G Q Y j |c (τ )
:= G̃ j (τ ) , j ∈ {0, 1}
−1
with mean function b̃ j (τ ) = − f Y j |c Q Y j |c (τ ) b j Q Y j |c (τ ) , and covariance func-
−1 −1
tion ṽ j,k (τ, τ̃ ) = f Y j |c Q Y j |c (τ ) f Y k |c Q Y k |c (τ̃ ) v j,k Q Y j |c (τ ) , Q Y k |c (τ̃ )
with b j and v j,k as in Theorem 7.1. Furthermore, for the estimator ˆτ
QT E of the QTE
for the compliers one has
: % &
nh n ˆ τQT E − τQT E −→ G̃ 1 (τ ) − G̃ 0 (τ )
7.3 Quantile Treatment Effects under Endogeneity: RDD 349
with mean function b̃1 (τ ) − b̃0 (τ ) and covariance function ṽ1,1 (τ, τ̃ ) + ṽ0,0 (τ, τ̃ ) −
2ṽ0,1 (τ, τ̃ ).
It can also be shown that smooth functionals of both distribution functions satisfy
a functional central limit theorem. This is very helpful in practice as we will see in
Example 7.6. First let us state the theory:
T H E O R E M 7.3 Let ξ u, FY 0 |c , FY 1 |c be a functional taking values in #∞ (Y) that
is differentiable
in FY 0 |c , FY 1 |c tangentially to the set of continuous functions with
derivative ξ0 , ξ1 .22% If Assumptions & RDD-1 and 2 are satisfied, then the plug-in
estimator ξ̂ (u) ≡ ξ u, F̂Y 0 |c , F̂Y 1 |c converges in #∞ ((0, 1)) as follows:
: % &
nh n ξ̂ (u) − ξ (u) −→ ξ0 (u) G 0 (u) + ξ1 (u) G 1 (u) .
This is very useful in many situations where the interest is directed to a derivative or
a parameter of the distribution. Let us look at cases where the Lorenz curve or the Gini
coefficient of the income distribution are the objects of interest.
Example 7.6 We apply Theorem 7.3 in order to derive the limiting distribution of the
estimators of the Lorenz curves and the Gini coefficients of the potential outcomes.
Their estimates are defined as
τ τ
0 Q Y j |c (t) dt Q̂ Y j |c (t) dt
L (τ ) = 1
j
, L̂ (τ ) = 01
j
.
0 Q Y j |c (t) dt 0 Q̂ Y j |c (t) dt
The Hadamard derivative of the map from the distribution function to the Lorenz curve
can be found e.g. in Barrett and Donald (2009). Using their result one obtains the
limiting distribution for a simple plug-in estimator, i.e.
% & τ 1 1 1
: 0 G̃ (t) dt − L (τ ) 0 G̃ (t) dt
1
nh n L̂ (τ ) − L (τ ) −→
j j
1 =: L (τ ) (7.51)
0 Q Y 1 |c (t) dt
22 What is exactly demanded is the so-called Hadamard or compact differentiability; see, for example, Gill
(1989), page 100.
350 Distributional Policy Analysis and Quantile Treatment Effects
For the same reasons discussed in the previous chapters and sections it can be useful to
incorporate additional covariates X . We recommend to do this in a fully non-parametric
way and then suppose that Assumption RDD-1 holds conditionally on X . Even if one
believes that the RDD is valid without conditioning, one might want to check the robust-
ness of the results when covariates are included. As before, including covariates might
increase the precision of the estimates. Another reason for incorporating covariates
applies when the threshold crossing at z 0 itself affects them. Under certain conditions
we can separate then the direct from the indirect effects by controlling for X but first
obtain the conditional treatment effect. The common support restriction will then iden-
tify the unconditional effects which are obtained as usual by integrating the conditional
treatment effect over X . So we need then
Assumption RDD-4 Suppose Assumption RDD-1 (i), (ii) and (v). Suppose further that
Assumption RDD-1 (iii) and (iv) are true conditionally on X . Further assume:
(vi) Common support lim Supp(X |Z ∈ Nε+ ) = lim Supp(X |Z ∈ Nε− )
ε→0 ε→0
Under these assumptions, similar expressions as in the Theorems above are obtained,
but the weights are now functions of pε (x) = Pr (Z ≥ z 0 |X = x, Z ∈ Nε ), and one has
% &
Q τY 0 |c , τQT E
. /
1+ − pε (X )
= lim arg min E ρτ (Y − a − bD) (2D − 1) |Z ∈ Nε .
ε→0 a,b pε (X ) (1 − pε (X ))
This shows that the unconditional QTE can be estimated via a simple weighted quan-
tile regression where the covariates X only enter in the weights via pε (x). Again,
the weights in the previous expression are sometimes positive and sometimes nega-
tive such that conventional linear programming algorithms fail because of the potential
non-convexity.
heterogeneous effects, its theoretical properties have been studied extensively, and it has
been used in many empirical studies. Chaudhuri (1991) analysed non-parametric estima-
tion of conditional QTE. A more recent contribution is Hoderlein and Mammen (2007),
who consider marginal effects in non-separable models.
Linear instrumental variable quantile regression estimates have been proposed for
example by Abadie, Angrist and Imbens (2002), Chernozhukov and Hansen (2005),
and Chernozhukov and Hansen (2006). Chernozhukov, Imbens and Newey (2007) and
Horowitz and Lee (2007) have considered non-parametric IV estimation of conditional
quantile functions. Furthermore, instead of exploiting monotonicity in the relationship
predicting D, alternative approaches assume a monotonicity in the relationship deter-
mining the Y variable. Finally, in a series of papers, Chesher examines non-parametric
identification of conditional distributional effects with structural equations, see Chesher
(2010) and references therein.
Regarding the bandwidth choice, note that for semi-parametric estimators the first-
order asymptotics do often not depend on the bandwidth value, at least as long as
sufficient smoothness conditions are fulfilled and all necessary bias reduction meth-
ods were applied in the non-parametric step. This has the obvious implication that the
first-order asymptotics is not helpful for selecting bandwidth values. Therefore, on the
one hand, those methods would have to be based on second-order approximations. On
the other hand, it is well known that in practice these approximations are of little help
for finite samples. Taking all together it must be said that the bandwidth choice problem
is so far an open field.
Frölich and Melly (2013) discuss the relationship between existing estimators. For
example, Abadie, Angrist and Imbens (2002) are interested in parametrically estimating
conditional QTE (with a simple linear model). One could be attempted to adapt that
approach to estimating unconditional QTE by using the weights (7.42) but no X in that
parametric specification. However, this approach would not lead to consistent estimation
as it would converge to the difference between the τ quantiles of the treated compliers
and non-treated compliers, respectively:
FY−1 −1
1 |c,D=1 (τ ) − FY 0 |c,D=0 (τ ).
This difference is not very meaningful as one compares the Y 1 outcomes among the
treated with the Y 0 outcomes among the non-treated. Therefore, in the general case the
weights (7.42) are only useful to estimate conditional quantile effects. If one is interested
in non-parametric estimation of the unconditional QTE, one should use the weights in
(7.31) but not those in (7.42). When X is the empty set, e.g. in the case where Z is
randomly assigned, then the weights (7.31) and those in (7.42) are proportional such
that both approaches converge to the same limit.
Often, when people speak about distributional effects, they are thinking of changes
in the distribution of Y = ϕ(X, U ) caused by a new distribution of X but keeping the
distribution of U unchanged. That is, we are in the situation where the impact of D on
Y happens exclusively through X . Note that in such a situation you are not necessarily
interested in studying a causal effect of X on Y ; you are rather interested in the change
of FY to FY∗ caused by a change from FX to FX∗ . This implicates that you take the latter
change (i.e. FX∗ ) as known or at least as predictable. Often one speaks also of FX and FY
352 Distributional Policy Analysis and Quantile Treatment Effects
as being the distributions of the source population, whereas FX∗ , FY∗ denote those of the
target population. Of interest are certainly only those target distributions whose changes
(from FY to FY∗ ) are exclusively caused by the change from FX to FX∗ .
In Section 7.1.2 we already saw the two approaches proposed by Machado and Mata
(2005) and Melly (2005), recall equations (7.16), (7.17) for the latter one. For a related
approach compare also with Gosling, Machin and Meghir (2000). Firpo, Fortin and
Lemieux (2009) aim to estimate the partial effects on FY caused by marginal changes of
FX . For the case of quantiles these could be approximated via the regression of (under
certain conditions analytically) equivalent expressions containing the re-centred influ-
ence function of the Y -quantiles on X . They study this approach for parametric and
non-parametric estimation methods. Chernozhukov, Fernández-Val and Melly (2013)
review the problem, summarise the different approaches in a joint formal framework,
and discuss inference theory under general conditions.
Having in mind that FY (y) = F(y, x)d x = F(y|x)d FX can be well approx-
imated by n1 i=1 n
F(y|xi ), all you need for predicting FY∗ (y) = F ∗ (y|x)d FX∗ ≈
1 n ∗ ∗ ∗
n∗ i=1 F (y|x i ) is a reasonable predictor for F (y|x) together with either a given
∗ ∗
distribution FX or a sample {xi }i=1 from the target population. In the existing meth-
n
ods it is assumed that F ∗ (y|x) can be estimated from the available data, or simply that
FY∗ (y) = E[F(y|X ∗ )], which is, for example, the case if for Y = ϕ(X, U ), you have
also Y ∗ = ϕ(X ∗ , U ) with U independent from X and X ∗ . Note that this does not exclude
the dependence of the conditional moments of Y on X , but the moment functions must
be the same for the pair (Y ∗ , X ∗ ). Some might argue that this would be a strong restric-
tion; others might say that this is exactly what counterfactual distributions are. For a
simple though quite flexible and effective way to implement this idea, see Dai, Sperlich
and Zucchini (2016). The asymptotic properties for a purely non-parametric predictor
of FY∗ (y) based on this idea are studied in Rothe (2010).
For some of the routines one first needs to obtain predictions for the propensity scores
(as also explained in earlier chapters). These propensity score predictions are used to
calculate the weights W = Pr11{D=d} τ
(D=d|x) . To obtain a quantile Q Y d you run a univariate
quantile regression using rq and setting the option weights=W. To construct the stan-
dard errors and confidence interval you can use the bootstrap function boot.rq from
that package.
The corresponding commands for these function in Stata, are qreg, iqreg,
sqreg, bqreg that provide quantile regression, interquartile range regression, simul-
taneous quantile regression and bootstrapped quantile regression, respectively. Among
these commands only qreg accepts the use of weights, and sqreg and bsqreg
calculate a variance-covariance estimator (via bootstrapping).
Since quantile treatment effect under endogeneity and the presence of a plau-
sible instrumental variable, say Z , is equivalent to the solution of ( Q̂ τY 0 , τ ) =
n
arg mina,b n1 i=1 ρτ (Yi − a − bDi )Ŵi one can first calculate some weights, say Ŵ =
z−π(x)
π(x)(1−π(x)) (2D − 1) and then proceed with the univariate estimation of the weighted
quantiles with the techniques mentioned above. Moreover, the function qregspiv
from the package library("McSpatial") in R allows to run the quantile IV
estimation for any model with one endogenous explanatory variable; the function was
originally created to deal with special AR models. In Stata the command ivreg can
handle up to two endogenous treatment variables.
For quantile Diff-in-Diff estimation see this section in Chapter 5. For the
case of estimating the quantile treatment effect in the regression discontinu-
ity design, the corresponding Stata codes exist under the command rddqte,
see Frölich and Melly (2008), Frölich and Melly (2010), and Frandsen, Frölich
and Melly (2012) for more details. To install the ado and helpfiles go to
http://froelich.vwl.uni-mannheim.de/1357.0.html. To use similar
techniques in R one can make use of the function lprq mentioned above.
7.5 Exercises
1. Consider the estimation of β τ in the linear quantile regression problem; recall Equa-
tion 7.11. One may often be interested in estimating β0τ for various different values
of τ , e.g. for all deciles or all percentiles. Show that with a finite number of obser-
vations, only a finite number of estimates will be numerically distinct. You may start
with a sample of just two observations. Then try to estimate the median and the
quartiles.
2. Prove Equation 7.16 using substitution.
3. Asymptotics using the GMM framework: Under certain regularity conditions, the
GMM framework can be used to show
√ % τ & d
n β̂ − β0τ −→ N (0, τ ) ,
!−1 ! !−1
with τ = τ (1 − τ ) · E fU |X (0|X ) · X X · E X X · E fU |X (0|X ) · X X .
354 Distributional Policy Analysis and Quantile Treatment Effects
τ
If one is willing to strengthen the assumption Q U |X = 0 to be satisfied for every
quantile τ ∈ (0, 1), which implies full independence between U and X , the variance
matrix simplifies to
τ (1 − τ ) !−1
τ = · E X X .
fU (0) 2
Derive the asymptotic variance using the results for exactly identified GMM
estimators:
!
E τ − 11 Y < X β0 · X = 0 .
4. Show that estimators resulting from conditions (7.33) are equivalent to those
resulting from conditions (7.32).
5. Show that the weights W + defined in (7.38) are indeed positive.
6. Note that in (7.47) you have 1+ in the numerator and denominator. Therefore, in
(7.48) you would also expect 1+ , and this would be correct. Proof that (7.48) with
1− substituting 1+ is equivalent to the present formula.
7. Derive the estimator and formulae given in (7.50).
8. Discuss standard problems that occur in parametric quantile regression that disap-
pear when using local constant estimators. Which of these problems can also occur
(locally) when using local linear estimation?
8 Dynamic Treatment Evaluation
While in several settings the framework of the previous chapters may well apply, in
others a more careful treatment of time and dynamic treatment allocation is needed. As
an example, consider a few issues that come up with the evaluation of active labour
market policy:
Example 8.1 The time t0 when a programme starts might itself be related to unob-
served characteristics of the unemployed person. Therefore, t0 might often itself be an
356 Dynamic Treatment Evaluation
important control variable. Here t0 might reflect calendar time (e.g. seasonal effects) as
well as process time (e.g. current unemployment duration).
The time t1 when participation in a programme ends might often be already an out-
come of the programme itself. A person who finds employment while being in a training
programme would naturally have a shorter programme duration t1 −t0 than a person who
did not find a job during that period. In other words, if someone finds a job during train-
ing he would stop training earlier than planned. The fact of having found a job is then
the reason for the short treatment duration, and not its outcome. That means we cannot
say that because of the short treatment duration he found a job.
It can be seen from this example that, for this reason, it is often advisable to mea-
sure the impact of the beginning of a treatment t0 and not of the end t1 . Nevertheless,
one might also be interested in the effects of the duration t1 − t0 of the programme.
A possible shortcut is to use the length of intended programme duration as a measure
that may be more likely to be exogenous, conditional on X t0 . The confounding variables
often include time-varying variables as well. Then, a more explicit modelling may be
necessary.
Example 8.2 Ashenfelter (1978) noted that the decision to participate in active labour
market programmes is highly dependent on the individual’s previous earnings and
employment histories. Recent negative employment shocks often induce individuals
to participate in training programmes. Hence the employment situation in the months
before the programme starts is an important determinant of the programme participation
decision but is also likely to be correlated with the potential employment outcomes.
Example 8.3 Recall Example 8.2. Since usually no explicit start time can be observed
for the ‘non-participation’ treatment, the employment situation in the months before the
programme started is undefined for them. To solve this problem, Lechner (1999) sug-
gested drawing hypothetical start times for the ‘non-participants’ from the distribution
of start times among the participants, and to delete the ‘non-participant’ observations
for whom the assigned start time implies an inconsistency. Thus, if unemployment is a
basic eligibility condition for participation in an active labour market programme, indi-
viduals with an assigned start time after the termination of their unemployment spell
are discarded, because participation could not have been possible at that date. Lechner
(2002b) analysed the assignment of hypothetical start times further. Instead of drawing
dates from the unconditional distribution of start times, he also considered drawing from
the distribution conditional on the confounding variables. This conditional distribution
can be simulated by regressing the start times on the covariates and fitting the mean of
8.1 Motivation and Introduction 357
Example 8.4 Recall Examples 8.2 and 8.3, and think of an unemployed person regis-
tered at a certain day. A treatment definition window of length ‘1 day’ would define an
individual as treated if a programme started on the first day of unemployment. Every-
one else would be defined as non-treated. Similarly, a treatment definition window of
length ‘1 day’ applied to day 29 of their unemployment would define as treated every-
one who starts treatment on day 29 and as untreated who does not (certainly only using
the individuals who are still registered unemployed at day 29). Treatment is undefined
for those not in the risk set, i.e. those individuals that are no longer unemployed or
already started training. The risk set contains only those individuals who are eligible
and could potentially be assigned to a programme.
For an extremely short treatment definition window, e.g. of one day like in this exam-
ple, there would be only very few treated observations such that estimation might be
very imprecise. In addition, the treatment effects are likely to be very small and may not
be of main interest: they would measure the effect of starting a programme today versus
‘not today but perhaps tomorrow’. Many of the non-treated might actually receive treat-
ment a few days later so that this situation would be similar to a substitution bias in an
experimental setting where people in the control group get a compensation or a different
treatment. In certain situations, however, the effect of treatment today versus ‘not today
but perhaps tomorrow’, may indeed be the effect of interest.
Example 8.5 In Frölich (2008) this is the case. There the choice problem of a caseworker
in the employment office is considered. At every meeting with the unemployed person,
the caseworker aims to choose the optimal action plan including e.g. the choice among
active labour market programmes. In the next meeting, the situation is reconsidered and
different actions might be taken. The caseworker might choose ‘no programme’ today,
but if the unemployed person is still unemployed four weeks later, a different action (i.e.
different treatment) might be appropriate then.
A very large treatment definition window of e.g. one year (that would define as treated
everyone who started a programme in the first year and as untreated who did not enter
a programme during the entire first year) might be the treatment effect of most interest.
358 Dynamic Treatment Evaluation
The problem for identification, however, is that the definition is ‘conditioning on the
future’, using the language of Fredriksson and Johansson (2008). From the onset of the
treatment definition window, one could imagine two competing processes: the one of
being sent to a programme, and the one of finding a job. Even for two persons exactly
identical in all their characteristics, it may happen by chance that the first person finds a
job after eight weeks whereas the other person would have found a job after ten weeks
but was sent to a programme already after nine weeks. In this case, the first person would
be defined as non-treated, whereas the second would be defined as treated. This clearly
introduces a problem because for otherwise identical individuals – and supposing for
the moment a zero (or say, no) treatment effect – the untreated are those who were lucky
in finding a job early, whereas the treated are the unlucky who did not find a job so
soon. In the extreme case of a very long treatment definition window, you may even
imagine a case where all the non-treated could be those who found a job before the
programme started, whereas all the treated would be those who could have found a job
at some time but the programme happened to start before. Clearly, such a situation leads
to biased estimates, in favour of the so-defined non-treated. Hence, if there would be no
differences in unobservables between treated and non-treated, apart from differences in
luck in the dynamic assignment process, the estimated treatment effects are downward
biased. This bias is likely to be most severe if every unemployed person eventually has
to participate in some programme, unless he finds a job before then. On the other hand,
the bias would be expected to be smaller if the probability of eventually ending up in
treatment (if no other event happened) is clearly below one.
In most applications, the sign of the bias is unclear since there might also be other
systematic differences between treated and non-treated in addition to differences in luck
in the dynamic assignment process. That is, there might be other unobserved reasons for
why individuals did not get treated even though the haven’t found a job.
To overcome this problem of conditioning on the future, one has to shorten the length
of the treatment definition window. But this is likely to introduce again the problem that
many of those defined as non-treated may have actually been treated shortly thereafter,
as discussed above. One solution is to analyse the sensitivity of the final estimates to
alternative definitions of this window. If the length of the window is shortened, bias
due to conditioning on the future decreases but variance increases. At the same time,
however, many non-treated may shortly thereafter have become treated what blurs the
treatment effect definition as discussed above. If the data available permits, one can
measure how many people have been affected. For the interpretation of the estimated
effects, one should therefore always examine which share of the non-treated actually
received treatment in the period thereafter (i.e. how many people that were classified
as non-treated actually received treatment thereafter). If this fraction is small, we are
more confident that we measure the effect of treatment versus no-treatment and not of
treatment today versus ‘not today but perhaps tomorrow’.1
We will discuss two possible approaches to deal with this conditioning-on-the-future
problem. First we discuss discrete-time dynamic models, which mitigate the problem.
However, when we return to treatment definition windows of a very short length like
one day, a week or maybe a month, then this could be handled by a continuous-time
model approach that attempts to aggregate the effects over time. This latter approach
seems particularly natural if the outcome variable Y is the survival probability, e.g. in
single-spell unemployment data.
Example 8.6 Fredriksson and Johansson (2008) suggest a non-parametric hazard model
to estimate the effects of treatment start for each day, from which the survival functions
of the potential outcomes can be derived. The intuition is as follows. Consider again a
training programme for unemployed people. At every day t (in process time) the risk
set consists of those people who are still unemployed and have not yet entered train-
ing. These people are at risk of entering training on day t and of finding a job (or say,
exiting unemployment) on day t. It is assumed that these are random events with equal
probability for all individuals still in the risk set, perhaps conditional on some observed
covariates X t . I.e. after controlling for X t and conditional on still being in the risk set,
selection into treatment is only based on white noise. In other words, it is assumed that
there are no unobserved confounders after controlling for X t and the risk set. Hence, the
hazard rates into treatment and into employment can be estimated non-parametrically,
from which the potential survival functions can be deduced.
Continuous time models often avoid the conditioning on the future problem. However,
they require certain restrictions on treatment effect heterogeneity, which are not needed
in discrete time models. This will become obvious from the following discussion of
problems in which the number of possible treatment sequences would be infinite in
continuous time.
Before starting, let us add one more reason why in various situations static mod-
els could be insufficient for treatment effect evaluation. As already indicated at the
beginning of this chapter, in many applications we might be interested in the effects of
sequences of programmes; a first programme, say A, is followed by another programme,
say B. You also might want to compare that sequence with its inverse, i.e. starting with
programme B followed by programme A. Since purely the fact that a second treatment
was applied may already be an outcome of the (successful or unsuccessful) first pro-
gramme, disentangling such effects is very difficult or simply impossible in a static
model. To avoid these kind of problems, one could focus on estimating the effects of
the first programme (measured from the beginning of the first programme) while con-
sidering the second programme as an endogenously evolving outcome of this first one.
One would thereby estimate the total effect of the first programme together with any
possibly following subsequent programme. From this example we already notice that
intermediate outcome variables, i.e. Yt for some values of t, might be important vari-
ables that affect the sequence of treatment choice. But as discussed in previous chapters,
a general rule from the static model is that one should usually never control for variables
already affected by the treatment. We will see below that some type of controlling for
these variables is nonetheless important or even unavoidable here. If we further want to
disentangle the effects of each programme (e.g. A and B), then we certainly need a more
complex model setup.
360 Dynamic Treatment Evaluation
Introducing a time dimension into the evaluation framework can be done in two ways:
either by considering sequences of treatments over a number of discrete time periods (of
finite lengths), or by considering time as continuous. We start by examining a modelling
framework for discrete time periods, which permits a wide range of possible treatment
sequences, different start times, different treatment durations etc. This model may often
be directly applicable if treatments can start only at certain fixed points in time, e.g.
quarterly,2 or when data is observed only for discrete time periods.3 When treatments
can start in (almost) continuous time, this model may nevertheless have several advan-
tages over an explicit incorporation of continuous time in that it does not impose strong
restrictions on treatment effect heterogeneity. Time is partitioned into discrete periods
where different sequences of treatments can be chosen.
Example 8.7 Lechner and Miquel (2010) study the impact of government sponsored
training in Western Germany on finding a job during the nineties. They define the first
month of unemployment between January 1992 and December 1993 being the refer-
ence period (i.e. their period zero). Since in the data there is not enough variation over
time to analyse monthly movements they aggregate the monthly information to quar-
terly information. They consider the following three possible states until finding a job:
participating in a vocational training programme paid by the employment office (T),
participating in a retraining programme paid by the employment office to obtain a voca-
tional degree in a different occupation (R), or simply remaining unemployed receiving
benefits and services (U). Observing a single unemployment spell over one year, there
are many sequences possible like for example UUUU, RRRR, TTTT but also UTTT,
UURR, etc., but also shorter sequences if the individual has found a job after less
than four quarters. Lechner and Miquel (2010) study only the differences between the
effects of RRRR, TTTT and UUUU on being employed one (respectively four) year(s)
later.
Because the treatment effects are not restricted across treatment sequences, the model
cannot be directly extended to continuous time as there would be an infinite number
of different sequences. Hence, for most of these sequences the number of observa-
tions would be zero. Clearly, for continuous time more restrictions will be required,
as will be discussed later. In applications, time could almost always be considered as
discrete because information is typically aggregated over periods (hours, days, weeks,
moths, etc.). The important points are how many observations are observed entering in
treatment in a particular time period, and how many different treatment sequences can
be examined.
2 In the evaluation of school education policies, each school year would be a discrete time period.
3 Similarly, primary education, lower secondary and upper secondary education can be considered as a
sequence.
8.2 Dynamic Potential Outcomes Model 361
Flexible multiperiod extensions of the potential outcomes model has been developed
since a while in biometrics.4 In this chapter we focus on the exposition and exten-
sions of Lechner and Miquel (2001), Lechner and Miquel (2010), and Lechner (2008),
which are much closer in spirit and notation to the rest of this book as directed towards
applications in social sciences. Identification is based on sequential conditional inde-
pendence assumptions, which could also be called sequential ‘selection on observables’
assumptions. As we will see, for identification it is often important to be able to observe
intermediate outcomes variables. Such information may be available in administrative
data of unemployment registers. In many other applications this information does usu-
ally not exist, and it would therefore be important to collect such data as part of the
evaluation strategy.
To introduce the basic ideas, suppose that there are time periods τ and that in each
time period either a treatment 0 or 1 can be chosen. From this setup, the extensions
to many time periods and multiple treatments will be straightforward. The outcome
variable is measured at some time t later. In addition, there is an initial period for which
information on covariates is available before any treatment has started. I.e. a time period
zero exists where none of the treatments of interest has already started, and where we
can measure potentially confounding variables (before treatment). More precisely, we
define a period 0. Treatments could have happened before, but we will not be able to
identify their effects.
Recall Example 8.7 studying labour market programmes: at the beginning of the spell,
every observed person is unemployed, and we have some information measured at that
time about the person and previous employment histories. Let Dτ ∈ {0, 1} be the treat-
ment chosen in period τ , and let Dτ be the sequence of treatments until time τ with dτ
¯ ¯
being a particular realisation of this random variable. The set of possible realisations
of D1 is {0, 1}. The set of possible realisations of D2 is {00, 10, 01, 11}. The possible
¯ ¯
realisations of D3 are 000, 001, 010, 011, 100, 101, 110, 111, etc. We define potential
d¯
outcomes as YT¯ τ which is the outcome that would be observed at some time T if the par-
ticular sequence dτ had been chosen. In the following we use the symbols t and τ to refer
¯
to treatment sequences, and the symbol T for the time when the outcome is measured.
d d
Hence, with two treatment periods we distinguish between YT¯ 1 and YT¯ 2 . The observed
outcome YT is the one that corresponds to the sequence actually chosen. To be specific
about the timing when we measure these variables, we will assume that treatment starts
at the beginning of a period, whereas the outcome Y (and also other covariates X intro-
duced later) are measured at the end of a period. We thus obtain the observation rule,
i.e. the rule linking potential outcomes to observed outcomes:
Y1 = D1 Y11 + (1 − D1 )Y10
Y2 = D1 Y21 + (1 − D1 )Y20
= D1 D2 Y211 + (1 − D1 )D2 Y201 + D1 (1 − D2 )Y210 + (1 − D1 )(1 − D2 )Y200 .
4 See, for example, Robins (1986), Robins (1989), Robins (1997), Robins (1999), Robins, Greenland and Hu
(1999) for discrete treatments, and Robins (1998), Gill and Robins (2001) for continuous treatments.
362 Dynamic Treatment Evaluation
To be clear about the difference between Y211 and Y21 : the potential outcome Y211 is
the outcome that a particular individual i would have realised at the end of the second
period if by external intervention this person was sent to the sequence 11. The potential
outcome Y21 is the outcome that this individual i would have realised at the end of the
second period if by external intervention this person was sent first to the programme 1
and thereafter chose for the second period whatever this person was about to choose.
I.e. the first period is set by external intervention whereas the treatment in the second
period is determined according to the selection process of the individual or the case-
worker, given the assignment of the first programme. Note that the choice of the second
programme may be influenced by the first programme. This means
where D21 is the potential treatment choice in the second period if the programme in the
first period D1 was set to 1. Analogously, D31 is the programme in period three if the
first programme was set to one, and D311 the programme in period three if the treatment
in the first two periods was set to one. By analogy we obtain,
YT1 = D21 D31 YT111 + (1 − D21 )D31 YT101 + D21 (1 − D31 )YT110 + (1 − D21 )(1 − D31 )YT100 ,
or as another example
The observed outcome YT corresponds to the outcome if the person herself selected the
entire sequence of programmes.
denote as covariates X . For example, the type of secondary school a child attends clearly
depends on the schooling outcomes (grades, test scores) at the end of primary school,
and without observing these grades or scores, identification would be very difficult.5
5 Another example in Lechner (2004) considers the labour supply effects of different fertility sequences, e.g.
two children in the first period and zero in the second period versus one child in each period.
6 This last example is used in Sianesi (2004) and Fredriksson and Johansson (2008) and is applied in Frölich
(2008).
7 One way to circumvent this problem in the static model is to consider the effects of planned durations only,
e.g. in Lechner, Miquel and Wunsch (2011).
364 Dynamic Treatment Evaluation
with a minimum duration of at least one or two periods, the latter comparison refers to
treatments with a duration of exactly one or two periods.
Finally, one might want to study the effect of sequences of treatments: We could
be interested in various sequences of treatments, e.g. 010001 versus 0101. Particularly
when we extend the previous setup to allow for several treatment options, e.g. {0, 1, 2, 3}
in each period, for example, no assistance, job search assistance, training and employ-
ment programmes, it is interesting to compare a sequence 123 to 132 or 101 to 1001.
Should one start with training or with an employment programme? If one programme
has been completed, should one start with the next one, or leave some time in between
in order to permit individuals to focus on their own job search activities? The applica-
tion of the static model as covered in the previous section, breaks down when selection
into the second and any subsequent programmes is influenced by the outcome of the
previous programmes. Then these intermediate outcomes have to be included to control
for selection.
Hence, a large number of sequences could be interesting. However, when specify-
ing such sequences, one should keep in mind that the longer the treatment sequences
specified, the fewer observations will be in the data that have exactly followed this
sequence. Hence, one could run into small sample problems even with a data set of
several thousand observations. An additional complication will arise when comparing
two rather different sequences, e.g. 1110 to 00001110. It is quite likely that those indi-
viduals who followed a very specific sequence such as 00001110 may be relatively
homogenous in their X characteristics. If also the participants in 1110 are relatively
homogenous, the common support between these two participant groups will be rela-
tively small. After deleting observations out of common support, the treatment effect
between 00001110 and 1110 might thus depend on only a very specific subpopulation,
which reduces external validity.
Another concern with very long sequences is that in case we get identification via
some (sequential) conditional independence assumptions, we have to include the vector
of covariates X 0 , X 1 , . . . , up to X τ −1 for identifying the effect of a sequence dτ , which
¯
may contain an increasing number of variables when τ is increasing. So the number of
covariates becomes too large, one may perhaps only include, say, four lags X t−1 , X t−2 ,
X t−3 , X t−4 as they may be picking up most of the information contained in the past X .
As a further component to the model, one often wants to include covariates X t which
are time-varying; we denote by X t the collection of X t variables up to time period t. The
¯
Xt may also include the outcome variables up to Yt . Hence, we permit that the variables
¯
X t are already causally influenced by the treatments, and we could even define potential
d
values X t¯ τ for these. Remember that we observe X t at the end of a period. Hence, at
the beginning of a period τ , the values of X t up to τ − 1 are observed. In the exam-
ples on active labour market policies given above, X t could be (among other variables)
the employability of the unemployed person. The caseworker assesses the employa-
bility of his unemployed client, and this assessment can change over time. If training
programmes are effective, one would expect that the employability should increase
after having participated in training. Certainly, also other issues such as motivation,
8.2 Dynamic Potential Outcomes Model 365
psychological status or family composition can also change over time, see for example
Lechner and Wiehler (2011) on the interactions between labour market programmes and
fertility.
We now can define a large number of different average treatment effects. Let dτ , dτ
¯ ¯
and d
be three sequences of possibly different lengths τ , τ , τ . Define the treatment
¯τ
effect by
d ,d dτ dτ
αT¯ τ ¯ τ (d ) = E[Y ¯ − Y T¯ |dτ ] for τ ≤ τ , τ ,
¯τ T
¯
which is the treatment effect between sequence dτ and dτ for the subpopulation that is
¯ ¯
observed to have taken sequence d
. Note that the three sequences dτ , dτ and dτ can
¯τ ¯ ¯ ¯
differ in the length and in the types of the treatments. Hence, we could be comparing
two sequences of the same length, e.g. 01 versus 10, as well as sequences of different
lengths, e.g. 01 versus 1. The latter example corresponds to the effect of a delayed
treatment start, i.e. the treatment starting in period 2 versus period 1. The sequence d
¯τ
defines the subgroup for which the effect is defined. We supposed τ ≤ τ , τ since
there is little interest in the effect for a (sub-)population which is more finely defined
than the two sequences for which the causal effect is to be determined. The identification
conditions would also be stronger.
If τ = 0, this gives the dynamic average treatment effect (DATE)
d ,d d d
αT¯ τ ¯ τ = E[YT¯ τ − YT¯ τ ],
whereas the dynamic average treatment effect on the treated (DATET) would be
obtained when d
= dτ
¯τ ¯
d ,d d d
αT¯ τ ¯ τ (dτ ) = E[YT¯ τ − YT¯ τ |d τ ],
¯ ¯
and the dynamic average treatment effect on the non-treated (DATEN) would be
obtained when d
= dτ
¯τ ¯
d ,d d d
αT¯ τ ¯ τ (dτ ) = E[YT¯ τ − YT¯ τ |d τ ].
¯ ¯
Without any restrictions on effect heterogeneity, these effects could be very different.
Further, we only consider the case where T ≥ max(τ , τ ), which means that we only
consider as final outcome variables the periods after the completion of the sequence. It
would not make sense to consider explicitly the case for T < max(τ , τ ) because we
assume that treatments can have an effect only on future periods but not on earlier ones.
We will refer to this as the assumption of no anticipation effects. If we were expecting
anticipation effects, we would have to re-define the treatment start to the point were the
anticipation started. For example, if we observed in the data that an unemployed person
started a training programme in June, but we also know that this person was already
informed by early May about this programme, then we could consider May as the date
where treatment started. If the date of referral and the date of programme start are very
close together, and the date of referral is not observed, the possible anticipation effects
can hopefully be ignored.
366 Dynamic Treatment Evaluation
With this assumption we suppose that accounting for the information X 0 observed at
time zero, the entire treatment sequence taken later is independent from their potential
outcomes. This includes that all important information the agent has about his future
potential outcomes (and does therefore influence his decisions on treatment participa-
tion) is already contained in X 0 . In other words, we assume that the researcher has
enough information in the beginning of the initial period so that treatment assignment
in every period can be treated as random conditional on X 0 . Such an assumptions is
reasonable for example for a scheme where the assignment of all treatments is made in
the initial period and is not changed subsequently. Or, more precisely, any revision of
the original treatment plan has not been triggered by the arrival of new information that
is related to the potential outcomes. Hence, the choices do not depend on time varying
X and also not on the outcomes of the treatments in the previous periods, because the
complete treatment sequence is chosen9 initially based on the information contained
in X 0 .
For many situations this assumption can be rather strong and will therefore be relaxed
in the next subsection(s). But it is helpful to understand the implications of this assump-
tion. As shown in Lechner and Miquel (2001), with the above assumptions all treatment
effects up to period τ are identified, including DATET and DATEN as well as for coarser
subpopulations. It also includes identification of effects of the type
9 What it actually meant is that if the complete treatment sequence had been chosen initially, we would not
get systematically different treatment sequences than those observed.
368 Dynamic Treatment Evaluation
Example 8.9 Recall our examples on active labour market policy but now thinking of
a training programme that prohibits repeated participation. Then the eligibility status
(included in the vector of confounders X 1 ) will never be one if D1 = 1, whereas it has
positive probability to be one if D1 = 0. Hence, Pr D2 = 1|X1 = eligible, D1 = 1
¯
is zero, and the event (X1 = eligible, D1 = 1) has probability zero. On the other
¯
hand, (8.8) would not be satisfied because Pr(D1 = D2 = 1|X1 = eligible) = 0 but
¯
X1 = eligible happens with positive probability.
¯
Still, the common support assumption may be rather restrictive in many applications.
Suppose participation in treatment is permitted only for unemployed persons. Then
which implies that it is impossible to observe individuals with D2 = 1 for those who
found a job after the first training.
To better understand what is identified by WDCIA (8.6) consider E[YT11 |D1 = 0] in
the simple two-period model example above. Using iterated expectations and WDCIA
with respect to the first period, we can write
E YT11 |D1 = 0 = E E YT11 |X 0 , D1 = 0 |D1 = 0
= E E YT11 |X 0 , D1 = 1 |D1 = 0
= E E E YT11 |X 0 , X 1 , D1 = 1 |X 0 , D1 = 1 |D1 = 0
= E E E YT11 |X 0 , X 1 , D2 = 11 |X 0 , D1 = 1 |D1 = 0
¯ ! ! !
= E E E YT |X 0 , X 1 , D2 = 11 |X 0 , D1 = 1 |D1 = 0
¯
!
= E YT |X1 , D2 = 11 d FX 1 |X 0 ,D1 =1 d FX 0 |D1 =0 .
¯ ¯
This result shows on the one hand that this potential outcome is identified and ! also
suggests a way for estimating it. We first need to estimate E YT |X1 , D2 = 11 non-
¯ ¯
parametrically and then to adjust it sequentially for the distributions d FX 1 |X 0 ,D1 =1 and
d FX 0 |D1 =0 . As discussed later this adjustment can be done via matching or weighting.
The estimator is more complex than in the static model as we have to adjust for dif-
ferences in the X distribution twice. Generally, when we were to consider treatment
sequences of length τ we would have to adjust τ times.
More generally, under the WDCIA assumption the population average potential
outcomes
d
E[YT¯ τ ]
are identified for any sequence dτ of length τ ≤ T if the necessary conditioning vari-
¯
ables are observed. Also all average outcomes for any sequence dτ in the subpopulation
¯
of individuals who participated in treatment 0 or 1 in the first period
d
E[YT¯ τ |D1 = d1 ]
370 Dynamic Treatment Evaluation
are identified then. The situation becomes more difficult, however, if we are interested
in the average effect for a subpopulation that is defined by a longer sequence (especially
with D1 = D2 = 0). The relevant distinction between the populations defined by treat-
ment states in the first and, respectively, subsequent periods is that in the first period,
treatment choice is random conditional on exogenous variables, which is the result of
the initial condition stating that D0 = 0 holds for everybody. In the second and later
periods, randomisation into these treatments is conditional on endogenous variables, i.e.
variables already influenced by the first part of the treatment. WDCIA has an appeal for
applied work as a natural extension of the static framework. However, W-CIA does not
identify the classical treatment effects on the treated if the sequences of interest differ in
the first period.
In contrast to the stronger assumption (8.3) of the previous subsection, the SCIA, not
all treatment effects are identified anymore. Observing the information set that influ-
ences the allocation to the next treatment in a sequence together with the outcome of
interest is sufficient to identify average treatment effects (DATE) even if this information
is based on endogenous variables. However, this assumption is not sufficient to identify
the treatment effect on the treated (DATET). To understand why it is not identified, it
is a useful exercise to attempt to identify E[YT00 |D2 = 11] by iterated expectations, see
¯
Exercise 4. The reason is that the subpopulation of interest (i.e. the participants who
complete the sequence) has evolved (i.e. been selected) based on the realised intermedi-
ate outcomes of the sequence. This result is quite different from the static model, where
identification of ATET is often considered to be even easier than identification of ATE.
Nevertheless some effects can also be identified for finer subpopulations. The first
result refers to comparisons of sequences that differ only with respect to the treatment
in the last period, i.e. that they have the same initial subsequence until τ − 1 and differ
only in period τ . This is basically the same result as before, but with time period τ − 1
playing the role of time period 0 before, the period up to which the treatment sequence
still coincides. In this case the endogeneity problem is not really harmful, because the
potentially endogenous variables Xτ −1 , Yτ −1 , which are the crucial ones to condition
¯ ¯
on for identification, have been influenced by the same past treatment sequence at time
τ − 1 when comparing the two sequences. It can be shown10 that, given WDCIA, the
potential outcome is identified if the sequences (dτ −1 , dτ ) and (dτ −1 , dτ ) are identical
¯ ¯
except for the last period, i.e.
(d ,d )
E[YT ¯ τ −1 τ |Dτ = (dτ −1 , dτ )] (8.9)
¯ ¯
is identified. By the result (8.2) for coarser subpopulations, this also implies that
(d ,d )
E[YT ¯ τ −1 τ |Dτ −1 = dτ −1 ] (8.10)
¯ ¯
is identified. To give some examples, E[YT11 ], E[YT11 |D1 = 0], E[YT11 |D1 = 1] and
E[YT11 |D2 = 10] and E[YT11 |D2 = 11] are identified, but neither E[YT11 |D2 = 00]
¯ ¯ ¯
10 See, for example, Lechner and Miquel (2001, theorem 3b).
8.2 Dynamic Potential Outcomes Model 371
nor E[YT11 |D2 = 01]. Hence, the ATET between the sequences 10 and 01 is thus not
¯
identified.
The result given in (8.10) extends actually to the cases where we consider longer
sequences for outcome Y . Once the WDCIA is given, all what is needed is that the
initial (sub-)sequence of Y is identical to the sequence of D we condition on. Formally
spoken, for a sequence dτ −w where 1 ≤ w < τ , and a longer sequence that starts with
¯
the same subsequence (dτ −w , dτ −w+1 , . . . , dτ ), given WDCIA, the average potential
¯
outcome
(d ,dτ −w+1 ,...,dτ )
E[YT ¯ τ −w |Dτ −w = dτ −w ] (8.11)
¯ ¯
is identified. Of course, the relevant subpopulations for which identification is obtained
could be coarser, but not finer. Compared to (8.9) the conditioning set for the expected
value is ‘one period shorter’. The identification of sequences that differ for more than
one period is more difficult: The conditioning variables Xτ −1 , Yτ −1 needed to make
¯ ¯
participants comparable to non-participants in the specific sequence might be influenced
by all events during the sequence. However, since the sequences differ, also these events
can differ, leading to some additional loss of identification.
Example 8.10 Recall the two periods examples from above. It is clear that the WDCIA
implies that YT11 ⊥⊥ D2 |X 1 , X 0 , D1 . Together with
YT11 ⊥⊥ D1 |X 1 , X 0 (8.12)
one could conclude YT11 ⊥⊥ (D1 , D2 )|X 1 , X 0 . However, the WDCIA (8.6) does (and
shall) not imply (8.12). The implication of (8.12) is clearly visible from the graph in
Figure 8.1 (where for ease of exposition we ignored D0 and X 0 ). Generally, we would
like to permit X 1 to be potentially affected by D1 since X is measured at the end of the
period, whereas treatment D starts at the beginning of the period, but conditioning on
X 1 as in (8.12) ‘blocks’ a part of the total effect of D1 on YT . In other words, X 1 is an
outcome variable of D1 and thereby conditioning on it is an unreasonable condition. and
(8.12) can only be true if there is no causal effect running through X 1 . In other words,
D1 is not permitted to have any effect on X 1 .
For the identification of E[YT11 |D1 ] this was no problem, but for example for
E[YT11 |D2 ], cf. also Exercise 8.6, this becomes important because X 1 determines the
¯
population of interest in the second period. Hence, on the one hand, we would have to
condition on X 1 to control for the selection in the second period. On the other hand, we
are not permitted to condition on this variable as this could invalidate independence for
the selection in the first period.
In order to identify more than ‘just’ (8.11) but also DATET, DATEN, or other treat-
ment effects, one has to restrict the potential endogeneity of X τ , resulting in a stronger
sequential independence assumption. In fact, recall (8.11), for w = 1 we are aiming for
a condition that allows us to identify all effects up to period τ .
Assumption SDCIA Strong dynamic conditional independence assumption
d
(a) (YT¯ τ , X t ) ⊥⊥ Dt |Xt−1 , Dt−1 ∀ t ≤ τ − 1 and dτ ∈ $τ ,
¯ ¯ ¯
d
(b) YT¯ τ ⊥⊥ Dt |Xt−1 , Dt−1 t = τ and dτ ∈ $τ ,
¯ ¯ ¯
(c) 0 < Pr Dt = dt |Xt−1 < 1 a.s. ∀ t ≤ τ and dτ ∈ $τ .
¯ ¯ ¯ ¯
Compare now SDCIA with the two-period presentation of WDCIA (given directly
d
below the original assumption). Note that assumption (a) implies that YT¯ 2 ⊥⊥
D1 |X 0 , X 1 as can be shown by simple calculations. Together with the assumption (b)
d
we thus have that YT¯ 2 ⊥⊥ (D1 , D2 )|X1 . This follows because A ⊥⊥ (B, C) is equiv-
¯
alent to A ⊥⊥ B|C together with A ⊥⊥ C. With this assumption we can derive for
E[YT11 |D2 = 00] that
¯
E YT |D2 = 00 = E E[YT11 |X 1 , X 0 , D2 = 00] |D2 = 00
11
¯ ¯ ¯
= E E[YT |X 1 , X 0 , D1 = 0] |D2 = 00 = E E[YT11 |X 1 , X 0 , D1 = 1] |D2 = 00
11
¯ ¯
!
= E E[YT11 |X 1 , X 0 , D2 = 11] |D2 = 00 = E E[YT |X 1 , X 0 , D2 = 11] |D2 = 00 .
¯ ¯ ¯ ¯
!
Clearly, the same can be done for E YT00 |D2 = 11 . This result has two implications:
¯
First, the DATET is identified. Second, we simply have to adjust for the distribution of
X 1 and X 0 simultaneously, and can therefore use the methods we learnt for the static
model with multiple treatments. In other words, we do not have to resort to more com-
plex sequential matching or weighting methods (that are discussed in detail later when
only using WDCIA).
Part (a) of the SDCIA further implies that X 1 ⊥⊥ D1 |X 0 , i.e. the variable X 1 which
is observed at the end of the first period is not influenced by D1 which in turn starts
at the beginning of the first period. Hence, the X t still have to be exogenous in the
sense that Dt has no effect on X t . This (eventually) prohibits to include intermediate
outcomes in X t . In other words, treatment assignment is typically decided each period
based on initial information, treatment history and new information that is revealed up
to that period. But it is not permitted that the information revealed has been caused by
past treatments. The participation decision may be based on the values of time varying
confounders observable at the beginning of the period, as long as they are not influenced
by the treatments of this period. Hence, X t is still exogenous, which thus does not allow
Yt to be included in X t .
Note that this is a statement in terms of observed variables and its implication can be
related to causality concepts in time series econometrics. It says that X 1 is not Granger-
caused by previous treatments. This condition is a testable implication of SDCIA, which
8.2 Dynamic Potential Outcomes Model 373
on the one hand is an advantage, but on the other hand suggests that SDCIA may be
stronger than strictly necessary.
We will discuss alternative representations in terms of potential values of X 1d1 , i.e.
of the values of X 1 that would be observed if a particular treatment had been applied.
d
Some might think that SDCIA (a) says that X 1d1 = X 1 1 but these two statements are
definitely not equal. To examine alternative representations of the CIA assumptions in
terms of potential values, we first turn back to the WDCIA. When using WDCIA, no
explicit exogeneity condition is required for the control variables. This may be surpris-
ing, because it is a well-known fact that if we include, for example, the outcome in the
list of control variables, we will always estimate a zero effect.11 Obviously, a CIA based
on observable control variables which are potentially influenced by the treatment is not
the ‘best’ representation (in terms of a representation whose plausibility is most intu-
itively and easily be judged in a given application) of the identifying conditions, because
it confounds selection effects with other endogeneity issues. Sometimes it helps to get
its own mind clear when expressing the conditions really needed in terms of potential
confounders. For example, the WDCIA implies for the second period
d d
E[YT¯ τ |X1 , D1 = 1] = E[YT¯ τ |X1 , D2 = 11].
¯ ¯ ¯
Equivalently one could work with an expression in terms of potential confounders, i.e.
d d d =11
E[YT¯ τ |X1d1 =1 , D1 = 1] = E[YT¯ τ |X1¯ 2 , D2 = 11].
¯ ¯ ¯
This shows that the WDCIA is in fact a set of joint assumptions about selection and
endogeneity bias.
We close this subsection discussing alternative sets of assumptions for WDCIA and
SDCIA expressed in terms of potential confounders (and called WDCIA-P, and SDCIA-
P respectively). We concentrate only on versions for the simple two periods model to
focus on the key issues. It can be shown that these assumptions are strongly related to
the original versions given above. Nevertheless, neither does WDCIA directly imply
this new WDCIA-P nor vice versa. The same applies to SDCIA and SDCIA-P. One can
show that the same treatment effects are identified under WDCIA and WDCIA-P. All in
all, the following assumptions are not exactly equivalent to our previous discussion, but
almost. They provide an intuition into how we might interpret WDCIA and SDCIA, but
are not testable.
Assumption WDCIA-P Weak dynamic conditional independence based on potential
confounders
d d2
(a) YT¯ 2 ⊥⊥ Dt |Xt−1
¯ , Dt−1 ∀ t ≤ 2 and d2 ∈ $2
¯ ¯ ¯
d2
(b) F(X 0¯ |D1 = d1 ) = F(X 0d1 |D1 = d1 ) ∀ d2 ∈ $2 (8.13)
¯
d1 ,d2 d1 ,d2
(c) F(X 1 |X 0 , D1 = d1 ) = F(X 1d1 |X 0d1 , D1 = d1 ) ∀ d2 ∈ $2 ,
¯
where X t may include Yt . The common support requirement remains the same as before.
11 See, for example, Rosenbaum (1984), Rubin (2004) and Rubin (2005) on this so-called endogeneity bias.
374 Dynamic Treatment Evaluation
The conditional independence condition (a) looks like before but now formulated
in terms of potential confounders. What is new are the exogeneity conditions given
afterwards. Intuitively, (8.13) states that given D1 , D2 should have no effect on (the
distribution of) the confounders in period 0, and if also given X 0d1 , then D2 should have
no effect on confounders in period 1, cf. (c). A somewhat stronger assumption which
implies this, is if the treatment has no effect on the confounders before it starts, i.e.
d d d1 ,d
X 0¯ 2 = X 0¯ 2 for any d2 and d2 and also X 1d1 ,d2 = X 1 2 for any d2 and d2 . This rules
¯ ¯
out anticipation effects on the confounders. In the jargon of panel data econometrics,
the values of X t are ‘pre-determined’. They may depend on past values of the treatment
sequence, but not on the current value or future values of Dt . Overall, this implies that
we do not only rule out anticipation effects on the outcome variable, as this would not
permit identification anyhow, but also anticipation effects on the confounders X .
The requirements for the strong dynamic CIA are nearly equivalent representation in
terms of confounders:
Assumption SDCIA-P Strong conditional independence based on potential con-
founders
d d d
(a) (YT¯ 2 , X 1¯ 2 ) ⊥⊥ D1 |X0¯ 2 ∀ d 2 ∈ $2
¯ ¯
d2 d2
(b) YT¯ ⊥⊥ D2 |X1¯ , D1 ∀ d 2 ∈ $2
¯ ¯
d d
(c) F(X 0¯ 2 |D2 = d2 ) = F(X 0¯ 2 |D2 = d2 ) ∀ d2 , d2 ∈ $2 (8.14)
¯
¯ ¯ ¯ ¯ ¯
d d d d
(d) F(X 1¯ 2 |X 0¯ 2 , D2 = d2 ) = F(X 1¯ 2 |X 0¯ 2 , D2 = d2 ) ∀ d2 , d2 ∈ $2 .
¯ ¯ ¯ ¯ ¯ ¯
In contrast to WDCIA-P, the above exogeneity conditions require that X 1d1 ,d2 =
d ,d
X 1 1 2 for any values of d1 , d1 , d2 , d2 . This means not only that the causal effect of
D2 on X 1 is zero as before (no anticipation) but also that the causal effect of D1 on X 1
is zero. Hence, X t is assumed to not be affected by current nor future values of Dt . This
assumption goes much beyond the no-anticipation condition required for WDCIA-P by
ruling out the use of intermediate outcomes as conditioning variables. Hence, as already
remarked when discussing SDCIA before, the identification essentially boils down to
the static model with multiple treatments, which, if deemed reasonable, makes estima-
tion much simpler. In many applications SDCIA is likely to be too strong. However, in
cases where the new information X t does influence outcomes as well as the choice of
treatment in the next period, and this new information is so far not influenced by the
evolving of the treatment history, then SDCIA can be plausible.
you can express the mean of all needed potential outputs in terms of expectations of
observed outputs. We can then proceed accordingly to the main estimation ideas of
matching and propensity score weighting, only extended now by the hyperindices, indi-
cating to which treatment sequence the (observed) variables refer to. In fact, all above
identified effects can be considered as weighted averages of the observed outcomes in
the subgroup experiencing the treatment sequence of interest. As an example we have
already shown that
!
E YT |D1 = 0 =
11
E YT |X1 , D2 = 11 d FX 1 |X 0 ,D1 =1 d FX 0 |D1 =0
¯ ¯
!
or, E YT =
11
E YT |X1 , D2 = 11 d FX 1 |X 0 ,D1 =1 d FX 0 . (8.15)
¯ ¯
From here on we can obviously apply the same non-parametric (matching) estimators
as in the static case.
In practice, though, there might be some complications which can pose problems
(given that we are only provided with finite samples). We already mentioned that if
we consider very long sequences, e.g. 10 versus 0000010, then the number of obser-
vations who actually experienced these sequences can be very small. We have further
discussed that the observations in very long sequences are likely to be more homoge-
nous such that the common support for the comparison of two sequences may be rather
small. Another potential problem is that often we will have to control for continuous
variables in our sequential matching estimation: While we can estimate d FX 0 in (8.15)
simply by the empirical distribution function of X 0 , this would not be possible for
d FX 1 |X 0 ,D1 =1 if X 0 contains a continuous variable. If one were to impose paramet-
ric forms for d FX 1 |X 0 ,D1 =1 and d FX 0 , this would trivially become much simpler. This
problem is actually not present if one were to assume SDCIA. In that case, one could
identify
!
E YT11 |D2 = 00 = E YT |X1 , D2 = 11 d FX 1 ,X 0 |D =00 ,
¯ ¯ ¯ ¯2
!
E YT11 |D1 = 0 = E YT |X1 , D2 = 11 d FX 1 ,X 0 |D1 =0 and
¯ ¯
!
E YT = E YT |X1 , D2 = 11 d FX 1 ,X 0 ,
11
¯ ¯
where D1 = 0 and D2 = 00 have positive probability mass. Hence, with SDCIA we
¯
obtain a simpler estimator. Of course, since the SDCIA implies the WDCIA, the meth-
ods (outlined below) for WDCIA are also applicable here. This could in fact be used
as a specification check for those parameters that are identified under SDCIA but also
under WDCIA.
In the sections on propensity score matching and/or weighting we discussed that
these approaches are often taken as a semi-parametric device to improve the estimators’
performance.12 For the problem considered here this is even more attractive, if not nec-
essary, due to the above-mentioned problem that arises when continuous confounders
12 It is semi-parametric if, as often done in practice, the propensity score is estimated parametrically.
376 Dynamic Treatment Evaluation
are present. Similarly to matching, the propensity score weighting estimator is straight-
forward; only the notation complicates a bit. Defining p d1 (x0 ) = Pr(D1 = d1 |X 0 = x0 )
and p d2 |d1 (x1 ) = Pr(D2 = d2 |X1 = x1 , D1 = d1 ) we have
. ¯ ¯ / ¯
YT
E |D = 11 · Pr(D2 = 11)
p 1|1 (X1 ) p 1 (X 0 ) ¯ 2 ¯
¯
Pr(D2 = 11) !
= ¯ E YT |X1 , D2 = 11 d FX 1 ,X 0 |D =11
p (X1 ) p (X 0 )
1|1 1 ¯ ¯ ¯2
¯
Pr(D2 = 11) ! Pr(D2 = 1|X 1 , X 0 , D1 = 1)
= ¯ E YT |X1 , D2 = 11 d FX 1 ,X 0 |D1 =1
p (X1 ) p (X 0 )
1|1 1 ¯ ¯ Pr(D2 = 1|D1 = 1)
¯
Pr(D1 = 1) !
= E YT |X1 , D2 = 11 d FX 1 |X 0 ,D1 =1 d FX 0 |D1 =1
p 1 (X 0 ) ¯ ¯
Pr(D1 = 1) ! Pr(D1 = 1|X 0 )d FX 0
= E Y T |X 1 , D 2 = 11 d FX 1 |X 0 ,D1 =1
p 1 (X 0 ) ¯ ¯ Pr (D1 = 1)
!
= E YT |X1 , D2 = 11 d FX 1 |X 0 ,D1 =1 d FX 0 = E Y211 ,
¯ ¯
which is identical to (8.15). Hence, a natural estimator is
⎧ ⎫ ⎧ ⎫
⎪
⎨
⎪
⎬ ⎪⎨
⎪
⎬ 1
ŵi YT / ŵi where ŵi = 1|1 .
⎪
⎩i:D =11 ⎪
⎭ ⎪⎩i:D =11 ⎪ ⎭ p̂ (X1 ) p̂ 1 (X 0,i )
¯
¯ 2,i ¯ 2,i
The conditional probabilities can be estimated non-parametrically. But when the
sequences become very long, parametric estimation might be more advisable because
the number of observations who have followed exactly this sequence decreases, but the
list of control variables Xτ gets longer. Similarly,
¯ . /
YT Pr(D2 = 11)
E YT |D1 = 0 = E
11
· p 0
(X 0 ) |D = 11 · ¯ .
p 1|1 (X1 ) p 1 (X 0 ) ¯2 Pr(D1 = 0)
¯
Though we have derived here expressions for the means of the potential outcome for
‘two times treated’, i.e. sequence 11, the procedure works the same for sequences
00, 01 and 10. Various matching estimators based on nearest-neighbour regression are
examined in Lechner (2008).
Turning to propensity score matching, it can be shown that the propensity scores also
satisfy a balancing property which can make sequential matching estimation somewhat
simpler. (Else you might match directly with the conditioning variables.) The idea is as
follows: note that for the two-period case under the WDCIA one has
d d
YT¯ 2 ⊥⊥ D1 | p 1 (X 0 ) and YT¯ 2 ⊥⊥ D2 | p 1|D1 (X1 ) (8.16)
¯
(cf. Exercise 5) but also13
d d
YT¯ 2 ⊥⊥ D2 | p 1|D1 (X1 ), D1 and YT¯ 2 ⊥⊥ D2 | p 1|D1 (X1 ), p 1 (X 0 ), D1 . (8.17)
¯ ¯
13 In fact, instead of p di (x ) one can also use any balancing scores b(x ) with the property that
i i
E p di (X i )|bi (X i ) = p di (X i ).
8.3 Duration Models and the Timing of Treatments 377
Hence, we can augment the propensity score with additional control variables that we
deem to be particularly important for the outcome variable, with the aim to improve
small sample properties. In addition to that, it means that we can use the same propensity
score when estimating the effects separately by gender or age groups, for example. We
obtain
E YT =
11
E YT | p 1|1 , p 1 , D2 = 11 d F p1|1 | p1 ,D1 =1 d F p1 ,
¯
so that a potential estimator would be
⎛ 1 1 ⎞
1|1 p j − pi
m (pj , pj) · K
11 1
1
n
⎜ j:D 1, j =1 h ⎟
⎜ 1 1 ⎟,
n ⎝ p j − pi ⎠
j:D1, j =1 K
i=1
h
where m 11 ( p 1|1 , p 1 ) = E YT | p 1|1 , p 1 , D2 = 11 .
If more than two time periods are examined, more propensity scores are needed. This
means that the dimension of the (non-parametric) matching estimator is increasing with
the length of the treatment sequence even if we use a parametrically estimated propen-
sity score. The reason is that when we are interested in Y 11 , then we have to control for
p 1|1 and p 1 (in the matching). When Y 111 is of interest, we will need p 1|11 and p 1|1
and p 1 . In fact, the minimum number of propensity scores needed corresponds to the
length of the treatment sequence; actually, the number of propensity scores needed for
a treatment sequence dτ equals τ .
¯
A
crucial assumption
in the above model was the common support, i.e. that 0 <
Pr D2 = 1|X1 , D1 < 1 almost surely. In other words, for every value of X 0 and X 1
¯
there should be a positive probability that either D2 = 1 or D2 = 0 is chosen. In
some applications, the set of possible treatments might, however, depend on the value
of X 1 . For example, if we examine a particular training programme for unemployed,
the treatment option D2 = 1 might not exist for someone for whom X 1 indicates that
this person is not unemployed anymore. Here, the set of available treatment options in a
given time period t varies with X t−1 , and the model discussed so far would have to be
adjusted to this setting.
What happens if the outcome of interest is time or say, the duration to change from a
given state (e.g. unemployment) to another (e.g. getting employed)? One might study for
example what are the patterns of unemployment duration or even what are the factors
which influence the length of an unemployment spell. Other examples are the dura-
tion from a political decision to its implementation, or the impact of tuition fees on the
duration of university studies. These are the sorts of questions that duration (or sur-
vival) analysis is designed to address. Statistical methods have been developed over a
long time in other disciplines (biometrics, technometrics, statistics in medicine, etc.). In
econometrics, however, it is still much less used.
378 Dynamic Treatment Evaluation
For this reason, before we start to study the impact of (dynamic) treatment on dura-
tion, we first give a brief introduction (in Section 8.3.1) to some basic definitions and
concepts in duration analysis. Afterwards, in Section 8.3.2 we introduce the concept of
competing risks which is fundamental for our analysis of treatment effect estimation on
durations.
t
F(t) = Pr(Y ≤ t) = Pr(Y = l) = Pr(Y < t + 1),
l=1
t
S(t) = Pr(Y > t) = 1 − Pr(Y = l) = Pr(Y ≥ t + 1),
l=1
(the so-called survival function) and the hazard rate (cf. Exercise 6) as
for positive duration dependence. Clearly, the potential patterns of duration dependence
depend on the form of λ(t) which is therefore often considered as the main building
block of duration analysis. The simplest hazard rate has constant exit rate (zero duration
dependence), but generally λ(t) may neither be constant nor monotonic.
As part of the audience might not be familiar with duration analysis, some basic
relationships between functions could be interesting to be mentioned. From above, we
have that
. /
f (t) 1 d F(t) 1 d S(t) d logS(t) d log[1 − F(t)]
λ(t) = = = − =− =− .
S(t) S(t) dt S(t) dt dt dt
So, integrating λ(·) to t gives
t
d log[1 − F(s)]
t
(t) = λ(s)ds = − ds = [−log[1 − F(s)]]t0
s=0 s=0 ds
= −log[1 − F(t)] + log[1 − F(0)] = −log[1 − F(t)]
= −logS(t) since F(0) = 0.
You may think of (t) as the sum of the risks you face going from duration 0 to t. We
can thus express the survival function and the density in terms of the hazard rate by
rearranging
. t /
S(t) = exp − λ(s)ds , and (8.19)
s=0
. t /
f (t) = exp − λ(s)ds λ(t). (8.20)
s=0
It is obvious then that for continuous time F(t) = exp [−(t)], indicating that (t)
has an exponential distribution with parameter 1, and log(t) !an extreme value Type 1
(or Gumbel) distribution with density f () = exp − exp() . Similarly to the above
calculations it can be shown that for discrete time the probability Pr(Y = t) can be
expressed in terms of the hazard. Let us consider some typical examples of distributions
used in duration analysis.
Example 8.11 The classical example in basic statistics courses is the exponential distri-
bution. It has a constant hazard rate specification where λ(t) = λ0 for λ0 > 0. To derive
S(t), note first that dlogS(t)
dt = −λ0 . This implies logS(t) = k − λ0 t for some k. Hence,
for a given K > 0
Example 8.12 Another classical example is the Weibull distribution. The Weibull hazard
rate is defined as
γ
λ(t) = λ0 γ (λ0 t)γ −1 or aγ t γ −1 for a = λ0
with λ0 , γ > 0 (as negative exit rates do not exist), giving survival function and
derivative of the hazard
dλ(t)
S(t) = exp[−(λ0 t)γ ], = λ20 γ ∗ (γ − 1) ∗ (λ0 t)γ −2 .
dt
The median duration can be calculated by
log(2)1/γ
S(M) = exp[−(λ0 M)γ ] = 0.5 =⇒ M = .
λ0
The hazard is positive for γ > 1 and negative for γ < 1. So the parameter γ
defines the sign and degree of duration dependence. Note, however, that the Weibull
is monotonic in t, for γ > 1 monotonically increasing, and for γ < 1 monotonically
decreasing.
Example 8.13 The log logistic distribution is another, though still one-parameter,
generalisation of the exponential distribution with hazard rate
λ0 γ (λ0 t)γ −1
λ(t) = for λ0 , γ > 0
1 + (λ0 t)γ
−1 γ
or = aγ t γ −1 1 + at γ for a = λ0 .
1
S(t) =
(1 + (λ0 t)γ )
f (t) = aγ t γ −1 (1 + at γ )−2 .
One conclusion is that log(t) has logistic distribution (Exercise 7) with density
!2
g(y) = γ exp {γ (y − μ)} / 1 + exp {γ (y − μ)} ,
In Figure 8.2 you see graphical examples of the exponential, Weibull and log-logistic
distribution. It can now also be seen why it is much easier to look at a hazard rate than
looking at the survival (or cumulative distribution) function to see and understand the
differences.
For estimating the unknown parameter of the particular distribution we can obviously
resort to maximum likelihood methods. Even if we start out from the specification of the
hazard, thanks to (8.20) we always get immediately the density (for the continuous case)
or the probability (for the discrete case), see (8.18) we need. More specifically, take t
as continuous and consider a sample of n observed (completed) durations t1 , t2 , . . . , tn
within a sample period. Given a parametric form for λ(·) that is fixed up to an unknown
finite-dimensional parameter θ , the density for ti is f (ti ; θ ) = λ(ti ; θ ) · S(ti ; θ ) which
yields a likelihood and corresponding log-likelihood of
0
n 0
n
L(θ ) = f (ti ; θ ) = λ(ti ; θ ) · S(ti ; θ ),
i=1 i=1
n
n
l(θ ) = ln f (ti ; θ ) = ln λ(ti ; θ ) + ln S(ti ; θ ). (8.21)
i=1 i=1 i=1
3.0 1.0
0.8
2.0
0.6
0.4
1.0
0.2
0.0 0.0
0 1 2 3 4 0 1 2 3 4
exit rates survival functions
Figure 8.2 Hazard rates (left figure) and survival functions (right figure) for exponential with
λ0 = 1 (dotted), Weibull with λ0 = 1.4, γ = 0.5 (solid), respectively with λ0 = 0.9, γ = 1.5
(dashed), and log-logistic with λ0 = 1.9, γ = 2.7
382 Dynamic Treatment Evaluation
n
lr (θ ) = ln f (ti ; θ ) + ln S(ti ; θ ) = ln λ(ti ; θ ) + ln S(ti ; θ ). (8.22)
δi =1 δi =0 δi =1 i=1
For the consistency of this maximum likelihood estimator it is needed that the latent
duration is distributed independently from ci and the starting point, say ai , from the
initial state.
Left-censoring can be treated equivalently, but for left-truncation we would need some
more information. Imagine now, durations ti were only observed for people being (still)
in the initial status at time b, i.e. observed conditional on ti > li (for li > 0), where
li = b − ai with ai being, as above, the starting point. Where li is known you can work
with the conditional density
λ(ti ; θ )S(ti ; θ )
f (ti |ti > li ; θ ) =
S(li ; θ )
to calculate the log-likelihood (in the absence of right-censoring). In sum this gives the
log likelihood
n
ll (θ ) = ln λ(ti ; θ ) + [ln S(ti ; θ ) − ln S(li ; θ )]. (8.23)
i=1 i=1
Of course, the problem is the need to know li (respectively b and ai ).
8.3 Duration Models and the Timing of Treatments 383
n
llr (θ ) = ln f (ti ; θ ) + ln S(ti ; θ ) − ln S(li ; θ ). (8.24)
δi =1 δi =0 i=1
When having grouped data, typically some methods are used that are different from
the discrete time or parametric continuous time models above. One could say that peo-
ple often tend to directly apply non-parametric methods. The idea is actually pretty
simple. Having grouped data means that the time bar is divided into M + 1 inter-
vals: [0, b1 ), [b1 , b2 ), . . . , [b M , ∞), where the bm are given (in practice: chosen by the
empirical researcher) for all m. We record the observations now in terms of exits E m
in m’s interval, [bm−1 , bm ). Let Nm be the people still at risk in that period, i.e. still
alive in the initial state. Then a trivial estimator for the exit rate is obviously for all m,
λ̂m = E m /Nm . Similarly,
0
m
Pr (Y > bm |Y > bm−1 ) = (Nm − E m )/Nm and
S(bm ) = (Nr − Er )/Nr . (8.25)
r =1
This is the so called Kaplan–Meier estimator. It is consistent when assuming that for
increasing sample size the number of observations in each interval increases, too. In
practice it means that in each interval we have a ‘reasonable’ large number Nr .
As already discussed in other chapters, for our purpose it is quite helpful – if not
necessary – to include covariates in the model. We will see that basically all we have
learnt above does still apply though the notation changes. The most crucial points are
the definition of a conditional hazard, and the assumptions on the covariates that are to
be included.
An important first distinction is to separate covariates xi into those that are time-
invariant covariates, i.e. that do not depend on the period of duration, and time-varying
covariates (xit ). Typical examples for the former are evidently gender of the individual,
or the level of school qualification (for adults). The time-varying covariates, however,
have to be handled with care. As we did in the previous chapters, one typically prefers
to assume that the included covariates are not influenced by the considered process; they
have to be exogenous.
then problems start when there is a feedback from duration on them (as e.g. for mar-
ital status). If, however, there is only a feedback from a change in X on Y , then it is
manageable.
For the ease of presentation we start with the inclusion of time-invariant covariates.
Furthermore, we assume that the conditional distribution of the latent duration ti∗ |xi is
independent of ci , ai (potential censoring and the starting point). Then the most pop-
ular modelling approach, at least in continuous time, is the proportional hazard (PH)
specification: for some parametric baseline hazard λ0 (t) consider
λ(t; x) = g(x) · λ0 (t), g(x), λ0 > 0 (8.26)
with an unknown (maybe pre-specified up to a vector β of unknown parameters) func-
tion g(·) which is called systematic part. A typical choice is g(x) = exp(x β). Then,
log{λ(t; x)} = x β + ln λ0 (t), and the elements of β measure the semi-elasticity of the
hazard with respect to their corresponding element in vector x. The definition of the
survival function becomes
. t / . t /
S(t) = exp − exp(x β)λ0 (s)ds = exp − exp(x β) λ0 (s)ds
s=0 s=0
= exp [− exp(x β)0 (t)].
This way we get again standard formulations for (log-) likelihoods which can be used
for maximum likelihood estimation of the PH. One reason for its popularity is that for
proportional hazards, Cox (1972) derived the partial maximum likelihood estimation
method for β. Its advantage is that it does not require the knowledge of λ0 (t), i.e. no
further specification of the exact distribution of duration: It is defined (for completed
observations) as being the maximum of
0
where Ri is the set of individuals under risk at time ti , and ‘i’ is the individual with the
event at ti .
Example 8.15 Recall Example 8.12, and the Weibull hazard specification λ(t) =
f (t)/S(t) = aγ t γ −1 . If we substitute g(x) = exp{β0 + x β} for a in order to model the
dependency on x, then we obtain a proportional hazard with λ0 (t) = γ t γ −1 .
Even more generally, with g(x) = exp{β0 + x β} we obtain for the baseline hazard
- t 9
log λ0 (s)ds = −β0 − x β + (8.28)
0
with having the extreme value Type 1 distribution; recall the discussion after
Equations 8.19 and 8.20.
This shows also a link to regression analysis of duration. However, in Equation 8.28
the only represents the purely random variation in the duration outcome; it does not
8.3 Duration Models and the Timing of Treatments 385
capture any other individual heterogeneity. Therefore we will later on introduce the
so-called mixed proportional hazard.
An alternative to the popular proportional hazard is the idea of accelerated hazard
functions (AHF), also called accelerated failure time models. For a given parametric
hazard model (like the exponential or Weibull) one simply replaces λ0 by
λ0 = exp(x β).
Example 8.16 For the Weibull distribution, cf. Example 8.12, this gives
For the exponential hazard with no duration dependence, cf. Example 8.11, we have
simply λ(t) = λ0 = exp(x β). This gives expected duration time
1
E[Y |X ] =
exp(x β)
which for ‘completed’ durations is often estimated by a linear regression model as
Other distributions suitable for AHF models include the log-normal, generalised
gamma, inverse Gaussian distributions, etc. Among them, the generalised gamma dis-
tribution is quite flexible as it is a three-parameter distribution that includes the Weibull,
log-normal and the gamma. Their popularity, however, is less oriented along their flexi-
bility but along the availability of software packages, or the question whether its survival
function has an analytic closed form.
In Example 8.15 we saw that it would be desirable to include also unobserved hetero-
geneity between individuals in the PH model. This can be done straightforwardly and
leads to the so-called mixed proportional hazard (MPH) models:
with a time-invariant (i.e. only individual specific) random effect v with distribution
Fv (v) and E[v] < ∞.14 For identification (and estimation) it is typically assumed that
the observed covariates x are independent from the unobserved heterogeneity v. For a
complete set of technical assumptions in order to non-parametrically identify the MPH,
see for example van den Berg (2001). Compared to the regression equation in Example
8.15 we have now
t
log λ0 (s)ds = −β0 − xβ − logv +
0
14 Sometimes it is also set E[v] = 1 if one wants to identify a scale for g(x) and/or λ (t). Else one has to
0
normalise these functions.
386 Dynamic Treatment Evaluation
which is much more flexible than that resulting from the PH, but still more restrictive
than a regression with an arbitrary (finite variance) error term.
For calculating the maximum likelihood, note that when ti (xi , vi ) ∼ F(t|xi , vi ; θ ),
and v ∼ Fv (v; δ) with a finite dimensional unknown parameter δ, then
∞
ti |xi ∼ H (t|xi ; θ, δ) = F(t|xi , v; θ ) d Fv (v; δ).
0
This means, one would work with H or h = d H/dt (instead of F, f ) for the
construction of the likelihood. As a byproduct we have (suppressing δ)
∞
f (t|x) = λ(t; x)S(t|x) = λ(t; x, v)S(t|x, v)d Fv (v)
0
It can be shown that its duration dependence is more negative than the one of λ(t; x, v).
But how to choose the distribution of the unobserved heterogeneity Fv among sur-
vivors? Heckman and Singer (1984) study the impact of the choice on the parameter
estimates and propose a semi-parametric estimator. Abbring and van den Berg (2007)
show that for a large class of MPH models this distribution converges to a gamma
distribution.
Example 8.17 If we use gamma-distributed heterogeneity, then we can find the dis-
tribution of completed ti |xi for a broad class of hazard functions with multiplicative
heterogeneity. Set λ(t; x, v) = v · g(t|x) without further specification, and v ∼ (δ, δ),
such that E[v] = 1 and V ar [v] = δ −1 . Recall the density of the Gamma distribution:
δ δ v δ−1 exp{−δv}/ (δ) and that for t|x, v we have
- t 9
F(t|xi , vi ) = 1 − exp −vi g(s|xi )ds = 1 − exp {−vi (t; xi )} ,
0
t
where (t; xi ) = 0 g(s|xi )ds. Set i = (t; xi ), then plugging-in gives
∞
!
H (ti |xi ; θ, δ) = 1 − exp{−vi } δ δ v δ−1 exp{−δv}/ (δ)dv
0
∞
δ
= 1 − [δ/(δ + i )] (δ + i )δv δ−1 exp{−v(δ + i )}/ (δ)dv
0
!δ !−δ
= 1 − δ/(δ + i ) = 1 − 1 + i /δ
For grouped duration data, cf. the Kaplan–Meier estimator introduced above, the
inclusion of time-invariant covariates is quite straightforward for parametrically spec-
ified hazards λ(t; x) that are specified up to a finite-dimensional parameter θ . For a
moment let us assume not to suffer from any censoring. Again, we have the timeline
divided into M + 1 intervals [0, b1 ), [b1 , b2 ), . . . , [b M , ∞), where the bm are given.
Then we can estimate θ by maximising the likelihood
0
n i −1
m0 - bl 9
{1 − b̃m i (xi ; θ )} b̃l (xi ; θ ) where b̃l (xi ; θ ) := exp − λ(s; x) ds .
i=1 l=1 bl−1
That is, we sum up the exits in each time interval. For right censored observations i, we
can simply drop {1 − b̃m i (xi ; θ )} from the likelihood.
In the Kaplan-Meier estimator without covariates we considered constant exit rates
in each interval. Here we have allowed for a continuous function λ(·); but nonethe-
less use only its integral b̃m . Quite popular are the piece-wise-constant proportional
hazards with λ(t; x) = g(x)λm for m = 1, . . . , M and g(·) > 0 to be specified.
Such a specification causes discontinuities in the time dimension but only in theory,
because even if λ0 is continuous, for proportional hazards we will only work with
For the common specification g(x) = exp(x β) one obtains b̃m (x; β) =
bm−1
bm λ 0 (s)ds.
exp[− exp(x β)λm (bm − bm−1 )].
We finally turn to hazard models conditional on time-varying covariates. For nota-
tional convenience we denote therefore the covariates by x(t), t ≥ 0. There exist various
definitions of the hazard function with slight modifications of the conditioning set. For
our purposes it is opportune to work with the following: We (re)define the hazard by
λ{t; X (t)} = lim Pr t ≤ Y < t + dt|Y ≥ t, X {u}t0 /dt, (8.30)
dt#0
where X {u}t0 denotes the path of X over the time period [0, t]. This requires that the
entire path of X is well defined whether or not the individual is in the initial state.
A further necessary condition for a reasonable interpretation (at least in our context)
of the model and its parameters is to rule out feedback from the duration to (future)
values of X . One would therefore assume that for all t and s > t
Pr X {u}st |Y ≥ s, X (t) = Pr X {u}st |X (t) . (8.31)
One speaks then also of strictly exogenous covariates.15 One speaks of external covari-
ates if the path of the X is independent of whether or not the agent is in or has
left the initial state. Note that these have always a well-defined paths and fulfil
Assumption (8.31).
In the context of (mixed) proportional hazard models, however, it is more common
to say that X (t) is a predictable process. This does not mean that we can predict the
whole future realisation of X ; it basically means that all values of the covariates for the
hazard at t must be known and observables just before t. In other words, the covariates
15 The covariates are then also sequentially exogenous, because by specification of λ{t, X (t)} we are
conditioning on current and past covariates.
388 Dynamic Treatment Evaluation
at time t are influenced only by events that have occurred up to time t, and these events
are observable.16
Example 8.18 Time invariant covariates like gender and race are obviously predictable.
All covariates with fully known path are predictable. A trivial example is age, but there
one might argue that this is equivalent with knowing the birth date and thus being equiv-
alent to a time-invariant covariate. Another example are unemployment benefits as a
function of the elapsed unemployment duration. If these are institutionally fixed, then
the path is perfectly known and thus predictable. It is less obvious for processes that can
be considered as random. A stochastic X is predictable in the above defined sense if its
present value depends only on past and outside random variation. As a counterexample
serves any situation where the individual has inside information (that the researcher does
not have) on future realisations of X , that affect the present hazard. In other words, pre-
dictability of X does not necessarily mean that the empirical researcher or the individual
can predict its future values; but it means that both are on the same level of information
as far as it is relevant for the hazard.
16 For people being familiar with time series and panel data analysis it might be quite helpful to know that
predictability of process X is basically the same as weak exogeneity of X .
8.3 Duration Models and the Timing of Treatments 389
for the cause-specific case we simply add an indicator for the cause, say k ∈
{1, 2, . . . , M}:
λk (t; x) = lim Pr (t ≤ Y < t + dt, K = k|Y ≥ t, x) /dt. (8.32)
dt→0
M
From the law of total probability one has immediately λ(t; x) = k=1 λk (t; x).
Accordingly you get
t
Sk (t|x) = exp{−k (t; x)}, k (t; x) = λk (s; x)ds (8.33)
0
;M
from which we can conclude S(t|x) = k=1 Sk (t|x). Analogously, the cause-specific
density is
f k (t|x) = lim Pr (t ≤ Y < t + dt, K = k|x) /dt = λk (t; x)S(t|x) (8.34)
dt→0
M
with f (t|x) = k=1 f k (t|x). Similarly we can proceed for the case when time is dis-
crete. Consequently, with the density at hand, all parameter specified in this model can
be estimated by maximum (log-)likelihood, no matter whether you directly start with
the specification of the density or with modelling the hazard or the cumulative incidence
function. And again, in case of right censoring you include for completed observations
the density, and for censored observations only the survival function. For left-truncation
you can again derive the density conditional on being not truncated like we did before.
Typically used names for this model are multiple exit or multivariate duration models.
But often one speaks of competing risks models because the different causes of failure
are competing for being the first in occurring. In case we are only interested in one of
them, a natural approach would be to classify all the others as censored observations.
This decision certainly depends very much on the context, i.e. whether one is interested
in the over-all or just one or two cause-specific exit rates. The way of modelling the
overall and/or the cause-specific hazards can be crucial. For example, understanding the
duration effects of a therapy on different subgroups, interventions can be targeted for
those who most likely benefit at a reasonable expense. A most obvious approach is to
apply (M)PH models to the cause-specific hazards.17 If using the Cox (1972) PH speci-
fication for each cause-specific hazard function, then the overall partial likelihood is just
the product of the M partial likelihoods one would obtain by treating all other causes of
failure alike censored cases. The extension to mixed proportional hazards, i.e. including
unobserved individual (time invariant) heterogeneity V that may vary over individuals
and/or cases, works as before. But it can easily render the estimation problem infeasible
when the assumed dependence structure among the Vik gets complex. The extensions
to more complex dependence structures or to more flexible functional forms is generat-
ing a still-growing literature on competing risks models, their modelling, estimation and
implementation.
17 An also quite popular approach is to model explicitly the cumulative incidence function; see Fine and
Gray (1999).
390 Dynamic Treatment Evaluation
In what follows we only concentrate on the problem of identifying the treatment effect
in a bivariate (M = 2) competing risks model. We consider the population being in
the initial status, e.g. unemployment, and are interested in measuring the effect of a
treatment on the exit rate, e.g. the duration to find a job. The complication is that those
people leaving the initial status are no longer eligible for treatment. In other words, there
are four observed situations thinkable: you never get treated and you never find a job;
you get treated but you do not find a job; you find a job after treatment; you find a job
before a treatment has taken place. For each you observe the duration Y of ‘staying in
the initial state’, and D ‘waiting in the initial state for treatment’.
So we denote by D the treatment time which typically refers to the point in time
the treatment is initiated. Like in the previous chapters we are interested in Y d , i.e. the
potential duration to change the state (e.g. find a job) given a certain d ≥ 0. As usual,
for the observed duration we have Y = Y D . What we can identify from a reasonable
large data set without further specifying a model are the probabilities
Pr(Y > y, D > d, Y > D) = Pr(Y > y, D > d|Y > D) · Pr(Y > D) and
Pr(Y > y, Y < D) = Pr(Y > y|Y < D) · Pr(Y < D).
In our example Pr(Y < D) = 1 − Pr(Y > D) just indicates the proportion of people
who found a job before starting with a training programme, and Pr(Y > y|Y < D)
the cumulative distribution function within this group. Now, if the treatment effect can
be revealed from these probabilities, then one says it is identifiable. Similarly to the
previous chapters, the causal model is given by the pair ({Y d ; d ≥ 0}, D). The hazard for
the potential outcomes is defined the same way as we defined hazard functions before,
namely by
% &
λY d (t) = lim Pr t ≤ Y d < t + dt|Y d ≥ t /dt, (8.35)
dt→0
t
with its integral Y d (t) = 0 λY d (y)dy.
Abbring and van den Berg (2003b) showed that, unfortunately, to each causal model
specification exists an observationally equivalent specification, say ({Ỹ d ; d ≥ 0}, D̃),
that satisfies randomised assignment, i.e. {Ỹ d ; d ≥ 0} ⊥⊥ D̃, and no anticipation. In
other words, the two probabilities Pr(Y > y, D > d, Y > D), Pr(Y > y, Y < D)
could be produced equally well from models with and without a treatment effect. Con-
sequently, without a structural model and a clear rule of no anticipation, one cannot
detect a treatment effect from observational data. In fact, in order to be able to iden-
tify a treatment effect under plausible assumptions, we need to have some observable
variation over individuals or strata. This can either be multiple spells or observable
characteristics X .
Clearly, after all we have seen above, the possibly most appealing structure seems to
be the mixed proportional hazard with unobserved heterogeneity V and either multiple
spells or observable covariates X . We proceed as we did in the former chapters, i.e. start
without observable characteristics X . We assume to have observed for each individual
or strata at least two spells, say (Y1 , D1 ), (Y2 , D2 ), and for the ease of notation only use
these two. While the unobserved heterogeneity is allowed to change over time but must
8.3 Duration Models and the Timing of Treatments 391
be unique for an individual or strata (therefore indexed below by Y ), the treatment effect
can be different for the different spells. A useful model is then given by
-
λ0,Yk (t)VY (t) if t ≤ Dk
λYk (t; Dk , V ) = k = 1, 2, (8.36)
λ0,Yk (t)αk (t, Dk )VY (t) if t > Dk
where the λ0,Yk (·), VY (·) are integrable on bounded intervals. For identification one has
to either normalise the baseline hazards or VY ; for convenience (but without loss of
generalisation) let us set λ0,Y1 = 1. Note that the model is restrictive for the treatment
effects αk in the sense that it must not depend on individual characteristics except those
captured by Dk .
Then, the identification strategy is very similar to all what we have seen in former
chapters; in particular, we need a kind of conditional independence assumption:
Assumption CIA-PH1 (conditional independence assumption for competing risks
models with multiple spells): Y1 ⊥⊥ (Y2 , D2 )|(D1 , VY ) and Y2 ⊥⊥ (Y1 , D1 )|(D2 , VY ).
This looks weaker than what we had so far as we no longer ask for something like
Yk ⊥⊥ Dk or Yk ⊥⊥ Dk |VY . The latter, however, is already quite a weak requirement as
it does not specify what we can put in VY and what not. Assumption CIA-PH1 seems
to be even weaker, but again we will only be able to identify treatment effects if each
individual or strata was exposed twice to the same experiment. This is a situation we
never considered in earlier chapters.
Now define Nd as being our treatment indicator, i.e. Nd (y) = 0 if y < d and = 1
else, and {Nd (t) : 0 ≤ t ≤ Y } being our treatment history until Y . Then it can be shown
that for Y(1) := min{Y1 , Y2 } in model (8.36) it holds
% $ &
$
Pr Y1 = Y(1) $Y(1) , VY , {Nd1 (t) : 0 ≤ t ≤ Y(1) }, {Nd2 (t) : 0 ≤ t ≤ Y(0) }
⎧ !−1
⎪
⎪ 1 + λ0,Y2 (Y(1) ) if D1 , D2 > Y(1)
⎪
⎨ !−1
1 + λ0,Y2 (Y(1) )/α1 (Y(1) , D1 ) if D1 < Y(1) < D2
= !−1
⎪
⎪ 1 + λ0,Y2 (Y(1) )α2 (Y(1) , D2 ) if D1 > Y(1) > D2
⎪
⎩ !−1
1 + λ0,Y2 (Y(1) )α2 (Y(1) , D2 )/α1 (Y(1) , D1 ) if D1 , D2 < Y(1)
% $ &
$
= Pr Y1 = Y(1) $Y(1) , {Nd1 (t) : 0 ≤ t ≤ Y(1) }, {Nd2 (t) : 0 ≤ t ≤ Y(0) }
recalling that we set λ0,Y1 = 1. The last equation is obvious as none of the expressions
depends on VY . This is extremely helpful because the last expression can be directly
estimated from the data by the observed proportions with Y1 = Y(1) in the sub-samples
defined by the conditioning set. Then, λ0,Y2 (·) can be obtained for all observed y(1) in the
first group, afterwards α1 (·) for all (y(1) , d1 ) observed in the second group, etc. Actually,
having four equations for only three functions, they might even be overidentified.18 This
gives the estimator; for further inference one may use for example wild bootstrap in
order to get variance estimates and confidence intervals.
In practice one is often interested in studying the hazard function of treatment(s) D.
At the same time, when thinking of models with common factors in the unobservable
18 We say ‘might be’ because this also depends on the availability of observations in each group and further
specification of the unknown functions.
392 Dynamic Treatment Evaluation
parts (VY , VD ) one typically drops the time dependence of V . This leads us to the next
model:
Take the hazard from (8.36) but replace VY (t) by VY , and add
λ Dk (t; VD ) = λ0,Dk (t)VD , k = 1, 2. (8.37)
One assumes that (VD , VY ) ∈ IR+ 2 have finite expectations but are not ≡ 0, come
from
t a joint (often specified up to t some parameter) distribution G, further 0,Dk =
0 λ 0,D k (s)ds < ∞ and 0,Y k = 0 λ0,Yk (s)ds < ∞ for all t ∈ IR+ . Suppressing index
k, treatment effect α : IR 2 → (0, ∞) is such that A(t, d) = t α(s, d)ds < ∞ and
+ d
d) = t λ0,Y (s)α(s, d)ds < ∞ exist and are continuous on {(t, d) ∈ IR+
A(t, 2 : t > d}.
d
Again, to have full identification and not just up to a multiplicative scale, you also need
to normalise some of the functions; e.g. you may say set 0,D (t0 ) = 0,Y (t0 ) = 1 for
a given t0 , instead of setting λ0,Y1 = 1.19 It is clear that adding (8.37) to (8.36) will
simplify rather than complicate the identification of the treatment effect. The estimation
strategy for (8.36) does not change. Depending on the estimation procedure one might
or not modify the conditional independence assumption as follows:
Assumption CIA-PH2 (conditional independence assumption for competing risks
models with multiple spells): (Y1 , D1 ) ⊥⊥ (Y2 , D2 )|V .
In the so far considered model we allowed the treatment effect to be a function of
time that may vary with (Dk ). Alternatively, one could allow the treatment to depend on
some unobserved heterogeneity but not on Dk (apart from the fact that treatment only
occurs if the time spent in the initial status has exceeded Dk ) for example of the form
-
λ0,Yk (t)VY if t ≤ Dk
λYk (t; Dk , V ) = k = 1, 2 (8.38)
λ0,k (t)V if t > Dk
with normalisation 0,k (t0 ) = 1 for an a priori fixed t0 ∈ (0, ∞), and V =
(VY , VD , V ) has joint distribution G̃. Then the treatment effects αk are obtained from
the fraction {λ0,Yk (t)V }/{λ0,Yk (t)VY }.
It can be shown that under Assumption CIA-PH2 all functions in either model, (8.37)
or (8.38), can be identified in the sense that they can be expressed in terms of the
following four probabilities:
Pr (Y1 > y1 , Y2 > y2 , D1 > d1 , D2 > d2 , Y1 > D1 , Y2 > D2 )
Pr (Y1 > y1 , Y2 > y2 , D2 > d2 , Y1 < D1 , Y2 > D2 )
Pr (Y1 > y1 , Y2 > y2 , D1 > d1 , Y1 > D1 , Y2 < D2 )
Pr (Y1 > y1 , Y2 > y2 , Y1 < D1 , Y2 < D2 ) .
Note that these are equivalent to the following four expressions (in the same order)
Pr (Y1 > y1 , Y2 > y2 , D1 > d1 , D2 > d2 |Y1 > D1 , Y2 > D2 ) · Pr (Y1 > D1 , Y2 > D2 )
Pr (Y1 > y1 , Y2 > y2 , D2 > d2 |Y1 < D1 , Y2 > D2 ) · Pr (Y1 < D1 , Y2 > D2 )
19 So far we had to restrict the hazard over the entire timescale because heterogeneity V was allowed to vary
over time.
8.3 Duration Models and the Timing of Treatments 393
Pr (Y1 > y1 , Y2 > y2 , D1 > d1 |Y1 > D1 , Y2 < D2 ) · Pr (Y1 > D1 , Y2 < D2 )
Pr (Y1 > y1 , Y2 > y2 |Y1 < D1 , Y2 < D2 ) · Pr (Y1 < D1 , Y2 < D2 )
which can all be estimated from a sufficiently rich data set. In fact, we have simply
separated the sample into the four groups (Y1 > D1 , Y2 > D2 ), (Y1 < D1 , Y2 > D2 ),
(Y1 > D1 , Y2 < D2 ), (Y1 < D1 , Y2 < D2 ) whose proportions are the probabilities
in the second column. In the first column we have probabilities that also correspond
to directly observable proportions inside each corresponding group. We said ‘from a
sufficiently rich data set’ because it requires that for all values of possible combinations
of (y1 , y2 , d1 , d2 ) we are provided with sufficiently many observations (or you merge
them in appropriate intervals) to obtain reliable estimates of these probabilities (i.e.
proportions). To avoid this problem one would typically specify parametric functions for
the baseline hazards and distribution G, and G̃ respectively, in order to apply maximum
likelihood estimation.
In Abbring and van den Berg (2003c) are given some indications on how this could
be estimated non-parametrically. Else you simply take parametrically specified hazard
models and apply maximum likelihood estimation. The same can basically be said about
the next approaches.
Quite often we do not have multiple spells for most of the individuals or strata but
mostly single spells. In order to simplify presentation, imagine we use only one spell
per individual (or strata) from now on. Then we need to observe and explore some of
the heterogeneity; say we observe characteristics X . Obviously, to include them in the
above models gives the original mixed proportional hazard model with observable (X )
and non-observable (V ) covariates. The potential outcomes would be durations Y x,v,d
and D x,v with Y = Y X,V,D , Y d = Y X,V,d , and D = D X,V . When using those character-
istics as control variables one arrives to a kind of conditional independence assumption
that is much closer to what we originally called the CIA. Specifically,
Assumption CIA-PH3 (conditional independence assumption for competing risks
x,v
with single spells): Y x,v,d ⊥⊥ D x,v , and the distribution of (Y x,v,D ⊥⊥ D x,v ) is
absolutely continuous on IR+ for all (x, v) in supp(X, V ).
2
On the one hand, this seems to be more general than all CIA versions we have seen so
far, as it allows to condition on unobservables V . On the other hand, recall that for the
MPH one typically needs to assume independence between X and V ; something that
was not required for matching, propensity score weighting, etc. What also looks new is
the second part of the assumption. It is needed to allow for remaining variation, or say,
randomness for treatment and outcome: while (X, V ) include all joint determinants of
outcomes and assignment, like information that triggers relevant behaviour responses,
they fully determine neither Y nor D.
An easy way to technically specify the no anticipation property we need in this
context, is to do it via the integral of the hazard:
Assumption NA (no anticipation): For all d1 , d2 ∈ [0, ∞] we have Y x,v,d1 (t) =
Y x,v,d2 (t) for all t ≤ min{d1 , d2 } and all (x, v) in supp(X, V ).
Again we give two examples of MPH competing risks models that allow for the
identification of the treatment effect; one allowing the treatment effect to be a function
394 Dynamic Treatment Evaluation
of time that may depend on (X, D), and one allowing it to depend on (X, V ) but not on
D. The first model has a standard mixed proportional hazard rate for D, but a two-case
one for Y :
with V = (VD , V , VY ) ∈ IR+ 3 , E[V V ] < ∞ but not V ≡ 0, with a joint (typically
D
specified up to some parameter) distribution G̃ independent of X , and else the same
regularity conditions and normalisations as for model (8.39)–(8.40), here also applied
to g , λ0, .
Clearly, the treatment effect is now
λ0, (t)g (x)V
α(t, x, VY , V ) :=
λ0,Y (t)gY (x)VY
∞
λ0, (t)g (x)u
α(t, x) = d G̃(u, v) (8.43)
0 λ0,Y (t)gY (x)v
20 Technically one would say: {(g (x), g (x)); x ∈ X } contains a non-empty open two-dimensional set in
Y D
IR 2 .
8.3 Duration Models and the Timing of Treatments 395
Assumption SP2 The image of the systematic part g , {g (x); x ∈ X } contains a
non-empty open interval in IR.
It can be shown that Assumptions CIA-PH3, NA, SP1 and SP2 together guarantee
the identification of 0,D , 0,Y , 0, , g D , gY , g , G̃ from the probabilities Pr(Y >
y, D > d, Y > D), Pr(Y > y, Y < D), and therefore the treatment effect (8.43).
While it is true that these are non-parametric identification results – some would speak
of semi-parametric ones because we imposed clear separability structures – most of the
estimators available in practice are based on parametric specifications of these MPH
models. Consequently, once you have specified the hazard functions and G, respectively
G̃, you can also write down the explicit likelihood function. Then a (fully) parametric
maximum likelihood estimator can be applied with all the standard tools for further
inference. The main problems here are typical duration data problems like censoring,
truncation, etc. These, however, are not specific to the treatment effect estimation lit-
erature but to any duration analysis, and therefore not further treated in this book. For
the simpler problems we already indicated how to treat censoring and truncation, recall
Section 8.3.1.
Example 8.19 Abbring, van den Berg and van Ours (2005) study the impact of unem-
ployment insurance sanctions on the duration to find a job. In the theoretical part of the
article they construct the Bellman equations for the expected present values of income
before and after the imposition of a sanction as a result of the corresponding optimal job
search intensities s1 (before), s2 (when sanction is imposed). Under a set of assumptions
on functional forms and agent’s rational behaviour they arrive at hazard rates
for given reservation wages w1 , w2 . In the empirical study they specify them for a large
set of observable covariates x by
i.e. with an exponential systematic part. The difference between the hazards before and
after treatment is reduced to just a constant treatment effect for all individuals, treat-
ments and durations. The model is completed by the hazard function for treatment,
specified as
λ D = λ0,D (t) exp{x β D }VD .
For the baseline hazards λ0,Y (t), λ0,D (t), are taken piecewise constant specifications
with prefixed time intervals, and for G, the distribution of VY and VD , a bivariate dis-
crete distribution with four unrestricted point mass locations. When they estimated their
model (no matter whether for the entire sample or separated by sectors), they found a
significant positive α throughout, i.e. in all cases the imposition of sanctions increased
significantly the re-employment rate.
The research on non- and semi-parametric estimation is still in progress. But even the
parametric models are so far not much used in empirical economics; as indicated at the
396 Dynamic Treatment Evaluation
beginning. Actually, most of the empirical studies with competing risks structure can be
found in the biometrics and technometrics literature.
Lechner (2006) shows that neither of these two definitions of non-causality implies the
other. However, if the W-CIA holds (including the common support assumption) than
each of these two definitions of non-causality implies the other. Hence, if we can assume
W-CIA, both definitions can be used to test for non-causality, and they can be interpreted
in the perspective that seems to be more intuitive.
Turning to Section 8.3, nowadays, the literature on duration analysis is quite abun-
dant. A general, excellent introduction to the analysis of failure time data is given in
Kalbfleisch and Prentice (2002) and Crowder (1978), of which the latter puts the main
emphasise on multivariate survival analysis and competing risks, i.e. what has been
considered here. A detailed overview on duration analysis in economics was given in
Lancaster (1990). A more recent review of duration analysis in econometrics can be
found in van den Berg (2001).
Competing risks models have been in the focus of biometrical research since many
decades, see for example David and Moeschberger (1978). You can find a recent, smooth
8.4 Bibliographic and Computational Notes 397
introduction in Beyersmann and Scheike (2013). Mixture models and Cox regres-
sion was applied already very early to competing risks models, see e.g. Larson and
Dinse (1985) or Lunn and McNeil (1995). However, as stated, there is still a lot of
research going on. A recent contribution to the inclusion of time-varying covariates is
e.g. Cortese and Andersen (2010), a recent contribution to semi-parametric estimation
is e.g. Hernandez-Quintero, Dupuy and Escarela (2011). Abbring and van den Berg
(2003a) study the extension of the hazard function-based duration analysis to the use
of instrumental variables. Although they argue that they generally doubt the existence
of instruments fulfilling the necessary exogeneity conditions, they provide methods in
case an intention to treat is randomised but compliance is incomplete. The identification
presented in Abbring and van den Berg (2003b) was essentially based on the results for
competing risks models introduced in Abbring and van den Berg (2003c) and Heckman
and Honoré (1989).
In this chapter we almost strictly divided the dynamic treatment effect estimation into
two sections: one section considered discrete time in which several treatments, different
treatment durations, and its timing was analysed regarding its impact on any kind of
outcome. Another section considered only durations, namely the impact of the duration
until treatment takes place on the duration of leaving an initial state. In the first section
(Section 8.2) we have presented matching and propensity score estimators extended
for the dynamic case and multiple treatments; in the second section (Section 8.3) we
only have worked with tools known from duration analysis. The estimator proposed by
Fredriksson and Johansson (2008) uses elements of both approaches. They consider a
discrete time framework but are interested in the impact of the timing of treatment on
duration, i.e. the estimation problem considered in Section 8.3. However, for this they
use (generalised) matching estimators.
further assumptions see also its help file and description. Similarly, a rapidly increasing
number of packages and commands is available in R; see e.g. the survival package,
the OIsurv and the KMsurv package.
As we have seen, also in duration analysis the causal inference basically resorts to
existing methods, in this case developed for competing risks and multi-state models.
We therefore refer mainly to the paper of de Wreede, Fiocco and Putter (2011) and
the book of Beyersmann, Allignol and Schumacher (2012), and the tutorial of the R
package mstate by Putter, Fiocco and Geskus (2006) and Putter (2014); all publica-
tions explicitly dedicated to the estimation of those type of models with the statistics
software R.
It should be mentioned, however, that this is presently a very dynamic research area
on which every year appear several new estimation methods, programme codes and
packages, so that it is hardly possible to give a comprehensive review at this stage.
8.5 Exercises
1. Give examples in practice where we cannot estimate the treatment effects by any
method of the previous chapters.
2. Give the explicit formula of (8.2) for τ = 2, δ = 3 in the binary treatment case and
discuss examples.
3. Give examples of identification problems with the SCIA in Subsection 8.2.1 (i.e.
potential violations or when and why it could hold).
4. Show that the WDCIA is not sufficient to predict E[YT00 |D2 = 11].21 What can
¯
be followed for DATET? Give additional assumptions that would allow it to be
identified.
5. Show that from WDCIA it follows that (8.16) and (8.17).
6. Show for the discrete case that the probability Pr(T = t) and the cumulative
distribution function F(t) can be expressed in terms of the hazard rate λ(t).
7. Recall Example 8.13 and show that for the given hazard rate, log(t) has a logistic
distribution. Calculate also the mean.
8. Discuss how you could estimate non-parametrically the probabilities given below
model (8.38).
9. Several of the estimation procedures proposed or indicated here were based on
sequential (or multi-step) estimation. Discuss how to apply resampling methods in
order to estimate the final (over all steps) variance of the treatment effect estimator.
21 Hint: Show that it cannot be written in terms of E[Y 11 |·,D = 11], which would correspond to the
T 2
¯
observable outcome E[YT |·,D2 = 11].
¯
Bibliography
Angrist, J. (1998): ‘Estimating Labour Market Impact of Voluntary Military Service using Social
Security Data’, Econometrica, 66, 249–288.
Angrist, J., G. Imbens and D. Rubin (1996): ‘Identification of Causal Effects using Instrumental
Variables’, Journal of American Statistical Association, 91, 444–472 (with discussion).
Angrist, J. and A. Krueger (1991): ‘Does Compulsory School Attendance Affect Schooling and
Earnings?’, Quarterly Journal of Economics, 106, 979–1014.
(1999): ‘Empirical Strategies in Labor Economics’, in Handbook of Labor Economics, ed. by
O. Ashenfelter and D. Card, pp. 1277–1366. Amsterdam: North-Holland.
Angrist, J. and V. Lavy (1999): ‘Using Maimonides Rule to Estimate the Effect of Class Size on
Scholastic Achievement’, Quarterly Journal of Economics, 114, 533–575.
Angrist, J. & Pischke (2008): Mostly Harmless Econometrics. An empiricist’s companion.
Princeton University Press.
Arias, O. and M. Khamis (2008): ‘Comparative Advantage, Segmentation and Informal Earnings:
A Marginal Treatment Effects Approach’, IZA discussion paper, 3916.
Arpino, B. and A. Aassve (2013): ‘Estimation of Causal Effects of Fertility on Economic
Wellbeing: Evidence from Rural Vietnam’, Empirical Economics, 44, 355–385.
Ashenfelter, O. (1978): ‘Estimating the Effect of Training Programmes on Earnings’, Review of
Economics and Statistics, 6, 47–57.
Athey, S. and G. Imbens (2006): ‘Identification and Inference in Nonlinear Difference-in-
Differences Models’, Econometrica, 74, 431–497.
Bahadur, R. (1966): ‘A Note on Quantiles in Large Samples’, Annals of Mathematical Statistics,
37, 577–580.
Bailey, R. (2008): Design of Comparative Experiments. Cambridge: Cambridge University Press.
Baron, R. and D. Kenny (1986): ‘The Moderator-Mediator Variable Distinction in Social Psy-
chological Research: Conceptual, Strategic, and Statistical Considerations’, Journal of
Personality and Social Psychology, 6, 1173–1182.
Barrett, G. and S. Donald (2009): ‘Statistical Inference with Generalized Gini Indices of
Inequality and Poverty’, Journal of Business & Economic Statistics, 27, 1–17.
Barrios, T. (2013): ‘Optimal Stratification in Randomized Experiments’, discussion paper,
Harvard University OpenScholar.
Becker, S. and A. Ichino (2002): ‘Estimation of Average Treatment Effects Based on Propensity
Scores’, The Stata Journal, 2, 358–377.
Beegle, K., R. Dehejia and R. Gatti (2006): ‘Child Labor and Agricultural Shocks’, Journal of
Development Economics, 81, 80–96.
Begun, J., W. Hall, W. Huang and J. Wellner (1983): ‘Information and Asymptotic Efficiency in
Parametric-Nonparametric Models’, Annals of Statistics, 11, 432–452.
Belloni, A., V. Chernozhukov, I. Fernández-Val and C. Hansen (2017): ‘Program Evaluation and
Causal Inference with High-Dimensional Data’, Econometrica, 85, 233–298.
Belloni, A., V. Chernozhukov and C. Hansen (2014): ‘Inference on Treatment Effects after
Selection among High-Dimensional Controls’, Review of Economic Studies, 81, 608–650.
Benini, B., S. Sperlich and R. Theler (2016): ‘Varying Coefficient Models Revisited: An Econo-
metric View’, in Proceedings of the Second Conference of the International Society for
Nonparametric Statistics. New York, NY: Springer.
Benini, G. and S. Sperlich (2017): ‘Modeling Heterogeneity by Structural Varying Coefficients
Models’, Working paper.
Bertrand, M., E. Duflo and S. Mullainathan (2004): ‘How Much Should We Trust Differences-in-
Differences Estimates?’, Quarterly Journal of Economics, 119, 249–275.
Bibliography 401
Beyersmann, J., A. Allignol and M. Schumacher (2012): Competing Risks and Multistate Models
with R. New York, NY: Springer.
Beyersmann, J. and T. Scheike (2013): ‘Classical Regression Models for Competing Risks’, in
Handbook of Survival Analysis, pp. 157–177. CRC Press Taylor & Francis Group.
Bhatt, R. and C. Koedel (2010): ‘A Non-Experimental Evaluation of Curricular Effectiveness in
Math’, mimeo.
Bickel, P., C. Klaassen, Y. Ritov and J. Wellner (1993): Efficient and Adaptive Estimation for
Semiparametric Models. Baltimore, MD: John Hopkins University Press.
Black, D., J. Galdo and J. Smith (2005): ‘Evaluating the Regression Discontinuity Design using
Experimental Data’, mimeo, Ann Arbor, MI: University of Michigan.
Black, D. and J. Smith (2004): ‘How Robust is the Evidence on the Effects of College Quality?
Evidence from Matching’, Journal of Econometrics, 121, 99–124.
Black, S. (1999): ‘Do “Better” Schools Matter? Parental Valuation of Elementary Education’,
Quarterly Journal of Economics, 114, 577–599.
Blundell, R. and M. C. Dias (2009): ‘Alternative Approaches to Evaluation in Empirical
Microeconomics’, Journal of Human Resources, 44, 565–640.
Blundell, R. and J. Powell (2003): ‘Endogeneity in Nonparametric and Semiparametric Regres-
sion Models’, in Advances in Economics and Econometrics, ed. by L. H. M. Dewatripont
and S. Turnovsky, pp. 312–357. Cambridge: Cambridge University Press.
Bonhomme, S. and U. Sauder (2011): ‘Recovering Distributions in Difference-in-Differences
Models: A Comparison of Selective and Comprehensive Schooling’, The Review of Eco-
nomics and Statistics, 93, 479–494.
Brookhart, M., S. Schneeweiss, K. Rothman, R. Glynn, J. Avorn and T. Stürmer (2006): ‘Variable
Selection for Propensity Score Models’, American Journal of Epidemiology, 163, 1149–
1156.
Brügger, B., R. Lalive and J. Zweimüller (2008): ‘Does Culture Affect Unemployment? Evidence
from the Barriere des Roestis’, mimeo, Zürich: University of Zürich.
Bruhn, M. and D. McKenzie (2009): ‘In Pursuit of Balance: Randomization in Practice in
Development Field Experiments’, Policy Research Paper 4752, World Bank.
Buddelmeyer, H. and E. Skoufias (2003): ‘An evaluation of the Performance of Regression
Discontinuity Design on PROGRESA’, IZA discussion paper, 827.
Busso, M., J. DiNardo and J. McCrary (2009): ‘Finite Sample Properties of Semiparametric Esti-
mators of Average Treatment Effects’, Unpublished manuscript, University of Michigan and
University of Californa-Berkeley.
(2014): ‘New Evidence on the Finite Sample Properties of Propensity Score Matching and
Reweighting Estimators’, Review of Economics and Statistics, 58, 347–368.
Cox, D. (1972): ‘Regression Models and Life-Tables’, Journal of the Royal Statistical Society (B),
34, 187–220.
Cameron, C. and P. Trivedi (2005): Microeconometrics: Methods and Applications. Cambridge:
Cambridge University Press.
Card, D., J. Kluve and A. Weber (2010): ‘Active Labour Market Policy Evaluations: A Meta-
Analysis’, Economic Journal, 120(548), F452–F477.
Card, D. and A. Krueger (1994): ‘Minimum Wages and Employment: A Case Study of the
Fast-Food Industry in New Jersey and Pennsylvania’, American Economic Review, 84,
772–793.
Card, D., D. Lee, Z. Pei and A. Weber (2015): ‘Inference on Causal Effects in a Generalized
Regression Kink Design’, IZA discussion paper No 8757.
402 Bibliography
Carpenter, J., H. Goldstein and J. Rasbash (2003): ‘A novel boostrap procedure for assessing the
relationship between class size and achievement’, Applied Statistics, 52, 431–443.
Carroll, R., D. Ruppert and A. Welsh (1998): ‘Local Estimating Equations’, Journal of American
Statistical Association, 93, 214–227.
Cattaneo, M. (2010): ‘Efficient Semiparametric Estimation of Multi-Valued Treatment Effects
under Ignorability’, Journal of Econometrics, 155, 138–154.
Cattaneo, M., D. Drucker and A. Holland (2013): ‘Estimation of Multivalued Treatment Effects
under Conditional Independence’, The Stata Journal, 13, 407–450.
Cerulli, G. (2012): ‘treatrew: A User-Written STATA Routine for Estimating Average Treatment
Effects by Reweighting on Propensity Score’, discussion paper, National Research Council
of Italy, Institute for Economic Research on Firms and Growth.
(2014): ‘ivtreatreg: A Command for Fitting Binary Treatment Models with Heterogeneous
Response to Treatment and Unobservable Selection’, The Stata Journal, 14, 453–480.
Chamberlain, G. (1994): ‘Quantile Regression, Censoring and the Structure of Wages’, in
Advances in Econometrics, ed. by C. Sims. Amsterdam: Elsevier.
Chan, K., S. Yam and Z. Zhang (2016): ‘Globally Efficient Nonparametric Inference of Aver-
age Treatment Effects by Empirical Balancing Calibration Weighting’, Journal of the Royal
Statistical Society (B), 78, 673–700.
Chaudhuri, P. (1991): ‘Global Nonparametric Estimation of Conditional Quantile Functions and
their Derivatives’, Journal of Multivariate Analysis, 39, 246–269.
Chay, K., P. McEwan and M. Urquiola (2005): ‘The Central Role of Noise in Evaluating
Interventions that Use Test Scores to Rank Schools’, American Economic Review, pp.
1237–1258.
Chen, X., O. Linton and I. van Keilegom (2003): ‘Estimation of Semiparametric Models when the
Criterion Function is Not Smooth’, Econometrica, 71, 1591–1608.
Chernozhukov, V., I. Fernandez-Val and A. Galichon (2007): ‘Quantile and Probability Curves
Without Crossing’, MIT working paper.
Chernozhukov, V., I. Fernández-Val and B. Melly (2013): ‘Inference on Counterfactual Distribu-
tions’, Econometrica, 81, 2205–2268.
Chernozhukov, V. and C. Hansen (2005): ‘An IV Model of Quantile Treatment Effects’,
Econometrica, 73, 245–261.
(2006): ‘Instrumental Quantile Regression Inference for Structural and Treatment Effect
models’, Journal of Econometrics, 132, 491–525.
Chernozhukov, V., G. Imbens and W. Newey (2007): ‘Instrumental Variable Estimation of
Nonseparable Models’, Journal of Econometrics, 139, 4–14.
Chesher, A. (2003): ‘Identification in Nonseparable Models’, Econometrica, 71, 1405–1441.
(2005): ‘Nonparametric Identification under Discrete Variation’, Econometrica, 73, 1525–
1550.
(2007): ‘Identification of Nonadditive Structural Functions’, in Advances in Economics and
Econometrics, ed. by R. Blundell, W. Newey and T. Persson, pp. 1–16. Cambridge:
Cambridge University Press.
(2010): ‘Instrumental Variable Models for Discrete Outcomes’, Econometrica, 78, 575–601.
Claeskens, G., T. Krivobokova and J. Opsomer (1998): ‘Asymptotic Properties of Penalized Spline
Estimators’, Biometrika, 96, 529–544.
(2009): ‘Asymptotic Properties of Penalized Spline Estimators’, Biometrika, 96, 529–544.
Cleveland, W., E. Grosse and W. Shyu (1991): ‘Local Regression Models’, in Statistical Models
in S, ed. by J. Chambers and T. Hastie, pp. 309–376. Pacific Grove: Wadsworth & Brooks.
Bibliography 403
Collier, P. and A. Höffler (2002): ‘On the Incidence of Civil War in Africa’, Journal of Conflict
Resolution, 46, 13–28.
Cortese, G. and P. Andersen (2010): ‘Competing Risks and Time-Dependent Covariates’,
Biometrical Journal, 52, 138–158.
Cox, D. (1958): Planning of Experiments. New York: Wiley.
Croissant, Y. and G. Millo (2008): ‘Panel Data Econometrics in R: The plm Package’, Journal of
Statistical Software, 27(2).
Crowder, M. (1978): Multivariate Survival Analysis and Competing Risks. CRC Press Taylor &
Francis Group.
Crump, R., J. Hotz, G. Imbens and O. Mitnik (2009): ‘Dealing with Limited Overlap in Estimation
of Average Treatment Effects’, Biometrika, 96, 187–199.
Curie, I. and M. Durban (2002): ‘Flexible Smoothing with P-Splines: A Unified Approach’,
Statistical Science, 2, 333–349.
Dai, J., S. Sperlich and W. Zucchini (2016): ‘A Simple Method for Predicting Distributions by
Means of Covariates with Examples from Welfare and Health Economics’, Swiss Journal of
Economics and Statistics, 152, 49–80.
Darolles, S., Y. Fan, J. Florens and E. Renault (2011): ‘Nonparametric Instrumental Regression’,
Econometrica, 79:5, 1541–1565.
Daubechies, I. (1992): Ten Lectures on Wavelets. Philadelphia, PA: SIAM.
David, H. and M. Moeschberger (1978): The Theory of Competing Risks, Griffins Statistical
Monograph No. 39. New York, NY: Macmillan.
de Wreede, L., M. Fiocco and H. Putter (2011): ‘mstate: An R Package for the Analy-
sis of Competing Risks and Multi-State Models’, Journal of Statistical Software, 38,
1–30.
Dette, H., A. Munk and T. Wagner (1998): ‘Estimating the Variance in Nonparametric Regression
– What is a Reasonable Choice?’, Journal of the Royal Statistical Society, B, 60, 751–764.
Dette, H., N. Neumeyer and K. Pilz (2006): ‘A Simple Nonparametric Estimator of a Strictly
Monotone Regression Function’, Bernoulli, 12, 469–490.
Dette, H. and K. Pilz (2006): ‘A Comparative Study of Monotone Nonparametric Kernel
Estimates’, Journal of Statistical Computation and Simulation, 76, 41–56.
Donald, S. and K. Lang (2007): ‘Inference with Difference-in-Differences and Other Panel Data’,
Review of Economics and Statistics, 89, 221–233.
Duflo, E. (2001): ‘Schooling and Labor Market Consequences of School Construction in Indone-
sia: Evidence from an Unusual Policy Experiment’, American Economic Review, 91,
795–813.
Duflo, E., P. Dupas and M. Kremer (2015): ‘Education, HIV, and Early Fertility: Experimental
Evidence from Kenya’, The American Economic Review, 105, 2757–2797.
Duflo, E., R. Glennerster and M. Kremer (2008): ‘Using Randomization in Development Eco-
nomics Research: A Toolkit’, in Handbook of Development Economics, ed. by T. Schultz
and J. Strauss, pp. 3895–3962. Amsterdam: North-Holland.
Edin, P.-A., P. Fredriksson and O. Aslund (2003): ‘Ethnic Enclaves and the Economic Success of
Immigrants – Evidence from a Natural Experiment’, The Quarterly Journal of Economics,
118, 329–357.
Eilers, P. and B. Marx (1996): ‘Flexible Smoothing with B-Splines and Penalties’, Statistical
Science, 11, 89–121.
Engel, E. (1857): ‘Die Produktions- und Konsumtionsverhältnisse des Königsreichs Sachsen’,
Zeitschrift des statistischen Büros des Königlich Sächsischen Ministeriums des Inneren, 8,
1–54.
404 Bibliography
Fan, J. (1993): ‘Local Linear Regression Smoothers and their Minimax Efficiency’, Annals of
Statistics, 21, 196–216.
Fan, J. and I. Gijbels (1996): Local Polynomial Modeling and its Applications. London: Chapman
and Hall.
Field, C. and A. Welsh (2007): ‘Bootstrapping Clustered Data’, Journal of the Royal Statistical
Society (B), 69, 366–390.
Fine, J. and R. Gray (1999): ‘A Proportional Hazards Model for the Subdistribution of a
Competing Risk’, Journal of the American Statistical Association, 94:446, 496–509.
Firpo, S. (2007): ‘Efficient Semiparametric Estimation of Quantile Treatment Effects’, Economet-
rica, 75, 259–276.
Firpo, S., N. Fortin and T. Lemieux (2009): ‘Unconditional Quantile Regressions’, Econometrica,
77, 935–973.
Florens, J. (2003): ‘Inverse Problems and Structural Econometrics: The Example of Instrumental
Variables’, in Advances in Economics and Econometrics, ed. by L. H. M. Dewatripont and
S. Turnovsky, pp. 284–311. Cambridge: Cambridge University Press.
Florens, J., J. Heckman, C. Meghir and E. Vytlacil (2008): ‘Identification of Treatment Effects
Using Control Functions in Models With Continuous, Endogenous Treatment and Heteroge-
neous Effects’, Econometrica, 76:5, 1191–1206.
Frandsen, B., M. Frölich and B. Melly (2012): ‘Quantile Treatment Effects in the Regression
Discontinuity Design’, Journal of Econometrics, 168, 382–395.
Frangakis, C. and D. Rubin (1999): ‘Addressing Complications of Intention-to-Treat Analysis in
the Combined Presence of All-or-None Treatment-Noncompliance and Subsequent Missing
Outcomes’, Biometrika, 86, 365–379.
(2002): ‘Principal Stratification in Causal Inference’, Biometrics, 58, 21–29.
Fredriksson, P. and P. Johansson (2008): ‘Dynamic Treatment Assignment: The Consequences
for Evaluations using Observational Data’, Journal of Business and Economics Statistics,
26, 435–445.
Fredriksson, P. and B. Öckert (2006): ‘Is Early Learning Really More Productive? The Effect
of School Starting Age on School and Labor Market Performance’, IFAU Discussion Paper
2006:12.
Frölich, M. (2004): ‘Finite Sample Properties of Propensity-Score Matching and Weighting
Estimators’, Review of Economics and Statistics, 86, 77–90.
(2005): ‘Matching Estimators and Optimal Bandwidth Choice’, Statistics and Computing, 15/3,
197–215.
(2007a): ‘Nonparametric IV Estimation of Local Average Treatment Effects with Covariates’,
Journal of Econometrics, 139, 35–75.
(2007b): ‘Propensity Score Matching without Conditional Independence Assumption –
with an Application to the Gender Wage Gap in the UK’, Econometrics Journal, 10,
359–407.
(2008): ‘Statistical Treatment Choice: An Application to Active Labour Market Programmes’,
Journal of the American Statistical Association, 103, 547–558.
Frölich, M. and M. Lechner (2010): ‘Exploiting Regional Treatment Intensity for the Evaluation
of Labour Market Policies’, Journal of the American Statistical Association, 105, 1014–
1029.
Frölich, M. and B. Melly (2008): ‘Quantile Treatment Effects in the Regression Discontinuity
Design’, IZA Discussion Paper, 3638.
(2010): ‘Estimation of Quantile Treatment Effects with STATA’, Stata Journal, 10, 423–457.
Bibliography 405
(2013): ‘Unconditional Quantile Treatment Effects under Endogeneity’, Journal of Business &
Economic Statistics, 31, 346–357.
Gautier, E. and S. Hoderlein (2014): ‘A Triangular Treatment Effect Model with Random
Coefficients in the Selection Equation’, Working Paper at Boston College, Department of
Economics.
Gerfin, M. and M. Lechner (2002): ‘Microeconometric Evaluation of the Active Labour Market
Policy in Switzerland’, Economic Journal, 112, 854–893.
Gerfin, M., M. Lechner and H. Steiger (2005): ‘Does Subsidised Temporary Employment Get the
Unemployed Back to Work? An Econometric Analysis of Two Different Schemes’, Labour
Economics, 12, 807–835.
Gill, R. (1989): ‘Non- and Semi-Parametric Maximum Likelihood Estimators and the von Mises
Method (Part 1)’, Scandinavian Journal of Statistics, 16, 97–128.
Gill, R. and J. Robins (2001): ‘Marginal Structural Models’, Annals of Statistics, 29, 1785–1811.
Glennerster, R. & Takavarasha, K. (2013): Running Randomized Evaluations: A Practical Guide.
Princeton University Press.
Glewwe, P., M. Kremer, S. Moulin and E. Zitzewitz (2004): ‘Retrospective vs. Prospective Analy-
ses of School Inputs: The Case of Flip Charts in Kenya’, Journal of Development Economics,
74, 251–268.
Glynn, A. and K. Quinn (2010): ‘An Introduction to the Augmented Inverse Propensity Weighted
Estimator’, Political Analysis, 18, 36–56.
Gonzalez-Manteiga, W. and R. Crujeiras (2013): ‘An Updated Review of Goodness-of-Fit Tests
for Regression Models’, Test, 22, 361–411.
Gosling, A., S. Machin and C. Meghir (2000): ‘The Changing Distribution of Male Wages in the
U.K.’, Review of Economics Studies, 67, 635–666.
Gozalo, P. and O. Linton (2000): ‘Local Nonlinear Least Squares: Using Parametric Information
in Nonparametric Regression’, Journal of Econometrics, 99, 63–106.
Graham, B., C. Pinto and D. Egel (2011): ‘Efficient Estimation of Data Combination Mod-
els by the Method of Auxiliary-to-Study Tilting (AST)’, NBER Working Papers No.
16928.
(2012): ‘Inverse Probability Tilting for Moment Condition Models with Missing Data’, Review
of Economic Studies, 79, 1053–1079.
Greene, W. (1997): Econometric Analysis, 3rd edn. New Jersey: Prentice Hall.
Greevy, R., B. Lu, J. Silver and P. Rosenbaum (2004): ‘Optimal Multivariate Matching Before
Randomization’, Biostatistics, 5, 263–275.
Gruber, S. and M. van der Laan (2012): ‘tmle: An R Package for Targeted Maximum Likelihood
Estimation’, Journal of Statistical Software, 51(13).
Hahn, J. (1998): ‘On the Role of the Propensity Score in Efficient Semiparametric Estimation of
Average Treatment Effects’, Econometrica, 66(2), 315–331.
Hahn, J. and G. Ridder (2013): ‘Asymptotic Variance of Semiparametric Estimators with
Generated Regressors’, Econometrica, 81(1), 315–340.
Hahn, J., P. Todd and W. van der Klaauw (1999): ‘Evaluating the Effect of an Antidiscrimination
Law Using a Regression-Discontinuity Design’, NBER working paper, 7131.
Hall, P., R. Wolff and Q. Yao (1999): ‘Methods for Estimating a Conditional Distribution
function’, Journal of American Statistical Association, 94(445), 154–163.
Ham, J. and R. LaLonde (1996): ‘The Effect of Sample Selection and Initial Conditions
in Duration Models: Evidence from Experimental Data on Training’, Econometrica, 64,
175–205.
406 Bibliography
Ham, J., X. Li and P. Reagan (2011): ‘Matching and Nonparametric IV Estimation, A Distance-
Based Measure of Migration, and the Wages of Young Men’, Journal of Econometrics, 161,
208–227.
Hansen, C. (2007a): ‘Asymptotic Properties of a Robust Variance Matrix Estimator for Panel Data
when T is Large’, Journal of Econometrics, 141, 597–620.
(2007b): ‘Generalized Least Squares Inference in Panel and Multilevel Models with Serial
Correlation and Fixed Effects’, Journal of Econometrics, 140, 670–694.
Härdle, W., P. Hall and H. Ichimura (1993): ‘Optimal Smoothing in Single-Index Models’, Annals
of Statistics, 21, 157–193.
Härdle, W. and S. Marron (1987): ‘Optimal Bandwidth Selection in Nonparametric Regression
Function Estimation’, Annals of Statistics, 13, 1465–1481.
Härdle, W., M. Müller, S. Sperlich and A. Werwatz (2004): Nonparametric and Semiparametric
Models. Heidelberg: Springer Verlag.
Härdle, W. and T. Stoker (1989): ‘Investigating Smooth Multiple Regression by the Method of
Average Derivatives’, Journal of American Statistical Association, 84, 986–995.
Hastie, T. and R. Tibshirani (1990): Generalized Additive Models. London: Chapman and Hall.
Have, D. S. T. T. and P. Rosenbaum (2008): ‘Randomization Inference in a GroupRandomized
Trial of Treatments for Depression: Covariate Adjustment, Noncompliance, and Quantile
Effects’, Journal of the American Statistical Association, 103, 271–279.
Haviland, A. and D. Nagin (2005): ‘Causal Inferences with Group Based Trajectory Models’,
Psychometrika, 70, 557–578.
Hayes, A. (2009): ‘Beyond Baron and Kenny: Statistical Mediation Analysis in the New
Millennium’, Communication Monographs, 76, 408–420.
Heckman, J. (2001): ‘Micro Data, Heterogeneity, and the Evaluation of Public Policy: Nobel
Lecture’, Journal of Political Economy, 109, 673–748.
(2008): ‘Econometric Causality’, International Statistical Review, 76, 1–27.
Heckman, J. and B. Honoré (1989): ‘The Identifiability of the Competing Risks Model’,
Biometrika, 76, 325–330.
Heckman, J., H. Ichimura and P. Todd (1998): ‘Matching as an Econometric Evaluation
Estimator’, Review of Economic Studies, 65, 261–294.
Heckman, J., R. LaLonde and J. Smith (1999): ‘The Economics and Econometrics of Active
Labour Market Programs’, in Handbook of Labor Economics, ed. by O. Ashenfelter and
D. Card, pp. 1865–2097. Amsterdam: North-Holland.
Heckman, J. and B. Singer (1984): ‘A Method for Minimizing the Impact of Distributional
Assumptions in Econometric Models for Duration Data’, Econometrica, 52, 277–320.
Heckman, J. and J. Smith (1995): ‘Assessing the Case for Social Experiments’, Journal of
Economic Perspectives, 9, 85–110.
Heckman, J. and E. Vytlacil (1999): ‘Local Instrumental Variables and Latent Variable Models
for Identifying and Bounding Treatment Effects’, Proceedings National Academic Sciences
USA, Economic Sciences, 96, 4730–4734.
(2007a): ‘Econometric Evaluation of Social Programs Part I: Causal Models, Structural Models
and Econometric Policy Evaluation’, in Handbook of Econometrics, ed. by J. Heckman and
E. Leamer, pp. 4779–4874. Amsterdam and Oxford: North-Holland.
(2007b): ‘Econometric Evaluation of Social Programs Part II: Using the Marginal Treatment
Effect to Organize Alternative Econometric Estimators to Evaluate Social Programs, and
to Forecast their Effects in New Environments’, in Handbook of Econometrics, ed. by
J. Heckman and E. Leamer, pp. 4875–5143. Amsterdam and Oxford: North-Holland.
Bibliography 407
Heckman, N. (1986): ‘Spline Smoothing in a Partly Linear Model’, Journal of the Royal Statistical
Society, B, 48, 244–248.
Henderson, D., D. Millimet, C. Parmeter and L. Wang (2008): ‘Fertility and the Health of
Children: A Nonparametric Investigation’, Advances in Econometrics, 21, 167–195.
Henderson, D. and C. Parmeter (2015): Applied Nonparametric Econometrics. Cambridge:
Cambridge University Press.
Hernan, M., B. Brumback and J. Robins (2001): ‘Marginal Structural Models to Estimate the
Joint Causal Effect of Nonrandomized Trials’, Journal of American Statistical Association,
96, 440–448.
Hernandez-Quintero, A., J. Dupuy and G. Escarela (2011): ‘Analysis of a Semiparametric Mixture
mowl for competing risks’, Annals of the Institute of Statistical Mathematics, 63, 305–329.
Hirano, K., G. Imbens and G. Ridder (2003): ‘Efficient Estimation of Average Treatment Effects
Using the Estimated Propensity Score’, Econometrica, 71, 1161–1189.
Hoderlein, S. and E. Mammen (2007): ‘Identification of Marginal Effects in Nonseparable Models
without Monotonicity’, Econometrica, 75, 1513–1518.
(2010): ‘Analyzing the Random Coefficient Model Nonparametrically’, Econometric Theory,
26, 804–837.
Hoderlein, S. and Y. Sasaki (2014): ‘Outcome Conditioned Treatment Effects’, working paper at
John Hopkins University.
Holland, P. (1986): ‘Statistics and Causal Inference’, Journal of American Statistical Association,
81, 945–970.
Hong, H. and D. Nekipelov (2012): ‘Efficient Local IV Estimation of an Empirical Auction
Model’, Journal of Econometrics, 168, 60–69.
Horowitz, J. and S. Lee (2007): ‘Nonparametric Instrumental Variables Estimation of a Quantile
Regression Model’, Econometrica, 75, 1191–1208.
Huber, M., M. Lechner and A. Steinmayr (2013): ‘Radius Matching on the Propensity Score with
Bias Adjustment: Tuning Parameters and Finite Sample Behaviour’, Discussion paper at the
University of St Gallen.
Huber, M., M. Lechner and C. Wunsch (2013): ‘The Performance of Estimators Based on the
Propensity Score’, Journal of Econometrics, 175, 1–21.
Ichimura, H. (1993): ‘Semiparametric Least Squares (SLS) and Weighted SLS Estimation of
Single-Index Models’, Journal of Econometrics, pp. 71–120.
Imai, K. (2005): ‘Do Get-Out-of-the-Vote Calls Reduce Turnout?’, American Political Science
Review, 99, 283–300.
Imai, K., L. Keele and T. Yamamoto (2010): ‘Identification, Inference and Sensitivity Analysis
for Causal Mediation Effects’, Statistical Science, 25, 51–71.
Imai, K. and I. Kim (2015): ‘On the Use of Linear Fixed Effects Regression Estimators for Causal
Inference’, working Paper at Princeton.
Imai, K., G. King and C. Nall (2009): ‘The Essential Role of Pair Matching in Cluster-
Randomized Experiments, with Application to the Mexican Universal Health Insurance
Evaluation’, Statistical Science, 24, 29–53.
Imai, K., G. King and E. Stuart (2008): ‘Misunderstandings between Experimentalists and Obser-
vationalists about Causal Inference’, Journal of the Royal Statistical Society (A), 171,
481–502.
Imbens, G. (2000): ‘The Role of the Propensity Score in Estimating Dose-Response Functions’,
Biometrika, 87, 706–710.
(2001): ‘Some Remarks on Instrumental Variables’, in Econometric Evaluation of Labour
Market Policies, ed. by M. Lechner and F. Pfeiffer, pp. 17–42. Heidelberg: Physica/Springer.
408 Bibliography
Lee, D. (2008): ‘Randomized Experiments from Non-Random Selection in U.S. House Elections’,
Journal of Econometrics, 142, 675–697.
Lee, D. and D. Card (2008): ‘Regression Discontinuity Inference with Specification Error’,
Journal of Econometrics, 142, 655–674.
Lee, D. and T. Lemieux (2010): ‘Regression Discontinuity Designs in Economics’, Journal of
Econometrics, 142, 615–674.
Leuven, E., M. Lindahl, H. Oosterbeek and D. Webbink (2007): ‘The Effect of Extra Funding for
Disadvantaged Pupils on Achievement’, Review of Economics and Statistics, 89, 721–736.
Leuven, E. and B. Sianesi (2014): ‘PSMATCH2: Stata Module to Perform Full Mahalanobis and
Propensity Score Matching, Common Support Graphing, and Covariate Imbalance Testing’,
Statistical Software Components.
Li, Q. and J. Racine (2007): Nonparametric Econometrics – Theory and Practice. Princeton, NJ:
Princeton University Press.
Little, R. and D. Rubin (1987): Statistical Analysis with Missing Data. New York, NY: Wiley.
Loader, C. (1999a): ‘Bandwidth Selection: Classical or Plug-In?’, Annals of Statistics, 27,
415–438.
(1999b): Local Regression and Likelihood. New York, NY: Springer.
Lu, B., E. Zanuto, R. Hornik and P. Rosenbaum (2001): ‘Matching with Doses in an Observa-
tional Study of a Media Campaign against Drug Abuse’, Journal of the American Statistical
Association, 96, 1245–1253.
Lunceford, J. and M. Davidian (2004): ‘Stratification and Weighting via the Propensity Score in
Estimation of Causal Treatment Effects: A Comparative Study’, Statistics in Medicine, 23,
2937–2960.
Lunn, M. and D. McNeil (1995): ‘Applying Cox Regression to Competing Risks’, Biometrics, 51,
524–532.
Machado, J. and J. Mata (2005): ‘Counterfactual Decomposition of Changes in Wage Distribu-
tions Using Quantile Regression’, Journal of Applied Econometrics, 20, 445–465.
Mammen, E. (1991): ‘Estimating a Smooth Monotone Regression Function’, Annals of Statistics,
19, 724–740.
(1992): When Does Bootstrap Work: Asymptotic Results and Simulations. Lecture Notes in
Statistics 77. New York, NY and Heidelberg: Springer Verlag.
Manning, W., L. Blumberg and L. Moulton (1995): ‘The Demand for Alcohol: The Differential
Response to Price’, Journal of Health Economics, 14, 123–148.
Markus Frölich & Martin Huber (2017): Direct and indirect treatment effects-causal chains and
mediation analysis with instrumental variables, J. R. Statist. Soc. B (2017), 79, Part 5, pp.
1645–1666.
(2018): Including Covariates in the Regression Discontinuity Design, Journal of Business &
Economic Statistics, forthcoming DOI: 10.1080/07350015.2017.142154
Markus Frölich, Martin Huber, Manuel Wiesenfarth (2017): The finite sample performance of
semi- and non-parametric estimators for treatment effects and policy evaluation, Computa-
tional Statistics and Data Analysis 115 (2017) 91–102
Markus Frölich and Blaise Melly (2010): Estimation of quantile treatment effects with STATA,
Stata Journal, 10 (3), 423–457.
Matsudaira, J. (2008): ‘Mandatory Summer School and Student Achievement’, Journal of
Econometrics, 142, 829–850.
McCrary, J. (2008): ‘Manipulation of the Running Variable in the Regression Discontinuity
Design: A Density Test’, Journal of Econometrics, 142, 698–714.
Bibliography 411
Mealli, F., G. Imbens, S. Ferro and A. Biggeri (2004): ‘Analyzing a Randomized Trial on Breast
Self-Examination with Noncompliance and Missing Outcomes’, Biostatistics, 5, 207–222.
Melly, B. (2005): ‘Decomposition of Differences in Distribution Using Quantile Regression’,
Labour Economics, 12, 577–590.
Meyer, B. (1995): ‘Natural and Quasi-Experiments in Economics’, Journal of Business and
Economic Statistics, 13, 151–161.
Miguel, E. and M. Kremer (2004): ‘Worms: Identifying Impacts on Education and Health in the
Presence of Treatment Externalities’, Econometrica, 72, 159–217.
Miguel, E., S. Satyanath and E. Sergenti (2004): ‘Economic Shocks and Civil Conflict: An
Instrumental Variables Approach’, Journal of Political Economy, 112, 725–753.
Moffitt, R. (2004): ‘The Role of Randomized Field Trials in Social Science Research: A Per-
spective from Evaluations of Reforms of Social Welfare Programs’, American Behavioral
Scientist, 47, 506–540.
(2008): ‘Estimating Marginal Treatment Effects in Heterogeneous Populations’, Annales
d’Economie et de Statistique, 91/92, 239–261.
Mora, R. and I. Reggio (2012): ‘Treatment Effect Identification Using Alternative Parallel
Assumptions’, WP Carlos III de Madrid, Spain.
Moral-Arce, I., S. Sperlich and A. Fernandez-Sainz (2013): ‘The Semiparametric Juhn-Murphy-
Pierce Decomposition of the Gender Pay Gap with an application to Spain’, in Wages and
Employment: Economics, Structure and Gender Differences, ed. by A. Mukherjee, pp. 3–20.
Hauppauge, New York, NY: Nova Science Publishers.
Moral-Arce, I., S. Sperlich, A. Fernandez-Sainz and M. Roca (2012): ‘Trends in the Gender Pay
Gap in Spain: A Semiparametric Analysis’, Journal of Labor Research, 33, 173–195.
Nadaraya, E. (1965): ‘On Nonparametric Estimates of Density Functions and Regression Curves’,
Theory of Applied Probability, 10, 186–190.
Neumeyer, N. (2007): ‘A Note on Uniform Consistency of Monotone Function Estimators’,
Statistics and Probability Letters, 77, 693–703.
Newey, W. (1990): ‘Semiparametric Efficiency Bounds’, Journal of Applied Econometrics, 5,
99–135.
(1994): ‘The Asymptotic Variance of Semiparametric Estimators’, Econometrica, 62, 1349–
1382.
Newey, W. and J. Powell (2003): ‘Instrumental Variable Estimation of Nonparametric Models’,
Econometrica, 71, 1565–1578.
Nichols, A. (2007): ‘Causal Inference with Observational Data’, The Stata Journal, 7, 507–541.
(2014): ‘rd: Stata Module for Regression Discontinuity Estimation. Statistical Software
Components’, discussion paper, Boston College Department of Economics.
Pagan, A. and A. Ullah (1999): Nonparametric Econometrics. Cambridge: Cambridge University
Press.
Pearl, J. (2000): Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University
Press.
Pfanzagl, J. and W. Wefelmeyer (1982): Contributions to a General Asymptotic Statistical Theory.
Heidelberg: Springer Verlag.
Pocock, S. and R. Simon (1975): ‘Sequential Treatment Assignment with Balancing for Prognos-
tic Factors in the Controlled Clinical Trial’, Biometrics, 31, 103–115.
Politis, D., J. Romano and M. Wolf (1999): Subsampling. New York, NY: Springer.
Powell, J., J. Stock and T. Stoker (1989): ‘Semiparametric Estimation of Index Coefficients’,
Econometrica, 57, 1403–1430.
412 Bibliography
Putter, H. (2014): ‘Tutorial in Biostatistics: Competing Risks and Multi-State Models Analyses
Using the mstate Package’, discussion paper, Leiden University Medical Center.
Putter, H., M. Fiocco and R. Geskus (2006): ‘Tutorial in Biostatistics: Competing Risks and Multi-
State Models’, Statistics in Medicine, 26, 2389–2430.
Racine, J. and Q. Li (2004): ‘Nonparametric Estimation of Regression Functions with Both
Categorical and Continuous Data’, Journal of Econometrics, 119, 99–130.
Ravallion, M. (2008): ‘Evaluating Anti-Poverty Programs’, in Handbook of Development Eco-
nomics, ed. by T. Schultz and J. Strauss, pp. 3787–3846. Amsterdam: North-Holland.
Reinsch, C. (1967): ‘Smoothing by Spline Functions’, Numerische Mathematik, 16, 177–183.
Rice, J. (1986): ‘Convergence Rates for Partially Splined Estimates’, Statistics and Probability
Letters, 4, 203–208.
Robins, J. (1986): ‘A New Approach to Causal Inference in Mortality Studies with Sus-
tained Exposure Periods – Application to Control of the Healthy Worker Survivor Effect’,
Mathematical Modelling, 7, 1393–1512.
(1989): ‘The Analysis of Randomized and Nonrandomized AIDS Treatment Trials Using a
New Approach to Causal Inference in Longitudinal Studies’, in Health Service Research
Methodology: A Focus on Aids, ed. by L. Sechrest, H. Freeman and A. Mulley, pp.
113–159. Washington, DC: Public Health Service, National Center for Health Services
Research.
(1997): ‘Causal Inference from Complex Longitudinal Data. Latent Variable Modelling and
Applications to Causality’, in Lecture Notes in Statistics 120, ed. by M. Berkane, pp. 69–117.
New York, NY: Springer.
(1998): ‘Marginal Structural Models’, Proceedings of the American Statistical Association,
1997, 1–10.
(1999): ‘Association, Causation, and Marginal Structural Models’, Synthese, 121, 151–179.
Robins, J., S. Greenland and F. Hu (1999): ‘Estimation of the Causal Effect of a Time-varying
Exposure on the Marginal Mean of a Repeated Binary Outcome’, Journal of the American
Statistical Association, 94, 687–700.
Robins, J. and A. Rotnitzky (1995): ‘Semiparametric Efficiency in Multivariate Regression
Models with Missing Data’, Journal of American Statistical Association, 90, 122–129.
Robins, J., A. Rotnitzky and L. Zhao (1995): ‘Analysis of Semiparametric Regression Models
for Repeated Outcomes in the Presence of Missing Data’, Journal of American Statistical
Association, 90, 106–121.
Robins, J. M., A. Rotnitzky and L. Zhao (1994): ‘Estimation of Regression Coefficients
When Some Regressors Are Not Always Observed’, Journal of the American Statistical
Association, 90, 846–866.
Roodman, D. (2009a): ‘How to Do xtabond2: An Introduction to Difference and System GMM in
Stata’, The Stata Journal, 9, 86–136.
(2009b): ‘A Note on the Theme of Too Many Instruments’, Oxford Bulletin of Economics and
Statistics, 71, 135–158.
Rose, H. and J. Betts (2004): ‘The Effect of High School Courses on Earnings’, Review of
Economics and Statistics, 86, 497–513.
Rosenbaum, P. (1984): ‘The Consequences of Adjustment for a Concomitant Variable That Has
Been Affected by the Treatment’, Journal of Royal Statistical Society (A), 147, 656–666.
(2002): Observational Studies. Heidelberg: Springer Verlag.
Rothe, C. (2010): ‘Nonparametric Estimation of Distributional Policy Effects’, Journal of
Econometrics, 155, 56–70.
Bibliography 413
Rothe, Christoph & Firpo, Sergio, 2013. “Semiparametric Estimation and Inference Using Doubly
Robust Moment Conditions,” IZA Discussion Papers 7564, Institute for the Study of Labor
(IZA).
Rotnitzky, A. and J. Robins (1995): ‘Semiparametric Regression Estimation in the Presence of
Dependent Censoring’, Biometrika, 82, 805–820.
(1997): ‘Analysis of Semiparametric Regression Models with Non-Ignorable Non-Response’,
Statistics in Medicine, 16, 81–102.
Rotnitzky, A., J. Robins and D. Scharfstein (1998): ‘Semiparametric Regression for Repeated Out-
comes With Nonignorable Nonresponse’, Journal of the American Statistical Association,
93, 1321–1339.
Roy, A. (1951): ‘Some Thoughts on the Distribution of Earnings’, Oxford Economic Papers, 3,
135–146.
Rubin, D. (1974): ‘Estimating Causal Effects of Treatments in Randomized and Nonrandomized
Studies’, Journal of Educational Psychology, 66, 688–701.
(1980): ‘Comment on “Randomization Analysis of Experimental Data: The Fisher Randomiza-
tion Test” by D. Basu’, Journal of American Statistical Association, 75, 591–593.
(2001): ‘Using Propensity Scores to Help Design Observational Studies: Application
to the Tobacco Litigation’, Health Services and Outcomes Research Methodology, 2,
169–188.
(2004): ‘Direct and Indirect Causal Effects via Potential Outcomes’, Scandinavian Journal of
Statistics, 31, 161–170.
(2005): ‘Causal Inference Using Potential Outcomes: Design, Modeling, Decisions’, Journal
of American Statistical Association, 100, 322–331.
(2006): Matched Sampling for Causal Effects. Cambridge: Cambridge University Press.
Ruppert, D. and M. Wand (1994): ‘Multivariate Locally Weighted Least Squares Regression’,
Annals of Statistics, 22, 1346–1370.
Särndal, C.-E., B. Swensson and J. Wretman (1992): Model Assisted Survey Sampling. New York,
NY, Berlin, Heidelberg: Springer.
Schwarz, K. and T. Krivobokova (2016): ‘A Unified Framework for Spline Estimators’,
Biometrika, 103, 121–131.
Seifert, B. and T. Gasser (1996): ‘Finite-Sample Variance of Local Polynomials: Analysis and
Solutions’, Journal of American Statistical Association, 91, 267–275.
(2000): ‘Data Adaptive Ridging in Local Polynomial Regression’, Journal of Computational
and Graphical Statistics, 9, 338–360.
Shadish, W., M. Clark and P. Steiner (2008): ‘Can Nonrandomized Experiments Yield Accurate
Answers? A Randomized Experiment Comparing Random and Nonrandom Assignments’,
Journal of the American Statistical Association, 103, 1334–1344.
Sianesi, B. (2004): ‘An Evaluation of the Swedish System of Active Labor Market Programs in
the 1990s’, Review of Economics and Statistics, 86, 133–155.
Speckman, P. (1988): ‘Kernel Smoothing in Partial Linear Models’, Journal of the Royal
Statistical Society (B), 50, 413–436.
Sperlich, S. (2009): ‘A Note on Nonparametric Estimation with Predicted Variables’, The
Econometrics Journal, 12, 382–395.
(2014): ‘On the Choice of Regularization Parameters in Specification Testing: A Critical
Discussion’, Empirical Economics, 47, 275–450.
Sperlich, S. and R. Theler (2015): ‘Modeling Heterogeneity: A Praise for Varying-coefficient
Models in Causal Analysis’, Computational Statistics, 30, 693–718.
414 Bibliography
accelerated hazard functions (AHF), 385 conditional independence assumption (CIA), 15, 43,
adjustment term, 101 117
always takers, 317, 341 for instruments (CIA-IV), 191
anticipation effects, 164 conditional mean independence, 117
approximation bias, 64 conditioning on the future, 358
Ashenfelter’s dip, 241 confounders, 42, 51
asymmetric loss function, 320 continuity
attrition, 32, 292 Hölder, 62
average direct effect, 58 Lipschitz, 62
average structural function (ASF), 35 continuous-time model, 359
average treatment effect (ATE), 10 control function, 214
conditional, 35 control variable approach, 214
for treated compliers, 199 control variables, see confounders
on the non-treated (ATEN), 11 convergence, 63
on the treated (ATET), 10 counterfactual distribution functions, 145
counterfactual exercise, 6
back-door approach, 54 cross-validation, 82
bandwidth, 65 generalised, 84
local, 84 crude incidence, see cumulative incidence function
baseline hazard, 384 cumulative incidence function, 388
bias stability (BS), see common trend (CT) curse of dimensionality, 64, 81, 131
bins, 108
blocking, 25, 28 defiers, 271, 317
bootstrap DiD-RDD approach, 275
naive, 155 difference-in-differences (DiD), 227
wild, 155 difference-in-differences-in-differences, 242
direct effect, 7, 54
canonical parametrisation, 93 directed acyclic graph, 46
causal chains, 46 directional derivatives, 321
causal effect, 3 discrete-time dynamic models, 358
cause-specific hazards, 388 displacement effect, see substitution effect
censoring distributional structural function, 36
left-, 382 Do-validation, 83
right-, 382 Dominated (Bounded) Convergence Theorem, 72
changes-in-changes (CiC), 228, 244 double robust estimator, 168
reversed, 246 drop-out bias, 31
choice-based sampling, 161
common support efficiency wage theory, 12
condition (CSC), 19, 121 eligibility, 270
problem, 45, 121 endogeneity, 337
common trend (CT), 230, 243 endogenous sample selection, 31
competing risks, 378, 389 equivariance to monotone transformations, 318
compliance intensity, 218 exact balance, 25
compliers, 271, 317, 341 exit rate, see hazard function
conditional DiD (CDiD), see matchingDiD (MDiD) exogeneity, 17
416 Index