Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
Advisors:
P. Bickel, P. Diggle, S. Fienberg, U. Gather,
I. Olkin, S. Zeger
Springer Series in Statistics
Alho/Spencer: Statistical Demography and Forecasting.
Andersen/Borgan/Gill/Keiding: Statistical Models Based on Counting Processes.
Atkinson/Riani: Robust Diagnostic Regression Analysis.
Atkinson/Riani/Cerioli: Exploring Multivariate Data with the Forward Search.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
   2nd edition.
Brockwell/Davis: Time Series: Theory and Methods, 2nd edition.
Bucklew: Introduction to Rare Event Simulation.
Cappé/Moulines/Rydén: Inference in Hidden Markov Models.
Chan/Tong: Chaos: A Statistical Perspective.
Chen/Shao/Ibrahim: Monte Carlo Methods in Bayesian Computation.
Coles: An Introduction to Statistical Modeling of Extreme Values.
David/Edwards: Annotated Readings in the History of Statistics.
Devroye/Lugosi: Combinatorial Methods in Density Estimation.
Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications.
Eggermont/LaRiccia: Maximum Penalized Likelihood Estimation, Volume I: Density
   Estimation.
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear
   Models, 2nd edition.
Fan/Yao: Nonlinear Time Series: Nonparametric and Parametric Methods.
Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations
   1750-1900.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume I:
   Two Crops.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume II:
   Three or More Crops.
Ferraty/Vieu: Nonparametric Functional Data Analysis: Models, Theory, Applications,
   and Implementation
Ghosh/Ramamoorthi: Bayesian Nonparametrics.
Glaz/Naus/Wallenstein: Scan Statistics.
Good: Permutation Tests: Parametric and Bootstrap Tests of Hypotheses, 3rd edition.
Gouriéroux: ARCH Models and Financial Applications.
Gu: Smoothing Spline ANOVA Models.
Györfi/Kohler/Krzyz• ak/Walk: A Distribution-Free Theory of Nonparametric
   Regression.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Härdle: Smoothing Techniques: With Implementation in S.
Harrell: Regression Modeling Strategies: With Applications to Linear Models,
   Logistic Regression, and Survival Analysis.
Hart: Nonparametric Smoothing and Lack-of-Fit Tests.
Hastie/Tibshirani/Friedman: The Elements of Statistical Learning: Data Mining,
   Inference, and Prediction.
Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications.
Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
   Parameter Estimation.
                                                              (continued after index)
Anastasios A. Tsiatis
Semiparametric Theory
and Missing Data
Anastasios A. Tsiatis
Department of Statistics
North Carolina State University
Raleigh, NC 27695
USA
[email protected]
ISBN-10: 0-387-32448-8
ISBN-13: 978-0387-32448-7
9 8 7 6 5 4 3 2 1
springer.com
     To
My Mother, Anna
My Wife, Marie
     and
 My Son, Greg
Preface
Missing data are prevalent in many studies, especially when the studies in-
volve human beings. Not accounting for missing data properly when analyzing
data can lead to severe biases. For example, most software packages, by de-
fault, delete records for which any data are missing and conduct the so-called
“complete-case analysis”. In many instances, such an analysis will lead to an
incorrect inference. Since the 1980s there has been a serious attempt to un-
derstand the underlying issues involved with missing data. In this book, we
study the different mechanisms for missing data and some of the different
analytic strategies that have been suggested in the literature for dealing with
such problems. A special case of missing data includes censored data, which
occur frequently in the area of survival analysis. Some discussion of how the
missing-data methods that are developed will apply to problems with censored
data is also included.
    Underlying any missing-data problem is the statistical model for the data
if none of the data were missing (i.e., the so-called full-data model). In this
book, we take a very general approach to statistical modeling. That is, we
consider statistical models where interest focuses on making inference on a
finite set of parameters when the statistical model consists of the parame-
ters of interest as well as other nuisance parameters. Unlike most traditional
statistical models, where the nuisance parameters are finite-dimensional, we
consider the more general problem of infinite-dimensional nuisance parame-
ters. This allows us to develop theory for important statistical methods such
as regression models that model the conditional mean of a response variable
as a function of covariates without making any additional distributional as-
sumptions on the variables and the proportional hazards regression model for
survival data. Models where the parameters of interest are finite-dimensional
and the nuisance parameters are infinite-dimensional are called semiparamet-
ric models.
    The first five chapters of the book consider semiparametric models when
there are no missing data. In these chapters, semiparametric models are de-
fined and some of the theoretical developments for estimators of the parame-
viii      Preface
ters in these models are reviewed. The semiparametric theory and the proper-
ties of the estimators for parameters in semiparametric models are developed
from a geometrical perspective. Consequently, in Chapter 2, a quick review of
the geometry of Hilbert spaces is given. The geometric ideas are first devel-
oped for finite-dimensional parametric models in Chapter 3 and then extended
to infinite-dimensional models in Chapters 4 and 5.
     A rigorous treatment of semiparametric theory is given in the book Ef-
ficient and Adaptive Estimation for Semiparametric Models by Bickel et al.
(1993). (Johns Hopkins University Press: Baltimore, MD). My experience has
been that this book is too advanced for many students in statistics and bio-
statistics even at the Ph.D. level. The attempt here is to be more expository
and heuristic, trying to give an intuition for the basic ideas without going into
all the technical details. Although the treatment of this subject is not rigorous,
it is not trivial either. Readers should not be frustrated if they don’t grasp
all the concepts at first reading. This first part of the book that deals only
with semiparametric models (absent missing data) and the geometric theory
of semiparametrics will be important in its own right. It is a beautiful theory,
where the geometric perspective gives a new insight and deeper understanding
of statistical models and the properties of estimators for parameters in such
models.
     The remainder of the book focuses on missing-data methods, building on
the semiparametric techniques developed in the earlier chapters. In Chapter
6, a discussion and overview of missing-data mechanisms is given. This in-
cludes the definition and motivation for the three most common categories of
missingness, namely
•      missing completely at random (MCAR)
•      missing at random (MAR)
•      nonmissing at random (NMAR)
These ideas are extended to the broader class of coarsened data. We show how
statistical models for full data can be integrated with missingness or coarsen-
ing mechanisms that allow us to derive likelihoods and models for the observed
data in the presence of missingness. The geometric ideas for semiparametric
full-data models are extended to missing-data models. This treatment will
give the reader a deep understanding of the underlying theory for missing and
coarsened data. Methods for estimating parameters with missing or coars-
ened data in as efficient a manner as possible are emphasized. This theory
leads naturally to inverse probability weighted complete-case (IPWCC) and
augmented inverse probability weighted complete-case (AIPWCC) estimators,
which are discussed in great detail in Chapters 7 through 11. As we will see,
some of the proposed methods can become computationally challenging if not
infeasible. Therefore, in Chapter 12, we give some approximate methods for
obtaining more efficient estimators with missing data that are easier to im-
plement. Much of the theory developed in this book is taken from a series of
                                                               Preface      ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
4     Semiparametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            53
      4.1 GEE Estimators for the Restricted Moment Model . . . . . . . . . .                                   54
          Asymptotic Properties for GEE Estimators . . . . . . . . . . . . . . . . .                           55
          Example: Log-linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              57
      4.2 Parametric Submodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           59
      4.3 Influence Functions for Semiparametric
          RAL Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     61
      4.4 Semiparametric Nuisance Tangent Space . . . . . . . . . . . . . . . . . . .                          63
          Tangent Space for Nonparametric Models . . . . . . . . . . . . . . . . . . .                         68
          Partitioning the Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . .               69
      4.5 Semiparametric Restricted Moment Model . . . . . . . . . . . . . . . . . .                           73
          The Space Λ2s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    77
          The Space Λ1s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    79
          Influence Functions and the Efficient Influence Function for
                 the Restricted Moment Model . . . . . . . . . . . . . . . . . . . . . . .                     83
          The Efficient Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . . .                  85
          A Different Representation for the Restricted
                 Moment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          87
          Existence of a Parametric Submodel for the Arbitrary
                 Restricted Moment Model . . . . . . . . . . . . . . . . . . . . . . . . . .                   91
      4.6 Adaptive Semiparametric Estimators for the Restricted
          Moment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     93
          Extensions of the Restricted Moment Model . . . . . . . . . . . . . . . .                            97
      4.7 Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
1
Introduction to Semiparametric Models
Statistical problems are described using probability models. That is, data
are envisioned as realizations of a vector of random variables Z1 , . . . , Zn ,
where Zi itself is a vector of random variables corresponding to the data
collected on the i-th individual in a sample of n individuals chosen from some
population of interest. We will assume throughout the book that Z1 , . . . , Zn
are identically and independently distributed (iid) with density belonging to
some probability (or statistical model), where a model consists of a class of
densities that we believe might have generated the data. The densities in
a model are often identified through a set of parameters; i.e., a real-valued
vector used to describe the densities in a statistical model. The problem is
usually set up in such a way that the value of the parameters or, at the least,
the value of some subset of the parameters that describes the density that
generates the data, is of importance to the investigator. Much of statistical
inference considers how we can learn about this “true” parameter value from
a sample of observed data. Models that are described through a vector of a
finite number of real values are referred to as finite-dimensional parametric
models. For finite-dimensional parametric models, the class of densities can
be described as
                           P = {p(z, θ), θ ∈ Ω ⊂ Rp },
where the dimension p is some finite positive integer.
    For many problems, we are interested in making inference only on a sub-
set of the parameters. Nonetheless, the entire set of parameters is necessary
to properly describe the class of possible distributions that may have gen-
erated the data. Suppose, for example, we are interested in estimating the
mean response of a variable, which we believe follows a normal distribution.
Typically, we conduct an experiment where we sample from that distribution
and describe the data that result from that experiment as a realization of the
random vector
                                   
                                   j−1
                              j
                             x =         a x for all x ∈ R
                                   =0
for some constants a0 , . . . , aj−1 . If this were the case, then the derivatives of
  j
x
j−1 of all orders would have to be equal to the corresponding derivatives of
              
    =0 a x . But    the j-th derivative of xj is equal to j! = 0, whereas the j-th
                  j−1
derivative of =0 a x is zero, leading to a contradiction and implying that
x0 , . . . , xm are linearly independent.
     Consequently, the space S cannot be spanned by any finite number, say
m elements of S, because, if this were possible, then the space of polynomials
of order greater than m could also be spanned by the m elements. But this is
impossible since such spaces of polynomials have dimension greater than m.
Hence, S is infinite-dimensional.
     From the arguments above, we can easily show that the space of arbitrary
densities pZ (z) for a continuous random variable Z defined on the closed
finite interval [0, 1] (i.e., the so-called nonparametric model for such a random
variable) spans a space that is infinite-dimensional. This follows by noticing
that the functions pZj (z) = (j + 1)−1 z j , 0 ≤ z ≤ 1, j = 1, . . . are densities
that are linearly independent.
Remark 2. We will use the convention that random variables are denoted by
capital letters such as Y and X, whereas realizations of those random variables
will be denoted by lowercase letters such as y and x. One exception to this
is that the random variable corresponding to the error term Y − µ(X, β) is
denoted by the Greek lowercase ε. This is in keeping with the usual notation
for such error terms used in statistics. The realization of this error term will
also be denoted by the Greek lowercase ε. The distinction between the random
variable and the realization of the error term will have to be made in the
context it is used and should be obvious in most cases. For example, when we
refer to pε,X (ε, x), the subscript ε is a random variable and the argument ε
inside the parentheses is the realization.   
Remark 3. νX (x) is a dominating measure for which densities for the ran-
dom vector X are defined. For the most part, we will consider ν(·) to be the
Lebesgue measure for continuous random variables and the counting measure
for discrete random variables. The random variable Y and hence ε will be
taken to be continuous random variables dominated by Lebesgue measure dy
or dε, respectively. 
                     
                                                 h(0) (ε, x)
                               h(1) (ε, x) =                  ;
                                                 h(0) (ε, x)dε
    i.e.,
                         
                             h(1) (ε, x)dε = 1 for all x.
Since the class of all such conditional densities η1 (ε, x) was derived from arbi-
trary positive functions h(0) (ε, x) (subject to regularity conditions), and since
the space of positive functions is infinite-dimensional, then the set of such
resulting conditional densities is also infinite-dimensional.
    Similarly, we can construct densities for X where pX (x) = η2 (x) such that
                                              η2 (x) > 0,
                                
                                    η2 (x)dνX (x) = 1.
The set of all such functions η2 (x) will also be infinite-dimensional as long as
the support of X is infinite.
   Therefore, the restricted moment model is characterized by
                          Yi = µ(Xi , β) + εi , i = 1, . . . , n,
                          2
where εi are iid N (0, σ ). That is,
                                                                        
                                         1              1 {y − µ(x, β)}2
             pY |X (y|x; β, σ 2 ) =              exp  −                    .
                                               1
                                      (2πσ 2 ) 2        2       σ2
This model is much more restrictive than the semiparametric model defined
earlier.
where
                                                  ⎧                   ⎫
                                                  ⎨            t     ⎬
        pT |X {t|x; β, λ(·)} = λ(t) exp(β T x) exp − exp(β T x) λ(u)du ,
                                                  ⎩                   ⎭
                                                                    0
                                     η2 (x)  0,
                              
                                  η2 (x)dνX (x) = 1,
for all x. The proportional hazards model has gained a great deal of popularity
because it is more flexible than a finite-dimensional parametric model, that
assumes that the hazard function for T has a specific functional form in terms
of a few parameters; e.g.,
or
In the two previous examples, the probability models were written in terms
of an infinite-dimensional parameter θ, which was partitioned as {β T , η(·)},
where β was the finite-dimensional parameter of interest and η(·) was the
infinite-dimensional nuisance parameter. We now consider the problem of es-
timating the moments of a single random variable Z where we put no re-
striction on the distribution of Z except that the moments of interest exist.
That is, we denote the density
                                 of Z by θ(z), where θ(z) can be any posi-
tive function of z such that θ(z)dνZ (z) = 1 and any additional restrictions
necessary for the moments of interest to exist. Clearly, the class of all θ(·) is
infinite-dimensional as long as the support of Z is infinite. Suppose we were
interested in estimating some functional of θ(·), say β(θ) (forexample, the
first or second moment E(Z) or E(Z 2 ), where β(θ) is equal to zθ(z)dνZ (z)
or z 2 θ(z)dνZ (z), respectively). For such a problem, it is not convenient to
try to partition the parameter space in terms of the parameter β of interest
and a nuisance parameter but rather to work directly with the functional β(θ).
for all densities “p{z, β, η(·)}” within some semiparametric family, where
P {β,η(·)}                                               D{β,η(·)}
−−−−−−→ denotes convergence in probability and −−−−−−→ denotes conver-
gence in distribution when the density of the random variable Z is p{z, β, η(·)}.
   We know, for example, that the solution to the linear estimating equations
                  
                  n                                      
                        Aq×1 (Xi , β̂n ) Yi − µ(Xi , β̂n ) = 0q×1 ,
                  i=1
In this section, we will introduce a Hilbert space without going into much of
the technical details. We will focus primarily on the Hilbert space whose ele-
ments are random vectors with mean zero and finite variance that will be used
throughout the book. For more details about Hilbert spaces, we recommend
that the reader study Chapter 3 of Luenberger (1969).
                          (i) E{h(Z)} = 0,
                          (ii) E{hT (Z)h(Z)} < ∞.
Since the elements of this space are random functions, when we refer to an
element as h, we implicitly mean h(Z). Clearly, the space of all such h that
satisfy (i) and (ii) is a linear space. By linear, we mean that if h1 , h2 are
elements of the space, then for any real constants a and b, ah1 + bh2 also
belongs to the space.
12      2 Hilbert Space for Random Vectors
    In the same way that we consider points in Euclidean space as vectors from
the origin, here we will consider the q-dimensional random functions as points
in a space. The intuition we have developed in understanding the geometry of
two- and three-dimensional Euclidean space will aid us in understanding the
geometry of more complex spaces through analogy. The random function
h(Z) = 0q×1
Note 1. In some cases, the function ·, · may satisfy conditions 1–3 above and
the first part of condition 4, but h1 , h1 = 0 may not imply that h1 = 0. In
that case, we can still define a Hilbert space by identifying equivalence classes
where individual elements in our space correspond to different equivalence
classes.
This definition of inner product clearly satisfies the first three conditions of
the definition given above. As for condition 4, we can define an equivalence
class where h1 is equivalent to h2 ,
h1 ≡ h 2 ,
Remark 1. Technically speaking, the definitions above are those for a pre-
Hilbert space. In order to be a Hilbert space, we also need the space to be
complete (i.e., every Cauchy sequence has a limit point that belongs to the
space). That the space of q-dimensional random functions with mean zero
and bounded second moments is complete follows from the L2 -completeness
theorem (see Loève 1963, p. 161) and hence is a Hilbert space. 
                                                                
Geometrically
                                        uo
                    origin
h1 , h2 = E(h1 h2 )
for h1 (Z), h2 (Z) ∈ H. Let u1 (Z), . . . , uk (Z) be arbitrary elements of this space
and U be the linear subspace spanned by {u1 , · · · , uk }. That is,
U = {aT u; for a ∈ Rk },
where
                                      ⎛     ⎞
                                       u1
                                      ⎜ ⎟
                             uk×1   = ⎝ ... ⎠ .
                                          uk
or
                 
                 k
                       aj h − aT0 u, uj = 0 for all aj , j = 1, . . . , k.
                 j=1
or
or
                                   h1 , h2 = E(hT1 h2 ).
 2.4 Some Simple Examples of the Application of the Projection Theorem                 17
Let “v(Z)” be an r-dimensional random function with mean zero and E(v T v)
< ∞. Consider the linear subspace U spanned by v(Z); that is,
and hence
Since the Bij are arbitrary, we can minimize the sum in (2.5) by minimizing
each of the elements in the sum separately. 
The norm-squared of this projection is given by
                                                           
            E v T {E(vv T )}−1 E(vhT )E(hv T ){E(vv T )}−1 v ,
and, by the Pythagorean theorem (see Figure 2.2 for illustration), the norm-
squared of the residual (h − B0 v) is
                                                                   
         E(hT h) − E v T {E(vv T )}−1 E(vhT )E(hv T ){E(vv T )}−1 v
             = tr{[hhT ] − E(hv T ){E(vv T )}−1 E(vhT )}.
   There are other properties of Hilbert spaces that will be used throughout
the book. Rather than giving all the properties in this introductory chapter,
we will instead define these as they are needed. There is, however, one very
important result that we do wish to highlight. This is the Cauchy-Schwartz
inequality given in Theorem 2.3.
                                                  2.5 Exercises for Chapter 2   19
      Σ11 = E(Y1 Y1T ), Σ12 = E(Y1 Y2T ), Σ21 = E(Y2 Y1T ), Σ22 = E(Y2 Y2T ).
    Let H be the Hilbert space of all q-dimensional measurable functions of Z
    with mean zero, finite variance, and equipped with the covariance inner
    product. Let U be the linear subspace spanned by Y2 ; i.e., U consists of
    all the elements
                                                              
                 B q×(p−q) Y2 : for all q × (p − q) matrices B .
ϕ(Z, θ) to emphasize that this random function will vary according to the
value of θ in the model. Unless otherwise stated, it will be assumed that ϕ(Z)
is evaluated at the truth and expectations are taken with respect to the truth.
Therefore, E{ϕ(Z)} is shorthand for
     
     
     The random vector ϕ(Zi ) in (3.1) is referred to as the i-th influence func-
tion of the estimator β̂n or the influence function of the i-th observation of
the estimator β̂n . The term influence function comes from the robustness lit-
erature, where, to first order, ϕ(Zi ) is the influence of the i-th observation on
β̂n ; see Hampel (1974).
                                      
                                      n
         n1/2 (σ̂n2 − σ02 ) = n−1/2         {(Zi − µ0 )2 − σ02 } + n1/2 (µ̂n − µ0 )2 .
                                      i=1
Since n1/2 (µ̂n −µ0 ) converges to a normal distribution and (µ̂n −µ0 ) converges
in probability to zero, this implies that n1/2 (µ̂n −µ0 )2 converges in probability
to zero (i.e., is op (1)). Consequently, we have demonstrated that σ̂n2 is an
asymptotically linear estimator for σ 2 whose i-th influence function is given
by ϕ(Zi ) = {(Zi − µ0 )2 − σ02 }. 
                                                             
                                                             n
                              1/2                     −1/2
                          n         (β̂n − β0 ) = n                ϕ(Zi ) + op (1),
                                                             i=1
Proof. By contradiction
Suppose not. Then there exists another influence function ϕ∗ (Z) such that
E{ϕ∗ (Z)} = 0,
and
                                                         
                                                         n
                         n1/2 (β̂n − β0 ) = n−1/2                  ϕ∗ (Zi ) + op (1).
                                                             i=1
                                                                 
                                                                 n
Since n1/2 (β̂n − β0 ) is also equal to n−1/2                          ϕ(Zi ) + op (1), this implies that
                                                                 i=1
                                        
                                        n
                               n−1/2          {ϕ(Zi ) − ϕ∗ (Zi )} = op (1).
                                        i=1
3.1 Super-Efficiency
Example Due to Hodges
Let Z1 , . . . , Zn be iid N (µ, 1), µ ∈ R. For this simple model, we know that
the maximum  n likelihood estimator (MLE) of µ is given by the sample mean
Z̄n = n−1 i=1 Zi and that
                                       D(µ)
                         n1/2 (Z̄n − µ) −−−→ N (0, 1).
   Now, consider the estimator µ̂n given by Hodges in 1951 (see LeCam,
1953):                     
                            Z̄n if |Z̄n | > n−1/4
                     µ̂n =
                            0     if |Z̄n | ≤ n−1/4 .
Some of the properties of this estimator are as follows.
If µ = 0, then with increasing probability, the support of Z̄n moves away from
0 (see Figure 3.1).
                                                        3.1 Super-Efficiency      25
                      –n–1/4
                                   0         n–1/4        µ
                                                                    D(µ)
Therefore n1/2 (Z̄n −µ) = n1/2 (µ̂n −µ)+op (1) and n1/2 (µ̂n −µ) −−−→ N (0, 1).
   If µ = 0, then the support of Z̄n will be concentrated in an O(n−1/2 )
neighborhood about the origin and hence, with increasing probability, will be
within ±n−1/4 (see Figure 3.2).
                      –n–1/4
                                             n–1/4
                                                                   
Therefore, this implies that P0 (µ̂n = 0) → 1. Hence P0 n1/2 µ̂n = 0 → 1, and
               P         D(0)
n1/2 (µ̂n − 0) −→
                0
                  0 or −−−→ N (0, 0). Consequently, the asymptotic variance of
  1/2
n (µ̂n − µ) is equal to 1 for all µ = 0, as it is for the MLE Z̄n , but for µ = 0,
the asymptotic variance of n1/2 (µ̂n − µ) equals 0 and thus is super-efficient.
    Although super-efficiency, at the surface, may seem like a good property
for an estimator to possess, upon further study we find that super-efficiency
is gained at the expense of poor estimation in a neighborhood of zero. To
26     3 The Geometry of Influence Functions
                     –n–1/4
                                        n–1/3     n–1/4
Therefore,                                        
                    Pµn n1/2 (µ̂n − µn ) = −n1/2 µn → 1.
−n1/2 µn → −∞.
where                                                         T     T
                         θn = (βnT , ηnT )T ,       θ∗ = (β ∗ , η ∗ )T .
An estimator β̂n , more specifically β̂n (Z1n , . . . , Znn ), is said to be regular if,
for each θ∗ , n1/2 (β̂n − βn ) has a limiting distribution that does not depend on
the LDGP.     
where
                     Z1n , . . . , Znn   are iid p(z, θ∗ ),         for all n,
then                                              D(θ )
                n1/2 β̂n (Z1n , . . . , Znn ) − βn −−−−→ N (0, Σ∗ ),
                                                       n
where
                            Z1n , . . . , Znn     are iid p(z, θn ),
       1/2       ∗
and n (θn − θ ) → τ         p×1
                            , where τ is any arbitrary constant vector.
    It is easy to see that, in our previous example, the MLE Z̄n is a regular
estimator, whereas the super-efficient estimator µ̂n , given by Hodges, is not.
    From now on, we will restrict ourselves to regular estimators; in fact,
we will only consider estimators that are regular and asymptotically linear
(RAL). Although most reasonable estimators are RAL, regular estimators do
exist that are not asymptotically linear. However, as a consequence of Hájek’s
(1970) representation theorem, it can be shown that the most efficient regular
estimator is asymptotically linear; hence, it is reasonable to restrict attention
to RAL estimators.
    In Theorem 3.2 and its subsequent corollary, given below, we present a
very powerful result that allows us to describe the geometry of influence func-
tions for regular asymptotically linear (RAL) estimators. This will aid us in
defining and visualizing efficiency and will also help us generalize ideas to
semiparametric models.
    First, we define the score vector for a single observation Z in a parametric
model, where Z ∼ pZ (z, θ), θ = (β T , η T )T , by Sθ (Z, θ0 ), where
                                                ∂ log pZ (z, θ)
                             Sθ (z, θ0 ) =                                              (3.3)
                                                      ∂θ          θ=θ0
where
                                                       q×1
                                     ∂ log pZ (z, θ)
                     Sβ (z, θ0 ) =
                                           ∂β          θ=θ0
and
                                                       r×1
                                     ∂ log pZ (z, θ)
                     Sη (z, θ0 ) =                            .
                                           ∂η          θ=θ0
Corollary 1.
(i)
                             E{ϕ(Z)SβT (Z, θ0 )} = I q×q
and
(ii)
                             E{ϕ(Z)SηT (Z, θ0 )} = 0q×r ,
where I q×q denotes the q ×q identity matrix and 0q×r denotes the q ×r matrix
of zeros.
                                           3.2 m-Estimators (Quick Review)     29
    Theorem 3.2 follows from the definition of regularity together with suffi-
cient smoothness conditions that makes a local data generating process con-
tiguous (to be defined shortly) to the sequence of distributions at the truth.
For completeness, we will give an outline of the proof. Before giving the gen-
eral proof of Theorem 3.2, which is complicated and can be skipped by the
reader not interested in all the technical details, we can gain some insight by
first showing how Corollary 1 could be proved for the special (and important)
case of the class of m-estimators.
Eθ {mT (Z, θ)m(Z, θ)} < ∞, and Eθ {m(Z, θ)mT (Z, θ)} is positive definite for
all θ ∈ Ω. Additional regularity conditions are also necessary and will be
defined as we need them.
    The m-estimator θ̂n is defined as the solution (assuming it exists) of
                               
                               n
                                      m(Zi , θ̂n ) = 0
                               i=1
from a sample
                            Z1 , . . . , Zn iid pZ (z, θ)
                                     θ ∈ Ω ⊂ Rp .
where Sθ (z, θ) is the score vector (i.e., the derivative of the log-density) defined
in (3.3). Since the score vector Sθ (Z, θ), under suitable regularity conditions,
has the property that Eθ {Sθ (Z, θ)} = 0 – see, for example, equation (7.3.8)
of Casella and Berger (2002) – , this implies that the MLE is an example of
an m-estimator.
    In order to prove the consistency and asymptotic normality of m-estimators,
we need to assume certain regularity conditions. Some of the conditions that
are discussed in Chapter 36 of the Handbook           of Econometrics by Newey
                                               ∂m(Z, θ0 )
and McFadden (1994) include that E                            be nonsingular, where
                                                  ∂θT
∂m(Zi , θ)
            is defined as the p × p matrix of all partial derivatives of the ele-
    ∂θT
ments of m(·) with respect to the elements of θ, and that
                              
                              n                                     
                         −1     ∂m(Zi , θ)   P            ∂m(Z, θ)
                     n                       → E θ0
                              i=1
                                    ∂θT                     ∂θT
                               ∂m(Z, θ)
                   sup                  ≤ g(Z),       E{g(Z)} < ∞,
                 θ∈N (θ0 )       ∂θT
Therefore,
                          #                           $−1                      "
                                     
                                     n
                                       ∂m(Zi , θ∗ )                  
                                                                     n
      1/2                       −1                            −1/2
  n         (θ̂n − θ0 ) = − n                   n
                                                 n          m(Zi , θ0 )
                                   ∂θT
                                     i=1                i=1
                                      −1        n
                                                                    "
                           ∂m(Zi , θ0 )         −1/2
                     =− E                     n          m(Zi , θ0 ) + op (1).
                             ∂θT                     i=1
       n1/2 (θ̂n − θ0 ) −→
                           D
                                                                         (3.7)
         %                    −1                          −1T &
                     ∂m(Z, θ0 )                        ∂m(Z, θ0 )
       N 0, E                        var m(Z, θ0 ) E                      ,
                        ∂θT                              ∂θT
where                                    '                    (
                      var {m(Z, θ0 )} = E m(Z, θ0 )mT (Z, θ0 ) .
I(θ0 ) = Eθ0 {−Sθθ (Z, θ0 )} = Eθ0 {Sθ (Z, θ0 )SθT (Z, θ0 )}. (3.11)
θ = (β T , η T )T
and
Therefore,                           
                                ∂
                                         m(z, θ)p(z, θ)dν(z) = 0.
                               ∂θT
                                         3.2 m-Estimators (Quick Review)        33
where I p×p denotes the p×p identity matrix. Recall that the influence function
for θ̂n , given by (3.6), is
                                             −1
                                    ∂m(Z, θ0 )
                  ϕθ̂n (Zi ) = − E                  m(Zi , θ0 )
                                      ∂θT
                                                T
and can be partitioned as ϕTβ̂ (Zi ), ϕTη̂n (Zi ) .
                                 n
    The covariance of the influence function ϕθ̂n (Zi ) and the score vector
Sθ (Zi , θ0 ) is
                                            
                   E ϕθ̂n (Zi )SθT (Zi , θ0 )
                                   −1
                        ∂m(Z, θ0 )             '                     (
                 =− E          T
                                              E m(Z, θ0 )SθT (Z, θ0 ) , (3.14)
                           ∂θ
Consequently,
                             
(i) E ϕβ̂n (Zi )SβT (Zi , θ0 ) = I q×q   (the q × q identity matrix)
34      3 The Geometry of Influence Functions
and
                              
(ii) E ϕβ̂n (Zi )SηT (Zi , θ0 ) = 0q×r .
Thus, we have verified that the two conditions of Corollary 1 hold for influence
functions of m-estimators.
Definition 2. Let Vn be a sequence of random vectors and let P1n and P0n be
sequences of probability measures with densities p1n (vn ) and p0n (vn ), respec-
tively. The sequence of probability measures P1n is contiguous to the sequence
of probability measures P0n if, for any sequence of events An defined with re-
spect to Vn , P0n (An ) → 0 as n → ∞ implies that P1n (An ) → 0 as n → ∞.
To illustrate that (3.15) holds for LDGPs under sufficient smoothness and
regularity conditions, we sketch out the following heuristic argument. Define
(ii) Since θn∗ → θ0 and Sθθ (Zin , θ0 ), i = 1, . . . , n are iid random matrices with
     mean −I(θ0 ), then, under sufficient smoothness conditions,
36      3 The Geometry of Influence Functions
                                  
                                  n
                             −1
                                        {Sθθ (Zin , θn∗ ) − Sθθ (Zin , θ0 )} −
                                                                                P
                         n                                                   → 0,
                                  i=1
     hence
                                            
                                            n
                                      −1
                                                  Sθθ (Zin , θn∗ ) −
                                                                    P
                                  n                                → −I(θ0 ).
                                            i=1
                    1/2
By assumption, n          (θn − θ0 ) → τ . Therefore, (i), (ii), and Slutsky’s theorem
imply that
                                                                                     
                                        D(P0n )              τ T I(θ0 )τ T
              log{Ln (Vn )} −−−−−→ N                       −            , τ I(θ0 )τ       .
                                                                  2
                                                             
                                                             n
               n1/2 {β̂n − β(θ0 )} = n−1/2                         ϕ(Zin ) + oP1n (1).
                                                             i=1
Also, under P1n , [ϕ(Zin ) − Eθn {ϕ(Z)}], i = 1, . . . , n are iid mean-zero random
vectors with variance matrix Eθn (ϕϕT ) − Eθn (ϕ)Eθn (ϕT ). By the smoothness
assumption, Eθn (ϕϕT ) → Eθ0 (ϕϕT ) and Eθn (ϕ) → 0 as n → ∞. Hence, by
the CLT, we obtain
                 
                 n                                                 
                                               D(P1n )
       n−1/2           [ϕ(Zin ) − Eθn {ϕ(Z)}] −−−−−→ N 0, Eθ0 (ϕϕT ) .                          (3.20)
                 i=1
By a simple Taylor series expansion, we deduce that β(θn ) ≈ β(θ0 )+Γ(θ0 )(θn −
θ0 ), where Γ(θ0 ) = ∂β(θ0 )/∂θT . Hence,
Finally,
                                                 
                 n1/2 Eθn {ϕ(Z)} = n1/2              ϕ(z)p(z, θn )dν(z)
                                                                           T
      1/2                                  1/2                  ∂p(z, θn∗ )
 =n             ϕ(z)p(z, θ0 )dν(z) + n               ϕ(z)                          (θn − θ0 )dν(z)
                                                                   ∂θ
                                                                            T
                           n→∞                           ∂p(z, θ0 )
                           −−−−→ 0 +        ϕ(z)                    /p(z, θ0 )   p(z, θ0 )dν(z)τ
                                                            ∂θ
                                   = Eθ0 {ϕ(Z)SθT (Z, θ0 )}τ,                                   (3.22)
where θn∗ is some intermediate value between θn and θ0 . The only way that
(3.19) and (3.20) can hold is if the limit of (3.18), as n → ∞, is identically
equal to zero. By (3.21) and (3.22), this implies that
                                                  
                    Eθ0 {ϕ(Z)SθT (Z, θ0 )} − Γ(θ0 ) τ = 0q×1 .
B q×p Sθ (Z, θ0 )
Constructing Estimators
Let ϕ(Z) be a q-dimensional measurable function with zero mean and finite
variance that satisfies conditions (i) and (ii) of Corollary 1. Define
                3.3 Geometry of Influence Functions for Parametric Models       39
Assume that we can find a root-n consistent estimator for the nuisance pa-
rameter η̂n (i.e., where n1/2 (η̂n −η0 ) is bounded in probability). In many cases
the estimator η̂n will be β-dependent (i.e., η̂n (β)). For example, we might use
the MLE for η, or the restricted MLE for η, fixing the value of β.
   We will now argue that the solution to the equation
                                
                                n
                                      m{Zi , β, η̂n (β)} = 0,               (3.24)
                                i=1
or
                           
                               m(z, β0 , η)p(z, β0 , η)dν(z) = 0.
Consequently,
                      
         ∂
                          m(z, β0 , η)p(z, β0 , η)dν(z) = 0,
        ∂η T   η=η0
or
                                                 
            ∂m(z, β0 , η0 )
                            p(z, β ,
                                  0 0η )dν(z)  +     m(z, β0 , η0 )
               ∂η T                                                         (3.25)
                               ×SηT (z, β0 , η0 )p(z, β0 , η0 )dν(z) = 0.
                               
                               n
                      0=             m{Zi , β̂n , η̂n (β̂n )}
                               i=1
                               n
                           =         m{Zi , β0 , η̂n (β̂n )}
                               i=1
                                 #                              $
                                 n
                                      ∂m         ∗
                               +          {Zi , βn , η̂n (β̂n )} (β̂n − β0 ),                         (3.28)
                                     ∂β T            * +, -
                                 i=1
                                                       ⇓
                                     Notice that this term is held fixed
       n1/2 (β̂n − β0 )
       #                                        $−1 #                                   $
               n
                    ∂                                        n
          −1                    ∗                       −1/2
     =− n               m{Zi , βn , η̂n (β̂n )}       n          m{Zi , β0 , η̂n (β̂n )} .
               i=1
                   ∂β T                                      i=1
       *                  +,                      -
                             ⇓p
                                          −1
                          ∂
                      E       m(Z, β0 , η0 )     = −I q×q by (3.27)                                   (3.29)
                         ∂β T
                                                                                 
                                                                                 n
Let us consider the second term of (3.29); namely, n−1/2                              m{Zi , β0 , η̂n (β̂n )}.
                                                                                i=1
By expansion, this equals
                   
                   n
           n−1/2           m(Zi , β0 , η0 )
                     i=1
                                                       "
                              
                              n
                                ∂m(Zi , β0 , η ∗ )                                      
                         −1
           +         n                              n
                                                                  n1/2 {η̂n (β̂n ) − η0 } ,           (3.30)
                              i=1
                                           ∂η T                 *           +,           -
                 *                    +,                -
                                   ⇓p                             ⇓
                               ∂                            bounded in probability
                     E             m(Z, β0 , η0 )
                              ∂η T
                              = 0 by (3.26)
                                                        
                                                        n
                     n1/2 (β̂n − β0 ) = n−1/2                 m(Zi , β0 , η0 ) + op (1),
                                                        i=1
                                                        n
                                           = n−1/2            ϕ(Zi ) + op (1),
                                                        i=1
               3.3 Geometry of Influence Functions for Parametric Models       41
which illustrates that ϕ(Zi ) is the influence function for the i-th observation
of the estimator β̂n above.
Remark 3. This argument was independent of the choice of the root-n consis-
tent estimator for the nuisance parameter η. 
                                             
Remark 4. In the derivation above, the asymptotic distribution of the esti-
mator obtained by solving the estimating equation, which uses the estimating
function m(Z, β, η̂n ), is the same as the asymptotic distribution of the estima-
tor solving the estimating equation using the estimating function m(Z, β, η0 )
had the true value of the nuisance parameter η0 been known to us. This
fact follows from the orthogonality of the estimating function (evaluated at
the truth) to the nuisance tangent space. This type of robustness, where the
asymptotic distribution of an estimator is independent of whether the true
value of the nuisance parameter is known or whether (and how) the nuisance
parameter is estimated in an estimating equation, is one of the bonuses of
working with estimating equations with estimating functions that are orthog-
onal to the nuisance tangent space.     
Remark 5. We want to make it clear that the estimator we just presented is
for theoretical purposes only and not of practical use. The starting point was
the choice of a function satisfying the conditions of Lemma 3.1. To find such
a function necessitates knowledge of the truth, which, of course, we don’t
have. Nonetheless, starting with some truth, say θ0 , and some function ϕ(Z)
satisfying the conditions of Corollary 1 (under the assumed true model), we
constructed an estimator whose influence function is ϕ(Z) when θ0 is the
truth. If, however, the data were generated, in truth, by some other value of
the parameter, say θ∗ , then the estimator constructed by solving (3.24) would
have some other influence function ϕ∗ (Z) satisfying the conditions of Lemma
3.1 at θ∗ . 
Thus, by Corollary 1, all RAL estimators have influence functions that belong
to the subspace of our Hilbert space satisfying
(i) E{ϕ(Z)SβT (Z, θ0 )} = I q×q
and
(ii) E{ϕ(Z)SηT (Z, θ0 )} = 0q×r ,
and, conversely, any element in the subspace above is the influence function
of some RAL estimator.
                              H = M ⊕ M ⊥. 
                                           
                         h − a0 , a = 0 for all a ∈ Λ.
                                             3.4 Efficient Influence Function     43
can be written as the direct sum of the nuisance tangent space and the tangent
space generated by the score vector with respect to the parameter of interest
“β”. That is, if we define Tβ as the space {B q×q Sβ (Z, θ0 ) for all B q×q }, then
T = Tβ ⊕ Λ.
if and only if
                        var {aT ϕ(1) (Z)} ≤ var {aT ϕ(2) (Z)}
for all q × 1 constant vectors a. Equivalently,
                                 T                                T
             aT E{ϕ(1) (Z)ϕ(1) (Z)}a ≤ aT E{ϕ(2) (Z)ϕ(2) (Z)}a.
This means that, for such cases, the variance matrix of +h, for q-dimensional
 and h, is larger (in the multidimensional sense defined above) than either
the variance matrix of  or the variance matrix of h.
    In many of the arguments that follow, we will be decomposing elements
of the Hilbert space as the projection to a tangent space or a nuisance tan-
gent space plus the residual after the projection. For such problems, because
the tangent space or nuisance tangent space is a q-replicating linear space,
we now know that we can immediately apply the multivariate version of the
Pythagorean theorem where the variance matrix of any element is always
larger than the variance matrix of the projection or the variance matrix of the
residual after projection. Consequently, we don’t have to distinguish between
the Hilbert space of one-dimensional random functions and q-dimensional ran-
dom functions.
Before describing the geometry of influence functions, we first give the defini-
tion of a linear variety (sometimes also called an affine space).
Definition 7. A linear variety is the translation of a linear subspace away
from the origin; i.e., a linear variety V can be written as V = x0 + M , where
x0 ∈ H and x0 ∈  / M, x0 = 0, and M is a linear subspace (see Figure 3.4).
M (linear subspace)
Theorem 3.4. The set of all influence functions, namely the elements of H
that satisfy condition (3.4) of Theorem 3.2, is the linear variety ϕ∗ (Z) + T ⊥ ,
where ϕ∗ (Z) is any influence function and T ⊥ is the space perpendicular to
the tangent space.
Proof. Any element l(Z) ∈ T ⊥ must satisfy
Therefore, if we take
46     3 The Geometry of Influence Functions
The efficient influence function ϕeff (Z), if it exists, is the influence func-
tion with the smallest variance matrix; that is, for any influence function
ϕ(Z) = ϕeff (Z), var{ϕeff (Z)} − var{ϕ(Z)} is negative definite. That an ef-
ficient influence function exists and is unique is now easy to see from the
geometry of the problem.
Theorem 3.5. The efficient influence function is given by
or
                       Beff E{Sθ (Z, θ0 )SθT (Z, θ0 )} = Γ(θ0 ),
which implies
                               Beff = Γ(θ0 )I −1 (θ0 ),
where I(θ0 ) = E{Sθ (Z, θ0 )SθT (Z, θ0 )} is the information matrix. Conse-
quently, the efficient influence function is given by
Definition 8. The efficient score is the residual of the score vector with re-
spect to the parameter of interest after projecting it onto the nuisance tangent
space; i.e.,
                  Seff (Z, θ0 ) = Sβ (Z, θ0 ) − Π(Sβ (Z, θ0 )|Λ).
Recall that
Therefore, if we define
then
                        (i) E[ϕeff (Z, θ0 )SβT (Z, θ0 )] = I q×q
and
48     3 The Geometry of Influence Functions
                             β ∈ Rq ,
                             η ∈ Rr ,
                              θ = (β T, η T )T , θ ∈ Rp , p = q + r.
                                      ∂ log p(z, θ)
                       Sβ (z) =                            ,
                                           ∂β         θ0
                                      ∂ log p(z, θ)
                       Sη (z) =                            ,
                                           ∂η         θ0
                                      ∂ log p(z, θ)
                       Sθ (z) =                            = {SβT (z), SηT (z)}T .
                                           ∂θ         θ0
Linear subspaces
Tangent space:
• Efficient score
       for any root-n consistent estimator η̂n∗ (β), yields an estimator that is
       asymptotically linear with the efficient influence function.
3. Assume Y1 , . . . , Yn are iid with distribution function F (y) = P (Y ≤ y),
   which is differentiable everywhere with density f (y) = dFdy(y) . The median
                             
   is defined as β = F −1 12 . The sample median is defined as
                                                
                                                1
                                   β̂n ≈ F̂n−1     ,
                                                2
                                                  3.6 Exercises for Chapter 3        51
                         n
    where F̂n (y) = n−1 i=1 I(Yi ≤ y) is the empirical distribution function.
    Equivalently, β̂n is the solution to the m-estimating equation
                               n 
                                                    
                                                 1
                                     I(Yi ≤ β) −         ≈ 0.
                               i=1
                                                 2
(a) Find the influence function for the sample median β̂n .
    Hint: You may assume the following to get your answer.
                                                    
     (i) β̂n is consistent; i.e., β̂n → β0 = F −1 12 .
    (ii) Stochastic equicontinuity:
                                                                   
                  n1/2 F̂n (β̂n ) − F (β̂n ) − n1/2 F̂n (β0 ) − F (β0 ) −
                                                                          P
                                                                          → 0.
(b) Let Y1 , . . . , Yn be iid N (µ, σ 2 ), µ ∈ R, σ 2 > 0. Clearly, for this model, the
    median β is equal to µ. Verify, by direct calculation, that the influence
    function for the sample median satisfies the two conditions of Corollary
    1.
4
Semiparametric Models
θ = (β T , η T )T β ∈ R q , η ∈ Rr p = q + r,
β being the parameter of interest and η the nuisance parameter. In this chap-
ter, we will extend this theory to semiparametric models, where the parameter
space for θ is infinite-dimensional.
    For most of the exposition in this book, as well as most of the exam-
ples used throughout, we will consider semiparametric models that can be
represented using the class of densities p(z, β, η), where β, the parameter of
interest, is finite-dimensional (q-dimensional); η, the nuisance parameter, is
infinite-dimensional; and β and η are variationally independent – that is, any
choice of β and η in a neighborhood about the true β0 and η0 would result
in a density p(z, β, η) in the semiparametric model. This will allow us, for
example, to explicitly define partial derivatives
                        ∂p(z, β, η0 )              ∂p(z, β0 , η0 )
                                               =                   .
                            ∂β          β=β0           ∂β
Keep in mind, however, that some problems lend themselves more naturally
to models represented by the class of densities p(z, θ), where θ is infinite-
dimensional and the parameter of interest, β q×1 (θ), is a smooth q-dimensional
function of θ. When the second representation is easier to work with, we will
make the distinction explicit.
   In Chapter 1, we gave two examples of semiparametric models:
(i) Restricted moment model
                                     Yi = µ(Xi , β) + εi ,
                                        E(εi |Xi ) = 0,
54      4 Semiparametric Models
     or equivalently
                                  E(Yi |Xi ) = µ(Xi , β).
(ii) Proportional hazards model
     The hazard of failing at time t is
for all densities “p(·, β, η)” within some semiparametric model and “good”
refers to estimators with small asymptotic variance. All of these ideas will be
made precise shortly.
The asymptotic properties of the GEE estimator follow from the expansion
            
            n
      0=          A(Xi , β̂n ){Yi − µ(Xi , β̂n )}
            i=1
            n
        =         A(Xi , β0 ){Yi − µ(Xi , β0 )}
            i=1
                 n                                                              "
                                             
                                              n
            +           Q(Yi , Xi , βn∗ ) −         A(Xi , βn∗ ) D(Xi , βn∗ ) (β̂n − β0 ),         (4.2)
                  i=1                         i=1
where
                                                           ∂µ(X, β)
                                           D(X, β) =                                               (4.3)
                                                             ∂β T
is the gradient matrix (d × q), made up of all partial derivatives of the d-
elements of µ(X, β) with respect to the q-elements of β, and βn∗ denotes some
intermediate value between β̂n and β0 .
    If we denote the rows of A(Xi , β) by {A1 (Xi , β), . . . , Aq (Xi , β)}, then
Qq×q (Yi , Xi , β) is the q × q matrix defined by
                                              ⎛            ∂AT (X ,β)
                                                                      ⎞
                                          {Yi − µ(Xi , β)}T 1∂β Ti
                                        ⎜               ..            ⎟
                   Qq×q (Yi , Xi , β) = ⎜
                                        ⎝                .
                                                                      ⎟.
                                                                      ⎠
                                                             T
                                                           ∂A (Xi ,β)
                                          {Yi − µ(Xi , β)}T q∂β T
Because
                               
                               n
                        n−1          Q(Yi , Xi , βn∗ ) → E {Q(Y, X, β0 )} = 0
                                                       P
i=1
and
                        
                        n
                n−1           A(Xi , βn∗ )D(Xi , βn∗ ) −
                                                           P
                                                       → E{A(X, β0 )D(X, β0 )},
                        i=1
56     4 Semiparametric Models
we obtain
                                          n 
                                          
           1/2                     −1/2                                          −1
       n         (β̂n − β0 ) = n                 [E{A(X, β0 )D(X, β0 )}]              A(Xi , β0 )
                                          i=1
                                                                    
                                                × {Yi − µ(Xi , β0 )} + op (1).
                                                                           0
                                                                          T
                                            = E{A(Xi , β0 )V (Xi )A (Xi , β0 )},                    (4.5)
    In order to use the results above for data analytic applications, such as
constructing confidence intervals for β or for some components of β, we must
also be able to derive consistent estimators for the asymptotic variance of β̂n
given by (4.6). Without going into all the technical details, we now outline the
arguments for constructing such an estimator. These arguments are similar
to those that resulted in the sandwich variance estimator for the asymptotic
variance of an m-estimator given by (3.10) in Chapter 3.
    Suppose, for the time being, that we assumed the true-valued β0 were
known to us. Then, by the law of large numbers, a consistent estimator for
E(AD) is given by
                                                 
                                                 n
                                            −1
                         Ê0 (AD) = n                  A(Xi , β0 )D(Xi , β0 ),                      (4.7)
                                                 i=1
where the subscript “0” is used to emphasize that this statistic is com-
puted with β0 known. As we showed in (4.5), the variance of A(Xi , β0 ){Yi −
                     4.1 GEE Estimators for the Restricted Moment Model                57
µ(Xi , β0 )} is given by E{A(Xi , β0 )V (Xi )AT (Xi , β0 )}, which, by the law of
large numbers, can be consistently estimated by
                     
                     n
Ê0 (AV AT ) = n−1         A(Xi , β0 ){Yi − µ(Xi , β0 )}{Yi − µ(Xi , β0 )}T AT (Xi , β0 ).
                     i=1
                                                                       (4.8)
Of course, the value β0 is not known to us. But since β̂n is a consistent
estimator for β0 , a natural estimator for the asymptotic variance of β̂n is
given by
                                                     T
                     {Ê(AD)}−1 Ê(AV AT ){Ê(AD)}−1 ,                 (4.9)
where Ê(AD) and Ê(AV AT ) are computed as in equations (4.7) and (4.8),
respectively, with β̂n substituted for β0 . The estimator (4.9) is referred to as
the sandwich estimator for the asymptotic variance. More details about this
methodology can be found in Liang and Zeger (1986).
    The results above did not depend on any specific parametric assumptions
beyond the moment restriction and regularity conditions. Consequently, the
estimator, given as the solution to equation (4.1), is a semiparametric estima-
tor for the restricted moment model.
The log transformation guarantees that the conditional mean response, given
the covariates, is always positive. Consequently, this model puts no restrictions
on the possible values that β can take.
    With a sample of iid data (Yi , Xi ), i = 1, . . . , n, a semiparametric estima-
tor for β can be obtained as the solution to a generalized estimating equation
given by (4.1). In this example, the response variable Y is a single random vari-
able; hence d = 1. If we take Aq×1 (X, β) in (4.1) to equal (1, X1 , . . . , Xq−1 )T ,
58         4 Semiparametric Models
then the corresponding GEE estimator β̂n = (α̂n , . . . , δ̂(q−1)n )T is the solution
to
     
     n
           (1, XiT )T {Yi − exp(α + δ1 X1i + . . . + δq−1 X(q−1)i )} = 0q×1 .     (4.11)
     i=1
                              
                              n
           Ê(AV AT ) = n−1          (1, XiT )T {Yi − µ(Xi , β̂n )}2 (1, XiT ).   (4.13)
                               i=1
Remark 1. The asymptotic variance is the variance matrix of the limiting nor-
mal distribution to which n1/2 (β̂n − β0 ) converges. That is, the asymptotic
                                                                   D
variance is equal to the (q × q) matrix Σ, where n1/2 (β̂n − β0 ) −→ N (0, Σ).
The estimator for the asymptotic variance is denoted by Σ̂n and is given by
where Ê(AD) and Ê(AV AT ) are defined by (4.12) and (4.13), respectively.
    For practical applications, say, when we are constructing confidence inter-
vals for δj , the regression coefficient for the j-th covariate Xj , j = 1, . . . , q −1,
we must be careful to use the appropriate scaling factor when computing the
estimated standard error for δ̂jn . That is, the 95% confidence interval for δj
is given by
                                 δ̂jn ± 1.96se(δ̂jn ),
and se(δ̂jn ) = n−1 (Σ̂n )(j+1)(j+1) , where (·)q×q
                                                (j+1)(j+1) denotes the (j + 1)-th
diagonal element of the q × q matrix (·)q×q .    
    The GEE estimator for β in the log-linear model, given as the solution to
equation (4.11), is just one example of many possible semiparametric estima-
tors. Clearly, we would be interested in finding a semiparametric estimator
that is as efficient as possible; i.e., with as small an asymptotic variance as
possible. We will address this issue later in this chapter.
    Some natural questions that arise for semiparametric models are:
p0 (z) = p(z, β0 , γ0 ).
Remark 3. The terms parametric submodel and parametric model can be con-
fusing. A parametric model is a model whose probability densities are charac-
terized through a finite number of parameters that the data analyst believes
will suffice in identifying the probability distribution that generates the data.
For example, we may be willing to assume that our data follow the model
Yi = µ(Xi , β) + εi , (4.15)
We note that:
• In this model, the (q +r) parameters (β T , γ T )T are left unspecified. Hence,
  this model is indeed a finite-dimensional model.
• For any choice of β and γ, the resulting density follows a proportional
  hazards model and is therefore contained in the semiparametric model;
  i.e.,
                                 Pβ,γ ⊂ P.
• The truth is obtained by setting β = β0 and γ = 0.
• This parametric submodel is defined using λ0 (t), “the truth,” which is not
  known to us; consequently, such a model is not useful for data analysis.
Contrast this with the case where we are willing to consider the parametric
model; namely
                   λ(t|X) = λ exp(β T X), λ, β unknown.
That is, we assume that the underlying baseline hazard function is constant
over time; i.e., conditional on X, the survival distribution follows an expo-
nential distribution. If we are willing to assume that our data are generated
from some distribution within this parametric model, then we only need to
estimate the parameters λ and β and use this for any subsequent data analy-
sis. Of course, the disadvantage of such a parametric model is that if the data
are not generated from any density within this class, then the estimates we
obtain may be meaningless.
    where
                                       ∂ log p(z, β0 , γ0 )
                             Sγr×1 =                        .
                                              ∂γ
62      4 Semiparametric Models
(ii) The efficient influence function for the parametric submodel is given by
                                                T
                    ϕeff            eff eff
                     β,γ (Z) = {E(Sβ,γ Sβ,γ )}
                                              −1 eff
                                                Sβ,γ (Z, β0 , γ0 ),
            eff
     where Sβ,γ (Z, β0 , γ0 ), the parametric submodel efficient score, is
     and
                                                ∂ log p(z, β0 , η0 )
                        Sβq×1 (Z, β0 , η0 ) =                        .
                                                       ∂β
(iii) The smallest asymptotic variance among such RAL estimators for β in
      the parametric submodel is
                                                    T
                                  eff      eff
                              [E{Sβ,γ (Z)Sβ,γ (Z)}]−1 .
for all p(z; β, γ) ∈ Pβ,γ ⊂ P. However, the converse may not be true. Conse-
quently, the class of semiparametric estimators must be contained within the
class of estimators for a parametric submodel. Therefore:
Hence, the variance of the influence function for any semiparametric estimator
for β must be greater than or equal to
                         sup              /          0−1
                                             eff eff T
                                          E Sβ,γ Sβ,γ      .              (4.16)
              {all parametric submodels}
                                              2 j→∞
                          h(Z) − Bj Sγj (Z)    −−−→ 0
                                                                             2
for a sequence of parametric submodels indexed by j], where           h(Z)       =
E{hT (Z)h(Z)}.  
64     4 Semiparametric Models
Remark 4. The Hilbert space H is also a metric space (i.e., a set of elements
where a notion of distance between elements of the set is defined). For any
two elements h1 , h2 ∈ H, we can define the distance between two elements
h1 , h2 ∈ H as h2 − h1 = [E{(h2 − h1 )T (h2 − h1 )}]1/2 . The closure of a set S,
where, in this setting, a set consists of q-dimensional random functions with
mean zero and finite variance, is defined as the smallest closed set that contains
S, or equivalently, as the set of all elements in S together with all the limit
points of S. The closure of S is denoted by S̄. Thus the closure of a set is itself
a closed set. Limits must be defined in terms of a distance between elements.
The word mean-square is used because limits are taken with respect to the
distance, which in this case is the square root of the expected sum of squared
differences between the q-components of the two elements (i.e., between the
two q-dimensional random functions). Therefore, the mean-square closure is
larger and contains the union of all parametric submodel nuisance tangent
spaces. Therefore, if we denote by S the union of all parametric submodel
nuisance tangent spaces, then Λ = S̄ is the semiparametric nuisance tangent
space.   
There is no difficulty with this definition since the nuisance tangent space Λ
is a closed linear subspace and therefore the projection Π{Sβ (Z, β0 , η0 )|Λ}
exists and is unique. 
Proof. For simplicity, we take β to be a scalar (i.e., q = 1), although this can
be extended to q > 1 using arguments in Section 3.4, where a generalization of
                               4.4 Semiparametric Nuisance Tangent Space                     65
the Pythagorean theorem to dimension q > 1 was derived (see (3.31)). Denote
by V the semiparametric efficiency bound, which, when q = 1, is defined by
                                            eff                     −2
                           sup             Sβ,γ (Z)                      = V,
                {all parametric submodels}
                              Pβ,γ
where
                      eff
                     Sβ,γ (Z) = Sβ (Z) − Π(Sβ (Z)|Λγ ).
                                           eff
Since Λγ ⊂ Λ, this implies that Seff (Z) ≤ Sβ,γ (Z) for all parametric
submodels Pβ,γ . Hence
                              −2                    eff            −2
                  Seff (Z)          ≥      sup      Sβ,γ (Z)            = V.           (4.17)
                                       {all Pβ,γ }
To complete the proof of the theorem, we need to show that Seff (Z) −2 is
also less than or equal to V. But because Π(Sβ (Z)|Λ) ∈ Λ, this means that
there exists a sequence of parametric submodels Pβ,γj with nuisance score
vectors Sγj (Z) such that
                                                            2 j→∞
                         Π(Sβ (Z)|Λ) − Bj Sγj (Z)            −−−→ 0
  V −1 ≤ Sβ,γ
          eff
             j
               (Z)   2
                         = Sβ (Z) − Π(Sβ (Z)|Λγj )           2
                                                                 ≤ Sβ (Z) − Bj Sγj (Z)   2
                                       2
        = Sβ (Z) − Π(Sβ (Z)|Λ)             + Π(Sβ (Z)|Λ) − Bj Sγj (Z) 2 .             (4.18)
Because Sβ (Z) − Π(Sβ (Z)|Λ) is orthogonal to Λ and Π(Sβ (Z)|Λ) − Bj Sγj (Z)
is an element of Λ, the last equality in (4.18) follows from the Pythagorean
theorem. Taking j → ∞ implies
                                                2                 2
                 Sβ (Z) − Π(Sβ (Z)|Λ)               = Seff (Z)          ≥ V −1
or
                                                −2
                                   Seff (Z)           ≤ V.
                                                             −2
Together with (4.17), we conclude that Seff (Z)                    = V. 
                                                                       
ϕ, h = ϕ, Bj Sγj + ϕ, h − Bj Sγj .
| ϕ, h | ≤ ϕ h − Bj Sγj .
Clearly, ϕeff (Z, β0 , η0 ) satisfies conditions (i) and (ii) above and moreover
has variance matrix E{ϕeff (Z)ϕTeff (Z)} = V, where V is the semiparametric
efficiency bound.      
Theorem 4.3. If a semiparametric RAL estimator for β exists, then the influ-
ence function of this
                    estimator must
                                belong to the space of influence functions,
the linear variety   ϕ(Z) + T ⊥ , where ϕ(Z) is the influence function of any
semiparametric RAL estimator for β and T is the semiparametric tangent
space, and if an RAL estimator for β exists that achieves the semiparametric
efficiency bound (i.e., a semiparametric efficient estimator), then the influence
function of this estimator must be the unique and well-defined element
Theorem 4.4. The tangent space (i.e., the mean-square closure of all para-
metric submodel tangent spaces) is the entire Hilbert space H.
where
                      ∂ log p(z, θ0 )
           Sθ (z) =                   .
                            ∂θ
Denote the truth as p0 (z) = p(z, θ0 ). From the usual properties of score vec-
tors, we know that
                             E{Sθ (Z)} = 0s×1 .
Consequently, the linear subspace Λθ ⊂ H.
Reminder: When we write E{Sθ (Z)}, we implicitly mean that the expectation
is computed with respect to the truth; i.e.,
1 0
This guarantees that p(z, θ), for θ in some neighborhood of the truth, is a
proper density function. For this parametric submodel, the score vector is
If we choose B q×q to be I q×q (i.e., the q ×q identity matrix), then h(Z), which
also equals I q×q h(Z), is an element of this parametric submodel tangent space.
    Thus we have shown that the tangent space contains all bounded mean-
zero random vectors. The proof is completed by noting that any element of
H can be approximated by a sequence of bounded h.         
where
                          pZ (j) |Z (1) ,...,Z (j−1) (z (j) |z (1) , . . . , z (j−1) )            (4.20)
is the conditional density of Z (j) given Z (1) , . . . , Z (j−1) , defined with respect
to the dominating measure νj . If we put no restrictions on the density of Z
(i.e., the nonparametric model) or, equivalently, put no restrictions on the
conditional densities above, then the j-th conditional density (4.20) is any
positive function ηj (z (1) , . . . , z (j) ) such that
                         
                              ηj (z (1) , . . . , z (j) )dνj (z (j) ) = 1
Tγ = Tγ1 ⊕ . . . ⊕ Tγm ,
where
                                        4.4 Semiparametric Nuisance Tangent Space                  71
      Tγj = {B q×sj Sγj (Z (1) , . . . , Z (j) ), for all constant matrices B q×sj }.
    It is now easy to verify that the tangent space T , the mean-square closure
of all parametric submodel tangent spaces, is equal to
                                         T = T1 ⊕ . . . ⊕ Tm ,
where Tj , j = 1, . . . , m, is the mean-square closure of parametric submodel
tangent spaces for ηj (·), where a parametric submodel for ηj (·) is given by the
class of conditional densities
         Pγj = {p(z (j) |z (1) , . . . , z (j−1) , γj ), γj − say sj − dimensional}
and the parametric submodel tangent space Tγj is the linear space spanned
by the score vector Sγj (Z (1) , . . . , Z (j) ).
   We now are in a position to derive the following results regarding the
partition of the Hilbert space H into a direct sum of orthogonal subspaces.
Theorem 4.5. The tangent space T for the nonparametric model, which we
showed in Theorem 4.4 is the entire Hilbert space H, is equal to
                                     T = H = T1 ⊕ . . . ⊕ Tm ,
where
                                                      
    T1 = α1q×1 (Z (1) ) ∈ H : E{α1q×1 (Z (1) )} = 0q×1
and
        
    Tj = αjq×1 (Z (1) , . . . , Z (j) ) ∈ H :                                               (4.21)
                                                                                  
           E{αjq×1 (Z (1) , . . . , Z (j) )|Z (1) , . . . , Z (j−1) } = 0q×1 , j = 2, . . . , m,
Proof. That the partition of the tangent space Tj associated with the nuisance
parameter ηj (·) is the set of elements given by (4.21) follows by arguments
similar to those for the proof of Theorem 4.4. That is, because of properties of
score functions for parametric models of conditional densities, the score vector
Sγj (·) must be a function only of Z (1) , . . . , Z (j) and must have conditional
expectation
pj (z (j) |z (1) , . . . , z (j−1) , θj ) = p0j (z (j) |z (1) , . . . , z (j−1) ){1+θjT αj (z (1) , . . . , z (j) )},
                                                                                                            (4.24)
where p0j (z (j) |z (1) , . . . , z (j−1) ) denotes the true conditional density of Z (j)
given Z (1) , . . . , Z (j−1) and θj is a q-dimensional parameter chosen sufficiently
small to guarantee that pj (z (j) |z (1) , . . . , z (j−1) , θj ) is positive. This class of
functions is clearly a parametric submodel since
        
             p0j (z (j) |z (1) , . . . , z (j−1) ){1 + θjT αj (z (1) , . . . , z (j) )}dνj (z (j) ) = 1,
where (4.25) is equal to θjT E{αj (Z (1) , . . . , Z (j) )|Z (1) , . . . , Z (j−1) }, which must
equal zero by the definition of Tj . The score vector for the parametric sub-
model (4.24) is
Thus we have shown that the tangent space for every parametric submodel
of ηj (·) is contained in Tj and every bounded element in Tj belongs to the
tangent space for some parametric submodel of ηj (·). The argument is com-
pleted by noting that every element of Tj is the limit of bounded elements of
Tj .
     That the projection of any element h ∈ H onto Tj is given by hj (·),
defined by (4.23), can be verified directly. Clearly hj (·) ∈ Tj . Therefore, by
the projection theorem for Hilbert spaces, we only need to verify that h − hj
is orthogonal to every element of Tj . Consider any arbitrary element j ∈ Tj .
                               4.5 Semiparametric Restricted Moment Model                  73
Then, because hj and lj are functions of Z (1) , . . . , Z (j) , we can use the law of
iterated conditional expectations to obtain
Note that (4.26) follows from the definition of hj . The equality in (4.27) follows
because lj ∈ Tj , which, in turn, implies that E(lj |Z (1) , . . . , Z (j−1) ) = 0.
   Finally, in order to prove that Tj , j = 1, . . . , m are mutually orthogonal
subspaces, we must show that hj is orthogonal to hj  , where hj ∈ Tj , hj  ∈ Tj 
                
and j = j , j, j = 1, . . . , m. This follows using the law of iterated conditional
expectations, which we leave for the reader to verify.      
Y = µ(X, β) + ε,
where
                                    E(ε|X) = 0.
74     4 Semiparametric Models
    We will assume, for the time being, that the d-dimensional response vari-
able Y is continuous; i.e., the dominating measure is the Lebesgue measure,
which we will denote by Y . It will be shown later how this can be generalized
to more general dominating measures that will also allow Y to be discrete.
The covariates X may be continuous, discrete, or mixed, and we will denote
the dominating measure by νX .
    The observed data are assumed to be realizations of the iid random vec-
tors (Z1 , . . . , Zn ), where Zi = (Yi , Xi ). Our aim is to find semiparametric
estimators for β and identify, if possible, the most efficient semiparametric
estimator.
    The density of a single observation, denoted by p(z), belongs to the semi-
parametric model                                        
                        P=      p{z, β, η(·)}, z = (y, x) ,
defined with respect to the dominating measure Y × νX . The truth (i.e., the
density that generates the data) is denoted by p0 (z) = p{z, β0 , η0 (·)}. Because
there is a one-to-one transformation of (Y, X) and (ε, X), we can express the
density
                       pY,X (y, x) = pε,X {y − µ(x, β), x},                  (4.28)
where pε,X (ε, x) is a density with respect to the dominating measure ε × νX .
  The restricted moment model only makes the assumption that
E(ε|X) = 0.
    The set of functions η1 (ε, x) and η2 (x), satisfying the constraints (4.30),
(4.31), and (4.32), are infinite-dimensional and can be used to characterize
the semiparametric model as
for
                                          T
                         β T , γ1T , γ2T        ∈ Ωβ,γ ⊂ Rq+r }.
Therefore,
and
76     4 Semiparametric Models
ε = y − µ(x, β0 ).
and
                        '                                       (
                   Λγ2 = B q×r2 Sγ2 (X)           for all B q×r2 .         (4.35)
   It is easy to show that the space Λγ1 is orthogonal to the space Λγ2 , as we
demonstrate in the following lemma.
Lemma 4.1. The space Λγ1 defined by (4.34) is orthogonal to the space Λγ2
defined by (4.35).
E{Sγ2 (X)} = 0.
Consequently,
                           4.5 Semiparametric Restricted Moment Model        77
   Convince yourself that (4.36) suffices to show that every element of Λγ1 is
orthogonal to every element of Λγ2 . 
                                     
   We now show how to explicitly derive the spaces Λ1s , Λ2s and the space
orthogonal to the nuisance tangent space Λ⊥ .
Theorem 4.6. The space Λ2s consists of all q-dimensional mean-zero func-
tions of X with finite variance.
We illustrate below.
E{Sγ2 (X)} = 0.
   Therefore, any element of Λγ2 = {B q×r2 Sγ2 (X) for all B} is a q-dimensional
function of X with mean zero. It may be reasonable to guess that Λ2s , the
mean-square closure of all Λγ2 , is the linear subspace of all q-dimensional
mean-zero functions of X.
1 0
Theorem 4.7. The space Λ1s is the space of all q-dimensional random func-
tions a(ε, x) that satisfy
and
Proof. The space Λ1s is the mean-square closure of all parametric submodel
nuisance tangent spaces Λγ1 , where
and
                                       ∂ log pε|X (ε|x, γ10 )
                        Sγ1 (ε, x) =                          .
                                                ∂γ1
Recall: We use ε to denote {y − µ(x, β0 )}. 
                                            
Consequently, any element of Λγ1 = B q×r1 Sγ1 (ε, X), say a(ε, X), must satisfy
and
for some bounded function a(ε, X) satisfying (4.38) and (4.39) and a q-
dimensional parameter γ1 chosen sufficiently small so that
This parametric submodel contains the truth; i.e., (γ1 = 0). Also, the class of
densities in this submodel consists of proper densities,
                      
                        pε|X (ε|x, γ1 )dε = 1 for all x, γ1 ,
and E(ε|X) = 0,
                              4.5 Semiparametric Restricted Moment Model         81
                   
                         εpε|X (ε|x, γ1 )dε = 0d×1 for all x, γ1 .
                                                        0
   The nuisance tangent space for the semiparametric model is Λ = Λ1s ⊕Λ2s .
Note that Λ1s is the intersection of two linear subspaces; namely,
                                                         
                       q×1                            q×1
           Λ1sa = aa (ε, X) : E{aa (ε, X)|X} = 0            ,
and
                                                                           
            Λ1sb =     aq×1
                        b   (ε, X)   :   E{aq×1
                                            b     (ε, X)ε |X} = 0
                                                        T             q×d
                                                                             .
Lemma 4.3.
                                 Λ1sa = Λ⊥
                                         2s .
Lemma 4.4.
                                 Λ2s ⊂ Λ1sb .
and
Lemma 4.5.
                       Λ = Λ2s ⊕ (Λ1sa ∩ Λ1sb ) = Λ1sb .
To complete the proof, we must show that any element h ∈ H can be written
as h = h1 ⊕ h2 , where h1 ∈ Λ2s and h2 ∈ Λ1sa . We write h = E(h|X) + {h −
E(h|X)}. That E(h|X) ∈ Λ2s and {h − E(h|X)} ∈ Λ1sa follow immediately.
The result above also implies that Π(h|Λ2s ) = E(h|X) and Π(h|Λ1sa ) =
{h − E(h|X)}.    
which follows from the model restriction E(εT |X) = 01×d . Hence α(X) ∈ Λ1sb .
Conversely, let h be any element of Λ1sb . Since by Lemmas 4.3 and 4.4
E(h|X) ∈ Λ2s ⊂ Λ1sb , this implies that {h − E(h|X)} ∈ Λ1sb since Λ1sb is a
linear space. Therefore h can be written as E(h|X) + {h − E(h|X)}, where
E(h|X) ∈ Λ2s and {h − E(h|X)} ∈ Λ1sb . But by Lemma 4.3, {h − E(h|X)} is
also an element of Λ1sa and hence {h − E(h|X)} ∈ (Λ1sa ∩ Λ1sb ), completing
the proof. 
                            4.5 Semiparametric Restricted Moment Model       83
   Consequently, we have shown that the nuisance tangent space Λ for the
semiparametric restricted moment model is given by
The key to deriving the space of influence functions is first to identify elements
of the Hilbert space that are orthogonal to Λ. Equivalently, the space Λ⊥ is
the linear space of residuals
for all
                                  h(ε, X) ∈ H.
Using (4.40), this equals
But, by the definition of Λ1sb , E{ab (ε, X)εT |X} = 0q×d or, equivalently,
                                                        
E{abj (ε, X)εj  |X} = 0, for all j = 1, . . . , q and j = 1, . . . , d, where {abj (ε, X)
                                                     
is the j-th element of ab (ε, X) and εj  is the j -th element of ε. Consequently,
the inner expectation of (4.46), which can be written as
                        
                              Ajj  (X)E{abj (ε, X)εj  |X},
                           j,j 
                                   
where Ajj  (X) is the (j, j )-th element of A(X), must also equal zero. This,
in turn, proves (4.45).
    Now that we have shown the orthogonality of the spaces Λ1sb and the space
(4.42), in order to prove that the space (4.42) is the orthogonal complement
of Λ1sb , it suffices to show that any h ∈ H can be written as h1 + h2 , where
h1 ∈ (4.42) and h2 ∈ Λ1sb . Or, equivalently, for any h ∈ H, there exists
g q×d (X) such that
                           {h(ε, X) − g(X)ε} ∈ Λ1sb .                   (4.47)
That such a function g(X) exists follows by solving the equation
or
                         E(hεT |X) − g(X)E(εεT |X) = 0,
which yields
                         g(X) = E(hεT |X){E(εεT |X)}−1 ,
where, to avoid any technical difficulties, we will assume that the conditional
variance matrix E(εεT |X) is positive definite and hence invertible. 
   We have thus demonstrated that, for the semiparametric restricted mo-
ment model, any element of the Hilbert space perpendicular to the nuisance
tangent space is given by
Influence functions of RAL estimators for β (i.e., ϕ(ε, X)) are normalized
versions of elements perpendicular to the nuisance tangent space. That is,
the space of influence functions, as well as being orthogonal to the nuisance
tangent space, must also satisfy condition (i) of Theorem 4.2, namely that
where Sβ (ε, X) is the score vector with respect to the parameter β and I q×q
is the q × q identity matrix. If we start with any A(X), and define ϕ(ε, X) =
C q×q A(X)ε, where C q×q is a q ×q constant matrix (i.e., normalization factor),
then condition (i) of Theorem 4.2 is satisfied by solving
                             4.5 Semiparametric Restricted Moment Model      85
or
                        C = [E{A(X)εSβT (ε, X)}]−1 .                      (4.49)
   Since a typical element orthogonal to the nuisance tangent space is given
by A(X){Y − µ(X, β0 )}, and since a typical influence function is given by
CA(X){Y − µ(X, β0 )}, where C is defined by (4.49), this motivates us to
consider an m-estimator for β of the form
                       
                       n
                             CA(Xi ){Yi − µ(Xi , β)} = 0.
                       i=1
where
and
η2 (x) = pX (x).
If we fix the nuisance parameter at the truth (i.e., η10 (ε, x) and η20 (x)), then
we obtain
                    
              ∂
                        {y − µ(x, β)}η10 {y − µ(x, β), x}dy                = 0.
             ∂β T                                                  β=β0
where
                                            ∂µ(X, β0 )
                               D(X) =                  .
                                              ∂β T
By (4.50) and (4.52), we obtain that the efficient score is
                   
                   n
                         DT (Xi )V −1 (Xi ){Yi − µ(Xi , β)} = 0           (4.54)
                   i=1
(optimal GEE).
    We also note that the normalization constant matrix C given in (4.49) in-
volves the expectation E{A(X)εSβT (ε, X)}, which by a conditioning argument
can be derived as E[A(X)E{εSβT (ε, X)|X}], which equals E{A(X)D(X)} by
(4.52). Hence, C = [E{A(X)D(X)}]−1 . This implies that a typical influence
function is given by
which is the influence function for the GEE estimator given in (4.4). Simi-
larly, the efficient influence function can be obtained by using the appropriate
normalization constant with the efficient score (4.53) to yield
This, of course, is also the variance of the efficient influence function (4.54).
variable. Strictly speaking, the response variable CD4 count used in the log-
linear model example of Section 4.1 is not a continuous variable. A more
obvious example is when we have a binary response variable Y taking on
values 1 (response) or 0 (nonresponse). A popular model for modeling the
probability of response as a function of covariates X is the logistic regression
model. In such a model, we assume
                                                exp(β T X ∗ )
                        P (Y = 1|X) =                           ,
                                              1 + exp(β T X ∗ )
where X ∗ = (1, X T )T , allowing the introduction of an intercept term. Since
Y is a binary indicator, this implies that E(Y |X) = P (Y = 1|X), and hence
the logistic regression model is just another example of a restricted moment
                         exp(β T X ∗ )
model with µ(X, β) = 1+exp(β     T X∗) .
    The difficulty that occurs when the response variable Y is not a continuous
random variable is that the transformed variable Y − µ(X, β) may no longer
have a dominating measure that allows us to define densities. In order to
address this problem, we will work directly with densities defined on (Y, X),
namely p(y, x) with respect to some dominating measure νY × νX . As you will
see, many of the arguments developed previously will carry over to this more
general setting.
    We start by first deriving the nuisance tangent space. As before, we need to
find parametric submodels. Let p(y, x) be written as p(y|x) p(x), where p(y|x)
is the conditional density of Y given X and p(x) is the marginal density of X,
and denote the truth as p0 (y, x) = p0 (y|x)p0 (x). The parametric submodel
can be written generically as
                                  p(y|x, β, γ1 )p(x, γ2 ),
where for some β0 , γ10 , γ20
                                p0 (y|x) = p(y|x, β0 , γ10 )
and
                                    p0 (x) = p(x, γ20 ).
The parametric submodel nuisance tangent space is the space spanned by
the score vector with respect to the nuisance parameters γ1 and γ2 . As in
the previous section, the parametric submodel nuisance tangent space can be
written as the direct sum of two orthogonal spaces
                     Λ γ1 ⊕ Λ γ2 , Λ γ1 ⊥ Λ γ2 ,
where
                          '                                (
                     Λγ1 = B q×r1 Sγ1 (Y, X) for all B q×r1 ,
                          '                              (
                     Λγ2 = B q×r2 Sγ2 (X) for all B q×r2 ,
                                ∂ log p(y|x, β0 , γ10 )
                Sγ1 (y, x) =                            ,
                                        ∂γ1
                             4.5 Semiparametric Restricted Moment Model       89
and
                                         ∂ log p(x, γ20 )
                             Sγ2 (x) =                    .
                                               ∂γ2
   Hence, the semiparametric nuisance tangent space Λ equals Λ1s ⊕ Λ2s ,
Λ1s ⊥ Λ2s , where
and
and
              
                   yp(y|x, β0 , γ1 )dν(y) = µ(x, β0 )    for all x, γ1 .   (4.58)
Using standard arguments, where we take derivatives of (4.57) and (4.58) with
respect to γ1 , interchange integration and differentiation, divide and multiply
by p(y|x, β0 , γ1 ), and set γ1 at the truth, we obtain
                    
                       Sγ1 (y, x)p0 (y|x)dν(y) = 0r1 ×1 for all x
and
               
                    ySγT1 (y, x)p0 (y|x)dν(y) = 0d×r1         for all x.
That is, E{Sγ1 (Y, X)|X} = 0r1 ×1 and E{Y SγT1 (Y, X)|X} = 0d×r1 . This
implies that any element of Λγ1 , namely B q×r1 Sγ1 (Y, X), would satisfy
90      4 Semiparametric Models
E{B q×r1 Sγ1 (Y, X)|X} = 0q×1 and E{B q×r1 Sγ1 (Y, X)Y T |X} = 0q×d . This
leads us to the conjecture that
                          
                  (conj)
                Λ1s      = aq×1 (Y, X) : E{a(Y, X)|X} = 0q×1
and
                                                   
                            E{a(Y, X)Y T |X} = 0q×d .
for all y, x. This parametric submodel contains the truth when γ1 = 0 and
satisfies the constraints of the restricted moment model, namely
                   
                      p(y|x, γ1 )dν(y) = 1 for all x, γ1 ,
                 
                     yp(y|x, γ1 )dν(y) = µ(x, β0 ) for all x, γ1 .
The score vector Sγ1 (Y, X) for this parametric submodel is a(Y, X). Also any
               (conj)                                                       (conj)
element of Λ1s        can be derived as limits of bounded elements in Λ1s          .
Therefore, we have established that any element of a parametric submodel
                                               (conj)                        (conj)
nuisance tangent space Λγ1 is an element of Λ1s       , and any element of Λ1s
is either an element of a parametric submodel nuisance tangent space or a limit
                                                      (conj)
of such elements. Thus, we conclude that Λ1s = Λ1s           .
     Note that Λ1s can be expressed as Λ1sa ∩ Λ1sb , where
and
Therefore the nuisance tangent space Λ = Λ2s ⊕ (Λ1sa ∩ Λ1sb ). This repre-
sentation is useful because Λ2s ⊕ Λ1sa = H (the whole Hilbert space) and
Λ2s ⊂ Λ1sb . Consequently, we use Lemmas 4.3–4.5 to show that the semipara-
metric nuisance tangent space Λ = Λ1sb .
   Using the exact same proof as for Theorem 4.8, we can show that
Since
                             E(Y |X = x) = µ(x, β),
then for any parametric submodel where p(y|x, β, γ1 ) satisfies
                 
                   yp(y|x, β, γ1 )dν(y) =µ(x, β) for all x, γ1           (4.60)
and
where
                                          ∂µ(X, β0 )
                               D(X) =                .
                                            ∂β T
   Equation (4.61) follows because
It still remains to show that a parametric submodel exists that satisfies (4.60).
This is addressed by the following argument.
A class of joint densities for (Y, X) can be defined with dominating measure
νY ×νX that satisfy E(Y |X) = µ(X, β) by considering the conditional density
92      4 Semiparametric Models
                             E0 {Y exp(cY )|X = x}
                                                   ,                         (4.62)
                              E0 {exp(cY )|X = x}
                       E0 {Y exp(cY )|X = x}
                                             = µ(x, β).
                        E0 {exp(cY )|X = x}
where ξ0 and ξ1 are scalar constants. Again, we may not believe that this
captures the true functional relationship of the conditional variance to the
conditional mean but may serve as a useful approximation. Nonetheless, if,
for the time being, we accepted V (x, ξ, β) as a working model, then the param-
eters ξ in V (x, ξ, β) can be estimated separately using the squared residuals
{Yi − µ(Xi , β̂ninitial )}2 , i = 1, . . . , n, where β̂ninitial is some initial consistent
estimator for β. For instance, we can find an initial estimator for β by solving
equation (4.1) using A(X, β) = D(X, β) (which is equivalent to a working vari-
ance V (X) proportional to the identity matrix). Using this initial estimator,
we can then find an estimator for ξ by solving the equation
     
     n                                                                                
           Q(Xi , ξ, β̂ninitial ) {Yi − µ(Xi , β̂ninitial )}2 − V (Xi , ξ, β̂ninitial ) = 0,
     i=1
This estimator would only be a consistent estimator for the asymptotic vari-
ance of β̂n if the working variance contained the truth. Otherwise, it would
be asymptotically biased. A consistent estimator for the asymptotic vari-
ance can be obtained by using the sandwich variance given by (4.14) with
A(X) = DT (X, β̂n )V −1 (X, ξˆn , β̂ninitial ).
for arbitrary Aq×1 (X), and the efficient estimator is obtained by choosing
                                                 0)
A(X) = DT (X)V −1 (X), where D(X) = ∂µ(X,β  ∂β T
                                                    and V (X) = var(Y |X).
Because Y is binary,
                                                                           exp(β0T X ∗ )
     V (X) = var(Y |X) = µ(X, β0 ){1 − µ(X, β0 )} =                                          .
                                                                        {1 + exp(β0T X ∗ )}2
Taking derivatives, we also obtain that DT (X) = X ∗ V (X). Hence the optimal
estimator for β can be derived by choosing A(X) = DT (X)V −1 (X) = X ∗ ,
leading us to the optimal estimating equation
                             
                             n                                     
                                                  exp(β T Xi∗ )
                                   Xi∗ Yi −                             = 0.                 (4.65)
                             i=1
                                                1 + exp(β T Xi∗ )
96     4 Semiparametric Models
where the gradient matrix D(X, β) was derived in Section 4.1 for the log-linear
model to be D(X, β) = ∂µ(X,β)
                            ∂β T
                                  = µ(X, β)(1, X T ). Although the semipara-
metric restricted moment model makes no assumptions about the variance
function V (X) = var(Y |X), we argued that to find a locally efficient estima-
tor for β we might want to make some assumptions regarding the function
V (X) and derive an adaptive estimator.
    Since CD4 count is a count we might be willing to assume that it fol-
lows a Poisson distribution. If indeed the distribution of Y given X fol-
lows a Poisson distribution with mean µ(X, β), then we immediately know
that V (X) = µ(X, β). Although this is probably too strong an assump-
tion to make in general, we may believe that a good approximation is that
the variance function V (X) is at least proportional to the mean; i.e., that
V (X) = σ 2 µ(X, β), where σ 2 is some unknown scale factor. In that case,
DT (X, β)V −1 (X) = σ −2 (1, X T )T and the locally efficient estimator for β
would be the solution to (4.66), which, up to a proportionality constant, would
be the solution to the estimating equation
                        
                        n
                              (1, XiT )T {Yi − µ(Xi , β)} = 0,           (4.67)
                        i=1
which is the same as the estimator proposed in Section 4.1; see (4.11).
    Thus, we have shown that the locally efficient semiparametric estimator
for β, when the conditional distribution of the response variable given the
covariates follows a Poisson distribution or, more generally, if the conditional
variance of the response variable given the covariates is proportional to the
conditional mean of Y given the covariates, is given by (4.67). For a more
detailed discussion on log-linear models, see McCullagh and Nelder (1989,
Chapter 6).
    If the conditional variance of Y given X is not proportional to the con-
ditional mean µ(X, β), then the estimator (4.67) is no longer semiparametric
4.6 Adaptive Semiparametric Estimators for the Restricted Moment Model            97
ε1 (Y, X, β, ξ) = Y − µ(X, β)
and
                 ε2 (Y, X, β, ξ) = {Y − µ(X, β)}2 − V (X, β, ξ).
With such a representation, it is clear that our model for the conditional mean
and conditional variance of Y given X is equivalent to E{ε(Y, X, β, ξ)|X} = 0.
   This representation also allows us to consider models for the conditional
quantiles of Y as a function of X. For example, suppose we wanted to consider
a model for the median of a continuous random variable Y as a function of
X. Say we wanted a model where we assumed that the conditional median
of Y given X was equal to µ(X, β) and we wanted to estimate β using a
sample of data (Yi , Xi ), i = 1, . . . , n. This could be accomplished by consider-
ing ε(Y, X, β) = I{Y ≤ µ(X, β)} − .5 because the conditional expectation of
ε(Y, X, β) given X is given by
and, by definition, the conditional median is the value µ(X, β) such that the
conditional probability that Y is less than or equal to µ(X, β), given X, is
equal to .5, which would imply that E{ε(Y, X, β)|X} = 0.
98      4 Semiparametric Models
for arbitrary A(X) and that the efficient semiparametric estimator for β falls
within this class. For models where we include both the conditional mean and
conditional variance, this leads to the so-called quadratic estimating equations
or GEE2. We give several exercises along these lines.
Y = µ(X, β) + ε,
and
Yi = µ(Xi , β) + εi , i = 1, . . . , n,
α1 + µ(X, β1 ) = α2 + µ(X, β2 ),
where α1 , α2 are any scalar constants and β1 , β2 are values in the parameter
space contained in Rq .
   For example, if we consider a linear model where µ(Xi , β) = XiT β, we
must make sure not to include an intercept term, as this will be absorbed into
the error term εi ; i.e.,
Yi = α + XiT β + εi , i = 1, . . . , n, (5.2)
with mean zero and some common variance σ 2 . What is often not made clear
is that there is an implicit assumption that the covariates Xi , i = 1, . . . , n
are fixed. Consequently, the error terms εi being iid implies that the distri-
bution of εi is independent of Xi . Thus, such models are examples of what
we are now calling the location-shift regression models. In contrast, a linear
restricted moment model can also be written as (5.2), where εi , i = 1, . . . , n
are iid random variables. However, the restricted moment model makes the
assumption that E(εi |Xi ) = 0, which then implies that E(εi ) = 0 but does
not necessarily assume that εi is independent of Xi .
    The location-shift regression model, although semiparametric, is more re-
strictive than the restricted moment model considered in Chapter 4. For ex-
ample, if we consider the linear restricted moment model
Yi = α + β1 Xi1 + · · · + βq Xiq + εi ,
where
                                 E (εi |Xi ) = 0,
or equivalently
                     Yi = β1 Xi1 + · · · + βq Xiq + (α + εi ),
where
                               E{α + εi |Xi } = α,
then this model includes a larger class of probability distributions than the
linear location-shift regression model; namely,
The key to finding semiparametric RAL estimators for β and identifying the
efficient such estimator is to derive the space of influence functions of RAL
estimators for β. This will require us to find the space orthogonal to the
104     5 Other Examples of Semiparametric Models
nuisance tangent space. The nuisance tangent space is defined as the mean-
square closure of all parametric submodel nuisance tangent spaces. The nui-
sance tangent space and its orthogonal complement for the semiparametric
location-shift regression model are given as follows.
Theorem 5.1. Using the convention that ε(β) = Y −µ(X, β) and ε = ε(β0 ) =
Y −µ(X, β0 ), the nuisance tangent space for the location-shift regression model
is given by
                                Λ = Λ1s ⊕ Λ2s ,
where                                                   
                     Λ1s = aq×1
                            1   (ε) : E{aq×1
                                         1   (ε)} = 0q×1
                                                           ,
                                                      
                    Λ2s = aq×1
                           2   (X) : E{aq×1
                                        2   (X)} = 0q×1 ,
where γ10 and γ20 denote the “truth.” If we fix β at the truth β0 , then pε (ε, γ1 )
allows for an arbitrary marginal density of ε. Consequently, using logic devel-
oped in Chapter 4, the mean-square closure for parametric submodel nuisance
tangent spaces
                      Λγ1 = {B q×r1 Sγ1 (ε) for all B q×r1 },
where Sγ1 (ε) = ∂ log pε (ε, γ10 )/∂γ1 , is the space Λ1s , defined as
                                                             
                    Λ1s = aq×1 1   (ε) : E{aq×1
                                              1  (ε)} = 0q×1 .
Λ = Λ1s ⊕ Λ2s .
Proof. The projection of any element h ∈ H onto the closed linear space Λ is
the unique element Π(h|Λ) such that the residual h − Π(h|Λ) is orthogonal to
every element in
To verify that (5.4) is true, we first note that Π(h|Λ1s ) ∈ Λ1s and Π(h|Λ2s ) ∈
Λ2s , implying that Π(h|Λ1s ) + Π(h|Λ2s ) ∈ Λ. To complete the proof, we must
show that the inner product
for all a1 ∈ Λ1s and a2 ∈ Λ2s . The inner product (5.5) can be written as
                                 h − Π(h|Λ1s ), a1                          (5.6)
                              − Π(h|Λ2s ), a1                               (5.7)
                              + h − Π(h|Λ2s ), a2                           (5.8)
                              − Π(h|Λ1s ), a2 .                             (5.9)
    In the proof of Lemma 4.3, we showed that for any h(ε, X) ∈ H the projec-
tion of h(ε, X) onto Λ2s is given by Π(h|Λ2s ) = E{h(ε, X)|X}. Similarly, we
can show that Π(h|Λ1s ) = E{h(ε, X)|ε}. Consequently, the space orthogonal
to the nuisance tangent space is given by
                                                                
     ⊥
   Λ = [h(ε, X) − E{h(ε, X)|ε} − E{h(ε, X)|X}] for all h ∈ H . (5.10)
106    5 Other Examples of Semiparametric Models
where Gε (X) = E{g(ε, X)|X}, GX (ε) = E{g(ε, X)|ε}, and the function
g(ε, X) should not equal g1 (ε) + g2 (X) in order to ensure that the residual
(5.11) is not trivially equal to zero.
    Because of the independence of ε and X, we obtain, for fixed X = x,
that Gε (x) = E{g(ε, x)} and, for fixed ε = ε∗ , that GX (ε∗ ) = E{g(ε∗ , X)}.
Consequently, consistent and unbiased estimators for Gε (x) and GX (ε∗ ) are
given by
                                                
                                                n
                                           −1
                             Ĝε (x) = n              g(εi , x),               (5.12)
                                                i=1
                                                n
                           ĜX (ε∗ ) = n−1            g(ε∗ , Xi ),             (5.13)
                                                i=1
respectively.
    Because influence functions of RAL estimators for β are proportional to
elements orthogonal to the nuisance space, if we knew Gε (x), GX (ε∗ ), and
E{g(ε, X)}, then a natural estimator for β would be obtained by solving the
estimating equation
       
       n
             [g{εi (β), Xi } − Gε (Xi ) − GX {εi (β)} + E{g(ε, X)}] = 0q×1 ,
       i=1
where εi (β) = Yi − µ(Xi , β). Since Gε (x), GX (ε∗ ), and E{g(ε, X)} are not
known, a natural strategy for obtaining an estimator for β is to substitute
estimates of these quantities in the preceding estimating equation, leading to
the estimating equation
                                          5.1 Location-Shift Regression Model        107
   n /
                                                                        0
            g{εi (β), Xi } − Ĝε(β) (Xi ) − ĜX {εi (β)} + Ê[g{ε(β), X}] = 0q×1 ,
      i=1
                                                                         (5.14)
where Ĝε(β) (Xi ), ĜX {εi (β)} are defined by (5.12) and (5.13), respectively,
                             n
and Ê[g{ε(β), X}] = n−1 i=1 g{εi (β), Xi }.
    The estimator for β that solves (5.14) should be a consistent and asymp-
totically normal semiparametric estimator for β with influence function pro-
portional to (5.11). Rather than trying to prove the asymptotic properties
of this estimator, we will instead focus on deriving a class of locally efficient
estimators for β and investigate the asymptotic properties of this class of
estimators.
   Because E{Sε (ε)} = 0, we use (5.11) to deduce that the efficient score is
given by
      Sβ (ε, X) − Π [Sβ |Λ] = −DT (X, β0 )Sε (ε) + E{DT (X, β0 )}Sε (ε)
                                = −{DT (X, β0 ) − E{DT (X, β0 )}}Sε (ε).   (5.16)
Note 1. The sign in (5.16) was reversed, but this is not important, as the
estimator remains the same. 
However, since E{DT (X, β)} is not known to us, a natural strategy is to
substitute an estimator for E{DT (X, β)} in (5.17), leading to the estimator
for β that solves the estimating equation
                 
                 n
                       {DT (Xi , β) − D̄T (β)}Sε {Yi − µ(Xi , β)} = 0,     (5.18)
                 i=1
                          
                          n
where D̄T (β) = n−1            DT (Xi , β).
                         i=1
The function Sε (ε) depends on the underlying density pε (ε), which is unknown
to us. Consequently, we may posit some underlying density for ε to start with,
some working density pε (ε) that may or may not be correct. Therefore, in what
follows, we consider the asymptotic properties of the estimator for β, denoted
by β̂n , which is the solution to equation (5.18) for an arbitrary function of ε,
Sε (ε), which may not be a score function or, for that matter, may not have
mean zero at the truth; i.e., E{Sε (ε)} = 0. To emphasize the fact that we may
not be using the correct score function Sε (ε), we will substitute an arbitrary
function of ε, which we denote by κ(ε), for Sε (ε) in equation (5.18).
    We now investigate (heuristically) the asymptotic properties of the esti-
mator β̂n , which solves (5.18) for an arbitrary function κ(·) substituted for
Sε (·).
Theorem 5.4. The estimator β̂n , which is the solution to the estimating
equation
            n
               {DT (Xi , β) − D̄T (β)}κ{Yi − µ(Xi , β)} = 0,       (5.19)
                  i=1
                                             5.1 Location-Shift Regression Model     109
where                               −2
                             ∂κ(ε)                     '           (−1
              Σ=E                          var{κ(ε)} var DT (X, β0 )     .
                              ∂ε
   Under suitable regularity conditions, the sample average (5.21) will con-
verge in probability to
                                                                 
         ' T                          (q×1 ∂κ(ε)                1×q
     E D (X, β0 ) − E{D (X, β0 )}
                            T
                                                       D(X, β0 )      (5.23)
                                               ∂ε
                                                         
                 ∂                     ∂
        +E          DT (X, β0 ) − E        DT (X, β0 )   κ(ε) .       (5.24)
               ∂β T                   ∂β T
Note that
                 
                 n
                   '                              (
         n−1/2           DT (Xi , β0 ) − D̄T (β0 ) κ(εi )
                 i=1
                                       
                                       n
                                         '                            (
                             = n−1/2         DT (Xi , β0 ) − D̄T (β0 ) {κ(εi ) − κ̄}
                                       i=1
                    
                    n
                      '                        (
         = n−1/2             DT (Xi , β0 ) − µD {κ(εi ) − µκ }
                       i=1
                 '             (
          + n1/2 D̄T (β0 ) − µD {κ̄ − µκ }
                 n
                    ' T               (
         = n−1/2     D (Xi , β0 ) − µD {κ(εi ) − µκ } + op (1),                        (5.26)
                       i=1
where
                   
                   n
                                       '             (
        κ̄ = n−1         κ(εi ), µD = E DT (Xi , β0 ) , and µκ = E {κ(εi )} .
                   i=1
is orthogonal to the nuisance tangent space. We also notice that this is pro-
portional to the influence function of β̂n given by (5.27).
    If, however, we choose the true density pε (ε), then we obtain an efficient es-
timator. Consequently, the estimator for β given by (5.19) is a locally efficient
semiparametric estimator for β.    
                                  Yi = XiT β + εi ,
112    5 Other Examples of Semiparametric Models
DT (Xi , β) = Xiq×1 .
                        
                        n
                              (Xi − X̄)Sε (Yi − XiT β) = 0.                             (5.28)
                        i=1
Then                                                       
                                                   ε − µε
                                  Sε (ε) = −                    .
                                                     σ2
Substituting into (5.28) yields
                      
                      n
                                        (Yi − XiT β − µε )
                            (Xi − X̄)                      = 0.
                      i=1
                                               σ2
                                                     n
Since σ 2 is a multiplicative constant and               i=1 (Xi      − X̄) = 0, the estimator
for β is equivalent to solving
                            
                            n
                                  (X1 − X̄)(Yi − XiT β) = 0.
                            i=1
This gives the usual least-squares estimator for the regression coefficients in
a linear model with an intercept term; namely,
                      n                                 "−1
                                                               
                                                                n
             β̂n =          (Xi − X̄)(Xi − X̄)       T
                                                                      (Xi − X̄)Yi .     (5.29)
                      i=1                                       i=1
Yi = α + β1 Xi1 + · · · + βq Xiq + εi ,
                     
                     n
                           DT (Xi )V −1 (Xi ){Yi − µ(Xi , β)} = 0.
                     i=1
                             
                             n
                      σ −2         (1, XiT )T (Yi − α − XiT β) = 0.
                             i=1
which is identical to the estimator (5.29), the locally efficient estimator for β
in the location-shift regression model.
Remarks
1. The estimator β̂n given by (5.29) is the efficient estimator for β among
   semiparametric estimators for the location-shift regression model if the
   error distribution is indeed normally distributed.
2. If, however, the error distribution was not normally distributed, other
   semiparametric estimators (semiparametric for the location-shift model)
   would be more efficient; namely, the solution to the equation
                             
                             n
                                    (Xi − X̄)Sε (Yi − XiT β) = 0,                  (5.30)
                              i=1
useful for survival data that are right censored, as is often the case in many
clinical trials where the primary endpoint is “time to an event” such as time
to death, time to relapse, etc. As we will see shortly, the proportional hazards
model is a semiparametric model that can be represented naturally with a
finite number of parameters of interest and an infinite-dimensional nuisance
parameter. Because of this natural representation with a parametric and non-
parametric component, it was one of the first models to be studied using
semiparametric theory, in the landmark paper of Begun et al. (1983).
    In order to follow the arguments in this section, the reader must be familiar
with the theory of counting processes and associated martingale processes that
have been developed for studying the theoretical properties of many statistics
used in censored survival analysis. Some excellent books for studying counting
processes and their application to censored survival-data models include those
by Fleming and Harrington (1991) and Andersen et al. (1992). The reader who
has no familiarity with this area can skip this section, as it is self-contained
and will not affect the understanding of the remainder of the book.
    The primary goal of the proportional hazards model is to model the rela-
tionship of “time to event” as a function of a vector of covariates X. Through-
out this section, we will often refer to the “time to event” as the “survival
time” since, in many applications where these models are used, the primary
endpoint is time to death. But keep in mind that the time to any event could
be used. Let T denote the underlying survival time of an arbitrary individual
in our population. The random variable T will be assumed to be a contin-
uous positive random variable. Unlike previous models where we considered
the conditional mean of the response variable, or shifts in the location of
the distribution of the response variable as a function of covariates, in the
proportional hazards model it is the hazard rate of failure that is modeled
as a function of covariates X (q-dimensional). Specifically, the proportional
hazards model of Cox (1972) assumes that
                                                              
                                  P (u ≤ T < u + h|T ≥ u, X)
                 λ(u|X) = lim
                            h→0                 h
                        = λ(u) exp(β T X),                                (5.31)
and
                                             v
                              ΛC|X (v|x) =        λC|X (u|x)du.
                                             0
and                      ∞
                              pC|X (u|x)du = exp{−ΛC|X (v|x)},
                      v
The model is characterized by the parameter (β, η), where β denotes the q-
dimensional regression parameters of interest and the nuisance parameter
For this problem, the Hilbert space H consists of all q-dimensional measurable
functions h(v, δ, x) of (V, ∆, X) with mean zero and finite variance equipped
with the covariance inner product. In order to define the nuisance tangent
space within the Hilbert space H, we must derive the mean-square closure of
all parametric submodel nuisance tangent spaces. Toward that end, we define
an arbitrary parametric submodel by substituting λ(v, γ1 ), λC|X (v|x, γ2 ) and
pX (x, γ3 ) into (5.33), where γ1 , γ2 , and γ3 are finite-dimensional nuisance
parameters of dimension r1 , r2 , and r3 , respectively. Since the nuisance pa-
rameters γ1 , γ2 , and γ3 are variationally independent and separate from each
other in the log-likelihood of (5.34), this implies that the parametric submodel
nuisance tangent space, and hence the nuisance tangent space, can be written
as a direct sum of three orthogonal spaces,
where
and Sγj (v, δ, x) = ∂ log pV,∆,X (v, δ, x, γ)/∂γj are the parametric submodel
score vectors, for j = 1, 2, 3.
   Since the space Λ3s is associated with arbitrary marginal densities pX (x)
of X, we can use arguments already developed in Chapter 4 to show that
                                                           
                                          '         (
                    Λ3s = αq×1 (X) : E αq×1 (X) = 0q×1                  (5.35)
Π(h|Λ3s ) = E(h|X).
We will now show in a series of lemmas how to derive Λ1s and Λ2s followed
by the key theorem that derives the space orthogonal to the nuisance tangent
space; i.e., Λ⊥ .
where λ0C|X (u|x) denotes the true conditional hazard function of C at time
u given X = x.
118    5 Other Examples of Semiparametric Models
Lemma 5.1. The space Λ2s associated with the arbitrary conditional density
of C given X is given as the class of elements
                                                           
               q×1                                  q×1
              α (u, X)dMC (u, X) for all functions α (u, x) ,
Proof. In order to derive the space Λ2s , consider the parametric submodel
That this is a valid parametric submodel follows because the truth is contained
within this model (i.e., when γ2 = 0) and a hazard function must be some
positive function of v.
   The contribution to the log-likelihood (5.34) for this parametric submodel
is
                                            v
         '                             (                        '           (
  (1 − δ) log λ0C|X (v|x) + γ2T α(v, x) −        λ0C|X (u|x) exp γ2T α(u, x) du.
                                            0
Taking the derivative with respect to γ2 and evaluating it at the truth (γ2 = 0),
we obtain the nuisance score vector
                                   V
         Sγ2 = (1 − ∆)α(V, X) −         α(u, X)λ0C|X (u|X)du
                                   0
                                   
             = (1 − ∆)α(V, X) −         α(u, X)λ0C|X (u|X)I(V ≥ u)du.
From this last result, we conjecture that the space Λ2s consists of all elements
in the class
                                                                
               αq×1 (u, X)dMC (u, X) for all functions αq×1 (u, x) .
We have already demonstrated that any element in the class above is an ele-
ment of a parametric submodel nuisance tangent space. Therefore, to complete
our argument and verify our conjecture, we need to show that the linear space
spanned by the score vector with respect to γ2 for any parametric submodel
belongs to the space above.
   Consider any arbitrary parametric submodel λC|X (u|x, γ2 ), with γ20 de-
noting the truth. The score vector is given by
               5.2 Proportional Hazards Regression Model with Censored Data               119
                                                                              
    ∂
           (1 − ∆) log λC|X (V |x, γ2 ) −           λC|X (u|X, γ2 )I(V ≥ u)du
   ∂γ2                                                                              γ2 =γ20
                    ∂
                   ∂γ2 λC|X (V   |X, γ20 )
   = (1 − ∆)
                     λC|X (V |X, γ20 )
                ∂
                ∂γ2 λC|X (u|X, γ20 )
       −                               λC|X (u|X, γ20 )I(V ≥ u)du
                  λC|X (u|X, γ20 )
                                      
               ∂ log λC|X (u|X, γ20 )
   =                                         dMC (u, X).
                        ∂γ2
Let N (u) denote the counting process that counts whether an individual was
observed to die before or at time u (i.e., N (u) = I(V ≤ u, ∆ = 1)) and, as
before, Y (u) = I(V ≥ u) is the “at risk” indicator at time u. Let dM (u, X) de-
note the martingale increment dN (u) − λ0 (u) exp(β0T X)Y (u)du, where λ0 (u)
denotes the true underlying hazard rate in the proportional hazards model.
Lemma 5.2. The space Λ1s , the part of the nuisance tangent space associ-
ated with the nuisance parameter λ(·), the underlying baseline hazard rate of
failure, is
                                                                     
                q×1                                              q×1
   Λ1s =       a (u)dM (u, X) for all q-dimensional functions a (u) ,
for any arbitrary q-dimensional function aq×1 (v) of v. For this parametric
submodel, the contribution to the log-density is
                                        v
    δ{log λ0 (v) + γ1T a(v) + β T x} −     λ0 (u) exp{γ1T a(u) + β T x}du. (5.36)
                                                0
where Λ2s and Λ1s were derived in Lemmas 5.1 and 5.2, respectively, and Λ3s
is given in (5.35). Influence functions of RAL estimators for β belong to the
space orthogonal to Λ. We now consider finding the orthogonal complement
to Λ.
Theorem 5.5. The space orthogonal to the nuisance tangent space is given
by
                                                        
      ⊥                   ∗
     Λ =      {α(u, X) − a (u)} dM (u, X) for all α (u, x) ,
                                                   q×1
                                                                  (5.38)
where                        '                       (
                     ∗      E α(u, X) exp(β0T X)Y (u)
                    a (u) =     '                (     .                  (5.39)
                               E exp(β0T X)Y (u)
          5.2 Proportional Hazards Regression Model with Censored Data       121
Proof. We begin by noting some interesting geometric properties for the nui-
sance tangent space that we can take advantage of in deriving its orthogonal
complement.
    If we put no restrictions on the densities pV,∆,X (v, δ, x) that generate our
data (completely nonparametric), then it follows from Theorem 4.4 that the
corresponding tangent space would be the space of all q-dimensional mea-
surable functions of (V, ∆, X) with mean zero; i.e., the entire Hilbert space
H.
    The proportional hazards model we are considering puts no restrictions
on the marginal distribution of X, pX (x) or the conditional distribution of C
given X. Therefore, the only restrictions on the class of densities for (V, ∆, X)
come from those imposed on the conditional distribution of T given X via the
proportional hazards model. Suppose, for the time being, we put no restriction
on the conditional hazard of T given X and denote this by λT |X (v|x). If
this were the case, then there would be no restrictions on the distribution of
(V, ∆, X).
    The distribution of (V, ∆, X) given by the density PV,∆,X (v, δ, x) can be
written as PV,∆|X (v, δ|x)pX (x), where the conditional density PV,∆|X (v, δ|x)
can also be characterized through the cause-specific hazard functions
                                                                     
           ∗               P (v ≤ V < v + h, ∆ = δ|V ≥ v, X = x)
          λ∆ (v|x) = lim                                                ,
                     h→0                       h
That (5.40) is true follows from results in Tsiatis (1998). Therefore, putting no
restrictions on λT |X (v|x) or λC|X (v|x) implies that no restrictions are placed
on the conditional distribution of (V, ∆) given X. Hence, the log-density for
such a saturated (nonparametric) model could be written (analogously to
(5.34)) as
The tangent space for this model can be written as a direct sum of three
orthogonal spaces,
                            Λ∗1s ⊕ Λ2s ⊕ Λ3s ,
where Λ2s and Λ3s are defined as before, but now Λ∗1s is the space associated
with λT |X (v|x), which is now left arbitrary.
    Arguments that are virtually identical to those used to find the space Λ2s
in Lemma 5.1 can be used to show that
122     5 Other Examples of Semiparametric Models
                                                                
            Λ∗1s =        αq×1 (u, X)dM (u, X) for all αq×1 (u, x) ,
where
                  dM (u, X) = dN (u) − λ0T |X (u|X)Y (u)du.
Note 2. Notice the difference: For Λ1s we used aq×1 (u) (i.e., a function of u
only), whereas for Λ∗1s we used αq×1 (u, X) (i.e., a function of both u and X)
in the stochastic integral above. 
                                  
Because the tangent space Λ∗1s ⊕ Λ2s ⊕ Λ3s is that for a nonparametric model
(i.e., a model that allows for all densities of (V, ∆, X)), and because the tan-
gent space for a nonparametric model is the entire Hilbert space, this implies
that
                             H = Λ∗1s ⊕ Λ2s ⊕ Λ3s ,                       (5.41)
where Λ∗1s , Λ2s , and Λ3s are mutually orthogonal subspaces.
   Since the nuisance tangent space Λ = Λ1s ⊕ Λ2s ⊕ Λ3s , this implies that
Λ1s ⊂ Λ∗1s . Also, the orthogonal complement Λ⊥ must be orthogonal to Λ2s ⊕
Λ3s = Λ∗1s ; i.e., Λ⊥ ⊂ Λ∗1s . Λ⊥ must also be orthogonal to Λ1s ; consequently,
Λ⊥ consists of elements of Λ∗1s that are orthogonal to Λ1s .
   In order to identify elements of Λ⊥ (i.e., elements of Λ∗1s that are orthogonal
to Λ1s ), it suffices to take an arbitrary element of Λ∗1s , namely
                               
                                 αq×1 (u, X)dM (u, X)
and find its residual after projecting it onto Λ1s . To find the projection, we
must derive a∗ (u) so that
                                                        
                     α(u, X)dM (u, X) − a∗ (u)dM (u, X)
for all u. We can prove (5.44) by contradiction because if (5.44) was not equal
to zero, then we could make the integral (5.43) nonzero by choosing a(u) to
be equal to whatever the expectation is in (5.44).
    Solving (5.44), we obtain
             '                        (           '                (
            E α(u, X) exp(β0T X)Y (u) = a∗ (u)E exp(β0T X)Y (u)
or                              '                       (
                           ∗   E α(u, X) exp(β0T X)Y (u)
                       a (u) =     '                (     .
                                  E exp(β0T X)Y (u)
Therefore, the space orthogonal to the nuisance tangent space is given by
                                                                 
         Λ⊥ =      {α(u, X) − a∗ (u)} dM (u, X) for all αq×1 (u, x) ,
where                          '                       (
                       ∗      E α(u, X) exp(β0T X)Y (u)
                      a (u) =     '                (     .
                                                          
                                 E exp(β0T X)Y (u)
Note that
                n 
                
        n−1/2         {α(u, Xi ) − a∗ (u)} dMi (u, Xi )
                i=1
                       n 
                       
            = n−1/2            {α(u, Xi ) − â∗ (u)} dMi (u, Xi ) + op (1).   (5.45)
                       i=1
is identical to
                ⎧            n                                 ⎫
                ⎪
                ⎪                                 T             ⎪
                                                                ⎪
           n  ⎨                α(u, X j ) exp(β0  X j )Yj (u) ⎬
                             j=1
                 α(u, Xi ) −                                     dNi (u).
                ⎪
                ⎪
                                  n
                                                                ⎪
                                                                ⎪
           i=1  ⎩                    exp(β0T Xj )Yj (u).        ⎭
                                           j=1
By considering all functions αq×1 (u, x), the corresponding estimators will de-
fine a class of semiparametric estimators that contains all the influence func-
tions of RAL estimators for β.
                     5.3 Estimating the Mean in a Nonparametric Model        125
Efficient Estimator
To find the efficient estimator, we must derive the efficient score. This entails
computing
                              Sβ (Vi , ∆i , Xi )
and projecting this onto the nuisance tangent space. Going back to the log-
density (5.34) and taking the derivative with respect to β, it is straightforward
to show that                     
                           Sβ = X q×1 dM (u, X).
The estimator for β, which has an efficient influence function (i.e., proportional
to Seff ), is given by substituting Xi for α(u, Xi ) in (5.46); namely,
                      ⎡    ⎧ n                         ⎫⎤
                           ⎪
                           ⎪     X   exp(βX  )Y  (V   ) ⎪
                                                        ⎪
               
               n
                      ⎢    ⎨       j       j   j    i   ⎬⎥
                   ∆i ⎢                                  ⎥ = 0.
                             j=1
                       X −
                      ⎣ i ⎪                             ⎦                (5.47)
                           ⎪
                                n
                                                        ⎪
                                                        ⎪
               i=1         ⎩      exp(βXj )Yj (Vi ) ⎭
                                  j=1
The estimator for β, given as the solution to (5.47), is the estimator proposed
by Cox for maximizing the partial likelihood, where the notion of partial like-
lihood was first introduced by Cox (1975). The martingale arguments above
are essentially those used by Andersen and Gill (1982), where the theoreti-
cal properties of the proportional hazards model are derived in detail. The
argument above shows that Cox’s maximum partial likelihood estimator is a
globally efficient semiparametric estimator for β for the proportional hazards
model.
collected on all the individuals in the study. The goal is to estimate the differ-
ence in the mean response between the two treatments. One simple estimator
for this treatment difference is the difference in the treatment-specific sample
average response between the two treatments. A problem that has received
considerable interest is whether and how we can use the baseline covariates
that are collected prior to randomization to increase the efficiency of the es-
timator for treatment difference.
    Along the same line, we also consider the randomized pretest-posttest
design. In this design, a random sample of subjects (patients) are chosen from
some population of interest and, for each patient, a pretest measurement, say
Y1 , is made, and then the patient is randomized to one of two treatments,
which we denote by the treatment indicator A = (1, 0) with probability δ
and (1 − δ), and after some prespecified time period, a posttest measurement
Y2 is made. The goal of such a study is to estimate the effect of treatment
intervention on the posttest measurement, or equivalently to estimate the
effect of treatment on the change score, which is the difference between the
pretest response and posttest response.
    We will focus the discussion on the pretest-posttest design. However, we
will see later that the results derived for the pretest-posttest study can be
generalized to the problem of covariate adjustment in a randomized study.
    As an example of a pretest-posttest study, suppose we wanted to compare
the effect of some treatment intervention on the quality of life for patients
with some disease. We may be interested in comparing the effect that a new
treatment has to a placebo or comparing a new treatment to the current best
standard treatment. Specifically, such a design is carried out by identifying a
group of patients with disease that are eligible for either of the two treatments.
These patients are then given a questionnaire to assess their baseline quality
of life. Typically, in these studies, patients answer several questions, each of
which is assigned a score (predetermined by the investigator), and the quality
of life score is a sum of these scores, denoted by Y1 . Afterward, they are
randomized to one of the two treatments A = (0, 1) and then followed for
some period of time and asked to complete the quality of life questionnaire
again, where their quality of life score Y2 is computed.
    The goal in such studies is to estimate the treatment effect, defined as
or equivalently
Consequently, the influence function for the i-th observation of β̂n equals
                             Ai         (1)   (1 − Ai )         (0)
                 ϕ(Zi ) =       (Y2i − µ2 ) −           (Y2i − µ2 ).               (5.51)
                             δ                 (1 − δ)
    Now that we have identified one influence function for an RAL estimator
for β, we can identify the linear variety of the space of influence functions by
deriving the tangent space T and its orthogonal complement T ⊥ .
Let us now construct the tangent space T . The data from a single observa-
tion in a randomized pretest-posttest study are given by (Y1 , A, Y2 ). The only
restriction that is placed on the density of the data is that induced by the
randomization itself, specifically that the pretest measurement Y1 is indepen-
dent of the treatment indicator A and that the distribution of the Bernoulli
variable A is given by P (A = 1) = δ and P (A = 0) = 1 − δ, where δ is
the randomization probability, which is known by design. Other than these
restrictions, we will allow the density of (Y1 , A, Y2 ) to be arbitrary.
    To derive the tangent space and its orthogonal complement, we will take
advantage of the results of partitioning the Hilbert space for a nonparametric
model given by Theorem 4.5 of Chapter 4.
    First note that the density of the data for a single observation can be
factored as
pY1 ,A,Y2 (y1 , a, y2 ) = pY1 (y1 )pA|Y1 (a|y1 )pY2 |Y1 ,A (y2 |y1 , a). (5.52)
H = T1 ⊕ T2 ⊕ T3 , (5.53)
where
130     5 Other Examples of Semiparametric Models
that is, the conditional density is completely known to us and not a function
of unknown parameters. Otherwise, the density of Y1 , which is pY1 (y1 ), and
the conditional density of Y2 given Y1 , A; i.e., pY2 |Y1 ,A (y2 |y1 , a), are arbitrary.
Consequently, the tangent space for semiparametric models in the randomized
pretest-posttest design is given by
T = T1 ⊕ T3 .
Remark 2. The contribution to the tangent space that is associated with the
nuisance parameters for pA|Y1 (a|y1 ), which is T2 , is left out because, by the
model restriction, this conditional density is completely known to us by design.
   The orthogonality of T1 , T2 , T3 together with (5.53) implies that the space
orthogonal to the tangent space is given by
T ⊥ = T2 .
Using (4.22) from Theorem 4.5, we can represent elements of T2 as h∗2 (Y1 , A)−
E{h∗2 (Y1 , A)|Y1 } for any arbitrary function h∗2 (·) of Y1 and A. Because A is
a binary indicator function, any function h∗2 (·) of Y1 and A can be expressed
as h∗2 (Y1 , A) = Ah1 (Y1 )+h2 (Y1 ), where h1 (·) and h2 (·) are arbitrary function
of Y1 . Therefore
h∗ (Y1 , A) − E{h∗ (Y1 , A)|Y1 } = Ah1 (Y1 ) + h2 (Y1 ) − {E(A|Y1 )h1 (Y1 ) + h2 (Y1 )}
                                 = (A − δ)h1 (Y1 ).
Since T ⊥ = T2 , we can use (4.23) of Theorem 4.5 to show that for any
function α(Y1 , A, Y2 ),
Therefore
                                                                            
     A        (1)     ⊥           A        (1)                A           (1)
 Π     (Y2 − µ2 )|T       =E        (Y2 − µ2 )|Y1 , A − E        (Y2 − µ2 )|Y1 ,
     δ                            δ                           δ
                                                                              (5.56)
where
                               
            A         (1)             A                       (1)
                                                                   
        E     (Y2 − µ2 )|Y1 , A =         E(Y2 |Y1 , A = 1) − µ2     ,        (5.57)
            δ                         δ
                                                                             
                                                                       (1)
and by the law of iterated conditional expectations, E Aδ (Y2 − µ2 )|Y1 can
be computed by taking the conditional expectation of (5.57) given Y1 , yielding
                                     
                    A         (1)                               (1)
               E      (Y2 − µ2 )|Y1 = E(Y2 |Y1 , A = 1) − µ2 .                (5.58)
                    δ
Similarly, we obtain
                           
       1−A           (0)  ⊥      A−δ                      (0)
                                                               
   Π         (Y2 − µ2 )|T     =−      E(Y2 |Y1 , A = 0) − µ2     .         (5.60)
       1−δ                       1−δ
   Using (5.59) and (5.60) and after some algebra, we deduce that the efficient
influence function (5.55) equals
                                                   
                  A       (A − δ)
                    Y2 −           E(Y2 |A = 1, Y1 )
                  δ          δ
                                                             
                       (1 − A)       (A − δ)
                  −            Y2 +          E(Y2 |A = 0, Y1 )
                       (1 − δ)       (1 − δ)
                     /             0
                        (1)    (0)
                  − µ2 − µ2 .                                          (5.61)
                              β
    In order to construct an efficient RAL estimator for β (that is, an RAL
estimator for β whose influence function is the efficient influence function
given by (5.61)), we would need to know E(Y2 |A = 1, Y1 ) and E(Y2 |A = 0, Y1 ),
which, of course, we don’t. One strategy is to posit models for E(Y2 |A = 1, Y1 )
and E(Y2 |A = 0, Y1 ) in terms of a finite number of parameters ξ1 and ξ0 ,
respectively. That is,
                    E(Y2 |A = j, Y1 ) = ζj (Y1 , ξj ), j = 0, 1.
These posited models are restricted moment models for the subset of individ-
uals in treatment groups A = 1 and A = 0, respectively. For such models, we
can use generalized estimating equations to obtain estimators ξˆ1n and ξˆ0n for
ξ1 and ξ0 using patients {i : A1 = 1} and {i : Ai = 0}, respectively. Such
estimators are consistent for ξ1 and ξ0 if the posited models were correctly
specified, but, even if they were incorrectly specified, under suitable regularity
conditions, ξˆjn would converge to ξj∗ for j = 0, 1. With this in mind, we use
the functional form of the efficient influence given by (5.61) to motivate the
estimator β̂n for β given by
                           n                                      
                        −1        Ai        (Ai − δ)            ˆ
                β̂n = n               Y2i −          ζ1 (Y1i , ξ1n )
                           i=1
                                   δ            δ
                                                                   
                           (1 − Ai )       (Ai − δ)
                      −              Y2i +          ζ0 (Y1i , ξˆ0n ) .    (5.62)
                            (1 − δ)         (1 − δ)
   After some algebra (Exercise 3 at the end of this Chapter), we can show
that the influence function of β̂n is
             A        (1)    (A − δ)                    (1)
               (Y2 − µ2 ) −          {ζ1 (Y1 , ξ1∗ ) − µ2 }
             δ                  δ
                (1 − A)        (0)    (A − δ)                       (0)
             −          (Y2 − µ2 ) −             {ζ0 (Y1 , ξ0∗ ) − µ2 }.   (5.63)
                (1 − δ)               (1 − δ)
                                      5.5 Remarks about Auxiliary Variables           133
HW Z = T1 ⊕ HZ , T1 ⊥ HZ .
    Because the density in (5.66) factors into terms involving the parameter η ∗
and terms involving the parameters (β, η), where η ∗ and (β, η) are variation-
ally independent, it is then straightforward to show that the tangent space
for the model on (W, Z) is given by
T W Z = T1 ⊕ T Z , T1 ⊥ T Z .
implies
                                α = α    and β = β  ,
136      5 Other Examples of Semiparametric Models
implies
σ = σ and β = β .
      For this model, describe how you would derive a locally efficient estimator
      for β from a sample of data
(Yi , Xi ), i = 1, . . . , n.
   for the iid data (Y1i , Ai , Y2i ), i = 1, . . . , n, and the parameters η0 , β, η1 are
   estimated using least squares.
    a) Show that the least-squares estimator β̂n in the model above is a
        semiparametric estimator for β = E(Y2 |A = 1) − E(Y2 |A = 0); that
        is, that n1/2 (β̂n − β) converges to a normal distribution with mean
        zero whether the linear model (5.67) is correct or not.
    b) Find the influence function for β̂n and show that it is in the class of
        influence functions given in (5.54).
3. In (5.62) we considered locally efficient adaptive estimators for the treat-
   ment difference β in a randomized pretest-posttest study. Suppose the
   estimators ξˆjn , j = 0, 1 were root-n consistent estimators; that is, that
   there exist ξj∗ , j = 0, 1 such that
6.1 Introduction
In many practical situations, although we may set out in advance to collect
data according to some “nice” plan, things may not work out quite as in-
tended. Some examples of this follow.
Surrogate measurements
For some studies, the response of interest or some important covariate may
be very expensive to obtain. For example, suppose we are interested in the
daily average percentage fat intake of a subject. An accurate measurement
requires a detailed “food diary” over a long period, which is both expensive
and time consuming. A cheaper measurement (surrogate) is to have subjects
recall the food they ate in the past 24 hours. Clearly, this cheaper measurement
will be correlated with the expensive one but not perfectly. To reduce costs,
a study may be conducted where only a subsample of participants provide
the expensive measurement (validation sample), whereas everyone provides
data on the inexpensive measurement. The expensive measurement would be
138     6 Models and Methods for Missing Data
missing for all individuals not in the validation sample. Unlike the previous
examples, here the missingness was by design rather than by happenstance.
    In almost all studies involving human subjects, some important data may
be missing for some subjects for a variety of reasons, from oversight or mis-
takes by the study personnel to refusal or inability of the subjects to provide
information.
    Objective: Usually, interest focuses on making an inference about some
aspect (parameter) of the distribution of the “full data” (i.e., the data that
would have been observed if no data were missing).
    Problem: When some of the data are missing, it may be that, depending
on how and why they are missing, our ability to make an accurate inference
may be compromised.
    This problem can be defined using the following notation. For individual
i in our sample i = 1, · · · , n, let
                   ⎧
                   ⎨1
             Ai =           denotes treatment assignment,
                   ⎩0
and
                              
                              n                       
                                                      n
                                  Ai Yi                    (1 − Ai )Yi
                              i=1                     i=1
                      µ̂1 =               , µ̂0 =                        ,
                               n                      n
                                    Ai                      (1 − Ai )
                              i=1                     i=1
                           (Ri , Ri Y1i ), i = 1, · · · , n1 .
   A natural estimator for µ1 , using the observed data, is the complete-case
sample average; namely,
                                     n1
                                         Ri Y1i
                                            i=1
                                  µ̂1c =                    .
                                             n1
                                                      Ri
                                                i=1
140    6 Models and Methods for Missing Data
(Ri , Y1i ), i = 1, . . . , n1 ,
to specify the conditional density because pR|Y1 (ri |y1i ) = {π(y1i )}ri {1 −
π(y1i )}1−ri .
   Again, the observed data are denoted by
(Ri , Ri Y1i ), i = 1, . . . , n1 .
                    
                    n1                        
                                              n1
                         Ri Y1i         n−1
                                         1        Ri Y1i
                    i=1                      i=1         P      E(RY1 )
                                    =                    →              ,
                     n1                       n1
                                                                 E(R)
                            Ri            n−1
                                           1       Ri
                      i=1                      i=1
                    E(RY1 )   E(R)E(Y1 )
                            =            = E(Y1 ) = µ1 .
                     E(R)       E(R)
                                                                    6.1 Introduction      141
Therefore, if the data are missing completely at random (MCAR), then the
complete-case estimator is unbiased, as our intuition would suggest. If, how-
ever, the probability of missingness depends on Y1 , which we will refer to as
nonmissing at random, (NMAR) (a formal definition will be given in later
chapters), then the complete-case estimator is written
        
        n1
             Ri Y1i
        i=1           P   E(RY1 )   E{E(RY1 )|Y1 }
                      −
                      →           =                                                      (6.1)
          
          n1
                           E(R)      E{E(R|Y1 )}
                Ri
          i=1
                                        E{Y1 π(Y1 )}
                                    =                = E(Y1 )          (necessarily).
                                         E{π(Y1 )}
In fact, if π(y) is an increasing function in y (i.e., probability of not being
missing increases with y), then this suggests that individuals with larger values
of Y would be overrepresented in the observed data and hence
                                   E{Y1 π(Y1 )}
                                                > µ1 .
                                    E{π(Y1 )}
We leave the proof of this last result as an exercise for the reader.
    The difficulty with NMAR is that there is no way of estimating π(y) =
P (R = 1|Y1 = y) from the observed data because if R = 0 we don’t get
to observe Y1 . In fact, there is no way that we can distinguish whether the
missing data were MCAR or NMAR from the observed data. That is, there
is an inherent nonidentifiability problem here.
    There is, however, a third possibility to consider. Suppose, in addition
to the response variable, we measured additional covariates on the i-th in-
dividual, denoted by Wi , which are not missing. For example, some baseline
characteristics may also be collected, including baseline blood pressure or pos-
sibly some additional variables collected between the initiation of treatment
and six months, when the follow-up response is supposed to get collected. Such
variables Wi , i = 1, . . . , n are sometimes referred to as auxiliary covariates, as
they represent variables that are not of primary interest for inference. The
observable data in such a case are denoted by
(Ri , Ri Y1i , Wi ), i = 1, · · · , n1 .
                                      Ri ⊥
                                         ⊥ Y1i Wi .
142    6 Models and Methods for Missing Data
Remark 1. It may be that Wi is related to both Y1i and Ri , in which case, even
though (6.2) is true, a dependence between Y1i and Ri would be induced. For
example, consider the hypothetical scenario where Wi denotes the blood pres-
sure for the i-th individual at an interim examination, say at three months,
measured on all n individuals in the sample. After observing this response,
individuals whose blood pressure was still elevated would be more likely to
drop out. Therefore, Ri would be correlated with Wi . Since individuals could
not possibly know what their blood pressure is at the end of the study (i.e.,
at six months), it may be reasonable to assume that Ri ⊥     ⊥ Y1i |Wi . It may
also be reasonable to assume that the three-month blood pressure reading is
correlated with the six-month blood pressure reading. Under these circum-
stances, dropping out of the study Ri would be correlated with the response
outcome Y1i but in this case, because they were both related to the interim
three-month blood pressure reading.
    Therefore, without conditioning additionally on Wi , we would obtain that
                              P (Ri = 1 Y1i ) = π(Y1i )
depends on the value Y1i . Consequently, if we didn’t collect the additional
data Wi , then we would be back in the impossible NMAR situation.   
    The assumption (6.2) is an example of what is referred to as missing at
random, or MAR (not to be confused with MCAR). Basically, missing at
random means that the probability of missingness depends on variables that
are observed. A general and more precise definition of MAR will be given
later.
    The MAR assumption alleviates the identifiability problems that were en-
countered with NMAR because the probability of missingness depends on
variables that are observed on all subjects. The available data could also be
used to model the relationship for the probability of missingness, or, equiva-
lently, the probability of a complete case, as a function of the covariates Wi .
For example, we can posit a model for
                            P (R = 1|W = w) = π(w, γ)
(say, logistic regression) in terms of a parameter vector γ and estimate the
parameter γ from the observed data (Ri , Wi ), which are measured on all indi-
viduals i = 1, . . . , n, using, say, maximum likelihood. That is, the maximum
likelihood estimator γ̂ would be obtained by maximizing
                      !
                      n
                            π(Wi , γ)Ri {1 − π(Wi , γ)}1−Ri .
                      i=1
                                      ⊥ Y1 |W.
                                     R⊥
          µ1 = E(Y1 ) = E{E(Y1 |W )}
                        
                      = ypY1 |W (y|w, γ1 )pW (w, γ2 )dνY (y)dνW (w).
or
               '                                             (I(r=1)
                 pY1 |W,R (y1 |w, r = 1) pR|W (r = 1|w)pW (w)
               '                          (I(r=0)
                 pR|W (r = 0|w) pW (w)            .
                        ⊥ Y1 |W ),
Because of MAR (i.e., R ⊥
144     6 Models and Methods for Missing Data
where
                              pR|W (r = 1|w) = π(w, γ3 ).
Consequently, the likelihood for n independent sets of data is given as
     n                                " n              "
     !                                   !
                              I(ri =1)
        pY1 |W (y1i |wi , γ1 )              pW (wi , γ2 ) × {function of γ3 }.
        i=1                                   i=1
Because of the way the likelihood factorizes, we find the MLE for γ1 , γ2 by
separately maximizing
                            !
                                  pY1 |W (y1i |wi , γ1 )               (6.5)
                              {i : Ri =1}
and
                                    !
                                    n
                                            pW (wi , γ2 ).                     (6.6)
                                    i=1
Note 2. We only include complete cases to find the MLE for γ1 , whereas we
use all the data to find the MLE for γ2 . 
                                         
   The estimates for γ1 and γ2 , found by maximizing (6.5) and (6.6), can
then be substituted into (6.4) to obtain the MLE for µ1 .
Remark 2. Although likelihood methods are certainly feasible and the corre-
sponding estimators enjoy the optimality properties afforded to an MLE, they
can be difficult to compute in some cases. For example, the integral given in
(6.4) can be numerically challenging to compute, especially if W involves many
covariates.
6.3 Imputation
Since some of the Y1i ’s are missing, a natural strategy is to impute or “es-
timate” a value for such missing data and then estimate the parameter of
interest behaving as if the imputed values were the true values.
    For example, if there were no missing values of Y1i , i = 1, . . . , n1 , then we
would estimate µ1 by using
                                                                 6.3 Imputation    145
                                                 
                                                 n1
                                   µ̂1 =   n−1
                                            1          Y1i .                      (6.7)
                                                 i=1
However, for values of i such that Ri = 0, we don’t observe such Y1i .
    Suppose we posit some relationship for the distribution of Y1 given W ;
e.g., we may assume a parametric model pY1 |W (y, |w, γ1 ) as we did in (6.3).
By the MAR assumption, we can estimate γ1 using the complete cases, say,
by maximizing (6.5) to derive the MLE γ̂1 . This would allow us to estimate
                                   
                E(Y1 W = w) by        ypY1 |W (y|w, γ̂1 )dνY (y),
Remarks
 1. We could have modeled the conditional expectation directly, say as
                                   E(Y1 |W ) = µ(W, γ),
    and estimated γ by using GEEs with complete cases.
146    6 Models and Methods for Missing Data
may be biased. Horvitz and Thompson (1952), and later Robins, Rotnitzky,
and Zhao (1994), suggested using inverse weighting of complete cases as a
method of estimation. Let us denote the probability of observing a complete
case by
The basic intuition is as follows. For any individual randomly chosen from a
population with covariate value W , the probability that such an individual
will have complete data is π(W ). Therefore, any individual with covariate
                                                                 1
W with complete data can be thought of as representing π(W         ) individuals
at random from the population, some of which may have missing data. This
suggests using
                                        n1
                                            Ri Y1i
                              µ̂1 = n−1
                                      1
                                        i=1
                                            π(W i)
                                                    
                                         P
as an estimator for µ1 . By WLLN, µ̂1 −  → E π(WRY1
                                                    ) , which, by conditioning,
equals
                                                           
                        RY1                    Y1
              E E             Y1 , W    =E          E(R|Y1 , W )
                      π(W )                   π(W )
                                     
                           Y1
                 =E             π(W ) = E(Y1 ) = µ1 .
                         π(W )
                                              6.5 Double Robust Estimator     147
                     !
                     n1
                           {π(Wi , γ3 )}Ri {1 − π(Wi γ3 )}1−Ri
                     i=1
to obtain the estimator γ̂3 . (Often, logistic regression models are used.) The
resulting inverse probability weighted complete-case (IPWCC) estimator is
given by
                                   n1
                                          Ri Y1i
                               n−1
                                1                    .                    (6.10)
                                   i=1
                                        π(Wi , γˆ3 )
and for
148      6 Models and Methods for Missing Data
(a) Suppose the model (6.12) is correctly specified but the model (6.11) might
    not be. Then the estimator is approximated by
             n1                                                
          −1            {Ri − P (R = 1|Wi )}                 ∗
        n1        Y1i +                       {Y1i − µ(Wi , γ1 )} + op (1),
             i=1
                           P (R = 1|Wi )
      which converges to
                                                   6.5 Double Robust Estimator   149
                                                 
               R − P (R = 1|W )
      E Y1 +                      {Y1 − µ(W, γ1∗ )}
                 P (R = 1|W )
                                                          
                        R − P (R = 1|W )
        = E(Y1 ) + E                       {Y1 − µ(W, γ1∗ )}
                          P (R = 1|W )
                     *                 +,                    -
                                  Conditioning on Y1 , W
                                    E[E[ |Y1 , W ]]
                         ⎡                                                 ⎤
                                                       
                            E(R|Y1 , W ) − P (R = 1|W )
                        E⎣                                {Y1 − µ(W, γ1∗ )}⎦
                                  P (R = 1|W )
                          *              +,             -
                                                               
                                P (R = 1|W ) − P (R = 1|W )
                                       P (R = 1|W )
                                               0
          = E(Y1 ) = µ1 .
(b) Suppose the model (6.11) is correctly specified but the model (6.12) might
    not be. Then the estimator is
              n1                                             
           −1              Ri − π(Wi , γ3∗ )
         n1        Y1i +                       {Y1i − E(Y1 |Wi )} + op (1),
              i=1
                             π(Wi , γ3∗ )
    which converges to
                                                         
                       R − π(W, γ3∗ )
            E Y1 +                      {Y 1 −  E(Y 1 |W )}
                         π(W, γ3∗ )
                                                                 
                             R − π(W, γ3∗ )
            = E(Y1 ) + E                        {Y1 −  E(Y  1 |W )}
                                π(W, γ3∗ )
                        *                    +,                     -
                                               Condition on R, W
                                                E[E[ |R, W ]]
                                                                       
                        R − π(W, γ3∗ )
              =E                             {E(Y1 |R, W ) − E(Y1 |W )}
                          π(W, γ3∗ )
{E(Y1 |W ) − E(Y1 |W )}
                                                               0
              = E(Y1 ) = µ1 .
150      6 Models and Methods for Missing Data
                                     E{Y1 π(Y1 )}
                                                  ,
                                      E{π(Y1 )}
                                  E{Y1 π(Y1 )}
                                               > µ1 .
                                   E{π(Y1 )}
7
Missing and Coarsening at Random for
Semiparametric Models
where the nuisance function η2 allows for any arbitrary conditional density of
Z2 given Z1 . By so doing, auxiliary variables can be introduced as part of a
full-data semiparametric model.   
    Again, we emphasize that the underlying objective is to make inference
on parameters that describe important aspects of the distribution of the data
Z had Z not been missing. That is, had Z been completely observed for the
entire sample, then the data would be realizations of the iid random quanti-
ties Z1 , . . . , Zn , each with density pZ (z, β, η), where β q×1 is assumed finite-
dimensional, and η denotes the nuisance parameter, which for semiparametric
models is infinite-dimensional. It is the parameter β in this model that is of
primary interest. The fact that some of the data are missing is a difficulty that
we have to deal with by thinking about and modeling the missingness pro-
cess. The model for missingness, although important for conducting correct
inference, is not of primary inferential interest.
    Although we have only discussed missing data thus far, it is not any more
difficult to consider the more general notion of “coarsening” of data. The
concept of coarsened data was first introduced by Heitjan and Rubin (1991)
and studied more extensively by Heitjan (1993) and Gill, van der Laan, and
Robins (1996). When we think of missing data, we generally consider the case
where the data on a single individual can be represented by a random vector
with, say, l components, where a subset of these components may be missing
for some of the individuals in the sample. When we refer to coarsened data,
we consider the case where we observe a many-to-one function of Z for some
of the individuals in the sample. Just as we allow that different subsets of the
data Z may be missing for different individuals in the sample, we allow the
possibility that different many-to-one functions may be observed for differ-
ent individuals. Specifically, we will define a coarsening (missingness) variable
C such that, when “C = r,” we only get to see a many-to-one function of
the data, which we denote by Gr (Z), and different r correspond to different
                                            7.1 Missing and Coarsened Data        153
Having missing data is equivalent to the case where a subset of the elements of
(Z (1) , . . . , Z (l) ) are observed and the remaining elements are missing. This can
be represented using the coarsening notation {C, GC (Z)}, where, the many-to-
one function Gr (Z) maps the vector Z to a subset of elements of this vector
whenever C = r.
    For example, let Z = (Z (1) , Z (2) )T be a vector of two random variables.
Define
                                  C     GC (Z)
                                  1      Z (1)
                                  2      Z (2)
                                  ∞ (Z (1) , Z (2) )T
154    7 Missing and Coarsening at Random for Semiparametric Models
Remark 2. If we were dealing only with missing data, say, where different
subsets of an l-dimensional random vector may be missing, it may be more
convenient to define the missingness variable to be an l-dimensional vector
of 1’s and 0’s to denote which element of the vector is observed or missing.
If it is convenient to switch to such notation, we will use R to denote such
missingness indicators. This was the notation used to represent missing data,
for example, in Chapter 6.   
   The theory developed in this book will apply to missing and coarsened data
problems where it is assumed that there is a positive probability of observing
the full data. That is,
Coarsened-Data Mechanisms
To specify models for coarsened data, we must specify the probability distri-
bution of the coarsening process together with the probability model for the
full data (i.e., the data had there not been any coarsening). As with missing-
ness, we can define different coarsening mechanisms that can be categorized
as coarsening completely at random (CCAR), coarsening at random (CAR),
and noncoarsening at random (NCAR).
    These are defined as follows,
2. Coarsening at random:
                            P (C = r Z) = {r, Gr (Z)}.
    The probability of coarsening depends on Z only as a function of the
    observed data.
 3. Noncoarsening at random:
    Noncoarsening at random (NCAR) corresponds to models where coarsen-
    ing at random fails to hold. That is, the probability of coarsening depends
    on Z, possibly as a function of unobserved parts of Z; i.e., if there exists
    z1 , z2 such that
                                 Gr (z1 ) = Gr (z2 )
    for some r and
                             P (C = r|z1 ) = P (C = r|z2 ),
    then the coarsening is NCAR.
• Full data are the data Z1 , . . . , Zn that are iid with density pZ (z, β, η) and
  that we would like to observe. That is, full data are the data that we
  would observe had there been no coarsening. With such data we could
  make inference on the parameter β using standard statistical techniques
  developed for such parametric or semiparametric models.
• Because of coarsening, full data are not observed; instead, the observed
  data are denoted by iid random quantities
  where Ci denotes the coarsening variable and GCi (Zi ) denotes the corre-
  sponding coarsened data for the i-th individual in the sample. It is observed
  data that are available to us for making inference on the parameter β.
• Finally, when Ci = ∞, then the data for the i-th individual are not coars-
  ened (i.e., when Ci = ∞, we observe the data Zi ). Therefore, we denote
  by complete data the data only for individuals i in the sample such that
  Ci = ∞ (i.e., complete data are {Zi : for all i such that Ci = ∞}). Com-
  plete data are often used for statistical analysis in many software packages
  when there are missing data.
156     7 Missing and Coarsening at Random for Semiparametric Models
That is, the density and likelihood of the full data are deduced from the
density and likelihood of Z, pZ (z, β, η), and the density and likelihood for
the coarsening mechanism (i.e., the probability of coarsening given Z). We
emphasize that the density for the coarsening mechanism may also be from a
model that is described through the parameter ψ.
Remark 3. Since the coarsening variable C is discrete, the dominating mea-
sure for C is the counting measure that puts unit mass on each of the finite
values that C can take including C = ∞. The dominating measure for Z is,
as before, defined to be νZ (generally the Lebesgue measure for a continuous
random variable, the counting measure for a discrete random variable, or a
combination when Z is a random vector of continuous and discrete random
variables). Consequently, the dominating measure for the densities of (C, Z)
is just the product of the counting measure for C by νZ . 
                                                          
Discrete Data
For simplicity, we will first consider the case when Z itself is a discrete random
vector. Consequently, the dominating measure is the counting measure over
the discrete combinations of C and Z, and integrals with respect to such a
                            7.2 The Density and Likelihood of Coarsened Data             157
dominating measure are just sums. Although this is overly simplistic, it will
be instructive in describing the probabilistic structure of the problem. We will
also indicate how this can be generalized to continuous Z as well.
    Thus, with discrete data, the probability density of the observed data
{C, GC (Z)} is derived as
                                     
        P {C = r, GC (Z) = gr } =              P (C = r, Z = z)
                                           {z:Gr (z)=gr }
                                               
                                       =                    P (C = r|Z = z)P (Z = z).
                                           {z:Gr (z)=gr }
Remark 4. Rather than developing one set of notation for discrete Z and an-
other set of notation (using integrals) for continuous Z, we will, from now on,
use integrals with respect to the appropriate dominating measure. Therefore,
when we have discrete Z, and νZ is the corresponding counting measure, we
will denote                              
                          P (Z ∈ A) =        P (Z = z)
                                                  z∈A
as                                 
                                            pZ (z)dνZ (z). 
                                                           
                                     z∈A
   With this convention in mind, we write the density and likelihood of the
observed data, when Z is discrete, as
                                      
        pC,GC (Z) (r, gr , ψ, β, η) =      pC,Z (r, z, ψ, β, η)dνZ (z)
                                           {z:Gr (z)=gr }
                        
              =                    P (C = r|Z = z, ψ)pZ (z, β, η)dνZ (z).               (7.1)
                  {z:Gr (z)=gr }
Continuous Data
It will be instructive to indicate how the density of the observed data would
be derived if Z was a continuous random vector and the relationship of this
density to (7.1). For example, consider the case where Z = (Z1 , . . . , Zl )T
is continuous. Generally, Gr (Z) is a dimensional-reduction transformation,
unless r = ∞. This is certainly the case for missing-data mechanisms.
    Let Gr (z) be lr -dimensional, lr < l for r = ∞, and assume there exists a
function Vr (z) that is (l − lr )-dimensional so that the mapping
z = Hr (gr , vr ).
where
Note that
Consequently, using (7.2) and (7.4), we can write (7.3), including the param-
eters ψ, β, and η, as
      pC,GC (r, gr , ψ, β, η)
         
       = P {C = r|Z = Hr (gr , vr ), ψ}pZ {Hr (gr , vr ), β, η}J(gr , vr )dvr .        (7.5)
Therefore, the only difference between (7.1) for discrete Z and (7.5) for con-
tinuous Z is the Jacobian that appears in (7.5). Since the Jacobian does not
involve parameters in the model, it will not have an effect on the subsequent
likelihood.
The likelihood we derived in (7.1) was general, as it allowed for any coars-
ening mechanism, including NCAR. As we indicated earlier, we will restrict
attention to coarsening at random mechanisms (CAR), where P (C = r|Z =
z) = {r, Gr (z)} for all r, z. Coarsening completely at random (CCAR) is
just a special case of this. We remind the reader that another key assumption
being made throughout is that P (C = ∞|Z = z) ≥ ε > 0 for all r, z.
    Now that we have shown how to derive the likelihood of the observed
data from the marginal density of the desired data Z and the coarsening
mechanism, P (C = r|Z = z), we can now derive the likelihood of the observed
data when coarsening is CAR. To do so, we need to consider a model for the
coarsening density in terms of parameters. For the time being, we will be very
general and denote such a model by
                        7.2 The Density and Likelihood of Coarsened Data             159
Notice that the parameter ψ for the coarsening process separates from the
parameters (β, η) that describe the full-data model. Also notice that if Z were
continuous and we used formula (7.5) to derive the density, then, under CAR,
we obtain
160     7 Missing and Coarsening at Random for Semiparametric Models
                                 
 pC,GC (Z) (r, gr , ψ, β, η) =       (r, gr , ψ)pZ {Hr (gr , vr ), β, η} J(gr , vr ) dvr
                                                                          * +, -
                                                                                   ⇓
                                                                         The Jacobian does
                                                                         not involve any of
                                                                         the parameters.
                                                 
                            = (r, gr , ψ)            pZ {Hr (gr , vr ), β, η}J(gr , vr )dvr . (7.7)
Therefore, the MLE for (β, η) only involves maximizing the function
                                     !
                                     n
                                           pGri (Z) (gri , β, η),                           (7.10)
                                     i=1
where                                             
                 pGr (Z) (gr , β, η) =                      pZ (z, β, η)dνZ (z).
                                           {z:Gr (z)=gr }
Therefore, as long as we believe the CAR assumption, we can find the MLE
for β and η without having to specify a model for the coarsening process. If the
parameter space for (β, η) is finite-dimensional, this is especially attractive, as
the MLE for β, under suitable regularity conditions, is an efficient estimator.
Moreover, the coarsening probabilities, subject to the CAR assumption, play
no role in either the estimation of β (or η for that matter) or the efficiency
                       7.2 The Density and Likelihood of Coarsened Data         161
of such an estimator. This has a great deal of appeal, as it frees the analyst
from making modeling assumptions for the coarsening probabilities.
    Maximizing functions such as (7.10) to obtain the MLE may involve inte-
grals and complicated expressions that may not be easy to implement. Nev-
ertheless, there has been a great deal of progress in developing optimization
techniques involving quadrature or Monte Carlo methods, as well as other
maximization routines such as the EM algorithm, which may be useful for
this purpose. Since likelihood methods for missing (coarsened) data have been
studied in great detail by others, there will be relatively little discussion of
these methods in this book. We refer the reader to Allison (2002), Little and
Rubin (1987), Schafer (1997), and Verbeke and Molenberghs (2000) for more
details on likelihood methods for missing data.
Return to Example 1
Let us return to Example 1 of Section 7.1. Recall that in this example two
blood samples of equal volume were taken from each of n individuals in a
study that measured the blood concentration of some biological marker. Some
of the individuals, chosen at random, had concentration measurements made
on both samples. These are denoted as X1 and X2 . The remaining individuals
had their blood samples combined and only one concentration measurement
was made, equaling (X1 +X2 )/2. Although concentrations must be positive, let
us, for simplicity, assume that the distribution of these blood concentrations is
well approximated by a normal distribution. To assess the variability of these
concentrations between and within individuals, we assume that Xj = α + ej ,
where α is normally distributed with mean µα and variance σα2 , and ej , j = 1, 2
are independently normally distributed with mean zero and variance σe2 , also
independent of α. With this representation, σα2 represents the variation be-
tween individuals and σe2 the variation within an individual. From this model,
we deduce that Z = (X1 , X2 )T follows a bivariate normal distribution with
common mean µα and common variance σα2 + σe2 and covariance σα2 .
    Since the individuals chosen to have their blood samples combined were
chosen at random, this is an example of coarsening completely at random
(CCAR). Thus P (C = 1|Z) = , where  is the probability of being selected
in the subsample and P (C = ∞|Z) = 1 − .
    The data for this example can be represented as realizations of the iid
random vectors {Ci , GCi (Zi )}, i = 1, . . . , n, where, if Ci = ∞, then we observe
G∞ (Zi ) = (Xi1 , Xi2 ), whereas if Ci = 1, then we observe G1 (Zi ) = (Xi1 +
Xi2 )/2. Under the assumptions of the model,
                                           
                             Xi1                 µα
                                    ∼N               ,Σ ,                     (7.11)
                             Xi2                 µα
where
162     7 Missing and Coarsening at Random for Semiparametric Models
                                                         
                                    σα2 + σe2 , σα2
                           Σ=                               .
                                       σα2    , σα2 + σe2
Consequently, the MLE for (µα , σα2 , σe2 ) is obtained by maximizing the coarsened-
data likelihood (7.10), which, for this example, is given by
 n 
 !                   
           −1/2        1
        |Σ|       exp − {(Xi1 − µα , Xi2 − µα )T
 i=1
                       2
                                                            I(Ci =∞)
                           Σ−1 (Xi1 − µα , Xi2 − µα )}
                                                         I(Ci =1) 
                                  {(Xi1 + Xi2 )/2 − µα }2
       × (σα2 + σe2 /2)−1/2 exp −                                       . (7.12)
                                      2(σα2 + σe2 /2)
    We leave the calculation of the MLE for this example as an exercise at the
end of the chapter.
    Although maximizing the likelihood is the preferred method for obtaining
estimators for the parameters in finite-dimensional parametric models of the
full data Z, it may not be a feasible approach for semiparametric models when
the parameter space of the full data is infinite-dimensional. We illustrate some
of the difficulties through an example where the parameter of interest is easily
estimated using likelihood techniques if the data are not coarsened but where
likelihood methods become difficult when the data are coarsened.
                                              exp(β T X ∗ )
                         P (Y = 1|X) =                        ,
                                            1 + exp(β T X ∗ )
This was also derived in (4.65). This indeed is the standard analytic strategy
for obtaining estimators for β in a logistic regression model implemented in
most statistical software packages.
    If, however, we had coarsened data (CAR), then the likelihood contribution
for the part of the likelihood that involves β for a single observation is
                                          
                            exp(β T x∗ )y
                                              pX {x, η(·)}dνY,X (y, x).    (7.16)
                           1 + exp(β T x∗ )
      {(y,x):Gr (y,x)=gr }
                                          ∂ log pZ (Z, β0 , γ0 )
                              SγF (Z) =                          ,
                                                  ∂γ
      and the full-data parametric submodel nuisance tangent space is the space
      spanned by SγF ; namely,
                      '                                         (
                          B q×r SγF (Z) for all q × r matrices B .
                                          ∂ log pZ (Z, βo , ηo )
                                  SβF =                          .
                                                  ∂β
• The efficient full-data score is
                            F
                           Seff (Z) = SβF (Z) − Π{SβF (Z)|ΛF },
(i) What is the class of observed-data influence functions, and how are they
    related to full-data influence functions?
            7.3 The Geometry of Semiparametric Coarsened-Data Models         165
(ii) How can we characterize the most efficient influence function and the
     semiparametric efficiency bound for coarsened-data semiparametric mod-
     els?
Both of these, as well as many other issues regarding semiparametric esti-
mators with coarsened data, will be studied carefully over the next several
chapters.
    We remind the reader that observed-data influence functions of RAL es-
timators for β must be orthogonal to the observed-data nuisance tangent
space, which we denote by Λ (without the superscript F ). Therefore, we will
demonstrate how to derive the observed-data nuisance tangent space and its
orthogonal complement.
    When the data are discrete and coarsening is CAR, the observed-data
likelihood for a single observation is given by (7.6); namely,
                                                   
        pC,GC (Z) (r, gr , ψ, β, η) = (r, gr , ψ)     pZ (z, β, η)dνZ (z).
                                            {Gr (z)=gr }
Λ = Λ ψ ⊕ Λη , Λψ ⊥ Λη . (7.18)
                                                                  same as η = η0
                         
                                 ∂
                                    pZ (z, β0 , γ0 )dνZ (z)
                                 ∂γ
                    {Gr (z)=gr }
                =                                            .                    (7.20)
                                   pZ (z, β0 , γ0 )dνZ (z)
                      {Gr (z)=gr }
Hence,
                                   '                   (
                    Sγ (r, gr ) = E SγF (Z)|Gr (Z) = gr . 
                                                          
In general, (7.21) will not equal (7.19); however, as we will now show, these
are equal under the assumption of CAR.     
                                         pC,Z (r, z)
    pZ|C,GC (Z) (z|r, gr ) =        
                                          pC,Z (r, u)dνZ (u)
                               {Gr (u)=gr }
                                         pZ (z)
                         =                               = pZ|Gr (Z) (z|gr ). 
                                                                                  (7.25)
                                          pZ (u)dνZ (u)
                               {Gr (v)=gr }
The linear subspace Λη consisting of elements B q×r E{SγF (Z)|C, GC (Z)} for
some parametric submodel or a limit (as n → ∞) of elements
                        Bnq×rn E{Sγn
                                  F
                                     (Z)|C, GC (Z)}
for a sequence of parametric submodels and conformable matrices. This is the
same as the space consisting of elements
                            '                     (
                         E B q×r SγF (Z)|C, GC (Z)
or limits of elements
                         '                        (
                        E Bnq×rn Sγn
                                  F
                                     (Z)|C, GC (Z) .
But the space of elements B q×r SγF (Z) or limits of elements Bnq×rn Sγn
                                                                       F
                                                                          (Z)
for parametric submodels is precisely the definition of the full-data nuisance
tangent space ΛF . Consequently, the space Λη can be characterized as the
space of elements
                    '                  (                 
              Λη = E αF (Z)|C, GC (Z) for all αF ∈ ΛF .     
    In the special case where data are missing or coarsened by design (i.e.,
when there are no additional parameters ψ necessary to define a model for
the coarsening probabilities), then the observed-data nuisance tangent space
is Λ = Λη . We know that an influence function of an observed-data RAL
estimator for β must be orthogonal to Λ. Toward that end, we now characterize
the space orthogonal to Λη (i.e., Λ⊥
                                   η ).
            7.3 The Geometry of Semiparametric Coarsened-Data Models       169
                         E[h{C, GC (Z)}|Z] ∈ ΛF ⊥ . 
                                                    
K : H → HF
to be
                          K(h) = E[h{C, GC (Z)}|Z]                        (7.30)
for h ∈ H. Because of the linear properties of conditional expectations, the
mapping K, given by (7.30), is a linear mapping or linear operator.
    By Lemma 7.3, the space Λ⊥ η can be defined as
                               Λ⊥
                                η =K
                                     −1
                                        (ΛF ⊥ ),
Lemma 7.4. For any ϕ∗F (Z) ∈ ΛF ⊥ , let K−1 {ϕ∗F (Z)} denote the space of
elements h̃{C, GC (Z)} ∈ H such that
then
            7.3 The Geometry of Semiparametric Coarsened-Data Models        171
and
(ii) find Λ2 = K−1 (0).
   We now derive the space Λ⊥
                            η in the following theorem.
the space Λ⊥
           η consists of all elements that can be written as
or                              
            (∞, Z)L2∞ (Z) +           {r, Gr (Z)}L2r {Gr (Z)} = 0.     (7.36)
                                r=∞
where, again, assumption (7.31) is needed to guarantee that we are not di-
viding by zero. Hence, for any L2r {Gr (Z)}, r = ∞, we can define a typical
element of Λ2 as
             ⎡                              ⎤
   I(C = ∞) ⎣                                   
                   {r, Gr (Z)}L2r {Gr (Z)}⎦ −      I(C = r)L2r {Gr (Z)}.
    (∞, Z)
                r=∞                               r=∞
                                                                          (7.37)
   Combining the results from (7.34) and (7.37), we are now able to explicitly
define the linear space Λ⊥
                        η to be that consisting of all elements given by (7.32).
    Since the missingness probabilities are known by design for this example,
the nuisance tangent space for the observed data is Λη , and the space or-
thogonal to the nuisance tangent space, Λ⊥    η , derived in (7.32) of Theorem 7.2,
is
                                                                          
    Rϕ∗F (Z)                          ∗F          F⊥
                 + L2 {C, GC (Z)} : ϕ    (Z) ∈ Λ     , L2 {C, GC (Z)} ∈ Λ 2 . (7.39)
    π(Y, X (1) )
Since a typical element ϕ∗F (Z) ∈ ΛF ⊥ for the restricted moment model is
    0
         #                                                                              "               $
    
    n
                     Ri                                         Ri − π(Yi , Xi )
                                                                                   (1)
                                                                                                     (1)
=                (1)
                       A(Xi ){Yi − µ(Xi , β̂n )} +                 (1)
                                                                                             L(Yi , Xi )
  i=1 π(Yi , Xi )                                         π(Yi , Xi )
      #                                                                   "             $
  n
           Ri                                         Ri − π(Yi , Xi )
                                                                       (1)
                                                                                     (1)
=                (1)
                       A(Xi ){Yi − µ(Xi , β0 )} +                 (1)
                                                                             L(Yi , Xi )
  i=1 π(Yi , Xi )                                         π(Yi , Xi )
     # n                                     $
             Ri
  −                  (1)
                           A(Xi )D(Xi , βn∗ ) (β̂n − β0 ),
      i=1 π(Yi , X  i    )
where D(X, β) = ∂µ(X, β)/∂β T and βn∗ is an intermediate value between β̂n
and β0 . Therefore,
                     #                                          $−1
                           n
                                     Ri
        1/2             −1                                    ∗
      n (β̂n − β0 ) = n                  (1)
                                              A(Xi )D(Xi , βn )
                           i=1 π(Yi , Xi )
                                   #
                              n
                                          Ri
                         −1/2
                     ×n                       (1)
                                                   A(Xi ){Yi − µ(Xi , β0 )}
                              i=1 π(Yi , Xi )
                                             "               $
                                          (1)
                          Ri − π(Yi , Xi )                (1)
                      +               (1)
                                                  L(Yi , Xi ) .
                             π(Yi , Xi )
Under suitable regularity conditions,
                                      "                               
             Ri                                  R
   −1                               ∗    P
 n                (1)
                      A(Xi )D(Xi , βn ) −→E                A(X)D(X, β0 ) .
          π(Yi , Xi )                         π(Y, X (1) )
E{A(X)D(X, β0 )}.
Consequently,
                                       
                                       n
        1/2                     −1/2
    n         (β̂n − β0 ) = n                [E{A(X)D(X, β0 )}]−1
                                       i=1
    #                                                                             "                 $
                                                                             (1)
                Ri                                          Ri − π(Yi , Xi )                     (1)
                     (1)
                           A(Xi ){Yi − µ(Xi , β0 )} +                  (1)
                                                                                         L(Yi , Xi )
        π(Yi , Xi )                                           π(Yi , Xi )
                             + op (1).                                                             (7.44)
which we demonstrated has mean zero, in (7.42) and (7.43), using iterated
conditional expectations.
    We note that this influence function is proportional to the element in Λ⊥η
that motivated the corresponding m-estimator. Also, because this estimator is
asymptotically linear, as shown in (7.44), we immediately deduce that this es-
timator is asymptotically normal with asymptotic variance being the variance
of its influence function. Other than regularity conditions, no assumptions
were made on the distribution of (Y, X), beyond that of the restricted mo-
ment assumption, to obtain asymptotic normality. Therefore, the resulting
estimator is a semiparametric estimator.
    Standard methods using a sandwich variance can be used to derive an
estimator for the asymptotic variance of β̂n , the solution to (7.41). Such a
sandwich estimator was derived for the full-data GEE estimator in (4.9) of
Section 4.1. We leave the details as an exercise at the end of the chapter.
    Hence, for the restricted moment model with missing data that are miss-
ing by design, we have derived the space orthogonal to the nuisance tangent
space (i.e., Λ⊥
              η ) and have constructed an m-estimator with influence function
proportional to any element of Λ⊥  η . Since all influence functions of RAL es-
timators for β must belong to Λ⊥  η , this means that any RAL estimator for
β must be asymptotically equivalent to one of the estimators given by the
solution to (7.41).
Remark 10. The estimator for β, given as the solution to (7.41), is referred
to as an augmented inverse probability weighted complete-case (AIPWCC)
estimator. If L(Y, X (1) ) is chosen to be identically equal to zero, then the
estimating equation in (7.41) becomes
                  
                  n
                            Ri
                                 (1)
                                       A(Xi ){Yi − µ(Xi , β)} = 0.                 (7.45)
                  i=1   π(Yi , Xi )
    The choice of the influence function and hence the corresponding class
of estimators depends on the arbitrary functions A(X) and L(Y, X (1) ). With
full data, the class of estimating equations is characterized by (7.38). This
     7.4 Example: Restricted Moment Model with Missing Data by Design                             179
We gave an example in Section 7.2 where we argued that with coarsened data it
was difficult to obtain estimators for β using likelihood methods. Specifically,
we considered the logistic regression model for the probability of response
Y = 1 as a function of covariates X, where Y denotes a binary response
variable. Let us consider the likelihood for such a model if we had missing
                                                                T     T
data by design as described above; that is, where X = (X (1) , X (2) )T and
where we always observe Y and X (1) on everyone in the sample but only
observe X (2) on a subset chosen at random with probability π(Y, X (1) ) by
design. Also, to allow for an intercept term in the logistic regression model,
                          T      T                        T
we define X ∗ = (1, X (1) , X (2) )T and X (1∗) = (1, X (1) )T . The density of
the full data (Y, X) for this problem can be written as
   pY,X (y, x, β, η1 , η2 ) = pY |X (y|x, β)pX (2) |X (1) (x(2) |x(1) , η1 )pX (1) (x(1) , η2 )
                                       
        exp{(β1T x(1∗) + β2T x(2) )y}
    =                                      p (2) (1) (x(2) |x(1) , η1 )pX (1) (x(1) , η2 ),
        1 + exp(β1T x(1∗) + β2T x(2) ) X |X
where β is partitioned as β = (β1T , β2T )T , pX (2) |X (1) (x(2) |x(1) , η1 ) denotes the
conditional density of X (2) given X (1) , specified through the parameter η1 ,
and pX (1) (x(1) , η2 ) denotes the marginal density of X (1) , specified through
the parameter η2 . Because the parameter of interest β separates from the
parameters η1 and η2 in the density above, finding the MLE for β with full
data only involves maximizing the part of the likelihood above involving β
and is easily implemented in most software packages.
   In contrast, the density of the observed data (R, Y, X (1) , RX (2) ) is given
by
180        7 Missing and Coarsening at Random for Semiparametric Models
× pX (1) (x(1) , η2 ).
Because the parameters β and η1 do not separate in the density above, de-
riving the MLE for β involves maximizing, as a function of β and η1 , the
product (over i = 1, . . . , n) of terms (7.46) × (7.47). Even if we were willing
to make simplifying parametric assumptions about the conditional distribu-
tion of X (2) given X (1) in terms of a finite number of parameters η1 , this
would be a complicated maximization, but if we wanted to be semiparametric
(i.e., put no restrictions on the conditional distribution of X (2) given X (1) ),
then this problem would be impossible as it would suffer from the curse of
dimensionality. Notice that in the likelihood formulation above, nowhere do
the probabilities π(Y, X (1) ) come into play, even though they are known to us
by design.
    Since the logistic regression model is just a simple example of a restricted
moment model, estimators for the parameter β for the semiparametric model,
which puts no restrictions on the joint distribution of (X (1) , X (2) ), can be
found easily by solving the estimating equation (7.41), where µ(Xi , β) =
exp(β T Xi∗ )/{1 + exp(β T Xi∗ )} and for some choice of A(X) and L(Y, X (1) ).
    With no missing data, we showed in (4.65) that the optimal choice for
A(X) is X ∗ . Consequently, one easy way of obtaining an estimator for β is
                                                        (1)
by solving (7.41) using A(Xi ) = Xi∗ and L(Yi , Xi ) = 0, leading to the
estimating equation
                    
                    n                                                   
                               Ri                      exp(β T Xi∗ )
                                          Xi∗ Yi −                           = 0.         (7.48)
                     i=1
                                    (1)
                           π(Yi , Xi )               1 + exp(β T Xi∗ )
Full data
• Coarsened data are denoted by {C, GC (Z)}, where the coarsening variable
  C is a discrete random variable taking on values 1, . . . ,  and ∞. When C =
  r, r = 1, . . . , , then we observe the many-to-one transformation Gr (Z).
  C = ∞ is reserved to denote complete data; i.e., G∞ (Z) = Z.
• We distinguish among three types of coarsening mechanisms:
  – Coarsening completely at random (CCAR): The coarsening probabili-
      ties do not depend on the data.
  – Coarsening at random (CAR): The coarsening probabilities only de-
      pend on the data as a function of the observed data.
  – Noncoarsening at random (NCAR): The coarsening probabilities de-
      pend on the unobserved part of the data.
182         7 Missing and Coarsening at Random for Semiparametric Models
  in the support of Z.
• H denotes the observed-data Hilbert space of q-dimensional, mean-zero,
  finite-variance, measurable functions of {C, GC (Z)} equipped with the co-
  variance inner product.
• Because C takes on a finite set of values, a typical function h{C, GC (Z)}
  can be written as
                                            
          h{C, GC (Z)} = I(C = ∞)h∞ (Z) +        I(C = r)hr {Gr (Z)}.
                                                    r=∞
Λ = Λ ψ ⊕ Λη , Λψ ⊥ Λη ,
• In this chapter, we did not consider models for the coarsening probabilities;
  rather, we assumed they are known by design. Therefore, we didn’t need to
  consider the space Λψ , in which case the observed-data nuisance tangent
  space Λ = Λη .
• Observed-data estimating equations, when coarsening is by design, are
  motivated by considering elements in the space Λ⊥   η , where
                                                             
                                          I(C = ∞)ΛF ⊥
                               Λ⊥
                                η =                    ⊕ Λ2
                                             (∞, Z)
      and                                                          
                    Λ2 =    L2 {C, GC (Z)} : E[L2 {C, GC (Z)}|Z] = 0 .
                                                7.6 Exercises for Chapter 7    183
                                   I(C = ∞)ΛF ⊥
                                      (∞, Z)
is an element of Λ2 .
      b) Using the full-data influence function that was derived in 1(c) above
         (or, equivalently, using the full-data score vector), write out a set of
         AIPWCC estimating equations that can be used to obtain observed-
         data estimators for (µα , σα2 , σe2 ).
4. Derive an estimator for the asymptotic variance of β̂n , the AIPWCC es-
   timator for β for the restricted moment given by the solution to (7.41)
   where data were missing by design.
8
The Nuisance Tangent Space and Its
Orthogonal Complement
                                      exp(ψ0 + ψ1T Z1 )
                      π(Z1 , ψ) =                         ,                (8.1)
                                    1 + exp(ψ0 + ψ1T Z1 )
186    8 The Nuisance Tangent Space and Its Orthogonal Complement
and the parameter ψ = (ψ0 , ψ1T )T needs to be estimated from the observed
data. Although this illustration assumed a logistic regression model that was
linear in Z1 , we could easily have considered more complex models where we
include higher-order terms, interactions, regression splines, or whatever else
the data analyst deems appropriate.
When there are more than two levels of missingness or coarsening of the data,
we distinguish between monotone and nonmonotone coarsening.
    A form of missingness that often occurs in practice is monotone missing-
ness. Because of its importance, we now describe monotone missingness, or
more generally monotone coarsening, in more detail and discuss methods for
developing models for such monotone missingness mechanisms.
    For some problems, we can order the coarsening variable C in such a way
that the coarsened data Gr (Z) when C = r is a coarsened version of Gr (Z)
         
for all r > r. In such a case, Gr (Z) is a many-to-one function of Gr+1 (Z);
that is,
                           Gr (z) = fr {Gr+1 (z)},
where fr (·) denotes a many-to-one function that depends on r. In other words,
G1 (Z) is the most coarsened data, G2 (Z) less so, and G∞ (Z) = Z is not
coarsened at all. For example, with longitudinal data, suppose we intend to
measure an individual at l different time points so that Z = (Y1 , . . . , Yl ),
where Yj denotes the measurement at the j-th time point, j = 1, . . . , l. For
such longitudinal studies, it is not uncommon for some individuals to drop
out during the course of the study, in which case we would observe the data
up to the time they dropped out and all subsequent measurements would be
missing. This pattern of missingness is monotone and can be described by
                               r         Gr (Z)
                               1          (Y1 )
                               2        (Y1 , Y2 )
                               ..
                                .
                             l − 1 (Y1 , . . . , Yl−1 )
                               ∞ (Y1 , . . . , Yl )
   When data are CAR, we consider models for the coarsening probabilities,
which, in general, are denoted by
That λr (·) is a function of Gr (Z) follows by noting that the right-hand side
of (8.2) equals
                   P (C = r|Z)        {r, Gr (Z)}
                               =                                              (8.3)
                   P (C ≥ r|Z)   1 − r ≤r−1 {r , Gr (Z)}
Equations (8.3), (8.4), and (8.5) demonstrate that there is a one-to-one re-
lationship between coarsening probabilities {r, Gr (Z)} and discrete hazard
functions λr {Gr (Z)}. Using discrete hazards, the probability of a complete
case (i.e., C = ∞) is given by
                                  !                  
                       (∞, Z) =       1 − λr {Gr (Z)} .                (8.6)
                                    r=∞
    The use of discrete hazards provides a natural way of thinking about mono-
tone coarsening. For example, suppose we were asked to design a longitudi-
nal study with monotone missingness. We can proceed as follows. First, we
would collect G1 (Z) = Y1 . Then, with probability λ1 {G1 (Z)} (that is, with
probability depending on Y1 ), we would stop collecting additional data. How-
ever, with probability 1 − λ1 {G1 (Z)}, we would collect Y2 , in which case
we now have G2 (Z) = (Y1 , Y2 ). If we collected (Y1 , Y2 ), then with probabil-
ity λ2 {G2 (Z)} we would stop collecting additional data, but with probabil-
ity 1 − λ2 {G2 (Z)} we would collect Y3 , in which case we would have col-
lected G3 (Z) = (Y1 , Y2 , Y3 ). We continue in this fashion, either stopping at
        
stage r after collecting Gr (Z) = (Y1 , . . . , Yr ) or continuing with probability
λr {Gr (Z)} or 1 − λr {Gr (Z)}, respectively. When monotone missingness is
                                                                                    
viewed in this fashion, it is clear that, conditional on having reached stage r ,
there are two choices: either stop or continue to the next stage with condi-
tional probability λr {Gr (Z)} or 1 − λr {Gr (Z)}. Therefore, when we build
models for the coarsening probabilities of monotone coarsened data, it is nat-
ural to consider individual models for each of the discrete hazards. Because
188    8 The Nuisance Tangent Space and Its Orthogonal Complement
of the binary choice made at each stage, logistic regression models for the
discrete hazards are often used. For example, for the longitudinal data given
above, we may consider a model where it is assumed that
    Missing or coarsened data can also come about in a manner that is non-
monotone. For the longitudinal data example given above, suppose patients
didn’t necessarily drop out of the study but rather missed visits from time
to time. In such a case, some of the longitudinal data might be missing but
not necessarily in a monotone fashion. In the worst-case scenario, any of the
2l − 1 combinations of (Y1 , . . . , Yl ) might be missing for different patients in
the study. Building coherent models for the missingness probabilities for such
nonmonotone missing data, even under the assumption that missingness is
MAR, is challenging. There have been some suggestions for developing non-
monotone missingness models given by Robins and Gill (1997) using what they
call randomized monotone missingness (RMM) models. Because of the com-
plexity of nonmonotone missingness models, we will not discuss such models
specifically in this book. In what follows, we will develop the semiparametric
theory assuming that coherent missingness or coarsening models were used.
Specific examples with two levels of missingness or monotone missingness will
be used to illustrate the results.
Models for the coarsening probabilities are described through the parameter
ψ. Specifically, it is assumed that P (C = r|Z = z, ψ) = {r, Gr (z), ψ}, where
ψ is often assumed to be a finite-dimensional parameter. Estimates for the
parameter ψ can be obtained using maximum likelihood. We remind the reader
that because of the factorization of the observed-data likelihood given by (7.6),
the maximum likelihood estimator ψ̂n for ψ is obtained by maximizing
                                !
                                n
                                      {Ci , GCi (Zi ), ψ}.                  (8.8)
                                i=1
                     !
                     n
                           {π(Z1i , ψ)}Ri {1 − π(Z1i , ψ)}1−Ri .             (8.9)
                     i=1
                    8.2 Estimating the Parameters in the Coarsening Model           189
So, for example, if we entertained the logistic regression model (8.1), then the
maximum likelihood estimator for (ψ0 , ψ1T )T would be obtained by maximiz-
ing (8.9) or, more specifically, by maximizing
                                n 
                                !                           
                                    exp{(ψ0 + ψ T Z1i )Ri }
                                                  1
                                                                .                 (8.10)
                                i=1
                                      1 + exp(ψ0 + ψ1T Z1i )
Substituting the right-hand side of (8.11) for (·, ψ) into (8.8) and rearranging
terms, we obtain that the likelihood for monotone coarsening can be expressed
as
     ! !                         I(Ci =r)                      I(Ci >r)
                 λr {Gr (Zi ), ψ}            1 − λr {Gr (Zi ), ψ}           . (8.12)
     r=∞ i:Ci ≥r
   So, for example, if we consider the logistic regression models used to model
the discrete hazards for the monotone missing longitudinal data given by (8.7),
then the likelihood is given by
              !
              l−1   ! exp(ψ0r + ψ1r Y1i + . . . + ψrr Yri )I(Ci = r)
                                                                     .            (8.13)
                        1 + exp(ψ0r + ψ1r Y1i + . . . + ψrr Yri )
              r=1 i:Ci ≥r
Because the likelihood in (8.13) factors into a product of l−1 logistic regression
likelihoods, standard logistic regression software can be used to maximize
(8.13).
190     8 The Nuisance Tangent Space and Its Orthogonal Complement
Λ = Λ ψ ⊕ Λη ,
where Λψ is the space associated with the coarsening model parameter ψ and
Λη is the space associated with the infinite-dimensional nuisance parameter
η. In Chapter 7, we derived the space Λη and its orthogonal complement. We
now consider the space Λψ and some of its properties. Because the space Λψ
will play an important role in deriving RAL estimators for β with coarsened
data, when the coarsening probabilities are not known and must be modeled,
we will denote this space as the coarsening model tangent space and give a
formal definition as follows.
where
                                 ∂ log {C, GC (Z), ψ0 }
                       Sψs×1 =                           ,              (8.15)
                                          ∂ψ
and ψ0 denotes the true value of ψ. 
                                    
8.3 The Nuisance Tangent Space when Coarsening Probabilities Are Modeled   191
Taking the partial derivative inside the sum, dividing and multiplying by
{r, Gr (z), ψ}, and setting ψ = ψ0 yields
            
                  Sψ {r, Gr (z), ψ0 }{r, Gr (z), ψ0 } = 0   for all z,
              r
or
Hence
                                         
            E B q×s Sψ {C, GC (Z), ψ0 } |Z = 0.
              *          +,           -
                    typical element of Λψ
                               0 since h ∈ Λ2
                     = 0.
Since Λψ is contained in Λ2 , then this implies that Λψ is orthogonal to Λη . 
                                                                              
We are now in a position to derive the space orthogonal to the nuisance
tangent space.
Before discussing how these results can be used to derive RAL estimators
for β when data are CAR and when the coarsening probabilities need to be
modeled and estimated, which will be deferred to the next chapter, we close
this chapter by defining the space of observed-data influence functions of RAL
observed-data estimators for β.
where Sβ {C, GC (Z)} is the observed-data score vector with respect to β. For
completeness, we will now define the space of observed-data influence func-
tions.
Theorem 8.3. When data are coarsened at random (CAR) with coarsening
probabilities P (C = r|Z) = {r, Gr (Z), ψ}, where Λψ is the space spanned
by the score vector Sψ {C, GC (Z)} (i.e., the coarsening model tangent space),
then the space of observed-data influence functions, also denoted by (IF ), is
the linear variety contained in H, which consists of elements
                                                      
                      I(C = ∞)ϕF (Z)
  ϕ{C, GC (Z)} =                        + L2 {C, GC (Z)} − Π{[·]|Λψ }, (8.19)
                        (∞, Z, ψ0 )
                                                                         
  where ϕ (Z) is a full-data influence function and L2 {C, GC (Z)} ∈ Λ2 .
           F
Proof. We first note that we can use the exact same arguments as used in
lemmas 7.1 and 7.2 to show that the observed-data score vector with respect
to β is                              '                (
                  Sβ {C, GC (Z)} = E SβF (Z)|C, GC (Z) ,
where SβF (Z) is the full-data score vector with respect to β,
                                     ∂ log pZ (z, β0 , η0 )
                         SβF (z) =                          .
                                             ∂β
194      8 The Nuisance Tangent Space and Its Orthogonal Complement
                     I q×q =
                                                           
                             I(C = ∞)ϕ∗F (Z)
                     E                       + L2 {C, GC (Z)}
                                 (∞, Z)
                                                        
                                        FT
                      − Π[{ · }|Λψ ] E{Sβ (Z)|C, GC (Z)} ,
and
Λ = Λ ψ ⊕ Λη , Λψ ⊥ Λη ,
                                       I(C = ∞)(IF )F
                          (IF ) =                     + Λ2 .
                                          (∞, Z)
   model for the coarsening process where we modeled the discrete hazard
   function using equation (8.7) and derived the likelihood contribution for
   the coarsening model in (8.13). Let ψr denote the vector of parameters
   (ψ0r , . . . , ψrr )T for r = 1, . . . , l − 1, and let ψ denote the entire parameter
   space for the coarsening probabilities; that is, ψ = (ψ1T , . . . , ψl−1  T
                                                                                )T .
    a) Derive the score vector
where the estimating function evaluated at the truth, m(Z, β0 ), was chosen so
that m(Z, β0 ) = ϕ∗F (Z) ∈ ΛF ⊥ . For example, we take m(Z, β) = A(X){Y −
µ(X, β)} for the restricted moment model. The influence function of such a
full-data estimator for β was derived in Chapter 3, formula (3.6), and is given
by
                                    −1
                         ∂m(Zi , β0 )
                 − E                       m(Zi , β0 ) = ϕF (Zi ).        (9.2)
                            ∂β T
The estimating equation (9.3) is not a sum of iid quantities and hence the
resulting estimator is not, strictly speaking, an m-estimator. However, in many
situations, and certainly in all cases considered in this book, the estimator β̂n
that solves (9.3) will be asymptotically equivalent to the m-estimator β̂n∗ that
solves the equation
                               n
                                   m(Zi , β, η0 ) = 0,
                              i=1
                                                    P
with η0 known, in the sense that n1/2 (β̂n −β̂n∗ ) −
                                                   → 0. Without going into detail,
this asymptotic equivalence occurs because m(Z, β0 , η0 ) = ϕ∗F (Z) is orthogo-
nal to the nuisance tangent space. We illustrated this asymptotic equivalence
for parametric models in Section 3.3 using equation (3.30) (also see Remark
4 of this section). Therefore, from here on, with a slight abuse of notation,
we will still refer to estimators such as those that solve (9.3) as m-estimators
with estimating function m(Z, β).     
                              9.1 Deriving Semiparametric Estimators for β     201
    The estimator that is the solution to the estimating equation (9.4) is re-
ferred to as an AIPWCC estimator. If we take the element L2 (·) to be iden-
tically equal to zero, then the estimating equation becomes
                         n                     
                              I(Ci = ∞)m(Zi , β)
                                                   = 0,
                         i=1
                                 (∞, Zi , ψ0 )
and the resulting estimator is an IPWCC estimator since only complete cases
are considered in the sum above (i.e., {i : Ci = ∞}), weighted by the inverse
probability of being a complete case. The term L2 {Ci , GCi (Zi ), ψ0 } allows con-
tributions to the sum by observations that are not complete (i.e., coarsened),
and this is referred to as the augmented term.
    Using standard Taylor series expansions for m-estimators (which we leave
as an exercise for the reader), we can show that the influence function of the
estimator, derived by solving (9.4), is equal to
                        −1                                                   
             ∂m(Zi , β0 )        I(Ci = ∞)m(Zi , β0 )
    − E                                               + L2 {Ci , GC i
                                                                      (Zi ), ψ0 }
                ∂β T                (∞, Zi , ψ0 )
        I(Ci = ∞)ϕF (Zi )
    =                     + L∗2 {Ci , GCi (Zi ), ψ0 },                       (9.5)
          (∞, Zi , ψ0 )
where ϕF (Zi ) was defined in (9.2) and
                                           −1
                                ∂m(Zi , β0 )
                   L∗2 = − E                      L2 ∈ Λ2 .                  (9.6)
                                    ∂β T
    Therefore, we can now summarize these results. If coarsening of the data
were by design with known coarsening probabilities {r, Gr (Z), ψ0 }, for all
r, and we wanted to obtain an observed-data RAL estimator for β in a semi-
parametric model, we would proceed as follows.
 1. Choose a full-data estimating function m(Z, β).
 2. Choose an element of the augmentation space Λ2 as follows.
    a) For each r = ∞, choose a function L2r {Gr (Z)}.
    b) Construct L2 {C, GC (Z), ψ0 } to equal
                               ⎡                                   ⎤
                    I(C = ∞) ⎣ 
                                      {r, Gr (Z), ψ0 }L2r {Gr (Z)}⎦
                  (∞, Z, ψ0 )
                                r=∞
                      
                   −     I(C = r)L2r {Gr (Z)}.
                       r=∞
202    9 Augmented Inverse Probability Weighted Complete-Case Estimators
ψ unknown
The development above shows how we can take results regarding semipara-
metric estimators for the parameter β for full-data models and modify them to
estimate the parameter β with coarsened data (CAR) when the the coarsen-
ing probabilities are known to us by design. In most problems, the coarsening
probabilities are not known and must be modeled using the unknown param-
eter ψ. We discussed models for the coarsening process and estimators for the
parameters in these models in Chapter 8. We also showed in Chapter 8 the im-
pact that such models have on the observed-data nuisance tangent space, its
orthogonal complement, and the space of observed-data influence functions.
    If the parameter ψ is unknown, two issues emerge:
 (i) The unknown parameter ψ must be estimated.
(ii) The influence function of an observed-data RAL estimator for β must be
     an element in the space defined by (7.37) (i.e., involving a projection onto
     the coarsening model tangent space Λψ ).
    One obvious strategy for estimating β with coarsened data when ψ is
unknown is to find a consistent estimator for ψ and substitute this estimator
for ψ0 in the estimating equation (9.4). A natural estimator for ψ is obtained
by maximizing the coarsening model likelihood (8.8). The resulting MLE is
denoted by ψ̂n . The influence function of the estimator for β, obtained by
substituting the maximum likelihood estimator ψ̂n for ψ0 in equation (9.4),
is given by the following important theorem.
Theorem 9.1. If the coarsening process follows a parametric model, and if ψ
is estimated using the maximum likelihood estimator, say ψ̂n , or any efficient
estimator of ψ, then the solution to the estimating equation
                 #                                               $
             n
                   I(Ci = ∞)m(Zi , β)
                                      + L2 {Ci , GCi (Zi ), ψ̂n } = 0   (9.7)
             i=1      (∞, Zi , ψ̂n )
where ϕF (·) and L∗2 (·) are defined by (9.2) and (9.6). We note that such an
influence function is indeed a member of the class of observed-data influence
functions given by (8.19).
   For notational convenience, we denote a typical influence function, if the
parameter ψ is known, by
ϕ{Ci , GCi (Zi ), ψ} = ϕ̃{Ci , GCi (Zi ), ψ} − Π[ϕ̃{Ci , GCi (Zi ), ψ}|Λψ ].
Before giving the proof of the theorem above we present the following lemma.
Lemma 9.1.
                                                                       
      ∂ ϕ̃{C, GC (Z), ψ0 }
  E                          = −E  ϕ̃{C, GC (Z), ψ0 }S T
                                                       ψ {C, GC (Z), ψ0 }  .         (9.10)
              ∂ψ T
Because of the definition of ϕ̃{C, GC (Z), ψ} given by (9.9) and the fact that
L∗2 {C, GC (Z), ψ} ∈ Λ2 , we obtain for any ψ, that
                             '              (
                          Eψ ϕ̃{C, GC , ψ) Z = ϕF (Z),
          ∂ 
                ϕ̃{r, Gr (z), ψ}{r, Gr (z), ψ} = 0         for all z, ψ.            (9.12)
         ∂ψ T r
204              9 Augmented Inverse Probability Weighted Complete-Case Estimators
Differentiating the product inside the sum (9.12) and setting ψ = ψ0 yields
          ∂ ϕ̃{r, Gr (z), ψ0 }
                                       {r, Gr (z), ψ0 }
             r
                          ∂ψ T
                                             ∂{r, Gr (z), ψ0 }/∂ψ T
                 +       ϕ̃{r, Gr (z), ψ0 }                           {r, Gr (z), ψ0 } = 0,
                     r
                                                {r, Gr (z), ψ0 }
or
                                                                                  
             ∂ ϕ̃{C, GC (Z), ψ0 }
     E                            Z   = −E  ϕ̃{C, GC (Z), ψ0 }S T
                                                                ψ {C, GC (Z), ψ0 }|Z  ,
                     ∂ψ T
         n1/2 (β̂n − β0 ) =
                     #                                              $
                 n
                       I(Ci = ∞)ϕF (Zi )
           −1/2                             ∗
         n                               + L2 {Ci , GCi (Zi ), ψ̂n } + op (1),              (9.13)
                 i=1     (∞, Zi , ψ̂n )
where ϕF (Zi ) is given by (9.2) and L∗2 {Ci , GCi (Zi ), ψ} is given by (9.6). Now
we expand ψ̂n about ψ0 to obtain
                                     
                                     n
     n1/2 (β̂n − β0 ) =n−1/2               ϕ̃{Ci , GCi (Zi ), ψ0 }
                                     i=1
                                 #                                      $
                                          n
                                              ∂ ϕ̃{Ci , GCi (Zi ), ψn∗ } 1/2
                                     −1
                               + n                                        n (ψ̂n − ψ0 ) + op (1),
                                          i=1
                                                        ∂ψ T
                                                                                            (9.14)
where ψn∗ is some intermediate value between ψ̂n and ψ0 . Since under usual
regularity conditions ψn∗ converges in probability to ψ0 , we obtain
                     n                                                            
                         ∂ ϕ̃{Ci , GCi (Zi ), ψn∗ } P     ∂ ϕ̃{Ci , GCi (Zi ), ψ0 }
             n−1                                    −
                                                    → E                               .     (9.15)
                     i=1
                                   ∂ψ T                             ∂ψ T
where Sψ {C, GC (Z), ψ0 } is the score vector with respect to ψ defined in (8.15).
    The influence function of ψ̂n given by (9.16), together with (9.15) and
(9.10) of Lemma 9.1, can be used to deduce that (9.14) is equal to
  n1/2 (β̂n − β0 )
             n 
             
       −1/2
                                                                                          
  =n               ϕ̃{Ci , GCi (Zi ), ψ0 } − E ϕ̃{Ci , GCi (Zi ), ψ0 }SψT {Ci , GCi , ψ0 }
              i=1
                                                                                     
                                                     −1
   E Sψ {Ci , GCi (Zi ), ψ0 }SψT {Ci , GCi (Zi ), ψ0 }      Sψ {Ci , GCi (Zi ), ψ0 }
+ op (1). (9.17)
Interesting Fact
The asymptotic variance of the RAL estimator β̂n is the variance of the influ-
ence function (9.8), which we denote by Σ. An estimator for the asymptotic
variance, Σ̂n , can be obtained using a sandwich variance estimator. For com-
pleteness, we now describe how to construct this estimator:
                        9.2 Additional Results Regarding Monotone Coarsening                          207
                             "
                                  −1
                   ∂m(Z, β̂n )
         Σ̂n = Ê
                     ∂β T
                    n                                                                
              × n−1      g{Ci , GCi (Zi ), ψ̂n , β̂n }g T {Ci , GCi (Zi ), ψ̂n , β̂n }
                            i=1
                                   " T
                                       −1
                         ∂m(Z, β̂n )
                    × Ê                  ,                                                        (9.19)
                           ∂β T
where
                                 "                                                       "
                    ∂m(Z, β̂n )                 
                                                n
                                                          I(Ci = ∞)∂m(Zi , β̂n )/∂β T
                                           −1
           Ê                         =n                                                       ,
                      ∂β T                      i=1             (∞, Zi , ψ̂n )
and
                                   
                                   n
          Ê(Sψ SψT ) = n−1               Sψ {Ci , GCi (Zi ), ψ̂n }SψT {Ci , GCi (Zi ), ψ̂n }.
                                   i=1
where λr {Gr (Z)} and Kr {Gr (Z)} are defined by (8.2) and (8.4), respectively,
and Lr {Gr (Z)} denotes an arbitrary function of Gr (Z) for r = ∞.
Remark 2. Equation (9.20) is made up of a sum of mean-zero conditionally
uncorrelated terms; i.e., it has a martingale structure. We will take advantage
of this structure in the next chapter when we derive the more efficient double
robust estimators.   
Proof. Using (7.37), we note that a typical element of Λ2 can be written as
                       I(C = ∞){r, Gr (Z)}
                                                
              I(C = r) −                          L2r {Gr (Z)},        (9.21)
                                (∞, Z)
        r=∞
and conversely
                                         
                                         r−1
                      Lr = Kr−1 L2r +          j L2j for all r = ∞.                 (9.23)
                                         j=1
Therefore,
                                   
                                   r
                                     Kr λi
          Lr+1 = Kr L2(r+1) +                   Li
                                   i=1
                                          Ki
                                                ⎛                              ⎞
                                   
                                   r
                                     Kr λi                      
                                                                i−1
                 = Kr L2(r+1) +                 ⎝Ki−1 L2i +              j L2j ⎠
                                   i=1
                                          Ki                       j=1
                                   
                                   r
                                     K r i              
                                                         r 
                                                           i−1
                                                               K r λ i j
                 = Kr L2(r+1) +                  L2i +                        L2j .
                                   i=1
                                          Ki             i=1 j=1
                                                                         Ki
                     9.2 Additional Results Regarding Monotone Coarsening               209
                                      
                                      r
                                        K r i            
                                                          r−1                r
                                                                                  λi
         Lr+1 = Kr L2(r+1) +                      L2i +         Kr j L2j
                                      i=1
                                             Ki           j=1               i=j+1
                                                                                  K i
                                                      ⎛                      ⎞
                                      
                                      r
                                                          1   r
                                                                   λi ⎠
                   = Kr L2(r+1) +           Kr j L2j ⎝     +           .
                                      j=1
                                                          Kj i=j+1 Ki
Note that
                     1    λj+1   1 − λj+1   λj+1    1
                        +      =          +      =      ,
                     Kj   Kj+1     Kj+1     Kj+1   Kj+1
and hence
                                 1   r
                                          λi   1
                                   +         =    .
                                 Kj i=j+1 Ki   Kr
Therefore,
                                                      
                                                      r
                          Lr+1 = Kr L2(r+1) +               j L2j .
                                                      j=1
which is exactly (9.21). Similarly, substituting (9.22) into (9.21) yields (9.20).
 
 
E(Yji |Xi ) = β1 + β2 tj + β3 Xi tj .
     In this study, however, some patients dropped out during the course of the
study, in which case we would observe the data up to the time they dropped
out but all subsequent CD4 count measurements would be missing. This is
an example of monotone coarsening as described in Section 8.1, where there
are  = l − 1 levels of coarsening and where Gr (Zi ) = (Xi , Y1i , . . . , Yri )T , r =
1, . . . , l − 1 and G∞ (Zi ) = (Xi , Y1i , . . . , Yli )T . Because dropout was not by
design, we need to model the coarsening probabilities in order to derive AIP-
WCC estimators for β. Since the data are monotonically coarsened, it is more
convenient to model the discrete hazard function. Assuming the coarsening is
CAR, we consider a series of logistic regression models similar to (8.7), namely
Note 1. The only difference between equations (8.7) and (9.25) is the inclusion
of the treatment indicator X. 
Therefore, the parameter ψ in this example is ψ = (ψ0r , . . . , ψ(r+1)r , r =
1, . . . , l − 1)T . The MLE ψ̂n is obtained by maximizing the likelihood (8.13).
     We now have most of the components necessary to construct an AIP-
WCC estimator for β. We still need to define an element of the augmen-
tation space Λ2 . In accordance with Theorem 9.2, for monotonically coars-
ened data, we must choose a function Lr {Gr (Z)}, r = 1, . . . , l − 1 (that is,
a function Lr (X, Y1 , . . . , Yr )) and then use (9.20) to construct an element
L2 {C, GC (Z)} ∈ Λ2 .
     Putting all these different elements together, we can derive an observed-
data RAL estimator for β by using the results of Theorem 9.1, equation (9.7),
by solving the estimating equation
212      9 Augmented Inverse Probability Weighted Complete-Case Estimators
        n 
           I(Ci = ∞)H T (Xi ){Yi − H(Xi )β}
                                                                                                 (9.26)
         i=1                    (∞, Zi , ψ̂n )
                                                                    "              
             
             l−1
                       I(Ci = r) − λr {Gr (Zi ), ψ̂n }I(Ci ≥ r)
         +                                                               Lr {Gr (Z)} = 0,
             r=1                    Kr {Gr (Zi ), ψ̂n }
where
and
                              (∞, Zi , ψ̂n ) = Kl−1 {Gl−1 (Zi ), ψ̂n }.
The estimator β̂n , the solution to (9.26), will be an observed-data RAL esti-
mator for β that is consistent and asymptotically normal, assuming, of course,
that the model for the discrete hazard functions were correctly specified. An
estimator for the asymptotic variance of the estimator for β̂n can be obtained
by using the sandwich variance estimator given by (9.19).
    The efficiency of the estimator will depend on the choice for m(Zi , β) and
the choice for Lr (Xi , Y1i , . . . , Yri ), r = 1, . . . , l − 1. For illustration, we chose
m(Zi , β) = H T (Xi ){Yi − H(Xi )β}, but this may or may not be a good choice.
Also, we did not discuss choices for Lr (Xi , Y1i , . . . , Yri ), r = 1, . . . , l − 1. If
Lr (·) were set equal to zero, then the corresponding estimator would be the
IPWCC estimator. A more detailed discussion on choices for these functions
and how they affect the efficiency of the resulting estimator will be given in
Chapters 10 and 11.     
which has influence function X(T ) − E{X(T )}. If, however, the duration of
illness is right censored for some individuals, then this problem becomes more
214     9 Augmented Inverse Probability Weighted Complete-Case Estimators
difficult. We will use this example for illustration as we develop censored data
estimators. 
            
    In actuality, we often don’t observe the full data because of censoring, pos-
sibly because of incomplete follow-up of the patients due to staggered entry
and finite follow-up, or because the patients drop out of the study prema-
turely. To accommodate censoring, we introduce a censoring variable C̃ that
corresponds to the time at that an individual would be censored from the
study.
where pT,X̄(T ) {t, x̄(t)} denotes the density of the full data had we been able
to observe them.
    The observed (coarsened) data for this problem are denoted as
where
                  [C = r, Gr {T, X̄(T )}] = {C̃ = r, T > r, X̄(r)}
and
                  [C = ∞, G∞ {T, X̄(T )}] = {T ≤ C̃, T, X̄(T )}.
With monotone coarsening, we argued that it is more convenient to work with
hazard functions in describing the coarsening probabilities. With a slight abuse
of notation, the coarsening hazard function is given as
Therefore,
    λr {T, X̄(T )} = P [C̃ = r, T ≥ r|{(C̃ ≥ r, T > C̃) ∪ (T < C̃)}, T, X̄(T )].
                                                                            (9.29)
If T < r, then (9.29) must equal zero, whereas if T ≥ r, then {(C̃ ≥ r, T >
C̃) ∪ (T < C̃)} ∩ (T ≥ r) = (C̃ ≥ r). Consequently,
then
           λr [Gr {T, X̄(T )}] = P {C̃ = r|C̃ ≥ r, T ≥ r, X̄(r)}I(T ≥ r).
216     9 Augmented Inverse Probability Weighted Complete-Case Estimators
Then
                   λr [Gr {T, X̄(T )}] = λC̃ {r, X̄(r)}I(T ≥ r).              (9.30)
    In order to construct estimators for a full-data parameter using coars-
ened data, such as those given by (9.4), we need to compute the probability
of a complete case [∞, G∞ {T, X̄(T )}] = P {C = ∞|T, X̄(T )} = P {∆ =
1|T, X̄(T )} and a typical element of the augmentation space, Λ2 . We now
show how these are computed with censored data using the hazard function
for the censoring time and counting process notation.
where dNC̃ (r) denotes the increment of the counting process. Using (9.30),
we obtain
              9.3 Censoring and Its Relationship to Monotone Coarsening                217
Letting Y (r) denote the at-risk indicator, Y (r) = I(U ≥ r) = I(T ≥ r, C̃ ≥ r),
we obtain
                λr [Gr {T, X̄(T )}]I(C ≥ r) = λC̃ {r, X̄(r)}Y (r).
Because the elements in the sum of (9.32) are nonzero only if T ≥ r and C̃ ≥ r,
it suffices to define Lr [Gr {T, X̄(T )}] and Kr [Gr {T, X̄(T )}] as Lr {X̄(r)} and
Kr {X̄(r)}, respectively. Moreover,
                 !                                 r                    
    Kr {X̄(r)} =      1 − λC̃ {u, X̄(u)}du = exp −          λC̃ {u, X̄(u)}du .
                 u≤r                                                0
                                                                    (9.33)
   Consequently, a typical element of Λ2 , given by (9.32), can be written
using stochastic integrals of counting process martingales, namely
                           ∞
                                dMC̃ {r, X̄(r)}
                                                Lr {X̄(r)},
                                 Kr {X̄(r)}
                           0
where
                dMC̃ {r, X̄(r)} = dNC̃ (r) − λC̃ {r, X̄(r)}Y (r)dr
is the usual counting process martingale increment and Kr {X̄(r)} is given by
(9.33).
accumulated medical costs of a patient during the time T of his or her illness.
As noted previously, with full data there is only one influence function of
RAL estimators for β, namely X(T ) − β. Consequently, the obvious full-data
estimating function for this problem is m(Zi , β) = Xi (Ti ) − β. With censored
data, we use (9.34) to derive an arbitrary estimator for β as the solution to
      n 
                                                     ∞                                    
                  ∆i                                       dMC̃i {r, X̄i (r)}
                             {Xi (Ui ) − β} +                                   Lr {X̄i (r)} = 0.
      i=1
             KUi {X̄i (Ui )}                                 Kr {X̄i (r)}
                                                      0
    Notice that, unlike the case where there was only one influence function
with the full data for this problem, there are many influence functions with
censored data. The observed (censored) data influence functions depend on the
choice of Lr {X̄(r)} for r ≥ 0, which, in turn, affects the asymptotic variance
of the corresponding estimator in (9.35). Clearly, we want to choose such
functions in order to minimize the asymptotic variance of the corresponding
estimator. This issue will be studied carefully in the next chapter.
    We also note that the estimator given above assumes that we know the
hazard function for censoring, λC̃ {r, X̄(r)}. In practice, this will be unknown
to us and must be estimated from the observed data. If the censoring time
C̃ is assumed independent of {T, X̄(T )}, then one can estimate Kr using
the Kaplan-Meier estimator for the censoring time C̃; see Kaplan and Meier
(1958). If the censoring time is related to the time-dependent covariates, then
a model has to be developed. A popular model for this purpose is Cox’s
proportional hazards model (Cox, 1972, 1975).
    If Lr (·) is taken to be identically equal to zero, then we obtain the IPWCC
estimator. This estimator, referred to as the simple weighted estimator for
estimating the mean medical cost with censored cost data, was studied by
Bang and Tsiatis (2000), who also derived the large-sample properties. More
efficient estimators with a judicious choice of Lr (·) were also proposed in that
paper.
    where ψ̂n denotes the MLE for ψ. The i-th influence function of the re-
    sulting estimator is
         ϕ̃{Ci , GCi (Zi ), ψ0 } − E(ϕ̃SψT ){E(Sψ SψT )}−1 Sψ {Ci , GCi (Zi ), ψ0 }
          = ϕ̃{Ci , GCi (Zi ), ψ0 } − Π[ϕ̃{Ci , GCi (Zi ), ψ0 }|Λψ ].
•   When the observed data are monotonically coarsened, then it will prove
    to be convenient to express the elements of the augmentation space using
    discrete hazard functions, specifically an element L2 {C, GC (Z), ψ} ∈ Λ2 ,
                 I(C = r) − λr {Gr (Z), ψ}I(C ≥ r) 
                                                       Lr {Gr (Z)},
                              Kr {Gr (Z), ψ}
               r=∞
Proof. We begin by noting that the space of elements in (10.1), for a fixed
ϕF (Z), is a linear variety as defined by Definition 7 of Chapter 3 (i.e., a
translation of a linear space away from the origin). Specifically, this space is
given as V = x0 + M , where the element
                                                              
                     I(C = ∞)ϕF (Z)       I(C = ∞)ϕF (Z)
              x0 =                   −Π                    Λψ
                       (∞, Z, ψ0 )         (∞, Z, ψ0 )
x0 − Π[x0 |M ].
                                                  
          I(C = ∞)ϕF (Z)           I(C = ∞)ϕF (Z)
       Π                 Λψ = Π Π                 Λ2 Λψ ,                 (10.4)
            (∞, Z, ψ0 )             (∞, Z, ψ0 )
by (IF )DR . 
             
224    10 Improving Efficiency and Double Robustness with Coarsened Data
The subscript “DR” is used to denote double robustness. Therefore, the space
(IF )DR ⊂ (IF ) is defined as the set of double-robust observed-data influence
functions. The term double robust was first introduced in Section 6.5. Why we
refer to these as double-robust influence functions will become clear later in
the chapter. Since the space of full-data influence functions is a linear variety
in HF (see Theorem 4.3) (IF )F = ϕF (Z)+T F ⊥ , where ϕF (Z) is an arbitrary
full-data influence function and T F is the full-data tangent space, it is clear
that (IF )DR = J {(IF )F } = J (ϕF ) + J (T F ⊥ ) is a linear variety in H.
    This now gives us a prescription for how to find observed-data RAL estima-
tors for β whose influence function belongs to (IF )DR . We start by choosing
a full-data estimating function m(Z, β) so that m(Z, β0 ) = ϕ∗F (Z) ∈ ΛF ⊥ .
We then construct the observed-data estimating function m{C, GC (Z), β} =
J {m(Z, β)}, where
                                                                   
                        I(C = ∞)m(Z, β)          I(C = ∞)m(Z, β)
        J {m(Z, β)} =                      −Π                     Λ2 .
                           (∞, Z, ψ0 )            (∞, Z, ψ0 )
where                                                    
                                        I(C = ∞)m(Z, β)
              L2 {C, GC (Z), β, ψ} = −Π                 Λ2 .
                                           (∞, Z, ψ)
If the coarsening probabilities were not known and had to be modeled using
the unknown parameter ψ, then we would derive an estimator for β by solving
the estimating equation
         n                                                   
              I(Ci = ∞)m(Zi , β)
                                 + L2 {Ci , GCi (Zi ), β, ψ̂n } = 0, (10.7)
         i=1     (∞, Zi , ψ̂n )
where ψ̂n is the MLE for the parameter ψ and L2 (·) is defined as above.
    Finding projections onto the augmentation space Λ2 is not necessarily easy.
Later we will discuss a general procedure for finding such projections that
involves an iterative process. However, in the case when there are two levels
of coarsening or when the coarsening is monotone, a closed-form solution for
the projection onto Λ2 exists. We will study these two scenarios more carefully.
We start by illustrating how to find improved estimators using these results
when there are two levels of missingness.
however, we can make the model as flexible as necessary to fit the data. For
example, we can include higher-order polynomial terms, interaction terms,
splines, etc.
    As always, there is an underlying full-data model Z ∼ p(z, β, η) ∈ P,
where β is the q-dimensional parameter of interest and η is the nuisance
parameter (possibly infinite-dimensional), and our goal is to estimate β using
the observed data Oi = (Ri , Z1i , Ri Z2i ), i = 1, . . . , n.
    In order to use either estimating equation (10.6) or (10.7), when ψ is known
or unknown, respectively, to obtain an observed-data RAL estimator for β
whose influence function is an element of (IF )DR , we must find the projection
            ∗F
               (Z)
of I(C=∞)ϕ
      (∞,Z)       onto the augmentation space Λ2 , where ϕ∗F (Z) = m(Z, β0 ) ∈
  F⊥
Λ . Using the notation for two levels of missingness, we now consider how
                                 ∗F
                                    (Z)
to derive the projection of Rϕ  π(Z1 )   onto the augmentation space. According
to (7.40) of Chapter 7, we showed that, with two levels of missingness, Λ2
consists of the set of elements
                                                 
                                       R − π(Z1 )
                         L2 (O) =                   h2 (Z1 ),
                                         π(Z1 )
where hq×1
       2   (Z1 ) is an arbitrary q-dimensional function of Z1 .
Because
                                                2 
                R{R − π(Z1 )}            R − π(Z1 )       1 − π(Z1 )
        E                     Z   = E                 Z =            ,
                   π 2 (Z1 )               π(Z1 )           π(Z1 )
we write (10.12) as
                                                                T           
                            1 − π(Z1 )         ∗F
                   E                       ϕ        (Z) −   h02 (Z1 )    h2 (Z1 ) .   (10.13)
                              π(Z1 )
   Therefore, we must find the function h02 (Z1 ) such that (10.13) is equal
to zero for all h2 (Z1 ). We derive (10.13) by again using the law of iterated
conditional expectations, where we first condition on Z1 to obtain
                                                       T       
                  1 − π(Z1 )
       E                         E{ϕ∗F (Z)|Z1 } − h02 (Z1 ) h2 (Z1 ) .                (10.14)
                    π(Z1 )
Adaptive Estimation
pZ2 |Z1 ,R (z2 |z1 , r) = pZ2 |Z1 ,R=1 (z2 |z1 , r = 1) = pZ2 |Z1 (z2 |z1 ).
If, however, our posited model did contain the truth, then we denote this by
taking ξ ∗ to equal ξ0 , where
   With such a posited model for the conditional density of Z2 given Z1 and
an estimator ξˆn∗ , we are able to estimate h02 (Z1 ) = E{ϕ∗F (Z)|Z1 } by using
                               
             h∗2 (Z1 , ξˆn∗ ) = ϕ∗F (Z1 , u)p∗Z2 |Z1 (u|Z1 , ξˆn∗ )dνZ2 (u).
where h∗2 (Z1 , ξ ∗ ) is a function of Z1 but not necessarily that h02 (Z1 ) =
h∗2 (Z1 , ξ ∗ ) unless the posited model for the conditional density of Z2 given
Z1 was correct.
     With this as background, we now give a step-by-step algorithm on how to
derive an improved estimator. In so doing, we consider the scenario where the
parameter ψ in our missingness model is unknown and must be estimated.
                         !
                         n
                               {π(Z1i , ψ)}Ri {1 − π(Z1i , ψ)}1−Ri .
                         i=1
      where
                                                      
                h∗2 (Z1i , β, ξ) = E m(Zi , β)|Z1i , ξ
                                   
                                 = m(Z1i , u, β)p∗Z2 |Z1 (u|Z1i , ξ)dνZ2 (u).
The semiparametric theory that we have developed for coarsened data implic-
itly assumes that the model for the coarsening probabilities is correctly speci-
fied. With two levels of missingness, this means that P (R = 1|Z1 ) = π0 (Z1 ) is
contained within the model π(Z1 , ψ), in which case, under suitable regularity
conditions, π(Z1 , ψ̂n ) → π0 (Z1 ), where ψ̂n denotes the MLE for ψ. The fact
that we used the augmented term
                                              "
                            Ri − π(Z1i , ψ̂n )
                        −                        h∗2 (Z1i , β, ξˆn∗ )   (10.18)
                              π(Z1i , ψ̂n )
in equation (10.17) was an attempt to gain efficiency from the data that are
incomplete (i.e., {i : Ri = 0}). To get the greatest gain in efficiency, the
augmented term must equal
                                             "
                           Ri − π(Z1i , ψ̂n )
                       −                        h02 (Z1i , β),
                             π(Z1i , ψ̂n )
where
                  10.2 Improving Efficiency with Two Levels of Missingness                231
Theorem 10.3.
                                  n1/2 (β̂n − β̂n∗ ) −
                                                    P
                                                     → 0.
Proof. If we return to the proof of Theorem 9.1, we note that the asymptotic
approximation given by (9.13), applied to the estimator β̂n , the solution to
(10.17) would yield
                                        −1
    1/2                       ∂m(Z, β0 )
  n (β̂n − β0 ) = − E                             ×
                                 ∂β T
          n 
                                                           "                     
              R   m(Z   , β0 )     R   − π(Z   1i , ψ̂   )
    −1/2         i     i              i                 n      ∗              ˆ∗
  n                             −                             h2 (Z1i , β̂n , ξn ) + op (1),
         i=1    π(Z1i , ψ̂n )           π(Z1i , ψ̂n )
                                                                                     (10.22)
The proof is complete if we can show that the term in the square brackets
above converges in probability to zero. This follows by expanding
                                               "
                       n
                             Ri − π(Z1i , ψ̂n )
                  −1/2
                n                                 h∗2 (Z1i , β̂n , ξˆn∗ )
                       i=1     π(Z 1i , ψ̂ n )
where βn∗ and ξn∗ are intermediate values between β̂n and β0 and ξˆn∗ and
                                                                 P          P
ξ ∗ , respectively. Let us consider (10.26). Since ψ̂n −         → ψ0 , βn∗ −
                                                                            → β0 , and
  ∗ P     ∗
ξn −  → ξ , then, under suitable regularity conditions, the sample average in
equation (10.26) is
                                            "                         
                    n
                          Ri − π(Z1i , ψ̂n )     ∂h∗2 (Z1i , βn∗ , ξn∗ ) P
                 −1
               n                                                          −
                                                                          →
                    i=1      π(Z1i , ψ̂n )                ∂ξ ∗T
                                    ∗                       
                     R − π(Z1 , ψ0 )       ∂h2 (Z1 , β0 , ξ ∗ )
               E                                                  .
                        π(Z1 , ψ0 )             ∂ξ∗T
This would then imply that the term inside the square brackets “[ · ]”in (10.21)
                                                                               
is an element orthogonal to Λ2 (i.e., [ · ] ∈ Λ⊥
                                               2 ), in which case Π [ · ] Λψ
(i.e., the last term of (10.21)) is equal to zero because Λψ ⊂ Λ2 . The resulting
estimator would have influence function
                                          
          RϕF (Z)            RϕF (Z)
                     −Π                   Λ2 = J (ϕF ) ∈ (IF )DR ,        (10.27)
         π(Z1 , ψ0 )        π(Z1 , ψ0 )
                            −1
                          0)
where ϕF (Z) = − E ∂m(Z,β
                     ∂β T          m(Z, β0 ), and this influence function is
within the class of the so-called double-robust influence functions. The vari-
ance of this influence function represents the smallest asymptotic variance
among observed-data estimators that used m(Z, β0 ) = ϕ∗F (Z) as the ba-
sis for an augmented inverse probability weighted complete-case estimating
equation.
To estimate the asymptotic variance of β̂n , where β̂n is the solution to (10.17),
we propose using the sandwich variance estimator given in Chapter 9, equation
(9.19). For completeness, this estimator is given by
                                 "
                                        −1
                       ∂m(Z, β̂n )
             Σ̂n = Ê         T
                           ∂β
                          n                                                 
                       −1                         ˆ∗ T                    ˆ∗
                  × n          g(Oi , ψ̂n , β̂n , ξn )g (Oi , ψ̂n , β̂n , ξn )
                            i=1
                                  " T
                                      −1
                        ∂m(Z, β̂n )
                   × Ê      T
                                         ,                                 (10.28)
                          ∂β
where
234       10 Improving Efficiency and Double Robustness with Coarsened Data
                                       "                                                   "
                          ∂m(Z, β̂n )                  
                                                       n
                                                                   Ri ∂m(Zi , β̂n )/∂β T
                                                  −1
                 Ê                         =n                                                   ,
                            ∂β T                       i=1             π(Z1i , ψ̂n )
and
                                                  
                                                  n
                          Ê(Sψ SψT ) = n−1             Sψ (Oi , ψ̂n )SψT (Oi , ψ̂n ).
                                                  i=1
Thus far, we have taken the point of view that the missingness model was
correctly specified; that is, that π0 (Z1 ) = P (R = 1|Z1 ) is contained in the
model π(Z1 , ψ) for some value of ψ. If this is the case, we denote the true value
of ψ by ψ0 and π0 (Z1 ) = π(Z1 , ψ0 ). However, unless the missingness was by
design, we generally don’t know the true missingness model, and therefore the
possibility exists that we have misspecified this model. Even if the missing-
ness model is misspecified, under suitable regularity conditions, the maximum
likelihood estimator ψ̂n will converge in probability to some constant ψ ∗ , but
π(Z1 , ψ ∗ ) = π0 (Z1 ). That is,
actually gives us the extra protection of double robustness, which was briefly
introduced in Section 6.5. We now explore this issue further.
    Using standard asymptotic approximations, we can show that the esti-
mator β̂n , which is the solution to the estimating equation (10.17), will be
consistent and asymptotically normal if
                    10.2 Improving Efficiency with Two Levels of Missingness                 235
                                                                             
                Rm(Z, β0 )           R − π(Z1 , ψ ∗ )
        E                    −                              h∗2 (Z1 , β0 , ξ ∗ ) = 0,   (10.29)
                π(Z1 , ψ ∗ )           π(Z1 , ψ ∗ )
            P                    P
where ψ̂n → ψ ∗ and ξˆn∗ → ξ ∗ .
   In developing our estimator for β, we considered two models, one for the
missingness probabilities π(Z1 , ψ) and another for the conditional density of
Z2 given Z1 p∗Z2 |Z1 (z2 |z1 , ξ). If the missingness model is correctly specified,
then
                     π(Z1 , ψ ∗ ) = π(Z1 , ψ0 ) = P (R = 1|Z1 ).           (10.30)
If the model for the conditional density of Z2 given Z1 is correctly specified,
then
                      h∗2 (Z1 , β0 , ξ ∗ ) = E0 {m(Z, β0 )|Z1 }.       (10.31)
   We now show the so-called double-robustness property; that is, the esti-
mator β̂n , the solution to (10.17), is consistent and asymptotically normal (a
result that follows under suitable regularity conditions when (10.29) is satis-
fied) if either (10.30) or (10.31) is true.
   After adding and subtracting similar terms, we write the expectation in
(10.29) as
                                                                        
                          R − π(Z1 , ψ ∗ )                     ∗          ∗
        E m(Z, β0 ) +                          m(Z, β0 ) − h2 (Z1 , β0 , ξ )
                             π(Z1 , ψ ∗ )
                                                                  
                   R − π(Z1 , ψ ∗ )                   ∗             ∗
      = 0+E                              m(Z, β0 ) − h2 (Z1 , β0 , ξ ) .       (10.32)
                      π(Z1 , ψ ∗ )
where, for ease of illustration, we let X2 be a single random variable. Also, for
this example, we let all the covariates in X be continuous random variables.
Specifically, we consider the model
                                           exp(β T X ∗ )
                       P (Y = 1|X) =                       ,
                                         1 + exp(β T X ∗ )
                                    exp(ψ0 + ψ1 Y + ψ2T X1 )
                 π(Y, X1 , ψ) =                                .         (10.38)
                                  1 + exp(ψ0 + ψ1 Y + ψ2T X1 )
                n 
                                                                   
                           Ri                        exp(β T Xi∗ )
                                     Xi∗ Yi −
                i=1 π(Yi , X1i , ψ̂n )
                                                 1 + exp(β T Xi∗ )
                                           "             
                    Ri − π(Yi , X1i , ψ̂n )
                −                             L(Yi , X1i ) = 0,          (10.39)
                      π(Yi , X1i , ψ̂n )
     Therefore, one strategy is to posit the model (10.40) for the condi-
tional distribution of X2 given X1 and Y and estimate the parameters
ξ = (ξ0 , ξ1T , ξ2 , ξσ2 )T by maximizing (10.16). This is especially attractive be-
cause the model (10.40) is a traditional normally distributed linear model.
Therefore, using the complete cases (i.e., {i : Ri = 1}), the MLE for
(ξ0 , ξ1T , ξ2 )T can be obtained using standard least squares and the MLE for
the variance parameter ξσ2 can be obtained as the average of the squared
residuals. Denote this estimator by ξˆn∗ .
     Finally, me must compute
where the discrete hazard function λr {Gr (Z)} and Kr {Gr (Z)} are defined by
(8.2) and (8.4), respectively, and Lr {Gr (Z)} denotes an arbitrary function of
Gr (Z) for r = ∞.
    By the projection theorem, if we want to find
                                                  
                               I(C = ∞)m(Z, β)
                          Π                      Λ2 ,
                                   (∞, Z)
then we must derive the functions L0r {Gr (Z)}, r = ∞, such that
240     10 Improving Efficiency and Double Robustness with Coarsened Data
            %
                I(C = ∞)mT (Z, β)
        E
                     (∞, Z)
                                                                       &
                       I(C = r) − λr {Gr (Z)}I(C ≥ r) 
                   −                                     L0r {Gr (Z)}
                                                            T
                                   Kr {Gr (Z)}
                     r=∞
                     ⎛                                                  ⎞
                          I(C = r) − λr {Gr (Z)}I(C ≥ r) 
                   ×⎝                                        Lr {Gr (Z)}⎠
                                     Kr {Gr (Z)}
                       r=∞
                                                         
            I(C = r) − λr {Gr (Z)}I(C ≥ r) T
   E                                           L0r {Gr (Z)}
                     Kr {Gr (Z)}
                                                                            
                   E{I(C = r )|Fr } − λr {Gr (Z)}I(C ≥ r )
              ×                                                   Lr {Gr (Z)} .
                                 Kr {Gr (Z)}
                                                                              (10.45)
But
              E{I(C = r )|Fr } = P (C = r |C ≥ r , Z)I(C ≥ r ),         (10.46)
which by the coarsening at random assumption and the definition of a discrete
hazard, given by (8.2), is equal to λr {Gr (Z)}I(C ≥ r ). Substituting (10.46)
into (10.45) proves (10.44). 
                             
                    10.3 Improving Efficiency with Monotone Coarsening        241
Lemma 10.3.
                                                                  
      I(C = ∞)mT (Z, β) I(C = r) − λr {Gr (Z)}I(C ≥ r)
  E                                                      Lr {Gr (Z)}
           (∞, Z)                 Kr {Gr (Z)}
                                              
                λr {Gr (Z)} T
         = −E               m (Z, β)Lr {Gr (Z)} .                 (10.51)
                Kr {Gr (Z)}
242       10 Improving Efficiency and Double Robustness with Coarsened Data
                 %                                     T          &
                    λr {Gr (Z)}
 −           E                    m(Z, β) + L0r {Gr (Z)} Lr {Gr (Z)} = 0, (10.52)
                     Kr {Gr (Z)}
      r=∞
Clearly, (10.53) follows when (10.54) holds. Conversely, if (10.54) were not
true, then we could choose Lr {Gr (Z)} = E{m(Z, β)| Gr (Z)} + L0r {Gr (Z)}
for all r = ∞ to get a contradiction.
    Therefore, we have demonstrated that, with monotone CAR,
                            
        I(C = ∞)m(Z, β)
   Π                      Λ2
             (∞, Z)
            I(C = r) − λr {Gr (Z)}I(C ≥ r) 
              
    =−                                        E {m(Z, β)|Gr (Z)} . (10.55)
                         Kr {Gr (Z)}
             r=∞
    In order to take advantage of the results above, we need to compute
E{m(Z, β)|Gr (Z)}. This requires us to estimate the distribution of Z, or
at least enough of the distribution to be able to compute these conditional
expectations.
                     10.3 Improving Efficiency with Monotone Coarsening         243
Remark 4. This last statement almost seems like circular reasoning. That is,
we argue that to gain greater efficiency, we would need to estimate the distri-
bution of Z. However, if we had methods to estimate the distribution of Z,
then we wouldn’t need to develop this theory in the first place. The rationale
for considering semiparametric theory for coarsened data is that it led us to
augmented inverse probability weighted complete-case estimators, which, we
argued, build naturally on full-data estimators and are easier to derive than,
say, likelihood methods with coarsened data. However, as we saw in the case
with two levels of missingness, we will still obtain consistent asymptotically
normal estimators using this inverse weighted methodology even if we con-
struct estimators for the distribution of Z that are incorrect. This gives us
greater flexibility and robustness and suggests the use of an adaptive approach,
as we now describe.   
Adaptive Estimation
   L2 {C, GC (Z)}
         I(C = r) − λr {Gr (Z)}I(C ≥ r) 
   =                                        EINC {m(Z, β)|Gr (Z)}         (10.56)
                    Kr {Gr (Z)}
      r=∞
in the estimating equation (10.42). Even though (10.56) is not the correct pro-
             I(R = ∞)m(Z, β)
jection of                      onto Λ2 , if the posited model for Z is incorrect,
                  (∞, Z)
it is still an element of the augmentation space Λ2 , which implies that the so-
lution to (10.42), using the augmented term L2 (·) as given by (10.56), would
still result in an AIPWCC consistent, asymptotically normal semiparametric
estimator for β. This protection against misspecified models argues in favor
of using an adaptive approach.
     In an adaptive strategy, to improve efficiency, we start by positing a simpler
and possibly incorrect model for the distribution of the full data Z. Say we
244     10 Improving Efficiency and Double Robustness with Coarsened Data
                                              
                                                              m(z, β)p∗Z (z, ξ)dνZ (z)
                                   {Gr (z)=Gr (Z)}
      E {m(Z, β)|Gr (Z), ξ} =                                                            .   (10.57)
                                                                    p∗Z (z, ξ)dνZ (z)
                                           {Gr (z)=Gr (Z)}
                                    !
                                    n
                                           p∗Gr       (Zi ) (gri , ξ),                        (10.58)
                                                  i
                                    i=1
where                                                 
                    p∗Gr (Z) (gr , ξ) =                          p∗Z (z, ξ)dνZ (z).
                                           {z:Gr (z)=gr }
where gr = Gr (z), p∗G1 (Z)|G0 (Z) (g1 |g0 ) = p∗G1 (Z) (g1 ),
                                                                                      
            p∗Gr (Z)|Gr −1 (Z) (gr |gr −1 ) = p∗Z|G (Z) (z|g ) when r = ∞,
   Thus, the adaptive approach to finding estimators when data are mono-
tonically coarsened is as follows.
 1. We first consider the full-data problem. That is, how would semiparamet-
    ric estimators for β be obtained if we had full data? For example, we may
    use a full-data m-estimator for β, which is the solution to
                                         
                                         n
                                               m(Zi , β) = 0,
                                         i=1
Even though the posited model p∗Z (z, ξ) may not be correctly specified, under
suitable regularity conditions, the estimator ξˆn∗ will converge in probability to
a constant ξ ∗ and that n1/2 (ξˆn∗ −ξ ∗ ) will be bounded in probability. Also, even
if the posited model is incorrect, the function L2 {Ci , GCi (Zi ), β, ψ0 , ξ ∗ } ∈ Λ2 ,
where
The asymptotic variance of β̂n can also be obtained by using the sandwich
variance estimator for AIPWCC estimators given by (9.19).
                    10.3 Improving Efficiency with Monotone Coarsening      247
will converge to                                    
                              I(Ci = ∞)m(Zi , β0 )
                        Π                          Λ2 .
                                 (∞, Zi , ψ0 )
For the case of a correctly specified model, the influence function is
                                                           
                I(Ci = ∞)ϕF (Zi )       I(Ci = ∞)ϕF (Zi )
                                   −Π                     Λ2
                  (∞, Zi , ψ0 )          (∞, Zi , ψ0 )
                                
                   −Π       ·    Λψ .
                          ⇑
               This is orthogonal to Λ2 and
               since Λψ ⊂ Λ2 must equal 0
by positing a model p∗Z (z, ξ) often leads to more efficient estimators even if
the model was incorrect. In fact, this attempt to gain efficiency also gives us
the extra protection of double robustness similar to that seen in the previous
section when we considered two levels of missingness. We now explore this
double-robustness relationship for the case of monotone coarsening.
248      10 Improving Efficiency and Double Robustness with Coarsened Data
Throughout this section, we have taken the point of view that the model for
the coarsening probabilities was correctly specified. That is, for some ψ0 ,
   It will be convenient to show first that the expression inside the expectation
on the left-hand side of (10.62) can be written as
                      I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r) 
      m(Z, β0 ) −
                                    Kr {Gr (Z), ψ ∗ }
                  r=∞
                                                              
                       × m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ ∗ } .         (10.63)
Lemma 10.4.
    I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r)        I(C = ∞)
                                                                 
                                            =  1 −                 . (10.65)
               Kr {Gr (Z), ψ ∗ }                   (∞, Z, ψ ∗ )
  r=∞
By the definitions of λr (·) and Kr (·) given by (8.2) and (8.4), respectively, we
obtain that
                           λr (·)      1         1
                                  =        −          ,
                           Kr (·)   Kr (·) Kr−1 (·)
where K0 (·) = 1. Consequently,
          λr (·)I(C ≥ r)                  1          1
                                                              
     −                    =I(C = ∞)               −
                Kr (·)                    Kr−1 (·) Kr (·)
         r=∞                        r≤C
                                               1            1
                                                                   
                           + I(C = ∞)                 −
                                              Kr−1 (·) Kr (·)
                                       r=∞
                                                           
                                                 1
                          =I(C = ∞) 1 −
                                          KC {GC (Z), ψ ∗ }
                                                              
                                                    1
                           + I(C = ∞) 1 −                        ,       (10.67)
                                            K {G (Z), ψ ∗ }
where  denotes the number of different coarsening levels (i.e., the largest
integer r < ∞) and
                          !
      K {G (Z), ψ ∗ } =   [1 − λr {Gr (Z), ψ ∗ }] = (∞, Z, ψ ∗ ). (10.68)
                          r=∞
Taking the sum of (10.66) and (10.67) and substituting (∞, Z, ψ ∗ ) for
K {G (Z), ψ ∗ } (see (10.68)), we obtain
                                                                     
                                          1                I(C = ∞)
     I(C = ∞) + I(C = ∞) 1 −                      =  1 −                 ,
                                    (∞, Z, ψ ∗ )         (∞, Z, ψ ∗ )
Theorem 10.5.
                    I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r) 
    E m(Z, β0 ) −
                                   Kr {Gr (Z), ψ ∗ }
                  r=∞
                                                              
                       × m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ ∗ }     =0
if either the model for λr {Gr (Z), ψ}, r = ∞ or the posited model p∗Z (z, ξ) is
correctly specified.
Proof. By construction, E{m(Z, β0 )} = 0. Therefore, to prove Theorem 10.5,
we must show that
                                                       
                    I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r)
               E
                              Kr {Gr (Z), ψ ∗ }
                                                     
                                                    ∗
                  m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ }     = 0,      (10.69)
for all r = ∞, if either the model for λr {Gr (Z), ψ}, r = ∞ or the posited
model p∗Z (z, ξ) is correctly specified.
    We first consider the case when the model for the coarsening probabilities is
correctly specified (i.e., λr {Gr (Z), ψ ∗ } = λr {Gr (Z)} = P0 (C = r|C ≥ r, Z)),
whether the posited model p∗Z (z, ξ) is correct or not. Defining the random
vector Fr = {I(C = 1), . . . , I(C = r − 1), Z}, as we did in the proof of Lemma
10.1, and deriving the expectation of (10.69) by first conditioning on Fr , we
obtain
                    %                                          
                        E{I(C = r)|Fr } − λr {Gr (Z)}I(C ≥ r)
                  E
                                      Kr {Gr (Z)}
                                                            &
                                                          ∗
                     × m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ } .
But
E{I(C ≥ r)m(Z, β0 )|I(C ≥ r), Gr (Z)} = I(C ≥ r)E0 {m(Z, β0 )|Gr (Z)}
and hence (10.72) and (10.71) are equal to zero. A similar argument, where
we condition on {I(C = r), Gr (Z)}, can be used to show that (10.70) is equal
to zero. This then implies that (10.69) is equal to zero, which, in turn, implies
that (10.62) is true, thus demonstrating that β̂n is a consistent estimator for
β when the posited model p∗Z (z, ξ) is correctly specified.  
We return to Example 1, given in Section 9.2, where the interest was in esti-
mating parameters that described the mean CD4 count over time, as a func-
tion of treatment, in a randomized study where CD4 counts were measured
longitudinally at fixed time points. Specifically, we considered two treatments:
(X = 1) was the new treatment and (X = 0) was the control treatment. The
response Y = (Y1 , . . . , Yl )T was a vector of CD4 counts that were measured
on each individual at times 0 = t1 < . . . < tl . The full data are denoted by
Z = (Y, X). It was assumed that CD4 counts follow a linear trajectory whose
252     10 Improving Efficiency and Double Robustness with Coarsened Data
slope may be treatment-dependent. Thus the model was given by (9.24) and
assumes that
                    E(Yji |Xi ) = β1 + β2 tj + β3 Xi tj .
Therefore, the problem was to estimate the parameter β = (β1 , β2 , β3 )T from
a sample of data Zi = (Yi , Xi ), i = 1, . . . , n, where Yi = (Y1i , . . . , Yli )T are
the longitudinally measured CD4 counts for subject i.
     In this study, some patients dropped out, and for those patients we ob-
served the CD4 count data prior to dropout, whereas all subsequent CD4
counts are missing. This is an example of monotone coarsening with  = l − 1
levels of coarsening. We introduce the notation Y r to denote the vector
of data (Y1 , . . . , Yr )T , r = 1, . . . , l and Y r̄ to denote the vector of data
(Yr+1 , . . . , Yl )T , r = 1, . . . , l − 1. Therefore, when the coarsening variable
is Ci = r, we observe Gr (Zi ) = (Xi , Yir ), r = 1, . . . , l − 1, and, when Ci = ∞,
we observe the complete data G∞ (Zi ) = Zi = (Xi , Yi ).
     With such coarsened data {Ci , GCi (Zi )}, i = 1, . . . , n, we considered esti-
mating the parameter β using an AIPWCC estimator. To accommodate this,
we introduced a logistic regression model for the discrete hazard of coarsening
probabilities given by (9.25) in terms of a parameter ψ, which was estimated
by maximizing the likelihood (8.13). The resulting estimator was denoted by
ψ̂n . An AIPWCC estimator for β was proposed by solving the estimating
equation (9.26), where, for simplicity, we chose
                         m(Z, β) = H T (X){Y − H(X)β}.
The definition of the design matrix H(X) is given subsequent to (9.24), and
the rationale for this estimating equation is given in Section 9.2. Notice, how-
ever, that we defined the augmentation term in equation (9.26) generically
using arbitrary functions Lr {Gr (Z)}, r = ∞. We now know that to improve
efficiency as much as possible, we should choose
                        {Lr {Gr (Z)} = Em(Z, β)|Gr (Z)},
which requires adaptive estimation using a posited model for p∗Z (z, ξ). We
propose the following.
    Assume that the distribution of Y given X follows a multivariate normal
distribution whose mean and variance matrix may depend on treatment X;
that is,
                         Y |X = 1 ∼ M V N (ξ1 , Σ1 )
and
                            Y |X = 0 ∼ M V N (ξ0 , Σ0 ),
where ξk is an l-dimensional vector and Σk is an l × l matrix for k = 0, 1. The
parameter ξ will denote (ξ0 , ξ1 , Σ0 , Σ1 ).
Remark 7. Even though our model puts restrictions on the mean vectors ξk ,
in terms of the parameter β, we will let these be unrestricted and, as it turns
out, these parameters will not come into play in our estimating equation.    
                         10.3 Improving Efficiency with Monotone Coarsening                    253
    ! 1  l−1
    n !   !
                           −1/2
              {(2π)r |Σrr
                       k |}
    i=1 k=0     r=1
                                                            I(Ci =r,Xi =k)
                          1                  −1
                 × exp − (Yir − ξkr )T (Σrr
                                         k  )   (Yi
                                                   r
                                                     − ξk
                                                         r
                                                           )
                          2
                                                        I(Ci =∞,Xi =k) 
                  −1/2      1           T −1
    × {(2π) |Σk |}
           l
                       exp − (Yi − ξk ) Σk (Yi − ξk )                         .
                            2
                                                                           (10.75)
               n 
                  I(Ci = ∞)H T (Xi ){Yi − H(Xi )β}
               i=1                  (∞, Zi , ψ̂n )
                                                                       "
                   
                   l−1
                             I(Ci = r) − λr {Gr (Zi ), ψ̂n }I(Ci ≥ r)
               +
                   r=1                  Kr {Gr (Zi ), ψ̂n }
                                                                             
                                         × H T (Xi )q(r, Xi , Yir , β, ξˆn∗ ) = 0,
where Z denotes the full-data (i.e., Z = {T, X̄(T )}), and m(Z, β) denotes a
full-data estimating function that would have been used to obtain estimators
for β had there been no censoring. The definition of dNC̃ (r), λC̃ {r, X̄(r)},
Y (r), dMC̃ {r, X̄(r)} = dNC̃ (r) − λC̃ {r, X̄(r)}Y (r)dr, and Kr {X̄(r)} were all
defined in Section 9.3.
    By analogy between the censored-data estimating equation (10.76) and
the monotonically coarsened data estimating equation (10.42), we can show
that the most efficient augmented inverse probability weighted complete-case
estimator for β that uses the full-data estimating function m(Z, β) is obtained
by choosing
                     Lr {X̄(r)} = E{m(Z, β)|T ≥ r, X̄(r)}.
      To actually implement these methods with censored data, we need to
 1. develop models for the censoring distribution λC̃ {r, X̄(r), ψ} and find es-
    timators for ψ, and
 2. estimate the conditional expectation E{m(Z, β)|T ≥ r, X̄(r)}.
    A popular model for the censoring hazard function λC̃ {r, X̄(r), ψ} is the
semiparametric proportional hazards regression model (Cox, 1972) using max-
imum partial likelihood estimators to estimate the regression parameters and
Breslow’s (1974) estimator to estimate the underlying cumulative hazard func-
tion.
    In order to estimate the conditional expectation E{m(Z, β)|T ≥ r, X̄(r)},
we can posit a simpler full-data model, say p∗Z (z, ξ) = p∗T,X̄(T ) {t, x̄(t), ξ},
and then estimate ξ using the observed data {Ui , ∆i , X̄i (Ui )}, i = 1, . . . , n by
maximizing the observed-data likelihood
   ! n                               ∆i
         ∗
        pT,X̄(T ) {Ui , X̄i (Ui ), ξ}     ×
   i=1
                                                                                            1−∆i
                                              p∗T,X̄(T ) {t, x̄(t), ξ}dνT,X̄(T ) {t, x̄(t)}           .
        {t,x̄(t)}:t≥Ui ,{x(s)=Xi (s),s≤Ui }
                                                                                               (10.77)
              10.5 Improving Efficiency when Coarsening Is Nonmonotone          255
    Building models for p∗T,X̄(T ) {t, x̄(t), ξ} with time-dependent covariate and
maximizing (10.77) can be a daunting task. Nonetheless, the theory that has
been developed can often be useful in developing more efficient estimators
even if we don’t necessarily derive the most efficient one.
    In the example of censored medical cost data that was described in Exam-
ple 2 of Section 9.3, Bang and Tsiatis (2000) used augmented inverse prob-
ability weighted complete-case estimators to estimate the mean medical cost
and showed various methods for gaining efficiency by judiciously choosing the
augmented term.
    Other examples where this methodology was used include Robins, Rot-
nitzky, and Bonetti (2001), who used AIPWCC estimators of the survival
distribution under double sampling with follow-up of dropouts. Hu and Tsi-
atis (1996) and van der Laan and Hubbard (1998) constructed estimators of
the survival distribution from survival data that are subject to reporting de-
lays. Zhao and Tsiatis (1997, 1999, 2000) and van der Laan and Hubbard
(1999) derived estimators of the quality-adjusted-lifetime distribution from
right-censored data. Bang and Tsiatis (2002) derived estimators for the pa-
rameters in a median regression model of right-censored medical costs. Straw-
derman (2000) used these methods to derive an estimator of the mean of an
increasing stopped stochastic process. Van der Laan, Hubbard, and Robins
(2002) and Quale, van der Laan and Robins (2003) constructed locally effi-
cient estimators of a multivariate survival distribution when failure times are
subject to a common censoring time and to failure-time-specific censoring.
have been worked out for nonmonotone coarsened data using the general the-
ory developed by Robins, Rotnitzky, and Zhao (1994). For completeness, we
present these results in this section, but again we caution the reader that there
are many challenges yet to be tackled before these methods can be feasibly
implemented.
                              ∞
                              
                L{hF (·)} =         I(C = r)E{hF (Z)|C = r, Gr (Z)},
                              r=1
                        I(C=∞)hF (Z)
   The projection of       (∞,Z)      onto Λ2 is now given by the following the-
orem.
Theorem 10.6.
(i) The inverse mapping M−1 exists and is uniquely defined.
(ii) The projection is
                     
     I(C = ∞)hF (Z)      I(C = ∞)hF (Z)
   Π                Λ2 =                − L[M−1 {hF (·)}].                (10.82)
        (∞, Z)             (∞, Z)
where the last equality follows because M−1 (hF ) ∈ HF and hence as a func-
tion of Z, allowing it to come outside the inner conditional expectation, and
L2 (·) ∈ Λ2 , which implies that E{L2 (·)|Z} = 0, thus proving (b). 
                                                                    
In order to complete the proof of Theorem 10.6, we must show that the linear
operator M has a unique inverse. We will prove the existence and uniqueness
of the inverse of the linear mapping M by showing that the linear operator
(I − M) is a contraction mapping, where I denotes the identity mapping
HF → HF ; i.e., I{hF (Z)} = hF (Z) for hF ∈ HF . For more details on
these methods, we refer the reader to Kress (1989).
    We begin by first defining what we mean by a contraction mapping and
proving why (I − M) being a contraction mapping implies that M has a
unique inverse.
                                     (I − M)(hF )
                         suphF ∈HF                ,
                                          hF
or equivalently, I − M ≤ (1 − ε), if
i.e.,
                ϕ0 (Z) = hF (Z)
                ϕ1 (Z) = (I − M)hF (Z) + hF (Z)
                ϕ2 (Z) = (I − M)2 hF (Z) + (I − M)hF (Z) + hF (Z)
                 ..
                  .
and         ϕn (Z) → M−1 {hF (·)}.
However,
                           ∞                "
                          
           M S{hF (Z)} = M    (I − M)k hF (Z)
                                   k=0
                                                 ∞
                                                                   "
                                                  
                            = {I − (I − M)}           (I − M) h (Z) .
                                                             k F
k=0
In that case,
and
260      10 Improving Efficiency and Double Robustness with Coarsened Data
                              S(hF ) − S ∗ (hF ) = 0. 
                                                      
and                                      '            (
                          Π[hCZ |HF ] = E hCZ (C, Z)|Z ,             (10.86)
where equations (10.85) and (10.86) can be easily shown to hold by checking
that the definitions of a projection are satisfied.
   Therefore, deriving M{h(·)} = Π[Π[h|H]|HF ] corresponds to finding two
subsequent projections onto these two linear subspaces. What we want to
prove is that (I − M) is a contraction mapping from HF to HF . First note
that                              '                    (
              (I − M)hF (Z) = Π[ hF (Z) − Π[hF (Z)|H] |HF ].
Hence, by the Pythagorean theorem,
Hence,
             10.5 Improving Efficiency when Coarsening Is Nonmonotone                                   261
                             ⎛⎡                                                                       ⎤T
                         ⎜                                  
 Π[hF (Z)|H]     2
                     = E ⎝⎣I(C = ∞)hF (Z) +                          I(C = r)E{hF (Z)|Gr (Z)}⎦
                                                            r=∞
                         ⎡                                                                       ⎤⎞
                                                      
                         ⎣I(C = ∞)hF (Z) +                  I(C = r)E{hF (Z)|Gr (Z)}⎦⎠
                                                     r=∞
                             %
                                                                           
                     = E I(C = ∞){hF (Z)}T hF (Z) +                               I(C = r)
                                                                           r=∞
                                                                                                 &
                                                      T                 
                                       E{hF (Z)|Gr (Z)} × E{hF (Z)|Gr (Z)}
                                                
                     ≥ E I(C = ∞){hF (Z)}T hF (Z) .                                              (10.89)
of the DR linear space J (ΛF ⊥ ) (see Definition 3) or, equivalently, the space
L{M−1 (ΛF ⊥ )}.
    Consequently, if we defined a full-data estimating function m(Z, β) such
that m(Z, β0 ) ∈ ΛF ⊥ , then we should use L[M−1 {m(Z, β)}] as our observed-
data estimating function. That is, the estimator for β would be the solution
to the estimating equation
                            
                            n
                                  Li [M−1
                                       i {m(Zi , β)}] = 0.                  (10.92)
                            i=1
We will now demonstrate that the estimator for β that solves (10.93) is an
example of an AIPWCC estimator. This follows because of the following the-
orem.
Theorem 10.7. Let dF (Z, β, ψ, ξ) = M−1 {m(Z, β), ψ, ξ}. Then,
can be written as
                               I(C = ∞)m(Z, β)
  L{dF (Z, β, ψ, ξ), ξ} =                      + L∗2 {C, GC (Z), β, ψ, ξ},        (10.94)
                                  (∞, Z, ψ)
where
Proof. By definition,
Because
                  I(C = ∞)m(Z, β0 )
                                    + L∗2(j) {C, GC (Z), β0 , ψ0 , ξ ∗ }
                     (∞, Z, ψ0 )
            L∗∗                          ∗
             2(j) {C, GC (Z), β0 , ψ0 , ξ }
                                  −1
                        ∂m(Z, β0 )
            = −E                            L∗2(j) {C, GC (Z), β0 , ψ0 , ξ ∗ },
                           ∂β T
Double Robustness
and
                                              
(ii) Eξ0 ,ψ0 L[M−1 {m(Z, β0 ), ψ ∗ , ξ0 }, ξ0 ] = 0.
Proof of (i)
Because the conditional distribution of C|Z involves the parameter ψ only and
the marginal distribution of Z involves ξ only, we can write (i) as
                                                       
            Eξ0 Eψ0 L[M−1 {m(Z, β0 ), ψ ∗ , ξ0 }, ξ0 ] Z    .        (10.103)
But, for any function q(Z), Eψ0 [L{q(Z), ξ ∗ }|Z] is, by definition, equal to
M{q(Z), ψ0 , ξ ∗ }. Therefore, (10.103) equals
                                                          
                                 −1             ∗         ∗
                     Eξ0 M M {m(Z, β0 ), ψ0 , ξ }, ψ0 , ξ
                   = Eξ0 {m(Z, β0 )} = 0. 
                                          
Proof of (ii)
Because L{q(Z), ξ0 } = Eξ0 {q(Z)|C, GC (Z)}, then
Notice that the argument above didn’t involve the parameter ψ; hence
for any parameter ψ ∗ . Applying this to the left-hand side of equation (ii), we
obtain
                                                   
                           −1              ∗
             Eξ0 ,ψ0 L[M {m(Z, β0 ), ψ , ξ0 }, ξ0 ]
                                                       
                               −1              ∗
              = Eξ0 ,ψ∗ L[M {m(Z, β0 ), ψ , ξ0 }, ξ0 ]
                                                           
                                  −1              ∗
              = Eξ0 Eψ∗ L[M {m(Z, β0 ), ψ , ξ0 }, ξ0 ] Z
                                                        
                             −1              ∗       ∗
              = Eξ0 M M {m(Z, β0 ), ψ , ξ0 }, ψ , ξ0
                 = Eξ0 {m(Z, β0 )} = 0. 
                                        
•     Let the full data be given by Z = (Z1T , Z2T )T , where Z1 is always ob-
      served but Z2 may be missing. Let R denote the complete-case indicator,
      and P (R = 1|Z) = π(Z1 , ψ) is a model that describes the complete-case
      probabilities. Then, for ϕ∗F (Z) ∈ ΛF ⊥ ,
                                                                       
                          Rϕ∗F (Z)        R − π(Z1 , ψ0 )
             J (ϕ∗F ) =               −                     E{ϕ∗F (Z)|Z1 } .
                          π(Z1 , ψ0 )       π(Z1 , ψ0 )
      ξˆn∗ is an estimator for the parameter ξ in a posited model p∗Z (z, ξ), and
                                                                
                           h∗2 (Z1i , β, ξ) = E m(Zi , β)|Z1i , ξ .
Monotone coarsening
    where
                                              !
                                              r
      Kr {Gr (Z), ψ0 } = P (C > r|Z, ψ0 ) =            [1 − λr {Gr (Z), ψ0 }], r = ∞,
                                              r  =1
    and                                !
                      (∞, Z, ψ0 ) =       [1 − λr {Gr (Z), ψ0 }].
                                       r=∞
    where m(Z, β) is a full-data estimating function, ψ̂n is the MLE for ψ, and
    ξˆn∗ is an estimator for the parameter ξ in a posited model p∗Z (z, ξ).
Nonmonotone coarsening
    where
      L(hF ) = E{hF (Z)|C, GC (Z)} is equal to
                         
                            I(C = r)E{hF (Z)|Gr (Z)},
                               r
         and where the inverse operator M−1 (hF ) exists and can be obtained
         by successive approximation, where
      where m(Z, β) is a full-data estimating function, ψ̂n is the MLE for ψ, ξˆn∗
      is an estimator for the parameter ξ in a posited model p∗Z (z, ξ),
                                                  −1
  and dF(j) (Z, β, ψ, ξ) is the approximation to M   {m(Z, β, ψ, ξ)} after (j)
  iterations of successive approximations.
• As j → ∞, the AIPWCC estimator becomes a double-robust estimator.
3. Consider the simple linear regression restricted moment model where with
   full data (Yi , X1i , X2i ), i = 1, . . . , n, we assume
                          E(Yi |X1i , X2i ) = β0 + β1 X1i + β2 X2i .
  In such a model, we can estimate the parameters (β0 , β1 , β2 )T using ordi-
  nary least squares; that is, the solution to the estimating equation
              
              n
                    (1, X1i , X2i )T (Yi − β0 − β1 X1i − β2 X2i ) = 0.        (10.104)
              i=1
  In fact, this estimator is locally efficient when var(Yi |X1i , X2i ) is constant.
  The data, however, are missing at random with a monotone missing pat-
  tern. That is, Yi is observed on all individuals in the sample; however, for
  some individuals, only X2i is missing, and for others both X1i and X2i
  are missing. Therefore, we define the missingness indicator
                          (Ci = 1) if we only observe Yi ,
                          (Ci = 2) if we only observe (Yi , X1i ),
and
and
                             I(Ci = ∞)
                                                    ϕ∗F (Yi , X1i , X2i )
                         (∞, Yi , X1i , X2i , ψo )
                                                                           
                                    I(Ci = ∞)ϕ∗F (Y1i , X1i , X2i )
                            −Π                                            Λ2 ,
                                        (∞, Yi , X1i , X2i , ψo )
      i) With the observed data, how would you estimate the parameters in
         the multivariate normal?
      j) Assuming the simplifying multivariate normal model and the esti-
         mates derived in (i), estimate the projection in (h).
      k) Write out the estimating equation that needs to be solved to get an
         improved estimator.
      l) Find a consistent estimator for the asymptotic variance of the estima-
         tor in (k). (Keep in mind that the simplifying model of multivariate
         normality may not be correct.)
11
Locally Efficient Estimators for
Coarsened-Data Semiparametric Models
where L∗2 {C, GC (Z), β, ψ, ξ} is equal to minus the projection onto the augmen-
tation space; i.e.,
                                                     
                                 I(C = ∞)m(Z, β)
                           −Π                      |Λ2 .
                                     (∞, Z)
To compute this projection, we need estimates for the parameter ψ that de-
scribes the coarsening probabilities and an estimate for the marginal distri-
bution of Z. The latter is accomplished by positing a simpler, and possibly
274    11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
incorrect, model for the density of Z as p∗Z (z, ξ) and deriving an estimator ξˆn∗
for ξ.
     Among the class of double-robust estimators is the efficient estimator,
the estimator that achieves the semiparametric efficiency bound. Finding this
efficient estimator within this class of double-robust estimators entails de-
riving the proper choice of the full-data estimating function m(Z, β), where
m(Z, β0 ) = ϕ∗F ∈ ΛF ⊥ . In this chapter, we will study how to find the efficient
estimator and the appropriate choice for m(Z, β).
     As we will see, the efficient estimator will depend on the true marginal dis-
tribution of Z, which, of course, is unknown to us. Consequently, we will de-
velop adaptive methods where the efficient estimator will be computed based
on a posited model p∗Z (z, ξ) for the density of Z. Hence, the proposed methods
will lead to a locally efficient estimator, an estimator for β that will achieve
the semiparametric efficiency bound if the posited model is correct but will
still be a consistent, asymptotically normal RAL semiparametric estimator
for β even if the posited model does not contain the truth.
     As we indicated in Chapter 10, finding improved double-robust estimators
often involves computationally intensive methods. In fact, when the coars-
ening of the data is nonmonotone, these computational challenges could be
overwhelming. Similarly here, deriving locally efficient estimators involves nu-
merical difficulties. Nonetheless, the theory developed by Robins, Rotnitzky,
and Zhao (1994) gives us a prescription for how to derive locally efficient
estimators. We present this theory in this chapter and discuss strategies for
finding locally efficient estimators. The methods build on the full-data semi-
parametric theory. Therefore, it will be assumed that we have a good un-
derstanding of the full-data semiparametric model. That is, we can identify
the space orthogonal to the full-data nuisance tangent space ΛF ⊥ , the class
of full-data influence functions (IF )F , the full-data efficient score Seff
                                                                       F
                                                                         (Z, β0 ),
                                         F
and the full-data influence function ϕeff (Z).
     However, we caution the reader that these methods may be very difficult
to implement in practice, and we believe a great deal of research still needs to
be done in developing feasible computational algorithms. In Chapter 12, we
will discuss approximations that may be used to derive AIPWCC estimators
for β that although not locally efficient are easier to implement and can result
in substantial gains in efficiency.
     There is, however, one class of problems where locally efficient estimators
are obtained readily, and this is the case when only one full-data influence
function exists. This occurs, for example, when the full-data tangent space is
the entire Hilbert space HF , as is the case when no restrictions are put on the
class of densities for Z; i.e., the nonparametric problem (see Theorem 4.4).
In Section 5.3, we showed that only one full-data influence function exists
when we are interested in estimating the mean of a random variable in a
nonparametric problem and that this estimator can be obtained using the
sample average. When only one full-data influence function, ϕF (Z), exists,
then the class of observed-data influence functions is given by
11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models    275
                                                                 
           I(C = ∞)ϕF (Z)
                          + L2 {C, GC (Z)} − Π{[ ]|Λψ }, L2 (·) ∈ Λ2 ,
             (∞, Z, ψ0 )
which is also the efficient influence function. Consequently, when there is only
one full-data influence function, then the adaptive double-robust estimator
outlined in Chapter 10 will lead to a locally efficient estimator. We illustrate
with a simple example.
Denote the least-squares estimator for ξ by ξˆn∗ . Then, the adaptive observed-
data estimator for β is given as the solution to
  n 
                                                                              
            Ri                    Ri − π(Y1 , X1i ) ˆ∗    ˆ∗       ˆ∗T
                      (X2i − β) −                  (ξ1n + ξ2n Yi + ξ2n X1i − β)
  i=1
         π(Y1 , X1i )               π(Y1 , X1i )
  = 0,
whether the distribution is normal or not, the locally efficient estimator β̂n is
also fully efficient whenever (11.6) is satisfied.
                                   11.1 The Observed-Data Efficient Score     277
Representation 1 (Likelihood-Based)
We remind the reader that the efficient observed-data estimator for β has an
influence function that is proportional to the observed-data efficient score and
that the efficient score is unique and equal to
              Seff {C, GC (Z)} = Sβ {C, GC (Z)} − Π[Sβ {C, GC (Z)}|Λ],
where Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}, and Λ = Λψ ⊕ Λη , Λψ ⊥ Λη .
Because Λψ ⊥ Λη , this implies that
     Π[Sβ {C, GC (Z)}|Λ] = Π[Sβ {C, GC (Z)}|Λψ ] + Π[Sβ {C, GC (Z)}|Λη ].
    The same argument that was used to show that Λη ⊥ Λ2 in Theorem
8.2 can also be used to show that Sβ {C, GC (Z)} ⊥ Λ2 . Since Λψ ⊂ Λ2 , this
implies that Sβ {C, GC (Z)} is orthogonal to Λψ . Therefore,
                            Π[Sβ {C, GC (Z)}|Λψ ] = 0,                    (11.7)
which implies that
                   Π[Sβ {C, GC (Z)}|Λ] = Π[Sβ {C, GC (Z)}|Λη ],
and the efficient score is
        Seff {C, GC (Z)} = Sβ {C, GC (Z)} − Π[Sβ {C, GC (Z)}|Λη ].         (11.8)
Recall that
                   ' '                (                   (
               Λη = E αF (Z)|C, GC (Z) for all αF (Z) ∈ ΛF .
This means that the unique projection Π[Sβ {C, GC (Z)}|Λη ] corresponds to
some element in Λη , which we will denote by
                      ' F              (
                   E αeff                    F
                          (Z)|C, GC (Z) , αeff (Z) ∈ ΛF .
With this representation,
                          '                 (   ' F               (
       Seff {C, GC (Z)} = E SβF (Z)|C, GC (Z) − E αeff (Z)|C, GC (Z)
                         = E[{SβF (Z) − αeff
                                         F
                                            (Z)}|C, GC (Z)].              (11.9)
278      11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
Representation 2 (AIPWCC-Based)
which we denote as the space (IF )DR = J {(IF )F }. Since the observed-data
efficient score is defined up to a proportionality constant matrix times the
efficient influence function, this implies that the observed-data efficient score
must be an element in the DR linear space J (ΛF ⊥ ),
                                                                   
      I(C = ∞)B F (Z)       I(C = ∞)B F (Z)
                      −Π                      Λ2 : for B F (Z) ∈ ΛF ⊥ ,
        (∞, Z, ψ0 )          (∞, Z, ψ0 )
                                                                     (11.11)
                                              F
with the corresponding element denoted by Beff   (Z).
Thus, we have shown that there are two equivalent representations for the
observed-data efficient score, which are given by (11.9) and (11.11):
                                                     
(i) Seff {C, GC (Z)} = E Sβ (Z) − αeff (Z) C, GC (Z) , where αeff
                             F        F                             F
                                                                      (Z) ∈
      ΛF ,
and
                                                                
                          I(C = ∞)Beff
                                    F
                                      (Z)      I(C = ∞)Beff
                                                         F
                                                           (Z)
(ii) Seff {C, GC (Z)} =                    −Π                   Λ2 , where
                              (∞, Z)              (∞, Z)
       F
      Beff (Z) ∈ ΛF ⊥ .
Proof. Because of the equivalence of the two representations (i) and (ii) above,
the efficient score can be written as
                '                  (          
              E SβF (F ) − αeffF
                                 (Z) C, GC (Z)
                                                               
                 I(C = ∞)Beff F
                                (Z)       I(C = ∞)BeffF
                                                       (Z)
              =                     −Π                      Λ2           (11.13)
                     (∞, Z)                  (∞, Z)
           F
for some αeff (Z) ∈ ΛF and BeffF
                               (Z) ∈ ΛF ⊥ . Taking the conditional expectation
of both sides of equation (11.13) with respect to Z, we obtain the equation
                         '                  (
                       M SβF (Z) − αeff
                                     F
                                       (Z) = Beff F
                                                    (Z).               (11.14)
280      11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
  In Theorem 10.6 and Lemma 10.5, we showed that the linear operator
M(·) has a unique inverse, M−1 . Therefore, we can write (11.14) as
                       ' F     ( '                   (
                  M−1 Beff   (Z) = SβF (Z) − αeff F
                                                  (Z) .             (11.15)
               F
Uniqueness of Beff (Z)
                                    F
Lemma 11.1. There exists a unique Beff (Z) ∈ ΛF ⊥ that solves the equation
                                       
                 Π M−1 {Beff  F
                               (Z)}|ΛF ⊥ = Seff
                                             F
                                                (Z),              (11.17)
Proof. Notice that (11.17) involves a mapping from the linear subspace ΛF ⊥ ⊂
HF to ΛF ⊥ ⊂ HF . The way we will prove this lemma is by defining another
linear mapping (I −Q){M−1 }(·) : HF → HF , which maps the entire full-data
Hilbert space to the entire full-data Hilbert space in such a way that
(i) (I − Q){M−1 }(·) coincides with the mapping Π[M−1 (hF )|ΛF ⊥ ] whenever
     hF ∈ ΛF ⊥ , and
(ii) (I − Q) is a contraction mapping and hence has a unique inverse.
      Define
                           Deff
                            F
                               (Z) = M−1 {Beff
                                           F
                                              (Z)}.
Because of the existence and uniqueness of M−1 , if Beff
                                                     F
                                                        (Z) exists, then so
does Deff (Z) such that
      F
                           M{DeffF
                                  (Z)} ∈ ΛF ⊥ .                      (11.18)
      Motivated by the fact that Deff
                                  F
                                     (Z) must satisfy
                                   11.1 The Observed-Data Efficient Score      281
or, equivalently,
                              Π[M(·)|ΛF ]{Deff
                                           F
                                              (Z)} = 0,                   (11.21)
                   Q{Deff
                      F
                         (Z)} = Π[(I − M){Deff
                                           F
                                              (Z)} ΛF ].
We first argue that the solution, hF (Z) ∈ HF , to equation (11.23) exists and
is unique and then argue that this solution hF (Z) must equal Deff F
                                                                    (Z).
    According to Lemma 10.5, the linear operator (I −Q)(·) will have a unique
inverse if we can show that the linear operator “Q” is a contraction mapping.
Also, if Q is a contraction mapping, then by Lemma 10.5, the unique inverse
is equal to
                                           ∞
                                           
                             (I − Q)−1 =      Qi .
                                             i=0
    To complete the proof, we must show that the unique solution hF (Z) to
equation (11.23) or, equivalently, (11.22), is identical to DeffF
                                                                  (Z) satisfying
(11.19) and (11.21).
    Clearly, any element Deff F
                               (Z) satisfying (11.19) and (11.21) must satisfy
(11.22). Conversely, since Seff (Z) ∈ ΛF ⊥ , then the solution hF (Z) of (11.22)
must be such that Π[M{hF (Z)}|ΛF ] = 0 and Π[hF (Z)|ΛF ⊥ ] = Seff (Z); that
is, hF (Z) satisfies (11.21) and (11.19), respectively. This completes the proof
that DeffF
          (Z) exists and is the unique element satisfying equations (11.19) and
(11.21) or, equivalently, that Beff (Z) = M{Deff  F
                                                  (Z)} exists and is the unique
solution to (11.17).
    In Lemma 10.5, we showed that the solution Deff    F
                                                        (Z) can be obtained by
successive approximation; that is,
                                                    
                D(i+1) (Z) = Π (I − M)D(i) (Z) ΛF + Seff   F
                                                             (Z),        (11.25)
and
                            D(i) (Z) −−−→ Deff
                                     i→∞   F
                                              (Z).
      If we define                                   
                       B (i) (Z) = Π M{D(i) (Z)} ΛF ⊥ ,
where, by construction, B (i) (Z) ∈ ΛF ⊥ , then
                                                 i→∞ F
       B (i) (Z) = M{D(i) (Z)} −Π M{D(i) (Z)} ΛF −−−→ Beff (Z).               (11.26)
                   *    +,   -
                        ⇓                 ⇓
                                        F         
                M{DeffF          F
                       (Z)} = Beff (Z) Π Beff (Z)|ΛF
                                            ||       
                                                     
                                            0
and
                                                  !
                                                  r
              Kr {Gr (Z)} = P (C ≥ r + 1|Z) =           [1 − λj {Gj (Z)}].
                                                  j=1
                                11.1 The Observed-Data Efficient Score      283
                              hF (Z)    λr
  aF (Z) = M−1 {hF (Z)} =            −      E{hF (Z)|Gr (Z)},          (11.27)
                             (∞, Z)     Kr
                                            r=∞
where we use the shorthand notation λr = λr {Gr (Z)} and Kr = Kr {Gr (Z)}.
An equivalent representation is also given by
                              λr                          
   M−1 {hF (Z)} = hF (Z) +         hF (Z) − E{hF (Z)|Gr (Z)} .         (11.28)
                               Kr
                             r=∞
Proof. In Theorem 10.6, we showed that M−1 exists and is uniquely defined.
Therefore, we only need to show that M{aF (Z)} = hF (Z), where aF (Z) is
defined by (11.27) and
                                     
                   M(aF ) = ∞ aF +     r E(aF |Gr ).
                                          r=∞
We note that                                       
                            λr          1     1
                               =           −                           (11.29)
                            Kr          Kr   Kr−1
and
                                    ∞ = K ,
where  denotes the number of coarsening /levels; i.e., 0 denotes the largest
value of C less than ∞. After substituting K1r − Kr−11         λr
                                                           for Kr
                                                                   and K1 for
∞ in (11.27) and rearranging terms, we obtain
                       1                                 
                  F
                 a =            E(h |Gr ) − E(h |Gr−1 ) ,
                                     F           F
                                                                       (11.30)
                      r
                         Kr−1
                       r
                 = E(hF |G∞ ) − E(hF |G0 ) = hF .
or
where λC̃ {r, X̄(r)} was defined in (9.30) and Kr {X̄(r)} was defined in (9.33).
D(i) (Z, β, ψ̂n , ξˆn∗ ) = Π[(I − M)D(i−1) (Z, β, ψ̂n , ξˆn∗ )|ΛF ] + m(0) (Z, β), (11.34)
where we index D(·) by ψ and ξ to make clear that we need these parameters
when computing the projection Π[(·)|ΛF ] and the linear operator M(·). After,
say, j iterations, we compute
where dF (Z) = M−1 {m(Z, β)}. Since D(j) (Z, β, ψ̂n , ξˆn∗ ) is an approximation
to M−1 {m(Z, β, ψ̂n , ξˆn∗ )}, we propose the following adaptive estimating equa-
tion:
 n 
      I(Ci = ∞)m(Zi , β, ψ̂n , ξˆn∗ )
i=1          (∞, Zi , ψ̂n )
                       ⎡                                                                     ⎤
         I(Ci = ∞) ⎣ 
      −                       {r, Gr (Zi ), ψ̂n }E{D(j) (Z, β, ψ̂n , ξˆn∗ )|Gr (Zi ), ξˆn∗ }⎦
        (∞, Zi , ψ̂n ) r=∞
                                                         
                            (j)         ˆ ∗           ˆ∗
      +     I(Ci = r)E{D (Z, β, ψ̂n , ξn )|Gr (Zi ), ξn } = 0.                      (11.37)
          r=∞
m(Z, β0 , ψ ∗ , ξ ∗ ) ∈ ΛF ⊥
and
                        ⎡                                                               ⎤
         I(C = ∞) ⎣         
      −                   {r, Gr (Z), ψ0 }E{D(j) (Z, β, ψ0 , ξ ∗ )|Gr (Z), ξ ∗ }⎦
        (∞, Z, ψ0 )
                     r=∞
        
      +     I(C = r)E{D(j) (Z, β, ψ0 , ξ ∗ )|Gr (Z), ξ ∗ } ∈ Λ2
          r=∞
                                    Z = (Y, X),
and the model assumes that E(Y |X) = µ(X, β), or, equivalently,
where V (X) = var(Y |X). The full-data efficient score was given by (4.53),
                           F
                          Seff (Z) = DT (X)V −1 (X)ε,                      (11.39)
where
                                                ∂µ(X, β)
                              D(X) =                     .
                                                  ∂β T
    Suppose the data are coarsened at random with a monotone coarsening
pattern. An example with monotone missing longitudinal data was given in
Example 1 in Section 9.2 and also studied further in Section 10.3, where
double-robust estimators were proposed. The question is, how do we go about
finding a locally efficient estimator for β with a sample of monotonically coars-
ened data {Ci , GCi (Zi )}, i = 1, . . . , n?
    We first develop a model for the coarsening probabilities in terms of
a parameter ψ. Because in this example we are assuming that coarsening
is monotone, it is convenient to develop models for the discrete hazards
λr {Gr (Z), ψ}, r = ∞ and obtain estimators for ψ by maximizing (8.12).
We denote these estimators by ψ̂n .
    We also posit a simpler, possibly incorrect model for the full data Z =
(Y, X), Z ∼ p∗Z (z, ξ), and obtain an estimator for ξ, say ξˆn∗ , by maximizing
the observed-data likelihood; see, for example, (10.58),
                              !
                              n
                                    p∗Gr       (Zi ) (gri , ξ).
                                           i
                              i=1
This model also gives us an estimate for var(Y |X, ξˆn∗ ) = V (X, ξˆn∗ ).
   If the data are coarsened at random, then by (11.16) and Theorem 11.1,
the efficient observed-data score is given by
                                                                   
     I(C = ∞)BeffF
                  (Z, β, ψ, ξ)      I(C = ∞)BeffF
                                                 (Z, β, ψ, ξ)
                               −Π                               Λ2 ,      (11.40)
           (∞, Z, ψ)                     (∞, Z, ψ)
       F
where Beff (Z, β, ψ, ξ) ∈ ΛF ⊥ must satisfy
288      11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
                                                
                    Π M−1 {Beff
                            F
                               (Z, β, ψ, ξ)}|ΛF ⊥ = Seff
                                                     F
                                                        (Z, β, ξ).
                                   F
For the restricted moment model, Beff (Z, β, ψ, ξ) = A(X, β, ψ, ξ)ε(β), where
ε(β) = Y − µ(X, β), and the matrix A(X, β, ψ, ξ) is obtained by solving the
equation                                     
              Π M−1 {A(X, β, ψ, ξ)ε(β)} |ΛF ⊥ = Seff F
                                                      (Z, β, ξ),
which by (11.38) and (11.39) is equal to
                                               
           E M−1 {A(X, β, ψ, ξ)ε(β)} εT (β)|X, ξ V −1 (X, ξ)ε(β)
                 = DT (X, β)V −1 (X, ξ)ε(β),
or, equivalently,
                                            
         E M−1 {A(X, β, ψ, ξ)ε(β)}εT (β)|X, ξ = DT (X, β).                (11.41)
  If, in addition, the coarsening is monotone, then using the results from
Theorem 11.2, equation (11.27), we obtain
                                         A(X, β, ψ, ξ)ε(β)
             M−1 {A(X, β, ψ, ξ)ε(β)} =
                                          (∞, Y, X, ψ)
                  λr {Gr (Y, X), ψ}
             −                       E{A(X, β, ψ, ξ)ε(β)|Gr (Y, X), ξ}.
                   Kr {Gr (Y, X), ψ}
                 r=∞
= DT (X, β),
or
                                             
                            ε(β)εT (β)
         A(X, β, ψ, ξ)E                  X, ξ
                           (∞, Y, X, ψ)
                                                                     
         λr {Gr (Y, X), ψ} '                               ( T
 −   E                     E A(X, β, ψ, ξ)ε(β) Gr (Y, X), ξ ε (β) X, ξ
         Kr {Gr (Y, X), ψ}
      r=∞
Remarks
    For instance, this was the case in Example 1 of Section 9.2, which was
    further developed in Section 10.3, where the responses Y = (Y1 , . . . , Yl )T
    were longitudinal data intended to be measured at times t1 < . . . < tl but
    were missing for some subjects in the study in a monotone fashion due to
    patient dropout. For this example, the covariate X (treatment assignment)
    was always observed but some of the longitudinal measurements that made
    up Y were missing. The coarsening was described as Gr (Z) = (X, Y r ),
    where Y r = (Y1 , . . . , Yr )T , r = 1, . . . , l − 1. Equation (11.42) can now be
    written as
                                            
                         ε(β)εT (β)
          A(X, β, ψ, ξ)E                X, ξ
                        (∞, Y, X, ψ)
                         λr (X, Y r , ψ) '                   ( T
                                                                            
        −A(X, β, ψ, ξ)  E                  E   ε(β) X, Y r
                                                           , ξ  ε  (β) X, ξ
                           Kr (X, Y r , ψ)
                          r=∞
                          = DT (X, β).
    Therefore, the solution is given by
                        A(X, β, ψ, ξ) = DT (X, β)Ṽ −1 (X, β, ψ, ξ),
    where
                                               
                            ε(β)εT (β)
     Ṽ (X, β, ψ, ξ) = E                    X, ξ
                           (∞, Y, X, ψ)
                                                                        
                                λr (X, Y r , ψ) '              ( T
                        −   E                              r
                                                E ε(β) X, Y , ξ ε (β) X, ξ .
                                Kr (X, Y r , ψ)
                          r=∞
 (ii) Except for special cases, such as the example above, the equation for
      solving A(X) in (11.42) is generally a complicated integral equation. Ap-
      proximate methods for solving such integral equations are given in Kress
      (1989). However, these computations may be so difficult as not to be fea-
      sible in practice.
(iii) In Chapter 12, we will give some approximate methods for obtaining im-
      proved estimators that although not locally efficient do have increased
      efficiency and are easier to implement.   
    Suppose we were able to overcome these numerical difficulties and ob-
tain an approximate solution for A(X, β, ψ, ξ). Denoting this solution by
Aimp (X, β, ψ̂n , ξˆn∗ ) and going back to (11.40), we approximate the efficient
score by
        I(C = ∞)Aimp (X, β, ψ̂n , ξˆn∗ ){Y − µ(X, β)}
                          π(∞, Y, X, ψ̂n )
              #                                                     $
                  I(C = ∞)Aimp (X, β, ψ̂n , ξˆn∗ ){Y − µ(X, β)}
         −Π                                                       Λ2 .         (11.43)
                                 π(∞, Y, X, ψ̂n )
290     11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
Because the coarsening is monotone, we can use (10.55) to estimate the pro-
jection onto Λ2 by −L2 {C, GC (Y, X), β, ψ̂n , ξˆn∗ }, where
as m(Zi , β), where ψ ∗ and ξ ∗ are the limits (in probability) of ψ̂n and ξˆn∗ , would
result in an asymptotically equivalent estimator for β. What is important to
note here is that
regardless of what the converging values ψ ∗ and ξ ∗ are, and therefore the
solution to (11.45) is an example of an AIPWCC estimator for β.
    Because of (11.47), the estimator given as the solution to (11.45) with L2 (·)
computed by (11.44) is an example of an improved estimator as described in
Section 10.4. As such, this estimator is a double robust estimator in the sense
that it will be consistent and asymptotically normal if either the coarsening
model or the model for the posited marginal distribution of Z is correctly
specified. This double-robustness property holds regardless of whether
                                                 11.3 Concluding Thoughts    291
or not.
   Finally, if both models are correctly specified, and if
                                 F
full-data estimating function, Beff (Z) ∈ ΛF ⊥ , as well as the optimal augmen-
tation. It is not clear whether the efficiency gains of such an estimator would
make such a complicated procedure attractive for practical use.
    Because of the complexity of these methods, we offer in the next chapter
some simpler methods for gaining efficiency that are easier to implement.
These methods will not generally result in locally efficient estimators, but
they are, however, more feasible.
      where
                                                                
                            I(C = ∞)hF (Z)      I(C = ∞)hF (Z)
                J {h (Z)} =
                    F
                                           −Π                  Λ2 ,
                               (∞, Z)             (∞, Z)
       F
      Beff (Z) is the unique element in ΛF ⊥ that satisfies
                            Π[M−1 {Beff
                                    F
                                       (Z)}|ΛF ⊥ ] = Seff
                                                      F
                                                         (Z),
           F
      and Seff (Z) is the full-data efficient score vector.
•     We can derive BeffF
                         (Z) by solving the equation
                            (I − Q)M−1 {Beff
                                         F          F
                                            (Z)} = Seff (Z),
                                                   
                 D(i+1) (Z) = Π (I − M)D(i) (Z) ΛF + Seff F
                                                           (Z),          (11.48)
      and
                                 D(i) (Z) −−−→ Deff
                                          i→∞   F
                                                   (Z).
                                           11.5 Exercises for Chapter 11    293
  If we define                                       
                       B (i) (Z) = Π M{D(i) (Z)} ΛF ⊥ ,
  where, by construction, B (i) (Z) ∈ ΛF ⊥ , then
   when the data are monotonically coarsened. Similarly, outline the steps
   necessary to obtain a locally efficient estimator for β if there are two levels
   of missingness.
12
Approximate Methods for Gaining Efficiency
    Rather than searching for the optimal AIPWCC estimator, which involves
finding the optimal L2eff (·) ∈ Λ2 and the optimal Beff
                                                   F
                                                     (Z) ∈ ΛF ⊥ , we instead
restrict the search to linear subspaces of Λ2 ⊂ H and ΛF ⊥ ⊂ HF . That is,
296       12 Approximate Methods for Gaining Efficiency
we will only consider AIPWCC estimators that are the solution to (12.1) for
m(Z, β0 ) = ϕ∗F (Z) ∈ G F and L2 (·) ∈ G2 , where G F is a q-replicating linear
subspace of ΛF ⊥ and G2 is a q-replicating linear subspace of Λ2 , where a
q-replicating linear space is defined by Definition 6 of Chapter 3.
We will always assume that the t1 elements of J F (Z) and the t2 elements
of J2 {C, GC (Z)} are linearly independent; that is, cT J F (Z) = 0 implies that
c = 0, where c is a t1 vector of constants (similarly for J2 {C, GC (Z)}). By con-
struction, the finite-dimensional linear spaces defined above are q-replicating
linear spaces.
    Therefore, we will restrict attention to estimators for β that are the solu-
tion to the estimating equation
          #                                                           $
      n
            I(Ci = ∞)AF
                          q×t1
                               m∗ (Zi , β)
                                           + L2 {Ci , GCi (Zi ), ψ̂n } = 0, (12.3)
      i=1          (∞, Zi , ψ̂n )
                                 I(C = ∞)ΛF ⊥
                          Λ⊥
                           η =                ⊕ Λ2 ,
                                    (∞, Z)
and hence
                                   Ξ ⊂ Λ⊥
                                        η.                                (12.6)
    The elements in the linear subspace Ξ are associated with estimating func-
tions that will lead to the restricted class of estimating equations (12.3) that
we are considering. However, in Theorem 9.1, we showed that substituting the
MLE ψ̂n for the parameter ψ in an AIPWCC estimating equation resulted in
an influence function that subtracts off a projection onto the space Λψ , where
Λψ denotes the linear subspace spanned by the score vector
                                      ∂ log {C, GC (Z), ψ0 }
                   Sψ {C, GC (Z)} =                           .
                                               ∂ψ
    We remind the reader that when we introduce a coarsening model with
parameter ψ, the nuisance tangent space is given by Λ = Λη ⊕ Λψ . Since
estimating functions need to be associated with elements in Λ⊥ , we should
only consider elements of Ξ that are also orthogonal to Λψ ; i.e., Π[Ξ|Λ⊥   ψ ].
Consequently, it will prove desirable that the space G2 ⊂ Λ2 also contain Λψ .
For the restricted (class 2) estimators where G2 = Λ2 , this is automatically
true, however, for the restricted (class 1) estimators, we will always include
the elements Sψ {C, GC (Z)} that span Λψ as part of the vector J2 {C, GC (Z)}
that spans G2 to ensure that Λψ ⊂ G2 .
    Because the variance of an RAL estimator for β is the variance of its
influence function, when we consider finding the optimal estimator within the
restricted class of estimators (12.3), then we are looking for the estimator
whose influence function has the smallest variance matrix. Recall that an
influence function in addition to being orthogonal to the nuisance tangent
space must also satisfy the property that
where Sβ {C, GC (Z)} is the observed-data score vector with respect to β, which
also equals the conditional expectation of the full-data score vector with re-
spect to β given the observed data (i.e., Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}),
and ϕ{C, GC (Z)} denotes an influence function. One can always normalize any
element ϕ∗ {C, GC (Z)} ∈ Ξ to ensure that it satisfies (12.7) by choosing
                                                     −1
                         ∗
   ϕ{C, GC (Z)} = E[ϕ {C, GC (Z)}Sβ {C, GC (Z)}]
                                        T
                                                          ϕ∗ {C, GC (Z)}. (12.8)
IF (Ξ). In Theorem 12.1 below, we derive the optimal influence function within
IF (Ξ); i.e., the element within IF (Ξ) that has the smallest variance matrix.
Theorem 12.1. Among the elements ϕ{C, GC (Z)} ∈ IF (Ξ), the one with
the smallest variance matrix is given by the normalized version of the pro-
jection of Sβ {C, GC (Z)} onto Ξ. Specifically, if we define ϕ∗opt {C, GC (Z)} =
Π[Sβ {C, GC (Z)}|Ξ], then the element in IF (Ξ) with the smallest variance
matrix is given by
                                                                −1
  ϕopt {C, GC (Z)} =       E[ϕ∗opt {C, GC (Z)}SβT {C, GC (Z)}]         ϕ∗opt {C, GC (Z)}.
                                                                                    (12.9)
and hence
                      E[ϕ∗opt (·){ϕ(·) − ϕopt (·)}T ] = 0.
                                                          −1
Premultiplying by the constant matrix E{ϕ∗opt (·)SβT (·)}      , we obtain
Consequently,
E{ϕ(·)ϕT (·)} = E[{ϕopt (·) + ϕ(·) − ϕopt (·)}{ϕopt (·) + ϕ(·) − ϕopt (·)}T ]
                = E{ϕopt (·)ϕTopt (·)} + E[{ϕ(·) − ϕopt (·)}{ϕ(·) − ϕopt (·)}T ] + 0.
300     12 Approximate Methods for Gaining Efficiency
Corollary 1. The optimal element ϕopt {C, GC (Z)} ∈ IF (Ξ), derived in The-
orem 12.1, is orthogonal to Λψ and is an element of the space of observed-data
influence functions (IF ).
But since Sβ {C, GC (Z)} is orthogonal to Λψ (see equation (11.7)), this implies
that ϕ∗opt {C, GC (Z)} must be orthogonal to Λψ . Therefore ϕopt {C, GC (Z)},
defined by (12.9), is also orthogonal to Λψ . Hence,
where the last two relationships follow from (12.6) and (8.16), respectively.
Therefore, ϕopt {C, GC (Z)} is an element orthogonal to the observed-data nui-
sance tangent space and by construction (see (12.9)) satisfies (12.7). Hence,
ϕopt {C, GC (Z)} is an element of the space of observed-data influence functions
(IF ).  
Therefore, finding the projection onto this linear subspace of H is exactly the
same as Example 2 of Chapter 2, which was solved by using equation (2.2).
Applying this result to our problem, ϕ∗opt (·) of Theorem 12.1 is obtained by
                              12.2 Optimal Restricted (Class 1) Estimators         301
(12.10)
   Before deriving the solution to equation (12.10), we first give some results
through a series of lemmas that will simplify the equation.
Lemma 12.1.
                                             T           T 
                                   I(C = ∞)m∗ (Z, β0 )AF
                  E Sβ {C, GC (Z)}
                                           (∞, Z)
                                                T            T
                                                                
                                   = E Sβ (Z)m∗ (Z, β0 )AF
                                          F
                                                                               (12.11)
                                                           T
                                               ∂m∗ (Z, β0 )
                                   =− A E F
                                                                 .             (12.12)
                                                  ∂β T
Proof. We prove (12.11) using a series of iterated conditional expectations,
where
              #                           T             T
                                                           $
                            I(C = ∞)m∗ (Z, β0 )AF
           E Sβ {C, GC (Z)}
                                    (∞, Z)
                 #                                        T          T
                                                                       $
                    ' F               ( I(C = ∞)m∗ (Z, β0 )AF
            = E E Sβ (Z)|C, GC (Z)
                                                     (∞, Z)
                 #                         T              T
                                                                      "$
                              I(C = ∞)m∗ (Z, β0 )AF
            = E E Sβ (Z)F
                                                             C, GC (Z)
                                     (∞, Z)
                                                       "
                                     ∗T              FT
                          I(C =  ∞)m    (Z,   β 0 )A
            = E SβF (Z)
                                  (∞, Z)
                 #                         T              T
                                                               "$
                        F     I(C = ∞)m∗ (Z, β0 )AF
            = E E Sβ (Z)                                     Z
                                     (∞, Z)
                 #                                            "$
                                           ∗T           FT
                              I(C = ∞)m        (Z, β0 )A
            = E SβF (Z)E                                     Z
                                     (∞, Z)
                           T           T
                                          
            = E SβF (Z)m∗ (Z, β0 )AF         .
302    12 Approximate Methods for Gaining Efficiency
Equation (12.12) follows from the usual expansion for m-estimators, where
the influence function for the estimator that solves the equation
                                 
                                 n
                                       AF m∗ (Zi , β) = 0
                                 i=1
The result in (12.12) now follows after taking the transpose of both sides of
the equation above.  
Lemma 12.2.
                                                 
                   E Sβ {C, GC (Z)}J2T {C, GC (Z)} = 0q×t2 .                        (12.13)
Lemma 12.3.
                                                      T
                                                                   "
       I(C = ∞) ∗           ∗T              m∗ (Z, β0 )m∗ (Z, β0 )
   E             m (Z, β0 )m (Z, β0 ) = E                            .
       2 (∞, Z)                                  (∞, Z)
                                                                   (12.14)
      Using the results from Lemmas 12.1–12.3, we obtain the solution to (12.10)
as                                       #         $
                                           U11 U12
                        [AFopt , A2opt ]     T
                                                     = [H1 , H2 ],       (12.15)
                                           U12 U22
where
                                           T
                                                          "t1 ×t1
                               m∗ (Z, β0 )m∗ (Z, β0 )
               U11 = E                                              ,
                                     (∞, Z)
                                                             t1 ×t2
                         I(C = ∞) ∗
               U12 = E             m (Z, β0 )J2 {C, GC (Z)}
                                              T
                                                                      ,
                         (∞, Z)
                       '                             (t2 ×t2
               U22 = E J2 {C, GC (Z)}J2T {C, GC (Z)}         ,
                               ∗
                                           T q×t1
                              ∂m (Z, β0 )
               H1 = − E                               ,
                                  ∂β T
                H2 = 0q×t2 .                                                      (12.16)
      Therefore, solving (12.15) yields
                                                      #                 $−1
                                                          U11 U12
                       [AF
                         opt , A2opt ]   = [H1 , 0]        T
                                                          U12 U22
                                                      #                 $
                                                          U 11 U 12
                                         = [H1 , 0]
                                                          U 12T U 22
                                         = [H1 U 11 , H1 U 12 ].
    Using standard results for the inverse of partitioned symmetric matrices
(see, for example, Rao, 1973, p.33), we obtain
                                          q×t1 11(t1 ×t1 )
                                  AF
                                   opt = H1   U                                   (12.17)
and
                                 A2opt = H1q×t1 U 12(t1 ×t2 ) ,                   (12.18)
where                                                  
                                                 −1 T −1
                               U 11 = U11 − U12 U22 U12                           (12.19)
and
                                                      −1
                                    U 12 = −U 11 U12 U22 .                        (12.20)
    Thus we have shown that the optimal influence function ϕopt {C, GC (Z)}
in IF (Ξ), given by Theorem 12.1, is obtained by choosing ϕ∗ {C, GC (Z)} ∈ Ξ
to be
                                         ∗
                           I(C = ∞)AF
                                    opt m (Z, β0 )
     ϕ∗opt {C, GC (Z)} =                           + A2opt J2 {C, GC (Z)},        (12.21)
                                 (∞, Z)
where AF
       opt and A2opt are defined by (12.17) and (12.18), respectively.
  We also note the following interesting relationship.
304      12 Approximate Methods for Gaining Efficiency
                                              I(C=∞)AF m∗ (Z,β0 )
Lemma 12.4. The projection of             opt
                                        (∞,Z)      onto the space G2 (i.e.,
the space spanned by J2 {C, GC (Z)}) is equal to
                         ∗          
          I(C = ∞)AFopt m (Z, β0 )
       Π                           G2 = −A2opt J2 {C, GC (Z)},      (12.22)
                 (∞, Z)
which implies that
                                                 ∗
                                   I(C = ∞)AF
                                            opt m (Z, β0 )
        ϕ∗opt {C, GC (Z)} =                                + A2opt J2 {C, GC (Z)}
                                         (∞, Z)
is orthogonal to G2 ; that is,
which are all matrices whose elements are expectations. For practical appli-
cations, these must be estimated from the data. We propose the following
empirical averages:
                          n                                T
                                I(Ci = ∞)        ∂m∗ (Zi , β)
           Ĥ1 (β) = − n−1                                        ,                        (12.24)
                           i=1 (∞, Zi , ψ̂n )
                                                   ∂β T
                             
                             n                                                       
                        −1           I(Ci = ∞)              ∗            ∗T
         Û11 (β) = n                                      m (Zi , β)m        (Zi , β) ,   (12.25)
                             i=1   2 (∞, Zi , ψ̂n )
                    n                                                       
                         I(Ci = ∞)
   Û12 (β) = n−1                        m∗ (Zi , β)J2T {Ci , GCi (Zi ), ψ̂n } ,           (12.26)
                    i=1 (∞, Zi , ψ̂n )
and
                                    12.2 Optimal Restricted (Class 1) Estimators        305
                        
                        n
                   −1
        Û22 = n              J2 {Ci , GCi (Zi ), ψ̂n }J2T {Ci , GCi (Zi ), ψ̂n }.   (12.27)
                        i=1
where Û 11 (β) and Û 12 (β) are obtained by substituting the empirical estimates
for U11 , U12 , and U22 into equations (12.19) and (12.20).
Using a sample of observed data {Ci , GCi (Zi )}, i = 1, . . . , n, we propose esti-
mating β by solving the equation
        #                                                                   $
    n
          I(Ci = ∞)ÂF          ∗
                     opt (β)m (Zi , β)
                                       + Â2opt (β)J2 {Ci , GCi (Zi ), ψ̂n } = 0.
    i=1          (∞, Zi , ψ̂n )
                                                                              (12.29)
We denote this estimator as β̂nopt and now prove the fundamental result for
restricted optimal estimators.
Theorem 12.2. Among the restricted class of AIPWCC estimators (12.3),
the optimal estimator (i.e., the estimator with the smallest asymptotic vari-
ance matrix) is given by β̂nopt , the solution to (12.29).
    Before sketching out the proof of Theorem 12.2, we give another equivalent
representation for the class of influence functions IF (Ξ) defined by (12.8) that
will be useful.
Lemma 12.5. The class of influence functions IF (Ξ) can also be defined by
                                                          −1
                                            ∂m∗ (Z, β0 )
      ϕ{C, GC (Z)} =          − AF E                               ϕ∗ {C, GC (Z)},   (12.30)
                                              ∂β T
Proof. Using (12.12) of Lemma 12.1 and Lemma 12.2, we can show that
                                                          
          ∗                                   ∂m∗ (Z, β0 )
      E[ϕ {C, GC (Z)}Sβ {C, GC (Z)}] = −A E
                        T                F
                                                             .   (12.31)
                                                ∂β T
The lemma now follows by substituting the right-hand side of (12.31) into
(12.8). 
        
estimators. In the same manner that we found the influence function of the
estimator in (9.7) of Theorem 9.1, we can show that the influence function of
the estimator for β that solves (12.3) is given by
                                         −1
                             ∂m∗ (Z, β0 )
  ϕ{C, GC (Z)} = − A E  F
                                 ∂β T
                  F ∗
                                                           
        I(C = ∞)A m (Z, β0 )
   ×                            + A2 J2 {C, GC (Z)} − Π · Λψ . (12.32)
               (∞, Z)
Because we constructed the space G2 so that Λψ ⊂ G2 , this implies that
Π[[·]|Λψ ] ∈ G2 , which in turn implies that (12.32) is an element of Ξ. Therefore,
as a consequence of Lemma 12.5, the influence function ϕ{C, GC (Z)} defined
above is an element of IF (Ξ). By Theorem 12.1, we know that
                     var[ϕopt {C, GC (Z)} ≤ var[ϕ{C, GC (Z)}].
Hence, if we can show that the influence function of the estimator (12.29) is
equal to ϕopt {C, GC (Z)}, then we would complete the proof of the theorem.
    An expansion of the estimating equation in (12.29) about β = β0 , keeping
ψ̂n fixed, yields
                                                  −1
                                      ∂m∗ (Z, β0 )
  n1/2 (β̂nopt − β0 ) = − AF opt E                        ×
                                        ∂β T
              #                                                                          $
          n
                I(Ci = ∞)ÂF          ∗
                           opt (β0 )m (Zi , β0 )
    −1/2
  n                                              + Â2opt (β0 )J2 {Ci , GCi (Zi ), ψ̂n }
          i=1          (∞, Zi , ψ̂n )
  + op (1).                                                                      (12.33)
We now show that estimating AF      opt and A2opt only has a negligible effect on
the asymptotic properties of the estimator by noting that (12.33) equals
             #                                                               $
         n
               I(Ci = ∞)AF       ∗
                         opt m (Zi , β0 )
    −1/2
  n                                        + A2opt J2 {Ci , GCi (Zi ), ψ̂n }   (12.34)
         i=1         (∞, Zi , ψ̂n )
                                      
                                      n
                                        I(Ci = ∞)m∗ (Zi , β0 )
   + n1/2 {ÂF
             opt (β0 ) − Aopt }n
                          F      −1
                                                                                 (12.35)
                                      i=1       (∞, Zi , ψ̂n )
                                        n
   + n1/2 {Â2opt (β0 ) − A2opt }n   −1
                                             J2 {Ci , GCi (Zi ), ψ̂n }.          (12.36)
                                       i=1
and
                      
                      n
              n−1
                                                     P
                            J2 {Ci , GCi (Zi ), ψ̂n } −
                                                      → E[J2 {C, GC (Z), ψ0 }] = 0.
                      i=1
Hence (12.35) and (12.36) will converge in probability to zero. Using Theorem
9.1, we can expand ψ̂n about ψ0 in (12.34) to obtain that (12.34) equals
            n                     ∗                                            
                   I(Ci = ∞)AF  opt m (Zi , β0 )
     n−1/2                                       + A2opt J2 {Ci , GCi (Zi ), ψ0 }
            i=1
                         (∞, Zi , ψ0 )
                            ∗                                               
            I(Ci = ∞)AF opt m (Zi , β0 )
      −Π                                 + A2opt J2 {Ci , GCi (Zi ), ψ0 } Λψ
                  (∞, Zi , ψ0 )
         + op (1).                                                                         (12.37)
Combining all the results from (12.33) through (12.37), we obtain that the
influence function of β̂nopt , the solution to (12.29), is given by
                                    −1                                                   
                      ∂m∗ (Z, β0 )
     −   AF
          opt E                                ϕ∗opt {C, GC (Z)}   −   Π[ϕ∗opt {C, GC (Z)}|Λψ ]    ,
                        ∂β T
                                                                             (12.38)
where ϕ∗opt {C, GC (Z)} is defined by (12.21). We proved that ϕ∗opt {C, GC (Z)}
is orthogonal to Λψ in Corollary 1. We also demonstrated in (12.23) of Lemma
12.4 that ϕ∗opt {C, GC (Z)} is orthogonal to G2 . Since Λψ ⊂ G2 , this, too, implies
that
                           Π[ϕ∗opt {C, GC (Z)}|Λψ ] = 0.
Therefore, the influence function (12.38) is equal to
                                              −1
                                  ∂m∗ (Z, β0 )
  ϕopt {C, GC (Z)} = − Aopt E
                          F
                                                     ϕ∗opt {C, GC (Z)},                    (12.39)
                                     ∂β T
thus proving that the estimator that is the solution to (12.29) is the optimal
restricted estimator. 
                      
Using the matrix relationships (12.16) and (12.17), we note that the leading
term in (12.39) can be written as the symmetric matrix
                                    −1              −1
                        ∂m∗ (Z, β0 )               11 T
            − AFopt E                     =  H 1 U   H 1     .       (12.40)
                          ∂β T
After a little algebra, we can also show that the covariance matrix of
ϕ∗opt {C, GC (Z)} is equal to
                                         
                                   ∗T
                        E ϕopt (·)ϕopt (·) = H1 U 11 H1T .
                              ∗
                                                               (12.41)
308    12 Approximate Methods for Gaining Efficiency
where Ĥ1 (β) and Û 11 (β) were defined by (12.24) and (12.28).
Remark 4. The method we proposed for estimating the parameter β using re-
stricted optimal (class 1) estimators only needs that the model for the coars-
ening probabilities be correctly specified. The appeal of this method is its
simplicity. We did not have to use adaptive methods, where a simpler model
p∗Z (z, ξ) had to be posited and an estimator for ξ had to be derived. Yet
the resulting estimator is guaranteed to have the smallest asymptotic vari-
ance within the class of estimators considered. It would certainly be more
efficient than the simple inverse probability weighted complete-case estima-
tor (IPWCC), which discards information from data that are not completely
observed. However, this estimator is not efficient. How close the variance of
such an estimator will be to the semiparametric efficiency bound will depend
on how close the optimal element Beff  F
                                        (Z) ∈ ΛF ⊥ is to the subspace G F and
                                                      I(C=∞)B F (Z)
how close the optimal element in Λ2 , Π[ (∞,Z)  eff
                                                      |Λ2 ], is to G2 .
    If the missingness were by design, then restricted optimal (class 1) esti-
mators would be guaranteed (subject to regularity conditions) to yield con-
sistent, asymptotically normal estimators for β. However, if the coarsening
probabilities are modeled and not correctly specified, then such estimators
will be biased. There is no double-robustness protection guaranteed for such
estimators. Therefore, in the next section, we consider what we refer to as
(class 2) estimators. These estimators, although more complicated, will re-
sult in double-robust estimators that avoid the necessity to solve complicated
integral equations. 
where D(X, β) = ∂µ(X, β)/∂β T . Therefore, for this example, we would choose
   Finding the optimal estimator within this restricted class and deriving
the asymptotic variance are now just a matter of plugging into the formulas
derived in the previous section.
   Taking the partial derivative of (12.43) with respect to β yields
                                
                 ∂m∗ (Y, X, β0 )
             E                     = −f F (X) exp(β0T X ∗ )D(X, β0 )
                      ∂β T
                                                                                    T
                                                = f F (X) exp(2β0T X ∗ )X ∗ .
                    
                    n                  (1)
                      Ri {Ri − π(Yi , X )}                                                       (1)
               −1
Û12 (β) = n                              (1)
                                                 i
                                                         {Yi −exp(β T Xi∗ )}f F (Xi )f2T (Yi , Xi ),
                    i=1           π(Yi , Xi )
                                                                                              (12.48)
and
                       
                       n
                                                 (1)                (1)                 (1)
      Û22 = n−1               {Ri − π(Yi , Xi )}2 f2 (Yi , Xi )f2T (Yi , Xi ).               (12.49)
                       i=1
and
                                                             (1)
                i=1                                  π(Yi , Xi )
                                                                                
                                             (1)                     (1)
                      + {Ri −        π(Yi , Xi )}Â2opt (β)f2 (Yi , Xi )            = 0,      (12.54)
312    12 Approximate Methods for Gaining Efficiency
which we denote by β̂nopt . Moreover, the asymptotic variance for β̂nopt can
be estimated using
                                                          −1
                                    11          T
                     Ĥ1 (β̂nopt )Û (β̂nopt )Ĥ1 (β̂nopt )    ,
where Ĥ1 (β) and Û 11 (β) were defined by (12.46) and (12.50), respectively.
In the example above, it was assumed that the missing values of X (2) were by
design, where the investigator had control of the missingness probabilities. We
now consider how the methods would be modified if these probabilities were
not known and had to be estimated from the data. We might, for instance,
consider the logistic regression model where
                                              exp(ψ0 + ψ1 Y + ψ2 X (1) )
  P (R = 1|Y, X (1) ) = π(Y, X (1) , ψ) =                                  . (12.55)
                                            1 + exp(ψ0 + ψ1 Y + ψ2 X (1) )
and the resulting MLE is denoted by ψ̂n = (ψ̂0n , ψ̂1n ψ̂2n )T . Using standard
results for the logistic regression model, we note that the score vector Sψ (·)
is given by
Remark 5. We point out that the second term in the estimating equation
(12.54) (i.e., the augmented term) can be written as
                          
                          n
                                            (1)                (1)
                 Â2opt         f2 (Yi , Xi ){Ri − π(Yi , Xi , ψ̂n )}.
                          i=1
    n                 (1)
are i=1 Sψ (Ri , Yi , Xi , ψ̂n ), then as a consequence of (12.57), this means
that the first three elements of the vector (12.58) are equal to zero. This
observation may result in some modest savings in computation.
Theorem 12.3. Among the elements ϕ{C, GC (Z)} ∈ IF (Ξ), the one with the
smallest variance matrix is given by
where the linear operator J (·) is given by Definition 1 of Chapter 10; that is,
                                              #                      $
                      I(C = ∞)ϕF   opt (Z)      I(C = ∞)ϕF opt (Z)
      J {ϕopt (Z)} =
            F
                                           −Π                      Λ2 ,
                          (∞, Z)                   (∞, Z)
                                                  T
                                    ∗F             −1 ∗F
                     ϕF                     F
                      opt (Z) = [E{ϕopt (Z)Sβ (Z)}]  ϕopt (Z),               (12.62)
and   ϕ∗F
       opt (Z)   is the unique element in G   F
                                                  that solves the equation
Proof. We first will prove that IF (Ξ) consists of the class of elements
Because Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}, we use a series of iterated con-
ditional expectations to obtain
                                                                    T
  E[ϕ∗ {C, GC (Z)}SβT {C, GC (Z)}] = E(E[ϕ∗ {C, GC (Z)}SβF (Z)|C, GC (Z)])
                                                                T
                                        = E[ϕ∗ {C, GC (Z)}SβF (Z)]
                                                                    T
                                        = E(E[ϕ∗ {C, GC (Z)}SβF (Z)|Z])
                                                                        T
                                        = E(E[ϕ∗ {C, GC (Z)}|Z]SβF (Z))
                                                          T
                                        = E{ϕ∗F (Z)SβF (Z)},
where ϕ∗F (Z) ∈ G F , if the goal is to find the optimal estimator in IF (Ξ).
Equivalently, we can restrict the search to the elements in the linear subspace
J (G F ) ⊂ Ξ that satisfy (12.7).
    In Theorem 10.6, we proved that J {hF (Z)} = L[M−1 {hF (Z)}], where the
linear operator L was defined by Definition 4 of Chapter 10 and, we remind
the reader, is given by
for all ϕ∗F (Z) ∈ G F . Recalling that Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}
and L[M−1 {ϕ∗F (Z)}] = E{M−1 (ϕ∗F )|C, GC (Z)}, we use a series of iterated
conditional expectations to write (12.69) as
316    12 Approximate Methods for Gaining Efficiency
                                                                      
          0 = E E[{SβF − M−1 (ϕ∗Fopt )}T
                                         L{M    −1
                                                   (ϕ ∗F
                                                         )}|C, G C (Z)]
                                                       
            = E {SβF − M−1 (ϕ∗F
                             opt )}T
                                      L{M   −1
                                               (ϕ ∗F
                                                     )}
                                                              
            = E E[{SβF − M−1 (ϕ∗Fopt )}T
                                         L{M    −1
                                                   (ϕ ∗F
                                                         )}|Z]
                                                              
                         −1   ∗F                −1    ∗F
            = E {Sβ − M (ϕopt )} E[L{M (ϕ )}|Z]
                   F                T
                                                        
                         −1  ∗F              −1     ∗F
            = E {Sβ − M (ϕopt )} M{M (ϕ )}
                   F               T
                                         
                         −1  ∗F    T ∗F
            = E {Sβ − M (ϕopt )} ϕ
                   F
                                           .                               (12.70)
Therefore, (12.70) being true for all ϕ∗F ∈ G F implies that SβF − M−1 (ϕ∗F
                                                                          opt )
must be orthogonal to G F . Since the projection exists and is unique, this
implies that there must exist a unique element ϕ∗Fopt ∈ G
                                                          F
                                                             such that (12.70)
               ∗F
holds for all ϕ ∈ G , or equivalently
                     F
Consequently,
where ϕ∗Fopt (Z) satisfies (12.71), or (12.63) of the theorem. The proof is com-
plete if we can show that
                               T
is the same as E{ϕ∗F        F
                    opt (Z)Sβ (Z)}. This can be shown by using the same
iterated expectations argument that led to (12.70). 
                                                    
(that is, elements of Ξ that satisfy (12.7)), the one with the smallest variance
matrix is obtained by choosing
                             12.4 Optimal Restricted (Class 2) Estimators      317
where ϕ∗F
       opt (Z) is the unique element in Λ
                                          F⊥
                                             that satisfies
               Π[M−1 {ϕ∗F
                       opt (Z)}|Λ
                                  F⊥
                                     ] = Π[SβF (Z)|ΛF ⊥ ] = Seff
                                                             F
                                                                (Z).
where
                              T                                  −1
      q×t1         ∂m∗ (Z, β0 )          −1   ∗           ∗T
  AF
   opt       =− E                     E[M   {m  (Z, β0 )}m   (Z, β0 )]     .
                     ∂β T
                                                                       (12.72)
                                                             q×t1
Proof. Using (12.63), we are looking for ϕ∗F
                                           opt = Aopt
                                                     F
                                                           m∗ (Z, β0 ) such that
                                                           
                                     q×t1
               Π SβF (Z) − M−1 {AF opt    m ∗
                                              (Z, β0 )} G F
                                                              = 0.         (12.73)
or when                                                T                    T
                                −1
                        AF
                         opt E{M   (m∗ )m∗ } = E(SβF m∗ ).                      (12.76)
Using the result from equation (12.12) of Lemma 12.1, we obtain that
                                                             T
                            T                     ∂m∗ (Z, β0 )
                 E{SβF (Z)m∗ (Z, β0 )}      =− E                   .
                                                    ∂β T
Substituting this last result into equation (12.76) and solving for AF
                                                                     opt leads
to (12.72), thus proving the corollary. 
   The results of Corollary 3 are especially useful when the inverse operator
M−1 (·) can be derived explicitly such as the case when there are two levels of
coarsening or when the coarsening is monotone. The following algorithm can
now be used to derive improved adaptive double-robust estimators for β:
1. If the coarsening of the data is not by design, we develop a model for the
   coarsening probabilities, say
                                     !
                                     n
                                           p∗Gr       (Zi ) (gri , ξ)
                                                  i
                                     i=1
4. We compute AF             ˆ∗
               opt (β, ψ̂n , ξn ) by using
                                               T
                                   ∂m∗ (Z, β)
         AF
          opt (β, , ψ, ξ)    =− E             ,ξ
                                     ∂β T
                                                                   −1
                                                         T
                              × E[M−1 {m∗ (Z, β), ψ, ξ}m∗ (Z, β), ξ]    ,
           
           n                       
                                ˆ∗      I(Ci = ∞)m∗ (Zi , β)
                 AF
                  opt (β, ψ̂n , ξn )
           i=1                              (∞, Zi , ψ̂n )
                                                                            
                                       + L2 {Ci , GCi (Zi ), β, ψ̂n , ξˆn∗ } = 0.    (12.77)
                                                   exp(β T X ∗ )
                           P (Y = 1|X) =                           ,
                                                 1 + exp(β T X ∗ )
                                                      exp(ψ0 + ψ1 Y + ψ2T X1 )
        P (R = 1|Y, X) = π(Y, X1 , ψ) =                                          .
                                                    1 + exp(ψ0 + ψ1 Y + ψ2T X1 )
    We also posited a model for the full data (Y, X) by assuming that the con-
ditional distribution of X given Y follows a multivariate normal distribution
with a mean that depends on Y but with a variance matrix that is indepen-
dent of Y . Let us denote the mean vector of X given Y = 1 and Y = 0 as µ1
and µ0 , respectively, and the common covariance matrix as Σ. We also denote
the mean vector of X1 given Y = 1 and Y = 0 as µ11 and µ10 , respectively,
and the mean of X2 given Y = 1 and Y = 0 as µ21 and µ20 , respectively.
Similarly, we denote the variance matrix of X1 by Σ11 , the variance of the
single covariate X2 by Σ22 , and the covariance of X1 and X2 by Σ12 . The
parameter ξ for this posited model can be represented by ξ = (µ1 , µ0 , Σ, τ ),
where τ denotes P (Y = 1). Since Y is observed for everyone, the estimate for
τ is obtained by the sample proportion
                                              
                                              n
                                  τ̂n = n−1         Yi .
                                              i=1
      With full data we know that the optimal estimating function is given by
                                                           
                                  ∗         exp(β T X ∗ )
                    m(Y, X, β) = X Y −                        .
                                          1 + exp(β T X ∗ )
Since this may no longer be the optimal choice with coarsened data, we now
consider an expanded set of estimating functions, namely m∗ (Y, X, β). For
example, we might take
                                                           
                  ∗              ∗∗         exp(β T X ∗ )
                m (Y, X, β) = X       Y −                     ,
                                          1 + exp(β T X ∗ )
Finally, with two levels of missingness, we use the results from Theorem 10.2
to obtain
                                                               
                                   I(C = ∞)m∗ (Z, β)
   L2 {C, GC (Z), β, ψ, ξ} = − Π                      Λ2 , ψ, ξ
                                      (∞, Z, ψ)
                                                 
                                 R − π(Y, X1 , ψ)
                           =−                       E{m∗ (Y, X, β)|Y, X1 , ξ}.
                                    π(Y, X1 , ψ)
      when the data are monotonically coarsened. This methodology led to the
      integral equation (11.42), which in general is very difficult if not impossible
      to solve. Only consider the case when Y is a univariate random variable.
      For this same problem, outline the steps that would be necessary to obtain
      the optimal restricted (class 2) estimator for β. For this exercise, take
                            
                            n                           
                                                        n
                µ̂1 = n−1
                       1          Ai Yi ,   µ̂0 = n−1
                                                   0          (1 − Ai )Yi ,
                            i=1                         i=1
                       
and n1 = Ai , n0 = (1 − Ai ) denote the treatment-specific sample sizes.
     Typically, such an associational analysis does not answer the causal ques-
tion of interest. If the treatments were not assigned to the patients at ran-
dom, then one can easily imagine that individuals who receive the statin
drugs may be inherently different from those who do not. They may be
wealthier, younger, smoke less, etc. Consequently, the associational param-
eter ∆ = µ1 − µ0 may reflect these inherent differences as well as any effect
due to treatment. In the study of epidemiology, such factors are referred to as
confounders, as they may confound the relationship between treatment and
response.
     Thus, we have argued that statistical associations may not be adequate to
describe causal effects. Therefore, how might we describe causal effects? The
point of view we will adopt is that proposed by Neyman (1923) and Rubin
(1974), where causal effects are defined through potential outcomes or counter-
factual random variables. Specifically, for each level of the treatment A = a,
we will assume that there exists a potential outcome Y ∗ (a), where Y ∗ (a)
denotes the response of a randomly selected individual had that individual,
possibly contrary to fact, been given treatment A = a. In our illustration,
we only include two treatments and hence we define the potential outcomes
Y ∗ (1) and Y ∗ (0). Again, we emphasize that these are referred to as potential
outcomes or counterfactual random variables, as it is impossible to observe
both Y ∗ (1) and Y ∗ (0) simultaneously. Nonetheless, using the notion of poten-
tial outcomes, we would define the causal treatment effect by Y ∗ (1) − Y ∗ (0).
                                              13.1 Point Exposure Studies      325
Remark 1. Rubin (1978a) refers to the assumption (13.1) as the Stable Unit
Treatment Value Assumption, or SUTVA. Although this assumption may
seem straightforward at first, there are some philosophical subtleties that
need to be considered in order to fully accept. For one thing, there must
not be any interference in the response from other subjects. That is, the ob-
served response for the i-th individual in the sample should not be affected
by the response of the other individuals in the sample. Thus, for example,
this assumption may not be reasonable in a vaccine intervention trial for an
infectious disease, where the response of an individual is clearly affected by
the response of others in the study. That is, whether or not an individual
contracts an infectious disease will depend, to some extent, on whether and
how many other individuals in the population are infected. From here on, we
will assume the SUTVA assumption but caution that the plausibility of this
assumption needs to be evaluated on a case-by-case basis.   
    Because of assumption (13.1), we see that the observed data are a many-
to-one transformation of the full data. We also note that the treatment as-
signment indicator A plays a role similar to that of the missingness indicator
in missing-data problems. This analogy to missing-data problems will be use-
ful as we develop the theory that enables us to estimate the average causal
treatment effect.
326      13 Double-Robust Estimator of the Average Causal Treatment Effect
where (13.3) follows from the SUTVA assumption and (13.4) follows from
assumption (13.2). Similarly, we can show that E(Y |A = 0) = E{Y ∗ (0)}.
Consequently,
          n the difference   in the sample’s average response between treat-
                          −1 n
ments n−1
       1         A Y
             i=1 i i −  n 0    i=1 (1 − Ai )Yi , which is an unbiased estimator
                                               13.3 Observational Studies      327
for ∆, the associational treatment effect, is also an unbiased estimator for the
average causal treatment effect δ in a randomized intervention study.
    As we mentioned earlier, the treatment indicator A serves a role analogous
to the missingness indicator R, which was used to denote missing data with
two levels of missingness. That is, when A = 1, we observe Y ∗ (1), in which
case Y ∗ (0) is missing, and when A = 0, we observe Y ∗ (0) and Y ∗ (1) is missing.
The assumption (13.2), which is induced because of randomization, is similar
to missing completely at random; that is, the probability that A is equal to 1
or 0 is independent of all the data {Y ∗ (1), Y ∗ (0), X}.
   We now give an argument to show that the average causal treatment effect
can be identified through the distribution of the observable data (Y, A, X) if
assumptions (13.1) and (13.5) hold. This follows because
We now consider the estimation of the average causal treatment effect from a
sample of observed data (Yi , Ai , Xi ), i = 1, . . . , n. The first approach, which
we refer to as regression modeling, is motivated by equation (13.8). Here, we
consider a restricted moment model for the conditional expectation of Y given
(A, X) in terms of a finite-dimensional parameter, say ξ. That is,
µ(A, X, ξ) = ξ0 + ξ1 A + ξ2 X + ξ3 AX,
Under the assumption that the restricted moment model is correct, a natural
estimator for E(Y |A = 1, X) − E(Y |A = 0, X) would then be µ(1, X, ξˆn ) −
µ(0, X, ξˆn ). Substituting this into (13.12) yields the estimator for the average
causal treatment effect,
                                   
                                   n
                     δ̂n = n−1           {µ(1, Xi , ξˆn ) − µ(0, Xi , ξˆn )}.   (13.13)
                                   i=1
So, for example, if we posited the log-linear model (13.10), then the estimator
for the average causal treatment effect would be given by
                n 
                                                                        
             −1          ˆ      ˆ      ˆ     ˆ              ˆ     ˆ
     δ̂n = n        exp{ξ0n + ξ1n + (ξ2n + ξ3n )Xi } − exp(ξ0n + ξ2n Xi ) .
               i=1
vector of baseline covariates. The joint density of the full data can be written
as
          p{y ∗ (1), y ∗ (0), x, a} = p{a|y ∗ (1), y ∗ (0), x}p{y ∗ (1), y ∗ (0), x}
                                   = p(a|x)p{y ∗ (1), y ∗ (0), x},                     (13.14)
where (13.14) follows from the strong ignorability assumption (13.5). We will
put no restrictions on the joint density p{y ∗ (1), y ∗ (0), x} of {Y ∗ (1), Y ∗ (0), X}
(i.e., a nonparametric model). Consequently, using the same logic as in Sec-
tion 5.3, we argue that there is only one full-data influence function of RAL
estimators for δ = E{Y ∗ (1) − Y ∗ (0)}. Letting δ0 denote the true value of δ,
the full-data influence function is given by
                 ϕF {Y ∗ (1), Y ∗ (0), X} = {Y ∗ (1) − Y ∗ (0) − δ0 },                 (13.15)
which, of course, is the influence function for the full-data estimator
                                        
                                        n
                           δ̂nF = n−1         {Yi∗ (1) − Yi∗ (0)}.
                                        i=1
h(Y, A, X) + Λ2 ,
To verify this, consider the conditional expectation of the first term on the
right-hand side of (13.18); that is,
332      13 Double-Robust Estimator of the Average Causal Treatment Effect
                                                                                
                   AY                                   AY ∗ (1) ∗
          E            Y ∗ (1), Y ∗ (0), X       =E             Y (1), Y ∗ (0), X
                  π(X)                                   π(X)
                                                   Y ∗ (1)
                                                 =         E{A|Y ∗ (1), Y ∗ (0), X}
                                                   π(X)
                                                   Y ∗ (1)
                                                 =         E(A|X)                    (13.19)
                                                   π(X)
                                                   Y ∗ (1)
                                                 =         π(X) = Y ∗ (1),           (13.20)
                                                   π(X)
where (13.19) follows from the strong ignorability assumption. Also, in order
for (13.20) to hold, we must not be dividing 0 by 0. Therefore, we will need
the additional assumption that the propensity score P (A = 1|X) = π(X)
is 
   strictly greater than zeroalmost everywhere. Similarly, we can show that
E (1−A)Y       ∗      ∗
     1−π(X) |Y (1), Y (0), X   = Y ∗ (0) as long as 1 − π(X) is strictly greater
than zero almost everywhere. Therefore, we have proved that the function h(·)
defined in (13.18) satisfies the relationship (13.16) as long as the propensity
score 0 < π(x) < 1, for all x in the support of X.
    To derive the augmentation space Λ2 , we must find all functions L2 (·) of
Y, A, X that satisfy (13.17). Because A is a binary indicator, any function of
Y, A, X can be written as
for arbitrary functions L21 (·) and L20 (·) of Y, X. Hence the conditional ex-
pectation of L2 (·), given {Y ∗ (1), Y ∗ (0), X}, is
where (13.22) follows from the SUTVA assumption and (13.23) follows from
the strong ignorability assumption. Therefore, in order for (13.17) to hold, we
need                                         
                                       π(X)
              L20 {Y ∗ (0), X} = −              L21 {Y ∗ (1), X}.       (13.24)
                                     1 − π(X)
Since both the left- and right-hand sides of (13.24) denote functions of
Y ∗ (1), Y ∗ (0), X, the only way that equation (13.24) can hold is if both L20 (·)
and L21 (·) are functions of X alone. In that case, any element of Λ2 must
satisfy                                         
                                          π(X)
                         L20 (X) = −               L21 (X).                (13.25)
                                        1 − π(X)
Substituting the relationship given by (13.25) into (13.21), we conclude that
the space Λ2 consists of functions
                           13.5 Coarsened-Data Semiparametric Estimators      333
                          
                A − π(X)
                               L21 (X) for arbitrary functions L21 (X).
                1 − π(X)
Since 1 − π(X) is strictly greater than zero almost everywhere and L21 (X) is
an arbitrary function of X, we can write Λ2 as
                                                           
               Λ2 = {A − π(X)}h2 (X) for arbitrary h2 (X) .          (13.26)
for all functions h2 (X). We denote this as Π[ϕ(Y, A, X)|Λ2 ] = {A−π(X)}h02 (X).
As indicated earlier, any function ϕ(·) of Y, A, X can be written as ϕ(Y, A, X) =
Aϕ1 (Y, X) + (1 − A)ϕ0 (Y, X). Because projections are linear operators,
or equivalently
                                                    
                                          2 0
   E A{A − π(X)}h(X)ϕ1 (Y, X) − {A − π(X)} h1 (X)h(X) = 0,            (13.30)
The first term on the left-hand side of (13.30) is also computed through a
series of iterated conditional expectations; namely,
                                         
            E A{A − π(X)}h(X)ϕ1 (Y, X)
                                                   
            = E E A{A − π(X)}h(X)ϕ1 (Y, X) A, X
                                                       
            = E A{A − π(X)}h(X)E{ϕ1 (Y, X)|A = 1, X}
                                                           
            = E E A{A − π(X)}h(X)E{ϕ1 (Y, X)|A = 1, X} X
                                                         
            = E π(X){1 − π(X)}h(X)E{ϕ1 (Y, X)|A = 1, X} .          (13.32)
for all h(X). Since π(X) and 1−π(X) are both bounded away from zero almost
surely, and because E{ϕ1 (Y, X)|A = 1, X} − h01 (X) is a function of X, then
(13.33) can only be true for all h(X) if and only if E{ϕ1 (Y, X)|A = 1, X} −
h01 (X) is identically equal to zero; i.e., h01 (X) must equal E{ϕ1 (Y, X)|A =
1, X}, thus proving (13.28). The proof of (13.29) follows similarly.  
   Returning to the class of influence functions of RAL estimators for δ given
by (13.27), the efficient influence function in this class is given by
                                                                 
           AY     (1 − A)Y                 AY     (1 − A)Y
                −            − δ0 − Π          −             − δ0 Λ2 ,
          π(X) 1 − π(X)                   π(X) 1 − π(X)
which by Theorem 13.1 is equal to
           
              AY    {A − π(X)}E(Y |A = 1, X)
                  −
             π(X)               π(X)
                                                    
           (1 − A)Y    {A − π(X)}E(Y |A = 0, X)
         −          −                           − δ0 .                (13.34)
           1 − π(X)             1 − π(X)
                              13.5 Coarsened-Data Semiparametric Estimators      335
where µ(A, X) = E(Y |A, X). Of course, µ(A, X) is not known to us but can
be estimated by positing a model where E(Y |A, X) = µ(A, X, ξ) as we did in
(13.9). The parameter ξ can be estimated by solving the estimating equation
(13.11), and then µ(1, Xi , ξˆn ) and µ(0, Xi , ξˆn ) can be substituted into (13.35)
to obtain a locally efficient estimator for δ.
    The development above assumed that the propensity score π(X) was
known to us. In observational studies, this will not be the case. Consequently,
we must posit a model for the propensity score, say assuming that
                                          exp(ψ0 + ψ1T X)
                         π(X, ψ) =                          .
                                        1 + exp(ψ0 + ψ1T X)
The estimator for ψ would be obtained using maximum likelihood; i.e., ψ̂n is
the value of ψ that maximizes the likelihood
                        !
                        n
                              π(Xi , ψ)Ai {1 − π(Xi , ψ)}(1−Ai ) .
                        i=1
Double Robustness
Since E{Y ∗ (1)−Y ∗ (0)} of (13.41) is equal to δ, the proof of double robustness
will follow if we can show that the expectations in (13.42) and (13.43) are equal
to zero if either ψ ∗ = ψ0 or ξ ∗ = ξ0 .
    Let us first assume that the propensity score is correctly specified; i.e.,
ψ ∗ = ψ0 . We compute the expectation of (13.42) by iterated conditional
expectations, where we first condition on Y ∗ (1) and X to obtain that (13.42)
equals
                                                                       
                 [E{A|Y ∗ (1), X} − π(X, ψ0 )]{Y ∗ (1) − µ(1, X, ξ ∗ )}
            E                                                             .
                                      π(X, ψ0 )
                                           13.6 Exercises for Chapter 13    337
Remark 6. The connection of the propensity score to causality has been stud-
ied carefully by Rosenbaum, Rubin, and others; see, for example, Rosenbaum
and Rubin (1983, 1984, 1985), Rosenbaum (1984, 1987), and Rubin (1997).
Different methods for estimating the average causal treatment effect using
propensity scores have been advocated. These include stratification, match-
ing, and inverse propensity weighting. The locally efficient estimator derived
above is sometimes referred to as the augmented inverse propensity weighted
estimator. This estimator was compared with other commonly used estima-
tors of the average causal treatment effect in a series of numerical simulations
by Lunceford and Davidian (2004), who generally found that this estimator
performs the best across a wide variety of scenarios.   
A popular approach for dealing with missing data is the use of multiple im-
putation, which was first introduced by Rubin (1978b). Although most of
this book has focused on semiparametric models, where the model includes
infinite-dimensional nuisance parameters, this chapter will only consider finite-
dimensional parametric models, as in Chapter 3. Because of its importance in
missing-data problems, we conclude with a discussion of this methodology.
    Imputation methods, where we replace missing values by some “best guess”
and then analyze the data as if complete, have a great deal of intuitive appeal.
However, unless one is careful about how the imputation is carried out and
how the subsequent inference is made, imputation methods may lead to biased
estimates with estimated confidence intervals that are too narrow. Rubin’s
proposal for multiple imputation allowed the use of this intuitive idea in a
manner that results in correct inference. Rubin’s justification is based on a
Bayesian paradigm. In this chapter, we will consider the statistical properties
of multiple-imputation estimators from a frequentist point of view using large
sample theory. Much of the development is based on two papers, by Wang and
Robins (1998) and Robins and Wang (2000). Although most of the machinery
developed in the previous chapters that led to AIPWCC estimators will not
be used here, we feel that this topic is of sufficient importance to warrant
study in its own right.
    As we have all along, we will consider a full-data model, where the full
data are denoted by Z1 , . . . , Zn assumed iid with density pZ (z, β), where β is
a finite-dimensional parameter, say q-dimensional. The observed (coarsened)
data will be assumed coarsened at random and denoted by
Remark 1. When data are coarsened at random and the full-data parameter
β is finite-dimensional, then β can be estimated using maximum likelihood.
The coarsened-data likelihood was derived in (7.10), where we showed that β
could be estimated by maximizing
340      14 Multiple Imputation: A Frequentist Perspective
                                 !
                                 n
                                        pGri (Zi ) (gri , β),                (14.1)
                                 i=1
where                                         
                   pGr (Z) (gr , β) =                    pZ (z, β)dνZ (z).
                                        {z:Gr (z)=gr }
                                 
                                 n
                                        S F (Zi , β) = 0.
                                 i=1
Remark 2. On notation
In this chapter, we only consider the parameter β with no additional nuisance
parameters. Therefore, the full-data score vector will be denoted by S F (Z, β)
(without the subscript β used in the previous chapters). As usual, when we use
the notation S F (Z, β), we are considering a q-dimensional vector of functions
of Z and β. If the score vector is evaluated at the truth, β = β0 , then this will
often be denoted by S F (Z) = S F (Z, β0 ). A similar convention will be used
when we consider the observed-data score vector, which will be denoted as
S{C, GC (Z), β}, and, at the truth, as S{C, GC (Z), β0 } or S{C, GC (Z)}.   
      Under suitable regularity conditions, which will be assumed throughout,
                                                         
                                          D
                       n1/2 (β̂nF − β0 ) −→ N 0, Veff
                                                   F
                                                     (β0 ) ,
where {VeffF
            (β0 )}−1 is the full-data information matrix, which we denote by
                        T
 F
I (β0 ) = E{S F (Z)S F (Z)}. That is,
                                F
                                         '         (−1
                              Veff (β0 ) = I F (β0 )    .
                                                exp(θT X)
                         P (Y = 1|X) =                      .                (14.2)
                                              1 + exp(θT X)
                                                             exp(θT X)
              P (Y = 1|X, W ) = P (Y = 1|X) =                            .
                                                           1 + exp(θT X)
(Ri , Yi , Wi , Ri Xi ), i = 1, . . . , n.
Although the primary focus is the parameter θ, since the model is finite-
dimensional, we will not differentiate between the parameters of interest and
the nuisance parameters. Hence
S{Ci , GCi (Zi ), β} = E{S F (Zi , β)|Ci , GCi (Zi ), β}. (14.4)
variance of the observed-data MLE β̂n . In this section, we give a formal proof
of this result.
                                         T
    Let var{S F (Z)} = E{S F (Z)S F (Z)} denote that variance matrix of
S F (Z). Then the asymptotic variance of the full-data MLE β̂nF is given by
  F
Veff (β0 ) = [var{S F (Z)}]−1 . Similarly, the asymptotic variance of the observed-
data MLE β̂n is Veff (β0 ) = (var[S{C, GC (Z)}])−1 .
    We now give two results regarding the full-data and observed-data esti-
mators for β.
Theorem 14.1. The observed-data information matrix is smaller than or
equal to the full-data information matrix; that is,
where the notation “≤” means that var{S F (Z)} − var[S{C, GC (Z)}] is non-
negative definite.
Proof. By the law of conditional variance,
                       '                  (       '             (
  var {S F (Z)} = var E S F (Z)|C, GC (Z) + E var S F (Z)|C, GC (Z)
                                             '               (
                = var [S{C, GC (Z)}] + E var S F (Z)|C, GC (Z) .  (14.6)
Proof. This follows from results about influence functions; namely, if we define
                                         −1
                                    FT
             ϕF               F
              eff (Z) = E S (Z)S         (Z)     S F (Z)
and
                                                     −1
       ϕeff {C, GC (Z)} = E S{C, GC (Z)}S T {C, GC (Z)}      S{C, GC (Z)},
then
                                                           
       ϕeff {C, GC (Z)} = ϕF
                          eff (Z) + ϕeff {C, GC (Z)} − ϕeff (Z) .
                                                      F
or
                          var[ϕeff {C, GC (Z)}] − var{ϕF
                                                      eff (Z)}
and
                                          ' F   ( −1
                       var{ϕF
                            eff (Z)} = [var S (Z) ]
                                                         F
                                                     = Veff (β0 ).
     
     
   For many problems, working with the observed (coarsened) data likelihood
may be difficult, with no readily available software. For such instances, it may
be useful to find methods where we can use the simpler full-data inference to
analyze coarsened data. This is what motivated much of the inverse weighted
methodology discussed in the previous chapters. Multiple imputation is
another such methodology popularized by Rubin (1978b, 1987). See also the
excellent overview paper by Rubin (1996).
Zij , j = 1, . . . , m, i = 1, . . . , n.
These are the imputation of the full data from the observed (coarsened) data.
The “j”-th set of imputed full data, denoted by Zij , i = 1, . . . , n, is used to
                            ∗
obtain the j-th estimator β̂nj by solving the full-data likelihood equation:
                                  
                                  n
                                                     ∗
                                        S F (Zij , β̂nj ) = 0.               (14.9)
                                  i=1
                                                          14.2 Multiple Imputation      345
That is, we use the j-th imputed data set and treat these data as if they were
                                        ∗
full data to obtain the full-data MLE β̂nj .
    The proposed multiple-imputation estimator is
                                                 
                                                 m
                                    β̂n∗ = m−1           ∗
                                                       β̂nj .                        (14.10)
                                                 j=1
Rubin (1987) argues that, under appropriate conditions, this estimator is con-
sistent and asymptotically normal. That is,
                                                 D
                            n1/2 (β̂n∗ − β0 ) −→ N (0, Σ∗ ),
We note that the first term in the sum is an average of the estimators of
the full-data asymptotic variance using the inverse of the full-data observed
information matrix over the imputed full-data sets and the second term is
the sample variance of the imputation estimators multiplied by a finite “m”
correction factor.
where β must be estimated from the data in some fashion. Such imputed
values are referred to as Zij (β). 
                                   
the complete cases since these are a representative sample (albeit smaller)
from the population. If the data are missing at random (MAR), we might use
an inverse probability weighted complete-case estimator as an initial value.
    The initial estimator for β will be assumed to be an RAL estimator; that
is,
                                         
                                         n
               n1/2 (β̂nI − β0 ) = n−1/2   q{Ci , GCi (Zi )} + op (1),
                                          i=1
where q{Ci , GCi (Zi )} is the i-th influence function of the estimator β̂nI . We
remind the reader of the following properties of influence functions for RAL
estimators:
(a) The efficient influence function, denoted by ϕeff {Ci , GCi (Zi )}, equals
                                                 −1
           E S{Ci , GCi (Zi )}S T {Ci , GCi (Zi )}      S{Ci , GCi (Zi )}.      (14.11)
(b) For any influence function of an RAL estimator q{Ci , GCi (Zi )},
                                                     
               E q{Ci , GCi (Zi )}S T {Ci , GCi (Zi )} = *+,-
                                                         I q×q .                (14.12)
                                                     the identity matrix
q{Ci , GCi (Zi )} = ϕeff {Ci , GCi (Zi )} + h{Ci , GCi (Zi )}, (14.13)
Hence
       var [q{Ci , GCi (Zi )}] = var [ϕeff {Ci , GCi (Zi )}] + var [h{Ci , GCi , (Zi )}]
                                                  −1
        = E S{Ci , GCi (Zi )}S T {Ci , GCi (Zi )}         + var [h{Ci , GCi (Zi )}] .
                                                                                  (14.14)
and the use of empirical processes that are beyond the scope of this book. We
will, however, provide some heuristic justification of the results.
    As we have emphasized repeatedly throughout this book, the key to the
asymptotic properties of RAL estimators is being able to derive their influence
function. We will derive the influence function of the multiple-imputation
estimator through a series of approximations. We begin by giving the first
such approximation.
                                                  ⎡                            ⎤
                             
                             n
                               '             ( −1      
                                                       m
 n1/2 (β̂n∗ − β0 ) = n−1/2          I F (β0 )     ⎣m−1   S F {Zij (β̂nI ), β0 }⎦ + op (1),
                             i=1                          j=1
                                                                                   (14.15)
where I F (β0 ) is the full-data information matrix; namely,
                                                                 
                     −∂S F (Z, β0 )          F           FT
                E                     = E  S   (Z, β0 )S    (Z, β0 )  .
                         ∂β T
                              
                              n
                                                        ∗
                                    S F {Zij (β̂nI ), β̂nj } = 0,
                              i=1
                                              ∗
where β̃nj is an intermediate value between β̂nj and β0 . Under suitable regu-
larity conditions,
              
              n                                                       
                ∂S F {Zij (β̂ I ), β̃nj }        ∂S F {Zij (β0 ), β0 }
      −n−1
                                            P
                               n
                                            −
                                            →E −                         ,         (14.16)
              i=1
                           ∂β T                         ∂β T
Therefore,
             14.3 Asymptotic Properties of the Multiple-Imputation Estimator                        349
             /             0             
                                         n
       1/2         ∗               1/2
   n             β̂nj   − β0 = n               {I F (β0 )}−1 S F {Zij (β̂nI ), β0 } + op (1),    (14.17)
                                         i=1
However, this does not give us the desired influence function for β̂n∗ because
Zij (β̂nI ) is evaluated at a random quantity (β̂nI ) that involves data from all
individuals and therefore (14.18) is not a sum of iid terms. In order to find
the influence function, we write (14.18) as the sum of
                ⎡                             ⎤
             n        
                       m
   n−1/2        ⎣m−1      S F {Zij (β0 ), β0 }⎦                           (14.19)
             i=1            j=1
  +
                    ⎡                                                                       ⎤
             
             n              
                            m                                      
                                                                   m
  n−1/2             ⎣m−1          S F {Zij (β̂nI ), β0 } − m−1           S F {Zij (β0 ), β0 }⎦   (14.20)
             i=1            j=1                                    j=1
and then show how these two pieces can be expressed approximately as a sum
of iid terms.
Remark 5. We remind the reader that, for a fixed β, Zij (β), j = 1, . . . , m, i =
1, . . . , n are obtained through a two-stage process. For any i, nature (sam-
pling from the population) provides us with the data {Ci , GCi (Zi )} de-
rived from the distribution pC,GC (Z) (r, gr , β0 ). Then, for j = 1, . . . , m, the
data analyst draws m values at random from the conditional distribution
pZ|C,GC (Z) {z|Ci , GCi (Zi ), β}. Consequently, the vector {Zi1 (β), . . . , Zim (β)} is
made up of correlated random variables, but these random vectors across i
are iid random vectors. Also, because of the way the data are generated, the
marginal distribution of Zij (β) is the same for all i and j. If β = β0 , then
the marginal distribution of Zij (β0 ) has density pZ (z, β0 ) (i.e., the density for
the full data at the truth). However, if β = β0 , then the marginal density for
Zij (β) is more complex.       
   Based on the discussion in Remark 5, (14.19) is made up of a sum of n iid
elements, where the i-th element of the sum is equal to
350      14 Multiple Imputation: A Frequentist Perspective
                                       
                                       m
                                  −1
                               m             S F {Zij (β0 ), β0 },            (14.21)
                                       j=1
which implies that (14.21) has mean zero. Hence, (14.19) is a normalized sum
of mean-zero iid random vectors that will converge to a normal distribution
by the central limit theorem.
   We now consider the approximation of (14.20) as a sum of iid random
vectors. This is given by the following theorem.
Theorem 14.3. The expression (14.20) is equal to
                    
                    n
                      '                     (
            n−1/2         I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )} + op (1),      (14.22)
                    i=1
and q{Ci , GCi (Zi )} is the i-th influence function of the initial estimator β̂nI .
      Theorem 14.3 will be proved using a series of lemmas.
Lemma 14.2. Let Zij (β) denote a random draw from the conditional distri-
bution with conditional density pZ|C,GC (Z) {z|Ci , GCi (Zi ), β}. If we define
                                                       
                      λ(β, β0 ) = E S F {Zij (β), β0 } ,                    (14.23)
then
                           ∂λ(β, β0 )
                                                = I F (β0 ) − I(β0 ).
                             ∂β T       β=β0
Because Zij (β) is a random draw from the conditional distribution with con-
ditional density
         14.3 Asymptotic Properties of the Multiple-Imputation Estimator                       351
Because
   pZ|C,GC (Z) (z|r, gr , β)pC,GC (Z) (r, gr , β)dνZ|C,GC (Z) (z|r, gr )dνC,GC (Z) (r, gr )
          = pC,Z (r, z, β)dνC,Z (r, z),
Therefore
                            
  ∂ log   pC,Z (r, z, β)                            ∂ log pZ (z, β0 ) ∂ log pGr (Z) (gr , β0 )
                                                =                    −                         .
  ∂β T pC,GC (Z) (r, gr , β)             β=β0             ∂β T                ∂β T
                                                                                         (14.29)
352      14 Multiple Imputation: A Frequentist Perspective
Substituting this last result into (14.27) and rearranging some terms yields
                 T                              
     S F (z, β0 ) S F (z, β0 ) − S T (r, gr , β0 ) pC,Z (r, z, β0 )dνC,Z (r, z)
                      T
                                                                          
   = E S F (Z, β0 )S F (Z, β0 ) − E S F (Z, β0 )S T {C, GC (Z), β0 } . (14.30)
Remark 6. The result of Lemma 14.2 can be deduced, with slight modification,
from equation (6) on page 480 of Oakes (1999). The term I F (β0 ) − I(β0 ) is
also referred to as the “missing information,” as this is the information that
is lost due to the coarsening of the data. 
                                           
    Before giving the next lemma, we first give a short description of the notion
of “stochastic equicontinuity.”
Stochastic Equicontinuity
                                                        
as a process in β, where λ(β, β0 ) = E S F {Zij (β), β0 } .
    Using the theory of empirical processes (see van der Vaart and Well-
ner, 1996), under suitable regularity conditions, the process Wn (β) converges
weakly to a tight Gaussian process. When this is the case, we have stochastic
equicontinuity, where, for every ε, η > 0, there exists a δ > 0 and an n0 such
that                                                       "
                P                 sup              Wn (β  ) − Wn (β  )  ε   ≤η
                    β  ,β  :   β  −β    <δ
for all n > n0 , where “ · ” denotes the usual Euclidean norm or distance.
          14.3 Asymptotic Properties of the Multiple-Imputation Estimator                           353
is equal to
                              n1/2 {λ(β̂nI , β0 ) − λ(β0 , β0 )} + op (1).
354       14 Multiple Imputation: A Frequentist Perspective
Lemma 14.5.
                                              ∂λ(β, β0 )
       n1/2 {λ(β̂nI , β0 ) − λ(β0 , β0 )} =                        n1/2 (β̂nI − β0 ) + op (1).
                                                ∂β T        β=β0
                         ∂λ(β, β0 )
                                              n1/2 (β̂nI − β0 ) + op (1).
                           ∂β T       β=β0
Theorem 14.4. Letting β̂n∗ denote the multiple imputation estimator and
denoting the i-th influence function of the initial estimator by (14.13), i.e.,
q{Ci , GCi (Zi )} = ϕeff {Ci , GCi (Zi )} + h{Ci , GCi (Zi )},
where Σ∗ is equal to
      14.4 Asymptotic Distribution of the Multiple-Imputation Estimator             355
                 −1        '         (−1 ' F              ('          (−1
         {I(β0 )} + m−1 I F (β0 )         I (β0 ) − I(β0 ) I F (β0 )
              '        (−1 ' F                (
            + I F (β0 )      I (β0 ) − I(β0 ) var [h{Ci , GCi (Zi )}]
           ' F               ('        (−1
             I (β0 ) − I(β0 ) I F (β0 )     .                             (14.31)
                                                    ⎛⎡                          ⎤
                                
                                n
                                  '            ( −1       
                                                          m
    n1/2 (β̂n∗ − β0 ) = n−1/2         I F (β0 )     ⎝⎣m−1   S F {Zij (β0 ), β0 }⎦
                                i=1                          j=1
                                                             ⎞
                         '                 (
                       + I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )}⎠ + op (1).      (14.32)
    This is a key result, as we have identified the influence function for the
multiple-imputation estimator as the i-th element in the summand in (14.32).
Asymptotic normality follows from the central limit theorem. The asymptotic
variance of n1/2 (β̂n∗ − β0 ) is the variance of the influence function, which we
will now derive in a series of steps.
    Toward that end, we first compute
                              ⎡                             ⎤
                                     
                                     m
                         var ⎣m−1       S F {Zij (β0 ), β0 }⎦ .          (14.33)
                                      j=1
Computing (14.33)
Using the law for iterated conditional variance, (14.33) can be written as
          ⎛     ⎡                                             ⎤⎞
                        
                        m
        E ⎝var ⎣m−1        S F {Zij (β0 ), β0 } Ci , GCi (Zi )⎦⎠       (14.34)
                         j=1
                ⎛ ⎡                                                 ⎤⎞
                             
                             m
         + var ⎝E ⎣m−1             S F {Zij (β0 ), β0 } Ci , GCi (Zi )⎦⎠ .    (14.35)
                             j=1
Because, conditional on {Ci , GCi (Zi )}, the Zij (β0 ), j = 1, . . . , m are inde-
pendent draws from the conditional distribution with conditional density
pZ|C,GC (Z) {z|Ci , GCi (Zi ), β0 }, this means that the conditional variance
                         ⎡                                                ⎤
                                  m
                    var ⎣m−1           S F {Zij (β0 ), β0 } Ci , GCi (Zi )⎦
                                j=1
is equal to
356     14 Multiple Imputation: A Frequentist Perspective
                                        '                                (
                          m−1 var           S F (Zi , β0 )|Ci , GCi (Zi ) .
Therefore, (14.34) is equal to
                              '                          (
                   m−1 E var S F (Zi , β0 )|Ci , GCi (Zi ) .
In equation (14.6), we showed
                   '                          (
             E var S F (Zij β0 )|Ci , GCi (Zi ) = I F (β0 ) − I(β0 ).
Consequently,                           '                  (
                           (14.34) = m−1 I F (β0 ) − I(β0 ) .                     (14.36)
Similar logic can be used to show
                    ⎡                                              ⎤
                          m
                  E ⎣m−1      S F {Zij , (β0 ), β0 } Ci , GCi (Zi )⎦
                                  j=1
                              '                           (
                         = E S F (Zi , β0 )|Ci , GCi (Zi )
                         = S{Ci , GCi (Zi )}.                                     (14.37)
Therefore (14.35) is equal to
                              var [S{Ci , GCi (Zi )}] = I(β0 ).                   (14.38)
Combining (14.36) and (14.38) gives us that
                               '                (
                (14.33) = m−1 I F (β0 ) − I(β0 ) + I(β0 ).                        (14.39)
   Next we compute the covariance matrix
     ⎛⎡                            ⎤                                        ⎞
                                                                       T
             m
                                       '               (
   E ⎝⎣m−1     S F {Zij (β0 ), β0 }⎦ I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )} ⎠ .
                j=1
                                                                                  (14.40)
Computing (14.40)
Expression (14.40) can be written as
              
              m
                                                             '                 (
        m−1         E S F {Zij (β0 ), β0 }q T {Ci , GCi (Zi )} I F (β0 ) − I(β0 ) ,
              j=1
where
                                                        
              E S F {Zij (β0 ), β0 }q T {Ci , GCi (Zi )}
                                                                           
                = E E S F {Zij (β0 ), β0 }q T {Ci , GCi (Zi )}|Ci , GCi (Zi )
                                                                           
                                                          
               E E S F {Zij (β0 ), β0 }|Ci , GCi (Zi ) q T {Ci , GCi (Zi )}
                     *                  +,                 -
                          we showed in (14.37) that
                          this equals S{Ci , GCi (Zi )}
                                                          
                 = E S{Ci , GCi (Zi )}q T {Ci , GCi (Zi )}
                 = I q×q (identity matrix) by (14.12).
       14.4 Asymptotic Distribution of the Multiple-Imputation Estimator           357
Therefore
                                (14.40) = I F (β0 ) − I(β0 ).                  (14.41)
Finally,
                                            
   var {I F (β0 ) − I(β0 )}q{Ci , GCi (Zi )}
      '                  (                        '                  (
    = I F (β0 ) − I(β0 ) var [q{Ci , GCi (Zi )}] I F (β0 ) − I(β0 )             (14.42)
                                                              
    = {I F (β0 ) − I(β0 )} I −1 (β0 ) + var [h{Ci , GCi (Zi )}] {I F (β0 ) − I(β0 )},
                                                                               (14.43)
        Σ∗ = {I(β0 )}−1
                    '        (−1 ' F               ('        (−1
             + m−1 I F (β0 )       I (β0 ) − I(β0 ) I F (β0 )
               '        (−1 ' F          (                     '
             + I F (β0 )      I − I(β0 ) var[h{Ci , GCi (Zi )}] I F (β0 )
                      '        (−1
             −I(β0 )} I F (β0 )    .
   Examining (14.32), we conclude that the influence function for the multiple-
imputation estimator, with m imputation draws, is equal to
                 m                                                       
 {I F (β0 )}−1 m−1     S F {Zj (β0 ), β0 } + {I F (β0 ) − I(β0 )}q{C, GC (Z)} ,
                         j=1
                                                                          (14.44)
where Zj (β0 ) denotes the j-th random draw from the conditional distribution
of pZ|C,GC (Z) {z|C, GC (Z), β0 }. As a consequence of the law of large numbers,
we would expect, under suitable regularity conditions, that
             
             m
       m−1
                                      P
                   S F {Zj (β0 ), β0 } −
                                       → E{S F (Z)|C, GC (Z)} = S{C, GC (Z)}
             j=1
        j=1
                                                '                 (
So, for example, if the conditional variance var S F (Z)|C, GC (Z) is bounded
almost surely, then (14.45) would hold. Consequently, as m → ∞, the influence
function of the multiple-imputation estimator (14.44) converges to
                                                                
               −1
      {I (β0 )}
        F
                   S{C, GC (Z)} + {I (β0 ) − I(β0 )}q{C, GC (Z)} .
                                     F
                                                                       (14.46)
We now show the relationship between the EM algorithm and the multiple-
imputation estimator in the following theorem.
Theorem 14.5. Let the one-step updated EM estimator, β̂nEM , be the solu-
tion to                                                
               n
                  E S F (Z, β̂nEM )|Ci , GCi (Zi ), β̂nI = 0,     (14.48)
                   i=1
where  β̂nI  is the initial estimator for β, whose i-th influence function is
q{Ci , GCi (Zi )}, which was used for imputation. The influence function of β̂nEM
is identically equal to (14.46), the limiting influence function for the multiple-
imputation estimator as m → ∞.
Proof. A simple expansion of β̂nEM about β0 , keeping β̂nI fixed, in (14.48) yields
        14.4 Asymptotic Distribution of the Multiple-Imputation Estimator                    359
                
                n                                 
         −1/2        F                           I
  0=n             E S (Z, β0 )|Ci , GCi (Zi ), β̂n
                i=1
              n     F                            
            −1        ∂S (Z, β0 )
        + n        E       T
                                  Ci , GCi (Zi ), β0 n1/2 (β̂nEM − β0 ) + op (1).
               i=1
                        ∂β
Because
                          n     F                               
                         −1       ∂S (Z, β0 )
                       n      E                Ci , GCi (Zi ), β0
                          i=1
                                      ∂β T
                              F                                
                        P        ∂S (Z, β0 )
                       −→E E                  Ci , GCi (Zi ), β0
                                     ∂β T
                            F            
                             ∂S (Z, β0 )
                       =E                   = −I F (β0 ),
                                 ∂β T
we obtain
                                           
                                           n                                   
n1/2 (β̂nEM −β0 ) = {I F (β0 )}−1 n−1/2      E S F (Z, β0 )|Ci , GCi (Zi ), β̂nI +op (1).
                                            i=1
                                                                                        (14.49)
An expansion of β̂nI about β0 on the right-hand side of (14.49) yields
        
        n                                    
n−1/2     E S F (Z, β0 )|Ci , GCi (Zi ), β̂nI
        i=1
            
            n                                  
 = n−1/2      E S F (Z, β0 )|Ci , GCi (Zi ), β0                                         (14.50)
           i=1
            
             n                                                      
          −1         ∂
    + n                  E{S F (Z, β0 )|Ci , GCi (Zi ), β}           n1/2 (β̂nI − β0 ) + op (1).
              i=1
                    ∂β T                                     β=β0
where
       ∂
           E{S F (Z, β0 )|Ci , GCi (Zi ), β}
      ∂β T                                   β=β0
                   #                        F                      $
               ∂      z:GCi (z)=GCi (Zi )
                                           S   (z, β0 )p(z, β)dν(z)
           =                                                                      .    (14.52)
             ∂β T            z:GC (z)=GC (Zi )
                                                 p(z, β)dν(z)               β=β0
                                    i       i
360      14 Multiple Imputation: A Frequentist Perspective
where the last equality follows from (14.6). Using (14.4), we obtain that (14.50)
is equal to
                                   
                                   n
                            n−1/2     S{Ci , GCi (Zi )}.                  (14.54)
                                     i=1
where h{C, GC (Z)} is orthogonal to S{C, GC (Z)}; i.e., E(hS T ) = 0q×q . Sub-
stituting (14.56) for q{C, GC (Z)} in (14.46), we obtain another representation
of the influence function for β̂nEM as
ϕeff {C, GC (Z)} + {I F (β0 )}−1 {I F (β0 ) − I(β0 )}h{C, GC (Z)}, (14.57)
where
                     ϕeff {C, GC (Z)} = {I(β0 )}−1 S{C, GC (Z)}
is the efficient observed-data influence function.
    We note that the influence function in (14.57) is that for a one-step updated
EM estimator that started with the initial estimator β̂nI . Using similar logic,
we would conclude that the EM algorithm after j iterations would yield an
            EM(j)
estimator β̂n     with influence function
where J is the q × q matrix {I F (β0 )}−1 {I F (β0 ) − I(β0 )}. For completeness,
we now show that J j will converge to zero (in the sense that all elements in
the matrix will converge to zero) as j goes to infinity, thus demonstrating that
the EM algorithm will converge to the efficient observed-data estimator. This
proof is taken from Lemma A.1 of Wang and Robins (1998).
      14.4 Asymptotic Distribution of the Multiple-Imputation Estimator            361
is positive definite, this implies that all the elements on the diagonal of Λ
must be strictly less than 1. Consequently,
J j = (R)−1 Λj R
   and
362    14 Multiple Imputation: A Frequentist Perspective
                               
       '            (−1   m+1 ' F           (−1 ' F                ('      (−1
           I F (β0 )    +            I (β0 )     I (β0 ) − I(β0 ) I F (β0 )
                             m
              ' F     (−1 ' F                 (
            + I (β0 )        I (β0 ) − I(β0 ) var[q{Ci , GCi (Zi )}]
            ' F              ('        (−1
             I (β0 ) − I(β0 ) I F (β0 )     .                             (14.59)
In this section, we will consider estimators for the asymptotic variance of the
frequentist (type B) multiple-imputation estimator. Although Rubin (1987)
refers to this type of imputation as improper and does not advocate using his
intuitive variance estimator in such cases, my experience has been that, in
practice, many statisticians do not distinguish between proper and improper
imputation and will often use Rubin’s variance formula. Therefore, we begin
this section by studying the properties of Rubin’s variance formula when used
with frequentist multiple imputation.
    Rubin suggested the following estimator for the asymptotic variance of
n1/2 (β̂n∗ − β0 ):
                                   #                                           $−1
                             
                             m                  
                                                n                       ∗
                                                  −∂S F {Zij (β̂nI ), β̂nj }
                      −1                   −1
                    m                  n
                             j=1                i=1
                                                             ∂β T
                                              ∗
                             m+1
                                           m
                                              (β̂nj − β̂n∗ )(β̂nj
                                                               ∗
                                                                  − β̂n∗ )T
                     +                  n                                   .        (14.60)
                              m           j=1
                                                        m−1
   Results regarding the second term of (14.60) are given in the following
theorem.
Theorem 14.6.
          ⎧                                  ⎫
          ⎨  m /            0/           0T ⎬
                    ∗
                       − β̂n∗ β̂nj
                                ∗
                                   − β̂n∗
                                                n→∞
        E n       β̂nj                         −−−−→
          ⎩                                  ⎭
                     j=1
                        '         (−1 ' F              ('        (−1
                 (m − 1) I F (β0 )     I (β0 ) − I(β0 ) I F (β0 )    .
where
                                            
                                            m
                         S̄iF (β0 ) = m−1         S F {Zij (β0 ), β0 }.
                                            j=1
(Notice that the terms involving q{Ci , GCi (Zi )} in (14.32) cancel out.)
Therefore,
              ∗
          n(β̂nj − β̂n∗ )(β̂nj
                            ∗
                               − β̂n∗ )T =
               % n                                                     &
                 '             (−1  F                               
            −1
          n              F
                        I (β0 )         S {Zij (β0 ), β0 } − S̄i (β0 )
                                                               F
                 i=1
            % n                                                     &
                                               T '          ( −1
          ×     S F {Zij (β0 ), β0 } − S̄iF (β0 )     I F (β0 )       + op (1).
                i=1
Because the quantities inside the two sums above are independent across i =
1, . . . , n, we obtain
                     ∗
               E{n(β̂nj − β̂n∗ )(β̂nj
                                   ∗
                                      − β̂n∗ )T } →
                     /'           (−1  F                             
                  E I F (β0 )          S {Zij (β0 ), β0 } − S̄iF (β0 )
                                                        T ' F        (−1 0
                      × S F {Zij (β0 ), β0 } − S̄iF (β0 )    I (β0 )
and
          ⎧                                   ⎫
          ⎨  m                               ⎬
                    ∗      ∗      ∗
         E n     (β̂nj − β̂nj )(β̂nj − β̂n∗ )T →
          ⎩                                   ⎭
             j=1
                             ⎛
            ' F       (−1       m
                                      F                             
             I (β0 )      E⎝          S {Zij (β0 ), β0 } − S̄iF (β0 )
                                    j=1
                                                        
                   F                             T         '         (−1
                   S {Zij (β0 ), β0 } − S̄iF (β0 )          × I F (β0 )    .          (14.61)
            j=1
                   m 
                                                                   
                                                 T
            =E             S F {Zij (β0 ), β0 }S F {Zij (β0 ), β0 }                 (14.62)
                   j=1
                   1  F                                         
                      m m
                                                T
              −     E      S {Zij (β0 ), β0 }S F {Zij  (β0 ), β0 } .
                   m j=1 
                               j =1
When j = j  ,
                         T
                                                                  T
                                                                              
E[S F {Zij (β0 ), β0 }S F {Zij  (β0 ), β0 }] = E S F (Zi , β0 )S F (Zi , β0 ) = I F (β0 ),
whereas when j = j  ,
                        T
                                             
E S F {Zij (β0 ), β0 }S F {Zij  (β0 ), β0 }
                                                  
= cov S F {Zij (β0 ), β0 }, S F {Zij  (β0 ), β0 }
                                                                 
= E cov S F {Zij (β0 ), β0 }, S F {Zij  (β0 ), β0 }|Ci , GCi (Zi )           (14.63)
          F                                        F                           
  + cov E S {Zij (β0 ), β0 }|Ci , GCi (Zi ) , E S {Zij  (β0 ), β0 }|Ci , GCi (Zi ) .
    Because, conditional on Ci , GCi (Zi ), the Zij (β0 ) are independent draws
for different j, from the conditional density pZ|C,GC (Z) {z|Ci , GCi (Zi )}, this
means that the first term of (14.63) (conditional covariance) is zero. Because
E[S F {Zij (β0 ), β0 }|Ci , GCi (Zi )] = S{Ci , GCi (Zi )} for all j = 1, . . . , m, then the
second term of (14.63) is
This follows directly from (14.63). Another estimator for {I F (β0 ) − I(β0 )} is
motivated from the relationship (14.6), which states that
                                                                     
                                     F                             
          I (β0 ) − I(β0 ) = E var S {Zij (β0 ), β0 }|Ci , GCi (Zi ) .
           F
Remark 8. This logic seems a bit circular since Bayesian inference is based on
deriving the posterior distribution of β given the observed data. Under suitable
regularity conditions on the choice of the prior distribution, the posterior mean
or mode of β is generally an efficient estimator for β. Therefore, using proper
imputation, where we draw from the posterior distribution of β, we start with
an efficient estimator and, after imputing m full data sets, we end up with an
estimator that is not efficient.   
   When the sample size is large, the posterior distribution of the parameter
and the sampling distribution of the estimator are closely approximated by
each other. The initial estimator β̂nI was assumed to be asymptotically normal;
that is,                                                   
                        n1/2 (β̂nI − β0 ) ∼ N 0, var[q{C, GC (Z)}] ,
                                                          14.6 Proper Imputation      367
where q{C, GC (Z)} denotes the influence function of β̂nI . Therefore, mimicking
the idea of Bayesian imputation, instead of fixing the values β̂nI for each of
the m imputations, at the j-th imputation, we sample β (j) from
                                                  
                                 var[q{C,
                                  ˆ       GC (Z)}]
                        N β̂nI ,                     ,
                                        n
where var[q{C,
        ˆ      GC (Z)}] is a consistent estimator for the asymptotic variance,
and then randomly choose Zij (β (j) ) from the conditional distribution with
conditional density
Remark 9. If β̂nI were efficient, say the MLE, then this would approximate
sampling the β’s from the posterior distribution and the Z’s from the predic-
tive distribution. 
                   
Using this approach, the j-th imputed estimator is the solution to the equation
                           
                           n
                                 S F {Zij (β (j) ), β̂nj
                                                      ∗
                                                         } = 0,
                           i=1
Using the same expansion that led us to (14.15), we again obtain that
                                            ⎡                             ⎤
                          
                          n
                            '          ( −1      
                                                 m
n1/2 (β̂n∗ − β0 ) = n−1/2     I F (β0 )     ⎣m−1   S F {Zij (β (j) ), β0 }⎦ + op (1).
                                                            * +, -
                         i=1                           j=1
                                                               note here the       (14.69)
                                                              dependence on j
Also, the same logic that was used for the multiple-imputation (improper)
estimator leads us to the relationship
368     14 Multiple Imputation: A Frequentist Perspective
                       ⎡                                     ⎤
                
                n                
                                 m
        n−1/2          ⎣m−1            S F {Zij (β (j) ), β0 }⎦
                i=1              j=1
                                 ⎡                                ⎤
                           
                           n             
                                         m
          = n−1/2                ⎣m−1          S F {Zij (β0 ), β0 }⎦
                           i=1           j=1
                   '             ( −1 
                                      m
               + I (β0 ) − I(β0 ) m
                       F
                                        n1/2 (β (j) − β0 ) + op (1).                     (14.70)
                                                      j=1
                        '                  (    
                                                m
                       + I F (β0 ) − I(β0 ) m−1   n1/2 (β (j) − β̂nI )
                                                            j=1
                            '            (
                       + I (β0 ) − I(β0 ) n1/2 (β̂nI − β0 ) + op (1).
                                 F
                                                                                         (14.71)
                                               
                                               n
Because n1/2 (β̂nI −β0 ) = n−1/2                   q{Ci , GCi (Zi )}+op (1), we can write (14.71)
                                             i=1
as
              ⎛⎡                                       ⎤                                     ⎞
        
        n                  
                           m
                                                            '              (
n−1/2         ⎝⎣m−1              S F {Zij (β0 ), β0 }⎦ + I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )}⎠
        i=1                j=1
                                                                                         (14.72)
         
         m
           '                      (
+ m−1           I F (β0 ) − I(β0 ) n1/2 (β (j) − β̂nI ) + op (1).                        (14.73)
         j=1
Note that (14.72) is a term that was derived when we considered type B
multiple-imputation estimators in the previous section (improper imputation),
whereas (14.73) is an additional term due to sampling the β (j) ’s from
                                                       
                                 var[q{C
                                  ˆ     i , GCi (Zi )}]
                        N β̂nI ,                          .
                                           n
Therefore
                                      '         (−1
                 n1/2 (β̂n∗ − β0 ) = [ I F (β0 )    {(14.72) + (14.73)}].
            n1/2 (β̂n∗ − β0 ) =
                                         ⎛⎡                               ⎤
                     
                     n                           
                                                 m
            n−1/2          {I F (β0 )}−1 ⎝⎣m−1         S F {Zij (β0 ), β0 }⎦
                     i=1                         j=1
                                                                        ⎞
                                     '                (
                                  + I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )}⎠              (14.74)
                       
                       m
                                          '                  (
               + m−1         {I F (β0 )}−1 I F (β0 ) − I(β0 ) Vj                       (14.75)
                       j=1
               + op (1).
Because (14.74) converges to a normal distribution with mean zero and vari-
ance matrix (14.59), and (14.75) is distributed as normal with mean zero and
variance matrix
                 '       (−1 ' F               (
            m−1 I F (β0 )      I (β0 ) − I(β0 ) var[q{C, GC (Z)}]
                             '                 ('        (−1
                           × I F (β0 ) − I(β0 ) I F (β0 )            (14.76)
and is independent of (14.74), this implies that n1/2 (β̂n∗ −β0 ) is asymptotically
normal with mean zero and asymptotic variance equal to (14.59) + (14.76),
which equals
                              
      '          (−1     m+1 ' F        (−1 ' F               ('        (−1
           F
          I (β0 )    +           I (β0 )      I (β0 ) − I(β0 ) I F (β0 )
                          m
                    
               m+1 ' F         (−1 ' F              (
           +            I (β0 )     I (β0 ) − I(β0 ) var[q{C, GC (Z)}]
                  m
             ' F             ('        (−1
           × I (β0 ) − I(β0 ) I F (β0 )     .                            (14.77)
Comparing (14.77) with (14.59), we see that the estimator using “proper”
imputation has greater variance (is less efficient) than the corresponding “im-
proper” imputation estimator, which fixes β̂nI at each imputation. This makes
intuitive sense since we are introducing additional variability by sampling the
β’s from some distribution at each imputation. The increase in the variance
is given by (14.76). The variances of the two methods converge as m goes to
infinity.
370       14 Multiple Imputation: A Frequentist Perspective
    Let us now study the properties of Rubin’s formula for the asymptotic
variance when applied to type A (proper imputation) multiple-imputation
estimators.
                                           
                                           n
        n1/2 (β̂nj
                ∗
                   − β̂n∗ ) = n−1/2              {I F (β0 )}−1 [S F {Zij (β0 ), β0 } − S̄iF (β0 )]
                                           i=1
                                  + {I (β0 )}−1 {I F (β0 ) − I(β0 )}(Vj − V̄ ) + op (1),
                                       F
                         m
where V̄ = m−1              j=1   Vj . Therefore,
       ⎧                                 ⎫
       ⎨ m /            0/           0T ⎬
                ∗
                   − β̂n∗ β̂nj
                            ∗
                               − β̂n∗
                                            n→∞
      E n     β̂nj                         −−−−→
       ⎩                                 ⎭
             j=1
         '      (−1
             F
         I (β0 )    ×
      ⎧ ⎛                                                                    ⎞
      ⎨    m
       E ⎝ [S {Zij (β0 ), β0 } − S̄i (β0 )][S {Zij (β0 ), β0 } − S̄i (β0 )] ⎠
                 F                 F         F                      F      T
      ⎩
           j=1
                           ⎧                          ⎫                        ⎫
        ' F             ( ⎨  m                       ⎬'                     ( ⎬
      + I (β0 ) − I(β0 ) E      (Vj − V̄ )(Vj − V̄ )T      I F (β0 ) − I(β0 )
                           ⎩                          ⎭                        ⎭
                                           j=1
             '      (−1
         × I F (β0 )    .                                                                       (14.78)
Remark 10. The term involving q{Ci , GCi (Zi )} is common for all j and hence
                              ∗
drops out when considering {β̂nj − β̂n∗ }. Also, the additivity of the expectations
in (14.78) is a consequence of the fact that Vj are generated independently
from all the data. 
Also
             ⎧                                   ⎫
             ⎨m                                 ⎬
         E             (Vj − V̄ )(Vj − V̄ )T         = (m − 1)var[q{C, GC (Z)}].                (14.80)
             ⎩                                   ⎭
                 j=1
                                14.7 Surrogate Marker Problem Revisited      371
Summary
(Ri , Yi , Wi , Ri Xi ), i = 1, . . . , n.
using the observed data. We will now illustrate how we would use multiple
imputation. First, we need to derive an initial estimator β̂nI . One possibility
is to use an inverse weighted complete-case estimator. This is particularly
attractive because we know the probability of being included in the validation
set by design; that is, P [R = 1|Y, W ] = π(Y, W ).
    If we had full data, then we could estimate β using standard likelihood
estimating equations; namely,
                                    
                                    n
                                            '                  (
                                          Xi Yi − expit(θT Xi ) = 0,
                                    i=1
                                                     
                                                     n
                                                           (Xi − µX ) = 0,
                                                     i=1
                                                     
                                                     n
                                                           (Wi − µW ) = 0,
                                                     i=1
                    
                    n
                      '                                  (
                              (Xi − µX )(Xi − µX )T − σXX = 0,
                    i=1
                  
                  n
                    '                                                 (
                         (Wi − µW )(Wi − µW )T − σW W                     = 0,
                  i=1
                   n
                         '                                            (
                             (Xi − µX )(Wi − µW )T − σXW                  = 0,   (14.83)
                   i=1
where
                                                   exp(u)
                                  expit(u) =                .
                                                 1 + exp(u)
   Using the observed data, an inverse weighted complete-case estimator for
β can be obtained by solving the equations
                                      14.7 Surrogate Marker Problem Revisited      373
                      
                      n
                               Ri         '                  (
                                        Xi Yi − expit(θT Xi ) = 0
                      i=1
                            π(Yi , Wi )
                      
                      n
                               Ri
                                        (Xi − µX ) = 0
                      i=1
                            π(Yi , Wi )
                              ..
                               .      etc.                                      (14.84)
Remark 11. If we are interested only in the parameter θ of the logistic regres-
sion model, then we only need to solve (14.84) and not rely on any assumption
of normality of X, W . However, to use multiple imputation, we need to derive
initial estimators for all the parameters. 
                                           
Clearly, when Ri = 1 (the complete case), we just use the observed value Xi .
However, when Ri = 0, then we must sample from
How Do We Sample?
We now describe the use of rejection sampling to obtain random draws from
the conditional distribution of pX|Y,W (x|Yi , Wi , β̂nI ). Using Bayes’s rule and
the surrogacy assumption (14.82), the conditional density is derived as
                                             pY |X (y|x)pX|W (x|w)
                   pX|Y,W (x|y, w) =                                .
                                             pY |X (y|x)pX|W (x|w)dx
and variance
                                                                 T
                                                     −1 I
                               I
                var (X|W ) = σ̂XXn
                                   − σ̂XW
                                       I
                                          n
                                               I
                                            [σ̂W Wn ]  σ̂XWn .           (14.86)
                                  exp(θ̂nIT XYi )
                                 1 + exp(θ̂nIT X)
or we keep repeating this process until we “keep” an X that we use for the j-
th imputation Xij (β̂nI ). This rejection sampling scheme guarantees a random
draw from
                                 pX|Y,W (x|Yi , Wi ).
Therefore, at the j-th imputation, together with Yi and Wi , which we always
observe, we use Xi if Ri = 1 and Xij (βnI ) if Ri = 0 to create the j-th pseudo-
                                                                            ∗
full data. This j-th imputed data set is then used to obtain estimators β̂nj  as
described by (14.83). Standard software packages will do this.
    The final estimate is
                                          m
                              β̂n∗ = m−1         ∗
                                               β̂nj .
                                           j=1
  Fleming, T.R. and Harrington, D.P. (1991). Counting Processes and Sur-
       vival Analysis. Wiley, New York.
  Gill, R.D., van der Laan, M.J., and Robins, J.M. (1997). Coarsening at
       random: Characterizations, conjectures and counterexamples. Proceed-
       ings of the First Seattle Symposium in Biostatistics: Survival Analysis,
       Springer, New York, pp. 255–294.
  Hájek, J. (1970). A characterization of limiting distributions of regular es-
       timates. Zeitschrift Wahrscheinlichkeitstheorie und Verwandte Gebiete
       14, 323–330.
  Hájek, J. and Sidak, Z. (1967). Theory of Rank Tests. Academic Press, New
       York.
  Hampel, F.R. (1974). The influence curve and its role in robust estimation.
       Journal of the American Statistical Association 69, 383–393.
  Heitjan, D.F. (1993). Ignorability and coarse data: Some biomedical exam-
       ples. Biometrics 49, 1099–1109 .
  Heitjan, D.F. and Rubin, D.B. (1991). Ignorability and coarse data. Annals
       of Statistics 19, 2244–2253.
  Holland, P.W. (1986). Statistics and causal inference (with discussion).
       Journal of the American Statistical Association 81, 945–970.
  Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling
       without replacement from a finite universe. Journal of the American
       Statistical Association 47, 663–685.
  Hu, P. and Tsiatis, A.A. (1996). Estimating the survival distribution when
       ascertainment of vital status is subject to delay. Biometrika 83, 371–
       380.
  Kaplan, E.L. and Meier, P. (1958). Nonparametric estimation from incom-
       plete observations. Journal of the American Statistical Association 53,
       457–481.
  Kress, R. (1989). Linear Integral Equations. Springer-Verlag, Berlin.
  LeCam, L. (1953). On some asymptotic properties of maximum likelihood
       estimates and related Bayes estimates. University of California Publi-
       cations in Statistics 1, 227–330.
  Leon, S., Tsiatis, A.A., and Davidian, M. (2003). Semiparametric estima-
       tion of treatment effect in a pretest-posttest study. Biometrics 59,
       1046–1055.
  Liang, K-Y. and Zeger, S.L. (1986). Longitudinal data analysis using gen-
       eralized linear models. Biometrika 73, 13–22.
  Lipsitz, S.R., Ibrahim, J.G., and Zhao, L.P. (1999). A weighted estimating
       equation for missing covariate data with properties similar to maximum
       likelihood. Journal of the American Statistical Association 94, 1147–
       1160.
  Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996).
       SAS System for Mixed Models. SAS Institute, Inc., Cary, NC.
  Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing
       Data. Wiley, New York.
                                                          References    377
Rubin, D.B. (1978a). Bayesian inference for causal effects: The role of ran-
    domization. Annals of Statistics 6, 34–58.
Rubin, D.B. (1978b). Multiple imputations in sample surveys: A phe-
    nomenological Bayesian approach to nonresponse (with discussion).
    American Statistical Association Proceedings of the Section on Sur-
    vey Research Methods, American Statistical Association, Alexandria,
    VA, pp. 20–34.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wi-
    ley, New York.
Rubin, D.B. (1990). Comment: Neyman (1923) and causal inference in
    experiments and observational studies. Statistical Science 5, 472–480.
Rubin, D.B. (1996). Multiple imputation after 18+ years (with discussion).
    Journal of the American Statistical Association 91, 473–520.
Rubin, D.B. (1997). Estimating causal effects from large data sets using
    propensity scores. Annals of Internal Medicine 127, 757–763.
Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman
    and Hall, London.
Scharfstein, D.O., Rotnitzky, A., and Robins, J.M. (1999). Adjusting for
    nonignorable drop-out using semiparametric nonresponse models (with
    discussion) Journal of the American Statistical Association 94, 1096–
    1146.
Stefanski, L.A. and Boos, D.D. (2002). The calculus of M-estimation.
    American Statistician 56, 29–38.
Strawderman, R.L. (2000). Estimating the mean of an increasing stochastic
    process at a censored stopping time. Journal of the American Statistical
    Association 95, 1192–1208.
Tsiatis, A.A. (1998). Competing risks. In Encyclopedia of Biostatistics.
    Wiley, New York, pp. 824–834.
van der Laan, M.J. and Hubbard, A.E. (1998). Locally efficient estimation
    of the survival distribution with right-censored data and covariates
    when collection of data is delayed. Biometrika 85, 771–783.
van der Laan, M.J. and Hubbard, A.E. (1999). Locally efficient estimation
    of the quality-adjusted lifetime distribution with right-censored data
    and covariates. Biometrics 55, 530–536.
van der Laan, M.J., Hubbard, A.E., and Robins, J.M. (2002). Locally ef-
    ficient estimation of a multivariate survival function in longitudinal
    studies. Journal of the American Statistical Association 97, 494–507.
van der Laan, M.J. and Robins, J.M. (2003). Unified Methods for Censored
    Longitudinal Data and Causality. Springer-Verlag, New York.
van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and
    Empirical Processes with Applications to Statistics. Springer-Verlag,
    New York.
Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for Longi-
    tudinal Data. Springer-Verlag, New York.
380   References