0% found this document useful (0 votes)
2K views391 pages

Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive

Uploaded by

wutracy3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views391 pages

Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive

Uploaded by

wutracy3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 391

Springer Series in Statistics

Advisors:
P. Bickel, P. Diggle, S. Fienberg, U. Gather,
I. Olkin, S. Zeger
Springer Series in Statistics
Alho/Spencer: Statistical Demography and Forecasting.
Andersen/Borgan/Gill/Keiding: Statistical Models Based on Counting Processes.
Atkinson/Riani: Robust Diagnostic Regression Analysis.
Atkinson/Riani/Cerioli: Exploring Multivariate Data with the Forward Search.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications,
2nd edition.
Brockwell/Davis: Time Series: Theory and Methods, 2nd edition.
Bucklew: Introduction to Rare Event Simulation.
Cappé/Moulines/Rydén: Inference in Hidden Markov Models.
Chan/Tong: Chaos: A Statistical Perspective.
Chen/Shao/Ibrahim: Monte Carlo Methods in Bayesian Computation.
Coles: An Introduction to Statistical Modeling of Extreme Values.
David/Edwards: Annotated Readings in the History of Statistics.
Devroye/Lugosi: Combinatorial Methods in Density Estimation.
Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications.
Eggermont/LaRiccia: Maximum Penalized Likelihood Estimation, Volume I: Density
Estimation.
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear
Models, 2nd edition.
Fan/Yao: Nonlinear Time Series: Nonparametric and Parametric Methods.
Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations
1750-1900.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume I:
Two Crops.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume II:
Three or More Crops.
Ferraty/Vieu: Nonparametric Functional Data Analysis: Models, Theory, Applications,
and Implementation
Ghosh/Ramamoorthi: Bayesian Nonparametrics.
Glaz/Naus/Wallenstein: Scan Statistics.
Good: Permutation Tests: Parametric and Bootstrap Tests of Hypotheses, 3rd edition.
Gouriéroux: ARCH Models and Financial Applications.
Gu: Smoothing Spline ANOVA Models.
Györfi/Kohler/Krzyz• ak/Walk: A Distribution-Free Theory of Nonparametric
Regression.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Härdle: Smoothing Techniques: With Implementation in S.
Harrell: Regression Modeling Strategies: With Applications to Linear Models,
Logistic Regression, and Survival Analysis.
Hart: Nonparametric Smoothing and Lack-of-Fit Tests.
Hastie/Tibshirani/Friedman: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction.
Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications.
Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
Parameter Estimation.
(continued after index)
Anastasios A. Tsiatis

Semiparametric Theory
and Missing Data
Anastasios A. Tsiatis
Department of Statistics
North Carolina State University
Raleigh, NC 27695
USA
[email protected]

Library of Congress Control Number: 2006921164

ISBN-10: 0-387-32448-8
ISBN-13: 978-0387-32448-7

Printed on acid-free paper.

© 2006 Springer Science+Business Media, LLC


All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science+Business Media, LLC, 233 Springer
Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or
scholarly analysis. Use in connection with any form of information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar methodology now known
or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or
not they are subject to proprietary rights.

Printed in the United States of America. (MVY)

9 8 7 6 5 4 3 2 1

springer.com
To
My Mother, Anna
My Wife, Marie
and
My Son, Greg
Preface

Missing data are prevalent in many studies, especially when the studies in-
volve human beings. Not accounting for missing data properly when analyzing
data can lead to severe biases. For example, most software packages, by de-
fault, delete records for which any data are missing and conduct the so-called
“complete-case analysis”. In many instances, such an analysis will lead to an
incorrect inference. Since the 1980s there has been a serious attempt to un-
derstand the underlying issues involved with missing data. In this book, we
study the different mechanisms for missing data and some of the different
analytic strategies that have been suggested in the literature for dealing with
such problems. A special case of missing data includes censored data, which
occur frequently in the area of survival analysis. Some discussion of how the
missing-data methods that are developed will apply to problems with censored
data is also included.
Underlying any missing-data problem is the statistical model for the data
if none of the data were missing (i.e., the so-called full-data model). In this
book, we take a very general approach to statistical modeling. That is, we
consider statistical models where interest focuses on making inference on a
finite set of parameters when the statistical model consists of the parame-
ters of interest as well as other nuisance parameters. Unlike most traditional
statistical models, where the nuisance parameters are finite-dimensional, we
consider the more general problem of infinite-dimensional nuisance parame-
ters. This allows us to develop theory for important statistical methods such
as regression models that model the conditional mean of a response variable
as a function of covariates without making any additional distributional as-
sumptions on the variables and the proportional hazards regression model for
survival data. Models where the parameters of interest are finite-dimensional
and the nuisance parameters are infinite-dimensional are called semiparamet-
ric models.
The first five chapters of the book consider semiparametric models when
there are no missing data. In these chapters, semiparametric models are de-
fined and some of the theoretical developments for estimators of the parame-
viii Preface

ters in these models are reviewed. The semiparametric theory and the proper-
ties of the estimators for parameters in semiparametric models are developed
from a geometrical perspective. Consequently, in Chapter 2, a quick review of
the geometry of Hilbert spaces is given. The geometric ideas are first devel-
oped for finite-dimensional parametric models in Chapter 3 and then extended
to infinite-dimensional models in Chapters 4 and 5.
A rigorous treatment of semiparametric theory is given in the book Ef-
ficient and Adaptive Estimation for Semiparametric Models by Bickel et al.
(1993). (Johns Hopkins University Press: Baltimore, MD). My experience has
been that this book is too advanced for many students in statistics and bio-
statistics even at the Ph.D. level. The attempt here is to be more expository
and heuristic, trying to give an intuition for the basic ideas without going into
all the technical details. Although the treatment of this subject is not rigorous,
it is not trivial either. Readers should not be frustrated if they don’t grasp
all the concepts at first reading. This first part of the book that deals only
with semiparametric models (absent missing data) and the geometric theory
of semiparametrics will be important in its own right. It is a beautiful theory,
where the geometric perspective gives a new insight and deeper understanding
of statistical models and the properties of estimators for parameters in such
models.
The remainder of the book focuses on missing-data methods, building on
the semiparametric techniques developed in the earlier chapters. In Chapter
6, a discussion and overview of missing-data mechanisms is given. This in-
cludes the definition and motivation for the three most common categories of
missingness, namely
• missing completely at random (MCAR)
• missing at random (MAR)
• nonmissing at random (NMAR)
These ideas are extended to the broader class of coarsened data. We show how
statistical models for full data can be integrated with missingness or coarsen-
ing mechanisms that allow us to derive likelihoods and models for the observed
data in the presence of missingness. The geometric ideas for semiparametric
full-data models are extended to missing-data models. This treatment will
give the reader a deep understanding of the underlying theory for missing and
coarsened data. Methods for estimating parameters with missing or coars-
ened data in as efficient a manner as possible are emphasized. This theory
leads naturally to inverse probability weighted complete-case (IPWCC) and
augmented inverse probability weighted complete-case (AIPWCC) estimators,
which are discussed in great detail in Chapters 7 through 11. As we will see,
some of the proposed methods can become computationally challenging if not
infeasible. Therefore, in Chapter 12, we give some approximate methods for
obtaining more efficient estimators with missing data that are easier to im-
plement. Much of the theory developed in this book is taken from a series of
Preface ix

ground-breaking papers by Robins and Rotnitzky (together with colleagues),


who developed this elegant semiparametric theory for missing-data problems.
A short discussion on how missing-data semiparametric methods can be
applied to estimate causal treatment effects in a point exposure study is given
in Chapter 13 to illustrate the broad applicability of these methods. In Chap-
ter 14, the final chapter, we deviate slightly from semiparametric models to
discuss some of the theoretical properties of multiple-imputation estimators
for finite-dimensional parametric models. However, even here, the theory de-
veloped throughout the book will be useful in understanding the properties
of such estimators.

Anastasios (Butch) Tsiatis


Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction to Semiparametric Models . . . . . . . . . . . . . . . . . . . 1


1.1 What Is an Infinite-Dimensional Space? . . . . . . . . . . . . . . . . . . . . 2
1.2 Examples of Semiparametric Models . . . . . . . . . . . . . . . . . . . . . . . 3
Example 1: Restricted Moment Models . . . . . . . . . . . . . . . . . . . . . 3
Example 2: Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . 7
Example 3: Nonparametric Model . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Semiparametric Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Hilbert Space for Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.1 The Space of Mean-Zero q-dimensional
Random Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The Dimension of the Space of Mean-Zero
Random Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Linear Subspace of a Hilbert Space and
the Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Projection Theorem for Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . 14
2.4 Some Simple Examples of the Application of
the Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Example 1: One-Dimensional Random Functions . . . . . . . . . . . . 15
Example 2: q-dimensional Random Functions . . . . . . . . . . . . . . . 16
2.5 Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 The Geometry of Inf luence Functions . . . . . . . . . . . . . . . . . . . . . . 21


3.1 Super-Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Example Due to Hodges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 m-Estimators (Quick Review) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Estimating the Asymptotic Variance of an m-Estimator . . . . . . 31
Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
xii Contents

3.3 Geometry of Influence Functions for Parametric Models . . . . . . 38


Constructing Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Efficient Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Asymptotic Variance when Dimension Is Greater than One . . . 43
Geometry of Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Deriving the Efficient Influence Function . . . . . . . . . . . . . . . . . . . 46
3.5 Review of Notation for Parametric Models . . . . . . . . . . . . . . . . . . 49
3.6 Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Semiparametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 GEE Estimators for the Restricted Moment Model . . . . . . . . . . 54
Asymptotic Properties for GEE Estimators . . . . . . . . . . . . . . . . . 55
Example: Log-linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Parametric Submodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Influence Functions for Semiparametric
RAL Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Semiparametric Nuisance Tangent Space . . . . . . . . . . . . . . . . . . . 63
Tangent Space for Nonparametric Models . . . . . . . . . . . . . . . . . . . 68
Partitioning the Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Semiparametric Restricted Moment Model . . . . . . . . . . . . . . . . . . 73
The Space Λ2s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
The Space Λ1s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Influence Functions and the Efficient Influence Function for
the Restricted Moment Model . . . . . . . . . . . . . . . . . . . . . . . 83
The Efficient Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A Different Representation for the Restricted
Moment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Existence of a Parametric Submodel for the Arbitrary
Restricted Moment Model . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Adaptive Semiparametric Estimators for the Restricted
Moment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Extensions of the Restricted Moment Model . . . . . . . . . . . . . . . . 97
4.7 Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Other Examples of Semiparametric Models . . . . . . . . . . . . . . . . 101


5.1 Location-Shift Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 101
The Nuisance Tangent Space and Its Orthogonal Complement
for the Location-Shift Regression Model . . . . . . . . . . . . . . 103
Semiparametric Estimators for β . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Efficient Score for the Location-Shift Regression Model . . . . . . . 107
Locally Efficient Adaptive Estimators . . . . . . . . . . . . . . . . . . . . . . 108
Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Proportional Hazards Regression Model with
Censored Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
The Nuisance Tangent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Contents xiii

The Space Λ2s Associated with λC|X (v|x) . . . . . . . . . . . . . . . . . . 117


The Space Λ1s Associated with λ(v) . . . . . . . . . . . . . . . . . . . . . . . 119
Finding the Orthogonal Complement of the Nuisance
Tangent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Finding RAL Estimators for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Efficient Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 Estimating the Mean in a Nonparametric Model . . . . . . . . . . . . . 125
5.4 Estimating Treatment Difference in a Randomized
Pretest-Posttest Study or with Covariate Adjustment . . . . . . . . 126
The Tangent Space and Its Orthogonal Complement . . . . . . . . . 129
5.5 Remarks about Auxiliary Variables . . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6 Models and Methods for Missing Data . . . . . . . . . . . . . . . . . . . . . 137


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.4 Inverse Probability Weighted
Complete-Case Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.5 Double Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.6 Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7 Missing and Coarsening at Random for Semiparametric


Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.1 Missing and Coarsened Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Missing Data as a Special Case of Coarsening . . . . . . . . . . . . . . . 153
Coarsened-Data Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.2 The Density and Likelihood of Coarsened Data . . . . . . . . . . . . . . 156
Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Likelihood when Data Are Coarsened at Random . . . . . . . . . . . . 158
Brief Remark on Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . 160
Examples of Coarsened-Data Likelihoods . . . . . . . . . . . . . . . . . . . 161
7.3 The Geometry of Semiparametric
Coarsened-Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
The Nuisance Tangent Space Associated with the Full-Data
Nuisance Parameter and Its Orthogonal Complement . . 166
7.4 Example: Restricted Moment Model with Missing Data by
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.5 Recap and Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.6 Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
xiv Contents

8 The Nuisance Tangent Space and Its Orthogonal


Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.1 Models for Coarsening and Missingness . . . . . . . . . . . . . . . . . . . . . 185
Two Levels of Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Monotone and Nonmonotone Coarsening for more than
Two Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.2 Estimating the Parameters in the Coarsening Model . . . . . . . . . 188
MLE for ψ with Two Levels of Missingness . . . . . . . . . . . . . . . . . 188
MLE for ψ with Monotone Coarsening . . . . . . . . . . . . . . . . . . . . . 189
8.3 The Nuisance Tangent Space when Coarsening Probabilities
Are Modeled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.4 The Space Orthogonal to the
Nuisance Tangent Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.5 Observed-Data Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . 193
8.6 Recap and Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.7 Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

9 Augmented Inverse Probability Weighted Complete-Case


Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.1 Deriving Semiparametric Estimators for β . . . . . . . . . . . . . . . . . . 199
Interesting Fact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Estimating the Asymptotic Variance . . . . . . . . . . . . . . . . . . . . . . . 206
9.2 Additional Results Regarding Monotone Coarsening . . . . . . . . . 207
The Augmentation Space Λ2 with Monotone Coarsening . . . . . . 207
9.3 Censoring and Its Relationship to
Monotone Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Probability of a Complete Case with Censored Data . . . . . . . . . 216
The Augmentation Space, Λ2 , with Censored Data . . . . . . . . . . . 216
Deriving Estimators with Censored Data . . . . . . . . . . . . . . . . . . . 217
9.4 Recap and Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.5 Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

10 Improving Efficiency and Double Robustness with


Coarsened Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.1 Optimal Observed-Data Influence Function Associated with
Full-Data Influence Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.2 Improving Efficiency with Two
Levels of Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Finding the Projection onto the Augmentation Space . . . . . . . . 226
Adaptive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Algorithm for Finding Improved Estimators with
Two Levels of Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Remarks Regarding Adaptive Estimators . . . . . . . . . . . . . . . . . . . 230
Estimating the Asymptotic Variance . . . . . . . . . . . . . . . . . . . . . . . 233
Double Robustness with Two Levels of Missingness . . . . . . . . . . 234
Contents xv

Remarks Regarding Double-Robust Estimators . . . . . . . . . . . . . . 236


Logistic Regression Example Revisited . . . . . . . . . . . . . . . . . . . . . 236
10.3 Improving Efficiency with Monotone Coarsening . . . . . . . . . . . . . 239
Finding the Projection onto the Augmentation Space . . . . . . . . 239
Adaptive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Double Robustness with Monotone Coarsening . . . . . . . . . . . . . . 248
Example with Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.4 Remarks Regarding Right Censoring . . . . . . . . . . . . . . . . . . . . . . . 254
10.5 Improving Efficiency when Coarsening
Is Nonmonotone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Finding the Projection onto the Augmentation Space . . . . . . . . 256
Uniqueness of M−1 (·) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Obtaining Improved Estimators with Nonmonotone
Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Double Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.6 Recap and Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.7 Exercises for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

11 Locally Efficient Estimators for Coarsened-Data


Semiparametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Example: Estimating the Mean with Missing Data . . . . . . . . . . . 275
11.1 The Observed-Data Efficient Score . . . . . . . . . . . . . . . . . . . . . . . . . 277
Representation 1 (Likelihood-Based) . . . . . . . . . . . . . . . . . . . . . . . 277
Representation 2 (AIPWCC-Based) . . . . . . . . . . . . . . . . . . . . . . . . 278
Relationship between the Two Representations . . . . . . . . . . . . . . 278
M−1 for Monotone Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
M−1 with Right Censored Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.2 Strategy for Obtaining Improved Estimators . . . . . . . . . . . . . . . . 285
Example: Restricted Moment Model with Monotone
Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Some Brief Remarks Regarding Robustness . . . . . . . . . . . . . . . . . 290
11.3 Concluding Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
11.4 Recap and Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
11.5 Exercises for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

12 Approximate Methods for Gaining Efficiency . . . . . . . . . . . . . . 295


12.1 Restricted Class of AIPWCC Estimators . . . . . . . . . . . . . . . . . . . 295
12.2 Optimal Restricted (Class 1) Estimators . . . . . . . . . . . . . . . . . . . . 300
Deriving the Optimal Restricted (Class 1) AIPWCC
Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Estimating the Asymptotic Variance . . . . . . . . . . . . . . . . . . . . . . . 307
12.3 Example of an Optimal Restricted
(Class 1) Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Modeling the Missingness Probabilities . . . . . . . . . . . . . . . . . . . . . 312
12.4 Optimal Restricted (Class 2) Estimators . . . . . . . . . . . . . . . . . . . . 313
xvi Contents

Logistic Regression Example Revisited . . . . . . . . . . . . . . . . . . . . . 319


12.5 Recap and Review of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
12.6 Exercises for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

13 Double-Robust Estimator of the Average Causal


Treatment Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
13.1 Point Exposure Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
13.2 Randomization and Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
13.3 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
13.4 Estimating the Average Causal Treatment Effect . . . . . . . . . . . . 328
Regression Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
13.5 Coarsened-Data Semiparametric Estimators . . . . . . . . . . . . . . . . 329
Observed-Data Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . 331
Double Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.6 Exercises for Chapter 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

14 Multiple Imputation: A Frequentist Perspective . . . . . . . . . . . 339


14.1 Full- Versus Observed-Data Information Matrix . . . . . . . . . . . . . 342
14.2 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
14.3 Asymptotic Properties of the
Multiple-Imputation Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Stochastic Equicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
14.4 Asymptotic Distribution of the
Multiple-Imputation Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
14.5 Estimating the Asymptotic Variance . . . . . . . . . . . . . . . . . . . . . . . 362
Consistent Estimator for the Asymptotic Variance . . . . . . . . . . . 365
14.6 Proper Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
Asymptotic Distribution of n1/2 (β̂n∗ − β0 ) . . . . . . . . . . . . . . . . . . . 367
Rubin’s Estimator for the Asymptotic Variance . . . . . . . . . . . . . 370
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
14.7 Surrogate Marker Problem Revisited . . . . . . . . . . . . . . . . . . . . . . . 371
How Do We Sample? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
1
Introduction to Semiparametric Models

Statistical problems are described using probability models. That is, data
are envisioned as realizations of a vector of random variables Z1 , . . . , Zn ,
where Zi itself is a vector of random variables corresponding to the data
collected on the i-th individual in a sample of n individuals chosen from some
population of interest. We will assume throughout the book that Z1 , . . . , Zn
are identically and independently distributed (iid) with density belonging to
some probability (or statistical model), where a model consists of a class of
densities that we believe might have generated the data. The densities in
a model are often identified through a set of parameters; i.e., a real-valued
vector used to describe the densities in a statistical model. The problem is
usually set up in such a way that the value of the parameters or, at the least,
the value of some subset of the parameters that describes the density that
generates the data, is of importance to the investigator. Much of statistical
inference considers how we can learn about this “true” parameter value from
a sample of observed data. Models that are described through a vector of a
finite number of real values are referred to as finite-dimensional parametric
models. For finite-dimensional parametric models, the class of densities can
be described as
P = {p(z, θ), θ ∈ Ω ⊂ Rp },
where the dimension p is some finite positive integer.
For many problems, we are interested in making inference only on a sub-
set of the parameters. Nonetheless, the entire set of parameters is necessary
to properly describe the class of possible distributions that may have gen-
erated the data. Suppose, for example, we are interested in estimating the
mean response of a variable, which we believe follows a normal distribution.
Typically, we conduct an experiment where we sample from that distribution
and describe the data that result from that experiment as a realization of the
random vector

Z1 , . . . , Zn assumed iid N (µ, σ 2 ); µ ∈ R, σ 2 > 0; θ = (µ, σ 2 ) ∈ R × R+ ⊂ R2 .


2 1 Introduction to Semiparametric Models

Here we are interested in estimating µ, the mean of the distribution, but σ 2 ,


the variance of the distribution, is necessary to properly describe the possible
probability distributions that might have generated the data. It is useful to
write the parameter θ as (β T , η T )T , where β q×1 (a q-dimensional vector) is
the parameter of interest and η r×1 (an r-dimensional vector) is the nuisance
parameter. In the previous example, β = µ and η = σ 2 . The entire parameter
space Ω has dimension p = q + r.
In some cases, we may want to consider models where the class of densi-
ties is so large that the parameter θ is infinite-dimensional. Examples of this
will be given shortly. For such models, we will consider the problem where
we are interested in estimating β, which we still take to be finite-dimensional,
say q-dimensional. For some problems, it will be natural to partition the pa-
rameter θ as (β, η), where β is the q-dimensional parameter of interest and η
is the nuisance parameter, which is infinite-dimensional. In other cases, it is
more natural to consider the parameter β as the function β(θ). These models
are referred to as semiparametric models in the literature because, generally,
there is both a parametric component β and a nonparametric component η
that describe the model. By allowing the space of parameters to be infinite-
dimensional, we are putting less restrictions on the probabilistic constraints
that our data might have (compared with finite-dimensional parametric mod-
els). Therefore, solutions, if they exist and are reasonable, will have greater
applicability and robustness.
Because the notion of an infinite-dimensional parameter space is so im-
portant in the subsequent development of this book, we start with a short
discussion of infinite-dimensional spaces.

1.1 What Is an Infinite-Dimensional Space?


The parameter spaces that we will consider in this book will always be subsets
of linear vector spaces. That is, we will consider a parameter space Ω ⊂ S,
where S is a linear space. A space S is a linear space if, for θ1 and θ2 elements
of S, aθ1 + bθ2 will also be an element of S for any scalar constants a and
b. Such a linear space is finite-dimensional if it can be spanned by a finite
number of elements in the space. That is, S is a finite-dimensional linear
space if elements θ1 , . . . , θm exist, where m is some finite positive integer such
that any element θ ∈ S is equal to some linear combination of θ1 , . . . , θm ; i.e.,
θ = a1 θ1 + . . . + am θm for some scalar constants a1 , . . . , am . The dimension
of a finite-dimensional linear space is defined by the minimum number of
elements in the space that span the entire space or, equivalently, the number
of linearly independent elements that span the entire space, where a set of
elements are linearly independent if no element in the set can be written as a
linear combination of the other elements. Parameter spaces that are defined in
p-dimensional Euclidean spaces are clearly finite-dimensional spaces. A linear
1.2 Examples of Semiparametric Models 3

space S that cannot be spanned by any finite set of elements is called an


infinite-dimensional parameter space.
An example of an infinite-dimensional linear space is the space of contin-
uous functions defined on the real line. Consider the space S = {f (x), x ∈ R}
for all continuous functions f (·). Clearly S is a linear space. In order to show
that this space is infinite-dimensional, we must demonstrate that it cannot be
spanned by any finite set of elements in S. This can be accomplished by noting
that the space S contains the linear subspaces made mup of the class of polyno-
mials of order m; that is, the space Sm = {f (x) = j=0 aj xj } for all constants
a0 , . . . , am . Clearly, the space Sm is finite-dimensional (i.e., spanned by the
elements x0 , x1 , . . . , xm ). In fact, this space is exactly an m + 1-dimensional
linear space since the elements x0 , . . . , xm are linearly independent.
Linear independence follows because xj cannot be written as a linear com-
bination of x0 , . . . , xj−1 for any j = 1, 2, . . .. If it could, then


j−1
j
x = a x for all x ∈ R
=0

for some constants a0 , . . . , aj−1 . If this were the case, then the derivatives of
j
x
j−1 of all orders would have to be equal to the corresponding derivatives of

=0 a x . But the j-th derivative of xj is equal to j! = 0, whereas the j-th
j−1
derivative of =0 a x is zero, leading to a contradiction and implying that
x0 , . . . , xm are linearly independent.
Consequently, the space S cannot be spanned by any finite number, say
m elements of S, because, if this were possible, then the space of polynomials
of order greater than m could also be spanned by the m elements. But this is
impossible since such spaces of polynomials have dimension greater than m.
Hence, S is infinite-dimensional.
From the arguments above, we can easily show that the space of arbitrary
densities pZ (z) for a continuous random variable Z defined on the closed
finite interval [0, 1] (i.e., the so-called nonparametric model for such a random
variable) spans a space that is infinite-dimensional. This follows by noticing
that the functions pZj (z) = (j + 1)−1 z j , 0 ≤ z ≤ 1, j = 1, . . . are densities
that are linearly independent.

1.2 Examples of Semiparametric Models


Example 1: Restricted Moment Models

A common statistical problem is to model the relationship of a response


variable Y (possibly vector-valued) as a function of a vector of covariates
X. Throughout, we will use the convention that a vector of random vari-
ables Z that is not indexed will correspond to a single observation, whereas
Zi , i = 1, . . . , n will denote a sample of n iid random vectors. Consider a
4 1 Introduction to Semiparametric Models

family of probability distributions for Z = (Y, X) that satisfy the regression


relationship
E(Y |X) = µ(X, β),
where µ(X, β) is a known function of X and the unknown q-dimensional pa-
rameter β.
The function µ(X, β) may be linear or nonlinear in β, and it is assumed
that β is finite-dimensional. For example, we might consider a linear model
where µ(X, β) = β T X or a nonlinear model, such as a log-linear model, where
µ(X, β) = exp(β T X). No other assumptions will be made on the class of prob-
ability distributions other than the constraint given by the conditional expec-
tation of Y given X stated above. As we will demonstrate shortly, such models
are semiparametric, as they will be defined through an infinite-dimensional
parameter space. We will refer to such semiparametric models as “restricted
moment models.” These models were studied by Chamberlain (1987) and
Newey (1988) in the econometrics literature. They were also popularized in
the statistics literature by Liang and Zeger (1986).
For illustration, we will take Y to be a one-dimensional random variable
that is continuous on the real line. This model can also be written as
Y = µ(X, β) + ε,
where E(ε|X) = 0. The data are realizations of (Y1 , X1 ), . . . , (Yn , Xn ) that
are iid with density for a single observation given by
pY,X {y, x; β, η(·)},
where η(·) denotes the infinite-dimensional nuisance parameter function char-
acterizing the joint distribution of ε and X, to be defined shortly. Knowledge
of β and the joint distribution of (ε, X) will induce the joint distribution of
(Y, X). Since
ε = Y − µ(X, β),
pY,X (y, x) = pε,X {y − µ(x, β), x}.
The restricted moment model only makes the assumption that
E(ε|X) = 0.
That is, we will allow any joint density pε,X (ε, x) = pε|X (ε|x)pX (x) such that
pε|X (ε|x)  0 for all ε, x,

pε|X (ε|x)dε = 1 for all x,

εpε|X (ε|x)dε = 0 for all x,

pX (x) ≥ 0 for all x,



pX (x)dνX (x) = 1.
1.2 Examples of Semiparametric Models 5

Remark 1. When we refer to the density, joint density, or conditional density


of one or more random variables, to avoid confusion, we will often index the
variables being used as part of the notation. For example, pY,X (y, x) is the
joint density of the random variables Y and X evaluated at the values (y, x).
This notation will be suppressed when the variables are obvious.  

Remark 2. We will use the convention that random variables are denoted by
capital letters such as Y and X, whereas realizations of those random variables
will be denoted by lowercase letters such as y and x. One exception to this
is that the random variable corresponding to the error term Y − µ(X, β) is
denoted by the Greek lowercase ε. This is in keeping with the usual notation
for such error terms used in statistics. The realization of this error term will
also be denoted by the Greek lowercase ε. The distinction between the random
variable and the realization of the error term will have to be made in the
context it is used and should be obvious in most cases. For example, when we
refer to pε,X (ε, x), the subscript ε is a random variable and the argument ε
inside the parentheses is the realization.  

Remark 3. νX (x) is a dominating measure for which densities for the ran-
dom vector X are defined. For the most part, we will consider ν(·) to be the
Lebesgue measure for continuous random variables and the counting measure
for discrete random variables. The random variable Y and hence ε will be
taken to be continuous random variables dominated by Lebesgue measure dy
or dε, respectively. 


Without going into the measure-theoretical technical details, the class of


conditional densities for ε given X, such that E(ε|X) = 0, can be constructed
through the following steps.

(a) Choose any arbitrary positive function of ε and x (subject to regularity


conditions):
h(0) (ε, x) > 0.
(b) Normalize this function to be a conditional density:

h(0) (ε, x)
h(1) (ε, x) =  ;
h(0) (ε, x)dε

i.e.,

h(1) (ε, x)dε = 1 for all x.

(c) Center it:


A random variable ε∗ whose conditional density, given X = x is
h(1) (ε , x) = p(ε∗ = ε |X = x), has mean
6 1 Introduction to Semiparametric Models

µ(x) = ε h(1) (ε , x)dε .

In order to construct a random variable ε whose conditional density, given


X = x, has mean zero, we consider ε = ε∗ − µ(X) or ε∗ = ε + µ(X). It is
clear that E(ε|X = x) = E(ε∗ |X = x) − µ(x) = 0. Since the transforma-
tion from ε to ε∗ , given X = x, has Jacobian equal to 1, the conditional
density of ε given X, defined by η1 (ε, x), is given by
  
(1) (1)
η1 (ε, x) = h ε + εh (ε, x)dε, x ,

which, by construction, satisfies εη1 (ε, x)dε = 0 for all x.
Thus, any function η1 (ε, x) constructed as above will satisfy η1 (ε, x) > 0,

η1 (ε, x)dε = 1 for all x,

εη1 (ε, x)dε = 0 for all x.

Since the class of all such conditional densities η1 (ε, x) was derived from arbi-
trary positive functions h(0) (ε, x) (subject to regularity conditions), and since
the space of positive functions is infinite-dimensional, then the set of such
resulting conditional densities is also infinite-dimensional.
Similarly, we can construct densities for X where pX (x) = η2 (x) such that

η2 (x) > 0,

η2 (x)dνX (x) = 1.

The set of all such functions η2 (x) will also be infinite-dimensional as long as
the support of X is infinite.
Therefore, the restricted moment model is characterized by

{β, η1 (ε, x), η2 (x)},

where β ∈ Rq is finite-dimensional and η1 (ε, x) = pε|X (ε|x), η2 (x) = pX (x)


are infinite-dimensional. Consequently, the joint density of (Y, X) is given by

pY,X {y, x; β, η1 (·), η2 (·)} = pY |X {y|x; β, η1 (·)}pX {x; η2 (·)}


= η1 {y − µ(x, β), x}η2 (x).

This is an example of a semiparametric model because the parametrization


is through a finite-dimensional parameter of interest β ∈ Rq and infinite-
dimensional nuisance parameters {η1 (·), η2 (·)}.
Contrast this semiparametric model with the more traditional parametric
model where
1.2 Examples of Semiparametric Models 7

Yi = µ(Xi , β) + εi , i = 1, . . . , n,
2
where εi are iid N (0, σ ). That is,
 
1 1 {y − µ(x, β)}2
pY |X (y|x; β, σ 2 ) = exp − .
1
(2πσ 2 ) 2 2 σ2

This model is much more restrictive than the semiparametric model defined
earlier.

Example 2: Proportional Hazards Model

In many biomedical applications, we are often interested in modeling the sur-


vival time of individuals as functions of covariates. Let the response variable
be the survival time of an individual, denoted by T , whose distribution de-
pends on explanatory variables X. A popular model in survival analysis is
Cox’s proportional hazards model, which was first introduced in the seminal
paper by Cox (1972). This model assumes that the conditional hazard rate,
as a function of X, is given by
 
P (t ≤ T < t + h|T  t, X)
λ(t|X) = lim
h→0 h
= λ(t) exp(β T X).

The proportional hazards model is especially convenient when survival times


may be right censored, as we will discuss in greater detail in Chapter 5.
Interest often focuses on the finite-dimensional parameter β, as this de-
scribes the magnitude of the effect that the covariates have on the survival
time. The underlying hazard function λ(t) is left unspecified and is considered
a nuisance parameter. Since this function can be any positive function in t,
subject to some regularity conditions, it, too, is infinite-dimensional. Using
the fact that the density of a positive random variable is related to the hazard
function through ⎧ ⎫
⎨ t ⎬
pT (t) = λ(t) exp − λ(u)du ,
⎩ ⎭
0

then the density of a single observation Z = (T, X) is given by

pT,X {t, x; β, λ(·), η2 (·)} = pT |X {t|x; β, λ(·)}η2 (x),

where
⎧ ⎫
⎨ t ⎬
pT |X {t|x; β, λ(·)} = λ(t) exp(β T x) exp − exp(β T x) λ(u)du ,
⎩ ⎭
0

and exactly as in Example 1, η2 (x) is defined as a function of x such that


8 1 Introduction to Semiparametric Models

η2 (x)  0,

η2 (x)dνX (x) = 1,

for all x. The proportional hazards model has gained a great deal of popularity
because it is more flexible than a finite-dimensional parametric model, that
assumes that the hazard function for T has a specific functional form in terms
of a few parameters; e.g.,

λ(t, η) =η (constant hazard over time – exponential model),

or

λ(t, η) =η1 tη2 (Weibull model).

Example 3: Nonparametric Model

In the two previous examples, the probability models were written in terms
of an infinite-dimensional parameter θ, which was partitioned as {β T , η(·)},
where β was the finite-dimensional parameter of interest and η(·) was the
infinite-dimensional nuisance parameter. We now consider the problem of es-
timating the moments of a single random variable Z where we put no re-
striction on the distribution of Z except that the moments of interest exist.
That is, we denote the density
 of Z by θ(z), where θ(z) can be any posi-
tive function of z such that θ(z)dνZ (z) = 1 and any additional restrictions
necessary for the moments of interest to exist. Clearly, the class of all θ(·) is
infinite-dimensional as long as the support of Z is infinite. Suppose we were
interested in estimating some functional of θ(·), say β(θ) (forexample, the
first or second moment E(Z) or E(Z 2 ), where β(θ) is equal to zθ(z)dνZ (z)
or z 2 θ(z)dνZ (z), respectively). For such a problem, it is not convenient to
try to partition the parameter space in terms of the parameter β of interest
and a nuisance parameter but rather to work directly with the functional β(θ).

1.3 Semiparametric Estimators


In a semiparametric model, a semiparametric estimator for β, say β̂n , is one
that, loosely speaking, has the property that it is consistent and asymptoti-
cally normal in the sense that
P {β,η(·)}
(β̂n − β) −−−−−−→ 0,
D{β,η(·)}
n1/2 (β̂n − β) −−−−−−→ N (0, Σq×q {β, η(·)}),
1.3 Semiparametric Estimators 9

for all densities “p{z, β, η(·)}” within some semiparametric family, where
P {β,η(·)} D{β,η(·)}
−−−−−−→ denotes convergence in probability and −−−−−−→ denotes conver-
gence in distribution when the density of the random variable Z is p{z, β, η(·)}.
We know, for example, that the solution to the linear estimating equations

n  
Aq×1 (Xi , β̂n ) Yi − µ(Xi , β̂n ) = 0q×1 ,
i=1

under suitable regularity conditions, leads to an estimator for β that is con-


sistent and asymptotically normal for the semiparametric restricted moment
model of Example 1. In fact, this is the basis for “generalized estimating
equations” (GEE) proposed by Liang and Zeger (1986).
The maximum partial likelihood estimator proposed by Cox (1972, 1975)
is an example of a semiparametric estimator for β in the proportional hazards
model given in Example 2. Also, in Example  3, a semiparametric
 estimator for
the first and second moments is given by n−1 Zi and n−1 Zi2 , respectively.
Some issues that arise when studying semiparametric models are:
(i) How do we find semiparametric estimators, or do they even exist?
(ii) How do we find the best estimator among the class of semiparametric
estimators?
Both of these problems are difficult. Understanding the geometry of estima-
tors, more specifically the geometry of the influence function of estimators,
will help us in this regard.
Much of this book will rely heavily on geometric constructions. We will de-
fine the influence function of an asymptotically linear estimator and describe
the geometry of all possible influence functions for a statistical model. We
will start by looking at finite-dimensional parametric models and then gener-
alize the results to the more complicated infinite-dimensional semiparametric
models.
Since the geometry that is considered is the geometry of Hilbert spaces,
we begin with a quick review of Hilbert spaces, the notion of orthogonality,
minimum distance, and how this relates to efficient estimators (i.e., estimators
with the smallest asymptotic variance).
2
Hilbert Space for Random Vectors

In this section, we will introduce a Hilbert space without going into much of
the technical details. We will focus primarily on the Hilbert space whose ele-
ments are random vectors with mean zero and finite variance that will be used
throughout the book. For more details about Hilbert spaces, we recommend
that the reader study Chapter 3 of Luenberger (1969).

2.1 The Space of Mean-Zero q-dimensional


Random Functions
As stated earlier, data are envisioned as realizations of the random vectors
Z1 , Z2 , . . ., Zn , assumed iid. Let Z denote the random vector for a single
observation. As always, there is an underlying probability space (Z , A, P ),
where Z denotes the sample space, A the corresponding σ-algebra, and P
the probability measure. For the time being, we will not consider a statistical
model consisting of a family of probability measures, but rather we will assume
that P is the true probability measure that generates the realizations of Z.
Consider the space consisting of q-dimensional mean-zero random func-
tions of Z,
h : Z → Rq ,
where h(Z) is measurable and also satisfies

(i) E{h(Z)} = 0,
(ii) E{hT (Z)h(Z)} < ∞.

Since the elements of this space are random functions, when we refer to an
element as h, we implicitly mean h(Z). Clearly, the space of all such h that
satisfy (i) and (ii) is a linear space. By linear, we mean that if h1 , h2 are
elements of the space, then for any real constants a and b, ah1 + bh2 also
belongs to the space.
12 2 Hilbert Space for Random Vectors

In the same way that we consider points in Euclidean space as vectors from
the origin, here we will consider the q-dimensional random functions as points
in a space. The intuition we have developed in understanding the geometry of
two- and three-dimensional Euclidean space will aid us in understanding the
geometry of more complex spaces through analogy. The random function

h(Z) = 0q×1

will denote the origin of this space.

The Dimension of the Space of Mean-Zero Random Functions

An element of the linear space defined above is a q-dimensional function of


Z. This should not be confused with the dimensionality of the space itself.
To illustrate this point more clearly, let us first consider the space of one-
dimensional random functions of Z (random variables), where Z is a discrete
variable with finite support. Specifically, let Z be allowed to take on one
of a finite
k number of values z1 , . . . , zk with positive probabilities π1 , . . . , πk ,
where i=1 πi = 1. For such a case, any one-dimensional random function
of Z can be defined as h(Z) = a1 I(Z = z1 ) + . . . + ak I(Z = zk ) for any
real valued constants a1 , . . . , ak , where I(·) denotes the indicator function.
The space of all such random functions is a linear space spanned by the k
linearly independent functions I(Z = zi ), i = 1, . . . , k. Hence this space is a
k-dimensional linear space. If we put the further constraint that the mean
k
must be zero (i.e., E{h(Z)} = 0), then this implies that i=1 ai πi = 0,
k−1
or equivalently that ak = −( i=1 ai πi )/πk . Some simple algebra leads us
to conclude that the space of one-dimensional mean-zero random functions
of Z is a linear space spanned by the k − 1 linearly independent functions
{I(Z = zi ) − ππki I(Z = zk )}, i = 1, . . . , k − 1. Hence this space is a k − 1-
dimensional linear space.
Similarly, the space of q-dimensional mean-zero random functions of Z,
where Z has finite support at the k values z1 , . . . , zk , can be shown to be a
linear space with dimension q × (k − 1). Clearly, as the number of support
points k for the distribution of Z increases, so does the dimension of the linear
space of q-dimensional mean-zero random functions of Z.
If the support of the random vector Z is infinite, as would be the case
if any element of the random vector Z was a continuous random variable,
then the space of measurable functions that make up the Hilbert space will be
infinite-dimensional. As we indicated in Section 1.1, the set of one-dimensional
continuous functions of Z is infinite-dimensional. Consequently, the set of q-
dimensional continuous functions will also be infinite-dimensional. Clearly, the
set of q-dimensional measurable functions is a larger class and hence must also
be infinite-dimensional.
2.2 Hilbert Space 13

2.2 Hilbert Space


A Hilbert space, denoted by H, is a complete normed linear vector space
equipped with an inner product. As well as being a linear space, a Hilbert
space also allows us to consider distance between elements and angles and
orthogonality between vectors in the space. This is accomplished by defining
an inner product.

Definition 1. Corresponding to each pair of elements h1 , h2 belonging to a


linear vector space H, an inner product, defined by h1 , h2 , is a function that
maps to the real line. That is, h1 , h2 is a scalar that satisfies
1. h1 , h2 = h2 , h1 ,
2. h1 + h2 , h3 = h1 , h3 + h2 , h3 , where h1 , h2 , h3 belong to H,
3. λh1 , h2 = λ h1 , h2 for any scalar constant λ,
4. h1 , h1 ≥ 0 with equality if and only if h1 = 0.

Note 1. In some cases, the function ·, · may satisfy conditions 1–3 above and
the first part of condition 4, but h1 , h1 = 0 may not imply that h1 = 0. In
that case, we can still define a Hilbert space by identifying equivalence classes
where individual elements in our space correspond to different equivalence
classes.

Definition 2. For the linear vector space of q-dimensional measurable ran-


dom functions with mean zero and finite second moments, we can define the
inner product
h1 , h2 by E(hT1 h2 ).
We shall refer to this inner product as the “covariance inner product.”

This definition of inner product clearly satisfies the first three conditions of
the definition given above. As for condition 4, we can define an equivalence
class where h1 is equivalent to h2 ,

h1 ≡ h 2 ,

if h1 = h2 a.e. or P (h1 = h2 ) = 0. In this book, we will generally not concern


ourselves with such measure-theoretical subtleties.
Once an inner product is defined, we then define the norm or “length” of
any vector (i.e., element of H ) (distance from any point h ∈ H to the origin)
as
h = h, h 1/2 .
Hilbert spaces also allow us to define orthogonality; that is, h1 , h2 ∈ H are
orthogonal if h1 , h2 = 0.
14 2 Hilbert Space for Random Vectors

Remark 1. Technically speaking, the definitions above are those for a pre-
Hilbert space. In order to be a Hilbert space, we also need the space to be
complete (i.e., every Cauchy sequence has a limit point that belongs to the
space). That the space of q-dimensional random functions with mean zero
and bounded second moments is complete follows from the L2 -completeness
theorem (see Loève 1963, p. 161) and hence is a Hilbert space. 


2.3 Linear Subspace of a Hilbert Space and


the Projection Theorem
A space U ⊂ H is a linear subspace if u1 , u2 ∈ U implies that au1 + bu2 ∈ U
for all scalar constants a, b. A linear subspace must contain the origin. This
is clear by letting the scalars be a = b = 0.
A simple example of a linear subspace is obtained by taking h1 , . . . , hk to
be arbitrary elements of a Hilbert space. Then the space a1 h1 + · · · + ak hk for
all scalars (a1 , . . . , ak ) ∈ Rk is a linear subspace spanned by {h1 , . . . , hk }.
One of the key results for Hilbert spaces, which we will use repeatedly
throughout this book, is given by the projection theorem.

Projection Theorem for Hilbert Spaces


Theorem 2.1. Let H be a Hilbert space and U a linear subspace that is closed
(i.e., contains all its limit points). Corresponding to any h ∈ H, there exists a
unique u0 ∈ U that is closest to h; that is,
h − u0 ≤ h − u for all u ∈ U.
Furthermore, h − u0 is orthogonal to U; that is,
h − u0 , u = 0 for all u ∈ U.
We refer to u0 as the projection of h onto the space U, and this is denoted as
Π(h|U). Moreover, u0 is the only element u ∈ U such that h − u is orthogonal
to U (see Figure 2.1).
The proof of the projection theorem for arbitrary Hilbert spaces is not
much different or more difficult than for a finite-dimensional Euclidean space.
The condition that a Hilbert space be complete is necessary to guarantee the
existence of the projection. A formal proof can be found in Luenberger (1969,
Theorem 2, p. 51). The intuition of orthogonality and distance carries over
very nicely from simple Euclidean spaces to more complex Hilbert spaces.
A simple consequence of orthogonality is the Pythagorean theorem, which
we state for completeness.
Theorem 2.2. Pythagorean Theorem
If h1 and h2 are orthogonal elements of the Hilbert space H (i.e., h1 , h2 = 0),
then
h1 + h2 2 = h1 2 + h2 2 .
2.4 Some Simple Examples of the Application of the Projection Theorem 15

Geometrically

(If we shift this to the origin


we get (h – uo))
h
(h – uo)
linear subspace ϑ

uo
origin

Fig. 2.1. Projection onto a linear subspace

2.4 Some Simple Examples of the Application of


the Projection Theorem
Example 1: One-Dimensional Random Functions

Consider the Hilbert space H of one-dimensional random functions, h(Z),


with mean zero and finite variance equipped with the inner product

h1 , h2 = E(h1 h2 )

for h1 (Z), h2 (Z) ∈ H. Let u1 (Z), . . . , uk (Z) be arbitrary elements of this space
and U be the linear subspace spanned by {u1 , · · · , uk }. That is,

U = {aT u; for a ∈ Rk },

where
⎛ ⎞
u1
⎜ ⎟
uk×1 = ⎝ ... ⎠ .
uk

The space U is an example of a finite-dimensional linear subspace since it


is spanned by the finite number of elements u1 (Z), . . . , uk (Z). This subspace
is contained in the infinite-dimensional Hilbert space H. Moreover, if the ele-
ments u1 (Z), . . . , uk (Z) are linearly independent, then the dimension of U is
identically equal to k.
Let h be an arbitrary element of H. Then the projection of h onto the
linear subspace U is given by the unique element aT0 u that satisfies

h − aT0 u, aT u = 0 for all a = (a1 , . . . , ak )T ∈ Rk ,


16 2 Hilbert Space for Random Vectors

or


k
aj h − aT0 u, uj = 0 for all aj , j = 1, . . . , k.
j=1

Equivalently, h − aT0 u, uj = 0 for all j = 1, . . . , k,

or

E{(h − aT0 u)uT } = 0(1×k) ,

or

E(huT ) − aT0 E(uuT ) = 0(1×k) .

Any solution of a0 such that

aT0 E(uuT ) = E(huT )

would lead to the unique projection aT0 u.


If E(uuT ) is positive definite, and therefore has a unique inverse, then

aT0 = E(huT ){E(uuT )}−1 ,

in which case the unique projection will be

u0 = aT0 u = E(huT ){E(uuT )}−1 u.

The norm-squared of this projection is equal to

E(huT ){E(uuT )}−1 E(uh).

By the Pythagorean theorem,


2
h − aT0 u = E(h − aT0 u)2
= E(h2 ) − E(huT ){E(uuT )}−1 E(uh).

Example 2: q-dimensional Random Functions

Let H be the Hilbert space of mean-zero q-dimensional measurable random


functions with finite second moments equipped with the inner product

h1 , h2 = E(hT1 h2 ).
2.4 Some Simple Examples of the Application of the Projection Theorem 17

[E(h2) – E(hu){E(V V )}–1E(V h)]1/2


{E(h2)}1/2
h

[E(hV ){E(V V )}–1E(V h)]1/2

Fig. 2.2. Geometric illustration of the Pythagorean theorem

Let “v(Z)” be an r-dimensional random function with mean zero and E(v T v)
< ∞. Consider the linear subspace U spanned by v(Z); that is,

U = {B q×r v, where B is any arbitrary q × r matrix of real numbers}.

The linear subspace U defined above is a finite-dimensional linear sub-


space contained in the infinite-dimensional Hilbert space H. If the elements
v1 (Z), . . . , vr (Z) are linearly independent, then the dimension of U is q × r.
This can easily be seen by noting that U is spanned by the q × r linearly
independent elements uij (Z), i = 1, . . . , q, j = 1, . . . , r, of H, where, for any
i = 1, . . . , q, we take the element uq×1
ij (Z) ∈ H to be the q-dimensional func-
tion of Z, where all except the i-th element are equal to 0 and the i-th element
is equal to vj (Z) for j = 1, . . . , r.
We now consider the problem of finding the projection of an arbitrary
element h ∈ H onto U. By the projection theorem, such a projection B0 v is
unique and must satisfy

E{(h − B0 v)T Bv} = 0 for all B ∈ Rq×r . (2.1)

The statement above being true for all B is equivalent to

E{(h − B0 v)v T } = 0q×r (matrix of all zeros). (2.2)

To establish (2.2), we write



E{(h − B0 v)T Bv} = Bij E{(h − B0 v)i vj }, (2.3)
i j

where (h − B0 v)i denotes the i-th element of the q-dimensional vector (h −


B0 v), vj denotes the j-th element of the r-dimensional vector v, and Bij
denotes the (i, j)-th element of the matrix B.
If we take Bij = 1 for i = i and j = j  , and 0 otherwise, it becomes clear
from (2.3) that
18 2 Hilbert Space for Random Vectors

E{(h − B0 v)i vj  } = 0 for all i , j 

and hence

E{(h − B0 v)v T } = 0q×r .


Conversely, if (2.2) is true, then for any matrix B we have (2.1) being true.
Consequently, by (2.2), we deduce that E(hv T ) = B0 E(vv T ). Therefore, as-
suming E(vv T ) is nonsingular (i.e., positive definite),
B0 = E(hv T ){E(vv T )}−1 .
Hence, the unique projection is
Π(h|U) = E(hv T ){E(vv T )}−1 v. (2.4)
Remark 2. Finding the projection of the q-dimensional vector h onto the sub-
space U is equivalent to taking each element of h and projecting it individ-
ually to the subspace spanned by {v1 , . . . , vr } for the Hilbert space of one-
dimensional random functions considered in Example 1 and then stacking
these individual projections into a vector. This can be deduced by noting that
minimizing h − Bv 2 corresponds to minimizing
E{(h − Bv)T (h − Bv)}

q
= E(h − Bv)2i
i=1
⎛ ⎞2

q 
r
= E ⎝hi − Bij vj ⎠ . (2.5)
i=1 j=1

Since the Bij are arbitrary, we can minimize the sum in (2.5) by minimizing
each of the elements in the sum separately. 
The norm-squared of this projection is given by
 
E v T {E(vv T )}−1 E(vhT )E(hv T ){E(vv T )}−1 v ,
and, by the Pythagorean theorem (see Figure 2.2 for illustration), the norm-
squared of the residual (h − B0 v) is
 
E(hT h) − E v T {E(vv T )}−1 E(vhT )E(hv T ){E(vv T )}−1 v
= tr{[hhT ] − E(hv T ){E(vv T )}−1 E(vhT )}.

There are other properties of Hilbert spaces that will be used throughout
the book. Rather than giving all the properties in this introductory chapter,
we will instead define these as they are needed. There is, however, one very
important result that we do wish to highlight. This is the Cauchy-Schwartz
inequality given in Theorem 2.3.
2.5 Exercises for Chapter 2 19

Theorem 2.3. Cauchy-Schwartz inequality


For any two elements h1 , h2 ∈ H,
| h1 , h2 |2 ≤ h1 2
h2 2 ,
with equality holding if and only if h1 and h2 are linearly related; i.e., h1 = ch2
for some scalar constant c.

2.5 Exercises for Chapter 2


1. Prove the projection theorem (Theorem 2.1) for Hilbert spaces.
2. Let Z = (Z1 , . . . , Zp )T be a p-dimensional multivariate normally dis-
tributed random vector with mean zero and covariance matrix Σp×p . We
also write Z as the partitioned vector Z = (Y1T , Y2T )T , where Y1q×1 =
(p−q)×1
(Z1 , . . . , Zq )T and Y2 = (Zq+1 , . . . , Zp )T , q < p, and
⎛ ⎞
q×(p−q)
Σq×q
11 Σ12
Σ=⎝ ⎠, where
(p−q)×q (p−q)×(p−q)
Σ21 Σ22

Σ11 = E(Y1 Y1T ), Σ12 = E(Y1 Y2T ), Σ21 = E(Y2 Y1T ), Σ22 = E(Y2 Y2T ).
Let H be the Hilbert space of all q-dimensional measurable functions of Z
with mean zero, finite variance, and equipped with the covariance inner
product. Let U be the linear subspace spanned by Y2 ; i.e., U consists of
all the elements
 
B q×(p−q) Y2 : for all q × (p − q) matrices B .

(a) Find the projection of Y1 onto U.


(b) Compute the norm of the residual
Y1 − Π(Y1 |U) .
3. Let Z = (Z1 , Z2 )T be a bivariate
 2 normally
 distributed vector with mean
σ1 σ12
zero and covariance matrix .
σ12 σ22

Consider the Hilbert space of all one-dimensional measurable functions of


Z with mean zero, finite variance, and covariance inner product. Let U
denote the linear subspace spanned by Z2 and (Z12 − σ12 ); i.e., the space
whose elements are
a1 (Z12 − σ12 ) + a2 Z2 for all a1 , a2 .
(a) Find the projection of Z1 onto the space U.
(b) Find the variance of the residual (i.e., var{Z1 − Π(Z1 |U)}).
3
The Geometry of Inf luence Functions

As we will describe shortly, most reasonable estimators for the parameter


β, in either parametric or semiparametric models, are asymptotically linear
and can be uniquely characterized by the influence function of the estimator.
The class of influence functions for such estimators belongs to the Hilbert
space of all mean-zero q-dimensional random functions with finite variance
that was defined in Chapter 2. As such, this construction will allow us to view
estimators or, more specifically, the influence function of estimators, from a
geometric point of view. This will give us intuitive insight into the construction
of such estimators and a geometric way of assessing the relative efficiencies of
the various estimators.
As always, consider the statistical model where Z1 , . . . , Zn are iid ran-
dom vectors and the density of a single Z is assumed to belong to the class
{pZ (z; θ), θ Ω} with respect to some dominating measure νZ . The parameter
θ can be written as (β T , η T )T , where β q×1 is the parameter of interest and
η, the nuisance parameter, may be finite- or infinite-dimensional. The truth
will be denoted by θ0 = (β0T , η0T )T. For the remainder of this chapter, we will
only consider parametric models where θ = (β T , η T )T and the vector θ is
p-dimensional, the parameter of interest β is q-dimensional, and the nuisance
parameter η is r-dimensional, with p = q + r.
An estimator β̂n of β is a q-dimensional measurable random function of
Z1 , . . . , Zn . Most reasonable estimators for β are asymptotically linear; that
is, there exists a random vector (i.e., a q-dimensional measurable random
function) ϕq×1 (Z), such that E{ϕ(Z)} = 0q×1 ,

n
n1/2 (β̂n − β0 ) = n−1/2 ϕ(Zi ) + op (1), (3.1)
i=1

where op (1) is a term that converges in probability to zero as n goes to infinity


and E(ϕϕT ) is finite and nonsingular.
Remark 1. The function ϕ(Z) is defined with respect to the true distribu-
tion p(z, θ0 ) that generates the data. Consequently, we sometimes may write
22 3 The Geometry of Influence Functions

ϕ(Z, θ) to emphasize that this random function will vary according to the
value of θ in the model. Unless otherwise stated, it will be assumed that ϕ(Z)
is evaluated at the truth and expectations are taken with respect to the truth.
Therefore, E{ϕ(Z)} is shorthand for

Eθ0 {ϕ(Z, θ0 )}.



The random vector ϕ(Zi ) in (3.1) is referred to as the i-th influence func-
tion of the estimator β̂n or the influence function of the i-th observation of
the estimator β̂n . The term influence function comes from the robustness lit-
erature, where, to first order, ϕ(Zi ) is the influence of the i-th observation on
β̂n ; see Hampel (1974).

Example 1. As a simple example, consider the model where Z1 , . . . , Zn are


iid N (µ, σ 2
). The maximum likelihoodnestimators for µ and σ 2 are given by
−1 n 2 −1 2
µ̂n = n i=1 Zi and σ̂n = n i=1 (Zi − µ̂n ) , respectively. That the
estimator µ̂n for µ is asymptotically linear follows immediately because

n
n1/2 (µ̂n − µ0 ) = n−1/2 (Zi − µ0 ).
i=1

Therefore, µ̂n is an asymptotically linear estimator for µ whose i-th influence


function is given by ϕ(Zi ) = (Zi − µ0 ).
After some straightforward algebra, we can express the estimator σ̂n2 minus
the estimand as

n
(σ̂n2 − σ02 ) = n−1 {(Zi − µ0 )2 − σ02 } + (µ̂n − µ0 )2 . (3.2)
i=1

Multiplying (3.2) by n1/2 , we obtain


n
n1/2 (σ̂n2 − σ02 ) = n−1/2 {(Zi − µ0 )2 − σ02 } + n1/2 (µ̂n − µ0 )2 .
i=1

Since n1/2 (µ̂n −µ0 ) converges to a normal distribution and (µ̂n −µ0 ) converges
in probability to zero, this implies that n1/2 (µ̂n −µ0 )2 converges in probability
to zero (i.e., is op (1)). Consequently, we have demonstrated that σ̂n2 is an
asymptotically linear estimator for σ 2 whose i-th influence function is given
by ϕ(Zi ) = {(Zi − µ0 )2 − σ02 }. 

When considering the asymptotic properties of an asymptotically linear


estimator (e.g., asymptotic normality and asymptotic variance), it suffices
to consider the influence function of the estimator. This follows as a simple
consequence of the central limit theorem (CLT). Since, by definition,
3 The Geometry of Influence Functions 23


n
1/2 −1/2
n (β̂n − β0 ) = n ϕ(Zi ) + op (1),
i=1

then, by the central limit theorem,



n  
−1/2 D
n ϕ(Zi ) −→ N 0 q×1 T
, E(ϕϕ ) ,
i=1

and, by Slutsky’s theorem,


 
D
n1/2 (β̂n − β0 ) −→ N 0, E(ϕϕT ) .

In an asymptotic sense, an asymptotically linear estimator can be iden-


tified through its influence function, as we now demonstrate in the following
theorem.
Theorem 3.1. An asymptotically linear estimator has a unique (a.s.) influ-
ence function.

Proof. By contradiction
Suppose not. Then there exists another influence function ϕ∗ (Z) such that

E{ϕ∗ (Z)} = 0,

and

n
n1/2 (β̂n − β0 ) = n−1/2 ϕ∗ (Zi ) + op (1).
i=1


n
Since n1/2 (β̂n − β0 ) is also equal to n−1/2 ϕ(Zi ) + op (1), this implies that
i=1


n
n−1/2 {ϕ(Zi ) − ϕ∗ (Zi )} = op (1).
i=1

However, by the CLT,



n  
−1/2 ∗ D ∗ ∗ T
n {ϕ(Zi ) − ϕ (Zi )} −→ N 0, E{(ϕ − ϕ )(ϕ − ϕ ) } .
i=1

In order for this limiting normal distribution to be op (1), we would require


that the covariance matrix

E{(ϕ − ϕ∗ )(ϕ − ϕ∗ )T } = 0q×q ,

which implies that ϕ(Z) = ϕ∗ (Z) a.s. 



24 3 The Geometry of Influence Functions

The representation of estimators through their influence function lends it-


self nicely to geometric interpretations in terms of Hilbert spaces, discussed
in Chapter 2. Before describing this geometry, we briefly comment on some
regularity conditions that will be imposed on the class of estimators we will
consider.

Reminder. We know that the variance of any unbiased estimator must be


greater than or equal to the Cràmer-Rao lower bound; see, for example,
Casella and Berger (2002, Section 7.3). When considering asymptotic theory,
where we let the sample size n go to infinity, most reasonable estimators are
asymptotically unbiased. Thus, we might expect the asymptotic variance of
such asymptotically unbiased estimators also to be greater than the Cràmer-
Rao lower bound. This indeed is the case for the most part, and estimators
whose asymptotic variance equals the Cràmer-Rao lower bound are referred
to as asymptotically efficient. For parametric models, with suitable regular-
ity conditions, the maximum likelihood estimator (MLE) is an example of
an efficient estimator. One of the peculiarities of asymptotic theory is that
asymptotically unbiased estimators can be constructed that have asymptotic
variance equal to the Cràmer-Rao lower bound for most of the parameter val-
ues in the model but have smaller variance than the Cràmer-Rao lower bound
for the other parameters. Such estimators are referred to as super-efficient
and for completeness we give the construction of such an estimator (Hodges)
as an example.

3.1 Super-Efficiency
Example Due to Hodges

Let Z1 , . . . , Zn be iid N (µ, 1), µ ∈ R. For this simple model, we know that
the maximum n likelihood estimator (MLE) of µ is given by the sample mean
Z̄n = n−1 i=1 Zi and that

D(µ)
n1/2 (Z̄n − µ) −−−→ N (0, 1).

Now, consider the estimator µ̂n given by Hodges in 1951 (see LeCam,
1953): 
Z̄n if |Z̄n | > n−1/4
µ̂n =
0 if |Z̄n | ≤ n−1/4 .
Some of the properties of this estimator are as follows.

If µ = 0, then with increasing probability, the support of Z̄n moves away from
0 (see Figure 3.1).
3.1 Super-Efficiency 25

–n–1/4
0 n–1/4 µ

Fig. 3.1. When µ = 0, Pµ (Z̄n = µ̂n ) → 0

D(µ)
Therefore n1/2 (Z̄n −µ) = n1/2 (µ̂n −µ)+op (1) and n1/2 (µ̂n −µ) −−−→ N (0, 1).
If µ = 0, then the support of Z̄n will be concentrated in an O(n−1/2 )
neighborhood about the origin and hence, with increasing probability, will be
within ±n−1/4 (see Figure 3.2).

–n–1/4
n–1/4

Fig. 3.2. When µ = 0, P0 {|Z̄n | < n−1/4 } → 1

 
Therefore, this implies that P0 (µ̂n = 0) → 1. Hence P0 n1/2 µ̂n = 0 → 1, and
P D(0)
n1/2 (µ̂n − 0) −→
0
0 or −−−→ N (0, 0). Consequently, the asymptotic variance of
1/2
n (µ̂n − µ) is equal to 1 for all µ = 0, as it is for the MLE Z̄n , but for µ = 0,
the asymptotic variance of n1/2 (µ̂n − µ) equals 0 and thus is super-efficient.
Although super-efficiency, at the surface, may seem like a good property
for an estimator to possess, upon further study we find that super-efficiency
is gained at the expense of poor estimation in a neighborhood of zero. To
26 3 The Geometry of Influence Functions

illustrate this point, consider the sequence µn = n−1/3 , which converges to


zero, the value at which the estimator µ̂n is super-efficient. The MLE has the
property that
D(µn )
n1/2 (Z̄n − µn ) −−−−→ N (0, 1).
However, because Z̄n concentrates its mass in an O(n−1/2 ) neighborhood
about µn = n−1/3 , which eventually, as n increases, will be completely con-
tained within the range ±n−1/4 with probability converging to one (see Figure
3.3).

–n–1/4
n–1/3 n–1/4

Fig. 3.3. When µn = n−1/3 , Pµn (µ̂n = 0) → 1

Therefore,  
Pµn n1/2 (µ̂n − µn ) = −n1/2 µn → 1.

Consequently, if µn = n−1/3 , then

−n1/2 µn → −∞.

Therefore, n1/2 (µ̂n − µn ) diverges to −∞.


Although super-efficient estimators exist, they are unnatural and have un-
desirable local properties associated with them. Therefore, in order to avoid
problems associated with super-efficient estimators, we will impose some ad-
ditional regularity conditions on the class of estimators that will exclude such
estimators. Specifically, we will require that an estimator be regular, as we
now define.

Definition 1. Consider a local data generating process (LDGP), where, for


each n, the data are distributed according to “θn ,” where n1/2 (θn − θ∗ ) con-
verges to a constant (i.e., θn is close to some fixed parameter θ∗ ). That is,

Z1n , Z2n , . . . , Znn are iid p(z, θn ),


3.1 Super-Efficiency 27

where T T
θn = (βnT , ηnT )T , θ∗ = (β ∗ , η ∗ )T .
An estimator β̂n , more specifically β̂n (Z1n , . . . , Znn ), is said to be regular if,
for each θ∗ , n1/2 (β̂n − βn ) has a limiting distribution that does not depend on
the LDGP.  

For our purposes, this will ordinarily mean that if


  D(θ∗ )
n1/2 β̂n (Z1n , . . . , Znn ) − β ∗ −−−−→ N (0, Σ∗ ) ,

where
Z1n , . . . , Znn are iid p(z, θ∗ ), for all n,
then   D(θ )
n1/2 β̂n (Z1n , . . . , Znn ) − βn −−−−→ N (0, Σ∗ ),
n

where
Z1n , . . . , Znn are iid p(z, θn ),
1/2 ∗
and n (θn − θ ) → τ p×1
, where τ is any arbitrary constant vector.
It is easy to see that, in our previous example, the MLE Z̄n is a regular
estimator, whereas the super-efficient estimator µ̂n , given by Hodges, is not.
From now on, we will restrict ourselves to regular estimators; in fact,
we will only consider estimators that are regular and asymptotically linear
(RAL). Although most reasonable estimators are RAL, regular estimators do
exist that are not asymptotically linear. However, as a consequence of Hájek’s
(1970) representation theorem, it can be shown that the most efficient regular
estimator is asymptotically linear; hence, it is reasonable to restrict attention
to RAL estimators.
In Theorem 3.2 and its subsequent corollary, given below, we present a
very powerful result that allows us to describe the geometry of influence func-
tions for regular asymptotically linear (RAL) estimators. This will aid us in
defining and visualizing efficiency and will also help us generalize ideas to
semiparametric models.
First, we define the score vector for a single observation Z in a parametric
model, where Z ∼ pZ (z, θ), θ = (β T , η T )T , by Sθ (Z, θ0 ), where

∂ log pZ (z, θ)
Sθ (z, θ0 ) = (3.3)
∂θ θ=θ0

is the p-dimensional vector of derivatives of the log-likelihood with respect


to the elements of the parameter θ and θ0 denotes the true value of θ that
generates the data.
This vector can be partitioned according to β (the parameters of interest)
and η (the nuisance parameters) as
28 3 The Geometry of Influence Functions

Sθ (Z, θ0 ) = {SβT (Z, θ0 ), SηT (Z, θ0 )}T ,

where
q×1
∂ log pZ (z, θ)
Sβ (z, θ0 ) =
∂β θ=θ0

and
r×1
∂ log pZ (z, θ)
Sη (z, θ0 ) = .
∂η θ=θ0

Although, in many applications, we can naturally partition the parameter


space θ as (β T , η T )T , we will first give results for the more general represen-
tation where we define the q-dimensional parameter of interest as a smooth
q-dimensional function of the p-dimensional parameter θ; namely, β(θ). As we
will show later, especially when we consider infinite-dimensional semiparamet-
ric models, in some applications this will be a more natural representation.
For parametric models, this is really not a great distinction, as we can always
reparametrize the problem so that there is a one-to-one relationship between
{β T (θ), η T (θ)}T and θ for some r-dimensional nuisance function η(θ).

Theorem 3.2. Let the parameter of interest β(θ) be a q-dimensional func-


tion of the p-dimensional parameter θ, q < p, such that Γq×p (θ) = ∂β(θ)/∂θT ,
the q × p-dimensional matrix of partial derivatives, exists, has rank q, and is
continuous in θ in a neighborhood of the truth θ0 . Also let β̂n be an asymptot-
ically linear estimator with influence function ϕ(Z) such that Eθ (ϕT ϕ) exists
and is continuous in θ in a neighborhood of θ0 . Then, if β̂n is regular, this will
imply that
E{ϕ(Z)SθT (Z, θ0 )} = Γ(θ0 ). (3.4)
In the special case where θ can be partitioned as (β T , η T )T , we obtain the
following corollary.

Corollary 1.
(i)
E{ϕ(Z)SβT (Z, θ0 )} = I q×q
and
(ii)
E{ϕ(Z)SηT (Z, θ0 )} = 0q×r ,
where I q×q denotes the q ×q identity matrix and 0q×r denotes the q ×r matrix
of zeros.
3.2 m-Estimators (Quick Review) 29

Theorem 3.2 follows from the definition of regularity together with suffi-
cient smoothness conditions that makes a local data generating process con-
tiguous (to be defined shortly) to the sequence of distributions at the truth.
For completeness, we will give an outline of the proof. Before giving the gen-
eral proof of Theorem 3.2, which is complicated and can be skipped by the
reader not interested in all the technical details, we can gain some insight by
first showing how Corollary 1 could be proved for the special (and important)
case of the class of m-estimators.

3.2 m-Estimators (Quick Review)


In order to define an m-estimator, we consider a p × 1-dimensional function
of Z and θ, m(Z, θ), such that

Eθ {m(Z, θ)} = 0p×1 ,

Eθ {mT (Z, θ)m(Z, θ)} < ∞, and Eθ {m(Z, θ)mT (Z, θ)} is positive definite for
all θ ∈ Ω. Additional regularity conditions are also necessary and will be
defined as we need them.
The m-estimator θ̂n is defined as the solution (assuming it exists) of

n
m(Zi , θ̂n ) = 0
i=1

from a sample

Z1 , . . . , Zn iid pZ (z, θ)
θ ∈ Ω ⊂ Rp .

Under suitable regularity conditions, the maximum likelihood estimator


(MLE) of θ is an m-estimator. The MLE is defined as the value θ that maxi-
mizes the likelihood
!n
pZ (Zi , θ),
i=1

or, equivalently, the value of θ that maximizes the log-likelihood



n
log pZ (Zi , θ).
i=1

Under suitable regularity conditions, the maximum is found by taking the


derivative of the log-likelihood with respect to θ and setting it equal to zero.
That is, solving the score equation in θ,

n
Sθ (Zi , θ) = 0, (3.5)
i=1
30 3 The Geometry of Influence Functions

where Sθ (z, θ) is the score vector (i.e., the derivative of the log-density) defined
in (3.3). Since the score vector Sθ (Z, θ), under suitable regularity conditions,
has the property that Eθ {Sθ (Z, θ)} = 0 – see, for example, equation (7.3.8)
of Casella and Berger (2002) – , this implies that the MLE is an example of
an m-estimator.
In order to prove the consistency and asymptotic normality of m-estimators,
we need to assume certain regularity conditions. Some of the conditions that
are discussed in Chapter 36 of the Handbook  of Econometrics by Newey
∂m(Z, θ0 )
and McFadden (1994) include that E be nonsingular, where
∂θT
∂m(Zi , θ)
is defined as the p × p matrix of all partial derivatives of the ele-
∂θT
ments of m(·) with respect to the elements of θ, and that


n  
−1 ∂m(Zi , θ) P ∂m(Z, θ)
n → E θ0
i=1
∂θT ∂θT

uniformly in θ in a neighborhood of θ0 . For example, uniform convergence


∂m(Z, θ)
would be satisfied if the sample paths of are continuous in θ about
∂θT
θ0 almost surely and

∂m(Z, θ)
sup ≤ g(Z), E{g(Z)} < ∞,
θ∈N (θ0 ) ∂θT

where N (θ0 ) denotes a neighborhood in θ about θ0 . In fact, these regularity


conditions would suffice to prove that the estimator θ̂n is consistent; that is,
P
θ̂n −
→ θ0 .
Therefore, assuming that these regularity conditions hold, the influence
function for θ̂n is found by using the expansion
 n "p×p
 n n  ∂m(Zi , θ∗ )
0= m(Zi , θ̂n ) = m(Zi , θ0 ) + n
(θ̂n − θ0 ),
i=1 i=1 i=1
∂θT

where θn∗ is an intermediate value between θ̂n and θ0 .


Because we have assumed sufficient regularity conditions to guarantee the
consistency of θ̂n ,
 "  
n
∂m(Z , θ ∗
) ∂m(Z, θ0 )
−1 i P
n n
→E ,
i=1
∂θT ∂θT

and by the nonsingularity assumption


 "−1   −1
n
∂m(Zi , θn∗ ) ∂m(Z, θ0 )
−1 P
n → E .
i=1
∂θT ∂θT
3.2 m-Estimators (Quick Review) 31

Therefore,
# $−1  "

n
∂m(Zi , θ∗ ) 
n
1/2 −1 −1/2
n (θ̂n − θ0 ) = − n n
n m(Zi , θ0 )
∂θT
i=1 i=1
  −1  n
"
∂m(Zi , θ0 ) −1/2
=− E n m(Zi , θ0 ) + op (1).
∂θT i=1

Since, by definition, E{m(Z, θ0 )} = 0, we immediately deduce that the influ-


ence function of θ̂n is given by
  −1
∂m(Z, θ0 )
− E m(Zi , θ0 ) (3.6)
∂θT
and

n1/2 (θ̂n − θ0 ) −→
D
(3.7)
%   −1    −1T &
∂m(Z, θ0 ) ∂m(Z, θ0 )
N 0, E var m(Z, θ0 ) E ,
∂θT ∂θT

where ' (
var {m(Z, θ0 )} = E m(Z, θ0 )mT (Z, θ0 ) .

Estimating the Asymptotic Variance of an m-Estimator

In order to use an m-estimator for the parameter θ for practical applications,


such as constructing confidence intervals for the parameter θ or a subset of the
parameter, we must be able to derive a consistent estimator for the asymp-
totic variance of θ̂n . Under suitable regularity conditions, a consistent estima-
tor for the asymptotic variance of θ̂n can be derived intuitively using what
is referred to as the “sandwich” variance estimator. This estimator is moti-
vated by considering the asymptotic variance derived in (3.7). The following
heuristic argument is used.
If θ0 (the truth) is known, then a simple application of the weak  law  of
∂m(Z,θ0 )
large numbers can be used to obtain a consistent estimator for E ∂θ T
,
namely
  n
∂m(Z, θ0 ) −1 ∂m(Zi , θ0 )
Ê T
= n , (3.8)
∂θ i=1
∂θT

and a consistent estimator for var{m(Z, θ0 )} can be obtained by using



n
var{m(Z,
ˆ θ0 )} = n−1 m(Zi , θ0 )mT (Zi , θ0 ). (3.9)
i=1
32 3 The Geometry of Influence Functions

Since θ0 is not known, we instead substitute θ̂n for θ0 in equations (3.8)


and (3.9) to obtain the sandwich estimator for the asymptotic variance, (3.7),
of θ̂n , given by
#  "$−1  #  "$−1T
∂m(Z, θ̂n ) ∂m(Z, θ̂n )
Ê var
ˆ m(Z, θ̂n ) Ê . (3.10)
∂θT ∂θT

The estimator (3.10) is referred to as the sandwich variance estimator as


we see the term var(·)
ˆ sandwiched between two terms involving Ê(·). The
sandwich variance will be discussed in greater detail in Chapter 4 when we
introduce estimators that solve generalized estimating equations (i.e., the so-
called GEE estimators). For more details on m-estimators, we refer the reader
to the excellent expository article by Stefanski and Boos (2002).
When we consider the special case where the m-estimator is the MLE
of θ (i.e., where m(Z, θ) = Sθ (Z, θ); see (3.5)), we note that − ∂m(Z,θ)
∂θ T
=
∂Sθ (Z,θ)
− ∂θT corresponds to minus the p × p matrix of second partial derivatives
of the log-likelihood with respect to θ, which we denote by −Sθθ (Z, θ). Under
suitable regularity conditions (see Section 7.3 of Casella and Berger, 2002),
the information matrix, which we denote by I(θ0 ), is given by

I(θ0 ) = Eθ0 {−Sθθ (Z, θ0 )} = Eθ0 {Sθ (Z, θ0 )SθT (Z, θ0 )}. (3.11)

As a consequence of (3.6) and (3.7), we obtain the well-known results that


the i-th influence function of the MLE is given by {I(θ0 )}−1 Sθ (Zi , θ0 ) and the
asymptotic distribution is normal with mean zero and variance matrix equal
to I −1 (θ0 ) (i.e., the inverse of the information matrix).
Returning to the general m-estimator, since

θ = (β T , η T )T

and

θ̂n = (β̂nT , η̂nT )T ,

the influence function of β̂n is made up of the first q elements of the p-


dimensional influence function for θ̂n given above.
We will now illustrate why Corollary 1 applies to m-estimators. By defi-
nition,
Eθ {m(Z, θ)} = 0p×1 .
That is, 
m(z, θ)p(z, θ)dν(z) = 0 for all θ.

Therefore, 

m(z, θ)p(z, θ)dν(z) = 0.
∂θT
3.2 m-Estimators (Quick Review) 33

Assuming suitable regularity conditions that allow us to interchange integra-


tion and differentiation, we obtain
⎧ ⎫
    ⎪ ∂p(z, θ) ⎪
∂ ⎨ ⎬
m(z, θ) p(z, θ)dν(z) + m(z, θ) ∂θT p(z, θ)dν(z) = 0.
∂θT ⎪
⎩ p(z, θ) ⎪⎭
* +, -
(3.12)

SθT (z, θ) or the transpose of the


score vector

At θ = θ0 , we deduce from equation (3.12) that


 
∂m(Z, θ0 ) ' (
E T
= −E m(Z, θ0 )SθT (Z, θ0 ) ,
∂θ
which can also be written as
  −1
∂m(Z, θ0 ) ' (
I p×p
=− E E m(Z, θ0 )SθT (Z, θ0 ) , (3.13)
∂θT

where I p×p denotes the p×p identity matrix. Recall that the influence function
for θ̂n , given by (3.6), is
  −1
∂m(Z, θ0 )
ϕθ̂n (Zi ) = − E m(Zi , θ0 )
∂θT
 T
and can be partitioned as ϕTβ̂ (Zi ), ϕTη̂n (Zi ) .
n
The covariance of the influence function ϕθ̂n (Zi ) and the score vector
Sθ (Zi , θ0 ) is
 
E ϕθ̂n (Zi )SθT (Zi , θ0 )
  −1
∂m(Z, θ0 ) ' (
=− E T
E m(Z, θ0 )SθT (Z, θ0 ) , (3.14)
∂θ

which by (3.13) is equal to I (q+r)×(q+r) , the identity matrix. This covariance


matrix (3.14) can be partitioned as
# $
ϕβ̂n (Zi )SβT (Zi , θ0 ) ϕβ̂n (Zi )SηT (Zi , θ0 )
E .
ϕη̂n (Zi )SβT (Zi , θ0 ) ϕη̂n (Zi )SηT (Zi , θ0 )

Consequently,
 
(i) E ϕβ̂n (Zi )SβT (Zi , θ0 ) = I q×q (the q × q identity matrix)
34 3 The Geometry of Influence Functions

and
 
(ii) E ϕβ̂n (Zi )SηT (Zi , θ0 ) = 0q×r .

Thus, we have verified that the two conditions of Corollary 1 hold for influence
functions of m-estimators.

Proof of Theorem 3.2

In order to prove Theorem 3.2, we must introduce the theory of contiguity,


which we now review briefly. An excellent overview of contiguity theory can be
found in the Appendix of Hájek and Sidak (1967). Those readers not interested
in the theoretical details can skip the remainder of this section.

Definition 2. Let Vn be a sequence of random vectors and let P1n and P0n be
sequences of probability measures with densities p1n (vn ) and p0n (vn ), respec-
tively. The sequence of probability measures P1n is contiguous to the sequence
of probability measures P0n if, for any sequence of events An defined with re-
spect to Vn , P0n (An ) → 0 as n → ∞ implies that P1n (An ) → 0 as n → ∞.



In our applications, we let Vn = (Z1n , . . . , Znn ), where Z1n , . . . , Znn are


iid random vectors and
!
n
p0n (vn ) = p(zin , θ0 ),
i=1
!n
p1n (vn ) = p(zin , θn ),
i=1

where n1/2 (θn − θ0 ) → τ , τ being a p-dimensional vector of constants.


Letting the parameter θ0 denote the true value of the parameter that
generates the data, then p1n (·) is an example of a local data generating process
(LDGP) as given by Definition 1. If we could show that the sequence P1n is
contiguous to the sequence P0n , then a sequence of random variables Tn (Vn )
that converges in probability to zero under the truth (i.e., for every > 0,
P0n (|Tn | > ) → 0) would also satisfy that P1n (|Tn | > ) → 0; hence, Tn (Vn )
would converge in probability to zero for the LDGP. This fact can be very
useful because in some problems it may be relatively easy to show that a
sequence of random variables converges in probability to zero under the truth,
in which case convergence in probability to zero under the LDGP follows
immediately from contiguity.
LeCam, in a series of lemmas (see Hájek and Sidak, 1967), proved some
important results regarding contiguity. One of LeCam’s results that is of par-
ticular use to us is as follows.
3.2 m-Estimators (Quick Review) 35

Lemma 3.1. LeCam


If  
p1n (Vn ) D(P0n )
log −−−−−→ N (−σ 2 /2, σ 2 ), (3.15)
p0n (Vn )
then the sequence P1n is contiguous to the sequence P0n .

Heuristic justification of contiguity for LDGP

To illustrate that (3.15) holds for LDGPs under sufficient smoothness and
regularity conditions, we sketch out the following heuristic argument. Define

p1n (Vn ) ! p(Zin , θn )


n
Ln (Vn ) = = .
p0n (Vn ) i=1 p(Zin , θ0 )

By a simple Taylor series expansion, we obtain



n
log{Ln (Vn )} = {log p(Zin , θn ) − log p(Zin , θ0 )}
i=1
 n "

= (θn − θ0 )T Sθ (Zin , θ0 )
i=1
n
(θn − θ0 )T { i=1 Sθθ (Zin , θn∗ )}(θn − θ0 )
+ , (3.16)
2
where Sθ (z, θ0 ) is the p-dimensional score vector defined as ∂ log p(z, θ0 )/∂θ,
Sθθ (z, θn∗ ) is the p × p matrix ∂ 2 log p(z, θn∗ )/∂θ∂θT , and θn∗ is some interme-
diate value between θn and θ0 .
The expression (3.16) can be written as
 "
n
n1/2 (θn − θ0 )T n−1/2 Sθ (Zin , θ0 )
i=1
n
n1/2 (θn − θ0 )T {n−1 i=1 Sθθ (Zin , θn∗ )}n1/2 (θn − θ0 )
+ .
2
Under P0n :
(i) Sθ (Zin , θ0 ), i = 1, . . . , n are iid mean zero random vectors with variance
matrix equal to the information matrix I(θ0 ) defined by (3.11). Conse-
quently, by the CLT,

n  
−1/2 D(P0n )
n Sθ (Zin , θ0 ) −−−−−→ N 0, I(θ0 ) .
i=1

(ii) Since θn∗ → θ0 and Sθθ (Zin , θ0 ), i = 1, . . . , n are iid random matrices with
mean −I(θ0 ), then, under sufficient smoothness conditions,
36 3 The Geometry of Influence Functions


n
−1
{Sθθ (Zin , θn∗ ) − Sθθ (Zin , θ0 )} −
P
n → 0,
i=1

and by the weak law of large numbers



n
−1 P
n Sθθ (Zin , θ0 ) −
→ −I(θ0 ),
i=1

hence

n
−1
Sθθ (Zin , θn∗ ) −
P
n → −I(θ0 ).
i=1
1/2
By assumption, n (θn − θ0 ) → τ . Therefore, (i), (ii), and Slutsky’s theorem
imply that
 
D(P0n ) τ T I(θ0 )τ T
log{Ln (Vn )} −−−−−→ N − , τ I(θ0 )τ .
2

Consequently, by LeCam’s lemma, the sequence P1n is contiguous to the se-


quence P0n .

Now we are in a position to prove Theorem 3.2.


Proof. Theorem 3.2 .
Consider the. sequence of densities p0n (vn ) = p(zin , θ0 ) and the LDGP
p1n (vn ) = p(zin , θn ), where n1/2 (θn − θ0 ) → τ . By the definition of asymp-
totic linearity,

n
1/2 −1/2
n {β̂n − β(θ0 )} = n ϕ(Zin ) + oP0n (1), (3.17)
i=1

where oP0n (1) is a sequence of random vectors that converge in probability to


zero with respect to the sequence of probability measures P0n . Consider the
LDGP defined by θn . By contiguity, terms that are oP0n (1) are also oP1n (1).
Consequently, by (3.17),


n
n1/2 {β̂n − β(θ0 )} = n−1/2 ϕ(Zin ) + oP1n (1).
i=1

By adding and subtracting common terms, we obtain



n
n1/2 {β̂n − β(θn )} = n−1/2 [ϕ(Zin ) − Eθn {ϕ(Z)}]
i=1

+n1/2 Eθn {ϕ(Z)} − n1/2 {β(θn ) − β(θ0 )} (3.18)


+oP1n (1).
3.2 m-Estimators (Quick Review) 37

By assumption, the estimator β̂n is regular; that is,


 
1/2 D(P1n )
n {β̂n − β(θn )} −−−−−→ N 0, Eθ0 (ϕϕ ) .
T
(3.19)

Also, under P1n , [ϕ(Zin ) − Eθn {ϕ(Z)}], i = 1, . . . , n are iid mean-zero random
vectors with variance matrix Eθn (ϕϕT ) − Eθn (ϕ)Eθn (ϕT ). By the smoothness
assumption, Eθn (ϕϕT ) → Eθ0 (ϕϕT ) and Eθn (ϕ) → 0 as n → ∞. Hence, by
the CLT, we obtain

n  
D(P1n )
n−1/2 [ϕ(Zin ) − Eθn {ϕ(Z)}] −−−−−→ N 0, Eθ0 (ϕϕT ) . (3.20)
i=1

By a simple Taylor series expansion, we deduce that β(θn ) ≈ β(θ0 )+Γ(θ0 )(θn −
θ0 ), where Γ(θ0 ) = ∂β(θ0 )/∂θT . Hence,

n1/2 {β(θn ) − β(θ0 )} → Γ(θ0 )τ. (3.21)

Finally,

n1/2 Eθn {ϕ(Z)} = n1/2 ϕ(z)p(z, θn )dν(z)
   T
1/2 1/2 ∂p(z, θn∗ )
=n ϕ(z)p(z, θ0 )dν(z) + n ϕ(z) (θn − θ0 )dν(z)
∂θ
  T
n→∞ ∂p(z, θ0 )
−−−−→ 0 + ϕ(z) /p(z, θ0 ) p(z, θ0 )dν(z)τ
∂θ
= Eθ0 {ϕ(Z)SθT (Z, θ0 )}τ, (3.22)

where θn∗ is some intermediate value between θn and θ0 . The only way that
(3.19) and (3.20) can hold is if the limit of (3.18), as n → ∞, is identically
equal to zero. By (3.21) and (3.22), this implies that
 
Eθ0 {ϕ(Z)SθT (Z, θ0 )} − Γ(θ0 ) τ = 0q×1 .

Since τ is arbitrary, this implies that

Eθ0 {ϕ(Z)SθT (Z, θ0 )} = Γ(θ0 ),

which proves Theorem 3.2. 



We now show how the results of Theorem 3.2 lend themselves to a geomet-
ric interpretation that allows us to compare the efficiency of different RAL
estimators using our intuition of minimum distance and orthogonality.
38 3 The Geometry of Influence Functions

3.3 Geometry of Influence Functions for Parametric


Models
Consider the Hilbert space H of all q-dimensional measurable functions of
Z with mean zero and finite variance equipped with the inner product
h1 , h2 = E(hT1 h2 ). We first note that the score vector Sθ (Z, θ0 ), under suit-
able regularity conditions, has mean zero (i.e., E{Sθ (Z, θ0 )} = 0p×1 ). Similar
to Example 2 of Chapter 2, we can define the finite-dimensional linear sub-
space T ⊂ H spanned by the p-dimensional score vector Sθ (Z, θ0 ) as the set
of all q-dimensional mean-zero random vectors consisting of

B q×p Sθ (Z, θ0 )

for all q × p matrices B. The linear subspace T is referred to as the tangent


space.
In the case where θ can be partitioned as (β T , η T )T , consider the linear
subspace spanned by the nuisance score vector Sη (Z, θ0 ),

B q×r Sη (Z, θ0 ), (3.23)

for all q × r matrices B. This space is referred to as the nuisance tangent


space and will be denoted by Λ. We note that condition (ii) of Corollary 1 is
equivalent to saying that the q-dimensional influence function ϕβ̂n (Z) for β̂n
is orthogonal to the nuisance tangent space Λ.
In addition to being orthogonal to the nuisance tangent space, the influence
function of β̂n must also satisfy condition (i) of Corollary 1; namely,
 
E ϕβ̂n (Z)SβT (Z, θ0 ) = I q×q .

Although influence functions of RAL estimators for β must satisfy condi-


tions (i) and (ii) of Corollary 1, a natural question is whether the converse
is true; that is, for any element of the Hilbert space satisfying conditions (i)
and (ii) of Corollary 1, does there exist an RAL estimator for β with that
influence function?
Remark 2. To prove this in full generality, especially later when we consider
infinite-dimensional nuisance parameters, is difficult and requires that some
careful technical regularity conditions hold. Nonetheless, it may be instruc-
tive to see how one may, heuristically, construct estimators that have influence
functions corresponding to elements in the subspace of the Hilbert space sat-
isfying conditions (i) and (ii). 


Constructing Estimators

Let ϕ(Z) be a q-dimensional measurable function with zero mean and finite
variance that satisfies conditions (i) and (ii) of Corollary 1. Define
3.3 Geometry of Influence Functions for Parametric Models 39

m(Z, β, η) = ϕ(Z) − Eβ,η {ϕ(Z)}.

Assume that we can find a root-n consistent estimator for the nuisance pa-
rameter η̂n (i.e., where n1/2 (η̂n −η0 ) is bounded in probability). In many cases
the estimator η̂n will be β-dependent (i.e., η̂n (β)). For example, we might use
the MLE for η, or the restricted MLE for η, fixing the value of β.
We will now argue that the solution to the equation

n
m{Zi , β, η̂n (β)} = 0, (3.24)
i=1

which we denote by β̂n , will be an asymptotically linear estimator with influ-


ence function ϕ(Z).
By construction, we have

Eβ0 ,η {m(Z, β0 , η)} = 0,

or

m(z, β0 , η)p(z, β0 , η)dν(z) = 0.

Consequently,


m(z, β0 , η)p(z, β0 , η)dν(z) = 0,
∂η T η=η0

or
 
∂m(z, β0 , η0 )
p(z, β ,
0 0η )dν(z) + m(z, β0 , η0 )
∂η T (3.25)
×SηT (z, β0 , η0 )p(z, β0 , η0 )dν(z) = 0.

By definition, ϕ(Z) = m(Z, β0 , η0 ) must satisfy


' (
E ϕ(Z)SηT (Z, θ0 ) = 0.

(This is condition (ii) of Corollary 1.) Consequently, by (3.25), we obtain


 

E m(Z, β0 , η0 ) = 0. (3.26)
∂η T

Similarly, we can show that


 

E m(Z, β ,
0 0η ) = −I q×q . (3.27)
∂β T

A standard expansion yields


40 3 The Geometry of Influence Functions


n
0= m{Zi , β̂n , η̂n (β̂n )}
i=1
n
= m{Zi , β0 , η̂n (β̂n )}
i=1
# $
n
∂m ∗
+ {Zi , βn , η̂n (β̂n )} (β̂n − β0 ), (3.28)
∂β T * +, -
i=1

Notice that this term is held fixed

where βn∗ is an intermediate value between β̂n and β0 . Therefore,

n1/2 (β̂n − β0 )
# $−1 # $
n
∂ n
−1 ∗ −1/2
=− n m{Zi , βn , η̂n (β̂n )} n m{Zi , β0 , η̂n (β̂n )} .
i=1
∂β T i=1
* +, -
⇓p
  −1

E m(Z, β0 , η0 ) = −I q×q by (3.27) (3.29)
∂β T

n
Let us consider the second term of (3.29); namely, n−1/2 m{Zi , β0 , η̂n (β̂n )}.
i=1
By expansion, this equals

n
n−1/2 m(Zi , β0 , η0 )
i=1
 "

n
∂m(Zi , β0 , η ∗ )  
−1
+ n n
n1/2 {η̂n (β̂n ) − η0 } , (3.30)
i=1
∂η T * +, -
* +, -
 ⇓p  ⇓
∂ bounded in probability
E m(Z, β0 , η0 )
∂η T
= 0 by (3.26)

where ηn∗ is an intermediate value between η̂n (β̂n ) and η0 .


Combining (3.29) and (3.30), we obtain


n
n1/2 (β̂n − β0 ) = n−1/2 m(Zi , β0 , η0 ) + op (1),
i=1
n
= n−1/2 ϕ(Zi ) + op (1),
i=1
3.3 Geometry of Influence Functions for Parametric Models 41

which illustrates that ϕ(Zi ) is the influence function for the i-th observation
of the estimator β̂n above.
Remark 3. This argument was independent of the choice of the root-n consis-
tent estimator for the nuisance parameter η. 

Remark 4. In the derivation above, the asymptotic distribution of the esti-
mator obtained by solving the estimating equation, which uses the estimating
function m(Z, β, η̂n ), is the same as the asymptotic distribution of the estima-
tor solving the estimating equation using the estimating function m(Z, β, η0 )
had the true value of the nuisance parameter η0 been known to us. This
fact follows from the orthogonality of the estimating function (evaluated at
the truth) to the nuisance tangent space. This type of robustness, where the
asymptotic distribution of an estimator is independent of whether the true
value of the nuisance parameter is known or whether (and how) the nuisance
parameter is estimated in an estimating equation, is one of the bonuses of
working with estimating equations with estimating functions that are orthog-
onal to the nuisance tangent space.  
Remark 5. We want to make it clear that the estimator we just presented is
for theoretical purposes only and not of practical use. The starting point was
the choice of a function satisfying the conditions of Lemma 3.1. To find such
a function necessitates knowledge of the truth, which, of course, we don’t
have. Nonetheless, starting with some truth, say θ0 , and some function ϕ(Z)
satisfying the conditions of Corollary 1 (under the assumed true model), we
constructed an estimator whose influence function is ϕ(Z) when θ0 is the
truth. If, however, the data were generated, in truth, by some other value of
the parameter, say θ∗ , then the estimator constructed by solving (3.24) would
have some other influence function ϕ∗ (Z) satisfying the conditions of Lemma
3.1 at θ∗ . 
Thus, by Corollary 1, all RAL estimators have influence functions that belong
to the subspace of our Hilbert space satisfying
(i) E{ϕ(Z)SβT (Z, θ0 )} = I q×q
and
(ii) E{ϕ(Z)SηT (Z, θ0 )} = 0q×r ,
and, conversely, any element in the subspace above is the influence function
of some RAL estimator.

Why Is this Important?

RAL estimators are asymptotically normally distributed; i.e.,


 
1/2 D
n (β̂n − β0 ) −→ N 0, E(ϕϕ ) .
T
42 3 The Geometry of Influence Functions

Because of this, we can compare competing RAL estimators for β by looking


at the asymptotic variance, where clearly the better estimator is the one with
smaller asymptotic variance. We argued earlier, however, that the asymptotic
variance of an RAL estimator is the variance of its influence function. There-
fore, it suffices to consider the variance of influence functions. We already
illustrated that influence functions can be viewed as elements in a subspace
of a Hilbert space. Moreover, in this Hilbert space the distance to the origin
(squared) of any element (random function) is the variance of the element.
Consequently, the search for the best estimator (i.e., the one with the small-
est asymptotic variance) is equivalent to the search for the element in the
subspace of influence functions that has the shortest distance to the origin.

Remark 6. We want to emphasize again that Hilbert spaces are characterized


by both the elements that make up the space (random functions in our case)
and the inner product, h1 , h2 = E(hT1 h2 ), where expectation is always taken
with respect to the truth (θ0 ). Therefore, for different θ0 , we have different
Hilbert spaces. This also means that the subspace that defines the class of
influence functions is θ0 -dependent. 

3.4 Efficient Influence Function


We will show how the geometry of Hilbert spaces will allow us to identify the
most efficient influence function (i.e., the influence function with the smallest
variance). First, however, we give some additional notation and definitions
regarding operations on linear subspaces that will be needed shortly.

Definition 3. We say that M ⊕ N is a direct sum of two linear subspaces


M ⊂ H and N ⊂ H if M ⊕ N is a linear subspace in H and if every element
x ∈ M ⊕ N has a unique representation of the form x = m + n, where m ∈ M
and n ∈ N . 


Definition 4. The set of elements of a Hilbert space that are orthogonal to a


linear subspace M is denoted by M ⊥ . The space M ⊥ is also a linear subspace
(referred to as the orthogonal complement of M ) and the entire Hilbert space

H = M ⊕ M ⊥. 


Condition (ii) of Corollary 1 can now be stated as follows: If ϕ(Z) is an


influence function of an RAL estimator, then ϕ ∈ Λ⊥ , where Λ denotes the
nuisance tangent space defined by (3.23).

Definition 5. If we consider any arbitrary element h(Z) ∈ H, then by the


projection theorem, there exists a unique element a0 (Z) ∈ Λ such that h−a0
has the minimum norm and a0 must uniquely satisfy the relationship

h − a0 , a = 0 for all a ∈ Λ.
3.4 Efficient Influence Function 43

The element a0 is referred to as the projection of h onto the space Λ and is


denoted by Π(h|Λ). The element with the minimum norm, h−a0 , is sometimes
referred to as the residual of h after projecting onto Λ, and it is easy to show
that h − a0 = Π(h|Λ⊥ ).  

As we discussed earlier, condition (ii) of Corollary 1 is equivalent to an


element h(Z) in our Hilbert space H being orthogonal to the nuisance tangent
space; i.e., the linear subspace generated by the nuisance score vector, namely

Λ = {B q×r Sη (Z, θ0 ) for all B q×r }.

If we want to identify all elements orthogonal to the nuisance tangent space,


we can consider the set of elements h − Π(h|Λ) for all h ∈ H, where using the
results in Example 2 of Chapter 2,

Π(h|Λ) = E(hSηT ){E(Sη SηT )}−1 Sη (Z, θ0 ).

It is also straightforward to show that the tangent space

T = {B q×p Sθ (Z, θ0 ) for all B q×p }

can be written as the direct sum of the nuisance tangent space and the tangent
space generated by the score vector with respect to the parameter of interest
“β”. That is, if we define Tβ as the space {B q×q Sβ (Z, θ0 ) for all B q×q }, then
T = Tβ ⊕ Λ.

Asymptotic Variance when Dimension Is Greater than One

When the parameter of interest β has dimension ≥ 2, we must be careful


about what we mean by smaller asymptotic variance for an estimator or its
influence function. Consider two RAL estimators for β with influence function
ϕ(1) (Z) and ϕ(2) (Z), respectively. We say that

var {ϕ(1) (Z)} ≤ var {ϕ(2) (Z)}

if and only if
var {aT ϕ(1) (Z)} ≤ var {aT ϕ(2) (Z)}
for all q × 1 constant vectors a. Equivalently,
T T
aT E{ϕ(1) (Z)ϕ(1) (Z)}a ≤ aT E{ϕ(2) (Z)ϕ(2) (Z)}a.

This means that


T T
aT [E{ϕ(2) (Z)ϕ(2) (Z)} − E{ϕ(1) (Z)ϕ(1) (Z)}] a ≥ 0,
T T
or E(ϕ(2) ϕ(2) ) − E(ϕ(1) ϕ(1) ) is nonnegative definite.
44 3 The Geometry of Influence Functions

If H(1) is the Hilbert space of one-dimensional mean-zero random func-


tions of Z, where we use the superscript (1) to emphasize one-dimensional
random functions, and if h1 and h2 are elements of H(1) that are or-
thogonal to each other, then, by the Pythagorean theorem, we know that
var(h1 + h2 ) = var(h1 ) + var(h2 ), making it clear that var(h1 + h2 ) is greater
than or equal to var(h1 ) or var(h2 ). Unfortunately, when H consists of q-
dimensional mean-zero random functions, there is no such general relationship
with regard to the variance matrices. However, there is an important special
case when this does occur, which we now discuss.
Definition 6. q-replicating linear space
A linear subspace U ⊂ H is a q-replicating linear space if U is of the form
U (1) × . . . × U (1) or {U (1) }q , where U (1) denotes a linear subspace in H(1) and
{U (1) }q ⊂ H represents the linear subspace in H that consists of elements
h = (h(1) , . . . , h(q) )T such that h(j) ∈ U (1) for all j = 1, . . . , q; i.e., {U (1) }q
consists of q-dimensional random functions, where each element in the vector
is an element of U (1) , or the space U (1) stacked up on itself q times.  
The linear subspace spanned by an r-dimensional vector of mean zero finite
variance random functions v r×1 (Z), namely the subspace
S = {B q×r v(Z) : for all constant matrices B q×r },
is such a subspace. This is easily seen by defining U (1) to be the space
{bT v(Z) : for all constant r-dimensional constant vectors br×1 },
in which case S = {U (1) }q . Since tangent spaces and nuisance tangent spaces
are linear subspaces spanned by score vectors, these are examples of q-
replicating linear spaces.
Theorem 3.3. Multivariate Pythagorean theorem
If h ∈ H and is an element of a q-replicating linear space U, and  ∈ H is
orthogonal to U, then
var( + h) = var() + var(h), (3.31)
T
where var(h) = E(hh ). As a consequence of (3.31), we obtain a multivariate
version of the Pythagorean theorem; namely, for any h∗ ∈ H,
var(h∗ ) = var (Π[h∗ |U]) + var (h∗ − Π[h∗ |U]) . (3.32)
Proof. It is easy to show that an element  = ((1) , . . . , (q) )T ∈ H is orthogo-
nal to U = {U (1) }q if and only if each element (j) , j = 1, . . . , q is orthogonal
to U (1) . Consequently, such an element  is not only orthogonal to h ∈ {U (1) }q
in the sense that E(T h) = 0 but also in that E(hT ) = E(hT ) = 0q×q . This
is important because for such an  and h, we obtain
var( + h) = var() + var(h),
where var(h) = E(hhT ). 

3.4 Efficient Influence Function 45

This means that, for such cases, the variance matrix of +h, for q-dimensional
 and h, is larger (in the multidimensional sense defined above) than either
the variance matrix of  or the variance matrix of h.
In many of the arguments that follow, we will be decomposing elements
of the Hilbert space as the projection to a tangent space or a nuisance tan-
gent space plus the residual after the projection. For such problems, because
the tangent space or nuisance tangent space is a q-replicating linear space,
we now know that we can immediately apply the multivariate version of the
Pythagorean theorem where the variance matrix of any element is always
larger than the variance matrix of the projection or the variance matrix of the
residual after projection. Consequently, we don’t have to distinguish between
the Hilbert space of one-dimensional random functions and q-dimensional ran-
dom functions.

Geometry of Influence Functions

Before describing the geometry of influence functions, we first give the defini-
tion of a linear variety (sometimes also called an affine space).
Definition 7. A linear variety is the translation of a linear subspace away
from the origin; i.e., a linear variety V can be written as V = x0 + M , where
x0 ∈ H and x0 ∈ / M, x0 = 0, and M is a linear subspace (see Figure 3.4).



M (linear subspace)

Fig. 3.4. Depiction of a linear variety

Theorem 3.4. The set of all influence functions, namely the elements of H
that satisfy condition (3.4) of Theorem 3.2, is the linear variety ϕ∗ (Z) + T ⊥ ,
where ϕ∗ (Z) is any influence function and T ⊥ is the space perpendicular to
the tangent space.
Proof. Any element l(Z) ∈ T ⊥ must satisfy

E{l(Z)SθT (Z, θ0 )} = 0q×p . (3.33)

Therefore, if we take
46 3 The Geometry of Influence Functions

ϕ(Z) = ϕ∗ (Z) + l(Z),


then
 
E{ϕ(Z)SθT (Z, θ0 )} = E {ϕ∗ (Z) + l(Z)} SθT (Z, θ0 )
   
= E ϕ∗ (Z)SθT (Z, θ0 ) + E l(Z)SθT (Z, θ0 )
= Γ(θ0 ) + 0q×p = Γ(θ0 ).

Hence, ϕ(Z) is an influence function satisfying condition (3.4) of Theorem 3.2.


Conversely, if ϕ(Z) is an influence function satisfying (3.4) of Theorem
3.2, then
ϕ(Z) = ϕ∗ (Z) + {ϕ(Z) − ϕ∗ (Z)}.
It is a simple exercise to verify that {ϕ(Z) − ϕ∗ (Z)} ∈ T ⊥ . 


Deriving the Efficient Influence Function

The efficient influence function ϕeff (Z), if it exists, is the influence func-
tion with the smallest variance matrix; that is, for any influence function
ϕ(Z) = ϕeff (Z), var{ϕeff (Z)} − var{ϕ(Z)} is negative definite. That an ef-
ficient influence function exists and is unique is now easy to see from the
geometry of the problem.
Theorem 3.5. The efficient influence function is given by

ϕeff (Z) = ϕ∗ (Z) − Π(ϕ∗ (Z)|T ⊥ ) = Π(ϕ∗ (Z)|T ), (3.34)

where ϕ∗ (Z) is an arbitrary influence function and T is the tangent space,


and can explicitly be written as

ϕeff (Z) = Γ(θ0 )I −1 (θ0 )Sθ (Z, θ0 ). (3.35)

Proof. By Theorem 3.4, the class of influence functions is a linear variety,


ϕ∗ (Z) + T ⊥ . Let ϕeff = ϕ∗ − Π(ϕ∗ |T ⊥ ) = Π(ϕ∗ |T ). Because Π(ϕ∗ |T ⊥ ) ∈
T ⊥ , this implies that ϕeff is an influence function and, moreover, is orthogonal
to T ⊥ . Consequently, any other influence function can be written as ϕ = ϕeff +
l, with l ∈ T ⊥ . The tangent space T and its orthogonal complement T ⊥ are
examples of q-replicating linear spaces as defined by Definition 6. Therefore,
because of Theorem 3.3, equation (3.31), we obtain var(ϕ) = var(ϕeff )+var(l),
which demonstrates that ϕeff , constructed as above, is the efficient influence
function.
We deduce from the argument above that the efficient influence function
ϕeff = Π(ϕ∗ |T ) is an element of the tangent space T and hence can be ex-
q×p q×p
pressed as ϕeff (Z) = Beff Sθ (Z, θ0 ) for some constant matrix Beff . Since
ϕeff (Z) is an influence function, it must also satisfy relationship (3.4) of The-
orem 3.2; i.e.,
E{ϕeff (Z)SθT (Z, θ0 )} = Γ(θ0 ),
3.4 Efficient Influence Function 47

or
Beff E{Sθ (Z, θ0 )SθT (Z, θ0 )} = Γ(θ0 ),
which implies
Beff = Γ(θ0 )I −1 (θ0 ),
where I(θ0 ) = E{Sθ (Z, θ0 )SθT (Z, θ0 )} is the information matrix. Conse-
quently, the efficient influence function is given by

ϕeff (Z) = Γ(θ0 )I −1 (θ0 )Sθ (Z, θ0 ). 




It is instructive to consider the special case θ = (β T , η T )T . We first define


the important notion of an efficient score vector and then show the relationship
of the efficient score to the efficient influence function.

Definition 8. The efficient score is the residual of the score vector with re-
spect to the parameter of interest after projecting it onto the nuisance tangent
space; i.e.,
Seff (Z, θ0 ) = Sβ (Z, θ0 ) − Π(Sβ (Z, θ0 )|Λ).
Recall that

Π(Sβ (Z, θ0 )|Λ) = E(Sβ SηT ){E(Sη SηT )}−1 Sη (Z, θ0 ). 




Corollary 2. When the parameter θ can be partitioned as (β T , η T )T , where β


is the parameter of interest and η is the nuisance parameter, then the efficient
influence function can be written as

ϕeff (Z, θ0 ) = {E(Seff Seff


T
)}−1 {Seff (Z, θ0 )}. (3.36)

Proof. By construction, the efficient score vector is orthogonal to the nuisance


tangent space; i.e., it satisfies condition (ii) of being an influence function.
By appropriately scaling the efficient score, we can construct an influence
function, which we will show is the efficient influence function. We first note
that E{Seff (Z, θ0 )SβT (Z, θ0 )} = E{Seff (Z, θ0 )Seff
T
(Z, θ0 )}. This follows because

E{Seff (Z, θ0 )SβT (Z, θ0 )} = E{Seff (Z, θ0 )Seff


T
(Z, θ0 )} + E{Seff (Z, θ0 )Π(Sβ |Λ)T } .
* +, -
This equals zero since
Seff (Z, θ0 } ⊥ Λ

Therefore, if we define

ϕeff (Z, θ0 ) = {E(Seff Seff


T
)}−1 Seff (Z, θ0 ),

then
(i) E[ϕeff (Z, θ0 )SβT (Z, θ0 )] = I q×q
and
48 3 The Geometry of Influence Functions

(ii) E[ϕeff (Z, θ0 )SηT (Z, θ0 )] = 0q×r ;


i.e., ϕeff (Z, θ0 ) satisfies conditions (i) and (ii) of Corollary 1 and thus is an
influence function.
As argued above, the efficient influence function is the unique influ-
ence function belonging to the tangent space T . Since both Sβ (Z, θ0 ) and
Π(Sβ (Z, θ0 )|Λ) are elements of T , so is
ϕeff (Z, θ0 ) = {E(Seff Seff
T
)}−1 {Sβ (Z, θ0 ) − Π(Sβ |Λ)},
thus demonstrating that (3.36) is the efficient influence function for RAL
estimators of β. 

Remark 7. When the parameter θ can be partitioned as (β T , η T )T , then Γ(θ0 )
can be partitioned as [I q×q : 0q×r ], and it is a straightforward exercise to show
that (3.35) leads to (3.36).  
Remark 8. If we denote by (β̂nM LE , η̂nM LE ) the values of β and η that maximize
the likelihood
!
n
p(Zi , β, η),
i=1

then under suitable regularity conditions, the estimator β̂nM LE of β is an RAL


estimator whose influence function is the efficient influence function given by
(3.36). See Exercise 3.2 below. 
Remark 9. If the parameter of interest is given by β(θ) and we define by θ̂nM LE
the value of θ that maximizes the likelihood
!
n
p(Zi , θ),
i=1

then, under suitable regularity conditions, the estimator β(θ̂nM LE ) of β is an


RAL estimator with efficient influence function (3.35).  
Remark 10. By definition,
ϕeff (Z, θ0 ) = {E(Seff Seff
T
)}−1 Seff (Z, θ0 )
has variance equal to
{E(Seff Seff
T
)}−1 ,
the inverse of the variance matrix of the efficient score. If we define Iββ =
E(Sβ SβT ), Iηη = E(Sη SηT ), and Iβη = E(Sβ SηT ), then we obtain the well-
known result that the minimum variance for the most efficient RAL estimator
is
−1 T −1
{Iββ − Iβη Iηη Iβη } ,
where Iββ , Iβη , Iηη are elements of the information matrix used in likelihood
theory. 

3.5 Review of Notation for Parametric Models 49

3.5 Review of Notation for Parametric Models


We now give a quick review of some of the notation and ideas developed in
Chapter 3 as a useful reference.
– Z1 , . . . , Zn iid p(z, β, η),

β ∈ Rq ,
η ∈ Rr ,
θ = (β T, η T )T , θ ∈ Rp , p = q + r.

– Truth is denoted as θ0 = (β0T , η0T )T .


– n1/2 (β̂n −β0 ) = n−1/2 Σϕ(Zi )+op (1), where ϕ(Zi ) is the influence function
for the i-th observation of β̂n .
– Hilbert space: q-dimensional measurable functions of Z with mean zero and
finite variance equipped with the covariance inner product E{hT1 (Z)h2 (Z)}.
– Score vector: For θ = (β T , η T )T ,

∂ log p(z, θ)
Sβ (z) = ,
∂β θ0
∂ log p(z, θ)
Sη (z) = ,
∂η θ0
∂ log p(z, θ)
Sθ (z) = = {SβT (z), SηT (z)}T .
∂θ θ0

Linear subspaces

Nuisance tangent space:

Λ = {B q×r Sη : for all B q×r }.

Tangent space:

T = {B q×p Sθ : for all B q×p },


T = Tβ ⊕ Λ, where Tβ = {B q×q Sβ : for all B q×q },

and ⊕ denotes the direct sum of linear subspaces.

Influence functions ϕ must satisfy

(i) E{ϕSβT } = I q×q


and
(ii) E{ϕSηT } = 0q×r ; ϕ ⊥ Λ, ϕ ∈ Λ⊥ .
50 3 The Geometry of Influence Functions

• Efficient score

Seff (Z, θ0 ) = Sβ (Z, θ0 ) − Π(Sβ |Λ);


Π(Sβ |Λ) = E(Sβ SηT ){E(Sη SηT )}−1 Sη (Z, θ0 ).

• Efficient influence function

ϕeff (Z) = {E(Seff Seff


T
)}−1 Seff (Z, θ0 ).

• Any influence function is equal to

ϕ(Z) = ϕeff (Z) + l(Z), l(Z) ∈ T ⊥ .

That is, influence functions lie on a linear variety and



E(ϕϕT ) = E(ϕeff ϕTeff ) + E(llT ).

3.6 Exercises for Chapter 3


1. Prove that the Hodges super-efficient estimator µ̂n , given in Section 3.1,
is not asymptotically regular.
2. Let Z1 , . . . , Zn be iid p(z, β, η), where β ∈ Rq and η ∈ Rr . Assume all the
usual regularity conditions that allow the maximum likelihood estimator
to be a solution to the score equation,
n 
 
Sβ (Zi , β, η)
= 0(q+r)×1 ,
Sη (Zi , β, η)
i=1

and be consistent and asymptotically normal.


a) Show that the influence function for β̂n is the efficient influence func-
tion.
b) Sketch out an argument that shows that the solution to the estimating
equation
n
q×1
Seff {Zi , β, η̂n∗ (β)} = 0q×1 ,
i=1

for any root-n consistent estimator η̂n∗ (β), yields an estimator that is
asymptotically linear with the efficient influence function.
3. Assume Y1 , . . . , Yn are iid with distribution function F (y) = P (Y ≤ y),
which is differentiable everywhere with density f (y) = dFdy(y) . The median
 
is defined as β = F −1 12 . The sample median is defined as
 
1
β̂n ≈ F̂n−1 ,
2
3.6 Exercises for Chapter 3 51
n
where F̂n (y) = n−1 i=1 I(Yi ≤ y) is the empirical distribution function.
Equivalently, β̂n is the solution to the m-estimating equation
n 
 
1
I(Yi ≤ β) − ≈ 0.
i=1
2

Remark 11. We use “≈” to denote approximately because the estimating


equation is not continuous in β and therefore will not always yield a solu-
tion. However, for large n, you can get very close to zero, the difference being
asymptotically negligible.  

(a) Find the influence function for the sample median β̂n .
Hint: You may assume the following to get your answer.
 
(i) β̂n is consistent; i.e., β̂n → β0 = F −1 12 .
(ii) Stochastic equicontinuity:
    
n1/2 F̂n (β̂n ) − F (β̂n ) − n1/2 F̂n (β0 ) − F (β0 ) −
P
→ 0.

(b) Let Y1 , . . . , Yn be iid N (µ, σ 2 ), µ ∈ R, σ 2 > 0. Clearly, for this model, the
median β is equal to µ. Verify, by direct calculation, that the influence
function for the sample median satisfies the two conditions of Corollary
1.
4
Semiparametric Models

In Chapter 3, we developed theoretical results for estimators of parameters


in finite-dimensional parametric models where Z1 , . . . , Zn are iid {p(z, θ), θ ∈
Ω ⊂ Rp }, p finite, and where θ can be partitioned as

θ = (β T , η T )T β ∈ R q , η ∈ Rr p = q + r,

β being the parameter of interest and η the nuisance parameter. In this chap-
ter, we will extend this theory to semiparametric models, where the parameter
space for θ is infinite-dimensional.
For most of the exposition in this book, as well as most of the exam-
ples used throughout, we will consider semiparametric models that can be
represented using the class of densities p(z, β, η), where β, the parameter of
interest, is finite-dimensional (q-dimensional); η, the nuisance parameter, is
infinite-dimensional; and β and η are variationally independent – that is, any
choice of β and η in a neighborhood about the true β0 and η0 would result
in a density p(z, β, η) in the semiparametric model. This will allow us, for
example, to explicitly define partial derivatives
∂p(z, β, η0 ) ∂p(z, β0 , η0 )
= .
∂β β=β0 ∂β

Keep in mind, however, that some problems lend themselves more naturally
to models represented by the class of densities p(z, θ), where θ is infinite-
dimensional and the parameter of interest, β q×1 (θ), is a smooth q-dimensional
function of θ. When the second representation is easier to work with, we will
make the distinction explicit.
In Chapter 1, we gave two examples of semiparametric models:
(i) Restricted moment model

Yi = µ(Xi , β) + εi ,
E(εi |Xi ) = 0,
54 4 Semiparametric Models

or equivalently
E(Yi |Xi ) = µ(Xi , β).
(ii) Proportional hazards model
The hazard of failing at time t is

λ(t|Xi ) = λ(t) exp(β T Xi ).

The major aim is to find “good” semiparametric estimators for β, where


“loosely speaking” a semiparametric estimator has the property that
D{p(·,β,η)}  
n1/2 (β̂n − β) −−−−−−−→ N 0, Σq×q (β, η)

for all densities “p(·, β, η)” within some semiparametric model and “good”
refers to estimators with small asymptotic variance. All of these ideas will be
made precise shortly.

4.1 GEE Estimators for the Restricted Moment Model


The restricted moment model was introduced briefly in Section 1.2. In this
model, we are primarily interested in studying the relationship of a response
variable Y , possibly vector-valued, as a function of covariates X. Specifically,
the restricted moment model considers the conditional expectation of Y given
X; that is,
E(Y d×1 |X) = µd×1 (X, β)
through the function µ(X, β) of X and the q-dimensional parameter β. Here,
“d” denotes the dimension of the response variable Y . Therefore, the restricted
moment model allows the modeling of multivariate and longitudinal response
data as a function of covariates (i.e., d > 1) as well as more traditional regres-
sion models for a univariate response variable (i.e., d = 1).
An example of a semiparametric estimator for the restricted moment model
is the solution to the linear estimating equation

n  
Aq×d (Xi , β̂n ) Yid×1 − µd×1 (Xi , β̂n ) = 0q×1 , (4.1)
i=1

where A(Xi , β) is an arbitrary (q × d) matrix of functions of the covariate Xi


and the parameter β.
Subject to suitable regularity conditions, β̂n is a consistent, asymptotically
normal estimator for β and thus is an example of a semiparametric estimator
for the restricted moment model. Such an estimator is an example of a solution
to a generalized estimating equation, or GEE, as defined by Liang and Zeger
(1986). It is also an example of an m-estimator as defined in Chapter 3. For
completeness, we sketch out a heuristic argument for the asymptotic normality
of β̂n and describe how to estimate its asymptotic variance.
4.1 GEE Estimators for the Restricted Moment Model 55

Asymptotic Properties for GEE Estimators

The asymptotic properties of the GEE estimator follow from the expansion

n
0= A(Xi , β̂n ){Yi − µ(Xi , β̂n )}
i=1
n
= A(Xi , β0 ){Yi − µ(Xi , β0 )}
i=1
 n "
 
n
+ Q(Yi , Xi , βn∗ ) − A(Xi , βn∗ ) D(Xi , βn∗ ) (β̂n − β0 ), (4.2)
i=1 i=1

where
∂µ(X, β)
D(X, β) = (4.3)
∂β T
is the gradient matrix (d × q), made up of all partial derivatives of the d-
elements of µ(X, β) with respect to the q-elements of β, and βn∗ denotes some
intermediate value between β̂n and β0 .
If we denote the rows of A(Xi , β) by {A1 (Xi , β), . . . , Aq (Xi , β)}, then
Qq×q (Yi , Xi , β) is the q × q matrix defined by
⎛ ∂AT (X ,β)

{Yi − µ(Xi , β)}T 1∂β Ti
⎜ .. ⎟
Qq×q (Yi , Xi , β) = ⎜
⎝ .
⎟.

T
∂A (Xi ,β)
{Yi − µ(Xi , β)}T q∂β T

This matrix, although complicated, is made up of a linear combination of


functions of Xi multiplied by elements of {Yi − µ(Xi , β)}, which, as we will
demonstrate shortly, drops out of consideration for the asymptotic theory.
Using (4.2), we obtain


n
n1/2 (β̂n − β0 ) = −n−1 Q(Yi , Xi , βn∗ )
i=1
"−1

n 
n
+n−1 A(Xi , βn∗ )D(Xi , βn∗ ) n−1/2 A(Xi , β0 ){Yi − µ(X, β0 )}.
i=1 i=1

Because

n
n−1 Q(Yi , Xi , βn∗ ) → E {Q(Y, X, β0 )} = 0
P

i=1

and

n
n−1 A(Xi , βn∗ )D(Xi , βn∗ ) −
P
→ E{A(X, β0 )D(X, β0 )},
i=1
56 4 Semiparametric Models

we obtain
n 

1/2 −1/2 −1
n (β̂n − β0 ) = n [E{A(X, β0 )D(X, β0 )}] A(Xi , β0 )
i=1

× {Yi − µ(Xi , β0 )} + op (1).

Consequently, the influence function for the i-th observation of β̂n is

{E(AD)}−1 A(Xi , β0 ){Yi − µ(Xi , β0 )}. (4.4)

As we demonstrated in Chapter 3, the asymptotic variance of an RAL esti-


mator for β is the variance of its influence function. Therefore, the asymptotic
variance of the GEE estimator β̂n is the variance of (4.4). We first compute
the variance of A(Xi , β0 ){Yi − µ(Xi , β0 )}, which equals
   
var A(Xi , β0 ){Yi − µ(Xi , β0 )} = E var [A(Xi , β0 ){Yi − µ(Xi , β0 )}|Xi ]
 
+ var E [A(Xi , β0 ){Yi − µ(Xi , β0 )}|Xi ]
* +, -

0
T
= E{A(Xi , β0 )V (Xi )A (Xi , β0 )}, (4.5)

where V (Xi ) = var(Yi |Xi ) is the d × d conditional variance matrix of Yi given


Xi . Consequently, the asymptotic variance of β̂n is
−1T
{E(AD)}−1 E{AV (X)AT } {E(AD)} . (4.6)

In order to use the results above for data analytic applications, such as
constructing confidence intervals for β or for some components of β, we must
also be able to derive consistent estimators for the asymptotic variance of β̂n
given by (4.6). Without going into all the technical details, we now outline the
arguments for constructing such an estimator. These arguments are similar
to those that resulted in the sandwich variance estimator for the asymptotic
variance of an m-estimator given by (3.10) in Chapter 3.
Suppose, for the time being, that we assumed the true-valued β0 were
known to us. Then, by the law of large numbers, a consistent estimator for
E(AD) is given by

n
−1
Ê0 (AD) = n A(Xi , β0 )D(Xi , β0 ), (4.7)
i=1

where the subscript “0” is used to emphasize that this statistic is com-
puted with β0 known. As we showed in (4.5), the variance of A(Xi , β0 ){Yi −
4.1 GEE Estimators for the Restricted Moment Model 57

µ(Xi , β0 )} is given by E{A(Xi , β0 )V (Xi )AT (Xi , β0 )}, which, by the law of
large numbers, can be consistently estimated by

n
Ê0 (AV AT ) = n−1 A(Xi , β0 ){Yi − µ(Xi , β0 )}{Yi − µ(Xi , β0 )}T AT (Xi , β0 ).
i=1
(4.8)
Of course, the value β0 is not known to us. But since β̂n is a consistent
estimator for β0 , a natural estimator for the asymptotic variance of β̂n is
given by
T
{Ê(AD)}−1 Ê(AV AT ){Ê(AD)}−1 , (4.9)
where Ê(AD) and Ê(AV AT ) are computed as in equations (4.7) and (4.8),
respectively, with β̂n substituted for β0 . The estimator (4.9) is referred to as
the sandwich estimator for the asymptotic variance. More details about this
methodology can be found in Liang and Zeger (1986).
The results above did not depend on any specific parametric assumptions
beyond the moment restriction and regularity conditions. Consequently, the
estimator, given as the solution to equation (4.1), is a semiparametric estima-
tor for the restricted moment model.

Example: Log-linear Model

Consider the problem where we want to model the relationship of a response


variable Y , which is positive, as a function of covariates X. For example, in
the study of HIV disease, CD4 count is a measure of the degree of destruction
that HIV disease has on the immune system. Therefore, it may be of interest
to model CD4 count as a function of covariates such as treatment, age, race,
etc. Let us denote by X = (X1 , . . . , Xq−1 )T the q − 1 vector of covariates
that we are considering. Because CD4 count is a positive random variable, a
popular model is the log-linear model where it is assumed that

log{E(Y |X)} = α + δ1 X1 + . . . + δq−1 Xq−1 .

Here, the parameter of interest is given by β q×1 = (α, δ1 , . . . , δq−1 )T .


This is an example of a restricted moment model where

E(Y |X) = µ(X, β) = exp(α + δ1 X1 + . . . + δq−1 Xq−1 ). (4.10)

The log transformation guarantees that the conditional mean response, given
the covariates, is always positive. Consequently, this model puts no restrictions
on the possible values that β can take.
With a sample of iid data (Yi , Xi ), i = 1, . . . , n, a semiparametric estima-
tor for β can be obtained as the solution to a generalized estimating equation
given by (4.1). In this example, the response variable Y is a single random vari-
able; hence d = 1. If we take Aq×1 (X, β) in (4.1) to equal (1, X1 , . . . , Xq−1 )T ,
58 4 Semiparametric Models

then the corresponding GEE estimator β̂n = (α̂n , . . . , δ̂(q−1)n )T is the solution
to

n
(1, XiT )T {Yi − exp(α + δ1 X1i + . . . + δq−1 X(q−1)i )} = 0q×1 . (4.11)
i=1

The estimator β̂n is consistent and asymptotically normal with a vari-


ance matrix that can be estimated using (4.9), where the derivative matrix
D1×q (X, β), defined by (4.3), is equal to µ(X, β)(1, X T ), and Ê(AD) and
Ê(AV AT ) are given by equations (4.7) and (4.8), respectively, with β̂n sub-
stituted for β0 . Specifically,

n
Ê(AD) = n−1 (1, XiT )T µ(Xi , β̂n )(1, XiT ), (4.12)
i=1


n
Ê(AV AT ) = n−1 (1, XiT )T {Yi − µ(Xi , β̂n )}2 (1, XiT ). (4.13)
i=1

Remark 1. The asymptotic variance is the variance matrix of the limiting nor-
mal distribution to which n1/2 (β̂n − β0 ) converges. That is, the asymptotic
D
variance is equal to the (q × q) matrix Σ, where n1/2 (β̂n − β0 ) −→ N (0, Σ).
The estimator for the asymptotic variance is denoted by Σ̂n and is given by

Σ̂n = {Ê(AD)}−1 Ê(AV AT ){Ê(AD)}−1 , (4.14)

where Ê(AD) and Ê(AV AT ) are defined by (4.12) and (4.13), respectively.
For practical applications, say, when we are constructing confidence inter-
vals for δj , the regression coefficient for the j-th covariate Xj , j = 1, . . . , q −1,
we must be careful to use the appropriate scaling factor when computing the
estimated standard error for δ̂jn . That is, the 95% confidence interval for δj
is given by
δ̂jn ± 1.96se(δ̂jn ),
and se(δ̂jn ) = n−1 (Σ̂n )(j+1)(j+1) , where (·)q×q
(j+1)(j+1) denotes the (j + 1)-th
diagonal element of the q × q matrix (·)q×q .  

The GEE estimator for β in the log-linear model, given as the solution to
equation (4.11), is just one example of many possible semiparametric estima-
tors. Clearly, we would be interested in finding a semiparametric estimator
that is as efficient as possible; i.e., with as small an asymptotic variance as
possible. We will address this issue later in this chapter.
Some natural questions that arise for semiparametric models are:

(i) How do we find semiparametric estimators, or do they even exist?


(ii) How do we find the best semiparametric estimator?
4.2 Parametric Submodels 59

Although both of these questions are difficult to resolve in general, under-


standing the geometry of influence functions for semiparametric estimators
is often helpful in constructing estimators and assessing efficiency. The ideas
and geometry we developed in Chapter 3 for finite-dimensional parametric
models will now be generalized to semiparametric models.

4.2 Parametric Submodels


As is often the case in mathematics, infinite-dimensional problems are tackled
by first working with a finite-dimensional problem as an approximation and
then taking limits to infinity. Therefore, the first step in dealing with a semi-
parametric model is to consider a simpler finite-dimensional parametric model
contained within the semiparametric model and use the theory and methods
developed in Chapter 3. Toward that end, we define a parametric submodel.

Recall: In a semiparametric model, the data Z1 , . . . , Zn are iid random vectors


with a density that belongs to the class
 
P = p{z, β, η(·)}, where β is q-dimensional and η(·) is infinite-dimensional

with respect to some dominating measure νZ . As we illustrated in some of


the examples of semiparametric models in Chapter 1, the infinite-dimensional
nuisance parameter η is itself often a function and hence denoted as η(·). 
We will denote the “truth” (i.e., the density that generates the data) by
p0 (z) ∈ P, namely
p0 (z) = p{z, β0 , η0 (·)}.
A parametric submodel, which we will denote by Pβ,γ = {p(z, β, γ)}, is a
class of densities characterized by the finite-dimensional parameter (β T , γ T )T
such that
(i) Pβ,γ ⊂ P (i.e., every density in Pβ,γ belongs to the semiparametric
model P) and
(ii) p0 (z) ∈ Pβ,γ (i.e., the parametric submodel contains the truth). Another
way of saying this is that there exists a density identified by the parameter
(β0 , γ0 ) within the parametric submodel such that

p0 (z) = p(z, β0 , γ0 ).

In keeping with the notation of Chapter 3, we will denote the dimension of γ


by “r,” although, in this case, the value r depends on the choice of parametric
submodel.
Remark 2. When we developed the geometry of influence functions for para-
metric models in Chapter 3, certain regularity conditions were implicitly as-
sumed. For example, the parametric model had to have sufficient regularity
60 4 Semiparametric Models

conditions to allow the interchange of differentiation and integration of the


density with respect to the parameters. This is necessary, for example, when
we want to prove that the score vector has mean zero. Consequently, the para-
metric model has to satisfy certain smoothness conditions. Similarly, the para-
metric submodels that we will consider must also satisfy certain smoothness
conditions. Appropriate smoothness and regularity conditions on the likeli-
hoods are given in Definition A.1 of the appendix in Newey (1990). Thus,
from here on, when we refer to parametric submodels, we implicitly are as-
suming smooth and regular parametric submodels.  

Remark 3. The terms parametric submodel and parametric model can be con-
fusing. A parametric model is a model whose probability densities are charac-
terized through a finite number of parameters that the data analyst believes
will suffice in identifying the probability distribution that generates the data.
For example, we may be willing to assume that our data follow the model

Yi = µ(Xi , β) + εi , (4.15)

where εi are iid N (0, σ 2 ), independent of Xi . This model is contained within


the semiparametric restricted moment model discussed previously.
In contrast, a parametric submodel is a conceptual idea that is used to
help us develop theory for semiparametric models. The reason we say it is
conceptual is that we require a parametric submodel to contain the truth. But
since we don’t know what the truth is, we can only describe such submodels
generically and hence such models are not useful for data analysis. The para-
metric model given by (4.15) is not a parametric submodel if, in truth, the
data are not normally distributed.  

We now illustrate how a parametric submodel can be defined using the


proportional hazards model as an example. In the proportional hazards model,
we assume
λ(t|X) = λ(t) exp(β T X),
where X = (X1 , . . . , Xq )T denotes a q-dimensional vector of covariates, λ(t)
is some arbitrary hazard function of time that is left unspecified and hence
is infinite-dimensional, and β is the q-dimensional parameter of interest. We
denote (conceptually) the truth by λ0 (t); t ≥ 0 and β0 .
An example of a parametric submodel is as follows. Let h1 (t), . . . , hr (t)
be r different functions of time that are specified by the data analyst. (Any
smooth functions will do.) Consider the model

Pβ,γ = {class of densities with hazard function


λ(t|X) = λ0 (t) exp{γ1 h1 (t) + · · · + γr hr (t)} exp(β T X)},

where γ = (γ1 , . . . , γr )T ∈ Rr and β ∈ Rq .


4.3 Influence Functions for Semiparametric RAL Estimators 61

We note that:
• In this model, the (q +r) parameters (β T , γ T )T are left unspecified. Hence,
this model is indeed a finite-dimensional model.
• For any choice of β and γ, the resulting density follows a proportional
hazards model and is therefore contained in the semiparametric model;
i.e.,
Pβ,γ ⊂ P.
• The truth is obtained by setting β = β0 and γ = 0.
• This parametric submodel is defined using λ0 (t), “the truth,” which is not
known to us; consequently, such a model is not useful for data analysis.
Contrast this with the case where we are willing to consider the parametric
model; namely
λ(t|X) = λ exp(β T X), λ, β unknown.
That is, we assume that the underlying baseline hazard function is constant
over time; i.e., conditional on X, the survival distribution follows an expo-
nential distribution. If we are willing to assume that our data are generated
from some distribution within this parametric model, then we only need to
estimate the parameters λ and β and use this for any subsequent data analy-
sis. Of course, the disadvantage of such a parametric model is that if the data
are not generated from any density within this class, then the estimates we
obtain may be meaningless.

4.3 Influence Functions for Semiparametric


RAL Estimators
In Chapter 3, we studied RAL estimators for β for finite-dimensional para-
metric models and derived their asymptotic properties through their influence
function. We also described the geometry of the class of influence functions for
RAL estimators. Consequently, from this development, we know that influence
functions of RAL estimators for β for a parametric submodel:
(i) Belong to the subspace of the Hilbert space H of q-dimensional mean-
zero finite-variance measurable functions (equipped with the covariance
inner product) that are orthogonal to the parametric submodel nuisance
tangent space

Λγ = {B q×r Sγ (Z, β0 , γ0 ), for all B q×r },

where
∂ log p(z, β0 , γ0 )
Sγr×1 = .
∂γ
62 4 Semiparametric Models

(ii) The efficient influence function for the parametric submodel is given by
T
ϕeff eff eff
β,γ (Z) = {E(Sβ,γ Sβ,γ )}
−1 eff
Sβ,γ (Z, β0 , γ0 ),
eff
where Sβ,γ (Z, β0 , γ0 ), the parametric submodel efficient score, is

Sβ (Z, β0 , η0 ) − Π(Sβ (Z, β0 , η0 )|Λγ ),

and
∂ log p(z, β0 , η0 )
Sβq×1 (Z, β0 , η0 ) = .
∂β
(iii) The smallest asymptotic variance among such RAL estimators for β in
the parametric submodel is
T
eff eff
[E{Sβ,γ (Z)Sβ,γ (Z)}]−1 .

An estimator for β is an RAL estimator for a semiparametric model if it


is an RAL estimator for every parametric submodel. Therefore, any influence
function of an RAL estimator in a semiparametric model must be an influence
function of an RAL estimator within a parametric submodel; i.e.,
   
class of influence functions class of influence functions
⊂ .
of RAL estimators for β for P of RAL estimators for Pβ,γ

A heuristic way of looking at this is as follows. If β̂n is a semiparametric


estimator, then we want
 
1/2 D(β,η)
n (β̂n − β) −−−−→ N 0, Σ(β, η)

for all p(z, β, η) ∈ P. Such an estimator would necessarily satisfy


 
1/2 D(β,γ)
n (β̂n − β) −−−−→ N 0, Σ(β, γ)

for all p(z; β, γ) ∈ Pβ,γ ⊂ P. However, the converse may not be true. Conse-
quently, the class of semiparametric estimators must be contained within the
class of estimators for a parametric submodel. Therefore:

(i) Any influence function of an RAL semiparametric estimator for β must


be orthogonal to all parametric submodel nuisance tangent spaces.
(ii) The variance of any RAL semiparametric influence function must be
greater than or equal to
T
eff eff
[E{Sβ,γ (Z)Sβ,γ (Z)}]−1

for all parametric submodels Pβ,γ .


4.4 Semiparametric Nuisance Tangent Space 63

Hence, the variance of the influence function for any semiparametric estimator
for β must be greater than or equal to

sup  / 0−1
eff eff T
E Sβ,γ Sβ,γ . (4.16)
{all parametric submodels}

This supremum is defined to be the semiparametric efficiency bound. Any


semiparametric RAL estimator β̂n with asymptotic variance achieving this
bound for p0 (z) = p(z, β0 , η0 ) is said to be locally efficient at p0 (·). If the
same estimator β̂n is semiparametric efficient regardless of p0 (·) ∈ P, then
we say that such an estimator is globally semiparametric efficient.
Geometrically, the parametric submodel efficient score is the residual of
Sβ (Z, β0 , η0 ) after projecting it onto the parametric submodel nuisance tan-
gent space. For a one-dimensional parameter β, the inverse of the norm-
squared of this residual is the smallest variance of all influence functions for
RAL estimators for the parametric submodel. This analogy can be extended
to q-dimensional β as well by considering the inverse of the variance matrix
of the residual q-dimensional vector.
As we increase the complexity of the parametric submodel, or consider
the linear space spanned by the nuisance tangent spaces of all the parametric
submodels, the corresponding space becomes larger and therefore the norm
of the residual becomes smaller. Hence, the inverse of the variance of the
residual grows larger. This gives a geometric perspective to the observation
that the efficient semiparametric estimator has a variance larger than the
efficient estimator for any parametric submodel.

4.4 Semiparametric Nuisance Tangent Space


Let us be a bit more formal.
Definition 1. The nuisance tangent space for a semiparametric model, de-
noted by Λ, is defined as the mean-square closure of parametric submodel
nuisance tangent spaces, where a parametric submodel nuisance tangent space
is the set of elements
{B q×r Sγr×1 (Z, β0 , η0 )},
Sγ (Z, β0 , η0 ) is the score vector for the nuisance parameter γ for some para-
metric submodel, and B q×r is a conformable matrix with q-rows. Specifically,
the mean-square closure of the spaces above is defined as the space Λ ⊂ H,
where Λ = [hq×1 (Z) ∈ H such that E{hT (Z)h(Z)} < ∞ and there exists a
sequence Bj Sγj (Z) such that

2 j→∞
h(Z) − Bj Sγj (Z) −−−→ 0
2
for a sequence of parametric submodels indexed by j], where h(Z) =
E{hT (Z)h(Z)}.  
64 4 Semiparametric Models

Remark 4. The Hilbert space H is also a metric space (i.e., a set of elements
where a notion of distance between elements of the set is defined). For any
two elements h1 , h2 ∈ H, we can define the distance between two elements
h1 , h2 ∈ H as h2 − h1 = [E{(h2 − h1 )T (h2 − h1 )}]1/2 . The closure of a set S,
where, in this setting, a set consists of q-dimensional random functions with
mean zero and finite variance, is defined as the smallest closed set that contains
S, or equivalently, as the set of all elements in S together with all the limit
points of S. The closure of S is denoted by S̄. Thus the closure of a set is itself
a closed set. Limits must be defined in terms of a distance between elements.
The word mean-square is used because limits are taken with respect to the
distance, which in this case is the square root of the expected sum of squared
differences between the q-components of the two elements (i.e., between the
two q-dimensional random functions). Therefore, the mean-square closure is
larger and contains the union of all parametric submodel nuisance tangent
spaces. Therefore, if we denote by S the union of all parametric submodel
nuisance tangent spaces, then Λ = S̄ is the semiparametric nuisance tangent
space.  

Remark 5. Although the space Λ is closed, it may not necessarily be a linear


space. However, in most applications it is a linear space, and certainly is in
any of the examples used in this book. Therefore, from now on, we will always
assume the space Λ is a linear space and, by construction, is also a closed
space. This is important because in order for the projection theorem to be
guaranteed to apply (i.e., that a unique projection of an element to a linear
subspace exists), that linear subspace must be a closed linear subspace of the
Hilbert space. 

Before deriving the semiparametric efficient influence function, we first


define the semiparametric efficient score vector and give some results regarding
the semiparametric efficiency bound.

Definition 2. The semiparametric efficient score for β is defined as

Seff (Z, β0 , η0 ) = Sβ (Z, β0 , η0 ) − Π{Sβ (Z, β0 , η0 )|Λ}.

There is no difficulty with this definition since the nuisance tangent space Λ
is a closed linear subspace and therefore the projection Π{Sβ (Z, β0 , η0 )|Λ}
exists and is unique. 

Theorem 4.1. The semiparametric efficiency bound, defined by (4.16), is


equal to the inverse of the variance matrix of the semiparametric efficient
score; i.e.,
T
[E{Seff (Z)Seff (Z)}]−1 .

Proof. For simplicity, we take β to be a scalar (i.e., q = 1), although this can
be extended to q > 1 using arguments in Section 3.4, where a generalization of
4.4 Semiparametric Nuisance Tangent Space 65

the Pythagorean theorem to dimension q > 1 was derived (see (3.31)). Denote
by V the semiparametric efficiency bound, which, when q = 1, is defined by
eff −2
sup Sβ,γ (Z) = V,
{all parametric submodels}
Pβ,γ

where

eff
Sβ,γ (Z) = Sβ (Z) − Π(Sβ (Z)|Λγ ).
eff
Since Λγ ⊂ Λ, this implies that Seff (Z) ≤ Sβ,γ (Z) for all parametric
submodels Pβ,γ . Hence
−2 eff −2
Seff (Z) ≥ sup Sβ,γ (Z) = V. (4.17)
{all Pβ,γ }

To complete the proof of the theorem, we need to show that Seff (Z) −2 is
also less than or equal to V. But because Π(Sβ (Z)|Λ) ∈ Λ, this means that
there exists a sequence of parametric submodels Pβ,γj with nuisance score
vectors Sγj (Z) such that

2 j→∞
Π(Sβ (Z)|Λ) − Bj Sγj (Z) −−−→ 0

for conformable matrices Bj .


By the definition of V, for any Pβ,γj , we obtain V −1 ≤ eff
Sβ,γj
(Z) 2 .
Therefore

V −1 ≤ Sβ,γ
eff
j
(Z) 2
= Sβ (Z) − Π(Sβ (Z)|Λγj ) 2
≤ Sβ (Z) − Bj Sγj (Z) 2

2
= Sβ (Z) − Π(Sβ (Z)|Λ) + Π(Sβ (Z)|Λ) − Bj Sγj (Z) 2 . (4.18)

Because Sβ (Z) − Π(Sβ (Z)|Λ) is orthogonal to Λ and Π(Sβ (Z)|Λ) − Bj Sγj (Z)
is an element of Λ, the last equality in (4.18) follows from the Pythagorean
theorem. Taking j → ∞ implies
2 2
Sβ (Z) − Π(Sβ (Z)|Λ) = Seff (Z) ≥ V −1

or

−2
Seff (Z) ≤ V.
−2
Together with (4.17), we conclude that Seff (Z) = V. 


Definition 3. The efficient influence function is defined as the influence func-


tion of a semiparametric RAL estimator, if it exists (see remark below), that
achieves the semiparametric efficiency bound.  
66 4 Semiparametric Models

Remark 6. In the development that follows, we will construct a unique element


of H that always exists, which will be defined as the efficient influence function.
We will prove that if a semiparametric RAL estimator exists that has an
influence function whose variance is the semiparametric efficiency bound, then
this influence function must be the efficient influence function. There is no
guarantee, however, that such an RAL estimator can be derived.  
Theorem 4.2. Any semiparametric RAL estimator for β must have an influ-
ence function ϕ(Z) that satisfies
' (
(i) E{ϕ(Z)SβT (Z, β0 , η0 )} = E ϕ(Z)Seff
T
(Z, β0 , η0 ) = I q×q
and
(ii) Π{ϕ(Z)|Λ} = 0; i.e., ϕ(Z) is orthogonal to the nuisance tangent space.
The efficient influence function is now defined as the unique element satisfying
conditions (i) and (ii) whose variance matrix equals the efficiency bound and
is equal to
'  T
(−1
ϕeff (Z, β0 , η0 ) = E Seff Seff Seff (Z, β0 , η0 ).

Proof. We first prove condition (ii). To show that ϕ(Z) is orthogonal to Λ, we


must prove that ϕ, h = 0 for all h ∈ Λ. By the definition of Λ, there exists
a sequence Bj Sγj (Z) such that
j→∞
h(Z) − Bj Sγj (Z) −−−→ 0

for a sequence of parametric submodels indexed by j. Hence

ϕ, h = ϕ, Bj Sγj + ϕ, h − Bj Sγj .

Because any influence function of a semiparametric RAL estimator for β must


be an influence function for an RAL estimator in a parametric submodel, then
condition (ii) of Corollary 1 of Theorem 3.2 implies that ϕ is orthogonal to
Λγj and hence the first term in the sum above is equal to zero. By the Cauchy-
Schwartz inequality, we obtain

| ϕ, h | ≤ ϕ h − Bj Sγj .

Taking limits as j → ∞ gives us the desired result.


To prove condition (i) above, we note that by condition (i) of Corollary
1 of Theorem 3.2, ϕ(Z) must satisfy E{ϕ(Z)SβT (Z, β0 , η0 )} = I q×q . Since
T
E{ϕ(Z)Seff (Z, β0 , η0 )} = E{ϕ(Z)SβT (Z, β0 , η0 )}−E{ϕ(Z)Π(SβT (Z, β0 , η0 )|Λ)},
the result follows because Π(SβT (Z, β0 , η0 )|Λ) ∈ Λ and since, by condition (ii),
ϕ(Z) is orthogonal to Λ, this implies that E{ϕ(Z)Π(SβT (Z, β0 , η0 )|Λ)} = 0.
It is now easy to show that the efficient influence function is given by
'  T
(−1
ϕeff (Z, β0 , η0 ) = E Seff Seff Seff (Z, β0 , η0 ).
4.4 Semiparametric Nuisance Tangent Space 67

Clearly, ϕeff (Z, β0 , η0 ) satisfies conditions (i) and (ii) above and moreover
has variance matrix E{ϕeff (Z)ϕTeff (Z)} = V, where V is the semiparametric
efficiency bound.  

The results of Theorem 4.2 apply specifically to semiparametric models


that can be parametrized through (β, η), where β is the parameter of interest,
η is the infinite-dimensional nuisance parameter, and for which the nuisance
tangent space Λ can be readily computed. For some semiparametric models,
it may be more convenient to parametrize the model through an infinite-
dimensional parameter θ for which the tangent space can be readily computed
and define the parameter of interest as β(θ); i.e., a smooth q-dimensional
function of θ.
For parametric models, we showed in Chapter 3, Theorem 3.4, that the
space of influence functions for RAL estimators for β lies on a linear variety
defined as ϕ(Z) + Tθ⊥ , where ϕ(Z) is the influence function of any RAL esti-
mator for β and Tθ is the parametric model tangent space, the space spanned
by the score vector Sθ (Z). Similarly, for semiparametric models, we could
show that the influence functions of semiparametric RAL estimators for β
must lie on the linear variety ϕ(Z) + T ⊥ , where ϕ(Z) is the influence func-
tion of any semiparametric RAL estimator for β and T is the semiparametric
tangent space, defined as the mean-square closure of all parametric submodel
tangent spaces. The element in this linear variety with the smallest norm is
the unique element ϕ(Z) − Π{ϕ(Z)|T ⊥ } = Π{ϕ(Z)|T }, which can be shown
to be the efficient influence function ϕeff (Z).
We give these results in the following theorem, whose proof is analogous
to that in Theorems 3.4 and 3.5 and is therefore omitted.

Theorem 4.3. If a semiparametric RAL estimator for β exists, then the influ-
ence function of this
 estimator must
 belong to the space of influence functions,
the linear variety ϕ(Z) + T ⊥ , where ϕ(Z) is the influence function of any
semiparametric RAL estimator for β and T is the semiparametric tangent
space, and if an RAL estimator for β exists that achieves the semiparametric
efficiency bound (i.e., a semiparametric efficient estimator), then the influence
function of this estimator must be the unique and well-defined element

ϕeff (Z) = ϕ(Z) − Π{ϕ(Z)|T ⊥ } = Π{ϕ(Z)|T }.

What is not clear is whether there exist semiparametric estimators that


will have influence functions corresponding to the elements of the Hilbert
space satisfying conditions (i) and (ii) of Theorem 4.2 or Theorem 4.3 (al-
though we might expect that arguments similar to those in Section 3.3, used
to construct estimators for finite-dimensional parametric models, will extend
to semiparametric models as well).
In many cases, deriving the space of influence functions, or even the space
orthogonal to the nuisance tangent space, for semiparametric models, will
68 4 Semiparametric Models

suggest how semiparametric estimators may be constructed and even how to


find locally or globally efficient semiparametric estimators. We will illustrate
this for the semiparametric restricted moment model. But before doing so,
it will be instructive to develop some methods and tools for finding infinite-
dimensional tangent spaces. We start with the nonparametric model, where we
put no restrictions on the class of densities and show that the corresponding
tangent space is the entire Hilbert space H.

Tangent Space for Nonparametric Models

Suppose we are interested in estimating some q-dimensional parameter β for


a nonparametric model. That is, let Z1 , . . . , Zn be iid random vectors with
arbitrary density p(z) with respect to a dominating measure νZ , where the
only restriction on p(z) is that p(z) ≥ 0 and

p(z)dν(z) = 1.

Theorem 4.4. The tangent space (i.e., the mean-square closure of all para-
metric submodel tangent spaces) is the entire Hilbert space H.

Proof. Consider any parametric submodel Pθ = {p(z, θ), θ, say s-dimensional}.


The parametric submodel tangent space is
' (
Λθ = B q×s Sθ (Z), for all constant matrices B q×s ,

where

∂ log p(z, θ0 )
Sθ (z) = .
∂θ
Denote the truth as p0 (z) = p(z, θ0 ). From the usual properties of score vec-
tors, we know that
E{Sθ (Z)} = 0s×1 .
Consequently, the linear subspace Λθ ⊂ H.

Reminder: When we write E{Sθ (Z)}, we implicitly mean that the expectation
is computed with respect to the truth; i.e.,

E0 {Sθ (Z, θ0 )} or Eθ0 {Sθ (Z, θ0 )}. 




To complete the proof, we need to show that any element of H can be


written as an element of Λθ for some parametric submodel or a limit of such
elements. Choose an arbitrary element of H, say h(Z) that is a bounded mean-
zero q-dimensional measurable function with finite variance. Consider the
parametric submodel p(z, θ) = p0 (z){1 + θT h(z)}, where θ is a q-dimensional
vector sufficiently small so that
4.4 Semiparametric Nuisance Tangent Space 69

{1 + θT h(z)} ≥ 0 for all z. (4.19)

Condition (4.19) is necessary to guarantee that p(z, θ) is a proper density.


Because h(·) is a bounded function, the set of θ satisfying (4.19) contains an
open set in Rq . This must be the case in order to ensure that the partial
derivatives of p(z, θ) with respect to θ exist. Moreover, every element p(z, θ)
in the parametric submodel satisfies
 
' (
p(z, θ)dν(z) = p0 (z) 1 + θT h(z) dν(z)
 
= p0 (z)dν(z) + θT h(z)p0 (z)dν(z) = 1.
* +, - * +, -

1 0

This guarantees that p(z, θ), for θ in some neighborhood of the truth, is a
proper density function. For this parametric submodel, the score vector is

∂ log[p0 (z){1 + θT h(z)}]


Sθ (z) =
∂θ θ=0
= h(z).

If we choose B q×q to be I q×q (i.e., the q ×q identity matrix), then h(Z), which
also equals I q×q h(Z), is an element of this parametric submodel tangent space.
Thus we have shown that the tangent space contains all bounded mean-
zero random vectors. The proof is completed by noting that any element of
H can be approximated by a sequence of bounded h.  

Remark 7. On the dimension of θ


When we defined an arbitrary parametric submodel through the parameter
θ, the dimension of θ was taken to be s-dimensional, where s was arbitrary.
The corresponding score vector, which also is s-dimensional, had to be pre-
multiplied by some arbitrary constant matrix B q×s to obtain an element in the
Hilbert space. However, when we were constructing a parametric submodel
that led to the score vector h(Z), we chose θ to be q-dimensional to conform
to the dimension of h(Z); i.e., for this particular parametric submodel, we
took s to equal q.  

Partitioning the Hilbert Space

Suppose Z is an m-dimensional random vector, say Z = Z (1) , . . . , Z (m) . Then


the density of Z can be expressed as the product of conditional densities,
namely
70 4 Semiparametric Models

pZ (z) =pZ (1) (z (1) ) × pZ (2) |Z (1) (z (2) |z (1) )


× . . . × pZ (m) |Z (1) ,...,Z (m−1) (z (m) |z (1) , . . . , z (m−1) ),

where
pZ (j) |Z (1) ,...,Z (j−1) (z (j) |z (1) , . . . , z (j−1) ) (4.20)
is the conditional density of Z (j) given Z (1) , . . . , Z (j−1) , defined with respect
to the dominating measure νj . If we put no restrictions on the density of Z
(i.e., the nonparametric model) or, equivalently, put no restrictions on the
conditional densities above, then the j-th conditional density (4.20) is any
positive function ηj (z (1) , . . . , z (j) ) such that

ηj (z (1) , . . . , z (j) )dνj (z (j) ) = 1

for all z (1) , . . . , z (j−1) , j = 1, . . . , m. With this representation, we note that


the nonparametric model can be represented by the m variationally indepen-
dent infinite-dimensional nuisance parameters η1 (·), . . . , ηm (·). By variation-
ally independent, we mean that the product of any combination of arbitrary
η1 (·), . . . , ηm (·) can be used to construct a valid density for Z (1) , . . . , Z (m) in
the nonparametric model.
The tangent space T is the mean-square closure of all parametric sub-
model tangent spaces, where a parametric submodel is given by the class of
densities
!
m
p(z (1) , γ1 ) p(z (j) |z (1) , . . . , z (j−1) , γj ),
j=2

γj , j = 1, . . . , m are sj -dimensional parameters that are variationally indepen-


dent, and p(z (j) |z (1) , . . . , z (j−1) , γ0j ) denotes the true conditional density of
Z (j) given Z (1) , . . . , Z (j−1) . The parametric submodel tangent space is defined
as the space spanned by the score vectors Sγj (Z (1) , . . . , Z (m) ), j = 1, . . . , m,
where Sγj (z (1) , . . . , z (m) ) = ∂ log p(z (1) , . . . , z (m) , γ1 , . . . , γm )/∂γj . Because
the density of Z (1) , . . . , Z (m) is a product of conditional densities, each
parametrized through parameters γj that are variationally independent, this
implies that the log-density is a sum of log-conditional densities with respect
to variationally independent parameters γj . Hence, Sγj (z (1) , . . . , z (m) ), which
is defined as ∂ log p(z (j) |z (1) , . . . , z (j−1) , γj )/∂γj , is a function of z (1) , . . . , z (j)
only. Moreover, the parametric submodel tangent space can be written as the
space
Tγ = B1q×s1 Sγ1 (Z (1) ) + . . . + Bm q×sm
Sγm (Z (1) , . . . , Z (m) )
for all constant matrices B1q×s1 , . . . , Bm
q×sm
. Consequently, the parametric sub-
model tangent space is equal to

Tγ = Tγ1 ⊕ . . . ⊕ Tγm ,

where
4.4 Semiparametric Nuisance Tangent Space 71

Tγj = {B q×sj Sγj (Z (1) , . . . , Z (j) ), for all constant matrices B q×sj }.
It is now easy to verify that the tangent space T , the mean-square closure
of all parametric submodel tangent spaces, is equal to
T = T1 ⊕ . . . ⊕ Tm ,
where Tj , j = 1, . . . , m, is the mean-square closure of parametric submodel
tangent spaces for ηj (·), where a parametric submodel for ηj (·) is given by the
class of conditional densities
Pγj = {p(z (j) |z (1) , . . . , z (j−1) , γj ), γj − say sj − dimensional}
and the parametric submodel tangent space Tγj is the linear space spanned
by the score vector Sγj (Z (1) , . . . , Z (j) ).
We now are in a position to derive the following results regarding the
partition of the Hilbert space H into a direct sum of orthogonal subspaces.
Theorem 4.5. The tangent space T for the nonparametric model, which we
showed in Theorem 4.4 is the entire Hilbert space H, is equal to
T = H = T1 ⊕ . . . ⊕ Tm ,
where
 
T1 = α1q×1 (Z (1) ) ∈ H : E{α1q×1 (Z (1) )} = 0q×1

and

Tj = αjq×1 (Z (1) , . . . , Z (j) ) ∈ H : (4.21)

E{αjq×1 (Z (1) , . . . , Z (j) )|Z (1) , . . . , Z (j−1) } = 0q×1 , j = 2, . . . , m,

and Tj , j = 1, . . . , m are mutually orthogonal spaces. Equivalently, the linear


space Tj can be defined as the space
 
(1)
[hq×1
∗j (Z , . . . , Z (j) ) − E{hq×1
∗j (Z
(1)
, . . . , Z (j) )|Z (1) , . . . , Z (j−1) }] (4.22)
(1)
for all square-integrable functions hq×1
∗j (·) of Z , . . . , Z (j) .
(1) (m)
In addition, any element h(Z , . . . , Z ) ∈ H can be decomposed into
orthogonal elements
h = h 1 + . . . + hm ,
where
h1 (Z (1) ) = E{h(·)|Z (1) },
(1)
hj (Z , . . . , Z (j) ) = E{h(·)|Z (1) , . . . , Z (j) }
− E{h(·)|Z (1) , . . . , Z (j−1) }, (4.23)
for j = 2, . . . , m, and hj (·) is the projection of h onto Tj ; i.e., hj (·) =
Π{h(·)|Tj }.
72 4 Semiparametric Models

Proof. That the partition of the tangent space Tj associated with the nuisance
parameter ηj (·) is the set of elements given by (4.21) follows by arguments
similar to those for the proof of Theorem 4.4. That is, because of properties of
score functions for parametric models of conditional densities, the score vector
Sγj (·) must be a function only of Z (1) , . . . , Z (j) and must have conditional
expectation

E{Sγj (Z (1) , . . . , Z (j) )|Z (1) , . . . , Z (j−1) } = 0sj ×1 .

Consequently, any q-dimensional element spanned by Sγj (·) must belong to


Tj . Conversely, for any bounded element αj (Z (1) , . . . , Z (j) ) in Tj , consider
the parametric submodel

pj (z (j) |z (1) , . . . , z (j−1) , θj ) = p0j (z (j) |z (1) , . . . , z (j−1) ){1+θjT αj (z (1) , . . . , z (j) )},
(4.24)
where p0j (z (j) |z (1) , . . . , z (j−1) ) denotes the true conditional density of Z (j)
given Z (1) , . . . , Z (j−1) and θj is a q-dimensional parameter chosen sufficiently
small to guarantee that pj (z (j) |z (1) , . . . , z (j−1) , θj ) is positive. This class of
functions is clearly a parametric submodel since

p0j (z (j) |z (1) , . . . , z (j−1) ){1 + θjT αj (z (1) , . . . , z (j) )}dνj (z (j) ) = 1,

which follows because



p0j (z (j) |z (1) , . . . , z (j−1) )dνj (z (j) ) = 1,

p0j (z (j) |z (1) , . . . , z (j−1) ){θjT αj (z (1) , . . . , z (j) )}dνj (z (j) ) = 0, (4.25)

where (4.25) is equal to θjT E{αj (Z (1) , . . . , Z (j) )|Z (1) , . . . , Z (j−1) }, which must
equal zero by the definition of Tj . The score vector for the parametric sub-
model (4.24) is

∂ log p0j (z (j) |z (1) , . . . , z (j−1) ){1 + θjT αj (z (1) , . . . , z (j) )}


Sθj (·) =
∂θj θj =0
(1) (j)
= αj (z ,...,z ).

Thus we have shown that the tangent space for every parametric submodel
of ηj (·) is contained in Tj and every bounded element in Tj belongs to the
tangent space for some parametric submodel of ηj (·). The argument is com-
pleted by noting that every element of Tj is the limit of bounded elements of
Tj .
That the projection of any element h ∈ H onto Tj is given by hj (·),
defined by (4.23), can be verified directly. Clearly hj (·) ∈ Tj . Therefore, by
the projection theorem for Hilbert spaces, we only need to verify that h − hj
is orthogonal to every element of Tj . Consider any arbitrary element j ∈ Tj .
4.5 Semiparametric Restricted Moment Model 73

Then, because hj and lj are functions of Z (1) , . . . , Z (j) , we can use the law of
iterated conditional expectations to obtain

E{(h − hj )T j } = E[E{(h − hj )T j |Z (1) , . . . , Z (j) }]


= E[{E(h|Z (1) , . . . , Z (j) ) − hj }T j ]
= E[{E(h|Z (1) , . . . , Z (j−1) )}T j ] (4.26)
/ 0
= E E[{E(h|Z (1) , . . . , Z (j−1) )}T j |Z (1) , . . . , Z (j−1) ]
= E[{E(h|Z (1) , . . . , Z (j−1) )}T E(j |Z (1) , . . . , Z (j−1) )] = 0. (4.27)

Note that (4.26) follows from the definition of hj . The equality in (4.27) follows
because lj ∈ Tj , which, in turn, implies that E(lj |Z (1) , . . . , Z (j−1) ) = 0.
Finally, in order to prove that Tj , j = 1, . . . , m are mutually orthogonal
subspaces, we must show that hj is orthogonal to hj  , where hj ∈ Tj , hj  ∈ Tj 
 
and j = j , j, j = 1, . . . , m. This follows using the law of iterated conditional
expectations, which we leave for the reader to verify.  

4.5 Semiparametric Restricted Moment Model


We will focus a great deal of attention on studying the geometric properties
of influence functions of estimators for parameters in the restricted moment
model because of its widespread use in statistical modeling. The relationship of
the response variable Y , which may be univariate or multivariate, and covari-
ates X is modeled by considering the conditional expectation of Y given X as a
function of X and a finite number of parameters β. This includes linear as well
as nonlinear models. For example, if Y is a univariate response variable, then,
in a linear model, we would assume that E(Y |X) = X T β, where the dimen-
sions of X and β were, say, equal to q. We also gave an example of a log-linear
model earlier, in Section 4.1. More generally, the response variable Y may be
multivariate, say d-dimensional, and the relationship of the conditional expec-
tation of Y given X may be linear or nonlinear, say E(Y |X) = µ(X, β), where
µ(x, β) is a d-dimensional function of the covariates X and the q-dimensional
parameter β. Therefore, such models are particularly useful when modeling
longitudinal and/or multivariate response data. The goal is to estimate the
parameter β using a sample of iid data (Yi , Xi ), i = 1, . . . , n. As we argued in
Chapter 1, without any additional restrictions on the joint probability distri-
bution of Y and X, this is a semiparametric model. The restricted moment
model can also be expressed as

Y = µ(X, β) + ε,

where

E(ε|X) = 0.
74 4 Semiparametric Models

We will assume, for the time being, that the d-dimensional response vari-
able Y is continuous; i.e., the dominating measure is the Lebesgue measure,
which we will denote by Y . It will be shown later how this can be generalized
to more general dominating measures that will also allow Y to be discrete.
The covariates X may be continuous, discrete, or mixed, and we will denote
the dominating measure by νX .
The observed data are assumed to be realizations of the iid random vec-
tors (Z1 , . . . , Zn ), where Zi = (Yi , Xi ). Our aim is to find semiparametric
estimators for β and identify, if possible, the most efficient semiparametric
estimator.
The density of a single observation, denoted by p(z), belongs to the semi-
parametric model  
P= p{z, β, η(·)}, z = (y, x) ,

defined with respect to the dominating measure Y × νX . The truth (i.e., the
density that generates the data) is denoted by p0 (z) = p{z, β0 , η0 (·)}. Because
there is a one-to-one transformation of (Y, X) and (ε, X), we can express the
density
pY,X (y, x) = pε,X {y − µ(x, β), x}, (4.28)
where pε,X (ε, x) is a density with respect to the dominating measure ε × νX .
The restricted moment model only makes the assumption that

E(ε|X) = 0.

As illustrated in Example 1 of Section 1.2, the density of (ε, X) can be ex-


pressed as

pε,X (ε, x) = η1 (ε, x) η2 (x), (4.29)

where η1 (ε, x) = pε|X (ε|x) is any nonnegative function such that



η1 (ε, x)dε = 1 for all x, (4.30)

εη1 (ε, x)dε = 0 for all x, (4.31)

and pX (x) = η2 (x) is a nonnegative function of x such that



η2 (x)dν(x) = 1. (4.32)

The set of functions η1 (ε, x) and η2 (x), satisfying the constraints (4.30),
(4.31), and (4.32), are infinite-dimensional and can be used to characterize
the semiparametric model as

p{z, β, η1 (·), η2 (·)} = η1 {y − µ(x, β), x}η2 (x).


4.5 Semiparametric Restricted Moment Model 75

The true density generating the data is denoted by

p0 (z) = η10 {y − µ (x, β0 ), x} η20 (x)


= p{z, β0 , η10 (·), η20 (·)}.

To develop the semiparametric theory and define the semiparametric nui-


sance tangent space, we first consider parametric submodels. Instead of ar-
bitrary functions η1 (ε, x) and η2 (x) for pε|X (ε|x) and pX (x) satisfying con-
straints (4.30), (4.31), and (4.32), we will consider parametric submodels

pε|X (ε|x, γ1 ) and pX (x, γ2 ),

where γ1 is an r1 -dimensional vector and γ2 is an r2 -dimensional vector. Thus


γ = (γ1T , γ2T )T is an r-dimensional vector, r = r1 + r2 .
This parametric submodel is given as

Pβ,γ = {p(z, β, γ1 , γ2 ) = pε|X {y − µ(x, β)|x, γ1 }pX (x, γ2 ) (4.33)

for
 T
β T , γ1T , γ2T ∈ Ωβ,γ ⊂ Rq+r }.

Also, to be a parametric submodel, Pβ,γ must contain the truth; i.e.,

p0 (z) = pε|X {y − µ(x, β0 )|x, γ10 }pX (x, γ20 ).

We begin by defining the parametric submodel nuisance tangent space and


will show how this leads us to the semiparametric model nuisance tangent
space. The parametric submodel nuisance score vector is given as
 T  T "T
∂ log p(z, β, γ) ∂ log p(z, β, γ)
Sγ (z, β0 , γ0 ) = , β = β0 ,
∂γ1 ∂γ2
γ = γ0
' T ( T
= Sγ1 (z, β0 , γ0 ), SγT2 (z, β0 , γ0 ) .

By (4.33), we note that

log p(z, β, γ1 , γ2 ) = log pε|X {y − µ(x, β)|x, γ1 } + log pX (x, γ2 ).

Therefore,

∂ log pε|X {y − µ(x, β0 )|x, γ10 }


Sγ1 (z, β0 , γ0 ) = ,
∂γ1

and
76 4 Semiparametric Models

∂ log pX (x, γ20 )


Sγ2 (z, β0 , γ0 ) = .
∂γ2
Since we are taking derivatives with respect to γ1 and γ2 and leaving β fixed
for the time being at “β0 ,” we will use the simplifying notation that

ε = y − µ(x, β0 ).

Also, unless stated otherwise, if parameters are omitted in an expression, they


will be understood to be evaluated at the truth. So, for example,

∂ log pε|X {y − µ(x, β0 )|x, γ10 }


Sγ1 (z, β0 , γ0 ) =
∂γ1

will be denoted as Sγ1 (ε, x).


A typical element in the parametric submodel nuisance tangent space is
given by
q×r q×r1 q×r2
*B+, - Sγ (ε, X) = B1 Sγ1 (ε, X) + B2 Sγ2 (X).
matrix
of constants

Therefore, the parametric submodel nuisance tangent space


' (
Λγ = B q×r Sγ for all B q×r

can be written as the direct sum Λγ1 ⊕ Λγ2 , where


' (
Λγ1 = B q×r1 Sγ1 (ε, X) for all B q×r1 (4.34)

and
' (
Λγ2 = B q×r2 Sγ2 (X) for all B q×r2 . (4.35)

It is easy to show that the space Λγ1 is orthogonal to the space Λγ2 , as we
demonstrate in the following lemma.
Lemma 4.1. The space Λγ1 defined by (4.34) is orthogonal to the space Λγ2
defined by (4.35).

Proof. Since pε|X (ε|x, γ1 ) is a conditional density, then by properties of score


vectors
E{Sγ1 (ε, X)|X} = 0. (4.36)
(Equation (4.36) is also derived explicitly later in (4.37).) Similarly,

E{Sγ2 (X)} = 0.

Consequently,
4.5 Semiparametric Restricted Moment Model 77

E{Sγ1 (ε, X)SγT2 (X)}


 
= E E{Sγ1 (ε, X)SγT2 (X)|X}
 
= E E{Sγ1 (ε, X)|X}Sγ2 (X) = 0r1 ×r2 .
T
* +, -

Convince yourself that (4.36) suffices to show that every element of Λγ1 is
orthogonal to every element of Λγ2 . 


By definition, the semiparametric nuisance tangent space


 
mean-square closure of all parametric
Λ=
submodel nuisance tangent spaces
= {mean-square closure of Λγ1 ⊕ Λγ2 }.

Because γ1 , γ2 are variationally independent – that is, proper densities in the


parametric submodel can be defined by considering any combination of γ1
and γ2 – this implies that Λ = Λ1s ⊕ Λ2s , where

Λ1s = {mean-square closure of all Λγ1 },


Λ2s = {mean-square closure of all Λγ2 }.

We now show how to explicitly derive the spaces Λ1s , Λ2s and the space
orthogonal to the nuisance tangent space Λ⊥ .

The Space Λ2s

Since here we are considering marginal distributions of X with no restrictions,


finding the space Λ2s is similar to finding the nuisance tangent space for
the nonparametric model given in Section 4.4. For completeness, we give the
arguments so that the reader can become more facile with the techniques used
in such exercises.

Theorem 4.6. The space Λ2s consists of all q-dimensional mean-zero func-
tions of X with finite variance.

Remark 8. In many cases, the structure of the parametric submodel nuisance


tangent space will allow us to make an educated guess for the semiparametric
nuisance tangent space; i.e., the mean-square closure of parametric submodel
nuisance tangent spaces. After we make such a guess, we then need to verify
that our guess is correct. 

We illustrate below.

Proof. For any parametric submodel,


78 4 Semiparametric Models

(i) Sγ2 (Z) = Sγ2 (X) is a function only of X


and
(ii) Any score vector has mean zero

E{Sγ2 (X)} = 0.

Therefore, any element of Λγ2 = {B q×r2 Sγ2 (X) for all B} is a q-dimensional
function of X with mean zero. It may be reasonable to guess that Λ2s , the
mean-square closure of all Λγ2 , is the linear subspace of all q-dimensional
mean-zero functions of X.

Recall: The Hilbert space is made up of all q-dimensional mean-zero functions


of Z = (Y, X); hence, the conjectured space Λ2s ⊂ H.  

We denote the conjectured space as


(conj)
Λ2s = {all q-dimensional mean-zero functions of X}.

In order to verify that our conjecture is true, we must demonstrate:


(conj)
(a) Any element of Λγ2 , for any parametric submodel, belongs to Λ2s , and
conversely,
(conj)
(b) any element of Λ2s is either an element of Λγ2 for some parametric
submodel or a limit of such elements.
(a) is true because
E{B q×r2 Sγ2 (X)} = 0q×1 .
(conj)
To verify (b), we start by choosing a bounded element α(X) ∈ Λ2s ; i.e.,

E{αq×1 (X)} = 0q×1 ,



α(x)p0 (x)dν(x) = 0.

Consider the parametric submodel with density pX (x, γ2 ) = p0 (x){1 +


γ2T α(x)}, γ2 is a q-dimensional vector, and γ2 is taken sufficiently small so
that
{1 + γ2T α(x)} ≥ 0 for all x.
This is necessary to guarantee that pX (x, γ2 ) is a proper density in a neigh-
borhood of γ2 around zero and is true because α(x) is bounded.
The function pX (x, γ2 ), in x, is a density function since pX (x, γ2 ) ≥ 0 for
all x and
4.5 Semiparametric Restricted Moment Model 79
 
pX (x, γ2 )dν(x) = p0 (x){1 + γ2T α(x)}dν(x)
 
= p0 (x)dν(x) + γ2T α(x)p0 (x) dν(x) = 1.
* +, - * +, -

1 0

For this parametric submodel, the score vector is

∂ log p0 (x){1 + γ2T α(x)}


Sγ2 (x) =
∂γ2 γ2 =0
= α(X).

Hence, by choosing the constant matrix B q×q to be the q × q identity matrix,


we deduce that α(X) is an element of this particular parametric submodel
(conj)
nuisance tangent space. Since arbitrary α(X) ∈ Λ2s can always be taken
as limits of bounded mean-zero functions of X, we have thus shown that all
(conj)
elements of Λ2s are either elements of a parametric submodel nuisance
tangent space or a limit of such elements; hence, our conjecture has been
verified and

Λ2s = {all q-dimensional mean-zero functions of X}. 




The Space Λ1s

Theorem 4.7. The space Λ1s is the space of all q-dimensional random func-
tions a(ε, x) that satisfy

(i) E{a(ε, X)|X} = 0q×1

and

(ii) E{a(ε, X)εT |X} = 0q×d .

Proof. The space Λ1s is the mean-square closure of all parametric submodel
nuisance tangent spaces Λγ1 , where

Λγ1 = {B q×r1 Sγ1 (ε, X) for all B q×r1 }

and
∂ log pε|X (ε|x, γ10 )
Sγ1 (ε, x) = .
∂γ1
Recall: We use ε to denote {y − µ(x, β0 )}. 


Note the following relationships:


80 4 Semiparametric Models
 

(i) pε|X (ε|x, γ1 )dε = 1 for all x, γ1 , implies pε|X (ε|x, γ1 )dε = 0 for
∂γ1
all x and γ1 . Interchanging integration and differentiation, dividing and
multiplying by pε|X (ε|x, γ10 ), and evaluating at γ1 = γ10 , we obtain

∂pε|X (ε|x, γ10 )/∂γ1
pε|X (ε|x, γ10 )dε = 0
pε|X (ε|x, γ10 )

for all x; i.e.,


E{Sγ1 (ε, X)|X} = 0. (4.37)

(ii) The model restriction E(ε|X) = 0 is equivalent to pε|X (ε|x, γ1 )εT dε =
01×d for all x, γ1 . Using arguments similar to (i), where we differentiate
with respect to γ1 , interchange integration and differentiation, divide and
multiply by pε|X (ε|x, γ10 ), and set γ1 to γ10 , we obtain

Sγ1 (ε, x)εT pε|X (ε|x, γ10 )dε = 0 for all x;

i.e., E{Sγ1 (ε, X)εT |X} = 0r1 ×d .

Consequently, any element of Λγ1 = B q×r1 Sγ1 (ε, X), say a(ε, X), must satisfy

(i) E{a(ε, X)|X} = 0 (4.38)

and

(ii) E{a(ε, X)εT |X} = 0q×d . (4.39)

A reasonable conjecture is that Λ1s is the space of all q-dimensional functions


of (ε, X) that satisfy (4.38) and (4.39).
To verify this, consider the parametric submodel

pε|X (ε|x, γ1 ) = p0ε|X (ε|x){1 + γ1T a(ε, x)}

for some bounded function a(ε, X) satisfying (4.38) and (4.39) and a q-
dimensional parameter γ1 chosen sufficiently small so that

{1 + γ1T a(ε, x)} ≥ 0 for all ε, x.

This parametric submodel contains the truth; i.e., (γ1 = 0). Also, the class of
densities in this submodel consists of proper densities,

pε|X (ε|x, γ1 )dε = 1 for all x, γ1 ,

and E(ε|X) = 0,
4.5 Semiparametric Restricted Moment Model 81

εpε|X (ε|x, γ1 )dε = 0d×1 for all x, γ1 .

For this parametric submodel, the score vector is


∂ log pε|X (ε|x, γ10 )
Sγ1 (ε, x) = = a(ε, x).
∂γ1
Premultiplying this score vector by the q × q identity matrix leads us to the
conclusion that a(ε, x) is an element of this parametric submodel nuisance
tangent space.
Finally, any a(ε, X) ∈ H satisfying (4.38) and (4.39) can be obtained, in
the limit, by a sequence of bounded a(ε, X) satisfying (4.38) and (4.39). 
Recap:
Λ1s = {aq×1 (ε, X) such that
E{a(ε, X)|X} = 0 and
E{a(ε, X)εT |X} = 0},
Λ2s = {αq×1 (X) such that
E{α(X)} = 0}.  
It is now easy to demonstrate that Λ1s is orthogonal to Λ2s .
Lemma 4.2. Λ1s ⊥ Λ2s
Proof.
E{αT (X)a(ε, X)} = E[E{αT (X)a(ε, X)|X}]
= E[αT (X)E {a(ε, X)|X}] = 0. 

* +, -

0
The nuisance tangent space for the semiparametric model is Λ = Λ1s ⊕Λ2s .
Note that Λ1s is the intersection of two linear subspaces; namely,
 
q×1 q×1
Λ1sa = aa (ε, X) : E{aa (ε, X)|X} = 0 ,

and
 
Λ1sb = aq×1
b (ε, X) : E{aq×1
b (ε, X)ε |X} = 0
T q×d
.

Therefore, the nuisance tangent space Λ = Λ2s ⊕ (Λ1sa ∩ Λ1sb ).


There are certain relationships between the spaces Λ2s , Λ1sa , and Λ1sb
that will allow us to simplify the representation of the nuisance tangent space
Λ and also allow us to derive the space orthogonal to the nuisance tangent
space Λ⊥ more easily. We give these relationships through a series of lemmas.
82 4 Semiparametric Models

Lemma 4.3.
Λ1sa = Λ⊥
2s .

Lemma 4.4.
Λ2s ⊂ Λ1sb .

and

Lemma 4.5.
Λ = Λ2s ⊕ (Λ1sa ∩ Λ1sb ) = Λ1sb .

Proof. Lemma 4.3


We first show that the space Λ1sa is orthogonal to Λ2s . Let a(ε, X) be an
arbitrary element of Λ1sa and α(X) be an arbitrary element of Λ2s . Then

E{αT (X)a(ε, X)} = E[E{αT (X)a(ε, X)|X}]


= E[αT (X)E {a(ε, X)|X}] = 0.
* +, -

To complete the proof, we must show that any element h ∈ H can be written
as h = h1 ⊕ h2 , where h1 ∈ Λ2s and h2 ∈ Λ1sa . We write h = E(h|X) + {h −
E(h|X)}. That E(h|X) ∈ Λ2s and {h − E(h|X)} ∈ Λ1sa follow immediately.
The result above also implies that Π(h|Λ2s ) = E(h|X) and Π(h|Λ1sa ) =
{h − E(h|X)}.  

Proof. Lemma 4.4


Consider any element α(X) ∈ Λ2s . Then

E{α(X)εT |X} = α(X)E(εT |X) = 0q×d ,

which follows from the model restriction E(εT |X) = 01×d . Hence α(X) ∈ Λ1sb .



Proof. Lemma 4.5


Consider any element h1 ∈ Λ2s . By Lemma 4.4, h1 ∈ Λ1sb . Let h2 be any ele-
ment in (Λ1sa ∩ Λ1sb ). Then, by definition, h2 ∈ Λ1sb . Hence (h1 + h2 ) ∈ Λ1sb
since Λ1sb is a linear space.

Conversely, let h be any element of Λ1sb . Since by Lemmas 4.3 and 4.4
E(h|X) ∈ Λ2s ⊂ Λ1sb , this implies that {h − E(h|X)} ∈ Λ1sb since Λ1sb is a
linear space. Therefore h can be written as E(h|X) + {h − E(h|X)}, where
E(h|X) ∈ Λ2s and {h − E(h|X)} ∈ Λ1sb . But by Lemma 4.3, {h − E(h|X)} is
also an element of Λ1sa and hence {h − E(h|X)} ∈ (Λ1sa ∩ Λ1sb ), completing
the proof. 
4.5 Semiparametric Restricted Moment Model 83

Consequently, we have shown that the nuisance tangent space Λ for the
semiparametric restricted moment model is given by

Λ = Λ1sb = {h(ε, X) such that E{h(ε, X)εT |X} = 0q×d }. (4.40)

Influence Functions and the Efficient Influence Function for the


Restricted Moment Model

The key to deriving the space of influence functions is first to identify elements
of the Hilbert space that are orthogonal to Λ. Equivalently, the space Λ⊥ is
the linear space of residuals

h(ε, X) − Π(h(ε, X)|Λ)

for all
h(ε, X) ∈ H.
Using (4.40), this equals

[h(ε, X) − {Π(h|Λ1sb )}] . (4.41)

Theorem 4.8. The space orthogonal to the nuisance tangent space, Λ⊥ , or


equivalently Λ⊥
1sb , is

{Aq×d (X)ε for all Aq×d (X)}, (4.42)

where Aq×d (X) is the matrix of arbitrary q × d-dimensional functions of X.


Moreover, the projection of any arbitrary element h(ε, X) ∈ H onto Λ1sb
satisfies
h(ε, X) − Π(h|Λ1sb ) = g q×d (X)ε, (4.43)
where
g(X) = E{h(ε, X)εT |X} {E(εεT |X)}−1 ,
which implies that

Π[h|Λ1sb ] = h − E(hεT |X){E(εεT |X)}−1 ε. (4.44)

Proof. Theorem 4.8


In order to prove that the space given by (4.42) is the orthogonal complement
of Λ1sb , we first prove that this space is orthogonal to Λ1sb . That is, for any
A(X)ε, we must show

E{aTb (ε, X)A(X)ε} = 0 for all ab ∈ Λ1sb . (4.45)

By a conditioning argument, this expectation equals


84 4 Semiparametric Models

E[E{aTb (ε, X)A(X)ε|X}]. (4.46)

But, by the definition of Λ1sb , E{ab (ε, X)εT |X} = 0q×d or, equivalently,

E{abj (ε, X)εj  |X} = 0, for all j = 1, . . . , q and j = 1, . . . , d, where {abj (ε, X)

is the j-th element of ab (ε, X) and εj  is the j -th element of ε. Consequently,
the inner expectation of (4.46), which can be written as

Ajj  (X)E{abj (ε, X)εj  |X},
j,j 


where Ajj  (X) is the (j, j )-th element of A(X), must also equal zero. This,
in turn, proves (4.45).
Now that we have shown the orthogonality of the spaces Λ1sb and the space
(4.42), in order to prove that the space (4.42) is the orthogonal complement
of Λ1sb , it suffices to show that any h ∈ H can be written as h1 + h2 , where
h1 ∈ (4.42) and h2 ∈ Λ1sb . Or, equivalently, for any h ∈ H, there exists
g q×d (X) such that
{h(ε, X) − g(X)ε} ∈ Λ1sb . (4.47)
That such a function g(X) exists follows by solving the equation

E[{h − g(X)ε}εT |X] = 0q×d ,

or
E(hεT |X) − g(X)E(εεT |X) = 0,
which yields
g(X) = E(hεT |X){E(εεT |X)}−1 ,
where, to avoid any technical difficulties, we will assume that the conditional
variance matrix E(εεT |X) is positive definite and hence invertible. 
We have thus demonstrated that, for the semiparametric restricted mo-
ment model, any element of the Hilbert space perpendicular to the nuisance
tangent space is given by

Aq×d (X) ε or A(X){Y − µ(X, β0 )}. (4.48)

Influence functions of RAL estimators for β (i.e., ϕ(ε, X)) are normalized
versions of elements perpendicular to the nuisance tangent space. That is,
the space of influence functions, as well as being orthogonal to the nuisance
tangent space, must also satisfy condition (i) of Theorem 4.2, namely that

E{ϕ(ε, X)SβT (ε, X)} = I q×q ,

where Sβ (ε, X) is the score vector with respect to the parameter β and I q×q
is the q × q identity matrix. If we start with any A(X), and define ϕ(ε, X) =
C q×q A(X)ε, where C q×q is a q ×q constant matrix (i.e., normalization factor),
then condition (i) of Theorem 4.2 is satisfied by solving
4.5 Semiparametric Restricted Moment Model 85

E{CA(X)εSβT (ε, X)} = I q×q

or
C = [E{A(X)εSβT (ε, X)}]−1 . (4.49)
Since a typical element orthogonal to the nuisance tangent space is given
by A(X){Y − µ(X, β0 )}, and since a typical influence function is given by
CA(X){Y − µ(X, β0 )}, where C is defined by (4.49), this motivates us to
consider an m-estimator for β of the form

n
CA(Xi ){Yi − µ(Xi , β)} = 0.
i=1

Because C is a multiplicative constant matrix, then, as long as C is invertible,


this is equivalent to solving the equation

n
A(Xi ){Yi − µ(Xi , β)} = 0.
i=1

This logic suggests that estimators can often be motivated by identifying


elements orthogonal to the nuisance tangent space, a theme that will be used
frequently throughout the remainder of the book.
We showed in Section 4.1 that solutions to such linear estimating equations
result in semiparametric GEE estimators with influence function (4.4) that
is proportional to A(X){Y − µ(X, β)}, where the proportionality constant is
given by {E(AD)}−1 . In the next section, we will show that this proportional-
ity constant satisfies equation (4.49). This will be a consequence of the results
obtained in deriving the efficient influence function, as we now demonstrate.

The Efficient Influence Function

To derive an efficient semiparametric estimator, we must find the efficient


influence function. For this, we need to derive the efficient score (i.e., the
residual after projecting the score vector with respect to β onto the nuisance
tangent space Λ) Seff (ε, X) = Sβ (ε, X) − Π{Sβ (ε, X)|Λ}, which by (4.44)
equals
Seff (ε, X) = E{Sβ (ε, X)εT |X}V −1 (X)ε, (4.50)
where V (X) denotes E(εεT |X).
Recall that the restricted moment model was characterized by {β, η1 (ε, x),
η2 (x)}, where β is the parameter of interest and η1 (·), η2 (·) are infinite-
dimensional nuisance parameters. When studying the nuisance tangent space,
we fixed β at the truth and varied η1 (·) and η2 (·). To compute Sβ , we fix the
nuisance parameters at the truth and vary β.
A typical density in our model was given as

η1 {y − µ(x, β), x}η2 (x),


86 4 Semiparametric Models

where

η1 (ε, x) = pε|X (ε|x)

and

η2 (x) = pX (x).

If we fix the nuisance parameter at the truth (i.e., η10 (ε, x) and η20 (x)), then

∂ log η10 {y − µ(x, β), x}


Sβ (y, x, β0 , η0 (·)) = ,
∂β β=β0
∂η10 (ε, x)/∂β
= . (4.51)
η10 (ε, x) β=β0

Due to the model restriction,



{y − µ(x, β)}η10 {y − µ(x, β), x}dy = 0 for all x, β,

we obtain


{y − µ(x, β)}η10 {y − µ(x, β), x}dy = 0.
∂β T β=β0

Taking the derivative inside the integral, we obtain


  d×q 
∂µ(x, β0 )
− η10 (ε, x)dε + εSβT {ε, x, β0 , η0 (·)}η10 (ε, x)dε = 0
∂β T
for all x, or
−∂µ(X, β0 ) ' (
T
+ E εSβT (ε, X)|X = 0.
∂β
After solving the preceding equation and taking the transpose, we obtain

DT (X) = E{Sβ (ε, X)εT |X}, (4.52)

where

∂µ(X, β0 )
D(X) = .
∂β T
By (4.50) and (4.52), we obtain that the efficient score is

Seff (ε, X) = DT (X)V −1 (X)ε, (4.53)

and the optimal estimator is obtained by solving the estimating equation


4.5 Semiparametric Restricted Moment Model 87


n
DT (Xi )V −1 (Xi ){Yi − µ(Xi , β)} = 0 (4.54)
i=1

(optimal GEE).
We also note that the normalization constant matrix C given in (4.49) in-
volves the expectation E{A(X)εSβT (ε, X)}, which by a conditioning argument
can be derived as E[A(X)E{εSβT (ε, X)|X}], which equals E{A(X)D(X)} by
(4.52). Hence, C = [E{A(X)D(X)}]−1 . This implies that a typical influence
function is given by

[E{A(X)D(X)}]−1 A(X){Y − µ(X, β)},

which is the influence function for the GEE estimator given in (4.4). Simi-
larly, the efficient influence function can be obtained by using the appropriate
normalization constant with the efficient score (4.53) to yield

[E{DT (X)V −1 (X)D(X)}]−1 DT (X)V −1 (X){Y − µ(X, β)}. (4.55)

The semiparametric efficiency bound is given as

V = [E{Seff (ε, X)Seff


T
(ε, X)}]−1 ,

which by (4.53) is equal to

[E{DT (X)V −1 (X)εεT V −1 (X)D(X)}]−1


= [E{DT (X)V −1 (X)E(εεT |X)V −1 (X)D(X)}]−1
= [E{DT (X)V −1 (X)D(X)}]−1 . (4.56)

This, of course, is also the variance of the efficient influence function (4.54).

A Different Representation for the Restricted Moment Model

Up to now, we have defined probability models for the restricted moment


model when the response variable Y was continuous. This allowed us to con-
sider conditional densities for ε given X, where ε = Y −µ(X, β). We were able
to do this because we assumed that Y was a continuous variable with respect to
Lebesgue measure and hence the transformed variable ε was also a continuous
variable with respect to Lebesgue measure. This turned out to be useful, as we
could then describe the semiparametric model through the finite-dimensional
parameter of interest β together with the infinite-dimensional nuisance pa-
rameters η1 (ε, x) and η2 (x), where β was variationally independent of η1 (ε, x)
and η2 (x). That is, any combination of β, η1 (ε, x), and η2 (x) would lead to a
valid density describing our semiparametric model.
There are many problems, however, where we want to use the restricted
moment model with a dependent variable Y that is not a continuous random
88 4 Semiparametric Models

variable. Strictly speaking, the response variable CD4 count used in the log-
linear model example of Section 4.1 is not a continuous variable. A more
obvious example is when we have a binary response variable Y taking on
values 1 (response) or 0 (nonresponse). A popular model for modeling the
probability of response as a function of covariates X is the logistic regression
model. In such a model, we assume
exp(β T X ∗ )
P (Y = 1|X) = ,
1 + exp(β T X ∗ )
where X ∗ = (1, X T )T , allowing the introduction of an intercept term. Since
Y is a binary indicator, this implies that E(Y |X) = P (Y = 1|X), and hence
the logistic regression model is just another example of a restricted moment
exp(β T X ∗ )
model with µ(X, β) = 1+exp(β T X∗) .

The difficulty that occurs when the response variable Y is not a continuous
random variable is that the transformed variable Y − µ(X, β) may no longer
have a dominating measure that allows us to define densities. In order to
address this problem, we will work directly with densities defined on (Y, X),
namely p(y, x) with respect to some dominating measure νY × νX . As you will
see, many of the arguments developed previously will carry over to this more
general setting.
We start by first deriving the nuisance tangent space. As before, we need to
find parametric submodels. Let p(y, x) be written as p(y|x) p(x), where p(y|x)
is the conditional density of Y given X and p(x) is the marginal density of X,
and denote the truth as p0 (y, x) = p0 (y|x)p0 (x). The parametric submodel
can be written generically as
p(y|x, β, γ1 )p(x, γ2 ),
where for some β0 , γ10 , γ20
p0 (y|x) = p(y|x, β0 , γ10 )
and
p0 (x) = p(x, γ20 ).
The parametric submodel nuisance tangent space is the space spanned by
the score vector with respect to the nuisance parameters γ1 and γ2 . As in
the previous section, the parametric submodel nuisance tangent space can be
written as the direct sum of two orthogonal spaces
Λ γ1 ⊕ Λ γ2 , Λ γ1 ⊥ Λ γ2 ,

where
' (
Λγ1 = B q×r1 Sγ1 (Y, X) for all B q×r1 ,
' (
Λγ2 = B q×r2 Sγ2 (X) for all B q×r2 ,
∂ log p(y|x, β0 , γ10 )
Sγ1 (y, x) = ,
∂γ1
4.5 Semiparametric Restricted Moment Model 89

and
∂ log p(x, γ20 )
Sγ2 (x) = .
∂γ2
Hence, the semiparametric nuisance tangent space Λ equals Λ1s ⊕ Λ2s ,
Λ1s ⊥ Λ2s , where

Λ1s = {mean-square closure of all Λγ1 }

and

Λ2s = {mean-square closure of all Λγ2 }.

We showed previously that


⎧ ⎫

⎪ all q-dimensional mean-zero ⎪

⎨ ⎬
measurable functions of X
Λ2s = .

⎪ with finite second moments; i.e., ⎪

⎩ q×1 ⎭
α (X) : E{α(X)} = 0

We now consider the space Λ1s . Again, we proceed by making educated


guesses for the nuisance tangent space by considering the structure of the
parametric submodel nuisance tangent space and then verify that our guess is
correct. For the restricted moment model, if we fix β at the truth, (β0 ), then
the conditional densities p(y|x, β0 , γ1 ) must satisfy

p(y|x, β0 , γ1 )dν(y) = 1 for all x, γ1 (4.57)

and

yp(y|x, β0 , γ1 )dν(y) = µ(x, β0 ) for all x, γ1 . (4.58)

Using standard arguments, where we take derivatives of (4.57) and (4.58) with
respect to γ1 , interchange integration and differentiation, divide and multiply
by p(y|x, β0 , γ1 ), and set γ1 at the truth, we obtain

Sγ1 (y, x)p0 (y|x)dν(y) = 0r1 ×1 for all x

and

ySγT1 (y, x)p0 (y|x)dν(y) = 0d×r1 for all x.

That is, E{Sγ1 (Y, X)|X} = 0r1 ×1 and E{Y SγT1 (Y, X)|X} = 0d×r1 . This
implies that any element of Λγ1 , namely B q×r1 Sγ1 (Y, X), would satisfy
90 4 Semiparametric Models

E{B q×r1 Sγ1 (Y, X)|X} = 0q×1 and E{B q×r1 Sγ1 (Y, X)Y T |X} = 0q×d . This
leads us to the conjecture that

(conj)
Λ1s = aq×1 (Y, X) : E{a(Y, X)|X} = 0q×1

and

E{a(Y, X)Y T |X} = 0q×d .

To verify this conjecture, we consider an arbitrary bounded element


(conj)
aq×1 (Y, X) ∈ Λ1s . We construct the parametric submodel p0 (y|x){1 +
γ1 a(y, x)} with γ1 chosen sufficiently small to ensure {1 + γ1T a(y, x)} ≥ 0
T

for all y, x. This parametric submodel contains the truth when γ1 = 0 and
satisfies the constraints of the restricted moment model, namely

p(y|x, γ1 )dν(y) = 1 for all x, γ1 ,

yp(y|x, γ1 )dν(y) = µ(x, β0 ) for all x, γ1 .

The score vector Sγ1 (Y, X) for this parametric submodel is a(Y, X). Also any
(conj) (conj)
element of Λ1s can be derived as limits of bounded elements in Λ1s .
Therefore, we have established that any element of a parametric submodel
(conj) (conj)
nuisance tangent space Λγ1 is an element of Λ1s , and any element of Λ1s
is either an element of a parametric submodel nuisance tangent space or a limit
(conj)
of such elements. Thus, we conclude that Λ1s = Λ1s .
Note that Λ1s can be expressed as Λ1sa ∩ Λ1sb , where

Λ1sa = {aq×1 (Y, X) : E{a(Y, X)|X} = 0q×1 }

and

Λ1sb = {aq×1 (Y, X) : E{a(Y, X){Y − µ(X, β0 )}|X} = 0q×d }.

Therefore the nuisance tangent space Λ = Λ2s ⊕ (Λ1sa ∩ Λ1sb ). This repre-
sentation is useful because Λ2s ⊕ Λ1sa = H (the whole Hilbert space) and
Λ2s ⊂ Λ1sb . Consequently, we use Lemmas 4.3–4.5 to show that the semipara-
metric nuisance tangent space Λ = Λ1sb .
Using the exact same proof as for Theorem 4.8, we can show that

h(Y, X) − Π[h|Λ1sb ] or Π[h|Λ⊥


1sb ]

= E{h(Y, X)(Y − µ(X, β0 ))T |X}V −1 (X){Y − µ(X, β0 )}.

To complete the development of the semiparametric theory, we still need to


derive the efficient score or
4.5 Semiparametric Restricted Moment Model 91

Seff (Y, X) = Π[Sβ (Y, X)|Λ⊥ ]


= E[Sβ (Y, X){(Y − µ(X, β0 )}T |X]V −1 (X){Y − µ(X, β0 )}. (4.59)

Since
E(Y |X = x) = µ(x, β),
then for any parametric submodel where p(y|x, β, γ1 ) satisfies

yp(y|x, β, γ1 )dν(y) =µ(x, β) for all x, γ1 (4.60)

and

p(y|x, β0 , γ10 ) =p0 (y|x),

assuming at least one such parametric submodel exists, we can differentiate


both sides of (4.60) with respect to β T , interchange integration and differen-
tiation, divide and multiply by p0 (y|x), and set β and γ1 equal to β0 and γ10 ,
respectively, to obtain

E{Y SβT (Y, X)|X}


= E[{Y − µ(X, β0 )}SβT (Y, X)|X]
= D(X), (4.61)

where
∂µ(X, β0 )
D(X) = .
∂β T
Equation (4.61) follows because

E{µ(X, β0 )SβT (Y, X)|X} = µ(X, β0 )E{SβT (Y, X)|X} = 0.

Taking transposes yields

E[Sβ (Y, X){Y − µ(X, β0 )}T |X] = DT (X).

Consequently, the efficient score (4.59) is given by

Seff (Y, X) = DT (X)V −1 (X){Y − µ(X, β0 )}.

It still remains to show that a parametric submodel exists that satisfies (4.60).
This is addressed by the following argument.

Existence of a Parametric Submodel for the Arbitrary Restricted


Moment Model

A class of joint densities for (Y, X) can be defined with dominating measure
νY ×νX that satisfy E(Y |X) = µ(X, β) by considering the conditional density
92 4 Semiparametric Models

of Y given X multiplied by the marginal density of X, where a class of con-


ditional densities for Y given X can be constructed using exponential tilting.
To illustrate, let us consider, for simplicity, the case where Y is a univariate
bounded random variable. We will assume that, at the truth, the conditional
density of Y given X is given by p0 (y|x), where yp0 (y|x)dν(y) = µ(x, β0 )
for all x. The question is whether we candefine a class of conditional densities
p(y|x, β) with respect to νY such that yp(y|x, β)dν(y) = µ(x, β) for all x
and for β in a neighborhood of β0 . We consider the conditional densities

p0 (y|x) exp{c(x, β)y}


p(y|x, β) =  ,
exp{c(x, β)y}p0 (y|x)dν(y)

where c(x, β), if possible, is chosen so that yp(y|x, β)dν(y) = µ(x, β) for β
in a neighborhood of β0 . We first  note that we can take c(x, β0 ) to be equal
to zero because, by definition, yp0 (y|x)dν(y) = µ(x, β0 ). To illustrate that
c(x, β) exists and can be uniquely defined in a neighborhood of β0 , we fix the
value of x and consider the function

p0 (y|x) exp(cy)
y dν(y)
exp(cy)p0 (y|x)dν(y)

as a function in c, which can also be written as

E0 {Y exp(cY )|X = x}
, (4.62)
E0 {exp(cY )|X = x}

where the conditional expectation E0 (·|X = x) is taken with respect to the


conditional density p0 (y|x). Taking the derivative of (4.62) with respect to c,
we obtain
 2
E0 {Y 2 exp(cY )|X = x} E0 {Y exp(cY )|X = x}
− .
E0 {exp(cY )|X = x} E0 {exp(cY )|X = x}

This derivative, being the conditional variance of Y given X = x with respect


p0 (y|x) exp(cy)
to the conditional density  exp(cy)p 0 (y|x)dν(y)
, must therefore be positive, im-
plying that the function (4.62) is strictly monotonically increasing in c. Hence,
in a neighborhood of β about the value β0 , a unique inverse for c exists in a
neighborhood of zero that satisfies the equation

E0 {Y exp(cY )|X = x}
= µ(x, β).
E0 {exp(cY )|X = x}

We define this solution as c(x, β).


The arguments above can be generalized to multivariate Y by using the
inverse function theorem.
4.6 Adaptive Semiparametric Estimators for the Restricted Moment Model 93

4.6 Adaptive Semiparametric Estimators for the


Restricted Moment Model
Using the theory we have developed for the semiparametric restricted moment
model, we now know that the class of all influence functions for semiparametric
RAL estimators for β must be of the form

[E{A(X)D(X)}]−1 A(X){Y − µ(X, β0 )}

for any arbitrary q × d matrix, A(X), of functions of X. We also showed that


the solution to the estimating equation

n
A(Xi ){Yi − µ(Xi , β)} = 0 (4.63)
i=1

results in an estimator for β that is RAL with influence function given by


(4.4). The estimating equation (4.63) is an example of what Liang and Zeger
(1986) referred to as a linear estimating equation or a GEE estimator. Using
semiparametric theory, we showed that this class of estimators encompasses,
at least asymptotically, all possible semiparametric RAL estimators for β in a
restricted moment model. That is, any semiparametric RAL estimator must
have an influence function that is contained within the class of influence func-
tions for GEE estimators. Consequently, it is reasonable to restrict attention
to only such estimators and, moreover, the efficient RAL estimator must be
asymptotically equivalent to the efficient GEE estimator given by the solution
to (4.54).
It is important to note that the optimal estimating equation depends on
using the correct V (X), where

V (X) = E(εεT |X)

is the conditional variance of ε given X, or equivalently, the conditional vari-


ance of Y given X. Since, in a semiparametric model, the distribution of ε
given X is left unspecified, the function V (x) is unknown to us. We can try
to estimate this conditional variance of Y as a function of X = x using the
data, but generally, without additional assumptions, this requires smoothing
estimators that are not very stable, especially if X is multidimensional. Con-
sequently, substituting such nonparametric type smoothing estimators V̂ (X)
into equation (4.54) leads to estimators for β that perform poorly with mod-
erate sample sizes.
Another strategy is to posit some relationship for V (x), either completely
specified or as a function of a finite (small) number of additional parameters
ξ as well as the parameters β in the restricted moment model. This is referred
to as a working variance assumption because the model V (x, ξ, β) for the
conditional variance of Y given X = x may not contain the truth. For example,
94 4 Semiparametric Models

suppose, for simplicity, we take the response variable Y to be one-dimensional


(i.e., d = 1) and assume the variance function

V (x, ξ) = exp(ξ0 + ξ1T x),

where ξ1 is a vector of dimension equal to the number of covariates that make


up the vector X. For this illustration, we chose a log-linear relationship to
ensure a positive variance function. We might choose such a model for the
variance not necessarily because we believe this is the true relationship but
rather because if we believe that the variance is related to the covariates, then
this model may capture some of this relationship, at least to a first order.
Another model, if one believes that the conditional variance of Y given X
may be related to the conditional mean of Y given X, is to assume that

V (x, ξ, β) = ξ02 {µ(x, β)}ξ1 , (4.64)

where ξ0 and ξ1 are scalar constants. Again, we may not believe that this
captures the true functional relationship of the conditional variance to the
conditional mean but may serve as a useful approximation. Nonetheless, if,
for the time being, we accepted V (x, ξ, β) as a working model, then the param-
eters ξ in V (x, ξ, β) can be estimated separately using the squared residuals
{Yi − µ(Xi , β̂ninitial )}2 , i = 1, . . . , n, where β̂ninitial is some initial consistent
estimator for β. For instance, we can find an initial estimator for β by solving
equation (4.1) using A(X, β) = D(X, β) (which is equivalent to a working vari-
ance V (X) proportional to the identity matrix). Using this initial estimator,
we can then find an estimator for ξ by solving the equation

n  
Q(Xi , ξ, β̂ninitial ) {Yi − µ(Xi , β̂ninitial )}2 − V (Xi , ξ, β̂ninitial ) = 0,
i=1

where Q(X, ξ, β) is an arbitrary vector of functions of X, ξ, and β of dimen-


sion equal to the dimension of ξ. One possibility is to choose Q(X, ξ, β) =
∂V (X, ξ, β)/∂ξ. Denote the resulting estimator by ξˆn . Under weak regularity
conditions, ξˆn will converge in probability to some constant ξ ∗ whether the
variance function was correctly specified or not. Without going into the tech-
nicalities, it can be shown that substituting V (Xi , ξˆn , β̂ninitial ) into equation
(4.54) will result in an RAL estimator for β, namely β̂n (not to be confused
with β̂ninitial ), with influence function [E{A(X)D(X)}]−1 A(X){Y − µ(X, β)},
where A(X) = DT (X)V −1 (X, ξ ∗ , β0 ). Consequently, solutions to such esti-
mating equations, where we use a working variance assumption, will lead to
what is called a locally efficient estimator for β. That is, if the posited re-
lationship for V (x, ξ, β) is indeed true (i.e., if the true conditional variance
of Y given X = x is contained in the model V (x, ξ, β), with V (x, ξ0 , β0 )
denoting the truth), then V (x, ξˆn , β̂ninitial ) will converge to V (x, ξ0 , β0 ) =
var(Y |X = x) and the resulting estimator is semiparametric efficient; oth-
erwise, it is not. However, even if the posited model is not correct (i.e.,
4.6 Adaptive Semiparametric Estimators for the Restricted Moment Model 95

V (x, ξ ∗ , β0 ) = var(Y |X = x)), V (X, ξ ∗ , β0 ) is still a function of X and the re-


sulting estimator for β is consistent and asymptotically normal. Such adaptive
estimators for β have been shown empirically to have high relative efficiency
compared with the optimal semiparametric efficiency bound if the working
variance model provides a good approximation to the truth.
When using a working variance, one must be careful in estimating the
asymptotic variance of the estimator β̂n . Since the asymptotic variance
of the efficient influence function is given by (4.56), there is a natural
temptation to estimate the asymptotic variance by using an estimator for
[E{DT (X)V −1 (X, ξ ∗ , β0 )D(X)}]−1 , namely
 "−1

n
−1 T −1 ˆ initial
n D (Xi , β̂n )V (Xi , ξn , β̂n )D(Xi , β̂n ) .
i=1

This estimator would only be a consistent estimator for the asymptotic vari-
ance of β̂n if the working variance contained the truth. Otherwise, it would
be asymptotically biased. A consistent estimator for the asymptotic vari-
ance can be obtained by using the sandwich variance given by (4.14) with
A(X) = DT (X, β̂n )V −1 (X, ξˆn , β̂ninitial ).

Example 1. Logistic regression model


We argued earlier that the logistic regression model is an example of a re-
stricted moment model where the response variable Y is a binary variable
exp(β T X ∗ ) ∗
taking on the values 1 or 0 and µ(X, β) = 1+exp(β T X ∗ ) , where X = (1, X T )T .
Accordingly, from the theory just developed, we know that the influence func-
tions of all RAL estimators for β can be derived as the solution to the gener-
alized estimating equations

n
A(Xi ){Yi − µ(Xi , β)} = 0
i=1

for arbitrary Aq×1 (X), and the efficient estimator is obtained by choosing
0)
A(X) = DT (X)V −1 (X), where D(X) = ∂µ(X,β ∂β T
and V (X) = var(Y |X).
Because Y is binary,

exp(β0T X ∗ )
V (X) = var(Y |X) = µ(X, β0 ){1 − µ(X, β0 )} = .
{1 + exp(β0T X ∗ )}2

Taking derivatives, we also obtain that DT (X) = X ∗ V (X). Hence the optimal
estimator for β can be derived by choosing A(X) = DT (X)V −1 (X) = X ∗ ,
leading us to the optimal estimating equation

n  
exp(β T Xi∗ )
Xi∗ Yi − = 0. (4.65)
i=1
1 + exp(β T Xi∗ )
96 4 Semiparametric Models

Because the conditional distribution of Y given X is fully described with


the finite number of parameters β, we could also estimate the parameter β
using maximum likelihood. It is an easy exercise to show that the solution to
(4.65) also leads to the maximum likelihood estimator for β.  

Example 2. Log-linear model


In Section 4.1, we considered a log-linear model to model CD4 count as a
function of covariates; see (4.10). We also proposed an ad hoc semiparametric
GEE estimator for β as the solution to (4.11) without any motivation. Let us
study this problem more carefully. We know that the optimal semiparametric
estimator for β for model (4.10) is given as the solution to the equation

n
DT (Xi , β)V −1 (Xi ){Yi − µ(Xi , β)} = 0, (4.66)
i=1

where the gradient matrix D(X, β) was derived in Section 4.1 for the log-linear
model to be D(X, β) = ∂µ(X,β)
∂β T
= µ(X, β)(1, X T ). Although the semipara-
metric restricted moment model makes no assumptions about the variance
function V (X) = var(Y |X), we argued that to find a locally efficient estima-
tor for β we might want to make some assumptions regarding the function
V (X) and derive an adaptive estimator.
Since CD4 count is a count we might be willing to assume that it fol-
lows a Poisson distribution. If indeed the distribution of Y given X fol-
lows a Poisson distribution with mean µ(X, β), then we immediately know
that V (X) = µ(X, β). Although this is probably too strong an assump-
tion to make in general, we may believe that a good approximation is that
the variance function V (X) is at least proportional to the mean; i.e., that
V (X) = σ 2 µ(X, β), where σ 2 is some unknown scale factor. In that case,
DT (X, β)V −1 (X) = σ −2 (1, X T )T and the locally efficient estimator for β
would be the solution to (4.66), which, up to a proportionality constant, would
be the solution to the estimating equation

n
(1, XiT )T {Yi − µ(Xi , β)} = 0, (4.67)
i=1

which is the same as the estimator proposed in Section 4.1; see (4.11).
Thus, we have shown that the locally efficient semiparametric estimator
for β, when the conditional distribution of the response variable given the
covariates follows a Poisson distribution or, more generally, if the conditional
variance of the response variable given the covariates is proportional to the
conditional mean of Y given the covariates, is given by (4.67). For a more
detailed discussion on log-linear models, see McCullagh and Nelder (1989,
Chapter 6).
If the conditional variance of Y given X is not proportional to the con-
ditional mean µ(X, β), then the estimator (4.67) is no longer semiparametric
4.6 Adaptive Semiparametric Estimators for the Restricted Moment Model 97

efficient; nonetheless, it will still be a consistent, asymptotically normal es-


timator for β with an asymptotic variance that can be estimated using the
sandwich estimator (4.14).
Another possibility, which would give somewhat greater flexibility, is to
model the variance function using (4.64), as this model contains the Poisson
variance structure mentioned above, and use the adaptive methods described
in this section to estimate β. 

Extensions of the Restricted Moment Model

When considering the restricted moment model, we have concentrated on


models where E(Y |X) = µ(X, β) or, equivalently, E{ε(Y, X, β)|X} = 0, where
ε(Y, X, β) = Y − µ(X, β). Using this second representation, the theory de-
veloped for the restricted moment problem can be applied (or extended) to
models where E{ε(Y, X, β)|X} = 0 for arbitrary functions ε(Y, X, β).
This allows us to consider models, for example, where we model both the
conditional variance and the conditional mean of Y given X. Say we want a
model where we assume that E(Y |X) = µ(X, β) and var(Y |X) = V (X, β, ξ)
and our interest is in estimating the parameters β and ξ using a sample of data
that are realizations of (Yi , Xi ), i = 1, . . . , n. For simplicity, take Y to be a
univariate response random variable, although this can be easily generalized
to multivariate response random vectors as well. We could then define the
bivariate vector ε(Y, X, β, ξ) = {ε1 (Y, X, β, ξ), ε2 (Y, X, β, ξ)}T , where

ε1 (Y, X, β, ξ) = Y − µ(X, β)

and
ε2 (Y, X, β, ξ) = {Y − µ(X, β)}2 − V (X, β, ξ).
With such a representation, it is clear that our model for the conditional mean
and conditional variance of Y given X is equivalent to E{ε(Y, X, β, ξ)|X} = 0.
This representation also allows us to consider models for the conditional
quantiles of Y as a function of X. For example, suppose we wanted to consider
a model for the median of a continuous random variable Y as a function of
X. Say we wanted a model where we assumed that the conditional median
of Y given X was equal to µ(X, β) and we wanted to estimate β using a
sample of data (Yi , Xi ), i = 1, . . . , n. This could be accomplished by consider-
ing ε(Y, X, β) = I{Y ≤ µ(X, β)} − .5 because the conditional expectation of
ε(Y, X, β) given X is given by

E{ε(Y, X, β)|X} = P {Y ≤ µ(X, β)|X} − .5

and, by definition, the conditional median is the value µ(X, β) such that the
conditional probability that Y is less than or equal to µ(X, β), given X, is
equal to .5, which would imply that E{ε(Y, X, β)|X} = 0.
98 4 Semiparametric Models

Using arguments similar to those developed in this chapter, we can show


that we can restrict attention to semiparametric estimators that are solutions
to the estimating equations

n
A(Xi )ε(Yi , Xi , β) = 0
i=1

for arbitrary A(X) and that the efficient semiparametric estimator for β falls
within this class. For models where we include both the conditional mean and
conditional variance, this leads to the so-called quadratic estimating equations
or GEE2. We give several exercises along these lines.

4.7 Exercises for Chapter 4


1. Prove that the linear subspacesTj , j = 1, . . . , m are mutually orthogonal
subspaces, where Tj is defined by (4.21) of Theorem 4.5. That is, show
 
that hj is orthogonal to hj  , where hj ∈ Tj , hj  ∈ Tj  and j = j , j, j =
1, . . . , m.
2. Let Y be a one-dimensional response random variable. Consider the model

Y = µ(X, β) + ε,

where β ∈ Rq , and E{h(ε)|X} = 0 for some arbitrary function h(·). Up


to now, we considered the identity function h(ε) = ε, but this can be
generalized to arbitrary h(ε). For example, if we define h(ε) = {I(ε ≤
0) − 1/2}, then this is the median regression model. That is, if we define
F (y|x) = P (Y ≤ y|X = x), then med (Y |x) = F −1 (1/2, x), the value
m(x) such that F (m(x)|x) = 1/2. Therefore, the model with this choice
of h(·) is equivalent to

med (Y |X) = µ(X, β).

Assume no other restrictions are placed on the model but E{h(ε)|X} = 0


for some function h(·). For simplicity, assume h is differentiable, but this
can be generalized to nondifferentiable h such as in median regression.
a) Find the space Λ⊥ (i.e., the space perpendicular to the nuisance tan-
gent space).
b) Find the efficient score vector for this problem.
c) Describe how you would construct a locally efficient estimator for β
from a sample of data (Yi , Xi ), i = 1, . . . , n.
d) Find an estimator for the asymptotic variance of the estimator defined
in part (c).
3. Letting Y be a one-dimensional response variable, consider the semipara-
metric model where
4.7 Exercises for Chapter 4 99

E(Y |X) = µ(X, β)

and

Var (Y |X) = V (X, β), β ∈ Rq ,

where V (x, β) > 0 for all x and β.


a) Derive the space Λ⊥ (i.e., the space orthogonal to the nuisance tangent
space).
b) Find the efficient score vector.
c) Derive a locally efficient estimator for β using a sample of data
(Yi , Xi ), i = 1, . . . , n.
5
Other Examples of Semiparametric Models

In this chapter, we will consider other widely used semiparametric models. We


begin with the “location-shift” regression model and the proportional hazards
regression model. These models, like the restricted moment regression model
discussed in detail in Chapter 4, are best represented through a parameter of
interest β and an infinite-dimensional nuisance parameter η. This being the
case, the class of influence functions of RAL estimators for β will be defined as
elements of a Hilbert space orthogonal to the nuisance tangent space. Later in
the chapter, we will also discuss the problem of estimating the mean treatment
difference between two treatments in a randomized pretest-posttest study or,
more generally, in a randomized study with covariate adjustment. We will
show that this seemingly easy problem can be cast as a semiparametric prob-
lem for which the goal is to find an efficient estimator for the mean treatment
difference. We will see that this problem is best represented by defining the
model through an infinite-dimensional parameter θ, where the parameter of
interest, the mean treatment difference, is given by β(θ), a function of θ. With
this representation, it will be more convenient to define the tangent space and
its orthogonal complement and then find the efficient estimator by considering
the residual after projecting the influence function of a simple, but inefficient,
RAL estimator onto the orthogonal complement of the tangent space. This
methodology will be described in detail in Section 5.4 that follows.

5.1 Location-Shift Regression Model


Let Yi denote a continuous response variable and Xi a vector of covariates
measured on the i-th individual of an iid sample i = 1, . . . , n. For simplicity,
we will only consider a univariate response variable here, but all the arguments
that follow could be generalized to multivariate response variables as well.
A popular way of modeling the relationship of Yi as a function of Xi is
with a location-shift regression model. That is, consider the model where
102 5 Other Examples of Semiparametric Models

Yi = µ(Xi , β) + εi , i = 1, . . . , n,

(Yi , Xi ), i = 1, . . . , n are iid, Yi is assumed to be continuous, and εi (also


continuous) is independent of Xi (denoted as εi ⊥ ⊥ Xi ). These models may
be linear or nonlinear models depending on whether µ(Xi , β) is linear in β or
not. The semiparametric properties of this model were studied by Bickel et
al. (1993, Section 4.3) and Manski (1984).
In this model, there is an assumed basic underlying distributional shape
that the density of the response variable Yi follows, that of the distribution
of εi , but where the location of this distribution is determined according to
the value of the covariates Xi (i.e., shifted by µ(Xi , β)). The q-dimensional
parameter β determines the magnitude of the shift in the location of the
distribution as a function of the covariates, and it is this parameter that is
of primary interest. We make no additional assumptions on the distribution
of εi or Xi . To avoid identifiability problems, we will also assume that if
(α1 , β1 ) = (α2 , β2 ), then

α1 + µ(X, β1 ) = α2 + µ(X, β2 ),

where α1 , α2 are any scalar constants and β1 , β2 are values in the parameter
space contained in Rq .
For example, if we consider a linear model where µ(Xi , β) = XiT β, we
must make sure not to include an intercept term, as this will be absorbed into
the error term εi ; i.e.,

Yi = β1 Xi1 + · · · + βq Xiq + εi , i = 1, . . . , n. (5.1)

The location-shift regression model is semiparametric because no restrictions


are placed on the distribution of εi or Xi . Such a model can be characterized
by
{β, pε (ε), pX (x)},
where pε (ε) and pX (x) are arbitrary densities of ε and X, respectively. We
assume that ε and hence Y is a continuous random variable with dominating
Lebesgue measure ε or Y , respectively. The dominating measure for X is
denoted by νX . An arbitrary density in this model for a single observation is
given by
pY,X (y, x) = pε {y − µ(x, β)}pX (x)
with respect to the dominating measure Y × νX .
It has been my experience that there is confusion between the location-
shift regression model and the restricted moment model. In many introductory
courses in statistics, a linear regression model is defined as

Yi = α + XiT β + εi , i = 1, . . . , n, (5.2)

where εi , i = 1, . . . , n are assumed to be iid mean-zero random variables.


Sometimes, in addition, it may be assumed that εi are normally distributed
5.1 Location-Shift Regression Model 103

with mean zero and some common variance σ 2 . What is often not made clear
is that there is an implicit assumption that the covariates Xi , i = 1, . . . , n
are fixed. Consequently, the error terms εi being iid implies that the distri-
bution of εi is independent of Xi . Thus, such models are examples of what
we are now calling the location-shift regression models. In contrast, a linear
restricted moment model can also be written as (5.2), where εi , i = 1, . . . , n
are iid random variables. However, the restricted moment model makes the
assumption that E(εi |Xi ) = 0, which then implies that E(εi ) = 0 but does
not necessarily assume that εi is independent of Xi .
The location-shift regression model, although semiparametric, is more re-
strictive than the restricted moment model considered in Chapter 4. For ex-
ample, if we consider the linear restricted moment model

Yi = α + β1 Xi1 + · · · + βq Xiq + εi ,

where
E (εi |Xi ) = 0,
or equivalently
Yi = β1 Xi1 + · · · + βq Xiq + (α + εi ),
where
E{α + εi |Xi } = α,
then this model includes a larger class of probability distributions than the
linear location-shift regression model; namely,

Yi = β1 Xi1 + . . . + βq Xiq + ε∗i ,

where ε∗i is independent of Xi . Because of independence, we obtain E(ε∗i |Xi ) =


E(ε∗i ) = constant, which satisfies the linear moment restriction, but, con-
versely, E(α + εi |Xi ) = α, which holds true for the restricted moment model,
does not imply that (α + εi ) = ε∗i ⊥ ⊥ Xi .
Since the location-shift regression model is more restrictive than the re-
stricted moment model, we would expect the class of semiparametric RAL es-
timators for β for the location-shift regression model to be larger than the class
of semiparametric RAL estimators for β for the restricted moment model and
the semiparametric efficiency bound for the location-shift regression model to
be smaller than the semiparametric efficiency bound for the restricted moment
model.

The Nuisance Tangent Space and Its Orthogonal Complement for


the Location-Shift Regression Model

The key to finding semiparametric RAL estimators for β and identifying the
efficient such estimator is to derive the space of influence functions of RAL
estimators for β. This will require us to find the space orthogonal to the
104 5 Other Examples of Semiparametric Models

nuisance tangent space. The nuisance tangent space is defined as the mean-
square closure of all parametric submodel nuisance tangent spaces. The nui-
sance tangent space and its orthogonal complement for the semiparametric
location-shift regression model are given as follows.
Theorem 5.1. Using the convention that ε(β) = Y −µ(X, β) and ε = ε(β0 ) =
Y −µ(X, β0 ), the nuisance tangent space for the location-shift regression model
is given by
Λ = Λ1s ⊕ Λ2s ,
where  
Λ1s = aq×1
1 (ε) : E{aq×1
1 (ε)} = 0q×1
,
 
Λ2s = aq×1
2 (X) : E{aq×1
2 (X)} = 0q×1 ,

and Λ1s ⊥ Λ2s .

Proof. Consider the parametric submodel with density

pε {y − µ(x, β), γ1 }pX (x, γ2 ), (5.3)

where γ10 and γ20 denote the “truth.” If we fix β at the truth β0 , then pε (ε, γ1 )
allows for an arbitrary marginal density of ε. Consequently, using logic devel-
oped in Chapter 4, the mean-square closure for parametric submodel nuisance
tangent spaces
Λγ1 = {B q×r1 Sγ1 (ε) for all B q×r1 },
where Sγ1 (ε) = ∂ log pε (ε, γ10 )/∂γ1 , is the space Λ1s , defined as
 
Λ1s = aq×1 1 (ε) : E{aq×1
1 (ε)} = 0q×1 .

Similarly, the mean-square closure for parametric submodel nuisance tangent


spaces
Λγ2 = {B q×r2 Sγ2 (X) for all B q×r2 },
where Sγ2 (x) = ∂ log pX (x, γ20 )/∂γ2 , is the space Λ2s , defined as
 
Λ2s = aq×12 (X) : E{aq×12 (X)} = 0q×1 .

Since the density (5.3) is a product involving variationally independent pa-


rameters γ1 and γ2 , the nuisance tangent space (i.e., the mean-square closure
of all parametric submodel nuisance tangent spaces associated with arbitrary
nuisance parameters γ1 and γ2 ) is given by

Λ = Λ1s ⊕ Λ2s .

Because ε is independent of X, it is easy to verify that Λ1s ⊥ Λ2s . 



5.1 Location-Shift Regression Model 105

Influence functions of RAL estimators for β lie in the space orthogonal to


the nuisance tangent space. We can find all elements of Λ⊥ by considering

Λ⊥ = [{h − Π (h|Λ)} for all h ∈ H] .

Because Λ1s ⊥ Λ2s , we obtain the following intuitive result.


Theorem 5.2. Let Λ = Λ1s ⊕ Λ2s , where the closed linear subspaces Λ1s and
Λ2s are orthogonal; i.e., Λ1s ⊥ Λ2s . Then

Π(h|Λ) = Π(h|Λ1s ) + Π(h|Λ2s ). (5.4)

Proof. The projection of any element h ∈ H onto the closed linear space Λ is
the unique element Π(h|Λ) such that the residual h − Π(h|Λ) is orthogonal to
every element in

Λ = {(a1 + a2 ), where a1 ∈ Λ1s and a2 ∈ Λ2s }.

To verify that (5.4) is true, we first note that Π(h|Λ1s ) ∈ Λ1s and Π(h|Λ2s ) ∈
Λ2s , implying that Π(h|Λ1s ) + Π(h|Λ2s ) ∈ Λ. To complete the proof, we must
show that the inner product

h − Π(h|Λ1s ) − Π(h|Λ2s ), a1 + a2 = 0 (5.5)

for all a1 ∈ Λ1s and a2 ∈ Λ2s . The inner product (5.5) can be written as

h − Π(h|Λ1s ), a1 (5.6)
− Π(h|Λ2s ), a1 (5.7)
+ h − Π(h|Λ2s ), a2 (5.8)
− Π(h|Λ1s ), a2 . (5.9)

Since h − Π(h|Λ1s ) is orthogonal to Λ1s and h − Π(h|Λ2s ) is orthogonal to Λ2s ,


this implies that the inner products (5.6) and (5.8) are equal to zero. Also,
since Π(h|Λ2s ) ∈ Λ2s and Π(h|Λ1s ) ∈ Λ1s , then Λ1s ⊥ Λ2s implies that the
inner products (5.7) and (5.9) are also equal to zero. 

In the proof of Lemma 4.3, we showed that for any h(ε, X) ∈ H the projec-
tion of h(ε, X) onto Λ2s is given by Π(h|Λ2s ) = E{h(ε, X)|X}. Similarly, we
can show that Π(h|Λ1s ) = E{h(ε, X)|ε}. Consequently, the space orthogonal
to the nuisance tangent space is given by
 

Λ = [h(ε, X) − E{h(ε, X)|ε} − E{h(ε, X)|X}] for all h ∈ H . (5.10)
106 5 Other Examples of Semiparametric Models

Semiparametric Estimators for β

Since influence functions of RAL estimators for β are, up to a proportionality


constant, defined by the elements orthogonal to the nuisance tangent space,
we now use the result in (5.10) that defines elements orthogonal to the nui-
sance tangent space to aid us in finding semiparametric estimators of β in the
location-shift regression model.
We begin by considering any arbitrary q-dimensional function of ε, X, say
g q×1 (ε, X). In order that this be an arbitrary element of H, it must have mean
zero. Therefore, we center g(ε, X) and define

h(ε, X) = g(ε, X) − E{g(ε, X)}.

We now consider an arbitrary element of Λ⊥ , which by (5.10) is given by


{h − Π (h|Λ)}, which equals

g(ε, X) − Gε (X) − GX (ε) + E{g(ε, X)}, (5.11)

where Gε (X) = E{g(ε, X)|X}, GX (ε) = E{g(ε, X)|ε}, and the function
g(ε, X) should not equal g1 (ε) + g2 (X) in order to ensure that the residual
(5.11) is not trivially equal to zero.
Because of the independence of ε and X, we obtain, for fixed X = x,
that Gε (x) = E{g(ε, x)} and, for fixed ε = ε∗ , that GX (ε∗ ) = E{g(ε∗ , X)}.
Consequently, consistent and unbiased estimators for Gε (x) and GX (ε∗ ) are
given by

n
−1
Ĝε (x) = n g(εi , x), (5.12)
i=1
n
ĜX (ε∗ ) = n−1 g(ε∗ , Xi ), (5.13)
i=1

respectively.
Because influence functions of RAL estimators for β are proportional to
elements orthogonal to the nuisance space, if we knew Gε (x), GX (ε∗ ), and
E{g(ε, X)}, then a natural estimator for β would be obtained by solving the
estimating equation

n
[g{εi (β), Xi } − Gε (Xi ) − GX {εi (β)} + E{g(ε, X)}] = 0q×1 ,
i=1

where εi (β) = Yi − µ(Xi , β). Since Gε (x), GX (ε∗ ), and E{g(ε, X)} are not
known, a natural strategy for obtaining an estimator for β is to substitute
estimates of these quantities in the preceding estimating equation, leading to
the estimating equation
5.1 Location-Shift Regression Model 107
n /
 0
g{εi (β), Xi } − Ĝε(β) (Xi ) − ĜX {εi (β)} + Ê[g{ε(β), X}] = 0q×1 ,
i=1
(5.14)
where Ĝε(β) (Xi ), ĜX {εi (β)} are defined by (5.12) and (5.13), respectively,
n
and Ê[g{ε(β), X}] = n−1 i=1 g{εi (β), Xi }.
The estimator for β that solves (5.14) should be a consistent and asymp-
totically normal semiparametric estimator for β with influence function pro-
portional to (5.11). Rather than trying to prove the asymptotic properties
of this estimator, we will instead focus on deriving a class of locally efficient
estimators for β and investigate the asymptotic properties of this class of
estimators.

Efficient Score for the Location-Shift Regression Model


The efficient estimator for β has an influence function proportional to the
efficient score (i.e., the residual after projecting the score vector with respect
to β onto the nuisance tangent space). To obtain the score vector with respect
to β, we consider the density for a single observation (Y, X), which is given
by
pY,X (y, x, β) = pε {y − µ(x, β)}pX (x).
Therefore,
∂ log pY,X (y, x, β)
Sβq×1 (y, x) =
∂β β=β0
T q×1
= −D (x, β0 )Sε (ε),
where
∂µ(x, β0 )
D(x, β0 ) =
∂β T
and
∂ log pε (ε)
Sε (ε) = . (5.15)

We first prove that E{Sε (ε)} = 0.
Theorem 5.3. If the random variable ε is continuous with support on the
real line, then
E{Sε (ε)} = 0,
where Sε (ε) is defined by (5.15).
Proof. Because ε is a continuous random variable whose distribution is domi-
nated
 by Lebesgue measure and has support on the real line, this means
 that
pε (ε − µ)dε = 1 for all scalar constants µ. This implies that d/dµ{ pε (ε −
µ)dε} = 0 for all µ. Interchanging differentiation and integration, we ob-
tain −pε (ε − µ)dε = 0 for all µ, where pε (x) = dp  ε (x)/dx. Multiplying
and dividing by pε (ε) and taking µ = 0, we obtain − Sε (ε)pε (ε)dε = 0, or
E{Sε (ε)} = 0.  
108 5 Other Examples of Semiparametric Models

Because E{Sε (ε)} = 0, we use (5.11) to deduce that the efficient score is
given by

Sβ (ε, X) − Π [Sβ |Λ] = −DT (X, β0 )Sε (ε) + E{DT (X, β0 )}Sε (ε)
= −{DT (X, β0 ) − E{DT (X, β0 )}}Sε (ε). (5.16)

If the distributions of X, pX (x), and ε, pε (ε), were known to us, then we


would also know E{DT (X, β)} and Sε (ε). If this were the case, then (5.16)
suggests finding the efficient estimator for β by solving the equation

n
[DT (Xi , β) − E{DT (X, β)}]Sε {Yi − µ(Xi , β)} = 0. (5.17)
i=1

Note 1. The sign in (5.16) was reversed, but this is not important, as the
estimator remains the same. 
However, since E{DT (X, β)} is not known to us, a natural strategy is to
substitute an estimator for E{DT (X, β)} in (5.17), leading to the estimator
for β that solves the estimating equation

n
{DT (Xi , β) − D̄T (β)}Sε {Yi − µ(Xi , β)} = 0, (5.18)
i=1


n
where D̄T (β) = n−1 DT (Xi , β).
i=1

Locally Efficient Adaptive Estimators

The function Sε (ε) depends on the underlying density pε (ε), which is unknown
to us. Consequently, we may posit some underlying density for ε to start with,
some working density pε (ε) that may or may not be correct. Therefore, in what
follows, we consider the asymptotic properties of the estimator for β, denoted
by β̂n , which is the solution to equation (5.18) for an arbitrary function of ε,
Sε (ε), which may not be a score function or, for that matter, may not have
mean zero at the truth; i.e., E{Sε (ε)} = 0. To emphasize the fact that we may
not be using the correct score function Sε (ε), we will substitute an arbitrary
function of ε, which we denote by κ(ε), for Sε (ε) in equation (5.18).
We now investigate (heuristically) the asymptotic properties of the esti-
mator β̂n , which solves (5.18) for an arbitrary function κ(·) substituted for
Sε (·).

Theorem 5.4. The estimator β̂n , which is the solution to the estimating
equation
n
{DT (Xi , β) − D̄T (β)}κ{Yi − µ(Xi , β)} = 0, (5.19)
i=1
5.1 Location-Shift Regression Model 109

is asymptotically normal. That is,


D
n1/2 (β̂n − β0 ) −→ N (0, Σ),

where  −2
∂κ(ε)  ' (−1
Σ=E var{κ(ε)} var DT (X, β0 ) .
∂ε

Proof. To derive the asymptotic distribution of β̂n , we expand β̂n about β0


in equation (5.19) and, after multiplying by n−1/2 , we obtain

n
0 = n−1/2 {DT (Xi , β̂n ) − D̄T (β̂n )}κ{Yi − µ(Xi , β̂n )}
i=1

n
= n−1/2 {DT (Xi , β0 ) − D̄T (β0 )}κ{Yi − µ(Xi , β0 )} (5.20)
i=1
 "
n
∂ ' ( 
+ n−1 DT (Xi , βn∗ ) − D̄T (βn∗ ) κ{Yi − µ(Xi , βn∗ )} (5.21)
i=1
∂β T
n1/2 (β̂n − β0 ),

where βn∗ is an intermediate value between β̂n and β0 . Consequently,



 n
∂ ' T (
n (β̂n − β0 ) = −n−1
1/2
T
D (Xi , βn∗ ) − D̄T (βn∗ )
i=1
∂β
"−1
 
n
' T (
κ{Yi − µ(Xi , βn∗ )} × n−1/2 D (Xi , β0 ) − D̄T (β0 )
i=1
κ{Yi − µ(Xi , β0 )}. (5.22)

Under suitable regularity conditions, the sample average (5.21) will con-
verge in probability to
   
' T (q×1 ∂κ(ε) 1×q
E D (X, β0 ) − E{D (X, β0 )}
T
D(X, β0 ) (5.23)
∂ε
   
∂ ∂
+E DT (X, β0 ) − E DT (X, β0 ) κ(ε) . (5.24)
∂β T ∂β T

Because of the independence of ε and X, (5.23) is equal to


 
∂κ(ε) ' (
E var DT (X, β0 ) ,
∂ε
where

var{DT (X, β0 )} = E{DT (X, β0 )D(X, β0 )} − E{DT (X, β0 )}E{D(X, β0 )}


110 5 Other Examples of Semiparametric Models

is the variance matrix of DT (X, β0 ), and (5.24) = 0. Therefore (5.22) can be


written as
   
1/2 ∂κ(ε) ' T ( −1
n (β̂n − β0 ) = −E var D (X, β0 )
∂ε
n
' T (
n−1/2 D (Xi , β0 ) − D̄T (β0 ) κ(εi ) + op (1). (5.25)
i=1

Note that

n
' (
n−1/2 DT (Xi , β0 ) − D̄T (β0 ) κ(εi )
i=1

n
' (
= n−1/2 DT (Xi , β0 ) − D̄T (β0 ) {κ(εi ) − κ̄}
i=1

n
' (
= n−1/2 DT (Xi , β0 ) − µD {κ(εi ) − µκ }
i=1
' (
+ n1/2 D̄T (β0 ) − µD {κ̄ − µκ }
n
' T (
= n−1/2 D (Xi , β0 ) − µD {κ(εi ) − µκ } + op (1), (5.26)
i=1

where

n
' (
κ̄ = n−1 κ(εi ), µD = E DT (Xi , β0 ) , and µκ = E {κ(εi )} .
i=1

Therefore, by (5.25) and (5.26), we obtain that


   
dκ(ε) ' ( −1
n1/2 (β̂n − β0 ) = −E var DT (X, β0 )

n
' T (
−1/2
×n D (Xi , β0 ) − µD {κ(εi ) − µκ } + op (1).
i=1

Consequently, the influence function of β̂n is


   
dκ(ε) ' T ( −1 ' T (
−E var D (X, β0 ) D (Xi , β0 ) − µD {κ(εi ) − µκ } ,
∂ε
(5.27)
1/2
and n (β̂n − β0 ) is asymptotically normal with mean zero and variance
matrix equal to the variance matrix of the influence function, which equals
 −2
∂κ(ε)  ' (−1
E var{κ(ε)} var DT (X, β0 ) .

∂ε
5.1 Location-Shift Regression Model 111

Remark 1. If we start with an arbitrary function

g(ε, X) = DT (X, β0 )κ(ε)


pε (ε)
regardless of whether κ(ε) = Sε (ε) = ∂ log ∂ε or some other function of ε,
then by (5.11),

DT (X, β0 )κ(ε) − DT (X, β0 )µκ − µD κ(ε) + µD µκ


' T (
= D (X, β0 ) − µD {κ(ε) − µκ }

is orthogonal to the nuisance tangent space. We also notice that this is pro-
portional to the influence function of β̂n given by (5.27).
If, however, we choose the true density pε (ε), then we obtain an efficient es-
timator. Consequently, the estimator for β given by (5.19) is a locally efficient
semiparametric estimator for β.  

If we wanted to derive a globally efficient semiparametric estimator, then


we would need to estimate the score function ∂ log pε (ε)/∂ε nonparametri-
cally and substitute this for κ(ε) in the estimating equation (5.19). Although
this may be theoretically possible (see Bickel et al., 1993, Section 7.8), gen-
erally such nonparametric estimators are unstable (unless the sample size is
very large). Another strategy would be to posit some parametric model, say
pε (ε, ξ), for the density of ε in terms of a finite number of parameters ξ. The
parameter ξ can be estimated, say, by using pseudo-likelihood techniques; i.e.,
by maximizing the pseudo-likelihood,
!
n
pε {Yi − µ(Xi , β̂nI ), ξ},
i=1

as a function in ξ using some initial consistent estimator β̂nI of β. Letting


ξˆn denote such an estimator, we can estimate β by substituting Sε (ε, ξˆn ) =
∂ log pε (ε, ξˆn )/∂ε into equation (5.18). Such an adaptive estimator is locally
efficient in the sense that if the true density of ε is an element of the posited
parametric model, then the resulting estimator is efficient; otherwise, it will
still be a consistent asymptotically normal semiparametric estimator for β.
The idea here is that with a flexible parametric model, we can get a reason-
ably good approximation for the underlying density of ε, so even if we don’t
estimate this density consistently, it would hopefully be close enough so that
the resulting estimator will have good efficiency properties.
In the following example, we consider the linear model and use this to
contrast the estimators obtained for the restricted moment model and the
location-shift regression model.
Example 1. Consider the linear model (5.1)

Yi = XiT β + εi ,
112 5 Other Examples of Semiparametric Models

where Yi is a one-dimensional random variable. For this model,

DT (Xi , β) = Xiq×1 .

The estimator for β given by (5.18) is the solution to


n
(Xi − X̄)Sε (Yi − XiT β) = 0. (5.28)
i=1

Suppose we believe our data are normally distributed; i.e., εi ∼ N (µε , σ 2 ) or


 
1 1 2
pε (ε) = √ exp − 2 (ε − µε ) .
2πσ 2 2σ

Then  
ε − µε
Sε (ε) = − .
σ2
Substituting into (5.28) yields


n
(Yi − XiT β − µε )
(Xi − X̄) = 0.
i=1
σ2
n
Since σ 2 is a multiplicative constant and i=1 (Xi − X̄) = 0, the estimator
for β is equivalent to solving

n
(X1 − X̄)(Yi − XiT β) = 0.
i=1

This gives the usual least-squares estimator for the regression coefficients in
a linear model with an intercept term; namely,
 n "−1
 
n
β̂n = (Xi − X̄)(Xi − X̄) T
(Xi − X̄)Yi . (5.29)
i=1 i=1

We mentioned previously that the location-shift regression model is con-


tained within the restricted moment model

Yi = α + β1 Xi1 + · · · + βq Xiq + εi ,

where E(εi |Xi ) = 0. For the location-shift regression model, εi is independent


of Xi , which implies that the variance function

V (Xi ) = var (Yi |Xi ) = σ 2 (a constant independent of Xi ).

The efficient estimator for β among semiparametric estimators for β in the


restricted moment model, given by (4.54), is
5.2 Proportional Hazards Regression Model with Censored Data 113


n
DT (Xi )V −1 (Xi ){Yi − µ(Xi , β)} = 0.
i=1

If V (Xi ) is assumed constant, then the efficient restricted moment estimator


for (α, β T )T is given as the solution to


n
σ −2 (1, XiT )T (Yi − α − XiT β) = 0.
i=1

Some usual matrix algebra yields


 n "−1
 
n
β̂n = (Xi − X̄)(Xi − X̄) T
(Xi − X̄)Yi ,
i=1 i=1

which is identical to the estimator (5.29), the locally efficient estimator for β
in the location-shift regression model.

Remarks

1. The estimator β̂n given by (5.29) is the efficient estimator for β among
semiparametric estimators for the location-shift regression model if the
error distribution is indeed normally distributed.
2. If, however, the error distribution was not normally distributed, other
semiparametric estimators (semiparametric for the location-shift model)
would be more efficient; namely, the solution to the equation

n
(Xi − X̄)Sε (Yi − XiT β) = 0, (5.30)
i=1

where Sε (ε) = d logdεpε (ε) .


3. In contrast, among the estimators for β for the restricted moment model,
when the variance function V (Xi ) is assumed constant, the efficient esti-
mator for β is given by (5.29).
4. The estimator for β given by (5.30) may not be consistent and asymp-
totically normal if εi is not independent of Xi but E(εi |Xi ) = 0, whereas
(5.29) is.

5.2 Proportional Hazards Regression Model with


Censored Data
The proportional hazards model, which was introduced by Cox (1972), is the
most widely used model for analyzing survival data. This model is especially
114 5 Other Examples of Semiparametric Models

useful for survival data that are right censored, as is often the case in many
clinical trials where the primary endpoint is “time to an event” such as time
to death, time to relapse, etc. As we will see shortly, the proportional hazards
model is a semiparametric model that can be represented naturally with a
finite number of parameters of interest and an infinite-dimensional nuisance
parameter. Because of this natural representation with a parametric and non-
parametric component, it was one of the first models to be studied using
semiparametric theory, in the landmark paper of Begun et al. (1983).
In order to follow the arguments in this section, the reader must be familiar
with the theory of counting processes and associated martingale processes that
have been developed for studying the theoretical properties of many statistics
used in censored survival analysis. Some excellent books for studying counting
processes and their application to censored survival-data models include those
by Fleming and Harrington (1991) and Andersen et al. (1992). The reader who
has no familiarity with this area can skip this section, as it is self-contained
and will not affect the understanding of the remainder of the book.
The primary goal of the proportional hazards model is to model the rela-
tionship of “time to event” as a function of a vector of covariates X. Through-
out this section, we will often refer to the “time to event” as the “survival
time” since, in many applications where these models are used, the primary
endpoint is time to death. But keep in mind that the time to any event could
be used. Let T denote the underlying survival time of an arbitrary individual
in our population. The random variable T will be assumed to be a contin-
uous positive random variable. Unlike previous models where we considered
the conditional mean of the response variable, or shifts in the location of
the distribution of the response variable as a function of covariates, in the
proportional hazards model it is the hazard rate of failure that is modeled
as a function of covariates X (q-dimensional). Specifically, the proportional
hazards model of Cox (1972) assumes that
 
P (u ≤ T < u + h|T ≥ u, X)
λ(u|X) = lim
h→0 h
= λ(u) exp(β T X), (5.31)

where λ(u|x) is the conditional hazard rate of failure at time u given X = x.


The baseline hazard function λ(u) can be viewed as the underlying hazard
rate if all the covariates X are equal to zero and this underlying hazard rate
is left unspecified (nonparametric). This baseline hazard rate is assumed to
be increased or decreased proportionately (the same proportionality constant
through time) as a function of the covariates X through the relationship
exp(β T X). Consequently, the parameters β in this model measure the strength
of the association of this proportionality increase or decrease in the hazard
rate as a function of the covariates, and it is these parameters that will be of
primary interest to us.
5.2 Proportional Hazards Regression Model with Censored Data 115

In many experimental settings where survival data are obtained, such as


in clinical trials, not all the survival data are available for the individuals
in the study. Some of the survival data may be right censored. That is, for
some individuals, we may only know that they survived to some time. For
example, in a clinical trial where patients enter the study during some accrual
period and where the study is analyzed before all the patients die, a patient
who has not died has a survival time that is right censored; that is, we only
know that this patient survived the period of time between their entry into
the study and the time the study was analyzed. This is an example of what
is referred to as administrative censoring. Other reasons for censoring include
patient dropout, where we only know that a patient was still alive at the time
they dropped out. To accommodate censoring, we define the random variable
C that corresponds to the potential time that an individual is followed for
survival. We denote the conditional density of C given X as pC|X (c|x). In this
setting, we observe, for each individual, the variables
V = min(T, C) : (time on study),
∆ = I(T ≤ C) : (failure indicator),
and X = q-dimensional covariate vector.
In addition, we assume that T and C are conditionally independent given
X. This is denoted as T ⊥ ⊥ C|X. This assumption is necessary to allow for
the identifiability of the conditional distribution of T given X when we have
censored data; see Tsiatis (1998) for more details. Otherwise, no other as-
sumptions are made on the conditional distribution of C given X.
The data from a survival study are represented as (Z1 , . . . , Zn ), iid, where
Zi = (Vi , ∆i , Xi ), i = 1, . . . , n.
The goal is to find semiparametric consistent, asymptotically normal, and ef-
ficient estimators for β using the sample (Vi , ∆i , Xi ), i = 1, . . . , n without
making any additional assumptions on the underlying baseline hazard func-
tion λ(u), on the conditional distribution of C given X, or on the distribution
of X. To do so, we will use the semiparametric theory that we have developed.
Specifically, we will find the space of influence functions of RAL estimators
for β for this model, which, in turn, will motivate a class of RAL estimators
among which we will derive the efficient estimator. This will be accomplished
by deriving the semiparametric nuisance tangent space, its orthogonal com-
plement, where influence functions lie, and the efficient score. We begin by
first considering the density of the observable data.
The density of a single data item is given by
' (δ ' (
pV,∆,X (v, δ, x) = λ(v) exp(β T x) exp −Λ(v) exp(β T x)
⎧∞ ⎫δ

' (1−δ ⎨ ⎬
pC|X (v|x) pC|X (u|x)du pX (x), (5.32)
⎩ ⎭
v
116 5 Other Examples of Semiparametric Models

where the cumulative baseline hazard function is defined as


v
Λ(v) = λ(u)du,
0

and pX (x) denotes the marginal density of X.


It will be convenient to use a hazard representation for the conditional
distribution of C given X. Define
 
P (u ≤ C < u + h|C ≥ u, X = x)
λC|X (u|x) = lim
h→0 h

and
v
ΛC|X (v|x) = λC|X (u|x)du.
0

Using the fact that

pC|X (v|x) = λC|X (v|x) exp{−ΛC|X (v|x)}

and  ∞
pC|X (u|x)du = exp{−ΛC|X (v|x)},
v

we write the density (5.32) as


' (δ ' (
pV,∆,X (v, δ, x) = λ(v) exp(β T x) exp −Λ(v) exp(β T x)
' (1−δ ' (
× λC|X (v|x) exp −ΛC|X (v|x) × pX (x). (5.33)

The model is characterized by the parameter (β, η), where β denotes the q-
dimensional regression parameters of interest and the nuisance parameter

η = {λ(v), λC|X (v|x), pX (x)},

where λ(v) is an arbitrary positive function of v, λC|X (v|x) is an arbitrary


positive function of v and x, and pX (x) is any density such that

pX (x)dνX (x) = 1.

The log of the density in (5.33) is


' (
δ log λ(v) + β T x − Λ(v) exp(β T x)
+ (1 − δ) log λC|X (v|x) − ΛC|X (v|x) + log pX (x). (5.34)
5.2 Proportional Hazards Regression Model with Censored Data 117

The Nuisance Tangent Space

For this problem, the Hilbert space H consists of all q-dimensional measurable
functions h(v, δ, x) of (V, ∆, X) with mean zero and finite variance equipped
with the covariance inner product. In order to define the nuisance tangent
space within the Hilbert space H, we must derive the mean-square closure of
all parametric submodel nuisance tangent spaces. Toward that end, we define
an arbitrary parametric submodel by substituting λ(v, γ1 ), λC|X (v|x, γ2 ) and
pX (x, γ3 ) into (5.33), where γ1 , γ2 , and γ3 are finite-dimensional nuisance
parameters of dimension r1 , r2 , and r3 , respectively. Since the nuisance pa-
rameters γ1 , γ2 , and γ3 are variationally independent and separate from each
other in the log-likelihood of (5.34), this implies that the parametric submodel
nuisance tangent space, and hence the nuisance tangent space, can be written
as a direct sum of three orthogonal spaces,

Λ = Λ1s ⊕ Λ2s ⊕ Λ3s ,

where

Λjs = mean-square closure of {B q×rj Sγj (V, ∆, X) for all B q×rj }, j = 1, 2, 3,

and Sγj (v, δ, x) = ∂ log pV,∆,X (v, δ, x, γ)/∂γj are the parametric submodel
score vectors, for j = 1, 2, 3.
Since the space Λ3s is associated with arbitrary marginal densities pX (x)
of X, we can use arguments already developed in Chapter 4 to show that
 
' (
Λ3s = αq×1 (X) : E αq×1 (X) = 0q×1 (5.35)

and that the projection of an arbitrary element h(V, ∆, X) ∈ H is given by

Π(h|Λ3s ) = E(h|X).

We will now show in a series of lemmas how to derive Λ1s and Λ2s followed
by the key theorem that derives the space orthogonal to the nuisance tangent
space; i.e., Λ⊥ .

The Space Λ2s Associated with λC|X (v|x)

Using counting process notation, let NC (u) = I(V ≤ u, ∆ = 0) denote (count)


the indicator of whether a single individual is observed to be censored by or
at time u and Y (u) = I(V ≥ u) denote the indicator of being at risk at time
u. The corresponding martingale increment is

dMC (u, x) = dNC (u) − λ0C|X (u|x)Y (u)du,

where λ0C|X (u|x) denotes the true conditional hazard function of C at time
u given X = x.
118 5 Other Examples of Semiparametric Models

Lemma 5.1. The space Λ2s associated with the arbitrary conditional density
of C given X is given as the class of elements
 
q×1 q×1
α (u, X)dMC (u, X) for all functions α (u, x) ,

where αq×1 (u, x) is any arbitrary q-dimensional function of u and x.

Proof. In order to derive the space Λ2s , consider the parametric submodel

λC|X (v|x, γ2 ) = λ0C|X (v|x) exp{γ2T αq×1 (v, x)}.

That this is a valid parametric submodel follows because the truth is contained
within this model (i.e., when γ2 = 0) and a hazard function must be some
positive function of v.
The contribution to the log-likelihood (5.34) for this parametric submodel
is
v
' ( ' (
(1 − δ) log λ0C|X (v|x) + γ2T α(v, x) − λ0C|X (u|x) exp γ2T α(u, x) du.
0

Taking the derivative with respect to γ2 and evaluating it at the truth (γ2 = 0),
we obtain the nuisance score vector
V
Sγ2 = (1 − ∆)α(V, X) − α(u, X)λ0C|X (u|X)du
0

= (1 − ∆)α(V, X) − α(u, X)λ0C|X (u|X)I(V ≥ u)du.

Using counting process notation, Sγ2 can be written as a stochastic integral,



Sγ2 = α(u, X)dMC (u, X).

From this last result, we conjecture that the space Λ2s consists of all elements
in the class
 
αq×1 (u, X)dMC (u, X) for all functions αq×1 (u, x) .

We have already demonstrated that any element in the class above is an ele-
ment of a parametric submodel nuisance tangent space. Therefore, to complete
our argument and verify our conjecture, we need to show that the linear space
spanned by the score vector with respect to γ2 for any parametric submodel
belongs to the space above.
Consider any arbitrary parametric submodel λC|X (u|x, γ2 ), with γ20 de-
noting the truth. The score vector is given by
5.2 Proportional Hazards Regression Model with Censored Data 119
  

(1 − ∆) log λC|X (V |x, γ2 ) − λC|X (u|X, γ2 )I(V ≥ u)du
∂γ2 γ2 =γ20

∂γ2 λC|X (V |X, γ20 )
= (1 − ∆)
λC|X (V |X, γ20 )
 ∂
∂γ2 λC|X (u|X, γ20 )
− λC|X (u|X, γ20 )I(V ≥ u)du
λC|X (u|X, γ20 )
  
∂ log λC|X (u|X, γ20 )
= dMC (u, X).
∂γ2

Multiplying this score vector by a conformable matrix results in an element


of the form 
α(u, X)dMC (u, X). 


The Space Λ1s Associated with λ(v)

Let N (u) denote the counting process that counts whether an individual was
observed to die before or at time u (i.e., N (u) = I(V ≤ u, ∆ = 1)) and, as
before, Y (u) = I(V ≥ u) is the “at risk” indicator at time u. Let dM (u, X) de-
note the martingale increment dN (u) − λ0 (u) exp(β0T X)Y (u)du, where λ0 (u)
denotes the true underlying hazard rate in the proportional hazards model.

Lemma 5.2. The space Λ1s , the part of the nuisance tangent space associ-
ated with the nuisance parameter λ(·), the underlying baseline hazard rate of
failure, is
 
q×1 q×1
Λ1s = a (u)dM (u, X) for all q-dimensional functions a (u) ,

where aq×1 (u) denotes an arbitrary q-dimensional function of u.

Proof. Consider the parametric submodel

λ(v, γ1 ) = λ0 (v) exp{γ1T aq×1 (v)}

for any arbitrary q-dimensional function aq×1 (v) of v. For this parametric
submodel, the contribution to the log-density is
 v
δ{log λ0 (v) + γ1T a(v) + β T x} − λ0 (u) exp{γ1T a(u) + β T x}du. (5.36)
0

Taking derivatives of (5.36) with respect to γ1 , setting γ1 = 0 and β = β0 , we


obtain the score function

Sγ1 = aq×1 (u)dM (u, X).
120 5 Other Examples of Semiparametric Models

From this, we conjecture that Λ1s is


 
q×1 q×1
Λ1s = a (u)dM (u, X) for all q-dimensional functions a (u) .

We have already demonstrated that any element in Λ1s is an element of a para-


metric submodel nuisance tangent space. Therefore, to complete our argument
and verify our conjecture, we need to show that the linear space spanned by
the score vector with respect to γ1 for any parametric submodel belongs to
Λ1s . Consider the parametric submodel with the nuisance parameter γ1 , which
appears in the log-density (5.34) as
' (
δ log λ(v, γ1 ) + β T x − Λ(v, γ1 ) exp(β T x), (5.37)
v
where Λ(v, γ1 ) = 0 λ(u, γ1 )du and γ10 denotes the truth; i.e., λ(v, γ10 ) =
λ0 (v). Taking the derivative of (5.37) with respect to γ1 , setting γ1 = γ10 and
β = β0 , we deduce that the score vector with respect to γ1 is
  
∂ log λ(u, γ10 )
S γ1 = dM (u, X).
∂γ1

Multiplying this score vector by a conformable matrix leads to an element in


Λ1s , thus verifying our conjecture. 


Finding the Orthogonal Complement of the Nuisance Tangent


Space

We have demonstrated that the nuisance tangent space can be written as a


direct sum of three orthogonal linear spaces, namely

Λ = Λ1s ⊕ Λ2s ⊕ Λ3s ,

where Λ2s and Λ1s were derived in Lemmas 5.1 and 5.2, respectively, and Λ3s
is given in (5.35). Influence functions of RAL estimators for β belong to the
space orthogonal to Λ. We now consider finding the orthogonal complement
to Λ.
Theorem 5.5. The space orthogonal to the nuisance tangent space is given
by
 
⊥ ∗
Λ = {α(u, X) − a (u)} dM (u, X) for all α (u, x) ,
q×1
(5.38)

where ' (
∗ E α(u, X) exp(β0T X)Y (u)
a (u) = ' ( . (5.39)
E exp(β0T X)Y (u)
5.2 Proportional Hazards Regression Model with Censored Data 121

Proof. We begin by noting some interesting geometric properties for the nui-
sance tangent space that we can take advantage of in deriving its orthogonal
complement.
If we put no restrictions on the densities pV,∆,X (v, δ, x) that generate our
data (completely nonparametric), then it follows from Theorem 4.4 that the
corresponding tangent space would be the space of all q-dimensional mea-
surable functions of (V, ∆, X) with mean zero; i.e., the entire Hilbert space
H.
The proportional hazards model we are considering puts no restrictions
on the marginal distribution of X, pX (x) or the conditional distribution of C
given X. Therefore, the only restrictions on the class of densities for (V, ∆, X)
come from those imposed on the conditional distribution of T given X via the
proportional hazards model. Suppose, for the time being, we put no restriction
on the conditional hazard of T given X and denote this by λT |X (v|x). If
this were the case, then there would be no restrictions on the distribution of
(V, ∆, X).
The distribution of (V, ∆, X) given by the density PV,∆,X (v, δ, x) can be
written as PV,∆|X (v, δ|x)pX (x), where the conditional density PV,∆|X (v, δ|x)
can also be characterized through the cause-specific hazard functions
 
∗ P (v ≤ V < v + h, ∆ = δ|V ≥ v, X = x)
λ∆ (v|x) = lim ,
h→0 h

for ∆ = 0, 1. Under the assumption T ⊥ ⊥ C|X, the cause-specific hazard


functions equal the net-specific hazard functions; namely,

λ∗1 (v|x) = λT |X (v|x), λ∗0 (v|x) = λC|X (v|x). (5.40)

That (5.40) is true follows from results in Tsiatis (1998). Therefore, putting no
restrictions on λT |X (v|x) or λC|X (v|x) implies that no restrictions are placed
on the conditional distribution of (V, ∆) given X. Hence, the log-density for
such a saturated (nonparametric) model could be written (analogously to
(5.34)) as

δ log λT |X (v|x) − ΛT |X (v|x)


+ (1 − δ) log λC|X (v|x) − ΛC|X (v|x)
+ log pX (x).

The tangent space for this model can be written as a direct sum of three
orthogonal spaces,
Λ∗1s ⊕ Λ2s ⊕ Λ3s ,
where Λ2s and Λ3s are defined as before, but now Λ∗1s is the space associated
with λT |X (v|x), which is now left arbitrary.
Arguments that are virtually identical to those used to find the space Λ2s
in Lemma 5.1 can be used to show that
122 5 Other Examples of Semiparametric Models
 
Λ∗1s = αq×1 (u, X)dM (u, X) for all αq×1 (u, x) ,

where
dM (u, X) = dN (u) − λ0T |X (u|X)Y (u)du.

Note 2. Notice the difference: For Λ1s we used aq×1 (u) (i.e., a function of u
only), whereas for Λ∗1s we used αq×1 (u, X) (i.e., a function of both u and X)
in the stochastic integral above. 


Because the tangent space Λ∗1s ⊕ Λ2s ⊕ Λ3s is that for a nonparametric model
(i.e., a model that allows for all densities of (V, ∆, X)), and because the tan-
gent space for a nonparametric model is the entire Hilbert space, this implies
that
H = Λ∗1s ⊕ Λ2s ⊕ Λ3s , (5.41)
where Λ∗1s , Λ2s , and Λ3s are mutually orthogonal subspaces.
Since the nuisance tangent space Λ = Λ1s ⊕ Λ2s ⊕ Λ3s , this implies that
Λ1s ⊂ Λ∗1s . Also, the orthogonal complement Λ⊥ must be orthogonal to Λ2s ⊕
Λ3s = Λ∗1s ; i.e., Λ⊥ ⊂ Λ∗1s . Λ⊥ must also be orthogonal to Λ1s ; consequently,
Λ⊥ consists of elements of Λ∗1s that are orthogonal to Λ1s .
In order to identify elements of Λ⊥ (i.e., elements of Λ∗1s that are orthogonal
to Λ1s ), it suffices to take an arbitrary element of Λ∗1s , namely

αq×1 (u, X)dM (u, X)

and find its residual after projecting it onto Λ1s . To find the projection, we
must derive a∗ (u) so that
  
α(u, X)dM (u, X) − a∗ (u)dM (u, X)

is orthogonal to every element of Λ1s . That is,


  
∗ T
E {α(u, X) − a (u)} dM (u, X) a(u)dM (u, X) = 0

for all a(u).


The covariance of martingale stochastic integrals such as those above can
be computed by finding the expectation of the predictable covariation process;
see Fleming and Harrington (1991). Namely,
 
∗ T
E {α(u, X) − a (u)} a(u)λ0 (u) exp(β0 X)Y (u)du
T
(5.42)
  
= E {α(u, X) − a∗ (u)} exp(β0T X)Y (u) λ0 (u)a(u)du = 0
T
(5.43)
5.2 Proportional Hazards Regression Model with Censored Data 123

for all a(u). Since a(u) is arbitrary, this implies that


 
E {α(u, X) − a∗ (u)} exp(β0T X)Y (u) = 0q×1
T
(5.44)

for all u. We can prove (5.44) by contradiction because if (5.44) was not equal
to zero, then we could make the integral (5.43) nonzero by choosing a(u) to
be equal to whatever the expectation is in (5.44).
Solving (5.44), we obtain
' ( ' (
E α(u, X) exp(β0T X)Y (u) = a∗ (u)E exp(β0T X)Y (u)

or ' (
∗ E α(u, X) exp(β0T X)Y (u)
a (u) = ' ( .
E exp(β0T X)Y (u)
Therefore, the space orthogonal to the nuisance tangent space is given by
 
Λ⊥ = {α(u, X) − a∗ (u)} dM (u, X) for all αq×1 (u, x) ,

where ' (
∗ E α(u, X) exp(β0T X)Y (u)
a (u) = ' ( .

E exp(β0T X)Y (u)

Finding RAL Estimators for β

As we have argued repeatedly, influence functions of RAL estimators for β are


defined, up to a proportionality constant, through the elements orthogonal
to the nuisance tangent space. For the proportional hazards model, these
elements are defined through (5.38) and (5.39). Often, knowing the functional
form of such elements leads us to estimating equations that will result in RAL
estimators for β with a particular influence function. We now illustrate.
Since a∗ (u), given by (5.39), is defined through a ratio of expectations, it
is natural to estimate this using a ratio of sample averages, namely
n
n−1 i=1 α(u, Xi ) exp(β0T Xi )Yi (u)
â∗ (u) = n .
n−1 i=1 exp(β0T Xi )Yi (u)

Note that
n 

n−1/2 {α(u, Xi ) − a∗ (u)} dMi (u, Xi )
i=1
n 

= n−1/2 {α(u, Xi ) − â∗ (u)} dMi (u, Xi ) + op (1). (5.45)
i=1

This follows because


124 5 Other Examples of Semiparametric Models
n 

−1/2
{â∗ (u) − a∗ (u)} dMi (u, Xi ) −
P
n → 0,
i=1

a consequence of Lenglart’s inequality; see Fleming and Harrington (1991).


Using straightforward algebra, we obtain that
⎧ n ⎫
 ⎪
⎪ α(u, X ) exp(β T
X )Y (u) ⎪

n ⎨ j 0 j j ⎬
j=1
α(u, Xi ) −  dMi (u, Xi )


n

⎪ * +, -
i=1 ⎩ exp(β0T Xj )Yj (u) ⎭
j=1 ||
dNi (u) − λ0 (u) exp(β0T Xi )Yi (u)

is identical to
⎧ n ⎫

⎪ T ⎪

n  ⎨ α(u, X j ) exp(β0 X j )Yj (u) ⎬
j=1
α(u, Xi ) −  dNi (u).


n


i=1 ⎩ exp(β0T Xj )Yj (u). ⎭
j=1

Using standard expansions of the estimating equation (which we leave to the


reader as as exercise), where we expand the estimating equation about the
truth, we can show that the estimator for β, which is the solution to the
estimating equation
n  
  
α(u, Xj ) exp(β T Xj )Yj (u)
α(u, Xi ) −  dNi (u) = 0,
i=1
exp(β T Xj )Yj (u)

will have an influence function “proportional” to the element of Λ⊥ ,



{α(u, Xi ) − a∗ (u)} dMi (u, Xi ).

That is, we can find a semiparametric estimator for β by choosing any q-


dimensional function α(u, x) of u and x and solving the estimating equation
⎧ n ⎫

⎪ α(V , X ) exp(β T
X )Y (V ) ⎪


n ⎨ i j j j i ⎬
j=1
∆i α(Vi , Xi ) −  = 0. (5.46)


n


i=1 ⎩ T
exp(β Xj )Yj (Vi ) ⎭
j=1

By considering all functions αq×1 (u, x), the corresponding estimators will de-
fine a class of semiparametric estimators that contains all the influence func-
tions of RAL estimators for β.
5.3 Estimating the Mean in a Nonparametric Model 125

Efficient Estimator

To find the efficient estimator, we must derive the efficient score. This entails
computing
Sβ (Vi , ∆i , Xi )
and projecting this onto the nuisance tangent space. Going back to the log-
density (5.34) and taking the derivative with respect to β, it is straightforward
to show that 
Sβ = X q×1 dM (u, X).

We note that Sβ is an element of Λ∗1s , with α(u, Xi ) = Xi .


Therefore, the efficient score, derived as the residual after projecting Sβ
onto Λ (or in this case Λ1s ), is given as
  
E{X exp(β0T X)Y (u)}
Seff = X− dM (u, X).
E{exp(β0T X)Y (u)}

The estimator for β, which has an efficient influence function (i.e., proportional
to Seff ), is given by substituting Xi for α(u, Xi ) in (5.46); namely,
⎡ ⎧ n ⎫⎤

⎪ X exp(βX )Y (V ) ⎪


n
⎢ ⎨ j j j i ⎬⎥
∆i ⎢ ⎥ = 0.
j=1
X −
⎣ i ⎪  ⎦ (5.47)

n


i=1 ⎩ exp(βXj )Yj (Vi ) ⎭
j=1

The estimator for β, given as the solution to (5.47), is the estimator proposed
by Cox for maximizing the partial likelihood, where the notion of partial like-
lihood was first introduced by Cox (1975). The martingale arguments above
are essentially those used by Andersen and Gill (1982), where the theoreti-
cal properties of the proportional hazards model are derived in detail. The
argument above shows that Cox’s maximum partial likelihood estimator is a
globally efficient semiparametric estimator for β for the proportional hazards
model.

5.3 Estimating the Mean in a Nonparametric Model


Up to this point, we have identified influence functions and estimators by
considering elements in the Hilbert space H that are orthogonal to the nui-
sance tangent space Λ. In Theorems 3.4 and 4.3, we also defined the space of
influence functions of RAL estimators for β for parametric and semiparamet-
ric models, respectively, by considering the linear variety ϕ(Z) + T ⊥ , where
ϕ(Z) is any influence function of an RAL estimator for β and T is the tangent
space. For some problems, this representation of influence functions may be
more useful in identifying semiparametric estimators, including the efficient
126 5 Other Examples of Semiparametric Models

estimator; i.e., when the parameter of interest is represented as β(θ), some


function of the parameter θ, where the infinite-dimensional parameter θ de-
scribes the entire parameter space, when the tangent space T is easier to
derive, and when some simple semiparametric but (possibly) inefficient RAL
estimator, with influence function ϕ(Z), exists. We will illustrate this with
two examples.
In the first example, we consider the problem of finding estimators for the
mean of a random variable Z (i.e., µ = E(Z)) with a sample of iid random
variables Z1 , . . . , Zn with common density p(z) with respect to some domi-
nating measure νZ , where p(z) could be any density (i.e., p(z) can be any
positive function of z such that p(z)dν(z) = 1 and µ = zp(z)dν(z) < ∞);
that is, the so-called nonparametric model for the distribution of Z.

Goal. We want to find the class of RAL semi(non)parametric estimators for


µ = E(Z).

Basically, the model above puts no restriction on the density of Z other


than finite moments. This is referred to as the nonparametric model. In The-
orem 4.4, we proved that the tangent space T was the entire Hilbert space
H. Consequently, T ⊥ , the orthogonal complement of T , is the single “zero”
element of the Hilbert space corresponding to the origin. For this model, the
space of influence functions, ϕ(Z) + T ⊥ , consists of, at most, one influence
function, and if an influence function exists for an RAL estimator for µ, then
this influence function must be the efficient influence function and the corre-
sponding estimator has to be semiparametric efficient.
An obvious and natural RAL estimator for µ = E(Z) is the sample mean

n
Z̄ = n−1 Zi . This estimator can be trivially shown to be an RAL estimator
i=1
for µ with influence function (Z−µ0 ). Therefore, Z−µ0 is the unique and hence
efficient influence function among RAL estimators for µ for the nonparametric
model and Z̄ is the semiparametric efficient estimator.
We now consider the more complicated problem where the primary objec-
tive is to estimate the treatment effect in a randomized study with covariate
adjustment.

5.4 Estimating Treatment Difference in a Randomized


Pretest-Posttest Study or with Covariate Adjustment
The randomized study is commonly used to compare the effects of two treat-
ments on response in clinical trials and other experimental settings. In this
design, patients are randomized to one of two treatments with probability δ
or 1 − δ, where the randomization probability δ is chosen by the investigator.
Together with the response variable, other baseline covariate information is
5.4 Estimating Treatment Difference in a Randomized Pretest-Posttest Study 127

collected on all the individuals in the study. The goal is to estimate the differ-
ence in the mean response between the two treatments. One simple estimator
for this treatment difference is the difference in the treatment-specific sample
average response between the two treatments. A problem that has received
considerable interest is whether and how we can use the baseline covariates
that are collected prior to randomization to increase the efficiency of the es-
timator for treatment difference.
Along the same line, we also consider the randomized pretest-posttest
design. In this design, a random sample of subjects (patients) are chosen from
some population of interest and, for each patient, a pretest measurement, say
Y1 , is made, and then the patient is randomized to one of two treatments,
which we denote by the treatment indicator A = (1, 0) with probability δ
and (1 − δ), and after some prespecified time period, a posttest measurement
Y2 is made. The goal of such a study is to estimate the effect of treatment
intervention on the posttest measurement, or equivalently to estimate the
effect of treatment on the change score, which is the difference between the
pretest response and posttest response.
We will focus the discussion on the pretest-posttest design. However, we
will see later that the results derived for the pretest-posttest study can be
generalized to the problem of covariate adjustment in a randomized study.
As an example of a pretest-posttest study, suppose we wanted to compare
the effect of some treatment intervention on the quality of life for patients
with some disease. We may be interested in comparing the effect that a new
treatment has to a placebo or comparing a new treatment to the current best
standard treatment. Specifically, such a design is carried out by identifying a
group of patients with disease that are eligible for either of the two treatments.
These patients are then given a questionnaire to assess their baseline quality
of life. Typically, in these studies, patients answer several questions, each of
which is assigned a score (predetermined by the investigator), and the quality
of life score is a sum of these scores, denoted by Y1 . Afterward, they are
randomized to one of the two treatments A = (0, 1) and then followed for
some period of time and asked to complete the quality of life questionnaire
again, where their quality of life score Y2 is computed.
The goal in such studies is to estimate the treatment effect, defined as

β = E(Y2 |A = 1) − E(Y2 |A = 0),

or equivalently

β = E(Y2 − Y1 |A = 1) − E(Y2 − Y1 |A = 0),

where Y2 − Y1 is the difference from baseline in an individual’s response and


is sometimes referred to as the change score.
Note 3. This last equivalence follows because of randomization which guar-
antees that Y1 and A are independent; i.e., E(Y1 |A = 1) = E(Y1 |A = 0).


128 5 Other Examples of Semiparametric Models

The data from such a randomized pretest-posttest study can be repre-


sented as
Zi = (Y1i , Ai , Y2i ), i = 1, . . . , n.
Some estimators for β that are commonly used include the difference in
the treatment-specific sample averages of posttest response, namely the two-
sample comparison

n 
n
Ai Y2i (1 − Ai )Y2i
i=1 i=1
β̂n = 
n − 
n , (5.48)
Ai (1 − Ai )
i=1 i=1

or possibly the difference in the treatment-specific sample averages of posttest


minus pretest response, namely the two-sample change-score comparison
 
Ai (Y2i − Y1i ) (1 − Ai )(Y2i − Y1i )
β̂n =  −  . (5.49)
Ai (1 − Ai )
An analysis-of-covariance model has also been proposed for such designs,
where it is assumed that

Y2i = η0 + βAi + η1 Y1i + εi

and β, η0 , and η1 are estimated using least squares.


It is clear that the estimators β̂n given by (5.48) and (5.49) are semipara-
metric estimators for β in the sense that these are consistent and asymptot-
ically normal estimators for β without having to make any additional para-
metric assumptions. It also turns out that the least-squares estimator for β
in the analysis-of-covariance model is semiparametric, although this is not as
obvious.
A study of the relative efficiency of these estimators was made in Yang
and Tsiatis (2001). In this paper, the parameter for treatment difference β
in the pretest-posttest model was also cast using a simple restricted moment
model for which the efficient generalized estimating equation (GEE) could be
derived. This GEE estimator was compared with the other more commonly
used estimators for treatment difference in the pretest-posttest design.
Rather than considering ad hoc semiparametric estimators for β, we now
propose to look at this problem using the semiparametric theory that has
been developed in this book. Toward that end, we will show how to derive the
space of influence functions of RAL estimators for β. We will use the linear
variety ϕ(Z) + T ⊥ to represent the space of influence functions.
We begin by finding the influence function of some RAL estimator for β.
For simplicity, we consider the influence function of β̂n (two-sample difference)
from (5.48). This can be derived as
   
1/2 1/2 Ai Y2i (1) 1/2 (1 − Ai )Y2i (0)
n (β̂n − β0 ) = n  − µ2 −n  − µ2 ,
Ai (1 − Ai )
5.4 Estimating Treatment Difference in a Randomized Pretest-Posttest Study 129
(1) (0)
where β0 = E(Y2 |A = 1) − E(Y2 |A = 0) = µ2 − µ2 . After some simple
algebra, we obtain
 "  "
(1) (0)
A (Y2i − µ ) (1 − A )(Y 2i − µ )
n−1/2  2
− n−1/2  2
i i
( Ai /n) ( (1 − Ai )/n)
⇓p ⇓p
δ 1−δ

 Ai 
−1/2 (1) (1 − Ai ) (0)
=n (Y2i − µ2 ) − (Y2i − µ2 ) + op (1). (5.50)
δ (1 − δ)

Consequently, the influence function for the i-th observation of β̂n equals

Ai (1) (1 − Ai ) (0)
ϕ(Zi ) = (Y2i − µ2 ) − (Y2i − µ2 ). (5.51)
δ (1 − δ)

Now that we have identified one influence function for an RAL estimator
for β, we can identify the linear variety of the space of influence functions by
deriving the tangent space T and its orthogonal complement T ⊥ .

The Tangent Space and Its Orthogonal Complement

Let us now construct the tangent space T . The data from a single observa-
tion in a randomized pretest-posttest study are given by (Y1 , A, Y2 ). The only
restriction that is placed on the density of the data is that induced by the
randomization itself, specifically that the pretest measurement Y1 is indepen-
dent of the treatment indicator A and that the distribution of the Bernoulli
variable A is given by P (A = 1) = δ and P (A = 0) = 1 − δ, where δ is
the randomization probability, which is known by design. Other than these
restrictions, we will allow the density of (Y1 , A, Y2 ) to be arbitrary.
To derive the tangent space and its orthogonal complement, we will take
advantage of the results of partitioning the Hilbert space for a nonparametric
model given by Theorem 4.5 of Chapter 4.
First note that the density of the data for a single observation can be
factored as

pY1 ,A,Y2 (y1 , a, y2 ) = pY1 (y1 )pA|Y1 (a|y1 )pY2 |Y1 ,A (y2 |y1 , a). (5.52)

With no restrictions on the distribution of Y1 , A, Y2 (i.e., the nonparametric


model), we can use the results from Theorem 4.5 to show that the entire
Hilbert space can be written as

H = T1 ⊕ T2 ⊕ T3 , (5.53)

where
130 5 Other Examples of Semiparametric Models

T1 = {α1 (Y1 ) : E{α1 (Y1 )} = 0},


T2 = {α2 (Y1 , A) : E{α2 (Y1 , A)|Y1 } = 0},
T3 = {α2 (Y1 , A, Y2 ) : E{α2 (Y1 , A, Y2 )|Y1 , A} = 0},

and T1 , T2 , T3 are mutually orthogonal linear spaces.


As mentioned above, in the semiparametric randomized pretest-posttest
design, the only restrictions placed on the class of densities for Y1 , A, Y2 are
those induced by randomization itself. Specifically, because of the indepen-
dence of Y1 and A and the known distribution of A, we obtain

pA|Y1 (a|y1 ) = PA (a) = δ a (1 − δ)(1−a) ;

that is, the conditional density is completely known to us and not a function
of unknown parameters. Otherwise, the density of Y1 , which is pY1 (y1 ), and
the conditional density of Y2 given Y1 , A; i.e., pY2 |Y1 ,A (y2 |y1 , a), are arbitrary.
Consequently, the tangent space for semiparametric models in the randomized
pretest-posttest design is given by

T = T1 ⊕ T3 .

Remark 2. The contribution to the tangent space that is associated with the
nuisance parameters for pA|Y1 (a|y1 ), which is T2 , is left out because, by the
model restriction, this conditional density is completely known to us by design.


The orthogonality of T1 , T2 , T3 together with (5.53) implies that the space
orthogonal to the tangent space is given by

T ⊥ = T2 .

Using (4.22) from Theorem 4.5, we can represent elements of T2 as h∗2 (Y1 , A)−
E{h∗2 (Y1 , A)|Y1 } for any arbitrary function h∗2 (·) of Y1 and A. Because A is
a binary indicator function, any function h∗2 (·) of Y1 and A can be expressed
as h∗2 (Y1 , A) = Ah1 (Y1 )+h2 (Y1 ), where h1 (·) and h2 (·) are arbitrary function
of Y1 . Therefore

h∗ (Y1 , A) − E{h∗ (Y1 , A)|Y1 } = Ah1 (Y1 ) + h2 (Y1 ) − {E(A|Y1 )h1 (Y1 ) + h2 (Y1 )}
= (A − δ)h1 (Y1 ).

Consequently, we have shown that the orthogonal complement of the tan-


gent space is all elements {(A − δ)h∗ (Y1 )} for any arbitrary function h∗ (·) of
Y1 and therefore the space of all influence functions, {ϕ(Z) + T ⊥ }, is
 
A (1) (1 − A) (0)
(Y2 − µ2 ) − (Y2 − µ2 ) + (A − δ)h∗ (Y1 ) (5.54)
δ (1 − δ)

for any arbitrary function h∗ (Y1 ).


5.4 Estimating Treatment Difference in a Randomized Pretest-Posttest Study 131

An estimator for β with this influence function is given by


  /  0
Ai Y2i (1 − Ai )Y2i
 −  + n−1 Ai − n−1 Ai h∗ (Y1i ).
Ai (1 − Ai )

The estimator above will be a consistent, asymptotically normal semipara-


metric estimator for β. Moreover, the class of estimators above, indexed by
the functions h∗ (Y1 ), are RAL estimators with influence functions that in-
clude the entire class of influence functions. Although all the estimators given
above are asymptotically normal, the asymptotic variance will vary according
to the choice of h∗ (·).
We showed in Theorem 4.3 that the efficient influence function is given by

ϕ(Z) − Π{ϕ (Z)|T ⊥ },

or for our problem


 
A (1) A (1)
(Y2 − µ2 ) − Π (Y2 − µ2 )|T ⊥
δ δ
    
1−A (0) 1−A (0)
− (Y2 − µ2 ) + Π (Y2 − µ2 )|T ⊥ . (5.55)
1−δ 1−δ

Since T ⊥ = T2 , we can use (4.23) of Theorem 4.5 to show that for any
function α(Y1 , A, Y2 ),

Π{α(·)|T ⊥ } = Π{α(·)|T2 } = E{α(·)|Y1 , A} − E{α(·)|Y1 }.

Therefore
     
A (1) ⊥ A (1) A (1)
Π (Y2 − µ2 )|T =E (Y2 − µ2 )|Y1 , A − E (Y2 − µ2 )|Y1 ,
δ δ δ
(5.56)
where
 
A (1) A (1)

E (Y2 − µ2 )|Y1 , A = E(Y2 |Y1 , A = 1) − µ2 , (5.57)
δ δ
 
(1)
and by the law of iterated conditional expectations, E Aδ (Y2 − µ2 )|Y1 can
be computed by taking the conditional expectation of (5.57) given Y1 , yielding
 
A (1) (1)
E (Y2 − µ2 )|Y1 = E(Y2 |Y1 , A = 1) − µ2 . (5.58)
δ

Therefore, by (5.56)–(5.58), we obtain


 
A (1) A−δ  (1)

Π (Y2 − µ2 )|T ⊥ = E(Y2 |Y1 , A = 1) − µ2 . (5.59)
δ δ
132 5 Other Examples of Semiparametric Models

Similarly, we obtain
 
1−A (0) ⊥ A−δ  (0)

Π (Y2 − µ2 )|T =− E(Y2 |Y1 , A = 0) − µ2 . (5.60)
1−δ 1−δ
Using (5.59) and (5.60) and after some algebra, we deduce that the efficient
influence function (5.55) equals
 
A (A − δ)
Y2 − E(Y2 |A = 1, Y1 )
δ δ
 
(1 − A) (A − δ)
− Y2 + E(Y2 |A = 0, Y1 )
(1 − δ) (1 − δ)
/ 0
(1) (0)
− µ2 − µ2 . (5.61)

β
In order to construct an efficient RAL estimator for β (that is, an RAL
estimator for β whose influence function is the efficient influence function
given by (5.61)), we would need to know E(Y2 |A = 1, Y1 ) and E(Y2 |A = 0, Y1 ),
which, of course, we don’t. One strategy is to posit models for E(Y2 |A = 1, Y1 )
and E(Y2 |A = 0, Y1 ) in terms of a finite number of parameters ξ1 and ξ0 ,
respectively. That is,
E(Y2 |A = j, Y1 ) = ζj (Y1 , ξj ), j = 0, 1.
These posited models are restricted moment models for the subset of individ-
uals in treatment groups A = 1 and A = 0, respectively. For such models, we
can use generalized estimating equations to obtain estimators ξˆ1n and ξˆ0n for
ξ1 and ξ0 using patients {i : A1 = 1} and {i : Ai = 0}, respectively. Such
estimators are consistent for ξ1 and ξ0 if the posited models were correctly
specified, but, even if they were incorrectly specified, under suitable regularity
conditions, ξˆjn would converge to ξj∗ for j = 0, 1. With this in mind, we use
the functional form of the efficient influence given by (5.61) to motivate the
estimator β̂n for β given by
n  
−1 Ai (Ai − δ) ˆ
β̂n = n Y2i − ζ1 (Y1i , ξ1n )
i=1
δ δ
 
(1 − Ai ) (Ai − δ)
− Y2i + ζ0 (Y1i , ξˆ0n ) . (5.62)
(1 − δ) (1 − δ)
After some algebra (Exercise 3 at the end of this Chapter), we can show
that the influence function of β̂n is
A (1) (A − δ) (1)
(Y2 − µ2 ) − {ζ1 (Y1 , ξ1∗ ) − µ2 }
δ δ
(1 − A) (0) (A − δ) (0)
− (Y2 − µ2 ) − {ζ0 (Y1 , ξ0∗ ) − µ2 }. (5.63)
(1 − δ) (1 − δ)
5.5 Remarks about Auxiliary Variables 133

This influence function is in the class of influence functions of RAL estimators


for β given by (5.54) regardless of whether we posited the correct model for
E(Y2 |Y1 , A = 1) and E(Y2 |Y1 , A = 0) or not. Consequently, the estimator β̂n
given by (5.62) is locally efficient in the sense that this estimator leads to a
consistent asymptotically normal semiparametric RAL estimator for β even
if the posited models for the treatment-specific conditional expectations of
Y2 given Y1 are incorrect, but this estimator is semiparametric efficient if the
posited model is correct.
Our experience working with such adaptive methods is that a reason-
able attempt at modeling the conditional expectations using the usual model-
building techniques that statisticians learn will lead to estimators with very
high efficiency even if the model is not exactly correct. For more details, see
Leon, Tsiatis, and Davidian (2003).
Although the development above was specific to the pretest-posttest study,
this can be easily generalized to the more general problem of covariate ad-
justment in a two-arm randomized study. Letting Y2 be the response variable
of interest and A the treatment indicator, assigned at random to patients,
the primary goal of a two-arm randomized study is to estimate the parameter
β = E(Y2 |A = 1) − E(Y2 |A = 0). Let Y1 be a vector of baseline covariates
that are also collected on all individuals prior to randomization. One of the
elements of Y1 could be the same measurement that is used to evaluate re-
sponse, as was the case in the pretest-posttest study. In such a study, the data
are realizations of the iid random vectors (Y1i , Ai , Y2i ), i = 1, . . . , n. As before,
in a semiparametric model, the only restriction on the class of densities for
(Y1 , A, Y2 ) is that induced by randomization itself; namely, that A is inde-
pendent of Y1 and that the P (A = 1) = δ. Consequently, the factorization of
the data (Y1 , A, Y2 ) is identical to that given by (5.52) and hence the efficient
influence function is given by (5.61), with the only difference being that Y1
now represents the vector of baseline covariates rather than the single pretest
measurement made at baseline in the pretest-posttest study. In fact, the exact
same strategy for obtaining locally efficient adaptive semiparametric estima-
tors for β as given by (5.62) for the pretest-posttest study can be used for this
more general problem.

5.5 Remarks about Auxiliary Variables


In the randomized pretest-posttest problem presented in the previous section,
the parameter of interest, β = E(Y2 |A = 1) − E(Y2 |A = 0), is obtained from
the joint distribution of (Y2 , A). Nonetheless, we were able to use the random
variable Y1 , either the pretest measurement in a pretest-posttest study or a
vector of baseline characteristics, to obtain a more efficient estimator of the
treatment difference β. One can view the variables making up Y1 as auxiliary
variables, as they are not needed to define the parameter of interest.
134 5 Other Examples of Semiparametric Models

Therefore, a natural question, when we are considering estimating the


parameter β in a model p(z, β, η) for the random vector Z, is whether we
can gain efficiency by collecting additional auxiliary variables W and deriving
estimators for β using the data from both Z and W . We will now argue that
if we put no additional restrictions on the distribution of (W, Z) other than
those from the marginal model for Z, then the class of RAL estimators is the
same as if we never considered W .
To see this, let us first make a distinction between the Hilbert space HZ
(i.e., all q-dimensional mean-zero square-integrable functions of Z) and HW Z
(i.e., all q-dimensional mean-zero square-integrable functions of W and Z). If
we consider estimators for β based only on the data from Z, then we have
developed a theory that shows that the class of influence functions of RAL
estimators for β can be represented as

ϕ(Z) + T Z , (5.64)

where ϕ(Z) is the influence function of any RAL estimator for β and T Z is
the space orthogonal to the tangent space defined in HZ .
However, if, in addition, we consider auxiliary variables W , then the space
of influence functions of RAL estimators for β is given by

ϕ(Z) + T W Z , (5.65)

where T W Z is the space orthogonal to the tangent space defined in HW Z .
We now present the key result that demonstrates that, without additional
assumptions on the conditional distribution of auxiliary covariates W given
Z, the class of influence functions of RAL estimators for β remains the same.
Theorem 5.6. If no restrictions are put on the conditional density pW |Z (w|z),
where the marginal density of Z is assumed to be from the semiparametric
model pZ (z, β, η), then the orthogonal complement of the tangent space T W Z

for the semiparametric model for the joint distribution of (W, Z) (i.e., T W Z )
is equal to the orthogonal complement of the tangent space T Z for the semi-

parametric model of the marginal distribution for Z alone (i.e., T Z ).

Proof. With no additional restrictions placed on the joint distribution of


(W, Z), the model can be represent by the class of densities

pW,Z (w, z, β, η, η ∗ ) = pW |Z (w|z, η ∗ )pZ (z, β, η), (5.66)

where pW |Z (w|z, η ∗ ) can be any arbitrary conditional


 density of W given Z;
i.e., any positive function η ∗ (w, z) such that η ∗ (w, z)dνW (w) = 1 for all z.
In Theorem 4.5, we showed that the part of the tangent space associated with
the infinite-dimensional parameter η ∗ was the set of random functions
 
T1 = a (W, Z) : E{a (W, Z)|Z} = 0
q q q×1
5.6 Exercises for Chapter 5 135

and, moreover, that the entire Hilbert space HW Z can be partitioned as

HW Z = T1 ⊕ HZ , T1 ⊥ HZ .

Because the density in (5.66) factors into terms involving the parameter η ∗
and terms involving the parameters (β, η), where η ∗ and (β, η) are variation-
ally independent, it is then straightforward to show that the tangent space
for the model on (W, Z) is given by

T W Z = T1 ⊕ T Z , T1 ⊥ T Z .

It is also straightforward to show that the space orthogonal to the tangent


space T W Z is the space that is orthogonal both to T1 and to T Z ; i.e., it
must be the space within HZ that is orthogonal to T Z . In other words, the
⊥ ⊥
space T W Z is the same as the space T Z .  
⊥ ⊥
Because T W Z = T Z , this means that the space of influence functions
of RAL estimators for β given by (5.64), using estimators that are functions
of Z alone, is identical to the space of influence functions of RAL estimators
for β given by (5.65), using estimators that are functions of W and Z.
This formally validates the intuitive notion that if we are not willing to
make any additional assumptions regarding the relationship of the auxiliary
variables W and the variables of interest Z, then we need not consider auxil-
iary variables when estimating β.
The reason we gained efficiency in the pretest-posttest problem was be-
cause a relationship was induced between Y1 and the treatment indicator A
due to randomization; namely, that Y1 was independent of A.

5.6 Exercises for Chapter 5


1. Heteroscedastic models
Consider the semiparametric model for which, for a one-dimensional re-
sponse variable Y , we assume

Y = µ(X, β) + V 1/2 (X, β)ε, β ∈ Rq ,

where ε is an arbitrary continuous random variable such that ε is indepen-


dent of X. To avoid identifiability problems, assume that for any scalars
α, α

α + µ(x, β) = α + µ(x, β  ) for all x

implies

α = α and β = β  ,
136 5 Other Examples of Semiparametric Models

and for any scalars σ, σ  > 0 that

σ{V (x, β)} = σ  {V (x, β  )} for all x

implies

σ = σ and β = β  .

For this model, describe how you would derive a locally efficient estimator
for β from a sample of data

(Yi , Xi ), i = 1, . . . , n.

2. In the pretest-posttest study, one estimator for the treatment difference β


that has been proposed is the analysis of covariance (ANCOVA) estimator.
That is, an analysis-of-covariance model is assumed, where

Y2i = η0 + βAi + η1 Y1i + εi (5.67)

for the iid data (Y1i , Ai , Y2i ), i = 1, . . . , n, and the parameters η0 , β, η1 are
estimated using least squares.
a) Show that the least-squares estimator β̂n in the model above is a
semiparametric estimator for β = E(Y2 |A = 1) − E(Y2 |A = 0); that
is, that n1/2 (β̂n − β) converges to a normal distribution with mean
zero whether the linear model (5.67) is correct or not.
b) Find the influence function for β̂n and show that it is in the class of
influence functions given in (5.54).
3. In (5.62) we considered locally efficient adaptive estimators for the treat-
ment difference β in a randomized pretest-posttest study. Suppose the
estimators ξˆjn , j = 0, 1 were root-n consistent estimators; that is, that
there exist ξj∗ , j = 0, 1 such that

n1/2 (ξˆjn − ξj∗ )

are bounded in probability for j = 0, 1 and that the functions ζj (x1 , ξj ), j =


0, 1 as functions in ξj were differentiable in a neighborhood of ξj∗ , j = 0, 1
for all x1 . Then show (heuristically) that the influence function for the
estimator β̂n given in (5.62) is that given in (5.63).
6
Models and Methods for Missing Data

6.1 Introduction
In many practical situations, although we may set out in advance to collect
data according to some “nice” plan, things may not work out quite as in-
tended. Some examples of this follow.

Nonresponse in sample surveys


We send out questionnaires to a sample of (randomly chosen) individuals.
However, some may provide only a partial answer or no answer to some ques-
tions, or, perhaps, may not return the questionnaire at all.

Dropout or noncompliance in clinical trials


A study is conducted to compare two or more treatments. In a randomized
clinical trial, subjects are enrolled into the study and then randomly assigned
to one of the treatments. Suppose, in such a clinical trial, the subjects are
supposed to return to the clinic weekly to provide response measurement Yij
(for subject i, week j). However, some subjects “drop out” of the study, failing
to show up for any clinic visit after a certain point. Still others may miss clinic
visits sporadically or quit taking their assigned treatment.

Surrogate measurements
For some studies, the response of interest or some important covariate may
be very expensive to obtain. For example, suppose we are interested in the
daily average percentage fat intake of a subject. An accurate measurement
requires a detailed “food diary” over a long period, which is both expensive
and time consuming. A cheaper measurement (surrogate) is to have subjects
recall the food they ate in the past 24 hours. Clearly, this cheaper measurement
will be correlated with the expensive one but not perfectly. To reduce costs,
a study may be conducted where only a subsample of participants provide
the expensive measurement (validation sample), whereas everyone provides
data on the inexpensive measurement. The expensive measurement would be
138 6 Models and Methods for Missing Data

missing for all individuals not in the validation sample. Unlike the previous
examples, here the missingness was by design rather than by happenstance.
In almost all studies involving human subjects, some important data may
be missing for some subjects for a variety of reasons, from oversight or mis-
takes by the study personnel to refusal or inability of the subjects to provide
information.
Objective: Usually, interest focuses on making an inference about some
aspect (parameter) of the distribution of the “full data” (i.e., the data that
would have been observed if no data were missing).
Problem: When some of the data are missing, it may be that, depending
on how and why they are missing, our ability to make an accurate inference
may be compromised.

Example 1. Consider the following (somewhat contrived) problem. A study


is conducted to assess the efficacy of a new drug in reducing blood pressure
for patients that have hypertension using a randomized design, where half of
the patients recruited with hypertension are randomized to receive the new
treatment and the other half are given a placebo. The endpoint of interest is
the decrease in blood pressure after six months.
Let µ1 denote the mean decrease in blood pressure (after six months) if
all patients in a population were given the new treatment and µ0 the popula-
tion mean decrease if given a placebo. The parameter of interest is the mean
treatment difference,
δ = µ1 − µ0 .
If all the patients randomized into this study are followed for six months
and have complete measurements, then an unbiased estimator for δ can be
computed easily using the difference in sample averages of blood pressure
decrease from the patients in the two treatment groups. Suppose, however,
that some of the data are missing. Consider the scenario where all the data for
patients randomized to the placebo were collected, but some of the patients
assigned treatment dropped out and hence their six-month blood pressure
reading is missing. For such a problem, what would we do?

This problem can be defined using the following notation. For individual
i in our sample i = 1, · · · , n, let

⎨1
Ai = denotes treatment assignment,
⎩0

and

Yi = reduction in blood pressure after six months.

If we had “full” data (i.e., if we observed {Ai , Yi }, i = 1, · · · , n), then


6.1 Introduction 139


n 
n
Ai Yi (1 − Ai )Yi
i=1 i=1
µ̂1 = , µ̂0 = ,
n n
Ai (1 − Ai )
i=1 i=1

would be unbiased estimators for µ1 and µ0 , respectively.


Let Ri denote the indicator of complete data for the i-th individual; i.e.,

1 if the six-month blood pressure
Ri = measurement was taken
0 if it was missing.
In such a case, we would only observe (i.e., observed data) (Ai , Ri , Ri Yi ), i =
1, · · · , n; that is, we always observe Ai and Ri but only observe Yi if Ri = 1.
It is important to emphasize the distinction between full data, observed
data, and complete data, as we will be using this terminology throughout the
remainder of the book. Full data are the data that we would want to have
collected on all the individuals in the sample i = 1, . . . , n. Observed data are
the data that are actually observed on the individuals in the study, some of
which are missing. Complete data are the data from only the subset of patients
with no missing data.
In this hypothetical scenario, data may be missing only if Ai = 1. There-
fore, among patients randomized to the placebo, there are no missing data and
therefore we can consistently (unbiasedly) estimate µ0 using the treatment-
specific sample average
n
(1 − Ai )Yi
i=1
µ̂0 = .
n
(1 − Ai )
i=1
Let us therefore focus our attention on estimating µ1 . Since we will only be
considering this one sample for the time being, we will use the notation
(Ri , Y1i ), i = 1, · · · , n1
to represent the full data for the n1 individuals randomized to treatment 1,

n
n1 = Ai , and focus on this subset of data. The observed data are
i=1

(Ri , Ri Y1i ), i = 1, · · · , n1 .
A natural estimator for µ1 , using the observed data, is the complete-case
sample average; namely,
n1
Ri Y1i
i=1
µ̂1c = .
n1
Ri
i=1
140 6 Models and Methods for Missing Data

We now examine whether this estimator is reasonable or not under various


circumstances.
Intuitively, if we believed that missing observations occurred by chance
alone, then we might expect that the complete cases will still be representative
of the population from which they were sampled and consequently would still
give us an unbiased estimator. If, however, there was a systematic bias, where,
say, patients with a worse prognosis (i.e., lower values of Y ) were more likely
to drop out, then we might expect the complete-case estimator to be overly
optimistic.
This can be formalized as follows. Let the full data be denoted by

(Ri , Y1i ), i = 1, . . . , n1 ,

assumed to be iid with E(Y1i ) = µ1 .


Note 1. In actuality, we cannot observe Y1i whenever Ri = 0; nonetheless, this
conceptualization is useful in understanding the consequences of missingness.


The joint density of (Ri , Y1i ) can be characterized by

pR,Y1 (ri , y1i ) = pR| Y1 (ri |y1i ) pY1 (y1i ).

Since Ri is binary, we only need

P (Ri = 1|Y1i = y1i ) = π(y1i )

to specify the conditional density because pR|Y1 (ri |y1i ) = {π(y1i )}ri {1 −
π(y1i )}1−ri .
Again, the observed data are denoted by

(Ri , Ri Y1i ), i = 1, . . . , n1 .

If the data are missing completely at random, denoted as MCAR (i.e., P (R =


1|Y1 ) = π(Y1 ) = π), then the probability of being observed (or missing) does
not depend on Y1i . That is, R and Y1 are independent; R ⊥ ⊥ Y1 .
The complete-case estimator


n1 
n1
Ri Y1i n−1
1 Ri Y1i
i=1 i=1 P E(RY1 )
= → ,
n1  n1
E(R)
Ri n−1
1 Ri
i=1 i=1

and by the independence of R and Y1 ,

E(RY1 ) E(R)E(Y1 )
= = E(Y1 ) = µ1 .
E(R) E(R)
6.1 Introduction 141

Therefore, if the data are missing completely at random (MCAR), then the
complete-case estimator is unbiased, as our intuition would suggest. If, how-
ever, the probability of missingness depends on Y1 , which we will refer to as
nonmissing at random, (NMAR) (a formal definition will be given in later
chapters), then the complete-case estimator is written

n1
Ri Y1i
i=1 P E(RY1 ) E{E(RY1 )|Y1 }

→ = (6.1)

n1
E(R) E{E(R|Y1 )}
Ri
i=1
E{Y1 π(Y1 )}
= = E(Y1 ) (necessarily).
E{π(Y1 )}
In fact, if π(y) is an increasing function in y (i.e., probability of not being
missing increases with y), then this suggests that individuals with larger values
of Y would be overrepresented in the observed data and hence
E{Y1 π(Y1 )}
> µ1 .
E{π(Y1 )}
We leave the proof of this last result as an exercise for the reader.
The difficulty with NMAR is that there is no way of estimating π(y) =
P (R = 1|Y1 = y) from the observed data because if R = 0 we don’t get
to observe Y1 . In fact, there is no way that we can distinguish whether the
missing data were MCAR or NMAR from the observed data. That is, there
is an inherent nonidentifiability problem here.
There is, however, a third possibility to consider. Suppose, in addition
to the response variable, we measured additional covariates on the i-th in-
dividual, denoted by Wi , which are not missing. For example, some baseline
characteristics may also be collected, including baseline blood pressure or pos-
sibly some additional variables collected between the initiation of treatment
and six months, when the follow-up response is supposed to get collected. Such
variables Wi , i = 1, . . . , n are sometimes referred to as auxiliary covariates, as
they represent variables that are not of primary interest for inference. The
observable data in such a case are denoted by

(Ri , Ri Y1i , Wi ), i = 1, · · · , n1 .

Although the auxiliary covariates are not of primary interest, suppose we


believe that the reason for missingness depends on Wi , and, moreover, condi-
tional on Wi , Y1i has no additional effect on the probability of missingness.
Specifically,
P (Ri = 1 Y1i , Wi ) = π(Wi ). (6.2)
That is, conditional on Wi , Y1i is independent of Ri ;

Ri ⊥
⊥ Y1i Wi .
142 6 Models and Methods for Missing Data

Remark 1. It may be that Wi is related to both Y1i and Ri , in which case, even
though (6.2) is true, a dependence between Y1i and Ri would be induced. For
example, consider the hypothetical scenario where Wi denotes the blood pres-
sure for the i-th individual at an interim examination, say at three months,
measured on all n individuals in the sample. After observing this response,
individuals whose blood pressure was still elevated would be more likely to
drop out. Therefore, Ri would be correlated with Wi . Since individuals could
not possibly know what their blood pressure is at the end of the study (i.e.,
at six months), it may be reasonable to assume that Ri ⊥ ⊥ Y1i |Wi . It may
also be reasonable to assume that the three-month blood pressure reading is
correlated with the six-month blood pressure reading. Under these circum-
stances, dropping out of the study Ri would be correlated with the response
outcome Y1i but in this case, because they were both related to the interim
three-month blood pressure reading.
Therefore, without conditioning additionally on Wi , we would obtain that
P (Ri = 1 Y1i ) = π(Y1i )
depends on the value Y1i . Consequently, if we didn’t collect the additional
data Wi , then we would be back in the impossible NMAR situation.  
The assumption (6.2) is an example of what is referred to as missing at
random, or MAR (not to be confused with MCAR). Basically, missing at
random means that the probability of missingness depends on variables that
are observed. A general and more precise definition of MAR will be given
later.
The MAR assumption alleviates the identifiability problems that were en-
countered with NMAR because the probability of missingness depends on
variables that are observed on all subjects. The available data could also be
used to model the relationship for the probability of missingness, or, equiva-
lently, the probability of a complete case, as a function of the covariates Wi .
For example, we can posit a model for
P (R = 1|W = w) = π(w, γ)
(say, logistic regression) in terms of a parameter vector γ and estimate the
parameter γ from the observed data (Ri , Wi ), which are measured on all indi-
viduals i = 1, . . . , n, using, say, maximum likelihood. That is, the maximum
likelihood estimator γ̂ would be obtained by maximizing
!
n
π(Wi , γ)Ri {1 − π(Wi , γ)}1−Ri .
i=1

This probability of a complete case as a function of the covariates, together


with the MAR assumption, will prove useful for computing inverse probability
weighted complete-case estimators, to be described later.
There are several methods for estimating parameters when data are miss-
ing at random. These methods include:
6.2 Likelihood Methods 143

(a) likelihood methods;


(b) imputation methods;
(c) inverse probability weighting of complete cases.
Likelihood and imputation methods have been studied in great detail, and
some excellent references include the books by Rubin (1987), Little and Rubin
(1987), and Schafer (1997). The main theory of inverse probability weighted
methods is given in the seminal paper by Robins, Rotnitzky, and Zhao (1994)
and will be the primary focus of this book.
To give a flavor of how these methods work, let us return to the problem
of estimating µ1 = E(Y1 ) when auxiliary variables W are collected and the
response data are missing at random; i.e.,

⊥ Y1 |W.
R⊥

6.2 Likelihood Methods


Consider a parametric model with density

pY1 ,W (y1 , w) = pY1 |W (y1 |w, γ1 ) pW (w, γ2 ), (6.3)

where γ1 , γ2 are unknown parameters describing the conditional distribution


of Y1 given W and the marginal distribution of W , respectively. The parameter
µ1 we are interested in can be written as

µ1 = E(Y1 ) = E{E(Y1 |W )}

= ypY1 |W (y|w, γ1 )pW (w, γ2 )dνY (y)dνW (w).

If we could obtain estimators for γ1 , γ2 , say γ̂1 , γ̂2 , respectively, then we


could estimate µ1 , by

µ̂1 = ypY1 |W (y|w, γ̂1 )pW (w, γ̂2 )dνY (y)dνW (w). (6.4)

A popular way of obtaining estimators is by maximizing the likelihood. The


density of the observed data for one individual can be written as
I(r=1) I(r=0)
pRY1 ,W,R (ry1 , w, r) = {pY1 ,W,R (y1 , w, r = 1)} {pW,R (w, r = 0)}

or
' (I(r=1)
pY1 |W,R (y1 |w, r = 1) pR|W (r = 1|w)pW (w)
' (I(r=0)
pR|W (r = 0|w) pW (w) .

⊥ Y1 |W ),
Because of MAR (i.e., R ⊥
144 6 Models and Methods for Missing Data

pY1 |W,R (y1 |w, r = 1) = pY1 |W (y1 |w).

Therefore, the likelihood for one individual in our sample is given by


' (I(r=1)
pY1 |W (y1 |w, γ1 ) pW (w, γ2 )
× {π(w, γ3 )}I(r=1) {1 − π(w, γ3 )}I(r=0) ,

where
pR|W (r = 1|w) = π(w, γ3 ).
Consequently, the likelihood for n independent sets of data is given as
 n " n "
! !
I(ri =1)
pY1 |W (y1i |wi , γ1 ) pW (wi , γ2 ) × {function of γ3 }.
i=1 i=1

Because of the way the likelihood factorizes, we find the MLE for γ1 , γ2 by
separately maximizing
!
pY1 |W (y1i |wi , γ1 ) (6.5)
{i : Ri =1}

and
!
n
pW (wi , γ2 ). (6.6)
i=1

Note 2. We only include complete cases to find the MLE for γ1 , whereas we
use all the data to find the MLE for γ2 . 

The estimates for γ1 and γ2 , found by maximizing (6.5) and (6.6), can
then be substituted into (6.4) to obtain the MLE for µ1 .

Remark 2. Although likelihood methods are certainly feasible and the corre-
sponding estimators enjoy the optimality properties afforded to an MLE, they
can be difficult to compute in some cases. For example, the integral given in
(6.4) can be numerically challenging to compute, especially if W involves many
covariates.

6.3 Imputation
Since some of the Y1i ’s are missing, a natural strategy is to impute or “es-
timate” a value for such missing data and then estimate the parameter of
interest behaving as if the imputed values were the true values.
For example, if there were no missing values of Y1i , i = 1, . . . , n1 , then we
would estimate µ1 by using
6.3 Imputation 145


n1
µ̂1 = n−1
1 Y1i . (6.7)
i=1
However, for values of i such that Ri = 0, we don’t observe such Y1i .
Suppose we posit some relationship for the distribution of Y1 given W ;
e.g., we may assume a parametric model pY1 |W (y, |w, γ1 ) as we did in (6.3).
By the MAR assumption, we can estimate γ1 using the complete cases, say,
by maximizing (6.5) to derive the MLE γ̂1 . This would allow us to estimate

E(Y1 W = w) by ypY1 |W (y|w, γ̂1 )dνY (y),

which we denote by µ(w, γ̂1 ).


We then propose to impute the missing data for any individual i such that
Ri = 0 by substituting the value µ(Wi , γ̂1 ) for the missing Yi in (6.7). The
resulting imputed estimator is

n1
n−1
1 {Ri Y1i + (1 − Ri )µ(Wi , γ̂1 )} . (6.8)
i=1

A heuristic argument why this should yield a consistent estimator is as follows.


Assuming γ̂1 converges to the truth, then (6.8) should be well approximated
by

n1
−1
n1 {Ri Y1i + (1 − Ri )µ(Wi , γ10 )} + op (1), (6.9)
i=1
where op (1) is a term converging in probability to zero and, at the truth, γ10 ,
µ(Wi , γ10 ) = E(Y1i Wi ).
Therefore, by the weak law of large numbers (WLLN), (6.9) converges in
probability to
E{RY1 + (1 − R)E(Y1 |W )}.
By a conditioning argument, this equals
E [E{RY1 + (1 − R)E(Y1 |W )|R, W }]
= E{R E(Y1 |R, W ) + (1 − R)E(Y1 |W )}
* +, -
by MAR = E(Y1 |W )
= E{RE(Y1 |W ) + (1 − R)E(Y1 |W )}
= E{E(Y1 |W )} = E(Y1 ) = µ1 . 


Remarks
1. We could have modeled the conditional expectation directly, say as
E(Y1 |W ) = µ(W, γ),
and estimated γ by using GEEs with complete cases.
146 6 Models and Methods for Missing Data

2. Later, we will consider other imputation techniques, where we impute a


missing Y1i by using a random draw from the conditional distribution of
pY1 |Wi (y1 |Wi , γ̂1 ) or possibly using more than one draw (multiple impu-
tation).

6.4 Inverse Probability Weighted


Complete-Case Estimator
When data are MAR (i.e., Y1 ⊥⊥ R|W ), we have already argued that Y1 may
not be independent of R. Therefore, the naive complete-case estimator (6.1)

n1
Ri Y1i
i=1

n1
Ri
i=1

may be biased. Horvitz and Thompson (1952), and later Robins, Rotnitzky,
and Zhao (1994), suggested using inverse weighting of complete cases as a
method of estimation. Let us denote the probability of observing a complete
case by

P (R = 1|W, Y1 ) = P (R = 1|W ) = π(W ).


* +, -

follows by MAR

The basic intuition is as follows. For any individual randomly chosen from a
population with covariate value W , the probability that such an individual
will have complete data is π(W ). Therefore, any individual with covariate
1
W with complete data can be thought of as representing π(W ) individuals
at random from the population, some of which may have missing data. This
suggests using
n1
Ri Y1i
µ̂1 = n−1
1
i=1
π(W i)
 
P
as an estimator for µ1 . By WLLN, µ̂1 − → E π(WRY1
) , which, by conditioning,
equals
    
RY1 Y1
E E Y1 , W =E E(R|Y1 , W )
π(W ) π(W )
 
Y1
=E π(W ) = E(Y1 ) = µ1 .
π(W )
6.5 Double Robust Estimator 147

In many practical situations, π(W ) = P (R = 1|W ) would not be known to


us. In such cases, we may posit a model P (R = 1|W = w) = π(w, γ3 ), and γ3
can be estimated by maximizing the likelihood

!
n1
{π(Wi , γ3 )}Ri {1 − π(Wi γ3 )}1−Ri
i=1

to obtain the estimator γ̂3 . (Often, logistic regression models are used.) The
resulting inverse probability weighted complete-case (IPWCC) estimator is
given by
n1
Ri Y1i
n−1
1 . (6.10)
i=1
π(Wi , γˆ3 )

Remark 3. There is a technical condition that π(w) be strictly greater than


zero for all values of w in the support of W in order that the IPWCC estima-
tor (6.10) be consistent and asymptotically normal. This will be discussed in
greater detail in later chapters. A cautionary note, however, even if this tech-
nical condition holds true, is that if π(Wi , γˆ3 ) is very small, then this gives
undue influence to the i-th observation in the IPWCC estimator (6.10). This
could result in a very unstable estimator with poor performance with small
to moderate sample sizes.  

6.5 Double Robust Estimator


For both the likelihood estimator and the imputation estimator, we had to
specify a model for pY1 |W (y1 |w, γ1 ). If this model was incorrectly specified,
then both of these estimators would be biased. For the IPWCC estimator, we
had to specify a model for the probability of missingness P (R = 1|W = w) =
π(w, γ3 ). If this model was incorrectly specified, then the IPWCC estima-
tor would be biased. More recently, augmented inverse probability weighted
complete-case estimators have been suggested that are doubly robust. Scharf-
stein, Rotnitzky, and Robins (1999) first introduced the notion of double
robust estimators. Double robust estimators were also studied by Lipsitz,
Ibrahim, and Zhao (1999), Robins (1999), Robins, Rotnitzky, and van der
Laan (2000), Lunceford and Davidian (2004), Neugebauer and van der Laan
(2005), Robins and Rotnitzky (2001), and van der Laan and Robins (2003);
the last two references give the theoretical justification for these methods. An
excellent overview is also given by Bang and Robins (2005). To obtain such
an estimator, a model is specified for

E(Y1 |W ) = µ(W, γ1 ) (6.11)

and for
148 6 Models and Methods for Missing Data

P (R = 1|W = w) = π(w, γ3 ), (6.12)


where the parameters γ1 and γ3 can be estimated as we have already discussed.
Denote these estimators by γ̂1 and γ̂3 .
If (6.11) is correctly specified, then γ̂1 will converge in probability to γ10
(the truth), in which case
P
µ(W, γ̂1 ) −
→ µ(W, γ10 ) = E(Y1 |W ),
P
→ γ1∗ and
whereas if (6.11) is incorrectly specified, then γ̂1 −

→ µ(W, γ1∗ ) = E(Y1 |W ).


P
µ(W, γ̂1 ) −

Similarly, if (6.12) is correctly specified, then


P
π(W, γ̂3 ) −
→ π(W, γ30 ) = P (R = 1|W ),

and if incorrectly specified

→ π(W, γ3∗ ) = P (R = 1|W ).


P
π(W, γ̂3 ) −

Consider the estimator


n1  
Ri Y1i {Ri − π(Wi , γ̂3 )}
µ̂1 = n−1
1 − µ(W ,
i 1γ̂ ) . (6.13)
i=1
π(Wi , γ̂3 ) π(Wi , γ̂3 )

This estimator is referred to as an augmented inverse probability weighted


complete-case (AIPWCC) estimator. Notice that the first term in (6.13) is the
inverse probability weighted complete-case estimator derived in Section 6.4.
This term only involves contributions from complete cases. The second term
includes additional contributions from individuals with some missing data.
This is the so-called augmented term. We will now show that this AIPWCC
estimator is doubly robust in the sense that it is consistent if either the model
(6.11) for Y1 given W is correctly specified (i.e., µ(W, γ10 ) = E(Y1 |W )) or the
missingness model (6.12) is correctly specified (i.e., π(W, γ30 ) = P (R = 1|W )).
The estimator (6.13) can be reexpressed as
n1  
{Ri − π(Wi , γ̂3 )}
n−1
1 Y1i + {Y1i − µ(Wi , γ̂1 )} .
i=1
π(Wi , γ̂3 )

(a) Suppose the model (6.12) is correctly specified but the model (6.11) might
not be. Then the estimator is approximated by
n1  
−1 {Ri − P (R = 1|Wi )} ∗
n1 Y1i + {Y1i − µ(Wi , γ1 )} + op (1),
i=1
P (R = 1|Wi )

which converges to
6.5 Double Robust Estimator 149
   
R − P (R = 1|W )
E Y1 + {Y1 − µ(W, γ1∗ )}
P (R = 1|W )
  
R − P (R = 1|W )
= E(Y1 ) + E {Y1 − µ(W, γ1∗ )}
P (R = 1|W )
* +, -
Conditioning on Y1 , W
E[E[ |Y1 , W ]]

⎡ ⎤
 
E(R|Y1 , W ) − P (R = 1|W )
E⎣ {Y1 − µ(W, γ1∗ )}⎦
P (R = 1|W )
* +, -

 
P (R = 1|W ) − P (R = 1|W )
P (R = 1|W )

0
= E(Y1 ) = µ1 .

(b) Suppose the model (6.11) is correctly specified but the model (6.12) might
not be. Then the estimator is
n1    
−1 Ri − π(Wi , γ3∗ )
n1 Y1i + {Y1i − E(Y1 |Wi )} + op (1),
i=1
π(Wi , γ3∗ )

which converges to
   
R − π(W, γ3∗ )
E Y1 + {Y 1 − E(Y 1 |W )}
π(W, γ3∗ )
  
R − π(W, γ3∗ )
= E(Y1 ) + E {Y1 − E(Y 1 |W )}
π(W, γ3∗ )
* +, -
Condition on R, W
E[E[ |R, W ]]
  
R − π(W, γ3∗ )
=E {E(Y1 |R, W ) − E(Y1 |W )}
π(W, γ3∗ )

{E(Y1 |W ) − E(Y1 |W )}

0
= E(Y1 ) = µ1 .
150 6 Models and Methods for Missing Data

Consequently, if either the model for E(Y1 |W ) or P (R = 1|W ) is correctly


specified, then the double robust estimator consistently estimates µ1 .
We will demonstrate later that if both models are correctly specified, then
the double robust estimator is more efficient than the IPWCC estimator.

Remark 4. Another advantage to the double robust estimator is that, based


on some of our experience, this estimator obviates some of the instability
problems that can result from small weights as noted in Remark 3 for the
IPWCC estimator.  

6.6 Exercises for Chapter 6


1. In (6.1) we showed that the sample average of the response variable Y1
among complete cases, when the missingness mechanism is NMAR, will
converge in probability to

E{Y1 π(Y1 )}
,
E{π(Y1 )}

where π(Y1 ) denotes the probability of being included in the sample as a


function of the response variable Y1 (i.e., P (R = 1|Y1 ) = π(Y1 )). If π(y)
is an increasing function in y (i.e., the probability of not being missing
increases with y), then prove that

E{Y1 π(Y1 )}
> µ1 .
E{π(Y1 )}
7
Missing and Coarsening at Random for
Semiparametric Models

7.1 Missing and Coarsened Data


In Chapter 6, we described three different missing-data mechanisms:

1. MCAR (missing completely at random): The probability of missingness is


independent of the data.
2. MAR (missing at random): The probability of missingness depends only
on the observed data.
3. NMAR (nonmissing at random): The probability of missingness may also
depend on the unobservable part of the data.

NMAR is clearly the most problematic. Since missingness may depend on


data that are unobserved, we run into nonidentifiability problems. Because
of these difficulties, we do not find methods that try to model the missing-
ness mechanism for NMAR models very attractive since the correctness of the
model cannot be verified using the observed data. Another approach, which
we believe is more useful, is to use NMAR models as part of a sensitivity anal-
ysis. There has been some excellent work along these lines; see, for example,
Scharfstein, Rotnitzky, and Robins (1999), Robins, Rotnitzky, and Scharfstein
(2000), and Rotnitzky et al. (2001).
In this book, we will not consider NMAR models; instead, we will focus
our attention on models for missing data that are either MAR or MCAR.
Restricting attention only to such models still allows us a great deal of flexi-
bility. Although the primary interest is to make inference on parameters that
involve the distribution of Z had Z been observed on the entire sample, the
MAR assumption allows us to consider cases where the probability of miss-
ingness may also depend on other auxiliary variables W that are collected on
the sample. If the reasons for missingness depend on the observed data, in-
cluding the auxiliary variables, then the MAR assumption may be tenable for
a wide variety of problems. We gave an example of this when we introduced
the notion of MAR in Chapter 6.
152 7 Missing and Coarsening at Random for Semiparametric Models

Remark 1. Although we have made a distinction between auxiliary variables


W and primary variables Z, for ease of notation, we will not introduce ad-
ditional notation unless absolutely necessary. That is, we can define the full
data Z to be all the data that are collected on individuals in our sample. This
may include auxiliary as well as primary variables. For example, the full data
Z may be partitioned as (Z1T , Z2T )T , where Z1 are the primary variables and
Z2 are the auxiliary variables. The model for the primary variables can be
denoted by a semiparametric model pZ1 (z1 , β, η1 ), where β is the parameter
of interest and η1 are nuisance parameters. Since the auxiliary variables Z2
are not of primary interest, we might not want to put any additional restric-
tions on the conditional distribution of the auxiliary variables Z2 given Z1 .
This situation can be easily accommodated by considering the semiparametric
model pZ (z, β, η), where η = (η1 , η2 ) and

pZ (z, β, η) = pZ1 (z1 , β, η1 )pZ2 |Z1 (z2 |z1 , η2 ),

where the nuisance function η2 allows for any arbitrary conditional density of
Z2 given Z1 . By so doing, auxiliary variables can be introduced as part of a
full-data semiparametric model.  
Again, we emphasize that the underlying objective is to make inference
on parameters that describe important aspects of the distribution of the data
Z had Z not been missing. That is, had Z been completely observed for the
entire sample, then the data would be realizations of the iid random quanti-
ties Z1 , . . . , Zn , each with density pZ (z, β, η), where β q×1 is assumed finite-
dimensional, and η denotes the nuisance parameter, which for semiparametric
models is infinite-dimensional. It is the parameter β in this model that is of
primary interest. The fact that some of the data are missing is a difficulty that
we have to deal with by thinking about and modeling the missingness pro-
cess. The model for missingness, although important for conducting correct
inference, is not of primary inferential interest.
Although we have only discussed missing data thus far, it is not any more
difficult to consider the more general notion of “coarsening” of data. The
concept of coarsened data was first introduced by Heitjan and Rubin (1991)
and studied more extensively by Heitjan (1993) and Gill, van der Laan, and
Robins (1996). When we think of missing data, we generally consider the case
where the data on a single individual can be represented by a random vector
with, say, l components, where a subset of these components may be missing
for some of the individuals in the sample. When we refer to coarsened data,
we consider the case where we observe a many-to-one function of Z for some
of the individuals in the sample. Just as we allow that different subsets of the
data Z may be missing for different individuals in the sample, we allow the
possibility that different many-to-one functions may be observed for differ-
ent individuals. Specifically, we will define a coarsening (missingness) variable
C such that, when “C = r,” we only get to see a many-to-one function of
the data, which we denote by Gr (Z), and different r correspond to different
7.1 Missing and Coarsened Data 153

many-to-one functions. Therefore, C will be a single discrete random variable


made up of positive integers r = 1, . . . ,  and ∞, where  denotes the number
of different many-to-one functions considered. We reserve C = ∞ to denote
“no coarsening.”

Example 1. A simple illustration of coarsened data is given by the following


hypothetical example. An investigator is interested in studying the relation-
ship between the concentration of some biological marker in some fixed volume
of an individual’s blood serum and some outcome. However, the investigator is
also interested in determining the within-person variability in serum concen-
trations. Therefore, two blood samples of equal volume are drawn from each
individual in a study. Denote by X1 and X2 the serum concentrations for
these two samples and by Y the response variable. The full data for this sce-
nario are given as (Y, X1 , X2 ). To save on expense, the investigator measures
the concentrations separately on the two samples from only a subset of the
patients chosen at random. For the remaining patients, the two blood samples
are combined and one measurement is made to obtain the concentration from
the combined samples. Since the blood volumes are the same, the combined
concentration would be (X1 + X2 )/2. Hence, in this example, there are two
levels of coarsening. When C = ∞, we observe the full data (Y, X1 , X2 ) (i.e.,
G∞ (Y, X1 , X2 ) = (Y, X1 , X2 )) whereas when C = 1, we observe the coarsened
data {Y, (X1 + X2 )/2} (i.e., G1 (Y, X1 , X2 ) = {Y, (X1 + X2 )/2}). 
We now illustrate how missing data is just a special case of the more
general concept of coarsening.

Missing Data as a Special Case of Coarsening

Suppose the full data Z for a single individual is made up of an l-dimensional


vector of random variables, say
/ 0T
Z = Z (1) , . . . , Z (l) .

Having missing data is equivalent to the case where a subset of the elements of
(Z (1) , . . . , Z (l) ) are observed and the remaining elements are missing. This can
be represented using the coarsening notation {C, GC (Z)}, where, the many-to-
one function Gr (Z) maps the vector Z to a subset of elements of this vector
whenever C = r.
For example, let Z = (Z (1) , Z (2) )T be a vector of two random variables.
Define
C GC (Z)
1 Z (1)
2 Z (2)
∞ (Z (1) , Z (2) )T
154 7 Missing and Coarsening at Random for Semiparametric Models

That is, if C = 1, then we only observe Z (1) , and Z (2) is missing; if C = 2,


then we observe Z (2) , and Z (1) is missing; and if (C = ∞), then there are no
missing data and we observe Z = (Z (1) , Z (2) )T .

Remark 2. If we were dealing only with missing data, say, where different
subsets of an l-dimensional random vector may be missing, it may be more
convenient to define the missingness variable to be an l-dimensional vector
of 1’s and 0’s to denote which element of the vector is observed or missing.
If it is convenient to switch to such notation, we will use R to denote such
missingness indicators. This was the notation used to represent missing data,
for example, in Chapter 6.  

The theory developed in this book will apply to missing and coarsened data
problems where it is assumed that there is a positive probability of observing
the full data. That is,

P (C = ∞|Z = z) ≥ ε > 0 for z a.e.

Therefore, some problems that may be thought of as missing-data problems


but where no complete data are ever observed would not be part of the theory
we will consider. For example, measurement error problems that do not include
a validation set (i.e., when the true underlying covariate is never observed for
any individual in our sample but instead only some mismeasured version of
the covariate is available) cannot be covered by the theory developed in this
book.

Coarsened-Data Mechanisms

In problems where the data are coarsened or missing, it is assumed that we


get to observe the coarsening variable C and the corresponding coarsened data
GC (Z). Thus, the observed data are realizations of the iid random quantities

{Ci , GCi (Zi )}, i = 1, . . . , n.

To specify models for coarsened data, we must specify the probability distri-
bution of the coarsening process together with the probability model for the
full data (i.e., the data had there not been any coarsening). As with missing-
ness, we can define different coarsening mechanisms that can be categorized
as coarsening completely at random (CCAR), coarsening at random (CAR),
and noncoarsening at random (NCAR).
These are defined as follows,

1. Coarsening completely at random:

P (C = r Z) = (r) for all r, Z; i.e., C ⊥


⊥ Z.
7.1 Missing and Coarsened Data 155

2. Coarsening at random:

P (C = r Z) = {r, Gr (Z)}.
The probability of coarsening depends on Z only as a function of the
observed data.
3. Noncoarsening at random:
Noncoarsening at random (NCAR) corresponds to models where coarsen-
ing at random fails to hold. That is, the probability of coarsening depends
on Z, possibly as a function of unobserved parts of Z; i.e., if there exists
z1 , z2 such that
Gr (z1 ) = Gr (z2 )
for some r and
P (C = r|z1 ) = P (C = r|z2 ),
then the coarsening is NCAR.

As with nonmissing at random (NMAR), when coarsening is NCAR, we


run into nonidentifiability problems. Therefore, we focus our attention on
models where the coarsening mechanism is either CCAR or CAR.
When working with coarsened data, we distinguish among three types of
data, namely full data, observed data, and complete data, which we now define:

• Full data are the data Z1 , . . . , Zn that are iid with density pZ (z, β, η) and
that we would like to observe. That is, full data are the data that we
would observe had there been no coarsening. With such data we could
make inference on the parameter β using standard statistical techniques
developed for such parametric or semiparametric models.
• Because of coarsening, full data are not observed; instead, the observed
data are denoted by iid random quantities

[{C1 , GC1 (Z1 )}, . . . , {Cn , GCn (Zn )}],

where Ci denotes the coarsening variable and GCi (Zi ) denotes the corre-
sponding coarsened data for the i-th individual in the sample. It is observed
data that are available to us for making inference on the parameter β.
• Finally, when Ci = ∞, then the data for the i-th individual are not coars-
ened (i.e., when Ci = ∞, we observe the data Zi ). Therefore, we denote
by complete data the data only for individuals i in the sample such that
Ci = ∞ (i.e., complete data are {Zi : for all i such that Ci = ∞}). Com-
plete data are often used for statistical analysis in many software packages
when there are missing data.
156 7 Missing and Coarsening at Random for Semiparametric Models

7.2 The Density and Likelihood of Coarsened Data


In order to find observed (coarsened)-data estimators of the parameter of
interest β using likelihood methods, or, for that matter, in order to derive
the underlying semiparametric theory, we need to derive the likelihood of the
observed data in terms of the parameter β and other nuisance parameters.
To derive the density of the observed data, we first consider the unobservable
random vectors
{(Ci , Zi ), i = 1, . . . , n} assumed iid.
We emphasize that these data are unobservable because, when Ci = r, r = ∞,
we only get to observe the many-to-one transformation Gr (Zi ) and not Zi it-
self. Nonetheless, working with such data will make the assumptions necessary
to obtain the likelihood of the observed data transparent. It is sometimes con-
venient to denote the data {(Ci , Zi ), i = 1, . . . , n} as the full data, whereas pre-
viously we said that the full data will be defined as {Zi , i = 1, . . . , n}. There-
fore, in what follows, we will sometimes refer to full data by {Zi , i = 1, . . . , n}
and other times by {(Ci , Zi ), i = 1, . . . , n}, and the distinction should be clear
by the context.
Since the observed data {C, GC (Z)} are a known function of the full data
(C, Z), this means that the density of the observed data is induced by the
density of the full data. The density of the full data and the corresponding
likelihood, in terms of the parameters β, η, and ψ, are given by

pC,Z (r, z, ψ, β, η) = P (C = r|Z = z, ψ)pZ (z, β, η).

That is, the density and likelihood of the full data are deduced from the
density and likelihood of Z, pZ (z, β, η), and the density and likelihood for
the coarsening mechanism (i.e., the probability of coarsening given Z). We
emphasize that the density for the coarsening mechanism may also be from a
model that is described through the parameter ψ.
Remark 3. Since the coarsening variable C is discrete, the dominating mea-
sure for C is the counting measure that puts unit mass on each of the finite
values that C can take including C = ∞. The dominating measure for Z is,
as before, defined to be νZ (generally the Lebesgue measure for a continuous
random variable, the counting measure for a discrete random variable, or a
combination when Z is a random vector of continuous and discrete random
variables). Consequently, the dominating measure for the densities of (C, Z)
is just the product of the counting measure for C by νZ . 


Discrete Data

For simplicity, we will first consider the case when Z itself is a discrete random
vector. Consequently, the dominating measure is the counting measure over
the discrete combinations of C and Z, and integrals with respect to such a
7.2 The Density and Likelihood of Coarsened Data 157

dominating measure are just sums. Although this is overly simplistic, it will
be instructive in describing the probabilistic structure of the problem. We will
also indicate how this can be generalized to continuous Z as well.
Thus, with discrete data, the probability density of the observed data
{C, GC (Z)} is derived as

P {C = r, GC (Z) = gr } = P (C = r, Z = z)
{z:Gr (z)=gr }

= P (C = r|Z = z)P (Z = z).
{z:Gr (z)=gr }

Remark 4. Rather than developing one set of notation for discrete Z and an-
other set of notation (using integrals) for continuous Z, we will, from now on,
use integrals with respect to the appropriate dominating measure. Therefore,
when we have discrete Z, and νZ is the corresponding counting measure, we
will denote 
P (Z ∈ A) = P (Z = z)
z∈A
as 
pZ (z)dνZ (z). 

z∈A

With this convention in mind, we write the density and likelihood of the
observed data, when Z is discrete, as

pC,GC (Z) (r, gr , ψ, β, η) = pC,Z (r, z, ψ, β, η)dνZ (z)
{z:Gr (z)=gr }

= P (C = r|Z = z, ψ)pZ (z, β, η)dνZ (z). (7.1)
{z:Gr (z)=gr }

Continuous Data

It will be instructive to indicate how the density of the observed data would
be derived if Z was a continuous random vector and the relationship of this
density to (7.1). For example, consider the case where Z = (Z1 , . . . , Zl )T
is continuous. Generally, Gr (Z) is a dimensional-reduction transformation,
unless r = ∞. This is certainly the case for missing-data mechanisms.
Let Gr (z) be lr -dimensional, lr < l for r = ∞, and assume there exists a
function Vr (z) that is (l − lr )-dimensional so that the mapping

z ↔ {GTr (z), VrT (z)}T

is one-to-one for all r. Define the inverse transformation by


158 7 Missing and Coarsening at Random for Semiparametric Models

z = Hr (gr , vr ).

Then, by the standard formula for change of variables, the density

pGr ,Vr (gr , vr ) = pZ {Hr (gr , vr )}J(gr , vr ), (7.2)

where J is the Jacobian (more precisely, the Jacobian determinant) of Hr


with respect to (gr , vr ). If we want to find the density of the observed data,
namely pC,GC (r, gr ), we can use

pC,GC (r, gr ) = pC,GC ,VC (r, gr , vr )dvr , (7.3)

where

pC,GC ,VC (r, gr , vr ) = P (C = r|Gr = gr , Vr = vr )pGr ,Vr (gr , vr ).

Note that

P (C = r|Gr = gr , Vr = vr ) = P {C = r|Z = Hr (gr , vr )}. (7.4)

Consequently, using (7.2) and (7.4), we can write (7.3), including the param-
eters ψ, β, and η, as

pC,GC (r, gr , ψ, β, η)

= P {C = r|Z = Hr (gr , vr ), ψ}pZ {Hr (gr , vr ), β, η}J(gr , vr )dvr . (7.5)

Therefore, the only difference between (7.1) for discrete Z and (7.5) for con-
tinuous Z is the Jacobian that appears in (7.5). Since the Jacobian does not
involve parameters in the model, it will not have an effect on the subsequent
likelihood.

Likelihood when Data Are Coarsened at Random

The likelihood we derived in (7.1) was general, as it allowed for any coars-
ening mechanism, including NCAR. As we indicated earlier, we will restrict
attention to coarsening at random mechanisms (CAR), where P (C = r|Z =
z) = {r, Gr (z)} for all r, z. Coarsening completely at random (CCAR) is
just a special case of this. We remind the reader that another key assumption
being made throughout is that P (C = ∞|Z = z) ≥ ε > 0 for all r, z.
Now that we have shown how to derive the likelihood of the observed
data from the marginal density of the desired data Z and the coarsening
mechanism, P (C = r|Z = z), we can now derive the likelihood of the observed
data when coarsening is CAR. To do so, we need to consider a model for the
coarsening density in terms of parameters. For the time being, we will be very
general and denote such a model by
7.2 The Density and Likelihood of Coarsened Data 159

P (C = r|Z = z) = {r, Gr (z), ψ},

where ψ is an unknown parameter that is functionally independent of (β, η),


the parameters for the full-data model.
Remark 5. The coarsening or missingness of the data can be by design where
the probability {r, Gr (z)} is known to the investigator. For such problems,
additional parameters ψ are not necessary. In other cases, where coarsening
or missingness occur by happenstance, we may introduce parametric models
with a finite-dimensional parameter ψ or semiparametric models with infinite-
dimensional parameter ψ, where ψ needs to be estimated. The exercise of find-
ing reasonable and coherent models for {r, Gr (z), ψ}, even under the MAR
or CAR assumption, may not be straightforward and may require special con-
siderations. Examples of such models will be given throughout the remainder
of the book but, for the time being, it will be assumed that the model for
{r, Gr (z), ψ}, as a function of ψ, is known and has been correctly specified.


We now see that the observed data can be viewed as realizations of the iid
random quantities {Ci , GRi (Zi )}, i = 1, . . . , n, with density from a statistical
model described through the parameter of interest β and nuisance parameters
η and ψ. The CAR assumption will allow simplification of the likelihood, as
we now demonstrate.
When data are CAR, the likelihood of the observed data for a single obser-
vation given by (7.1) for discrete Z (now considered as functions of (ψ, β, η))
is

pC,GC (Z) (r, gr , ψ, β, η) = P (C = r|Z = z, ψ)pZ (z, β, η)dνZ (z)
{z:Gr (z)=gr }

= (r, gr , ψ)pZ (z, β, η)dνZ (z)
{z:Gr (z)=gr }

= (r, gr , ψ) pZ (z, β, η)dνZ (z). (7.6)
{z:Gr (z)=gr }

Notice that the parameter ψ for the coarsening process separates from the
parameters (β, η) that describe the full-data model. Also notice that if Z were
continuous and we used formula (7.5) to derive the density, then, under CAR,
we obtain
160 7 Missing and Coarsening at Random for Semiparametric Models

pC,GC (Z) (r, gr , ψ, β, η) = (r, gr , ψ)pZ {Hr (gr , vr ), β, η} J(gr , vr ) dvr
* +, -

The Jacobian does
not involve any of
the parameters.

= (r, gr , ψ) pZ {Hr (gr , vr ), β, η}J(gr , vr )dvr . (7.7)

In both (7.6) and (7.7),

pC,GC (Z) (r, gr , ψ, β, η) = (r, gr , ψ)pGr (Z) (gr , β, η). (7.8)

Brief Remark on Likelihood Methods

Because the parameter ψ describing the coarsening mechanism separates from


the parameters (β, η) describing the distribution of Z in the observed data
likelihood, when the data are coarsened at random, likelihood methods are
often proposed for estimating the parameters β and η.
That is, suppose we posit a parametric model for full data (i.e., pZ (z, β, η))
and the aim is to estimate the parameter β using coarsened data

{Ci , GCi (Zi )}, i = 1, . . . , n.

The likelihood for a realization of such data, (ri , gri ), i = 1, . . . , n, as a function


of the parameters, when the coarsening mechanism is CAR, is (by (7.8)) equal
to  n " n "
! !
(ri , gri , ψ) pGri (Z) (gri , β, η) . (7.9)
i=1 i=1

Therefore, the MLE for (β, η) only involves maximizing the function

!
n
pGri (Z) (gri , β, η), (7.10)
i=1

where 
pGr (Z) (gr , β, η) = pZ (z, β, η)dνZ (z).
{z:Gr (z)=gr }

Therefore, as long as we believe the CAR assumption, we can find the MLE
for β and η without having to specify a model for the coarsening process. If the
parameter space for (β, η) is finite-dimensional, this is especially attractive, as
the MLE for β, under suitable regularity conditions, is an efficient estimator.
Moreover, the coarsening probabilities, subject to the CAR assumption, play
no role in either the estimation of β (or η for that matter) or the efficiency
7.2 The Density and Likelihood of Coarsened Data 161

of such an estimator. This has a great deal of appeal, as it frees the analyst
from making modeling assumptions for the coarsening probabilities.
Maximizing functions such as (7.10) to obtain the MLE may involve inte-
grals and complicated expressions that may not be easy to implement. Nev-
ertheless, there has been a great deal of progress in developing optimization
techniques involving quadrature or Monte Carlo methods, as well as other
maximization routines such as the EM algorithm, which may be useful for
this purpose. Since likelihood methods for missing (coarsened) data have been
studied in great detail by others, there will be relatively little discussion of
these methods in this book. We refer the reader to Allison (2002), Little and
Rubin (1987), Schafer (1997), and Verbeke and Molenberghs (2000) for more
details on likelihood methods for missing data.

Examples of Coarsened-Data Likelihoods

Return to Example 1
Let us return to Example 1 of Section 7.1. Recall that in this example two
blood samples of equal volume were taken from each of n individuals in a
study that measured the blood concentration of some biological marker. Some
of the individuals, chosen at random, had concentration measurements made
on both samples. These are denoted as X1 and X2 . The remaining individuals
had their blood samples combined and only one concentration measurement
was made, equaling (X1 +X2 )/2. Although concentrations must be positive, let
us, for simplicity, assume that the distribution of these blood concentrations is
well approximated by a normal distribution. To assess the variability of these
concentrations between and within individuals, we assume that Xj = α + ej ,
where α is normally distributed with mean µα and variance σα2 , and ej , j = 1, 2
are independently normally distributed with mean zero and variance σe2 , also
independent of α. With this representation, σα2 represents the variation be-
tween individuals and σe2 the variation within an individual. From this model,
we deduce that Z = (X1 , X2 )T follows a bivariate normal distribution with
common mean µα and common variance σα2 + σe2 and covariance σα2 .
Since the individuals chosen to have their blood samples combined were
chosen at random, this is an example of coarsening completely at random
(CCAR). Thus P (C = 1|Z) = , where  is the probability of being selected
in the subsample and P (C = ∞|Z) = 1 − .
The data for this example can be represented as realizations of the iid
random vectors {Ci , GCi (Zi )}, i = 1, . . . , n, where, if Ci = ∞, then we observe
G∞ (Zi ) = (Xi1 , Xi2 ), whereas if Ci = 1, then we observe G1 (Zi ) = (Xi1 +
Xi2 )/2. Under the assumptions of the model,
    
Xi1 µα
∼N ,Σ , (7.11)
Xi2 µα

where
162 7 Missing and Coarsening at Random for Semiparametric Models
 
σα2 + σe2 , σα2
Σ= .
σα2 , σα2 + σe2

It is also straightforward to show that


 
(Xi1 + Xi2 )/2 ∼ N µα , σα2 + σe2 /2 .

Consequently, the MLE for (µα , σα2 , σe2 ) is obtained by maximizing the coarsened-
data likelihood (7.10), which, for this example, is given by

n 
! 
−1/2 1
|Σ| exp − {(Xi1 − µα , Xi2 − µα )T
i=1
2
I(Ci =∞)
Σ−1 (Xi1 − µα , Xi2 − µα )}
   I(Ci =1) 
{(Xi1 + Xi2 )/2 − µα }2
× (σα2 + σe2 /2)−1/2 exp − . (7.12)
2(σα2 + σe2 /2)
We leave the calculation of the MLE for this example as an exercise at the
end of the chapter.
Although maximizing the likelihood is the preferred method for obtaining
estimators for the parameters in finite-dimensional parametric models of the
full data Z, it may not be a feasible approach for semiparametric models when
the parameter space of the full data is infinite-dimensional. We illustrate some
of the difficulties through an example where the parameter of interest is easily
estimated using likelihood techniques if the data are not coarsened but where
likelihood methods become difficult when the data are coarsened.

The logistic regression model


Let Y be a binary response variable {0, 1}, and let X be a vector of covariates.
A popular model for modeling the probability of response as a function of the
covariates X is the logistic regression model where

exp(β T X ∗ )
P (Y = 1|X) = ,
1 + exp(β T X ∗ )

where X ∗ = (1, X T )T , allowing us to consider an intercept term. We make


no assumptions on X. With full data, (Yi , Xi ), i = 1, . . . , n, the likelihood of
a single observation is
 
exp{(β T x∗ )y}
pY |X (y|x)pX (x) = pX {x, η(·)}, (7.13)
1 + exp(β T x∗ )

where the parameter η(·) is the infinite-dimensional nuisance function allowing


all nonparametric densities for the marginal distribution of X.
7.3 The Geometry of Semiparametric Coarsened-Data Models 163

If we use maximum likelihood to estimate β with full data, then because


the parameters β and η(·) separate in (7.13), it suffices to maximize
!n  
exp{(β T Xi∗ )Yi }
(7.14)
i=1
1 + exp(β T Xi∗ )

without any regard to the nuisance parameter η. Equivalently, we can derive


the maximum likelihood estimator for β by solving the score equations
 n  
∗ exp(β T Xi∗ )
Xi Yi − = 0. (7.15)
i=1
1 + exp(β T Xi∗ )

This was also derived in (4.65). This indeed is the standard analytic strategy
for obtaining estimators for β in a logistic regression model implemented in
most statistical software packages.
If, however, we had coarsened data (CAR), then the likelihood contribution
for the part of the likelihood that involves β for a single observation is
  
exp(β T x∗ )y
pX {x, η(·)}dνY,X (y, x). (7.16)
1 + exp(β T x∗ )
{(y,x):Gr (y,x)=gr }

Whereas maximizing the likelihood in β in (7.14) for noncoarsened data in-


volved only the parameter β, finding the MLE for β with coarsened data now
involves maximizing a function in both β and the infinite-dimensional param-
eter η(·) in a likelihood that involves a product over i of terms like (7.16). Such
maximization may be much more difficult if not impossible. We will return to
this example later.
Consequently, it is important to consider alternatives to likelihood meth-
ods for estimating parameters with coarsened data. Before providing such
alternatives, it is useful to go back to first principles and study the geometry
of influence functions of estimators for the parameter β with coarsened data
when the coarsening is CAR.

7.3 The Geometry of Semiparametric


Coarsened-Data Models
The key to deriving the class of influence functions of RAL estimators for the
parameter β and the corresponding geometry with coarsened data is to build
on the corresponding theory of influence functions of estimators for β and its
geometry had the data not been coarsened (i.e., with full data). In so doing,
we need to distinguish between the geometry of full-data Hilbert spaces and
that of observed-data Hilbert spaces.
We denote the full-data Hilbert space of all q-dimensional, mean-zero mea-
surable functions of Z with finite variance equipped with the covariance inner
164 7 Missing and Coarsening at Random for Semiparametric Models

product by HF . This is contrasted with the observed-data Hilbert space of all


q-dimensional, mean-zero, finite variance, measurable functions of {C, GC (Z)}
equipped with the covariance inner product, which we denote by H (without
the superscript F ). In some cases, we may also consider the Hilbert space
HCZ of all q-dimensional, mean-zero, finite variance, measurable functions of
(C, Z) equipped with the covariance inner product. We note that HF and H
are both linear subspaces of HCZ .
Because influence functions lie in a subspace of H orthogonal to the nui-
sance tangent space, the key in identifying influence functions is first to find
the nuisance tangent space and its orthogonal complement. We remind the
reader that
• The full-data nuisance tangent space is the mean-square closure of all full-
data parametric submodel nuisance tangent spaces. The full-data nuisance
tangent space is denoted by ΛF .
• For a full-data parametric submodel pZ (z, β, γ), where β is q-dimensional
and γ is r-dimensional, the nuisance score vector is

∂ log pZ (Z, β0 , γ0 )
SγF (Z) = ,
∂γ
and the full-data parametric submodel nuisance tangent space is the space
spanned by SγF ; namely,
' (
B q×r SγF (Z) for all q × r matrices B .

• The class of full-data influence functions are the elements ϕF (Z) ∈ HF


such that
(i) ϕF (Z) ∈ ΛF ⊥ (i.e., orthogonal to the full-data nuisance tangent space)
T
(ii) E{ϕF (Z)SβF (Z)} = I q×q (identity matrix), where

∂ log pZ (Z, βo , ηo )
SβF = .
∂β
• The efficient full-data score is
F
Seff (Z) = SβF (Z) − Π{SβF (Z)|ΛF },

and the efficient full-data influence function is


  −1
FT
ϕF F
eff (Z) = E Seff (Z)Seff (Z)
F
Seff (Z).

When considering missing or coarsened data, two issues need to be addressed:

(i) What is the class of observed-data influence functions, and how are they
related to full-data influence functions?
7.3 The Geometry of Semiparametric Coarsened-Data Models 165

(ii) How can we characterize the most efficient influence function and the
semiparametric efficiency bound for coarsened-data semiparametric mod-
els?
Both of these, as well as many other issues regarding semiparametric esti-
mators with coarsened data, will be studied carefully over the next several
chapters.
We remind the reader that observed-data influence functions of RAL es-
timators for β must be orthogonal to the observed-data nuisance tangent
space, which we denote by Λ (without the superscript F ). Therefore, we will
demonstrate how to derive the observed-data nuisance tangent space and its
orthogonal complement.
When the data are discrete and coarsening is CAR, the observed-data
likelihood for a single observation is given by (7.6); namely,

pC,GC (Z) (r, gr , ψ, β, η) = (r, gr , ψ) pZ (z, β, η)dνZ (z).
{Gr (z)=gr }

A similar expression for continuous variables, involving Jacobians, was given


by (7.7). From here on, we will use the representation of likelihood for dis-
crete data, realizing that these results can be easily generalized to continuous
variables using Jacobians.
The log-likelihood for a single observation is given by

log (r, gr , ψ) + log pZ (z, β, η)dνZ (z). (7.17)
{Gr (z)=gr }

The coarsened-data likelihood and log-likelihood are functions of the param-


eters β, η, and ψ, where β is the parameter of interest and hence η and ψ
are nuisance parameters. Since the parameters η and ψ separate out in the
likelihood for the observed data, we would expect that the nuisance tangent
space will be the direct sum of two orthogonal spaces, one involving the space
generated by the score vector with respect to ψ, which we denote by Λψ , and
the other space generated by the score vector with respect to η, which we
denote by Λη . That is,

Λ = Λ ψ ⊕ Λη , Λψ ⊥ Λη . (7.18)

We will give a formal proof of (7.18) later.


For the remainder of this chapter, we will only consider the space Λη
and its complement. When the coarsening of the data is by design, where the
coarsening probabilities {r, Gr (z)} are known to the investigator, then there
is no need to introduce an additional parameter ψ or the space Λψ . In that
case, the observed-data nuisance tangent space Λ is the same as Λη . Examples
where the data are missing by design will be given in Section 7.4. We restrict
ourselves to this situation for the time being. In the next chapter, we will
166 7 Missing and Coarsening at Random for Semiparametric Models

discuss what to do when the coarsening probabilities are not known to us by


design and models for {r, Gr (z), ψ}, as a function of the parameter ψ, have
to be developed.

The Nuisance Tangent Space Associated with the Full-Data


Nuisance Parameter and Its Orthogonal Complement
The nuisance tangent space

The space Λη is defined as the mean-square closure of parametric submodel


tangent spaces associated with the nuisance parameter η. Therefore, we begin
by first considering the parametric submodel for the full-data Z given by
pZ (z, β q×1 , γ r×1 ) and compute the corresponding observed-data score vector.
Lemma 7.1. The parametric submodel observed-data score vector with re-
spect to γ is given by
' (
Sγ (r, gr ) = E SγF (Z)|Gr (Z) = gr . (7.19)
Proof. The log-likelihood for the observed data (at least the part that involves
γ), given by (7.17), is
⎧ ⎫

⎨  ⎪

log pZ (z, β, γ)dνZ (z) .

⎩ ⎪

{Gr (z)=gr }

The score vector with respect to γ is


⎡ ⎧ ⎫⎤
⎪  ⎪
∂ ⎢ ⎨ ⎬

Sγ (r, gr ) = ⎣log pZ (z, β, γ)dνZ (z) ⎦ β = β
∂γ ⎪
⎩ ⎪
⎭ 0
{Gr (z)=gr } γ = γ0
* +, -

same as η = η0


pZ (z, β0 , γ0 )dνZ (z)
∂γ
{Gr (z)=gr }
=  . (7.20)
pZ (z, β0 , γ0 )dνZ (z)
{Gr (z)=gr }

Dividing and multiplying by pZ (z, β0 , γ0 ) in the integral of the numerator of


(7.20) yields

SγF (z, β0 , γ0 )pZ (z, β0 , γ0 )dνZ (z)
{Gr (z)=gr }
 .
pZ (z, β0 , γ0 )dνZ (z)
{Gr (z)=gr }
7.3 The Geometry of Semiparametric Coarsened-Data Models 167

Hence,
' (
Sγ (r, gr ) = E SγF (Z)|Gr (Z) = gr . 


Remark 6. Equation (7.19) is a conditional expectation that only involves the


conditional probability distribution of Z given Gr (Z) for a fixed value of r. It
will be important for the subsequent theoretical development that we compare
and contrast (7.19) with the conditional expectation
' (
E SγF (Z)|C = r, GC (Z) = gr . (7.21)

In general, (7.21) will not equal (7.19); however, as we will now show, these
are equal under the assumption of CAR.  

Lemma 7.2. When the coarsening mechanism is CAR, then


' ( ' (
Sγ (r, gr ) = E SγF (Z)|Gr (Z) = gr = E SγF (Z)|C = r, GC (Z) = gr .
(7.22)

Proof. Equation (7.22) will follow if we can prove that

pZ|C,GC (Z) (z|r, gr ) = pZ|Gr (Z) (z|gr ), (7.23)

which is true because when Gr (z) = gr

pC,Z (r, z)
pZ|C,GC (Z) (z|r, gr ) = 
pC,Z (r, u)dνZ (u)
{Gr (u)=gr }

pC|Z (r|z)pZ (z)


= 
pC|Z (r|u)pZ (u)dνZ (u)
{Gr (u)=gr }

(r, gr )pZ (z)


=  (7.24)
(r, gr ) pZ (u)dνZ (u)
{Gr (u)=gr }

pZ (z)
=  = pZ|Gr (Z) (z|gr ). 
 (7.25)
pZ (u)dνZ (u)
{Gr (v)=gr }

Remark 7. In order to prove (7.23), it was necessary that (r, gr ) cancel in


the numerator and denominator of (7.24), which is only true because of CAR.


168 7 Missing and Coarsening at Random for Semiparametric Models

Consequently, when the coarsening mechanism is CAR, the corresponding


observed-data nuisance score vector for the parametric submodel pZ (z, β, γ)
is ' (
Sγ {C, GC (Z)} = E SγF (Z)|C, GC (Z) . (7.26)
We are now in a position to define the observed-data nuisance tangent
space associated with the full-data nuisance parameter η.
Theorem 7.1. The space Λη (i.e., the mean square closure of parametric
submodel nuisance tangent spaces spanned by Sγ {C, GC (Z)}) is the space of
elements  ' ( 
Λη = E αF (Z)|C, GC (Z) for all αF ∈ ΛF , (7.27)
where ΛF denotes the full-data nuisance tangent space. We will also denote
this space by the shorthand notation
' (
Λη = E ΛF |C, GC (Z) .
Proof. Using (7.26), we note that the linear subspace, within H, spanned by
the parametric submodel score vector Sγ {C, GC (Z)} is
 q×r ' F ( 
B E Sγ (Z)|C, GC (Z) for all B q×r
 ' ( 
= E B q×r SγF (Z)|C, GC (Z) for all B q×r .

The linear subspace Λη consisting of elements B q×r E{SγF (Z)|C, GC (Z)} for
some parametric submodel or a limit (as n → ∞) of elements
Bnq×rn E{Sγn
F
(Z)|C, GC (Z)}
for a sequence of parametric submodels and conformable matrices. This is the
same as the space consisting of elements
' (
E B q×r SγF (Z)|C, GC (Z)
or limits of elements
' (
E Bnq×rn Sγn
F
(Z)|C, GC (Z) .

But the space of elements B q×r SγF (Z) or limits of elements Bnq×rn Sγn
F
(Z)
for parametric submodels is precisely the definition of the full-data nuisance
tangent space ΛF . Consequently, the space Λη can be characterized as the
space of elements
 ' ( 
Λη = E αF (Z)|C, GC (Z) for all αF ∈ ΛF .  
In the special case where data are missing or coarsened by design (i.e.,
when there are no additional parameters ψ necessary to define a model for
the coarsening probabilities), then the observed-data nuisance tangent space
is Λ = Λη . We know that an influence function of an observed-data RAL
estimator for β must be orthogonal to Λ. Toward that end, we now characterize
the space orthogonal to Λη (i.e., Λ⊥
η ).
7.3 The Geometry of Semiparametric Coarsened-Data Models 169

The orthogonal complement of the nuisance tangent space

Lemma 7.3. The space Λ⊥


η consists of all elements h
q×1
{C, GC (Z)} ∈ H such
that
E[h{C, GC (Z)}|Z] ∈ ΛF ⊥ ,
where ΛF ⊥ is the space orthogonal to the full-data nuisance tangent space.

Proof. The space Λ⊥ η consists of all elements h


q×1
{C, GC (Z)} ∈ H that are
orthogonal to Λη . By Theorem 7.1, this corresponds to the set of elements
h(·) ∈ H such that
 
E hT {C, GC (Z)}E{αF (Z)|C, GC (Z)} = 0
for all αF (Z) ∈ ΛF . (7.28)

Using the law of iterated conditional expectations repeatedly, we obtain the


following relationship for (7.28):
 
0 = E E[hT {C, GC (Z)}αF (Z)|C, GC (Z)]
= E[hT {C, GC (Z)}αF (Z)]
 
= E E[hT {C, GC (Z)}αF (Z)|Z]
 
= E E[hT {C, GC (Z)}|Z] αF (Z)
for all αF (Z) ∈ ΛF . (7.29)

Thus (7.29) implies that h{C, GR (Z)} ∈ H belongs to Λ⊥ η if and only if


E[h{C, GC (Z)}|Z] is orthogonal to every element αF (Z) ∈ ΛF ; i.e., that

E[h{C, GC (Z)}|Z] ∈ ΛF ⊥ . 


To explore this relationship further, it will be convenient to introduce the


notion of a mapping, and more specifically a linear mapping, from one Hilbert
space to another Hilbert space.

Definition 1. A mapping, also sometimes referred to as an operator, K, is a


function that maps each element of some linear space to an element of another
linear space. In all of our applications, the linear spaces will be well-defined
Hilbert spaces. So, for example, if H(1) and H(2) denote two Hilbert spaces,
then the mapping K : H(1) → H(2) means that for any h ∈ H(1) , K(h) ∈ H(2) .
A linear mapping also has the property that K(ah1 + bh2 ) = aK(h1 ) + bK(h2 )
for any two elements h1 , h2 ∈ H(1) and any scalar constants a and b. A many-
to-one mapping means that more than one element h ∈ H(1) will map to the
same element in H(2) . For more details regarding linear operators, we refer
the reader to Chapter 6 of Luenberger (1969).  

Define the many-to-one mapping


170 7 Missing and Coarsening at Random for Semiparametric Models

K : H → HF

to be
K(h) = E[h{C, GC (Z)}|Z] (7.30)
for h ∈ H. Because of the linear properties of conditional expectations, the
mapping K, given by (7.30), is a linear mapping or linear operator.
By Lemma 7.3, the space Λ⊥ η can be defined as

Λ⊥
η =K
−1
(ΛF ⊥ ),

where K−1 is the inverse operator.


Definition 2. Inverse operator
For any element hF ∈ HF , K−1 (hF ) corresponds to the set of all elements
(assuming at least one exists) h ∈ H such that K(h) = hF . Similarly, the
space K−1 (ΛF ⊥ ) corresponds to all elements of h ∈ H such that K(h) ∈ ΛF ⊥ .


Since K is a linear operator and ΛF ⊥ is a linear subspace of HF , it is easy
to show that K−1 (ΛF ⊥ ) is a linear subspace of H.
Let us consider the construction of the space Λ⊥ η =K
−1
(ΛF ⊥ ) element by
∗F F⊥
element. Consider a single element ϕ (Z) ∈ Λ . In the following theorem,
we show how the inverse K−1 (ϕ∗F ) is computed.

Remark 8. Notation convention


When we refer to elements of the space ΛF ⊥ , we will use the notation ϕ∗F (Z).
This is in contrast to the notation ϕF (Z) (without the ∗), which we use to
denote a full-data influence function. The space perpendicular to the full-
data nuisance tangent space ΛF ⊥ is the space in which the class of full-data
influence functions belongs. In order to be a full-data influence function, an
FT
element of ΛF ⊥ must also satisfy the property that E{ϕF (Z)Seff (Z)} = I q×q .
We remind the reader that, for any ϕ∗F (Z) ∈ ΛF ⊥ , we can construct an
influence function that equals ϕ∗F (Z), up to a multiplicative constant, by
taking
 −1
FT
ϕF (Z) = E{ϕ∗F (Z)Seff (Z)} ϕ∗F (Z). 


Lemma 7.4. For any ϕ∗F (Z) ∈ ΛF ⊥ , let K−1 {ϕ∗F (Z)} denote the space of
elements h̃{C, GC (Z)} ∈ H such that

K[h̃{C, GC (Z)}] = E[h̃{C, GC (Z)}|Z] = ϕ∗F (Z).

If we could identify any element h{C, GC (Z)} such that

K(h) = ϕ∗F (Z),

then
7.3 The Geometry of Semiparametric Coarsened-Data Models 171

K−1 {ϕ∗F (Z)} = h{C, GC (Z)} + Λ2 ,


where Λ2 is the linear subspace in H consisting of elements L2 {C, GC (Z)} such
that
E[L2 {C, GC (Z)}|Z] = 0;
that is, Λ2 = K−1 (0).

Proof. The proof is straightforward. If h̃{C, GR (Z)} is an element of the space


h{C, GR (Z)}+Λ2 , then h̃{C, GR (Z)} = h{C, GC (Z)}+L2 {R, GC (Z)} for some
L2 {C, GR (Z)} ∈ Λ2 , in which case

E[h̃{C, GC (Z)}|Z] = E[h{C, GC (Z)}|Z] + E[L2 {C, GC (Z)}|Z]


= ϕ∗F (Z) + 0 = ϕ∗F (Z).

Conversely, if E[h̃{C, GC (Z)}|Z] = ϕ∗F (Z), then

h̃{C, GC (Z)} = h{C, GC (Z)} + [h̃{C, GC (Z)} − h{C, GC (Z)}],

where clearly [h̃{C, GC (Z)} − h{C, GC (Z)}] ∈ Λ2 . 




Therefore, in order to construct Λ⊥


η = K
−1
(ΛF ⊥ ), we must, for each
∗F F⊥
ϕ (Z) ∈ Λ ,

(i) identify one element h{C, GC (Z)} such that

E[h{C, GC (Z)}|Z] = ϕ∗F (Z),

and
(ii) find Λ2 = K−1 (0).
We now derive the space Λ⊥
η in the following theorem.

Theorem 7.2. Under the assumption that

E{I(C = ∞)|Z} = (∞, Z) > 0 for all Z (a.e.), (7.31)

the space Λ⊥
η consists of all elements that can be written as

I(C = ∞)ϕ∗F (Z)


+ (7.32)
(∞, Z)
⎡ ⎤
I(C = ∞) ⎣  
{r, Gr (Z)}L2r {Gr (Z)}⎦ − I(C = r)L2r {Gr (Z)},
(∞, Z)
r=∞ r=∞

where, for r = ∞, L2r {Gr (Z)} is an arbitrary q × 1 measurable function of


Gr (Z) and ϕ∗F (Z) is an arbitrary element of ΛF ⊥ .
172 7 Missing and Coarsening at Random for Semiparametric Models

Proof. In accordance with the proof of Lemma 7.4, we begin by:

(i) Identifying h such that E[h{C, GC (Z)}|Z] = ϕ∗F (Z)

A single solution to the equation


E[h{C, GC (Z)}|Z] = ϕ∗F (Z)
is motivated by the idea of an inverse probability weighted complete-case es-
timator, which was first introduced in Section 6.4. Recall that C = ∞ denotes
the case when the data Z are completely observed and (∞, Z) = P (C =
∞|Z). Now consider h{C, GC (Z)} to be
I(C = ∞)ϕ∗F (Z)
.
(∞, Z)
This is clearly a function of the observed data. Moreover,
 
I(C = ∞)ϕ∗F (Z) ϕ∗F (Z)
E Z = E{I(C = ∞)|Z} = ϕ∗F (Z),
(∞, Z) (∞, Z)
where, in order for the equation above to hold, we must make sure we are not
dividing 0 by 0; hence the need for assumption (7.31).
Consequently, Λ⊥ η = K
−1
(ΛF ⊥ ) can be written as the direct sum of two
linear subspaces; namely,
I(C = ∞)ΛF ⊥
Λ⊥
η = ⊕ Λ2 , (7.33)
(∞, Z)
which is the linear subspace of H with elements

⊥ I(C = ∞)ϕ∗F (Z)
Λη = + L2 {C, GC (Z)} : ϕ∗F (Z) ∈ ΛF ⊥ ,
(∞, Z)

L2 {C, GC (Z)} ∈ Λ2 ; i.e., E[L2 {C, GC (Z)}|Z] = 0 . (7.34)

To complete the proof, we need to derive the linear space Λ2 .

(ii) The space Λ2 = K−1 (0)

Because we are assuming that the coarsening variable C is discrete, we can


express any function h{C, GR (Z)} ∈ H as

I(C = ∞)h∞ (Z) + I(C = r)hr {Gr (Z)}, (7.35)
r=∞

where h∞ (Z) denotes an arbitrary q × 1 function of Z and hr {Gr (Z)} denotes


an arbitrary q × 1 function of Gr (Z). The space of functions L2 {C, GC (Z)} ∈
Λ2 ⊂ H must satisfy
7.3 The Geometry of Semiparametric Coarsened-Data Models 173

E[L2 {C, GC (Z)}|Z] = 0;


that is,
⎡ ⎤

E ⎣I(C = ∞)L2∞ (Z) + I(C = r)L2r {Gr (Z)} Z ⎦ = 0,
r=∞

or 
(∞, Z)L2∞ (Z) + {r, Gr (Z)}L2r {Gr (Z)} = 0. (7.36)
r=∞

Consequently, to obtain an arbitrary element of L2 {C, GC (Z)} ∈ Λ2 , we can


define any set of q-dimensional measurable functions L2r {Gr (Z)}, r = ∞,
and, for any such set of functions, (7.36) will be satisfied by taking

L2∞ (Z) = −{(∞, Z)}−1 {r, Gr (Z)}L2r {Gr (Z)},
r=∞

where, again, assumption (7.31) is needed to guarantee that we are not di-
viding by zero. Hence, for any L2r {Gr (Z)}, r = ∞, we can define a typical
element of Λ2 as
⎡ ⎤
I(C = ∞) ⎣  
{r, Gr (Z)}L2r {Gr (Z)}⎦ − I(C = r)L2r {Gr (Z)}.
(∞, Z)
r=∞ r=∞
(7.37)
Combining the results from (7.34) and (7.37), we are now able to explicitly
define the linear space Λ⊥
η to be that consisting of all elements given by (7.32).



Identifying the space Λ⊥ will often guide us in deriving semiparametric


estimators. When data are coarsened by design, Λ⊥ = Λ⊥ η.

Remark 9. The representation of Λ⊥ η given by(7.33) as a direct sum of two


linear spaces will give us insight on how to construct estimating equations
whose solution will yield semiparametric RAL estimators for β with coarsened
data.
In Chapters 4 and 5, we showed how to use elements of ΛF ⊥ (i.e., the space
orthogonal to the full-data nuisance tangent space) to construct estimating
equations whose solution resulted in full-data RAL estimators for β. Since
F⊥
the first space in the direct sum (7.33), namely I(C=∞)Λ
(∞,Z) , consists of the
inverse probability weighted complete-case elements of ΛF ⊥ , this suggests that
observed-data estimators for β can be constructed by using inverse probability
weighted complete-case (IPWCC) full-data estimating equations. This would
lead to what are called IPWCC estimators. We gave a simple example of this
in Section 6.4.
174 7 Missing and Coarsening at Random for Semiparametric Models

The second space, Λ2 in (7.33) will be referred to as the augmentation


space. Estimators for β that include elements of Λ2 as part of the estimator
will be referred to as augmented inverse probability weighted complete-case
(AIPWCC) estimators. In Section 6.5, we introduced such an estimator and
showed how the augmentation term can help us gain efficiency and, in some
cases, leads to estimators with the property of double robustness. 


Therefore, we will formally define the two linear subspaces as follows.

Definition 3. The linear subspace contained in H consisting of elements


 
I(C = ∞)ϕ∗F (Z)
; for all ϕ∗F (Z) ∈ ΛF ⊥ ,
(∞, Z)
I(C=∞)ΛF ⊥
also denoted as (∞,Z) , will be defined to be the IPWCC space. 


Definition 4. The linear space Λ2 ⊂ H will be defined to be the augmentation


space. 


Before continuing to more complicated situations, it will be instructive to


see how the geometry we have developed so far will aid us in constructing
estimators in several examples when data are missing at random by design.

7.4 Example: Restricted Moment Model with Missing


Data by Design
Consider the semiparametric restricted moment model that assumes that

E(Y |X) = µ(X, β),

where Y is the response variable and X is a vector of covariates. Here, Z =


(Y, X) denotes full data. We studied the semiparametric properties of this
model in great detail in Sections 4.5 and 4.6, where we also showed, in (4.48),
that a typical element of ΛF ⊥ is given as

A(X){Y − µ(X, β0 )}.

This motivates the generalized estimating equation (GEE), or m-estimator,


which is the solution to

n
A(Xi ){Yi − µ(Xi , β)} = 0 (7.38)
i=1

using a sample of data (Yi , Xi ), i = 1, . . . , n.


Suppose, by design, we coarsen the data at random. For example, let the
vector of covariates X for a single observation be partitioned into two sets of
7.4 Example: Restricted Moment Model with Missing Data by Design 175
T T
variables, X = (X (1) , X (2) )T , where X (1) are variables that are relatively
inexpensive to collect, whereas X (2) are expensive to collect. For example,
X (2) may be genetic markers that are expensive to process, whereas X (1)
may be descriptive variables such as age, race, sex, etc. In such a case, we
might decide to collect the response variable Y and the inexpensive covariates
X (1) on everyone in the sample but collect the expensive covariates X (2) only
on a subset of individuals. Moreover, we let the probability of collecting X (2)
depend on the values of Y and X (1) . This might be the case if, say, we want to
overrepresent some values of Y and X (1) in the subset where all the data are
collected. This is an example of missing data by design. That is, the full data
(1) (2) (1)
are denoted by Zi = (Yi , Xi , Xi ), i = 1, . . . , n. Yi and Xi are observed on
(2)
everyone, whereas Xi may be missing for some individuals. To implement
(1)
such a design, we would collect the data (Yi , Xi ) for all patients i = 1, . . . , n,
as well as blood samples that could be used to obtain the expensive genetic
markers. For patient i we then choose, at random, the complete-case binary
(1)
indicator Ri taking the value 1 or 0 with probability π(Yi , Xi ) and 1 −
(1)
π(Yi , Xi ) respectively, where the function 0 < π(y, x(1) ) ≤ 1 is a known
function of the response Y = y and the covariates X (1) = x(1) chosen by
the investigator. If Ri = 1, then we process the blood sample and obtain the
(2)
genetic markers Xi ; otherwise, we let that data be missing.
Since there are only two levels of coarsening in this problem, it is convenient
to work with the binary indicator R to denote whether the observation was
complete or whether some of the data (X (2) in this case) were missing. The
relationship to the notation we have been using is as follows: R = (0, 1), where
R is not scripted, is equivalent to the coarsening variable C = (1, ∞), G1 (Z) =
(Y, X (1) ), G∞ (Z) = Z = (Y, X (1) , X (2) ), P (C = 1|Z) = {1, G1 (Z)} =
1 − π(Y, X (1) ), and P (C = ∞|Z) = {∞, G∞ (Z)} = π(Y, X (1) ).

Note 1. On notation for missingness probabilities


In keeping with much of the notation in the literature, we denote the proba-
bility of a complete case by P (R = 1|Y, X (1) ) as π(Y, X (1) ). This should not
be confused with coarsening probabilities, which are denoted as P (C = r|Z) =
{r, Gr (Z)}.  

Since the missingness probabilities are known by design for this example,
the nuisance tangent space for the observed data is Λη , and the space or-
thogonal to the nuisance tangent space, Λ⊥ η , derived in (7.32) of Theorem 7.2,
is
 
Rϕ∗F (Z) ∗F F⊥
+ L2 {C, GC (Z)} : ϕ (Z) ∈ Λ , L2 {C, GC (Z)} ∈ Λ 2 . (7.39)
π(Y, X (1) )

Because L21 {G1 (Z)} is an arbitrary q × 1 function of (Y, X (1) ), which we


denote by L(Y, X (1) ), we can use formula (7.37) to show, after some algebra,
that any element L2 {C, GC (Z)} ∈ Λ2 can be expressed as
176 7 Missing and Coarsening at Random for Semiparametric Models
 
R − π(Y, X (1) )
L(Y, X (1) ). (7.40)
π(Y, X (1) )

Since a typical element ϕ∗F (Z) ∈ ΛF ⊥ for the restricted moment model is

A(X){Y − µ(X, β0 )},

for arbitrary A(X), then by (7.39) and (7.40), a typical element of Λ⊥


η is
 
R[A(X){Y − µ(X, β0 )}] R − π(Y, X (1) )
+ L(Y, X (1) )
π(Y, X (1) ) π(Y, X (1) )

for arbitrary A(X) and L(Y, X (1) ).


We have shown that identifying elements orthogonal to the nuisance tan-
gent space and using these as estimating functions (i.e., functions of the data
and the parameter) may guide us in constructing estimating equations whose
solution would yield a consistent, asymptotically normal estimator for β.
Therefore, for this problem, we might consider estimating β with a sample
of coarsened data
(1) (2)
(Ri , Yi , Xi , Ri Xi ), i = 1, . . . , n,

by using the m-estimator that solves


#
n
Ri
(1)
A(Xi ){Yi − µ(Xi , β)}
i=1 π(Yi , Xi )
 " $
(1)
Ri − π(Yi , Xi ) (1)
+ (1)
L(Yi , Xi ) = 0. (7.41)
π(Yi , Xi )

If this estimator is to be consistent, at the least we would need that, at


the truth,
   
R R − π(Y, X (1) ) (1)
E A(X){Y − µ(X, β0 )} + L(Y, X ) = 0.
π(Y, X (1) ) π(Y, X (1) )
Using the law of iterated conditioning, where we first condition on Y, X, we
obtain

A(X){Y − µ(X, β0 )}
E E(R|Y, X)
π(Y, X (1) )
  
E(R|Y, X) − π(Y, X (1) ) (1)
+ L(Y, X ) . (7.42)
π(Y, X (1) )
Since
E(R|Y, X) = P (R = 1|Y, X) = P (R = 1|Y, X (1) , X (2) ),
which, by design, equals π(Y, X (1) ), we obtain that (7.42) is equal to
7.4 Example: Restricted Moment Model with Missing Data by Design 177

E[A(X){Y − µ(X, β0 )} + 0] = 0. (7.43)

Also, the usual expansion of m-estimators can be used to derive asymptotic


normality. That is,

0
#  " $

n
Ri Ri − π(Yi , Xi )
(1)
(1)
= (1)
A(Xi ){Yi − µ(Xi , β̂n )} + (1)
L(Yi , Xi )
i=1 π(Yi , Xi ) π(Yi , Xi )
#  " $
n
Ri Ri − π(Yi , Xi )
(1)
(1)
= (1)
A(Xi ){Yi − µ(Xi , β0 )} + (1)
L(Yi , Xi )
i=1 π(Yi , Xi ) π(Yi , Xi )
# n $
 Ri
− (1)
A(Xi )D(Xi , βn∗ ) (β̂n − β0 ),
i=1 π(Yi , X i )

where D(X, β) = ∂µ(X, β)/∂β T and βn∗ is an intermediate value between β̂n
and β0 . Therefore,
# $−1
n
Ri
1/2 −1 ∗
n (β̂n − β0 ) = n (1)
A(Xi )D(Xi , βn )
i=1 π(Yi , Xi )
#
n
Ri
−1/2
×n (1)
A(Xi ){Yi − µ(Xi , β0 )}
i=1 π(Yi , Xi )
 " $
(1)
Ri − π(Yi , Xi ) (1)
+ (1)
L(Yi , Xi ) .
π(Yi , Xi )
Under suitable regularity conditions,
 "  
 Ri R
−1 ∗ P
n (1)
A(Xi )D(Xi , βn ) −→E A(X)D(X, β0 ) .
π(Yi , Xi ) π(Y, X (1) )

Using iterated conditioning, where first we condition on Y, X, we obtain

E{A(X)D(X, β0 )}.

Consequently,

n
1/2 −1/2
n (β̂n − β0 ) = n [E{A(X)D(X, β0 )}]−1
i=1
#  " $
(1)
Ri Ri − π(Yi , Xi ) (1)
(1)
A(Xi ){Yi − µ(Xi , β0 )} + (1)
L(Yi , Xi )
π(Yi , Xi ) π(Yi , Xi )
+ op (1). (7.44)

Therefore, the i-th influence function for β̂n is


178 7 Missing and Coarsening at Random for Semiparametric Models
#
−1 Ri
[E{A(X)D(X, β0 )}] (1)
A(Xi ){Yi − µ(Xi , β0 )}
π(Yi , Xi )
 " $
(1)
Ri − π(Yi , Xi ) (1)
+ (1)
L(Yi , Xi ) ,
π(Yi , Xi )

which we demonstrated has mean zero, in (7.42) and (7.43), using iterated
conditional expectations.
We note that this influence function is proportional to the element in Λ⊥η
that motivated the corresponding m-estimator. Also, because this estimator is
asymptotically linear, as shown in (7.44), we immediately deduce that this es-
timator is asymptotically normal with asymptotic variance being the variance
of its influence function. Other than regularity conditions, no assumptions
were made on the distribution of (Y, X), beyond that of the restricted mo-
ment assumption, to obtain asymptotic normality. Therefore, the resulting
estimator is a semiparametric estimator.
Standard methods using a sandwich variance can be used to derive an
estimator for the asymptotic variance of β̂n , the solution to (7.41). Such a
sandwich estimator was derived for the full-data GEE estimator in (4.9) of
Section 4.1. We leave the details as an exercise at the end of the chapter.
Hence, for the restricted moment model with missing data that are miss-
ing by design, we have derived the space orthogonal to the nuisance tangent
space (i.e., Λ⊥
η ) and have constructed an m-estimator with influence function
proportional to any element of Λ⊥ η . Since all influence functions of RAL es-
timators for β must belong to Λ⊥ η , this means that any RAL estimator for
β must be asymptotically equivalent to one of the estimators given by the
solution to (7.41).

Remark 10. The estimator for β, given as the solution to (7.41), is referred
to as an augmented inverse probability weighted complete-case (AIPWCC)
estimator. If L(Y, X (1) ) is chosen to be identically equal to zero, then the
estimating equation in (7.41) becomes


n
Ri
(1)
A(Xi ){Yi − µ(Xi , β)} = 0. (7.45)
i=1 π(Yi , Xi )

The solution to (7.45) is referred to as an inverse probability weighted


complete-case (IPWCC) estimator. The second term in (7.41), which involves
the arbitrary function L(Y, X (1) ), allows contributions from individuals with
missing data into the estimating equation. Properly chosen augmentation will
result in an estimator with greater efficiency.  

The choice of the influence function and hence the corresponding class
of estimators depends on the arbitrary functions A(X) and L(Y, X (1) ). With
full data, the class of estimating equations is characterized by (7.38). This
7.4 Example: Restricted Moment Model with Missing Data by Design 179

requires us to choose the function A(X). In Chapter 4, we proved that the


optimal choice for A(X) was DT (X)V −1 (X), where V (X) = var (Y |X), and
suggested adaptive strategies for finding locally efficient estimators for β in
Section 4.6.
With missing data by design, we also want to find the optimal RAL estima-
tor for β; i.e., the RAL estimator for β with the smallest asymptotic variance.
This means that we must derive the functions A(X) and L(Y, X (1) ), which
yields an estimator in (7.41) with the smallest asymptotic variance. Finding
the optimal estimator with coarsened data will require special considerations
that will be the focus of later chapters. In general, the optimal choice of A(X)
with coarsened data is not necessarily the same as it is for full data. These
issues will be studied more carefully.

The Logistic Regression Model

We gave an example in Section 7.2 where we argued that with coarsened data it
was difficult to obtain estimators for β using likelihood methods. Specifically,
we considered the logistic regression model for the probability of response
Y = 1 as a function of covariates X, where Y denotes a binary response
variable. Let us consider the likelihood for such a model if we had missing
T T
data by design as described above; that is, where X = (X (1) , X (2) )T and
where we always observe Y and X (1) on everyone in the sample but only
observe X (2) on a subset chosen at random with probability π(Y, X (1) ) by
design. Also, to allow for an intercept term in the logistic regression model,
T T T
we define X ∗ = (1, X (1) , X (2) )T and X (1∗) = (1, X (1) )T . The density of
the full data (Y, X) for this problem can be written as

pY,X (y, x, β, η1 , η2 ) = pY |X (y|x, β)pX (2) |X (1) (x(2) |x(1) , η1 )pX (1) (x(1) , η2 )
 
exp{(β1T x(1∗) + β2T x(2) )y}
= p (2) (1) (x(2) |x(1) , η1 )pX (1) (x(1) , η2 ),
1 + exp(β1T x(1∗) + β2T x(2) ) X |X

where β is partitioned as β = (β1T , β2T )T , pX (2) |X (1) (x(2) |x(1) , η1 ) denotes the
conditional density of X (2) given X (1) , specified through the parameter η1 ,
and pX (1) (x(1) , η2 ) denotes the marginal density of X (1) , specified through
the parameter η2 . Because the parameter of interest β separates from the
parameters η1 and η2 in the density above, finding the MLE for β with full
data only involves maximizing the part of the likelihood above involving β
and is easily implemented in most software packages.
In contrast, the density of the observed data (R, Y, X (1) , RX (2) ) is given
by
180 7 Missing and Coarsening at Random for Semiparametric Models

{pY |X (y|x, β)}r {pX (2) |X (1) (x(2) |x(1) , η1 )}r


 1−r
× pY |X (y|x, β)pX (2) |X (1) (x(2) |x(1) , η1 )dνX (2) (x(2) ) pX (1) (x(1) , η2 )
 r
exp{(β1T x(1∗) + β2T x(2) )y}
= {pX (2) |X (1) (x(2) |x(1) , η1 )}r (7.46)
1 + exp(β1T x(1∗) + β2T x(2) )
  
exp{(β1T x(1∗) + β2T x(2) )y}
×
1 + exp(β1T x(1∗) + β2T x(2) )
1−r
pX (2) |X (1) (x(2) |x(1) , η1 )dνX (2) (x(2) ) (7.47)

× pX (1) (x(1) , η2 ).

Because the parameters β and η1 do not separate in the density above, de-
riving the MLE for β involves maximizing, as a function of β and η1 , the
product (over i = 1, . . . , n) of terms (7.46) × (7.47). Even if we were willing
to make simplifying parametric assumptions about the conditional distribu-
tion of X (2) given X (1) in terms of a finite number of parameters η1 , this
would be a complicated maximization, but if we wanted to be semiparametric
(i.e., put no restrictions on the conditional distribution of X (2) given X (1) ),
then this problem would be impossible as it would suffer from the curse of
dimensionality. Notice that in the likelihood formulation above, nowhere do
the probabilities π(Y, X (1) ) come into play, even though they are known to us
by design.
Since the logistic regression model is just a simple example of a restricted
moment model, estimators for the parameter β for the semiparametric model,
which puts no restrictions on the joint distribution of (X (1) , X (2) ), can be
found easily by solving the estimating equation (7.41), where µ(Xi , β) =
exp(β T Xi∗ )/{1 + exp(β T Xi∗ )} and for some choice of A(X) and L(Y, X (1) ).
With no missing data, we showed in (4.65) that the optimal choice for
A(X) is X ∗ . Consequently, one easy way of obtaining an estimator for β is
(1)
by solving (7.41) using A(Xi ) = Xi∗ and L(Yi , Xi ) = 0, leading to the
estimating equation

n  
Ri exp(β T Xi∗ )
Xi∗ Yi − = 0. (7.48)
i=1
(1)
π(Yi , Xi ) 1 + exp(β T Xi∗ )

This estimator is an inverse probability weighted complete case (IPWCC)


estimator for β. Although this estimator is a consistent, asymptotically normal
semiparametric estimator for β, it is by no means efficient. Since this estimator
only uses the complete cases (i.e., the data from individual i : Ri = 1), it is
intuitively clear that additional efficiency can be gained by using the data from
individuals i : Ri = 0, where only some of the data are missing. Therefore, it
would be preferable to use an AIPWCC estimator given by (7.41); namely,
7.5 Recap and Review of Notation 181
n 
  
Ri ∗ exp(β T Xi∗ )
X Y i −
(1)
i=1 π(Yi , Xi )
i
1 + exp(β T Xi∗ )
 " 
(1)
Ri − π(Yi , Xi ) (1)
+ (1)
L(Yi , X i ) = 0, (7.49)
π(Yi , Xi )

with some properly chosen L(Y, X (1) ).


This illustrates the usefulness of understanding the semiparametric theory
for missing and coarsened data. Of course, the choice of A(X) and L(Y, X (1) )
that will result in efficient estimators for β still needs to be addressed.

7.5 Recap and Review of Notation


Before continuing, we believe it is worthwhile to review some of the basic ideas
and notation that have been developed thus far.

Full data

• Full data are denoted by Z with density from a semiparametric model


pZ (z, β, η), where β denotes the q-dimensional parameter of interest and
η denotes the infinite-dimensional nuisance parameter.
• HF denotes the full-data Hilbert space defined as all mean-zero, q-
dimensional measurable functions of Z with finite variance equipped with
the covariance inner product.
• ΛF is the full-data nuisance tangent space spanned by the full-data nui-
sance score vectors for parametric submodels and their mean-square clo-
sure.
• ΛF ⊥ = {set of elements ϕ∗F (Z) that are orthogonal to ΛF }. This is the
space on which influence functions lie. Identifying this space helps motivate
full-data m-estimators.

Observed (coarsened) data

• Coarsened data are denoted by {C, GC (Z)}, where the coarsening variable
C is a discrete random variable taking on values 1, . . . ,  and ∞. When C =
r, r = 1, . . . , , then we observe the many-to-one transformation Gr (Z).
C = ∞ is reserved to denote complete data; i.e., G∞ (Z) = Z.
• We distinguish among three types of coarsening mechanisms:
– Coarsening completely at random (CCAR): The coarsening probabili-
ties do not depend on the data.
– Coarsening at random (CAR): The coarsening probabilities only de-
pend on the data as a function of the observed data.
– Noncoarsening at random (NCAR): The coarsening probabilities de-
pend on the unobserved part of the data.
182 7 Missing and Coarsening at Random for Semiparametric Models

• When coarsening is CAR, we denote the coarsening probabilities by

P (C = r|Z) = {r, Gr (Z)}.

• A key assumption is that there is a positive probability of observing com-


plete data; that is,

P (C = ∞|Z = z) = (∞, Z) > > 0 for all z

in the support of Z.
• H denotes the observed-data Hilbert space of q-dimensional, mean-zero,
finite-variance, measurable functions of {C, GC (Z)} equipped with the co-
variance inner product.
• Because C takes on a finite set of values, a typical function h{C, GC (Z)}
can be written as

h{C, GC (Z)} = I(C = ∞)h∞ (Z) + I(C = r)hr {Gr (Z)}.
r=∞

• The observed-data nuisance tangent space

Λ = Λ ψ ⊕ Λη , Λψ ⊥ Λη ,

where Λψ is spanned by the score vector with respect to the parameter ψ


used to describe the coarsening process and Λη is spanned by the observed-
data nuisance score vectors for parametric submodels and their mean-
square closures. Specifically,
 
Λη = E{αF (Z)|C, GC (Z)} : αF (Z) ∈ ΛF = E{ΛF |C, GC (Z)}.

• In this chapter, we did not consider models for the coarsening probabilities;
rather, we assumed they are known by design. Therefore, we didn’t need to
consider the space Λψ , in which case the observed-data nuisance tangent
space Λ = Λη .
• Observed-data estimating equations, when coarsening is by design, are
motivated by considering elements in the space Λ⊥ η , where
 
I(C = ∞)ΛF ⊥
Λ⊥
η = ⊕ Λ2
(∞, Z)

and  
Λ2 = L2 {C, GC (Z)} : E[L2 {C, GC (Z)}|Z] = 0 .
7.6 Exercises for Chapter 7 183

• The two linear spaces that make up Λ⊥


η are the IPWCC space

I(C = ∞)ΛF ⊥
(∞, Z)

and the augmentation space


Λ2 .
• In order to construct a typical element of Λ2 , for each r = ∞, choose an
arbitrary function L2r {Gr (Z)}. Then
⎡ ⎤
I(C = ∞) ⎣  
{r, Gr (Z)}L2r {Gr (Z)}⎦ − I(C = r)L2r {Gr (Z)}
(∞, Z)
r=∞ r=∞

is an element of Λ2 .

7.6 Exercises for Chapter 7


Returning to Example 1, introduced in Section 7.1, recall that two blood
samples were taken from each individual in a study where one of the objectives
was to assess the variability within and between persons in the concentration
of some biological marker. As part of the design of this study, a subset of
individuals was chosen at random with probability . These individuals had
their two blood samples combined and the concentration was obtained on
the pooled blood, whereas for the remaining individuals in the study, the
concentration was obtained separately for each of the two blood samples. In
Section 7.2, we introduced a bivariate normal model for the full data given by
(7.11) in terms of parameters (µα , σα2 , σe2 ), and in equation (7.12) we derived
the likelihood of the observed coarsened data. The first three exercises below
relate to this example.
1. Let us first consider only the full data for the time being.
a) What is the likelihood of the full data (Xi1 , Xi2 ), i = 1, . . . , n?
b) Find the MLE for the parameters (µα , σα2 , σe2 ).
c) Derive the influence function of the full-data MLE.
2. Return to the coarsened data whose likelihood for the parameters (µα , σα2 , σe2 )
is given by (7.12).
a) Derive the observed data MLE using the coarsened data.
b) What is the relative efficiency between the coarsened-data MLE and
the full-data MLE? Derive this separately for µα , σα2 , and σe2 .
3. We now consider AIPWCC estimators for this problem.
a) Derive the augmentation space Λ2 .
184 7 Missing and Coarsening at Random for Semiparametric Models

b) Using the full-data influence function that was derived in 1(c) above
(or, equivalently, using the full-data score vector), write out a set of
AIPWCC estimating equations that can be used to obtain observed-
data estimators for (µα , σα2 , σe2 ).
4. Derive an estimator for the asymptotic variance of β̂n , the AIPWCC es-
timator for β for the restricted moment given by the solution to (7.41)
where data were missing by design.
8
The Nuisance Tangent Space and Its
Orthogonal Complement

8.1 Models for Coarsening and Missingness


Two Levels of Missingness

In the previous chapter, we gave an example where the missingness (coarsen-


ing) mechanism was known to us by design. For most missing-data problems,
this is not the case, and we must consider models (either parametric or semi-
parametric) for the coarsening probabilities. For example, suppose the full-
data Z is a random vector that can be partitioned as Z = (Z1T , Z2T )T , where
Z1 is always observed but Z2 may be missing on a subset of individuals. This
scenario occurs frequently in practice where, say, one of the variables being
collected is missing on some individuals or where a set of variables that are
collected at the same time may be missing on some individuals. In this ex-
ample, there are two levels of missingness; either all the data are available on
an individual, which is denoted by letting the complete-case indicator R = 1
(unscripted), or only the data Z1 are available, which is denoted by letting
R = 0. Using the coarsening notation, this would correspond to C = ∞ or
C = 1, respectively. If we assume that missingness is MAR, then this would im-
ply that P (R = 1|Z) = {1 − P (R = 0|Z)} = {1 − P (R = 0|Z1 )} = π(Z1 ) and
P (R = 0|Z1 ) = 1−π(Z1 ). We remind the reader that using the coarsening no-
tation, the probability π(Z1 ) = {∞, G∞ (Z)} and 1 − π(Z1 ) = {1, G1 (Z)},
where G1 (Z) = Z1 and G∞ (Z) = Z. If the missingness was not by design,
then the probability of a complete case π(Z1 ), as a function of Z1 , is unknown
to us and must be estimated from the data. This is generally accomplished
by positing a model in terms of parameters ψ. Since, in this simplest example
of missing data, the complete-case indicator is a binary variable, a natural
model is the logistic regression model, where

exp(ψ0 + ψ1T Z1 )
π(Z1 , ψ) = , (8.1)
1 + exp(ψ0 + ψ1T Z1 )
186 8 The Nuisance Tangent Space and Its Orthogonal Complement

and the parameter ψ = (ψ0 , ψ1T )T needs to be estimated from the observed
data. Although this illustration assumed a logistic regression model that was
linear in Z1 , we could easily have considered more complex models where we
include higher-order terms, interactions, regression splines, or whatever else
the data analyst deems appropriate.

Monotone and Nonmonotone Coarsening for more than


Two Levels

When there are more than two levels of missingness or coarsening of the data,
we distinguish between monotone and nonmonotone coarsening.
A form of missingness that often occurs in practice is monotone missing-
ness. Because of its importance, we now describe monotone missingness, or
more generally monotone coarsening, in more detail and discuss methods for
developing models for such monotone missingness mechanisms.
For some problems, we can order the coarsening variable C in such a way
that the coarsened data Gr (Z) when C = r is a coarsened version of Gr (Z)

for all r > r. In such a case, Gr (Z) is a many-to-one function of Gr+1 (Z);
that is,
Gr (z) = fr {Gr+1 (z)},
where fr (·) denotes a many-to-one function that depends on r. In other words,
G1 (Z) is the most coarsened data, G2 (Z) less so, and G∞ (Z) = Z is not
coarsened at all. For example, with longitudinal data, suppose we intend to
measure an individual at l different time points so that Z = (Y1 , . . . , Yl ),
where Yj denotes the measurement at the j-th time point, j = 1, . . . , l. For
such longitudinal studies, it is not uncommon for some individuals to drop
out during the course of the study, in which case we would observe the data
up to the time they dropped out and all subsequent measurements would be
missing. This pattern of missingness is monotone and can be described by
r Gr (Z)
1 (Y1 )
2 (Y1 , Y2 )
..
.
l − 1 (Y1 , . . . , Yl−1 )
∞ (Y1 , . . . , Yl )
When data are CAR, we consider models for the coarsening probabilities,
which, in general, are denoted by

P (C = r|Z = z, ψ) = {r, Gr (z), ψ}

in terms of the unknown parameters ψ. However, with monotone coarsening,


it is more convenient to consider models for the discrete hazard function,
defined as
8.1 Models for Coarsening and Missingness 187

λr {Gr (Z)} = P (C = r|C ≥ r, Z), r = ∞


= 1, r = ∞. (8.2)

That λr (·) is a function of Gr (Z) follows by noting that the right-hand side
of (8.2) equals
P (C = r|Z) {r, Gr (Z)}
=  (8.3)
P (C ≥ r|Z) 1 − r ≤r−1 {r , Gr (Z)}

and by the definition of monotone coarsening, where Gr (Z) is a function of



Gr (Z) for all r < r. We also define
!
r
Kr {Gr (Z)} = P (C > r|Z) = [1 − λr {Gr (Z)}], r = ∞. (8.4)
r  =1

Consequently, we can equivalently express the coarsening probabilities in


terms of the discrete hazard functions; namely,

{r, Gr (Z)} = Kr−1 {Gr−1 (Z)}λr {Gr (Z)} for r > 1


and λ1 {G1 (Z)} for r = 1. (8.5)

Equations (8.3), (8.4), and (8.5) demonstrate that there is a one-to-one re-
lationship between coarsening probabilities {r, Gr (Z)} and discrete hazard
functions λr {Gr (Z)}. Using discrete hazards, the probability of a complete
case (i.e., C = ∞) is given by
!  
(∞, Z) = 1 − λr {Gr (Z)} . (8.6)
r=∞

The use of discrete hazards provides a natural way of thinking about mono-
tone coarsening. For example, suppose we were asked to design a longitudi-
nal study with monotone missingness. We can proceed as follows. First, we
would collect G1 (Z) = Y1 . Then, with probability λ1 {G1 (Z)} (that is, with
probability depending on Y1 ), we would stop collecting additional data. How-
ever, with probability 1 − λ1 {G1 (Z)}, we would collect Y2 , in which case
we now have G2 (Z) = (Y1 , Y2 ). If we collected (Y1 , Y2 ), then with probabil-
ity λ2 {G2 (Z)} we would stop collecting additional data, but with probabil-
ity 1 − λ2 {G2 (Z)} we would collect Y3 , in which case we would have col-
lected G3 (Z) = (Y1 , Y2 , Y3 ). We continue in this fashion, either stopping at

stage r after collecting Gr (Z) = (Y1 , . . . , Yr ) or continuing with probability
λr {Gr (Z)} or 1 − λr {Gr (Z)}, respectively. When monotone missingness is

viewed in this fashion, it is clear that, conditional on having reached stage r ,
there are two choices: either stop or continue to the next stage with condi-
tional probability λr {Gr (Z)} or 1 − λr {Gr (Z)}. Therefore, when we build
models for the coarsening probabilities of monotone coarsened data, it is nat-
ural to consider individual models for each of the discrete hazards. Because
188 8 The Nuisance Tangent Space and Its Orthogonal Complement

of the binary choice made at each stage, logistic regression models for the
discrete hazards are often used. For example, for the longitudinal data given
above, we may consider a model where it is assumed that

exp(ψ0r + ψ1r Y1 + . . . + ψrr Yr )


λr {Gr (Z)} = , r = ∞. (8.7)
1 + exp(ψ0r + ψ1r Y1 + . . . + ψrr Yr )

Missing or coarsened data can also come about in a manner that is non-
monotone. For the longitudinal data example given above, suppose patients
didn’t necessarily drop out of the study but rather missed visits from time
to time. In such a case, some of the longitudinal data might be missing but
not necessarily in a monotone fashion. In the worst-case scenario, any of the
2l − 1 combinations of (Y1 , . . . , Yl ) might be missing for different patients in
the study. Building coherent models for the missingness probabilities for such
nonmonotone missing data, even under the assumption that missingness is
MAR, is challenging. There have been some suggestions for developing non-
monotone missingness models given by Robins and Gill (1997) using what they
call randomized monotone missingness (RMM) models. Because of the com-
plexity of nonmonotone missingness models, we will not discuss such models
specifically in this book. In what follows, we will develop the semiparametric
theory assuming that coherent missingness or coarsening models were used.
Specific examples with two levels of missingness or monotone missingness will
be used to illustrate the results.

8.2 Estimating the Parameters in the Coarsening Model

Models for the coarsening probabilities are described through the parameter
ψ. Specifically, it is assumed that P (C = r|Z = z, ψ) = {r, Gr (z), ψ}, where
ψ is often assumed to be a finite-dimensional parameter. Estimates for the
parameter ψ can be obtained using maximum likelihood. We remind the reader
that because of the factorization of the observed-data likelihood given by (7.6),
the maximum likelihood estimator ψ̂n for ψ is obtained by maximizing
!
n
{Ci , GCi (Zi ), ψ}. (8.8)
i=1

MLE for ψ with Two Levels of Missingness

With two levels of missingness, the likelihood (8.8) simplifies to

!
n
{π(Z1i , ψ)}Ri {1 − π(Z1i , ψ)}1−Ri . (8.9)
i=1
8.2 Estimating the Parameters in the Coarsening Model 189

So, for example, if we entertained the logistic regression model (8.1), then the
maximum likelihood estimator for (ψ0 , ψ1T )T would be obtained by maximiz-
ing (8.9) or, more specifically, by maximizing
n 
! 
exp{(ψ0 + ψ T Z1i )Ri }
1
. (8.10)
i=1
1 + exp(ψ0 + ψ1T Z1i )

This can be easily implemented in most available statistical software packages.

MLE for ψ with Monotone Coarsening

Because monotone missingness (or monotone coarsening) is prevalent in many


studies, we now consider how to estimate the parameter ψ in this special case.
As indicated in Section 8.1, it is more convenient to work with the discrete
hazard function to describe monotone coarsening. The discrete hazard, de-
noted by λr {Gr (Z)}, was defined by (8.2). Therefore, it is natural to consider
models for the discrete hazard in terms of parameters ψ, which we denote by
λr {Gr (Z), ψ}. An example of such a model for monotone missing longitudinal
data was given by (8.7). We also showed in Section 8.1 that there is a one-
to-one relationship between the coarsening probabilities {r, Gr (Z)} and the
discrete hazard functions. Using (8.5), we see that the coarsening probability
can be deduced through the discrete hazards leading to the model

{r, Gr (Z), ψ} = λ1 {G1 (Z), ψ} for r = 1,


!
r−1
{r, Gr (Z), ψ} = [1 − λr {Gr (Z), ψ}]λr {Gr (Z), ψ} for r > 1. (8.11)

r =1

Substituting the right-hand side of (8.11) for (·, ψ) into (8.8) and rearranging
terms, we obtain that the likelihood for monotone coarsening can be expressed
as
! !  I(Ci =r)  I(Ci >r)
λr {Gr (Zi ), ψ} 1 − λr {Gr (Zi ), ψ} . (8.12)
r=∞ i:Ci ≥r

So, for example, if we consider the logistic regression models used to model
the discrete hazards for the monotone missing longitudinal data given by (8.7),
then the likelihood is given by

!
l−1 ! exp(ψ0r + ψ1r Y1i + . . . + ψrr Yri )I(Ci = r)
. (8.13)
1 + exp(ψ0r + ψ1r Y1i + . . . + ψrr Yri )
r=1 i:Ci ≥r

Because the likelihood in (8.13) factors into a product of l−1 logistic regression
likelihoods, standard logistic regression software can be used to maximize
(8.13).
190 8 The Nuisance Tangent Space and Its Orthogonal Complement

8.3 The Nuisance Tangent Space when Coarsening


Probabilities Are Modeled
Our ultimate goal is to derive semiparametric estimators for the parameter
β of a semiparametric model when the data are coarsened at random. As al-
ways, the key to deriving such estimators is to find elements orthogonal to
the nuisance tangent space that in turn can be used to guide us in construct-
ing estimating equations. Toward that end, we now return to the problem of
finding the nuisance tangent space for semiparametric models with coarsened
data when the coarsening probabilities are modeled using additional parame-
ters ψ. We described in the previous sections how such coarsening probability
models can be developed and estimated.
Therefore, as a starting point, we will assume that such a model for the
coarsening probabilities has already been developed as a function of unknown
parameters ψ and is denoted by P (C = r|Z) = {r, Gr (Z), ψ}. When the
observed data are CAR, we showed in (7.6) that the likelihood can be factored
as 
(r, gr , ψ) pZ (z, β, η)dνZ (z), (8.14)
{z:Gr (z)=gr }

where the parameter ψ is finite-dimensional, say with dimension s.


As shown in (7.18), the observed-data nuisance tangent space can be writ-
ten as a direct sum of two linear subspaces, namely

Λ = Λ ψ ⊕ Λη ,

where Λψ is the space associated with the coarsening model parameter ψ and
Λη is the space associated with the infinite-dimensional nuisance parameter
η. In Chapter 7, we derived the space Λη and its orthogonal complement. We
now consider the space Λψ and some of its properties. Because the space Λψ
will play an important role in deriving RAL estimators for β with coarsened
data, when the coarsening probabilities are not known and must be modeled,
we will denote this space as the coarsening model tangent space and give a
formal definition as follows.

Definition 1. The space Λψ , which we denote as the coarsening model tan-


gent space, is defined as the linear subspace, within H, spanned by the score
vector with respect to ψ. That is,
 
Λψ = B q×s Sψs×1 {C, GC (Z), ψ0 } for all B q×s ,

where
∂ log {C, GC (Z), ψ0 }
Sψs×1 = , (8.15)
∂ψ
and ψ0 denotes the true value of ψ. 

8.3 The Nuisance Tangent Space when Coarsening Probabilities Are Modeled 191

One of the properties of the space Λψ is that it is contained in the augmen-


tation space Λ2 , as we now prove.
Theorem 8.1. Λψ ⊂ Λ2
Proof. Since {r, Gr (z), ψ} = P (C = r|Z = z, ψ) is a conditional density,
then this implies that

{r, Gr (z), ψ} = 1 for all ψ, z.
r

Hence, for a fixed “z,”


∂ 
{r, Gr (z), ψ} = 0.
∂ψ r

Taking the partial derivative inside the sum, dividing and multiplying by
{r, Gr (z), ψ}, and setting ψ = ψ0 yields

Sψ {r, Gr (z), ψ0 }{r, Gr (z), ψ0 } = 0 for all z,
r

or

E [Sψ {C, GC (Z), ψ0 }|Z] = 0.

Hence
 
E B q×s Sψ {C, GC (Z), ψ0 } |Z = 0.
* +, -
typical element of Λψ

Consequently, if h{C, GC (Z)} ∈ Λψ , then E[h{C, GC (Z)}|Z] = 0. This implies


that Λψ ⊂ Λ2 .  
Because the parameter ψ and the parameter η separate out in the likeli-
hood (7.6), this should imply that the corresponding spaces Λψ and Λη are
orthogonal. We now prove this property more formally.
Theorem 8.2. Λψ ⊥ Λη
Proof. Recall that the space Λη is given by
' ' ( (
Λη = E αF (Z)|C, GC (Z) for all αF (Z) ∈ ΛF .

We first demonstrate that Λη ⊥ Λ2 .


Choose an arbitrary element h{C, GC (Z)} ∈ Λ2 (i.e., E(h|Z) = 0) and an
arbitrary element of Λη , say E{αF (Z)|C, GR (Z)}, for some αF (Z) ∈ ΛF . The
inner product of these two elements is
192 8 The Nuisance Tangent Space and Its Orthogonal Complement
 ' (
E hT {C, GC (Z)}E αF (Z)|C, GC (Z)
  
= E E hT {C, GC (Z)}αF (Z)|C, GC (Z)
 
= E hT {C, GC (Z)}αF (Z)
  
= E E hT {C, GC (Z)}αF (Z)|Z
 
 
= E E hT {C, GC (Z)}|Z αF (Z)
* +, -

0 since h ∈ Λ2
= 0.
Since Λψ is contained in Λ2 , then this implies that Λψ is orthogonal to Λη . 

We are now in a position to derive the space orthogonal to the nuisance
tangent space.

8.4 The Space Orthogonal to the


Nuisance Tangent Space
Since influence functions of RAL estimators for β belong to the space orthog-
onal to the nuisance tangent space, it is important to derive the space Λ⊥ ,
where Λ = Λψ ⊕ Λη and Λψ ⊥ Λη .
Because the nuisance tangent space Λ is the direct sum of two orthogonal
spaces, we can show that the orthogonal complement
Λ⊥ = Π(Λ⊥ ⊥ ⊥ ⊥
η |Λψ ) = Π(Λψ |Λη ). (8.16)
(We leave this as an exercise for the reader.) Using the first equality above,
a typical element of Λ⊥ can be found by taking an arbitrary element h ∈ Λ⊥ η
and computing
h − Π(h|Λψ ) = Π(h|Λ⊥ ψ ).
In Chapter 7, we showed how to find elements orthogonal to Λη . In fact, in
Theorem 7.2, we showed the important result that Λ⊥
η can be written as the
direct sum of the IPWCC space and the augmentation space. That is,
I(C = ∞)ΛF ⊥
Λ⊥
η = ⊕ Λ2 .
(∞, Z)
Specifically, a typical element of Λ⊥
η is given by formula (7.34); namely,

I(C = ∞)ϕ∗F (Z)
+ L2 {C, GC (Z)} : ϕ∗F (Z) ∈ ΛF ⊥
(∞, Z, ψ0 )

and L2 {C, GC (Z)} ∈ Λ2 .
8.5 Observed-Data Influence Functions 193

Therefore, a typical element of Λ⊥ is given by



I(C = ∞)ϕ∗F (Z)
+ L2 {C, GC (Z)}
(∞, Z, ψ0 )
  
I(C = ∞)ϕ∗F (Z)
−Π + L2 {C, GC (Z)} Λψ
(∞, Z, ψ0 )

∗F F⊥
for ϕ (Z) ∈ Λ and L2 {C, GC (Z)} ∈ Λ2 . (8.17)

Before discussing how these results can be used to derive RAL estimators
for β when data are CAR and when the coarsening probabilities need to be
modeled and estimated, which will be deferred to the next chapter, we close
this chapter by defining the space of observed-data influence functions of RAL
observed-data estimators for β.

8.5 Observed-Data Influence Functions


Because of condition (i) of Theorem 4.2, observed-data influence functions
ϕ{C, GC (Z)} not only must belong to the space Λ⊥ , but must satisfy
 
E ϕ{C, GC (Z)}SβT {C, GC (Z)} = I q×q , (8.18)

where Sβ {C, GC (Z)} is the observed-data score vector with respect to β. For
completeness, we will now define the space of observed-data influence func-
tions.
Theorem 8.3. When data are coarsened at random (CAR) with coarsening
probabilities P (C = r|Z) = {r, Gr (Z), ψ}, where Λψ is the space spanned
by the score vector Sψ {C, GC (Z)} (i.e., the coarsening model tangent space),
then the space of observed-data influence functions, also denoted by (IF ), is
the linear variety contained in H, which consists of elements
 
I(C = ∞)ϕF (Z)
ϕ{C, GC (Z)} = + L2 {C, GC (Z)} − Π{[·]|Λψ }, (8.19)
(∞, Z, ψ0 )

where ϕ (Z) is a full-data influence function and L2 {C, GC (Z)} ∈ Λ2 .
F

Proof. We first note that we can use the exact same arguments as used in
lemmas 7.1 and 7.2 to show that the observed-data score vector with respect
to β is ' (
Sβ {C, GC (Z)} = E SβF (Z)|C, GC (Z) ,
where SβF (Z) is the full-data score vector with respect to β,

∂ log pZ (z, β0 , η0 )
SβF (z) = .
∂β
194 8 The Nuisance Tangent Space and Its Orthogonal Complement

In order for an element of Λ⊥ , given by (8.17), to be an observed-data


influence function, it must satisfy (8.18); that is,

I q×q =
  
I(C = ∞)ϕ∗F (Z)
E + L2 {C, GC (Z)}
(∞, Z)
 
FT
− Π[{ · }|Λψ ] E{Sβ (Z)|C, GC (Z)} ,

which, by the law of iterated conditional expectations, used repeatedly, is


equal to
T
= E[E{[ · ]SβF (Z)|C, GC (Z)}]
T T
= E{[ · ]SβF (Z)} = E[E{[ · ]SβF (Z)|Z}]
T
= E[E{[ · ]|Z}SβF (Z)]
   
I(C = ∞)ϕ∗F (Z) FT
=E E + L2 {C, GC (Z)} − Π [{ · }|Λψ ] |Z Sβ (Z) .
(∞, Z)
Because
 
I(C = ∞)ϕ∗F (Z)
E Z = ϕ∗F (Z),
(∞, Z)
E [L2 {C, GC (Z)}|Z] = 0 since L2 ∈ Λ2 ,

and

E {Π({ · }|Λψ )|Z} = 0 since Λψ ⊂ Λ2 ,

this implies that  


T
E ϕ∗F (Z)SβF (Z) = I q×q . (8.20)

Equation (8.20) is precisely the condition necessary for a typical element


ϕ∗F ∈ ΛF ⊥ to be a full-data influence function.
Therefore, the space of observed-data influence functions consists of ele-
ments
 
I(C = ∞)ϕF (Z)
ϕ{C, GC (Z)} = + L2 {C, GC (Z)} − Π{[ · ]|Λψ },
(∞, Z, ψ0 )

where ϕ (Z) is a full-data influence function and L2 {C, GC (Z)} ∈ Λ2 . 
F


When data are coarsened by design, then the coarsening probabilities


{r, Gr (Z)} are known to us. We will sometimes refer to this as the pa-
rameter ψ = ψ0 being known. When this is the case, there is no need to
introduce the space Λψ . Therefore, we obtain the following simple corollary.
8.6 Recap and Review of Notation 195

Corollary 1. When the coarsening probabilities {r, Gr (Z)} are known to


us by design, then the space of observed-data influence functions is the linear
variety consisting of elements
 
I(C = ∞)ϕF (Z)
ϕ{C, GC (Z)} = + L2 {C, GC (Z)} , (8.21)
(∞, Z, ψ0 )

where ϕ (Z) is a full-data influence function and L2 {C, GC (Z)} ∈ Λ2 .
F

Remark 1. Notational convention


The space of observed-data influence functions will be denoted by (IF ) and
the space of full-data influence functions will be denoted by (IF )F . As a
result of Corollary 1, when data are coarsened by design, we can write the
space (linear variety) of observed-data influence functions, using shorthand
notation, as
I(C = ∞)(IF )F
(IF ) = + Λ2 , (8.22)
(∞, Z)
and, by Theorem 8.3, when the coarsening probabilities have to be modeled,
as  
I(C = ∞)(IF )F
(IF ) = + Λ2 − Π({ · }|Λψ ).   (8.23)
(∞, Z)

8.6 Recap and Review of Notation


Monotone coarsening

• An important special case of coarsened data is when the coarsening is


monotone; that is, the coarsening variable can be ordered in such a
way that Gr (Z) is a many-to-one function of Gr+1 (Z) (i.e., Gr (Z) =
fr {Gr+1 (Z)}).
• When coarsening is monotone, it is convenient to denote coarsening prob-
abilities through the discrete hazard function. The definition and the re-
lationship to coarsening probabilities are
λr {Gr (Z)} = P (C = r|C ≥ r, .
Z), r = ∞,
r
Kr {Gr (Z)} = P (C > r|Z) = r =1 [1 − λr {Gr (Z)}], r = ∞,
{r, Gr (Z)} = Kr−1 {Gr−1 (Z)}λr {Gr (Z)}, and
the probability of a complete case is

P (C = ∞|Z) = (∞, Z) = K {G (Z)}.


196 8 The Nuisance Tangent Space and Its Orthogonal Complement

The geometry of semiparametric models with coarsened data

• (IF )F denotes the space of full-data influence functions where a typical


element is denoted by ϕF (Z). This space is a linear variety where
(i) ϕF (Z) ∈ ΛF ⊥ ,
T
(ii) E{ϕF (Z)SβF (Z)} = I q×q .
• The observed-data nuisance tangent space

Λ = Λ ψ ⊕ Λη , Λψ ⊥ Λη ,

where Λψ , denoted as the coarsening model tangent space, is spanned


by the score vector with respect to the parameter ψ used to describe
the coarsening process, and Λη is spanned by the observed-data nuisance
score vectors for parametric submodels and their mean-square closures.
Specifically,
Λη = E{ΛF |C, GC (Z)}.
• Λψ ⊂ Λ 2 , Λη ⊥ Λ2
•  
I(C = ∞)ΛF ⊥
Λ⊥
η = ⊕ Λ2 .
(∞, Z)
• When the coarsening probabilities are unknown to us and need to be mod-
eled, then
  
I(C = ∞)ΛF ⊥ I(C = ∞)ΛF ⊥
Λ⊥ = ⊕ Λ2 − Π ⊕ Λ2 Λψ .
(∞, Z) (∞, Z)

 
(IF ) = space of observed-data influence functions ϕ{C, GC (Z)} .

When the coarsening probabilities are unknown to us and need to be mod-


eled, then
  
I(C = ∞)(IF )F I(C = ∞)(IF )F
(IF ) = + Λ2 − Π + Λ2 Λ ψ .
(∞, Z) (∞, Z)
When the coarsening probabilities are known to us by design, then

I(C = ∞)(IF )F
(IF ) = + Λ2 .
(∞, Z)

8.7 Exercises for Chapter 8


1. In Section 8.1, we described how data may be monotonically missing us-
ing longitudinal data (Y1 , . . . , Yl ) as an illustration. We also described a
8.7 Exercises for Chapter 8 197

model for the coarsening process where we modeled the discrete hazard
function using equation (8.7) and derived the likelihood contribution for
the coarsening model in (8.13). Let ψr denote the vector of parameters
(ψ0r , . . . , ψrr )T for r = 1, . . . , l − 1, and let ψ denote the entire parameter
space for the coarsening probabilities; that is, ψ = (ψ1T , . . . , ψl−1 T
)T .
a) Derive the score vector

∂ log[{C, GC (Z), ψ}]


Sψr {C, GC (Z)} = , r = 1, . . . , l − 1.
∂ψr

b) Show that the coarsening model tangent space Λψ is equal to the


direct sum
Λψ1 ⊕ . . . ⊕ Λψl−1 ,
where Λψr is the linear space spanned by the vector Sψr {C, GC (Z)},
r = 1, . . . , l − 1, and that these spaces are mutually orthogonal to each
other.
2. Give a formal proof of (8.16).
9
Augmented Inverse Probability Weighted
Complete-Case Estimators

9.1 Deriving Semiparametric Estimators for β


ψ known
We begin by assuming the parameter ψ0 that defines the coarsening model is
known to us by design. Semiparametric estimators for β can be obtained by
deriving elements orthogonal to the nuisance tangent space and using these to
motivate estimating functions that can be used to construct estimating equa-
tions whose solution will lead to semiparametric estimators for β. In Chapter
7, we showed that when the parameter ψ0 is known, the space orthogonal to
the nuisance tangent space is the direct sum of the IPWCC space and the
augmentation space, namely
I(C = ∞)ΛF ⊥
Λ⊥ = ⊕ Λ2 ,
(∞, Z, ψ0 )
where a typical element of this space is
I(C = ∞)ϕ∗F (Z)
+ L2 {C, GC (Z), ψ0 }, (9.1)
(∞, Z, ψ0 )
where ϕ∗F (Z) is an arbitrary element orthogonal to the full-data nuisance tan-
gent space (i.e., ϕ∗F (Z) ∈ ΛF ⊥ ), and L2 {C, GC (Z), ψ0 } is an arbitrary element
of Λ2 that can be constructed by taking arbitrary functions L2r {Gr (Z)}, r =
∞, and then using (7.37).
In Section 7.4, we gave examples of how these results could be used to
derive augmented inverse probability weighted complete-case (AIPWCC) es-
timators for the regression parameters in a restricted moment model with
missing data by design when there were two levels of missingness. We now
expand this discussion to more general coarsening by design mechanisms.
If we want to obtain a semiparametric estimator for β, we would proceed
as follows. We start with a full-data estimating equation that yields a full-
data RAL estimator for β. It will be assumed that we know how to construct
200 9 Augmented Inverse Probability Weighted Complete-Case Estimators

such estimating equations for full-data semiparametric models. For example,


a full-data m-estimator could be derived by solving the estimating equation

n
m(Zi , β) = 0,
i=1

where the estimating function evaluated at the truth, m(Z, β0 ), was chosen so
that m(Z, β0 ) = ϕ∗F (Z) ∈ ΛF ⊥ . For example, we take m(Z, β) = A(X){Y −
µ(X, β)} for the restricted moment model. The influence function of such a
full-data estimator for β was derived in Chapter 3, formula (3.6), and is given
by
  −1
∂m(Zi , β0 )
− E m(Zi , β0 ) = ϕF (Zi ). (9.2)
∂β T

Remark 1. An estimating function m(Z, β) is a function of the random vari-


able Z and the parameter β being estimated, and the corresponding m-
estimator is the
nsolution to the estimating equation made up of a sum of
iid quantities i=1 m(Zi , β) = 0. However, in many cases, elements orthog-
onal to the nuisance tangent are defined as ϕ∗F (Z) = m(Z, β0 , η0 ), where η0
denotes the true value of the full-data nuisance parameter. Consequently, it
may not be possible to define an estimating function m∗ (Z, β) that is only a
function of Z and β that satisfies Eβ,η {m∗ (Z, β)} = 0 for all β and η, as would
be necessary to obtain consistent asymptotically normal estimators for all β
and η. However, if we could find a consistent estimator for η, say η̂n , then a
natural strategy would be to derive an estimator for β that is the solution to
the estimating equation

n
m(Zi , β, η̂n ) = 0. (9.3)
i=1

The estimating equation (9.3) is not a sum of iid quantities and hence the
resulting estimator is not, strictly speaking, an m-estimator. However, in many
situations, and certainly in all cases considered in this book, the estimator β̂n
that solves (9.3) will be asymptotically equivalent to the m-estimator β̂n∗ that
solves the equation
n
m(Zi , β, η0 ) = 0,
i=1
P
with η0 known, in the sense that n1/2 (β̂n −β̂n∗ ) −
→ 0. Without going into detail,
this asymptotic equivalence occurs because m(Z, β0 , η0 ) = ϕ∗F (Z) is orthogo-
nal to the nuisance tangent space. We illustrated this asymptotic equivalence
for parametric models in Section 3.3 using equation (3.30) (also see Remark
4 of this section). Therefore, from here on, with a slight abuse of notation,
we will still refer to estimators such as those that solve (9.3) as m-estimators
with estimating function m(Z, β).  
9.1 Deriving Semiparametric Estimators for β 201

With coarsened data by design, we use (9.1) to motivate the following


estimating equation:
n  
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), ψ0 } = 0. (9.4)
i=1
(∞, Zi , ψ0 )

The estimator that is the solution to the estimating equation (9.4) is re-
ferred to as an AIPWCC estimator. If we take the element L2 (·) to be iden-
tically equal to zero, then the estimating equation becomes
n  
I(Ci = ∞)m(Zi , β)
= 0,
i=1
(∞, Zi , ψ0 )

and the resulting estimator is an IPWCC estimator since only complete cases
are considered in the sum above (i.e., {i : Ci = ∞}), weighted by the inverse
probability of being a complete case. The term L2 {Ci , GCi (Zi ), ψ0 } allows con-
tributions to the sum by observations that are not complete (i.e., coarsened),
and this is referred to as the augmented term.
Using standard Taylor series expansions for m-estimators (which we leave
as an exercise for the reader), we can show that the influence function of the
estimator, derived by solving (9.4), is equal to
  −1  
∂m(Zi , β0 ) I(Ci = ∞)m(Zi , β0 )
− E + L2 {Ci , GC i
(Zi ), ψ0 }
∂β T (∞, Zi , ψ0 )
I(Ci = ∞)ϕF (Zi )
= + L∗2 {Ci , GCi (Zi ), ψ0 }, (9.5)
(∞, Zi , ψ0 )
where ϕF (Zi ) was defined in (9.2) and
  −1
∂m(Zi , β0 )
L∗2 = − E L2 ∈ Λ2 . (9.6)
∂β T
Therefore, we can now summarize these results. If coarsening of the data
were by design with known coarsening probabilities {r, Gr (Z), ψ0 }, for all
r, and we wanted to obtain an observed-data RAL estimator for β in a semi-
parametric model, we would proceed as follows.
1. Choose a full-data estimating function m(Z, β).
2. Choose an element of the augmentation space Λ2 as follows.
a) For each r = ∞, choose a function L2r {Gr (Z)}.
b) Construct L2 {C, GC (Z), ψ0 } to equal
⎡ ⎤
I(C = ∞) ⎣ 
{r, Gr (Z), ψ0 }L2r {Gr (Z)}⎦
(∞, Z, ψ0 )
r=∞

− I(C = r)L2r {Gr (Z)}.
r=∞
202 9 Augmented Inverse Probability Weighted Complete-Case Estimators

3. Obtain the estimator for β by solving equation (9.4).


The resulting estimator, β̂n , under suitable regularity conditions, will be a
consistent, asymptotically normal RAL estimator for β with influence function
given by (9.5). The asymptotic variance of this estimator is, of course, the
variance of the influence function. Estimators for the asymptotic variance can
be obtained using the sandwich variance estimator (3.10) derived in Chapter
3 specifically for m-estimators.
We see from this construction that the estimator depends on the choice of
m(Z, β) and the functions L2r {Gr (Z)}, r = ∞. Of course, we would want to
choose these functions so that the resulting estimator is as efficient as possible.
This issue will be the focus of Chapters 10 and 11.

ψ unknown

The development above shows how we can take results regarding semipara-
metric estimators for the parameter β for full-data models and modify them to
estimate the parameter β with coarsened data (CAR) when the the coarsen-
ing probabilities are known to us by design. In most problems, the coarsening
probabilities are not known and must be modeled using the unknown param-
eter ψ. We discussed models for the coarsening process and estimators for the
parameters in these models in Chapter 8. We also showed in Chapter 8 the im-
pact that such models have on the observed-data nuisance tangent space, its
orthogonal complement, and the space of observed-data influence functions.
If the parameter ψ is unknown, two issues emerge:
(i) The unknown parameter ψ must be estimated.
(ii) The influence function of an observed-data RAL estimator for β must be
an element in the space defined by (7.37) (i.e., involving a projection onto
the coarsening model tangent space Λψ ).
One obvious strategy for estimating β with coarsened data when ψ is
unknown is to find a consistent estimator for ψ and substitute this estimator
for ψ0 in the estimating equation (9.4). A natural estimator for ψ is obtained
by maximizing the coarsening model likelihood (8.8). The resulting MLE is
denoted by ψ̂n . The influence function of the estimator for β, obtained by
substituting the maximum likelihood estimator ψ̂n for ψ0 in equation (9.4),
is given by the following important theorem.
Theorem 9.1. If the coarsening process follows a parametric model, and if ψ
is estimated using the maximum likelihood estimator, say ψ̂n , or any efficient
estimator of ψ, then the solution to the estimating equation
# $
n
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), ψ̂n } = 0 (9.7)
i=1 (∞, Zi , ψ̂n )

will be an estimator whose influence function is


9.1 Deriving Semiparametric Estimators for β 203

I(Ci = ∞)ϕF (Zi )


+ L∗2 {Ci , GCi (Zi ), ψ0 }
(∞, Zi , ψ0 )
 
I(Ci = ∞)ϕF (Zi )
−Π + L∗2 {Ci , GCi (Zi ), ψ0 } Λψ , (9.8)
(∞, Zi , ψ0 )

where ϕF (·) and L∗2 (·) are defined by (9.2) and (9.6). We note that such an
influence function is indeed a member of the class of observed-data influence
functions given by (8.19).
For notational convenience, we denote a typical influence function, if the
parameter ψ is known, by

I(Ci = ∞)ϕF (Zi )


ϕ̃{Ci , GCi (Zi ), ψ} = + L∗2 {Ci , GCi (Zi ), ψ}, (9.9)
(∞, Zi , ψ)

and a typical influence function, if ψ is unknown, by

ϕ{Ci , GCi (Zi ), ψ} = ϕ̃{Ci , GCi (Zi ), ψ} − Π[ϕ̃{Ci , GCi (Zi ), ψ}|Λψ ].

Before giving the proof of the theorem above we present the following lemma.
Lemma 9.1.
   
∂ ϕ̃{C, GC (Z), ψ0 }
E = −E ϕ̃{C, GC (Z), ψ0 }S T
ψ {C, GC (Z), ψ0 } . (9.10)
∂ψ T

Proof. Lemma 9.1


We first note that the conditional expectation E[h{C, GC (Z)}|Z] for a typical
function h, as given by (7.35), only depends on the parameter ψ. Namely, this
conditional expectation equals

Eψ (h|Z) = hr {Gr (Z)}{r, Gr (Z), ψ}. (9.11)
r

Because of the definition of ϕ̃{C, GC (Z), ψ} given by (9.9) and the fact that
L∗2 {C, GC (Z), ψ} ∈ Λ2 , we obtain for any ψ, that
' (
Eψ ϕ̃{C, GC , ψ) Z = ϕF (Z),

which, by equation (9.11), equals



ϕ̃{r, Gr (z), ψ}{r, Gr (z), ψ} = ϕF (z) for all z, ψ,
r

where ϕF (z) does not include the parameter ψ. Therefore

∂ 
ϕ̃{r, Gr (z), ψ}{r, Gr (z), ψ} = 0 for all z, ψ. (9.12)
∂ψ T r
204 9 Augmented Inverse Probability Weighted Complete-Case Estimators

Differentiating the product inside the sum (9.12) and setting ψ = ψ0 yields
 ∂ ϕ̃{r, Gr (z), ψ0 }
{r, Gr (z), ψ0 }
r
∂ψ T
 ∂{r, Gr (z), ψ0 }/∂ψ T
+ ϕ̃{r, Gr (z), ψ0 } {r, Gr (z), ψ0 } = 0,
r
{r, Gr (z), ψ0 }

or
   
∂ ϕ̃{C, GC (Z), ψ0 }
E Z = −E ϕ̃{C, GC (Z), ψ0 }S T
ψ {C, GC (Z), ψ0 }|Z ,
∂ψ T

which, after taking unconditional expectations, implies


   
∂ ϕ̃{C, GC (Z), ψ0 }
E = −E ϕ̃{C, GC (Z), ψ0 }Sψ {C, GC (Z), ψ0 } . 
T

∂ψ T

We are now in a position to prove Theorem 9.1.


Proof. Theorem 9.1
The usual expansion of (9.7) about β0 , but keeping ψ̂n fixed, yields

n1/2 (β̂n − β0 ) =
# $
 n
I(Ci = ∞)ϕF (Zi )
−1/2 ∗
n + L2 {Ci , GCi (Zi ), ψ̂n } + op (1), (9.13)
i=1 (∞, Zi , ψ̂n )

where ϕF (Zi ) is given by (9.2) and L∗2 {Ci , GCi (Zi ), ψ} is given by (9.6). Now
we expand ψ̂n about ψ0 to obtain

n
n1/2 (β̂n − β0 ) =n−1/2 ϕ̃{Ci , GCi (Zi ), ψ0 }
i=1
# $
n
∂ ϕ̃{Ci , GCi (Zi ), ψn∗ } 1/2
−1
+ n n (ψ̂n − ψ0 ) + op (1),
i=1
∂ψ T
(9.14)

where ψn∗ is some intermediate value between ψ̂n and ψ0 . Since under usual
regularity conditions ψn∗ converges in probability to ψ0 , we obtain

n  
∂ ϕ̃{Ci , GCi (Zi ), ψn∗ } P ∂ ϕ̃{Ci , GCi (Zi ), ψ0 }
n−1 −
→ E . (9.15)
i=1
∂ψ T ∂ψ T

Standard results for finite-dimensional parametric models, as derived in


Chapter 3, can be used to show that the influence function of the MLE ψ̂n is
given by
9.1 Deriving Semiparametric Estimators for β 205
 −1
 
E Sψ {C, GC (Z), ψ0 }SψT {C, GC (Z), ψ0 } Sψ {C, GC (Z), ψ0 }, (9.16)

where Sψ {C, GC (Z), ψ0 } is the score vector with respect to ψ defined in (8.15).
The influence function of ψ̂n given by (9.16), together with (9.15) and
(9.10) of Lemma 9.1, can be used to deduce that (9.14) is equal to

n1/2 (β̂n − β0 )
n 

−1/2
 
=n ϕ̃{Ci , GCi (Zi ), ψ0 } − E ϕ̃{Ci , GCi (Zi ), ψ0 }SψT {Ci , GCi , ψ0 }
i=1

  −1
E Sψ {Ci , GCi (Zi ), ψ0 }SψT {Ci , GCi (Zi ), ψ0 } Sψ {Ci , GCi (Zi ), ψ0 }

+ op (1). (9.17)

The space Λψ is the linear space spanned by Sψ {C, GC (Z), ψ0 }. Therefore,


using results from Chapter 2 (equation (2.4)) for finding projections onto
finite-dimensional linear subspaces, we obtain that
 −1
Π [ϕ̃{C, GC (Z)}|Λψ ] = E(ϕ̃SψT ) E(Sψ SψT ) Sψ {C, GC (Z)}. (9.18)

Consequently, as a result of (9.17) and (9.18), the influence function of β̂n is


given by
ϕ̃{C, GC (Z), ψ0 } − Π[ϕ̃{C, GC (Z), ψ0 }|Λψ ],
where
I(C = ∞)ϕF (Z)
ϕ̃{C, GC (Z), ψ0 } = + L∗2 {C, GC (Z), ψ0 }. 

(∞, Z, ψ0 )

Theorem 9.1 is important because it gives us a prescription for deriving


observed-data RAL estimators for β when coarsening probabilities need to be
modeled and estimated. Summarizing these results, if the data are coarsened
at random with coarsening probabilities that need to be modeled, and we
want to obtain an observed-data RAL estimator for β in a semiparametric
model, we would proceed as follows.
1. Choose a model for the coarsening probabilities in terms of a parameter
ψ, namely {r, Gr (Z), ψ}, for all r. In Section 8.1, we discussed how such
models can be derived when there are two levels of coarsening or when
the coarsening is monotone.
2. Using the observed data, estimate the parameter ψ using maximum like-
lihood (that is, by maximizing (8.8)), and denote the estimator by ψ̂n .
3. Choose a full-data estimating function m(Z, β).
4. Choose an element of the augmentation space Λ2 by
a) for each r = ∞, choosing a function L2r {Gr (Z)} and
206 9 Augmented Inverse Probability Weighted Complete-Case Estimators

b) constructing L2 {C, GC (Z), ψ̂n } to equal


⎡ ⎤
I(C = ∞) ⎣ 
{r, Gr (Z), ψ̂n }L2r {Gr (Z)}⎦
(∞, Z, ψ̂n ) r=∞

− I(C = r)L2r {Gr (Z)}.
r=∞

5. Obtain the estimator for β by solving equation (9.7).


The resulting estimator, β̂n , under suitable regularity conditions, will be a
consistent, asymptotically normal RAL estimator for β with influence function
given by (9.8).

Interesting Fact

If the parameter ψ0 was known to us and we used equation (9.4) to derive


an estimator for β, then, since this estimator is asymptotically linear, the
asymptotic variance would be the variance of the influence function (9.5),
namely  
I(C = ∞)ϕF (Z)
var + L∗2 {C, GC (Z), ψ0 } .
(∞, Z, ψ0 )
If, however, ψ was estimated using the MLE ψ̂n and equation (9.7) was used to
derive an estimator for β, then the asymptotic variance would be the variance
of its influence function namely
 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
var + L∗2 {C, GC (Z), ψ0 } − Π
(∞, Z, ψ0 ) (∞, Z, ψ0 )

+ L∗2 {C, GR (Z), ψ0 } Λψ .

By the Pythagorean theorem, the variance of the second estimator must be


less than or equal to the variance of the first. This leads to the interesting and
perhaps unintuitive result that, even if we know the coarsening probabilities
(say we coarsen the data by design), then we would still obtain a more efficient
estimator for β by estimating the parameter ψ in a model that contains the
truth and substituting this estimator for ψ in equation (9.4) rather than using
the true ψ0 itself.

Estimating the Asymptotic Variance

The asymptotic variance of the RAL estimator β̂n is the variance of the influ-
ence function (9.8), which we denote by Σ. An estimator for the asymptotic
variance, Σ̂n , can be obtained using a sandwich variance estimator. For com-
pleteness, we now describe how to construct this estimator:
9.2 Additional Results Regarding Monotone Coarsening 207
  "
−1
∂m(Z, β̂n )
Σ̂n = Ê
∂β T
 n 
× n−1 g{Ci , GCi (Zi ), ψ̂n , β̂n }g T {Ci , GCi (Zi ), ψ̂n , β̂n }
i=1
  " T
−1
∂m(Z, β̂n )
× Ê , (9.19)
∂β T

where
 "  "
∂m(Z, β̂n ) 
n
I(Ci = ∞)∂m(Zi , β̂n )/∂β T
−1
Ê =n ,
∂β T i=1 (∞, Zi , ψ̂n )

g{Ci , GCi (Zi ), ψ̂n , β̂n } =


q{Ci , GCi (Zi ), ψ̂n , β̂n } − Ê(qSψT ){Ê(Sψ SψT )}−1 Sψ {Ci , GCi (Zi ), ψ̂n },

I(Ci = ∞)m(Zi , β̂n )


q{Ci , GCi (Zi ), ψ̂n , β̂n } = + L2 {Ci , GCi (Zi ), ψ̂n },
(∞, Zi , ψ̂n )

n
Ê(qSψT ) = n−1 q{Ci , GCi (Zi ), ψ̂n , β̂n }SψT {Ci , GCi (Zi ), ψ̂n },
i=1

and


n
Ê(Sψ SψT ) = n−1 Sψ {Ci , GCi (Zi ), ψ̂n }SψT {Ci , GCi (Zi ), ψ̂n }.
i=1

9.2 Additional Results Regarding Monotone Coarsening


The Augmentation Space Λ2 with Monotone Coarsening

In deriving arbitrary semiparametric estimators for β with coarsened data,


whether the coarsening process was by design, ψ0 known, or whether the
parameter ψ describing the coarsening process had to be estimated, a key
component of the estimating equation, either (9.4) or (9.7), was the augmen-
tation term L2 {C, GC (Z)} ∈ Λ2 . In (7.37), we gave a general representation
for such an arbitrary element of Λ2 when the data are coarsened at random.
However, in the special case when we have monotone coarsening, it will be
convenient to derive another equivalent representation for the elements in Λ2 .
This representation uses discrete hazards as defined in (8.2) and is given in
the following theorem.
208 9 Augmented Inverse Probability Weighted Complete-Case Estimators

Theorem 9.2. Under monotone coarsening, a typical element of Λ2 can be


written as
  I(C = r) − λr {Gr (Z)}I(C ≥ r) 
Lr {Gr (Z)}, (9.20)
Kr {Gr (Z)}
r=∞

where λr {Gr (Z)} and Kr {Gr (Z)} are defined by (8.2) and (8.4), respectively,
and Lr {Gr (Z)} denotes an arbitrary function of Gr (Z) for r = ∞.
Remark 2. Equation (9.20) is made up of a sum of mean-zero conditionally
uncorrelated terms; i.e., it has a martingale structure. We will take advantage
of this structure in the next chapter when we derive the more efficient double
robust estimators.  
Proof. Using (7.37), we note that a typical element of Λ2 can be written as
 I(C = ∞){r, Gr (Z)}

I(C = r) − L2r {Gr (Z)}, (9.21)
(∞, Z)
r=∞

where L2r {Gr (Z)} denotes an arbitrary function of Gr (Z) for r = ∞. To


simplify notation, let λr = λr {Gr (Z)}, Kr = Kr {Gr (Z)}, Lr = Lr {Gr (Z)},
and L2r = L2r {Gr (Z)}. We will prove that there is a one-to-one relationship
between the elements in (9.20) and the elements of (9.21), specifically that
1 
r−1
λj
L2r = Lr − Lj for all r = ∞, (9.22)
Kr−1 j=1
K j

and conversely

r−1
Lr = Kr−1 L2r + j L2j for all r = ∞. (9.23)
j=1

We prove (9.23) by induction. For r = 1, notice that K0 = 1, and hence by


(9.22), L1 = L21 . Now suppose that (9.23) holds for any i ≤ r. Then for r + 1,
we have from (9.22) that
1 r
λi
L2(r+1) = Lr+1 − Li .
Kr i=1
K i

Therefore,

r
Kr λi
Lr+1 = Kr L2(r+1) + Li
i=1
Ki
⎛ ⎞

r
Kr λi 
i−1
= Kr L2(r+1) + ⎝Ki−1 L2i + j L2j ⎠
i=1
Ki j=1


r
K r i 
r 
i−1
K r λ i j
= Kr L2(r+1) + L2i + L2j .
i=1
Ki i=1 j=1
Ki
9.2 Additional Results Regarding Monotone Coarsening 209

Interchange the order of summation,


r
K r i 
r−1 r
λi
Lr+1 = Kr L2(r+1) + L2i + Kr j L2j
i=1
Ki j=1 i=j+1
K i
⎛ ⎞

r
1 r
λi ⎠
= Kr L2(r+1) + Kr j L2j ⎝ + .
j=1
Kj i=j+1 Ki

Note that
1 λj+1 1 − λj+1 λj+1 1
+ = + = ,
Kj Kj+1 Kj+1 Kj+1 Kj+1
and hence
1 r
λi 1
+ = .
Kj i=j+1 Ki Kr

Therefore,

r
Lr+1 = Kr L2(r+1) + j L2j .
j=1

Substituting (9.23) into (9.20) yields


⎛ ⎞
 I(C = r) − λr I(C ≥ r) 
r−1
⎝Kr−1 L2r + j L2j ⎠
Kr
r=∞ j=1
 I(C = r) − λr I(C ≥ r)
= Kr−1 L2r
Kr
r=∞
  I(C = r) − λr I(C ≥ r)
+ j L2j
Kr
j=∞ j+1≤r=∞
 I(C = r) − λr I(C ≥ r)
= Kr−1 L2r
Kr
r=∞
  I(C = j) − λj I(C ≥ j)
+ r L2r .
Kj
r=∞ r+1≤j=∞

But since I(C = j) = I(C ≥ j) − I(C ≥ j + 1), we have that


 I(C = j) − λj I(C ≥ j)
Kj
r+1≤j=∞
  
I(C ≥ j) I(C ≥ j + 1)
= −
Kj−1 Kj
r+1≤j=∞
I(C ≥ r + 1) I(C = ∞)
= − .
Kr ∞
210 9 Augmented Inverse Probability Weighted Complete-Case Estimators

Therefore, (9.20) can be written as


  I(C = r)Kr−1 I(C ≥ r)r I(C ≥ r + 1)r I(C = ∞)r

− + − L2r
Kr Kr Kr ∞
r=∞
  I(C = r)Kr−1 I(C = r)r I(C = ∞)r

= − − L2r
Kr Kr ∞
r=∞
 I(C = ∞)r

= I(C = r) − L2r ,
∞
r=∞

which is exactly (9.21). Similarly, substituting (9.22) into (9.21) yields (9.20).



Example 1. Longitudinal data with monotone missingness


Consider the following problem. A promising new drug for patients with HIV
disease is to be evaluated against a control treatment in a randomized clinical
trial. A sample of n patients with HIV disease are randomized with equal
probability (.5) to receive either the new treatment (X = 1) or the control
treatment (X = 0). The primary endpoint used to evaluate the treatments
is change in CD4 counts measured over time. Specifically, CD4 counts are to
be measured at time points t1 < . . . < tl , where t1 = 0 denotes the time of
entry into the study when a baseline CD4 count measurement is taken and
t2 < . . . < tl are the subsequent times after treatment when CD4 counts will
be measured. For example, CD4 count measurements may be taken every six
months, with a final measurement at two years. The data for individual i
can be summarized as Zi = (YiT , Xi )T , where Yi = (Y1i , . . . , Yli )T , with Yji
denoting the CD4 count measurement for patient i at time tj and Xi denoting
the treatment indicator to which patient i was assigned.
Suppose that it is generally believed that, after treatment is initiated,
CD4 counts will roughly follow a linear trajectory over time. Therefore, a
linear model
E(Yil×1 |Xi ) = H l×3 (Xi )β 3×1 , (9.24)
is used to describe the data, where the design matrix H(Xi ) is an l × 3

matrix with elements Hjj  (Xi ), j = 1, . . . , l, j = 1, 2, 3, and Hj1 (Xi ) = 1,
Hj2 (Xi ) = tj , and Hj3 (Xi ) = Xi tj . This model implies that the mean CD4
count at time tj is given by

E(Yji |Xi ) = β1 + β2 tj + β3 Xi tj .

Hence, if Xi = 0, then the expected CD4 count is β1 +β2 tj , whereas if Xi = 1,


then the expected CD4 count is β1 + (β2 + β3 )tj . This reflects the belief that
CD4 response follows a linear trajectory after treatment is initiated. Because
of randomization, the mean baseline response at time t1 = 0 equals β1 , the
same for both treatments, but the slope of the trajectory may depend on
9.2 Additional Results Regarding Monotone Coarsening 211

treatment. The parameter β3 reflects the strength of the treatment effect,


where β3 = 0 corresponds to the null hypothesis of no treatment effect.
The model (9.24) is an example of a restricted moment model, where
E(Y |X) = µ(X, β) = H(X)β. If the data were collected on everyone (i.e.,
full data), then according to the results developed in Section 4.5, a full-data
estimating function would in general be based on A3×l (X){Y − µ(X, β)},
and the optimal estimating function would be DT (X)V −1 (X){Y − µ(X, β)},
where D(X) = ∂µ(X,β) ∂β T
and V (X) = var(Y |X). In this example, D(X) =
H(X). Although we would expect the longitudinal CD4 count measurements
to be correlated, for simplicity we use a working variance function V (X) =
σ 2 I l×l . This leads to the full-data estimating function m(Z, β) = H T (X){Y −
H(X)β}, and the corresponding full-data estimator would be obtained by
solving the estimating equation

n
H T (Xi ){Yi − H(Xi )β} = 0.
i=1

In this study, however, some patients dropped out during the course of the
study, in which case we would observe the data up to the time they dropped
out but all subsequent CD4 count measurements would be missing. This is
an example of monotone coarsening as described in Section 8.1, where there
are  = l − 1 levels of coarsening and where Gr (Zi ) = (Xi , Y1i , . . . , Yri )T , r =
1, . . . , l − 1 and G∞ (Zi ) = (Xi , Y1i , . . . , Yli )T . Because dropout was not by
design, we need to model the coarsening probabilities in order to derive AIP-
WCC estimators for β. Since the data are monotonically coarsened, it is more
convenient to model the discrete hazard function. Assuming the coarsening is
CAR, we consider a series of logistic regression models similar to (8.7), namely

exp(ψ0r + ψ1r Y1 + . . . + ψrr Yr + ψ(r+1)r X)


λr {Gr (Z)} = . (9.25)
1 + exp(ψ0r + ψ1r Y1 + . . . + ψrr Yr + ψ(r+1)r X)

Note 1. The only difference between equations (8.7) and (9.25) is the inclusion
of the treatment indicator X. 
Therefore, the parameter ψ in this example is ψ = (ψ0r , . . . , ψ(r+1)r , r =
1, . . . , l − 1)T . The MLE ψ̂n is obtained by maximizing the likelihood (8.13).
We now have most of the components necessary to construct an AIP-
WCC estimator for β. We still need to define an element of the augmen-
tation space Λ2 . In accordance with Theorem 9.2, for monotonically coars-
ened data, we must choose a function Lr {Gr (Z)}, r = 1, . . . , l − 1 (that is,
a function Lr (X, Y1 , . . . , Yr )) and then use (9.20) to construct an element
L2 {C, GC (Z)} ∈ Λ2 .
Putting all these different elements together, we can derive an observed-
data RAL estimator for β by using the results of Theorem 9.1, equation (9.7),
by solving the estimating equation
212 9 Augmented Inverse Probability Weighted Complete-Case Estimators
n 
 I(Ci = ∞)H T (Xi ){Yi − H(Xi )β}
(9.26)
i=1 (∞, Zi , ψ̂n )
 " 

l−1
I(Ci = r) − λr {Gr (Zi ), ψ̂n }I(Ci ≥ r)
+ Lr {Gr (Z)} = 0,
r=1 Kr {Gr (Zi ), ψ̂n }

where

exp(ψ̂0r + ψ̂1r Y1i + . . . + ψ̂rr Yri + ψ̂(r+1)r Xi )


λr {Gr (Zi ), ψ̂n } = ,
1 + exp(ψ̂0r + ψ̂1r Y1i + . . . + ψ̂rr Yri + ψ̂(r+1)r Xi )

Kr {Gr (Zi ), ψ̂n } =


!r
1
,
r  =1
1 + exp(ψ̂0r + ψ̂1r Y1i + . . . + ψ̂r r Yr i + ψ̂(r +1)r Xi )

and
(∞, Zi , ψ̂n ) = Kl−1 {Gl−1 (Zi ), ψ̂n }.
The estimator β̂n , the solution to (9.26), will be an observed-data RAL esti-
mator for β that is consistent and asymptotically normal, assuming, of course,
that the model for the discrete hazard functions were correctly specified. An
estimator for the asymptotic variance of the estimator for β̂n can be obtained
by using the sandwich variance estimator given by (9.19).
The efficiency of the estimator will depend on the choice for m(Zi , β) and
the choice for Lr (Xi , Y1i , . . . , Yri ), r = 1, . . . , l − 1. For illustration, we chose
m(Zi , β) = H T (Xi ){Yi − H(Xi )β}, but this may or may not be a good choice.
Also, we did not discuss choices for Lr (Xi , Y1i , . . . , Yri ), r = 1, . . . , l − 1. If
Lr (·) were set equal to zero, then the corresponding estimator would be the
IPWCC estimator. A more detailed discussion on choices for these functions
and how they affect the efficiency of the resulting estimator will be given in
Chapters 10 and 11.  

Since we are on the topic of monotone coarsening, we take this opportunity


to note that in survival analysis the survival data are often right censored.
The notion of right censoring was introduced in Section 5.2, where we derived
semiparametric estimators for the proportional hazards model. Right censor-
ing can be viewed as a specific example of monotone coarsening. That is, if
data in a survival analysis are right censored, then we don’t observe any data
subsequent to the censoring time. The main difference between right censor-
ing for survival data and the monotone coarsening presented thus far is that
the censoring time for survival data is continuous, taking on an uncountably
infinite number of values, whereas we have only considered coarsening models
with a finite number of coarsened configurations. Nonetheless, we can make
the analogy from monotone coarsening to censoring in survival analysis by
9.3 Censoring and Its Relationship to Monotone Coarsening 213

considering continuous-time hazard rates instead of discrete hazard probabil-


ities.
In the next section, we give some of the analogous results for censored
data. We show how to cast a censored-data problem as a monotone coarsen-
ing problem and we derive a typical influence function of an observed-data
(censored data) influence function for a parameter β in terms of the influence
function of a full-data (uncensored) estimator for β. Much of the exposition
that follows is motivated by the work in the landmark paper of Robins and
Rotnitzky (1992). To follow the results in the next section, the reader must be
familiar with counting processes and the corresponding martingale processes
used in an advanced course in censored survival analysis. If the reader does
not have this background, then this section can be skipped without having it
affect the reading of the remainder of the book.

9.3 Censoring and Its Relationship to


Monotone Coarsening
In survival analysis, full data for a single individual can be summarized as
{T, X̄(T )}, where T denotes the survival time, X(u) denotes the value of
covariates (possibly time-dependent) measured at time u, and X̄(T ) is the
history of time-dependent covariates X(u), u ≤ T . As always, the primary
focus is to estimate parameters β that characterize important aspects of the
distribution of {T, X̄(T )}. We will assume that we know how to find estimators
for the parameter of interest if we had full data {Ti , X̄i (Ti )}, i = 1, . . . , n.
Example 2. As an example, consider the following problem. Suppose we are
interested in estimating the mean medical costs for patients with some illness
during the duration of their illness. For patient i in our sample, let Ti denote
the duration of their illness and let Xi (u) denote the accumulated hospital
costs incurred for patient i at time u (measured as the time from the be-
ginning of illness). Clearly, Xi (u) is a nondecreasing function of u, and the
total medical cost for patient i is Xi (Ti ). The parameter of interest is given
by β = E{X(T )}. If we make no assumptions regarding the joint distribu-
tion of {T, X̄(T )}, we showed in Theorem 4.4 that the tangent space for this
nonparametric model is the entire Hilbert space. Consequently, using argu-
ments in Section 5.3, where we derived the influence function for the mean
of a random variable under a nonparametric model, there can be at most
one influence function of an RAL estimator for β. With a sample of iid data
{Ti , X̄i (Ti )}, i = 1, . . . , n, the obvious estimator for the mean medical costs
is

n
−1
β̂n = n Xi (Ti ),
i=1

which has influence function X(T ) − E{X(T )}. If, however, the duration of
illness is right censored for some individuals, then this problem becomes more
214 9 Augmented Inverse Probability Weighted Complete-Case Estimators

difficult. We will use this example for illustration as we develop censored data
estimators. 


In actuality, we often don’t observe the full data because of censoring, pos-
sibly because of incomplete follow-up of the patients due to staggered entry
and finite follow-up, or because the patients drop out of the study prema-
turely. To accommodate censoring, we introduce a censoring variable C̃ that
corresponds to the time at that an individual would be censored from the
study.

Remark 3. Notational convention


Since we have been using C (scripted C) to denote the different levels of
coarsened data and because censoring is typically denoted by the variable C
(unscripted), the difference may be difficult to discern. Therefore, we will use
C̃ (i.e., C with a tilde over it) to denote the censoring variable from here on.



We assume that underlying any problem in survival analysis there are


unobservable latent random variables

{C̃i , Ti , X̄i (Ti )}, i = 1, . . . , n.

The joint distribution of pC̃,T,X̄(T ) {c, t, x̄(t)} can be written as

pC̃|T,X̄(T ) {c|t, x̄(t)}pT,X̄(T ) {t, x̄(t)},

where pT,X̄(T ) {t, x̄(t)} denotes the density of the full data had we been able
to observe them.
The observed (coarsened) data for this problem are denoted as

{U, ∆, X̄(U )},

where U = min(T, C̃) (observed time on study), ∆ = I(T ≤ C̃) (censoring


indicator), and X̄(U ) is the history of the time-dependent covariates while
on study. Coarsening of the data in this case is related to the fact that we
don’t observe the full data because of censoring. We will show that censoring
can be mapped to a form of monotone coarsening. The coarsening variable C,
which we took previously to be discrete, is now a continuous variable because
of continuous-time censoring. A complete case, C = ∞, corresponds to ∆ = 1
or, equivalently, to (T ≤ C̃).
To make the connection between censoring and the coarsening notation
used previously, we define (C = r) to be (C̃ = r, T > r)  and, when C =
r, we observe Gr {T, X̄(T )} = X̄{min(r, T )}, T I(T ≤ r) for r < ∞ and
G∞ {T, X̄(T )} = {T, X̄(T )}. With this notation

Gr {T, X̄(T )} = G∞ {T, X̄(T )} = {T, X̄(T )} whenever r ≥ T.


9.3 Censoring and Its Relationship to Monotone Coarsening 215

Nonetheless, this still satisfies all the assumptions of monotone coarsening.


Therefore, the observed data can be expressed as

[C, GC {T, X̄(T )},

where
[C = r, Gr {T, X̄(T )}] = {C̃ = r, T > r, X̄(r)}
and
[C = ∞, G∞ {T, X̄(T )}] = {T ≤ C̃, T, X̄(T )}.
With monotone coarsening, we argued that it is more convenient to work with
hazard functions in describing the coarsening probabilities. With a slight abuse
of notation, the coarsening hazard function is given as

λr {T, X̄(T )} = P {C = r|C ≥ r, T, X̄(T )}, r < ∞. (9.27)

Remark 4. Because the censoring variable C̃ is a continuous random variable,


so is the corresponding coarsening variable C. Therefore, the hazard function
given by (9.27), which strictly speaking is used for discrete coarsening, has to
be defined in terms of a continuous-time hazard function; namely,

λr {T, X̄(T )} = lim h−1 P {r ≤ C =< r + h|C ≥ r, T, X̄(T )}, r < ∞.


h→0

However, unless we need further clarification, it will be convenient to continue


with this abuse of notation. 

The event C ≥ r, which includes C = ∞, is equal to

(C ≥ r) = (C̃ ≥ r, T > C̃) ∪ (T < C̃). (9.28)

Therefore,

λr {T, X̄(T )} = P [C̃ = r, T ≥ r|{(C̃ ≥ r, T > C̃) ∪ (T < C̃)}, T, X̄(T )].
(9.29)
If T < r, then (9.29) must equal zero, whereas if T ≥ r, then {(C̃ ≥ r, T >
C̃) ∪ (T < C̃)} ∩ (T ≥ r) = (C̃ ≥ r). Consequently,

λr {T, X̄(T )} = P {C̃ = r|C̃ ≥ r, T, X̄(T )} I(T ≥ r).


* +, -
||
this is the hazard function of
censoring at time r given T, X̄(T ).

If, in addition, we make the coarsening at random (CAR) assumption,

λr {T, X̄(T )} = λr [Gr {T, X̄(T )}],

then
λr [Gr {T, X̄(T )}] = P {C̃ = r|C̃ ≥ r, T ≥ r, X̄(r)}I(T ≥ r).
216 9 Augmented Inverse Probability Weighted Complete-Case Estimators

Let us denote the hazard function for censoring by

λC̃ {r, X̄(r)} = P {C̃ = r|C̃ ≥ r, T ≥ r, X̄(r)}.

Then
λr [Gr {T, X̄(T )}] = λC̃ {r, X̄(r)}I(T ≥ r). (9.30)
In order to construct estimators for a full-data parameter using coars-
ened data, such as those given by (9.4), we need to compute the probability
of a complete case [∞, G∞ {T, X̄(T )}] = P {C = ∞|T, X̄(T )} = P {∆ =
1|T, X̄(T )} and a typical element of the augmentation space, Λ2 . We now
show how these are computed with censored data using the hazard function
for the censoring time and counting process notation.

Probability of a Complete Case with Censored Data

For discrete monotone coarsening, we showed in (8.6) how the probability of


a complete case C = ∞ can be written in terms of the discrete hazards. For a
continuous-time hazard function, the analogous relationship is given by
!  
[∞, G∞ {T, X̄(T )}] = P {∆ = 1|T, X̄(T )} = 1 − λr [Gr {T, X̄(T )}]dr
r<∞
  ∞ 
= exp − λr [Gr {T, X̄(T )}]dr
0
  ∞ 
= exp − λC̃ {r, X̄(r)}I(T ≥ r)dr
0
  T 
= exp − λC̃ {r, X̄(r)}dr . (9.31)
0

The Augmentation Space, Λ2 , with Censored Data

In equation (9.20) of Theorem 9.2, we showed that an arbitrary element of


the augmentation space, Λ2 , with monotone coarsening can be written as
 I(C = r) − λr [Gr {T, X̄(T )}]I(C ≥ r)
Lr [Gr {T, X̄(T )}]. (9.32)
r=∞
Kr [Gr {T, X̄(T )}]

Using counting process notation, we denote the counting process correspond-


ing to the number of observed censored observations up to and including time
r by NC̃ (r) = I(U ≤ r, ∆ = 0) = I(C̃ ≤ r, T > C̃). Consequently,

I(C = r) = I(C̃ = r, T > r) = dNC̃ (r),

where dNC̃ (r) denotes the increment of the counting process. Using (9.30),
we obtain
9.3 Censoring and Its Relationship to Monotone Coarsening 217

λr [Gr {T, X̄(T )}]I(C ≥ r) = λC̃ {r, X̄(r)}I(T ≥ r)I(C̃ ≥ r).

Letting Y (r) denote the at-risk indicator, Y (r) = I(U ≥ r) = I(T ≥ r, C̃ ≥ r),
we obtain
λr [Gr {T, X̄(T )}]I(C ≥ r) = λC̃ {r, X̄(r)}Y (r).
Because the elements in the sum of (9.32) are nonzero only if T ≥ r and C̃ ≥ r,
it suffices to define Lr [Gr {T, X̄(T )}] and Kr [Gr {T, X̄(T )}] as Lr {X̄(r)} and
Kr {X̄(r)}, respectively. Moreover,
!    r 
Kr {X̄(r)} = 1 − λC̃ {u, X̄(u)}du = exp − λC̃ {u, X̄(u)}du .
u≤r 0
(9.33)
Consequently, a typical element of Λ2 , given by (9.32), can be written
using stochastic integrals of counting process martingales, namely
∞
dMC̃ {r, X̄(r)}
Lr {X̄(r)},
Kr {X̄(r)}
0

where
dMC̃ {r, X̄(r)} = dNC̃ (r) − λC̃ {r, X̄(r)}Y (r)dr
is the usual counting process martingale increment and Kr {X̄(r)} is given by
(9.33).

Deriving Estimators with Censored Data

Suppose we want to estimate the parameter β in a survival problem with


censored data {Ui , ∆i , X̄i (Ui )}, i = 1, . . . , n. Let us assume that we know
how to estimate β if we had full data Zi = {Ti , X̄i (Ti )}, i = 1, . . . , n; say,
for example, we had an unbiased estimating function m(Zi , β) that we could
use as a basis for deriving an m-estimator. In addition, let us assume that
we knew the hazard function for censoring, λC̃ {r, X̄(r)}. Since censoring is a
special case of coarsened data, we could use (9.4) to obtain estimators for β
with censored data.
Specifically, using the notation developed above, an AIPWCC estimator
for β can be derived by solving the estimating equation
n 
 ∞ 
∆i dMC̃i {r, X̄i (r)}
m(Zi , β) + Lr {X̄i (r)} = 0, (9.34)
i=1
KUi {X̄i (Ui )} Kr {X̄i (r)}
0

where Kr {X̄(r)} is given by (9.33).


We now return to Example 2, where the goal was to estimate the mean
medical costs of patients with some illness during the course of their illness.
We defined the parameter of interest as β = E{X(T )}, where X(T ) was the
218 9 Augmented Inverse Probability Weighted Complete-Case Estimators

accumulated medical costs of a patient during the time T of his or her illness.
As noted previously, with full data there is only one influence function of
RAL estimators for β, namely X(T ) − β. Consequently, the obvious full-data
estimating function for this problem is m(Zi , β) = Xi (Ti ) − β. With censored
data, we use (9.34) to derive an arbitrary estimator for β as the solution to

n 
 ∞ 
∆i dMC̃i {r, X̄i (r)}
{Xi (Ui ) − β} + Lr {X̄i (r)} = 0.
i=1
KUi {X̄i (Ui )} Kr {X̄i (r)}
0

This leads to the estimator


 
n ∆i
∞ dMC̃i {r,X̄i (r)}
i=1 KUi {X̄i (Ui )}
Xi (Ui ) + Kr {X̄i (r)}
Lr {X̄i (r)}
0
β̂n = n ∆i
. (9.35)
i=1 KU {X̄i (Ui )}
i

Notice that, unlike the case where there was only one influence function
with the full data for this problem, there are many influence functions with
censored data. The observed (censored) data influence functions depend on the
choice of Lr {X̄(r)} for r ≥ 0, which, in turn, affects the asymptotic variance
of the corresponding estimator in (9.35). Clearly, we want to choose such
functions in order to minimize the asymptotic variance of the corresponding
estimator. This issue will be studied carefully in the next chapter.
We also note that the estimator given above assumes that we know the
hazard function for censoring, λC̃ {r, X̄(r)}. In practice, this will be unknown
to us and must be estimated from the observed data. If the censoring time
C̃ is assumed independent of {T, X̄(T )}, then one can estimate Kr using
the Kaplan-Meier estimator for the censoring time C̃; see Kaplan and Meier
(1958). If the censoring time is related to the time-dependent covariates, then
a model has to be developed. A popular model for this purpose is Cox’s
proportional hazards model (Cox, 1972, 1975).
If Lr (·) is taken to be identically equal to zero, then we obtain the IPWCC
estimator. This estimator, referred to as the simple weighted estimator for
estimating the mean medical cost with censored cost data, was studied by
Bang and Tsiatis (2000), who also derived the large-sample properties. More
efficient estimators with a judicious choice of Lr (·) were also proposed in that
paper.

9.4 Recap and Review of Notation

Constructing AIPWCC estimators

• Let m(Z, β) be a full-data estimating function, chosen so that m(Z, β0 ) ∈


ΛF ⊥ .
9.4 Recap and Review of Notation 219

• If coarsening probabilities are known by design (i.e., ψ0 known), then an


observed-data RAL AIPWCC estimator for β is obtained as the solution
to the estimating equation
n  
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), ψ0 } = 0,
i=1
(∞, Zi , ψ0 )

and the i-th influence function of the resulting estimator β̂n is

ϕ̃{Ci , GCi (Zi ), ψ0 } =


  −1  
∂m(Zi , β0 ) I(Ci = ∞)m(Zi , β0 )
− E + L2 {Ci , GC i (Zi ), ψ0 } ,
∂β T (∞, Zi , ψ0 )
where L2 (·) is an element of the augmentation space given by
⎡ ⎤
I(C = ∞) ⎣ 
{r, Gr (Z), ψ0 }L2r {Gr (Z)}⎦
(∞, Z, ψ0 )
r=∞

− I(C = r)L2r {Gr (Z)},
r=∞

for arbitrarily chosen functions L2r {Gr (Z)}, r = ∞.


• If the coarsening probabilities need to be modeled (i.e., ψ is unknown),
then an observed-data RAL AIPWCC estimator for β is obtained as the
solution to the estimating equation
# $
n
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), ψ̂n } = 0,
i=1 (∞, Zi , ψ̂n )

where ψ̂n denotes the MLE for ψ. The i-th influence function of the re-
sulting estimator is

ϕ̃{Ci , GCi (Zi ), ψ0 } − E(ϕ̃SψT ){E(Sψ SψT )}−1 Sψ {Ci , GCi (Zi ), ψ0 }
= ϕ̃{Ci , GCi (Zi ), ψ0 } − Π[ϕ̃{Ci , GCi (Zi ), ψ0 }|Λψ ].

The augmentation space with monotone coarsening

• When the observed data are monotonically coarsened, then it will prove
to be convenient to express the elements of the augmentation space using
discrete hazard functions, specifically an element L2 {C, GC (Z), ψ} ∈ Λ2 ,
  I(C = r) − λr {Gr (Z), ψ}I(C ≥ r) 
Lr {Gr (Z)},
Kr {Gr (Z), ψ}
r=∞

for arbitrarily chosen functions Lr {Gr (Z)} r = ∞.


220 9 Augmented Inverse Probability Weighted Complete-Case Estimators

9.5 Exercises for Chapter 9


1. In equation (9.4), we proposed an m-estimator (AIPWCC) for β when the
coarsening probabilities are known by design.
a) Derive the influence function for this estimator and demonstrate that
it equals (9.5).
b) Derive a consistent estimator for the asymptotic variance of this esti-
mator.
10
Improving Efficiency and Double Robustness
with Coarsened Data

Thus far, we have described the class of observed-data influence functions


when data are coarsened at random (CAR) by taking advantage of results
obtained for a full-data semiparametric model. We also illustrated how these
results can be used to derive estimators using augmented inverse probabil-
ity weighted complete-case estimating equations. The results were geometric,
relying on our ability to define the spaces ΛF ⊥ ⊂ HF and Λ2 ⊂ H. Ulti-
mately, the goal is to derive as efficient an estimator for β as is possible using
coarsened data.
As we will see, this exercise will be primarily theoretical, as it will most
often be the case that we cannot feasibly construct the most efficient esti-
mator. Nonetheless, the study of efficiency with coarsened data will aid us in
constructing more efficient estimators even if we are not able to derive the
most efficient one.

10.1 Optimal Observed-Data Influence Function


Associated with Full-Data Influence Function
We have already shown in (8.19) that all observed-data influence functions of
RAL estimators for β can be written as
 
I(C = ∞)ϕF (Z)
ϕ{C, GC (Z)} = + L2 {C, GC (Z)} − Π{[ · ]|Λψ }, (10.1)
(∞, Z, ψ0 )

where ϕF (Z) is a full-data influence function and L2 {C, GC (Z)} ∈ Λ2 .

We know that the asymptotic variance of an RAL estimator for β is the


variance of its influence function. Therefore, we now consider how to derive the
optimal observed-data influence function within the class of influence func-
tions (10.1) for a fixed full-data influence function ϕF (Z) ∈ (IF )F , where
optimal refers to the element with the smallest variance matrix.
222 10 Improving Efficiency and Double Robustness with Coarsened Data

Theorem 10.1. The optimal observed-data influence function among the


class of observed-data influence functions given by (10.1) for a fixed ϕF (Z) ∈
F
(Z)
(IF )F is obtained by choosing L2 {C, GC (Z)} = −Π[ I(C=∞)ϕ
(∞,Z,ψ0 ) |Λ2 ], in which
case the optimal influence function is given by
 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
−Π Λ2 . (10.2)
(∞, Z, ψ0 ) (∞, Z, ψ0 )

Proof. We begin by noting that the space of elements in (10.1), for a fixed
ϕF (Z), is a linear variety as defined by Definition 7 of Chapter 3 (i.e., a
translation of a linear space away from the origin). Specifically, this space is
given as V = x0 + M , where the element
 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
x0 = −Π Λψ
(∞, Z, ψ0 ) (∞, Z, ψ0 )

and the linear subspace


M = Π[Λ2 |Λ⊥
ψ ].

To prove that this space is a linear variety, we must show that x0 ∈ / M.


By Theorem 8.1, we know that Λψ ⊂ Λ2 . Therefore, it suffices to show that
x0 ∈/ Λ2 . This follows because E(x0 |Z) = ϕF (Z) = 0.
It is also straightforward to show that the linear space M is a q-replicating
linear space as defined by Definition 6 of Chapter 3. (We leave this as an
exercise for the reader.) Consequently, as a result of Theorem 3.3, the element
in this linear variety with the smallest variance matrix is given as

x0 − Π[x0 |M ].

The theorem will follow if we can prove that


   
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
Π[x0 |M ] = Π Λ2 − Π Λψ . (10.3)
(∞, Z, ψ0 ) (∞, Z, ψ0 )

In order to prove that (10.3) is a projection, we must show that


(a) (10.3) is an element of M and
(b) x0 − Π[x0 |M ] is orthogonal to M .
 
F
(Z)
(a) follows because Π I(C=∞)ϕ
(∞,Z,ψ0 ) Λ 2 ∈ Λ2 and

     
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
Π Λψ = Π Π Λ2 Λψ , (10.4)
(∞, Z, ψ0 ) (∞, Z, ψ0 )

where (10.4) follows because Λψ ⊂ Λ2 .

(b) follows because


10.1 Observed-Data Influence Function with a Full-Data Influence Function 223
 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
x0 − Π[x0 |M ] = −Π Λ2 ,
(∞, Z, ψ0 ) (∞, Z, ψ0 )
which is an element orthogonal to Λ2 and hence orthogonal to M because
Λψ ⊂ Λ2 . 
Remark 1. At the end of Section 9.1, we made the observation that the asymp-
totic variance of an observed-data estimator for β would be more efficient
if the parameter ψ in a model for the coarsening probabilities were esti-
mated even if in fact this parameter was known by design. This stemmed
from the fact that estimating the parameter ψ resulted in an influence func-
tion that subtracted off the projection of the term in the square brackets
of (10.1) onto Λψ . However, we now realize that if we constructed the esti-
mator for β as efficiently as possible when ψ is known (that is, if we chose
F
(Z)
L2 {C, GC (Z)} = −Π[ I(C=∞)ϕ
(∞,Z,ψ0 ) |Λ2 ]), then the term in the square brackets
of (10.1) would equal
 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
−Π Λ2 ,
(∞, Z, ψ0 ) (∞, Z, ψ0 )
which is orthogonal to Λ2 and hence orthogonal to Λψ , in which case the
additional projection onto Λψ that comes from estimating ψ would equal zero
and therefore would not result in any gain in efficiency. 

A linear operator was defined in Chapter 7 (see Definition 1). It will be
convenient to define the mapping from a full-data influence function to the
corresponding optimal observed-data influence function given by Theorem
10.1 using a linear operator.
Definition 1. The linear operator J : HF → H is defined so that for any
element hF (Z) ∈ HF ,
 
I(C = ∞)hF (Z) I(C = ∞)hF (Z)
J (hF ) = −Π Λ2 . 
 (10.5)
(∞, Z, ψ0 ) (∞, Z, ψ0 )
Using this definition, we note that the optimal observed-data influence func-
tion within the class (10.1), for a fixed ϕF (Z) ∈ (IF )F , is given by J (ϕF ).
Since any observed-data influence function of an RAL estimator for β must
be an element within the class (10.1) for some ϕF (Z) ∈ (IF )F , if we want to
find the efficient influence function, it suffices to restrict attention to the class
of influence functions J (ϕF ) for ϕF (Z) ∈ (IF )F . We define the space of such
influence functions as follows.
Definition 2. We denote the space J {(IF )F }, the space whose elements are
 
J (ϕ ) for all ϕ (Z) ∈ (IF ) ,
F F F

by (IF )DR . 

224 10 Improving Efficiency and Double Robustness with Coarsened Data

The subscript “DR” is used to denote double robustness. Therefore, the space
(IF )DR ⊂ (IF ) is defined as the set of double-robust observed-data influence
functions. The term double robust was first introduced in Section 6.5. Why we
refer to these as double-robust influence functions will become clear later in
the chapter. Since the space of full-data influence functions is a linear variety
in HF (see Theorem 4.3) (IF )F = ϕF (Z)+T F ⊥ , where ϕF (Z) is an arbitrary
full-data influence function and T F is the full-data tangent space, it is clear
that (IF )DR = J {(IF )F } = J (ϕF ) + J (T F ⊥ ) is a linear variety in H.

Remark 2. As we have argued repeatedly, because an influence function of an


observed-data RAL estimator for β can be obtained, up to a proportionality
constant, from an element orthogonal to the observed-data nuisance tangent
space, this has motivated us to define estimating functions m{C, GC (Z), β},
where m{C, GC (Z), β0 } ∈ Λ⊥ .
If, however, we are interested in deriving observed-data RAL estimators
for β whose influence function belongs to (IF )DR , then we should choose an
estimating function m{C, GC (Z), β} so that m{C, GC (Z), β0 } is an element of
the linear space J (ΛF ⊥ ). This follows because it suffices to define estimating
functions that at the truth are proportional to influence functions. Since any
element ϕ∗F (Z) ∈ ΛF ⊥ properly normalized will lead to a full-data influence
function ϕF (Z) = Aq×q ϕ∗F (Z), where A = {E(ϕ∗F SβT )}−1 , and because J (·)
is a linear operator, this implies that a typical element J (ϕF ) of (IF )DR is
equal to J (Aϕ∗F ) = AJ (ϕ∗F ); i.e., it is proportional to an element J (ϕ∗F ) ∈
J (ΛF ⊥ ). We define the space J (ΛF ⊥ ) to be the DR linear space.  

Definition 3. The linear subspace J (ΛF ⊥ ) ⊂ Λ⊥ ⊂ H, the space that con-


sists of elements  
J (ϕ∗F ) : ϕ∗F (Z) ∈ ΛF ⊥ ,

will be referred to as the DR linear space. 




This now gives us a prescription for how to find observed-data RAL estima-
tors for β whose influence function belongs to (IF )DR . We start by choosing
a full-data estimating function m(Z, β) so that m(Z, β0 ) = ϕ∗F (Z) ∈ ΛF ⊥ .
We then construct the observed-data estimating function m{C, GC (Z), β} =
J {m(Z, β)}, where
 
I(C = ∞)m(Z, β) I(C = ∞)m(Z, β)
J {m(Z, β)} = −Π Λ2 .
(∞, Z, ψ0 ) (∞, Z, ψ0 )

If the observed data were coarsened by design (i.e., ψ0 known), then we


would derive an estimator for β by solving the estimating equation
n  
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), β, ψ0 } = 0, (10.6)
i=1
(∞, Zi , ψ0 )
10.2 Improving Efficiency with Two Levels of Missingness 225

where  
I(C = ∞)m(Z, β)
L2 {C, GC (Z), β, ψ} = −Π Λ2 .
(∞, Z, ψ)
If the coarsening probabilities were not known and had to be modeled using
the unknown parameter ψ, then we would derive an estimator for β by solving
the estimating equation
n  
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), β, ψ̂n } = 0, (10.7)
i=1 (∞, Zi , ψ̂n )

where ψ̂n is the MLE for the parameter ψ and L2 (·) is defined as above.
Finding projections onto the augmentation space Λ2 is not necessarily easy.
Later we will discuss a general procedure for finding such projections that
involves an iterative process. However, in the case when there are two levels
of coarsening or when the coarsening is monotone, a closed-form solution for
the projection onto Λ2 exists. We will study these two scenarios more carefully.
We start by illustrating how to find improved estimators using these results
when there are two levels of missingness.

10.2 Improving Efficiency with Two


Levels of Missingness
Let the data for a single observation Z be partitioned as (Z1T , Z2T )T , where
we always observe Z1 but Z2 may be missing for some individuals. For this
problem, it is convenient to use R to denote the complete-case indicator; that
is, if R = 1, we observe Z = (Z1T , Z2T )T , whereas if R = 0 we only observe Z1 .
The observed data are given by Oi = (Ri , Z1i , Ri Z2i ), i = 1, . . . , n.
The assumption of missing at random implies that P (R = 0|Z1 , Z2 ) =
P (R = 0|Z1 ), which, in turn, implies that P (R = 1|Z1 , Z2 ) = P (R = 1|Z1 ),
which we denote by π(Z1 ). Such complete-case probabilities may be known to
us by design or may have to be estimated using a model for the missingness
probabilities that includes the additional parameter ψ. In the latter case, since
R is binary, a logistic regression model is often used; for example,
exp(ψ0 + ψ1T Z1 )
π(Z1 , ψ) = . (10.8)
1 + exp(ψ0 + ψ1T Z1 )
The parameter ψ = (ψ0 , ψ1T )T can be estimated using a maximum likelihood
estimator; that is, by maximizing
!
n
exp{(ψ0 + ψ T Z1i )Ri }
1
(10.9)
i=1
1 + exp(ψ0 + ψ1T Z1i )

using the data (Ri , Z1i ), i = 1, . . . , n, which are available on everyone. In


equation (10.8), we used a logistic regression model that was linear in Z1 ;
226 10 Improving Efficiency and Double Robustness with Coarsened Data

however, we can make the model as flexible as necessary to fit the data. For
example, we can include higher-order polynomial terms, interaction terms,
splines, etc.
As always, there is an underlying full-data model Z ∼ p(z, β, η) ∈ P,
where β is the q-dimensional parameter of interest and η is the nuisance
parameter (possibly infinite-dimensional), and our goal is to estimate β using
the observed data Oi = (Ri , Z1i , Ri Z2i ), i = 1, . . . , n.
In order to use either estimating equation (10.6) or (10.7), when ψ is known
or unknown, respectively, to obtain an observed-data RAL estimator for β
whose influence function is an element of (IF )DR , we must find the projection
∗F
(Z)
of I(C=∞)ϕ
(∞,Z) onto the augmentation space Λ2 , where ϕ∗F (Z) = m(Z, β0 ) ∈
F⊥
Λ . Using the notation for two levels of missingness, we now consider how
∗F
(Z)
to derive the projection of Rϕ π(Z1 ) onto the augmentation space. According
to (7.40) of Chapter 7, we showed that, with two levels of missingness, Λ2
consists of the set of elements
 
R − π(Z1 )
L2 (O) = h2 (Z1 ),
π(Z1 )

where hq×1
2 (Z1 ) is an arbitrary q-dimensional function of Z1 .

Finding the Projection onto the Augmentation Space


∗F
(Z)
Theorem 10.2. The projection of Rϕ π(Z1 ) onto the augmentation space Λ2
 
R−π(Z1 )
is the unique element π(Z1 ) h02 (Z1 ) ∈ Λ2 , where

h02 (Z1 ) = E{ϕ∗F (Z)|Z1 }. (10.10)


∗F
(Z)
Proof. The projection of Rϕ π(Z1 ) onto the space Λ2 is the unique element
 
R−π(Z1 )
π(Z1 ) h02 (Z1 ) ∈ Λ2 such that the residual
   
Rϕ∗F (Z) R − π(Z1 )
− h02 (Z1 )
π(Z1 ) π(Z1 )

is orthogonal to every element in Λ2 ; that is,


  T   
Rϕ∗F (Z) R − π(Z1 ) R − π(Z1 )
E − h02 (Z1 ) h2 (Z1 ) = 0
π(Z1 ) π(Z1 ) π(Z1 )
(10.11)
for all functions h2 (Z1 ). We derive the expectation in (10.11) by using the
law of iterated conditional expectations, where we first condition on Z =
(Z1T , Z2T )T to obtain
10.2 Improving Efficiency with Two Levels of Missingness 227
    
R R − π(Z1 )
E E Z ϕ∗F (Z)
π(Z1 ) π(Z1 )
 2  T 
R − π(Z1 ) 0
−E Z h2 (Z1 ) h2 (Z1 ) . (10.12)
π(Z1 )

Because
   2 
R{R − π(Z1 )} R − π(Z1 ) 1 − π(Z1 )
E Z = E Z = ,
π 2 (Z1 ) π(Z1 ) π(Z1 )

we write (10.12) as
  T 
1 − π(Z1 ) ∗F
E ϕ (Z) − h02 (Z1 ) h2 (Z1 ) . (10.13)
π(Z1 )

Therefore, we must find the function h02 (Z1 ) such that (10.13) is equal
to zero for all h2 (Z1 ). We derive (10.13) by again using the law of iterated
conditional expectations, where we first condition on Z1 to obtain
  T 
1 − π(Z1 )
E E{ϕ∗F (Z)|Z1 } − h02 (Z1 ) h2 (Z1 ) . (10.14)
π(Z1 )

By assumption, π(Z1 ) > for all Z1 , which implies that 1−π(Z 1)


π(Z1 ) is bounded
away from zero and ∞ for all Z1 . Therefore, equation (10.14) will be equal to
zero for all h2 (Z1 ) if and only if

h02 (Z1 ) = E{ϕ∗F (Z)|Z1 }. 




Thus, among estimators whose estimating functions are based on elements


of Λ⊥ ,    
Rϕ∗F (Z) R − π(Z1 )
− h2 (Z1 ) − Π{[·]|Λψ },
π(Z1 ) π(Z1 )
for a fixed ϕ∗F (Z) ∈ ΛF ⊥ , the one that gives the optimal answer (i.e., the
estimator with the smallest asymptotic variance) is obtained by choosing
h2 (Z1 ) = E{ϕ∗F (Z)|Z1 }.
This result, although interesting, is not of practical use since we don’t know
the conditional expectation E{ϕ∗F (Z)|Z1 }. The result does, however, suggest
that we consider an adaptive approach where we use the data to estimate this
conditional expectation, as we now illustrate.

Adaptive Estimation

Since Z = (Z1T , Z2T )T , in order to compute the conditional expectation of


E{ϕ∗F (Z)|Z1 }, we need to know the conditional density of Z2 given Z1 . Of
228 10 Improving Efficiency and Double Robustness with Coarsened Data

course, this conditional distribution depends on the unknown full-data pa-


rameters β and η. As we have argued, estimating β and η using likelihood
methods for semiparametric models (i.e., when the parameter η is infinite-
dimensional) may be difficult, if not impossible, using coarsened data. This
is the reason we introduced AIPWCC estimators in the first place. Another
approach, which we now advocate, is to be adaptive. That is, we posit a
working model where the density of Z is assumed to be within the model
p∗Z1 ,Z2 (z1 , z2 , ξ) ∈ Pξ ⊂ P, where the parameter ξ is finite-dimensional. The
conditional density of Z2 given Z1 as a function of the parameter ξ would be
p∗Z1 ,Z2 (z1 , z2 , ξ)
p∗Z2 |Z1 (z2 |z1 , ξ) =  .
p∗Z1 ,Z2 (z1 , u, ξ)dνZ2 (u)

For such a model, the parameter ξ can be estimated by maximizing the


observed-data likelihood (7.10), which, with two levels of missingness, can
be written as
!n  1−Ri
p∗Z1 ,Z2 (Z1i , Z2i , ξ)Ri p∗Z1 ,Z2 (Z1i , u, ξ)dνZ2 (u) . (10.15)
i=1

Since we only need the conditional distribution of Z2 given Z1 to derive the


desired projection, an even simpler approach would be to posit a parametric
model for the conditional density of Z2 given Z1 in terms of a parametric model
with a finite number of parameters ξ. Let us denote such a posited model by
p∗Z2 |Z1 (z2 |z1 , ξ). Because of the missing at random (MAR) assumption, R is
conditionally independent of Z2 given Z1 . This is denoted as R ⊥ ⊥ Z2 |Z1 .
Consequently,

pZ2 |Z1 ,R (z2 |z1 , r) = pZ2 |Z1 ,R=1 (z2 |z1 , r = 1) = pZ2 |Z1 (z2 |z1 ).

Therefore, it suffices to consider only the complete cases {i : Ri = 1} because


the conditional distribution of Z2 given Z1 among the complete cases is the
same as the conditional distribution of Z2 given Z1 in the population. A
natural estimator for ξ would be to maximize the conditional likelihood of
Z2 given Z1 in ξ among the complete cases. Namely, we would estimate ξ by
maximizing !
p∗Z2 |Z1 (z2i |z1i , ξ). (10.16)
i:Ri =1

The resulting estimator is denoted by ξˆn∗ .


Remark 3. Whether we posit a simpler model for the density of Z = (Z1T , Z2T )T
or the conditional density of Z2 given Z1 , we must keep in mind that this is
a posited model and that the true conditional density p0Z2 |Z1 (z2 |z1 ) may not
be contained in this model. Moreover, if we develop a model directly for the
conditional density of Z2 given Z1 , then we must be careful that such a model
is consistent with the underlying semiparametric model.  
10.2 Improving Efficiency with Two Levels of Missingness 229

Nonetheless, with the adaptive approach, we proceed as if the posited


model were correct and estimate ξ using the observed data. Under suitable
regularity conditions, this estimator will converge in probability to some con-
P
stant ξ ∗ ; i.e., ξˆn∗ → ξ ∗ and n1/2 (ξˆn∗ − ξ ∗ ) will be bounded in probability. In
general, the posited model will not contain the truth, in which case

p∗Z2 |Z1 (z2 |z1 , ξ ∗ ) = p0Z2 |Z1 (z2 |z1 ).

If, however, our posited model did contain the truth, then we denote this by
taking ξ ∗ to equal ξ0 , where

p∗Z2 |Z1 (z2 |z1 , ξ0 ) = p0Z2 |Z1 (z2 |z1 ).

With such a posited model for the conditional density of Z2 given Z1 and
an estimator ξˆn∗ , we are able to estimate h02 (Z1 ) = E{ϕ∗F (Z)|Z1 } by using

h∗2 (Z1 , ξˆn∗ ) = ϕ∗F (Z1 , u)p∗Z2 |Z1 (u|Z1 , ξˆn∗ )dνZ2 (u).

Again, keep in mind that



h∗2 (Z1 , ξˆn∗ ) → h∗2 (Z1 , ξ ∗ ) = ϕ∗F (Z1 , u)p∗Z2 |Z1 (u|Z1 , ξ ∗ )dνZ2 (u),

where h∗2 (Z1 , ξ ∗ ) is a function of Z1 but not necessarily that h02 (Z1 ) =
h∗2 (Z1 , ξ ∗ ) unless the posited model for the conditional density of Z2 given
Z1 was correct.
With this as background, we now give a step-by-step algorithm on how to
derive an improved estimator. In so doing, we consider the scenario where the
parameter ψ in our missingness model is unknown and must be estimated.

Algorithm for Finding Improved Estimators with


Two Levels of Missingness

1. We first consider how the parameter β would be estimated if there were no


missing data (i.e., the full-data problem). That is, we choose an estimat-
ing function, say m(Z, β), where m(Z, β0 ) = ϕ∗F (Z) ∈ ΛF ⊥ . We might
consider using an estimating function that leads to an efficient or locally
efficient full-data estimator for β. However, we must keep in mind that
the efficient full-data influence function may not be the one that leads to
an efficient observed-data influence function. A detailed discussion of this
issue will be deferred until the next chapter.
2. We posit a model for the complete-case (missingness) probabilities in terms
of the parameter ψ, say P (R = 1|Z) = π(Z1 , ψ), and using the data
(Ri , Z1i ), i = 1, . . . , n, which are available on the entire sample, we derive
the maximum likelihood estimator ψ̂n for ψ. For example, we might use the
230 10 Improving Efficiency and Double Robustness with Coarsened Data

logistic regression model in (10.8), in which case the estimator is obtained


by maximizing (10.9). In general, however, we would maximize

!
n
{π(Z1i , ψ)}Ri {1 − π(Z1i , ψ)}1−Ri .
i=1

We denote the MLE by ψ̂n .


3. We posit a model for either the distribution of Z = (Z1T , Z2T )T or the
conditional distribution of Z2 given Z1 in terms of the parameter ξ. Either
way, this results in a model for p∗Z2 |Z1 (z2 |z1 , ξ) in terms of the parameter
ξ. Using the observed data, we derive the MLE ξˆn∗ for ξ by maximizing
either (10.15) or (10.16).
4. The estimator for β is obtained by solving the estimating equation
n 
 " 
 Ri m(Zi , β) Ri − π(Z1i , ψ̂n )
− h∗2 (Z1i , β, ξˆn∗ ) = 0, (10.17)
i=1 π(Z1i , ψ̂n ) π(Z1i , ψ̂n )

where
 
h∗2 (Z1i , β, ξ) = E m(Zi , β)|Z1i , ξ

= m(Z1i , u, β)p∗Z2 |Z1 (u|Z1i , ξ)dνZ2 (u).

Remarks Regarding Adaptive Estimators

The semiparametric theory that we have developed for coarsened data implic-
itly assumes that the model for the coarsening probabilities is correctly speci-
fied. With two levels of missingness, this means that P (R = 1|Z1 ) = π0 (Z1 ) is
contained within the model π(Z1 , ψ), in which case, under suitable regularity
conditions, π(Z1 , ψ̂n ) → π0 (Z1 ), where ψ̂n denotes the MLE for ψ. The fact
that we used the augmented term
 "
Ri − π(Z1i , ψ̂n )
− h∗2 (Z1i , β, ξˆn∗ ) (10.18)
π(Z1i , ψ̂n )

in equation (10.17) was an attempt to gain efficiency from the data that are
incomplete (i.e., {i : Ri = 0}). To get the greatest gain in efficiency, the
augmented term must equal
 "
Ri − π(Z1i , ψ̂n )
− h02 (Z1i , β),
π(Z1i , ψ̂n )

where
10.2 Improving Efficiency with Two Levels of Missingness 231

h02 (Z1i , β) = E0 {m(Zi , β)|Z1i }, (10.19)


where the conditional expectation on the right-hand side of (10.19) is with
respect to the true conditional density of Z2 given Z1 .
Because we are using a posited model, the function h∗2 (Z1i , β, ξˆn∗ ) will con-
verge to h∗2 (Z1i , β, ξ ∗ ), which is not equal to the desired h02 (Z1i , β) unless the
posited model was correct. Nonetheless, h∗2 (Z1i , β, ξ ∗ ) is a function of Z1i , in
which case  
Ri − π(Z1i , ψ0 )
− h∗2 (Z1i , β0 , ξ ∗ ) ∈ Λ2 .
π(Z1i , ψ0 )
By Theorem 9.1, the solution to the estimating equation
n 
 " 
 Ri m(Zi , β) Ri − π(Z1i , ψ̂n )
− h∗2 (Z1i , β0 , ξ ∗ ) = 0, (10.20)
i=1 π(Z1i , ψ̂n ) π(Z1i , ψ̂n )

where notice that we take β = β0 and fix ξ ∗ in h∗2 (·), is an AIPWCC estimator


for β whose influence function is
  −1
∂m(Z, β0 )
− E ×
∂β T
     
Rm(Z, β0 ) R − π(Z1 , ψ0 ) ∗ ∗
− h2 (Z1 , β0 , ξ ) − Π [ ]|Λψ . (10.21)
π(Z1 , ψ0 ) π(Z1 , ψ0 )

Let us denote the estimator that solves (10.20) by β̂n∗ .


In the following theorem and proof, we give a heuristic justification to
show that estimating ξ using ξˆn∗ will have a negligible effect on the resulting
estimator. That is, β̂n , the solution to (10.17), is asymptotically equivalent
to β̂n∗ , the solution to (10.20). By so doing, we deduce that the adaptive
estimator β̂n is a consistent, asymptotically normal estimator for β whose
influence function is given by (10.21).

Theorem 10.3.
n1/2 (β̂n − β̂n∗ ) −
P
→ 0.

Proof. If we return to the proof of Theorem 9.1, we note that the asymptotic
approximation given by (9.13), applied to the estimator β̂n , the solution to
(10.17) would yield
  −1
1/2 ∂m(Z, β0 )
n (β̂n − β0 ) = − E ×
∂β T
n 
 " 
 R m(Z , β0 ) R − π(Z 1i , ψ̂ )
−1/2 i i i n ∗ ˆ∗
n − h2 (Z1i , β̂n , ξn ) + op (1),
i=1 π(Z1i , ψ̂n ) π(Z1i , ψ̂n )
(10.22)

whereas, applied to β̂n∗ , the solution to (10.20) would yield


232 10 Improving Efficiency and Double Robustness with Coarsened Data
  −1
1/2 ∂m(Z, β0 )
n (β̂n∗
− β0 ) = − E ×
∂β T
n 
 " 
 Ri m(Zi , β0 ) Ri − π(Z1i , ψ̂n )
−1/2
n − h∗2 (Z1i , β0 , ξ ∗ ) + op (1).
i=1 π(Z1i , ψ̂n ) π(Z1i , ψ̂n )
(10.23)

Taking differences between (10.22) and (10.23), we obtain that


  "
n
Ri − π(Z1i , ψ̂n )
1/2 ∗ −1/2
n (β̂n − β̂n ) = n h∗2 (Z1i , β0 , ξ ∗ )
i=1 π(Z 1i , ψ̂ n )
 " 
n
R − π(Z 1i , ψ̂ )
−1/2 i n ∗ ˆ ∗
−n h2 (Z1i , β̂n , ξn ) + op (1).
i=1 π(Z1i , ψ̂n )

The proof is complete if we can show that the term in the square brackets
above converges in probability to zero. This follows by expanding
 "
n
Ri − π(Z1i , ψ̂n )
−1/2
n h∗2 (Z1i , β̂n , ξˆn∗ )
i=1 π(Z 1i , ψ̂ n )

about β0 and ξ ∗ , while keeping ψ̂n fixed, to obtain


 "
n
R − π(Z 1i , ψ̂ )
n−1/2 h∗2 (Z1i , β0 , ξ ∗ )
i n
(10.24)
i=1 π(Z 1i , ψ̂ n )
  " 
n
Ri − π(Z1i , ψ̂n ) ∂h∗2 (Z1i , βn∗ , ξn∗ )
−1
+ n T
n1/2 (β̂n − β0 ) (10.25)
i=1 π(Z 1i , ψ̂ n ) ∂β
  " 
n
Ri − π(Z1i , ψ̂n ) ∂h∗2 (Z1i , βn∗ , ξn∗ )
−1
+ n ∗T
n1/2 (ξˆn∗ − ξ ∗ ), (10.26)
i=1 π(Z 1i , ψ̂ n ) ∂ξ

where βn∗ and ξn∗ are intermediate values between β̂n and β0 and ξˆn∗ and
P P
ξ ∗ , respectively. Let us consider (10.26). Since ψ̂n − → ψ0 , βn∗ −
→ β0 , and
∗ P ∗
ξn − → ξ , then, under suitable regularity conditions, the sample average in
equation (10.26) is
 " 
n
Ri − π(Z1i , ψ̂n ) ∂h∗2 (Z1i , βn∗ , ξn∗ ) P
−1
n −

i=1 π(Z1i , ψ̂n ) ∂ξ ∗T
  ∗ 
R − π(Z1 , ψ0 ) ∂h2 (Z1 , β0 , ξ ∗ )
E .
π(Z1 , ψ0 ) ∂ξ∗T

This last expectation can be shown to equal zero by a simple conditioning


argument where we first find the conditional expectation given Z1 . Since,
10.2 Improving Efficiency with Two Levels of Missingness 233

under suitable regularity conditions, n1/2 (ξˆn∗ − ξ ∗ ) is bounded in probability,


then a simple application of Slutsky’s theorem can be used to show that
(10.26) converges in probability to zero. A similar argument can be used also
to show that (10.25) converges in probability to zero.  

If, in addition, the model for the conditional distribution of Z2 given Z1


was correctly specified, then
   
R − π(Z1 , ψ0 ) ∗ ∗ R − π(Z1 , ψ0 )
h2 (Z1 , β0 , ξ ) = h02 (Z1 , β0 )
π(Z1 , ψ0 ) π(Z1 , ψ0 )
 
Rm(Z, β0 )
=Π Λ2 .
π(Z1 , ψ0 )

This would then imply that the term inside the square brackets “[ · ]”in (10.21)

is an element orthogonal to Λ2 (i.e., [ · ] ∈ Λ⊥
2 ), in which case Π [ · ] Λψ

(i.e., the last term of (10.21)) is equal to zero because Λψ ⊂ Λ2 . The resulting
estimator would have influence function
  
RϕF (Z) RϕF (Z)
−Π Λ2 = J (ϕF ) ∈ (IF )DR , (10.27)
π(Z1 , ψ0 ) π(Z1 , ψ0 )
   −1
0)
where ϕF (Z) = − E ∂m(Z,β
∂β T m(Z, β0 ), and this influence function is
within the class of the so-called double-robust influence functions. The vari-
ance of this influence function represents the smallest asymptotic variance
among observed-data estimators that used m(Z, β0 ) = ϕ∗F (Z) as the ba-
sis for an augmented inverse probability weighted complete-case estimating
equation.

Estimating the Asymptotic Variance

To estimate the asymptotic variance of β̂n , where β̂n is the solution to (10.17),
we propose using the sandwich variance estimator given in Chapter 9, equation
(9.19). For completeness, this estimator is given by
  "
−1
∂m(Z, β̂n )
Σ̂n = Ê T
∂β
  n 
−1 ˆ∗ T ˆ∗
× n g(Oi , ψ̂n , β̂n , ξn )g (Oi , ψ̂n , β̂n , ξn )
i=1
  " T
−1
∂m(Z, β̂n )
× Ê T
, (10.28)
∂β

where
234 10 Improving Efficiency and Double Robustness with Coarsened Data
 "  "
∂m(Z, β̂n ) 
n
Ri ∂m(Zi , β̂n )/∂β T
−1
Ê =n ,
∂β T i=1 π(Z1i , ψ̂n )

g(Oi , ψ̂n , β̂n , ξˆn∗ ) =


q(Oi , ψ̂n , β̂n , ξˆ∗ ) − Ê(qS T ){Ê(Sψ S T )}−1 Sψ (Oi , ψ̂n ),
n ψ ψ
 "
Ri m(Zi , β̂n ) Ri − π(Z1i , ψ̂n )
q(Oi , ψ̂n , β̂n , ξˆn∗ ) = − h∗2 (Z1i , β̂n , ξˆn∗ ),
π(Z1i , ψ̂n ) π(Z1i , ψ̂n )

n
Ê(qSψT ) = n−1 q(Oi , ψ̂n , β̂n , ξˆn∗ )SψT (Oi , ψ̂n ),
i=1

and


n
Ê(Sψ SψT ) = n−1 Sψ (Oi , ψ̂n )SψT (Oi , ψ̂n ).
i=1

Double Robustness with Two Levels of Missingness

Thus far, we have taken the point of view that the missingness model was
correctly specified; that is, that π0 (Z1 ) = P (R = 1|Z1 ) is contained in the
model π(Z1 , ψ) for some value of ψ. If this is the case, we denote the true value
of ψ by ψ0 and π0 (Z1 ) = π(Z1 , ψ0 ). However, unless the missingness was by
design, we generally don’t know the true missingness model, and therefore the
possibility exists that we have misspecified this model. Even if the missing-
ness model is misspecified, under suitable regularity conditions, the maximum
likelihood estimator ψ̂n will converge in probability to some constant ψ ∗ , but
π(Z1 , ψ ∗ ) = π0 (Z1 ). That is,

π(Z1 , ψ̂n ) → π(Z1 , ψ ∗ ) = π0 (Z1 ).

As we will demonstrate, the attempt to gain efficiency by positing a model


for the conditional distribution of Z2 given Z1 and estimating

h∗2 (Z1 , β, ξˆn∗ ) = E{m(Z, β)|Z1 , ξ}



ξ=ξ̂n

actually gives us the extra protection of double robustness, which was briefly
introduced in Section 6.5. We now explore this issue further.
Using standard asymptotic approximations, we can show that the esti-
mator β̂n , which is the solution to the estimating equation (10.17), will be
consistent and asymptotically normal if
10.2 Improving Efficiency with Two Levels of Missingness 235
   
Rm(Z, β0 ) R − π(Z1 , ψ ∗ )
E − h∗2 (Z1 , β0 , ξ ∗ ) = 0, (10.29)
π(Z1 , ψ ∗ ) π(Z1 , ψ ∗ )
P P
where ψ̂n → ψ ∗ and ξˆn∗ → ξ ∗ .
In developing our estimator for β, we considered two models, one for the
missingness probabilities π(Z1 , ψ) and another for the conditional density of
Z2 given Z1 p∗Z2 |Z1 (z2 |z1 , ξ). If the missingness model is correctly specified,
then
π(Z1 , ψ ∗ ) = π(Z1 , ψ0 ) = P (R = 1|Z1 ). (10.30)
If the model for the conditional density of Z2 given Z1 is correctly specified,
then
h∗2 (Z1 , β0 , ξ ∗ ) = E0 {m(Z, β0 )|Z1 }. (10.31)
We now show the so-called double-robustness property; that is, the esti-
mator β̂n , the solution to (10.17), is consistent and asymptotically normal (a
result that follows under suitable regularity conditions when (10.29) is satis-
fied) if either (10.30) or (10.31) is true.
After adding and subtracting similar terms, we write the expectation in
(10.29) as
   
R − π(Z1 , ψ ∗ ) ∗ ∗
E m(Z, β0 ) + m(Z, β0 ) − h2 (Z1 , β0 , ξ )
π(Z1 , ψ ∗ )
   
R − π(Z1 , ψ ∗ ) ∗ ∗
= 0+E m(Z, β0 ) − h2 (Z1 , β0 , ξ ) . (10.32)
π(Z1 , ψ ∗ )

If (10.30) is true, whether (10.31) holds or not, we write (10.32) as


  
R − P (R = 1|Z1 ) ∗ ∗
E m(Z, β0 ) − h2 (Z1 , β0 , ξ ) . (10.33)
P (R = 1|Z1 )

We derive the expectation of (10.33) by using the law of conditional iterated


expectations, where we first condition on Z = (Z1 , Z2 ) to obtain
  
E(R|Z1 , Z2 ) − P (R = 1|Z1 ) ∗ ∗
E m(Z, β0 ) − h2 (Z1 , β0 , ξ ) . (10.34)
P (R = 1|Z1 )

Because of MAR, R ⊥ ⊥ Z2 |Z1 , which implies that E(R|Z1 , Z2 ) = E(R|Z1 ) =


P (R = 1|Z1 ). This then implies that (10.34) equals zero, which, in turn,
implies that β̂n is consistent when (10.30) is true.
If (10.31) is true, whether (10.30) holds or not, we write (10.32) as
  
R − π(Z1 , ψ ∗ )
E m(Z, β0 ) − E0 {m(Z, β0 )|Z1 } . (10.35)
π(Z1 , ψ ∗ )

Again, we evaluate (10.35) by using the law of conditional iterated expecta-


tions, where we first condition on (R, Z1 ) to obtain
236 10 Improving Efficiency and Double Robustness with Coarsened Data
  
R − π(Z1 , ψ ∗ )
E E{m(Z, β0 )|R, Z1 } − E0 {m(Z, β0 )|Z1 } . (10.36)
π(Z1 , ψ ∗ )

Because of MAR, R ⊥ ⊥ Z2 |Z1 , which implies that E{m(Z, β0 )|R, Z1 } =


E0 {m(Z, β0 )|Z1 }. This then implies that (10.36) equals zero, which, in turn,
implies that β̂n is consistent when (10.31) is true.

Remarks Regarding Double-Robust Estimators

In developing the adaptive strategy that led to double-robust estimators, we


had to posit a simplifying model for p∗Z (z, ξ) or for p∗Z2 |Z1 (z2 |z1 , ξ) and then
estimate the parameter ξ. We originally took the point of view that the model
for the coarsening (missingness) probabilities π(Z1 , ψ) was correctly specified
and hence the posited model for the distribution of Z was used to compute
projections onto the augmentation space, which, in turn, gained us efficiency
while still leading to a consistent and asymptotically normal estimator for
β even if the posited model was misspecified. To estimate the parameter ξ,
we suggested using likelihood methods such as maximizing (10.15) or (10.16).
However, any estimator that would lead to a consistent estimator of ξ (assum-
ing the posited model was correct) could have been used. In fact, we might
use an IPWCC or AIPWCC estimator for ξ since, in some cases, these may
be easier to implement. For example, if a full-data estimating function for ξ
was easily obtained (i.e., m(Z, ξ) such that Eξ {m(Z, ξ)} = 0), then a simple
IPWCC estimator for ξ, using observed data, could be obtained by solving

n
Ri m(Zi , ξ)
= 0. (10.37)
i=1 π(Z1i , ψ̂n )

Such a strategy is perfectly reasonable as long as we believe that the model


for the coarsening probabilities is correctly specified because this is necessary
to guarantee that the solution to (10.37) would lead to a consistent estimator
of ξ if the posited model for the distribution of Z was correct.
However, if our goal is to obtain a double-robust estimator for β, then we
must make sure that the estimator for ξ is a consistent estimator if the posited
model for the distribution of Z was correctly specified, regardless of whether
the model for the coarsening probabilities was correctly specified or not. This
means that we could use likelihood methods such as maximizing (10.15) or
(10.16), as such estimators do not involve the coarsening probabilities, but we
could not use IPWCC or AIPWCC estimators such as (10.37).

Logistic Regression Example Revisited

In Chapter 7, we gave an example where we were interested in estimating


the parameter β in a logistic regression model used to model the relationship
of a binary outcome variable Y as a function of covariates X = (X1T , X2 )T ,
10.2 Improving Efficiency with Two Levels of Missingness 237

where, for ease of illustration, we let X2 be a single random variable. Also, for
this example, we let all the covariates in X be continuous random variables.
Specifically, we consider the model

exp(β T X ∗ )
P (Y = 1|X) = ,
1 + exp(β T X ∗ )

where X ∗ = (1, X1T , X2 )T is introduced to allow for an intercept term. We


also denote the full data Z = (Y, X). We considered the case where there
were two levels of missingness, specifically where (Y, X1 ) was always observed
but where the single variable X2 was missing on some of the individuals. The
observed data are denoted by Oi = (Ri , Yi , X1i , Ri X2i ), i = 1, . . . , n.
In the example in Chapter 7, we assumed that missingness was by design,
but here we will assume that the missing data were not by design and hence
the probability of missingness has to be modeled. We assume that the data are
missing at random (MAR) and the missingness probability, or more precisely
the probability of a complete case, follows the logistic regression model (10.8).

exp(ψ0 + ψ1 Y + ψ2T X1 )
π(Y, X1 , ψ) = . (10.38)
1 + exp(ψ0 + ψ1 Y + ψ2T X1 )

Estimates of the parameter ψ = (ψ0 , ψ1 , ψ2T )T are obtained by maximizing the


likelihood (10.9). This can be easily accomplished using standard statistical
software. The resulting estimator is denoted by ψ̂n and the estimator for the
probability of a complete case is denoted by π(Yi , X1i , ψ̂n ).
We next consider the choice for the full-data estimating function m(Z, β).
With full data, the optimal choice for m(Z, β) is
 
exp(β T X ∗ )
m(Y, X1 , X2 , β) = X ∗ Y − .
1 + exp(β T X ∗ )
Although this may not be the optimal choice with the introduction of missing
data, for the time being we will use this to construct observed-data AIP-
WCC estimators. In Chapter 7, AIPWCC estimators were introduced for this
problem using equation (7.49) when the data were missing by design. When
the missingness probability is modeled using (10.38), then we would consider
AIPWCC estimators that are the solution to

n 
  
Ri exp(β T Xi∗ )
Xi∗ Yi −
i=1 π(Yi , X1i , ψ̂n )
1 + exp(β T Xi∗ )
 " 
Ri − π(Yi , X1i , ψ̂n )
− L(Yi , X1i ) = 0, (10.39)
π(Yi , X1i , ψ̂n )

where L(Y, X1 ) is some arbitrary q-dimensional function of Y and X1 . We


now realize that in order to obtain as efficient an estimator as possible
238 10 Improving Efficiency and Double Robustness with Coarsened Data

for β among the estimators in (10.39), we should choose L(Y, X1 ) to equal


E{m(Y, X1 , X2 , β)|Y, X1 }; that is,
   
exp(β T X ∗ )
L∗2 (Y, X1 ) = E X ∗ Y − Y, X1 .
1 + exp(β T X ∗ )

Because the function L(Y, X1 ) depends on the conditional distribution of X2


given Y and X1 , which is unknown to us, we posit a model for this conditional
distribution and use adaptive methods.
The model we consider is motivated by the realization that a logistic re-
gression model for Y given X would be obtained if the conditional distribution
of X given Y followed a multivariate normal distribution with a mean that
depends on Y but with a variance matrix that is independent of Y ; that is,
the conditional distribution of X given Y = 1 would be M V N (µ1 , Σ) and
would be M V N (µ0 , Σ) given Y = 0; see, for example, Cox and Snell (1989).
For this scenario, the conditional distribution of X2 given X1 and Y would
follow a normal distribution; i.e.,
 
X2 |X1 , Y ∼ N ξ0 + ξ1 X1 + ξ2 Y, ξσ2 .
T
(10.40)

Therefore, one strategy is to posit the model (10.40) for the condi-
tional distribution of X2 given X1 and Y and estimate the parameters
ξ = (ξ0 , ξ1T , ξ2 , ξσ2 )T by maximizing (10.16). This is especially attractive be-
cause the model (10.40) is a traditional normally distributed linear model.
Therefore, using the complete cases (i.e., {i : Ri = 1}), the MLE for
(ξ0 , ξ1T , ξ2 )T can be obtained using standard least squares and the MLE for
the variance parameter ξσ2 can be obtained as the average of the squared
residuals. Denote this estimator by ξˆn∗ .
Finally, me must compute

L∗2 (Y, X1 , β, ξˆn∗ ) = E{m(Y, X1 , X2 , β)|Y, X1 , ξ}|ξ=ξ̂∗


 # n
$
∗ ∗T ∗
ˆ∗ −1/2 {u − (ξ0n + ξˆ1n
ˆ X1 + ξˆ2n Y )}2
= m(Y, X1 , u, β)(2π ξσn2 ) exp − du.
2ξˆσ∗ 2
n

This can be accomplished using numerical integration or Monte Carlo tech-


niques.
The estimator for β is then obtained by solving the equation
n 
  
Ri exp(β T Xi∗ )
Xi∗ Yi −
i=1 π(Yi , X1i , ψ̂n )
1 + exp(β T Xi∗ )
 " 
Ri − π(Yi , X1i , ψ̂n ) ∗ ˆ∗
− L2 (Yi , X1i , β, ξn ) = 0. (10.41)
π(Yi , X1i , ψ̂n )
10.3 Improving Efficiency with Monotone Coarsening 239

This estimator is doubly robust; it is a consistent asymptotically normal RAL


estimator for β if either the missingness model (10.38) or the model for the
conditional distribution of X2 given X1 and Y (10.40) is correct.
The asymptotic variance for β̂n can be obtained by using the sandwich
variance estimator (10.28).

10.3 Improving Efficiency with Monotone Coarsening

Finding the Projection onto the Augmentation Space

Monotone missingness, or more generally monotone coarsening, occurs often


in practice and is worth special consideration. We have shown that a natural
way to obtain semiparametric coarsened-data estimators for β is to consider
augmented inverse probability weighted complete-case estimators, estimators
based on
n  
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi )} = 0, (10.42)
i=1 (∞, Zi , ψ̂n )
where m(Zi , β) is an estimating function such that m(Z, β0 ) = ϕ∗F (Z) ∈
ΛF ⊥ and L2 {C, GC (Z)} ∈ Λ2. We also proved in Section 10.1 that we should
I(C=∞)m(Z,β)
choose L2 {C, GC (Z)} = −Π (∞,Z,ψ) Λ2 in order to gain the greatest
efficiency among estimators that solve (10.42); see (10.7). Therefore, to obtain
estimators for β with improved efficiency, we now consider how to find the
projection of I(C=∞)m(Z,β)
(∞,Z,ψ) onto the augmentation space Λ2 when coarsening
is monotone.
We remind the reader that a typical element of Λ2 , written in terms of
the coarsening probabilities, is given by (7.37). However, under monotone
coarsening, we showed (in Theorem 9.2) that a typical element of Λ2 can be
reparametrized, in terms of discrete hazard functions, as
  I(C = r) − λr {Gr (Z)}I(C ≥ r) 
Lr {Gr (Z)},
Kr {Gr (Z)}
r=∞

where the discrete hazard function λr {Gr (Z)} and Kr {Gr (Z)} are defined by
(8.2) and (8.4), respectively, and Lr {Gr (Z)} denotes an arbitrary function of
Gr (Z) for r = ∞.
By the projection theorem, if we want to find
 
I(C = ∞)m(Z, β)
Π Λ2 ,
(∞, Z)

then we must derive the functions L0r {Gr (Z)}, r = ∞, such that
240 10 Improving Efficiency and Double Robustness with Coarsened Data
%
I(C = ∞)mT (Z, β)
E
(∞, Z)
&
  I(C = r) − λr {Gr (Z)}I(C ≥ r) 
− L0r {Gr (Z)}
T
Kr {Gr (Z)}
r=∞
⎛ ⎞
  I(C = r) − λr {Gr (Z)}I(C ≥ r) 
×⎝ Lr {Gr (Z)}⎠
Kr {Gr (Z)}
r=∞

=0 for all Lr {Gr (Z)}, r = ∞. (10.43)

Theorem 10.4. The projection of I(C=∞)m(Z,β)


(∞,Z) onto Λ2 (i.e., the solution to
(10.43)) is obtained by taking L0r {Gr (Z)} = −E{m(Z, β)|Gr (Z)}.
Those readers who are familiar with the counting process notation and
stochastic integral martingale processes that are used in an advanced course
in survival analysis (see, for example, Fleming and Harrington, 1991) will find
that the proof of Theorem 10.4 uses similar methods. We first derive some
relationships in the following three lemmas that will simplify the calculations
in (10.43).
Lemma 10.1. For r = r
  
I(C = r) − λr {Gr (Z)}I(C ≥ r) T
E L0r {Gr (Z)}
Kr {Gr (Z)}
  
I(C = r ) − λr {Gr (Z)}I(C ≥ r )
× Lr {Gr (Z)} = 0. (10.44)
Kr {Gr (Z)}
Proof. For a single observation, define Fr to be the random vector {I(C =
1), I(C = 2), . . . , I(C = r − 1), Z}. Without loss of generality, take r > r. The
expectation in (10.44) can be evaluated as E{E(·|Fr )}. Conditional on Fr ,
however, the only term in (10.44) that is not known is I(C = r ). Consequently,
(10.44) can be written as

  
I(C = r) − λr {Gr (Z)}I(C ≥ r) T
E L0r {Gr (Z)}
Kr {Gr (Z)}
  
E{I(C = r )|Fr } − λr {Gr (Z)}I(C ≥ r )
× Lr {Gr (Z)} .
Kr {Gr (Z)}
(10.45)

But
E{I(C = r )|Fr } = P (C = r |C ≥ r , Z)I(C ≥ r ), (10.46)
which by the coarsening at random assumption and the definition of a discrete
hazard, given by (8.2), is equal to λr {Gr (Z)}I(C ≥ r ). Substituting (10.46)
into (10.45) proves (10.44). 

10.3 Improving Efficiency with Monotone Coarsening 241

Lemma 10.2. When r = r , the left-hand side of (10.44) equals


  
I(C = r) − λr {Gr (Z)}I(C ≥ r) T
E L0r {Gr (Z)}
Kr {Gr (Z)}
  
I(C = r) − λr {Gr (Z)}I(C ≥ r)
× Lr {Gr (Z)}
Kr {Gr (Z)}
 
λr {Gr (Z)} T
=E L {Gr (Z)}Lr {Gr (Z)} . (10.47)
Kr {Gr (Z)} 0r

Proof. Computing the expectation of the left-hand side of equation (10.47)


by first conditioning on Fr yields
 
E[{I(C = r) − λr {Gr (Z)}I(C ≥ r)}2 |Fr ] T
E L0r {Gr (Z)}Lr {Gr (Z)} .
Kr2 {Gr (Z)}
(10.48)
The conditional expectation

E[{I(C = r) − λr {Gr (Z)}I(C ≥ r)}2 |Fr ]

is the conditional variance of the Bernoulli indicator I(C = r) given Fr , which


equals
λr {Gr (Z)}[1 − λr {Gr (Z)}]I(C ≥ r).
Hence, (10.48) equals
 
λr {Gr (Z)}[1 − λr {Gr (Z)}]I(C ≥ r) T
E L0r {Gr (Z)}Lr {Gr (Z)} . (10.49)
Kr2 {Gr (Z)}

Computing the expectation of (10.49) by first conditioning on Z gives


 
λr (1 − λr )P (C ≥ r|Z) T
E L0r r .
L (10.50)
Kr2
.
r−1
Since P (C ≥ r|Z) = (1 − λj ), we obtain that (1 − λr )P (C ≥ r|Z) =
j=1
.
r  
(1 − λj ) = Kr . Therefore, (10.50) equals E Kλr T
r
L0r Lr , thus proving the
j=1
lemma. 


Lemma 10.3.
   
I(C = ∞)mT (Z, β) I(C = r) − λr {Gr (Z)}I(C ≥ r)
E Lr {Gr (Z)}
(∞, Z) Kr {Gr (Z)}
 
λr {Gr (Z)} T
= −E m (Z, β)Lr {Gr (Z)} . (10.51)
Kr {Gr (Z)}
242 10 Improving Efficiency and Double Robustness with Coarsened Data

Proof. Since I(C = ∞)I(C = r) = 0 for r = ∞ and I(R = ∞)I(C ≥ r) =


I(C = ∞), the left-hand side of (10.51) equals
 
I(C = ∞) λr {Gr (Z)} T
−E m (Z, β)Lr {Gr (Z)} .
(∞, Z) Kr {Gr (Z)}
The result (10.51) holds by first conditioning on Z and realizing that
 
I(C = ∞)
E Z = 1.  
(∞, Z)
Proof. Theorem 10.4
Using the results of Lemmas 10.1–10.3, equation (10.43) can be written as

%  T &
 λr {Gr (Z)}
− E m(Z, β) + L0r {Gr (Z)} Lr {Gr (Z)} = 0, (10.52)
Kr {Gr (Z)}
r=∞

for all functions Lr {Gr (Z)}, r = ∞.


By conditioning each expectation in the sum on the left-hand side of
(10.52) by Gr (Z), this can be written as
  λr {Gr (Z)} 
− E E{m(Z, β)|Gr (Z)}
Kr {Gr (Z)}
r=∞
T 
+ L0r {Gr (Z)} Lr {Gr (Z)} = 0. (10.53)

We now show that equation (10.53) holds if and only if

L0r {Gr (Z)} = −E {m(Z, β)|Gr (Z)} for all r = ∞. (10.54)

Clearly, (10.53) follows when (10.54) holds. Conversely, if (10.54) were not
true, then we could choose Lr {Gr (Z)} = E{m(Z, β)| Gr (Z)} + L0r {Gr (Z)}
for all r = ∞ to get a contradiction.
Therefore, we have demonstrated that, with monotone CAR,
 
I(C = ∞)m(Z, β)
Π Λ2
(∞, Z)
 I(C = r) − λr {Gr (Z)}I(C ≥ r) 

=− E {m(Z, β)|Gr (Z)} . (10.55)
Kr {Gr (Z)}
r=∞



In order to take advantage of the results above, we need to compute
E{m(Z, β)|Gr (Z)}. This requires us to estimate the distribution of Z, or
at least enough of the distribution to be able to compute these conditional
expectations.
10.3 Improving Efficiency with Monotone Coarsening 243

Remark 4. This last statement almost seems like circular reasoning. That is,
we argue that to gain greater efficiency, we would need to estimate the distri-
bution of Z. However, if we had methods to estimate the distribution of Z,
then we wouldn’t need to develop this theory in the first place. The rationale
for considering semiparametric theory for coarsened data is that it led us to
augmented inverse probability weighted complete-case estimators, which, we
argued, build naturally on full-data estimators and are easier to derive than,
say, likelihood methods with coarsened data. However, as we saw in the case
with two levels of missingness, we will still obtain consistent asymptotically
normal estimators using this inverse weighted methodology even if we con-
struct estimators for the distribution of Z that are incorrect. This gives us
greater flexibility and robustness and suggests the use of an adaptive approach,
as we now describe.  

Adaptive Estimation

To take advantage of the results for increased efficiency, we consider an


adaptive approach. That is, we posit simpler models for the distribution
of Z only for the purpose of approximating the conditional expectations
E{m(Z, β)|Gr (Z)}, r = ∞. We do not necessarily expect that these posited
models are correct, although we do hope that they may serve as a reasonable
approximation. However, even if the posited model is incorrect, the resulting
expectation, which we denote as EINC {m(Z, β)|Gr (Z)}, because

EINC {m(Z, β)|Gr (Z)} = E0 {m(Z, β)|Gr (Z)} ,

still results in a function of Gr (Z).


Consequently, computing (10.55) under incorrectly posited models would
lead us to use

L2 {C, GC (Z)}
  I(C = r) − λr {Gr (Z)}I(C ≥ r) 
= EINC {m(Z, β)|Gr (Z)} (10.56)
Kr {Gr (Z)}
r=∞

in the estimating equation (10.42). Even though (10.56) is not the correct pro-
I(R = ∞)m(Z, β)
jection of onto Λ2 , if the posited model for Z is incorrect,
(∞, Z)
it is still an element of the augmentation space Λ2 , which implies that the so-
lution to (10.42), using the augmented term L2 (·) as given by (10.56), would
still result in an AIPWCC consistent, asymptotically normal semiparametric
estimator for β. This protection against misspecified models argues in favor
of using an adaptive approach.
In an adaptive strategy, to improve efficiency, we start by positing a simpler
and possibly incorrect model for the distribution of the full data Z. Say we
244 10 Improving Efficiency and Double Robustness with Coarsened Data

assume Z ∼ p∗Z (z, ξ), where ξ is finite-dimensional. Under this presumed


model, we could compute the conditional expectations


m(z, β)p∗Z (z, ξ)dνZ (z)
{Gr (z)=Gr (Z)}
E {m(Z, β)|Gr (Z), ξ} =  . (10.57)
p∗Z (z, ξ)dνZ (z)
{Gr (z)=Gr (Z)}

Of course, we need to estimate the parameter ξ in our posited model. Be-


cause the parameter ξ is finite-dimensional, we can estimate ξ using likelihood
techniques as described in Section 7.1.
Using (7.10), the likelihood for a realization of such data, (ri , gri ), i =
1, . . . , n, when the coarsening mechanism is CAR, is equal to

!
n
p∗Gr (Zi ) (gri , ξ), (10.58)
i
i=1

where 
p∗Gr (Z) (gr , ξ) = p∗Z (z, ξ)dνZ (z).
{z:Gr (z)=gr }

Consequently, we would estimate E{m(Z, β)|Gr (Z), ξ} by substituting ξˆn∗


for ξ in (10.57), where ξˆn∗ is the MLE obtained by maximizing (10.58). The
estimated conditional expectation is denoted by E{m(Z, β)|Gr (Z), ξˆn∗ }.

Remark 5. Because of the monotone nature of the coarsening, it may be con-


venient to build models for the density of Z by considering models for the
conditional density of Gr (Z) given Gr −1 (Z). Specifically, the density of Z
can be written as

!
PZ∗ (z) = p∗Gr (Z)|Gr −1 (Z) (gr |gr −1 ),
r  =1

where gr = Gr (z), p∗G1 (Z)|G0 (Z) (g1 |g0 ) = p∗G1 (Z) (g1 ),

p∗Gr (Z)|Gr −1 (Z) (gr |gr −1 ) = p∗Z|G (Z) (z|g ) when r = ∞,

and  denotes the number of levels of coarsening. With this representation,


we can write the density
!
p∗Gr (Z) (gr ) = p∗Gr (Z)|Gr −1 (Z) (gr |gr −1 ).
r  ≤r

If, in addition, we consider models for the conditional density


10.3 Improving Efficiency with Monotone Coarsening 245

p∗Gr (Z)|Gr−1 (Z) (gr |gr−1 , ξr ),

in terms of parameter ξr , where ξr , for different r, are variationally indepen-


dent, then the likelihood (10.58) can be written as
! !
p∗G1 (Zi ) (g1i , ξ1 ) p∗G2 (Zi )|G1 (Zi ) (g2i |g1i , ξ2 )
i:ri ≥1 i:ri ≥2
!
× ... × p∗Zi |G (Zi ) (zi |gi , ξ∞ ). (10.59)
i:ri =∞

The maximum likelihood estimator for ξ = (ξ1 , . . . , ξ∞ )T can then be obtained


by maximizing each of the terms in (10.59) separately.  

Thus, the adaptive approach to finding estimators when data are mono-
tonically coarsened is as follows.
1. We first consider the full-data problem. That is, how would semiparamet-
ric estimators for β be obtained if we had full data? For example, we may
use a full-data m-estimator for β, which is the solution to

n
m(Zi , β) = 0,
i=1

which has influence function


  −1
∂m(Zi , β0 )
− E m(Zi , β0 ) = ϕF (Zi ).
∂β T

2. Next, we consider the augmented inverse probability weighted complete-


case estimator that is the solution to (10.42), with L2 (·) being an estimator
of (10.56). Specifically, we consider the estimator for β that solves the
estimating equation
%
 n
I(Ci = ∞)m(Zi , β)
+
i=1 (∞, Zi , ψ̂n )
# $ 
 I(Ci = r) − λr {Gr (Zi ), ψ̂n }I(Ci ≥ r)
E{m(Z, β)|Gr (Zi ), ξˆn∗ }
r=∞
Kr {Gr (Zi ), ψ̂n }
= 0, (10.60)
 
.
where (∞, Zi , ψ̂n ) = r<∞ 1 − λ r {G r (Zi ), ψ̂ n } , (see (8.6)) ψ̂n is the
maximum likelihood estimator for ψ obtained by maximizing (8.12), and
E{m(Z, β)|Gr (Zi ), ξˆn∗ } is obtained using (10.57), substituting ξˆn∗ , which
maximizes (10.58) or (10.59), for ξ. We denote this estimator by β̂n .
246 10 Improving Efficiency and Double Robustness with Coarsened Data

Even though the posited model p∗Z (z, ξ) may not be correctly specified, under
suitable regularity conditions, the estimator ξˆn∗ will converge in probability to
a constant ξ ∗ and that n1/2 (ξˆn∗ −ξ ∗ ) will be bounded in probability. Also, even
if the posited model is incorrect, the function L2 {Ci , GCi (Zi ), β, ψ0 , ξ ∗ } ∈ Λ2 ,
where

L2 {Ci , GCi (Zi ), β, ψ, ξ} =


  I(Ci = r) − λr {Gr (Zi ), ψ}I(Ci ≥ r) 
E{m(Z, β)|Gr (Zi ), ξ}.
Kr {Gr (Zi ), ψ}
r=∞

Consequently, the solution to the estimating equation


# $
 n
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), β0 , ψ̂n , ξ ∗ } = 0, (10.61)
i=1 (∞, Z i , ψ̂ n )

with β set to β0 and ξ ∗ fixed in L2 (·), is an AIPWCC estimator for β, which


we denote by β̂n∗ .
Using an argument similar to that in Theorem 10.3, when we considered
two levels of missingness, we can show that, under suitable regularity con-
ditions, the estimator β̂n , which solves equation (10.60), is asymptotically
equivalent to the AIPWCC estimator β̂n∗ , which solves (10.61); that is,

n1/2 (β̂n − β̂n∗ ) → 0.


P

(We leave this as an exercise for the reader.)


The resulting influence function for β̂n , which is the same as for β̂n∗ , can
now be derived by Theorem 9.1 and is equal to

I(Ci = ∞)ϕF (Zi )
+ L∗2 {Ci , GCi (Zi ), β0 , ψ0 , ξ ∗ }
(∞, Zi , ψ0 )
 
I(Ci = ∞)ϕF (Zi )
−Π + L∗2 {Ci , GCi (Zi ), β0 , ψ0 , ξ ∗ } Λψ ,
(∞, Zi , ψ0 )
where   −1
F ∂m(Zi , β0 )
ϕ (Zi ) = E m(Zi , β0 )
∂β T
and

L∗2 {Ci , GCi , β0 , ψ0 , ξ ∗ } =


  −1
∂m(Zi , β0 )
− E L2 {Ci , GCi (Zi ), β0 , ψ0 , ξ ∗ }.
∂β T

The asymptotic variance of β̂n can also be obtained by using the sandwich
variance estimator for AIPWCC estimators given by (9.19).
10.3 Improving Efficiency with Monotone Coarsening 247

Remark 6. If the posited model p∗ (z, ξ) is correctly specified, then

E{m(Zi , β0 )|Gr (Zi ), ξˆn∗ }

will be a consistent estimator of

E0 {m(Zi , β0 )|Gr (Zi )},

the true conditional expectation. In this case,

L2 {Ci , GCi (Zi ), β0 , ψ̂n , ξˆn∗ }

will converge to  
I(Ci = ∞)m(Zi , β0 )
Π Λ2 .
(∞, Zi , ψ0 )
For the case of a correctly specified model, the influence function is
  
I(Ci = ∞)ϕF (Zi ) I(Ci = ∞)ϕF (Zi )
−Π Λ2
(∞, Zi , ψ0 ) (∞, Zi , ψ0 )
  
−Π · Λψ .

This is orthogonal to Λ2 and
since Λψ ⊂ Λ2 must equal 0

In this case, the influence function equals


 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
−Π Λ2 = J {ϕF (Z)},
(∞, Z, ψ0 ) (∞, Z, ψ0 )

which is the most efficient among observed-data influence functions associated


with the full-data influence function ϕF (Zi ) and an element of (IF )DR . 


Generally, the attempt to estimate


 
I(Ci = ∞)ϕF (Zi )
Π Λ2
(∞, Zi )

by positing a model p∗Z (z, ξ) often leads to more efficient estimators even if
the model was incorrect. In fact, this attempt to gain efficiency also gives us
the extra protection of double robustness similar to that seen in the previous
section when we considered two levels of missingness. We now explore this
double-robustness relationship for the case of monotone coarsening.
248 10 Improving Efficiency and Double Robustness with Coarsened Data

Double Robustness with Monotone Coarsening

Throughout this section, we have taken the point of view that the model for
the coarsening probabilities was correctly specified. That is, for some ψ0 ,

λr {Gr (Z), ψ0 } = P0 (C = r|C ≥ r, Z), r = ∞,

where P0 (C = r|C ≥ r, Z) denotes the true discrete hazard rate. In actuality,


this model may also be misspecified. Nonetheless, under suitable regularity
conditions, the maximum likelihood estimator ψ̂n , which maximizes (8.12),
will converge to a constant ψ ∗ even if the model for the coarsening hazards is
not correctly specified.
When we developed the adaptive estimators for the purpose of improv-
ing efficiency, we considered the posited model p∗Z (z, ξ) and argued that the
estimator ξˆn∗ converged to a constant ξ ∗ , where p∗Z (z, ξ ∗ ) may not be the
correct distribution for the full data Z (i.e., p∗Z (z, ξ ∗ ) = p0Z (z)), where
p0Z (z) = pZ (z, β0 , η0 ) denotes the true density of Z. We now consider the
double-robustness property of the proposed adaptive estimator β̂n for β, the
solution to (10.60). That is, we will prove that β̂n is a consistent estimator if
either the model for λr {Gr (Z), ψ}, r = ∞ or the posited model p∗Z (z, ξ) is
correctly specified.
Using standard asymptotic arguments, the estimator β̂n will be consistent
and asymptotically normal if we can show that

I(C = ∞)m(Z, β0 )
E +
(∞, Z, ψ ∗ )
 I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r) 
 

E{m(Z, β0 )|Gr (Z), ξ } = 0.
Kr {Gr (Z), ψ ∗ }
r=∞
(10.62)

It will be convenient to show first that the expression inside the expectation
on the left-hand side of (10.62) can be written as
   I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r) 
m(Z, β0 ) −
Kr {Gr (Z), ψ ∗ }
r=∞
 
× m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ ∗ } . (10.63)

This follows because


 
I(C = ∞)m(Z, β0 ) I(C = ∞)
= m(Z, β0 ) + − 1 m(Z, β0 ) (10.64)
(∞, Z, ψ ∗ ) (∞, Z, ψ ∗ )

and by the following lemma.


10.3 Improving Efficiency with Monotone Coarsening 249

Lemma 10.4.
  I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r)   I(C = ∞)

= 1 − . (10.65)
Kr {Gr (Z), ψ ∗ } (∞, Z, ψ ∗ )
r=∞

Proof. Lemma 10.4


Because of the discreteness of C, we can write
 I(C = r) I(C = ∞)

= . (10.66)
Kr {Gr (Z), ψ } KC {GC (Z), ψ ∗ }
r=∞

By the definitions of λr (·) and Kr (·) given by (8.2) and (8.4), respectively, we
obtain that
λr (·) 1 1
= − ,
Kr (·) Kr (·) Kr−1 (·)
where K0 (·) = 1. Consequently,
 λr (·)I(C ≥ r)  1 1

− =I(C = ∞) −
Kr (·) Kr−1 (·) Kr (·)
r=∞ r≤C
 1 1

+ I(C = ∞) −
Kr−1 (·) Kr (·)
r=∞
 
1
=I(C = ∞) 1 −
KC {GC (Z), ψ ∗ }
 
1
+ I(C = ∞) 1 − , (10.67)
K {G (Z), ψ ∗ }

where  denotes the number of different coarsening levels (i.e., the largest
integer r < ∞) and
!
K {G (Z), ψ ∗ } = [1 − λr {Gr (Z), ψ ∗ }] = (∞, Z, ψ ∗ ). (10.68)
r=∞

Taking the sum of (10.66) and (10.67) and substituting (∞, Z, ψ ∗ ) for
K {G (Z), ψ ∗ } (see (10.68)), we obtain
   
1 I(C = ∞)
I(C = ∞) + I(C = ∞) 1 − = 1 − ,
(∞, Z, ψ ∗ ) (∞, Z, ψ ∗ )

thus proving the lemma. 




Therefore, to prove the double-robustness property of β̂n , it suffices to


show that the expected value of (10.63) is equal to zero if either the model
for λr {Gr (Z), ψ}, r = ∞ or the posited model p∗Z (z, ξ) is correctly specified,
which we give by the following theorem.
250 10 Improving Efficiency and Double Robustness with Coarsened Data

Theorem 10.5.
    I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r) 
E m(Z, β0 ) −
Kr {Gr (Z), ψ ∗ }
r=∞
 
× m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ ∗ } =0

if either the model for λr {Gr (Z), ψ}, r = ∞ or the posited model p∗Z (z, ξ) is
correctly specified.
Proof. By construction, E{m(Z, β0 )} = 0. Therefore, to prove Theorem 10.5,
we must show that
 
I(C = r) − λr {Gr (Z), ψ ∗ }I(C ≥ r)
E
Kr {Gr (Z), ψ ∗ }
 

m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ } = 0, (10.69)

for all r = ∞, if either the model for λr {Gr (Z), ψ}, r = ∞ or the posited
model p∗Z (z, ξ) is correctly specified.
We first consider the case when the model for the coarsening probabilities is
correctly specified (i.e., λr {Gr (Z), ψ ∗ } = λr {Gr (Z)} = P0 (C = r|C ≥ r, Z)),
whether the posited model p∗Z (z, ξ) is correct or not. Defining the random
vector Fr = {I(C = 1), . . . , I(C = r − 1), Z}, as we did in the proof of Lemma
10.1, and deriving the expectation of (10.69) by first conditioning on Fr , we
obtain
% 
E{I(C = r)|Fr } − λr {Gr (Z)}I(C ≥ r)
E
Kr {Gr (Z)}
 &

× m(Z, β0 ) − E{m(Z, β0 )|Gr (Z), ξ } .

We showed in (10.46) that E{I(C = r)|Fr } = λr {Gr (Z)}I(C ≥ r), which


proves that (10.69) is equal to zero for all r = ∞ when the coarsening proba-
bilities are modeled correctly, which, in turn, proves (10.62).
Now, let’s consider the case when the posited model for the distribution
of Z is correctly specified (i.e., p∗Z (z, ξ ∗ ) = p0Z (z)), whether or not the model
for the coarsening probabilities is correct. If this model is correctly specified,
then the conditional expectation
E{m(Z, β0 )|Gr (Z), ξ ∗ } = E0 {m(Z, β0 )|Gr (Z)}.
Write the expectation (10.69) as the difference of two expectations, namely
  
I(C = r)
E m(Z, β0 ) − E0 {m(Z, β0 )|Gr (Z)} (10.70)
Kr {Gr (Z), ψ ∗ }
   
λr {Gr (Z), ψ ∗ }I(C ≥ r)
−E m(Z, β0 ) − E0 {m(Z, β0 )|Gr (Z)} . (10.71)
Kr {Gr (Z), ψ ∗ }
10.3 Improving Efficiency with Monotone Coarsening 251

We compute the expectation in (10.71) by first conditioning on {I(C ≥


r), Gr (Z)} to obtain
 
λr {Gr (Z), ψ ∗ }
E E{I(C ≥ r)m(Z, β0 )|I(C ≥ r), Gr (Z)}
Kr {Gr (Z), ψ ∗ }

− I(C ≥ r)E0 {m(Z, β0 )|Gr (Z)} . (10.72)

But

E{I(C ≥ r)m(Z, β0 )|I(C ≥ r), Gr (Z)} = I(C ≥ r)E{m(Z, β0 )|C ≥ r, Gr (Z)}.


(10.73)
Because of the coarsening at random (CAR) assumption, we obtain that

pZ|C≥r,Gr (Z) (z|gr ) = pZ|Gr (Z) (z|gr ). (10.74)

This follows because


P (C ≥ r|Z = z)p0Z (z)
pZ|C≥r,Gr (Z) (z|gr ) = 
z:Gr (z)=gr
P (C ≥ r|Z = z)p0Z (z)dνZ (z)
Kr {Gr (z)}p0Z (z)
=
z:Gr (z)=gr
Kr {Gr (z)}p0Z (z)dνZ (z)
p0Z (z)
= = pZ|Gr (Z) (z|gr ).
p (z)dνZ (z)
z:Gr (z)=gr 0Z

Equation (10.74), together with (10.73), implies that

E{I(C ≥ r)m(Z, β0 )|I(C ≥ r), Gr (Z)} = I(C ≥ r)E0 {m(Z, β0 )|Gr (Z)}

and hence (10.72) and (10.71) are equal to zero. A similar argument, where
we condition on {I(C = r), Gr (Z)}, can be used to show that (10.70) is equal
to zero. This then implies that (10.69) is equal to zero, which, in turn, implies
that (10.62) is true, thus demonstrating that β̂n is a consistent estimator for
β when the posited model p∗Z (z, ξ) is correctly specified.  

Example with Longitudinal Data

We return to Example 1, given in Section 9.2, where the interest was in esti-
mating parameters that described the mean CD4 count over time, as a func-
tion of treatment, in a randomized study where CD4 counts were measured
longitudinally at fixed time points. Specifically, we considered two treatments:
(X = 1) was the new treatment and (X = 0) was the control treatment. The
response Y = (Y1 , . . . , Yl )T was a vector of CD4 counts that were measured
on each individual at times 0 = t1 < . . . < tl . The full data are denoted by
Z = (Y, X). It was assumed that CD4 counts follow a linear trajectory whose
252 10 Improving Efficiency and Double Robustness with Coarsened Data

slope may be treatment-dependent. Thus the model was given by (9.24) and
assumes that
E(Yji |Xi ) = β1 + β2 tj + β3 Xi tj .
Therefore, the problem was to estimate the parameter β = (β1 , β2 , β3 )T from
a sample of data Zi = (Yi , Xi ), i = 1, . . . , n, where Yi = (Y1i , . . . , Yli )T are
the longitudinally measured CD4 counts for subject i.
In this study, some patients dropped out, and for those patients we ob-
served the CD4 count data prior to dropout, whereas all subsequent CD4
counts are missing. This is an example of monotone coarsening with  = l − 1
levels of coarsening. We introduce the notation Y r to denote the vector
of data (Y1 , . . . , Yr )T , r = 1, . . . , l and Y r̄ to denote the vector of data
(Yr+1 , . . . , Yl )T , r = 1, . . . , l − 1. Therefore, when the coarsening variable
is Ci = r, we observe Gr (Zi ) = (Xi , Yir ), r = 1, . . . , l − 1, and, when Ci = ∞,
we observe the complete data G∞ (Zi ) = Zi = (Xi , Yi ).
With such coarsened data {Ci , GCi (Zi )}, i = 1, . . . , n, we considered esti-
mating the parameter β using an AIPWCC estimator. To accommodate this,
we introduced a logistic regression model for the discrete hazard of coarsening
probabilities given by (9.25) in terms of a parameter ψ, which was estimated
by maximizing the likelihood (8.13). The resulting estimator was denoted by
ψ̂n . An AIPWCC estimator for β was proposed by solving the estimating
equation (9.26), where, for simplicity, we chose
m(Z, β) = H T (X){Y − H(X)β}.
The definition of the design matrix H(X) is given subsequent to (9.24), and
the rationale for this estimating equation is given in Section 9.2. Notice, how-
ever, that we defined the augmentation term in equation (9.26) generically
using arbitrary functions Lr {Gr (Z)}, r = ∞. We now know that to improve
efficiency as much as possible, we should choose
{Lr {Gr (Z)} = Em(Z, β)|Gr (Z)},
which requires adaptive estimation using a posited model for p∗Z (z, ξ). We
propose the following.
Assume that the distribution of Y given X follows a multivariate normal
distribution whose mean and variance matrix may depend on treatment X;
that is,
Y |X = 1 ∼ M V N (ξ1 , Σ1 )
and
Y |X = 0 ∼ M V N (ξ0 , Σ0 ),
where ξk is an l-dimensional vector and Σk is an l × l matrix for k = 0, 1. The
parameter ξ will denote (ξ0 , ξ1 , Σ0 , Σ1 ).
Remark 7. Even though our model puts restrictions on the mean vectors ξk ,
in terms of the parameter β, we will let these be unrestricted and, as it turns
out, these parameters will not come into play in our estimating equation.  
10.3 Improving Efficiency with Monotone Coarsening 253

We denote ξkr = (ξ1k , . . . , ξrk )T and ξkr̄ = (ξ(r+1)k , . . . , ξlk )T for r = 1, . . . , l −1


and k = 0, 1. We also denote the corresponding elements of the partitioned
matrix for Σk by Σrr rr̄ r̄r̄
k , Σk , and Σk to represent the variance matrix of Y , the
r
r r̄ r̄
covariance matrix of Y and Y , and the variance matrix of Y , respectively,
given X = k. An estimator for ξ can be obtained by maximizing the likelihood

! 1  l−1
n ! !
−1/2
{(2π)r |Σrr
k |}
i=1 k=0 r=1
  I(Ci =r,Xi =k)
1 −1
× exp − (Yir − ξkr )T (Σrr
k ) (Yi
r
− ξk
r
)
2
   I(Ci =∞,Xi =k) 
−1/2 1 T −1
× {(2π) |Σk |}
l
exp − (Yi − ξk ) Σk (Yi − ξk ) .
2
(10.75)

This likelihood can be maximized by using standard statistical software


such as SAS Proc Mixed; see Littell et al. (1996). Denote the estimators for the
variance matrix by Σ̂kn for k = 0, 1. Using standard results for the conditional
distribution of a multivariate normal distribution, we obtain that

E{m(Z, β)|Gr (Z), ξ} = E[H T (X){Y − H(X)β}|X, Y r , ξ]


= H T (X)q(r, X, Y r , β, ξ),

where q(r, X, Y r , β, ξ) is an l-dimensional vector whose first r elements are


{Y r − H r (X)β}, whose last l − r elements are
−1
(Σkrr̄ )T (Σrr
k ) {Y r − H r (X)β}, when X = k, k = 0, 1,

and H r (X) is an r × 3 matrix consisting of the first r rows of H(X).


Therefore, the improved double-robust estimator for β is obtained by solv-
ing the equation

n 
 I(Ci = ∞)H T (Xi ){Yi − H(Xi )β}
i=1 (∞, Zi , ψ̂n )
 "

l−1
I(Ci = r) − λr {Gr (Zi ), ψ̂n }I(Ci ≥ r)
+
r=1 Kr {Gr (Zi ), ψ̂n }

× H T (Xi )q(r, Xi , Yir , β, ξˆn∗ ) = 0,

where, for this example, ξˆn∗ corresponds to the estimators, Σ̂kn , k = 0, 1,


derived by maximizing the likelihood (10.75).
254 10 Improving Efficiency and Double Robustness with Coarsened Data

10.4 Remarks Regarding Right Censoring


In Section 9.3, we showed that right censoring, which occurs frequently in
survival analysis, can be viewed as a special case of monotone coarsening with
continuous-time hazard rates representing the distribution of censoring. We
remind the reader that censored-data estimators for the parameter β can also
be written as augmented inverse probability weighted complete-case estima-
tors. Specifically, we argued in (9.34) that estimators for β can be derived by
solving the estimating equation
n 
 ∞ 
∆i dMC̃i {r, X̄i (r)}
m(Zi , β) + Lr {X̄i (r)} = 0, (10.76)
i=1
KUi {X̄i (Ui )} Kr {X̄i (r)}
0

where Z denotes the full-data (i.e., Z = {T, X̄(T )}), and m(Z, β) denotes a
full-data estimating function that would have been used to obtain estimators
for β had there been no censoring. The definition of dNC̃ (r), λC̃ {r, X̄(r)},
Y (r), dMC̃ {r, X̄(r)} = dNC̃ (r) − λC̃ {r, X̄(r)}Y (r)dr, and Kr {X̄(r)} were all
defined in Section 9.3.
By analogy between the censored-data estimating equation (10.76) and
the monotonically coarsened data estimating equation (10.42), we can show
that the most efficient augmented inverse probability weighted complete-case
estimator for β that uses the full-data estimating function m(Z, β) is obtained
by choosing
Lr {X̄(r)} = E{m(Z, β)|T ≥ r, X̄(r)}.
To actually implement these methods with censored data, we need to
1. develop models for the censoring distribution λC̃ {r, X̄(r), ψ} and find es-
timators for ψ, and
2. estimate the conditional expectation E{m(Z, β)|T ≥ r, X̄(r)}.
A popular model for the censoring hazard function λC̃ {r, X̄(r), ψ} is the
semiparametric proportional hazards regression model (Cox, 1972) using max-
imum partial likelihood estimators to estimate the regression parameters and
Breslow’s (1974) estimator to estimate the underlying cumulative hazard func-
tion.
In order to estimate the conditional expectation E{m(Z, β)|T ≥ r, X̄(r)},
we can posit a simpler full-data model, say p∗Z (z, ξ) = p∗T,X̄(T ) {t, x̄(t), ξ},
and then estimate ξ using the observed data {Ui , ∆i , X̄i (Ui )}, i = 1, . . . , n by
maximizing the observed-data likelihood
! n  ∆i

pT,X̄(T ) {Ui , X̄i (Ui ), ξ} ×
i=1
 1−∆i
p∗T,X̄(T ) {t, x̄(t), ξ}dνT,X̄(T ) {t, x̄(t)} .
{t,x̄(t)}:t≥Ui ,{x(s)=Xi (s),s≤Ui }
(10.77)
10.5 Improving Efficiency when Coarsening Is Nonmonotone 255

Building models for p∗T,X̄(T ) {t, x̄(t), ξ} with time-dependent covariate and
maximizing (10.77) can be a daunting task. Nonetheless, the theory that has
been developed can often be useful in developing more efficient estimators
even if we don’t necessarily derive the most efficient one.
In the example of censored medical cost data that was described in Exam-
ple 2 of Section 9.3, Bang and Tsiatis (2000) used augmented inverse prob-
ability weighted complete-case estimators to estimate the mean medical cost
and showed various methods for gaining efficiency by judiciously choosing the
augmented term.
Other examples where this methodology was used include Robins, Rot-
nitzky, and Bonetti (2001), who used AIPWCC estimators of the survival
distribution under double sampling with follow-up of dropouts. Hu and Tsi-
atis (1996) and van der Laan and Hubbard (1998) constructed estimators of
the survival distribution from survival data that are subject to reporting de-
lays. Zhao and Tsiatis (1997, 1999, 2000) and van der Laan and Hubbard
(1999) derived estimators of the quality-adjusted-lifetime distribution from
right-censored data. Bang and Tsiatis (2002) derived estimators for the pa-
rameters in a median regression model of right-censored medical costs. Straw-
derman (2000) used these methods to derive an estimator of the mean of an
increasing stopped stochastic process. Van der Laan, Hubbard, and Robins
(2002) and Quale, van der Laan and Robins (2003) constructed locally effi-
cient estimators of a multivariate survival distribution when failure times are
subject to a common censoring time and to failure-time-specific censoring.

10.5 Improving Efficiency when Coarsening


Is Nonmonotone
We have discussed how to derive AIPWCC estimators with improved efficiency
when there are two levels of coarsening or when the coarsening is monotone
and have given several examples to illustrate these methods. This theory can
also be extended to the case when the coarsening is nonmonotone. However,
we must caution the reader that the use of AIPWCC estimators in this setting
is very difficult to implement. At the end of Section 8.1, we already remarked
that developing coherent models for the missingness probabilities when the
missingness is nonmonotone, is not trivial. There has been very little work in
this area, with the exception of the paper by Robins and Gill (1997). Even if
one were able to develop models for the missingness probabilities, finding pro-
jections onto the augmentation space, as is necessary to obtain more efficient
AIPWCC estimators, is not straightforward and requires an iterative process
that is numerically difficult to implement. Consequently, the semiparametric
theory that leads to AIPWCC estimators has not been well developed with
nonmonotone coarsened data and there is still a great deal of research that
needs to be done in this area. Nonetheless, many of the theoretical results
256 10 Improving Efficiency and Double Robustness with Coarsened Data

have been worked out for nonmonotone coarsened data using the general the-
ory developed by Robins, Rotnitzky, and Zhao (1994). For completeness, we
present these results in this section, but again we caution the reader that there
are many challenges yet to be tackled before these methods can be feasibly
implemented.

Finding the Projection onto the Augmentation Space


We have already argued that among all coarsened-data influence functions
given by (10.1) with ϕF (Z) fixed, the optimal choice is given by (10.2). We
∗F
(Z)
have also shown how to find the projection of I(C=∞)ϕ(∞,Z) onto the aug-
mentation space, Λ2 , when there are two levels of coarsening or when the
coarsening is monotone, where ϕ∗F (Z) ∈ ΛF ⊥ . We now consider how to de-
∗F
(Z)
rive the projection of I(C=∞)ϕ
(∞,Z) onto Λ2 in general; that is, even if the
coarsening is nonmonotone.
We begin by defining two linear operators.
Definition 4. The linear operator L is a mapping from the full-data Hilbert
space HF to the observed-data Hilbert space H, where
L : HF → H
is defined as
L{hF (·)} = E{hF (Z)|C, GC (Z)} (10.78)
for h (Z) ∈ H . Specifically,
F F



L{hF (·)} = I(C = r)E{hF (Z)|C = r, Gr (Z)},
r=1

and because of the coarsening-at-random assumption, we obtain




L{hF (Z)} = I(C = r)E{hF (Z)|Gr (Z)}. 
 (10.79)
r=1

Definition 5. The linear operator M is a mapping from the full-data Hilbert


space to the full-data Hilbert space. Specifically,
M : HF → HF
is defined as
M{hF (·)} = E[L{hF (·)}|Z]. (10.80)
Using (10.79), we obtain

∞ 
M{h (·)} = E
F F
I(C = r)E{h (Z)|Gr (Z)}|Z
r=1


= {r, Gr (Z)}E{hF (Z)|Gr (Z)}. 
 (10.81)
r=1
10.5 Improving Efficiency when Coarsening Is Nonmonotone 257

I(C=∞)hF (Z)
The projection of (∞,Z) onto Λ2 is now given by the following the-
orem.

Theorem 10.6.
(i) The inverse mapping M−1 exists and is uniquely defined.
(ii) The projection is

 
I(C = ∞)hF (Z) I(C = ∞)hF (Z)
Π Λ2 = − L[M−1 {hF (·)}]. (10.82)
(∞, Z) (∞, Z)

Proof. Theorem 10.6 part (i)


We will defer the proof of (i) and assume for the time being that it is true.

Proof of Theorem 10.6 part (ii)


If we can show that
F
(Z)
a. I(C=∞)h
(∞,Z) − L[M−1 {hF (·)}] ∈ Λ2
b. and that
 
I(C = ∞)hF (Z) I(C = ∞)hF (Z)
− − L[M−1 {hF (·)}]
(∞, Z) (∞, Z)
= L[M−1 {hF (·)}]

is orthogonal to every element in Λ2 ,


I(C=∞)hF (Z)
then, by the projection theorem, (∞,Z) − L[M−1 {hF (·)}] is the unique
F
(Z)
projection of I(C=∞)h
(∞,Z) onto Λ2 .
We first note that
 
I(C = ∞)hF (Z)
E − L[M−1 {hF (·)}] Z
(∞, Z)
= hF (Z) − M[M−1 {hF (·)}] = hF (Z) − hF (Z) = 0,

thus proving (a).


If we let L2 {C, GC (Z)} be an arbitrary element of Λ2 , then the inner prod-
uct
258 10 Improving Efficiency and Double Robustness with Coarsened Data
 
 T
E L[M−1 {hF (·)}] L2 {C, GC (Z)}
  
 T
= E E M−1 {hF (·)} L2 {C, GC (Z)} C, GC (Z)
 
 T
= E M−1 {hF (·)} L2 {C, GC (Z)}
  
 −1 F T
= E E M {h (·)} L2 {C, GC (Z)} Z
  
 −1 F T
= E M {h (·)} E L2 {C, GC (Z)} Z = 0,

where the last equality follows because M−1 (hF ) ∈ HF and hence as a func-
tion of Z, allowing it to come outside the inner conditional expectation, and
L2 (·) ∈ Λ2 , which implies that E{L2 (·)|Z} = 0, thus proving (b). 


Uniqueness of M−1 (·)

In order to complete the proof of Theorem 10.6, we must show that the linear
operator M has a unique inverse. We will prove the existence and uniqueness
of the inverse of the linear mapping M by showing that the linear operator
(I − M) is a contraction mapping, where I denotes the identity mapping
HF → HF ; i.e., I{hF (Z)} = hF (Z) for hF ∈ HF . For more details on
these methods, we refer the reader to Kress (1989).
We begin by first defining what we mean by a contraction mapping and
proving why (I − M) being a contraction mapping implies that M has a
unique inverse.

Definition 6. A linear operator, say (I − M), is a contraction mapping if the


norm of the operator satisfies I − M ≤ (1 − ε) for some ε > 0, where the
norm of a linear operator, say I − M , is defined as

(I − M)(hF )
suphF ∈HF ,
hF

or equivalently, I − M ≤ (1 − ε), if

(I − M)(hF ) ≤ (1 − ε) hF for all hF ∈ HF . 




Lemma 10.5. If (I − M) is a contraction mapping, then M−1 exists, is


unique, and is equal to the operator


S= (I − M)k . (10.83)
k=0

Also, M−1 {hF (Z)} can be obtained by successive approximation, where


10.5 Improving Efficiency when Coarsening Is Nonmonotone 259

ϕn+1 (Z) = (I − M)ϕn (Z) + hF (Z); (10.84)

i.e.,

ϕ0 (Z) = hF (Z)
ϕ1 (Z) = (I − M)hF (Z) + hF (Z)
ϕ2 (Z) = (I − M)2 hF (Z) + (I − M)hF (Z) + hF (Z)
..
.
and ϕn (Z) → M−1 {hF (·)}.

Proof. To demonstrate existence, we must show

M[S{hF (Z)}] = hF (Z).

However,
∞ "
  
M S{hF (Z)} = M (I − M)k hF (Z)
k=0
 ∞
"

= {I − (I − M)} (I − M) h (Z) .
k F

k=0

By a telescoping argument, this equals


' (
lim I − (I − M)k hF (Z),
k→∞

but (I − M)k hF (Z) will have a norm that converges to zero as k → ∞.


This follows because (I − M) is a contraction mapping and (I − M)k hF ≤
(1 − )k hF . Therefore lim (I − M)k hF (Z) = 0 a.s. Consequently,
k→∞

M[S{hF (Z)}] = hF (Z).

We will demonstrate uniqueness by contradiction. Suppose M−1 (·) were


not unique. Then there exists S ∗ {hF (Z)} such that

M[S ∗ {hF (Z)}] = hF (Z) but


S ∗ {hF (Z)} = S{hF (Z)}.

In that case,

(I − M)(S − S ∗ )hF (Z) = (I − M)[S{hF (Z)} − S ∗ {hF (Z)}]


= S{hF (Z)} − S ∗ {hF (Z)}

and
260 10 Improving Efficiency and Double Robustness with Coarsened Data

(I − M)[S{hF (Z)} − S ∗ {hF (Z)}] = S{hF (Z)} − S ∗ {hF (Z)} .

But since (I − M) is a contraction mapping, this implies that

(I − M)[S{hF (Z)} − S ∗ {hF (Z)}] ≤ (1 − ) S{hF (Z)} − S ∗ {hF (Z)} .

This can only happen when

S(hF ) − S ∗ (hF ) = 0. 


We now complete the proof of Theorem 10.6 by showing that (I − M) is


a contraction mapping when M is defined by (10.80).

Proof. Theorem 10.6 part (i)


Consider the Hilbert space HCZ of all q-dimensional mean-zero measurable
functions of (C, Z) equipped with the covariance inner product. The observed-
data Hilbert space H and the full-data Hilbert space HF are both contained
in HCZ . That is, H ⊂ HCZ , HF ⊂ HCZ are linear subspaces within this space.
If we consider any arbitrary element hCZ (C, Z) ∈ HCZ , then
' (
Π[hCZ |H] = E hC,Z (C, Z)|C, GC (Z) (10.85)

and ' (
Π[hCZ |HF ] = E hCZ (C, Z)|Z , (10.86)
where equations (10.85) and (10.86) can be easily shown to hold by checking
that the definitions of a projection are satisfied.
Therefore, deriving M{h(·)} = Π[Π[h|H]|HF ] corresponds to finding two
subsequent projections onto these two linear subspaces. What we want to
prove is that (I − M) is a contraction mapping from HF to HF . First note
that ' (
(I − M)hF (Z) = Π[ hF (Z) − Π[hF (Z)|H] |HF ].
Hence, by the Pythagorean theorem,

(I − M)hF (Z) ≤ hF (Z) − Π[hF (Z)|H] . (10.87)

Also by the Pythagorean theorem, the right-hand side of (10.87) is equal to


' 2
 (1/2
− Π hF (Z)|H 2
hF (Z) . (10.88)
' (
The projection Π[hF (Z)|H] = E hF (Z)|C, GC (Z) , which, by (10.79),
equals 
I(C = ∞)hF (Z) + I(C = r)E{hF (Z)|Gr (Z)}.
r=∞

Hence,
10.5 Improving Efficiency when Coarsening Is Nonmonotone 261
⎛⎡ ⎤T
⎜ 
Π[hF (Z)|H] 2
= E ⎝⎣I(C = ∞)hF (Z) + I(C = r)E{hF (Z)|Gr (Z)}⎦
r=∞
⎡ ⎤⎞

⎣I(C = ∞)hF (Z) + I(C = r)E{hF (Z)|Gr (Z)}⎦⎠
r=∞
%

= E I(C = ∞){hF (Z)}T hF (Z) + I(C = r)
r=∞
&
 T  
E{hF (Z)|Gr (Z)} × E{hF (Z)|Gr (Z)}
 
≥ E I(C = ∞){hF (Z)}T hF (Z) . (10.89)

Conditioning on Z, (10.89) equals


 
E (∞, Z){hF (Z)}T hF (Z) . (10.90)

By assumption, (∞, Z) ≥ > 0 for all Z, and hence (10.90) is
∗ ∗
≥ E[{hF (Z)}T hF (Z)] = hF (Z) 2 .

Consequently, (10.88) is less than or equal to


' 2 ∗
(
2 1/2 ∗ 1/2 ∗
hF (Z) − hF (Z) = (1 − ) hF (Z) , > 0. (10.91)

Therefore, by (10.87), (10.88), and (10.91), we have shown that


∗ 1/2
(I − M)hF (Z) ≤ (1 − ) hF (Z)

for all hF ∈ HF . Hence (I − M) is a contraction mapping. 




Obtaining Improved Estimators with Nonmonotone Coarsening

In (10.2), we showed that the optimal observed-data influence function


of RAL estimators for β associated with the full-data influence function
ϕF (Z) is obtained by considering the residual after projecting the inverse
F
(Z)
probability weighted complete-case influence function I(C=∞)ϕ(∞,Z) onto Λ2 .
When the coarsening is nonmonotone, we demonstrated in the previous sec-
tion (see (10.82)) that this residual is equal to L[M−1 {ϕF (Z)}]; that is,
L[M−1 {ϕF (Z)}] = J {ϕF (Z)}, where J (ϕF ) was defined by (10.5) and is
an element of the space of influence functions (IF )DR (see Definition 2). We
also argued (see Remark 2) that if we are interested in deriving more efficient
estimators (i.e., estimators whose influence function is an element of (IF )DR ),
then we should consider estimating functions, which, at the truth, are elements
262 10 Improving Efficiency and Double Robustness with Coarsened Data

of the DR linear space J (ΛF ⊥ ) (see Definition 3) or, equivalently, the space
L{M−1 (ΛF ⊥ )}.
Consequently, if we defined a full-data estimating function m(Z, β) such
that m(Z, β0 ) ∈ ΛF ⊥ , then we should use L[M−1 {m(Z, β)}] as our observed-
data estimating function. That is, the estimator for β would be the solution
to the estimating equation

n
Li [M−1
i {m(Zi , β)}] = 0. (10.92)
i=1

Of course, deriving the estimating equation in (10.92) is not trivial. The


operator L(·), defined by (10.79), involves finding conditional expectations of
functions of Z given Gr (Z) for r = ∞, and the operator M(·), defined by
(10.81), involves finding such conditional expectations as well as deriving the
coarsening probabilities {r, Gr (Z)}. Consequently, in order to proceed, these
coarsening probabilities and conditional expectations need to be estimated.
The coarsening probabilities are modeled using a parametric model with
parameter ψ; that is, P (C = r|Z) = {r, Gr (Z), ψ}, where ψ is estimated by
maximizing (8.8) and the resulting estimator is denoted by ψ̂n .
In order to estimate the conditional expectation of functions of Z given
Gr (Z) for r = ∞, we need to estimate the conditional density of Z given
Gr (Z). An adaptive strategy is to posit a simplifying parametric model for
the density of Z, namely p∗Z (z, ξ), in terms of a parameter ξ, and then estimate
ξ by maximizing (10.58), where the estimator is denoted by ξˆn∗ . We remind
the reader that the “∗” notation is used to emphasize that such a model is
not necessarily believed to contain the true distribution of Z. We assume
P
sufficient regularity conditions so that ξˆn∗ − → ξ ∗ , where p∗Z (z, ξ ∗ ) may or may
not be the true density of Z. If it is the true density of Z, we denote this
by taking ξ ∗ = ξ0 . With such an estimator for ξ, we can now estimate the
conditional expectation E{m(Z, β)|Gr (Z)} by using E{m(Z, β)|Gr (Z), ξˆn∗ },
where E{m(Z, β)|Gr (Z), ξ} is defined by (10.57).
The linear operator M and its inverse M−1 are functions of the parameters
ψ and ξ, and the linear operator L is a function of ξ. To make this explicit, we
define these operators as M(·, ψ, ξ), M−1 (·, ψ, ξ), and L(·, ξ). Consequently,
the improved estimator for β would be the solution to

n
Li [M−1 ˆ∗ ˆ∗
i {m(Zi , β), ψ̂n , ξn }, ξn ] = 0. (10.93)
i=1

We will now demonstrate that the estimator for β that solves (10.93) is an
example of an AIPWCC estimator. This follows because of the following the-
orem.
Theorem 10.7. Let dF (Z, β, ψ, ξ) = M−1 {m(Z, β), ψ, ξ}. Then,

L[M−1 {m(Z, β), ψ, ξ}, ξ] = L{dF (Z, β, ψ, ξ), ξ}


10.5 Improving Efficiency when Coarsening Is Nonmonotone 263

can be written as
I(C = ∞)m(Z, β)
L{dF (Z, β, ψ, ξ), ξ} = + L∗2 {C, GC (Z), β, ψ, ξ}, (10.94)
(∞, Z, ψ)
where

L∗2 {C, GC (Z), β, ψ, ξ} =


⎡ ⎤
I(C = ∞) ⎣ 
− {r, Gr (Z), ψ}E{dF (Z, β, ψ, ξ)|Gr (Z), ξ}⎦
(∞, Z, ψ)
r=∞

+ I(C = r)E{dF (Z, β, ψ, ξ)|Gr (Z), ξ}. (10.95)
r=∞

Proof. By definition,

L{dF (Z, β, ψ, ξ), ξ} =



I(C = ∞)dF (Z, β, ψ, ξ) + I(C = r)E{dF (Z, β, ψ, ξ)|Gr (Z), ξ}. (10.96)
r=∞

Because

M{dF (Z, β, ψ, ξ), ψ, ξ} = M[M−1 {m(Z, β), ψ, ξ}, ψ, ξ] = m(Z, β)



= (∞, Z, ψ)dF (Z, β, ψ, ξ) + {r, Gr (Z), ψ}E{dF (Z, β, ψ, ξ)|Gr (Z), ξ},
r=∞

this implies that



F −1
d (Z, β, ψ, ξ) ={(∞, Z, ψ)} m(Z, β)
 
− {r, Gr (Z), ψ}E{dF (Z, β, ψ, ξ)|Gr (Z), ξ} . (10.97)
r=∞

Substituting the right-hand side of (10.97) for dF (Z, β, ψ, ξ) in (10.96) gives


us (10.94) and hence proves the theorem.  
Let us denote L2r {Gr (Z)} to be −E{dF (Z, β, ψ0 , ξ ∗ )|Gr (Z), ξ ∗ }. This is
a function of Gr (Z) whether the model p∗Z (z, ξ ∗ ) is correctly specified or not.
Therefore, because of (7.37), where elements of the augmentation space are
defined, we note that L∗2 {C, GC (Z), β, ψ0 , ξ ∗ }, defined by (10.95), is an ele-
ment of the augmentation space Λ2 as long as the model for the coarsening
probabilities {r, Gr (Z), ψ0 } is correctly specified, regardless of whether the
posited model for the density of Z, p∗Z (z, ξ ∗ ), is or not. Finally, because of
Theorem 10.7, the estimator (10.93) is the same as the solution to

n
I(Ci = ∞)m(Zi , β)
+ L∗2 {Ci , GCi (Zi ), β, ψ̂n , ξˆn∗ } = 0 (10.98)
i=1 (∞, Z, ψ̂n )
264 10 Improving Efficiency and Double Robustness with Coarsened Data

and therefore is an AIPWCC estimator.


The result above assumes that we can derive M−1 {m(Z, β), ψ, ξ}. In
Lemma 10.5, we showed that the inverse operator M−1 exists. Nonetheless,
this inverse operator is not necessarily easy to compute with nonmonotone
coarsened data, and an iterative procedure using successive approximations
was also given in Lemma 10.5. Therefore, let us denote, by dF (j) (Z, β, ψ, ξ)
the approximation of M−1 {m(Z, β), ψ, ξ} after, say, (j) iterations of succes-
sive approximations given by (10.84). Because dF (j) (Z, β, ψ, ξ) is not exactly
−1 ∗ ∗
M {m(Z, β), ψ, ξ}, then L{d(j) (Z, β0 , ψ0 , ξ ), ξ } may not be an element of
F

Λ⊥ and therefore not appropriate as the basis of an estimating function. We


therefore suggest the following strategy.
Define

L∗2(j) {C, GC (Z), β, ψ, ξ}


⎡ ⎤
I(C = ∞) ⎣  ⎦
=− {r, Gr (Z), ψ}E{dF (j) (Z, β, ψ, ξ)|Gr (Z), ξ}
(∞, Z, ψ)
r=∞

+ I(C = r)E{dF (j) (Z, β, ψ, ξ)|Gr (Z), ξ}. (10.99)
r=∞

By construction, L∗2(j) {C, GC (Z), β, ψ0 , ξ ∗ } is an element of Λ2 , whether


−1
(j) (Z, β, ψ, ξ) equals M
dF {m(Z, β), ψ, ξ} or not. This implies that

I(C = ∞)m(Z, β0 )
+ L∗2(j) {C, GC (Z), β0 , ψ0 , ξ ∗ }
(∞, Z, ψ0 )

is guaranteed to be an element of Λ⊥ when ψ0 is correctly specified.


By defining L∗2(j) (·) in this manner, we are guaranteed that the solution
to the equation

n
I(Ci = ∞)m(Zi , β)
+ L∗2(j) {Ci , GCi (Zi ), β, ψ̂n , ξˆn∗ } = 0 (10.100)
i=1 (∞, Z, ψ̂n )

is an AIPWCC estimator. Moreover, because of Theorem 10.7, if we take


the number of iterations (j) to be sufficiently large so that dF (j) (Z, β, ψ, ξ) is
−1
equal (or as close as we want) to M {m(Z, β), ψ, ξ}, then solving equation
(10.100) will lead to the estimator (10.93).
As long as the model for the coarsening probabilities is correctly specified,
the estimator, (10.100), for β, under suitable regularity conditions, will be an
RAL estimator for β with influence function
   
I(C = ∞)ϕF (Z)
+ L∗∗
2(j) {C, GC (Z), β0 , ψ0 , ξ ∗
} − Π [·] Λ ψ , (10.101)
(∞, Z, ψ0 )
where
10.5 Improving Efficiency when Coarsening Is Nonmonotone 265
  −1
∂m(Z, β0 )
ϕ (Z) = −E
F
m(Z, β0 ),
∂β T

L∗∗ ∗
2(j) {C, GC (Z), β0 , ψ0 , ξ }
  −1
∂m(Z, β0 )
= −E L∗2(j) {C, GC (Z), β0 , ψ0 , ξ ∗ },
∂β T

and ξ ∗ denotes the limit (in probability) of ξˆn∗ .


If, in addition, the posited model p∗Z (z, ξ) contains the truth, then the
influence function (10.101) is equal to L[M−1 {ϕF (Z)}] = J {ϕF (Z)} and
hence is an element of (IF )DR .

Double Robustness

In constructing the estimator in (10.100), we took the point of view that


the coarsening probabilities are correctly specified. We also defined a posited
model for Z, namely p∗Z (z, ξ), for the purpose of constructing more efficient es-
timators. Such a model, for instance, enabled us to construct functions dF (Z)
and E{dF (Z)|Gr (Z)}, which were used to derive projections onto the space
Λ2 . As we showed above, the model for p∗Z (z, ξ) does not need to be correctly
specified in order for our estimator to be consistent and asymptotically normal
as long as the model for the coarsening probabilities is correct.
However, as we will now demonstrate, the attempt to gain efficiency also
gives us the added protection of double robustness. That is, if the posited
model for the density of Z is correct (i.e., the true density of Z, p0 (z), is
contained in the model p∗Z (z, ξ) for some ξ, which we denote by ξ0 ), and if we
choose dF (Z, β, ψ, ξ) to be exactly M−1 {m(Z, β), ψ, ξ}, then the estimator β
that is the solution to (10.98) will be consistent and asymptotically normal
even if the model for the coarsening probabilities is not correct.
Such a double-robustness property was shown previously for two levels
of missingness or for monotone coarsening. This result now generalizes the
double-robustness property for all coarsened-data models where the coarsen-
ing mechanism is CAR.
In order for the double-robustness property to hold, we emphasize that
dF (Z, β, ψ, ξ) = M−1 {m(Z, β), ψ, ξ}, which, as we showed in the previous
section, would result in the estimating equation (10.98) being identical to the
estimating equation (10.93). With sufficient regularity conditions, the solution
to (10.93) will lead to a consistent, asymptotically normal estimator for β if the
estimating function of (10.93), L[M−1 {m(Z, β), ψ, ξ}, ξ], evaluated at β = β0 ,
ψ = ψ ∗ , ξ = ξ ∗ , where ψ ∗ and ξ ∗ are the probabilistic limits of ψ̂n and ξˆn∗
respectively, is an unbiased estimator of zero. Namely, we must show that
 
−1 ∗ ∗ ∗
E L[M {m(Z, β0 ), ψ , ξ }, ξ ] = 0. (10.102)
266 10 Improving Efficiency and Double Robustness with Coarsened Data

Remark 8. The expectation in (10.102) is with respect to the truth. We remind


the reader that the true coarsened-data density involves both the marginal
density of Z, p0 (z), and the model for the coarsening probabilities, P0 (C =
r|Z) = 0 {r, Gr (Z)}. We are considering the case where we posit a model for
the marginal density, namely p∗Z (z, ξ), which might be incorrect. We denote
this situation by letting the estimator ξˆn∗ converge in probability to ξ ∗ , where
p∗Z (z, ξ ∗ ) = p0 (z). However, in the special case where the posited model is cor-
rectly specified, we will denote this by taking ξ ∗ = ξ0 , where p∗Z (z, ξ0 ) = p0 (z).
We also posit a model for the coarsening probabilities; namely, P (C = r|Z) =
{r, Gr (Z), ψ}. Using the same convention, if this model is misspecified, we
denote this by letting the estimator ψ̂n converge in probability to ψ ∗ , where
{r, Gr (Z), ψ ∗ } = 0 {r, Gr (Z)}. If this model is correctly specified, then we
take ψ ∗ = ψ0 , where {r, Gr (Z), ψ0 } = 0 {r, Gr (Z)}. To emphasize this
notation, we write the expectation in (10.102) as
 
Eξ0 ,ψ0 L[M−1 {m(Z, β0 ), ψ ∗ , ξ ∗ }, ξ ∗ ] . 


To demonstrate double robustness, we need to prove the following theorem.


Theorem 10.8.
 
(i) Eξ0 ,ψ0 L[M−1 {m(Z, β0 ), ψ0 , ξ ∗ }, ξ ∗ ] = 0,

and
 
(ii) Eξ0 ,ψ0 L[M−1 {m(Z, β0 ), ψ ∗ , ξ0 }, ξ0 ] = 0.

Proof of (i)
Because the conditional distribution of C|Z involves the parameter ψ only and
the marginal distribution of Z involves ξ only, we can write (i) as
  
Eξ0 Eψ0 L[M−1 {m(Z, β0 ), ψ ∗ , ξ0 }, ξ0 ] Z . (10.103)

But, for any function q(Z), Eψ0 [L{q(Z), ξ ∗ }|Z] is, by definition, equal to
M{q(Z), ψ0 , ξ ∗ }. Therefore, (10.103) equals
  
−1 ∗ ∗
Eξ0 M M {m(Z, β0 ), ψ0 , ξ }, ψ0 , ξ

= Eξ0 {m(Z, β0 )} = 0. 


Proof of (ii)
Because L{q(Z), ξ0 } = Eξ0 {q(Z)|C, GC (Z)}, then

Eξ0 ,ψ0 [L{q(Z), ξ0 }] = Eξ0 {q(Z)}.


10.6 Recap and Review of Notation 267

Notice that the argument above didn’t involve the parameter ψ; hence

Eξ0 ,ψ0 [L{q(Z), ξ0 }] = Eξ0 ,ψ∗ [L{q(Z), ξ0 }] = Eξ0 {q(Z)}

for any parameter ψ ∗ . Applying this to the left-hand side of equation (ii), we
obtain
 
−1 ∗
Eξ0 ,ψ0 L[M {m(Z, β0 ), ψ , ξ0 }, ξ0 ]
 
−1 ∗
= Eξ0 ,ψ∗ L[M {m(Z, β0 ), ψ , ξ0 }, ξ0 ]
  
−1 ∗
= Eξ0 Eψ∗ L[M {m(Z, β0 ), ψ , ξ0 }, ξ0 ] Z
  
−1 ∗ ∗
= Eξ0 M M {m(Z, β0 ), ψ , ξ0 }, ψ , ξ0

= Eξ0 {m(Z, β0 )} = 0. 


Remark 9. As we indicated earlier, in order for our estimator to be double ro-


bust, we must make sure that if the posited model p∗Z (z, ξ) contains the truth,
then the estimator for ξ, namely ξˆn∗ , is a consistent estimator for ξ0 , even if
the model for the coarsening probabilities is misspecified. Therefore, likeli-
hood estimators such as those that maximize (10.58) would be appropriate
for this purpose, whereas AIPWCC estimators for ξ would not.  

10.6 Recap and Review of Notation


General results

• Among observed-data influence functions ϕ{C, GC (Z)} ∈ (IF ),


 
I(C = ∞)ϕF (Z)
ϕ{C, GC (Z)} = + L2 {C, GC (Z)} − Π{[ ]|Λψ },
(∞, Z, ψ0 )

where ϕ (Z) is a full-data influence function and L2 {C, GC (Z)} ∈ Λ2 ,
F

optimal influence functions can be obtained by taking


 
I(C = ∞)ϕF (Z)
L2 {C, GC (Z)} = −Π Λ2 .
(∞, Z, ψ0 )
• We denote such influence functions using the linear operator J (·), where
J : HF → H is defined as
 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
J (ϕ ) =
F
−Π Λ2 .
(∞, Z, ψ0 ) (∞, Z, ψ0 )
268 10 Improving Efficiency and Double Robustness with Coarsened Data

• We denote the class of influence functions


 
J {(IF )F } = J (ϕF ) : ϕF (Z) ∈ (IF )F

by (IF )DR (i.e., the space of double-robust influence functions).


• The corresponding linear space used to derive estimating functions that
lead to observed-data estimators for β with influence functions in (IF )DR
is denoted by the DR linear space and is defined as J (ΛF ⊥ ).

Two levels of missingness

• Let the full data be given by Z = (Z1T , Z2T )T , where Z1 is always ob-
served but Z2 may be missing. Let R denote the complete-case indicator,
and P (R = 1|Z) = π(Z1 , ψ) is a model that describes the complete-case
probabilities. Then, for ϕ∗F (Z) ∈ ΛF ⊥ ,
   
Rϕ∗F (Z) R − π(Z1 , ψ0 )
J (ϕ∗F ) = − E{ϕ∗F (Z)|Z1 } .
π(Z1 , ψ0 ) π(Z1 , ψ0 )

• Adaptive double-robust AIPWCC estimators for β are obtained by solving


the equation
n 
 " 
 Ri m(Zi , β) Ri − π(Z1i , ψ̂n )
− h∗2 (Z1i , β, ξˆn∗ ) = 0,
i=1 π(Z1i , ψ̂n ) π(Z1i , ψ̂n )

where m(Z, β) is a full-data estimating function, ψ̂n is the MLE for ψ


obtained by maximizing
!
n
{π(Z1i , ψ)}Ri {1 − π(Z1i , ψ)}1−Ri ,
i=1

ξˆn∗ is an estimator for the parameter ξ in a posited model p∗Z (z, ξ), and
 
h∗2 (Z1i , β, ξ) = E m(Zi , β)|Z1i , ξ .

Monotone coarsening

• When data are monotonically coarsened, the coarsening probabilities can


be modeled, as a function of the parameter ψ, using the discrete hazard
probability of coarsening,

λr {Gr (Z), ψ} = P (C = r|C ≥ r, Z, ψ), r = ∞.


10.6 Recap and Review of Notation 269

• For ϕ∗F (Z) ∈ ΛF ⊥ ,

I(C = ∞)ϕ∗F (Z)


J (ϕ∗F ) =
(∞, Z, ψ0 )
  I(C = r) − λr {Gr (Z), ψ0 }I(C ≥ r)  ' (
+ E ϕ∗F (Z)|Gr (Z) ,
Kr {Gr (Z), ψ0 }
r=∞

where
!
r
Kr {Gr (Z), ψ0 } = P (C > r|Z, ψ0 ) = [1 − λr {Gr (Z), ψ0 }], r = ∞,
r  =1

and !
(∞, Z, ψ0 ) = [1 − λr {Gr (Z), ψ0 }].
r=∞

• Adaptive double-robust AIPWCC estimators for β are obtained by solving


the equation
%
 n
I(Ci = ∞)m(Zi , β)
+
i=1 (∞, Zi , ψ̂n )
# $ &
 I(Ci = r) − λr {Gr (Zi ), ψ̂n }I(Ci ≥ r)
ˆ∗
E{m(Z, β)|Gr (Zi ), ξn } = 0,
r=∞
Kr {Gr (Zi ), ψ̂n }

where m(Z, β) is a full-data estimating function, ψ̂n is the MLE for ψ, and
ξˆn∗ is an estimator for the parameter ξ in a posited model p∗Z (z, ξ).

Nonmonotone coarsening

• In general, for ϕ∗F (Z) ∈ ΛF ⊥ ,

J (ϕ∗F ) = L{M−1 (ϕ∗F )},

where
L(hF ) = E{hF (Z)|C, GC (Z)} is equal to

I(C = r)E{hF (Z)|Gr (Z)},
r

M(hF ) = E{L(hF )|Z} is equal to



{r, Gr (Z)}E{hF (Z)|Gr (Z)},
r
270 10 Improving Efficiency and Double Robustness with Coarsened Data

and where the inverse operator M−1 (hF ) exists and can be obtained
by successive approximation, where

ϕn+1 (Z) = (I − M)ϕn (Z) + hF (Z),


n→∞
and ϕn (Z) −−−−→ M−1 {hF (Z)}.
• More efficient adaptive AIPWCC estimators for β can be obtained by
solving the equation

n
I(Ci = ∞)m(Zi , β)
+ L∗2(j) {Ci , GCi (Zi ), β, ψ̂n , ξˆn∗ } = 0,
i=1 (∞, Z, ψ̂n )

where m(Z, β) is a full-data estimating function, ψ̂n is the MLE for ψ, ξˆn∗
is an estimator for the parameter ξ in a posited model p∗Z (z, ξ),

L∗2(j) {C, GC (Z), β, ψ, ξ}


⎡ ⎤
I(C = ∞) ⎣ 
=− {r, Gr (Z), ψ}E{dF (j) (Z, β, ψ, ξ)|Gr (Z), ξ}

(∞, Z, ψ)
r=∞

+ I(C = r)E{dF (j) (Z, β, ψ, ξ)|Gr (Z), ξ},
r=∞

−1
and dF(j) (Z, β, ψ, ξ) is the approximation to M {m(Z, β, ψ, ξ)} after (j)
iterations of successive approximations.
• As j → ∞, the AIPWCC estimator becomes a double-robust estimator.

10.7 Exercises for Chapter 10


1. In Definition 6 of Chapter 3, we defined a q-replicating linear space. In
Theorem 10.1, we considered the linear space M = Π[Λ2 |Λ⊥ ψ ] ⊂ H.
a) Prove that Λ2 is a q-replicating linear space.
b) Prove that Λψ is a q-replicating linear space. (Recall that Λψ is the
finite-dimensional linear space, contained in H, that is spanned by the
score vector Sψ {C, GC (Z)}.)
c) Prove that M is a q-replicating linear space.
2. When we considered monotone coarsening, we stated that the adaptive
estimator β̂n , which solves equation (10.60), is asymptotically equivalent
to the AIPWCC estimator β̂n∗ , which solves equation (10.61), when the
model for the coarsening probabilities is correctly specified. Give a heuris-
tic proof that
n1/2 (β̂n − β̂n∗ ) −
P
→ 0.
(You can use arguments similar to the proof of Theorem 10.3.)
10.7 Exercises for Chapter 10 271

3. Consider the simple linear regression restricted moment model where with
full data (Yi , X1i , X2i ), i = 1, . . . , n, we assume
E(Yi |X1i , X2i ) = β0 + β1 X1i + β2 X2i .
In such a model, we can estimate the parameters (β0 , β1 , β2 )T using ordi-
nary least squares; that is, the solution to the estimating equation

n
(1, X1i , X2i )T (Yi − β0 − β1 X1i − β2 X2i ) = 0. (10.104)
i=1

In fact, this estimator is locally efficient when var(Yi |X1i , X2i ) is constant.
The data, however, are missing at random with a monotone missing pat-
tern. That is, Yi is observed on all individuals in the sample; however, for
some individuals, only X2i is missing, and for others both X1i and X2i
are missing. Therefore, we define the missingness indicator
(Ci = 1) if we only observe Yi ,
(Ci = 2) if we only observe (Yi , X1i ),

and

(Ci = ∞) if we observe (Yi , X1i , X2i ).


We will define the missingness probability model using discrete-time haz-
ards, namely
λ1 (Y ) = P (C = 1|Y ),
λ2 (Y, X1 ) = P (C = 2|C ≥ 2, Y, X1 ).
a) In terms of λ1 and λ2 , what is
P (C = ∞|Y, X1 , X2 )?
In order to model the missingness process, we assume logistic regres-
sion models; namely,
 
p
logit {λ1 (Y )} = ψ10 + ψ11 Y, where logit (p) = log ,
1−p

and

logit {λ2 (Y, X1 )} = ψ20 + ψ21 X1 + ψ22 Y.


b) Using some consistent notation to describe the observed data, write
out the estimating equations that need to be solved to derive the
maximum likelihood estimator for
ψ = (ψ10 , ψ11 , ψ20 , ψ21 , ψ22 )T .
272 10 Improving Efficiency and Double Robustness with Coarsened Data

c) Describe the linear subspace Λψ .


d) Describe the linear subspace Λ2 . Verify that Λψ ⊂ Λ2 .
e) Describe the subspace Λ⊥ , the linear space orthogonal to the observed-
data nuisance tangent space. An initial estimator for β can be obtained
by using an inverse probability weighted complete-case estimator that
solves the equation

n
I(Ci = ∞)
(1, X1i , X2i )T (Yi −β0 −β1 X1i −β2 X2i ) = 0,
i=1 (∞, Yi , X1i , X2i , ψ̂n )

where ψ̂n is the maximum likelihood estimator derived in (b). Denote


this estimator by β̂nI .
f) What is the i-th influence function for β̂nI ?
g) Derive a consistent estimator for the asymptotic variance of β̂nI .
In an attempt to gain efficiency, we consider

I(Ci = ∞)
ϕ∗F (Yi , X1i , X2i )
(∞, Yi , X1i , X2i , ψo )
 
I(Ci = ∞)ϕ∗F (Y1i , X1i , X2i )
−Π Λ2 ,
(∞, Yi , X1i , X2i , ψo )

where ϕ∗F (·) = (1, X1i , X2i )T (Yi − β0 − β1 X1i − β2 X2i ).


h) Compute  
I(Ci = ∞)ϕ∗F (Y1i , X1i , X2i )
Π Λ2 .
(∞, Yi , X1i , X2i , ψo )
In practice, we need to estimate (h) using a simplifying model. For
simplicity, let us use as a working model that (Yi , X1i , X2i )T is multi-
variate normal with mean (µY , µX1 , µX2 )T and covariance matrix
⎡ ⎤
σY Y σY X1 σY X2
⎣σY X1 σX1 X2 σX1 X2 ⎦ .
σY X2 σX1 X2 σX2 X2

i) With the observed data, how would you estimate the parameters in
the multivariate normal?
j) Assuming the simplifying multivariate normal model and the esti-
mates derived in (i), estimate the projection in (h).
k) Write out the estimating equation that needs to be solved to get an
improved estimator.
l) Find a consistent estimator for the asymptotic variance of the estima-
tor in (k). (Keep in mind that the simplifying model of multivariate
normality may not be correct.)
11
Locally Efficient Estimators for
Coarsened-Data Semiparametric Models

Using semiparametric theory, we have demonstrated that RAL estimators


for the parameter β in a semiparametric model with coarsened data can be
obtained using AIPWCC estimators. That is, estimators for β can be obtained
from a sample of coarsened data {Ci , GCi (Zi )}, i = 1, . . . , n, by solving the
estimating equation
# $
n
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), β, ψ̂n } = 0, (11.1)
i=1 (∞, Zi , ψ̂n )

where m(Z, β) is a full-data estimating function, L2 (·) is an element of the


augmentation space, Λ2 , and ψ̂n is an estimator for the parameters in the
coarsening model. In Chapter 10, we demonstrated that, among the AIP-
WCC estimators, improved double-robust estimators for β can be obtained
by considering observed-data estimating functions within the class J (ΛF ⊥ )
(i.e., the so-called DR linear space), where, for ϕ∗F ∈ ΛF ⊥ ,
 
I(C = ∞)ϕ∗F I(C = ∞)ϕ∗F
J (ϕ∗F ) = −Π |Λ2 .
(∞, Z) (∞, Z)
This led us to develop adaptive estimators that were the solution to
# $
 n
I(Ci = ∞)m(Zi , β) ∗ ∗
+ L2 {Ci , GCi (Zi ), β, ψ̂n , ξˆn } = 0, (11.2)
i=1 (∞, Zi , ψ̂n )

where L∗2 {C, GC (Z), β, ψ, ξ} is equal to minus the projection onto the augmen-
tation space; i.e.,
 
I(C = ∞)m(Z, β)
−Π |Λ2 .
(∞, Z)
To compute this projection, we need estimates for the parameter ψ that de-
scribes the coarsening probabilities and an estimate for the marginal distri-
bution of Z. The latter is accomplished by positing a simpler, and possibly
274 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

incorrect, model for the density of Z as p∗Z (z, ξ) and deriving an estimator ξˆn∗
for ξ.
Among the class of double-robust estimators is the efficient estimator,
the estimator that achieves the semiparametric efficiency bound. Finding this
efficient estimator within this class of double-robust estimators entails de-
riving the proper choice of the full-data estimating function m(Z, β), where
m(Z, β0 ) = ϕ∗F ∈ ΛF ⊥ . In this chapter, we will study how to find the efficient
estimator and the appropriate choice for m(Z, β).
As we will see, the efficient estimator will depend on the true marginal dis-
tribution of Z, which, of course, is unknown to us. Consequently, we will de-
velop adaptive methods where the efficient estimator will be computed based
on a posited model p∗Z (z, ξ) for the density of Z. Hence, the proposed methods
will lead to a locally efficient estimator, an estimator for β that will achieve
the semiparametric efficiency bound if the posited model is correct but will
still be a consistent, asymptotically normal RAL semiparametric estimator
for β even if the posited model does not contain the truth.
As we indicated in Chapter 10, finding improved double-robust estimators
often involves computationally intensive methods. In fact, when the coars-
ening of the data is nonmonotone, these computational challenges could be
overwhelming. Similarly here, deriving locally efficient estimators involves nu-
merical difficulties. Nonetheless, the theory developed by Robins, Rotnitzky,
and Zhao (1994) gives us a prescription for how to derive locally efficient
estimators. We present this theory in this chapter and discuss strategies for
finding locally efficient estimators. The methods build on the full-data semi-
parametric theory. Therefore, it will be assumed that we have a good un-
derstanding of the full-data semiparametric model. That is, we can identify
the space orthogonal to the full-data nuisance tangent space ΛF ⊥ , the class
of full-data influence functions (IF )F , the full-data efficient score Seff
F
(Z, β0 ),
F
and the full-data influence function ϕeff (Z).
However, we caution the reader that these methods may be very difficult
to implement in practice, and we believe a great deal of research still needs to
be done in developing feasible computational algorithms. In Chapter 12, we
will discuss approximations that may be used to derive AIPWCC estimators
for β that although not locally efficient are easier to implement and can result
in substantial gains in efficiency.
There is, however, one class of problems where locally efficient estimators
are obtained readily, and this is the case when only one full-data influence
function exists. This occurs, for example, when the full-data tangent space is
the entire Hilbert space HF , as is the case when no restrictions are put on the
class of densities for Z; i.e., the nonparametric problem (see Theorem 4.4).
In Section 5.3, we showed that only one full-data influence function exists
when we are interested in estimating the mean of a random variable in a
nonparametric problem and that this estimator can be obtained using the
sample average. When only one full-data influence function, ϕF (Z), exists,
then the class of observed-data influence functions is given by
11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models 275
  
I(C = ∞)ϕF (Z)
+ L2 {C, GC (Z)} − Π{[ ]|Λψ }, L2 (·) ∈ Λ2 ,
(∞, Z, ψ0 )

and because of Theorem 10.1, the optimal observed-data influence function is


 
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
−Π Λ2 ,
(∞, Z, ψ0 ) (∞, Z, ψ0 )

which is also the efficient influence function. Consequently, when there is only
one full-data influence function, then the adaptive double-robust estimator
outlined in Chapter 10 will lead to a locally efficient estimator. We illustrate
with a simple example.

Example: Estimating the Mean with Missing Data

In Section 7.4, we considered a problem where interest focused on estimat-


ing the relationship of a response variable Y as a function of covariates X.
Because one of the covariates X2 was expensive to collect, only a subsam-
ple of this covariate was collected with probability π(Y, X1 ), which depends
on the response Y and the other covariates X1 . Thus, for this problem,
X = (X1T , X2 )T and the full data Z = (Y, X). The probability π(Y, X1 )
was chosen by the investigator and therefore this is an example of two levels
of missingness by design. The complete-case indicator is denoted by R, where
P (R = 1|Z) = P (R = 1|Y, X1 ) = π(Y, X1 ), and the observed data are de-
noted by O = (R, Y, X1 , RX2 ). The statistical question, as originally stated
in Section 7.4, was to estimate the parameter in a restricted moment model
with a sample of observed data (Ri , Yi , X1i , Ri X2i ), i = 1, . . . , n.
However, we now want to consider the simpler problem of estimating
the mean of X2 using the observed data. Let us denote this parameter as
β = E(X2 ). Also, to be as robust to model misspecification as possible, we no
longer assume any specific relationship between the response Y and the co-
variates X. Consequently, with no assumptions on the joint distribution of the
full data (Y, X1 , X2 ) (i.e., the nonparametric problem), we know that there
is only one full-data influence function of RAL estimators for β = E(X2 ),
namely ϕF (Z) = (X2 − β0 ), and that the solution to the full-data estimating
equation is
 n
(X2i − β) = 0;
i=1
n
i.e., the sample mean β̂nF = n−1 i=1 X2i is an RAL full-data estimator for
β with this influence function.
We also know that all observed-data influence functions for this problem
are given by  
RϕF (Z) R − π(Y, X1 )
+ L(Y, X1 ) , (11.3)
π(Y, X1 ) π(Y, X1 )
276 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

where L(Y, X1 ) is an arbitrary function of Y and X1 , and, because of Theorem


10.2, the optimal choice for L(Y, X1 ) is given by −E(X2 |Y, X1 ). Therefore,
among the class of influence functions (11.3), the one with the smallest vari-
ance is  
RϕF (Z) R − π(Y, X1 )
− E(X2 |Y, X1 ) . (11.4)
π(Y, X1 ) π(Y, X1 )
This also is the semiparametric efficient observed-data influence function.
Since the efficient influence function depends on E(X2 |Y, X1 ), we consider
an adaptive strategy. Using methods described in Section 10.2 for adaptive
estimation with two levels of missingness, we posit a model for the conditional
distribution of X2 given (Y, X1 ). One simple model we may consider is that
the distribution of X2 given (Y, X1 ) is normally distributed with mean ξ1 +
ξ2 Y +ξ3T X1 and variance σξ2 . This is attractive because the MLE estimator for
the parameter ξ (i.e., the estimator that maximizes (10.16)) can be obtained
using ordinary least squares among the complete cases {i : Ri = 1}. That is,
the estimator for ξ is obtained as the solution to the estimating equation

n
T T
Ri (1, Yi , X1i ) (X2i − ξ1 − ξ2 Yi − ξ3T X1i ) = 0.
i=1

Denote the least-squares estimator for ξ by ξˆn∗ . Then, the adaptive observed-
data estimator for β is given as the solution to
n 
 
Ri Ri − π(Y1 , X1i ) ˆ∗ ˆ∗ ˆ∗T
(X2i − β) − (ξ1n + ξ2n Yi + ξ2n X1i − β)
i=1
π(Y1 , X1i ) π(Y1 , X1i )
= 0,

which, after solving, yields


n 
 
−1 Ri X2i Ri − π(Y1 , X1i ) ˆ∗ ˆ∗ ˆ∗T
β̂n = n − (ξ1n + ξ2n Yi + ξ2n X1i ) .
i=1
π(Y1 , X1i ) π(Y1 , X1i )
(11.5)
This estimator is a consistent, asymptotically normal observed-data RAL
estimator for β regardless of whether the posited model is correct or not.
Moreover, if the posited model is correctly specified, then this estimator is
semiparametric efficient. Therefore, (11.5) is a locally efficient semiparamet-
ric estimator for β = E(X2 ). We also note that because the least-squares
estimator leads to a consistent estimator for ξ, if the conditional expectation
of X2 given Y and X1 is linear, namely

E(X2 |Y, X1 ) = ξ1 + ξ2 Y + ξ3T X1 , (11.6)

whether the distribution is normal or not, the locally efficient estimator β̂n is
also fully efficient whenever (11.6) is satisfied.
11.1 The Observed-Data Efficient Score 277

11.1 The Observed-Data Efficient Score


As we know, the efficient RAL estimator for β will have an influence function
that is proportional to the efficient score. Therefore, it is useful to study the
properties of the efficient score with coarsened data. Toward that end, we give
two different representations for the efficient score. The first is likelihood-based
and the second is based on AIPWCC estimators. The relationship of these two
representations to each other will be key in the development of the proposed
adaptive locally efficient estimators.

Representation 1 (Likelihood-Based)

We remind the reader that the efficient observed-data estimator for β has an
influence function that is proportional to the observed-data efficient score and
that the efficient score is unique and equal to
Seff {C, GC (Z)} = Sβ {C, GC (Z)} − Π[Sβ {C, GC (Z)}|Λ],
where Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}, and Λ = Λψ ⊕ Λη , Λψ ⊥ Λη .
Because Λψ ⊥ Λη , this implies that
Π[Sβ {C, GC (Z)}|Λ] = Π[Sβ {C, GC (Z)}|Λψ ] + Π[Sβ {C, GC (Z)}|Λη ].
The same argument that was used to show that Λη ⊥ Λ2 in Theorem
8.2 can also be used to show that Sβ {C, GC (Z)} ⊥ Λ2 . Since Λψ ⊂ Λ2 , this
implies that Sβ {C, GC (Z)} is orthogonal to Λψ . Therefore,
Π[Sβ {C, GC (Z)}|Λψ ] = 0, (11.7)
which implies that
Π[Sβ {C, GC (Z)}|Λ] = Π[Sβ {C, GC (Z)}|Λη ],
and the efficient score is
Seff {C, GC (Z)} = Sβ {C, GC (Z)} − Π[Sβ {C, GC (Z)}|Λη ]. (11.8)
Recall that
' ' ( (
Λη = E αF (Z)|C, GC (Z) for all αF (Z) ∈ ΛF .
This means that the unique projection Π[Sβ {C, GC (Z)}|Λη ] corresponds to
some element in Λη , which we will denote by
' F (
E αeff F
(Z)|C, GC (Z) , αeff (Z) ∈ ΛF .
With this representation,
' ( ' F (
Seff {C, GC (Z)} = E SβF (Z)|C, GC (Z) − E αeff (Z)|C, GC (Z)
= E[{SβF (Z) − αeff
F
(Z)}|C, GC (Z)]. (11.9)
278 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

Remark 1. The full-data efficient score is given by


F
Seff (Z) = SβF (Z) − Π[SβF (Z)|ΛF ].
F
However, the element αeff (Z) is not necessarily the same as Π[SβF (Z)|ΛF ].
This means that, in general,
' F (
Seff {C, GC (Z)} = E Seff (Z)|C, GC (Z) . 
 (11.10)

Representation 2 (AIPWCC-Based)

In (10.2), we derived the optimal (smallest variance matrix) influence function


among the class of AIPWCC influence functions associated with a fixed full-
data influence function ϕF (Z). Consequently, we can restrict our search for
the efficient observed-data influence function to the class of influence functions
   
I(C = ∞)ϕF (Z) I(C = ∞)ϕF (Z)
−Π Λ2 : ϕ (Z) ∈ (IF )
F F
,
(∞, Z, ψ0 ) (∞, Z, ψ0 )

which we denote as the space (IF )DR = J {(IF )F }. Since the observed-data
efficient score is defined up to a proportionality constant matrix times the
efficient influence function, this implies that the observed-data efficient score
must be an element in the DR linear space J (ΛF ⊥ ),
   
I(C = ∞)B F (Z) I(C = ∞)B F (Z)
−Π Λ2 : for B F (Z) ∈ ΛF ⊥ ,
(∞, Z, ψ0 ) (∞, Z, ψ0 )
(11.11)
F
with the corresponding element denoted by Beff (Z).

Relationship between the Two Representations

Thus, we have shown that there are two equivalent representations for the
observed-data efficient score, which are given by (11.9) and (11.11):
  
(i) Seff {C, GC (Z)} = E Sβ (Z) − αeff (Z) C, GC (Z) , where αeff
F F F
(Z) ∈
ΛF ,
and
 
I(C = ∞)Beff
F
(Z) I(C = ∞)Beff
F
(Z)
(ii) Seff {C, GC (Z)} = −Π Λ2 , where
(∞, Z) (∞, Z)
F
Beff (Z) ∈ ΛF ⊥ .

Remark 2. If the full-data model were parametric (i.e., if η were finite-


dimensional), then representation (i) would correspond to the estimating func-
tion that would be used to derive the coarsened-data MLE for β. This may
11.1 The Observed-Data Efficient Score 279

be preferable, as it does not involve the parameter ψ specifying the coarsen-


ing process. However, if the model is complicated, this approach may become
formidable due to the curse of dimensionality. The second representation leads
us to augmented inverse-probability weighted complete-case (AIPWCC) es-
timating equations. These estimators, which build on full-data estimators,
may be easier to derive, even in some complicated situations. However, this
approach requires that the data analyst model the coarsening process and
estimate the parameter ψ. Also, to obtain gains in efficiency, an adaptive ap-
proach is required, where the data analyst posits simpler models for the full
data p∗Z (z, ξ), in terms of the parameter ξ that needs to be estimated. Which
method is preferable often depends on the specific application. However, be-
cause of the robustness of the AIPWCC estimators to misspecification, we
will focus attention on these estimators.
Nonetheless, the two representations will aid us in getting a better un-
derstanding of the geometry of observed-data efficient influence functions and
guide us in finding as good an AIPWCC estimator as is feasible.  

If we knew, or could reasonably approximate, the element BeffF


(Z) ∈ ΛF ⊥
of representation (ii) above, then we could construct an AIPWCC estimator
for β by using the full-data estimating function m(Z, β), where m(Z, β0 ) =
F
Beff (Z), and applying the methods outlined in Chapter 10. Subject to the
accuracy of different posited models, this methodology will give us as efficient
an AIPWCC estimator as possible while still affording us maximum robustness
F
to misspecification. Toward that end, we now show how to derive Beff (Z) in
the following theorem.
F
Theorem 11.1. The element Beff (Z) is the unique B F (Z) ∈ ΛF ⊥ that solves
the equation
Π[M−1 {B F (Z)}|ΛF ⊥ ] = Seff
F
(Z), (11.12)
where M(·) denotes the linear operator, given by Definition 5 of Chapter 10,
equation (10.80), which maps HF (full-data Hilbert space) to HF as
 ' ( 
M{hF (q×1) (Z)} = E E hF (Z) C, GC (Z) Z .

Proof. Because of the equivalence of the two representations (i) and (ii) above,
the efficient score can be written as
' ( 
E SβF (F ) − αeffF
(Z) C, GC (Z)
 
I(C = ∞)Beff F
(Z) I(C = ∞)BeffF
(Z)
= −Π Λ2 (11.13)
(∞, Z) (∞, Z)
F
for some αeff (Z) ∈ ΛF and BeffF
(Z) ∈ ΛF ⊥ . Taking the conditional expectation
of both sides of equation (11.13) with respect to Z, we obtain the equation
' (
M SβF (Z) − αeff
F
(Z) = Beff F
(Z). (11.14)
280 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

In Theorem 10.6 and Lemma 10.5, we showed that the linear operator
M(·) has a unique inverse, M−1 . Therefore, we can write (11.14) as
' F ( ' (
M−1 Beff (Z) = SβF (Z) − αeff F
(Z) . (11.15)

Projecting both sides of equation (11.15) onto ΛF ⊥ , we obtain


 ' F (     F 
Π M−1 Beff (Z) ΛF ⊥ = Π SβF (Z) ΛF ⊥ − Π αeff (Z) ΛF ⊥ .
* +, -
|| ||
0
F
F
Seff (Z) since αeff (Z)
∈ ΛF

This leads us to the important relationship that the efficient element


F
Beff (Z) ∈ ΛF ⊥ , which is used to construct the efficient score
 
I(C = ∞)Beff F
(Z) I(C = ∞)Beff
F
(Z)
Seff {C, GC (Z)} = −Π Λ2 , (11.16)
(∞, Z) (∞, Z)

must satisfy the relationship (11.12).


F
We still need to show that a unique Beff (Z) ∈ ΛF ⊥ exists that satisfies
(11.12) and find a method for computing Beff (Z). 
F


F
Uniqueness of Beff (Z)
F
Lemma 11.1. There exists a unique Beff (Z) ∈ ΛF ⊥ that solves the equation
 
Π M−1 {Beff F
(Z)}|ΛF ⊥ = Seff
F
(Z), (11.17)

and this solution can be obtained through successive approximations.

Proof. Notice that (11.17) involves a mapping from the linear subspace ΛF ⊥ ⊂
HF to ΛF ⊥ ⊂ HF . The way we will prove this lemma is by defining another
linear mapping (I −Q){M−1 }(·) : HF → HF , which maps the entire full-data
Hilbert space to the entire full-data Hilbert space in such a way that
(i) (I − Q){M−1 }(·) coincides with the mapping Π[M−1 (hF )|ΛF ⊥ ] whenever
hF ∈ ΛF ⊥ , and
(ii) (I − Q) is a contraction mapping and hence has a unique inverse.
Define
Deff
F
(Z) = M−1 {Beff
F
(Z)}.
Because of the existence and uniqueness of M−1 , if Beff
F
(Z) exists, then so
does Deff (Z) such that
F

M{DeffF
(Z)} ∈ ΛF ⊥ . (11.18)
Motivated by the fact that Deff
F
(Z) must satisfy
11.1 The Observed-Data Efficient Score 281

(a) equation (11.17), namely


F
Π[Deff (Z)|ΛF ⊥ ] = Seff
F
(Z), (11.19)

or, equivalently,

{I(·) − Π[I(·)|ΛF ]}{Deff


F F
(Z)} = Seff (Z), (11.20)

where we view {I(·) − Π[I(·)|ΛF ]}(·) as a linear operator from HF to HF


with I(·) denoting the identity operator,
and
(b) equation (11.18), or, equivalently,

Π[M(·)|ΛF ]{Deff
F
(Z)} = 0, (11.21)

where Π[M(·)|ΛF ](·) is also viewed as a linear operator from HF to HF ,


we combine (11.19)–(11.21) to consider the equations

Seff (Z) = Π[hF (Z)|ΛF ⊥ ] + Π[M{hF (Z)}|ΛF ] (11.22)


= (I − Q){h (Z)},
F
(11.23)

where (I − Q)(·) is a linear operator, with Q(·) defined as

Q{Deff
F
(Z)} = Π[(I − M){Deff
F
(Z)} ΛF ].

We first argue that the solution, hF (Z) ∈ HF , to equation (11.23) exists and
is unique and then argue that this solution hF (Z) must equal Deff F
(Z).
According to Lemma 10.5, the linear operator (I −Q)(·) will have a unique
inverse if we can show that the linear operator “Q” is a contraction mapping.
Also, if Q is a contraction mapping, then by Lemma 10.5, the unique inverse
is equal to


(I − Q)−1 = Qi .
i=0

In the proof of Theorem 10.6, part (i), we already showed that (I − M) is a


contraction mapping; i.e.,

||(I − M){hF }|| ≤ (1 − ε)||hF ||. (11.24)

By the Pythagorean theorem,

||Q(hF )|| = ||Π[(I − M)hF ΛF ]||


≤ ||(I − M)hF ||
≤ (1 − ε)||hF || by (11.24).

Hence, Q is a contraction mapping and (I − Q)−1 exists and is unique.


282 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

To complete the proof, we must show that the unique solution hF (Z) to
equation (11.23) or, equivalently, (11.22), is identical to DeffF
(Z) satisfying
(11.19) and (11.21).
Clearly, any element Deff F
(Z) satisfying (11.19) and (11.21) must satisfy
(11.22). Conversely, since Seff (Z) ∈ ΛF ⊥ , then the solution hF (Z) of (11.22)
must be such that Π[M{hF (Z)}|ΛF ] = 0 and Π[hF (Z)|ΛF ⊥ ] = Seff (Z); that
is, hF (Z) satisfies (11.21) and (11.19), respectively. This completes the proof
that DeffF
(Z) exists and is the unique element satisfying equations (11.19) and
(11.21) or, equivalently, that Beff (Z) = M{Deff F
(Z)} exists and is the unique
solution to (11.17).
In Lemma 10.5, we showed that the solution Deff F
(Z) can be obtained by
successive approximation; that is,
 
D(i+1) (Z) = Π (I − M)D(i) (Z) ΛF + Seff F
(Z), (11.25)

and
D(i) (Z) −−−→ Deff
i→∞ F
(Z).
If we define  
B (i) (Z) = Π M{D(i) (Z)} ΛF ⊥ ,
where, by construction, B (i) (Z) ∈ ΛF ⊥ , then
  i→∞ F
B (i) (Z) = M{D(i) (Z)} −Π M{D(i) (Z)} ΛF −−−→ Beff (Z). (11.26)
* +, -
⇓ ⇓
 F 
M{DeffF F
(Z)} = Beff (Z) Π Beff (Z)|ΛF
|| 

0

The inverse operator M−1 plays an important role in the definition of


F
Beff (Z) given by Theorem 11.1. As we showed in Chapter 10, M−1 exists,
is unique, and can be computed using successive approximations. However, a
closed-form solution exists when the coarsening is monotone. For complete-
ness, we give this result.

M−1 for Monotone Coarsening

Recall that, for monotone coarsening, we defined coarsening probabilities us-


ing discrete hazard probabilities

λr {Gr (Z)} = P (C = r|C ≥ r, Z)

and
!
r
Kr {Gr (Z)} = P (C ≥ r + 1|Z) = [1 − λj {Gj (Z)}].
j=1
11.1 The Observed-Data Efficient Score 283

Theorem 11.2. When coarsening is monotone, the inverse operator M−1 is


given by

hF (Z)  λr
aF (Z) = M−1 {hF (Z)} = − E{hF (Z)|Gr (Z)}, (11.27)
(∞, Z) Kr
r=∞

where we use the shorthand notation λr = λr {Gr (Z)} and Kr = Kr {Gr (Z)}.
An equivalent representation is also given by
 λr  
M−1 {hF (Z)} = hF (Z) + hF (Z) − E{hF (Z)|Gr (Z)} . (11.28)
Kr
r=∞

Proof. In Theorem 10.6, we showed that M−1 exists and is uniquely defined.
Therefore, we only need to show that M{aF (Z)} = hF (Z), where aF (Z) is
defined by (11.27) and

M(aF ) = ∞ aF + r E(aF |Gr ).
r=∞

We note that  
λr 1 1
= − (11.29)
Kr Kr Kr−1
and
∞ = K ,
where  denotes the number of coarsening /levels; i.e., 0 denotes the largest
value of C less than ∞. After substituting K1r − Kr−11 λr
for Kr
and K1 for
∞ in (11.27) and rearranging terms, we obtain
 1  
F
a = E(h |Gr ) − E(h |Gr−1 ) ,
F F
(11.30)
r
Kr−1

where K0 = 1 and E(hF |G0 ) = 0. Therefore,


    
1
M(a ) =
F
r E
 E(h |Gr ) − E(h |Gr−1 ) Gr
F F 
Kr−1
r r
   F   F  
h h
= r  E E Gr − E Gr−1 Gr  .
 r
Kr−1 Kr−1
r
(11.31)

As a consequence of monotone coarsening and the laws of conditional expec-


tations, we obtain that
284 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
  F   F  
h h
E E Gr − E Gr−1 Gr 
Kr−1 Kr−1

= 0 if r < r
 
1 
= E(h |Gr ) − E(h |Gr−1 ) if r ≥ r.
F F
(11.32)
Kr−1

Substituting (11.32) into (11.31), we obtain


  
1 
M(a ) =
F
r  E(h |Gr ) − E(h |Gr−1 ) I(r ≥ r)
F F
K r−1
r r
 1  

= E(h |Gr ) − E(h |Gr−1 )
F F
r I(r ≥ r)
K r−1
r r
 1  
= E(hF |Gr ) − E(hF |Gr−1 ) Kr−1
r
K r−1
 
= E(h |Gr ) − E(h |Gr−1 )
F F

r
= E(hF |G∞ ) − E(hF |G0 ) = hF .

The second representation (11.28) will follow if we can show that


 λr 1
1+ = . (11.33)
Kr ∞
r=∞

Substituting (11.29) into (11.33), we obtain


 1 1

1 1 1 1
1+ − =1+ − = = . 

Kr Kr−1 K K0 K ∞
r=∞

M−1 with Right Censored Data

Because of the correspondence that was developed between monotone coarsen-


ing and right-censored data in Section 9.3, we immediately obtain the following
result, which was first given by Robins and Rotnitzky (1992).

Lemma 11.2. In a survival analysis problem, full data are represented as


{T, X̄(T )}. Using the notation of Section 9.3, we obtain that

M−1 [hF {T, X̄(T )}]


 T  
hF {T, X̄(T )} λ {r, X̄(r)}
= − E h {T, X̄(T )} T ≥ r, X̄(r) C̃
F
dr,
KT {X̄(T )} 0 Kr {X̄(r)}
11.2 Strategy for Obtaining Improved Estimators 285

or

M−1 [hF {T, X̄(T )}] = hF {T, X̄(T )}+


 T  
λC̃ {r, X̄(r)}
hF {T, X̄(T )} − E hF {T, X̄(T )} T ≥ r, X̄(r) dr,
0 Kr {X̄(r)}

where λC̃ {r, X̄(r)} was defined in (9.30) and Kr {X̄(r)} was defined in (9.33).

11.2 Strategy for Obtaining Improved Estimators


The goal of this section is to outline a method for obtaining an AIPWCC
estimator for β that is as efficient as possible while still remaining as robust
to model misspecification as possible. Many of the calculations necessary to
derive conditional expectations, projections, linear operators, etc., involve the
coarsening probabilities as well as the marginal distribution of the full-data
Z. In Chapter 10, we discussed methods for positing models for the coarsen-
ing probabilities and the distribution of Z in terms of ψ and ξ and finding
estimators for these parameters.
We first consider finding a full-data estimating function m(Z, β) such that
m(Z, β0 ) = ϕ∗F (Z), where ϕ∗F (Z) ∈ ΛF ⊥ is an approximation to Beff F
(Z).
Because in general it is too difficult to find such an estimating function ex-
plicitly, we may instead use successive approximations as given by (11.48) and
(11.26). That is, we start with m(0) (Z, β) such that m(0) (Z, β0 ) = Seff
F
(Z) and
iteratively compute

D(i) (Z, β, ψ̂n , ξˆn∗ ) = Π[(I − M)D(i−1) (Z, β, ψ̂n , ξˆn∗ )|ΛF ] + m(0) (Z, β), (11.34)

where we index D(·) by ψ and ξ to make clear that we need these parameters
when computing the projection Π[(·)|ΛF ] and the linear operator M(·). After,
say, j iterations, we compute

m(Z, β, ψ̂n , ξˆn∗ ) = Π[M{D(j) (Z, β, ψ̂n , ξˆn∗ )}|ΛF ⊥ ] (11.35)


F
to serve as an approximation to Beff (Z).
Now that we’ve computed m(Z, β, ψ̂n , ξˆn∗ ), where m(Z, β0 , ψ ∗ , ξ ∗ ) ∈ ΛF ⊥ ,
we need to compute in accordance with (11.16), Π I(C=∞)m(Z,β)
(∞,Z) Λ2 . If we
have two levels of coarsening or monotone coarsening, then such a projection
can be defined explicitly, and in Chapter 10 we discussed how such projections
can be obtained using adaptive methods. With nonmonotone coarsened data,
we may proceed as follows.  
I(C=∞)m(Z,β)
In Chapter 10, equation (10.95), we showed that Π (∞,Z) Λ2
equals
286 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
⎡ ⎤
I(C = ∞) ⎣ 
− {r, Gr (Z)}E{dF (Z)|Gr (Z)}⎦
(∞, Z)
r=∞

+ I(C = r)E{dF (Z)|Gr (Z)}, (11.36)
r=∞

where dF (Z) = M−1 {m(Z, β)}. Since D(j) (Z, β, ψ̂n , ξˆn∗ ) is an approximation
to M−1 {m(Z, β, ψ̂n , ξˆn∗ )}, we propose the following adaptive estimating equa-
tion:
 n 
I(Ci = ∞)m(Zi , β, ψ̂n , ξˆn∗ )
i=1 (∞, Zi , ψ̂n )
⎡ ⎤
I(Ci = ∞) ⎣ 
− {r, Gr (Zi ), ψ̂n }E{D(j) (Z, β, ψ̂n , ξˆn∗ )|Gr (Zi ), ξˆn∗ }⎦
(∞, Zi , ψ̂n ) r=∞
 
(j) ˆ ∗ ˆ∗
+ I(Ci = r)E{D (Z, β, ψ̂n , ξn )|Gr (Zi ), ξn } = 0. (11.37)
r=∞

The important thing to notice is that, by construction,

m(Z, β0 , ψ ∗ , ξ ∗ ) ∈ ΛF ⊥

and
⎡ ⎤
I(C = ∞) ⎣ 
− {r, Gr (Z), ψ0 }E{D(j) (Z, β, ψ0 , ξ ∗ )|Gr (Z), ξ ∗ }⎦
(∞, Z, ψ0 )
r=∞

+ I(C = r)E{D(j) (Z, β, ψ0 , ξ ∗ )|Gr (Z), ξ ∗ } ∈ Λ2
r=∞

as long as the model for the coarsening probabilities is correctly specified.


F
In some cases, it may be possible to derive Beff (Z) directly. We illustrate
with an example.

Example: Restricted Moment Model with Monotone Coarsening

When the coarsening of the data is monotone, we showed in Theorem 11.2,


equation (11.27), how to derive the inverse operator M−1 in closed form. Let
us now examine how we would go about finding a locally efficient estimator
for β with monotonically coarsened data. Specifically, we will consider the
restricted moment model, as the semiparametric theory for such a model has
been studied thoroughly throughout this book.
We remind the reader that, for the restricted moment model, the full data
are given by
11.2 Strategy for Obtaining Improved Estimators 287

Z = (Y, X),
and the model assumes that E(Y |X) = µ(X, β), or, equivalently,

Y = µ(X, β) + ε, where E(ε|X) = 0.

For this semiparametric full-data model, we also derived a series of results


regarding the geometry of the full-data influence functions and full-data esti-
mating functions. Specifically, we showed in (4.48) that all elements of ΛF ⊥
are given by A(X)ε, where A(X) is a conformable matrix of functions of X.
We also showed in (4.44) that
' (
Π[hF (Z)|ΛF ⊥ ] = E hF (Z)εT |X V −1 (X)ε, (11.38)

where V (X) = var(Y |X). The full-data efficient score was given by (4.53),
F
Seff (Z) = DT (X)V −1 (X)ε, (11.39)

where
∂µ(X, β)
D(X) = .
∂β T
Suppose the data are coarsened at random with a monotone coarsening
pattern. An example with monotone missing longitudinal data was given in
Example 1 in Section 9.2 and also studied further in Section 10.3, where
double-robust estimators were proposed. The question is, how do we go about
finding a locally efficient estimator for β with a sample of monotonically coars-
ened data {Ci , GCi (Zi )}, i = 1, . . . , n?
We first develop a model for the coarsening probabilities in terms of
a parameter ψ. Because in this example we are assuming that coarsening
is monotone, it is convenient to develop models for the discrete hazards
λr {Gr (Z), ψ}, r = ∞ and obtain estimators for ψ by maximizing (8.12).
We denote these estimators by ψ̂n .
We also posit a simpler, possibly incorrect model for the full data Z =
(Y, X), Z ∼ p∗Z (z, ξ), and obtain an estimator for ξ, say ξˆn∗ , by maximizing
the observed-data likelihood; see, for example, (10.58),
!
n
p∗Gr (Zi ) (gri , ξ).
i
i=1

This model also gives us an estimate for var(Y |X, ξˆn∗ ) = V (X, ξˆn∗ ).
If the data are coarsened at random, then by (11.16) and Theorem 11.1,
the efficient observed-data score is given by
 
I(C = ∞)BeffF
(Z, β, ψ, ξ) I(C = ∞)BeffF
(Z, β, ψ, ξ)
−Π Λ2 , (11.40)
(∞, Z, ψ) (∞, Z, ψ)
F
where Beff (Z, β, ψ, ξ) ∈ ΛF ⊥ must satisfy
288 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models
 
Π M−1 {Beff
F
(Z, β, ψ, ξ)}|ΛF ⊥ = Seff
F
(Z, β, ξ).
F
For the restricted moment model, Beff (Z, β, ψ, ξ) = A(X, β, ψ, ξ)ε(β), where
ε(β) = Y − µ(X, β), and the matrix A(X, β, ψ, ξ) is obtained by solving the
equation  
Π M−1 {A(X, β, ψ, ξ)ε(β)} |ΛF ⊥ = Seff F
(Z, β, ξ),
which by (11.38) and (11.39) is equal to
 
E M−1 {A(X, β, ψ, ξ)ε(β)} εT (β)|X, ξ V −1 (X, ξ)ε(β)
= DT (X, β)V −1 (X, ξ)ε(β),

or, equivalently,
 
E M−1 {A(X, β, ψ, ξ)ε(β)}εT (β)|X, ξ = DT (X, β). (11.41)

If, in addition, the coarsening is monotone, then using the results from
Theorem 11.2, equation (11.27), we obtain

A(X, β, ψ, ξ)ε(β)
M−1 {A(X, β, ψ, ξ)ε(β)} =
(∞, Y, X, ψ)
 λr {Gr (Y, X), ψ}
− E{A(X, β, ψ, ξ)ε(β)|Gr (Y, X), ξ}.
Kr {Gr (Y, X), ψ}
r=∞

Combining this with (11.41), we obtain


%#
A(X, β, ψ, ξ)ε(β)
E
(∞, Y, X, ψ)
$ &
 λr {Gr (Y, X), ψ}
− E {A(X, β, ψ, ξ)ε(β)|Gr (Y, X), ξ} × ε (β) X, ξ
T
Kr {Gr (Y, X), ψ}
r=∞

= DT (X, β),

or
 
ε(β)εT (β)
A(X, β, ψ, ξ)E X, ξ
(∞, Y, X, ψ)
  
λr {Gr (Y, X), ψ} ' ( T
− E E A(X, β, ψ, ξ)ε(β) Gr (Y, X), ξ ε (β) X, ξ
Kr {Gr (Y, X), ψ}
r=∞

= DT (X, β). (11.42)

Remarks

(i) In general, equation (11.42) is difficult to solve. We do, however, get a


simplification for problems where the covariates X are always observed.
11.2 Strategy for Obtaining Improved Estimators 289

For instance, this was the case in Example 1 of Section 9.2, which was
further developed in Section 10.3, where the responses Y = (Y1 , . . . , Yl )T
were longitudinal data intended to be measured at times t1 < . . . < tl but
were missing for some subjects in the study in a monotone fashion due to
patient dropout. For this example, the covariate X (treatment assignment)
was always observed but some of the longitudinal measurements that made
up Y were missing. The coarsening was described as Gr (Z) = (X, Y r ),
where Y r = (Y1 , . . . , Yr )T , r = 1, . . . , l − 1. Equation (11.42) can now be
written as

 
ε(β)εT (β)
A(X, β, ψ, ξ)E X, ξ
(∞, Y, X, ψ)
  λr (X, Y r , ψ) ' ( T

−A(X, β, ψ, ξ) E E ε(β) X, Y r
, ξ ε (β) X, ξ
Kr (X, Y r , ψ)
r=∞

= DT (X, β).
Therefore, the solution is given by
A(X, β, ψ, ξ) = DT (X, β)Ṽ −1 (X, β, ψ, ξ),
where
  
ε(β)εT (β)
Ṽ (X, β, ψ, ξ) = E X, ξ
(∞, Y, X, ψ)
  
λr (X, Y r , ψ) ' ( T
− E r
E ε(β) X, Y , ξ ε (β) X, ξ .
Kr (X, Y r , ψ)
r=∞

(ii) Except for special cases, such as the example above, the equation for
solving A(X) in (11.42) is generally a complicated integral equation. Ap-
proximate methods for solving such integral equations are given in Kress
(1989). However, these computations may be so difficult as not to be fea-
sible in practice.
(iii) In Chapter 12, we will give some approximate methods for obtaining im-
proved estimators that although not locally efficient do have increased
efficiency and are easier to implement.  
Suppose we were able to overcome these numerical difficulties and ob-
tain an approximate solution for A(X, β, ψ, ξ). Denoting this solution by
Aimp (X, β, ψ̂n , ξˆn∗ ) and going back to (11.40), we approximate the efficient
score by
I(C = ∞)Aimp (X, β, ψ̂n , ξˆn∗ ){Y − µ(X, β)}
π(∞, Y, X, ψ̂n )
# $
I(C = ∞)Aimp (X, β, ψ̂n , ξˆn∗ ){Y − µ(X, β)}
−Π Λ2 . (11.43)
π(∞, Y, X, ψ̂n )
290 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

Because the coarsening is monotone, we can use (10.55) to estimate the pro-
jection onto Λ2 by −L2 {C, GC (Y, X), β, ψ̂n , ξˆn∗ }, where

L2 {C, GC (Y, X), β, ψ̂n , ξˆn∗ }


# $
  I(C = r) − λr {Gr (Y, X), ψ̂n }I(C ≥ r)
×
r=∞
Kr {Gr (Y, X), ψ̂n }
 
E Aimp (X, β, ψ̂n , ξˆn∗ ){Y − µ(X, β)} Gr (Y, X), ξˆn . (11.44)

Finally, the estimator for β is given as the solution to the estimating


equation
#
n
I(Ci = ∞)Aimp (Xi , β, ψ̂n , ξˆn∗ ){Yi − µ(Xi , β)}
i=1 (∞, Yi , Xi , ψ̂n )
$
ˆ∗
+ L2 {Ci , GCi (Yi , Xi ), β, ψ̂n , ξn } = 0. (11.45)

Some Brief Remarks Regarding Robustness

The estimator for β given by the solution to equation (11.45) used

Aimp (Xi , β, ψ̂n , ξˆn∗ ){Yi − µ(Xi , β)} (11.46)

to represent the full-data estimating function m(Zi , β). Strictly speaking,


(11.46) is not a full-data estimating function, as it involves the parameter
estimators ψ̂n and ξˆn∗ . However, as we discussed in Remark 1 of Chapter 9,
the solution to the estimating equation (11.45) had we substituted

Aimp (Xi , β, ψ ∗ , ξ ∗ ){Yi − µ(Xi , β)}

as m(Zi , β), where ψ ∗ and ξ ∗ are the limits (in probability) of ψ̂n and ξˆn∗ , would
result in an asymptotically equivalent estimator for β. What is important to
note here is that

Aimp (X, β0 , ψ ∗ , ξ ∗ ){Y − µ(X, β0 )} ∈ ΛF ⊥ (11.47)

regardless of what the converging values ψ ∗ and ξ ∗ are, and therefore the
solution to (11.45) is an example of an AIPWCC estimator for β.
Because of (11.47), the estimator given as the solution to (11.45) with L2 (·)
computed by (11.44) is an example of an improved estimator as described in
Section 10.4. As such, this estimator is a double robust estimator in the sense
that it will be consistent and asymptotically normal if either the coarsening
model or the model for the posited marginal distribution of Z is correctly
specified. This double-robustness property holds regardless of whether
11.3 Concluding Thoughts 291

Aimp (X, β0 , ψ ∗ , ξ ∗ ){Y − µ(X, β0 )} = Beff


F
(Z)

or not.
Finally, if both models are correctly specified, and if

Aimp (X, β0 , ψ0 , ξ0 ){Y − µ(X, β0 )} = Beff


F
(Z),

then the resulting estimator will be semiparametric efficient. Thus, this


methodology, assuming the numerical complexities could be overcome, would
lead to locally efficient observed-data estimators for β.

11.3 Concluding Thoughts


In the last two chapters, we have outlined methods for obtaining increasingly
efficient estimators while trying to keep them as robust as possible. The key
was always to use AIPWCC estimators. By no means do we want to give
the impression that these methods are easily implemented. Deriving param-
eter estimates for ξ in a simpler posited parametric model for the marginal
distribution of Z, which is used to obtain adaptive estimators, may require
maximizing a coarsened-data likelihood. Such maximization algorithms may
be complicated and may need specialized software. Even if these estimators
are obtained, finding projections, deriving linear operators (such as M(·) or
M−1 (·)) may require complicated integrals. Therefore, although the theory
for improved adaptive estimation has been laid out, the actual implementation
needs to be considered on a case-by-case basis.
The use of this inverse weighted methodology can be thought of as a bal-
ance between simplicity of implementation and relative efficiency. The sim-
plest estimator is the inverse probability weighted complete-case estimator
based on some prespecified full-data estimating function. Since this estimator
only uses complete cases, it may be throwing away a great deal of information
from data that are coarsened. Depending on how much of the data are coars-
ened, this estimator may be inadequate. Also, the consistency of the simpler
IPWCC estimator depends on correctly modeling the coarsening probabilities.
Improving the performance of the estimator by augmentation while using
the same full-data estimating function is the next step. To implement these
methods, one needs to develop simpler and possibly incorrect models for the
marginal distribution of Z to be used as part of an adaptive approach. This can
improve the efficiency considerably but at the cost of increased computations
and model building. The attempt to gain efficiency also gives you the extra
protection of double robustness in that the resulting AIPWCC estimator will
be consistent if either the model for the coarsening probabilities or the posited
model for the marginal distribution of Z is correctly specified.
Finally, the attempt to adaptively obtain the locally efficient estimator is
the most complex numerically. Here we actually attempt to find the optimal
292 11 Locally Efficient Estimators for Coarsened-Data Semiparametric Models

F
full-data estimating function, Beff (Z) ∈ ΛF ⊥ , as well as the optimal augmen-
tation. It is not clear whether the efficiency gains of such an estimator would
make such a complicated procedure attractive for practical use.
Because of the complexity of these methods, we offer in the next chapter
some simpler methods for gaining efficiency that are easier to implement.
These methods will not generally result in locally efficient estimators, but
they are, however, more feasible.

11.4 Recap and Review of Notation


• The observed-data efficient score, which can be used to derive adaptive
AIPWCC estimators that are locally efficient, is given by

Seff {C, GC (Z)} = J {Beff


F
(Z)},

where
 
I(C = ∞)hF (Z) I(C = ∞)hF (Z)
J {h (Z)} =
F
−Π Λ2 ,
(∞, Z) (∞, Z)
F
Beff (Z) is the unique element in ΛF ⊥ that satisfies

Π[M−1 {Beff
F
(Z)}|ΛF ⊥ ] = Seff
F
(Z),

M−1 (·) is the inverse of the linear operator

M{hF (Z)} = E[E{hF (Z)|C, GC (Z)}|Z]



= {r, Gr (Z)}E{hF (Z)|Gr (Z)},
r

F
and Seff (Z) is the full-data efficient score vector.
• We can derive BeffF
(Z) by solving the equation

(I − Q)M−1 {Beff
F F
(Z)} = Seff (Z),

where (I − Q)(·) is a linear operator, with Q(·) defined as

Q{hF (Z)} = Π[(I − M){hF (Z)} ΛF ].

(I − Q)(·) is a contraction mapping and hence has a unique inverse. There-


F
fore, Beff (Z) = M{Deff F
(Z)}, where Deff
F
(Z) = (I − Q)−1 {SeffF
(Z)}. The
solution Deff (Z) can be obtained by successive approximation,
F

 
D(i+1) (Z) = Π (I − M)D(i) (Z) ΛF + Seff F
(Z), (11.48)

and
D(i) (Z) −−−→ Deff
i→∞ F
(Z).
11.5 Exercises for Chapter 11 293

If we define  
B (i) (Z) = Π M{D(i) (Z)} ΛF ⊥ ,
where, by construction, B (i) (Z) ∈ ΛF ⊥ , then

B (i) (Z) −−−→ Beff


i→∞ F
(Z).

11.5 Exercises for Chapter 11


1. Recall that, with two levels of missingness, the data Z = (Z1T , Z2T )T ,
where Z1 is always observed and Z2 may be missing. We denote by R the
complete-case indicator and assume P (R = 1|Z) = P (R = 1|Z1 ) = π(Z1 )
(i.e., MAR). In Theorem 11.2, we derived the inverse operator M−1 when
coarsening is monotone. You should derive an explicit expression for M−1
when there are two levels of missingness.
(Note: Two levels of missingness can be viewed as a special case of mono-
tone coarsening.)
2. In Section 11.2, we outlined the steps necessary to obtain a locally efficient
estimator for β in a restricted moment model,

E(Y |X) = µ(X, β),

when the data are monotonically coarsened. Similarly, outline the steps
necessary to obtain a locally efficient estimator for β if there are two levels
of missingness.
12
Approximate Methods for Gaining Efficiency

12.1 Restricted Class of AIPWCC Estimators


In Chapters 10 and 11, we described various methods to increase the efficiency
of AIPWCC estimators. Although feasible in some situations, these methods
are often computationally very challenging. For example, to use these meth-
ods, one needs to posit simpler models for the marginal distribution of the
full-data Z in terms of a parameter ξ (i.e., p∗Z (z, ξ)) and then estimate the
parameter. This in itself can be a challenging numerical problem. But, even if
estimators for ξ can be derived, it is necessary to compute a series of condi-
tional expectations that might involve complicated integrals that are difficult
to compute numerically. Moreover, as discussed in Chapter 11, an attempt to
derive locally efficient estimators could result in complicated integral equa-
tions that are difficult to solve. Therefore, in this chapter, we explore other
methods that are numerically easier to implement but will result in gains in
efficiency.
We remind the reader that all semiparametric observed-data RAL estima-
tors for β, when the coarsening is CAR, are asymptotically equivalent to an
AIPWCC estimator, which is the solution to
 "
n
I(Ci = ∞)m(Zi , β)
+ L2 {Ci , GCi (Zi ), ψ̂n } = 0, (12.1)
i=1 (∞, Zi , ψ̂n )
where the estimating function m(Z, β) is chosen so that m(Z, β0 ) = ϕ∗F (Z) ∈
ΛF ⊥ ⊂ HF and L2 (·) ∈ Λ2 ⊂ H, and ψ̂n denotes the MLE for ψ obtained by
maximizing
!n
{Ci , GCi (Zi ), ψ}.
i=1

Rather than searching for the optimal AIPWCC estimator, which involves
finding the optimal L2eff (·) ∈ Λ2 and the optimal Beff
F
(Z) ∈ ΛF ⊥ , we instead
restrict the search to linear subspaces of Λ2 ⊂ H and ΛF ⊥ ⊂ HF . That is,
296 12 Approximate Methods for Gaining Efficiency

we will only consider AIPWCC estimators that are the solution to (12.1) for
m(Z, β0 ) = ϕ∗F (Z) ∈ G F and L2 (·) ∈ G2 , where G F is a q-replicating linear
subspace of ΛF ⊥ and G2 is a q-replicating linear subspace of Λ2 , where a
q-replicating linear space is defined by Definition 6 of Chapter 3.

Remark 1. We remind the reader that a full-data Hilbert space HF consists of


mean-zero finite-variance q-dimensional functions of Z and the observed-data
Hilbert space H consists of mean-zero finite-variance q-dimensional functions
of {C, GC (Z)}. In Definition 6 of Chapter 3, we noted that a q-replicating lin-
ear subspace can be written as {U (1) }q , where U (1) is a linear subspace con-
tained in the Hilbert space H(1) of one-dimensional mean-zero finite-variance
functions and where h(·) = {h1 (·), . . . , hq (·)}T is defined to be an element of
{U (1) }q if and only if each element hj (·) ∈ U (1) , j = 1, . . . , q. Linear sub-
spaces, such as ΛF ⊥ ⊂ HF and Λ2 ⊂ H, are examples of q-replicating spaces.
The importance of defining q-replicating linear spaces is given by Theorem 3.3,
which allows a generalization of the Pythagorean theorem to q dimensions.
One of the consequences of Theorem 3.3 is that if an element h is orthogonal
to a q-replicating linear space, say {U (1) }q , then not only is E(hT u) = 0 for
all u ∈ {U (1) }q (the definition of orthogonality) but also

E(huT ) = 0q×q for all u ∈ {U (1) }q . 


 (12.2)

We will consider two specific classes of restricted estimators.


1. For the first class of restricted estimators, we will take both G F ⊂ ΛF ⊥
and G2 ⊂ Λ2 to be finite-dimensional linear subspaces.
2. For the second class of restricted estimators, we will take G F ⊂ ΛF ⊥
to be a finite-dimensional linear subspace contained in the orthogonal
complement of the full-data nuisance tangent space but will let G2 = Λ2
be the entire augmentation space.
We note that a finite-dimensional linear subspace G F of ΛF ⊥ can be defined by
choosing a t1 -dimensional function of Z, say J F (Z) = {J1F (Z), . . . , JtF1 (Z)}T ,
(1)
where JjF (Z) ∈ ΛF ⊥ , j = 1, . . . , t1 , and letting the space spanned by J F (Z)
be  
G F = Aq×t1 J F (Z) for all constant matrices Aq×t1 .

Similarly, for (class 1) restricted estimators, a finite-dimensional linear


subspace G2 of Λ2 can be defined by choosing a t2 -dimensional function of
{C, GC (Z)}, say J2 {C, GC (Z)} = [J21 {C, GC (Z)}, . . . , J2t2 {C, GC (Z)}]T , where
(1)
J2j {C, GC (Z)} ∈ Λ2 , j = 1, . . . , t2 , and letting the space spanned by
J2 {C, GC (Z)} be
 
G2 = A q×t2
J2 {C, GC (Z)} for all constant matrices A q×t2
.
12.1 Restricted Class of AIPWCC Estimators 297

We will always assume that the t1 elements of J F (Z) and the t2 elements
of J2 {C, GC (Z)} are linearly independent; that is, cT J F (Z) = 0 implies that
c = 0, where c is a t1 vector of constants (similarly for J2 {C, GC (Z)}). By con-
struction, the finite-dimensional linear spaces defined above are q-replicating
linear spaces.
Therefore, we will restrict attention to estimators for β that are the solu-
tion to the estimating equation
# $
n
I(Ci = ∞)AF
q×t1
m∗ (Zi , β)
+ L2 {Ci , GCi (Zi ), ψ̂n } = 0, (12.3)
i=1 (∞, Zi , ψ̂n )

where m∗ (Z, β) is a t1 -dimensional estimating function such that m∗ (Z, β0 ) =


q×t1
J F (Z) and AF is an arbitrary q×t1 constant matrix, and L2 (·) ∈ G2 , either
L2 {C, GC (Z)} = Aq×t2
2
J2 {C, GC (Z)}, where Aq×t
2
2
is an arbitrary q×t2 matrix
(class 1), or L2 (·) ∈ Λ2 (class 2).
Remark 2. The observed-data estimating function (12.1) uses the full-data es-
timating function m(Z, β) to build estimators. Because the parameter β is
q-dimensional, the estimating function m(Z, β) is also q-dimensional, and,
at the minimum, the elements of m(Z, β) must be linearly independent.
When we consider restricted estimators that solve (12.3), the full-data es-
q×t1
timating function m(Z, β) is now equal to AF m∗ (Z, β), where m∗ (Z, β)
is t1 -dimensional. To ensure that the elements of m(Z, β), constructed in this
way, are linearly independent, the dimension t1 must be greater than or equal
to q. Moreover, since estimating equations, such as (12.3), can be defined up
to a proportional constant matrix (that is, multiplying the left-hand side of
(12.3) by a nonsingular q ×q matrix will not affect the resulting estimator), we
must choose the dimension t1 to be strictly greater than q so that the strategy
of choosing from this class of restricted estimators has an effect on the result-
ing estimator. Therefore, from here on, we will always assume that m∗ (Z, β),
chosen so that m∗ (Z, β0 ) = J F (Z), is made up of t1 linearly independent
elements with t1 > q.  
Let us define the q-replicating linear space Ξ ⊂ H to be
 
I(C = ∞)G F
Ξ= ⊕ G2 ; (12.4)
(∞, Z)
that is, Ξ consists of the elements
I(C = ∞)AF m∗ (Z, β0 )
+ L2 {C, GC (Z)} (12.5)
(∞, Z)

in H for any constant matrix AF and L2 ∈ G2 . We remind the reader that


the orthogonal complement of the nuisance tangent space associated with the
full-data nuisance parameter η is denoted by Λ⊥
η , which, by Theorem 7.2, is
equal to
298 12 Approximate Methods for Gaining Efficiency

I(C = ∞)ΛF ⊥
Λ⊥
η = ⊕ Λ2 ,
(∞, Z)
and hence
Ξ ⊂ Λ⊥
η. (12.6)
The elements in the linear subspace Ξ are associated with estimating func-
tions that will lead to the restricted class of estimating equations (12.3) that
we are considering. However, in Theorem 9.1, we showed that substituting the
MLE ψ̂n for the parameter ψ in an AIPWCC estimating equation resulted in
an influence function that subtracts off a projection onto the space Λψ , where
Λψ denotes the linear subspace spanned by the score vector
∂ log {C, GC (Z), ψ0 }
Sψ {C, GC (Z)} = .
∂ψ
We remind the reader that when we introduce a coarsening model with
parameter ψ, the nuisance tangent space is given by Λ = Λη ⊕ Λψ . Since
estimating functions need to be associated with elements in Λ⊥ , we should
only consider elements of Ξ that are also orthogonal to Λψ ; i.e., Π[Ξ|Λ⊥ ψ ].
Consequently, it will prove desirable that the space G2 ⊂ Λ2 also contain Λψ .
For the restricted (class 2) estimators where G2 = Λ2 , this is automatically
true, however, for the restricted (class 1) estimators, we will always include
the elements Sψ {C, GC (Z)} that span Λψ as part of the vector J2 {C, GC (Z)}
that spans G2 to ensure that Λψ ⊂ G2 .
Because the variance of an RAL estimator for β is the variance of its
influence function, when we consider finding the optimal estimator within the
restricted class of estimators (12.3), then we are looking for the estimator
whose influence function has the smallest variance matrix. Recall that an
influence function in addition to being orthogonal to the nuisance tangent
space must also satisfy the property that

E[ϕ{C, GC (Z)}SβT {C, GC (Z)}]


= E[Sβ {C, GC (Z)}ϕT {C, GC (Z)}] = I q×q , (12.7)

where Sβ {C, GC (Z)} is the observed-data score vector with respect to β, which
also equals the conditional expectation of the full-data score vector with re-
spect to β given the observed data (i.e., Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}),
and ϕ{C, GC (Z)} denotes an influence function. One can always normalize any
element ϕ∗ {C, GC (Z)} ∈ Ξ to ensure that it satisfies (12.7) by choosing
 −1

ϕ{C, GC (Z)} = E[ϕ {C, GC (Z)}Sβ {C, GC (Z)}]
T
ϕ∗ {C, GC (Z)}. (12.8)

Also, because Ξ is a q-replicating linear subspace, the element ϕ{C, GC (Z)} ∈


Ξ.
Let us define the subset of elements ϕ{C, GC (Z)} ∈ Ξ that satisfy the
property (12.7) or, equivalently, the subset of elements defined by (12.8), as
12.1 Restricted Class of AIPWCC Estimators 299

IF (Ξ). In Theorem 12.1 below, we derive the optimal influence function within
IF (Ξ); i.e., the element within IF (Ξ) that has the smallest variance matrix.

Remark 3. If the coarsening probabilities are modeled using a parameter ψ


that needs to be estimated, then influence functions of RAL estimators for β
must be orthogonal to Λψ . However, as we will show shortly, as long as Λψ ⊂
G2 , then the optimal influence function within IF (Ξ) will also be orthogonal
to Λψ , as is desired. 


Theorem 12.1. Among the elements ϕ{C, GC (Z)} ∈ IF (Ξ), the one with
the smallest variance matrix is given by the normalized version of the pro-
jection of Sβ {C, GC (Z)} onto Ξ. Specifically, if we define ϕ∗opt {C, GC (Z)} =
Π[Sβ {C, GC (Z)}|Ξ], then the element in IF (Ξ) with the smallest variance
matrix is given by
 −1
ϕopt {C, GC (Z)} = E[ϕ∗opt {C, GC (Z)}SβT {C, GC (Z)}] ϕ∗opt {C, GC (Z)}.
(12.9)

Proof. Because Ξ is a closed linear subspace, then, by the projection the-


orem for Hilbert spaces, there exists a unique projection ϕ∗opt {C, GC (Z)} =
 
Π[Sβ {C, GC (Z)}|Ξ] such that the residual Sβ {C, GC (Z)}−ϕ∗opt {C, GC (Z)} is
orthogonal to every element in Ξ. Consider any element ϕ{C, GC (Z)} ∈ IF (Ξ).
Because IF (Ξ) ⊂ Ξ and, since both ϕ{C, GC (Z)} and ϕopt {C, GC (Z)} belong
to IF (Ξ), this implies that ϕ(·)−ϕopt (·) ∈ Ξ. Also, because Ξ is a q-replicating
linear space (see Remark 1), by Theorem 3.3 we obtain that

E[{Sβ (·) − ϕ∗opt (·)}{ϕ(·) − ϕopt (·)}T ] = 0q×q .

Because elements of IF (Ξ) must satisfy (12.7), this implies that

E{Sβ (·)ϕT (·)} = E{Sβ (·)ϕTopt (·)} = I q×q ,

and hence
E[ϕ∗opt (·){ϕ(·) − ϕopt (·)}T ] = 0.
 −1
Premultiplying by the constant matrix E{ϕ∗opt (·)SβT (·)} , we obtain

E[ϕopt (·){ϕ(·) − ϕopt (·)}T ] = 0.

Consequently,

E{ϕ(·)ϕT (·)} = E[{ϕopt (·) + ϕ(·) − ϕopt (·)}{ϕopt (·) + ϕ(·) − ϕopt (·)}T ]
= E{ϕopt (·)ϕTopt (·)} + E[{ϕ(·) − ϕopt (·)}{ϕ(·) − ϕopt (·)}T ] + 0.
300 12 Approximate Methods for Gaining Efficiency

Since E[{ϕ(·) − ϕopt (·)}{ϕ(·) − ϕopt (·)}T ] is a nonnegative definite matrix,


this implies that
E{ϕ(·)ϕT (·)} ≥ E{ϕopt (·)ϕTopt (·)},
giving us the desired result. 


Corollary 1. The optimal element ϕopt {C, GC (Z)} ∈ IF (Ξ), derived in The-
orem 12.1, is orthogonal to Λψ and is an element of the space of observed-data
influence functions (IF ).

Proof. By construction, Sβ {C, GC (Z)} − ϕ∗opt {C, GC (Z)} is orthogonal to Ξ,


where Ξ is defined by (12.4). Hence, it must be orthogonal to G2 . Also, by
construction, Λψ ⊂ G2 . This implies that Sβ {C, GC (Z)} − ϕ∗opt {C, GC (Z)} is
orthogonal to Λψ . In other words,

E{(Sβ − ϕ∗opt )T h} = 0 for all h ∈ Λψ .

But since Sβ {C, GC (Z)} is orthogonal to Λψ (see equation (11.7)), this implies
that ϕ∗opt {C, GC (Z)} must be orthogonal to Λψ . Therefore ϕopt {C, GC (Z)},
defined by (12.9), is also orthogonal to Λψ . Hence,

ϕopt {C, GC (Z)} ∈ Π[Ξ|Λ⊥ ⊥ ⊥ ⊥


ψ ] ⊂ Π[Λη |Λψ ] = Λ ,

where the last two relationships follow from (12.6) and (8.16), respectively.
Therefore, ϕopt {C, GC (Z)} is an element orthogonal to the observed-data nui-
sance tangent space and by construction (see (12.9)) satisfies (12.7). Hence,
ϕopt {C, GC (Z)} is an element of the space of observed-data influence functions
(IF ).  

12.2 Optimal Restricted (Class 1) Estimators


We first consider how we can use the result from Theorem 12.1 to obtain
improved estimators by finding the optimal estimator within the restricted
class of estimators (12.3) where L2 {C, GC (Z)} = Aq×t 2
2
J2 {C, GC (Z)}; i.e., the
so-called (class 1) estimators. For this class of estimators, we will assume that
the model for the coarsening probabilities has been correctly specified.
Examining the elements in (12.5), we note that Ξ ⊂ H is a finite-
dimensional linear subspace that is spanned by the (t1 +t2 ) vector of functions
of the observed data, namely
 T
"T
I(C = ∞)m∗ (Z, β0 ) T
, J2 (·) .
(∞, Z)

Therefore, finding the projection onto this linear subspace of H is exactly the
same as Example 2 of Chapter 2, which was solved by using equation (2.2).
Applying this result to our problem, ϕ∗opt (·) of Theorem 12.1 is obtained by
12.2 Optimal Restricted (Class 1) Estimators 301

finding the constant matrices (AFopt )


q×t1
and (A2opt )q×t2 that solve the linear
equation
⎡⎧ ⎡ ⎤(t1 +t2 )×1 ⎫

⎨ I(C = ∞)m ∗
(Z, β0 ) ⎪

⎢  F q×(t1 +t2 )
E ⎣ Sβ (·) − Aopt , A2opt ⎣ (∞, Z) ⎦

⎩ ⎪

J2 (·)
⎧# ⎤
$1×(t1 +t2 ) ⎫
⎨ I(C = ∞)m∗T (Z, β ) ⎬
0 ⎥ q×(t1 +t2 )
× , J2T (·) ⎦=0 .
⎩ (∞, Z) ⎭

(12.10)

Before deriving the solution to equation (12.10), we first give some results
through a series of lemmas that will simplify the equation.
Lemma 12.1.
 T T 
I(C = ∞)m∗ (Z, β0 )AF
E Sβ {C, GC (Z)}
(∞, Z)
 T T

= E Sβ (Z)m∗ (Z, β0 )AF
F
(12.11)
   T
∂m∗ (Z, β0 )
=− A E F
. (12.12)
∂β T
Proof. We prove (12.11) using a series of iterated conditional expectations,
where

# T T
$
I(C = ∞)m∗ (Z, β0 )AF
E Sβ {C, GC (Z)}
(∞, Z)
# T T
$
' F ( I(C = ∞)m∗ (Z, β0 )AF
= E E Sβ (Z)|C, GC (Z)
(∞, Z)
#  T T
"$
I(C = ∞)m∗ (Z, β0 )AF
= E E Sβ (Z)F
C, GC (Z)
(∞, Z)
 "
∗T FT
I(C = ∞)m (Z, β 0 )A
= E SβF (Z)
(∞, Z)
#  T T
"$
F I(C = ∞)m∗ (Z, β0 )AF
= E E Sβ (Z) Z
(∞, Z)
#  "$
∗T FT
I(C = ∞)m (Z, β0 )A
= E SβF (Z)E Z
(∞, Z)
 T T

= E SβF (Z)m∗ (Z, β0 )AF .
302 12 Approximate Methods for Gaining Efficiency

Equation (12.12) follows from the usual expansion for m-estimators, where
the influence function for the estimator that solves the equation

n
AF m∗ (Zi , β) = 0
i=1

has influence function


   −1
∂m∗ (Z, β0 )
ϕ (Z) = − A E
F F
AF m∗ (Z, β0 ).
∂β T
T
Because E{ϕF (Z)SβF (Z)} = I q×q , this implies that
   
∗ T ∂m∗ (Z, β0 )
E A m F
(Z, β0 )SβF (Z) = −A E
F
.
∂β T

The result in (12.12) now follows after taking the transpose of both sides of
the equation above.  

Lemma 12.2.
 
E Sβ {C, GC (Z)}J2T {C, GC (Z)} = 0q×t2 . (12.13)

Proof. Using a series of iterated conditional expectations similar to the proof


of Lemma 12.1, we obtain that
 
E Sβ {C, GC (Z)}J2T {C, GC (Z)}
  
= E Sβ (Z)E J2 {C, GC (Z)} Z .
F T

Because each of the elements of J2 {C, GC (Z)} is an element of Λ2 , this implies


that  
E J2 {C, GC (Z)} Z = 0,

thus giving us the desired result. 




Lemma 12.3.
   T
"
I(C = ∞) ∗ ∗T m∗ (Z, β0 )m∗ (Z, β0 )
E m (Z, β0 )m (Z, β0 ) = E .
2 (∞, Z) (∞, Z)
(12.14)

Proof. This follows easily by an iterated conditional expectation argument


where we first compute the conditional expectation given Z.  
12.2 Optimal Restricted (Class 1) Estimators 303

Using the results from Lemmas 12.1–12.3, we obtain the solution to (12.10)
as # $
U11 U12
[AFopt , A2opt ] T
= [H1 , H2 ], (12.15)
U12 U22
where
 T
"t1 ×t1
m∗ (Z, β0 )m∗ (Z, β0 )
U11 = E ,
(∞, Z)
 t1 ×t2
I(C = ∞) ∗
U12 = E m (Z, β0 )J2 {C, GC (Z)}
T
,
(∞, Z)
' (t2 ×t2
U22 = E J2 {C, GC (Z)}J2T {C, GC (Z)} ,
   ∗
 T q×t1
∂m (Z, β0 )
H1 = − E ,
∂β T
H2 = 0q×t2 . (12.16)
Therefore, solving (12.15) yields
# $−1
U11 U12
[AF
opt , A2opt ] = [H1 , 0] T
U12 U22
# $
U 11 U 12
= [H1 , 0]
U 12T U 22
= [H1 U 11 , H1 U 12 ].
Using standard results for the inverse of partitioned symmetric matrices
(see, for example, Rao, 1973, p.33), we obtain
q×t1 11(t1 ×t1 )
AF
opt = H1 U (12.17)
and
A2opt = H1q×t1 U 12(t1 ×t2 ) , (12.18)
where  
−1 T −1
U 11 = U11 − U12 U22 U12 (12.19)
and
−1
U 12 = −U 11 U12 U22 . (12.20)
Thus we have shown that the optimal influence function ϕopt {C, GC (Z)}
in IF (Ξ), given by Theorem 12.1, is obtained by choosing ϕ∗ {C, GC (Z)} ∈ Ξ
to be

I(C = ∞)AF
opt m (Z, β0 )
ϕ∗opt {C, GC (Z)} = + A2opt J2 {C, GC (Z)}, (12.21)
(∞, Z)
where AF
opt and A2opt are defined by (12.17) and (12.18), respectively.
We also note the following interesting relationship.
304 12 Approximate Methods for Gaining Efficiency

I(C=∞)AF m∗ (Z,β0 )
Lemma 12.4. The projection of opt
(∞,Z) onto the space G2 (i.e.,
the space spanned by J2 {C, GC (Z)}) is equal to
 ∗ 
I(C = ∞)AFopt m (Z, β0 )
Π G2 = −A2opt J2 {C, GC (Z)}, (12.22)
(∞, Z)
which implies that

I(C = ∞)AF
opt m (Z, β0 )
ϕ∗opt {C, GC (Z)} = + A2opt J2 {C, GC (Z)}
(∞, Z)
is orthogonal to G2 ; that is,

ϕ∗opt {C, GC (Z)} ⊥ G2 . (12.23)

Proof. Because G2 is a linear subspace spanned by J2 {C, GC (Z)}, the projec-


tion (12.22) is given by
# $

I(C = ∞)AF opt m (Z, β0 ) T
E J2 {C, GC (Z)}
(∞, Z)
 −1
× E[J2 {C, GC (Z)}J2T {C, GC (Z)}] J2 {C, GC (Z)}
 
−1 11 −1
= AF opt U12 U22 J2 {C, GC (Z)} = H 1 U U12 U22 J2 {C, GC (Z)}

= −H1 U 12 J2 {C, GC (Z)} = −A2opt J2 {C, GC (Z)}. 




The optimal constant matrices AF


opt and A2opt involve the quantities

H1 , U11 , U12 , U22 ,

which are all matrices whose elements are expectations. For practical appli-
cations, these must be estimated from the data. We propose the following
empirical averages:
 n  T
I(Ci = ∞) ∂m∗ (Zi , β)
Ĥ1 (β) = − n−1 , (12.24)
i=1 (∞, Zi , ψ̂n )
∂β T


n  
−1 I(Ci = ∞) ∗ ∗T
Û11 (β) = n m (Zi , β)m (Zi , β) , (12.25)
i=1 2 (∞, Zi , ψ̂n )

n  
I(Ci = ∞)
Û12 (β) = n−1 m∗ (Zi , β)J2T {Ci , GCi (Zi ), ψ̂n } , (12.26)
i=1 (∞, Zi , ψ̂n )

and
12.2 Optimal Restricted (Class 1) Estimators 305


n
−1
Û22 = n J2 {Ci , GCi (Zi ), ψ̂n }J2T {Ci , GCi (Zi ), ψ̂n }. (12.27)
i=1

Consequently, we would estimate


11 12
[ÂF
opt (β), Â2opt (β)] = [Ĥ1 (β)Û (β), Ĥ1 (β)Û (β)], (12.28)

where Û 11 (β) and Û 12 (β) are obtained by substituting the empirical estimates
for U11 , U12 , and U22 into equations (12.19) and (12.20).

Deriving the Optimal Restricted (Class 1) AIPWCC Estimator

Using a sample of observed data {Ci , GCi (Zi )}, i = 1, . . . , n, we propose esti-
mating β by solving the equation
# $
n
I(Ci = ∞)ÂF ∗
opt (β)m (Zi , β)
+ Â2opt (β)J2 {Ci , GCi (Zi ), ψ̂n } = 0.
i=1 (∞, Zi , ψ̂n )
(12.29)
We denote this estimator as β̂nopt and now prove the fundamental result for
restricted optimal estimators.
Theorem 12.2. Among the restricted class of AIPWCC estimators (12.3),
the optimal estimator (i.e., the estimator with the smallest asymptotic vari-
ance matrix) is given by β̂nopt , the solution to (12.29).
Before sketching out the proof of Theorem 12.2, we give another equivalent
representation for the class of influence functions IF (Ξ) defined by (12.8) that
will be useful.

Lemma 12.5. The class of influence functions IF (Ξ) can also be defined by
   −1
∂m∗ (Z, β0 )
ϕ{C, GC (Z)} = − AF E ϕ∗ {C, GC (Z)}, (12.30)
∂β T

where ϕ∗ {C, GC (Z)} is an element of Ξ defined by (12.5).

Proof. Using (12.12) of Lemma 12.1 and Lemma 12.2, we can show that
 
∗ ∂m∗ (Z, β0 )
E[ϕ {C, GC (Z)}Sβ {C, GC (Z)}] = −A E
T F
. (12.31)
∂β T

The lemma now follows by substituting the right-hand side of (12.31) into
(12.8). 


Proof. Theorem 12.2


Since the asymptotic variance of an RAL estimator is the variance of its influ-
ence function, it suffices to consider the influence functions of the competing
306 12 Approximate Methods for Gaining Efficiency

estimators. In the same manner that we found the influence function of the
estimator in (9.7) of Theorem 9.1, we can show that the influence function of
the estimator for β that solves (12.3) is given by
   −1
∂m∗ (Z, β0 )
ϕ{C, GC (Z)} = − A E F
∂β T
 F ∗
   
I(C = ∞)A m (Z, β0 )
× + A2 J2 {C, GC (Z)} − Π · Λψ . (12.32)
(∞, Z)
Because we constructed the space G2 so that Λψ ⊂ G2 , this implies that
Π[[·]|Λψ ] ∈ G2 , which in turn implies that (12.32) is an element of Ξ. Therefore,
as a consequence of Lemma 12.5, the influence function ϕ{C, GC (Z)} defined
above is an element of IF (Ξ). By Theorem 12.1, we know that
var[ϕopt {C, GC (Z)} ≤ var[ϕ{C, GC (Z)}].
Hence, if we can show that the influence function of the estimator (12.29) is
equal to ϕopt {C, GC (Z)}, then we would complete the proof of the theorem.
An expansion of the estimating equation in (12.29) about β = β0 , keeping
ψ̂n fixed, yields
   −1
∂m∗ (Z, β0 )
n1/2 (β̂nopt − β0 ) = − AF opt E ×
∂β T
# $
 n
I(Ci = ∞)ÂF ∗
opt (β0 )m (Zi , β0 )
−1/2
n + Â2opt (β0 )J2 {Ci , GCi (Zi ), ψ̂n }
i=1 (∞, Zi , ψ̂n )
+ op (1). (12.33)

We now show that estimating AF opt and A2opt only has a negligible effect on
the asymptotic properties of the estimator by noting that (12.33) equals
# $
n
I(Ci = ∞)AF ∗
opt m (Zi , β0 )
−1/2
n + A2opt J2 {Ci , GCi (Zi ), ψ̂n } (12.34)
i=1 (∞, Zi , ψ̂n )

n
I(Ci = ∞)m∗ (Zi , β0 )
+ n1/2 {ÂF
opt (β0 ) − Aopt }n
F −1
(12.35)
i=1 (∞, Zi , ψ̂n )
n
+ n1/2 {Â2opt (β0 ) − A2opt }n −1
J2 {Ci , GCi (Zi ), ψ̂n }. (12.36)
i=1

Under mild regularity conditions, n1/2 {ÂFopt (β0 )−Aopt } and n


F 1/2
{Â2opt (β0 )−
A2opt } will be bounded in probability,
n  
−1 I(Ci = ∞)m∗ (Zi , β0 ) P I(C = ∞)m∗ (Z, β0 )
n −
→E
i=1 (∞, Zi , ψ̂n ) (∞, Zi , ψ0 )
= E{m∗ (Z, β0 )} = 0,
12.2 Optimal Restricted (Class 1) Estimators 307

and

n
n−1
P
J2 {Ci , GCi (Zi ), ψ̂n } −
→ E[J2 {C, GC (Z), ψ0 }] = 0.
i=1

Hence (12.35) and (12.36) will converge in probability to zero. Using Theorem
9.1, we can expand ψ̂n about ψ0 in (12.34) to obtain that (12.34) equals
n  ∗ 
I(Ci = ∞)AF opt m (Zi , β0 )
n−1/2 + A2opt J2 {Ci , GCi (Zi ), ψ0 }
i=1
(∞, Zi , ψ0 )
 ∗ 
I(Ci = ∞)AF opt m (Zi , β0 )
−Π + A2opt J2 {Ci , GCi (Zi ), ψ0 } Λψ
(∞, Zi , ψ0 )
+ op (1). (12.37)

Combining all the results from (12.33) through (12.37), we obtain that the
influence function of β̂nopt , the solution to (12.29), is given by
   −1  
∂m∗ (Z, β0 )
− AF
opt E ϕ∗opt {C, GC (Z)} − Π[ϕ∗opt {C, GC (Z)}|Λψ ] ,
∂β T
(12.38)
where ϕ∗opt {C, GC (Z)} is defined by (12.21). We proved that ϕ∗opt {C, GC (Z)}
is orthogonal to Λψ in Corollary 1. We also demonstrated in (12.23) of Lemma
12.4 that ϕ∗opt {C, GC (Z)} is orthogonal to G2 . Since Λψ ⊂ G2 , this, too, implies
that
Π[ϕ∗opt {C, GC (Z)}|Λψ ] = 0.
Therefore, the influence function (12.38) is equal to
   −1
∂m∗ (Z, β0 )
ϕopt {C, GC (Z)} = − Aopt E
F
ϕ∗opt {C, GC (Z)}, (12.39)
∂β T

thus proving that the estimator that is the solution to (12.29) is the optimal
restricted estimator. 


Estimating the Asymptotic Variance

Using the matrix relationships (12.16) and (12.17), we note that the leading
term in (12.39) can be written as the symmetric matrix
   −1  −1
∂m∗ (Z, β0 ) 11 T
− AFopt E = H 1 U H 1 . (12.40)
∂β T
After a little algebra, we can also show that the covariance matrix of
ϕ∗opt {C, GC (Z)} is equal to
 
∗T
E ϕopt (·)ϕopt (·) = H1 U 11 H1T .

(12.41)
308 12 Approximate Methods for Gaining Efficiency

Because the asymptotic variance of an RAL estimator is equal to the variance


of its influence function, this means that the asymptotic variance of β̂nopt is the
variance of (12.39). Using (12.40) and (12.41), we obtain that the asymptotic
variance is equal to
 −1   −1  −1
H1 U 11 H1T H1 U 11 H1T H1 U 11 H1T = H1 U 11 H1T .

Consequently, a consistent estimator for the asymptotic variance of β̂nopt is


given by
 −1
11 T
Ĥ1 (β̂nopt )Û (β̂nopt )Ĥ1 (β̂nopt ) , (12.42)

where Ĥ1 (β) and Û 11 (β) were defined by (12.24) and (12.28).

Remark 4. The method we proposed for estimating the parameter β using re-
stricted optimal (class 1) estimators only needs that the model for the coars-
ening probabilities be correctly specified. The appeal of this method is its
simplicity. We did not have to use adaptive methods, where a simpler model
p∗Z (z, ξ) had to be posited and an estimator for ξ had to be derived. Yet
the resulting estimator is guaranteed to have the smallest asymptotic vari-
ance within the class of estimators considered. It would certainly be more
efficient than the simple inverse probability weighted complete-case estima-
tor (IPWCC), which discards information from data that are not completely
observed. However, this estimator is not efficient. How close the variance of
such an estimator will be to the semiparametric efficiency bound will depend
on how close the optimal element Beff F
(Z) ∈ ΛF ⊥ is to the subspace G F and
I(C=∞)B F (Z)
how close the optimal element in Λ2 , Π[ (∞,Z) eff
|Λ2 ], is to G2 .
If the missingness were by design, then restricted optimal (class 1) esti-
mators would be guaranteed (subject to regularity conditions) to yield con-
sistent, asymptotically normal estimators for β. However, if the coarsening
probabilities are modeled and not correctly specified, then such estimators
will be biased. There is no double-robustness protection guaranteed for such
estimators. Therefore, in the next section, we consider what we refer to as
(class 2) estimators. These estimators, although more complicated, will re-
sult in double-robust estimators that avoid the necessity to solve complicated
integral equations. 

Before discussing restricted optimal (class 2) estimators, we first illustrate


how to derive a restricted optimal (class 1) estimator by considering a specific
example.
12.3 Example of an Optimal Restricted (Class 1) Estimator 309

12.3 Example of an Optimal Restricted


(Class 1) Estimator
We return to the example introduced in Section 7.4 where the goal is to
estimate the parameter β in the restricted moment model

E(Y |X) = µ(X, β).

In this example, we let Y be a univariate response variable and X =


(X (1) , X (2) )T be two univariate covariates. The second covariate X (2) is ex-
pensive to measure and therefore, by design, was only collected on a subsample
of the n individuals in the study, whereas X (1) was collected on all n individ-
uals. The subsample of individuals for which X (2) was collected was chosen at
random with a prespecified probability that depended on Y and X (1) . This is
an example of two levels of missingness, where we denote the complete-case
indicator for the i-th individual as Ri (unscripted); that is, if Ri = 1, then we
(1) (2) (1)
observe (Y1 , Xi , Xi ), whereas if Ri = 0, then we observe (Yi , Xi ). The
probability of a complete case is denoted by
(1) (1)
P (Ri = 1|Yi , Xi ) = P (Ri = 1|Yi , Xi ) = π(Yi , Xi ),
(1) (1)
where π(Yi , Xi ) (unscripted) is a known function of Yi and Xi . Since the
coarsening of the data was by design, the corresponding coarsening tangent
space Λψ = 0. Therefore, we need not worry that the vector J2 that spans G2
contains the score vector Sψ for this example.
To define the restricted class of estimators, we must first choose finite-
dimensional subsets G F ⊂ ΛF ⊥ ⊂ HF and G2 ⊂ Λ2 ⊂ H. We remind the
reader that the space ΛF ⊥ consists of elements
 
h(X, β0 )q×1 {Y −µ(X, β0 )} for arbitrary q-dimensional functions h(·) of X

and the space Λ2 consists of elements



f (Y, X (1) )q×1 {R − π(Y, X (1) )} for arbitrary

(1)
q-dimensional functions of (Y, X ) .

Recall that X, by itself, refers to (X (1) , X (2) ).


To define G F , we must choose a t1 -dimensional function of the full-data
(Y, X), say J F (Y, X, β) = {J1F (Y, X, β), . . . , JtF1 (Y, X, β)}T , that spans G F ,
(1) (1)
where the elements JjF (Y, X, β0 ) ∈ ΛF ⊥ , j = 1, . . . , t1 , and where ΛF ⊥
(1)
denotes the linear subspace in HF spanned by the first element of the q-
(1)
dimensional vector that makes up ΛF ⊥ ; i.e., ΛF ⊥ = {ΛF ⊥ }q .
310 12 Approximate Methods for Gaining Efficiency

Suppose, for example, we were considering a log-linear model, where

µ(X, β) = exp(β1 + β2 X (1) + β3 X (2) ).

If we also believed the variance as a function of X was homoscedastic, then


the optimal full-data estimating function would be chosen to be

m(Y, X, β) = DT (X, β)V −1 (X){Y − exp(β1 + β2 X (1) + β3 X (2) )},

where D(X, β) = ∂µ(X, β)/∂β T . Therefore, for this example, we would choose

m(Y, X, β) = (1, X (1) , X (2) )T exp(β T X ∗ ){Y − exp(β T X ∗ )},

where X ∗ = (1, X (1) , X (2) )T and β = (β1 , β2 , β3 )T , as our optimal full-data


estimating function if the data were not coarsened. However, this choice for
m(Y, X, β) may not be optimal with coarsened data. Therefore, we consider
restricted optimal estimators, where we might choose

J F (Y, X) = f F (X) exp(β0T X ∗ ){Y − exp(β0T X ∗ )},


2 2
where f F (X) = (1, X (1) , X (2) , X (1) , X (2) , X (1) X (2) )T . Such a set of basis
functions allows for a quadratic relationship in X (1) and X (2) . We also define
the t1 -dimensional vector of estimating functions as

m∗ (Y, X, β) = f F (X) exp(β T X ∗ ){Y − exp(β T X ∗ )}. (12.43)

To define G2 , we must choose a t2 -dimensional function of the observed


data (R, Y, X (1) , RX (2) ), say J2 (·) = {J21 (·), . . . , J2t2 (·)}T , that spans G2 ,
(1) (1)
where the elements J2j (·) ∈ Λ2 , j = 1, . . . , t2 , and where Λ2 denotes the
linear subspace in H(1) spanned by the first element of the q-dimensional
vector that makes up Λ2 . For example, we might choose

J2 (·) = f2 (Y, X (1) ){R − π(Y, X (1) )}, (12.44)


2
where f2 (Y, X (1) ) = (1, Y, X (1) , Y 2 , X (1) , Y X (1) )T . Such a set of basis func-
tions allows for a quadratic relationship in Y and X (1) .
Therefore, the class of restricted estimators that will be considered are
solutions to the estimating equations (12.3), namely
n 
 Ri AF f F (Xi ) exp(β T X ∗ ){Yi − exp(β T X ∗ )}
i i
(1)
i=1 π(Yi , Xi )

(1) (1)
+ {Ri − π(Yi , Xi )}A2 f2 (Yi , Xi ) = 0, (12.45)

for an arbitrary q × t1 constant matrix AF and an arbitrary q × t2 constant


matrix A2 . For this illustration, q = 3 and t1 = t2 = 6.
12.3 Example of an Optimal Restricted (Class 1) Estimator 311

Finding the optimal estimator within this restricted class and deriving
the asymptotic variance are now just a matter of plugging into the formulas
derived in the previous section.
Taking the partial derivative of (12.43) with respect to β yields
 
∂m∗ (Y, X, β0 )
E = −f F (X) exp(β0T X ∗ )D(X, β0 )
∂β T
T
= f F (X) exp(2β0T X ∗ )X ∗ .

Therefore, we obtain a consistent estimator for H1 (β) by using



n
Ri T
Ĥ1 (β) = n−1 (1)
Xi∗ exp(2β T Xi∗ )f F (Xi ). (12.46)
i=1 π(Yi , Xi )

Next we use equations (12.25)–(12.27) to compute



n
Ri {Yi − exp(β T X ∗ )}2 T
Û11 (β) = n−1 (1)
i
f F (Xi )f F (Xi ), (12.47)
i=1 π 2 (Yi , Xi )


n (1)
Ri {Ri − π(Yi , X )} (1)
−1
Û12 (β) = n (1)
i
{Yi −exp(β T Xi∗ )}f F (Xi )f2T (Yi , Xi ),
i=1 π(Yi , Xi )
(12.48)
and

n
(1) (1) (1)
Û22 = n−1 {Ri − π(Yi , Xi )}2 f2 (Yi , Xi )f2T (Yi , Xi ). (12.49)
i=1

From these, we can compute


−1
Û 11 (β) = {Û11 (β) − Û12 (β)Û22 T
(β)Û12 (β)}−1 , (12.50)
12 11 −1
Û (β) = −Û (β)Û12 (β)Û22 (β), (12.51)
11
ÂF
opt (β) = Ĥ1 (β)Û (β), (12.52)

and

Â2opt (β) = Ĥ1 (β)Û 12 (β). (12.53)

Therefore, the optimal restricted estimator is the solution to the estimating


equation
n  ∗ ∗
opt (β)f (Xi ) exp(β Xi ){Yi − exp(β Xi )}
Ri ÂF F T T

(1)
i=1 π(Yi , Xi )

(1) (1)
+ {Ri − π(Yi , Xi )}Â2opt (β)f2 (Yi , Xi ) = 0, (12.54)
312 12 Approximate Methods for Gaining Efficiency

which we denote by β̂nopt . Moreover, the asymptotic variance for β̂nopt can
be estimated using
 −1
11 T
Ĥ1 (β̂nopt )Û (β̂nopt )Ĥ1 (β̂nopt ) ,

where Ĥ1 (β) and Û 11 (β) were defined by (12.46) and (12.50), respectively.

Modeling the Missingness Probabilities

In the example above, it was assumed that the missing values of X (2) were by
design, where the investigator had control of the missingness probabilities. We
now consider how the methods would be modified if these probabilities were
not known and had to be estimated from the data. We might, for instance,
consider the logistic regression model where

exp(ψ0 + ψ1 Y + ψ2 X (1) )
P (R = 1|Y, X (1) ) = π(Y, X (1) , ψ) = . (12.55)
1 + exp(ψ0 + ψ1 Y + ψ2 X (1) )

The maximum likelihood estimator for ψ = (ψ0 , ψ1 , ψ2 )T is obtained by max-


imizing the likelihood given by (8.10); specifically,
# $
!
n (1)
exp{(ψ0 + ψ1 Yi + ψ2 Xi )Ri }
(1)
,
i=1 1 + exp(ψ0 + ψ1 Yi + ψ2 Xi )

and the resulting MLE is denoted by ψ̂n = (ψ̂0n , ψ̂1n ψ̂2n )T . Using standard
results for the logistic regression model, we note that the score vector Sψ (·)
is given by

Sψ (R, Y, X (1) , ψ) = (1, Y, X (1) )T {R − π(Y, X (1) , ψ)}, (12.56)

where π(Y, X (1) , ψ) is the probability of a complete case defined in (12.55).


We also note that the score equation, evaluated at the MLE, is equal to zero;
that is,

n
(1)
Sψ (Ri , Yi , Xi , ψ̂n ) = 0. (12.57)
i=1
As mentioned in Remark 3 of this chapter, we should choose the score
vector Sψ (·) as part of a set of basis functions J2 (·) that spans the space G2 .
Examining the set of functions J2 (·) given by (12.44) for the example above,
we note that indeed Sψ (·) makes up the first three elements of J2 (·).
The estimator for β would then be identical to that given in equation
(1)
(12.54) except that we would substitute π(Yi , Xi , ψ̂n ) for π(Y, X (1) ) in the
equation (12.54) itself and when evaluating all the quantities in equations
(12.46)–(12.53).
12.4 Optimal Restricted (Class 2) Estimators 313

Remark 5. We point out that the second term in the estimating equation
(12.54) (i.e., the augmented term) can be written as


n
(1) (1)
Â2opt f2 (Yi , Xi ){Ri − π(Yi , Xi , ψ̂n )}.
i=1

Since the first three elements of the vector



n
(1) (1)
f2 (Yi , Xi ){Ri − π(Yi , Xi , ψ̂n )} (12.58)
i=1

n (1)
are i=1 Sψ (Ri , Yi , Xi , ψ̂n ), then as a consequence of (12.57), this means
that the first three elements of the vector (12.58) are equal to zero. This
observation may result in some modest savings in computation.

12.4 Optimal Restricted (Class 2) Estimators


(Class 2) restricted estimators are AIPWCC estimators where we restrict
attention to a finite-dimensional linear subspace G F ∈ ΛF ⊥ spanned by the
vector J F (Z), as we did for (class 1) restricted estimators, but where G2 = Λ2 .
By so doing, we will show that the resulting optimal estimator for β within this
class will have an influence function that is an element of the class of double-
robust influence functions (IF )DR ; see Definition 2 of Chapter 10. Toward
that end, we define the linear space Ξ similar to what we did in (12.4); that
is,  
I(C = ∞)G F
Ξ= ⊕ Λ2 , (12.59)
(∞, Z)
and Ξ consists of the elements
I(C = ∞)AF m∗ (Z, β0 )
+ L2 {C, GC (Z)} (12.60)
(∞, Z)
q×t1
in H for any constant matrix AF and L2 ∈ Λ2 , where m∗ (Z, β) is a t1 × 1
vector of estimating functions such that m∗ (Z, β0 ) = J F (Z). We denote an
element within this class as ϕ∗ {C, GC (Z)} and define IF (Ξ) to be the elements
within this class that satisfy (12.7). The elements within the class IF (Ξ) are
defined as ϕ{C, GC (Z)} and are given by (12.8).
We now prove the following key theorem for (class 2) restricted estimators.

Theorem 12.3. Among the elements ϕ{C, GC (Z)} ∈ IF (Ξ), the one with the
smallest variance matrix is given by

ϕopt {C, GC (Z)} = J {ϕF


opt (Z)}, (12.61)
314 12 Approximate Methods for Gaining Efficiency

where the linear operator J (·) is given by Definition 1 of Chapter 10; that is,
# $
I(C = ∞)ϕF opt (Z) I(C = ∞)ϕF opt (Z)
J {ϕopt (Z)} =
F
−Π Λ2 ,
(∞, Z) (∞, Z)
T
∗F −1 ∗F
ϕF F
opt (Z) = [E{ϕopt (Z)Sβ (Z)}] ϕopt (Z), (12.62)
and ϕ∗F
opt (Z) is the unique element in G F
that solves the equation

Π[SβF (Z) − M−1 {ϕ∗F F


opt (Z)}|G ] = 0, (12.63)

where M−1 is the inverse of the linear operator M given by Definition 5


of Chapter 10. (The inverse operator M−1 exists and is unique; see Lemma
10.5.)

Proof. We first will prove that IF (Ξ) consists of the class of elements

I(C = ∞)ϕF (Z)


+ L∗2 {C, GC (Z)}, (12.64)
(∞, Z)
where T
ϕF (Z) = [E{ϕ∗F (Z)SβF (Z)}]−1 ϕ∗F (Z),
ϕ∗F (Z) ∈ G F , and L∗2 {C, GC (Z)} ∈ Λ2 .
Because the elements in IF (Ξ) are defined by (12.8), (12.64) will be true
if we can show that
T
E[ϕ∗ {C, GC (Z)}SβT {C, GC (Z)}] = E{ϕ∗F (Z)SβF (Z)}, (12.65)

where ϕ∗ {C, GC (Z)} ∈ Ξ is equal to

I(C = ∞)ϕ∗F (Z)


ϕ∗ {C, GC (Z)} = + L2 {C, GC (Z)}.
(∞, Z)

Because Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}, we use a series of iterated con-
ditional expectations to obtain
T
E[ϕ∗ {C, GC (Z)}SβT {C, GC (Z)}] = E(E[ϕ∗ {C, GC (Z)}SβF (Z)|C, GC (Z)])
T
= E[ϕ∗ {C, GC (Z)}SβF (Z)]
T
= E(E[ϕ∗ {C, GC (Z)}SβF (Z)|Z])
T
= E(E[ϕ∗ {C, GC (Z)}|Z]SβF (Z))
T
= E{ϕ∗F (Z)SβF (Z)},

thus proving (12.65).


By proving (12.64), we have demonstrated that all the elements in IF (Ξ)
can be written as
12.4 Optimal Restricted (Class 2) Estimators 315

I(C = ∞)ϕF (Z)


ϕ{C, GC (Z)} = + L2 {C, GC (Z)}, (12.66)
(∞, Z)

where ϕF (Z) is a full-data influence function and L2 ∈ Λ2 . Because of The-


orem 10.1, we know that, for a fixed ϕF (Z), the optimal element (smallest
variance matrix) among the class of elements (12.66) is given by J {ϕF (Z)}.
Therefore, we only need to restrict attention to those elements of IF (Ξ) that
are in the class
 
∗F FT −1 ∗F
J {ϕ (Z)} : ϕ (Z) = [E{ϕ (Z)Sβ (Z)}] ϕ (Z) ,
F F

where ϕ∗F (Z) ∈ G F , if the goal is to find the optimal estimator in IF (Ξ).
Equivalently, we can restrict the search to the elements in the linear subspace
J (G F ) ⊂ Ξ that satisfy (12.7).
In Theorem 10.6, we proved that J {hF (Z)} = L[M−1 {hF (Z)}], where the
linear operator L was defined by Definition 4 of Chapter 10 and, we remind
the reader, is given by

L{hF (Z)} = E{hF (Z)|C, GC (Z)}.

Therefore, the linear space J (G F ) = L{M−1 (G F )}.


Using the same proof as for Theorem 12.1, we can also prove that among
the elements ϕ{C, GC (Z)} ∈ J (G F ) = L{M−1 (G F )}, the one with the small-
est variance matrix is the normalized version of the projection of the observed-
data score vector onto J (G F ). Specifically,
 −1
ϕopt {C, GC (Z)} = E[ϕ∗opt {C, GC (Z)}SβT {C, GC (Z)}] ϕ∗opt {C, GC (Z)},
(12.67)
where

ϕ∗opt {C, GC (Z)} = Π[Sβ {C, GC (Z)}|L{M−1 (G F )}]. (12.68)

We now show how to derive this projection.


Because of the projection theorem for Hilbert spaces, the projection of
Sβ {C, GC (Z)} onto the closed linear space L{M−1 (G F )} is the unique element
ϕ∗F
opt (Z) ∈ G
F
that satisfies
 T 
−1
E Sβ {C, GC (Z)} − L[M {ϕ∗F
opt (Z)}] L[M−1
{ϕ ∗F
(Z)}] = 0, (12.69)

for all ϕ∗F (Z) ∈ G F . Recalling that Sβ {C, GC (Z)} = E{SβF (Z)|C, GC (Z)}
and L[M−1 {ϕ∗F (Z)}] = E{M−1 (ϕ∗F )|C, GC (Z)}, we use a series of iterated
conditional expectations to write (12.69) as
316 12 Approximate Methods for Gaining Efficiency
 
0 = E E[{SβF − M−1 (ϕ∗Fopt )}T
L{M −1
(ϕ ∗F
)}|C, G C (Z)]
 
= E {SβF − M−1 (ϕ∗F
opt )}T
L{M −1
(ϕ ∗F
)}
 
= E E[{SβF − M−1 (ϕ∗Fopt )}T
L{M −1
(ϕ ∗F
)}|Z]
 
−1 ∗F −1 ∗F
= E {Sβ − M (ϕopt )} E[L{M (ϕ )}|Z]
F T

 
−1 ∗F −1 ∗F
= E {Sβ − M (ϕopt )} M{M (ϕ )}
F T

 
−1 ∗F T ∗F
= E {Sβ − M (ϕopt )} ϕ
F
. (12.70)

Therefore, (12.70) being true for all ϕ∗F ∈ G F implies that SβF − M−1 (ϕ∗F
opt )
must be orthogonal to G F . Since the projection exists and is unique, this
implies that there must exist a unique element ϕ∗Fopt ∈ G
F
such that (12.70)
∗F
holds for all ϕ ∈ G , or equivalently
F

Π[SβF (Z) − M−1 {ϕ∗F F


opt (Z)}|G ] = 0. (12.71)

Consequently,

ϕ∗opt {C, GC (Z)} = L[M−1 {ϕ∗F ∗F


opt (Z)}] = J {ϕopt (Z)},

where ϕ∗Fopt (Z) satisfies (12.71), or (12.63) of the theorem. The proof is com-
plete if we can show that

E[ϕ∗opt {C, GC (Z)}SβT {C, GC (Z)}]

of equation (12.67), where

ϕ∗opt {C, GC (Z)} = L[M−1 {ϕ∗F


opt (Z)}],

T
is the same as E{ϕ∗F F
opt (Z)Sβ (Z)}. This can be shown by using the same
iterated expectations argument that led to (12.70). 


A corollary that is an immediate consequence of Theorem 12.3 by taking


G F = ΛF ⊥ is given as follows.

Corollary 2. Among all influence functions in


 
I(C = ∞)ΛF ⊥
Ξ= ⊕ Λ2
(∞, Z)

(that is, elements of Ξ that satisfy (12.7)), the one with the smallest variance
matrix is obtained by choosing
12.4 Optimal Restricted (Class 2) Estimators 317

ϕ∗opt {C, GC (Z)} = J {ϕ∗F


opt (Z)},

where ϕ∗F
opt (Z) is the unique element in Λ
F⊥
that satisfies

Π[SβF (Z) − M−1 {ϕ∗F


opt (Z)}|Λ
F⊥
]=0

or equivalently solves the equation

Π[M−1 {ϕ∗F
opt (Z)}|Λ
F⊥
] = Π[SβF (Z)|ΛF ⊥ ] = Seff
F
(Z).

Remark 6. Since the space Ξ in the corollary is the same as Λ⊥


η , then the result
above is an alternative proof of Theorem 11.1, which was used to derive the
optimal influence function among all AIPWCC estimators for β.  

Returning to the restricted (class 2) estimators, we obtain the following


corollary.

Corollary 3. Let m∗ (Z, β) be a t1 × 1 (t1 > q) vector of estimating functions


such that m∗ (Z, β0 ) = J F (Z) spans the linear space G F ⊂ ΛF ⊥ . Then, among
the class of influence functions IF (Ξ), where Ξ is defined by (12.59), the
optimal element (smallest variance matrix) is given by ϕF opt (Z), defined by
(12.62), the normalized version of
q×t1
ϕ∗F F
opt = Aopt m∗ (Z, β0 ),

where
  T  −1
q×t1 ∂m∗ (Z, β0 ) −1 ∗ ∗T
AF
opt =− E E[M {m (Z, β0 )}m (Z, β0 )] .
∂β T
(12.72)
q×t1
Proof. Using (12.63), we are looking for ϕ∗F
opt = Aopt
F
m∗ (Z, β0 ) such that
 
q×t1
Π SβF (Z) − M−1 {AF opt m ∗
(Z, β0 )} G F
= 0. (12.73)

Because G F is spanned by m∗ (Z, β0 ), the standard results for projecting onto


a finite-dimensional linear space yield
T T
Π[hF (Z)|G F ] = E(hF m∗ ){E(m∗ m∗ )}−1 m∗ (Z, β0 ). (12.74)

Using (12.74), we write equation (12.73) as


T T
−1
E[{SβF − AF
opt M (m∗ )}m∗ ]{E(m∗ m∗ )}−1 m∗ (Z, β0 ) = 0. (12.75)

Because m∗ (Z, β0 ) is made up of t1 linearly independent elements, this implies


T
that the variance matrix E(m∗ m∗ ) is positive definite and hence has a unique
inverse. Consequently, equation (12.75) is true if and only if
318 12 Approximate Methods for Gaining Efficiency
T
−1
E[{SβF − AF
opt M (m∗ )}m∗ ] = 0

or when T T
−1
AF
opt E{M (m∗ )m∗ } = E(SβF m∗ ). (12.76)
Using the result from equation (12.12) of Lemma 12.1, we obtain that
  T
T ∂m∗ (Z, β0 )
E{SβF (Z)m∗ (Z, β0 )} =− E .
∂β T

Substituting this last result into equation (12.76) and solving for AF
opt leads
to (12.72), thus proving the corollary. 

The results of Corollary 3 are especially useful when the inverse operator
M−1 (·) can be derived explicitly such as the case when there are two levels of
coarsening or when the coarsening is monotone. The following algorithm can
now be used to derive improved adaptive double-robust estimators for β:
1. If the coarsening of the data is not by design, we develop a model for the
coarsening probabilities, say

P (C = r|Z) = {r, Gr (Z), ψ},

and estimate ψ by maximizing


!
n
{Ci , GCi (Zi ), ψ}.
i=1

We denote this estimator by ψ̂n .


2. We posit a simpler parametric model for the distribution of Z using
p∗Z (z, ξ) and estimate ξ by maximizing the likelihood

!
n
p∗Gr (Zi ) (gri , ξ)
i
i=1

for a realization of the data (ri , gri ), i = 1, . . . , n, where



p∗Gr (Z) (gr , ξ) = p∗Z (z, ξ)dνZ (z).
{z:Gr (z)=gr }

We denote the estimator by ξˆn∗ .


3. We consider a t1 ×1 vector m∗ (Z, β) of estimating functions, where t1 > q.
These may include the q-dimensional efficient full-data estimating function
as a subset of m∗ (Z, β).
12.4 Optimal Restricted (Class 2) Estimators 319

4. We compute AF ˆ∗
opt (β, ψ̂n , ξn ) by using

  T
∂m∗ (Z, β)
AF
opt (β, , ψ, ξ) =− E ,ξ
∂β T
 −1
T
× E[M−1 {m∗ (Z, β), ψ, ξ}m∗ (Z, β), ξ] ,

where we emphasize that M−1 is computed as a function of ψ and ξ and


expectations of functions of Z are computed as a function of ξ.
5. We compute
 
I(C = ∞)m∗ (Z, β)
L2t1 ×1 {C, GC (Z), β, ψ, ξ} = −Π Λ2 , ψ, ξ .
(∞, Z, ψ)
6. The improved double-robust estimator for β is given as the solution to
the AIPWCC estimating equation


n 
ˆ∗ I(Ci = ∞)m∗ (Zi , β)
AF
opt (β, ψ̂n , ξn )
i=1 (∞, Zi , ψ̂n )

+ L2 {Ci , GCi (Zi ), β, ψ̂n , ξˆn∗ } = 0. (12.77)

Since this is an AIPWCC estimator, the asymptotic variance can be esti-


mated using the sandwich variance estimator given by (9.19).

Logistic Regression Example Revisited

In Section 10.2, we developed a double-robust estimator for the parameters


in a logistic regression model when one of the covariates was missing for some
individuals. Specifically, we considered the model

exp(β T X ∗ )
P (Y = 1|X) = ,
1 + exp(β T X ∗ )

where X = (X1T , X2 )T , X ∗ = (1, X1T , X2 )T , (Y, X1 ) were always observed,


whereas the covariate X2 may be missing for some of the individuals in a
study. A complete-case indicator was denoted by R, and it was assumed that

exp(ψ0 + ψ1 Y + ψ2T X1 )
P (R = 1|Y, X) = π(Y, X1 , ψ) = .
1 + exp(ψ0 + ψ1 Y + ψ2T X1 )

The estimator ψ̂n is obtained by maximizing the likelihood


!
n
exp{(ψ0 + ψ1 Yi + ψ T X1i )Ri }
2
.
i=1
1 + exp(ψ0 + ψ1 Yi + ψ2T X1i )
320 12 Approximate Methods for Gaining Efficiency

We also posited a model for the full data (Y, X) by assuming that the con-
ditional distribution of X given Y follows a multivariate normal distribution
with a mean that depends on Y but with a variance matrix that is indepen-
dent of Y . Let us denote the mean vector of X given Y = 1 and Y = 0 as µ1
and µ0 , respectively, and the common covariance matrix as Σ. We also denote
the mean vector of X1 given Y = 1 and Y = 0 as µ11 and µ10 , respectively,
and the mean of X2 given Y = 1 and Y = 0 as µ21 and µ20 , respectively.
Similarly, we denote the variance matrix of X1 by Σ11 , the variance of the
single covariate X2 by Σ22 , and the covariance of X1 and X2 by Σ12 . The
parameter ξ for this posited model can be represented by ξ = (µ1 , µ0 , Σ, τ ),
where τ denotes P (Y = 1). Since Y is observed for everyone, the estimate for
τ is obtained by the sample proportion

n
τ̂n = n−1 Yi .
i=1

The estimates for µ1 , µ0 , and Σ are obtained by maximizing the observed-data


likelihood
!n ! 1   (1−Ri )I(Yi =k)
−1/2 1 T −1
|Σ11 | exp − (X1i − µ1k ) Σ (X1i − µ1k )
i=1 k=0
2
 Ri I(Yi =k) 
1
×|Σ|−1/2 exp − (Xi − µk )T Σ−1 (Xi − µk ) .
2

With full data we know that the optimal estimating function is given by
 
∗ exp(β T X ∗ )
m(Y, X, β) = X Y − .
1 + exp(β T X ∗ )

Since this may no longer be the optimal choice with coarsened data, we now
consider an expanded set of estimating functions, namely m∗ (Y, X, β). For
example, we might take
 
∗ ∗∗ exp(β T X ∗ )
m (Y, X, β) = X Y − ,
1 + exp(β T X ∗ )

where X ∗∗ is a vector consisting of X ∗ together with all the squared terms


and cross-product terms of X.
To find the optimal estimator using m∗ (Y, X, β) defined above, we must
compute AF opt (β, ψ, ξ) in equation (12.72), as described in step 4 of the algo-
rithm. Toward that end, we first note that

∂m∗ (Y, X, β) exp(β T X ∗ ) T


− = X ∗∗ ∗ 2
X∗ .
∂β T {1 + exp(β X )}
T

Also, with two levels of missingness,


12.5 Recap and Review of Notation 321

M−1 {m∗ (Y, X, β), ψ, ξ} = {π(Y, X1 , ψ)}−1 m∗ (Y, X, β)


1 − π(Y, X1 , ψ)
− E{m∗ (Y, X, β)|Y, X1 , ξ}.
π(Y, X1 , ψ)

Finally, with two levels of missingness, we use the results from Theorem 10.2
to obtain
 
I(C = ∞)m∗ (Z, β)
L2 {C, GC (Z), β, ψ, ξ} = − Π Λ2 , ψ, ξ
(∞, Z, ψ)
 
R − π(Y, X1 , ψ)
=− E{m∗ (Y, X, β)|Y, X1 , ξ}.
π(Y, X1 , ψ)

Remark 7. Because Y is a binary indicator and the distribution of X given


Y is multivariate normal, it would be easy to simulate full data (Y, X) from
such a joint distribution. Such simulated
 data can then
 be used to estimate un-
∂m∗ (Y,X,β)
conditional expectations such as E − ∂β T
, ξ , which for this example
is  
exp(β T X ∗ ) ∗T
E X ∗∗ X , ξ ,
{1 + exp(β T X ∗ )}2
using Monte Carlo methods. Similarly, because the conditional distribution of
X2 given X1 , Y is normally distributed with mean

µ21 Y + µ20 (1 − Y ) + Σ12 Σ−1


22 {X1 − µ11 Y − µ10 (1 − Y )}

and variance Σ12 Σ−1 T


22 Σ12 , Monte Carlo methods can be used when computing
conditional expectations of functions of (Y, X) given (Y, X1 ), as was necessary
to compute M−1 or L2 (·).  

We are finally in a position to define all the elements making up equation


(12.77), which we can use to derive the estimator for β.

12.5 Recap and Review of Notation


• In this chapter, we considered a restricted class of AIPWCC estimators
where observed-data estimating functions were chosen from the linear sub-
space Π[Ξ|Λ⊥ ⊥
ψ ] ⊂ Λ , where
 
I(C = ∞)G F
Ξ= ⊕ G2 ,
(∞, Z)

where G F was a linear subspace contained in ΛF ⊥ , and G2 was a linear


subspace contained in the augmentation space Λ2 .
• We considered two classes of restricted estimators:
322 12 Approximate Methods for Gaining Efficiency

(Class 1) were defined by letting both G F ⊂ ΛF ⊥ and G2 ⊂ Λ2 be finite-



t1 ×1
dimensional linear spaces spanned by J F (Z) and J2t2 ×1 {C, GC (Z)},
respectively, where t1 > q and where the elements of the coars-
ening model score vector Sψ {C, GC (Z)} were elements contained in
J2t2 ×1 {C, GC (Z)}.
– (Class 2) were defined by only letting G F ⊂ ΛF ⊥ be a finite-dimensional
t1 ×1
linear space spanned by J F (Z), with t1 > q, but took G2 = Λ2 .
• The optimal influence functions within the class Ξ were derived for both
classes and shown to be orthogonal to Λψ . This then allowed us to derive
optimal restricted AIPWCC estimators within these two classes.
– The (class 1) restricted optimal estimators were the easiest to compute
but not double robust.
– The (class 2) restricted optimal estimators resulted in double-robust
estimators. They were, however, computationally more intensive but
not as difficult to compute as the locally efficient estimators of Chapter
11.

12.6 Exercises for Chapter 12


1. In Section 11.2, we outlined the steps that would be necessary to obtain
a locally efficient estimator for β for the restricted moment model

E(Y |X) = µ(X, β)

when the data are monotonically coarsened. This methodology led to the
integral equation (11.42), which in general is very difficult if not impossible
to solve. Only consider the case when Y is a univariate random variable.
For this same problem, outline the steps that would be necessary to obtain
the optimal restricted (class 2) estimator for β. For this exercise, take

m∗ (Y, X, β) = f t1 ×1 (X, β){Y − µ(X, β)},


where f t1 ×1 (X, β) is a t1 × 1 vector of linearly independent functions of
X and β and t1 > q. Assume that you can estimate the parameter ψ in
the coarsening model and the parameter ξ in the posited model p∗Z (z, ξ).
13
Double-Robust Estimator of the Average
Causal Treatment Effect

Statistical inference generally focuses on the associational relationships be-


tween variables in a population. Data that are collected are assumed to be
realizations of iid random vectors Z1 , . . . , Zn where a single observation Z
is distributed according to some density in the model pZ (z, θ), where θ de-
notes parameters that describe important features of the relationships of the
variables of interest.
However, one may be interested in causal relationships. That is, does “A”
cause “B”? For example, does a treatment intervention or exposure at one
point in time have a causal effect on subsequent response? In order to for-
mulate such a question from a statistical perspective, we will take the point
of view advocated by Neyman (1923), Rubin (1974), Robins (1986), and Hol-
land (1986), who considered potential outcomes. We will illustrate that the
semiparametric theory developed in this book can be used to aid us in find-
ing efficient semiparametric estimators of the average causal treatment effect
under certain assumptions. The methods we will discuss only consider the
simplest case of point exposure; that is, exposure or treatment of individuals
at only one point in time. A much more complex and elegant theory has been
developed by Robins and colleagues that provides methods for studying the
effect of time-dependent treatments on response. We refer the reader to the
book by van der Laan and Robins (2003) for more details and references.

13.1 Point Exposure Studies


We shall denote the possible treatments or exposures that can be given or
experienced by an individual by the random variable A. For simplicity, we
will assume that there are two possible treatments that we wish to compare
and thus A will be a binary variable taking on the values 0 or 1. For example,
A may denote whether an individual with hypertension is treated with a statin
drug (A = 1) or not (A = 0). The response variable will be denoted by Y ,
say, change in blood pressure after three months. Hence, in a typical study of
324 13 Double-Robust Estimator of the Average Causal Treatment Effect

this treatment, we consider a population of patients with hypertension, say


individuals whose diastolic blood pressure is greater than or equal to 140,
and identify a sample of such patients, some of which receive the statin drug
(A = 1) and others who do not (A = 0). This sample of individuals is followed
for three months, and the change in blood pressure Y is measured.
Ultimately, we are interested in establishing a causal link between treat-
ment and response. That is, does treatment with the statin drug reduce blood
pressure after three months as compared with no treatment? The data that
are available from such a study may be summarized by Zi = (Yi , Ai , Xi ), i =
1, . . . , n, where for the i-th individual Yi denotes the response, Ai the treat-
ment received, and Xi other covariates that have been measured prior to
treatment (i.e., baseline covariates).
In a typical associational analysis, we might define population parameters
µ1 = E(Y |A = 1), µ0 = E(Y |A = 0), and ∆ = µ1 − µ0 . That is, ∆ denotes
the difference in mean response for individuals receiving treatment 1 and the
mean response for individuals receiving treatment 0. Without any additional
assumptions, we can estimate ∆ simply as the difference of the treatment-
specific sample average of response, namely ∆ ˆ = µ̂1 − µ̂0 , where


n 
n
µ̂1 = n−1
1 Ai Yi , µ̂0 = n−1
0 (1 − Ai )Yi ,
i=1 i=1
 
and n1 = Ai , n0 = (1 − Ai ) denote the treatment-specific sample sizes.
Typically, such an associational analysis does not answer the causal ques-
tion of interest. If the treatments were not assigned to the patients at ran-
dom, then one can easily imagine that individuals who receive the statin
drugs may be inherently different from those who do not. They may be
wealthier, younger, smoke less, etc. Consequently, the associational param-
eter ∆ = µ1 − µ0 may reflect these inherent differences as well as any effect
due to treatment. In the study of epidemiology, such factors are referred to as
confounders, as they may confound the relationship between treatment and
response.
Thus, we have argued that statistical associations may not be adequate to
describe causal effects. Therefore, how might we describe causal effects? The
point of view we will adopt is that proposed by Neyman (1923) and Rubin
(1974), where causal effects are defined through potential outcomes or counter-
factual random variables. Specifically, for each level of the treatment A = a,
we will assume that there exists a potential outcome Y ∗ (a), where Y ∗ (a)
denotes the response of a randomly selected individual had that individual,
possibly contrary to fact, been given treatment A = a. In our illustration,
we only include two treatments and hence we define the potential outcomes
Y ∗ (1) and Y ∗ (0). Again, we emphasize that these are referred to as potential
outcomes or counterfactual random variables, as it is impossible to observe
both Y ∗ (1) and Y ∗ (0) simultaneously. Nonetheless, using the notion of poten-
tial outcomes, we would define the causal treatment effect by Y ∗ (1) − Y ∗ (0).
13.1 Point Exposure Studies 325

This definition of causal treatment effect is at the subject-specific level and, as


we pointed out, is impossible to measure. However, it may be possible, under
certain assumptions, to estimate the population-level causal treatment effect,
i.e., the expected value of the subject-specific treatment effect denoted as

δ = E{Y ∗ (1) − Y ∗ (0)} = E{Y ∗ (1)} − E{Y ∗ (0)}.

The parameter δ is referred to as the average causal treatment effect.


Since the average causal treatment effect is defined from parameters de-
scribing the potential outcomes, which are not directly observable, the ques-
tion is whether this parameter can be deduced from parameters describing
the distribution of the observable random variables Z = (Y, A, X).
Using the ideas that were developed for missing-data problems, we con-
sider the full data to be the variables {Y ∗ (1), Y ∗ (0), A, X}, which involve the
potential outcomes as well as the treatment assignment and baseline covari-
ates, and the observed data to be (Y, A, X). We now make the reasonable
assumption that
Y = AY ∗ (1) + (1 − A)Y ∗ (0); (13.1)
that is, the observed response Y is equal to Y ∗ (1) if the subject was given
treatment A = 1 and is equal to Y ∗ (0) if the subject was given treatment
A = 0.

Remark 1. Rubin (1978a) refers to the assumption (13.1) as the Stable Unit
Treatment Value Assumption, or SUTVA. Although this assumption may
seem straightforward at first, there are some philosophical subtleties that
need to be considered in order to fully accept. For one thing, there must
not be any interference in the response from other subjects. That is, the ob-
served response for the i-th individual in the sample should not be affected
by the response of the other individuals in the sample. Thus, for example,
this assumption may not be reasonable in a vaccine intervention trial for an
infectious disease, where the response of an individual is clearly affected by
the response of others in the study. That is, whether or not an individual
contracts an infectious disease will depend, to some extent, on whether and
how many other individuals in the population are infected. From here on, we
will assume the SUTVA assumption but caution that the plausibility of this
assumption needs to be evaluated on a case-by-case basis.  

Because of assumption (13.1), we see that the observed data are a many-
to-one transformation of the full data. We also note that the treatment as-
signment indicator A plays a role similar to that of the missingness indicator
in missing-data problems. This analogy to missing-data problems will be use-
ful as we develop the theory that enables us to estimate the average causal
treatment effect.
326 13 Double-Robust Estimator of the Average Causal Treatment Effect

13.2 Randomization and Causality


Intuitively, it has been accepted that the use of a randomized intervention
study will result in an unbiased estimate of the average treatment effect with
causal interpretations. This is because patients are assigned to treatment in-
terventions according to a random mechanism that is independent of all other
factors. Therefore, individuals in the two treatment groups are similar, on av-
erage, with respect to all characteristics except for the treatment intervention
to which they were assigned. Consequently, differences in response between
the two randomized groups can be reasonably attributed to the effect of treat-
ment and not other extraneous factors.
We now formalize this notion through the use of potential outcomes.
Specifically, we will show that the observed treatment difference in a ran-
domized intervention study is an unbiased estimator of the average causal
treatment effect δ.
Together with the SUTVA assumption (13.1), we also make the assumption
that
A⊥ ⊥ {Y ∗ (1), Y ∗ (0)}, (13.2)
where “⊥⊥” denotes independence. This assumption is plausible since the re-
sponse of an individual to one treatment or the other should be independent
of treatment assignment because treatment was assigned according to some
random mechanism.

Remark 2. The assumption that treatment assignment is independent of the


potential outcomes should not be confused with treatment assignment being
independent of the observed response. That is,

⊥ {Y ∗ (1), Y ∗ (0)} does not imply that A ⊥


A⊥ ⊥ Y = AY ∗ (1) + (1 − A)Y ∗ (0).

Generally, we reserve the notion that A ⊥


⊥ Y to denote the null hypothesis
that there is no treatment effect, whereas if there is a treatment effect, then
A is not independent of Y . 

In a randomized study, the associational treatment effect ∆ = E(Y |A =


1) − E(Y |A = 0) is equal to the average causal treatment effect δ =
E{Y ∗ (1)} − E{Y ∗ (0)}, as we now demonstrate:

E(Y |A = 1) = E{AY ∗ (1) + (1 − A)Y ∗ (0)|A = 1} = E{Y ∗ (1)|A = 1}


(13.3)
= E{Y ∗ (1)}, (13.4)

where (13.3) follows from the SUTVA assumption and (13.4) follows from
assumption (13.2). Similarly, we can show that E(Y |A = 0) = E{Y ∗ (0)}.
Consequently,
n the difference in the sample’s average response between treat-
−1 n
ments n−1
1 A Y
i=1 i i − n 0 i=1 (1 − Ai )Yi , which is an unbiased estimator
13.3 Observational Studies 327

for ∆, the associational treatment effect, is also an unbiased estimator for the
average causal treatment effect δ in a randomized intervention study.
As we mentioned earlier, the treatment indicator A serves a role analogous
to the missingness indicator R, which was used to denote missing data with
two levels of missingness. That is, when A = 1, we observe Y ∗ (1), in which
case Y ∗ (0) is missing, and when A = 0, we observe Y ∗ (0) and Y ∗ (1) is missing.
The assumption (13.2), which is induced because of randomization, is similar
to missing completely at random; that is, the probability that A is equal to 1
or 0 is independent of all the data {Y ∗ (1), Y ∗ (0), X}.

13.3 Observational Studies


In an observational study, individuals are not assigned to treatment by an
experimental design but rather by choice. It may be that data from an obser-
vational study are easier or cheaper to collect, or it may be that conducting a
randomized study is infeasible or unethical. Clearly, if we want to evaluate the
effect of smoking on some health outcome, it would be unethical to randomize
individuals and force them to smoke or not smoke.
For simplicity, and without loss of generality, we again consider only
two treatments. In an observational study, individuals who receive treat-
ment A = 1 may not be prognostically comparable with those who receive
treatment A = 0. Therefore, it may no longer be reasonable to assume that
A⊥ ⊥ {Y ∗ (1), Y ∗ (0)}. However, if pretreatment (baseline) prognostic factors
X can be identified that affect treatment choice and, in addition, are them-
selves prognostic (i.e., related to clinical outcome), then it may be reasonable
to assume
A⊥⊥ {Y ∗ (1), Y ∗ (0)}|X; (13.5)
that is, treatment assignment is independent of the potential outcomes given
X. Such variables X are referred to in epidemiology as confounders, and as-
sumption (13.5) is sometimes referred to as the assumption of “no unmeasured
confounders.” Rubin (1978a) also refers to this assumption as the strong ig-
norability assumption.

Remark 3. The assumption of no unmeasured confounders is key to being


able to estimate the average causal treatment effect in an observational study.
Presumably, when a patient or his or her physician is faced with a binary
treatment choice of whether to treat the patient with treatment A = 0 or
A = 1, they do not know what the patient’s potential outcome will be. Con-
sequently, treatment choice is made based on variables and characteristics of
the patient prior to or at the time of treatment. If such factors are captured
in the database, then the assumption of no unmeasured confounders may be
reasonable. Of course, there may be factors that influence treatment decisions
that are not captured in the data that are collected. If such factors are also
correlated with response, then the assumption (13.5) is no longer tenable.  
328 13 Double-Robust Estimator of the Average Causal Treatment Effect

We now give an argument to show that the average causal treatment effect
can be identified through the distribution of the observable data (Y, A, X) if
assumptions (13.1) and (13.5) hold. This follows because

E{Y ∗ (1)} = EX [E{Y ∗ (1)|X}]


= EX [E{Y ∗ (1)|A = 1, X}] (13.6)
= EX {E(Y |A = 1, X)}, (13.7)

where (13.6) follows from the assumption of no unmeasured confounders and


(13.7) follows from the SUTVA assumption. Similarly, we can show that
E{Y ∗ (0)} = EX {E(Y |A = 0, X)}. Hence the average causal treatment ef-
fect is equal to

δ = EX {E(Y |A = 1, X) − E(Y |A = 0, X)}, (13.8)

which only involves the distribution of (Y, A, X).

Remark 4. It is important to note that the outer expectation of (13.6) and


(13.7) is with respect to the marginal distribution of X and not the conditional
distribution of X given A = 1. We must also be a little careful to make sure
that, in equations (13.6) and (13.7), we are not conditioning on a null event.
One way to ensure that we are not is to assume that both P (A = 1|X = x)
and P (A = 0|X = x) are bounded away from zero for all x in the support of
X. We will discuss this assumption in greater detail when we introduce the
propensity score in the next section.  

13.4 Estimating the Average Causal Treatment Effect


Regression Modeling

We now consider the estimation of the average causal treatment effect from a
sample of observed data (Yi , Ai , Xi ), i = 1, . . . , n. The first approach, which
we refer to as regression modeling, is motivated by equation (13.8). Here, we
consider a restricted moment model for the conditional expectation of Y given
(A, X) in terms of a finite-dimensional parameter, say ξ. That is,

E(Y |A, X) = µ(A, X, ξ). (13.9)

The regression model could be as complicated as is deemed necessary by the


data analyst to get a good fit. For example, we might consider the linear model
with an interaction term; that is,

µ(A, X, ξ) = ξ0 + ξ1 A + ξ2 X + ξ3 AX,

or, if the response variable Y is positive, we might consider the corresponding


log-linear model
13.5 Coarsened-Data Semiparametric Estimators 329

µ(A, X, ξ) = exp(ξ0 + ξ1 A + ξ2 X + ξ3 AX). (13.10)

The parameter ξ can be estimated using the generalized estimating equa-


tions developed in Section 4.6. For example, we may solve the estimating
equation

n
∂µ(Ai , Xi , ξ)
V −1 (Ai , Xi ){Yi − µ(Ai , Xi , ξ)} = 0, (13.11)
i=1
∂ξ

where V (Ai , Xi ) = var(Yi |Ai , Xi ), to obtain the estimator ξˆn .


If the conditional expectations E(Y |A = 1, X) and E(Y |A = 0, X) were
known, then a natural consistent and unbiased estimator of the average causal
treatment effect δ, given by (13.8), would be obtained using the empirical
average
n
−1
n {E(Y |A = 1, Xi ) − E(Y |A = 0, Xi )}. (13.12)
i=1

Under the assumption that the restricted moment model is correct, a natural
estimator for E(Y |A = 1, X) − E(Y |A = 0, X) would then be µ(1, X, ξˆn ) −
µ(0, X, ξˆn ). Substituting this into (13.12) yields the estimator for the average
causal treatment effect,

n
δ̂n = n−1 {µ(1, Xi , ξˆn ) − µ(0, Xi , ξˆn )}. (13.13)
i=1

So, for example, if we posited the log-linear model (13.10), then the estimator
for the average causal treatment effect would be given by
n 
 
−1 ˆ ˆ ˆ ˆ ˆ ˆ
δ̂n = n exp{ξ0n + ξ1n + (ξ2n + ξ3n )Xi } − exp(ξ0n + ξ2n Xi ) .
i=1

The consistency and asymptotic normality of the estimator δ̂n can be


obtained in a straightforward fashion by deriving its influence function. We
leave this as an exercise for the reader to derive. We do note, however, that
these asymptotic properties are based on the assumption that the restricted
moment model E(Y |A, X) = µ(A, X, ξ) is correctly specified.

13.5 Coarsened-Data Semiparametric Estimators


We now consider how to obtain estimators for the average causal treatment
effect by casting the problem as a coarsened-data semiparametric model. To-
ward that end, we denote the full data to be {Y ∗ (1), Y ∗ (0), X, A}, where
Y ∗ (1) and Y ∗ (0) denote the potential outcomes for treatment 1 and treat-
ment 0, respectively, A denotes the treatment assignment, and X denotes the
330 13 Double-Robust Estimator of the Average Causal Treatment Effect

vector of baseline covariates. The joint density of the full data can be written
as
p{y ∗ (1), y ∗ (0), x, a} = p{a|y ∗ (1), y ∗ (0), x}p{y ∗ (1), y ∗ (0), x}
= p(a|x)p{y ∗ (1), y ∗ (0), x}, (13.14)
where (13.14) follows from the strong ignorability assumption (13.5). We will
put no restrictions on the joint density p{y ∗ (1), y ∗ (0), x} of {Y ∗ (1), Y ∗ (0), X}
(i.e., a nonparametric model). Consequently, using the same logic as in Sec-
tion 5.3, we argue that there is only one full-data influence function of RAL
estimators for δ = E{Y ∗ (1) − Y ∗ (0)}. Letting δ0 denote the true value of δ,
the full-data influence function is given by
ϕF {Y ∗ (1), Y ∗ (0), X} = {Y ∗ (1) − Y ∗ (0) − δ0 }, (13.15)
which, of course, is the influence function for the full-data estimator

n
δ̂nF = n−1 {Yi∗ (1) − Yi∗ (0)}.
i=1

Although the joint distribution of {Y ∗ (1), Y ∗ (0), X} can be arbitrary, we


will assume that P (A = 1|X) can be modeled as π(X, ψ) using a finite number
of parameters ψ. Because treatment assignment A is a binary indicator, the
conditional density of A given X is
p(a|x) = π(x, ψ)a {1 − π(x, ψ)}1−a .
The function P (A = 1|X) is defined as the propensity score, as it reflects
the propensity that an individual will receive one treatment or the other as
a function of the baseline covariates X. The propensity score was first intro-
duced by Rosenbaum and Rubin (1983), and the properties have been studied
extensively by Rosenbaum and Rubin; see, for example, Rosenbaum and Ru-
bin (1984, 1985), Rosenbaum (1984, 1987), and Rubin (1997). Although the
model for the propensity score is not used in defining the full-data estimator
for the average causal treatment effect δ, it will play a crucial role when we
consider observed-data estimators for δ.
In contrast with the full data {Y ∗ (1), Y ∗ (0), X, A}, the observed data are
given as the many-to-one transformation of the full data, namely (Y, X, A),
where Y = AY ∗ (1) + (1 − A)Y ∗ (0). As such, this is an example of coarsened
data, where A plays a role similar to the coarsening variable C, or, more
specifically, to the complete-case indicator R defined in the previous chapters.
Remark 5. Throughout the chapters on missing and coarsened data, we always
assumed that, with positive probability, the coarsening variable C could take
on the value ∞ to denote the case when the full data were observed. For this
problem, we never get to observe the full data {Y ∗ (1), Y ∗ (0), X}. We either
observe Y ∗ (1) when A = 1 or Y ∗ (0) when A = 0. Nonetheless, we will see
a similarity in the semiparametric estimators developed for this problem and
those for the missing-data problems developed previously.  
13.5 Coarsened-Data Semiparametric Estimators 331

Observed-Data Influence Functions

As with all semiparametric models, the key to finding estimators is to derive


the nuisance tangent space and the space orthogonal to the nuisance tangent
space, which, in turn, are used to derive the space of influence functions.
We first consider the case when the propensity score P (A = 1|X) is
known to us. This may be the case, for example, if we designed a randomized
study where treatment was assigned at random with known probabilities that
could depend on pretreatment baseline covariates. Using arguments identi-
cal to those in Chapters 7 and 8, we can show that the space of observed-
data influence functions of RAL estimators for δ corresponds to the space
K−1 {(IF )F }, where (IF )F denotes the space of full-data influence functions
and K : H → HF , given by Definition 1 of Chapter 7, is the many-to-one linear
mapping from the observed-data Hilbert space to the full-data Hilbert space,
where, for any h ∈ H, K(h) = E{h(Y, A, X)|Y ∗ (1), Y ∗ (0), X}. Consequently,
K−1 {(IF )F } denotes all the functions ϕ(Y, A, X) such that

E{ϕ(Y, A, X)|Y ∗ (1), Y ∗ (0), X} = ϕF {Y ∗ (1), Y ∗ (0), X}

for any ϕF {Y ∗ (1), Y ∗ (0), X} corresponding to a full-data influence function.


In our problem, there is only one full-data influence function given by
(13.15). Hence, the class of observed-data influence functions corresponds to
functions K−1 {Y ∗ (1) − Y ∗ (0) − δ0 }, which, by Lemma 7.4, are equal to

h(Y, A, X) + Λ2 ,

where h(Y, A, X) is any function that satisfies the relationship

E{h(Y, A, X)|Y ∗ (1), Y ∗ (0), X} = {Y ∗ (1) − Y ∗ (0) − δ0 }, (13.16)

and Λ2 is the linear subspace in H consisting of elements L2 (Y, A, X) such


that
E{L2 (Y, A, X)|Y ∗ (1), Y ∗ (0), X} = 0, (13.17)
also referred to as the augmentation space.
A function h(·) satisfying equation (13.16) is motivated by the inverse
propensity weighted full-data influence function; namely,
 
AY (1 − A)Y
h(Y, A, X) = − − δ0 . (13.18)
π(X) 1 − π(X)

To verify this, consider the conditional expectation of the first term on the
right-hand side of (13.18); that is,
332 13 Double-Robust Estimator of the Average Causal Treatment Effect
   
AY AY ∗ (1) ∗
E Y ∗ (1), Y ∗ (0), X =E Y (1), Y ∗ (0), X
π(X) π(X)
Y ∗ (1)
= E{A|Y ∗ (1), Y ∗ (0), X}
π(X)
Y ∗ (1)
= E(A|X) (13.19)
π(X)
Y ∗ (1)
= π(X) = Y ∗ (1), (13.20)
π(X)

where (13.19) follows from the strong ignorability assumption. Also, in order
for (13.20) to hold, we must not be dividing 0 by 0. Therefore, we will need
the additional assumption that the propensity score P (A = 1|X) = π(X)
is 
strictly greater than zeroalmost everywhere. Similarly, we can show that
E (1−A)Y ∗ ∗
1−π(X) |Y (1), Y (0), X = Y ∗ (0) as long as 1 − π(X) is strictly greater
than zero almost everywhere. Therefore, we have proved that the function h(·)
defined in (13.18) satisfies the relationship (13.16) as long as the propensity
score 0 < π(x) < 1, for all x in the support of X.
To derive the augmentation space Λ2 , we must find all functions L2 (·) of
Y, A, X that satisfy (13.17). Because A is a binary indicator, any function of
Y, A, X can be written as

L2 (Y, A, X) = AL21 (Y, X) + (1 − A)L20 (Y, X) (13.21)

for arbitrary functions L21 (·) and L20 (·) of Y, X. Hence the conditional ex-
pectation of L2 (·), given {Y ∗ (1), Y ∗ (0), X}, is

E{L2 (Y, A, X)|Y ∗ (1), Y ∗ (0), X}


= E[AL21 {Y ∗ (1), X} + (1 − A)L20 {Y ∗ (0), X}|Y ∗ (1), Y ∗ (0), X] (13.22)
∗ ∗
= π(X)L21 {Y (1), X} + {1 − π(X)}L20 {Y (0), X}, (13.23)

where (13.22) follows from the SUTVA assumption and (13.23) follows from
the strong ignorability assumption. Therefore, in order for (13.17) to hold, we
need  
π(X)
L20 {Y ∗ (0), X} = − L21 {Y ∗ (1), X}. (13.24)
1 − π(X)
Since both the left- and right-hand sides of (13.24) denote functions of
Y ∗ (1), Y ∗ (0), X, the only way that equation (13.24) can hold is if both L20 (·)
and L21 (·) are functions of X alone. In that case, any element of Λ2 must
satisfy  
π(X)
L20 (X) = − L21 (X). (13.25)
1 − π(X)
Substituting the relationship given by (13.25) into (13.21), we conclude that
the space Λ2 consists of functions
13.5 Coarsened-Data Semiparametric Estimators 333
 
A − π(X)
L21 (X) for arbitrary functions L21 (X).
1 − π(X)

Since 1 − π(X) is strictly greater than zero almost everywhere and L21 (X) is
an arbitrary function of X, we can write Λ2 as
 
Λ2 = {A − π(X)}h2 (X) for arbitrary h2 (X) . (13.26)

Thus, we have demonstrated that the class of all influence functions of


RAL estimators for δ, when the propensity score π(X) is known, is given by
 
AY (1 − A)Y
− − δ 0 + Λ2 , (13.27)
π(X) 1 − π(X)

where Λ2 is defined by (13.26). The optimal influence function in this class


is the one with the smallest variance or, equivalently, the element with the
smallest norm. This is obtained
 by choosing the element
 {A − π(X)}h02 (X)
AY
to be minus the projection of π(X) − (1−A)Y
1−π(X) − δ0 onto Λ2 .
We remind the reader that the projection of some arbitrary element
ϕ(Y, A, X) onto Λ2 is obtained by finding the unique element h02 (X) such
that
  
E ϕ(Y, A, X) − {A − π(X)}h02 (X) {A − π(X)}h2 (X) = 0

for all functions h2 (X). We denote this as Π[ϕ(Y, A, X)|Λ2 ] = {A−π(X)}h02 (X).
As indicated earlier, any function ϕ(·) of Y, A, X can be written as ϕ(Y, A, X) =
Aϕ1 (Y, X) + (1 − A)ϕ0 (Y, X). Because projections are linear operators,

Π[ϕ(Y, A, X)|Λ2 ] = Π[Aϕ1 (Y, X)|Λ2 ] + Π[(1 − A)ϕ0 (Y, X)|Λ2 ].

We now show how to derive these projections in the following theorem.


Theorem 13.1. The projection

Π[Aϕ1 (Y, X)|Λ2 ] = {A − π(X)}h01 (X), (13.28)

where h01 (X) = E{ϕ1 (Y, X)|A = 1, X}, and

Π[(1 − A)ϕ0 (Y, X)|Λ2 ] = {A − π(X)}h00 (X), (13.29)

where h00 (X) = −E{ϕ0 (Y, X)|A = 0, X}.

Proof. We first consider (13.28). By definition, Π[Aϕ1 (Y, X)|Λ2 ] is defined as


the function {A − π(X)}h01 (X) such that
  
0
E Aϕ1 (Y, X) − {A − π(X)}h1 (X) {A − π(X)}h(X) = 0,
334 13 Double-Robust Estimator of the Average Causal Treatment Effect

or equivalently
 
2 0
E A{A − π(X)}h(X)ϕ1 (Y, X) − {A − π(X)} h1 (X)h(X) = 0, (13.30)

for all h(X). By a simple conditioning argument, where we first condition on


X, we can show that the second term on the left-hand side of (13.30) is equal
to
   
2 0 0
E {A − π(X)} h1 (X)h(X) = E π(X){1 − π(X)}h1 (X)h(X) . (13.31)

The first term on the left-hand side of (13.30) is also computed through a
series of iterated conditional expectations; namely,
 
E A{A − π(X)}h(X)ϕ1 (Y, X)
  
= E E A{A − π(X)}h(X)ϕ1 (Y, X) A, X
 
= E A{A − π(X)}h(X)E{ϕ1 (Y, X)|A = 1, X}
  
= E E A{A − π(X)}h(X)E{ϕ1 (Y, X)|A = 1, X} X
 
= E π(X){1 − π(X)}h(X)E{ϕ1 (Y, X)|A = 1, X} . (13.32)

Substituting (13.31) and (13.32) into (13.30), we obtain that


   
0
E π(X){1 − π(X)} E{ϕ1 (Y, X)|A = 1, X} − h1 (X) h(X) = 0 (13.33)

for all h(X). Since π(X) and 1−π(X) are both bounded away from zero almost
surely, and because E{ϕ1 (Y, X)|A = 1, X} − h01 (X) is a function of X, then
(13.33) can only be true for all h(X) if and only if E{ϕ1 (Y, X)|A = 1, X} −
h01 (X) is identically equal to zero; i.e., h01 (X) must equal E{ϕ1 (Y, X)|A =
1, X}, thus proving (13.28). The proof of (13.29) follows similarly.  
Returning to the class of influence functions of RAL estimators for δ given
by (13.27), the efficient influence function in this class is given by
   
AY (1 − A)Y AY (1 − A)Y
− − δ0 − Π − − δ0 Λ2 ,
π(X) 1 − π(X) π(X) 1 − π(X)
which by Theorem 13.1 is equal to

AY {A − π(X)}E(Y |A = 1, X)

π(X) π(X)

(1 − A)Y {A − π(X)}E(Y |A = 0, X)
− − − δ0 . (13.34)
1 − π(X) 1 − π(X)
13.5 Coarsened-Data Semiparametric Estimators 335

Motivated by the efficient influence function (13.34), an estimator for δ


would be obtained as the solution to the estimating equation
n 
Ai Yi {Ai − π(Xi )}µ(1, Xi )

i=1
π(X i ) π(Xi )

(1 − Ai )Yi {Ai − π(Xi )}µ(0, Xi )
− − − δ = 0,
1 − π(Xi ) 1 − π(Xi )

or, more specifically,


n 
−1 Ai Yi {Ai − π(Xi )}µ(1, Xi )
δ̂n =n −
i=1
π(Xi ) π(Xi )

(1 − Ai )Yi {Ai − π(Xi )}µ(0, Xi )
− − , (13.35)
1 − π(Xi ) 1 − π(Xi )

where µ(A, X) = E(Y |A, X). Of course, µ(A, X) is not known to us but can
be estimated by positing a model where E(Y |A, X) = µ(A, X, ξ) as we did in
(13.9). The parameter ξ can be estimated by solving the estimating equation
(13.11), and then µ(1, Xi , ξˆn ) and µ(0, Xi , ξˆn ) can be substituted into (13.35)
to obtain a locally efficient estimator for δ.
The development above assumed that the propensity score π(X) was
known to us. In observational studies, this will not be the case. Consequently,
we must posit a model for the propensity score, say assuming that

P (A = 1|X) = π(X, ψ). (13.36)

Since A is binary, the logistic regression model is often used; i.e.,

exp(ψ0 + ψ1T X)
π(X, ψ) = .
1 + exp(ψ0 + ψ1T X)

The estimator for ψ would be obtained using maximum likelihood; i.e., ψ̂n is
the value of ψ that maximizes the likelihood
!
n
π(Xi , ψ)Ai {1 − π(Xi , ψ)}(1−Ai ) .
i=1

Consequently, in an observational study, the locally efficient semiparametric


estimator for the average causal treatment effect would be given by
 n 
−1 Ai Yi {Ai − π(Xi , ψ̂n )}µ(1, Xi , ξˆn )
δ̂n = n −
i=1 π(Xi , ψ̂n ) π(Xi , ψ̂n )

(1 − Ai )Yi {Ai − π(Xi , ψ̂n )}µ(0, Xi , ξˆn )
− − . (13.37)
1 − π(Xi , ψ̂n ) 1 − π(Xi , ψ̂n )
336 13 Double-Robust Estimator of the Average Causal Treatment Effect

Double Robustness

We complete this chapter by demonstrating that the locally efficient semi-


parametric estimator for δ, given by (13.37), is doubly robust. That is, under
the SUTVA and strong ignorability assumptions, (13.37) is a consistent esti-
mator for δ if either the model for the propensity score π(X, ψ) or the regres-
sion model E(Y |A, X) = µ(A, X, ξ) is correct. We use the conventions that
p p
→ ψ ∗ and ξˆn −
ψ̂n − → ξ ∗ to denote that, under suitable regularity conditions,
these estimators will converge whether the corresponding model is correct or
not and denote, by letting ψ ∗ = ψ0 or ξ ∗ = ξ0 , the case when these estimators
converge to the truth; i.e., that the corresponding model is correctly specified.
Because the estimator (13.37) is a sample average, it is easy to show that
it will converge in probability to

AY {A − π(X, ψ ∗ )}µ(1, X, ξ ∗ )
E ∗

π(X, ψ ) π(X, ψ ∗ )

(1 − A)Y {A − π(X, ψ ∗ )}µ(0, X, ξ ∗ )
− − . (13.38)
1 − π(X, ψ ∗ ) 1 − π(X, ψ ∗ )
By the SUTVA assumption,
AY AY ∗ (1) {A − π(X, ψ ∗ )}Y ∗ (1)

= ∗
= Y ∗ (1) + , (13.39)
π(X, ψ ) π(X, ψ ) π(X, ψ ∗ )
and
(1 − A)Y (1 − A)Y ∗ (0) ∗ {A − π(X, ψ ∗ )}Y ∗ (0)
= = Y (0) − . (13.40)
1 − π(X, ψ ∗ ) 1 − π(X, ψ ∗ ) 1 − π(X, ψ ∗ )

Substituting (13.39) and (13.40) into (13.38) yields

E{Y ∗ (1) − Y ∗ (0)} (13.41)


 
{A − π(X, ψ ∗ )}{Y ∗ (1) − µ(1, X, ξ ∗ )}
+E (13.42)
π(X, ψ ∗ )
 
{A − π(X, ψ )}{Y ∗ (0) − µ(0, X, ξ ∗ )}

+E . (13.43)
1 − π(X, ψ ∗ )

Since E{Y ∗ (1)−Y ∗ (0)} of (13.41) is equal to δ, the proof of double robustness
will follow if we can show that the expectations in (13.42) and (13.43) are equal
to zero if either ψ ∗ = ψ0 or ξ ∗ = ξ0 .
Let us first assume that the propensity score is correctly specified; i.e.,
ψ ∗ = ψ0 . We compute the expectation of (13.42) by iterated conditional
expectations, where we first condition on Y ∗ (1) and X to obtain that (13.42)
equals
 
[E{A|Y ∗ (1), X} − π(X, ψ0 )]{Y ∗ (1) − µ(1, X, ξ ∗ )}
E .
π(X, ψ0 )
13.6 Exercises for Chapter 13 337

Because of the strong ignorability assumption, E{A|Y ∗ (1), X} = E(A|X) =


π(X, ψ0 ), thus allowing us to conclude that (13.42) equals zero when ψ ∗ = ψ0 .
A similar argument can also be used to show that (13.43) equals zero when
ψ ∗ = ψ0 .
Now consider the case where the regression model E(Y |A, X) is correctly
specified; i.e., ξ ∗ = ξ0 . Again, we compute the expectation of (13.42) by
iterated conditional expectations, but now we first condition on A and X to
obtain that (13.42) equals
 
{A − π(X, ψ ∗ )}[E{Y ∗ (1)|A, X} − µ(1, X, ξ0 )]
E .
π(X, ψ ∗ )

Because of SUTVA, µ(1, X, ξ0 ) = E(Y |A = 1, X) = E{Y ∗ (1)|A = 1, X}, and


because of the strong ignorability assumption,

E{Y ∗ (1)|X} = E{Y ∗ (1)|A, X} = E{Y ∗ (1)|A = 1, X} = µ(1, X, ξ0 ),

thus allowing us to conclude that (13.42) equals zero when ξ ∗ = ξ0 . A similar


argument can also be used to show that (13.43) equals zero when ξ ∗ = ξ0 .
The local semiparametric efficiency together with the double-robustness
property makes the estimator (13.37) desirable as compared with, say, the
regression estimator (13.13), whose consistency is completely dependent on
correctly modeling the regression relationship E(Y |A, X).

Remark 6. The connection of the propensity score to causality has been stud-
ied carefully by Rosenbaum, Rubin, and others; see, for example, Rosenbaum
and Rubin (1983, 1984, 1985), Rosenbaum (1984, 1987), and Rubin (1997).
Different methods for estimating the average causal treatment effect using
propensity scores have been advocated. These include stratification, match-
ing, and inverse propensity weighting. The locally efficient estimator derived
above is sometimes referred to as the augmented inverse propensity weighted
estimator. This estimator was compared with other commonly used estima-
tors of the average causal treatment effect in a series of numerical simulations
by Lunceford and Davidian (2004), who generally found that this estimator
performs the best across a wide variety of scenarios.  

13.6 Exercises for Chapter 13


1. Derive the influence function for the regression estimator δ̂n given by
equation (13.13).
2. Derive the influence function for the augmented inverse propensity weighted
estimator δ̂n given by equation (13.37).
14
Multiple Imputation: A Frequentist
Perspective

A popular approach for dealing with missing data is the use of multiple im-
putation, which was first introduced by Rubin (1978b). Although most of
this book has focused on semiparametric models, where the model includes
infinite-dimensional nuisance parameters, this chapter will only consider finite-
dimensional parametric models, as in Chapter 3. Because of its importance in
missing-data problems, we conclude with a discussion of this methodology.
Imputation methods, where we replace missing values by some “best guess”
and then analyze the data as if complete, have a great deal of intuitive appeal.
However, unless one is careful about how the imputation is carried out and
how the subsequent inference is made, imputation methods may lead to biased
estimates with estimated confidence intervals that are too narrow. Rubin’s
proposal for multiple imputation allowed the use of this intuitive idea in a
manner that results in correct inference. Rubin’s justification is based on a
Bayesian paradigm. In this chapter, we will consider the statistical properties
of multiple-imputation estimators from a frequentist point of view using large
sample theory. Much of the development is based on two papers, by Wang and
Robins (1998) and Robins and Wang (2000). Although most of the machinery
developed in the previous chapters that led to AIPWCC estimators will not
be used here, we feel that this topic is of sufficient importance to warrant
study in its own right.
As we have all along, we will consider a full-data model, where the full
data are denoted by Z1 , . . . , Zn assumed iid with density pZ (z, β), where β is
a finite-dimensional parameter, say q-dimensional. The observed (coarsened)
data will be assumed coarsened at random and denoted by

{Ci , GCi (Zi )}, i = 1, . . . , n.

Remark 1. When data are coarsened at random and the full-data parameter
β is finite-dimensional, then β can be estimated using maximum likelihood.
The coarsened-data likelihood was derived in (7.10), where we showed that β
could be estimated by maximizing
340 14 Multiple Imputation: A Frequentist Perspective

!
n
pGri (Zi ) (gri , β), (14.1)
i=1

where 
pGr (Z) (gr , β) = pZ (z, β)dνZ (z).
{z:Gr (z)=gr }

However, even for parametric models, maximizing coarsened-data likelihoods


may be difficult to implement. Multiple imputation is an attempt to use sim-
pler methods that are easier to implement. 


We will assume throughout that the full-data maximum likelihood estima-


tor for β can be derived easily using standard software. We will also assume
that the standard large-sample properties of maximum likelihood estimators
apply to the full-data model. For example, denoting the full-data score vector
by
∂ log pZ (z, β)
S F (z, β) = ,
∂β
the MLE β̂nF is obtained by solving the likelihood equation


n
S F (Zi , β) = 0.
i=1

Remark 2. On notation
In this chapter, we only consider the parameter β with no additional nuisance
parameters. Therefore, the full-data score vector will be denoted by S F (Z, β)
(without the subscript β used in the previous chapters). As usual, when we use
the notation S F (Z, β), we are considering a q-dimensional vector of functions
of Z and β. If the score vector is evaluated at the truth, β = β0 , then this will
often be denoted by S F (Z) = S F (Z, β0 ). A similar convention will be used
when we consider the observed-data score vector, which will be denoted as
S{C, GC (Z), β}, and, at the truth, as S{C, GC (Z), β0 } or S{C, GC (Z)}.  
Under suitable regularity conditions, which will be assumed throughout,
 
D
n1/2 (β̂nF − β0 ) −→ N 0, Veff
F
(β0 ) ,

where {VeffF
(β0 )}−1 is the full-data information matrix, which we denote by
T
F
I (β0 ) = E{S F (Z)S F (Z)}. That is,

F
' (−1
Veff (β0 ) = I F (β0 ) .

We present an illustrative example where multiple-imputation methods


may be used.
14 Multiple Imputation: A Frequentist Perspective 341

Example 1. Surrogate marker problem


Consider the logistic regression model used to model the probability of a
dichotomous response as a function of the covariates X. Letting Y = {1, 0}
denote a binary response variable, we assume

exp(θT X)
P (Y = 1|X) = . (14.2)
1 + exp(θT X)

With full data (Yi , Xi ), i = 1, . . . , n, the maximum likelihood estimator for θ


can be obtained in a straightforward fashion using standard software. We wish
to consider the problem where the variable X, or a subset of X, is expensive
to obtain. For instance, X may represent some biological marker, derived from
stored plasma collected on everyone, that is expensive to measure. However,
another variable, W , is identified as a cheaper surrogate for X. That is, W is
correlated with X and satisfies the surrogacy assumption; namely,

exp(θT X)
P (Y = 1|X, W ) = P (Y = 1|X) = .
1 + exp(θT X)

Let us also assume that (X, W ) follows a multivariate normal distribution


     
X µX σXX σXW
∼N , T .
W µW σXW σW W

In an attempt to save costs, the inexpensive surrogate variable W will be


collected on everyone in the sample, as will the response variable Y . However,
the expensive variable X will be obtained on only a validation subsample of
individuals in the study using stored blood chosen at random with a pre-
specified probability of selection that may depend on (Y, W ) (i.e., missing at
random). Letting R denote the indicator of a complete-case, then the observed
data are denoted as

(Ri , Yi , Wi , Ri Xi ), i = 1, . . . , n.

Although the primary focus is the parameter θ, since the model is finite-
dimensional, we will not differentiate between the parameters of interest and
the nuisance parameters. Hence

β = (θT , µTX , µTW , σXX , σXW , σW W )T .

This example will be used for illustration later. 



With observed data, the parameter β can be estimated efficiently by maxi-
mizing the observed-data likelihood (14.1). That is, we would find the solution
to the observed-data likelihood equation

n
S{Ci , GCi (Zi ), β} = 0, (14.3)
i=1
342 14 Multiple Imputation: A Frequentist Perspective

where the observed-data score vector is given by

S{Ci , GCi (Zi ), β} = E{S F (Zi , β)|Ci , GCi (Zi ), β}. (14.4)

Remark 3. It is important for some of the subsequent developments that we


introduce the following notation. When we write

E{S F (Zi , β  )|Ci , GCi (Zi ), β  },

where we allow β  , β  to be arbitrary values not necessarily at the truth or


equal to each other, we mean specifically that

E{S F (Zi , β  )|Ci = ri , GCi (Zi ) = gri , β  }



S F (z, β  )pZ (z, β  )dνZ (z)
{z:Gri (z)=gri }
=  . (14.5)
pZ (z, β  )dνZ (z)
{z:Gri (z)=gri }

Unless otherwise stated, expectations and conditional expectations will be


evaluated with parameter values at the truth. Consequently, the observed-data
score vector, evaluated at the truth, is

S{C, GC (Z)} = E{S F (Z)|C, GC (Z)} = E{S F (Z, β0 )|C, GC (Z), β0 }. 




Therefore, when we derive the observed-data MLE by solving (14.3), we


are specifically looking for β̂n that satisfies

n
E{S F (Zi , β̂n )|Ci , GCi (Zi ), β̂n } = 0.
i=1
↑ ↑
notice that these two
must be equal

Again, under suitable regularity conditions,


 
1/2 D
n (β̂n − β0 ) −→ N 0, Veff (β0 ) ,

where {Veff (β0 )}−1 is the observed-data information matrix, denoted by


 
I(β0 ) = E S{Ci , GC (Zi )}S T {Ci , GCi (Zi )} .

14.1 Full- Versus Observed-Data Information Matrix


Because coarsened data Gr (Z) represent a many-to-one transformation of the
full data Z (i.e., a data reduction), it seems intuitively clear that the asymp-
totic variance of the full-data MLE β̂nF will be smaller than the asymptotic
14.1 Full- Versus Observed-Data Information Matrix 343

variance of the observed-data MLE β̂n . In this section, we give a formal proof
of this result.
T
Let var{S F (Z)} = E{S F (Z)S F (Z)} denote that variance matrix of
S F (Z). Then the asymptotic variance of the full-data MLE β̂nF is given by
F
Veff (β0 ) = [var{S F (Z)}]−1 . Similarly, the asymptotic variance of the observed-
data MLE β̂n is Veff (β0 ) = (var[S{C, GC (Z)}])−1 .
We now give two results regarding the full-data and observed-data esti-
mators for β.
Theorem 14.1. The observed-data information matrix is smaller than or
equal to the full-data information matrix; that is,

var[S{C, GC (Z)}] ≤ var{S F (Z)},

where the notation “≤” means that var{S F (Z)} − var[S{C, GC (Z)}] is non-
negative definite.
Proof. By the law of conditional variance,
 ' (  ' (
var {S F (Z)} = var E S F (Z)|C, GC (Z) + E var S F (Z)|C, GC (Z)
 ' (
= var [S{C, GC (Z)}] + E var S F (Z)|C, GC (Z) . (14.6)

This implies that


var{S F (Z)} − var[S{C, GC (Z)}]
is nonnegative definite. 

Theorem 14.2. The asymptotic variance of the full-data MLE is smaller than
or equal to the asymptotic variance of the observed-data MLE; that is,
F
Veff (β0 ) ≤ Veff (β0 ). (14.7)

Proof. This follows from results about influence functions; namely, if we define
  −1
FT
ϕF F
eff (Z) = E S (Z)S (Z) S F (Z)

and
  −1
ϕeff {C, GC (Z)} = E S{C, GC (Z)}S T {C, GC (Z)} S{C, GC (Z)},

then
 
ϕeff {C, GC (Z)} = ϕF
eff (Z) + ϕeff {C, GC (Z)} − ϕeff (Z) .
F

We also note that


    
E S F (Z)S T {C, GC (Z)} = E E S F (Z)S T {C, GC (Z)}|C, GC (Z)
 
= E S{C, GC (Z)}S T {C, GC (Z)} . (14.8)
344 14 Multiple Imputation: A Frequentist Perspective

Equation (14.8) can be used to show that


 T
eff (Z) ϕeff {C, GC (Z)} − ϕeff (Z) ) = 0,
E(ϕF F

which, in turn, can be used to show that


 
var[ϕeff {C, GC (Z)}] = var{ϕF
eff (Z)} + var( ϕeff {C, GC (Z)} − ϕeff (Z) )
F

or
var[ϕeff {C, GC (Z)}] − var{ϕF
eff (Z)}

is nonnegative definite. The proof is complete upon noticing that

var[ϕeff {C, GC (Z)}] = (var[S{C, GC (Z)}])−1 = Veff (β0 )

and
' F ( −1
var{ϕF
eff (Z)} = [var S (Z) ]
F
= Veff (β0 ).




For many problems, working with the observed (coarsened) data likelihood
may be difficult, with no readily available software. For such instances, it may
be useful to find methods where we can use the simpler full-data inference to
analyze coarsened data. This is what motivated much of the inverse weighted
methodology discussed in the previous chapters. Multiple imputation is
another such methodology popularized by Rubin (1978b, 1987). See also the
excellent overview paper by Rubin (1996).

14.2 Multiple Imputation


Multiple imputation is implemented as follows. For each observed-data value
{Ci , GCi (Zi )}, we sample at random from the conditional distribution

pZ|C,GC (Z) {z|Ci , GCi (Zi )}

m times to obtain random quantities

Zij , j = 1, . . . , m, i = 1, . . . , n.

These are the imputation of the full data from the observed (coarsened) data.
The “j”-th set of imputed full data, denoted by Zij , i = 1, . . . , n, is used to

obtain the j-th estimator β̂nj by solving the full-data likelihood equation:


n

S F (Zij , β̂nj ) = 0. (14.9)
i=1
14.2 Multiple Imputation 345

That is, we use the j-th imputed data set and treat these data as if they were

full data to obtain the full-data MLE β̂nj .
The proposed multiple-imputation estimator is

m
β̂n∗ = m−1 ∗
β̂nj . (14.10)
j=1

Rubin (1987) argues that, under appropriate conditions, this estimator is con-
sistent and asymptotically normal. That is,
D
n1/2 (β̂n∗ − β0 ) −→ N (0, Σ∗ ),

where the asymptotic variance Σ∗ can be estimated by Σ̂∗ =


 "−1

m 
n
∂S F (Zij , β̂nj ∗
)
−1 −1
m n −
j=1 i=1
∂β T
* +, -
 / 0/ 0T
  m
β̂ ∗
− β̂ ∗
β̂ ∗
− β̂ ∗
m+1 j=1 nj n nj n
+ n .
m m−1
* +, -

We note that the first term in the sum is an average of the estimators of
the full-data asymptotic variance using the inverse of the full-data observed
information matrix over the imputed full-data sets and the second term is
the sample variance of the imputation estimators multiplied by a finite “m”
correction factor.

Remark 4. If we knew the true value, β0 , then we could generate a random


variable Zij (β0 ) whose distribution has density pZ (z, β0 ) by first generating
a random {Ci , GCi (Zi )} from the distribution with density pC,GC (Z) (r, gr , β0 )
and then generating a random Zij (β0 ) from the conditional distribution with
conditional density
pZ|C,GC (Z) {z|Ci , GCi (Zi ), β0 }.
This is essentially the logic that motivates multiple imputation. Assuming the
model is correct, the observed data {Ci , GCi (Zi )} are generated, by nature,
from the density pC,GC (Z) (r, gr , β0 ). However, since we don’t know the true
value of β0 , the Zij must be generated from the conditional density

pZ|C,GC (Z) {z|Ci , GCi (Zi ), β},

where β must be estimated from the data in some fashion. Such imputed
values are referred to as Zij (β). 


We will consider two cases.


346 14 Multiple Imputation: A Frequentist Perspective

(a) In what we will denote as frequentist imputation, an initial estimator β̂nI is


obtained from the coarsened data. The imputation Zij (β̂nI ) are obtained
by sampling from

pZ|C,GC (Z) {z|Ci , GCi (Zi ), β̂nI }.

The resulting estimator is referred to by Wang and Robins (1998) as a


type B estimator. Rubin (1987) also refers to this type of imputation as
improper imputation.
(b) For Bayesian imputation (the approach advocated by Rubin), the Zij are
generated from the predictive distribution

pZ|C,GC (Z) {z|Ci , GCi (Zi )}.

The predictive distribution is given by



pZ|C,GC (Z) {z|Ci , GCi (Zi ), β} pβ|C,GC (Z) {β|Ci , GCi (Zi )}dνβ (β) .
* +, -
This is the posterior distribution
of β given the observed data

In order to implement the Bayesian approach, which Rubin refers to as


proper imputation, one needs to specify a prior distribution for β, from which
the posterior distribution pβ|C,GC (Z) {β|Ci , GCi (Zi )} is computed using Bayes’s
rule. Then
(i) we randomly select β (j) from the posterior distribution and,
(ii) conditional on β (j) , we sample from pZ|C,GC (Z) {z|Ci , GCi (Zi ), β (j) } to ob-
tain Zij (β (j) ).
A multiple-imputation estimator using this approach is referred to by Wang
and Robins (1998) as a type A estimator.

14.3 Asymptotic Properties of the


Multiple-Imputation Estimator
In what follows, we will derive the asymptotic properties of the multiple-
imputation estimator. We first will consider the “frequentist” approach, which
imputes from the conditional distribution fixing the parameter β at some
initial estimator β̂nI . Some discussion of Bayesian proper imputation will also
be considered later.
In order to implement multiple imputation from a frequentist view, we
must start with some initial estimator for β, say β̂nI . For missing-data prob-
lems, an initial estimator can often be obtained using only the complete cases
(i.e., {i : Ci = ∞}). For example, when the data are missing completely at ran-
dom, we can obtain a consistent asymptotically normal estimator using only
14.3 Asymptotic Properties of the Multiple-Imputation Estimator 347

the complete cases since these are a representative sample (albeit smaller)
from the population. If the data are missing at random (MAR), we might use
an inverse probability weighted complete-case estimator as an initial value.
The initial estimator for β will be assumed to be an RAL estimator; that
is,

n
n1/2 (β̂nI − β0 ) = n−1/2 q{Ci , GCi (Zi )} + op (1),
i=1

where q{Ci , GCi (Zi )} is the i-th influence function of the estimator β̂nI . We
remind the reader of the following properties of influence functions for RAL
estimators:
(a) The efficient influence function, denoted by ϕeff {Ci , GCi (Zi )}, equals
  −1
E S{Ci , GCi (Zi )}S T {Ci , GCi (Zi )} S{Ci , GCi (Zi )}. (14.11)

(b) For any influence function of an RAL estimator q{Ci , GCi (Zi )},
 
E q{Ci , GCi (Zi )}S T {Ci , GCi (Zi )} = *+,-
I q×q . (14.12)
the identity matrix

(c) The influence function q{Ci , GCi (Zi )} can be written as

q{Ci , GCi (Zi )} = ϕeff {Ci , GCi (Zi )} + h{Ci , GCi (Zi )}, (14.13)

where h{Ci , GCi (Zi )} has mean zero and


 
E h{Ci , GCi (Zi )}S T {Ci , GCi (Zi )} = 0q×q .

Hence

var [q{Ci , GCi (Zi )}] = var [ϕeff {Ci , GCi (Zi )}] + var [h{Ci , GCi , (Zi )}]
  −1
= E S{Ci , GCi (Zi )}S T {Ci , GCi (Zi )} + var [h{Ci , GCi (Zi )}] .
(14.14)

In studying the asymptotic properties of n1/2 (β̂n∗ − β0 ), we will consider


the following questions.
1. Is the distribution asymptotically normal?
2. What is its asymptotic variance?
3. Can we estimate the asymptotic variance? How well does Rubin’s variance
estimator work?
4. What are the efficiency properties of the multiple-imputation estimator?

We begin the study of the asymptotic properties by first introducing some


preliminary lemmas. Rigorous proofs of these lemmas require careful analysis
348 14 Multiple Imputation: A Frequentist Perspective

and the use of empirical processes that are beyond the scope of this book. We
will, however, provide some heuristic justification of the results.
As we have emphasized repeatedly throughout this book, the key to the
asymptotic properties of RAL estimators is being able to derive their influence
function. We will derive the influence function of the multiple-imputation
estimator through a series of approximations. We begin by giving the first
such approximation.

Lemma 14.1. The multiple-imputation estimator β̂n∗ defined by (14.9) and


(14.10) using the imputed values Zij (β̂nI ), j = 1, . . . , m, i = 1, . . . , n, where
β̂nI denotes some initial estimator, can be approximated as

⎡ ⎤

n
' ( −1 
m
n1/2 (β̂n∗ − β0 ) = n−1/2 I F (β0 ) ⎣m−1 S F {Zij (β̂nI ), β0 }⎦ + op (1),
i=1 j=1
(14.15)
where I F (β0 ) is the full-data information matrix; namely,
   
−∂S F (Z, β0 ) F FT
E = E S (Z, β0 )S (Z, β0 ) .
∂β T

Proof. Heuristic sketch



The j-th imputation estimator β̂nj is the solution to the equation


n

S F {Zij (β̂nI ), β̂nj } = 0,
i=1

which by the mean value expansion is equal to


# n $
n  ∂S F {Zij (β̂ I ), β̃nj } / 0

S F {Zij (β̂nI ), β0 } + n
× β̂nj − β0 ,
i=1 i=1
∂β T


where β̃nj is an intermediate value between β̂nj and β0 . Under suitable regu-
larity conditions,

n  
∂S F {Zij (β̂ I ), β̃nj } ∂S F {Zij (β0 ), β0 }
−n−1
P
n

→E − , (14.16)
i=1
∂β T ∂β T

which, by standard results from likelihood theory, is the information matrix,


which is also equal to
 T

E S F {Zij (β0 ), β0 }S F {Zij (β0 ), β0 } = I F (β0 ).

Therefore,
14.3 Asymptotic Properties of the Multiple-Imputation Estimator 349
/ 0 
n
1/2 ∗ 1/2
n β̂nj − β0 = n {I F (β0 )}−1 S F {Zij (β̂nI ), β0 } + op (1), (14.17)
i=1

which implies that


⎡ ⎤

n
' ( −1 
m
n1/2 (β̂n∗ −β0 ) = n−1/2 I F (β0 ) ⎣m−1 S F {Zij (β̂nI ), β0 }⎦+op (1). 

i=1 j=1

As a consequence of Lemma 14.1, we have shown that n1/2 (β̂n∗ − β0 ), up


to order op (1), is proportional to
⎡ ⎤
n 
m
n−1/2 ⎣m−1 S F {Zij (β̂nI ), β0 }⎦ . (14.18)
i=1 j=1

However, this does not give us the desired influence function for β̂n∗ because
Zij (β̂nI ) is evaluated at a random quantity (β̂nI ) that involves data from all
individuals and therefore (14.18) is not a sum of iid terms. In order to find
the influence function, we write (14.18) as the sum of
⎡ ⎤
 n 
m
n−1/2 ⎣m−1 S F {Zij (β0 ), β0 }⎦ (14.19)
i=1 j=1

+
⎡ ⎤

n 
m 
m
n−1/2 ⎣m−1 S F {Zij (β̂nI ), β0 } − m−1 S F {Zij (β0 ), β0 }⎦ (14.20)
i=1 j=1 j=1

and then show how these two pieces can be expressed approximately as a sum
of iid terms.
Remark 5. We remind the reader that, for a fixed β, Zij (β), j = 1, . . . , m, i =
1, . . . , n are obtained through a two-stage process. For any i, nature (sam-
pling from the population) provides us with the data {Ci , GCi (Zi )} de-
rived from the distribution pC,GC (Z) (r, gr , β0 ). Then, for j = 1, . . . , m, the
data analyst draws m values at random from the conditional distribution
pZ|C,GC (Z) {z|Ci , GCi (Zi ), β}. Consequently, the vector {Zi1 (β), . . . , Zim (β)} is
made up of correlated random variables, but these random vectors across i
are iid random vectors. Also, because of the way the data are generated, the
marginal distribution of Zij (β) is the same for all i and j. If β = β0 , then
the marginal distribution of Zij (β0 ) has density pZ (z, β0 ) (i.e., the density for
the full data at the truth). However, if β = β0 , then the marginal density for
Zij (β) is more complex.  
Based on the discussion in Remark 5, (14.19) is made up of a sum of n iid
elements, where the i-th element of the sum is equal to
350 14 Multiple Imputation: A Frequentist Perspective


m
−1
m S F {Zij (β0 ), β0 }, (14.21)
j=1

i = 1, . . . , n. In addition, the distribution of S F {Zij (β0 ), β0 } is the same as


the distribution of S F (Zi , β0 ), where Zi is the full data. Therefore,

E[S F {Zij (β0 ), β0 }] = E{S F (Zi , β0 )} = 0 for all i, j,

which implies that (14.21) has mean zero. Hence, (14.19) is a normalized sum
of mean-zero iid random vectors that will converge to a normal distribution
by the central limit theorem.
We now consider the approximation of (14.20) as a sum of iid random
vectors. This is given by the following theorem.
Theorem 14.3. The expression (14.20) is equal to

n
' (
n−1/2 I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )} + op (1), (14.22)
i=1

where I F (β0 ) is the full-data information matrix,


   
−∂S F (Z, β0 ) T
E T
= E S F (Z, β0 )S F (Z, β0 ) ,
∂β
I(β0 ) is the observed-data information matrix,
 
∂S{C, GC (Z), β0 }  
E − = E S{C, GC (Z), β0 }S T {C, GC (Z), β0 } ,
∂β T

and q{Ci , GCi (Zi )} is the i-th influence function of the initial estimator β̂nI .
Theorem 14.3 will be proved using a series of lemmas.
Lemma 14.2. Let Zij (β) denote a random draw from the conditional distri-
bution with conditional density pZ|C,GC (Z) {z|Ci , GCi (Zi ), β}. If we define
 
λ(β, β0 ) = E S F {Zij (β), β0 } , (14.23)

then
∂λ(β, β0 )
= I F (β0 ) − I(β0 ).
∂β T β=β0

Proof. By the law of conditional expectations, (14.23) equals


 
 
E E S F {Zij (β), β0 }|Ci , GCi (Zi ), β . (14.24)

Because Zij (β) is a random draw from the conditional distribution with con-
ditional density
14.3 Asymptotic Properties of the Multiple-Imputation Estimator 351

pZ|C,GC (Z) {z|Ci , GCi (Zi ), β},


this means that the inner expectation of (14.24) is equal to

S F (z, β0 )pZ|C,GC (Z) {z|Ci , GCi (Zi ), β}dνZ|C,GC (Z) {z|Ci , GCi (Zi )}.

note the value
β here

The outer expectation of (14.24), however, involves the distribution of the


observed data {Ci , GC (Zi )}, which are generated from the truth, β0 . Therefore,
the overall expectation is given as
  
F
λ(β, β0 ) = S (z, β0 )pZ|C,GC (Z) (z|r, gr , β)dνZ|C,GC (Z) (z|r, gr )

× pC,GC (Z) (r, gr , β0 )dνC,GC (Z) (r, gr ). (14.25)

Because

pZ|C,GC (Z) (z|r, gr , β)pC,GC (Z) (r, gr , β)dνZ|C,GC (Z) (z|r, gr )dνC,GC (Z) (r, gr )
= pC,Z (r, z, β)dνC,Z (r, z),

we can write (14.25) as


  
pC,Z (r, z, β)
S F (z, β0 ) pC,GC (Z) (r, gr , β0 )dνC,Z (r, z). (14.26)
pC,GC (Z) (r, gr , β)

Taking the derivative of (14.26) with respect to β and evaluating at β0 , we


obtain
  
∂λ(β, β0 ) F ∂ pC,Z (r, z, β)
= S (z, β0 )
∂β T β=β0 ∂β T pC,GC (Z) (r, gr , β) β=β0
× pC,GC (Z) (r, gr , β0 )dνC,Z (r, z)
    
pC,Z (r, z, β0 ) ∂ log pC,Z (r, z, β)
= S T (z, β0 )
pC,GC (Z) (r, gr , β0 ) ∂β T pC,GC (Z) (r, gr , β) β=β0
× pC,GC (Z) (r, gr , β0 )dνC,Z (r, z). (14.27)

Because the data are coarsened at random,


pC,Z (r, z, β) pZ (z, β)
= . (14.28)
pC,GC (Z) (r, gr , β) pGr (Z) (gr , β)

Therefore
 
∂ log pC,Z (r, z, β) ∂ log pZ (z, β0 ) ∂ log pGr (Z) (gr , β0 )
= − .
∂β T pC,GC (Z) (r, gr , β) β=β0 ∂β T ∂β T
(14.29)
352 14 Multiple Imputation: A Frequentist Perspective

When data are coarsened at random, we showed in Lemma 7.2 that


∂ log pGr (Z) (gr , β0 ) ' (
T
= E S F (Z)|C = r, GC (Z) = gr = S(r, gr ).
∂β
Therefore, (14.29) is equal to
T
S F (z, β0 ) − S T (r, gr , β0 ).

Substituting this last result into (14.27) and rearranging some terms yields
  T 
S F (z, β0 ) S F (z, β0 ) − S T (r, gr , β0 ) pC,Z (r, z, β0 )dνC,Z (r, z)
 T
  
= E S F (Z, β0 )S F (Z, β0 ) − E S F (Z, β0 )S T {C, GC (Z), β0 } . (14.30)

Using (14.8), we obtain that


' (
E S F (Z, β0 )S T (C, GC (Z), β0 ) = I(β0 ).

Hence, (14.30) equals I F (β0 ) − I(β0 ), proving Lemma 14.2. 




Remark 6. The result of Lemma 14.2 can be deduced, with slight modification,
from equation (6) on page 480 of Oakes (1999). The term I F (β0 ) − I(β0 ) is
also referred to as the “missing information,” as this is the information that
is lost due to the coarsening of the data. 


Before giving the next lemma, we first give a short description of the notion
of “stochastic equicontinuity.”

Stochastic Equicontinuity

Consider the centered stochastic process


⎡ ⎤
n m
Wn (β) = n−1/2 ⎣m−1 S F {Zij (β), β0 } − λ(β, β0 )⎦
i=1 j=1

 
as a process in β, where λ(β, β0 ) = E S F {Zij (β), β0 } .
Using the theory of empirical processes (see van der Vaart and Well-
ner, 1996), under suitable regularity conditions, the process Wn (β) converges
weakly to a tight Gaussian process. When this is the case, we have stochastic
equicontinuity, where, for every ε, η > 0, there exists a δ > 0 and an n0 such
that  "
P sup Wn (β  ) − Wn (β  )  ε ≤η
β  ,β  : β  −β  <δ

for all n > n0 , where “ · ” denotes the usual Euclidean norm or distance.
14.3 Asymptotic Properties of the Multiple-Imputation Estimator 353

Lemma 14.3. If stochastic equicontinuity holds, then


Wn (β̂nI ) − Wn (β0 ) = op (1)

whenever β̂nI is a consistent estimator for β0 .


Proof. Choose arbitrary ε, η > 0. Then, by stochastic equicontinuity, there
exists a δ(ε, η) and n(ε, η) such that
 "
P sup Wn (β  ) − Wn (β  )  ε ≤ η/2
β  ,β  : β  −β  <δ(ε,η)

for all n > n(ε, η).


By consistency, there exists for every δ and η an n∗ (δ, η) such that

P ( β̂nI − β0  δ) ≤ η/2 for all n > n∗ (δ, η).


Note that the event
 "
 
β̂nI − β0 < δ(ε, η) and sup Wn (β ) − Wn (β ) < ε
β  ,β  : β  −β  <δ(ε,η)
 
⊆ Wn (β̂nI ) − Wn (β0 ) < ε .

By taking complements, we obtain


P { Wn (β̂nI ) − Wn (β0 )  ε}
#
≤ P { β̂nI − β0  δ(ε, η)}
 "$
 
∪ sup Wn (β ) − Wn (β )  ε
β  ,β  : β  −β  <δ(ε,η)

≤ P { β̂nI − β0  δ(ε, η)}


 "
 
+P sup Wn (β ) − Wn (β )  ε .
β  ,β  : β  −β  <δ(ε,η)

By choosing n > max[n(ε, η), n∗ {δ(ε, η), η}], we obtain

P { Wn (β̂nI ) − Wn (β0 )  ε} ≤ η/2 + η/2 = η. 



Lemma 14.4. The statistic (14.20), namely
⎡ ⎤

n m 
m
n−1/2 ⎣m−1 S F {Zij (β̂nI ), β0 } − m−1 S F {Zij (β0 ), β0 }⎦ ,
i=1 j=1 j=1

is equal to
n1/2 {λ(β̂nI , β0 ) − λ(β0 , β0 )} + op (1).
354 14 Multiple Imputation: A Frequentist Perspective

Proof. This follows directly as a consequence of Lemma 14.3, which states


that
Wn (β̂nI ) − Wn (β0 ) = op (1),
after some simple rearrangement of terms. 


Lemma 14.5.
∂λ(β, β0 )
n1/2 {λ(β̂nI , β0 ) − λ(β0 , β0 )} = n1/2 (β̂nI − β0 ) + op (1).
∂β T β=β0

Proof. Follows from a standard delta method argument. 




We now return to the proof of Theorem 14.3.

Proof of Theorem 14.3


Lemmas 14.4 and 14.5 imply that (14.20) is equal to

∂λ(β, β0 )
n1/2 (β̂nI − β0 ) + op (1).
∂β T β=β0

The proof is complete because of Lemma 14.2, which states that


∂λ(β, β0 )
= I F (β0 ) − I(β0 ),
∂β T β=β0

and the definition of the influence function of β̂nI ,



n
n1/2 (β̂nI − β0 ) = n−1/2 q{Ci , GCi (Zi )} + op (1). 

i=1

14.4 Asymptotic Distribution of the


Multiple-Imputation Estimator
We are now in a position to derive the asymptotic distribution of β̂n∗ , which
we present in the following theorem:

Theorem 14.4. Letting β̂n∗ denote the multiple imputation estimator and
denoting the i-th influence function of the initial estimator by (14.13), i.e.,

q{Ci , GCi (Zi )} = ϕeff {Ci , GCi (Zi )} + h{Ci , GCi (Zi )},

where ϕeff {Ci , GCi (Zi )} is defined by (14.11), then


D
n1/2 (β̂n∗ − β0 ) −→ N (0, Σ∗ ) ,

where Σ∗ is equal to
14.4 Asymptotic Distribution of the Multiple-Imputation Estimator 355
−1 ' (−1 ' F (' (−1
{I(β0 )} + m−1 I F (β0 ) I (β0 ) − I(β0 ) I F (β0 )
' (−1 ' F (
+ I F (β0 ) I (β0 ) − I(β0 ) var [h{Ci , GCi (Zi )}]
' F (' (−1
I (β0 ) − I(β0 ) I F (β0 ) . (14.31)

Proof. As a result of Lemma14.1, Theorem 14.3, and equations (14.19) and


(14.20), we have shown that

⎛⎡ ⎤

n
' ( −1 
m
n1/2 (β̂n∗ − β0 ) = n−1/2 I F (β0 ) ⎝⎣m−1 S F {Zij (β0 ), β0 }⎦
i=1 j=1

' (
+ I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )}⎠ + op (1). (14.32)

This is a key result, as we have identified the influence function for the
multiple-imputation estimator as the i-th element in the summand in (14.32).
Asymptotic normality follows from the central limit theorem. The asymptotic
variance of n1/2 (β̂n∗ − β0 ) is the variance of the influence function, which we
will now derive in a series of steps.
Toward that end, we first compute
⎡ ⎤

m
var ⎣m−1 S F {Zij (β0 ), β0 }⎦ . (14.33)
j=1

Computing (14.33)
Using the law for iterated conditional variance, (14.33) can be written as
⎛ ⎡ ⎤⎞

m
E ⎝var ⎣m−1 S F {Zij (β0 ), β0 } Ci , GCi (Zi )⎦⎠ (14.34)
j=1
⎛ ⎡ ⎤⎞

m
+ var ⎝E ⎣m−1 S F {Zij (β0 ), β0 } Ci , GCi (Zi )⎦⎠ . (14.35)
j=1

Because, conditional on {Ci , GCi (Zi )}, the Zij (β0 ), j = 1, . . . , m are inde-
pendent draws from the conditional distribution with conditional density
pZ|C,GC (Z) {z|Ci , GCi (Zi ), β0 }, this means that the conditional variance
⎡ ⎤
m
var ⎣m−1 S F {Zij (β0 ), β0 } Ci , GCi (Zi )⎦
j=1

is equal to
356 14 Multiple Imputation: A Frequentist Perspective
' (
m−1 var S F (Zi , β0 )|Ci , GCi (Zi ) .
Therefore, (14.34) is equal to
 ' (
m−1 E var S F (Zi , β0 )|Ci , GCi (Zi ) .
In equation (14.6), we showed
 ' (
E var S F (Zij β0 )|Ci , GCi (Zi ) = I F (β0 ) − I(β0 ).
Consequently, ' (
(14.34) = m−1 I F (β0 ) − I(β0 ) . (14.36)
Similar logic can be used to show
⎡ ⎤
m
E ⎣m−1 S F {Zij , (β0 ), β0 } Ci , GCi (Zi )⎦
j=1
' (
= E S F (Zi , β0 )|Ci , GCi (Zi )
= S{Ci , GCi (Zi )}. (14.37)
Therefore (14.35) is equal to
var [S{Ci , GCi (Zi )}] = I(β0 ). (14.38)
Combining (14.36) and (14.38) gives us that
' (
(14.33) = m−1 I F (β0 ) − I(β0 ) + I(β0 ). (14.39)
Next we compute the covariance matrix
⎛⎡ ⎤ ⎞
  T
m
' (
E ⎝⎣m−1 S F {Zij (β0 ), β0 }⎦ I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )} ⎠ .
j=1
(14.40)
Computing (14.40)
Expression (14.40) can be written as

m
 ' (
m−1 E S F {Zij (β0 ), β0 }q T {Ci , GCi (Zi )} I F (β0 ) − I(β0 ) ,
j=1

where
 
E S F {Zij (β0 ), β0 }q T {Ci , GCi (Zi )}
  
= E E S F {Zij (β0 ), β0 }q T {Ci , GCi (Zi )}|Ci , GCi (Zi )
 
 
E E S F {Zij (β0 ), β0 }|Ci , GCi (Zi ) q T {Ci , GCi (Zi )}
* +, -
we showed in (14.37) that
this equals S{Ci , GCi (Zi )}
 
= E S{Ci , GCi (Zi )}q T {Ci , GCi (Zi )}
= I q×q (identity matrix) by (14.12).
14.4 Asymptotic Distribution of the Multiple-Imputation Estimator 357

Therefore
(14.40) = I F (β0 ) − I(β0 ). (14.41)
Finally,
 
var {I F (β0 ) − I(β0 )}q{Ci , GCi (Zi )}
' ( ' (
= I F (β0 ) − I(β0 ) var [q{Ci , GCi (Zi )}] I F (β0 ) − I(β0 ) (14.42)
 
= {I F (β0 ) − I(β0 )} I −1 (β0 ) + var [h{Ci , GCi (Zi )}] {I F (β0 ) − I(β0 )},
(14.43)

where (14.43) follows from (14.14).


Using (14.39), (14.41), and (14.43), and after some straightforward alge-
braic manipulation, we are able to derive the variance matrix for the influence
function of β̂n∗ (i.e., the i-th summand in (14.32)) to be

Σ∗ = {I(β0 )}−1
' (−1 ' F (' (−1
+ m−1 I F (β0 ) I (β0 ) − I(β0 ) I F (β0 )
' (−1 ' F ( '
+ I F (β0 ) I − I(β0 ) var[h{Ci , GCi (Zi )}] I F (β0 )
' (−1
−I(β0 )} I F (β0 ) .

This completes the proof of Theorem 14.4. 




Examining (14.32), we conclude that the influence function for the multiple-
imputation estimator, with m imputation draws, is equal to
 m  
{I F (β0 )}−1 m−1 S F {Zj (β0 ), β0 } + {I F (β0 ) − I(β0 )}q{C, GC (Z)} ,
j=1
(14.44)
where Zj (β0 ) denotes the j-th random draw from the conditional distribution
of pZ|C,GC (Z) {z|C, GC (Z), β0 }. As a consequence of the law of large numbers,
we would expect, under suitable regularity conditions, that

m
m−1
P
S F {Zj (β0 ), β0 } −
→ E{S F (Z)|C, GC (Z)} = S{C, GC (Z)}
j=1

as m → ∞. Specifically, in order to prove that



m
m−1
P
S F {Zj (β0 ), β0 } − S{C, GC (Z)} −
→ 0,
j=1

it suffices to show that


358 14 Multiple Imputation: A Frequentist Perspective
 
m 2
−1 m→∞
E m S {Zj (β0 ), β0 } − S{C, GC (Z)} −−−−→ 0.
F
(14.45)
j=1

Computing the expectation above by first conditioning on {C, GC (Z)}, we


obtain that
  2  
−1
m
−1
' F (
E m S {Zj (β0 ), β0 }−S{C, GC (Z)} = E m var S (Z)|C, GC (Z) .
F

j=1
' (
So, for example, if the conditional variance var S F (Z)|C, GC (Z) is bounded
almost surely, then (14.45) would hold. Consequently, as m → ∞, the influence
function of the multiple-imputation estimator (14.44) converges to
 
−1
{I (β0 )}
F
S{C, GC (Z)} + {I (β0 ) − I(β0 )}q{C, GC (Z)} .
F
(14.46)

As we will now demonstrate, the limit, as m → ∞, of the multiple-


imputation estimator is related to the expectation maximization (EM) al-
gorithm. The EM algorithm, first introduced by Dempster, Laird, and Rubin
(1977) and studied carefully by Wu (1983), is a numerical method for find-
ing the MLE that is especially useful with missing-data problems. The EM
algorithm is an iterative procedure where the expectation of the full-data log-
likelihood is computed with respect to a current value of a parameter estimate
and then an updated estimate for the parameter is obtained by maximizing
(j)
the expected full-data log-likelihood. Hence, if β̂n is the value of the estima-
tor at the j-th iteration, then the (j + 1)-th value of the estimator is obtained
by solving the estimating equation

n  
E S F (Z, β̂n(j+1) )|Ci , GCi (Zi ), β̂n(j) = 0. (14.47)
i=1

We now show the relationship between the EM algorithm and the multiple-
imputation estimator in the following theorem.
Theorem 14.5. Let the one-step updated EM estimator, β̂nEM , be the solu-
tion to  
n
E S F (Z, β̂nEM )|Ci , GCi (Zi ), β̂nI = 0, (14.48)
i=1

where β̂nI is the initial estimator for β, whose i-th influence function is
q{Ci , GCi (Zi )}, which was used for imputation. The influence function of β̂nEM
is identically equal to (14.46), the limiting influence function for the multiple-
imputation estimator as m → ∞.

Proof. A simple expansion of β̂nEM about β0 , keeping β̂nI fixed, in (14.48) yields
14.4 Asymptotic Distribution of the Multiple-Imputation Estimator 359


n  
−1/2 F I
0=n E S (Z, β0 )|Ci , GCi (Zi ), β̂n
i=1
 n  F 
−1 ∂S (Z, β0 )
+ n E T
Ci , GCi (Zi ), β0 n1/2 (β̂nEM − β0 ) + op (1).
i=1
∂β

Because
n  F 
−1 ∂S (Z, β0 )
n E Ci , GCi (Zi ), β0
i=1
∂β T
  F 
P ∂S (Z, β0 )
−→E E Ci , GCi (Zi ), β0
∂β T
 F 
∂S (Z, β0 )
=E = −I F (β0 ),
∂β T
we obtain

n  
n1/2 (β̂nEM −β0 ) = {I F (β0 )}−1 n−1/2 E S F (Z, β0 )|Ci , GCi (Zi ), β̂nI +op (1).
i=1
(14.49)
An expansion of β̂nI about β0 on the right-hand side of (14.49) yields

n  
n−1/2 E S F (Z, β0 )|Ci , GCi (Zi ), β̂nI
i=1

n  
= n−1/2 E S F (Z, β0 )|Ci , GCi (Zi ), β0 (14.50)
i=1
 
n 
−1 ∂
+ n E{S F (Z, β0 )|Ci , GCi (Zi ), β} n1/2 (β̂nI − β0 ) + op (1).
i=1
∂β T β=β0

The sample average is


  n 

n−1 E{S F
(Z, β 0 )|C i , G Ci (Z i ), β}
i=1
∂β T β=β0
 
P ∂
−→E E{S F
(Z, β0 )|C i , G C i
(Z i ), β} , (14.51)
∂β T β=β0

where

E{S F (Z, β0 )|Ci , GCi (Zi ), β}
∂β T β=β0
# F $
∂ z:GCi (z)=GCi (Zi )
S (z, β0 )p(z, β)dν(z)
=  . (14.52)
∂β T z:GC (z)=GC (Zi )
p(z, β)dν(z) β=β0
i i
360 14 Multiple Imputation: A Frequentist Perspective

Straightforward calculations yield that the derivative in (14.52) is equal to


var{S F (Z)|Ci , GCi (Zi )} and hence (14.51) is equal to
 
E var{S (Z)|Ci , GCi (Zi )} = I F (β0 ) − I(β0 ),
F
(14.53)

where the last equality follows from (14.6). Using (14.4), we obtain that (14.50)
is equal to

n
n−1/2 S{Ci , GCi (Zi )}. (14.54)
i=1

By the definition of an influence function,



n
n1/2 (β̂nI − β0 ) = n−1/2 q{Ci , GCi (Zi )} + op (1). (14.55)
i=1

Combining equations (14.49)–(14.55), we conclude that the influence function


of β̂nEM is equal to (14.46), thus concluding the proof. 


In (14.13), we expressed the influence function of β̂nI as the sum of

{I(β0 )}−1 S{C, GC (Z)} + h{C, GC (Z)}, (14.56)

where h{C, GC (Z)} is orthogonal to S{C, GC (Z)}; i.e., E(hS T ) = 0q×q . Sub-
stituting (14.56) for q{C, GC (Z)} in (14.46), we obtain another representation
of the influence function for β̂nEM as

ϕeff {C, GC (Z)} + {I F (β0 )}−1 {I F (β0 ) − I(β0 )}h{C, GC (Z)}, (14.57)

where
ϕeff {C, GC (Z)} = {I(β0 )}−1 S{C, GC (Z)}
is the efficient observed-data influence function.
We note that the influence function in (14.57) is that for a one-step updated
EM estimator that started with the initial estimator β̂nI . Using similar logic,
we would conclude that the EM algorithm after j iterations would yield an
EM(j)
estimator β̂n with influence function

ϕeff {C, GC (Z)} + J j h{C, GC (Z)},

where J is the q × q matrix {I F (β0 )}−1 {I F (β0 ) − I(β0 )}. For completeness,
we now show that J j will converge to zero (in the sense that all elements in
the matrix will converge to zero) as j goes to infinity, thus demonstrating that
the EM algorithm will converge to the efficient observed-data estimator. This
proof is taken from Lemma A.1 of Wang and Robins (1998).
14.4 Asymptotic Distribution of the Multiple-Imputation Estimator 361

Proof. Both I F (β0 ) and {I F (β0 ) − I(β0 )} are symmetric positive-definite


matrices. From Rao (1973, p. 41), we can express I F (β0 ) = RT R and
{I F (β0 ) − I(β0 )} = RT ΛR, where R is a nonsingular matrix and Λ is a
diagonal matrix. Moreover, because

I(β0 ) = I F (β0 ) − {I F (β0 ) − I(β0 )} = RT (I q×q − Λ)R

is positive definite, this implies that all the elements on the diagonal of Λ
must be strictly less than 1. Consequently,

J j = (R)−1 Λj R

will converge to zero as j → ∞. 




Remarks regarding the asymptotic variance

1. The asymptotic variance of the observed-data efficient estimator for β is


{I(β0 )}−1 . Because (14.31) is greater than {I(β0 )}−1 , this implies that
the multiple-imputation estimator is not fully efficient.
2. Even if the initial estimator β̂nI is efficient (i.e., h{Ci , GCi (Zi )} = 0), the
resulting multiple-imputation estimator loses some efficiency due to the
contribution to the variance of the second term of (14.31), but this loss
of efficiency vanishes as m → ∞.
3. For any initial estimator β̂nI , as the number of multiple-imputation draws
“m” gets larger, the asymptotic variance decreases (from the contribution
of the second term of (14.31)). Thus the resulting estimator becomes more
efficient with increasing m.
4. We showed in Theorem 14.5 that as m goes to infinity, the resulting
multiple-imputation estimator is equivalent to a one-step update of an
EM algorithm. This suggests that one strategy is that after m imputa-
tions, we start the process again, now using β̂n∗ as the initial estimator.
By continuing this process, we can iterate toward full efficiency. To imple-
ment such a strategy, m must be chosen sufficiently large to ensure that
the new estimator is more efficient than the previous one. As we converge
toward efficiency, m must get increasingly large.
5. Other algebraically equivalent representations of the asymptotic variance,
which we will use later, are given by

' (−1  ' ( ' (


I F (β0 ) I(β0 ) + m−1 I F (β0 ) − I(β0 ) + 2 I F (β0 ) − I(β0 )
' ( ' (
+ I F (β0 ) − I(β0 ) var[q{Ci , GCi (Zi )}] I F (β0 ) − I(β0 )
' F (−1
I (β0 ) (14.58)

and
362 14 Multiple Imputation: A Frequentist Perspective
 
' (−1 m+1 ' F (−1 ' F (' (−1
I F (β0 ) + I (β0 ) I (β0 ) − I(β0 ) I F (β0 )
m
' F (−1 ' F (
+ I (β0 ) I (β0 ) − I(β0 ) var[q{Ci , GCi (Zi )}]
' F (' (−1
I (β0 ) − I(β0 ) I F (β0 ) . (14.59)

14.5 Estimating the Asymptotic Variance

In this section, we will consider estimators for the asymptotic variance of the
frequentist (type B) multiple-imputation estimator. Although Rubin (1987)
refers to this type of imputation as improper and does not advocate using his
intuitive variance estimator in such cases, my experience has been that, in
practice, many statisticians do not distinguish between proper and improper
imputation and will often use Rubin’s variance formula. Therefore, we begin
this section by studying the properties of Rubin’s variance formula when used
with frequentist multiple imputation.
Rubin suggested the following estimator for the asymptotic variance of
n1/2 (β̂n∗ − β0 ):
# $−1

m 
n ∗
−∂S F {Zij (β̂nI ), β̂nj }
−1 −1
m n
j=1 i=1
∂β T
   ∗
m+1
m
(β̂nj − β̂n∗ )(β̂nj

− β̂n∗ )T
+ n . (14.60)
m j=1
m−1

It is easy to see that the first term in (14.60) converges in probability to


 
∂S(Z, β0 )
E − = {I F (β0 )}−1 .
∂β T

Results regarding the second term of (14.60) are given in the following
theorem.

Theorem 14.6.
⎧ ⎫
⎨  m / 0/ 0T ⎬

− β̂n∗ β̂nj

− β̂n∗
n→∞
E n β̂nj −−−−→
⎩ ⎭
j=1
' (−1 ' F (' (−1
(m − 1) I F (β0 ) I (β0 ) − I(β0 ) I F (β0 ) .

Proof. Examining the influence function of n1/2 (β̂n∗ − β0 ), derived in (14.32),


we conclude that
14.5 Estimating the Asymptotic Variance 363
1
n1/2 (β̂nj

− β̂n∗ ) = n 2 (β̂nj

− β0 ) − n1/2 (β̂n∗ − β0 )
n
' F (−1  F 
= n−1/2 I (β0 ) S {Zij (β0 ), β0 } − S̄iF (β0 )
i=1
+ op (1),

where

m
S̄iF (β0 ) = m−1 S F {Zij (β0 ), β0 }.
j=1

(Notice that the terms involving q{Ci , GCi (Zi )} in (14.32) cancel out.)
Therefore,

n(β̂nj − β̂n∗ )(β̂nj

− β̂n∗ )T =
% n &
' (−1  F 
−1
n F
I (β0 ) S {Zij (β0 ), β0 } − S̄i (β0 )
F

i=1
% n &
 T ' ( −1
× S F {Zij (β0 ), β0 } − S̄iF (β0 ) I F (β0 ) + op (1).
i=1

Because the quantities inside the two sums above are independent across i =
1, . . . , n, we obtain

E{n(β̂nj − β̂n∗ )(β̂nj

− β̂n∗ )T } →
/' (−1  F 
E I F (β0 ) S {Zij (β0 ), β0 } − S̄iF (β0 )
 T ' F (−1 0
× S F {Zij (β0 ), β0 } − S̄iF (β0 ) I (β0 )

and
⎧ ⎫
⎨  m ⎬
∗ ∗ ∗
E n (β̂nj − β̂nj )(β̂nj − β̂n∗ )T →
⎩ ⎭
j=1

' F (−1 m
 F 
I (β0 ) E⎝ S {Zij (β0 ), β0 } − S̄iF (β0 )
j=1

 F T ' (−1
S {Zij (β0 ), β0 } − S̄iF (β0 ) × I F (β0 ) . (14.61)

The expectation in (14.61) is evaluated by


364 14 Multiple Imputation: A Frequentist Perspective
⎛ ⎞

m
 F   
E⎝ S {Zij (β0 ), β0 } − S̄iF (β0 ) SβF {Zij (β0 ), β0 } − S̄iF (β0 ) ⎠
T

j=1
m 
 
T
=E S F {Zij (β0 ), β0 }S F {Zij (β0 ), β0 } (14.62)
j=1

1  F 
m m
T
− E S {Zij (β0 ), β0 }S F {Zij  (β0 ), β0 } .
m j=1 
j =1

When j = j  ,
T
 T

E[S F {Zij (β0 ), β0 }S F {Zij  (β0 ), β0 }] = E S F (Zi , β0 )S F (Zi , β0 ) = I F (β0 ),

whereas when j = j  ,
 T

E S F {Zij (β0 ), β0 }S F {Zij  (β0 ), β0 }
 
= cov S F {Zij (β0 ), β0 }, S F {Zij  (β0 ), β0 }
  
= E cov S F {Zij (β0 ), β0 }, S F {Zij  (β0 ), β0 }|Ci , GCi (Zi ) (14.63)
  F   F 
+ cov E S {Zij (β0 ), β0 }|Ci , GCi (Zi ) , E S {Zij  (β0 ), β0 }|Ci , GCi (Zi ) .

Because, conditional on Ci , GCi (Zi ), the Zij (β0 ) are independent draws
for different j, from the conditional density pZ|C,GC (Z) {z|Ci , GCi (Zi )}, this
means that the first term of (14.63) (conditional covariance) is zero. Because
E[S F {Zij (β0 ), β0 }|Ci , GCi (Zi )] = S{Ci , GCi (Zi )} for all j = 1, . . . , m, then the
second term of (14.63) is

var[S{Ci , GCi (Zi )}] = I(β0 ).

Thus, from these results, we obtain that (14.62) equals


1 ' F (
mI F (β0 ) − mI (β0 ) + m(m − 1)I(β0 )
m ' (
= (m − 1) I F (β0 ) − I(β0 ) . (14.64)

Finally, using (14.61), we obtain


⎧ ⎫
⎨  m / 0/ 0T ⎬

E n β̂nj − β̂n∗ β̂nj

− β̂n∗ →
⎩ ⎭
j=1
' (−1 ' F (' (−1
(m − 1) I F (β0 ) I (β0 ) − I(β0 ) I F (β0 ) . (14.65)

This completes the proof of Theorem 14.6. 



Consequently, Rubin’s estimator for the variance, (14.60), is an unbiased
(asymptotically) estimator for
14.5 Estimating the Asymptotic Variance 365
 
m+1 ' (−1 ' F (' (−1
{I F (β0 )}−1 + I F (β0 ) I (β0 ) − I(β0 ) I F (β0 ) .
m
Comparing this with the asymptotic variance of n1/2 (β̂n∗ −β0 ) given by (14.59),
we see that Rubin’s formula underestimates the asymptotic variance for the
frequentist type B multiple-imputation estimator.
Remark 7. We wish to note that the first term in Rubin’s variance estima-
tor is indeed a consistent estimator for {I F (β0 )}−1 ; that is, it converges in
probability as n → ∞. The second term, however, namely
   ∗
m+1
m
(β̂nj − β̂n∗ )(β̂nj

− β̂n∗ )T
n ,
m j=1
m−1

is an asymptotically unbiased estimator for


 
m+1 ' F (−1 ' F (' (−1
I (β0 ) I (β0 ) − I(β0 ) I F (β0 ) .
m
That is, the expectation converges as n → ∞ for m fixed. However, this
second term is not a consistent estimator but rather converges to a proper
distribution. Consistency is only obtained by also letting m → ∞. 

Nonetheless, a consistent estimator for the asymptotic variance of n1/2 (β̂n∗ −
β0 ) can be derived as follows.

Consistent Estimator for the Asymptotic Variance


Because β̂nI was assumed to be an RAL estimator for β with influence function
q{C, GC (Z)}, this implies that
 
n1/2 (β̂nI − β0 ) → N 0, var [q{C, GC (Z)}] .

Suppose we have a consistent estimator for the asymptotic variance of our


initial estimator β̂nI , which we denote as var[q{C,
ˆ GC (Z)}]. Then, if we can
construct consistent estimators for I F (β0 ) and I(β0 ), we can substitute these
into (14.59) to obtain a consistent estimator of the asymptotic variance of the
multiple-imputation estimator.
As we already discussed in the context of Rubin’s variance formula, a
consistent estimator for I F (β0 ) can be derived by
# $
m n ∗
∂S F {Zij (β̂nI ), β̂nj }
ˆ −1 −1
In (β0 ) = −m
F
n . (14.66)
j=1 i=1
∂β T
* +, -
This is the observed
information matrix, which is
often derived for us in the
j-th imputed full-data analysis
366 14 Multiple Imputation: A Frequentist Perspective

A consistent estimator for I(β0 ) can be obtained by



n   
−1
Iˆn (β0 ) = n−1 {m(m − 1)} ∗
S F {Zij (β̂nI ), β̂nj }
i=1 j,j  =1,...,m
j=j 
T


× S F {Zij  (β̂nI ), β̂nj } . (14.67)

This follows directly from (14.63). Another estimator for {I F (β0 ) − I(β0 )} is
motivated from the relationship (14.6), which states that
 
 F 
I (β0 ) − I(β0 ) = E var S {Zij (β0 ), β0 }|Ci , GCi (Zi ) .
F

This suggests using



n m 
 
n−1 (m − 1)−1 ∗
S F {Zij (β̂nI ), β̂nj } − S̄βFi (β̂n∗ ) S F {Zij (β̂nI ), β̂nj

}
i=1 j=1
T
− S̄βFi (βn∗ ) (14.68)

as an estimator for I F (β0 ) − I(β0 ).

14.6 Proper Imputation


In Bayesian imputation, the j-th imputation is obtained by sampling from
pZ|C,GC (Z) {z|Ci , GCi (Zi ), β (j) }, where β (j) itself is sampled from some distri-
bution. Rubin (1978b) suggests sampling β (j) from the posterior distribution
p{β|Ci , GCi (Zi )}. This is equivalent to drawing the imputation from the pre-
dictive distribution p{z|Ci , GCi (Zi )}.

Remark 8. This logic seems a bit circular since Bayesian inference is based on
deriving the posterior distribution of β given the observed data. Under suitable
regularity conditions on the choice of the prior distribution, the posterior mean
or mode of β is generally an efficient estimator for β. Therefore, using proper
imputation, where we draw from the posterior distribution of β, we start with
an efficient estimator and, after imputing m full data sets, we end up with an
estimator that is not efficient.  

When the sample size is large, the posterior distribution of the parameter
and the sampling distribution of the estimator are closely approximated by
each other. The initial estimator β̂nI was assumed to be asymptotically normal;
that is,  
n1/2 (β̂nI − β0 ) ∼ N 0, var[q{C, GC (Z)}] ,
14.6 Proper Imputation 367

where q{C, GC (Z)} denotes the influence function of β̂nI . Therefore, mimicking
the idea of Bayesian imputation, instead of fixing the values β̂nI for each of
the m imputations, at the j-th imputation, we sample β (j) from
 
var[q{C,
ˆ GC (Z)}]
N β̂nI , ,
n

where var[q{C,
ˆ GC (Z)}] is a consistent estimator for the asymptotic variance,
and then randomly choose Zij (β (j) ) from the conditional distribution with
conditional density

pZ|C,GC (Z) {z|Ci , GCi (Zi ), β (j) }.

Remark 9. If β̂nI were efficient, say the MLE, then this would approximate
sampling the β’s from the posterior distribution and the Z’s from the predic-
tive distribution. 


Using this approach, the j-th imputed estimator is the solution to the equation

n
S F {Zij (β (j) ), β̂nj

} = 0,
i=1

and the final multiple-imputation estimator is



m
β̂n∗ = m−1 ∗
β̂nj .
j=1

Therefore, if we decide to use such an approach, the obvious questions are


1. What is the asymptotic distribution of n1/2 (β̂n∗ − β0 )?
2. How does it compare with improper imputation?
3. How do we estimate the asymptotic variance?

Asymptotic Distribution of n1/2 (β̂n



− β0 )

Using the same expansion that led us to (14.15), we again obtain that
⎡ ⎤

n
' ( −1 
m
n1/2 (β̂n∗ − β0 ) = n−1/2 I F (β0 ) ⎣m−1 S F {Zij (β (j) ), β0 }⎦ + op (1).
* +, -
i=1 j=1
note here the (14.69)
dependence on j

Also, the same logic that was used for the multiple-imputation (improper)
estimator leads us to the relationship
368 14 Multiple Imputation: A Frequentist Perspective
⎡ ⎤

n 
m
n−1/2 ⎣m−1 S F {Zij (β (j) ), β0 }⎦
i=1 j=1
⎡ ⎤

n 
m
= n−1/2 ⎣m−1 S F {Zij (β0 ), β0 }⎦
i=1 j=1

' ( −1 
m
+ I (β0 ) − I(β0 ) m
F
n1/2 (β (j) − β0 ) + op (1). (14.70)
j=1

Note that (14.70) can be written as


⎡ ⎤
n 
m
n−1/2 ⎣m−1 S F {Zij (β0 ), β0 }⎦
i=1 j=1

' ( 
m
+ I F (β0 ) − I(β0 ) m−1 n1/2 (β (j) − β̂nI )
j=1
' (
+ I (β0 ) − I(β0 ) n1/2 (β̂nI − β0 ) + op (1).
F
(14.71)


n
Because n1/2 (β̂nI −β0 ) = n−1/2 q{Ci , GCi (Zi )}+op (1), we can write (14.71)
i=1
as
⎛⎡ ⎤ ⎞

n 
m
' (
n−1/2 ⎝⎣m−1 S F {Zij (β0 ), β0 }⎦ + I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )}⎠
i=1 j=1
(14.72)

m
' (
+ m−1 I F (β0 ) − I(β0 ) n1/2 (β (j) − β̂nI ) + op (1). (14.73)
j=1

Note that (14.72) is a term that was derived when we considered type B
multiple-imputation estimators in the previous section (improper imputation),
whereas (14.73) is an additional term due to sampling the β (j) ’s from
 
var[q{C
ˆ i , GCi (Zi )}]
N β̂nI , .
n

Therefore
' (−1
n1/2 (β̂n∗ − β0 ) = [ I F (β0 ) {(14.72) + (14.73)}].

The expression (14.72) multiplied by {I F (β0 )}−1 is, up to op (1), identical to


representation (14.32) of the type B multiple-imputation estimator and hence
is asymptotically normally distributed with mean zero and variance equal to
(14.59).
14.6 Proper Imputation 369

By construction, n1/2 (β (j) − β̂nI ), j = 1, . . . , m are m independent draws


from a normal distribution with mean zero and variance var[q{C ˆ i , GCi (Zi )}].
Because var{q(·)}
ˆ is a consistent estimator for var{q(·)}, this implies that
n1/2 (β (j) − β̂nI ), j = 1, . . . , m is asymptotically equivalent to V1 , . . . , Vm , which
are iid normal random variables with mean zero and variance var[q{C, GC (Z)}]
and independent of all the data {Ci , GCi (Zi )}, i = 1, . . . , n. Therefore,

n1/2 (β̂n∗ − β0 ) =
⎛⎡ ⎤

n 
m
n−1/2 {I F (β0 )}−1 ⎝⎣m−1 S F {Zij (β0 ), β0 }⎦
i=1 j=1

' (
+ I F (β0 ) − I(β0 ) q{Ci , GCi (Zi )}⎠ (14.74)


m
' (
+ m−1 {I F (β0 )}−1 I F (β0 ) − I(β0 ) Vj (14.75)
j=1

+ op (1).
Because (14.74) converges to a normal distribution with mean zero and vari-
ance matrix (14.59), and (14.75) is distributed as normal with mean zero and
variance matrix
' (−1 ' F (
m−1 I F (β0 ) I (β0 ) − I(β0 ) var[q{C, GC (Z)}]
' (' (−1
× I F (β0 ) − I(β0 ) I F (β0 ) (14.76)

and is independent of (14.74), this implies that n1/2 (β̂n∗ −β0 ) is asymptotically
normal with mean zero and asymptotic variance equal to (14.59) + (14.76),
which equals

 
' (−1 m+1 ' F (−1 ' F (' (−1
F
I (β0 ) + I (β0 ) I (β0 ) − I(β0 ) I F (β0 )
m
 
m+1 ' F (−1 ' F (
+ I (β0 ) I (β0 ) − I(β0 ) var[q{C, GC (Z)}]
m
' F (' (−1
× I (β0 ) − I(β0 ) I F (β0 ) . (14.77)
Comparing (14.77) with (14.59), we see that the estimator using “proper”
imputation has greater variance (is less efficient) than the corresponding “im-
proper” imputation estimator, which fixes β̂nI at each imputation. This makes
intuitive sense since we are introducing additional variability by sampling the
β’s from some distribution at each imputation. The increase in the variance
is given by (14.76). The variances of the two methods converge as m goes to
infinity.
370 14 Multiple Imputation: A Frequentist Perspective

Let us now study the properties of Rubin’s formula for the asymptotic
variance when applied to type A (proper imputation) multiple-imputation
estimators.

Rubin’s Estimator for the Asymptotic Variance

Using arguments that led us to (14.74) and (14.75), we obtain that


n
n1/2 (β̂nj

− β̂n∗ ) = n−1/2 {I F (β0 )}−1 [S F {Zij (β0 ), β0 } − S̄iF (β0 )]
i=1
+ {I (β0 )}−1 {I F (β0 ) − I(β0 )}(Vj − V̄ ) + op (1),
F

m
where V̄ = m−1 j=1 Vj . Therefore,
⎧ ⎫
⎨ m / 0/ 0T ⎬

− β̂n∗ β̂nj

− β̂n∗
n→∞
E n β̂nj −−−−→
⎩ ⎭
j=1
' (−1
F
I (β0 ) ×
⎧ ⎛ ⎞
⎨ m
E ⎝ [S {Zij (β0 ), β0 } − S̄i (β0 )][S {Zij (β0 ), β0 } − S̄i (β0 )] ⎠
F F F F T

j=1
⎧ ⎫ ⎫
' F ( ⎨ m ⎬' ( ⎬
+ I (β0 ) − I(β0 ) E (Vj − V̄ )(Vj − V̄ )T I F (β0 ) − I(β0 )
⎩ ⎭ ⎭
j=1
' (−1
× I F (β0 ) . (14.78)

Remark 10. The term involving q{Ci , GCi (Zi )} is common for all j and hence

drops out when considering {β̂nj − β̂n∗ }. Also, the additivity of the expectations
in (14.78) is a consequence of the fact that Vj are generated independently
from all the data. 

We showed in (14.64) that


⎛ ⎞
m
E ⎝ [S F {Zij (β0 ), β0 } − S̄iF (β0 )][S F {Zij (β0 ), β0 } − S̄iF ]T ⎠
j=1
' (
= (m − 1) I F (β0 ) − I(β0 ) . (14.79)

Also
⎧ ⎫
⎨m ⎬
E (Vj − V̄ )(Vj − V̄ )T = (m − 1)var[q{C, GC (Z)}]. (14.80)
⎩ ⎭
j=1
14.7 Surrogate Marker Problem Revisited 371

Therefore, substituting (14.79) and (14.80) into (14.78) yields


⎧ ⎫
⎨  m / 0/ 0T ⎬

E n β̂nj − β̂n∗ β̂nj

− β̂n∗ →
⎩ ⎭
j=1

' F (−1 ' F ( ' (
(m − 1) I (β0 ) I (β0 ) − I(β0 ) + I F (β0 ) − I(β0 )

' F ( ' F (−1
× var [q{C, GC (Z)}] I (β0 ) − I(β0 ) I (β0 ) . (14.81)

Consequently, Rubin’s variance estimator, given by (14.60), when used with


“proper” multiple imputation, will converge in expectation to (14.77), which
indeed is the asymptotic variance of the “proper” multiple-imputation esti-
mator.

Summary

1. Type B or “improper” imputation (i.e., holding the initial estimator fixed


across imputations) results in a more efficient estimator
/ than “proper”
0
 ˆ [q{C,GC (Z)}]
I var
imputation (where the β s are sampled from N β̂ , n n for
each imputation); however, this difference in efficiency disappears as m →
∞.
2. Rubin’s variance estimator underestimates the asymptotic variance when
used with “improper” imputation.
3. Rubin’s variance estimator correctly estimates the asymptotic variance
when used with “proper” imputation (i.e., the variance estimator con-
verges in expectation to the asymptotic variance). As m, the number of
imputations, goes to infinity, Rubin’s estimator is also consistent as well
as asymptotically unbiased.

14.7 Surrogate Marker Problem Revisited


We now return to Example 1, which was introduced at the beginning of the
chapter. In this problem, we were interested in estimating the regression pa-
rameters θ in a logistic regression model of a binary variable Y as a function of
covariates X given by (14.2). However, because X was expensive to measure,
a cheaper surrogate variable W for X was also collected and the design was
to collect X only on a validation subsample chosen at random with prespeci-
fied probability that depended on Y and W . A surrogate variable W for X is
assumed to satisfy the property that

p(Y = 1|W, X) = P (Y = 1|X). (14.82)

In addition, we also assume that


372 14 Multiple Imputation: A Frequentist Perspective
     
X µX σ σ
∼N , TXX XW .
W µW σXW σW W

Letting R denote the indicator of a complete case (i.e., if R = 1, then we


observe (Y, W, X), whereas when R = 0, we only observe (Y, W )), the observed
data can be represented as

(Ri , Yi , Wi , Ri Xi ), i = 1, . . . , n.

We are interested in obtaining an estimator for the parameter

β = (θT , µTX , µTW , σXX , σXW , σW W )T

using the observed data. We will now illustrate how we would use multiple
imputation. First, we need to derive an initial estimator β̂nI . One possibility
is to use an inverse weighted complete-case estimator. This is particularly
attractive because we know the probability of being included in the validation
set by design; that is, P [R = 1|Y, W ] = π(Y, W ).
If we had full data, then we could estimate β using standard likelihood
estimating equations; namely,

n
' (
Xi Yi − expit(θT Xi ) = 0,
i=1

n
(Xi − µX ) = 0,
i=1

n
(Wi − µW ) = 0,
i=1

n
' (
(Xi − µX )(Xi − µX )T − σXX = 0,
i=1

n
' (
(Wi − µW )(Wi − µW )T − σW W = 0,
i=1
n
' (
(Xi − µX )(Wi − µW )T − σXW = 0, (14.83)
i=1

where
exp(u)
expit(u) = .
1 + exp(u)
Using the observed data, an inverse weighted complete-case estimator for
β can be obtained by solving the equations
14.7 Surrogate Marker Problem Revisited 373


n
Ri ' (
Xi Yi − expit(θT Xi ) = 0
i=1
π(Yi , Wi )

n
Ri
(Xi − µX ) = 0
i=1
π(Yi , Wi )
..
. etc. (14.84)

Remark 11. If we are interested only in the parameter θ of the logistic regres-
sion model, then we only need to solve (14.84) and not rely on any assumption
of normality of X, W . However, to use multiple imputation, we need to derive
initial estimators for all the parameters. 


Solving the estimating equations above gives us an initial estimator β̂nI . In


addition, using the methods described for IPWCC estimators, say in Chap-
ter 7, we can also derive the influence function q(R, Y, W, RX) for β̂nI and a
consistent estimate of its asymptotic variance var{q(R,
ˆ Y, W, RX)}.
In order to carry out the imputation, we must be able to sample from
the conditional distribution of the full data given the observed data. For our
problem, we must sample from the conditional distribution

pX|R,Y,W,RX (x|Ri , Yi , Wi , Ri Xi , β̂nI ).

Clearly, when Ri = 1 (the complete case), we just use the observed value Xi .
However, when Ri = 0, then we must sample from

pX|R,Y,W,RX (x|Ri = 0, Yi , Wi , β̂nI ).

Because of the MAR assumption, this is the same as

pX|Y,W (x|Yi , Wi , β̂nI ).

How Do We Sample?

We now describe the use of rejection sampling to obtain random draws from
the conditional distribution of pX|Y,W (x|Yi , Wi , β̂nI ). Using Bayes’s rule and
the surrogacy assumption (14.82), the conditional density is derived as

pY |X (y|x)pX|W (x|w)
pX|Y,W (x|y, w) =  .
pY |X (y|x)pX|W (x|w)dx

Therefore, pX|Y,W (x|y, w) equals


 T 
exp(θ̂nI xy)/{1 + exp(θ̂nIT x)} pX|W (x, w)
, y = 0, 1.
normalizing 
= [numerator] dx
constant
374 14 Multiple Imputation: A Frequentist Perspective

Because (X T , W T )T are multivariate normal, this implies that the conditional


distribution of X|W is also normally distributed with mean
−1
E(X|W ) = µ̂IXn + σ̂XW
I
n
I
[σ̂W Wn ] (W − µ̂IWn ) (14.85)

and variance
T
−1 I
I
var (X|W ) = σ̂XXn
− σ̂XW
I
n
I
[σ̂W Wn ] σ̂XWn . (14.86)

Therefore, at the j-th imputation, if Ri = 0, we can generate Xij (β̂nI ) by first


randomly sampling from a normal distribution with mean
−1
µ̂IXn + σ̂XW
I
n
I
[σ̂W Wn ] (Wi − µ̂IWn )

and variance (14.86).


After generating such an X in this fashion, we either “keep” this X if
another randomly generated uniform random variable is less than

exp(θ̂nIT XYi )
1 + exp(θ̂nIT X)

or we keep repeating this process until we “keep” an X that we use for the j-
th imputation Xij (β̂nI ). This rejection sampling scheme guarantees a random
draw from
pX|Y,W (x|Yi , Wi ).
Therefore, at the j-th imputation, together with Yi and Wi , which we always
observe, we use Xi if Ri = 1 and Xij (βnI ) if Ri = 0 to create the j-th pseudo-

full data. This j-th imputed data set is then used to obtain estimators β̂nj as
described by (14.83). Standard software packages will do this.
The final estimate is
 m
β̂n∗ = m−1 ∗
β̂nj .
j=1

A consistent estimator of the asymptotic variance can be obtained by sub-


stituting consistent estimators for I F (β0 ) and I, say by using (14.66) and
(14.67), respectively, and a consistent estimator var{q(R,
ˆ Y, W, RX)} for
var{q(R, Y, W, RX)} in equation (14.58).
References

Allison, P.D. (2002). Missing Data. Sage, Thousand Oaks, CA.


Andersen, P.K., Borgan, O., Gill, R.D., and Kieding, N. (1992). Statistical
Models Based on Counting Processes. Springer-Verlag, Berlin.
Anderson, P.K. and Gill, R. (1982). Cox’s regression model for counting
processes: a large sample study. Annals of Statistics 10, 1100–1120.
Bang, H. and Robins, J.M. (2005). Doubly robust estimation in missing
data and causal inference models. Biometrics 61, 962–973.
Bang, H. and Tsiatis, A.A. (2000). Estimating medical costs with censored
data. Biometrika 87, 329–343.
Bang, H. and Tsiatis, A.A. (2002). Median regression with censored cost
data. Biometrics 58, 643–649.
Begun, J.M., Hall, W.J., Huang, W., and Wellner, J.A. (1983). Information
and asymptotic efficiency in parametric-nonparametric models. Annals
of Statistics 11, 432–452.
Bickel, P.J., Klaassen, C.A.J., Ritov, Y., and Wellner, J.A. (1993). Effi-
cient and Adaptive Estimation for Semiparametric Models. The Johns
Hopkins University Press, Baltimore.
Breslow, N. (1974). Covariance analysis of censored survival data. Biomet-
rics 30, 89–99.
Casella, G. and Berger, R.L. (2002). Statistical Inference, Second Edition.
Duxbury Press, Belmont, CA.
Chamberlain, G. (1987). Asymptotic efficiency in estimation with condi-
tional moment restrictions. Journal of Econometrics 34, 305–334.
Cox, D.R. (1972). Regression models and life-tables (with discussion). Jour-
nal of the Royal Statistical Society, Series B 34, 187–220.
Cox, D.R. (1975). Partial likelihood. Biometrika 62, 269–276.
Cox, D.R. and Snell, E.J. (1989). Analysis of Binary Data. Chapman and
Hall, London.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood
from incomplete data via the EM algorithm (with discussion). Journal
of the Royal Statistical Society, Series B 39, 1–37.
376 References

Fleming, T.R. and Harrington, D.P. (1991). Counting Processes and Sur-
vival Analysis. Wiley, New York.
Gill, R.D., van der Laan, M.J., and Robins, J.M. (1997). Coarsening at
random: Characterizations, conjectures and counterexamples. Proceed-
ings of the First Seattle Symposium in Biostatistics: Survival Analysis,
Springer, New York, pp. 255–294.
Hájek, J. (1970). A characterization of limiting distributions of regular es-
timates. Zeitschrift Wahrscheinlichkeitstheorie und Verwandte Gebiete
14, 323–330.
Hájek, J. and Sidak, Z. (1967). Theory of Rank Tests. Academic Press, New
York.
Hampel, F.R. (1974). The influence curve and its role in robust estimation.
Journal of the American Statistical Association 69, 383–393.
Heitjan, D.F. (1993). Ignorability and coarse data: Some biomedical exam-
ples. Biometrics 49, 1099–1109 .
Heitjan, D.F. and Rubin, D.B. (1991). Ignorability and coarse data. Annals
of Statistics 19, 2244–2253.
Holland, P.W. (1986). Statistics and causal inference (with discussion).
Journal of the American Statistical Association 81, 945–970.
Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling
without replacement from a finite universe. Journal of the American
Statistical Association 47, 663–685.
Hu, P. and Tsiatis, A.A. (1996). Estimating the survival distribution when
ascertainment of vital status is subject to delay. Biometrika 83, 371–
380.
Kaplan, E.L. and Meier, P. (1958). Nonparametric estimation from incom-
plete observations. Journal of the American Statistical Association 53,
457–481.
Kress, R. (1989). Linear Integral Equations. Springer-Verlag, Berlin.
LeCam, L. (1953). On some asymptotic properties of maximum likelihood
estimates and related Bayes estimates. University of California Publi-
cations in Statistics 1, 227–330.
Leon, S., Tsiatis, A.A., and Davidian, M. (2003). Semiparametric estima-
tion of treatment effect in a pretest-posttest study. Biometrics 59,
1046–1055.
Liang, K-Y. and Zeger, S.L. (1986). Longitudinal data analysis using gen-
eralized linear models. Biometrika 73, 13–22.
Lipsitz, S.R., Ibrahim, J.G., and Zhao, L.P. (1999). A weighted estimating
equation for missing covariate data with properties similar to maximum
likelihood. Journal of the American Statistical Association 94, 1147–
1160.
Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996).
SAS System for Mixed Models. SAS Institute, Inc., Cary, NC.
Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing
Data. Wiley, New York.
References 377

Loève, M. (1963). Probability Theory (third edition). Springer-Verlag,


Berlin.
Luenberger, D.G. (1969). Optimization by Vector Space Methods. Wiley,
New York.
Lunceford, J.K. and Davidian, M. (2004). Stratification and weighting via
the propensity score in estimation of causal treatment effects: A com-
parative study. Statistics in Medicine 23, 2937–2960.
Manski, C.F. (1984). Adaptive estimation of non-linear regression models
(with discussion). Econometric Reviews 3, 145–210.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models (2nd
edition). Chapman and Hall, London.
Neugebauer, R. and van der Laan, M.J. (2005). Why prefer double robust
estimators in causal inference? Journal of Statistical Planning and In-
ference 129, 405–426.
Newey, W.K. (1988). Adaptive estimation of regression models via moment
restrictions. Journal of Econometrics 38, 301–339.
Newey, W.K. (1990). Semiparametric efficiency bounds. Journal of Applied
Econometrics 5, 99–135.
Newey, W.K. and McFadden, D. (1994). Large sample estimation and hy-
pothesis testing. Handbook of Econometrics 4, 2111–2245.
Neyman, J. (1923). Sur les applications de la thar des probabilities aux
experiences Agaricales: Essay de principle. English translation of ex-
cerpts by Dabrowska, D. and Speed, T. (1990). Statistical Science 5,
465–480.
Oakes, D. (1999). Direct calculation of the information matrix via the EM
algorithm. Journal of the Royal Statistical Society, Series B 61, 479–
482.
Quale, C.M., van der Laan, M.J., and Robins, J.M. (2003). Locally efficient
estimation with bivariate right censored data. University of California,
Berkeley, Department of Statistics Technical Report.
Rao C.R. (1973). Linear Statistical Inference and Its Applications. Wiley,
New York.
Robins, J.M. (1986). A new approach to causal inference in mortality stud-
ies with sustained exposure periods – Application to control of the
healthy worker survivor effect. Mathematical Modelling 7, 1393–1512.
Robins J.M. (1999). Robust estimation in sequentially ignorable missing
data and causal inference models. Proceedings of the American Sta-
tistical Association Section on Bayesian Statistical Science, American
Statistical Association, Alexandria, VA, pp. 6–10.
Robins, J.M. and Gill, R.D. (1997). Non-response models for the analysis
of non-monotone ignorable missing data. Statistics in Medicine 16,
39–56.
378 References

Robins, J.M. and Rotnitzky, A. (1992). Recovery of information and adjust-


ment for dependent censoring using surrogate markers. In AIDS Epi-
demiology, Methodological Issues, Jewell, N., Dietz, K., and Farewell,
W., eds. Birkhäuser, Boston, pp. 297–331.
Robins, J.M. and Rotnitzky, A. (2001). Comment on “Inference for semi-
parametric models: Some questions and an answer.” Statistica Sinica
11, 920–936.
Robins, J.M., Rotnitzky, A., and Bonetti, M. (2001). Comment on “Ad-
dressing an idiosyncrasy in estimating survival curves using double
sampling in the presence of self-selected right censoring.” Biometrics
57, 343–347.
Robins J.M., Rotnitzky A., and Scharfstein D.O. (2000). Sensitivity analy-
sis for selection bias and unmeasured confounding in missing data and
causal inference models. In Statistical Models in Epidemiology: The En-
vironment and Clinical Trials. Halloran, M.E. and Berry, D., eds. IMA
Volume 116, Springer-Verlag, New York, pp. 1–95.
Robins, J.M., Rotnitzky, A., and Zhao, L.P. (1994). Estimation of regres-
sion coefficients when some regressors are not always observed. Journal
of the American Statistical Association 89, 846–866.
Robins, J.M., Rotnitzky, A., and van der Laan, M. (2000). Comment on
“On profile likelihood.” Journal of the American Statistical Association
95, 477–482.
Robins, J.M. and Wang, N. (2000). Inference for imputation estimators.
Biometrika 87, 113–124.
Rosenbaum, P.R. (1984). Conditional permutation tests and the propen-
sity score in observational studies. Journal of the American Statistical
Association 79, 565–574.
Rosenbaum, P.R. (1987). Model-based direct adjustment. Journal of the
American Statistical Association 82, 387–394.
Rosenbaum, P.R. and Rubin, D.B. (1983). The central role of the propen-
sity score in observational studies for causal effects. Biometrika 70,
41–55.
Rosenbaum, P.R. and Rubin, D.B. (1984). Reducing bias in observational
studies using subclassification on the propensity score. Journal of the
American Statistical Association 79, 516–524.
Rosenbaum, P.R. and Rubin, D.B. (1985). Constructing a control group
using multivariate matched sampling methods that incorporate the
propensity score. American Statistician 39, 3–38.
Rotnitzky, A., Scharfstein, D.O., Su, T.L., and Robins, J.M. (2001). Meth-
ods for conducting sensitivity analysis of trials with potentially nonig-
norable competing causes of censoring. Biometrics 57, 103–113.
Rubin, D.B. (1974). Estimating causal effects of treatments in random-
ized and nonrandomized studies. Journal of Educational Psychology
66, 688–701.
References 379

Rubin, D.B. (1978a). Bayesian inference for causal effects: The role of ran-
domization. Annals of Statistics 6, 34–58.
Rubin, D.B. (1978b). Multiple imputations in sample surveys: A phe-
nomenological Bayesian approach to nonresponse (with discussion).
American Statistical Association Proceedings of the Section on Sur-
vey Research Methods, American Statistical Association, Alexandria,
VA, pp. 20–34.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wi-
ley, New York.
Rubin, D.B. (1990). Comment: Neyman (1923) and causal inference in
experiments and observational studies. Statistical Science 5, 472–480.
Rubin, D.B. (1996). Multiple imputation after 18+ years (with discussion).
Journal of the American Statistical Association 91, 473–520.
Rubin, D.B. (1997). Estimating causal effects from large data sets using
propensity scores. Annals of Internal Medicine 127, 757–763.
Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman
and Hall, London.
Scharfstein, D.O., Rotnitzky, A., and Robins, J.M. (1999). Adjusting for
nonignorable drop-out using semiparametric nonresponse models (with
discussion) Journal of the American Statistical Association 94, 1096–
1146.
Stefanski, L.A. and Boos, D.D. (2002). The calculus of M-estimation.
American Statistician 56, 29–38.
Strawderman, R.L. (2000). Estimating the mean of an increasing stochastic
process at a censored stopping time. Journal of the American Statistical
Association 95, 1192–1208.
Tsiatis, A.A. (1998). Competing risks. In Encyclopedia of Biostatistics.
Wiley, New York, pp. 824–834.
van der Laan, M.J. and Hubbard, A.E. (1998). Locally efficient estimation
of the survival distribution with right-censored data and covariates
when collection of data is delayed. Biometrika 85, 771–783.
van der Laan, M.J. and Hubbard, A.E. (1999). Locally efficient estimation
of the quality-adjusted lifetime distribution with right-censored data
and covariates. Biometrics 55, 530–536.
van der Laan, M.J., Hubbard, A.E., and Robins, J.M. (2002). Locally ef-
ficient estimation of a multivariate survival function in longitudinal
studies. Journal of the American Statistical Association 97, 494–507.
van der Laan, M.J. and Robins, J.M. (2003). Unified Methods for Censored
Longitudinal Data and Causality. Springer-Verlag, New York.
van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and
Empirical Processes with Applications to Statistics. Springer-Verlag,
New York.
Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for Longi-
tudinal Data. Springer-Verlag, New York.
380 References

Wang, N. and Robins, J.M. (1998). Large-sample theory for parametric


multiple imputation procedures. Biometrika 85, 935–948.
Wu, C.F.J. (1983). On the convergence properties of the EM algorithm.
Annals of Statistics 11, 95–103.
Yang, L. and Tsiatis, A.A. (2001). Efficiency study of estimators for a
treatment effect in a pretest-posttest trial. The American Statistician
55, 314–321.
Zhao, H. and Tsiatis, A.A. (1997). A consistent estimator for the distribu-
tion of quality adjusted survival time. Biometrika 84, 339–348.
Zhao, H. and Tsiatis, A.A. (1999). Efficient estimation of the distribution
of quality-adjusted survival time. Biometrics 55, 1101–1107.
Zhao, H. and Tsiatis, A.A. (2000). Estimating mean quality adjusted life-
time with censored data. Sankhya, Series B 62, 175–188.
Index

adaptive semiparametric estimator censored survival data, 115, 212–218,


location-shift regression model, 254–255
108–113 coarsened data, 152
monotone coarsening, 243–247 likelihood, 156–163
nonmonotone coarsening, 261–265 monotone coarsening, 186–188, 195
restricted moment model, 93–97, two levels of coarsening, 185–186
286–291 coarsening at random (CAR), 155,
two levels of coarsening, 227–233 158–163, 181
affine space, see Hilbert space, linear coarsening by design, 159, 165
variety coarsening completely at random
AIPWCC estimator, 148, 178, 180, (CCAR), 154, 181
199–207, 237 coarsening model tangent space,
censored survival data, 217 190–192, 202
asymptotically linear, 21 coarsening probability, 155, 165, 182,
asymptotically normal estimator, 8, 41 195
augmentation space, 174, 183, 201, 332 coarsening probability model, 159
censored survival data, 216–217 monotone coarsening, 186–188, 195,
monotone coarsening, 212 211
projection onto, 273, 333–334 two levels of coarsening, 185
monotone coarsening, 239–243 coarsening process, see coarsening
nonmonotone coarsening, 256–261 probability model
coarsening variable, 152
two levels of coarsening, 226–227
complete data, 139, 155
augmented inverse probability weighted
consistent estimator, 8
complete-case estimator, see
contiguity theory, 34–36
AIPWCC estimator
contraction mapping, 258, 292
auxiliary variables, 133–135
counterfactual random variable, see
average causal treatment effect, 323–337
potential outcome
augmented inverse propensity
counting process, 117, 119, 216
weighted estimator, 337
covariate adjustment, 101, 126–133
regression modeling, 328–329
double robust estimator, 147–150, 239,
Cauchy-Schwartz inequality, 19 273, 313–321, 336–337
causal inference, 323 monotone coarsening, 248–251
382 Index

nonmonotone coarsening, 265–267 coarsened data, 168, 193–196,


two levels of coarsening, 234–236 202–206, 329–333
double robust, 221–225, 273
efficient estimator, 24, 360 restricted moment model, 83–85
proportional hazards model, 125 information matrix, 32, 340, 342–344
efficient influence function, 42–48, 50, inverse operator, see Hilbert space,
65, 347, 360 inverse operator
proportional hazards model, 125 inverse probability weighted complete-
restricted moment model, 85–87 case estimator, see IPWCC
efficient score vector, 47, 50, 64, 164 estimator
coarsened data, 277–282 IPWCC estimator, 146–147, 178, 180,
location-shift regression model, 236
107–108 IPWCC space, 174, 183
proportional hazards model, 125
restricted moment model, 86 linear mapping, see Hilbert space, linear
EM algorithm, 358 mapping
linear model, 4, 111–113
full data, 139, 155 linear operator, see Hilbert space, linear
full-data Hilbert space, 163, 181 mapping
linear space, 2
generalized estimating equation (GEE), linear variety, see Hilbert space, linear
9, 145 variety
for restricted moment model, 54–58, local data generating process (LDGP),
93 26
globally efficient, 63, 125 locally efficient, 63, 94, 108, 273–292,
335, 337
Hilbert space, 13–14 location-shift regression model, 101–113
direct sum of linear subspaces, 42, 49 log-linear model, 4, 96–97, 309–313
inner product, 13 logistic regression model, 88, 95–96, 147,
covariance inner product, 13, 49 162–163, 179–181, 185, 236–239,
inverse operator, 170, 258–261, 282 319–321, 341, 371–374
linear mapping, 169 m estimator, 29–31, 200
linear subspace, 14 martingale process, 117, 119, 122, 217
linear variety, 45, 67, 222 maximum likelihood estimator (MLE),
orthogonal, 13 24, 48, 96, 188–189, 340
orthogonal complement, 42 maximum partial likelihood estimator,
projection onto a linear subspace, 43 9
projection theorem, 14 mean-square closure, 63
Pythagorean theorem, 14 missing at random (MAR), 142, 151
multivariate, 44 missing by design, see coarsening by
replicating linear space, 44, 296 design
space of mean-zero q-dimensional missing completely at random (MCAR),
random functions, 11, 16–18 140, 151
missing data likelihood, 143–144
improper imputation, 346 multiple imputation, 339–374
imputation, 144–146
infinite dimensional, 2–3 no unmeasured confounders, 327
influence function, 22, 49, 61–68, 164, noncoarsening at random (NCAR), 155,
347 181
Index 383

nonmissing at random (NMAR), 141, pretest-posttest study, 101, 126–133,


151 135
nonmonotone coarsened data, 188, prior distribution, 346
255–267 projection theorem, see Hilbert space,
nonparametric model, 3, 8, 125–126, projection theorem
275 propensity score, 330, 335, 337
nuisance parameter, 2, 21, 41, 53, 59, proper imputation, 366–371
67, 101, 116, 119, 152, 156, 168, proportional hazards model, 7–8,
190, 226, 297, 339 113–125, 218
nuisance tangent space Pythagorean theorem, see Hilbert
coarsened data, 165–174, 182, space, Pythagorean theorem
190–193
full data, 164, 181 randomization and causality, 326–327
location-shift regression model, regular asymptotically linear estimator
103–105 (RAL), 27, 168
parametric model, 38, 49 regular estimator, 26–27
parametric submodel, 61 restricted moment model, 3–7, 73–98,
proportional hazards model, 117–120 174–179
restricted moment model, 77–83 longitudinal data, 210–212, 251–253
semiparametric model, 63–64
sandwich variance estimator, 31–32,
206–207, 233–234
observational study, 327
score vector, 27, 49, 340
observed data, 139, 155
coarsened data, 166–168
observed-data Hilbert space, 164
semiparametric efficiency bound, 63
optimal restricted AIPWCC estimator,
semiparametric estimator, 8
295–321
for restricted moment model, 54–58
class 1, 300–313
location-shift regression model,
class 2, 313–321 106–107
orthogonal, see Hilbert space, orthogo- proportional hazards model, 123–124
nal semiparametric model, 2, 53
orthogonal complement, see Hilbert stable unit treatment value assumption
space, orthogonal complement (SUTVA), 325
statistical model, 1
parametric model, 1, 21, 339 strong ignorability assumption, see no
parametric submodel, 59–61 unmeasured confounders
location-shift regression model, 104 successive approximation, 258
proportional hazards model, 60, 118, super-efficiency, 24–26
119 surrogate marker problem, 341, 371–374
restricted moment model, 75, 78, 80,
88, 90 tangent space
point exposure study, 323 nonparametric model, 68–69
posterior distribution, 346, 366 parametric model, 38, 49
potential outcome, 324 semiparametric model, 67
Springer Series in Statistics (continued from p. ii)

Huet/Bouvier/Poursat/Jolivet: Statistical Tools for Nonlinear Regression: A Practical


Guide with S-PLUS and R Examples, 2nd edition.
Ibrahim/Chen/Sinha: Bayesian Survival Analysis.
Jolliffe: Principal Component Analysis, 2nd edition.
Knottnerus: Sample Survey Theory: Some Pythagorean Perspectives.
Kolen/Brennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume III.
Küchler/Sørensen: Exponential Families of Stochastic Processes.
Kutoyants: Statistical Influence for Ergodic Diffusion Processes.
Lahiri: Resampling Methods for Dependent Data.
Le Cam: Asymptotic Methods in Statistical Decision Theory.
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts, 2nd edition.
Liu: Monte Carlo Strategies in Scientific Computing.
Longford: Models for Uncertainty in Educational Testing.
Manski: Partial Identification of Probability Distributions.
Mielke/Berry: Permutation Methods: A Distance Function Approach.
Molenberghs/Verbeke: Models for Discrete Longitudinal Data.
Mukerjee/Wu: A Modern Theory of Factorial Designs.
Nelsen: An Introduction to Copulas. 2nd edition
Pan/Fang: Growth Curve Models and Statistical Diagnostics.
Parzen/Tanabe/Kitagawa: Selected Papers of Hirotugu Akaike.
Politis/Romano/Wolf: Subsampling.
Ramsay/Silverman: Applied Functional Data Analysis: Methods and Case Studies.
Ramsay/Silverman: Functional Data Analysis, 2nd edition.
Rao/Toutenburg: Linear Models: Least Squares and Alternatives.
Reinsel: Elements of Multivariate Time Series Analysis. 2nd edition.
Rosenbaum: Observational Studies, 2nd edition.
Rosenblatt: Gaussian and Non-Gaussian Linear Time Series and Random Fields.
Särndal/Swensson/Wretman: Model Assisted Survey Sampling.
Santner/Williams/Notz: The Design and Analysis of Computer Experiments.
Schervish: Theory of Statistics.
Seneta: Non-negative Matrices and Markov Chains, Revised Printing.
Shao/Tu: The Jackknife and Bootstrap.
Simonoff: Smoothing Methods in Statistics.
Singpurwalla and Wilson: Statistical Methods in Software Engineering: Reliability
and Risk.
Small: The Statistical Theory of Shape.
Sprott: Statistical Inference in Science.
Stein: Interpolation of Spatial Data: Some Theory for Kriging.
Taniguchi/Kakizawa: Asymptotic Theory of Statistical Inference for Time Series.
Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, 3rd edition.
Tillé: Sampling Algorithms.
Tsiatis: Semiparametric Theory and Missing Data.
van der Laan: Unified Methods for Censored Longitudinal Data and Causality.
van der Vaart/Wellner: Weak Convergence and Empirical Processes: With
Applications to Statistics.
Verbeke/Molenberghs: Linear Mixed Models for Longitudinal Data.
Weerahandi: Exact Statistical Methods for Data Analysis.
West/Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition.

You might also like