Bootstrap Methods and Their Application
Bootstrap Methods and Their Application
D . V. H in k le y
H I C a m b r id g e
U N IV E R S IT Y P R E S S
The Pitt Building, Trumpington Street, Cambridge CB2 1RP, United Kingdom
C A M B R ID G E U N IV E R S IT Y PRESS
Contents
Preface
1
Introduction
In tro d u ctio n
Param etric Sim ulation
N o n p aram etric Sim ulation
Simple Confidence Intervals
R educing E rro r
Statistical Issues
N o n p aram etric A pproxim ations for V ariance and Bias
Subsam pling M ethods
B ibliographic N otes
Problem s
Practicals
Further Ideas
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
In tro d u ctio n
Several Sam ples
Sem iparam etric M odels
Sm ooth E stim ates o f F
C ensoring
M issing D a ta
F inite Population Sam pling
H ierarchical D a ta
B ootstrapping the B ootstrap
ix
1
11
11
15
22
27
31
37
45
55
59
60
66
70
70
71
77
79
82
88
92
100
103
Contents
vi
3.10
3.11
3.12
3.13
3.14
B ootstrap D iagnostics
Choice o f E stim ator from the D ata
B ibliographic N otes
Problem s
Practicals
136
Tests
4.1
Intro d u ctio n
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
Confidence Intervals
5.1
5.2
113
120
123
126
131
Intro d u ctio n
136
140
156
161
175
180
183
184
187
191
191
193
202
211
220
223
230
231
238
243
246
247
251
Linear Regression
256
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
256
257
273
290
307
315
316
321
Intro d u ctio n
Least Squares L inear Regression
M ultiple L inear Regression
A ggregate Prediction E rro r and V ariable Selection
R obust Regression
B ibliographic N otes
Problem s
Practicals
vii
Contents
326
7.1
In tro d u ctio n
326
7.2
327
7.3
Survival D a ta
346
7.4
O th er N onlinear M odels
353
7.5
M isclassification E rro r
358
7.6
362
7.7
B ibliographic N otes
374
7.8
Problem s
376
7.9
Practicals
378
Complex Dependence
385
8.1
In tro d u ctio n
385
8.2
Time Series
385
8.3
Point Processes
415
8.4
B ibliographic N otes
426
8.5
Problem s
428
8.6
Practicals
432
Improved Calculation
437
9.1
In tro d u ctio n
437
9.2
Balanced B ootstraps
438
9.3
C ontrol M ethods
446
9.4
450
9.5
466
9.6
B ibliographic N otes
485
9.7
Problem s
487
9.8
Practicals
494
499
10.1 Likelihood
499
500
507
509
512
514
10.7 Problem s
516
10.8 Practicals
519
viii
11
Contents
Computer Implementation
522
11.1
11.2
11.3
11.4
11.5
11.6
In tro d u ctio n
Basic B ootstraps
F u rth er Ideas
Tests
Confidence Intervals
L inear Regression
522
525
531
534
536
537
11.7
11.8
540
543
545
549
551
555
568
572
575
Preface
The publication in 1979 of Bradley Efrons first article on bootstrap methods was a
major event in Statistics, at once synthesizing some of the earlier resampling ideas
and establishing a new framework for simulation-based statistical analysis. The idea
of replacing complicated and often inaccurate approximations to biases, variances,
and other measures of uncertainty by com puter simulations caught the imagination
of both theoretical researchers and users of statistical methods. Theoreticians
sharpened their pencils and set about establishing mathematical conditions under
which the idea could work. Once they had overcome their initial skepticism, applied
workers sat down at their terminals and began to amass empirical evidence that
the bootstrap often did work better than traditional methods. The early trickle of
papers quickly became a torrent, with new additions to the literature appearing
every month, and it was hard to see when would be a good moment to try to chart
the waters. Then the organizers o f COMPSTAT 92 invited us to present a course
on the topic, and shortly afterwards we began to write this book.
We decided to try to write a balanced account o f resampling methods, to include
basic aspects of the theory which underpinned the methods, and to show as many
applications as we could in order to illustrate the full potential of the methods
warts and all. We quickly realized that in order for us and others to understand
and use the bootstrap, we would need suitable software, and producing it led us
further towards a practically oriented treatment. Our view was cemented by two
further developments: the appearance o f two excellent books, one by Peter Hall
on the asymptotic theory and the other on basic methods by Bradley Efron and
Robert Tibshirani; and the chance to give further courses that included practicals.
O ur experience has been that hands-on computing is essential in coming to grips
with resampling ideas, so we have included practicals in this book, as well as more
theoretical problems.
As the book expanded, we realized that a fully comprehensive treatm ent was
beyond us, and that certain topics could be given only a cursory treatm ent because
too little is known about them. So it is that the reader will find only brief accounts
o f bootstrap methods for hierarchical data, missing data problems, model selection,
robust estimation, nonparam etric regression, and complex data. But we do try to
point the more ambitious reader in the right direction.
No project of this size is produced in a vacuum. The majority of work on
the book was completed while we were at the University of Oxford, and we are
very grateful to colleagues and students there, who have helped shape our work
in various ways. The experience of trying to teach these methods in Oxford and
elsewhere at the Universite de Toulouse I, Universite de Neuchatel, Universita
degli Studi di Padova, Queensland University of Technology, Universidade de
Sao Paulo, and University of Umea has been vital, and we are grateful to
participants in these courses for prompting us to think more deeply about the
ix
Preface
material. Readers will be grateful to these people also, for unwittingly debugging
some of the problems and practicals. We are also grateful to the organizers of
COMPSTAT 92 and CLAPEM V for inviting us to give short courses on our
work.
While writing this book we have asked many people for access to data, copies
of their programs, papers or reprints; some have then been rewarded by our
bombarding them with questions, to which the answers have invariably been
courteous and informative. We cannot name all those who have helped in this
way, but D. R. Brillinger, P. Hall, M. P. Jones, B. D. Ripley, H. OR. Sternberg and
G. A. Young have been especially generous. S. Hutchinson and B. D. Ripley have
helped considerably with computing matters.
We are grateful to the mostly anonymous reviewers who commented on an early
draft of the book, and to R. G atto and G. A. Young, who later read various parts
in detail. A t Cambridge University Press, A. W oollatt and D. Tranah have helped
greatly in producing the final version, and their patience has been commendable.
We are particularly indebted to two people. V. Ventura read large portions o f the
book, and helped with various aspects of the com putation. A. J. Canty has turned
our version o f the bootstrap library functions into reliable working code, checked
the book for mistakes, and has made numerous suggestions that have improved it
enormously. Both of them have contributed greatly though o f course we take
responsibility for any errors that remain in the book. We hope that readers will
tell us about them, and we will do our best to correct any future versions of the
book; see its WWW page, at U R L
http://dmawww.epf1.ch/davison.mosaic/BMA/
The book could not have been completed without grants from the U K Engineer
ing and Physical Sciences Research Council, which in addition to providing funding
for equipment and research assistantships, supported the work o f A. C. Davison
through the award o f an Advanced Research Fellowship. We also acknowledge
support from the US N ational Science Foundation.
We must also mention the Friday evening sustenance provided at the Eagle and
Child, the Lam b and Flag, and the Royal Oak. The projects of many authors have
flourished in these amiable establishments.
Finally, we thank our families, friends and colleagues for their patience while
this project absorbed our time and energy. Particular thanks are due to Claire
Cullen Davison for keeping the Davison family going during the writing of this
book.
A. C. Davison and D. V. Hinkley
Lausanne and Santa Barbara
May 1997
1
Introduction
exP (Pk),
k
where the sum is over colum ns with blanks in row j. The eventual total o f as
yet u n rep o rted diagnoses from period j can be estim ated by replacing a j and
Pk by estim ates derived from the incom plete table, and thence we obtain the
predicted to tal for period j. Such predictions are shown by the solid line in
1 Introduction
D iagnosis
period
Y ear
Q u a rte r
0+
1988
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
31
26
31
36
32
15
34
38
31
32
49
44
41
56
53
63
71
95
76
67
80
99
95
77
92
92
104
101
124
132
107
153
137
124
175
135
161
178
181
16
27
35
20
32
14
29
34
47
36
51
41
29
39
35
24
48
39
9
9
13
26
10
27
31
18
24
10
17
16
33
14
17
23
25
3
8
18
11
12
22
18
9
11
9
15
11
7
12
13
12
2
11
4
3
19
21
8
15
15
7
8
6
11
7
11
8
3
6
8
12
12
6
6
8
6
9
5
6
10
1989
1990
1991
1992
>14
6
3
3
2
2
1
T otal
rep o rts
to end
o f 1992
174
211
224
205
224
219
253
233
281
245
260
285
271
263
306
258
310
318
273
133
Figure 1.1, together w ith the observed to tal reports to the end o f 1992. How
good are these predictions?
It would be tedious b u t possible to p u t pen to p ap er and estim ate the
prediction uncertainty th ro u g h calculations based on the Poisson model. But
in fact the d a ta are m uch m ore variable th an th a t m odel would suggest, and
by failing to take this into account we w ould believe th at the predictions are
m ore accurate th a n they really are. Furtherm ore, a b etter approach would be
to use a sem iparam etric m odel to sm ooth out the evident variability o f the
increase in diagnoses from q u arter to q u arter; the corresponding prediction is
the dotted line in Figure 1.1. A nalytical calculations for this m odel would be
very unpleasant, and a m ore flexible line o f attack is needed. W hile m ore th an
one approach is possible, the one th a t we shall develop based on com puter
sim ulation is b o th flexible and straightforw ard.
1 Introduction
Time
which the variability o f the quantities o f interest can be assessed w ithout longwinded and error-prone analytical calculation. Because this approach involves
repeating the original d a ta analysis procedure w ith m any replicate sets o f data,
these are som etim es called computer-intensive methods. A n o th er nam e for them
is bootstrap methods, because to use the d a ta to generate m ore d a ta seems
analogous to a trick used by the fictional B aron M unchausen, who when he
found him self a t the b o tto m o f a lake got out by pulling him self up by his
b ootstraps. In the sim plest nonparam etric problem s we do literally sample
from the data, and a com m on initial reaction is th a t this is a fraud. In fact
it is not. It turns out th a t a wide range o f statistical problem s can be tackled
this way, liberating the investigator from the need to oversimplify complex
problem s. T he ap proach can also be applied in simple problem s, to check the
adequacy o f stan d ard m easures o f uncertainty, to relax assum ptions, and to
give quick approxim ate solutions. A n exam ple o f this is random sam pling to
estim ate the p erm u tatio n distribution o f a nonparam etric test statistic.
It is o f course true th a t in m any applications we can be fairly confident in
a p articu lar p aram etric m odel and the stan d ard analysis based on th a t model.
Even so, it can still be helpful to see w hat can be inferred w ithout particular
p aram etric m odel assum ptions. This is in the spirit o f robustness o f validity o f
the statistical analysis perform ed. N onparam etric b o o tstrap analysis allows us
to do this.
1 Introduction
3
5
7
18
43
85
91
98
100
130
230
487
_____________________________________________________________________
Examples
B ootstrap m ethods can be applied b o th when there is a well-defined probability
m odel for d a ta an d when there is not. In o u r initial developm ent o f the
m ethods we shall m ake frequent use o f tw o simple examples, one o f each type,
to illustrate the m ain points.
Example 1.1 (Air-conditioning data) Table 1.2 gives n = 12 times between
failures o f air-conditioning equipm ent, for which we wish to estim ate the
underlying m ean or its reciprocal, the failure rate. A simple m odel for this
problem is th a t the times are sam pled from an exponential distribution.
The dotted line in the left panel o f Figure 1.2 is the cum ulative distribution
function (C D F )
F t ) = /
\ l - e x p (-y/n),
y ~
y > 0,
for the fitted exponential distrib u tio n w ith m ean fi set equal to the sample
average, y = 108.083. The solid line on the sam e plot is the nonparam etric
equivalent, the em pirical distribution function (E D F ) for the data, which places
equal probabilities n-1 = 0.083 at each sam ple value. C om parison o f the two
curves suggests th a t the exponential m odel fits reasonably well. A n alternative
view o f this is shown in the right panel o f the figure, which is an exponential
1 Introduction
O
co
o
o
in
o
o
o
o
co
o
o
CM
O
o
0.0 0.5
Failure time y
n+ 1
= - log (1
K=1
n+ 1
A lthough these plots suggest reasonable agreem ent with the exponential
m odel, the sam ple is ra th e r too small to have m uch confidence in this. In the
d a ta source the m ore general gam m a m odel with m ean /i and index k is used;
its density is
fw (y) =
1
1
/ \ K
I K ' K-1.
y K exP ( - Ky / v l
y > o,
h, k
> o.
( i.i)
F or o u r sam ple the estim ated index is k = 0.71, which does not differ signif
icantly (P = 0.29) from the value k = 1 th a t corresponds to the exponential
m odel. O u r reason for m entioning this will becom e apparent in C h apter 2.
Basic properties o f the estim ator T = Y for fj. are easy to obtain theoretically
under the exponential model. For example, it is easy to show th at T is unbiased
and has variance fi2/n. A pproxim ate confidence intervals for n can be calculated
using these properties in conjunction with a norm al approxim ation for the
distrib u tio n o f T, alth o u g h this does n o t w ork very well: we can tell this
because Y / n has an exact gam m a distribution, which leads to exact confidence
limits. Things are m ore com plicated under the m ore general gam m a model,
because the index k is only estim ated, and so in a traditional approach we would
use approxim ations such as a norm al approxim ation for the distribution
o f T, or a chi-squared approxim ation for the log likelihood ratio statistic.
1 Introduction
The param etric sim ulation m ethods o f Section 2.2 can be used alongside these
approxim ations, to diagnose problem s w ith them , or to replace them entirely.
Example 1.2 (City population data) Table 1.3 reports n = 49 d a ta pairs, each
corresponding to a city in the U nited States o f A m erica, the p air being the 1920
and 1930 p o pulations o f the city, w hich we denote by u and x. The d a ta are
plotted in Figure 1.3. Interest here is in the ratio o f m eans, because this would
enable us to estim ate the to tal pop u latio n o f the U SA in 1930 from the 1920
figure. I f the cities form a ran d o m sam ple w ith ( U , X ) denoting the p air o f
populatio n values for a random ly selected city, then the total 1930 population
is the prod u ct o f the to tal 1920 popu latio n and the ratio o f expectations
6 = E (X )/E ([7). This ratio is the p aram eter o f interest.
In this case there is no obvious p aram etric m odel for the jo in t distribution
o f ( U , X ) , so it is n atu ral to estim ate 9 by its em pirical analog, T = X / U , the
ratio o f sam ple averages. We are then concerned w ith the uncertainty in T. If
we had a plausible param etric m odel for exam ple, th a t the pair ( U, X ) has
a bivariate lognorm al distrib u tio n then theoretical calculations like those
in Exam ple 1.1 would lead to bias an d variance estim ates for use in a norm al
approxim ation, which in tu rn would provide approxim ate confidence intervals
for 6. W ithout such a m odel we m ust use nonparam etric analysis. It is still
possible to estim ate the bias an d variance o f T, as we shall see, and this m akes
norm al approxim ation still feasible, as well as m ore com plex approaches to
setting confidence intervals.
1 Introduction
Table 13 Populations
in thousands of n 49
large US cities in 1920
(u) and in 1930 (x)
(Cochran, 1977, p. 152).
138
93
61
179
48
37
29
23
30
143
104
69
260
75
63
50
48
111
50
52
53
79
57
317
93
58
76
381
387
78
60
507
50
77
64
40
136
243
256
94
36
45
80
464
459
106
57
634
64
89
77
60
139
291
288
85
46
53
67
120
172
66
46
121
44
64
56
40
116
87
43
43
161
36
67
115
183
86
65
113
58
63
142
64
130
105
61
50
232
54
38
46
71
25
298
74
50
Figure 1J Populations
of 49 large United
States cities (in 1000s)
in 1920 and 1930.
3
Q.
O
Q.
O
CO
O)
1920 population
1 Introduction
1 Introduction
the S language to sets o f data. The practicals are intended to reinforce the
ideas in each chapter, to supplem ent the m ore theoretical problem s, and to
give exam ples on which readers can base analyses o f their own data.
It would be possible to give different sorts o f course based on this book.
O ne w ould be a theoretical course based on the problem s and an o th er an
applied course based on the practicals; we prefer to blend the two.
A lthough a library o f routines for use with the statistical package S P lu s
is bundled w ith it, m ost o f the book can be read w ithout reference to p a r
ticular softw are packages. A p art from the practicals, the exception to this is
C h ap ter 11, which is a short introduction to the m ain resam pling routines,
arran g ed roughly in the order with which the corresponding ideas ap p ear in
earlier chapters. R eaders intending to use the bundled routines will find it
useful to w ork through the relevant sections o f C h apter 11 before attem pting
the practicals.
Notation
A lthough we believe th a t o u r n o tation is largely standard, there are not enough
letters in the English and G reek alphabets for us to be entirely consistent. G reek
letters such as 6, P and v generally denote param eters or o ther unknow ns, while
a is used for error rates in connection with significance tests and confidence
sets. English letters X , Y, Z , and so forth are used for random variables, which
take values x, y, z. T hus the estim ator T has observed value t, which m ay be
an estim ate o f the unknow n p aram eter 0. The letter V is used for a variance
estim ate, an d the letter p for a probability, except for regression models, where
p is the num b er o f covariates. Script letters such as J/~ are used to denote sets.
Probability, expectation, variance and covariance are denoted Pr( ), E( ),
var(-) and cov(-, ), while the jo in t cum ulant o f Yi, Y1Y2 and Y3 is denoted
cum(Yi, Yj Y2, Y3). We use I {A} to denote the indicator random variable, which
takes values one if the event A is true and zero otherwise. A related function
is the H eaviside function
We use #{/!} to denote the nu m ber o f elem ents in the set A, and #{^4r} for the
num ber o f events A r th a t occur in a sequence A i , A 2 , __ We use = to m ean
is approxim ately equal to , usually corresponding to asym ptotic equivalence
as sam ple sizes tend to infinity, ~ to m ean is distributed as o r is distributed
according to , ~ to m ean is distributed approxim ately a s, ~ to m ean is a
sam ple o f independent identically distributed random variables from , while
s has its usual m eaning o f is equivalent to .
10
1 Introduction
We m ostly reserve Z for ran d o m variables th a t are stan d ard norm al, at least
approxim ately, an d use Q for ran d o m variables w ith o ther (approxim ately)
know n distributions. As usual N(n, a 2) represents the norm al distribution w ith
m ean \i an d variance a 2, while za is often the a quantile o f the stan d ard norm al
distribution, w hose cum ulative distrib u tio n function is ( ).
The letter R is reserved for the n u m b er o f replicate sim ulations. Sim ulated
copies o f a statistic T are denoted T ' , r = 1 ,..., R, w hose ordered values are
r ('i) ^
^ T (R)- E xpectation, variance an d probability calculated w ith respect
to the sim ulation distribution are w ritten Pr*(), E*(-) and var*(-).
W here possible we avoid boldface type, and rely on the context to m ake
it plain when we are dealing w ith vectors o r m atrices; a T denotes the m atrix
transpose o f a vector o r m atrix a.
We use PD F, C D F, an d E D F as sh o rth an d for probability density function,
cum ulative distribution function, and em pirical distribution function. The
letters F and G are used for C D F s, an d / and g are generally used for the
corresponding PD F s. A n exception to this is th a t /*; denotes the frequency
with which y; app ears in the rth resample.
We use M L E as sh o rth an d for m axim um likelihood estim ate or som etim es
m axim um likelihood estim ation.
The end o f each exam ple is m arked , an d the end o f each algorithm is
m arked .
2
The Basic Bootstraps
2.1 Introduction
In this chap ter we discuss techniques which are applicable to a single, h om o
geneous sam ple o f data, denoted by y i,...,} V T he sam ple values are thought
o f as the outcom es o f independent and identically distributed ran d o m variables
Y U . . . ,Y w hose probability density function (P D F ) and cumulative distribution
function (C D F ) we shall denote by / and F, respectively. T he sam ple is to be
used to m ake inferences ab o u t a p o p ulation characteristic, generically denoted
by 6, using a statistic T whose value in the sam ple is t. We assum e for the
m om ent th a t the choice o f T has been m ade and th a t it is an estim ate for 6,
which we take to be a scalar.
O u r atten tio n is focused on questions concerning the probability distribution
o f T. F or exam ple, w hat are its bias, its stan d ard error, or its quantiles? W hat
are likely values und er a certain null hypothesis o f interest? H ow do we
calculate confidence limits for 6 using T ?
T here are tw o situations to distinguish, the param etric and the n o n p a ra m et
ric. W hen there is a p articu lar m athem atical m odel, with adjustable constants
o r p aram eters ip th a t fully determ ine / , such a m odel is called parametric and
statistical m ethods based on this m odel are param etric m ethods. In this case
the p aram eter o f interest 6 is a com ponent o f or function o f ip. W hen no such
m athem atical m odel is used, the statistical analysis is nonparametric, and uses
only the fact th a t the ran d o m variables Yj are independent and identically
distributed. Even if there is a plausible param etric m odel, a nonparam etric
analysis can still be useful to assess the robustness o f conclusions draw n from
a p aram etric analysis.
A n im p o rta n t role is played in nonparam etric analysis by the empirical
distribution which puts equal probabilities n-1 a t each sam ple value yj. The
corresponding estim ate o f F is the empirical distribution function (E D F ) F,
11
12
n
M ore form ally
F(y) = l i Z H ^ y - y ^
j=i
y dF( y) ,
t(F) =
y 2 dF(y) ~ { J ydF(y) J
(2.2)
13
2.1 Introduction
b o th the p aram eter an d its estim ate, b u t we shall use t( ) to represent the
function, and t to represent the estim ate o f 9 based on the observed d ata
ydF(y).
j= i
because f a ( y ) d H ( y x) = a(x) for any continuous function a(-).
Example 2.2 (City population data) F or the problem outlined in Exam ple 1.2,
the p aram eter o f interest is the ratio o f m eans 9 = E (X )/E (l/). In this case F
is the bivariate C D F o f Y = (V , X ), and the bivariate E D F F puts probability
n~l at each o f the d a ta pairs (uj ,Xj). T he statistical function version o f 9 simply
uses the definition o f m ean for b o th nu m erato r and denom inator, so th at
fxdF(u,x)
f ud F( u, x)
The corresponding estim ate o f 9 is
*
[ xdF(u,x)
t = t(F) =
J udF(u,x)
w ith x = n-1 J2 x j ar*d = n_1 J 2 uj-
A quantity A is said to
be 0(nd) if
lim_00 n~dA = a for
some finite a, and o(nJ)
if lim_0Q n~dA = 0.
x
u
14
require special treatm en t is n o n p aram etric density estim ation, which we discuss
in Exam ple 5.13.)
The representation 6 = t(F) defines the p aram eter and its estim ator T in a
robust way, w ithout any assum ption ab o u t F, oth er th an th a t 6 exists. This
guarantees th a t T estim ates the right thing, no m atter w hat F is. Thus the
sam ple average y is the only statistic th a t is generally valid as an estim ate o f the
population m ean f i : only if Y is sym m etrically distributed ab o u t /i will statistics
such as trim m ed averages also estim ate fi. This property, which guarantees th at
the correct characteristic o f the underlying distribution is estim ated, w hatever
th a t distribution is, is som etim es called robustness o f specification.
2.1.2 Objectives
M uch o f statistical theory is devoted to calculating approxim ate distributions
for p articu lar statistics T , on which to base inferences ab o u t their estim ands 8.
Suppose, for exam ple, th a t we w ant to calculate a (1 2a) confidence interval
for 6. It m ay be possible to show th a t T is approxim ately norm al w ith m ean
6 + P and variance v; here P is the bias o f T. If p an d v are b o th know n, then
we can write
P r(T < 1 1 F) = O
(2-3)
where <t>() is the stan d ard norm al integral. I f the a quantile o f the standard
norm al distrib u tio n is z = <D- 1(a), then an approxim ate (1 2a) confidence
interval for 6 has limits
t - p - v ^ \ ,
(2.4)
as follows from
Pr(/? + v1/2za < T
t(F),
(2.5)
= means is
approximately equal to.
15
(2.5), th a t is
B = b(F) = E ( T \ F ) - t ( F ) ,
(2.6)
E stim ates such as those in (2.6) are b o o tstrap estim ates. H ere they have
been used in conjunction w ith a norm al approxim ation, which som etim es will
be adequate. However, the b o o tstrap approach o f substituting estim ates can
be applied m ore am bitiously to im prove upon the norm al approxim ation and
o th e r first-order theoretical approxim ations. The elaboration o f the b o o tstrap
ap proach is the purpose o f this book.
16
v ar'(Y * ) = y 2/n.
N ote th a t the estim ated bias o f Y is zero, being the difference between
E '(Y *) an d the value ji = y for the m ean o f the fitted distribution. These
m om ents were used to calculate an approxim ate norm al confidence interval in
Exam ple 2.3.
If, however, we wished to calculate the bias and variance o f T = log Y under
the fitted m odel, i.e. E* (log Y*) lo g y and v ar (lo g Y '), exact calculation is
m ore difficult. The delta m ethod o f Section 2.7.1 would give approxim ate
values (2n)~* and n-1 . But m ore accurate approxim ations can be obtained
using sim ulated sam ples o f 7* s.
Sim ilar results and com m ents would apply if instead we chose to use the
m ore general gam m a m odel (1.1) for this example. T hen Y* would be a gam m a
random variable with m ean y and index k.
m
B r = / r 1 Y , Tr ~ t = T* - 1.
(2.7)
r= 1
N ote th a t in the sim ulation t is the p aram eter value for the model, so th at
T ' t is the sim ulation analogue o f T 6. The corresponding estim ator o f
the variance o f T is
1
Vr =
R
D 7-* - f *)2
(2-8)
17
cC/>
O
in
18
index n. The sim ulation variances o f B R and F r are
t2
t4 /
6 \
nR
n2 \ R - 1 + n R . )
and we can use these to say how large R should be in order th a t the sim ulated
values have a specified accuracy. For exam ple, the coefficients o f variation
o f VR a t R = 100 and 1000 are respectively 0.16 and 0.05. However, for a
com plicated problem w here sim ulation was really necessary, such calculations
could n o t be done, an d general rules are needed to suggest how large R should
be. These are discussed in Section 2.5.2.
r=l
19
l)p is an integer.
The sim ulation approxim ation GR and the corresponding quantiles are in
principle b etter th a n results obtained by norm al approxim ation, provided th at
R is large enough, because they avoid the supposition th a t the distribution o f
T* t has a p articu lar form.
Example 2.6 (Air-conditioning data) T he sim ulation experim ents described
in Exam ple 2.5 can be used to study the sim ulation approxim ations to the
d istribution an d quantiles o f Y fi. First, Figure 2.2 shows norm al Q -Q plots
o f t* values for R = 99 (top left panel) and R = 999 (top right panel). Clearly
a norm al ap proxim ation would n o t be accurate in the tails, and this is already
fairly clear w ith R = 99. F or reference, the lower h a lf o f Figure 2.2 shows
corresponding Q -Q plots w ith exact gam m a quantiles.
T he n onnorm ality o f T * is also reasonably clear on histogram s o f t* values,
show n in Figure 2.3, at least at the larger value R = 999. C orresponding
density estim ate plots provide sm oother displays o f the same inform ation.
We look next at the estim ated quantiles o f Y p.. T he p quantile is a p
proxim ated by J'fjK+np) y for p = 0.05 and 0.95. The values o f R are
1 9 ,3 9 ,9 9 ,1 9 9 ,..., 999, chosen to ensure th a t (R + 1)p is an integer throughout.
T hus at R = 19 the 0.05 quantile is approxim ated by y ^ y and so forth. In
order to display the m agnitude o f sim ulation error, we ran four independent
sim ulations a t R = 1 9 ,3 9 ,9 9 ,...,9 9 9 . The results are plotted in Figure 2.4.
A lso shown by d o tted lines are the exact quantiles under the m odel, which the
sim ulations ap proach as R increases. T here is large variability in the approxi
m ate quantiles for R less th an 100 and it appears th a t 500 or m ore sim ulations
are required to get accurate results.
The same sim ulations can be used in o th er ways. F or example, we m ight
w ant to know a b o u t log Y log /i, in which case the em pirical properties o f
logy* lo g y are relevant.
T he illustration used here is very simple, but essentially the same m ethods
can be used in arb itrarily com plicated param etric problems. F or example,
distributions o f likelihood ratio statistics can be approxim ated when largesam ple approxim ations are inaccurate or fail entirely. In C hapters 4 and
5 respectively we show how param etric boo tstrap m ethods can be used to
calculate significance tests an d confidence sets.
It is som etim es useful to be able to look at the density o f T, for exam ple to
see if it is m ultim odal, skewed, or otherw ise differs appreciably from norm ality.
A rough idea o f the density g(u) o f U = T 6, say, can be had from a histogram
o f the values o f t ' t. A som ew hat b etter picture is offered by a kernel density
20
ooo
o
C\J
>
/*S
CD
to
Jr
o
o
o
o
o
in
/
/
O
''fr
60 80
120
160
200
50
100
150
200
r= l
<>
where w is a sym m etric P D F with zero m ean and h i s a. positive bandw idth th a t
determ ines the sm oothness o f gh. The estim ate gh is non-negative and has unit
integral. It is insensitive to the choice o f w(-), for which we use the standard
norm al density. The choice o f h is m ore im portant. T he key is to produce a
sm ooth result, while n o t flattening out significant modes. If the choice o f h
is quite large, as it m ay be if R < 100, then one should rescale the density
21
Figure 2 3 Histograms
of t* values based on
R = 99 (left) and
R = 999 (right)
simulations from the
fitted exponential model
for the air-conditioning
data.
o
o
o
r~
O
o
co
o
o
in
o
o
o
Tt
o
liB
50
100
150
t*
lb
o
200
50
100
150
200
t*
estim ate to m ake its m ean and variance agree with the estim ated m ean bR and
variance vR o f T 9; see Problem 3.8.
As a general rule, good estim ates o f density require at least R = 1000:
density estim ation is usually h ard er th an probability o r quantile estim ation.
N ote th a t the same m ethods o f estim ating density, distribution function and
quantiles can be applied to any transform ation o f T. We shall discuss this
fu rth er in Section 2.5.
22
under
; =y
j=i
and similarly
1
v a r* (Y * )= -v a r * ( Y ')
n
1
1
"
1
-E *{Y * E*(Y*)}2 = - x V - { y , y f
n
1
1
n
^
n 1
}=i
(n 1)
A p art from the factor (n 1)/n, this is the usual result for the estim ated
variance o f Y .
O ther simple statistics such as the sam ple variance and sam ple m edian are
also easy to handle (Problem s 2.3, 2.4).
To apply sim ulation w ith the E D F is very straightforw ard. Because the
E D F puts equal probabilities on the original d a ta values y i , . . . , y , each Y*
is independently sam pled a t ran d o m from those d a ta values. T herefore the
sim ulated sam ple Y(, . . . , Y* is a ran d o m sam ple taken with replacem ent from
the data. This simplicity is special to the case o f a hom ogeneous sample, but
m any extensions are straightforw ard. This resam pling procedure is called the
nonparametric bootstrap.
Example 2.8 (City population data) H ere we look at the ratio estim ate for
the problem described in Exam ple 1.2. F or convenience we consider a subset
o f the d a ta in Table 1.3, com prising the first ten pairs. This is an application
with no obvious param etric m odel, so nonparam etric sim ulation m akes good
sense. Table 2.1 shows the d a ta and the first sim ulated sample, which has been
draw n by random ly selecting subscript j ' from the set { l,...,n } w ith equal
probability and taking (w*,x*) = (uj-,xj-). In this sam ple j ' = 1 never occurs
23
.7
u
1
138
/'
u
X*
143
2
93
104
3
61
69
4
179
260
5
48
75
6
37
63
7
29
50
8
23
48
9
30
111
10
2
50
6
37
63
7
29
50
2
93
104
2
93
104
3
61
69
3
61
69
10
2
50
7
29
50
2
93
104
9
30
111
1
138
143
2
93
104
(/ -Xj*).
Table 2.2 Frequencies
with which each original
data pair appears in
each of R = 9
nonparametric
bootstrap samples for
the data on US cities.
j
u
X
3
61
69
4
179
260
5
48
75
6
37
63
7
29
50
8
23
48
9
30
111
10
2
50
1
2
4
1
1
2
1
2
1
1
2
1
1
2
1
S tatistic
t = 1.520
R eplicate r
1
2
3
4
5
6
7
8
9
an d /
1
1
3
1
1
2
1
1
2
1
1
3
2
1
1
1
1
2
2
1
1
2
1
1
2
2
3
2
1
1
1
1
2
1
1
1
1
1
2
1
1
3
1
1
t\
t*
r;
t\
t'5
t'6
t;
tj
(j
=
=
=
=
=
=
=
=
=
1.466
1.761
1.951
1.542
1.371
1.686
1.378
1.420
1.660
v = 0.03907.
vL = n~2 J ^ ( x ; - t u j f / u 1,
j=i
24
o
C
oO
<
oN
c\i
I
1 ll
J llll.-_
in
o
o
o
0.5
1.0
1.5
2.0
2.5
-8
n .llll
-6
-4
-2
z*
t*
25
26
under the best-fitting gam m a m odel w ith index k = 0.71. The agreem ent in the
second panel is strikingly good. O n reflection this is natural, because the E D F
is closer to the larger gam m a m odel th a n to the exponential model.
27
m etric resam pling, T* and related quantities will have discrete distributions,
even though they m ay be approxim ating continuous distributions. This m akes
results som ew hat fuzzy com pared to their param etric counterparts.
Example 2.10 (Air-conditioning data) For the nonparam etric sim ulation dis
cussed in the previous exam ple, the right panels o f Figure 2.9 show the scatter
plots o f sam ple stan d ard deviation versus sam ple average for R = 99 and
R = 999 sim ulated datasets. C orresponding plots for the exponential sim u
lation are shown in the left panels. T he qualitative feature to be read from
any one o f these plots is th a t d a ta stan d ard deviation is proportional to d ata
average. The discreteness o f the nonparam etric m odel (the E D F ) adds noise
whose peculiar b anded structure is evident a t R = 999, although the qualitative
structure is still apparent.
_ f i n 1\ _ (2n 1)!
\ n1)
n\(n 1)!
possible values o f t*, depending upon the sm oothness o f the statistical function
t( ). Even for m oderately small sam ples the support o f the distribution o f T*
will often be fairly dense: values o f m for n = 7 and 11 are 1716 and 352 716
(Problem 2.5). It would therefore usually be harm less to think o f there being
a P D F for T*, and to approxim ate it, either using sim ulation results as in
Figure 2.6 o r theoretically (Section 9.5). There are exceptions, however, m ost
n otably when T is a sam ple quantile. The case o f the sam ple m edian is
discussed in Exam ple 2.16; see also Problem 2.4 and Exam ple 2.15.
For m any practical applications o f the sim ulation results, the effects o f
discreteness are likely to be fairly m inim al. However, one possible problem is
th at outliers are m ore likely to occur in the sim ulation output. F or example,
in Exam ple 2.8 there were three outliers in the sim ulation, and these inflated
the estim ate v o f the variance o f T*. Such outliers should be evident on a
norm al Q -Q plot (or com parable relevant plot), and when found they should be
om itted. M ore generally, a statistic th at depends heavily on a few quantiles can
be sensitive to the repeated values th a t occur under nonparam etric sampling,
an d it can be useful to sm ooth the original d a ta when dealing with such
statistics; see Section 3.4.
28
Bootstrap average
Bootstrap average
O
O
CO
in
C\J
Q
C/)
o.
CO
to
o
o
co
Q.
(0
CsJ
o
LO
o
8
CD o
50
Bootstrap average
Bootstrap average
1 (^(R+lJa) 0-
(2.10)
=>
P r ( T - b < 6 < T - a) = 1 - 2 a .
29
We shall refer to the limits (2.10) as the basic bootstrap confidence limits. Their
accuracy depends upon R, o f course, and one would typically take R > 1000 to
be safe. But accuracy also depends upon the extent to which the distribution o f
T" t agrees w ith th a t o f T 9. Com plete agreem ent will occur if T 9 has a
distribution n o t depending on any unknow ns. This special property is enjoyed
by quantities called pivots, which we discuss in m ore detail in Section 2.5.1.
If, as is usually the case, the distribution o f T 9 does depend on unknow ns,
then we can try alternative expressions contrasting T and 6, such as differences
o f transform ed quantities, o r studentized com parisons. For the latter, we define
the studentized version o f T 9 as
where V is an estim ate o f v a r(T | F): we give a fairly general form for V in
Section 2.7.2. The idea is to mimic the Student-t statistic, which has this form,
and which elim inates the unknow n standard deviation when m aking inference
ab o u t a norm al mean. T hro u g hout this book we shall use Z to denote a
studentized statistic.
Recall th a t the S tudent-t (1 2a) confidence interval for a norm al m ean n
has limits
y - v l/2tn- i ( l - a ) ,
y - v l/2t-i(a),
where v is the estim ated variance o f the m ean and f_i(a), t_ i(l a) are
quantiles o f the Student-f distribution w ith n 1 degrees o f freedom , the
distribution o f the pivot Z . M ore generally, when Z is defined by (2.11), the
(1 2a) confidence interval limits for 9 have the analogous form
where zp denotes the p quantile o f Z . One simple approxim ation, which can
often be justified for large sam ple size n, is to take Z as being N ( 0,1). The result
would be no different in practical term s from using a norm al approxim ation
for T 9, and we know th a t this is often inadequate. It is m ore accurate
to estim ate the quantiles o f Z from replicates o f the studentized bootstrap
statistic, Z* = (T* t ) / V * 1/2, where T ' and V * are based on a sim ulated
ran d o m sample, Y , . . . , Yn'.
If the m odel is param etric, the Y ' are generated from the fitted param etric
distribution, and if the m odel is nonparam etric, they are generated from the
E D F F, as outlined in Section 2.3. In either case we use the (R + l)a th order
statistic o f the sim ulated values z \ , . . . , z ' R, nam ely z(*(K+1)(x), to estim ate z. Then
the studentized bootstrap confidence interval for 9 has limits
(2 .12)
30
Example 2.12 (City population data) F or the sam ple o f n = 10 pairs analysed
in Exam ple 2.8, o u r estim ate o f the ratio 8 is t = x / u = 1.52. The 0.025 and
0.975 quantiles o f the 999 values o f t are 1.236 and 2.059, so the 95% basic
boo tstrap confidence interval (2.10) for 8 is (0.981,1.804).
To apply the studentized interval, we use the delta m ethod approxim ation
to the variance o f T, which is (Problem 2.9)
n
VL = n ~ 2 J ^ ( x y - tU j)2/Q 2,
j =i
and base confidence intervals for 8 on ( T 0 ) / v lL[ 2, using sim ulated values
o f z ' = (t* t ) / v L . T he sim ulated values in the right panel o f Figure 2.5
show th at the density o f the studentized b o o tstrap statistic Z ' is n o t close to
norm al. The 0.025 and 0.975 quantiles o f the 499 sim ulated z ' values are -3.063
and 1.447, and since v i = 0.0325, an approxim ate 95% equitailed confidence
interval based on (2.12) is (1.260,2.072). T his is quite different from the interval
above.
The usefulness o f these confidence intervals will depend on how well F
31
32
(2.13)
where /i_1( ) is the inverse transform ation. So h~l { h(T) aa} is an upper
(1 a) confidence lim it for 8.
Parametric problems
In param etric problem s F = F# and F = Fv have the sam e form, differing
only in p aram eter values. T he n otion o f a pivot is quite simple here, m eaning
constant behaviour und er all values o f the m odel param eters. M ore formally,
we define a pivot as a function Q = q ( T , 8 ) w hose distribution does o r n o t a
p articular q uantity Q is exactly or nearly pivotal, by exam ining its behaviour
under the m odel form w ith varying p aram eter values. F or example, in the
context o f Exam ple 1.1 n o t depend on the value o f \p: for all q,
W L i { h ( T ) } { h ( 8 ) } 2 v(8),
which in tu rn implies th a t the variance is m ade approxim ately constant (equal
to 1) if
H{t) = /
M lijp '
(114)
33
<
ocD
(0
c
(0
o
o
o
>
50 60 70
90
200
theta
in conjunction w ith (2.13) will typically give m ore accurate confidence limits
th an would be obtained using direct approxim ations o f quantiles for T 6.
If such use o f the transfo rm ation is appropriate, it will som etim es be clear
from theoretical considerations, as in the exponential case. O therw ise the
tran sfo rm atio n w ould have to be identified from a scatter plot o f sim ulationestim ated variance o f T versus 6 for a range o f values o f 8.
Example 2.13 (Air-conditioning data) Figure 2.10 shows a log-log plot o f the
em pirical variances o f r* = y ' based on R = 50 sim ulations for each o f a
range o f values o f 6. T h a t is, for each value o f 0 we generate R values t
corresponding to sam ples y y " from the exponential distribution with
m ean 6, and then plot log { ( R l) -1 X)(t* r*)2} against log0. T he linearity
an d slope o f the plot confirm th at v a r(T | F ) oc 62, where 6 = E (T | F).
a
Nonparametric problems
In n o n p aram etric problem s the situation is m ore com plicated. It is now unlikely
(but n o t strictly im possible) th a t any quantity can be exactly pivotal. A lso we
cann o t sim ulate d a ta from a distribution with the same form as F, because
th a t form is unknow n. However, we can sim ulate d a ta from distributions near
to and sim ilar to F, an d this m ay be enough since F is near F. A rough idea
o f w hat is possible can be h ad from Exam ple 2.10. In the right-hand panels o f
Figure 2.9 we plotted sam ple stan d ard deviation versus sam ple average for a
series o f n o nparam etrically resam pled datasets. If the E D F s o f those datasets
are th o u g h t o f as m odels n ear both F and F, then although the pattern is
obscured by the banding, the plots suggest th a t the true m odel has standard
deviation p ro p o rtio n al to its m ean which is indeed the case for the m ost
34
likely true m odel. T here are conceptual difficulties with this argum ent, b u t
there is little question th a t the im plication draw n is correct, nam ely th at log Y
will have approxim ately the sam e variance und er sam pling from b o th F and
F.
A m ore tho ro u g h discussion o f these ideas for nonparam etric problem s will
be given in Section 3.9.2.
A m ajor focus o f research on resam pling m ethods has been the reduction
o f statistical error. This is reflected particularly in the developm ent o f accurate
confidence lim it m ethods, which are described in C h apter 5. In general it is
best to rem ove as m uch o f the statistical erro r as possible in the choice o f
procedure. However, it is possible to reduce statistical erro r by a b o o tstrap
technique described in Section 3.9.1.
35
t ; - Y.
?; - y )
vary |e * ( R ~ 1 ^
Yr* -
+ E y | var' ( r ' 5 ] y ; - y ) } ,
where E y ( - ) and vary(-) denote the m ean and variance taken with respect to
the jo in t distrib u tio n o f Y \ , . . . , Y n. F rom (2.15) this gives
v ar (Br ) = vary(O) +
Ey
a2
n 1
x .
n
nR
(2.16)
This result does not depend on norm ality o f the data. A sim ilar expression
holds for any sm ooth statistic T w ith a linear approxim ation (Section 2.7.2),
except for an 0 ( n ~ 2) term.
N ext consider the variance estim ator VR = (R I)-1 XXYr Y*)2, where
Y* = R ^ 1
Yr*. The m ean and variance o f VR across all possible sim ulations,
conditional on the data, are
+ Ey
which reduces to
(2.17)
The first term on the right o f (2.17) is due to d a ta variation, the second to
36
( zp , 2 n p ( l - p ) e \ p ( z 2) \
- | ^ + ----------------------- - j .
(2.19)
0.01
5.15
0.025
3.72
0.05
3.30
0.10
3.56
0.25
8.16
So to m ake the variance inflation factor 10% for the 0.025 quantile, for
example, we would need R = 40n. E qu atio n (2.19) m ay n o t be useful in the
centre o f the distribution, where d(p) is very large because zp is small.
Example 2.14 (Air-conditioning data) To see how well this discussion applies
in practice, we look briefly a t results for the d a ta in Exam ple 1.1. T he statistic
o f interest is T = log Y, which estim ates 8 = log fi. The true m odel for Y is
taken to be the gam m a distrib u tio n w ith index k = 0.71 and m ean p. = 108.083;
these are the d a ta estim ates. Effects due to sim ulation e rro r are approxim ated
37
Source
T ype
P
0.01
0.99
0.05
0.95
0.10
0.90
D a ta
actual
theoretical
31.0
26.6
6.9
26.6
14.0
13.3
3.6
13.3
8.3
8.1
2.2
8.1
S im ulation, R = 100
actual
theoretical
actual
theoretical
actual
theoretical
53.6
32.9
4.3
6.6
2.2
3.3
9.4
32.9
2.4
6.6
0.8
3.3
8.5
10.5
2.0
2.1
1.5
1.0
3.2
10.5
0.6
2.1 .
0.1
1.0
3.8
6.9
1.2
1.4
0.8
0.7
2.6
6.9
0.4
1.4
0.2
0.7
S im ulation, R = 500
S im ulation, R = 1000
by taking sets o f R sim ulations from one long nonparam etric sim ulation o f
9999 datasets. Table 2.3 shows the actual com ponents o f variation due to
sim ulation an d d a ta variation, together with the theoretical com ponents in
(2.19), for estim ates o f quantiles o f l o g ? log/i. O n the whole the theory
gives a fairly accurate prediction o f perform ance.
38
when an d how a b o o tstrap calculation m ight fail, and ideally how it should
be am ended to yield useful answers. This topic o f boo tstrap diagnostics is
discussed m ore fully in Section 3.10.
A second question is: u nder w hat idealized conditions will a resam pling
procedure produce results th a t are in some sense m athem atically correct?
Answ ers to questions o f this sort involve an asym ptotic fram ew ork in which
the sam ple size n>oo. A lthough such asym ptotics are ultim ately intended
to guide practical work, they often act only as a backstop, by rem oving from
consideration procedures th a t do n o t have ap p ro p riate large-sam ple properties,
and are usually n o t subtle enough to discrim inate am ong com peting procedures
according to their finite-sam ple characteristics. N evertheless it is essential to
appreciate when a naive application o f the b o o tstrap will fail.
To put the theoretical basis for the b o o tstrap in simple term s, suppose th at
we have a ran d o m sam ple
or equivalently its E D F F, from which
we wish to estim ate properties o f a standardized quantity Q = q ( YU ---, Y;F).
For exam ple, we m ight take
Q{Yu . . . , Y n\F) = n 1/2 j ? -
y d F ( y ) ^ = n ^ 2( ? - 6 ) ,
(2.20)
(2.21)
where in this case Q{Y{, . . . , Y * ; F ) = n{/1{ Y ' y). In order for G p n to approach
G f n as n*oo, three conditions m ust hold. Suppose th a t the true distribution
F is surrounded by a neighbourhood
in a suitable space o f distributions,
and th at as n*oo, F eventually falls into J f w ith probability one. T hen the
conditions are:
1
2
3
h(u)dGAy(u)
->
h(u)dGAi0D(u)
for all integrable functions h(-). U nder these conditions the b o o tstrap is con
sistent, m eaning th a t for any q and e > 0, Pr{\Gpn(q) GF^ }(q)\ > e}>0 as
nyoo.
39
T he first condition ensures th at there is a limit for Gf, to converge to, and
w ould be needed even in the happy situation where F equalled F for every
n > n', for som e ri. N ow as n increases, F changes, so the second and third
conditions are needed to ensure th at G p n approaches G fi00 along every possible
sequence o f F s. If any one o f these conditions fails, the b o o tstrap can fail.
Example 2.15 (Sample maximum) Suppose th at Y i,. . . , Yn is a random sample
from the uniform distribution on (0 ,9). T hen the m axim um likelihood estim ate
o f 9 is the largest sam ple value, T = Yln), where Y(i) < < Y(n) are the sample
order statistics. C onsider nonparam etric resam pling. The lim iting distribution
o f Q = n(9 T ) / 9 is stan d ard exponential, and this suggests th a t we take our
standardized quantity to be Q' = n(t T ' ) / t , where t is the observed value
o f T , an d T* is the m axim um o f a b o o tstrap sam ple o f size n taken from
y i , . . . , y n. As n>oo, however,
Pr(g* = 0 | F) = Pr(T* = t \ F) = 1 - (1 - n_1)"-> 1 - e_1,
an d consequently the lim iting distribution o f Q* can n o t be stan d ard exponen
tial. The problem here is th a t the second condition fails: the distributional
convergence is not uniform on useful neighbourhoods o f F. A ny fixed o r
d er statistic Y(k) suffers from the same difficulty, b u t a statistic like a sample
quantile, where we would take k = pn for some fixed 0 < p < 1, does not.
Asymptotic accuracy
Here and below we say
X n = Op{nd) when
Prfn^l-Xnl > e)-*p for
some constant p as
noo, and X = op(nd)
when
Pr(n rf|ATn| > e)-*0 as
n>cc, for any e > 0.
Consistency is a w eak property, for exam ple guaranteeing only th at the true
probability coverage o f a nom inal (1 2a) confidence interval is 1 2ot + op(l).
S tan d ard norm al approxim ation m ethods are consistent in this sense. Once
consistency is established, m eaning th at the resam pling m ethod is valid, we
need to know w hether the m ethod is good relative to o ther possible m ethods.
This involves looking at the rate o f convergence to nom inal properties. For
example, does the coverage o f the confidence interval deviate from (1 2a) by
0 p(n~l/2) or by 0 p(n-1 )? Some insight into this can be obtained by expansion
m ethods, as we now outline. M ore detailed calculations are m ade in Section 5.4.
Suppose th a t the problem is one where the lim iting distribution o f Q is stan
d ard norm al, and where an Edgeworth expansion applies. T hen the distribution
o f Q can be w ritten in the form
Pr (Q < q \ F ) = <S>(q) + n~x/1a{q)<t>(q) + 0 ( n ~ l ),
(2.22)
where <!>() an d </>{) are the C D F and P D F o f the stan d ard norm al distribution,
and a(-) is an even quad ratic polynom ial. For a wide range o f problem s it can
be shown th a t the corresponding approxim ation for the b o o tstrap version o f
Q is
Pr(2* < q \ F ) = <b(q) + n~l/2a(q)(l>(q) + 0 ^ ) ,
(2.23)
40
where a(-) is obtained by replacing unknow ns in a(-) by estim ates. Now typically
a(q) = a(q) + 0 p(n~1/2), so
P r(Q' < q \ F) Pr2 < q \ F) = Op(n~l ).
(2.24)
T hus the estim ated distrib u tio n for Q differs from the true distribution by a
term th a t is Op(n_1), provided th a t Q is constructed in such a way th a t it is
asym ptotically pivotal. A sim ilar argum ent will typically hold when Q has a
different lim iting distribution, provided it does n o t depend on unknow ns.
Suppose th a t we choose n o t to standardize Q, so th a t its lim iting distribution
is norm al w ith variance v. A n E dgew orth expansion still applies, now with
form
Pr(fi , I F) _ * ( - j ) +
( - k ) * ( J L ) + 0(n-1),
(125)
(2.26)
(2.27)
because the leading term s on the right-hand sides o f (2.25) and (2.26) are
different.
The difference betw een (2.24) and (2.27) explains o u r insistence on w orking
w ith approxim ate pivots w henever possible: use o f a pivot will m ean th at a
boo tstrap distribution function is an o rd er o f m agnitude closer to its target.
It also gives a cogent theoretical m otivation for using the b o o tstrap to set
confidence intervals, as we now outline.
We can obtain the a quantile o f the distribution o f Q by inverting (2.22),
giving the Cornish-Fisher expansion
qx = z a + n - '^ a ' ^ Z x ) + 0 ( n _1),
where za is the a quantile o f the stan d ard norm al distribution, and a"(-) is
a further polynom ial. T he corresponding b o o tstrap quantile has the property
th a t q ^ qn = Op(n~l ). F or simplicity take Q = ( T 0 ) / V l/1, where V estim ates
the variance o f T. T hen an exact one-sided confidence interval for 9 based on
Q would be I a = [T V 1/2qx, oo), an d this contains the true 6 w ith probability
a. T he corresponding b o o tstrap interval is / = [T I/1/2g ,oo), where q is
the a quantile o f the distrib u tio n o f Q* which w ould often be estim ated by
sim ulation, as we have seen. Since q'x qx = Op(n~[), we have
Pr(0 e I a) = a,
P r(0 e /* ) = a + 0 ( n ~ l ),
41
so th a t the actual probability th at / ' contains 6 differs from the nom inal
probability by only 0 ( n -1 ). In contrast, intervals based on inverting (2.25) will
contain 8 w ith probability a + 0 ( n ~ l/2). This interval is in principle no m ore
accurate th a n using the interval [T F 1/2za, oo) obtained by assum ing th at
the distribution o f Q is stan d ard norm al. Thus one-sided confidence intervals
based on quantiles o f Q have an asym ptotic advantage over the use o f a
norm al approxim ation. Sim ilar com m ents apply to tw o-sided intervals.
The practical usefulness o f such results will depend on the num erical value
o f the difference (2.24) at the values o f q o f interest, and it will always be wise
to try to decrease this statistical error, as outlined in Section 2.5.1.
T he results above based on E dgew orth expansions apply to m any com m on
statistics: sm ooth functions o f sam ple m om ents, such as m eans, variances, and
higher m om ents, eigenvalues and eigenvectors o f covariance m atrices; sm ooth
functions o f solutions to sm ooth estim ating equations, such as m ost m axim um
likelihood estim ators, estim ators in linear and generalized linear models, and
som e robust estim ators; and to m any statistics calculated from tim e series.
42
Normal
Theoretical
Empirical
M ean bootstrap
Effective df
Cauchy
f3
11
21
11
21
11
21
14.3
13.9
17.2
4.3
7.5
7.3
8.8
5.4
16.8
19.1
25.9
3.2
8.8
9.5
11.4
4.9
22.4
38.3
14000
0.002
11.7
14.6
22.8
0.5
distribution o f Y * is
p r(y * =
, \
^
;=0
"
, s
(2.28)
j=0 '*'
for k = l , . . . , n where
= k / n ; sim ulation is n o t needed in this case. The
m om ents o f this b o o tstrap distribution, including its m ean and variance,
converge to the correct values as n increases. However, the convergence can be
very slow. To illustrate this, Table 2.4 com pares the average b o o tstrap variance
w ith the em pirical variance o f the m edian for d a ta sam ples o f sizes n = 11 and
21 from the stan d ard norm al distribution, the Student-t distribution with three
degrees o f freedom , and the C auchy d istrib u tio n ; also shown are the theoretical
variance approxim ations, which are incalculable when the true distribution F
is unknow n. We see th a t the b o o tstrap variance can be very po o r for n = 11
when distributions are long-tailed. The value 1.4 x 104 for average boo tstrap
variance w ith C auchy d a ta is not a m istake: the b o o tstrap variance exceeds
100 for ab o u t 1% o f d atasets: for som e sam ples the b o o tstrap variance is
huge. The situation stabilizes when n reaches 40 o r more.
The gross discreteness o f y * could also affect the simple confidence limit
m ethod described in Section 2.4. But provided the inequalities used to justify
(2.10) are taken to be < an d > rath er th a n < and > , the m ethod w orks well.
For example, for C auchy sam ples o f size n = 11 the coverage o f the 90% basic
boo tstrap confidence interval (2.10) is 90.8% in 1000 sam ples; see Problem 2.4.
We suggest ado p tin g the sam e practice for all problem s where t* is supported
on a small nu m b er o f values.
The statistic T will certainly behave wildly under resam pling w hen t(F) does
not exist, as happens for the m ean when F is a C auchy distribution. Q uite
naturally over repeated sam ples the b o o tstrap will produce silly and useless
results in such cases. T here are two points to m ake here. First, if d a ta are
taken from a real population, then such m athem atical difficulties can n o t arise.
Secondly, the stan d ard approaches to d a ta analysis include careful screening
o f d a ta for outliers, nonnorm ality, an d so forth, which leads either to deletion
o f disruptive d a ta elem ents or to sensible and reliable choices o f estim ators
43
44
U nder certain circum stances the resam pling m ethods we have described will
work, b u t in general it w ould be unwise to assum e this w ithout careful thought.
A lternative m ethods will be described in Section 3.6.
Dependent data
In general the n o n p aram etric resam pling m ethod th a t we have described will
n o t work for dependent data. This can be illustrated quite easily in the case
where the d a ta
form one realization o f a correlated tim e series. For
example, consider the sam ple average y an d suppose th a t the d a ta com e from
a stationary series {Yj} whose m arginal variance is a 2 = var(Y; ) and whose
autocorrelations are ph = c o n ( Y j , Y j +h) for h = 1 ,2 ,... In Exam ple 2.7 we
showed th a t the nonparam etric b o o tstrap estim ate o f the variance o f Y is
approxim ately s2/n, an d for large n this will ap proach <r2/n . But the actual
variance o f Y is
The sum here w ould often differ considerably from one, and then the b o otstrap
estim ate o f variance would be badly wrong.
Sim ilar problem s arise w ith oth er form s o f dependent data. The essence o f
the problem is th a t simple b o o tstrap sam pling im poses m utual independence
on the Y j , effectively assum ing th a t their jo in t C D F is F(yi) x x F (yn)
and thus sam pling from its estim ate
x x F (y '). This is incorrect for
dependent data. The difficulty is th a t there is no obvious way to estim ate a
general jo in t density for Y i,...,Y given one realization. We shall explore this
im p o rtan t subject furth er in C h ap ter 8.
W eakly dependent d a ta occur in the altogether different context o f finite
population sam pling. H ere the basic nonparam etric resam pling m ethods work
reasonably well. M ore will be said ab o u t this in Section 3.7.
Dirty data
W hat if sim ulated resam pling is used when there are outliers in the d a ta?
T here is no substitute for careful d a ta scrutiny in this o r any o th er statistical
context, an d if obvious outliers are found, they should be removed or corrected.
W hen there is a fitted p aram etric m odel, it provides a benchm ark for plots
o f residuals an d the panoply o f statistical diagnostics, and this helps to detect
poor m odel fit. W hen there is no p aram etric m odel, F is estim ated by the ED F,
and the bench m ark is sw ept aw ay because the d a ta and the m odel are one and
the same. It is then vital to look closely a t the sim ulation output, in order to
see w hether the conclusions depend crucially on p articular observations. We
retu rn to this question o f sensitivity analysis in Section 3.10.
45
means is
approximately
distributed as.
(2.29)
g (C +
o P( l) ) =
g ( 0
o P( l) .
(2.30)
(2.31)
F rom the latter, we can see th a t the norm al approxim ation for U implies th a t
T = g ( 0 + n - 1/2g(C M )Z + op(n~1/2),
which in tu rn entails (2.29).
46
N othing has yet been said a b o u t the bias o f T, which would usually be
hidden in the Op(n_1) term . I f we take the larger expansion (2.30), ignore the
rem ainder term , an d take expectations, we obtain
E (T ) = g(C) + g(C )E (u - 0 +
A t);
(2.32)
(2.33)
de
(2.34)
E=0
with H y(u) = H( u y) the H eaviside or unit step function jum ping from 0 to
1 at u = y. In this form the derivative satisfies / L t( y ; F ) d F ( y ) = 0, as seen
on setting G = F in (2.33). O ften the function L t(y) = L t( y; F ) is called the
influence function o f T an d its em pirical approxim ation l(y) = L t( y; F) is called
the empirical influence function. T he p articu lar values lj = l(yj) are called the
empirical influence values.
47
(2.35)
N ( 0 , vl (F))
because f L , ( y ; F ) d F ( y ) = 0, where
vl (F) = n - 'v a r jL ^ Y )} = n~l
L 2{y)dF{y).
vL = vL(F) = n - 2 Y , j,
(2-36)
j=i
L , ( y ; F) d F{ y ) = n-1 ^
lj = 0.
w ith a sm all value o f e such as (100n)-1 . The same m ethod can be used
for em pirical influence values lj = L,(yj;F). A lternative approxim ations to
the em pirical influence values lj, which are all th a t are needed in (2.36), are
described in the following sections.
Example 2.17 (Average) Let t = y, corresponding to the statistical function
t(F) = f ydF(y). To apply (2.34) we write
{(1 e)F + eHy} = (1 e)fi + sy,
and differentiate to obtain
d{(l - e)n + ey}
M y) =
de
= y-H.
e=0
48
is the unbiased sam ple variance o f the yj. This differs by the factor (n 1)/n
from the m ore usual n o n p aram etric variance estim ate for y.
m
The m ean is an exam ple o f a linear statistic, whose general form is
/ a(y) dF(y). As the term inology suggests, linear statistics have zero derivatives
beyond the first; they have influence function a{y) E{a(Y)}. This applies to
all m om ents ab o u t zero; see Problem 2.10.
C om plicated statistics w hich are functions o f simple statistics can be dealt
w ith using the chain rule. So if t(F) = a { t i ( F ) ,...,r m(F)}, then
m
=E
<2-38)
i= l o ti
This can also be used to find the influence function for a transform ed statistic,
given the influence function for the statistic itself.
Example 2.18 (Correlation) The sam ple correlation is the sam ple version o f
the prod u ct m om ent correlation, w hich for the p air Y = ( U , X ) can be defined
in term s o f p rs = E ( U rX s) by
__ / r>\ ___
P n PioPoi
{(^ 2 0 -
P ? 0 )(W>2 -
/* o i)} l / r
(2.40)
+ xs)}
49
Ca s e
Exact
1
2
-1 .0 4
-0.58
3
-0.37
4
-0.19
R egression
-1.11
-0 .4 4
-0.38
-0.65
5
0.03
-0.04
6
0.11
0.12
7
0.09
0.13
8
0.20
0.27
9
1.02
1.16
10
0.73
0.94
50
H{qp - y ) ~ p
f (qP)
L q, {y ;F ) =
{F)
tT
_ P(1 ~ P )
/ LUy;F)dF(y) = n f{ q v
ip)
the em pirical version o f which requires an estim ate o f f ( q p). But since n o n
p aram etric density estim ates converge m uch m ore slowly th an estim ates o f
m eans, variances, and so forth, estim ation o f variance for quantile estim ates is
h ard er and requires m uch larger samples.
Example 2.20 (City population data) For the ratio estim ate t = x / u , calcu
lations in Problem 2.16 lead to em pirical influence values lj = (xj tuj)/u.
N um erical values for the city population d a ta o f size 10 are given in Table 2.5;
the regression estim ates are discussed in Exam ple 2.23. T he variance estim ate
is Vl =
= 0.182.
50
The lj are plotted in the left panel o f Figure 2.11. Values o f yj = (uj, x j )
close to the line x = tu have little effect on the ratio t. C hanging the d ata
by giving m ore weight to those yj w ith negative influence values, for which
(uj, Xj) lies below the line, would result in sm aller values o f t th a n th a t actually
observed, and conversely. We discuss the right panels in Exam ple 2.23.
( v
) _
,(y)
E { - c ( y ,0 ) }
Z nyjJY
{>;. Of2'
A simple illustration is Exam ple 2.20, where t is determ ined by the estim ating
function c(y, 6) = x 6u.
For som e purposes it is useful to go beyond the first derivative term in the
expansion o f t(F) and o btain the quad ratic approxim ation
t(F) = t(F) + j L t( y; F) dF(y) +
\jj
Qt(y, 2; F) dF(y)dF(z),
(2.41)
dl d2
,=82=0
(2.42)
where t - j is the estim ate calculated w ith y; om itted from the data. In effect
this corresponds to num erical approxim ation (2.37) using e = (n I)- 1 ; see
Problem 2.18.
51
bjack = ~ ~
Ijack,j,
Vjack = ^ ackj ~
bjack = E*(T") t = \ n ~ 2
Qjj'i
j=i
(2.44)
j- 1
52
th a t the b o o tstrap estim ate o f variance should be sim ilar to the nonparam etric
delta m ethod approxim ation.
Example 2.22 (City population data) The right panels o f Figure 2.11 show
how 999 resam pled values o f f* depend on -1 / j for four values o f j, for the
d ata w ith n = 10. T he lines w ith slope lj sum m arize fairly well how t depends
on /* , b u t the correspondence is n o t ideal.
A different way to see this is to p lo t t* against the corresponding t'L.
Figure 2.12 shows this for 499 replicates. The line shows where the values
for an exactly linear statistic would fall. The linear approxim ation is poor
for n = 10, b u t it is m ore accurate for the full dataset, where n = 49. In
Section 3.10 we outline how such plots m ay be used to find a suitable scale on
which to set confidence limits.
Expression (2.44) suggests a way to approxim ate the /,-s using the results o f
a b o o tstrap sim ulation. Suppose th a t we have sim ulated R sam ples from F as
described in Section 2.3. Define /* to be the frequency with which the d a ta
value yj occurs in the rth b o o tstrap sample. T hen (2.44) implies th a t
t; = t + ^
r = l,...,R.
j=i
53
resulting in the regression equation
(2.46)
where F* is the R x ( n 1) m atrix w ith (r,j) elem ent n-1 /*;, and the rth row
o f the R x 1 vector d* is t* f*. In fact (2.45) is related to an alternative,
o rthogonal expansion o f T in which the rem ainder term is uncorrelated
with the linear piece.
The several different versions o f influence produce different estim ates o f
v ar(T ). In general vl is an underestim ate, w hereas use o f the jackknife values
or the regression estim ates o f the Is will typically produce an overestim ate. We
illustrate this in Section 2.7.5.
Example 2.23 (City population data) For the previous exam ple o f the ratio
estim ator, Table 2.5 gives regression estim ates o f em pirical influence values,
obtained from R = 1000 samples. The exact estim ate v l for v a r(T ) is 0.036,
com pared to the value 0.043 obtained from the regression estimates. The
b o o tstrap variance is 0.042. For n = 49 the corresponding values are 0.00119,
0.00125 an d 0.00125.
O u r experience is th a t R m ust be in the hundreds to give a good regression
approxim ation to the em pirical influence values.
54
(2-48)
(2.49)
which is exact for a linear statistic. In effect this uses the usual form ula, with
lj replaced by L t(y*j\F) n-1 J 2 L t(y*k ;F) in the rth resam ple. However, the
right-hand side o f (2.49) can badly underestim ate v'Lr if the statistic is not close
to linear. A n im proved approxim ation is outlined in Problem 2.20.
Example 2.24 (City population data) Figure 2.13 com pares the variance a p
proxim ations for n = 10. T he top left panel shows v" with M = 50 plotted
against the values
n
for R = 200 b o o tstrap samples. T he top right panel shows the values o f the
approxim ate variance on the right o f (2.49), also plotted against v'L. T he lower
panels show Q -Q plots o f the corresponding z* values, with (t* t ) / v ^ /2 on
the horizontal axis. Plainly vL underestim ates v', though not so severely as to
have a big effect on the studentized b o o tstrap statistic. But the right o f (2.49)
underestim ates v'L to an extent th a t greatly changes the distribution o f the
corresponding studentized b o o tstrap statistics.
55
co
>
Q_
2
2o
o
CO
vL*
T he rig h t-h an d panels o f the corresponding plots for the full d a ta show m ore
nearly linear relationships, so it appears th a t (2.49) is a b etter approxim ation
at sample size n = 49. In practice the sam ple size cannot be increased, and
it is necessary to seek a tran sfo rm ation o f t to attain approxim ate linearity.
T he tran sfo rm atio n outlined in Exam ple 3.25 greatly increases the accuracy o f
(2.49), even w ith n = 10.
56
We briefly review three such m ethods here. The first two are in principle
superior to resam pling for certain applications, although their com petitive
m erits in practice are largely untested. T he third m ethod provides an alternative
to the nonparam etric delta m ethod for variance approxim ation.
(2.50)
57
Hence confidence intervals for fi can be determ ined. In practice one w ould take
a ran d o m selection o f R such subsets, and attach equal probability ( R + I)-1
to the R + 1 intervals defined by the R ff values. It is unclear how efficient
this m ethod is, and to w hat extent it can be generalized to o th er estim ation
problems.
58
From the earlier representation for
.
[ 1
s
r= l
i E
I
m m
wf ( yn - y a )1 +
i= 1
j(yn - y a ) { y n - yj i)
i= l j = 1
equals
1 m
4 E
i=l
>2)2-
t= n + X
i= 1
k~l E
= ^ + E
7=1
i= l
a,>
j= l
say. Suppose th a t in the rth subsam ple we take one observation from each
stratum , as specified by the zero -o n e indicator c jy . T hen
'! - , = E
cl,,j(aU - a,),
which is a linear regression m odel w ithout erro r in which the atj a, are
coefficients and the
are covariate values to be determ ined. If the ay a,
59
can be calculated, then the usual estim ate o f v ar(T ) can be calculated. The
choice o f
- values corresponds to selection o f a fractional factorial design,
w ith only m ain effects to be calculated, and this is solved by a Plackett-B urm an
design. O nce the subsam pling design is obtained, the estim ate o f v a r(T ) is a
form ula in the subsam ple values tj. The same form ula w orks for any statistic
th a t is approxim ately linear.
The same principles apply for unequal stratum sizes, although then the
solution is m ore com plicated and m akes use o f orthogonal arrays.
60
statistic for confidence intervals an d significance tests, and m akes the connec
tion to E dgew orth expansions for sm ooth statistics. The em pirical choice o f
scale for resam pling calculations is discussed by C h apm an and H inkley (1986)
and T ibshirani (1988).
H all (1986) analyses the effect o f discreteness on confidence intervals. Efron
(1987) discusses the num bers o f sim ulations needed for bias and quantile
estim ation, while D iaconis an d H olm es (1994) describe how sim ulation can be
avoided com pletely by com plete en um eration o f b o o tstrap sam ples; see also
the bibliographic notes for C h ap ter 9.
Bickel and F reedm an (1981) were am ong the first to discuss the conditions
under which the b o o tstrap is consistent. T heir w ork was followed by Bretagnolle (1983) and others, and there is a grow ing theoretical literature on
m odifications to ensure th a t the b o o tstra p is consistent for different classes o f
aw kw ard statistics. T he m ain m odifications are sm oothing o f the d ata (Sec
tion 3.4), which can im prove m atters for nonsm ooth statistics such as quantiles
(D e Angelis and Young, 1992), subsam pling (Politis and R om ano, 1994b), and
rew eighting (B arbe and Bertail, 1995). H all (1992a) is a key reference to Edgew orth expansion theory for the b o o tstrap , while M am m en (1992) describes
sim ulations intended to help show when the b o o tstrap works, and gives the
oretical results for various situations. Shao and Tu (1995) give an extensive
theoretical overview o f the b o o tstrap an d jackknife.
A threya (1987) has show n th a t the b o o tstra p can fail for long-tailed distri
butions. Some o th er exam ples o f failure are discussed by Bickel, G otze and
van Zwet (1996).
T he use o f linear approxim ations an d influence functions in the context
o f robust statistical inference is discussed by H am pel et al. (1986). Fernholtz
(1983) describes the expansion theory th a t underlies the use o f these approx
im ation m ethods. A n alternative and o rthogonal expansion, sim ilar to th at
used in Section 2.7.4, is discussed by E fron and Stein (1981) and E fron (1982).
Tail-specific approxim ations are described by H esterberg (1995a).
The use o f m ultiple-deletion jackknife m ethods is discussed by H inkley
(1977), Shao and W u (1989), W u (1990), and Politis and R om ano (1994b), the
last w ith num erous theoretical exam ples. T he m ethod based on all non-em pty
subsam ples is due to H artig an (1969), an d is nicely p u t into context in C h apter 9
o f Efron (1982). H alf-sam ple m ethods for survey sam pling were developed by
M cC arthy (1969) an d extended by W u (1991). The relevant factorial designs
for half-sam pling were developed by Plackett and B urm an (1946).
2.10 Problems
1
Let F denote the E D F (2.1). Show that E {f(y )} = F(y) and that var{F(y)} =
f (3'){l F(y)}/ n. Hence deduce that provided 0 < F(y) < 1, F(y) has a limiting
61
2.10 Problems
normal distribution for large n, and that Pr(|F(y) F(y)| < e)>1 as ntoo for any
positive e. (In fact the much stronger property s u p ^ ^ ^ ^ |F(y) F (y )|>0 holds
with probability one.)
(Section 2.1)
2
their average is
Yj .
(a) Show that Y has the gamma density (1.1) with k = n, so its mean and variance
are n and fi2/n.
(b) Show that log Y is approximately normal with mean log^i and variance n~'.
(c) Compare the normal approximations for Y and for log Y in calculating 95%
confidence intervals for /z. Use the exact confidence interval based on (a) as the
baseline for the comparison, which can be illustrated with the data o f Example 1.1.
(Sections 2.1, 2.5.1)
3
This specifies the exact resampling density (2.28) o f the sample median. (The result
can be used to prove that the bootstrap estimate o f var(T ) is consistent as n>oo.)
(c) Use the resampling distribution to show that for n = 11
P r * ( r < y,3 j) = Pr( T > y(9)) = 0.051,
and apply (2.10) to deduce that the basic bootstrap 90% confidence interval for
the population median 6 is (2 y(6) y(9 ), 2 y(6)
(d) Examine the coverage o f the confidence interval in (c) for samples from normal
and Cauchy distributions.
(Sections 2.3, 2.4; Efron, 1979, 1982)
5
62
feAy)-
tpin) = max Y
lg//vc(y; )
j=i
be the profile log likelihood for n and let Q = 2 { /p(/i) /?p(n)}. In theory Q should
be approximately a x] variable for large n. Use simulation to examine whether or
not Q is approximately pivotal for n = 10 when k is in the range (0.5,2).
(Section 2.5.1)
7
i* \ v I . ,
*3 , l 2 / t , k4
K ) - R { ' + Z ' , ^ + i2' ( 2 + <
where k 3 and k4 are the third and fourth cumulants o f T" under bootstrap
resampling. If T is asymptotically normal, k ^ / v U2 = 0 ( n ~ l/2) and k 4/ v1^ = 0 (n ).
Compare this variance to that o f the bootstrap quantile estimate
t in
the special case T = Y .
(Sections 2.2.1, 2.5.2; Appendix A)
8
Suppose that estimator T has expectation equal to 0(1 + y ) , so that the bias is 9y.
The bias factor y can be estimated by C = E( T ' ) / T 1. Show that in the case
o f the variance estimate T = ri [ ^ 2(Yj Y ) 2, C is exactly equal to y. I f C were
approximated from R resamples, what would be the simulation variance o f the
approximation?
(Section 2.5)
Suppose that the random variables U = (Ui, .. . , Um) have means C i,...,( m and
covariances cov(Uk,Ui) = n-1 cow( 0 , and that Ti = g t ( U ) , . . . , T q = gq(U). Show
that
E(T,)
g , . ( 0 + i n - > f > w( 0 | ^ ,
cov(Tj, Tj)
/r f >
w(
\ " (x i tuj)2
" - 2
i=i
is a variance estimate for t = x / u , based on independent pairs (u i, Xi) ,...,( ,x n).
(Section 2.7.1)
63
2.10 Problems
10
(a) Show that the influence function for a linear statistic t(F) = / a(x) dF(x)
is a ( y ) t(F). Hence obtain the influence functions for a sample mom ent fir
f x r dF(x), for the variance /1 2 (F) {/ti(F)}2, and for the correlation coefficient
(Example 2.18).
(b) Show that the influence function for {t(F) 6 } / v ( F ) i/2 evaluated at 9 = t{F) is
v(F)~l/2L, (y; F) . Hence obtain the empirical influence values lj for the studentized
quantity {t{F) t ( F) } / v L( F ) l/2, and show that they have the properties E O = 0
and n~2 E I2 = 1 .
(Section 2.7.2; Hinkley and Wei, 1984)
11
12
J u { y, 9 ) d F ( y ) = 0
J u { y J ( F ) } dF(y) = 0,
u(x;0) = du(x-,6)/d8
,(-V *
f U(x;9)dF(x) '
Hence show that with 9 = t{F) the y'th empirical influence value is
t =
u ( y j ; 6)
- n ~ l E L i (w ;
(b) Let {p be the maximum likelihood estimator o f the (possibly vector) parameter
o f a regular parametric m odel / v (y) based on a random sample y u . ..,y. Show
that the j \ h empirical influence value for \p at yj may be written as n I ~ lSj, where
y-v g 2 l o g / v-,(y; )
dxpdip7
d\ogjjiyj)
dxp
Hence show that the nonparametric delta method variance estimate for ip is the
so-called sandwich estimator
/-> ( X s A r ) ' - ' Compare this to the usual parametric approximation when y \ , . . . , y is a random
sample from the exponential distribution with mean tp .
(Section 2.7.2; Royall, 1986)
64
13
t { F) =r h a [
computed at the E D F F. Express t(F) in terms o f order statistics, assuming that
na is an integer. How would you extend this to deal with non-integer values o f not?
Suppose that F is a distribution symmetric about its mean, p.. By rewriting t(F) as
rii-(f)
- /
1 2 a
udF(u),
where qa(F) is the a quantile o f F, use the result o f Example 2.19 to show that the
influence function o f t(F) is
L t(y,F )= l
l - 2 r I,
1 2a) ',
{{q '(F )-p }(l-2 * )-\
y<q(F),
q(F) < y < <ji_a(F ),
q t - x( F ) < y .
n(1 _
r r<i\-AF)
^ >*)2dF(y) + te(F) -
2ay [J {F)
Evaluate this at F = F.
(Section 2.7.2)
14
ej (F) t e j(Fc) = 1,
Consider the biased sample variance t n_ 1 J2(yj ~ J')2(a) Show that the empirical influence values and second derivatives are
lj = (yj - y ) 2 - U
y 'lj -
)~2qjj,
2.10 Problems
16
65
elsewhere.
0 = Mp) -
X Mp)>
k=\
d2
qtj = g~ ^
| =2=0
Hence deduce that
<2.7 = 'iij(P) ~ n
5Z
k=l
tik ( P ) -
n '
+ n 2
k=l
Y1ikl
k,l=1
(Section 2.7.2)
17
Suppose that t =
+ }W i)) is the median o f a sample o f even size n = 2m
from a distribution with continuous C D F F and P D F / whose median is p. Show
that the case-deletion values
are either y lmj or
and that the jackknife
variance estimate is
1 (,y<m
( +i) - y(m))'i2
vjack = "
By writing Yu> = F '{1 exp(y))}, where
is the 7 th order statistic o f a
random sample
from the standard exponential distribution, and recalling
properties o f exponential order statistics, show that
nVJadl~
J ___ / I
?\2
( X2 )
Pin)
as n*oo. This confirms that the jackknife variance estimate is not consistent.
(Section 2.7.3)
18
66
Show that in (b) and (c) the squared distance (dF dFe)T(dF dFc) from F to
Fe = (1 s)F + eH Vj is o f order 0 ( n ~ 2), but that if F* is generated by bootstrap
sampling, E* j( d F d F ) T {dF dF) j = 0 ( n ~ l ). Hence discuss the results you
would expect from the butcher knife, which uses e = n~l/2. How would you
calculate it?
(Section 2.7.3; Efron, 1982; Hesterberg, 1995a)
19
ty e x p ( ^ -)|,
where =
(a) Show that with Kj = n~l, the first four cumulants o f the /* are
E(/D
co v '( / ' , / * )
=
=
1,
dij-n~\
cum ' ( f i J j J k )
n~2{n2Sijk-<5ft[3] + 2 } ,
cum (/,',/* , f l J J )
},
where S:J = 1 when i = j and zero otherwise, and so on, and d:k [3] = d,k + S,j + Sjk,
and so forth.
(b) N ow consider t Q = f + n ~ ' J 2 f j h + \ n~2 H
Show that E*(tg) =
t + \ n~2 ^2 qjj and that t'g has variance
1j + ^ E
{E 4 - 1 ( E ) * + E
(2-51)
Show that the difference between the second derivative Q , ( x , y ) and the first
derivative o f L,(x) is equal to L,(y). Hence show that the empirical influence value
can be written as
n
lj = L t( y j ) + n~l ^ { 2 ,( y y ,y ( c ) ~ L ,(yk)}k= 1
Use the resampling version o f this result to discuss the accuracy o f approximation
(2.49)
for vL .
(Sections 2.7.2, 2.7.5)
2.11 Practicals
1
2.11 Practicals
67
attach(co.transfer)
plot(0.5*(entry+week),week-entry)
t .test(week-entry)
Are the differences normal? Is the Student-t confidence interval reliable?
For a bootstrap approach:
t0 <- cd4.boot$t0[l]
tstar <- cd4.boot$t[,1]
vL <- cd4.boot$t[,2]
zstar <- (tstar-tO)/sqrt(vL)
fisher <- function( r ) 0.5*log( (l+r)/(l-r) )
split.screen(c(1,2))
screen(l); plot(tstar,vL)
screen(2); plot(fisher(tstar),vL/(l-tstar"2)~2)
For a studentized bootstrap confidence interval on transformed scale:
2.11 Practicals
4
69
How many simulations are required for quantile estimation? To get som e idea, we
make four replicate plots with 39, 99, 399 and 999 simulations.
split.screen(c(4,4))
quantiles <- matrix(NA,16,4)
n <- c (39,99,399,999)
p <- c(0.025,0.05,0.95,0.975)
for (i in 1:4)
{ y <- rnorm(999)
for (j in 1:4) {
quantiles[(j-1)*4+i,] <- quantile(y [1 :n[j]] , probs=p)
screen((i-1)*4+j)
qqnorm(y [1 :n[j] ] ,ylab="y" ,main=paste("R = ",n[j]))
abline(h=quantile(y[l :n[j]] ,p) ,lty=2) } }
Repeat the loop a few times. How large a simulation is required to get reasonable
estimates o f the 0.05 and 0.95 quantiles? O f the 0.025 and 0.975 quantiles?
(Section 2.5.2)
5
close.screen(all=T);plot(tstar.linear.approx(cd4.boot,L.reg))
Find the correlation between t and its linear approximation. M ake the corre
sponding plots for the other empirical influence values. Are the plots better on the
transformed scale?
(Section 2.7)
3
Further Ideas
3.1 Introduction
In the previous chap ter we laid out the basic elem ents o f resam pling or
b o o tstrap m ethods, in the context o f the analysis o f a single hom ogeneous
sam ple o f data. This ch ap ter deals w ith how those ideas are extended to some
m ore com plex situations, an d then tu rn s to uses for variations and elaborations
o f simple b o o tstrap schemes.
In Section 3.2 we describe how to construct resam pling algorithm s for
several independent sam ples, and then in Section 3.3 we discuss briefly the use
o f partial m odelling, either qualitative or sem iparam etric, a topic explored m ore
fully in the later chapters on regression m odels (C hapters 6 and 7). Section 3.4
exam ines w hen it is w orthw hile to m odify the statistic by using a sm oothed
em pirical distribution function. In Sections 3.5 and 3.6 we tu rn to situations
where d a ta are censored or missing and therefore are incom plete. One relatively
simple situation where the stan d ard b o o tstrap m ust be modified to succeed is
finite population sampling, which we consider in Section 3.7. In Section 3.8 we
deal with simple situations o f hierarchical variation. Section 3.9 is an account
o f nested b ootstrapping, where we outline how to overcome som e o f the
shortcom ings o f a single b o o tstrap calculation by a fu rther level o f sim ulation.
Section 3.10 describes b o o tstrap diagnostics, which are concerned w ith the
assessm ent o f sensitivity o f resam pling analysis to individual observations, as
well as the use o f bo o tstrap o u tp u t to suggest m odifications to the calculations.
Finally, Section 3.11 describes the use o f nested b o o tstrapping in selecting an
estim ator from the data.
70
71
Since each o f the k p opulations is separate, nonparam etric sim ulation from
their respective E D F s F i , . . . , F k leads to datasets
where
is generated by sam pling n,- tim es w ith equal probabilities,
n ', from the ith original sample, independently o f all other sim ulated samples.
This am ounts to stratified sam pling in which each o f the original sam ples
corresponds to a stratum , an d nt observations are taken w ith equal probability
from the ith stratum . W ith this extension o f the resam pling algorithm , we
proceed as o u tlined in C h ap ter 2. F or example, if v = v(Fi,...,F/c) is an
estim ated variance for t, confidence intervals for 6 could be based on sim ulated
values o f z* = (t* t ) / v ' ll2 ju st as described in Section 2.4, where now t* and
v' are form ed from sam ples generated by the sim ulation algorithm described
above.
Example 3.1 (Difference of population means) Suppose we are interested in
the difference o f two p o p u latio n m eans, 6 = t(Fi,F 2 ) = f y d F i ( y ) f
The corresponding estim ate o f f(F], F 2) based on independent sam ples from
the two distributions is the difference o f the two sam ple averages,
This differs slightly from the delta m ethod variance approxim ation, which we
describe in Section 3.2.1.
A sim ulated value o f T w ould be f* = yj y 2 > where yj is the average
o f n\ observations generated w ith equal probability from the first sample,
y ii ,- - -, yi m, and
72
3 Further Ideas
Series
4
5
105
83
95
90
76
76
78
78
82
79
84
86
76
75
51
76
93
75
62
76
76
87
79
77
71
78
79
72
68
75
78
78
86
87
81
73
67
75
82
83
81
79
77
79
79
78
79
82
76
73
64
85
82
77
76
77
80
83
81
78
78
78
76
82
87
95
83
54
35
46
87
68
98
100
109
109
100
81
75
68
67
>
r = Ef=i V(Fi)/<r2(Fi)
E l i IM A )
where F, is the E D F o f the ith series, fi(Fi) is an estim ate o f g from F and
73
O
00
O
CD
o
<j2(Fj) is an estim ated variance for n(Fi). The estim ated variance o f T is
v =
j E 1/ ^
1 1=1
If the d a ta were tho u g h t to be norm ally distributed with m ean g but different
variances, we w ould take
KFi) = yh
to be the average o f the ith series and its estim ated variance. The resulting
estim ator T is then an em pirical version o f the optim al weighted average. For
o u r d a ta t = 78.54 w ith stan d ard error uI/2 = 0.59.
Figure 3.2 shows sum m ary plots for R = 999 nonparam etric sim ulations
from this model. The to p panels show norm al plots for the replicates t ' and
for the corresponding studentized b o o tstrap statistics z* = (f* t ) / v ' l/2. Both
are m ore dispersed th a n norm al. There is one large negative value o f z*, and
the lower panels show w h y : on the left we see th a t the u* for the smallest value
o f t* is very small, w hich inflates the corresponding z*. We would certainly
om it this value on the grounds th at it is a sim ulation outlier.
The average an d variance o f the * are 78.51 and 0.371, so the bias estim ate
for t is 78.51 78.54 = 0.03, and a 95% confidence interval for g based on a
norm al approxim ation is (77.37,79.76). The 0.025 x (R + 1) and 0.975 x (R + 1)
order statistics o f the z* are -3.03 and 2.50, so the 95% studentized boo tstrap
confidence interval for g is (77.07,80.32), slightly wider th an th at based on
the norm al approxim ation, as the top right panel o f Figure 3.2 w ould suggest.
74
3 Further Ideas
10
o00
O)
r^
*
N
GO
hr-.
/ ' y
. .. y
o
V
in
-2
-2
ho
CD
in
o
tr
o
co
d
c\j
o
77
78
79
80
81
t*
A p art from the resam pling algorithm , this mimics exactly the studentized
bo o tstrap procedure described in Section 2.4.
75
Hi
t1 1 )
j=
St (Fu . . . , ( l - e ) F i + eHy, . . . , F k)
ds
(3.2)
6=0
and for brevity we w rite F = (Fi, . . . , Fk). A s in the single sam ple case, the influ
ence functions have m ean zero, E { Ltii( y;F)} = 0 for each i. T hen the im m ediate
consequence o f (3.1) is the nonparam etric delta m ethod approxim ation
T 6
N ( 0, vL),
for large
where the variance approxim ation vL is given by the variance o f
the second term on the right-hand side o f (3.1), th a t is
k 1
v l = V - v a r { L M(Y ;F) \ F}.
f n
(3.3)
By analogy with the single sam ple case, em pirical influence values are
obtained by substituting the E D F s F = ( F i , . . . , F k) for the C D F s F in (3.2) to
give
h j = Lt Ay j i f )These values satisfy E y = i kj ~ f r eac^ * S ubstitution o f em pirical variances
o f the em pirical influence values in (3.3) gives the variance approxim ation
' E i D S -
<1 4 >
i= i n i j = i
x \{ {\ - )dFi(xi) + edHy(xi)} -
x 2dF2(x2)
e=0
= y -p i
ju st as in Exam ple 2.17. Sim ilarly L^2(y,F) = (y fi2). In this case the linear
approxim ation (3.1) is exact. The variance approxim ation form ula (3.3) gives
vL = v a r(F i) + var(Y2),
ni
n2
76
3 Further Ideas
vl
= X
"1 ;=1
~ h)2 +
_ *2)2;=1
As usual this differs slightly from the unbiased variance approxim ation.
N ote th a t if we could assum e th a t the two p o pulation variances were equal,
then it w ould be ap p ro p riate to replace vl by
(b,+ i )
The various com m ents m ade ab o u t calculation in Section 2.7 apply here
w ith obvious m odifications. T hus the em pirical influence values can be ap
proxim ated accurately by num erical differentiation, which here m eans
f ^ t(F\ , ...,( 1 - e ) F j + e H yj , .. ., Fk) - t
lj ~
for small
e.
' - +
r E
^ -
<3-5>
.- = 1 J.1
l)(t
tjj),
is the estim ate obtained by om itting the yth case in the ith sample.
k
vjack = E
_ J)
n,
~^jack,if-
One can also generalize the discussion o f bias approxim ation in Section 2.7.3.
However, the extension o f the quad ratic approxim ation (2.41) is n o t straight
forw ard, because there are cross-population terms.
The same approxim ation (3.1) could be used even when the samples, and
hence the F,s, are correlated. But this w ould have to be taken into account in
(3.3), which as stated assum es m utual independence o f the samples. In general
it would be safer to incorporate dependence th ro u g h the use o f appropriate
m ultivariate E D Fs.
77
for ap p ro p riate estim ates jl, an d au to verify hom ogeneity across samples. The
com m on Fo will be estim ated by the E D F o f all
n, o f the ey-s, or better
by the E D F o f the standardized residuals e y /( l n f 1)1/2. The resam pling
algorithm will then be
Yy
1, , Wj, i
1 ,. . . , /c,
where the y-s are random ly sam pled from the ED F, i.e. random ly sam pled
w ith replacem ent from the standardized eys; see Problem 3.1.
In an o th er context, w ith positive d a ta such as lifetimes, it m ight be ap p ro
priate to think o f d istributions as differing only by m ultiplicative effects, i.e.
Yy = HiSij, where the ey are random ly sam pled from some baseline distribution
w ith unit m ean. The exponential distribution is a param etric m odel o f this
form. The principle here w ould be essentially the sam e: estim ate the ey by
residuals such as ey = y y //i then define Yy = &* with the e*- random ly
sam pled w ith replacem ent from the eys.
Sim ilar ideas apply in regression situations. The param etric p art o f the model
concerns the system atic relationship betw een the response y and explanatory
variables x, e.g. th ro u g h the m ean, and the nonparam etric p a rt concerns the
ran d o m variation. We consider this in detail in C hapters 6 and 7.
R esam pling plans such as those ju st outlined will give m ore accurate answers
when their assum ptions ab o u t the relationships betw een F, are correct, but they
are not robust to failure o f these assum ptions. Some pooling o f inform ation
78
3 Further Ideas
across sam ples m ay be essential in o rd er to avoid difficulties w hen the sam ples
are small, b u t otherw ise it is usually unnecessary.
If we widen the m eaning o f sem iparam etric to include any partial modelling,
then features less tangible th a n param eters com e into play. T he following two
exam ples illustrate this.
Y-n
In b o th o f these exam ples the resulting estim ate will be m ore efficient th an
the EDF. This m ay be less im p o rtan t th a n producing a m odel which satisfies
the practical assum ptions an d m akes intuitive sense.
79
t M
- h t H
j=i
n r 1)-
where w(-) is a continuous an d sym m etric P D F with m ean zero and unit
variance, an d do calculations o r sim ulations based on the corresponding C D F
Fh, rath er th a n on the E D F F. This corresponds to sim ulation by setting
Y = yr. + hj,
j = l,...,n,
where the l j are independent and uniform ly distributed on the integers 1 ,..., n
and the ej are a ran d o m sam ple from w(-), independent o f the l j . This is the
smoothed bootstrap. N ote th a t h = 0 recovers the EDF.
The variance o f an observation generated from (3.6) is n~l J2(yj ~ S)2 + ^2>
and it m ay be preferable for the sam ples to have the same variance as for the
unsm oothed b ootstrap. This is im plem ented via the shrunk smoothed bootstrap,
under which h sm ooths betw een F and a m odel in which d a ta are generated
from density w(-) centred at the m ean and rescaled to have the variance o f F ;
see Problem 3.8.
H aving decided which sm oothed b o o tstrap is to be used, we estim ate the
required p roperty o f F , a(F), by a(F/,) ra th er th an a(F). So if T is an estim ator
o f 9 = t(F), an d we inten d to estim ate a(F) = v a r(T | F) by sim ulation, we
w ould obtain values t \ , . . . , t R calculated from sam ples generated from F/,, and
then estim ate a(F) by (R I)-1 F ) 2. N otice th a t it is a(F), n o t t(F),
th a t is estim ated using sm oothing.
To see w hen a(F/,) is b etter th an a(F), suppose th a t a(F) has linear approxi
m ation (2.35). Then
n
a(Fh) - a(F)
n~l ^
L a( Y j + h j - , F ) w ( E j ) d E j - i -------
7= 1
n - 1 Y , L a( Yj ; F ) + \ h 2n~ l
7=1
L "(Y ,;F ) +
7=1
80
3 Further Ideas
2 0
80
Smoothed, h
Usual
h= 0
18.9
11.4
0 .1
18.6
1 1 .2
0.25
0.5
16.6
10.4
11.9
8.5
1 .0
6 .6
6.4
J.
(3.7)
Sm oothing is n o t beneficial if the coefficient o f h2 is positive, b u t if it is negative
(3.7) can be reduced by choosing a positive value o f h th a t trades off the last
two terms. The leading term in (3.7) is unaffected by the choice o f h, which
suggests th a t in large sam ples any effect o f sm oothing will be m inor for such
statistics.
Example 3.7 (Sample correlation)
To illustrate the discussion above, we take
a(F) to be the scaled stan d ard deviation o f T = i log{(l + C )/( 1 C)}, where
C is the correlation coefficient for bivariate norm al data. We extend (3.6) to
bivariate y by taking w( ) to be the bivariate norm al density with m ean zero
and variance m atrix equal to the sam ple variance m atrix. F or each o f 200
samples, we applied the sm oothed b o o tstrap w ith different values o f h and
R = 200 to estim ate a(F).
Table 3.2 shows results for two sam ple sizes. F or n = 20 there is a reduction
in root m ean squared error by a factor o f ab o u t three, w hereas for n = 80 the
factor is ab o u t two. Results for the shrunk sm oothed b o o tstrap are the same,
because o f the scale invariance o f C and the form o f w( ).
(3.8)
81
S m oothed, h
Exp
S h ru n k sm oothed, h
U sual
h= 0
0.1
0.25
0.5
1.0
0.1
0.25
0.5
1.0
11
81
2.27
0.97
2.08
0.76
2.17
0.77
3.59
1.81
10.63
6.07
2.06
0.75
2.00
0.67
2.72
1.17
4.91
2.30
11
81
1.32
0.57
1.15
0.48
1.02
0.37
1.18
0.41
7.53
1.11
1.13
0.47
0.92
0.34
0.76
0.27
0.93
0.27
w hereas it is 0 ( n ~ ,/2) in the unsm oothed case. T hus there are advantages
to sm oothing here, a t least in large samples. Sim ilar results hold for other
quantiles.
Table 3.3 shows results o f sim ulation experim ents where 1000 sam ples were
taken from the exponential an d tj distributions. F or each sam ple sm oothed
an d shrunk sm oothed b o o tstrap s were perform ed w ith R = 200 an d several
values o f h. U nlike in Table 3.2, the advantage due to sm oothing increases with
n, and the shrunk sm oothed b o o tstrap im proves on the sm oothed bootstrap,
particularly at larger values o f h.
As predicted by the theory, as n increases the root m ean squared error
decreases m ore rapidly for sm oothed th an for unsm oothed bootstrap s; it
decreases fastest for shru n k sm oothing. F o r the tj d a ta the ro o t m ean squared
erro r is n o t m uch reduced. F or the exponential d a ta sm oothing was per
form ed on the log scale, leading to reduction in root m ean squared erro r by
a factor two o r so. Too large a value o f h can lead to large increases in
ro o t m ean squared error, b u t choice o f h is less critical for shrunk sm ooth
ing. Overall, a small am o u n t o f shrunk sm oothing seems w orthw hile here,
provided the d a ta are well-behaved. But sim ilar experim ents w ith Cauchy
d a ta gave very p o o r results m ade worse by sm oothing, so one m ust be
sure th a t the d a ta are n o t pathological. F urtherm ore, the gains in preci
sion are n o t large enough to be critical, at least for these sam ple sizes.
The discussion above begs the im p o rtan t question o f how to choose the
sm oothing p aram eter for use w ith a p articular dataset. O ne possibility is
to treat the problem as one o f choosing am ong possible estim ators a(Fh)
an d use the nested b o o tstrap , as in Exam ple 3.26. However, the use o f an
estim ated h is n o t sure to give im provem ent. W hen the rate o f decrease o f the
optim al value o f h is know n, an o th er possibility is to use subsam pling, as in
E xam ple 8.6.
82
3 Further Ideas
3.5 Censoring
3.5.1 Censored data
Censoring is present w hen d a ta con tain a lower or upper b o und for an
observation ra th e r th a n the value itself. Such d a ta often arise in m edical and
industrial reliability studies. In the m edical context, the variable o f interest
m ight represent the tim e to death o f a patien t from a specific disease, with an
indicator o f w hether the tim e recorded is exact or a lower b o und due to the
p atient being lost to follow -up or to d eath from oth er causes.
The com m onest form o f censoring is right-censoring, in which case the value
observed is Y = m in (7 , C), where C is a censoring value, and Y is a no n
negative failure time, which is know n only if Y < C. The d a ta themselves
are pairs ( Y , D ), w here D is a censoring indicator, w hich equals one if Y is
observed an d equals zero if C is observed. Interest is usually focused on the
distributio n F o f Y, w hich is obscured if there is censoring.
The survivor function and the cumulative hazard function are central to
the study o f survival data. The survivor function corresponding to F(y) is
Pr(Y > y) = 1 F(y), an d the cum ulative h azard function is A(y) =
lo g { l-F(y)}. The cum ulative h azard function m ay be w ritten as / 0y dA(u),
where for continuous y the hazard function d A ( y) /d y m easures the in stan
taneous rate o f failure at tim e y, conditional on survival to th a t point. A
constant h azard X leads to an exponential distrib u tion o f failure tim es with
survivor an d cum ulative h azard functions exp(Ay) and Ay; departures from
these simple form s are often o f interest.
T he sim plest m odel for censoring is random censorship, u n der which C is a
random variable w ith distrib u tio n function G, independent o f Y. In this case
the observed variable Y has survivor function
Pr(Y > y ) = { I - F ( y ) } { l - G ( y ) } .
O ther form s o f censoring also arise, an d these are often m ore realistic for
applications.
Suppose th a t the d a ta available are a hom ogeneous random sam ple (yi,di),
. . . , (y n, dn), and th a t censoring occurs at random . Let y\ < < y, so there are
n o tied observations. A stan d ard estim ate o f the failure-tim e survivor function,
the product-limit o r Kaplan-Meier estim ate, m ay then be w ritten as
(3.9)
I f there is no censoring, all the dj equal one, and F(y) reduces to the E D F o f
y i , . . . , y n (Problem 3.9). T he product-lim it estim ate changes only a t successive
failures, by an am o u n t th a t depends on the num b er o f censored observations
3.5 Censoring
83
between them . Ties betw een censored and uncensored d ata are resolved by
assum ing th a t censoring happens instantaneously after a failure m ight have
occurred; the estim ate is unaffected by o th er ties. A stan d ard error for 1F(y)
is given by Greenwoods formula,
1/2
(3.10)
In setting confidence intervals this is usually applied on a transform ed scale.
Both (3.9) an d (3.10) are unreliable where the num bers a t risk o f failure are
small.
Since 1 dj is an indicator o f censoring, the product-lim it estim ate o f the
censoring survivor function 1 G is
-*M- n Gr^Hj:yj< y v
J/
<^>
----- y
(3.12)
J ] {l-dA (yj)}
j-yj^y
(3.13)
3 Further Ideas
84
9
5
13
5
>13
8
18
8
23
12
>28
>16
31
23
34
27
>45
30
48
33
> 161
43
45
The left panel o f Figure 3.3 shows the estim ated survivor functions for the
tim es o f rem ission. A plus on one o f the lines indicates a censored observation.
T here is some suggestion th a t m aintenance prolongs the time to remission,
b u t the sam ples are sm all and the evidence is n o t overwhelming. T he right
panel shows the estim ated survivor functions for the censoring times. Only
one observation in the n o n-m aintained group is censored, b u t the censoring
distributions seem sim ilar for b o th groups.
The estim ated probabilities th a t rem ission will last beyond 20 weeks are
respectively 0.71 and 0.59 for the groups, w ith stan d ard errors from (3.10)
b o th equal to 0.14.
" ~ J_ ) ,
3.5 Censoring
Figure 3.3
Product-limit survivor
function estimates for
two groups o f patients
with A M L, one
receiving maintenance
chem otherapy (solid)
and the other not (dots).
The left panel shows
estimates for the time to
remission, and the right
panel shows the
estimates for the time to
censoring. In the left
panel, + indicates times
o f censored
observations; in the
right panel + indicates
times o f uncensored
observations.
85
n
na
o
CO
>
3
C/D
Time (weeks)
Time (weeks)
3 Further Ideas
86
vV
yk)
(3-14)
which can be used to estim ate the uncertainty o f the original estim ate A Q(y).
In this weird bootstrap the failures at different tim es are unrelated, the num ber
at risk does n o t depend on previous failures, there are no individuals whose
sim ulated failure tim es underlie -4 (y), and no explicit assum ption is m ade
ab o u t the censoring m echanism . Indeed, under this scheme the censored indi
viduals are held fixed, b u t the num b er o f failures is a sum o f binom ial variables
(Problem 3.10).
The sim ulated survivor function corresponding to (3.14) is obtained by
substituting
87
3.5 Censoring
Table 3.5 Results for
499 replicates of
censored data
bootstraps of Group 1
of the AML data:
average (standard
deviation) for estimated
probability of remission
beyond 20 weeks,
average (standard
deviation) for estimated
median survival time,
and the number of
resamples in which case
3 occurs 0, 1, 2 and 3 or
more times.
Figure 3.4 Comparison
of distributions of
differences in median
survival times for
censored data
bootstraps applied to
the AML data. The
dotted line is the line
x = y.
F requency o f case 3
Cases
C o n d itio n al
W eird
P robability
M edian
>3
0.72 (0.14)
0.72 (0.14)
0.73 (0.12)
32.5 (8.5)
32.8 (8.5)
33.3 (7.2)
180
75
0
182
351
499
95
71
0
42
3
0
co
c
o
o
a>
V-
c
o
-20
20
Cases
40
-20
20
40
Cases
o f the m edian 21, 19, and 2 tim es respectively. The w eird b o o tstrap results for
the m edian are less variable th a n the others.
The last colum ns o f the table show the num bers o f sam ples in which the
sm allest censored observation appears 0, 1, 2, and 3 or m ore times. U nder the
conditional scheme the observation appears m ore often th an under the ordinary
b o o tstrap , and und er the weird b o o tstrap it occurs once in each resample.
Figure 3.4 com pares the distributions o f the difference o f m edian survival
times betw een the two groups, und er the three schemes. R esults for the condi
tional and o rdinary b o o tstrap s are similar, b u t the weird bo o tstrap again gives
results th a t are less variable th a n the others.
This set o f d a ta gives an extrem e test o f m ethods for censored data, because
quantiles o f the product-lim it estim ate are very discrete.
T he weird b o o tstra p also gave results less variable th a n the o ther schemes
for a larger set o f data. In general it seems th a t case resam pling and conditional
resam pling give quite sim ilar an d reliable results, b o th differing from the weird
bootstrap.
88
3 Further Ideas
The EM or expectation
maximization algorithm
is widely used in
incomplete data
problems.
89
To estim ate the popu latio n m ean /i we should o f course use the average
response y = (n m)-1
X/ whose variance we would estim ate by
nm
v = (n m) 2 Y ( y j - y f
But think o f this as a prototype missing d a ta problem , to which resam pling
m ethods are to be applied. C onsider the following two approaches:
1
First estim ate fi by t = y, the average o f the non-m issing data. Then
(a) sim ulate sam ples y\,...,y*n by sam pling with replacem ent from the n
observations y \ , . . . , y-m, N A , . . . , N A ; then
(b ) calculate f* as the average o f non-m issing values.
y\,...,yQ
n_m, f n_m+x, . . . , f n to
get
A ssum ing th a t we discard all resam ples with rn = n (all d a ta missing), the
b o o tstrap variance will overestim ate v ar(T ) by a factor which ranges from
15% for n = 10, m = 5 to 4% for n = 30, m = 15.
In the second approach, the first step was to fix the d ata so th at the
com plete-data estim ation form ula /t = n-1 YTj=i y*j f r t could be used. Then
we attem pted to sim ulate d a ta according to the two steps in the original
d ata-generation process. U nfortunately the E D F o f y,...,y_m,y_m+l,...,y
is an underdispersed estim ate o f the true C D F F. Even though the estim ate t
is n o t affected in this particularly simple problem , the boo tstrap distribution
certainly is. This is illustrated by the b o o tstrap variance
Both approaches can be repaired. In the first, we can stratify the sam pling
w ith com plete an d incom plete d a ta as strata. In the second approach, we can
ad d variability to the estim ates o f missing values. This device, called multiple
90
3 Further Ideas
This exam ple suggests two lessons. First, if the com plete-data estim ator can
be m odified to w ork for incom plete data, then resam pling cases will w ork
reasonably well provided the p ro p o rtio n o f m issing d a ta is sm all: stratified
resam pling would reduce variation in the am o u n t o f missingness. Secondly,
the com plete-data estim ator and full sim ulation o f d a ta observation (including
the data-loss step) can n o t be based on single im p u tatio n estim ation o f missing
values, b u t m ay w ork if we use m ultiple im p u tatio n appropriately.
O ne fu rth er poin t concerns the data-loss m echanism , which in the exam ple
we assum ed to be com pletely random . If d a ta loss is dependent upon the
response value y, then resam pling cases should still be v a lid : this is som ew hat
sim ilar to the censored-data problem . But the o th er approach via m ultiple
im putatio n will becom e com plicated because o f the difficulty o f defining a p
propriate m ultiple im putations.
Example 3.12 (Bivariate missing data) A m ore realistic exam ple concerns the
estim ation o f bivariate correlation when some cases are incom plete. Suppose
th a t Y is bivariate w ith com ponents U an d X . T he param eter o f interest is
6 = c o t t ( U , X ) . A ran d o m sam ple o f n cases is taken, such th a t m cases have
x missing, b u t no cases have b o th u an d x missing o r ju st u missing. I f it is
safe to assum e th a t X has a linear regression on U, then we can use fitted
regression to m ake single im pu tatio n s o f missing values. T h a t is, we estim ate
each missing x; by
Xj = x + b(uj u),
where x, u and b are the averages and the slope o f linear regression o f x on u
from the n m com plete pairs.
It is easy to see th a t it would be w rong to substitute these single im putations
in the usual form ula for sam ple correlation. The result would be biased aw ay
from zero if b 0. O nly if we can m odify the sam ple correlation form ula to
remove this effect will it be sensible to use simple resam pling o f cases.
The o th er strategy is to begin w ith m ultiple im p u tation to obtain a suitable
bivariate F, next estim ate 6 w ith the usual sam ple correlation t(F), and then
resam ple appropriately. M ultiple im p u tatio n uses the regression residuals from
91
- 3 - 2 - 1 0 1 2 3
- 3 - 2 - 1 0 1 2 3
92
3 Further Ideas
Table 3.6 Average
Full d a ta
estim ates
a\
9
1.00 (0.33)
0.69 (0.13)
1.01 (0.49)
0.68 (0.20)
0.79 (0.44)
0.79 (0.18)
0.96 (0.46)
0.70 (0.19)
,
y,
with ^Placement,
w ithout replacem ent,
where y = ( N I )-1
T he sam ple variance c = (n I )-1 X X >;y )2
is an unbiased estim ate o f y, an d the usual stan d ard erro r for y under w ithoutreplacem ent sam pling is obtained from the second line o f (3.15) by replacing y
with c. N orm al approxim ation to the distribution o f Y then gives approxim ate
(1 2a) confidence lim its y + (1 / ) 1'/2c 1/2n_ 1/ 2za for 9, where za is the a
(standard deviation) o f
estim ators for variance
and correlation 6
from bivariate normal
da ta (u,x) with sample
size n = 20 and m = 10
x values missing at
random. True values
o^ l and B 0.7.
Results from 1000
simulated datasets.
93
quantile o f the stan d ard norm al distribution. Such confidence intervals are a
factor (1 / ) 1/2 shorter th a n for sam pling with replacem ent.
The lack o f independence affects possible resam pling plans, as is seen by
applying the o rdinary b o o tstrap to 7 . Suppose th a t 7 1*,...,Y* is a random
sam ple tak en w ith replacem ent from y i , . . . , y n- T heir average 7* has variance
var*(7*) = n~2 ^ 2 ( y j y ) 2, and this has expected value n~2(n l)y over possible
sam ples y i , . . . , y . This only m atches the second line o f (3.15) if / = n~l . T hus
for the larger values o f / generally m et in practice, ordinary b o o tstrap standard
errors for y are too large an d the confidence intervals for 6 are system atically
too wide.
94
3 Further Ideas
the (^) possible w ithout-replacem ent sam ples from 9 , and the corresponding
b o o tstrap value is X* = f(Y,*,. , Y).
If N / n is n o t an integer, we w rite N = kn + 1, where 0 < I < n, and form
t y by taking k copies o f y i , . . . , y n an d adding to them a sam ple o f size I
taken w ithout replacem ent from y i , . . . , y n- B ootstrap sam ples are form ed as
w hen N = kn, b u t a different <&' is used for each. We call this the population
bootstrap. U nder a superp o p u latio n m odel, the m em bers o f the population
aJJ are them selves a ran d o m sam ple from an underlying distribution, 2P. The
nonparam etric m axim um likelihood estim ate o f & is the E D F o f the sample,
which suggests the following resam pling plan.
As one w ould expect, this gives results sim ilar to the population bootstrap.
Example 3.14 (Sample average) Suppose th a t y \ , . . . , y n are scalars, th a t N =
kn, and th a t interest focuses on 6 = N ~ l J2 <3fj, as in Exam ple 3.13. T hen under
the population b ootstrap,
vv*\
N ( n - 1)
i
v a r ( y , = < A r = T j ; '< 1 - ' ,
and this is the correct form ula a p a rt from the first factor on the right, which is
typically close to one. U n d er the su p erp o p u latio n b o o tstra p a straightforw ard
calculation establishes th a t the m ean variance o f Y is (n l) /n x (1 / ) n -1 c
(Problem 3.12).
These sam pling schemes m ake alm ost the right allowance for the sam pling
fraction, at least for the average.
F or the m irror-m atch scheme we suppose th a t n = km for integer m, and write
Y* = n~l ]Tf= i
Y,j, where (Y(j , . . . , Y ^) is the ith w ithout-replacem ent
resam ple, independent o f the o th er w ithout-replacem ent resamples. T hen we
can use (3.15) to establish th a t var*(Y ) = (km)~l ( 1 m / n )m ~lc. Because our
assum ptions im ply th a t / = m/n, this is an unbiased estim ate o f var(Y ), b u t it
would be biased if m ^ n f .
m
95
z ((R + l)(l-a))>
t __*.1/2 *
1
Z(UM-1))>
_ -
frat MJV x
^2 j= ix j
E j= i
>
..
v rat
( I - / ) y -' ( ~
7T /
n(n ~ 1)
I xj
u j t r a t \ 2__
> WjV TV / _
jv /
UJ '
1 V ''. .
(3.16)
F o r o u r d a ta trat = 156.8 an d vrat = 10.852. The regression estim ate is based
on the straight-line regression x = j?o + fixu fit to the d a ta (w i,x i),...,(u ,x ),
using least squares estim ates /?o and (1]. The regression estim ate o f 9 and its
estim ated variance are
11 _n
treg = Po +
Vreg =
^ ^
Pluj) j
(3-17)
3 Further Ideas
96
Schem e
R a tio
N o rm al
M odified size, n' = 2
M odified size, n' = 11
M irro r-m atch , m = 2
P o p u lation
S u p erp o p u latio n
137.8
58.9
111.9
115.6
118.9
120.3
174.7
298.6
196.2
196.0
193.3
195.9
123.7
1 M il
112.8
116.1
114.0
N o rm al
M odified size, n' = 2
M odified size, n' = 11
M irro r-m atch , m = 2
P o p u latio n
S u p erp o p u latio n
7
1
2
3
2
1
152.0
258.2
258.7
240.7
255.4
L ength
C overage
L ow er
R egression
U pper
O verall
A verage
SD
89
82
98
89
88
89
91
23
151
34
33
36
41
142
19
19
21
24
98
91
91
91
92
8.2
91
X
o
-----------y
o
o
CO
o
in
C\J
o
o
CVJ
o
lO
//
o
o
CM
o
in
O
co
/ //
9 / > 2'
Aft
/Q
m
,IUy
o
2
6
sqrt(v*)
10
To com pare the perform ances o f the various m ethods in setting confidence
intervals, we conducted a num erical experim ent in which 1000 sam ples o f
size n = 10 were taken w ithout replacem ent from the p o p ulation o f size
N = 49. F or each sam ple we calculated 90% confidence intervals [L, U] for
6 using R = 999 b o o tstrap samples. Table 3.8 contains the em pirical values
o f Pr(0 < L), Pr(0 < U), an d Pr(L < 9 < U). T he norm al intervals are short
an d their coverages are m uch too small, while the m odified intervals with
ri = 2 have the opposite problem . Coverages for the m odified sam ple size with
ri = 11 and for the pop u latio n and superpopulation b o o tstrap are close to
their nom inal levels, though their endpoints seem to be slightly too far left. The
80% and 95% intervals an d those for the regression estim ator have sim ilar
properties. In line w ith o th er studies in the literature, we conclude th a t the
population and superp o p u latio n b o o tstraps are the best o f those considered
here.
Stratified sampling
In m ost applications the pop u lation is divided into k strata, the ith o f which
contains N t individuals from which a sam ple o f size n, is taken w ithout
replacem ent, independent o f o th er strata. The ith sam pling fraction is f i =
tii/Ni and the p ro p o rtio n o f the p o pulation in the ith stratu m is vv, = N t/ N ,
where N = N i H-------- 1- N k- The estim ate o f 9 and its stan d ard erro r are found
by com bining quantities from each stratum .
Two different setups can be envisaged for m athem atical discussion. In the
first the small-fc case there is a small num ber o f large stra ta: the
asym ptotic regim e takes k fixed and n N j>oo with
where 0 < 7tj < 1.
98
3 Further Ideas
A p art from there being k strata, the same ideas and results will apply as above,
w ith the chosen resam pling scheme applied separately in each stratum . The
second setup the large-/c case is where there are m any sm all stra ta;
in m athem atical term s we suppose th a t k >00 b u t th a t N, and n, are bounded.
This situation is m ore com plicated, because biases from each stratum can
com bine in such a way th a t a b o o tstrap fails completely.
Example 3.16 (Average)
.
Ni
W,2(l - / ,) X - W f ,
V=
i=l
(3.18)
j= 1
v = v v , 2( l - U ) x
>=1
Hi
Hi
- ( Y y - Yj)2.
1 j= 1
(3.19)
x - l j
(3.20)
the m ean o f which is obtained by replacing the last term on the right by
(Ni I )-1 Z j i & i j &i)2- If k is fixed and
TV,>-oo while f ~ * n t , (3.20) will
converge to v, b u t this will n o t be the case if n!; N, are bounded and k >00.
T he boo tstrap bias estim ate also m ay fail for the same reason (Problem 3.12).
F or setting confidence intervals using the studentized b o o tstrap the key issue
is n o t the perform ance o f bias and variance estim ates, b u t the extent to which
the distrib u tio n o f the resam pled q uantity Z* = (T* t ) / V ll2m atches th at
o f Z = ( T 6 ) / V 1/2. D etailed calculations show th a t when the population
and superpopulation b o o tstrap s are used, Z an d Z* have the same limiting
distribution u n d er b o th asym ptotic regimes, an d th a t under the fixed-/c setup
the approxim ation is b etter th a n th a t using the other resam pling plans.
Example 3.17 (Stratified ratio) F or em pirical com parison o f the m ore prom is
ing o f these finite populatio n resam pling schemes w ith stratified data, we gen
erated a pop u latio n w ith N pairs (u,x) divided into strata o f sizes N i , . . . , N k
99
N o rm al
M odified size
M irro r-m atch
P o p u latio n
S u p erp o p u latio n
k = 20, N = 18
k = 5, N = 72
k = 3 , N = 18
5
6
9
6
3
93
94
92
95
97
88
89
83
89
95
4
4
8
5
2
94
94
90
95
98
90
90
82
90
96
7
6
6
6
3
93
96
94
95
86
90
88
89
96
98
according to the ordered values o f u. The aim was to form 90% confidence
intervals for
k
N,
e = r l E E x'>
.=i j=\
where x,j is the value o f x for the jth elem ent o f stratu m i.
We took independent sam ples (uy,Xy) o f sizes n, w ithout replacem ent from
the ith stratum , an d used these to form the ratio estim ate o f 9 and its estim ated
variance, given by
k
t = V
WjU, X ti,
i= 1
V = Y
Wi ( 1 ~ f i )
i= 1
n,
X (---- 7T
^
l } j 1
~~ t t o j ) 2
where
E / ' = 1 X ij
E jW
Ni
.....
these extend (3.16) to stratified sampling. We used b o o tstrap resam ples with
R = 199 to com pute studentized b oo tstrap confidence intervals for 9 based on
1000 different sam ples from sim ulated datasets. Table 3.9 shows the em pirical
coverages o f these confidence intervals in three situations, a large-/c case with
k = 20, Nj = 18 and n, = 6, a small-fc case with k = 5, Ni = 72 and n, = 24,
and a small-fc case w ith k = 3, Ni = 18 and n, = 6. The m odified sam pling
m ethod used sam pling w ith replacem ent, giving sam ples o f size n' = 7 when
n = 6 an d size ri = 34 w hen n = 24, while the corresponding values o f m for
the m irror-m atch m ethod were 3 and 8. T h roughout / i = jIn all three cases the coverages for norm al, population and m odified sample
size intervals are close to nom inal, while the m irror-m atch m ethod does poorly.
T he superp o p u latio n m ethod also does poorly, perhaps because it was applied
to separate stra ta ra th e r th an used to construct a new p o pulation to be
stratified a t each replicate. Sim ilar results were obtained for nom inal 80% and
95% confidence limits. O verall the population b o o tstrap and m odified sample
3 Further Ideas
100
size m ethods d o best in this lim ited com parison, an d coverage is n o t im proved
by using the m ore com plicated m irror-m atch m ethod.
Zij,
i = 1 , . . . , a, j = l , . . . , b ,
(3.21)
where the x,s are random ly sam pled from Fx an d independently the z^s
are random ly sam pled from Fz, w ith E (Z ) = 0 to force uniqueness o f the
model. T hus there is hom ogeneity o f variation in Z betw een groups, and the
structure is additive. T he feature o f this m odel th a t com plicates resam pling is
the correlation betw een observations w ithin a group,
var(Yjy) = c* + a\,
j f k.
(3.22)
For d a ta having this nested structure, one m ight be interested in param eters o f
Fx o r Fz o r some co m bination o f both. F o r exam ple, w hen testing for presence
o f variation in X the usual statistic o f interest is the ratio o f betw een-group
and w ithin-group sum s o f squares.
How should one resam ple nonparam etrically for such a d ata structure? There
are two simple strategies, for b o th o f which the first stage is to random ly sample
groups w ith replacem ent. A t the second stage we random ly sam ple w ithin the
groups selected at the first stage, either w ithout replacem ent (Strategy 1) or
w ith replacem ent (Strategy 2). N ote th a t Strategy 1 keeps selected groups intact.
To see which strategy is likely to w ork better, we look at the second m om ents
o f resam pled d a ta y'j to see how well they m atch (3.22). C onsider selecting
y'i V. . . , y ib. A t the first stage we select a ran d o m integer /* from {1 ,2 ,__a}.
A t the second stage, we select ran d o m integers
from {1,2
either w ithout replacem ent (Strategy 1) o r w ith replacem ent (Strategy 2): the
101
sam pling w ithout replacem ent is equivalent to keeping the J* th group intact.
U nder b o th strategies
E*(5y I /* = O = )V,
and
However,
E*(Yy* Y* | /* = n =
6(6- 1)
yiiyi'm,
Strategy 1,
Strategy 2.
h tm = i ynyi'm,
T herefore
E*(Yt; ) = ?.,
1
SSg
SSyy
var*(Y,*) = +
J
a
ab
(3.23)
and
Strategy 1,
Strategy 2,
(3.24)
This gives
E {v a r'(i'jy )} =
and
Strategy 1,
Strategy 2.
O n balance, therefore, Strategy 1 m ore closely mimics the variation properties
o f the data, an d so is the preferable strategy. R esam pling should w ork well so
long as a is m oderately large, say at least 10, ju st as resam pling hom ogeneous
d a ta w orks well if n is m oderately large. O f course b o th strategies would work
well if b o th a an d b were very large, b u t this is rarely the case.
A n application o f these results is given in Exam ple 6.9.
The preceding discussion w ould apply to balanced d a ta structures, b u t not
to m ore com plex situations, for which a m ore general approach is required. A
direct, m odel-based ap proach would involve resam pling from suitable estim ates
o f the tw o (or m ore) d a ta distributions, generalizing the resam pling from F in
C h ap ter 2. H ere we outline how this m ight work for the d a ta structure (3.21).
3 Further Ideas
102
Estim ates o f the two C D F s Fx an d Fz can be form ed by first estim ating the
xs and zs, and then using their E D F s. A naive version o f this, which parallels
stan d ard linear m odel theory, is to define
xi = yu
ztj = y,j - %
(3.25)
choose z*n , . . . , z ' ab by random ly sam pling ab times with replacem ent from
z n , . . . , z ab; and finally
i=
j = l,...,b.
S traightforw ard calculations (Problem 3.17) show th a t this approach has the
sam e second-m om ent properties o f
as Strategy 2 earlier, show n in (3.23)
and (3.24), w hich are n o t satisfactory. Som ew hat predictably, Strategy 1 is
mim icked by choosing z\ r a n d o m l y w ith replacem ent from one group
o f residuals Zki,...,Zkb either a random ly selected group or the group
corresponding to x* (Problem 3.17).
W hat has gone w rong here is th a t the estim ates x* in (3.25) have excess
variation, nam ely a ^ S S g = <xl + b~loj, relative to
T he estim ates Zy defined
in (3.25) will be satisfactory provided b is reasonably large, although in principle
they should be standardized to
- 11
( 1 - f c - 1)1/2 '
(3.26)
The excess variation in X; can be corrected by using the shrinkage estim ate
= cy+ (1 - c ) y i . ,
where c is given by
(i - c Y =
b ( b - l ) S S B
103
(3.27)
where F* denotes either the E D F o f the boo tstrap sam ple Y J , . . . , Y * draw n
from F or the p aram etric m odel fitted to th at sample. Thus the calculation
applies to b o th param etric an d nonparam etric situations. There is b o th random
variation an d system atic bias in B in g e n eral: it is the bias w ith which we are
concerned here.
As with T itself, so w ith B : the bias can be estim ated using the bootstrap.
If we w rite y = c(F) = E (B \ F ) b(F), then the simple b o o tstra p estim ate
according to the general principle laid out in C h ap ter 2 is C = c(F). From the
definition o f c(F) this implies
C = E*(B* | F ) - B ,
the b o o tstrap estim ate o f the bias o f B. To see ju st w hat C involves, we use
the definition o f B in (3.27) to obtain
C = E*[E**{r(F**) | F*} - t(F*) | F] - [E*{t(F*) | F} - t (F)];
(3.28)
(3.29)
H ere F** denotes the E D F o f a sample draw n from F*, o r from the param etric
m odel fitted to th a t sam ple; T** is the estim ate com puted w ith th a t sam ple; and
E** denotes expectation over the the distribution o f th a t sam ple conditional on
F*. T here are tw o levels o f b o o tstrapping in this procedure, which is therefore
104
3 Further Ideas
Since typically bias is o f o rder n-1 , the adjustm ent C is typically o f order n~2.
T he following exam ple gives a simple illustration o f the adjustm ent.
Example 3.18 (Sample variance) Suppose th a t T = n~l Z ( Y j Y )2 is used to
estim ate v a r(Y ) = a 2. Since E { J](Y / Y ) 2} = (n l ) a 2, the bias o f T is easily
seen to be /? = n_1<x2, which the b o o tstrap estim ates by B = n~l T. The
bias o f this bias estim ate is E (B) ft = n~2o 2, which the b o o tstrap estim ates
by C = n~2T. T herefore the adjusted bias estim ate is
B C = n-1 T n~2 T.
T h at this is an im provem ent can be checked by showing th a t it has expectation
/?(1 + n~2), w hereas B has expectation /?(1 + n~]).
sam pling w ith replacem ent from y \ , . . . , y ' n (nonparam etric case)
or
sam pling from the m odel fitted to y [ , . . . , y * (param etric case);
3 evaluate the estim ator T for each o f the M second-level sam ples to give
..
..
V l - '- VMT hen approxim ate the bias adjustm ent C in (3.29) by
.
- k m E
r = l m= 1
r=l
3 3 )
105
(3.31)
1 or .
(3.32)
To correct for this bias we introduce the ideal pertu rb atio n y = c(F) which
modifies b(F) to b(F,y) in o rd er to achieve
E[h{F,F-,b(F,y)}\F]=0.
(3.33)
106
3 Further Ideas
W hat we w ant to see is the effect o f substituting p ajj for ft in (3.32). First
we approxim ate the solution to (3.33). T aylor expansion ab o u t 7 = 0 , together
with (3.32), gives
E [h{F, F ; b(F, y ) } \ F] = e(F)n~a + dn(F)y,
(3.34)
where
dn( F ) = ^ E [ h { F , F ; b ( F , y ) } \ F ]
y=0
Typically d{F) = d(F) =f= 0, so th a t if we w rite r(F) = e(F)/d(F) then (3.33)
and (3.34) together im ply th at
7 = c(F) = r(F)n~a.
This, together w ith the corresponding approxim ation for y = cn(F), gives
?
= n~a{r(F) - r(F)} = - r T ^ X , ,
(3.35)
y= 0
We can now assess the effect o f the adjustm ent from [3 to (iadj- Define the
conditional quantity
kH(X) = ^ - E [ h { F , F ; b ( F , y ) } \ X , F ]
8y
1/2)+
y=0
(3.36)
= 1.
?=o
107
This implies th at
E { X nk n(Xn) I F} = n i/2E{e(F) - e(F) \ F} = 0 ( n ~ l/2).
E quation (3.36) then becom es E { T 9 (fi y)} = 0 (n -3 ). This generalizes
the conclusion o f Exam ple 3.18, th at the adjusted b o o tstrap bias estim ate fi y
is correct to second order.
v(Vk) =
(3'37)
r= l
where t*k = J T 1 ? = i C Plots o f v{\pk) against com ponents o f yik can then
be used to see how v a r(T ) depends on 1p. Exam ple 2.13 shows an application
o f this. The sam e sim ulation results can also be used to approxim ate other
properties, such as the bias or quantiles o f T , or the variance o f transform ed T.
As described here the num ber o f sim ulated datasets will be R K , b u t in fact
this num b er can be reduced considerably, as we shall show in Section 9.4.4. The
sim ulation can be bypassed com pletely if we estim ate v(ipk) by a delta-m ethod
variance approxim ation VL(y)k), based on the variance o f the influence function
under the p aram etric m odel. However, this will often be impossible.
In the nonparam etric case there appears to be a m ajor obstacle to per
form ing calculations analogous to (3.37), nam ely the unavailability o f models
corresponding to a series o f p aram eter values rpi,...,\pK. But this obstacle can
108
3 Further Ideas
fr*)2,
(3.38)
m=1
with t = M ~ l E m =i Cm- T he scatter plot o f v against t* will then be a proxy
for the ideal plot o f v a r(T | ip) against 6, an d sim ilarly for o ther plots.
Example 3.20 (City population data) Figure 3.7 shows the results o f the
double b o o tstrap procedure outlined above, for the ratio estim ator applied to
the d a ta in Table 2.1, w ith n = 10. The left panel shows the bias b estim ated
using M = 50 second-level b o o tstrap sam ples from each o f R = 999 first-level
b o o tstrap samples. The right panel shows the corresponding stan d ard errors
* 112
vr . The lines from applying a locally w eighted robust sm oother confirm the
clear increase w ith the ratio in each panel.
The lim plication o f Figure 3.7 is th a t the bias and variance o f the ratio are
no t stable w ith n = 10. Confidence intervals for the true ratio 9 based on
norm al approxim ations to the distrib u tio n o f T 9 will therefore be poor, as
will basic b o o tstra p confidence intervals, and those based on related quantities
such as the studentized b o o tstrap are suspect. A reasonable in terpretation o f
the right panel is th a t v a r(T ) oc 92, so th a t log T should be m ore stable.
109
t*
t*
Transformed t*
3 Further Ideas
110
As presented here the selection o f p aram eter values ip* is com pletely random ,
and R would need to be m oderately large (at least 50) to get a reasonable
spread o f values o f \p*. T he to tal nu m b er o f samples, R M + R, will then be very
large. It is, however, possible to im prove upon the algorithm ; see Section 9.4.4.
A n other im p o rtan t problem is the roughness o f variance estim ates, apparent
in b o th o f the preceding exam ples. This is due n o t ju st to the size o f M , but
also to the noise in the E D F s F* being used as models.
Frequency smoothing
O ne m ajor difference betw een the p aram etric an d nonparam etric cases is th at
the param etric m odels vary sm oothly w ith p aram eter values. A simple way
to inject such sm oothness into the nonp aram etric m odels F is to sm ooth
them. F or simplicity we consider the one-sam ple case.
Let w( ) be a sym m etric density w ith m ean zero and unit variance, and
consider the sm oothed frequencies
f j ( o , e ) c c ( n r O ^ ,
r= l
'
j =
(3-39)
'
H ere e > 0 is a sm oothing p aram eter th a t determ ines the effective range o f
values o f t* over which the frequencies are sm oothed. As is com m on with kernel
sm oothing, the value o f e is m ore im p o rtan t th an the choice o f w(-), which we
take to be the stan d ard norm al density. N um erical experim entation suggests
th a t close to 6 = t, values o f e in the range 0 .2 v l/ 2 - 1 .0 v l/2 are suitable, where v is
an estim ated variance for t. We choose the co n stan t o f proportionality in (3.39)
to ensure th a t Z j f j { 8 ,E) = n- F r a given e, the relative frequencies n~ 1 f j ( 8 , e)
determ ine a distribution F e , for which the p aram eter value is 8 " = t{Fg); in
general 0* is n o t equal to 8 , although it is usually very close.
Example 3.22 (City population data) In co n tin u ation o f Exam ple 3.20, the
top panels o f Figure 3.9 show the frequencies f j for four sam ples with values
o f t' very close to 1.6. T he variation in the f j leads to the variability in both
b* and v" th a t shows so clearly in Figure 3.7.
The lower panels show the sm oothed frequencies (3.39) for distributions Fg
with 8 = 1.2, 1.52, 1.6, 1.9 and e = 0.2u1/2. The corresponding values o f the
ratio are 8 = 1.23, 1.51, 1.59, an d 1.89. T he observations w ith the smallest
em pirical influence values are m ore heavily weighted when 8 is less th a n the
original value o f the statistic, t = 1.52, and conversely. The third panel, for
6 = 1.6, results from averaging frequencies including those shown in the upper
panels, an d the distribution is m uch sm oother th an those. The results are not
very sensitive to the value o f e, although the tilting o f the frequencies is less
m arked for larger s.
The sm oothed frequencies can be used to assess how the bias and variance
1.0
theta=1.2
-0.5
111
0.0
0.5
r = 1 .5988
t*=1,6015
theta=1.6
theta=1.9
1.0
theta=1.52
-1.0
-0.5
0.0
0.5
1.0
3 Further Ideas
112
theta (jittered)
r = 1
W ithout loss o f generality, suppose th at t\ < < t*R. One
way to im plem ent em pirical variance-stabilization is to choose Ri o f the t"
th at are roughly evenly-spaced an d th a t include
and t'R. For each o f the
corresponding F* we then generate M b o o tstrap values t , from which we
estim ate the variance o f t to be v'r as defined in (3.38). We now sm ooth a plot
o f the v against the t, giving an estim ate v(Q) o f the variance v a r(T | F ) as
a function o f the p aram eter 0 = t(F), and integrate num erically to obtain the
estim ated variance-stabilizing transfo rm atio n
tt
{t)
dd
{ m v /r
(3.40)
113
In general, b u t especially for small Ri, it will be b etter to fit a sm ooth curve
to values o f logt>*, in p art to avoid negative estim ates v(0). Provided th at
a suitable sm oothing m ethod is used, inclusion o f t\ and t'R in the set for
which the v" are estim ated implies th at all the transform ed values h(t*) can be
calculated. T he transform ed estim ator h ( T ) should have approxim ately unit
variance.
A ny o f the com m on sm oothers can be used to obtain v(0), and simple inte
gration algorithm s can be used for the integral (3.40). I f the nested boo tstrap
is used only to obtain the variances o f Ri o f the f*, the total num ber o f
b o o tstrap sam ples required is R + M R i . Values o f R\ and M in the ranges
50-100 and 25-50 will usually be adequate, so if R = 1000 the overall num ber
o f b o o tstrap sam ples required will be 2250-6000. If variance estim ates for all
the t are available, for exam ple nonparam etric delta m ethod estim ates, then
the delta m ethod shows th a t approxim ate standard errors for the h(t'r) will be
i>*1/2/ v ( t ') 1/2; a plot o f these against t* will provide a check on the adequacy
o f the transform ation.
T he sam e procedure can be applied with second-level resam pling done from
sm oothed frequencies, as in Exam ple 3.22.
Example 3.23 (City population data) For the city population d ata o f E xam
ple 2.8 the p aram eter o f interest is the ratio 6 , which is estim ated by t = x / u.
Figure 3.7 shows th a t the variance o f T depends strongly on 6 . We used the
procedure outlined above to estim ate a transform ation based on R = 999
b o o tstrap samples, w ith R\ = 50 and M = 25. The transform ation is shown
in the left panel o f Figure 3.11: the right panel shows the stan d ard errors
v ^ 2 / v ( O l/2 o f the h(t'). T he transform ation has been largely successful in
stabilizing the variance.
In this case the variances VLr based on the linear approxim ation are readily
calculated, an d the tran sfo rm atio n could have been estim ated from them rather
than from the nested bootstrap.
114
3 Further Ideas
f
ID
CO
to
of
as
or
com pare outliers, for example. In this situation we m ust focus on the effect
individual observations on b o o tstrap calculations, to answ er questions such
would the confidence interval differ greatly if this point were rem oved?,
w hat happens to the significance level when this observation is deleted?
Nonparametric case
Once a nonparam etric resam pling calculation has been perform ed, a basic
question is how it w ould have been different if an observation, yj, say, had
been absent from the original data. F or exam ple, it m ight be wise to check
w hether or n o t a suspicious case has affected the quantiles used in a confidence
interval calculation. T he obvious way to assess this is to do a fu rth er sim ulation
from the rem aining observations, b u t this can be avoided. This is because a
resam ple in which y; does n o t ap p ear can be th o u g ht o f as a random sample
from the d a ta w ith yj excluded. Expressed formally, if J* is sam pled uniform ly
from { l ,...,n } , then the conditional distribution o f J ' given th at J* =/= j
is the sam e as the distribution o f /*, where /* is sam pled uniform ly from
{ 1 ,... , j \ , j + 1 ,..., } . T he probability th a t
is n o t included in a boo tstrap
sample is (1 n-1 )" = e ~ \ so the num b er o f sim ulations R - j th a t do not
include yj is roughly equal to R e ~l = 0.368R.
So we can m easure the effect o f
on the calculations by com paring the full
sim ulation w ith the subset o f t \ , . . . , t R
obtained from bo o tstrap sam ples where
yj does n o t occur. In term s o f the frequencies f j which count the num ber o f
tim es yj app ears in the rth sim ulation, we sim ply restrict attention to replicates
with f ' j = 0. F or exam ple, the effect o f yj on the bias estim ate B can be
Figure 3.11
Variance-stabilization
for the city population
ratio. The left panel
shows the empirical
transformation (), and
the right panel shows
the standard errors
u jy2/{v(r*)}1,/2 of the
h{t*), with a smooth
curve.
115
1
2
3
4
5
6
7
8
9
10
11
12
13
F irst son
L en
Brea
Second son
Len
Brea
191
195
181
183
176
208
189
197
188
192
179
183
174
179
201
185
188
171
192
190
189
197
187
186
174
185
155
149
148
153
144
157
150
159
152
150
158
147
150
145
152
149
149
142
152
149
152
159
151
148
147
152
14
15
16
17
18
19
20
21
22
23
24
25
F irst son
Len
B rea
Second son
L en
Brea
190
188
163
195
186
181
175
192
174
176
197
190
195
187
161
183
173
182
165
185
178
176
200
187
159
151
137
155
153
145
140
154
143
139
167
163
157
158
130
158
148
146
137
152
147
143
158
150
n(B_j - B) = J
J -
(t; - t - j ) - i
'^>=0
- t ) 1,
r
(3.41)
where B - j is the bias estim ate from the resam ples in which yj does not
appear, and r_; is the value o f t when yj is excluded from the original
data. Such calculations are applications o f the jackknife m ethod described
in Section 2.7.3, so the technique applied to b o o tstra p results is called the
jackknife-after-bootstrap. The scaling factor n in (3.41) is n o t essential.
A useful diagnostic is the plot o f jackknife-after-bootstrap m easures such
as (3.41) against em pirical influence values, possibly standardized. F or this
purpose any o f the approxim ations to em pirical influence values described in
Section 2.7 can be used. The next exam ple illustrates a related plot th a t shows
how the distrib u tio n o f r* t changes w hen each observation is excluded.
Example 3.24 (Frets heads)
Table 3.10 contains d ata on the head breadth
and length o f the first two ad u lt sons in 25 families.
T he correlations am ong the log m easurem ents are given below the diagonal
in Table 3.11. T he values above the diagonal are the partial correlations. For
exam ple, the value 0.13 in the second row is the correlation betw een the log
head b read th o f the first son, b i, and the log head length o f the second
son, h, after allowing for the other variables. In effect, this is the correlation
betw een the residuals from separate regressions o f b\ and lj on the other two
variables. T he correlations are all large, b u t four o f the partial correlations
are small, which suggests the simple in terpretation th at each o f the four pairs
o f m easurem ents for first and second sons is independent conditionally on the
values o f the o th er two m easurem ents.
116
3 Further Ideas
F irst son
L ength
B readth
F irst son
S econd son
L ength
B readth
L ength
B readth
0.43
0.75
0.72
0.72
0.70
0.72
Second son
L ength
B readth
0.21
0.17
0.13
0.22
0.64
0.85
We focus on the p artial correlation t = 0.13 betw een log foj and log I2 . The
top panel o f Figure 3.12 shows a jack k n ife-after-b ootstrap plot for t, based
on 999 b o o tstrap samples. T he points at the left-hand end show the em pirical
0.05, 0.1, 0.16, 0.5, 0.84, 0.9, an d 0.95 quantiles o f the values o f t t *_2 for the
368 b o o tstrap sam ples in which case 2 was n o t selected; ~t_ 2 is the average o f
t* for those samples. T he d o tted lines are the corresponding quantiles for all
999 values o f t* t. T he distribution is clearly m uch m ore peaked when case
2 is left out. T he panel also contains the corresponding quantiles when other
cases are excluded. T he horizontal axis shows the em pirical influence values
for t: clearly puttin g m ore weight on case 2 sharply decreases the value o f t.
The low er left panel o f the figure shows th a t case 2 lies som ew hat away
from the rest, and the plot o f residuals for the regressions o f logfti and lo g /2
on (lo g b2,lo g h) in the low er right panel accounts for the jackknife-afterb oo tstrap results. Case 2 seems outlying relative to the others: deleting it will
clearly increase t substantially. T he overall average and stan d ard deviation o f
the t* are 0.14 an d 0.23, changing to 0.34 and 0.17 when case 2 is excluded.
The evidence against zero p artial correlation depends heavily on case 2.
(3.42)
Suppose th a t the full-data estim ate (e.g. m axim um likelihood estim ate) o f the
m odel p aram eter is xp, an d th a t when case j is deleted the corresponding
estim ate is xp^j. The idea is to use (3.42) w ith xp an d xp-j in place o f xp and xpr
117
Log b1
respectively. F or example,
di _
f l W .*
} ~ "\
. \ f ( y * I V-y)
j) f ( y ;
Iv)
1 V~V**
(r
}J
w here the sam ples y* are draw n from the full-data fitted model, th at is with
p aram eter value ip. Sim ilar w eighted calculations apply to o ther features o f the
118
3 Further Ideas
3.10.2 Linearity
Statistical analysis is simplified w hen the statistic o f interest T is close to
linear. In this case the variance approxim ation v i will be an accurate estim ate
o f the b o o tstrap variance v a r(T | F), and saddlepoint m ethods (Section 9.5)
can be applied to o btain accurate estim ates o f the distribution o f t \ w ithout
recourse to sim ulation. A linear statistic is n o t necessarily close to norm ally
distributed, as Exam ple 2.3 illustrates. N o r does linearity guarantee th at T is
directly related to a pivot and therefore useful in finding confidence intervals.
O n the o th er hand, experience from o th er areas in statistics suggests th at these
three properties will often occur together.
This suggests th a t we aim to find a transfo rm atio n h(-) such th a t h ( T ) is well
described by the linear approxim ation th a t corresponds to (2.35) or (3.1). For
simplicity we focus on the single-sam ple case here. T he shape o f h(-) would be
revealed by a p lo t o f h(t) against t, b u t o f course this is n o t available because
h(-) is unknow n. However, using T aylor approxim ation and (2.44) we do have
h(t') = h(tl) = h{t) + h(t) Y ' f j l j - h(t) + h(t)(t'L - t),
" i =i
which shows th a t tL = c + dh(t') w ith ap p ro p riate definitions o f constants c
and d. T herefore a plot o f the values o f t'L = t + m_1 Y ^ f ) h against the t*
will look roughly like h(-), a p a rt from a location and scale shift. We can now
estim ate h(-) from this plot, either by fitting a p articular param etric form, or
by nonparam etric curve estim ation.
Example 3.25 (City population data) T he top left panel o f Figure 3.13 shows
t L plotted against t" for 499 b o o tstrap replicates o f the ratio t = x / u for the
d ata in Table 2.1. The p lo t is highly nonlinear, an d the logarithm ic tran sfo r
m ation, o r one even m ore extreme, seems appropriate. N ote th a t the plot has
shape sim ilar to th a t for the em pirical variance-stabilizing transform ation in
Figure 3.11.
For a p aram etric transform ation, we try a B ox-C ox transform ation, h{t) =
(tx 1) / 1, w ith the value o f k estim ated by m axim izing the log likelihood for
the regression o f the h(t') on the t'Lr. This strongly suggests th at we use I = 2,
for which the fitted curve is shown as the solid line on the plot. This is close to
the result for a sm oothing spline, shown as the d o tted line. The to p right panel
shows the linear approxim ation for h(t), i.e. h(t) + h(t)n~l Y T j = i f j b plotted
against h(tm). This plot is close to the line w ith unit gradient, and confirm s the
results o f the analysis o f transform ations.
h(t) is dh(t)/dt.
119
h(t*)
CO
CO
..y -
CM
r*
CM
_c=
*
N
CNJ
jf c a
C \1
CO
CO
-6
-4-2
- 3 - 2 - 1 0 1 2 3
Quantiles of Standard Normal
z*
The lower panels show related plots for the studentized b o o tstrap statistics
on the original scale and on the new scale,
.
t'-t
Z ~ *1/2
vL
.
h(t')-h(t)
Z>
>~
*1/2
h(t)vL
3 Further Ideas
120
m akes the corresponding studentized b o o tstrap statistic roughly norm al. The
transform atio n based on the sm oothing spline w ould give sim ilar results.
if
V(i) = m in V(k).
1Zk<.K
F or m ost simple estim ators we can use the nonp aram etric delta m ethod vari
ance estim ates. But in general, an d for m ore com plicated problem s, we use the
b o o tstrap to im plem ent this procedure. T hus we generate R boo tstrap samples,
com pute the estim ates f* (l),. . . , t ' ( K ) for each sample, and then choose t to be
th a t t(i) for which the b o o tstra p estim ate o f variance
R
;(0 = ( - l r 1 5 3 {t;(o - r ( o }2
r= 1
is sm allest; here t{i) = R~' J 2r f(0How we generate the b o o tstrap sam ples is im p o rtan t here. H aving assum ed
sym m etry o f d a ta distribution, the resam pling distribution should be sym m etric
so th a t the t'(i) are unbiased for fi. O therw ise selection based on variance alone
is questionable. F u rth er discussion o f this is postponed to Exam ple 3.26.
So far the procedure is straightforw ard. But now suppose th a t we w ant
to estim ate the variance o f T, o r quantiles o f T p. For the variance, the
m inim um estim ate v(i) used to select t = t{i) will tend to be too low: if /
is the rand o m index corresponding to the selected estim ator, then E{K (/)} <
var{ T(J)} = v ar(T ). Sim ilarly the resam pling distribution o f T* = T * (/) will
be artificially co ncentrated relative to th a t o f T , so th a t em pirical quantiles o f
the t(i) values will tend to be too close to t. W hether or n o t this selection bias
121
is serious depends on the context. However, the bias can be adjusted for by
b o o tstrap p in g the w hole procedure, as follows.
L et y\,...,y*n be one o f the R sim ulated samples. Suppose th at we apply
the procedure for choosing am ong T ( 1 ) ,..., T { K ) to this b o o tstrap sample.
T h a t is, we generate M sam ples with equal probability from y \ , . . . , y n, and
calculate the estim ates f (l), . . . , f (K ) for the mth such sample. T hen choose
the estim ator w ith the smallest estim ated variance
M
m=1
where f'*(i) =
C ( 0 - T h a t is,
t* = t*(i)
if
v'(i) = m in v'(k).
D oing this for each o f the R sam ples y [ , . . . , y * gives t \ , . . . , t R, and the em pirical
d istribution o f the t t values approxim ates the distribution o f T F or
exam ple, v = ( R I )-1 ^ ( t * t )2 estim ates the variance o f T, and by accounting
for the selection bias should be m ore accurate th an t>(i).
T here are two byproducts o f this double b o o tstrap procedure. One is infor
m ation on how w ell-determ ined is the choice o f estim ator, if this is o f interest,
simply by exam ining the relative frequency with which each estim ator is cho
sen. Secondly, the bias o f v(i) can be approxim ated: on the log scale bias is
estim ated by R ~ l ^ l o g y log v, where v'r is the sm allest value o f the v(i)s
in the rth b o o tstrap sample.
Example 3.26 (Gravity data)
Suppose th a t the d a ta in Table 3.1 were
only available as a com bined sample o f n = 81 m easurem ents. T he different
dispersions o f the ingredient series m ake the com bined sam ple very no n
norm al, so th a t the simple average is a po o r estim ator o f the underlying m ean
fi. O ne possible ap proach is to consider trim m ed average estim ates
n-k
which are averages after d ropping the k smallest and k largest order statistics
yy y The usual average and sam ple m edian correspond respectively to k =
0 an d \{n 1). The left panel o f Figure 3.14 plots the trim m ed averages
against k. The m ild dow nw ard trend in the plot suggests slight asym m etry o f
the d a ta distribution. O u r aim is to use the b o o tstrap to choose am ong the
trim m ed averages.
T he trim m ed averages will all be unbiased if the underlying d a ta distribution
is sym metric, an d estim ator variance will then be a sensible criterion on which
to base choice. The b o o tstrap procedure m ust build in the assum ed symmetry,
3 Further Ideas
2.0
2.0
122
9
&
a
>
'
% "
9
e
20
30
40
0 0
0 0
10
9
6
10
20
30
40
10
and this can be done (cf. Exam ple 3.4) by sim ulating sam ples
sym m etrized version o f F such as
20
30
40
from a
F sym(y ) = l2 { F ( y ) + F( 2 U - y - 0)} ,
mse(i) = K_ 1 { r ; ( 0 - y } 2 r= 1
123
sam ples y j,...,y g [ from Fsym. To each o f these sam ples we then apply the
original sym m etric b o o tstrap procedure, generating M = 100 sam ples o f size
n = 81 from the sym m etrized E D F o f y \ , . .. , 3^ , choosing t* to be th a t one o f
the 11 trim m ed averages w ith sm allest value o f v(i). The variance v o f t\ , . . . , t'R
equals 0.356, which is 10% larger th an the original m inim um variance. If we
use this variance w ith a norm al aproxim ation to calculate a 95% confidence
interval centred on t, the interval is [77.16,79.50]. This is very sim ilar to the
intervals obtained in Exam ple 3.2.
The frequencies w ith which the different trim m ing proportions are chosen
are:
k
12 16 20 24 28
32
36
40
Frequency
1
25 54 96 109131 49886
T hus when sym m etry o f the underlying distribution is assum ed, a fairly heavy
degree o f trim m ing seems desirable for these data, and the value k = 36
actually chosen seems reasonably well-determ ined.
124
3 Further Ideas
E fron (1979, 1982) suggested and studied em pirically the use o f sm ooth ver
sions o f the ED F, b u t the first system atic investigation o f sm oothed bootstraps
was by Silverm an and Y oung (1987). They studied the circum stances in which
sm oothing is beneficial for statistics for which there is a linear approxim ation.
Hall, D iCiccio an d R om an o (1989) show th a t when the quantity o f interest
depends on a local property o f the underlying C D F, as do quantiles, sm ooth
ing can give w orthw hile theoretical reductions in the size o f the m ean squared
error. Sim ilar ideas apply to m ore com plex situations such as L\ regression
(D e Angelis, H all and Y oung 1993); see how ever the discussion in Section 6.5.
D e Angelis an d Y oung (1992) give a useful review o f b o o tstrap sm oothing, and
discuss the em pirical choice o f how m uch sm oothing to apply. See also W ang
(1995). R o m an o (1988) describes a problem estim ation o f the m ode o f a
density where the estim ator is undefined unless the E D F is sm oothed; see
also Silverm an (1981). In a spatial d a ta problem , K endall and K endall (1980)
used a form o f b o o tstrap th a t jitte rs the observed data, in order to keep the
rough configuration o f p oints co n stan t over the sim ulations; this am ounts to
sam pling w ithout replacem ent when applying the sm oothed bootstrap. Young
(1990) concludes th a t although this ap proach can o u tperform the unsm oothed
bootstrap , it does n o t perform so well as the sm oothed b o o tstrap described in
Section 3.4.
G eneral discussions o f survival d a ta can be found in the books by Cox
and O akes (1984) and Kalbfleisch an d Prentice (1980), while Flem ing and
H arringto n (1991) and A ndersen et al. (1993) give m ore m athem atical accounts.
T he product-lim it estim ator was derived by K ap lan and M eier (1958): it and
variants are widely used in practice.
Efron (1981a) proposed the first b o o tstra p m ethods for survival data, and
discussed the relation betw een trad itio n al an d b o o tstrap stan d ard errors for
the product-lim it estim ator. A kritas (1986) com pared variance estim ates for
the m edian survival tim e from E frons sam pling scheme and a different a p
proach o f R eid (1981), and concluded th a t E frons scheme is superior. The
conditional m ethod outlined in Section 3.5 was suggested by H jo rt (1985),
and subsequently studied by K im (1990), who concluded th a t it estim ates
the conditional variance o f the product-lim it estim ator som ew hat b etter th an
does resam pling cases. D oss and G ill (1992) an d B urr and D oss (1993) give
weak convergence results leading to confidence bands for quantiles o f the
survival time distribution. T he asym ptotic behaviour o f param etric and no n
param etric b o o tstrap schemes for censored d a ta is described by H jo rt (1992),
while A ndersen et al. (1993) discuss theoretical aspects o f the weird b o o t
strap.
The general ap p ro ach to m issing-data problem s via the EM algorithm is dis
cussed by D em pster, L aird and R ubin (1977). Bayesian m ethods using m ultiple
im putatio n an d d a ta au gm entation are decribed by T anner and W ong (1987)
125
126
3 Further Ideas
3.13 Problems
1
In a two-sample problem, with data y tj, j = 1 ,..., n i = 1,2, giving sample averages
y,- and variances t> describe models for which it would be appropriate to resample
the following quantities:
(a) e y = ytj - %
(b) ei} = (ytj - 3>.)/(l + n~l )l/2,
(c) etj = (ytj - y,)/{.( 1 + n - l )}l/2,
(d)
= + ( y , j yi)/{vt( 1 + n~l )}l/1, where the signs are allocated with equal
probabilities,
i=i w.-y,-
E i= i wi
where w, = n j a j , with y,- = n~' J 2 j ytj and a f = n~[ J 2 j(yij ~ Pi)2 estimates o f
mean /j, and variance of o f the ith distribution. Show that the influence functions
for T are
^ }! ^ \.
where qj,- = n j a } . Deduce that the first-order approximation under the constraint
Hi = = Hk for the variance o f T is vL = 1 / ^
with empirical analogue vL =
1/
vv>- Compare this to the corresponding formula based on the unconstrained
empirical influence values.
(Section 3.2.1)
3
Spherical data y i , . . . , y are points on the sphere o f unit radius. Suppose that it
is assumed that these data come from a distribution that is symmetric about the
unknown mean direction /i. In light o f the symmetry assumption, what would be
an appropriate resampling algorithm for simulating data y j ,...,y * ?
(Section 3.3; Ducharme et a l., 1985)
L t<1 (y 2 \ F) = ( y i ~ n i ) / n 1 .
vl = {n^^^iyij -
127
3.13 Problems
The empirical influence values can be calculated more directly as follows. Consider
only distributions supported on the data values, with probabilities p t = ( p n , ... , p inf)
on the values in the ith sample for i = 1, . . . , k . Then write T =
so
that t = t(pi,...,pk) with pi = (},, ^ ) . Show that the empirical influence value
lij corresponding to the 7 'th case in sample i is given by
lv = ^ - t { p u . . . , ( l - s ) p i + e l j , . . . , p k} I ,
u
I=0
where l j is the vector with
(Section 3.2.1)
7
Following on from the previous problem, re-express t(p 1 , . . . , pk) as a function u(n)
o f a single probability vector n = ( 7t 1 1 , . . . , 7t 1 1 , . . . , nkt ). For example, for the ratio
o f means o f two independent samples, t = _p2 />i,
= ( j W I > y ) / ( 5 > u * y /*iy)The observed value t is then equal to u{n) where n = (~n, . . . , - n ) with n = * 1 , nt.
Show that
l j = j u {(1 - e)n + e h j} ^
where
= -/y,
, and zeroes
J2j'=i %
Vh i b w i ^ r ) = l j w { ^ r )
( x
j= 1
a bxj\
hb
J7
will have the same first two mom ents as the E D F if a = (1 b)x and b =
{1 + nh2z 2/ J2(x j x)2} ~ l/2. W hat algorithm simulates from this sm oothed E D F?
128
3 Further Ideas
(d) D iscuss the special problems that arise from using gh(x) when the range o f x
is [0, oo) rather than (oo, oo).
(e) Extend the algorithms in (b) and (c) to multivariate x.
(Section 3.4; Silverman and Young, 1987; Wand and Jones, 1995)
Consider resampling cases from censored data (y i, d \ ) , . . . , (y, dn), where yi < <
y n. Let f j denote the number o f times that (y j , d j ) occurs in an ordinary bootstrap
sample, and let Sj = / ' H-------1- / ' .
(a) Show that when there is no censoring, the product-limit estimate puts mass n-1
on each observed failure yi << y, so that F = F.
(b) Show that if B(m, p) denotes binom ial distribution with index m and probability
p, then
( n - j + l)2'
This equals the variance from Greenwoods formula, (3.10), apart from replacement
o f (n j + l) 2 by (n - j)(n - j + 1).
(Section 3.5; Efron, 1981a; Cox and Oakes, 1984, Section 4.3)
10
Consider the weird bootstrap applied to a hom ogeneous sample o f censored data,
(yi ,di),...,(y,d), in which >i < - < >>. Let d A 0,(yj) = N j / ( n j + l), where the
N'j are independent binomial variables with denominators n j + 1 and probabilities
dj / ( n j + 1).
(a) Show that the total number o f failures under this resampling scheme is dis
tributed as a sum o f independent binom ial observations.
(b) Show that
12
(a) Establish (3.15), and show that the sample variance c is an unbiased estimate
o f y.
129
3.13 Problems
(b) N ow suppose that N = kn for some integer k. Show that under the population
bootstrap,
E'(y*) = y,
v a r-(n =
x ( l - / ) n - 1c.
(c) In the context o f Example 3.16, suppose that the parameter o f interest is a
nonlinear function o f 9, say t] g (6), which is estimated by g(T ). Use the delta
m ethod to show that the bias o f g (T ) is roughly ^g"(0)var(T), and that the
bootstrap bias estimate is roughly ig " (t)var'(T ). Under what conditions on n and
N does the bootstrap bias estimate converge to the true bias?
(Section 3.7; Bickel and Freedman, 1984; Booth, Butler and Hall, 1994)
13
To model the superpopulation bootstrap, suppose that the original data are
y i , . . . , y n and that <9* contains
copies o f y u . . - , y n; the joint distri
bution o f the M j is multinomial with probabilities n~{ and denominator N. If
Y], . . . , y* are sampled without replacement from <&' and if Y = n~l J2 Y/> show
that
E-(Y ') = y,
E m {var'(Y* | M)} = ^
x (1 - f h ^ c .
Suppose we wish to perform mirror-match resampling with k independent withoutreplacement samples o f size m, but that k = {n(l m/ n ) } / {{ m( 1 / ) } is not an
integer. Let K be the random variable such that
Pr( K = k') = 1 - Pr(X* = k' + 1) = k'(l + k' - k)/ k,
where k' = [k] is the integer part o f k. Show that if the mirror-match algorithm is
applied for an average Y with this distribution fo r X ', var(Y ) = (1m/n)c/ (mk).
Show also that under mirror-match resampling with the simplifying assumption
that randomization is not required because k is an integer,
f,
(* -!)
E (C ) = c l 1- ^ j ^ T j
where C is the sample variance o f the Y-.
What implications are there for variance estimation for more complex statistics?
(Section 3.7; Sitter, 1992)
15
Suppose that n is a large even integer and that N = 5n/2, and that instead o f
applying the population bootstrap we choose a population from which to resample
according to
y i , - - - , y n,
yi, - --, y,
y u . . . , y n,
y u . . . , y n,
with probability \ ,
yi,...,y,
with probability
Having selected <& we take a sample Y ,\...,Y ' from it without replacement and
calculate Z = (Y* y ){(l f ' ) n ~ l c}~i/2. Show that if f = n / N the approximate
distribution o f Z is the normal mixture |N (0, | ) + |N (0, y ) , but that if f =
n/#{<&'} the approximate distribution o f Z is N ( 0,1). Check that in the first case,
E*(Z*) = 0 and var(Z ) = 1.
Comment on the implications for the use o f randomization in finite population
resampling.
(Section 3.7; Bickel and Freedman, 1984; Presnell and Booth, 1994)
130
16
5 Further Ideas
Suppose that we have data y i,...,y , and that the bootstrap sample is taken to be
Yj = y + d(yij y),
where I \ , .
are independently chosen at random from 1
Show that when d = {n'( 1 / ) / ( 1)}I/2, we have E * (y ) = y and var'(Y*) =
(1 f ) n ~ l c. How might the value o f ri be chosen?
Discuss critically this resampling scheme.
(Section 3.7; R ao and Wu, 1988)
17
For the m odel o f Problem 3.17, define estimates o f the x,s and z,; s by
^ = cy. + (1 - c)yh
(Section 3.8)
19
Consider the double bootstrap procedure for adjusting the estimated bias o f T,
as described in Section 3.9, when T is the average Y . Show that the variance o f
simulation error for the adjusted bias estimate B C is
21
3.14 Practicals
131
(a) If Fk denotes the gamma distribution with index k and unit mean, show that
tp(FK) = k(1 - 2p )-'{ F K+1(yK,i_p) - / \ +](>v,P)}, where y K<p is the p quantile o f FK.
Hence evaluate tp(FK) for k: = 1, 2, 5, 10 and p = 0, 0.1, 0.2, 0.3, 0.4, 0.5.
(b) Suppose that the parameter o f interest, 6 = * =1 Cjt()(FjKi), depends on several
gamma distributions FiJtr Let F, denote the E D F o f a sample o f size n, from f , K|.
Under what circumstances is T = , = 1 c,rp(F,) (i) unbiased, (ii) nearly unbiased,
as an estimate o f 0? Test your conclusions by a small simulation experiment.
(Section 3.11)
3.14 Practicals
1
To perform the analysis for the gravity data outlined in Example 3.2:
g r a v .fu n < - f u n c t io n ( d a t a , i )
{ d <- d a t a f i,]
m < - ta p p ly (d $ g ,d $ s e r ie s ,m e a n )
v < - t a p p l y ( d $ g ,d $ s e r i e s ,v a r )
n <- ta b le (d S s e r ie s )
c (su m (m * n /v )/su m (n /v ), l/s u m ( n /v ) ) >
g r a v .b o o t < - b o o t ( g r a v it y , g r a v .f u n , R=200, s t r a t a = g r a v it y $ s e r i e s )
Plot the estimate and its variance. Is the simulation well-behaved? How normal
are the bootstrapped estimates and studentized bootstrap statistics?
N ow for a semiparametric analysis, as suggested in Section 3.3:
a t t a c h ( g r a v it y )
n <- t a b le (s e r ie s )
m < - r e p ( t a p p ly ( g , s e r i e s , m ean), n)
s <- r e p (s q r t(ta p p ly (g ,s e r ie s ,v a r )) ,n )
r e s < - (g - m )/s
q q n o r m (r e s); a b l i n e ( 0 , 1 ,lt y = 2 )
g ra v < - d a ta .fr a m e (m , s , s e r i e s , r e s )
g r a v .fu n < - f u n c t io n ( d a t a , i )
{ e < - d a t a $ r e s [ i]
y < - data$m + d a ta $ s * e
m < - t a p p l y ( y , d a t a $ s e r ie s , mean)
v < - t a p p l y ( y , d a t a $ s e r ie s , v a r)
n <- ta b le (d a ta $ s e r ie s )
c (s u m (m * n /v )/su m (n /v ), l/s u m ( n /v ) )
>
g r a v l.b o o t < - b o o t( g r a v , g r a v .f u n , R=200)
D o residuals r e s for the different series look similar? Compare the values o f t and
d' for the two sampling schemes. Compare also 80% confidence intervals for g.
(Section 3.2)
2
Dataframe charm ing contains data on the survival o f 97 men and 365 women
in a retirement home in California. The variables are sex, ages in months at
which individuals entered and left the home, the time in months they spent there,
and a censoring indicator (0/1 denoting censored due to leaving the hom e/died
there). For details see Hyde (1980). We compare the variability o f the survival
probabilities at 75 and 85 years (900 and 1020 months), and o f the estimated 0.75
and 0.5 quantiles o f the survival distribution.
132
3 Further Ideas
133
3.14 Practicals
apply(con.boot$t, 2, mean),
apply(wei.boot$t, 2, mean),
sqrt(apply(ord.boot$t, 2, var)),
sqrt(apply(con.boot$t, 2, var)),
sqrt(apply(wei.boot$t, 2, var)))
results <- rbind(results, res) }
The estimated bias and standard deviation o f t ly and the bootstrap bias estimates
are
mean(results[,1])-l
sqrt(var(results[,1] ))
bias.o <- results[,3]-results[,1]
bias.c <- results[,5]-results[,1]
bias.w <- results[,7]-results[,1]
How do they compare? W hat about the estimated standard deviations? How do
the numbers o f censored observations vary under the schemes?
(Section 3.5; Efron, 1981a; Burr, 1994)
4
The tau particle is a heavy electron-like particle which decays into various col
lections o f other charged particles shortly after its production. The decay usually
involves one charged particle, in which case it can happen in a number o f modes,
the main four o f which are labelled p, n, e, and p. It takes a major research project
to measure the rate o f occurrence o f single-particle decay, decayi, or any o f its
com ponent rates decay,,, decay^, decaye, and decay,,, and just one o f these can
be measured in any one experiment. Thus dataframe ta u on decay rates for 60
experiments represent several years o f work. Here we use them to estimate and
form a confidence interval for the parameter
8 = decay! decay p decay n decay e decay
Suppose that we had thought o f using the 0, 12.5, 25, 37.5 and 50% trimmed
averages to estimate the difference. To calculate these and to obtain bootstrap
confidence intervals for the estimates o f 8:
134
3 Further Ideas
N ow suppose that we want to choose the estimator from the data, by taking the
trimmed average with smallest variance. For the original data this is the 25%
trimmed average, so the estimate is 16.87. Its variance can be estimated by a
double bootstrap, which we can implement as follow s:
i <- matrix(l:5,5,tau.boot2$R)
i <- i[t(tau.boot2$t[,6:10]==apply(tau.boot2$t[,6:10] ,l,min))]
table(i)
t.best <- tau.boot2$t[cbind(l:tau.boot2$R,i)]
var(t.best)
Is the optimal degree o f trimming well-determined?
How would you use the results o f Problems 2.13 and 2.4 to avoid the second level
o f bootstrapping?
(Section 3.11; Efron, 1992)
5
Suppose that we are interested in the largest eigenvalue o f the covariance matrix
between the baseline and one-year C D 4 counts in cd4; see Practical 2.3. To
3.14 Practicals
135
calculate this and its approximate variance using the nonparametric delta method
(Problem 2.14), and to bootstrap it:
split.screen(c(l,2))
screen(l); split.screen(c(2,1))
screen(3)
plot(cd4.boot$t[,1],cd4.boot$t[,2],xlab="t*",ylab="vL*",pch=".")
screen(4)
plot(cd4[,l],cd4[,2],type="n",xlab="baseline",
ylab="one year" ,xlim=c(l,7) ,ylim=c(1,7))
text(cd4[, 1] ,cd4[,2] ,c(l :20) ,cex=0.7)
screen(2); jack.after.boot(cd4.boot,useJ=F,stinf=F)
W hat is going on here?
(Section 3.10.1; Canty, D avison and Hinkley, 1996)
4
Tests
4.1 Introduction
M any statistical applications involve significance tests to assess the plausibil
ity o f scientific hypotheses. R esam pling m ethods are n o t new to significance
testing, since rando m izatio n tests and p erm u tatio n tests have long been used
to provide nonp aram etric tests. A lso M onte C arlo tests, which use sim ulated
datasets, are quite com m only used in certain areas o f application. In this chap
ter we describe how resam pling m ethods can be used to produce significance
tests, in b o th p aram etric and nonparam etric settings. The range o f ideas is
som ew hat w ider th a n the direct b o o tstrap approach introduced in the pre
ceding tw o chapters. To begin with, we sum m arize some o f the key ideas o f
significance testing.
T he sim plest situation involves a simple null hypothesis Ho which com pletely
specifies the probability distribution o f the data. Thus, if we are dealing with
a single sam ple y \ , . . . , y n from a p o p u latio n w ith C D F F, then Ho specifies
th a t F = Fo, where F0 contains no unknow n param eters. A n exam ple would
be exponential w ith m ean 1. T he m ore usual situation in practice is th at
Ho is a composite null hypothesis, which m eans th a t some aspects o f F are
n o t determ ined and rem ain unknow n w hen Ho is true. A n exam ple would
be norm al w ith m ean 1, the variance o f the norm al distribution being
unspecified.
P-values
A statistical test is based on a test statistic T which m easures the discrepancy
between the d a ta an d the null hypothesis. In general discussion we shall follow
the convention th a t large values o f T are evidence against H 0. Suppose for the
m om ent th a t this null hypothesis is simple. If the observed value o f the test
statistic is denoted by t then the level o f evidence against Ho is m easured by
136
137
4.1 Introduction
(4.1)
(4.2)
This yields the error rate in terp retation o f the P-value, nam ely th at if the
observed test statistic were regarded as ju st decisive against Ho, then this is
equivalent to following a procedure which rejects H 0 with error rate p. The
sam e is not exactly true if T is discrete, and for this reason m odifications to
(4.1) are som etim es suggested for discrete d a ta problem s: we shall n o t worry
a b o u t the distinction here.
It is im p o rtan t in applications to give a clear idea o f the degree o f discrepancy
betw een d a ta an d null hypothesis, if not giving the P-value itself then at least
indicating how it com pares to several levels, say p = 0.10,0.05,0.01, rather
th a n ju st testing a t the 0.05 level.
Choice o f test statistic
In the p aram etric setting, we have an explicit form for the sam pling distribution
o f the d a ta w ith a finite num ber o f unknow n param eters. O ften the null
hypothesis specifies num erical values for, or relationships between, som e or all
o f these param eters. T here is also an alternative hypothesis H A which describes
w hat alternatives to Ho it is m ost im p o rtan t to detect, or w hat is thought likely
to be true if Ho is not. This alternative hypothesis guides the specific choice o f
T , usually th ro u g h use o f the likelihood function
L(e) = f Yu...,Yn( y u . - . , y n \ 0 ) ,
i.e. the jo in t density o f the observations. F or example, when Ho and H A are
b o th simple, say Ho : 8 = 0 o an d Ha : 0 = dA, then the best test statistic is the
likelihood ratio
T = L(9 a )/ L{60).
(4.3)
138
4 Tests
to departu re from the original m odel. We would then test those additional
param eters. O therw ise general purpose goodness o f fit tests will be used, for
exam ple chi-squared tests.
In the nonp aram etric setting, no p articu lar form s are specified for the
distributions. T hen the ap p ro p riate choice o f T is less clear, b u t it should be
based on at least a qualitative n otion o f w hat is o f concern should Ho n o t be
true. Usually T would be based on a statistical function s(F) th a t reflects the
characteristic o f physical interest and for which the null hypothesis specifies a
value. F or example, suppose th a t we wish to test the null hypothesis Hq th at
X and Y are independent, given the ran d o m sam ple (X i, Vi) , . . . , (X, Y). The
correlation s(F) = corr(AT, Y ) = p is a convenient m easure o f dependence, and
p = 0 und er Hq. If the alternative hypothesis is positive dependence, then a
natu ral test statistic is T = s(F), the raw sam ple correlation; if the alternative
hypothesis is ju st dependence, then the tw o-sided test statistic T = s 2 (F)
could be used.
Conditional tests
In m ost p aram etric problem s and all nonparam etric problem s, the null h y p o th
esis Ho is com posite, th a t is it leaves som e param eters unknow n and therefore
does not com pletely specify F. Therefore P-value (4.1) is not generally welldefined, because P r( T > t \ F) m ay depend upon which F satisfying Ho is
taken. T here are two clean solutions to this difficulty. One is to choose T
carefully so th a t its distrib u tio n is the same for all F satisfying H o : examples
include the Student-t test for a norm al m ean w ith unknow n variance, and rank
tests for nonparam etric problem s. The second and m ore widely applicable so
lution is to elim inate the p aram eters which rem ain unknow n when Ho is true
by conditioning on the sufficient statistic und er Ho- If this sufficient statistic is
denoted by S, then we define the conditional P-value by
p = Pr(T > t \ S = s , H 0).
(4.4)
Fam iliar exam ples include the Fisher exact test for a 2 x 2 table and the
S tudent-t test m entioned earlier. O th er exam ples will be given in the next two
sections.
A less satisfactory approach, which can nevertheless give good approxim a
tions, is to estim ate F by a C D F f '0 which satisfies Ho and then calculate
p = Pr( T > t \ Fo).
(4.5)
Typically this value will n o t satisfy (4.2) exactly, b u t will deviate by an am ount
which m ay be practically negligible.
Pivot tests
W hen the null hypothesis concerns a p articu lar p aram eter value, the equiva
lence betw een significance tests an d confidence sets can be used. This equiv
139
4.1 Introduction
(4.6)
maxHaL(\p, a )
m axWo L(rp, X)
L ( v>A)
L(wo,%>)
= m axy^ L(\p, A)
maxAL(ip0, A)'
O f course this also applies when there is no nuisance param eter. F or m any
m odels it ispossible to show th at T = 2 log L R has approxim ately the Xd
d istribution u nder Ho, where d is the dim ension o f ip, so th at
p = Pr(X2d > t),
(4.8)
140
4 Tests
statistic there is a simple approxim ation to the null distribution, and m odifi
cations to im prove approxim ation in m oderate-sized samples. The likelihood
ratio m ethod appears lim ited to p aram etric problem s, but as we shall see in
C h apter 10 it is possible to define analogues in the nonparam etric case.
W ith all o f the P-value calculations introduced thus far, simple approxim a
tions for p exist in m any cases by appealing to lim iting results as n increases.
Part o f the purpose o f this chapter is to provide resam pling alternatives to such
approxim ations when they either fail to give ap p ro p riate accuracy o r do not
exist a t all. Section 4.2 discusses ways in which resam pling and sim ulation can
help with param etric tests, starting w ith exact M onte C arlo tests. Section 4.3
briefly reviews p erm u tatio n and random ization tests. This leads on to the wider
topic o f nonp aram etric b o o tstrap tests in Section 4.4. Section 4.5 describes a
simple m ethod for im proving P-values when these are biased. M ost o f the
exam ples in this chap ter involve relatively simple applications. C hapters 6 and
beyond con tain m ore substantial applications.
141
(4.9)
where as usual Tlr) denotes the rth ordered value. If exactly k o f the sim ulated
t* values exceed t and none equal it, then
p = P r(T > t | H 0) = Pmc =
(4.10)
O u r strict in terp retatio n o f (4.1) would have us use the upper bound, and so
we ad o p t the general definition
#(y4) means the number
of times the event A
occurs.
P =
1 + #{t* ^ f)
R + 1
/j i n
(4 U )
Pr(F ; = 1 | x/)
PriYj = 0 | x j) ~
^ X>
then the null hypothesis is H q :\p = 0 . U nder H q the sufficient statistic for X is
S
an d T = J2 x j Yj is the n atu ral test statistic; T is in fact optim al for
the logistic m odel, b u t is also effective for m onotone transform ations o f the
odds ratio o th er th an logarithm . The significance is to be calculated according
to (4.4).
T he null distribution o f Y i,...,Y given S = s is uniform over all (")
perm u tatio n s o f y i , . . . , y . R ath er th an generate all o f these perm utations to
com pute (4.4) exactly, we can generate R random perm utations and apply
(4.11). A sim ulated sam ple will then be ( x j,y j) , . . . , (xn,y^), where y \ , . . . , y n is
a ran d o m p erm u tatio n o f y \ , . . . , y n, and the associated test statistic will be
= E x jyj.
142
4 Tests
0
0
1
4
3
1
2
1
1
1
2
0
1
2
4
3
2
1
5
3
4
4
4
2
1
3
2
1
0
0
4
3
5
3
0
2
3
2
2
2
2
4
2
1
7
1
2
3
1
0
Table 4.1 n = 50
counts o f balsam-fir
seedlings in five feet
square quadrats.
143
o
w
GO
c
o
e
0Q_
CO
20
40
60
80
Chi-squared quantiles
It seems intuitively clear th a t the sensitivity o f the M onte C arlo test increases
w ith R. We shall discuss this issue later, b u t for now we note th a t it is advisable
to take R to be a t least 99.
T here are tw o im p o rtan t aspects o f the M onte C arlo test which m ake it
widely useful. T he first is th a t we only need to be able to sim ulate d a ta under
the null hypothesis, this being relatively simple even in some very com plicated
problem s, such as those involving spatial processes (C hapter 8). Secondly,
do n o t need to be independent outcom es: the m ethod rem ains
valid so long as they are exchangeable outcom es, which is to say th a t the
jo in t density o f T,
T R u nder Ho is invariant under p erm u tatio n o f its
argum ents. This allows us to apply M onte C arlo tests to quite com plicated
problem s, as we see next.
144
4 Tests
generate each y by an independent sim ulation o f N steps w ith the same initial
state x. If the M arkov chain has equilibrium distribution equal to the null
hypothesis distribution o f Y = ( Y [ ,..., Y), then y and the R replicates o f y *
are exchangeable outcom es under Ho an d (4.11) applies.
Suppose th a t und er H q the d a ta have jo in t density fo(y) for
where
both /o and & are conditioned on sufficient statistic s if we are dealing with
a conditional test. F or simplicity suppose th a t
has \3S\ elements, which
we now regard as possible states labelled (1 ,2 ,...,\&S\) o f a M arkov chain
{Zr, t = . . . , 1 ,0 ,1 ,...} in discrete time. C onsider the d a ta y to be one
realization o f Zjy. We then have to fix an ap p ro p riate value o r state for Zo,
and w ith this initial state sim ulate the R independent values o f Z N which are
the R values o f Y \ The M arkov chain is defined so th a t /o is the equilibrium
distribution, which can be enforced by ap p ro p riate choice o f the one-step
forw ard transition probability m atrix Q, say, with elements
quv = P r(Z I+i = v | Z ( = u),
u,v &.
(4.12)
Let the final state, the realized value o f Zo, be x. N ote th at if Ho is true, so
th at y was indeed sam pled from / 0, then Pr(Zo = x) = /o(x). In the second
p art o f the sim ulation, which we repeat independently R times, we sim ulate N
forw ard steps o f the M arkov chain, starting in state x and ending up in state
y ' = (> > i,...,y '). Since und er Ho the chain starts in equilibrium ,
Pr(Y* = /
| H 0) = P r( Z N = / ) = / 0( / ) .
f ( y , y l . . . , f R | Ho) = fo(y)
Pr(Z 0 = x | Z N = y ) ] ] P r(Z N =
x
| Z 0 = x),
r= l
using the independence o f the replicate sim ulations from x. But by the definition
o f the first p a rt o f the sim ulation, where (4.12) applies,
/o (y )P r(Z 0 = x | Z N = y) = / 0(x)P r(Z ^ = y \ Z 0 = x),
145
and so
f ( y , y[, . .. , y 'R \ H 0) = J 2 /o (x ){ p r(Z N = y | Z 0 = x) [ J Pr(Z N = y*r | Z 0 = x ) \ ,
x
r= l
'
u^v,
and
Qua = muu + Y ^ max{0 , 1 - fo(v)/fo{u)}muo,
V^U
an d from this it follows th a t f o is indeed the equilibrium distribution o f the
M arkov chain, as required. In applications it is n o t necessary to calculate
the probabilities muv explicitly, although the sym m etry and irreducibility o f
the carrier chain m ust be checked. If the m atrix M is n o t sym m etric, then
the acceptance probability in the M etropolis algorithm m ust be m odified to
m in [l,fo(v)mvu/{fo(u)muv}].
T he crucial feature o f the M arkov chain m ethod is th a t fo itself is not
needed, only ratio s fo(v)/fo(u) being involved. This m eans th a t for conditional
tests, w here f o is the conditional density for Y given S = s, only ratios o f the
u nconditional null density for Y are n e ed ed :
fo(v) = P r(7 = v \ S = s , H q) = P r(7 = v | H 0)
fo(u)
P r(7 = u | S = s , H 0)
P r(Y= u\H o)'
This greatly simplifies m any applications.
The realizations o f the M arkov chain are sym m etrically tied to the artificial
starting value x, an d this induces a sym m etric correlation am ong (t,
146
4 Tests
This correlation depends upon the p articu lar construction o f Q, and reduces
to zero at a rate which depends upon Q as m increases. W hile the correlation
does not affect the validity o f the P-value calculation, it does affect the power
o f the te s t: the higher the correlation, the lower the power.
Example 4.3 (Logistic regression) We retu rn to the problem o f Exam ple 4.1,
which provides a very sim ple if artificial illustration. The d a ta y are a binary
sequence o f length n w ith s ones, and calculations are to be conditional on
Y , Yj = s. Recall th a t direct M onte C arlo sim ulation is possible, since all (")
possible d a ta sequences are equally likely und er the null hypothesis o f constant
probability o f a unit response.
One simple M arkov chain has one-step transitions which select a pair o f
subscripts i, j a t random , an d switch y t an d yj. Clearly the chain is irreducible,
since one can progress from any one binary sequence with s ones to any other.
All ratios o f null probabilities /o (u )//o ( ) are equal to one, since all binary
sequences w ith s ones are equally probable. Therefore if we run the M etropolis
algorithm , all switches are accepted. But note th a t this M arkov chain, while
simple to im plem ent, is inefficient and will require a large num ber o f steps to
induce approxim ate independence o f the ts. T he m ost effective M arkov chain
would have one-step transitions which are ran d o m p erm utations, and for this
only one step w ould be required.
Example 4.4 (AM L data) F or d a ta such as those in Exam ple 3.9, consider
testing the null hypothesis o f p ro p o rtio n al h azard functions. D enote the failure
times by z\ < z2 < < z, assum ing no ties for the m om ent, and define rtj to
be the nu m b er in group i w ho were at risk ju st p rior to zj. Further, let yj be
0 or 1 according as the failure at zj is in group 1 or 2, and denote the hazard
function a t tim e z for group i by fy(z). Then
P r(y . = l ) = _____
r*Mzj>_____
aj + 0 /
147
5
11
8
11
12
11
10
n
12
11
10
n
9
n
8
8
11
*18
9
11
12
5
11
13
10
18
8
23
7
23
7
27
6
30
5
31
5
33
4
34
4
43
3
45
3
48
2
10
8
10
7
8
6
7
6
7
5
6
5
5
4
5
3
4
3
3
2
oo
10
is simply
18
dj + Oj
7=1
;'= i
We take the carrier M arkov chain to have one-step transitions which are ra n
dom p erm u tatio n s: this guarantees fast m ovem ent over the state space. A step
which moves from x to x is then accepted with probability min ^ 1, f l j l i a]
By sym m etry the reverse chain is defined in exactly the same way.
The test statistic m ust be chosen to m atch the particular alternative hy p o th
esis th o u g h t relevant. H ere we suppose th at the alternative is a m onotone ratio
o f hazards, for which T = YljLi Yj log(Zj) seems to be a reasonable choice.
The M arkov chain sim ulation is applied with N = 100 steps back to give the
initial state x an d 100 steps forw ard to state y ' , the latter repeated R = 99
times. O f the resulting * values, 48 are less th an or equal to the observed value
t = 17.75, so the P-value is (1 + 4 8 )/(l + 99) = 0.49. Thus there appears to be
no evidence against the prop o rtional hazards model.
Average acceptance probability in the M etropolis algorithm is approxim ately
0.7, and results for N = 10 and N = 1000 ap p ear indistinguishable from those
for N = 100. This indicates unusually fast convergence for applications o f the
M arkov chain m ethod.
148
4 Tests
the null m odel Fo and use (4.5) to com pute the P-value, i.e. p = P r(T > t \ Fo).
F or exam ple, for the p aram etric m odel where we are testing Ho : ip = ipo
with X a nuisance p aram eter, Fo w ould be the C D F o f f ( y \ ipo,Xo) with Xo
the m axim um likelihood estim ator (M L E ) o f the nuisance param eter when ip
is fixed equal to ipo. C alculation o f the P-value by (4.5) is referred to as a
b o o tstrap test.
If (4.5) can n o t be com puted exactly, o r if there is no satisfactory approx
im ation (norm al or otherwise), then we proceed by sim ulation. T h at is, R
independent replicate sam ples yj,...,_y* are draw n from Fo, and for the rth
such sam ple the test statistic value t'r is calculated. T hen the significance
probability (4.5) will be approxim ated by
Pboot ~
( 4 .1 3 )
O rdinarily one would use a simple p ro p o rtio n here, but we have chosen to
m ake the definition m atch th a t for the M onte C arlo test in (4.11).
Example 4.5 (Separate families test) Suppose th a t we wish to choose between
the alternative m odel form s fo(y \ r\) and f i ( y \ ) for the P D F o f the random
sam ple y \ , . . . , y n. In some circum stances it m ay m ake sense to take one model,
say fo, as a null hypothesis, and to test this against the o th er m odel as
alternative hypothesis. In the n o tatio n o f Section 4.1, the nuisance param eter
is X = (t],C) and ip is the binary indicator o f m odel, w ith null value ipo = 0
and alternative value ipa = 1. The likelihood ratio statistic (4.7) is equivalent
to the m ore convenient form
r = - N g ^ = n- ' X > g M ^ ,
L o(rj)
fo(yj I ri)
(4.14)
149
null distribution o f T , b u t this is often quite unreliable except for very large
n. The p aram etric b o o tstrap provides a m ore reliable and simple option.
The p aram etric b o o tstrap w orks as follows. We generate R sam ples o f size
n by ran d o m sam pling from the fitted null m odel /o (y | fj). For each sample
we calculate estim ates fj* and ( by m axim izing the sim ulated log likelihoods
m) = E lo&w i
and com pute the sim ulated log likelihood ratio statistic
,
= ^
{ l o g y - ot\
p
) y > 0 -
150
4 Tests
t*
151
yt)
O/
N
CO
o
O/'
c>d in
o
CDO
o
c o
<1)
-o
D
Q--'
0
/o
.6 o o
35
0/0
o
-1
a G si.
U nder the null hypothesis, T(a), Tj*(a),. . . , T R(a) are independent and identi
cally distributed for any fixed a, so th a t (4.9) applies w ith T = T(a). T h at
is,
P r ( T ( a ) < T (})( f l ) | H o ) = ^
I .
(4.15)
152
4 Tests
from the R sim ulated plots. If t(a) exceeds the upp er value, or falls below the
lower value, then the corresponding one-sided P-value is at m ost p; the twosided test which rejects Ho if t(a) falls outside the interval [ ^ ( a ) , tJ'J?+1_fc)(a)]
has level equal to 2p. T he set o f all u p p er and lower critical values defines the
test envelope
S'1 2p = {[t(fc)(a), t(R+!_(;)()] : a e s / j .
(4.16)
Excursions o f t(a) outside S l~2p are regarded as evidence against Ho, and this
sim ultaneous com parison across all values o f a is w hat is usually m eant by the
graphical test.
Example 4.7 (Normal plot, continued) F or the norm al plot o f the previous
example, suppose we set p = 0.05. T he sm allest sim ulation size th a t works is
R = 19, and then we take k = 1 in (4.16). T he test envelope will therefore
be the lines connecting the m axim a and the m inim a. Because we are plotting
studentized sam ple values, which elim inates m ean and variance param eters, the
sim ulation can be done w ith the N ( 0,1) distribution. Each sim ulated sam ple
y \ , . . . , y u is studentized to give z* = ( y y*)/s*, i = 1 ,..., 13, whose ordered
values are then plotted against the same norm al quantiles a, = <P-1 ( ^ ) . The
left panel o f Figure 4.4 shows a set o f R = 19 norm al plots (plotted as
connecting dashed lines) and their envelope (solid curves) for studentized
values o f sim ulated sam ples o f n = 13 N{0,1) data. The right panel shows the
envelope o f these plots together w ith the original d a ta plot. N ote th a t one o f
the inner points falls ju st outside the envelope: this m ight be taken as mild
evidence against norm ality o f the data, b u t such an in terpretation m ay be
prem ature, in light o f the discussion below.
153
sam ples can be generated from any null m odel Fo. W hen unknow n model
param eters can n o t be elim inated, we would sim ulate from Fo: then (4.15) will
be approxim ately true provided n is n o t too small.
There are two aspects o f the graphical test which need careful thought,
nam ely the choice o f R and the in terpretation o f the resulting plot. It seems
clear from earlier discussion th a t for p = 0.05, say, R = 19 is too sm all:
the test envelope is too random . R = 99 would seem to be a m ore sensible
choice, provided this is not com putationally difficult. But we should consider
how form al is to be the interp retation o f the graph. As it stands the notional
one-sided significance levels p hold pointwise, and certainly the chance th a t the
envelope captures an entire plot will be far less th an 1 2p. So it would not
m ake sense to infer evidence against the null m odel if one arbitrarily placed
p o in t falls outside the envelope, as happened in Exam ple 4.7. In fact in th at
exam ple the chance is ab o u t 0.5 th at some point will fall outside the sim ulation
envelope, in co n trast to the pointw ise chance 0.1.
F or some purposes it will be useful to know the overall erro r rate, i.e.
the chance o f a point falling outside the envelope, or even to control this
rate. W hile this is difficult to do exactly, there is a simple em pirical approach
which w orks satisfactorily. G iven the R sim ulated plots which were used to
calculate the test envelope, we can sim ulate the graphical test by com paring
{t'(a),a G j / } to the envelope SlS r2p th at is obtained from the o th er R 1
sim ulated plots. If we repeat this sim ulated test for r = 1, . . . , R , then we obtain
a resam ple estim ate o f the overall two-sided erro r rate
# { r : {t(a),a G j / } exits <0Lr2pj
R
(4.17)
154
4 Tests
Figure 4.5 Normal plot
of n = 13 studentized
values for final sample
in Table 3.1, together
with simultaneous (solid
lines) and pointwise
(dashed lines) two-sided
0.10 test envelopes.
K = 199
This is easy to calculate, since {t(a),a e jtfj exits S lS r2p if and only if
rank{t*(a)} < k
or
for at least one value o f a, where as before k = p(R + 1 ) . T hus if the R plots
are represented by a R x N array, we first com pute colum nw ise ranks. T hen
we calculate the p ro p o rtio n o f rows in which either the m inim um rank is less
th an or equal to k, or the m axim um ran k is greater th a n or equal to R + 1 k,
o r both. T he corresponding one-sided erro r rates are estim ated in the obvious
way.
Example 4.8 (Normal plot, continued)
overall tw o-sided error rate o f approxim ately 0.1 requires R = 199. Figure 4.5
shows a graphical test p lo t for R = 199 w ith outer envelope corresponding to
overall tw o-sided erro r rate 0.1 and inner envelope corresponding to pointw ise
two-sided erro r rate 0.1; the em pirical error rate (4.17) for the o u ter envelope
is 0.10.
155
4.2.5 C hoice o f R
In any sim ulation-based test, relatively few sam ples could be used if it quickly
becam e clear th a t p was so large as to n o t be regarded as evidence against HoF or exam ple, if the event t* > t occurred 50 times in the first 100 samples, then
it is reasonably certain th a t p will exceed 0.25, say, for m uch larger R, so there
is little p o in t in sim ulating further. O n the other hand, if we observed t* > t
only five times, then it w ould be w orth sam pling fu rther to m ore accurately
determ ine the level o f significance.
O ne effect o f n o t com puting p exactly is to w eaken the pow er o f the test,
essentially because the critical region o f a fixed-level test has been random ly
displaced. T he effect can be quantified approxim ately as follows. C onsider
testing a t level a, which is to say reject Ho if p < a. If the integer k is chosen
equal to (R + l)a , then the test rejects Ho when t'{R+l_k) < t. F or the alternative
hypothesis H a , the pow er o f the test is
nR(a, HA) = Pr(reject H 0 \ H A) = P r(T (*R+1_k) < T \ H A).
To evaluate this probability, suppose for simplicity th a t T has a continuous
distribution, w ith P D F go(t) and C D F Go(t) under Ho, and density gA(t) under
H A. T hen from the stan d ard result for P D F o f an order statistic we have
nR( a, HA) =
J J
R ( ^ _ Q c o M ^ g o M U - Goix ) } ^ 1 gA(t)dxdt.
A fter change o f variable an d some rearrangem ent o f the integral, this becom es
nR(cc,Ha ) = [ ^ao(u, H A)hR(u;tx)du,
Jo
(4.18)
where nx (u,HA) is the pow er o f the test using the exact P-value, and hR{u;a)
is the b eta density on [0,1] w ith indices (R + l)a and (R + 1)(1 a).
T he next p a rt o f the calculation relies on n R{ot, H A) being a concave function
o f a, as is usually the case. T hen a lower bound for n ^ u , H a ) is nm[ u , H a )
which equals U7taj( a ,H a) / a for u < a and 7tx ( a ,H 4 ) for u > a. It follows by
applying (4.18) to n R(y., HA), and som e m anipulation, th at
n 00( o L , H A ) - n R( a,HA)
< nco^^A')J
\u - a \ h R(u;cc)du
7too(a, H y4)a*R+1*<x(l + 1)
(R + l ) a r ((R + l)a ) T ((R + 1)(1 - a)) '
We apply Stirlings approxim ation T(x) = (2n)l/ 2 x x~ l / 1 exp(x) for large x to
the rig h t-h an d side an d obtain the approxim ate bound
156
4 Tests
The following table gives som e num erical values o f this approxim ate bound.
sim ulation size R
power ratio for a = 0.05
power ratio for a. = 0.01
19
39
99
199
499
999
9999
0.61
0.73
0.83
0.60
0.88
0.72
0.92
0.82
0.95
0.87
0.98
0.96
157
dnan
used in (4.4) is then a uniform distribution over a set o f perm utations o f the
d a ta structure. The following exam ple illustrates this.
Example 4.9 (Correlation test) Suppose th at Y = ( U , X ) is a random pair
an d th a t n such pairs are observed. T he objective is to see if U and X are
independent, this being the null hypothesis Hq. A n illustrative dataset is plotted
in Figure 4.6, where u = d nan is a genetic m easure and x = han d is an integer
m easure o f left-handedness. T he alternative hypothesis is th a t x tends to be
larger w hen u is larger. These d ata are clearly non-norm al.
O ne simple test statistic is the sam ple correlation, T = p{F) say. N ote th at
here the E D F F puts probabilities n~ on each o f the n d ata pairs (u;,x,).
T he correlation is zero for any distribution th a t satisfies Ho. The correlation
coefficient for the d a ta in Figure 4.6 is 0.509.
W hen the form o f F is unspecified, F is m inim al sufficient for F. U nder
the null hypothesis, however, the m inim al sufficient statistic is com prised o f
the ordered us an d ordered xs, s = (M(i),...,U(n),X(i),...,X()), equivalent to
the two m arginal E D Fs. So here a conditional test can be applied, w ith (4.4)
defining the P-value, w hich will therefore be independent o f the underlying
m arginal distributions o f U and X . N ow when S is constrained to equal s,
the ran d o m sam ple ( U \ , X \ ) , ... ,(U,X) is equivalent to (u(i),X j), . . . , (u (n),X*)
w ith ( X j ,. . .,X"n) a ran d o m p erm u tatio n o f X ( i ) ,...,X ( ) . F urther, when Ho
is true all such p erm u tatio n s are equally likely, and there are n! o f them.
Therefore the one-sided P-value is
# o f perm utations such th at T* > t
In evaluating p, we can use the fact th at all m arginal sam ple m om ents
158
4 Tests
-0.5
0.0
0.5
Correlation t*
1 + # { tr* > r}
R + 1
'
159
F2(y) = G(y - n 2)
or
F\(y) = G ( y / n i),
F2(y) = G{y/(i2),
for some unknow n G. T hen the null hypothesis implies a com m on C D F F for
the two populations. In this case, the null hypothesis sufficient statistic s is the
set o f order statistics for the pooled sam ple
=
yiii >*^1
y2n2,
th a t is s = (u(i),...,H (ni+n2)).
Situations where the special form s for Fj and F 2 apply would include
com parisons o f tw o treatm ents which were both applied to a random selection
o f units from a com m on pool. The special forms would n o t necessarily apply to
sets o f physical m easurem ents taken under different experim ental conditions or
using different apparatu s, since then the sam ples could have unequal variablity
even though Ho were true.
Suppose th a t we test Ho by com paring the sam ple m eans using test statistic
t = y 2 yi, and suppose th a t the one-sided alternative H a : fi2 > ji\ is
appropriate. If we assum e th a t Ho implies a com m on distribution for the Yu
and Yzj, then the exact significance probability is given by (4.4), i.e.
p = P r(T > t | S = s,Ho).
N ow when S is constrained to equal s, the concatenation o f the two random
sam ples ( Y u ,..., Yini, Y2i , . . . , Y22) m ust form a p erm utation o f s. The first
160
4 Tests
m com ponents o f a p erm u tatio n will give the first sam ple and the last 2
com ponents will give the second sample. Further, w hen Ho is true all such
perm utatio n s are equally likely, an d there are
o f them. Therefore
#
P ~
'
(4-21)
As in the previous exam ple, this exact probability would usually be approxi
m ated by taking R ran d o m p erm u tatio n s o f the type described, and applying
(4.11).
A som ew hat m ore com plicated tw o-sam ple test problem is provided by the
following example.
Example 4.12 (AM L data) Figure 3.3 shows the product-lim it estim ates o f
the survivor function for tim es to rem ission o f tw o groups o f patients with acute
m yelogeneous leukaem ia (A M L), w ith one o f the groups receiving m aintenance
chem otherapy. D oes this treatm en t m ake a difference to survival?
A com m on test for com parison o f estim ated survivor functions is based on
the log-rank statistic, which com pares the actual n u m ber o f failures in group
1 with its expected value at each tim e a failure is observed, under the null
hypothesis th a t the survival distributions o f the two groups are equal. To be
m ore explicit, suppose th a t we pool the two groups and obtain ordered failure
times y\ < < ym, w ith m < n if there is censoring. Let / \j and r\j be the
num ber o f failures and the nu m b er a t risk o f failure in group 1 at tim e yj, and
similarly for group 2. T hen the log-rank statistic is
T = E j = i ( /U - mij)
where
( / l j + f 2 j ) r i j r 2j ( r i j + r2j - f i; - f 2J)
(fij+f2j)rij
1]
r ij + r y
lJ
1)
161
Pr*(7
> r I
Fo),
i + # K > t}
R+l
4 Tests
162
in
CM
in
o
o
6
-4
Example 4.13 (Comparison of two means, continued) C onsider the last two
series o f m easurem ents in Exam ple 3.1, which are reproduced here labelled
sam ples 1 and 2 :
sam ple 1
sam ple 2
82
84
79
86
81
85
79
82
77
77
79
76
79
77
78
80
79
83
82
81
76
78
73
78
64
78
163
The question is: do we gain or lose anything by assum ing th a t the two
distributions have the same shape?
The p articu lar null fitted m odel used in the previous exam ple was suggested
in p a rt by the p erm u tatio n test, and is clearly n o t the only possibility. Indeed,
a m ore reasonable null m odel in the context would be one which allowed
different variances for the tw o p opulations sam pled: an analogous m odel is
used in Exam ple 4.14 below. So in general there can be m any candidates for null
m odel in the nonparam etric case, each corresponding to different restrictions
im posed in ad d itio n to H q. O ne m ust judge which is m ost ap p ro p riate on the
basis o f w hat m akes sense in the practical context.
Semiparametric null models
If d a ta are described by a sem iparam etric m odel, so th a t some features o f
underlying distributions are described by param eters, then it m ay be relatively
easy to specify a null model. The following exam ple illustrates this.
Example 4.14 (Comparison of several means) F or the gravity d a ta in E xam
ple 3.2, one p o in t th a t we m ight check before proceeding w ith an aggregate
estim ation is th a t the underlying m eans for all eight series are in fact the same.
One plausible m odel for the data, as m entioned in Section 3.2, is
)fij ~
L ? I ~ I?)
where the ei; com e from a single distribution G. The null hypothesis to be
tested is Ho : p\ = = p.%, w ith general alternative. F or this an appropriate
test statistic is given by
yi and sj are the average
and sample variance for
the ith series.
t= E
Wi(yi - o)2,
Wi = Hi/sf,
i=1
w ith fo = Y wi}'i/ Y wi
null estim ate o f the com m on mean. The null
distribution o f T w ould be approxim ately yfi were it n o t for the effect o f small
sam ple sizes. So a b o o tstrap approach is sensible.
T he null m odel fit includes /to and the estim ated variances
K> = ( i
l ) s f / i + ( Pi ~ M
2-
{ ^ - ( E w , ) - 1}172
when plotted against norm al quantiles, suggest mild non-norm ality. So, to be
safe, we apply a nonparam etric bootstrap. D atasets are sim ulated under the
null m odel
y'j = fo +
164
4 Tests
10
20
30
t*
s?
1
2
3
4
5
6
7
8
66.4
89.9
77.3
81.4
75.3
78.9
77.5
80.4
370.6
233.9
248.3
68.8
13.4
34.1
22.4
11.3
40
50
w,'
474.4
339.9
222.3
67.8
23.1
31.1
21.9
13.5
0.022
0.047
0.036
0.116
0.599
0.323
0.579
1.155
60
Chi-squared quantiles
with e'jS random ly sam pled from the pooled residuals {e^, i = 1.......8, j =
l,...,n ,} . F or each such sim ulated d ataset we calculate sam ple averages and
variances, then weights, the pooled m ean, and finally t*.
Table 4.3 contains a sum m ary o f the null m odel fit, from which we calculate
f o = 78.6 an d t = 21.275.
A set o f R = 999 b o o tstrap sam ples gave the histogram o f t values in the
left panel o f Figure 4.10. O nly 29 values exceed t = 21.275, so p = 0.030. The
right panel o f the figure plots ordered t* values against quantiles o f the Xi
approxim ation, which is off by a factor o f ab o u t 1.24 and gives the distorted
P-value 0.0034. A n o rm al-error p aram etric b o o tstrap gives results very sim ilar
to the nonparam etric b o otstrap.
Figure 4.10
Resampling results for
comparison of the
means of the eight series
of gravity data. Left
panel: histogram of
R = 999 values of t*
under nonparametric
resampling from the
null model with pooled
studentized residuals;
the unshaded area to
right of observed value
t = 21.275 gives
p = 0.029. Right panel:
ordered t values versus
Xi quantiles; the dotted
line is the theoretical
approximation.
165
Example 4.15 (Ratio test) Suppose that, as in Exam ple 1.2, each observation y
is a p air (u,x), and th a t we are interested in the ratio o f m eans 8 = E ( X ) /E ( U) .
In p articu lar suppose th a t we wish to test the null hypothesis Hq : 6 = 0OThis problem could arise in a variety o f contexts, and the context would help
to determ ine the relevant null model. F or example, we m ight have a pairedcom parison experim ent where the m ultiplicative effect 0 is to be tested. H ere
do would be 1, an d the m arginal distributions o f U and X should be the same
und er Hq- O ne n atu ral null m odel Fo w ould then be the sym m etrized E D F, i.e.
the E D F o f the expanded d a ta ( u i , x i ) , . . . , (u,x),(xi,ui),. . . , ( x n,u).
(4.22)
ttj
(4.23)
4 Tests
166
and a useful alternative is the reverse inform ation distance
k
nk
Y Y Pli log(P<7/Py')-
(4-24)
r=l j= 1
Both are m inim ized by the set o f E D F s when no constraints are im posed. The
second m easure has the advantage o f autom atically providing non-negative
solutions. T he following exam ple illustrates the m ethod and som e o f its im pli
cations.
Example 4.16 (Comparison of two means, continued) F or the tw o-sam ple
problem considered in Exam ples 4.11 and 4.13, we apply (4.22) with the
discrepancy m easure (4.24). T he null hypothesis constraint is th a t the two
m eans are equal, th a t is J ^ y i j P i j = Hi = H2 = ^ y i j P i j , so th a t (4.22) becomes
2
n,
Y Y piJ
n,
1=1 y=i
>=i
\j =i
Setting derivatives w ith respect to pi; equal to zero gives the equations
1 + lo g pij ai Xyij = 0,
17,0
EkLiexp(Ayik)
exp i - X y y )
E"Li ^ p i - X y i k Y
2 j'
The specific value o f X is uniquely determ ined by the null hypothesis constraint,
which becom es
Eyijexp(/l};iv-) = E y 2jexp(-Ay2j)
E * e x p ( ^ lt)
s x p ( - X y 2k)
167
cco
0
TJ
N ull m odel
S tatistic
P-value
______________________________________
pooled E D F
n ull variances
exponential tilt
M LE
(pivot)
t and z
t
t
z
t
z
z
0.045
0.053
0.006
0.025
0.019
0.017
0.015
C alculate
1 + # { f* ^ t}
v = -------- --------- -
V
R+ 1
N um erical results for R = 999 are given in Table 4.4 in the line labelled
exponential tilt, t". R esults for other resam pling tests are also given for
com parison: z refers to a studentized version o f t, M L E refers to use o f
constrained m axim um likelihood (see Problem 4.8), null variances refers to
the sem iparam etric m ethod o f Exam ple 4.14. Clearly the choice o f null m odel
can have a strong effect on the P-value, as one m ight expect. T he studentized
test statistics z are discussed in Section 4.4.1.
The m ethod as illustrated here has strong sim ilarity to use o f em pirical
likelihood m ethods, as described in C h ap ter 10. In practice it seems wise to
168
4 Tests
check the null m odel produced by the m ethod, since resulting P-values are
generally sensitive to m odel. Thus, in the previous example, we should look at
Figure 4.11 to see if it m akes practical sense. The sm oothed versions o f the null
distributions in the right panel, which are obtained by kernel sm oothing, are
perhaps easier to interpret. One m ight well judge in this case th a t the two null
distributions are m ore different th a n seems plausible. D espite this reservation
ab o u t this exam ple, the general m ethod is a valuable tool to have in case o f
need.
There are, o f course, situations where even this quite general approach will
n o t work. N evertheless the basic idea behind the ap proach can still be applied,
as the following exam ples show.
Example 4.17 (Test for unimodality) O ne o f the difficulties w ith n o n p a ra
m etric curve estim ation is know ing w hether particu lar features are real. For
example, suppose th a t we com pute a density estim ate f ( y ) and find th a t it has
two modes. H ow do we tell if the m inor m ode is real? B ootstrap m ethods can
be helpful in such problem s. Suppose th a t a kernel density estim ate is used, so
th at
/<>; =
(4.26)
j=1
where (j> is the stan d ard norm al density. It is possible to show th at the num ber
o f m odes o f f decreases as h increases. So one way to test unim odality is to
see if an unusually large h is needed to m ake / unim odal. This suggests th a t
we take as test statistic
t = min{h : f { y , h ) is unim odal}.
A natural candidate for the null sam pling distribution is f { y , t ) , since this is
the least sm oothed version o f the E D F which satisfies the null hypothesis o f
unim odality. By the convolution p roperty o f / , random sam ple values from
f ( y ; t ) are given by
y j = yij + hep
(4.27)
169
0.19
1.00
1.83
2.46
3.48
4.36
6.19
9.29
0.28
1.16
1.91
2.51
3.79
4.53
6.45
9.78
0.29
1.17
1.97
2.89
3.83
4.97
7.13
10.15
0.45
1.29
2.05
2.89
3.94
5.02
7.35
11.32
0.64
1.31
2.10
2.90
3.95
5.13
7.77
13.21
0.65
1.34
2.17
2.92
4.11
5.75
7.80
13.27
0.78
1.55
2.28
3.03
4.14
6.03
8.81
14.39
0.85
1.60
2.41
3.19
4.19
6.19
9.22
16.26
Example 4.18 (Tuna density estimate) O ne m ethod for estim ating the ab u n
dance o f a species in a region is to traverse a straight line o f length L through
the region, an d to record the p erpendicular distances from the line to posi
tions where there are sightings. If there are n independent sightings and their
(unsigned) distances y \ , . . . , y n are presum ed to have P D F f ( y ) , y > 0, the
ab undance density can be estim ated by n /( 0 ) /( 2L), where / ( 0 ) is an estim ate
o f the density a t distance y = 0. The P D F f ( y ) is p roportional to a detection
function th a t is assum ed to decline m onotonically with increasing distance,
w ith non-m onotonic decline suggesting th a t the assum ptions th a t underlie line
transect sam pling m ust be questioned.
Table 4.5 gives d a ta from an aerial survey o f schools o f S outhern Bluefin
T una in the G reat A ustralian Bight. Figure 4.12 shows a histogram o f the data.
The figure also shows kernel density estim ates
y * 0-
( 4 -2 8 >
with h = 0.75, 1.5125, an d 3. This seemingly unusual density estim ate is used
because the probability o f detection, and hence the distribution o f signed
distances, should be sym m etric ab o u t the transect. The estim ate is obtained by
first calculating the E D F o f the reflected distances + y i , - . . , + y n, then applying
the kernel sm oother, and finally folding the result a b o u t the origin.
A lthough the estim ated density falls m onotonically for h greater th an 1.5125,
the estim ate for sm aller values suggests non-m onotonic decline. Since we
consider f ( y ; h ) for positive values o f y only, we are interested in w hether the
underlying density falls m onotonically or not. We take the sm allest h such th at
f ( y ; h ) is unim odal to be the value o f o u r test statistic t. This corresponds
to m ono tonic decline o f f ( y ; h ) for y > 0, giving no m odes for y > 0. The
observed value o f the test statistic is t = 1.5125, and we are interested in the
significance probability
P r( T >
1 Fo),
for d a ta arising from Fo, an estim ate o f F th at satisfies the null hypothesis o f
4 Tests
170
Distance (miles)
j = l,...,n ,
where the signs + are assigned random ly, the l j are random integers from
{ 1 ,2 ,...,n}, and the r.j are independent N ( 0,1) variates; cf. (4.27). T he kernel
density estim ate based on the y is f *(y;h). We now calculate the test statistic
as outlined in the previous example, an d rep eat the process R = 999 times to
obtain an approxim ate significance probability. We restrict the h u n t for m odes
to 0 < y < 10, because it does n o t seem sensible to use so small a sm oothing
param eter in the density tails.
W hen the sim ulations were perform ed for these data, the frequencies o f the
num ber o f m odes o f f ' ( y ; t ) for 0 < y < 10 were as follows.
M odes
Frequency
0
536
1 2
411
50
3
2
Like the fitted null distribution, a replicate where the full f * {y ;t ) is unim odal
will have no m odes for y > 0. I f we assum e th a t the event t* = t is impossible,
b o o tstrap d atasets w ith no m odes have t* < t, so the significance probability
is (411 + 5 0 + 2 + l)/(9 9 9 + 1) = 0.464. T here is no evidence against m onotonic
decline, giving no cause to d o u b t the assum ptions underlying line transect
m ethods.
171
(4.29)
which we can approxim ate by sim ulation w ithout having to decide on a null
m odel Fo- T he usual choice for v would be the nonparam etric delta m ethod
estim ate vL o f Section 2.7.2. T he theoretical support for the use o f Z is given in
Section 5.4; in certain cases it will be advantageous to studentize a transform ed
estim ate (Sections 5.2.2 an d 5.7). In practice it would be appropriate to check
on w hether or n o t Z is approxim ately pivotal, using techniques described in
Section 3.10.
A pplications o f this m ethod are described in Section 6.2.5 and Section 6.3.2.
T he m odifications for the oth er one-sided alternative and for the two-sided
alternative are simply p = Pr*(Z* < zo | F ) and p = Pr*(Z *2 > z \ \ F).
Example 4.19 (Comparison of two means, continued) F or the application
considered in Exam ples 4.11, 4.13 and 4.16, where we com pared two m eans
using t =
y u it w ould be reasonable to suppose th a t the usual tw o-sam ple
t statistic
z
Y2 - Y 1 - ( H 2 - H i )
(,S i / n 2 + S f / n i ) l/2
is approxim ately pivotal. H ere F in (4.29) represents the E D F s o f the two
samples, given th a t no assum ptions are m ade connecting the two distributions.
We calculate the observed value o f the test statistic,
2o =
h - h
( s \ / n 2 + S ]/ i) 1/2
4 Tests
172
whose value for these d a ta is 2.846/1.610 = 1.768. T hen R values o f
z. = f 2 ~ fi ~ ( h - h )
(s 22/ n 2 + s \ 2/ n i ) l/2
are generated, w ith each sim ulated d ataset containing n\ values sam pled with
replacem ent from sam ple 1 an d n2 values sam pled with replacem ent from
sam ple 2.
In R = 999 sim ulations we found 14 values in excess o f 1.768, so the P-value
is 0.015. This is entered in Table 4.4 in the row labelled (pivot).
(4.30)
(4.31)
w ith the obvious changes for a test based on Qo. Even though the statistic
is n o t pivotal, its use is likely to reduce the effects o f nuisance param eters,
and to give a P-value th a t is m ore nearly uniform ly distributed u n der the null
hypothesis th a n th a t calculated from T alone.
Example 4.20 (Comparison of two means, continued) In Table 4.4 all the
entries for z, except for the row labelled (pivot), were obtained using (4.30)
w ith t = y 2 yi an d vo depending on the null m odel. F or example, for the null
m odels discussed in Exam ple 4.16,
2
n,
173
where ,o = Yl'j=i yijPijfi F or the two sam ples in question, under the ex
ponential tilt null m odel b o th m eans equal 79.17 and vo = 1.195, the latter
differing considerably from the variance estim ate 2.59 used in the pivot m ethod
(Exam ple 4.19).
The associated P-values com puted from (4.31) are shown in Table 4.4 for
all null models. These P-values are less dependent upon the p articular m odel
th an those obtained w ith t unstudentized.
T hen if y i , . . . , y n is in
the P-value as
^ ,)-
174
4 Tests
F o r an exam ple o f this, see Problem 4.13. In the case o f exact tests, such as
p erm utatio n tests, the adaptive test is also exact.
F or r = 1, . . . , R,
i+ # { s r
M+ 1
175
P I
Fo),
(4.32)
where p is the observed P-value defined above. This requires b o o tstrapping the
algorithm for com puting P-values, an o th er instance o f increasing the accuracy
o f a b o o tstrap m ethod by b o o tstrapping it, an idea introduced in Section 3.9.
T he problem can be explained theoretically in either o f two ways, perturbing
the critical value o f t for a fixed nom inal erro r rate a, or adjusting for the bias
in the P-value. We take the second approach, and since we are dealing with
statistical erro r rath er th an sim ulation erro r (Section 2.5), we ignore the latter.
The P-value com puted for the d ata is w ritten po{F), where the function po(')
depends on the m ethod used to obtain Fo from F. W hen the null hypothesis
is true, suppose th a t the p articu lar null distribution Fo obtains. T hen the null
distrib u tio n function for the P-value is
G ( u , F o)
P t { Po ( F ) < u \ F o } ,
(4.33)
which w ith u = a is the true error rate corresponding to nom inal erro r rate a.
N ow (4.33) implies th at
Pr{G(p0(F), F0) < a I F0} = a,
and so G{po(F),Fo) would be the ideal adjusted P-value, having actual error
rate equal to the nom inal erro r rate. N ext notice th a t by substituting Fo for Fo
in (4.33) we can estim ate G{u,Fo) by
Pr*{po(F*) < u | Fo}.
176
4 Tests
(4.34)
po(F")
Fo,
(4.35)
177
which is the P-value o f the exact test. Therefore p a<jj is exactly uniform and the
adjustm ent is perfectly successful.
In the previous example, the same result for pa^ would be achieved if the
b o o tstrap distribution o f T were replaced by a norm al approxim ation. This
m ight suggest th a t b o o tstrap calculation o f p could be replaced by a rough
theoretical approxim ation, thus rem oving one level o f boo tstrap sam pling from
calculation o f padj- U nfortunately this is n o t always true, as is clear from the
fact th a t if an approxim ate null distribution o f T is used which does not
depend upon F at all, then pa<jj is ju st the ordinary bo o tstrap P-value.
In m ost applications it will be necessary to use sim ulation to approxim ate the
adjusted P-value (4.34). Suppose th at we have draw n R resam ples from the null
m odel Fo, w ith corresponding test statistic values r j.......t'R. The rth resam ple
has E D F F* (possibly a vector o f E D Fs), to which we fit the null model
Ko- R esam pling M times from F *0 gives sam ples from which we calculate f " ,
m = 1 ,..., M. T hen the M onte C arlo approxim ation for the adjusted P-value
is
1
+ # { p r* < p }
dj
R +1
(4.36)
where for each r
=
Pr
1 + # K m
M +l
fr )
(4 3 7 )
For r = 1
178
4 Tests
1
2
0
1
0
2
0
1
1
1
2
0
1
2
1
1
2
1
0
1
y l+
t =
1
3
2
0
1
y ,j,
0
0
7
0
0
1
0
3
1
0
Choice o f M
T he general application o f the double b o o tstrap algorithm involves sim ulation
at two levels, w ith a to tal o f R M samples. If we follow the suggestion to use
as m any as 1000 sam ples for calculation o f probabilities, then here we would
need as m any as 106 samples, which seems im practical for o th er th a n simple
problem s. As in Section 3.9, we can determ ine approxim ately w hat a sensible
choice for M would be. The calculation below o f sim ulation m ean squared
erro r suggests th a t M = 99 w ould generally be satisfactory, and M = 249
would be safe. T here are also ways o f reducing considerably the total size o f
the sim ulation, as we shall show in C h ap ter 9.
179
To calculate the sim ulation m ean squared error, we begin w ith equation
(4.37), which we rew rite in the form
I {A} is the indicator
function of the event A.
1 +Em=lJ{C ^ K}
Pr
M+ 1
In order to simplify the calculations, we suppose that, as M >oo, p ' >ur such
th a t the urs are a ran d o m sam ple from the uniform distribution on [0,1]. In
this case there is no need to adjust the b o o tstrap P-value, so padj = PU nder
this assum ption (M + l)p* is alm ost a B inom (M ,ur) random variable, so th a t
equation (4.36) can be approxim ated by
l +
r = l* r
Padj = r + t ~ '
where X r = /{B in o m (M , ur) < ( M + \)p}. We can now calculate the sim ulation
m ean and variance o f
p adj
T .
y=0
( " ; )uJ( l - u ) M^ d u =
R [ ( M + l)p]
(n + i)(Af + i)>
180
4 Tests
181
and for group 2, sam ple d a ta y 2\ , - - - , y 2Nl from F(y 8), i.e. random ly with
replacem ent from
y n + 8, . . . , yi, + 8, y 2\ + 0, . . . , y 22 + 0T hen calculate test statistic t*. W ith R repetitions o f this, the pow er o f the test
at level p is the p ro p o rtio n o f tim es th a t t* > tp, where tp is the critical value
o f the W ilcoxon test for specified N\ and N 2.
In this p articu lar case, the sim ulations show th a t the W ilcoxon test at level
p = 0.01 has pow er 0.26 for 8 = 8 and the observed sam ple sizes. A dditional
4 Tests
182
Data values
Data values
If the proposed test uses the pivot m ethod o f Section 4.4.1, then calculations
o f sample size can be done m ore simply. F or exam ple, for a scalar 9 consider
a two-sided test o f Ho : 9 = 9o w ith level 2a based on the pivot Z . The pow er
function can be w ritten
n(2a, 9) = 1 - Pr I zx>N +
< Z N < z X- ^ N +
VN
- i ,
VN
(4.39)
where the subscript N indicates sam ple size. A rough approxim ation to this
pow er function can be obtained as follows. First sim ulate R sam ples o f size N
from F , an d use these to approxim ate the quantiles za>sr and zi_a>jv. N ext set
v Jl 2 = n^^vh^2/ N 1/2, where v is the variance estim ate calculated from the pilot
data. Finally, approxim ate the probability (4.39) using the same R boo tstrap
samples.
Sequential tests
Sim ilar sorts o f calculations can be done for sequential tests, where one
im p o rtan t criterion is term inal sam ple size. In this context sim ulation can also
be used to assess the likely eventual sam ple size, given d a ta y i , . . . , y at an
interim stage o f a test, w ith a specified protocol for term ination. This can
be done by sim ulating d a ta co n tin u atio n y^+i,y^,+2 , - up to term ination, by
sam pling from fitted m odels or E D F s, as appropriate. F rom repetitions o f this
sim ulation one obtains an approxim ate distribution for term inal sam ple size N.
183
184
4 Tests
4.8 Problems
1
For the dispersion test of Example 4.2, y \ , . . . , y n are hypothetically sampled from
a Poisson distribution. In the Monte Carlo test we simulate samples from the
conditional distribution of Y i,..., Y given Y Yj s<with s = Yl yj- If the exact
multinomial simulation were not available, a Markov chain method could be used.
Construct a Markov chain Monte Carlo algorithm based on one-step transitions
from (mi,...,u) to (t>i,_,u) which involve only adding and subtracting 1 from
two randomly selected us. (Note that zero counts must not be reduced.)
Such an algorithm might be slow. Suggest a faster alternative.
(Section 4.2)
Suppose that X i , . . . , X n are continuous and have the same marginal CDF F,
although they are not independent. Let / be a random integer between 1 and n.
Show that rank(X/) has a uniform distribution on {1,2,...,n}.
Explain how to apply this result to obtain an exact Monte Carlo test using one
realization of a suitable Markov chain.
(Section 4.2.2; Besag and Clifford, 1989)
Suppose that we have a m x m contingency table with entries ytj which are counts.
(a) Consider the null hypothesis of row-column independence. Show that the
sufficient statistic So under this hypothesis is the set of row and column marginal
totals. To assess the significance of the likelihood ratio test statistic conditional
on these totals, a Markov chain Monte Carlo simulation is used. Develop a
Metropolis-type algorithm using one-step transitions which modify the contents of
a randomly selected tetrad yik,yu>yjk>yji> where i ^ j , k ^ I.
(b) Now consider the the null hypothesis of quasi-symmetry, which implies that
in the loglinear model for mean cell counts, log E(Yy) = /i + a, +
+ ytj, the
interaction parameters satisfy yy = y;i- for all /, j. Show that the sufficient statistic
So under this hypothesis is the set of totals yy+yji, i = j, together with the row and
column totals and the diagonal entries. Again a conditional test is to be applied.
Develop a Metropolis-type algorithm for Markov chain Monte Carlo simulation
using one-step transitions which involve pairs of symmetrically placed tetrads.
(Section 4.2.2; Smith et al, 1996)
4.8 Problems
185
(a) Consider the following rule for choosing the number of simulations in a Monte
Carlo test. Choose k, and generate simulations t\,t2,..., t] until the first I for which
k of the t exceed the observed value t; then declare P-value p = (k + I)/(I + 1).
Let the random variables corresponding to I and p be L and P. Show that
Pr{P < (k + 1)/(/ + 1)} = Pr(L > 1 - 1 ) = k / l ,
l = k , k + 1,. .
and deduce that L has infinite mean. Show that P has the distribution of
a t/(0, 1) random variable rounded to the nearest achievable significance level
l , k / ( k + l ) , k / ( k + 2),..., and deduce that the test is exact.
(b) Consider instead stopping immediately if k of the f* exceed t at any I < R, and
anyway stopping when I = R, at which point m values exceed t. Show that this
rule gives achievable significance levels
/ ( * + ! ) /( / + !),
P ~ \( m + l) /( K + l) ,
m = k,
m <k.
1=1
l~\
Mc+l
Suppose that n subjects are allocated randomly to each of two treatments, A and
B. In fact each subject falls in one of two relevant groups, such as gender, and the
treatment allocation frequencies differ between groups. The response y t] for the j l h
Y . ri> - Y
r>
i,j(i,j)=B
i,jM<,j)=A
where
is the residual from regression of the >>s on the group indicators.
(a) Describe how to calculate a permutation P-value for the observed value t using
the method described above Example 4.12.
(b) A different calculation of the P-value is possible which conditions on the
observed covariates, i.e. on the treatment allocation frequencies in the two groups.
The idea is to first eliminate the group effects by reducing the data to differences
djj = yij yij+i, and then to note that the joint probability of these differences
under Ho is constant under permutations of data within groups. That is, the
minimal sufficient statistic So under H0 is the set of differences
Yl(J+l), where
Yni) < % ) < are the ordered values within the ith group. Show carefully how
to calculate the P-value for t conditional on so
le) Apply the unconditional and conditional permutation tests to the following
data:
Group 1
A
B O
(Sections 4.3, 6.3.2; Welch and Fahey, 1994)
Group 2
4
186
1
4 Tests
A randomized matched-pair experiment to compare two treatments produces
paired responses
from which the paired differences dj = yij >i7 are
calculated for j = 1
The null hypothesis Ho o f no treatment difference
implies that the djs are sampled from a distribution that is symmetric with mean
zero, whereas the alternative hypothesis implies a positive mean difference. For
any test statistic t, such as d, the exact randomization P-value Pr(T* > t | H0) is
calculated under the null resampling m odel
d) = Sjdj,
j =
where the Sj are independent and equally likely to be + 1 and 1. W hat would
be the corresponding nonparametric bootstrap sampling m odel Fo? Would the
resulting bootstrap P-value differ much from the randomization P-value?
See Practical 4.4 to apply the randomization and bootstrap tests to the following
data, which are differences o f measurements in eighths o f an inch on cross- and
self-fertilized plants grown in the same pot (taken from R. A. Fishers famous
discussion o f Darwins experiment).
49
-6 7
16
23
28
41
14
29
56
24 7560 -4 8
For the two-sample problem o f Example 4.16, consider fitting the null m odel by
maximum likelihood. Show that the solution probabilities are given by
Pij,
1
.
ni (a + Xy i j) P2]'
1
n2(P - Xy2j)
where a, fi and / are the solutions to the equations Y P i j f l = 1>Y PVfl ~ U and
Y yijPij.o = Y y 2jP2j,o- Under what conditions does this solution not exist, or give
negative probabilities? Compare this null m odel with the one used in Example 4.16.
9
10
Suppose that we wish to test the reduced-rank m odel H0 : g(0) 0, where g(-) is a
Pi-dimensional reduction o f p-dimensional 6. For the studentized pivot method we
take Q = {g(T ) - g(6)}T V ~ l { g ( T ) - g(0)}, with data test value q0 = g(t)r i;g-1g(t),
where vg estimates var[g(T )}. Use the nonparametric delta method to show that
var{g(T )} = g(t)VLg ( t y , where g(0) = 8 g( 6 ) / d d T.
Show how the method can be applied to test equality o f p means given p indepen
dent samples, assuming equal population variances.
(Section 4.4.1)
187
4.9 Practicals
11
In a parametric situation, suppose that an exact test is available with test statistic
U, that S is sufficient under the null hypothesis, but that a parametric bootstrap
test is carried out using T rather than U. Will the adjusted P-value padj always
produce the exact test?
(Section 4.5)
12
In calculating the mean squared error for the simulation approximation to the
adjusted P-value, it might be more reasonable to assume that P-values u, follow
a Beta distribution with parameters a and b which are close to, but not equal to,
one. Show that in this case
where X r = /{B in om (M , ur) < ( M + l)p}. Use this result to investigate numerically
the choice o f M.
(Section 4.5)
13
For the matched-pair experiment o f Problem 4.7, suppose that we choose between
the two test statistics ty = d and t2 = (n 2m)~l J2"Z2+i ^c/)> f r som e m in the
range 2, . . . , [^n], on the basis o f their estimated variances Vi and v2, where
E (d j-h )2
1
v->
n2
=
4.9 Practicals
1
The data in dataframe dogs are from apharmacological experiment. The two
variables are cardiac oxygen consum ption (M VO) and left ventricular pressure
(LVP). D ata for n = 7 dogs are
M VO
LVP
78
32
92
33
116
45
90
30
106
38
7899
24 44
Apply a bootstrap test for the hypothesis o f zero correlation between M VO and
LVP. Use R = 499 simulations.
(Sections 4.3, 4.4)
2
188
4 Tests
(Section 4.3)
3
For a graphical test o f suitability o f the exponential m odel for the data in Table 1.2,
we generate data from the exponential distribution, and plot an envelope.
4.9 Practicals
189
v2 <- ((sum((d-t2)~2)+m*(min(d)-t2)2+m*(max(d)-t2)"2))/(n*(n-2*m))
c(tl, vl, t2, v2) }
darwln.ad <- boot(darwin$y, darwin.f, R=999, sim="parametric",
r a n .gen=darwin.g e n , mle=nrow (darwin))
darwin.ad$tO
i <- c (1:999)[darwin.ad$t[,2]>darwin.ad$t[,4]]
(1+sum(darwin.ad$t [i,3] >darwin.ad$tO [3] )) / (1+length (i))
Is a different result obtained with the adaptive version o f the bootstrap test?
(Sections 4.3, 4.4)
5
h <- 1.5
hist(paulsen$y,probability=T,breaks=c(0:30))
lines(density(paulsen$y,width=4*h,from=0,to=30))
peak.test <- function(y, h)
{dens <- density(y,width=4*h,n=100)
sum(peaks(dens$y[(dens$x>=0) k (dens$x<=20)])) }
peak.test(paulsen$y, h)
Check that h = 1.87 is the smallest value giving just one peak.
For bootstrap analysis,
190
6
4 Tests
For the cd4 data o f Practicals 2.3 and 3.6, test the hypothesis that the distribution
o f C D 4 counts after one year is the same as the baseline distribution. Test
also whether the treatment affects the counts for each individual. Discuss your
conclusions.
5
Confidence Intervals
5.1 Introduction
T he assessm ent o f uncertainty ab o ut param eter values is m ade using confidence
intervals or regions. Section 2.4 gave a brief introduction to the ways in which
resam pling can be applied to the calculation o f confidence limits. In this chapter
we u ndertake a m ore tho ro u g h discussion o f such m ethods, including m ore
sophisticated ideas th a t are potentially m ore accurate th an those m entioned
previously.
Confidence region m ethods all focus on the same target properties. T he first
is th a t a confidence region w ith specified coverage probability y should be a
set Cy(y) o f p aram eter values which depends only upon the d a ta y and which
satisfies
Pr{0 e Cy( F )} = y.
(5.1)
191
192
5 Confidence Intervals
not serious for scalar 9, which is the m ajor focus in this chapter, because in
m ost applications the confidence region will be a single interval.
A confidence interval will be defined by limits 0ai and 9 i_a2, such th a t for
any a
Pr(0 < 0) = a.
The coverage o f the interval [0a,,0 i_ a2] is y = 1 (x\ + a 2), and ai and a 2 are
respectively the left- an d right-tail error probabilities. For som e applications
only one lim it is required, either a low er confidence limit 6a o r an upper
confidence limit 9 i_a, these b o th having coverage 1 a. If a closed interval is
required, then in principle we can choose oti and a2, so long as they sum to the
overall erro r probability 2a. T he sim plest way to do this, which we ad o p t for
general discussion, is to set a.\ = a2 = a. T hen the interval is equi-tailed with
coverage probability 1 2a. In p articu lar applications, however, one m ight
well w ant to choose ai and a 2 to give approxim ately the shortest interval: this
would be analogous to having the likelihood property m entioned earlier.
A single confidence region can n o t give an adequate sum m ary o f the u n
certainty ab o u t 9, so in practice one should give regions for three or four
confidence levels betw een 0.50 and 0.99, say, together with the p o int estim ate
for 9. O ne benefit from this is th a t any asym m etry in the uncertainty ab o u t 6
will be fairly clear.
So far we have assum ed th a t a confidence region can be found to satisfy
(5.1) exactly, b u t this is n o t possible except in a few special param etric models.
The m ethods developed in this chapter are based on approxim ate probability
calculations, an d therefore involve a discrepancy betw een the nom inal or target
coverage, an d the actual coverage probability.
In Section 5.2 we review briefly the stan d ard approxim ate m ethods for
param etric an d n o nparam etric models, including the basic b o o tstrap m ethods
already described in Section 2.4. M ore sophisticated m ethods, based on w hat
is know n as the percentile m ethod, are the subject o f Section 5.3. Section 5.4
com pares the various m ethods from a theoretical viewpoint, using asym ptotic
expansions, and introduces the A B C m ethod as an alternative to sim ulation
m ethods. The use o f significance tests to obtain confidence lim its is outlined
in Section 5.5. A nested b o o tstrap algorithm is introduced in Section 5.6.
Em pirical com parisons betw een m ethods are m ade in Section 5.7.
Confidence regions for vector p aram eters are described in Section 5.8. The
possibility o f conditional confidence regions is explored in Section 5.9 through
discussion o f two examples. Prediction intervals are discussed briefly in Sec
tion 5.10.
The discussion in this chap ter is ab o u t how to use the results o f boo tstrap
sim ulation algorithm s to obtain confidence regions, irrespective o f w hat the
resam pling algorithm is. T he presentation supposes for the m ost p a rt th at we
193
are in the simple situation o f C h apter 2, where we have a single, com plete
hom ogeneous sample. M ost o f the m ethods described can be applied to m ore
com plex d a ta structures, provided th a t appropriate resam pling algorithm s are
used, b u t for m ost sorts o f highly dependent d a ta the theoretical properties o f
the m ethods are largely unknow n.
(5.2)
fli_ a,
^ 1a = t
aa.
(5.3)
?(e) = d2t(e)/B0deT.
(5.4)
where as usual z i_ a = <I> '(1 a). If T is a m axim um likelihood estim ator, then
the approxim ate variance v can be com puted directly from the log likelihood
function tf(9). I f there are no nuisance param eters, then we can use the recip.. A
rocal o f either the observed Fisher inform ation, v = l/tf(9) o r the estim ated
expected Fisher inform ation v = 1/7(0), where i(9) = E{if(9)} var{/(0)}.
T he form er is usually preferable. W hen there are nuisance param eters, we use
the relevant elem ent o f the inverse o f either ?(0) o r i(9). M ore generally, if
T is given by an estim ating equation, then v can be calculated by the delta
m etho d ; see Section 2.7.2. E quation (5.4) is the stan d ard form for norm al
approxim ation confidence limits, although it is som etim es augm ented by a bias
correction which is based on the third derivative o f the log likelihood function.
194
5 Confidence Intervals
(5.5)
9 i_a = 2t t((R+l)a)-
(5.6)
^l-oc = t tf1/2Z((R+l)a)-
(5.7)
195
conventional values o f a. But if for some reason ( R + l)a is not an integer, then
interp o latio n can be used. A simple m ethod th a t w orks well for approxim ately
norm al estim ators is linear interp olation on the norm al quantile scale. For
exam ple, if we are trying to apply (5.6) and the integer p a rt o f (R + l)a is k,
then we define
O -ifa ) 0 _1( - ^ r )
= fo +
'R + l'
^R+l-
= [(* + !) ]
(5-8)
The sam e in terp o latio n can be applied to the z* s. Clearly such interpolations
fail if k = 0, R or R + 1.
Parameter transformation
T he norm al approxim ation m ethod m ay fail to w ork well because it is being
applied on the w rong scale, in which case it should help to apply the approxi
m ation on an appropriately transform ed scale. Skewness in the distribution o f
T is often associated w ith v a r(T ) varying w ith 9. F or this reason the accuracy
o f norm al approxim ation is often im proved by transform ing the param eter
scale to stabilize the variance o f the estim ator, especially if the transform ed
scale is the w hole real line. T he accuracy o f the basic boo tstrap confidence
lim its (5.6) will also tend to be im proved by use o f such a transform ation.
Suppose th a t we m ake a m onotone increasing transform ation o f the p aram
eter scale from 9 to tj = h(9), and then transform t correspondingly to u = h(t).
A ny confidence limit m ethod can be applied for tj, and untransform ing the
results will give confidence limits for 9. F o r example, consider applying the
norm al approxim ation limits (5.4) for r). By the delta m ethod (Section 2.7.1)
the variance approxim ation v for T transform s to
h(0) is dh(6)/d6.
(5.9)
0 .* = h - l {2h(t) - h(t*{{R+l)a])}.
(5.10)
196
5 Confidence Intervals
which som etim es w orks is to m ake norm al Q -Q plots o f h(t) for candidate
transform ations.
It is im p o rta n t to stress th a t the use o f transfo rm ation can im prove the basic
b o o tstrap m ethod considerably. N evertheless it m ay still be beneficial to use
the studentized m ethod, after transform ation. Indeed there is strong em pirical
evidence th a t the studentized m ethod is im proved by w orking on a scale with
stable approxim ate variance. T he studentized transform ed estim ator is
H T ) - m
\ h ( T) \ VV 2 '
G iven R values o f the b o o tstrap q uantity z* = {/i(f) h(t)} / {\h(t*)\v*1/2}, the
analogue o f (5.10) is given by
6a = h ~ l {h(t)
- IM0|t>1/2z(*(R+i)(i-))}> 0i- =
h - ' i h i t ) - \h(t)\v1/2z i { R + m }.
(5.11)
N ote th a t if h(-) is given by (2.14) w ith no co n stan t m ultiplier and V = v(T),
then the den o m in ato r o f z* an d the m ultiplier \h(t)\v1/2 in (5.11) are b o th unity.
Likelihood ratio methods
W hen likelihood estim ation is used, in principle the norm al approxim ation
confidence limits (5.4) are inferior to likelihood ratio limits. Suppose th at the
scalar 6 is the only unknow n p aram eter in the model, and define the log
likelihood ratio statistic
f(0) is the log likelihood
function
: w ( 0 ) < CU _ 2 a } ,
( 5 .1 2 )
where ciiP is the p quantile o f the y2 distribution. This confidence region need
n o t be a single interval, although usually it will be, and the left- and righttail errors need n o t be even approxim ately equal. Separate lower and upper
confidence lim its can be defined using
sgn(u) = u/\u\ is the sign
function.
z(01a) = Zl-.
(5.13)
W hen the m odel includes o th er unknow n param eters X, also estim ated by
m axim um likelihood, w(6) is calculated by replacing / ( 0 ) with the profile log
likelihood / prof(0) = sup; <f(0, 1).
These m ethods are in v arian t w ith respect to use o f param eter transform ation.
197
In m ost applications the accuracy will be very good, provided the m odel is
correct, b u t it m ay nevertheless be sensible to consider replacing the theoretical
quantiles by b o o tstrap approxim ations. W hether or n o t this is w orthw hile can
be ju d g ed from a chi-squared Q-Q plot o f sim ulated values o f
w -(6) = 2 { r ( 6 ' ) - f ( G ) } ,
or from a norm al Q-Q plot o f the corresponding values o f z*(0).
Example 5.1 (Air-conditioning data) T he d a ta o f Exam ple 1.1 were used to
illustrate various features o f param etric resam pling in C h apter 2. H ere we look
at confidence lim it calculations for the underlying m ean failure time n under
the exponential m odel for these data. The exam ple is convenient in th a t there
is an exact solution against which to com pare the various approxim ations.
F or the norm al approxim ation m ethod we use an estim ate o f the exact
variance o f the estim ator T = Y , v = n~ly 2. T he observed value o f y is
108.083 an d n = 12, so v = (31.20)2. T hen the 95% confidence interval limits
given by (5.4) w ith a = 0.025 are
108.083 31.20 x 1.96 = 46.9 and 169.2.
These co n trast sharply w ith the exact lim its 65.9 and 209.2.
T ransform ation to the variance-stabilizing logarithm ic scale does improve
the norm al approxim ation. A pplication o f (2.14) with v(/i) = n V 2 gives
h(t) = log(t), if we d ro p the m ultiplier n1/2, and the approxim ate variance
transform s to n~l . The 95% confidence interval limits given by (5.9) are
e x p { lo g (1 0 8 .0 8 3 ) (1 2 r1/2 x 1.96} = 61.4 and 190.3.
W hile a considerable im provem ent, the results are still not very close to the
exact solution. A p artial explanation for this is th a t there is a bias in log(T )
and the variance approxim ation is no longer equal to the exact variance. Use
o f b o o tstrap estim ates for the bias and variance o f log(T ), w ith R = 999, gives
limits 58.1 and 228.8.
F or the basic b o o tstrap confidence lim its we use R = 999 sim ulations under
the fitted exponential m odel, sam ples o f size n = 12 being generated from
the exponential distribution w ith m ean 108.083; see Exam ple 2.6. The relevant
ordered values o f y ' are the (9 9 9 + l)0.025th and (9 9 9 + l)0.975th, i.e. the 25th
and 975th, which in o u r sim ulation were 53.3 and 176.4. The 95% confidence
limits obtained from (5.6) are therefore
2 x 108.083 - 176.4 = 39.8,
These are no b etter th a n the norm al approxim ation limits. However, applica
tion o f the sam e m ethod on the logarithm ic scale gives m uch b etter results:
5 Confidence Intervals
198
using the same ordered values o f >' in (5.10) we o btain the limits
exp{21og(108.083)-log(176.4)} = 66.2, exp{21og(108.083)-log(53.4)} = 218.8.
In fact these are sim ulation approxim ations to the exact limits, which are based
on the exact gam m a distribution o f Y / p.- The sam e results are obtained using
the studentized b o o tstrap limits (5.7) in this case, because z = n l/2(y n ) / y is
a m onotone function o f log(y) log(/i) = log(y/p). E quation (5.11) also gives
these results.
N ote th a t if we h ad used R = 99, then the b oo tstrap confidence limits
would have required interpolation, because (9 9 + 1)0.025 = 2.5 which is n o t an
integer. T he application o f (5.8) would be
(D- 1(0.025)- 0 ) - 1(0.020)
*(2.5) -
f (2) +
{2])'
This involves quite extrem e ordered values and so is som ew hat unstable.
The likelihood ratio m ethod gives good results here, even using the chisquared approxim ation.
Broadly sim ilar com parisons am ong the m ethods apply under the m ore
com plicated gam m a m odel for these data. As the com parisons m ade in Ex
am ple 2.9 would predict, results for the gam m a m odel are sim ilar to those for
nonparam etric resam pling, which are discussed in the next example.
1/2
t + V
Z 1a.
(5.14)
Section 2.7 outlines various ways o f calculating or approxim ating the influence
values.
If a sm all nonparam etric b o o tstrap has been run to produce bias and
199
variance estim ates bR an d vR, as described in Section 2.3, then the corresponding
approxim ate 1 2a confidence interval is
t - bR + 4 /2zi-a-
(5.15)
1/2 *
/J
vx t V L
_ ,
t'la ~
1/2 *
VL
z((R+l)a)>
/r 1
(D.10J
where now z* = (t* t ) / v ' ^ 2. N ote th at the influence values m ust be recom
puted for each b o o tstra p sample, because in expanded n o tation lj = l(yj;P)
depends u p o n the E D F o f the sample. Therefore
i = - 2 / v ; ; n
7=1
Ky' j ; F " ) = K y ) ; f ) -
n~l
i(yk
' ; P) ,
k=l
b u t this is unreliable unless t is approxim ately linear; see Section 2.7.5 and
Problem 2.20.
As in the p aram etric case, one m ight consider m aking a bias adjustm ent
in the n u m erato r o f z, for exam ple based on the em pirical second derivatives
o f t. However, this rarely seems effective, and in any event an approxim ate
adjustm ent is implicitly m ade in the b o o tstra p distribution o f Z*.
Example 5.2 (Air-conditioning data, continued) F or the d a ta o f Exam ple 1.1,
confidence lim its for the m ean were calculated under an exponential m odel in
Exam ple 5.1. H ere we apply nonp aram etric m ethods, sim ulated datasets being
obtained by sam pling w ith replacem ent from the data.
F o r the norm al approxim ation, we use the nonparam etric delta m ethod
200
5 Confidence Intervals
Parameter transformation
For suitably sm ooth statistics, the consistency o f the studentized boo tstrap
m ethod is essentially g u aranteed by the consistency o f the variance estim ate
V. In principle the m ethod is m ore accurate th an the basic b o o tstrap m ethod,
Figure 5.1
Air-conditioning d ata:
nonparametric delta
method standard errors
for t = y (left panel) and
for log(t) (right panel) in
R = 999 nonparametric
bootstrap samples.
201
o
CO
in
csj o
CNJ
*<" o
_i co
>
>
o
C\J
50
100
150
200
250
t*
log t*
202
5 Confidence Intervals
Figure 3.11 w ith the studentized b o o tstra p limits (5.17) leads to the 95%
interval [1.23, 2.25], This is sim ilar to the 95% interval based on the h(t*) h(t),
[1.27, 2.21], while the studentized b o o tstra p interval on the original scale is
[1.12, 1.88]. T he effect o f the tran sfo rm atio n is to m ake the interval m ore like
those from the percentile m ethods described in the following section.
To com pare the studentized m ethods, we took 500 sam ples o f size 10 w ithout
replacem ent from the full city pop u latio n d a ta in Table 1.3. T hen for each
sam ple we calculated 90% studentized b o o tstra p intervals on the original scale,
and on the transform ed scale w ith and w ithout using the transform ed standard
erro r; this last interval is the basic b o o tstrap interval on the transform ed scale.
The coverages were respectively 90.4, 88.2, and 86.4%, to be com pared to the
ideal 90% . The first tw o are n o t significantly different, b u t the last is rath er
smaller, suggesting th a t it can be w orthw hile to studentize on the transform ed
scale, w hen this is possible. The draw back is th a t studentized intervals th a t use
the transform ed scale tend to be longer th an on the original scale, and their
lengths are m ore variable.
5.2.3 Choice o f R
W h at has been said ab o u t sim ulation size in earlier chapters, especially in
Section 4.2.5, applies here. In particular, if confidence levels 0.95 and 0.99 are
to be used, then it is advisable to have R = 999 or m ore, if practically feasible.
Problem 5.5 outlines som e relevant theoretical calculations.
203
instead o f
u, to estim ate the a and 1 a quantiles o f U. This
sw ap would change the confidence interval limits (5.6) to
*
U((K+l)a)>
U((R + l)(l-a))>
f ((R+1)(1ot))-
( 5 .1 8 )
(5.19)
204
5 Confidence Intervals
In fact this is an im proved norm al approxim ation, after applying the (unknow n)
norm alizing tran sfo rm atio n which elim inates the leading term in a skewness
approxim ation. T he usual factor n-1 has been taken o u t o f the variance by
scaling h(-) appropriately, so th a t b o th a an d w will typically be o f order n~x/2.
The use o f a an d w is analogous to the use o f B artlett correction factors in
likelihood inference for p aram etric models.
The essence o f the m ethod is to calculate confidence limits for <j) and then
transform these back to the 6 scale using the b o o tstrap distribution o f T. To
begin with, suppose th a t a an d w are know n, an d write
U = (j) + (I + acj))(Z - w ) ,
where Z has the N ( 0,1) distribution w ith a quantile z. It follows th at
log(l + aJJ) = lo g (l + a<j>) + log{ 1 + a( Z w)},
which is m onotone increasing in e/>. T herefore substitution o f za for Z and u
for U in this equation identifies the a confidence lim it for cj), which is
</>a =
i r\
w + z
U + f f ( u h ---------;-------------- r .
1 a(w + za)
<6
\
<D I w +I w + z
1 - a(w + za)
<5 -2 0 >
a = 0 ) ( w + 1 _ ^ + ^ ^ )) .
(5.21)
These lim its are usually referred to as B C a confidence limits. N ote th at they
share the tran sfo rm atio n invariance p roperty o f percentile confidence limits.
The use o f G overcom es lack o f know ledge o f the transform ation h. The
values o f a an d w are unknow n, o f course, b u t they can be easily estim ated.
F o r w we can use the initial norm al approxim ation (5.19) for U to write
Pr*(T* < t | t) = P r*([/* < u | u) = Pr([7 < (f> \ (j)) = <I>(w),
so th at
w = 0>-1{G(0}.
(5.22)
205
'# { t; < t}
R+ 1
T he value o f a can be determ ined inform ally using (5.19). Thus if /(</>) denotes
the log likelihood defined by (5.19), with derivative ?(<f>), then it is easy to show
th at
e { m 3}
= 6a,
var{<f(</>)}3/2
ignoring term s o f o rd er n~l . But the ratio on the left o f this equation is
invariant und er p aram eter transform ation. So we transform back from (j) to 6
and deduce th at, still ignoring term s o f order n-1 ,
Em > }
v a r{ /(0)}3/2
To calculate a we approxim ate the m om ents o f f ( 6) by those o f / (d) under the
fitted m odel w ith p aram eter value 9, so th a t the skewness correction factor is
1 E*{/*(0)3}
a = T -------- : *1 >
6 v a r * { n 0 )3/2}
(5.23)
w here ( ' is the log likelihood o f a set o f d a ta sim ulated from the fitted
m odel. M ore generally a is one-sixth the standardized skewness o f the linear
approxim ation to T.
One p o tential problem w ith the B C a m ethod is th a t if a in (5.21) is m uch
closer to 0 or 1 th an a, then (R + l)a could be less th an 1 o r greater th an
R, so th a t even w ith interpolation the relevant quantile can n o t be calculated.
If this happens, and if R can n o t be increased, then it would be appropriate
to quote the extrem e value o f t' and the im plied value o f a. For example, if
( R + l)a > JR, then the u pper confidence limit t'Rj would be given w ith implied
right-tail error a 2 equal to one m inus the solution to a = R / ( R + 1).
Example 5.5 (Air-conditioning data, continued) R eturning to the problem
o f Exam ple 5.4 and the exponential b o o tstrap results for R = 999, we find
th a t the n u m b er o f y * values below y = 108.083 is 535, so by (5.22) w =
<P_1 (0.535) = 0.0878. T he log likelihood function is tC(fi) = n l o g fi fi^1 Y,yj>
whose derivative is
iw = ^
fiz
nfi
The second and third m om ents o f if(fi) are nfi~2 and 2n/i~3, so by (5.23)
a = I n 1/2 = 0.0962.
206
5 Confidence Intervals
z = w + za
$ = (w + - p rjj;)
r = (/?-(- 1)5
0.025
0.975
0.050
0.950
- 1 .8 7 2
2.048
-1 .5 5 7
1.733
0.067
0.996
0.103
0.985
67.00
995.83
102.71
984.89
w
65.26
199.41
71.19
182.42
1000
= d>
(V0 0 8 7 8
.
0-0878 + ^
1 -0 .0 9 6 2 (0 .0 8 7 8 + Z !_ a2)
E -fc(0 )3 }
6 var'K lf (0))W'
In this expression the p aram eter estim ates ip are regarded as fixed, and the
m om ents are calculated u nder the fitted model.
A som ew hat sim pler expression for a can be obtained by noting th at i?LF{0)
is p ro portio n al to the influence function for t. The result in Problem 2.12 shows
th at
Lt(yj',Fv ) = m l (ipy{ip,yj),
j{y>) is d2f(\p)/dy>d\pT.
5J Percentile Methods
Table 5.2 Calculation
of adjusted percentile
bootstrap confidence
limits for ^ with the
data of Example 1.1,
under the gamma
parametric model with
p. = 108.0833,
/c = 0.7065 and
207
za = w + z a
5 = <I,(w + i i ; )
r = ( R + l)a
0.025
0.975
0.050
0.950
- 1 .8 2 3
2.097
- 1 .5 0 8
1.782
0.085
0.998
0.125
0.991
85.20
998.11
125.36
991.25
(r>
62.97
226.00
67.25
208.00
a = 0.1145, w = 0.1372.
w here i-1(tp) is the first row o f the inverse o f i(ip) and /(\p,yj) is the contribution
to /(tp) from the _/th case. We can then rew rite (5.24) as
where
L* = nil ( xpy(xp,Y )
an d Y ' follows the fitted distribution w ith param eter value ip. As before,
to first o rd er a is one-sixth the estim ated standardized skewness o f the linear
approxim ation to t. In the form given, (5.25) will apply also to nonhom ogeneous
data.
T he B C a m ethod can be extended to any sm ooth function o f the original
m odel p aram eters \p; see Problem 5.7.
Example 5.6 (Air-conditioning data, continued) We now replace the exponen
tial m odel used in the previous example, for the d a ta o f Exam ple 2.3, with the
tw o-param eter gam m a m odel. The param eters are 6 = fi and A = k , the first
still being the p aram eter o f interest. The log likelihood function is
/(fi,
nK
\og(K/fi) +
(k -
1)
log y j
yj/fi -
log T(k).
The inform ation m atrix is diagonal, so th a t the least-favourable fam ily is the
original gam m a family w ith k fixed a t k = 0.7065. It follows quite easily th a t
?
l f
()
- y ,
and so a is one-sixth o f the skewness o f the sam ple average under the fitted
gam m a m odel, th a t is a =
The same result is obtained som ew hat
m ore easily via (5.25), since we know th a t the influence function for the m ean
is L t( y \ F ) = y - fi.
The num erical values o f a and w for these d a ta are 0.1145 and 0.1372
respectively, the latter from R = 999 sim ulated samples. Using these we
com pute the adjusted percentile b o o tstrap confidence limits as in Table 5.2.
5 Confidence Intervals
208
Just how flexible is the B C a m eth o d ? The following exam ple presents a
difficult challenge for all b o o tstrap m ethods, an d illustrates how well the
studentized b o o tstrap an d B C a m ethods can com pensate for weaknesses in the
m ore prim itive m ethods.
Example 5.7 (Normal variance estimation) Suppose th at we have independent
sam ples ( y n , . . . , y i m), i = l,...,/c , from norm al d istributions w ith different
m eans A, b u t com m on variance 8, the latter being the param eter o f interest. The
m axim um likelihood estim ator o f the variance is t = n~l Y%=i Y^j=t(yij ~ yi)2,
where n = mk. In practice the m ore usual estim ate w ould be the pooled m ean
square, w ith deno m in ato r d = k(m 1) ra th e r th a n n, but here we leave the
bias o f T in tact to see how well the b o o tstrap m ethods can cope.
The distrib u tio n o f T is n~l 6xj. This exact result allows us b o th to avoid
the use o f sim ulation, an d to calculate exact coverages for all the confidence
limit m ethods. D enote the a quantile o f the Xd distribution by cd^. Using the
fact th a t T* = n~l t xd we see th a t the u p p er a confidence limits for 8 under
the basic b o o tstrap and percentile m ethods are respectively
21 r T 1tcj'i-a,
n~l tcd,a.
The coverages o f these limits are calculated using the exact distribution o f T.
F or exam ple, for the basic b o o tstrap confidence lim it
Pr (0 < 2 T
T c 4 ,_.) - Pr
> JT T T T ^) '
yi is the average of
yn>---,ymr
209
N om inal
Basic
S tudentized
P ercentile
BCa
1.0
2.5
5.0
95.0
97.5
99.0
0.8
2.5
4.8
35.0
36.7
38.3
1.0
2.5
5.0
95.0
97.5
99.0
0.0
0.0
0.0
1.6
4.4
6.9
1.0
2.5
5.0
91.5
100.0
100.0
(5-26)
The p aram eter o f interest 6 is a m onotone function o f r\ with inverse rj(6), say.
The M L E o f rj is fj = rj(t) = 0, which corresponds to the E D F F being the
n o n p aram etric M L E o f the sam pling distribution F.
The bias correction factor w is calculated as before from (5.22), b u t using
nonp aram etric b o o tstrap sim ulation to obtain values o f t*. The skewness
correction a is given by the em pirical analogue o f (5.23), where now
fj($) is the first
derivative drj(6)/dd.
6 /
6 (e
\ 3/2
if)
210
5 Confidence Intervals
za = w + z a
0.025
0.975
0.050
0.950
-1.8872
2.0327
-1.5721
1.7176
5 = (w +
i^ i;)
0.0629
0.9951
0.0973
0.9830
= (R + 1)5
62.93
995.12
97.26
983.01
C(r)
55.33
243.50
61.50
202.08
s i , rJ
vL = n 2 Y ^ i ij
(5.29)
( e u ?5)!
211
see Problem 3.7. This can be helpful in w riting an all-purpose algorithm for the
B C a m eth o d ; see also the discussion o f the A B C m ethod in the next section.
A n exam ple is given at the end o f the next section.
v 1/ 2m)
(5.30)
(5.31)
+ gm3z2)
(5.33)
(5.34)
and
212
5 Confidence Intervals
(5.35)
(5.36)
This will also be second-order accurate if it agrees w ith (5.35), which requires
th at to order n~1/2,
a = n~ll2{ \ m n - |m 3),
w = - n ~ 1/2(mi - gm3).
(5.37)
(5.38)
E {L3(y,)},
b = in 2 Y
E iQt (Yi,
i
(5.39)
213
T hen calculations o f the first and third m om ents o f T 0 from the quadratic
approxim ation show th at
m, = n 1/2v ~ l/2b,
m3 = n 1/2(6<z + 6c).
(5.40)
L ,(Y ,) -
n -1
n
L t(Y j) + n 1
7=1
Q ,( Y Y j )
7=1
(5.41)
The results in (5.40) and (5.41) im ply the identity for a in (5.37), after
noting th a t the definitions o f a in (5.23), (5.25) and (5.27) used in the adjusted
percentile m ethod are obtained by substituting estim ates for m om ents o f the
influence function. The identity for w in (5.37) is confirm ed by noting th at the
original definition w = 4>~ {G(t)} approxim ates <1>_ 1{G(0)}, which by applying
(5.30) w ith u = 0 agrees w ith (5.37).
Basic and percentile methods
Sim ilar calculations show th a t the basic b o o tstrap and percentile confidence
limits are only first-order accurate. However, they are b o th superior to the
norm al approxim ation limits, in the sense th at equi-tailed confidence intervals
are second-order accurate. F or example, consider the 1 2a basic boo tstrap
confidence interval w ith limits
0 x, 8 i - x
214
5 Confidence Intervals
(5.42)
here v has been approxim ated by v in the definition o f mi, and we have used
Z \ - a z.
The con stan ts a, b and c in (5.42) are defined by (5.39), in which the
expectations will be estim ated. Special form s o f the A B C m ethod correspond
to special-case estim ates o f these expectations. In all cases we take v to be vl Parametric case
If the estim ate t is a sm ooth function o f sam ple m om ents, as is the case for
an exponential family, then the co nstants in (5.39) are easy to estim ate. W ith
a tem porary change o f notatio n , suppose th a t t = t(s) where s = n~l ^ s(yj)
has p com ponents, and define fi = E(S), so th a t 6 = t(n). Then
L t(Y,) = t(n)T {s(Yj) - fi},
(5.43)
t = dt(s)/ds, and
V= d2t(s)/dsdsT .
Estim ates for a, b and c can therefore be calculated using estim ates for the
first three m om ents o f s( Y ).
For the p articu lar case where the distribution o f S has the exponential family
PD F
f S ) = exp{//r s - ( f / ) } ,
the calculations can be simplified. First, define L(^) = var(5) = l(rj). Then
vl
Ul) = 81Ul)/d'ldqT-
= t(s)T'L(s)i(s).
S ubstitution from (5.43) in (5.39), and estim ation o f the expectations, gives
estim ated con stan ts which can be expressed sim ply as
,
=0
b = ^tr{t(s)(s)},
L
c =
d2t(s + ke)
2vm
del
215
(5.44)
=0
where k = (s )i ( s ) / v ^ 2.
The confidence lim it (5.42) can also be approxim ated by an evaluation o f
the statistic t, analogous to the B C a confidence limit (5.20). This follows by
equating (5.42) w ith the right-hand side o f the approxim ation t(s + v 1^2e) =
t(s) + v ^ 2e T 't(s), w ith ap p ro p riate choice o f e. The result is
= t ii + F k ? k|*
( 5 -4 5 )
where
za = w + za = a + c - bvL i/2 + z.
In this form the ABC confidence limit is an explicit approxim ation to the B C a
confidence limit.
If the several derivatives in (5.44) are calculated by num erical differencing,
then only 4p + 4 evaluations o f t are necessary, plus one for every confidence
lim it calculated in the final step (5.45). A lgorithm s also exist for exact num erical
calculation o f derivatives.
Nonparametric case: single sample
If the estim ate t is again a sm ooth function o f sam ple m om ents, t = t(s), then
(5.43) still applies, and substitution o f em pirical m om ents leads to
b = ! t (;)
6(E/,2)3/2
2 { U
n (E^)1/2'
2n
(5.46)
A n alternative, m ore general form ulation is possible in which s is replaced by
the m ultinom ial pro p o rtio n s n~l { f \ , . . . , f n) attaching to the d a ta values. C o r
respondingly fi is replaced by the probability vector p, and with distributions
F restricted to the d a ta values, we re-express t(F) as t(p); cf. Section 4.4. Now
F is equivalent to p = (,,) and t = t(p). In this no tatio n the em pirical
influence values and second derivatives are defined by
(5.47)
and
d2
qjj =
(5.48)
=0
q = ( I ~ n - ' J i i m i ~ n~lJ),
216
5 Confidence Intervals
where J = 11T. F or each derivative the first form is convenient for approx
im ation by num erical differencing, while the second form is often easier for
theoretical calculation.
Estim ates for a and b can be calculated directly as em pirical versions o f
their definitions in (5.39), while for c it is sim plest to use the analogue o f the
representation in (5.44). The resulting estim ates are
a =
i E 'j
6 ( I ] ? ' 1'
i v1 / ,
2n2 ^ qjj ~ 2n1 ^
(5.49)
d2t(p + ek)
2 V 1/ 2
dl
t (I n
J)t(I n
J)t
2*vl
(1
' . J
- a z ay
(5.50)
If the several derivatives are calculated by num erical differencing, then the
num ber o f evaluations o f t(p) needed is only 2n+2, plus one for each confidence
limit and the original value t. N ote th a t the probability vector argum ent in
(5.50) is n o t constrained to be proper, o r even positive, so th at it is possible
for A B C confidence lim its to be undefined.
Example 5.9 (Air-conditioning data, continued) T he adjusted percentile m ethod
was applied to the air-conditioning d a ta in Exam ple 5.6 under the gam m a
m odel and in Exam ple 5.8 und er the nonparam etric model. H ere we exam ine
how well the A B C m ethod approxim ates the adjusted percentile confidence
limits. F or the m ean param eter, calculations are simple under all models. For
example, in the gam m a case the exponential family is tw o-dim ensional with
s = (y .lo g y )7,
rj i = - hk / h ,
rj2 = me,
1 is a vector of ones.
217
N o m in al confidence 1 2a
0.99
0.95
0.90
G a m m a m odel
BCa
ABC
51.5, 241.6
52.5, 316.6
63.0, 226.0
61.4, 240.5
67.2, 208.0
66.9, 210.5
N o n p a ra m e tric m odel
BCa
ABC
44.6, 268.8
46.6, 287.0
55.3, 243.5
57.2, 226.7
61.5, 202.1
63.6, 201.5
A B C lim its are shown in Table 5.5. The A B C m ethod appears to give rea
sonable approxim ations, except for the 99% interval under the gam m a model.
(7T11, . . . ,
7T21,. . . ,
(5.51)
E ;= i n n
218
5 Confidence Intervals
F irst aircra ft
3
18
43
85
91
98
100
130
230
487
23
139
30
188
36
197
210
Second aircraft
3
44
5
46
5
50
13
72
14
79
15
88
22
97
22
102
39
sam ple m eans are y\ = 108.083 an d y 2 = 64.125, so the estim ate for 6 is
t = y i / y i = 0-593.
The em pirical influence values are (Problem 3.5)
hj =
yi
yi
219
t*
CM
is-
/
y
O
/
CVJ
/
-1.5
-0.5
0.5 1.0
-2
iog(t*)
and
'
n \ ( y2j-yi
.......
This leads to form ulae in agreem ent with (5.29), which gives the values o f a
and vL already calculated. It rem ains to calculate b and c.
F or b, application o f (5.48) gives
\2
w , , - y. r,.
\)
= -2t f, n2(yij
iUjj t-. 1
h I
2r,
i yi
. n n 2( y i j - y i )
2
n\
220
5 Confidence Intervals
and
^ l.jj
*,nrn(y2J - y2)
2n\yx
'
so by (5.49) we have
b = n f M T 3' Y i y i j - y i f ,
whose value is b = 0.0720. (The b o o tstrap estim ates b and v are respectively
0.104 and 0.1125.) Finally, for c we apply the second form in (5.49) to u(n),
th at is
c = ^n~4v ^ 3/2l Tii(7t)I,
and calculate c = 0.3032. The im plied value o f w is 0.0583, quite different
from the b o o tstrap value 0.0954. The A B C form ula (5.50) is now applied to
u( jt) with k = n~2v~[1/20 n , .
The resulting 95% confidence interval is
[0.250,1.283], which is fairly close to the B C a interval.
It seems possible th a t the ap proxim ation theory does not w ork well here,
which would explain the larger-than-usual differences betw een B C a, A B C and
studentized b o o tstrap confidence lim its; see Section 5.7.
O ne practical poin t is th a t the theoretical calculation o f derivatives is quite
tim e-consum ing, com pared to application o f num erical differencing in (5.47)(5.49).
221
this is the score test statistic in the Cox p roportional hazards model. Large
values o f t(6o) are evidence th a t 9 > OoThere are several possible
including those described in
hazard ratio 9o- H ere we use
which holds fixed the survival
sim ulated values y \ , . . . , y ' n are
222
5 Confidence Intervals
log(theta)
J-i
r\j = m ax I 0, m - ^ ( 1 - y k ) - c1;I
*=i
(
r2j = m ax
1
0, r 2i
Y.y'k
C2j
k= 1
then on the logit scale we fitted a spline curve (in log 6), and interpolated the
solutions to p(9o) = a, 1a to determ ine the endpoints o f the (12a) confidence
interval for 9. Figure 5.3 illustrates this procedure for a = 0.05, which gives
the 90% confidence interval [1.07,6.16]; the 95% interval is [0.86,7.71] and the
p o int estim ate is 2.52. T hus there is m ild evidence th a t 6 > 1.
A m ore efficient ap proach w ould be to use R = 99 for the initial grid to
determ ine rough values o f the confidence limits, n ear which further sim ulation
with R = 999 w ould provide accurate interp o latio n o f the confidence limits.
Yet m ore efficient algorithm s are possible.
In a m ore system atic developm ent o f the m ethod, we m ust allow for a
nuisance p aram eter X, say, which also governs the d a ta distribution b u t is not
constrained by Ho. T hen b o th Ra(0) an d C \ - a{ Y \ , . . . , Y) m ust depend upon X
to m ake the inversion m ethod w ork exactly. U nder the b o o tstra p approach X
is replaced by an estim ate.
223
5 Confidence Intervals
224
see Problem 5.12. A m ore am bitious application is b o o tstrap adjustm ent o f the
basic b o o tstrap confidence limit, which we develop here.
First we recall the full n o tatio n s for the quantities involved in the basic
bo o tstrap confidence interval m ethod. The ideal u p per 1 a confidence limit
is t(F) ax(F), where
Pr { T - 6 < ax(F) | F j = Pr{f(F) - t(F) < aa(F) \ F} = a.
W h at is calculated, ignoring sim ulation error, is the confidence lim it t(F)ax(F).
The bias in the m ethod arises from the fact th a t aa(F) ^ a a(F) in general, so
th at
Pr{f(F) < t(F) - aa( F) | F} 1 - a.
(5.52)
A
We could try to elim inate the bias by adding a correction to ax(F), b u t a m ore
successful approach is to adjust the subscript a. T h a t is, we replace ax(F) by
Oq(a)(F) an d estim ate w hat the adjusted value q(a) should be. This is in the
sam e spirit as the B C a m ethod.
Ideally we w ant q(a) to satisfy
P r{t(F) < t(F) - fl, (a)(F) | F} = 1 - a.
(5.53)
The solution q(a) will depend u p o n F, i.e. q(oc) = q(a, F). Because F is unknow n,
we estim ate q(a) by q(a) = q(a, F). This m eans th a t we obtain q(a) by solving
the b o o tstrap version o f (5.53), namely
Pr*{t(F) < t(F') - ai{a)( h
I F} = 1 - a.
(5.54)
This looks intim idating, b u t from the definition o f aa(F) we see th a t (5.54) can
be rew ritten as
Pr*{Pr**(T** < 2 T ' - t \ F*) > q{oc) | F} = 1 - a.
(5.55)
The sam e m ethod o f adjustm ent can be applied to any b o o tstrap confi
dence lim it m ethod, including the percentile m ethod (Problem 5.13) and the
studentized b o o tstra p m ethod (Problem 5.14).
To verify th a t the nested b o o tstrap reduces the o rd er o f coverage erro r m ade
by the original b o o tstra p confidence limit, we can apply the general discussion
o f Section 3.9.1. In general we find th a t coverage 1 a + 0 ( n ~ ) is corrected to
1a + 0 ( n ~ fl~1/2) for one-sided confidence limits, w hether a = | or 1. However,
for equi-tailed confidence intervals coverage 1 2a + 0 (n-1 ) is corrected to
1 2a -I- 0 ( n ~ 2); see Problem 5.15.
Before discussing how to solve equation (5.55) using sim ulated samples,
we look at a simple illustrative exam ple where the solution can be found
theoretically.
Example 5.12 (Exponential mean) C onsider the param etric problem o f ex
ponential d a ta w ith unknow n m ean /i. T he d a ta estim ate for fi is t = y, F is
225
= Pr { & > 2 _
^ / ( 2 , } .
(5-56)
w ith q = q(a). Setting the probability on the right-hand side o f (5.56) equal to
1 a, we deduce th a t
2n
2 - cl n m l{2n)
C2na'
Using q(a) in place o f a in the basic b o o tstrap confidence lim it gives the
adjusted u p p er 1 a confidence limit 2 n y / c 2n,a, which has exact coverage 1 oc.
So in this case the double b o o tstrap adjustm ent is perfect.
Figure 5.4 shows the actual coverages o f nom inal 1 a b o o tstrap upper
confidence limits when n = 10. There are quite large discrepancies for both
basic and percentile m ethods, which are com pletely rem oved using the double
b o o tstrap adjustm ent; see Problem 5.13.
M
M ,r =
^ K
m=1
~ '}
5 Confidence Intervals
226
0.0
0.2
0.4
0.6
0.8
1.0
Nominal coverage
^ ()} = 1 r= l
which is to say th a t q(a) is the a quantile o f the uMr. The sim plest way to
obtain <j(ot) is to o rd er the values uMr into uM{l) < <
and then
set q{a) =
W h at this am ounts to is th a t the (R + l)a th ordered
value is read off from a Q -Q plot o f the uMr against quantiles o f the U ( 0 , 1)
distribution, and th a t ordered value is then used to give the required quantile
o f the t* t. We illustrate this in the next example.
The to tal nu m b er o f sam ples involved in this calculation is R M . Since
we always think o f sim ulating as m any as 1000 sam ples to approxim ate
probabilites, here this w ould suggest as m any as 106 sam ples overall. The
calculations o f Section 4.5 w ould suggest som ething a bit smaller, say M = 249
to be safe, b u t this is still ra th e r im practical. However, there are ways o f greatly
reducing the overall nu m b er o f sim ulations, two o f which are described in
C h apter 9.
Example 5.13 (Kernel density estimate) B ootstrap confidence intervals for the
value o f a density raise som e aw kw ard issues, which we now discuss, before
outlining the use o f the nested b o o tstra p in this context.
The stan d ard kernel estim ate o f the P D F f ( y ) given a ran d o m sample
y u - - - , y n is
227
where w( ) is a sym m etric density with m ean zero and unit variance, and h
is the bandw idth. O ne source o f difficulty is th a t if we consider the estim ator
to be t(F), as we usually do, then t(F) = h~l f w{h~l (y x ) } f ( x ) d x is being
estim ated, n o t f ( y) . The m ean and variance o f f ( y ; h ) are approxim ately
f ( y ) + j h 2f ' ( y ) ,
(nh)~lf ( y )
w2(u)du,
(5.57)
{f(y)Y/2+Uf(yT1/2{h2f"(y)-i2(nhr iK},
(5.58)
has m ean exactly equal to f ( y ,h); the approxim ate variance is the same as in
(5.57) except th a t f ( y \ h ) replaces f ( y ) . It follows th at T* = { f ' ( y \ h ) } 1^2 has
approxim ate m ean and variance
{ f ( y , h ) } 1/2 - K/ONfc)}-172^ ) -1 ^ ,
{ ( n h) - lK .
{ f ( y M
ll
2-
{ f ( y ) Y
i( n /j) - / 2K i /2
z<= {r(^}1/2-{/(>>;ft)}1/2
12
\(nh)~^K ^
oc
Z* = e \
(5.59)
5 Confidence Intervals
228
20
20
20
20
where b o th e and s' are N ( 0,1). This m eans th a t quantiles o f Z can n o t be well
approxim ated by quantiles o f Z*, no m atter how large is n. The same thing
happens for the u n transform ed density estim ate.
There are several ways in which we can try to overcome this problem . O ne
o f the sim plest is to change h to be o f o rd er -1/3, when calculations sim ilar to
those above show th a t Z = e an d Z* = e*. Figure 5.5 illustrates the effect. H ere
we estim ate the density a t y = 0 for sam ples from the N ( 0,1) distribution, with
w(-) the stan d ard norm al density. T he first two panels show box plots o f 500
values o f z an d z* w hen h = n~1/s, which is near-optim al for estim ation in this
case, for several values o f n; the values o f z* are obtained by resam pling from
one dataset. T he last two panels correspond to h = n~1/3. The figure confirm s
the key points o f the theory sketched above: th a t Z is biased aw ay from zero
when h = n-1^5, b u t not w hen h = n_1/3; an d th a t the distributions o f Z and
Z are quite stable and sim ilar when h = n-1/3.
U nder resam pling from F, the studentized b o o tstrap applied to {/(>; ^)}1/2
should be consistent if h oc n~1/3. F rom a practical point o f view this m eans
considerable undersm oothing in the density estim ate, relative to standard
practice for estim ation. A bias in Z o f o rd er n~ 1/3 or worse will rem ain, and
this suggests a possibly useful role for the double bootstrap.
F or a num erical exam ple o f nested b o o tstrap p in g in this context we revisit
Exam ple 4.18, where we discussed the use o f a kernel density estim ate in
estim ating species abundance. T he estim ated P D F is
f(y.h) = z z
where </>() is the stan d ard norm al density, and the value o f interest is / ( 0 ;/i),
which is used to estim ate /(0 ). In light o f the previous discussion, we base
229
<D
O)
5
O
>
o
o
0)
O
LU
fo
E
LU
t*-t
Nominal coverage
5 Confidence Intervals
230
Upper
Lower
Basic
Basic1-
Student
S tu d en t
Percentile
BCa
D ouble
0.204
0.036
0.240
0.060
0.273
0.055
0.266
0.058
0.218
0.048
0.240
0.058
0.301
0.058
In Exam ple 9.14 we describe how saddlepoint m ethods can greatly reduce
the tim e taken to perform the double b o o tstrap in this problem . It m ight be
possible to avoid the difficulties caused by the bias o f the kernel estim ate by
using a clever resam pling scheme, b u t it would be m ore com plicated th an the
direct ap p ro ach described above.
231
M e th o d
N o m in al e rro r rate
L ow er lim it
E xact
N o rm al ap proxim ation
Basic
Basic, log scale
S tudentized
S tudentized, log scale
B o o tstrap percentile
BCa
ABC
U p p e r lim it
2.5
10
10
2.5
1.0
1.0
0.1
0.1
0.0
0.0
2.6
1.6
0.6
0.8
1.1
1.1
1.8
1.2
1.9
1.4
1.9
1.3
2.8
2.3
0.5
0.5
0.0
0.1
4.9
3.2
2.1
2.3
2.8
2.5
3.6
2.6
4.0
3.0
4.2
3.0
5.5
4.8
10.5
9.9
1.7
2.1
0.2
0.4
8.1
6.0
4.6
4.6
5.6
5.0
6.5
5.1
6.9
5.6
7.4
5.7
6.3
6.4
1.8
3.0
12.9
11.4
9.9
9.9
10.7
10.1
11.6
10.1
12.3
10.9
12.7
11.0
9.8
10.2
20.6
16.3
24.4
19.2
13.1
11.5
11.9
10.9
11.6
10.8
14.6
12.6
14.0
11.8
14.6
12.1
4.8
4.9
15.7
11.5
21.0
15.0
7.5
6.3
2.6
2.5
12.5
8.2
18.6
12.5
4.8
3.3
4.0
3.0
3.5
2.9
5.9
4.2
5.3
3.8
5.5
3.7
1.0
1.1
9.6
5.5
16.4
10.3
2.5
1.7
2.0
1.4
1.7
1.3
3.3
2.1
3.0
1.9
3.1
1.9
6.7
5.9
6.3
5.7
8.9
7.1
8.3
6.8
8.7
6.8
divided by 100. The norm al approxim ation m ethod uses the delta m ethod
variance approxim ation. The results suggest th a t the studentized m ethod gives
the best results, provided the log scale is used. Otherwise, the studentized
m ethod and the percentile, B C a and A B C m ethods are com parable b u t only
really satisfactory a t the larger sample sizes.
Figure 5.7 shows box plots o f the lengths o f 1000 confidence intervals for
b o th sam ple sizes. The m ost pronounced feature for ni = n2 = 10 is the long
som etim es very long lengths for the two studentized m ethods, which
helps to account for their good error rates. This feature is far less prom inent
a t the larger sam ple sizes. It is noticeable th a t the norm al, percentile, B C a
an d A B C intervals are sh o rt com pared to the exact ones, and th at taking logs
improves the basic intervals. Sim ilar com m ents apply when ni = n2 = 25, but
w ith less force.
232
5 Confidence Intervals
n1=n2=10
1000
100
10
n1=n2=25
10
5
Pr(0, ^ Q , ) = ^
(5.60)
(5.61)
233
(5-62)
As in the scalar case, a com m on and useful choice for v is the delta m ethod
variance estim ate v^.
T he sam e m ethod can be applied on any scales which are m onotone tra n s
form ations o f the original p aram eter scales. F or example, if h(6) has ith
com ponent /i,(0;), say, and if d is the diagonal m atrix with elem ents dhi/d6j
evaluated at 0 = t, then we can apply (5.62) with the revised definition
q = {h(t) - h(0)}T (dTvd)~l {h(t) - fe(0)}.
If corresponding ordered b o o tstrap values are again denoted by q *, then the
b o o tstrap confidence region will be
{0 : {h(t) - h(6)}T(dTv d ) - l {h(t) - h(6)} < 9(*r+1Mi_)}-
(5.63)
A p articu lar choice for h(-) would often be based on diagnostic plots o f
com ponents o f t* and v", the objectives being to attain approxim ate norm ality
an d approxim ately stable variance for each com ponent.
This m ethod will be subject to the same potential defects as the studentized
b o o tstrap m ethod o f Section 5.2. T here is no vector analogue o f the adjusted
percentile m ethods, b u t the nested b o o tstrap m ethod can be applied.
Example 5.14 (Air-conditioning data) F o r the air-conditioning d a ta o f Exam
ple 1.1, consider setting a confidence region for the two param eters 0 = (ji, k)
in a gam m a m odel. The log likelihood function is
y and logy are the
averages o f the d a ta and
the log data.
/(,u, k ) =
logr(jc) + (k - l)logy -
Ky/n},
from which we calculate the m axim um likelihood estim ators T = (p,,k). The
234
5 Confidence Intervals
num erical values are p. = 108.083 and k = 0.7065. A straightforw ard calcula
tion shows th a t the delta m ethod variance approxim ation, equal to the inverse
o f the expected inform ation m atrix as in Section 5.2, is
vL = n_1d i a g |/ c _1/i2, ~
(fi,
lo g r ( ) - k_1j .
(5.64)
The stan d ard likelihood ratio 1 a confidence region is the set o f values o f
k ) for which
2{/(fi, k) -
Z( f i ,
where c2,i_ is the 1 a quantile o f the x l distribution. The top left panel
o f Figure 5.8 shows the 0.50, 0.95 an d 0.99 confidence regions obtained in
this way. T he top right panel is the same, except th a t C2,i_a is replaced by a
b o o tstrap estim ate obtained from R = 999 sam ples sim ulated from the fitted
gam m a m odel. This second region is som ew hat larger than, b u t o f course has
the same shape as, the first.
From the b o o tstrap sim ulation we have estim ators t" = (*,*) from each
sample, from which we calculate the corresponding variance approxim ations
using (5.64), an d hence the quad ratic form s q * = ( f f)r i>2-1 (f* t). We then
apply (5.62) to obtain the studentized b o o tstrap confidence regions shown in
the bottom left panel o f Figure 5.8. This is clearly nothing like the likelihoodbased confidence regions above, p artly because it fails com pletely to take
account o f the m ild skewness in the distribution o f fi and the heavy skewness
in the distrib u tio n o f k. These features are clear in the histogram plots o f
Figure 5.9.
L ogarithm ic transfo rm atio n o f b o th fi an d k improves m atters considerably:
the b otto m right panel o f Figure 5.8 com es from applying the studentized
boo tstrap m ethod after d ual logarithm ic transform ation. Nevertheless, the
solution is n o t com pletely satisfactory, in th a t the region is too wide on the k
axis and slightly narrow on the fi axis. This could be predicted to som e extent
by plotting v'L versus f*, which shows th a t the log transform ation o f k is not
quite strong enough. Perhaps m ore im p o rtan t is th a t there is a substantial bias
in k: the b o o tstrap bias estim ate is 0.18.
One lesson from this exam ple is th a t where a likelihood is available and
usable, it should be used w ith param etric sim ulation to check on, and if
necessary replace, stan d ard approxim ations for quantiles o f the log likelihood
ratio statistic.
235
(0
Q.
<
0
Q.
(0
CO
Q.
Q.
mu
mu
Q.
Q.
<o
J*
mu
236
5 Confidence Intervals
C\J
o
o
co
oo
so
in
o
o
I I I i i i i. ii i
o
50
0.5
1.0
mu
1.5
2.0
2.5
3.0
kappa
Lat
Long
Lat
Long
Lat
Long
Lat
Long
-26.4
-32.2
-73.1
-80.2
-71.1
-58.7
-40.8
-14.9
-66.1
-1.8
-38.3
-17.2
-56.2
324.0
163.7
51.9
140.5
267.2
32.0
28.1
266.3
144.3
256.2
146.8
89.9
35.6
-52.1
-77.3
-68.8
-68 .4
-29.2
-78.5
-65 .4
-49 .0
-67 .0
-5 6 .7
-72.7
-81 .6
-75.1
83.2
182.1
110.4
142.2
246.3
222.6
247.7
65.6
282.6
56.2
103.1
295.6
70.7
-80.5
-77.7
-6.9
-5 9 .4
-5 .6
-62.6
-74.7
-65.3
-71.6
-23.3
-60.2
-40.4
-53.6
108.4
266.0
19.1
281.7
107.4
105.3
120.2
286.6
106.4
96.5
33.2
41.0
59.1
-74.3
-8 1 .0
-12.7
-75.4
-85.9
-84.8
-7 .4
-29.8
-85.2
-53.1
-63.4
90.2
170.9
199.4
118.6
63.7
74.9
93.8
72.8
113.2
51.5
154.8
In ord er to set a confidence region for the m ean p o lar axis, or equivalently
(6, <f>), we let
b(6, <) = (sin 6 cos (j), sin 9 sin 0 , cos d)T,
denote the unit vectors ortho g o n al to a(0, </>). The sam ple values o f these
vectors are 2, b and c, and the sam ple eigenvalues are 1\ < %2 < ^3- Let A
denote the 2 x 3 m atrix (S,c)r and B the 2 x 2 m atrix with { j, k)th element
------ n~ l y ^ ( b Tyj)(cTyj)(aTyj)2.
237
90
(5.65)
238
5 * Confidence Intervals
239
5 \
V V
vaitH
15
0.0012
l i f t ; , .
0.0008
0.0010
0.0014
<3
80
100
120
140
160
80
. ...
91001
U nconditional
C o n d itio n al
90
100
110
120
130
80
90
100
0.010
0.006
0.025
0.020
0.050
0.044
110
120
130
<S'
d'
0.100
0.078
0.900
0.940
0.950
0.974
0.975
0.988
0.990
1.000
use these in (5.3). T he b o o tstrap estim ate o f ap(d) is the value ap(d) defined by
Pr{T* t < ap(d) \ D* = d} = p,
and the sim plest way to use o u r sim ulated sam ples to approxim ate this is to
use only those sam ples for which d* is n ea r d. F or example, we could take
the R i = 99 sam ples whose d* values are closest to d and approxim ate ap(d)
by the lOOpth ordered value o f t* in those samples.
C ertainly stratification o f the sim ulation results by intervals o f d* values
shows quite strong conditional effects, as evidenced in Figure 5.12. The difficulty
is th a t R j = 99 sam ples is n o t enough to obtain good estim ates o f conditional
quantiles, and certainly not to distinguish betw een unconditional quantiles and
the conditional quantiles given d' = d, which is near the m ean. O nly w ith an
increase o f R to 9999, an d using strata o f Rd = 499 samples, does a clear
picture emerge. Figure 5.13 shows plots o f conditional quantile estim ates from
this larger sim ulation.
How different are the conditional and unconditional distributions? Table
5.10 shows b o o tstrap estim ates o f the cum ulative conditional probabilities
Pr( T < ap | D = d), where ap is the unconditional p quantile, for several values
o f p. Each estim ate is the p ro p o rtio n o f times in Rd = 499 sam ples th a t t" is less
than or equal to the unconditional quantile estim ate (10ooop)- The com parison
suggests th a t conditioning does n o t have a large effect in this case.
A m ore efficient use o f b o o tstrap samples, which takes advantage o f the
sm oothness o f quantiles as a function o f d, is to estim ate quantiles for interval
stra ta o f Rd sam ples an d then for each level p to fit a sm ooth curve. For
exam ple, if the k th such stratu m gives quantile estim ates ap# and average
5 * Confidence Intervals
240
Ancillary d*
Ancillary d*
value dk for d', then we can fit a sm oothing spline to the points (dk, (ip^) for
each p an d interpolate the required value ap(d) at the observed d. Figure 5.14
illustrates this for R = 9999 and non-overlapping s tra ta o f size R^ = 199, with
p = 0.025 an d 0.975. N ote th a t interp o latio n is only needed at the centre o f
the curve. Use o f non-overlapping intervals seems to give the best results.
241
o
o
o
o
CM
<D
o
o
o
_2
o
>
' l\
.
/ h m . i r
': j*
* j!
o
00
o
o
T
:
:.k
* !
:;* ;
M M
ii
m #
* ?!
iii . . * M i
1/1
* ii \Mi ; i\i * * \i. **
r ; ' U
i\.
* !:*
*
.
*
CO
i
1880
1900
1920
1940
1960
Year
e
S(0) = ^ { > 7 3 ^ i + w ) } j= i
S tan d ard norm al-theory likelihood analysis suggests th a t differences in S(6)
for 0 n ear 0 are ancillary statistics. We shall reduce these differences to two
p articu lar statistics which m easure skewness and curvature o f S( ) near 0,
242
5 Confidence Intervals
b'
c*
1.64
2.44
4.62
4.87
5.12
5.49
6.06
6.94
..
..
-0.62
-0.37
-0.17
0.17
0.37
0.62
0.87
59
62
52
88
53
81
71
83
68
79
62
82
50
68
53
81
92
91
92
97
94
93
84
91
96
96
100
100
93
91
100
89
100
100
93
95
95
98
100
100
95
89
86
96
97
100
97
92
97
95
96
100
87
92
100
97
95
100
93
95
97
96
95
100
2.45
_
50
76
76
81
85
86
100
nam ely
B = S(d + 5) - S(6 - 5 ) ,
5.10 Prediction
243
is sm ooth in b ' , c ' . We fitted a logistic regression to the proportions in the 201
non-em pty cells o f the com plete version o f Table 5.11, the result being
logit p(b* , c ) = 0.51 0.20b2 + 0.68c*.
The residual deviance is 223 on 198 degrees o f freedom , which indicates an
adequate fit for this simple model. The conditional bo o tstrap confidence is the
fitted value o f p a t b' = b, c* = c, which is 0.972 w ith standard erro r 0.009.
So the conditional confidence attached to 6 = 28 + 1 is m uch higher th an the
unconditional value.
The value o f the stan d ard error for the fitted value corresponds to a binom ial
stan d ard error for a sam ple o f size 3500, or 35% o f the whole b o o tstrap sim u
lation, which indicates high efficiency for this m ethod o f estim ating conditional
probability.
5.10 Prediction
Closely related to confidence regions for param eters are confidence regions for
future outcom es o f the response Y , m ore usually called prediction regions.
A pplications are typically in m ore com plicated contexts involving regression
m odels (C hapters 6 and 7) and time series m odels (C hapter 8), so here we give
only a b rief discussion o f the m ain ideas.
In the sim plest situation we are concerned with prediction o f one future
response Yn+l given observations y \ , . . . , y n from a distribution F. The ideal
upp er y prediction lim it is the y quantile o f F, which we denote by ay(F). The
sim plest ap p ro ach to calculating a prediction limit is the plug-in approach,
th a t is substituting the estim ate F for F to give ay = ay(F). But this is clearly
biased in the optim istic direction, because it does n o t allow for the uncertainty
in F. R esam pling is used to correct for, or remove, this bias.
Parametric case
Suppose first th a t we have a fully param etric model, F = Fg, say. T hen the
prediction lim it ay(F) can be expressed m ore directly as ay(9). T he true coverage
o f this limit over repetitions o f b o th d a ta and predictand will n o t generally be
y, b u t rath er
P r{7 n+i < ay(6) \ 6} = h(y),
(5.66)
(5.67)
244
5 Confidence Intervals
where Z_i has the S tudent-f distribution w ith n 1 degrees o f freedom . This
leads directly to the S tudent-f prediction limit
The preceding exam ple suggests a m ore direct m ethod for special cases
involving m eans, which m akes use o f a poin t prediction y n+\ and the distribu
tion o f prediction error Yn+l y+1: resam pling can be used to estim ate this
distribution directly. This m ethod will be applied to linear regression m odels
in Section 6.3.3.
245
5.10 Prediction
Figure 5.16
Adjustment function
/i(y) for prediction with
sample size n = 10 from
N(n,cr2), with quadratic
logistic fit (solid), and
line giving /i(y) = y
(dots).
Logit of gamma
Nonparametric case
N ow consider the n o nparam etric context, where F is the E D F o f a single
sample. The calculations outlined for the param etric case apply here also.
First, if r / n < y < (r + 1)/n then the plug-in prediction limit is ay(F) = y(r)\
equivalently, ay(F) = y([ny\), where [] m eans integer part. Straightforw ard
calculation shows th at
Pr(Y+1 < yw ) = r / ( n + l ) ,
w hich m eans th a t (5.66) becom es h(y) = [ny]/(n+1). Therefore [n g (y )]/(n + l) =
y, so th at the adjusted prediction limit is y ( [ ( n+ i ) v ] ) : this is exact if (n + l ) y is
an integer.
It seems intuitively clear th a t the efficiency o f this nonparam etric prediction
lim it relative to a param etric prediction limit would be considerably lower
th an would be the case for confidence limits on a param eter. F or example,
a com parison betw een the norm al-theory and nonparam etric m ethods for
sam ples from a norm al distribution shows the efficiency to be ab o u t j for
a = 0.05.
F or sem iparam etric problem s sim ilar calculations apply. One general ap
proach which m akes sense in certain applications, as m entioned earlier, bases
prediction lim its on poin t predictions, and uses resam pling to estim ate the
distribution o f prediction error. For further details see Sections 6.3.3 and
7.2.4.
246
J Confidence Intervals
247
5.12 Problems
(1994) discuss the num bers o f sam ples required when the nested b o o tstrap is
used to calibrate a confidence interval.
C onditional m ethods have received little attention in the literature. E xam
ple 5.17 is tak en from H inkley an d Schechtm an (1987). B ooth, H all and W ood
(1992) describe kernel m ethods for estim ating the conditional distribution o f
a b o o tstrap statistic.
Confidence regions for vector param eters are alm ost untouched in the lit
erature. T here are no general analogues o f adjusted percentile m ethods. H all
(1987) discusses likelihood-based shapes for confidence regions.
Geisser (1993) surveys several approaches to calculating prediction intervals,
including resam pling m ethods such as cross-validation.
References to confidence interval and prediction interval m ethods for regres
sion m odels are given in the notes for C hapters 6 and 7; see also C hapter 8
for tim e series.
5.12 Problems
1
Compare the use o f normal approximations for 9 and f with use o f a parametric
bootstrap analysis to obtain confidence intervals for 9: see Practical 5.1.
(Section 5.2)
3
The gamma model (1.1) with mean /i and index k can be applied to the data o f
Example 1.1. For this model, show that the profile log likelihood for pt is
^prof(M) = nk lo g (kft/fi) + (k - 1) Y 2 lo 8 JO ~ ^ Y I Vi/t1 ~ n lo g r ^
where k h is the solution to the estimating equation
n log(K/n) + n +
log yj - ^
y j / f i - m p ( K ) = 0,
248
5 Confidence Intervals
with tp(fc) the derivative o f logr(K ).
Describe an algorithm for simulating the distribution o f the log likelihood ratio
statistic W( p ) = 2{<fprof(/i) <fprof(^)}, where p. is the overall maximum likelihood
estimate.
(Section 5.2)
( R ) f ( l - p f ~ s,
W
where p = p(F) = P r'(Z < z \ F). Let P be the random variable corresponding to
p(F), with C D F G( ). Hence show that the unconditional probability is
Pr(0 6 Ir) = J 2 ( * ) f o S( 1 - u)R~s dG(u).
N ote that Pr(P < a) = Pr{0 6 [T 7 1/2Z a',oo)}, where Z a* is the a quantile o f the
distribution o f Z ', conditional on Y i , . . . , Y n.
(b) Suppose that it is reasonable to approximate the distribution o f P by the beta
distribution with density wa l (1 u)b~l / B(a,b), 0 < u < 1; note that a, b>\ as
n
o o . For som e representative values o f R, a, a and b, compare the coverage error
o f I , with that o f the interval [T V 1/2Z ,oo).
(Section 5.2.3; Hall, 1986)
6
s{ ^
249
5.12 Problems
coverage (1 2a) is
Pr i
(n -l)2 _ 2
(n -l)2
< t n_ x <
i. ^n1,1a
Cnl,a
where C has the x l - 1 distribution. Give also the coverages o f the basic bootstrap
confidence intervals based on 9 and log 6.
Calculate these coverages for n = 25, 50, 75 and a = 0.05, 0.025, and 0.005. Which
o f these intervals is preferable?
(c) See Practical 5.4, in which we take d5 = 2.236.
(Section 5.3.1)
7
Suppose that we have a parametric model with parameter vector tp, and that
9 = h(xp) is the parameter o f interest. The adjusted percentile ( B C a) method is
found by applying the scalar parameter method to the least-favourable family, for
which the log likelihood <f(ip) is replaced by / l f ( 0 = ($>+&), with S = i~l {rp)h(y))
and h( ) is the vector o f partial derivatives. Equations (5.21), (5.22) and (5.24) still
apply.
Show in detail how to apply this extension o f the B C a method to the problem
o f calculating confidence intervals for the ratio 9 = Hi/\i\ o f the means o f two
exponential distributions, given independent samples from those distributions. Use
a numerical example (such as Example 5.10) to compare the B C a m ethod to the
exact method, which is based on the fact that 9 / 9 has an F distribution.
(Sections 5.3.2, 5.4.2; Efron, 1987)
For the ratio o f independent means in Example 5.10, show that the matrix o f
second derivatives ii{n) has elements
n2t 1 2 ( y u - y i X y i j ~ y \ )
njyi I
yi
uu.2j =
n2
n\n2y {
and
ft
. ___
-i = ( l -Hi /Hl
\-i/fii
1/Mi
5 Confidence Intervals
250
Hence show that the A B C confidence limit is given by
~ = x + d Y X j e j / ( n 2v l / 2u)
u + d x Y l u j e j / ( n 2v l / 2u)
11
13
s is consistent for k if
s = A+ op(i) as n->oo.
cv is the a quantile of
the
distribution.
251
5.13 Practicals
15
For an equi-tailed (1 2a) confidence interval, the ideal endpoints are t + p with
values o f P solving (3.31) with
h(F, F ; P ) = I {t(F) - t(F) < 0} - a,
Suppose that the bootstrap solutions are denoted by [i? and P t- a., and that in
the language o f Section 3.9.1 the adjustments b(F, y) are /Ja+?1 and /?i_a+w. Show
how to estimate yi and y2, and verify that these adjustments modify coverage
1 2a + 0 (n _1) to 1 2a + 0(n~2).
(Sections 3.9.1, 5.6; Hall and Martin, 1988)
16
G( I d ) ,
f= i W{h-'(d;-d)}
where w( ) is a density symmetric about zero and h is an adjustable bandwidth.
Investigate the bias and variance o f this estimate in the case where ( T , D ) is ap
proximately bivariate normal and w( ) = <p(-). Show that h = R ~ i/2 is a reasonable
choice.
(Section 5.9; Booth, Hall and W ood, 1992)
17
18
19
5.13 Practicals
1
Suppose that we wish to calculate a 90% confidence interval for the correlation
9 between the two counts in the colum ns o f cd4; see Practical 2.3. To obtain
252
5 Confidence Intervals
Suppose that we wish to calculate a 90% confidence interval for the largest
eigenvalue 9 o f the covariance matrix o f the two counts in the colum ns o f cd4; see
Practicals 2.3 and 5.1. To obtain confidence intervals for 9 under nonparametric
resampling, using the empirical influence values to calculate vL :
5.13 Practicals
253
par(mfrow=c(2,2))
tsplot(capabilityly,ylim=c(5,6))
abline(h=5.79,lty=2); abline(h=5.49,lty=2)
qqnorm(capability$y)
acf(capabilitySy)
254
5 Confidence Intervals
acf(capability$y,type="partial")
To find nonparametric confidence limits for rj using the estimates given by (ii) in
Problem 5.6:
Following on from Practical 2.3, w e use a double bootstrap with M = 249 to adjust
the studentized bootstrap interval for a correlation coefficient applied to the cd4
data.
nested.corr <- function(data, w, tO, M)
{ n <- nrow(data)
i <- rep(l:n,round(n*w))
t <- corr.fun(data, w )
z <- (t[l]-t0)/sqrt(t[2])
nested.boot <- boot(data[i,], corr.fun, R=M, stype="w")
z.nested <- (nested.boot$t[,1]t [1])/sqrt(nested.boot$t[,2])
c(z, sum(z.nested<z)/(M+l)) }
cd4.boot <- boot(cd4, nested.corr, R=9, stype="w",
tO=corr(cd4), M=249)
To get som e idea how long you will have to wait if you set R = 999 you can
time the call to b o o t using u n ix .t im e or d o s . t i m e : beware o f time and memory
problems. It may be best to run a batch job, with contents
q <- c(0.975,0.025)
q.adj <- quantile(cd4.nested$t[,2],q)
tO <- corr.fun(cd4)
z <- sort(cd4.nested$t[,1])
5.13 Practicals
255
t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q)]
t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q.adj)]
Does the correction have much effect? Compare this interval with the correspond
ing ABC interval.
(Section 5.6)
6
Linear Regression
6.1 Introduction
O ne o f the m ost im p o rta n t and frequent types o f statistical analysis is re
gression analysis, in which we study the effects o f explanatory variables or
covariates on a response variable. In this chap ter we are concerned with Unear
regression, in which the m ean o f the ran d o m response Y observed at value
x = ( x i,. . . , x p)T o f the explanatory variable vector is
E ( y | x) = n(x) = x Tp.
The m odel is com pleted by specifying the natu re o f random variation, which
for independent responses am o u n ts to specifying the form o f the variance
v a r(7 | x). F or a full p aram etric analysis we would also have to specify the
distribution o f Y , be it norm al, Poisson o r w hatever. W ithout this, the m odel
is sem iparam etric.
F or linear regression w ith norm al ran d o m errors having co n stan t variance,
the least squares theory o f regression estim ation and inference provides clean,
exact m ethods for analysis. But for generalizations to non-norm al errors and
non-con stan t variance, exact m ethods rarely exist, and we are faced with
approxim ate m ethods based o n linear approxim ations to estim ators and central
lim it theorem s. So, ju s t as in the sim pler context o f C hapters 2-5, resam pling
m ethods have the poten tial to provide m ore accurate analysis.
We begin o u r discussion in Section 6.2 w ith simple least squares linear re
gression, where in ideal conditions resam pling essentially reproduces the exact
theoretical analysis, b u t also offers the p o tential to deal with non-ideal cir
cum stances such as non-co n stan t variance. Section 6.3 covers the extension
to m ultiple explanatory variables. The related topics o f aggregate prediction
erro r an d o f variable selection based on predictive ability are discussed in
Section 6.4. R obust m ethods o f regression are exam ined briefly in Section 6.5.
256
257
Body weight
Body weight
x = log(body weight).
+ Pi xj + ej,
j= l,...,n,
(6.1)
w here the EjS are uncorrelated w ith zero m eans and equal variances a 2. This
constancy o f variance, or hom oscedasticity, seems roughly right for the example
data. We refer to the d a ta (x j , y j ) as the y'th case.
In general the values Xj m ight be controlled (by design), random ly sampled,
o r m erely observed as in the example. But we analyse the d a ta as if the x,s
were fixed, because the am o u n t o f inform ation ab o u t ft = (/fo, l h ) T depends
u p o n their observed values.
The sim plest analysis o f d a ta under (6.1) is by the ordinary least squares
6 Linear Regression
258
m ethod, on which we concentrate here. The least squares estim ates for (i are
,
h = y - Pi*,
(6 .2 )
where x = n 1 Y XJ an d
= ^ = i ( x; x )2- T he conventional estim ate o f
the error variance er2 is the residual m ean square
where
ei = yj - A>
(6.3)
A/ = Po + Plxj
(6.4)
the fitted values, or estim ated m ean values, for the response at the observed x
values.
The basic properties o f the p aram eter estim ates Po, Pi, which are easily
obtained u n d er m odel (6.1), are
(6.5)
and
E(j?i) =
Pu
(6.6)
var(j?i) =
The estim ates are norm ally distributed and optim al if the errors e;- are norm ally
distributed, they are often approxim ately norm al for other erro r distributions,
b u t they are n o t robust to gross non-norm ality o f errors or to outlying response
values.
The raw residuals e} are im p o rtan t for various aspects o f m odel checking,
and potentially for resam pling m ethods since they estim ate the random errors
Ej, so it is useful to sum m arize their properties also. U nder (6.1),
n
(6.7)
k= 1
where
var(e; ) = tx2( l
-hj).
68
( . )
259
Standardized residuals
are called studentized
residuals by some
authors.
(6.9)
(i - h j W
IX
= x) =
fly
+ y{x H x ) ,
y =
0 x y / 0 x2 ,
(6-10)
260
6 Linear Regression
w ith n x = E(X ), fly = E(Y ), a 2 = \a.r(X) and axy = cov(X, Y). This condi
tional m ean corresponds to the m ean in (6.1), w ith
Po = H y - y f i x,
Pi=y-
(6.11)
L^
<612>
F> = C - S ?
(6' 13)
T he nonparam etric delta m ethod variance approxim ation (2.36) applied to [1]
gives
vl
Y, { x j x)2e2j
= -S S 2
1
(6-14)
E (xj - x)n(xj)
SS X
(6.15)
261
The influence function for the least squares estim ator is again given by
(6.12), b u t w ith fix and a \ respectively replaced by x and n~' J2(x j ~ *)2Em pirical influence values are still given by (6.13). The analogue o f linear
approxim ations (2.35) an d (3.1) is $ = fi + n~x
Lt { ( xj , y j) ; F} , w ith vari
ance n_ 2 ^ " =1 v ar [Lt{( xj, Yj) ;F}]. If the assum ed hom oscedasticity o f errors
is used to evaluate this, w ith the constant variance a 2 estim ated by n~l
ep
then the delta m ethod variance approxim ation for /?i, for example, is
'Z i.
nSSx
strictly speaking this is a sem iparam etric approxim ation. This differs by a
factor o f (n 2) / n from the stan d ard estim ate, which is given by (6.6) with
residual m ean square s2 in place o f a 2.
The stan d ard analysis for linear regression as outlined in Section 6.2.1 is the
sam e for b o th situations, provided the random errors ej have equal variances,
as w ould usually be jud g ed from plots o f the residuals.
262
6 Linear Regression
j =
(6.16)
w ith p.j =
+ [Six an d ej random ly sam pled from G. So the algorithm
to generate sim ulated datasets an d corresponding param eter estim ates is as
follows.
Algorithm 6.1 (Model-based resampling in linear regression)
For r = 1
1 F or j = 1, . . . , n ,
(a) set x j = Xj\
(b) random ly sam ple ej from r
. . , r r; then
a ,
E (* )-* > 2 f t +
SS,
'
their average h, then the m eans an d variances o f Pq and p \ are given exactly
by (6.5) an d (6.6) w ith the estim ates Pq, P i an d s2 substituted for param eter
values. T he advantage o f resam pling is im proved quantile estim ation when
norm al-theory distributions o f the estim ators Pq, P i , S 2 are n o t accurate.
Example 6.1 (M am m als) F or the d a ta plotted in the right panel o f Figure 6.1,
the simple linear regression m odel seems appropriate. S tan d ard analysis sug
gests th a t errors are approxim ately norm al, although there is a small suspicion
o f heteroscedasticity: see Figure 6.2. T he p aram eter estim ates are Po = 2.135
and Pi = 0.752.
From R = 499 b o o tstra p sim ulations according to the algorithm above, the
263
co
3
TD
tO
3
o
o
<D
o
O
0o>
"D
O
Leverage h
estim ated sta n d a rd errors o f intercept and slope are respectively 0.0958 and
0.0273, com pared to the theoretical values 0.0960 and 0.0285. The em pirical
distributions o f b o o tstra p estim ates are alm ost perfectly norm al, as they are
for the studentized estim ates. T he estim ated 0.05 and 0.95 quantiles for the
studentized slope estim ate
sE{fay
w here SE(fS\) is the stan d ard error for
obtained from (6.6), are z*25) = 1.640
an d z'475) = 1.5 89, com pared to the stan d ard norm al quantiles +1.645. So, as
expected for a m oderately large clean dataset, the resam pling results agree
closely w ith those obtained from stan d ard m ethods.
Zero intercept
In som e applications the intercept f o will n o t be included in (6.1). This affects
the estim ation o f Pi and a 2 in obvious ways, b u t the resam pling algorithm will
also differ. First, the leverage values are different, nam ely
264
6 Linear Regression
265
R = 999.
f>i
T heoretical
R o b u st theoretical
bias
sta n d a rd e rro r
0
0.096
0.0006
0.091
0.088
bias
sta n d a rd e rro r
0
0.0285
0.0002
0.0223
0.0223
x j ,...,x * are random ly sam pled. The design fixes the inform ation content o f a
sample, and in principle o u r inference should be specific to the inform ation in
o u r data. The variation in x j , . . . , x will cause some variation in inform ation,
b u t fortunately this is often u n im p o rtan t in m oderately large datasets; see,
however, Exam ples 6.4 and 6.6.
N ote th a t in general the resam pling distribution o f a coefficient estim ate
will not have m ean equal to the d a ta estim ate, contrary to the unbiasedness
property th a t the estim ate in fact possesses. However, the difference is usually
negligible.
Example 6.2 (M ammals) F or the d ata o f Exam ple 6.1, a b o o tstra p sim ulation
was run by resam pling cases with R = 999. Table 6.1 shows the bias and
stan d ard error results for b o th intercept and slope. The estim ated biases are
very small. T he striking feature o f the results is th at the stan d ard erro r for the
slope is considerably sm aller than in the previous b o o tstrap sim ulation, which
agreed w ith stan d ard theory. The last colum n o f the table gives robust versions
o f the stan d ard errors, which are calculated by estim ating the variance o f Ej to
be rj. For exam ple, the robust estim ate o f the variance o f (it is
This corresponds to the delta m ethod variance approxim ation (6.14), except
th a t rj is used in preference to e; . As we m ight have expected from previous
discussion, the b o o tstrap gives an approxim ation to the robust stan d ard error.
A
A
Figure 6.3 shows norm al Q -Q plots o f the b o o tstra p estim ates Pq and fi'.
F or the slope p aram eter the right panel shows lines corresponding to norm al
d istributions w ith the usual and the robust stan d ard errors. T he distribution
o f Pi is close to norm al, with variance m uch closer to the robust form (6.17)
th an to the usual form (6.6).
One disadvantage o f the robust stan d ard error is its inefficiency relative to
the usual stan d ard erro r when the latter is correct. A fairly straightforw ard
calculation (Problem 6.6) gives the efficiency, which is approxim ately 40% for
the slope p aram eter in the previous example. T hus the effective degrees o f
freedom for the robust stan d ard error is approxim ately 0.40 times 62, or 25.
6 Linear Regression
266
The sam e loss o f efficiency would apply approxim ately to b o o tstrap results for
resam pling cases.
Pr [T > 1 1X = x, Y = p e rm j^ .)} ],
Figure 63 Normal
plots for bootstrapped
estimates of intercept
(left) and slope (right)
for linear regression fit
to logarithms of
mammal data, with
R = 999 samples
obtained by resampling
cases. The dotted lines
give approximate
normal distributions
based on the usual
formulae (6.5) and (6.6),
while the dashed line
shows the normal
distribution for the
slope using the robust
variance estimate (6.17).
267
where perm { } denotes a perm utation. Because all perm utations are equally
likely, we have
# o f perm utations such th a t T > t
P = --------------------n!i-------------------
as in (4.20). In the present context we can take T = fii, for which p is the same
as if we used the sam ple Pearson correlation coefficient, b u t the same m ethod
applies for any ap p ro p riate slope estim ator. In practice the test is perform ed
by generating sam ples ( x j ,y j ) ,. ..,(x * ,y * ) such th a t x* = x j and (_ y j,...,y )
is a ran d o m p erm u tatio n o f ( y i , . . . , y n), and fitting the least squares slope
estim ate jSj. If this is done R times, then the one-sided P-value for alternative
H A : fi i > 0 is
P
# { fr> M + i
R + 1
x) =
xp
yj =
;0 + 8}o>
w here pjo = y an d the *0 are sam pled with replacem ent from the null m odel
residuals e^o = yj ~ y , j = 1
, The least squares slope /Jj is calculated
from the sim ulated data. A fter R repetitions o f the sim ulation, the P-value is
calculated as before.
268
6 Linear Regression
This second b o o tstrap test differs from the first b o o tstrap test only in th at
the values o f explanatory variables x are fixed at the d a ta values for every
case. N ote th a t if residuals were sam pled w ithout replacem ent, this test would
duplicate the exact p erm u tatio n test, which suggests th at this boo tstrap test
will be nearly exact.
The test could be m odified by standardizing the residuals before sam pling
from them , which here w ould m ean adjusting for the constant null m odel
leverage n-1 . This w ould affect the P-value slightly for the test as described,
b u t not if the test statistic were changed to the studentized slope estimate.
It therefore seems wise to studentize regression test statistics in general, if
m odel-based sim ulation is used; see the discussion o f b o o tstrap pivot tests
below.
Testing non-zero slope values
All o f the preceding tests can be easily modified to test a non-zero value o f
Pi. If the null value is /?i,o, say, then we apply the test to m odified responses
yj PiflXj, as in Exam ple 6.3 below.
Bootstrap pivot tests
F u rther b o o tstrap tests can be based on the studentized b o o tstrap approach
outlined in Section 4.4.1. F or simplicity suppose th at we can assum e ho m o
scedastic errors. T hen Z = ([S\ Pi)/S\ is a pivot, where Si is the usual
standard error for
As a pivot, Z has a distribution not depending upon
param eter values, an d this can be verified under the linear m odel (6.1). The null
hypothesis is Ho : Pi = 0, and as before we consider the one-sided alternative
H a : Pi > 0. T hen the P-value is
p = Pr
P i = 0, P o, c r
Pi,Po,<r),
where Z* = (j?,* Pi ) / S ' is com puted from a sam ple sim ulated according to
A lgorithm 6.1, which uses the fit from the full m odel as in (6.16). So, applying
the b o o tstrap as described in Section 6.2.3, we calculate the b o o tstrap P-value
from the results o f R sim ulated sam ples as
#
P
(6.19)
where zq = Pi/si.
The relation o f this m ethod to confidence limits is th a t if the lower 1 a
CM
X * *
* A i* ***.
i
o
CO
Ip
CM
O
CM
o
*
00
CO
d
-
0.2
0.1
0.0
so
269
0.2
.
1 .
*
w - ,*
/
t
0.1
0.0
confidence lim it for fa is above zero, then p < oc. Sim ilar interpretations apply
with upper confidence limits and confidence intervals.
T he sam e m ethod can be used with case resampling. If this were done as
a precaution against erro r heteroscedasticity, then it would be appropriate to
replace si w ith the robust stan d ard erro r defined as the square root o f (6.17).
If we wish to test a non-zero value fa$ for the slope, then in (6.18) we
simply replace f a / s \ by zo = (fa fa,o)/si, or equivalently com pare the lower
confidence lim it to fayW ith all o f these tests there are simple m odifications if a different alternative
hypothesis is appropriate. For example, if the alternative is H A : fa < 0, then
the inequalities > used in defining p are replaced by
and the two-sided
P-value is twice the sm aller o f the two one-sided P-values.
O n balance there seems little to choose am ong the various tests described.
The perm u tatio n test an d its b o o tstrap look-alike are equally suited to statis
tics other th an least squares estim ates. T he b o o tstrap pivot test with case
resam pling is the only one designed to test slope w ithout assum ing constant
erro r variance u nder the null hypothesis. But one would usually expect sim ilar
results from all the tests.
The extensions to m ultiple linear regression are discussed in Section 6.3.2.
Example 6.3 (Returns data) The d a ta plotted in Figure 6.4 are n = 60
consecutive cases o f m onthly excess returns y for a particular com pany and
excess m ark et returns x, where excess is relative to riskless rate. We shall ignore
the possibility o f serial correlation. A linear relationship appears to fit the data,
and the hypothesis o f interest is Ho : fa = 1 with alternative HA : fa > 1, the
la tte r corresponding to the com pany outperform ing the m arket.
270
6 Linear Regression
a.
z* = (fil - M/Kob
CM
obtained by resampling
cases. Unshaded area
corresponds to values in
excess of data value
20 = (ft - 1)/sr0b =
0.669.
-2
linear m odel (6.1) will apply, b u t with heteroscedasheteroscedasticity can be m odelled, then boo tstrap
errors is still possible. We assum e to begin with th at
least squares estim ates are fitted, as before.
y j-h
{V (X j)(l-h j)y/2
or
y j-h
{ F( ^. )
1/ 2
271
(6.20)
a _ 5
P0
PlXw,
272
6 Linear Regression
},
var(/?i)
Y , W j ( X j - X w)2
273
All cases
C ases
Cases, subset
W ild, ej
W ild, rj
R o b u st theoretical
0.32
0.28
0.31
0.33
0.34
44.3
38.4
37.9
37.0
39.4
W ith o u t case 22
0.42
0.39
0.37
0.41
0.40
73.2
59.1
62.5
67.2
67.2
b ootstrap , an d for the full d a ta it m akes little difference when the modified
residuals are used.
Case 22 has high leverage, and its exclusion increases the variances o f both
estim ates. T he wild b o o tstrap is again less variable th an bootstrapping cases,
with the wild b o o tstrap o f modified residuals interm ediate betw een them.
We m entioned earlier th a t the design will vary when resam pling cases. The
left panel o f Figure 6.6 shows the sim ulated slope estim ates
plotted against
the sum s o f squares X X x )2> f r 200 b o o tstrap samples. The plotting
ch aracter distinguishes the num ber o f tim es case 22 occurs in the resam ples:
we retu rn to this below. The variability o f /}j decreases sharply as the sum o f
squares increases. N ow usually we would treat the sum o f squares as fixed in
the analysis, and this suggests th at we should calculate the variance o f P\ from
those b o o tstra p sam ples for which X ( x} x*)2 is close to the original value
XXx; ~ x)2, show n by the d otted vertical line. If we take the subset between
the dashed lines, the estim ated variance is closer to th at for the wild bootstrap,
as show n the values in Table 6.2 and by the Q-Q plot in the right panel o f
Figure 6.6. This is also true when case 22 is excluded.
The m ain reason for the large variability o f XXxy x )2 is th a t case 22 has
high leverage, as its position at the b o tto m left o f Figure 6.4 shows. Figure 6.6
shows th a t it has a substantial effect on the precision o f the slope estim ate:
the m ost variable estim ates are those where case 22 does not occur, and the
least variable those w here it occurs two or m ore times.
( 6.22)
274
6 Linear Regression
;
V%:
0 (*1n ol*
:d fe
co
i
i
ii
0 i!
0.001
r i p i . ..
v Ti
*
1 ill
i
ii
ii
i
0.003
0.005
Sum of squares
Cases
where for m odels w ith an intercept Xjo = 1. In the m ore convenient vector
form the m odel is
Yj = Xj (i + j
with x j = ( x jo , Xj i, .. ., Xj P). The com bined m atrix representation for all re
sponses Y t = ( Y i , . . . , Y) is
xp + s
(6.23)
275
(6.24)
(6.25)
vl
= (Xt X )-1
(X TX ) ~ l
(6.26)
see Problem 6.1. These generalize equations (6.13) and (6.14). The variance
approxim ation is im proved by using the modified residuals
(1 - M 1/2
6 Linear Regression
276
/I
1
0
0
X =
0
\0
0\
0
0
0
0
0
0
1/
Pi = 3 (yn + y n - i ),
i=
l,...,p,
and
ej = ( ~ i y ^(yn ~ yn-i),
hj=\,
j = 2i - l , 2i,
i=l,...,p.
The E D F o f the residuals, m odified o r not, could be very unlike the true error
distribution: for example, the E D F will always be symmetric.
I f the ran d o m errors are hom oscedastic then the m odel-based b o otstrap
will give consistent estim ates o f bias and stan d ard error for all regression
coefficients. However, the b o o tstrap distributions m ust be symmetric, and so
m ay be no b etter th an norm al approxim ations if true random errors are
skewed. T here appears to be no rem edy for this. T he problem is n o t so serious
for contrasts am ong the P,. F or example, if 0 = P\ P2 then it is easy to
see th at 9 has a sym m etric distribution, as does O'. The kurtosis is, however,
A
A
different for 9 an d 6 ; see Problem 6.10.
Case resam pling will not w ork because in those sam ples where b o th y 2i+i
and y2i+2 are absent /?, is inestim able: the resam ple design is singular. The
chance o f this is 0.48 for m = 5 increasing to 0.96 for m = 20. This can be fixed
by om itting all b o o tstrap sam ples where
+ f 2i = 0 for any i. T he resulting
boo tstrap variance for P consistently overestim ates by a factor o f ab o u t 1.3.
F u rth er details are given in Problem 6.9.
The im plication for m ore general designs is th a t difficulties will arise with
com binations cTp where c is in the subspace spanned by those eigenvectors o f
X TX corresponding to sm all eigenvalues. First, m odel-based resam pling will
give adequate results for stan d ard erro r calculations, but b o o tstrap distribu
tions m ay n o t im prove on norm al approxim ations in calculating confidence
limits for the /?,-s, o r for prediction. Secondly, unconstrained case resam pling
277
1
2
3
4
5
6
7
8
9
10
11
12
13
xi
*2
X)
*4
7
1
11
11
7
11
3
1
2
21
1
11
10
26
29
56
31
52
55
71
31
54
47
40
66
68
6
15
8
8
6
9
17
22
18
4
23
9
8
60
52
20
47
33
22
6
44
22
26
34
12
12
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.3
109.4
278
6 Linear Regression
Table 6.4 Standard
fio
/?!
P2
P4
err0rS of linear
____________________________________________________________________
N o rm al-th eo ry
E rro r resam pling, R = 999
C ase resam pling, all R = 999
C ase resam pling, m iddle 500
C ase resam pling, largest 800
70.1
66.3
108.5
68.4
67.3
0.74
0.70
1.13
0.76
0.77
0.72
0.69
1.12
0.71
0.69
0.75
0.72
1.18
0.78
0.78
regression coefficients
for cement data.
Theoretical and error
resampling assume
homoscedasticity.
Resampling results use
R = 999 samples, but
0.71
0.67
1.11
0.69
0.68
--------------------------------------------------------------------------------------------------------
rv
V1
U
(0
-O
. V :?-
5 10
50
500
Smallest eigenvalue
5 10
50
500
Smallest eigenvalue
gives m ore reasonable stan d ard errors, as seen in the penultim ate row o f
Table 6.4. T he last row, corresponding to d ropping the smallest 200 values o f
f \ , gives very sim ilar results.
(6.27)
the fitted values are p. = Xfl, and the residual vector is e = (I H)y, where
now the h a t m atrix H is defined by
X ( X T WX)~lX T W,
(6.28)
279
w hose diagonal elem ents are the leverage values hj. The residual vector e has
variance var(e) = k (I H ) W ~ [, whose y'th diagonal elem ent is /c(l h j ) w j 1.
So the m odified residual is now
rj =
_ J 2J -- ------
Wj
(1 hj)1/2
(6.29)
= x j p + w j ll2j,
We shall also need the residuals from this fit, which are eo = (/ Ho)y with
Ho = X q( X q Xo)~lX q . The test statistic T will be based on the least squares
estim ate y for y in the full m odel, which can be expressed as
y (Xi-oXio) 1X[.0eo
w ith X i o = (I H q) X i. The extension o f the earlier p erm utation test is
6 Linear Regression
280
= Ao + o,
where the com ponents o f the sim ulated error vector e0 are sam pled w ithout
(perm utation) or w ith (bo o tstrap ) replacem ent from the n residuals in eo- N ote
th at this m akes use o f the assum ed hom oscedasticity o f errors. Each case keeps
its original covariate values, which is to say th a t X = X . W ith the sim ulated
d a ta we regress y on X to calculate y' and hence the sim ulated test statistic
t \ as described below. W hen this is repeated R times, the b o o tstrap P-value is
# { t; > t} + l
R + l
T he p erm u tatio n version o f the test is not exact w hen nuisance covariates X j
are present, b u t em pirical evidence suggests th a t it is close to exact.
Scalar y
W hat should t be? F or testing a single com ponent, so th a t y is a scalar, suppose
th a t the alternative hypothesis is one-sided, say H A : y > 0. T hen we could
A
1/2
take t to be y itself, o r possibly a studentized form such as zo = y / v 0 , where
Do is an ap p ro p riate estim ate o f the variance o f y. If we com pute the standard
error using the null m odel residual sum o f squares, then
v0 = ( n - q r ' e l e o i X l o X i o r 1,
where q is the ran k o f X q. T he sam e form ula is applied to every sim ulated
sam ple to get i>q an d hence z* = y*/vq1/2.
W hen there are no nuisance covariates Xo, Vq = vq in the p erm u tatio n test,
and studentizing has no effect: the sam e is true if the non-null stan d ard error
is used. Em pirical evidence suggests th a t this is approxim ately true w hen Xo is
present; see the exam ple below. Studentizing is necessary if m odified residuals
are used, w ith stan d ard izatio n based on the null m odel hat m atrix.
A n alternative b o o tstrap test can be developed in term s o f a pivot, as
described for single-variable regression in Section 6.2.5. H ere the idea is to
treat Z = (y y ) / V l/2 as a pivot, w ith V l/1 an ap propriate stan d ard error.
B ootstrap sim ulation u nder the full fitted m odel then produces the R replicates
o f z which we use to calculate the P-value. To elaborate, we first fit the full
m odel p = X f i by least squares and calculate the residuals e = y p. Still
assum ing hom oscedasticity, the stan d ard erro r for y is calculated using the
residual m ean square a simple form ula is
v = ( n - p - 1) l e Te ( X l 0Xi . 0)
281
= X p + e*,
X ' = X,
where the n errors in e* are sam pled independently w ith replacem ent from the
residuals e o r m odified versions o f these. The full regression o f y on X is then
fitted, from which we obtain y * and its estim ated variance v", these being used
to calculate z* = (y* y ) / v ' ll2. F rom R repeats o f this sim ulation we then
have the one-sided P-value
#
{ z r* >
Z q }
R + 1
6 Linear Regression
282
case
a rea
p e rim e te r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
4990
7002
7558
7352
7943
7979
9333
8209
8393
6425
9364
8624
10651
8868
9417
8874
10962
10743
11878
9867
7838
11876
12212
8233
6360
4193
7416
5246
6509
4895
6775
7894
5980
5318
7392
7894
3469
1468
3524
5267
5048
1016
5605
8793
3475
1651
5514
9718
2792
3893
3931
3869
3949
4010
4346
4345
3682
3099
4480
3986
4037
3518
3999
3629
4609
4788
4864
4479
3429
4353
4698
3518
1977
1379
1916
1585
1851
1240
1728
1461
1427
991
1351
1461
1377
476
1189
1645
942
309
1146
2280
1174
598
1456
1486
sh a p e
0.09
0.15
0.18
0.12
0.12
0.17
0.19
0.16
0.20
0.16
0.15
0.15
0.23
0.23
0.17
0.15
0.20
0.26
0.20
0.14
0.11
0.29
0.24
0.16
0.28
0.18
0.19
0.13
0.23
0.34
0.31
0.28
0.20
0.33
0.15
0.28
0.18
0.44
0.16
0.25
0.33
0.23
0.46
0.42
0.20
0.26
0.18
0.20
p e rm e a b ility
6.3
6.3
6.3
6.3
17.1
17.1
17.1
17.1
119.0
119.0
119.0
119.0
82.4
82.4
82.4
82.4
58.6
58.6
58.6
58.6
142.0
142.0
142.0
142.0
740.0
740.0
740.0
740.0
890.0
890.0
890.0
890.0
950.0
950.0
950.0
950.0
100.0
100.0
100.0
100.0
1300.0
1300.0
1300.0
1300.0
580.0
580.0
580.0
580.0
283
co
3
T3
O
0N
?
(0
O
c
03
CO
10
12
Core number
V ariable
intercept
a r e a ( x lO - 3 )
p e r i ( x lO - 3 )
sh ap e
Coefficient
SE
f-value
3.465
0.864
-1 .9 9 0
3.518
1.391
0.211
0.400
4.838
2.49
4.09
- 4 .9 8
0.73
Vector y
F or testing several com ponents sim ultaneously, we take the test statistic to be
the quad ratic form
T = F i X l o X v 0)y,
6 *Linear Regression
284
-6
-4
-2
-6
-4
z*
-2
z0*
or equivalently the difference in residual sum s o f squares for the null and full
m odel least squares fits. This can be standardized to
n q
RSSo R S S
q
X
RSSo
where RSSo and R S S denote residual sum s o f squares under the null m odel
and full m odel respectively.
We can apply the pivot m ethod with full m odel sim ulation here also, using
Z = (y y)T ( X l 0Xi.o)(y y ) / S 2 w ith S 2 the residual m ean square. The test
statistic value is zo = y T(X[.0Xi .0) y /s 2, for w hich the P-value is given by
# {z* >
Zp}
R + 1
This would be equivalent to rejecting Ho at level a if the 1 a confidence set
for y does n o t include the point y = 0. A gain, case resam pling would provide
protection against heteroscedasticity: z would then require a robust standard
error.
6.3.3 Prediction
A fitted linear regression is often used for prediction o f a new individual
response Y+ when the explanatory variable vector is equal to x +. T hen we shall
w ant to supplem ent o u r predicted value by a prediction interval. Confidence
limits for the m ean response
can be found using the same resam pling
as is used to get confidence limits for individual coefficients, b u t limits for
the response Y+ itself usually called prediction lim its require additional
resam pling to sim ulate the variation o f 7+ ab o u t x \ j i .
285
( x l P + +)
by the distribution o f
<5* = x+/?* (x+/? + e+),
(6.30)
w here + is sam pled from G and /T is a sim ulated vector o f estim ates from the
m odel-based resam pling algorithm . This assum es hom oscedasticity o f random
error. U nconditional properties o f the prediction erro r correspond to averaging
over the distributions o f b o th + and the estim ates /?, which we do in the
sim ulation by repeating (6.30) for each set o f values o f /T. H aving obtained
the m odified residuals
from the d a ta fit, the algorithm to generate R sets
each w ith M predictions is as follows.
Algorithm 6.4 (Prediction in linear regression)
F or r = 1 ,..., R,
1 sim ulate responses y* according to (6.16);
2 obtain least squares estim ates pr = ( X TX ) ~ 1X Ty *; then
3 for m = 1 ,..., M ,
(a) sam ple ^ m from r \ f , . . . , r r, and
(b ) com pute prediction error S m = x+i?* (x/? + +m).
$+ - a*-
6 Linear Regression
286
the pooled <5*s, w hose ordered values we denote by < 5( < <
boo tstrap prediction lim its are
y+ ^((RM+l)(l-ct))
y+ ^((RM+lJa)
The
(6.31)
where y+ = *+/?. This is analogous to the basic b o o tstrap m ethod for confi
dence intervals (Section 5.2).
A som ew hat b etter ap p ro ach w hich mimics the stan d ard norm al-theory
analysis is to w ork w ith studentized prediction error
where S is the square root o f residual m ean square for the linear regression.
The corresponding sim ulated values are z*m = <5*m/s*, with s ' calculated in step
2 o f A lgorithm 6.4. T he a and (1 a) quantiles o f Z are estim ated by z*(RM+1)0,)
and
respectively, where z'{V) < < z RM) are the ordered values
o f all R M z* s. T hen the studentized b o o tstrap prediction interval for 7+ is
y+ ~ SZ((RM+l)(l-ct))
+ ~ SZ((RM+1))-
(6.32)
E xam ple 6.8 (N uclear power stations) Table 6.7 contains d a ta on the cost o f
32 light w ater reactors. T he cost (in dollars x l0 ~ 6 adjusted to a 1976 base) is
the response o f interest, an d the o th er quantities in the table are explanatory
variables; they are described in detail in the d a ta source.
We take lo g (c o s t) as the w orking response y, and fit a linear m odel with
covariates PT, CT, NE, d a te , lo g (c a p a c ity ) and log(N). T he dum m y variable PT
indicates six plants for w hich there were p artial turnkey guarantees, and it is
possible th a t some subsidies m ay be hidden in their costs.
Suppose th a t we wish to obtain 95% prediction intervals for the cost o f a
station like case 32 above, except th a t its value for d a te is 73.00. T he predicted
value o f lo g (c o s t) from the regression is x+fi = 6.72, and the m ean squared
erro r from the regression is s = 0.159. W ith a = 0.025 and a sim ulation with
R = 999 an d M = 1, ( R M + l)a = 25 an d ( R M + 1)(1 a) = 975. The values
o f 3(25) an d <5*975) are -0.539 and 0.551, so the 95% lim its (6.31) are 6.18 and
7.27, which are slightly w ider th a n the norm al-theory limits o f 6.25 and 7.19.
F or the lim its (6.32) we get z(*25) = 3.680 and z(*975) = 3.5 12, so the lim its for
lo g (c o st) are 6.13 and 7.28. T he corresponding prediction interval for c o s t is
[exp(6.13), exp(7.28)] = [459.4,1451],
The usual caveats apply a b o u t extrapolating a trend outside the range o f
the data, an d we should use these intervals w ith great caution.
The next exam ple involves an u nusual d a ta structure, where there is hierar
chical variatio n in the covariates.
It is unnecessary to
standardize also by the
square root of
1 + x l ( X TX)- ' x+,
which would make the
variance of Z close to 1.
unless bootstrap results
for different x+ are
pooled.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
287
cost
d a te
Tl
t2
c a p a c ity
PR
NE
CT
BW
PT
460.05
452.99
443.22
652.32
642.23
345.39
272.37
317.21
457.12
690.19
350.63
402.59
412.18
495.58
394.36
423.32
712.27
289.66
881.24
490.88
567.79
665.99
621.45
608.80
473.64
697.14
207.51
288.48
284.88
280.36
217.38
270.71
68.58
67.33
67.33
14
46
73
85
67
78
51
50
59
55
71
64
47
62
52
65
67
60
76
67
59
70
57
59
58
44
57
63
48
63
71
72
80
687
1065
1065
1065
1065
514
822
457
822
792
560
790
530
1050
850
778
845
530
1090
1050
913
828
786
821
538
1130
745
821
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
0
0
0
0
1
1
1
1
0
0
1
1
1
0
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
1
0
0
1
0
0
0
0
1
0
0
1
0
1
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
68.00
68.00
67.92
68.17
68.42
68.42
68.33
68.58
68.75
68.42
68.92
68.92
68.42
69.50
68.42
69.17
68.92
68.75
70.92
69.67
70.08
70.42
71.08
67.25
67.17
67.83
67.83
67.25
67.83
10
10
11
11
13
12
14
15
12
12
13
15
17
13
11
18
15
15
16
11
22
16
19
19
20
13
9
12
12
13
7
886
886
745
886
1
1
12
12
3
5
1
5
2
3
6
2
7
16
3
17
2
1
8
15
20
18
3
19
21
8
7
11
11
8
11
Example 6.9 (Rock data) F or the d a ta discussed in Exam ple 6.7, one objective
is to see how well one can predict perm eability from a single replicate o f the
three im age-based m easurem ents, as opposed to the four replicates obtained
in the study. The previous analysis suggested th a t variable sh ap e did not
contribute usefully to a linear regression relationship for the logarithm o f
perm eability, an d this is confirm ed by cross-validation analysis o f prediction
errors (Section 6.4.1). So here we concentrate on predicting perm eability from
the linear regression o f y = lo g ( p e r m e a b ility ) on a r e a and p e r i .
In Exam ple 6.7 we com m ented on the strong intra-core correlation am ong
the explanatory variables, and th a t m ust be taken into account here if we are
to correctly analyse prediction o f core perm eability from single m easurem ents
o f a r e a and p e r i . O ne way to do this is to think o f the four replicate values
o f u = ( a r e a , p e r i ) T as unbiased estim ates o f an underlying core variable ,
on which y has a linear regression. T hen the d a ta are m odelled by
yj = <x + j y + fij,
ujk =
+ sjk,
(6.33)
6 Linear Regression
288
V ariable
M eth o d
In tercep t
a r e a ( x lO - 4 )
p e r i ( x l O 4)
K = 1
D irect regression on x ^ s
N o rm al-th eo ry fit
5.746
5.694
5.144
5.300
-16.16
-16.39
K = 4
R egression on Xj. s
N o rm al-th eo ry fit
4.295
4.295
9.257
9.257
-21.78
-21.78
(6.33).
(6.34)
for K = 1 and 4; the param eters a and y in (6.33) correspond to a (x) and
when K = oo. Fortunately it turns out th a t b o th observation m odels can be fit
ted easily: for K = 4 we regress the yjs on the core averages Uj; and for K = 1
we fit linear regression w ith all 48 individual cases as tabled, ignoring the
intra-core correlation am ong the e;*s, i.e. pretending th at y; occurs four times
independently. Table 6.8 shows the coefficients for both fits, and com pares
them to corresponding estim ates based on exact norm al-theory analysis.
Suppose, then, th a t we w ant to predict the new response y + given a single
set o f m easurem ents u+. If we define x \ = (1,m+), then the point prediction Y+
is x l P \ where /?(1) are the coefficients in the fit o f m odel (6.34) with K = 1,
shown in the first row o f Table 6.8. T he E D F o f the 48 modified residuals
from this fit estim ates the m arginal distribution o f the e*1* in (6.34), and hence
o f the error e+ in
Y+ = x l ^ + s +.
O ur concern is w ith the prediction error
5 = Y+ - Y + = x l $ W -
- +,
(6.35)
289
t f .
where the *4)* are random ly sam pled from the 12 m ean-adjusted, modified
residuals r ^ rw from the regression o f the y; s on the iijS. The estim ates
are now obtained by fitting the regression to the 48 sim ulated cases ( u ^ y j ) ,
k = 1 , ...,4 and j = 1 ,..., 12.
Figure 6.10 shows typical norm al plots for prediction error y + y+ , these
for x + = (1,4000,1000) and x + = (1,10000,4000) which are near the edge o f
the observed space, from R = 999 resam ples and M = 1. The skewness o f
prediction erro r is quite noticeable. The resam pling stan d ard deviations for pre
diction errors are 0.91 an d 0.93, som ew hat larger th an the theoretical standard
deviations 0.88 and 0.87 obtained by treating the 48 cases as independent.
To calculate 95% intervals we set a = 0.025, so th at ( R M + l)a = 25 and
( R M + 1)(1 a) = 975. The sim ulation values <5(*25) and <5('975) are 1.63 and
1.93 at x+ = (1,4000,1000), and -1 .5 7 and 2.19 at x + = (1,10000,4000). The
corresponding p o in t predictions are 6.19 and 4.42, so 95% prediction intervals
are (4.26,7.82) at x+ = (1,4000,1000) and (2.23,5.99) at x+ = (1,10000,4000).
These intervals differ m arkedly from those based on norm al theory treating all
48 cases as independent, those being (4.44,7.94) and (2.68,6.17). M uch o f the
difference is due to the skewness o f the resam pling distribution o f prediction
error.
6 Linear Regression
290
X is the n x q matrix
with rows x j , . . . , x j ,
where q = p + 1 if there
are p covariate terms
and an intercept in the
model.
291
in which ft is fixed and the expectation is over y+J = x]p + e+j. We cannot
calculate D exactly, because the m odel param eters are unknow n, so we m ust
settle for an estim ate which in reality is an estim ate o f A = E(D), the
average over all possible sam ples o f size n. O ur objective is to estim ate D or A
as accurately as possible.
As stated the problem is quite simple, at least under the ideal conditions
where the linear m odel is correct and the error variance is constant, for then
D
n - l Y r ( Y +j) + n - l Y , ( X j P - x J [ l ) 2
a 2 + n - l ( p - l } ) TX TX 0 - p ) ,
(6.36)
w hose expectation is
A = <j 2(1 + ^ - 1),
(6.37)
(6.38)
However, this estim ate is very specialized, in two ways. First, it assumes th at
the linear m odel is correct and th a t erro r variance is constant, b o th unlikely to
be exactly true in practice. Secondly, the estim ate applies only to least squares
prediction and the squared erro r m easure o f accuracy, w hereas in practice we
need to be able to deal w ith other m easures o f accuracy and other prediction
rules such as robust linear regression (Section 6.5) and linear classification,
where y is binary (Section 7.2). T here are no simple analogues o f (6.38) to
cover these situations, b u t resam pling m ethods can be applied to all o f them.
In order th a t o u r discussion apply as broadly as possible, we shall use
general n o tatio n in which prediction erro r is m easured by c(y+, y +), typically
an increasing function o f |y+ y+|, and the prediction rule is y + = /i(x+, F),
where the E D F F represents the observed data. Usually n(x +>F) is an estim ate
o f the m ean response at x +, a function o f x+/? with /? an estim ate o f /?, and
the form o f this prediction rule is closely tied to the form o f c(y+,y+). We
suppose th a t the d a ta F are sam pled from distribution F, from which the
cases to be predicted are also sampled. This implies th at we are considering
x + values sim ilar to d a ta values x i ,...,x . Prediction accuracy is m easured by
the aggregate prediction error
D = D(F, F) = E + [c{ Y+, tx(X+, F)} | F],
(6.39)
(6.40)
6 Linear Regression
292
the average prediction accuracy over all possible d atasets o f size n sam pled
from F.
The m ost direct ap proach to estim ation o f A is to apply the boo tstrap
substitution principle, th a t is substituting the E D F F for F in (6.40). However,
there are o th er widely used resam pling m ethods which also m erit consideration,
in p art because they are easy to use, an d in fact the best approach involves a
com bination o f m ethods.
Apparent error
The sim plest way to estim ate D or A is to take the average prediction error
w hen the prediction rule is applied to the sam e d a ta th at was used to fit it.
This gives the apparent error, som etim es called the resubstitution error,
n
K PP = D( F, F) = n ~x ' Y ^ c { y j ,ii{xj,F)}.
7=1
(6.41)
This is n o t the sam e as the b o o tstrap estim ate A(F), which we discuss later.
It is intuitively clear th a t A app will tend to underestim ate A, because the
latter refers to prediction o f new responses. The underestim ation can be easily
A
|
checked for least squares prediction w ith squared error, when A app = n~ R S S ,
the average squared residual. If the m odel is correct with hom oscedastic
random errors, then A app has expectation a 2(l qn~ l ), w hereas from (6.37) we
know th a t A = <x2(l + qn~l ).
The difference betw een the true erro r and ap p aren t erro r is the excess error,
D( F, F) D(F,F), whose m ean is the expected excess error,
e(F) = E {D(F, F) - D(F, F)} = A(F) - E{D(F, F)},
(6.42)
293
)}>
(6-43)
jSa
w ith na the size o f Sa. T here are several variations on this estim ate, depending
on the size o f the training set, the m anner o f splitting the dataset, and the
num ber o f such splits.
The version o f cross-validation th at seems to come closest to actual use o f
o u r predictor is leave-one-out cross-validation. H ere training sets o f size n 1 are
taken, and all such sets are used, so we m easure how well the prediction rule
does when the value o f each response is predicted from the rest o f the data. If
F^j represents the n 1 observations {(xk,yk),k ^ j}, and if /u(Xy,F_; ) denotes
the value predicted for yj by the rule based on F _; , then the cross-validation
estimate o f prediction error is
n
Ac v = n~l
c{yj>
F-j)}, (6.44)
i= i
which is the average erro r when each observation is predicted from the rest o f
the sample.
In general (6.44) requires n fits o f the model, b u t for least squares linear
regression only one fit is required if we use the case-deletion result (Problem 6.2)
~
T A
Vi x j B
P - P- j = ( X TX ) ~ ' x j ^ _
where as usual hj is the leverage for the 7th case. F or squared erro r in particular
we then have
="
- ^
' 6-45>
From the natu re o f Ac v one would guess th a t this estim ate has only a small
bias, and this is so: assum ing an expansion o f the form A(F) = oq + a\ n~l +
a2n~2 + , one can verify from (6.44) th a t E(A c^) = o + a i(n I )-1 + ,
which differs from A by term s o f order n~2 unlike the expectation o f the
ap p aren t error which differs by term s o f order n_ l .
K -fold cross-validation
In general there is no reason th at training sets should be o f size n 1. For
certain m ethods o f estim ation the num ber n o f fits required for Ac v could
itself be a difficulty although not for least squares, as we have seen in
(6.45). T here is also the possibility th at the small p erturbations in fitted m odel
w hen single observations are left out m akes Ac v too variable, if fitted values
H(x,F) do n o t depend sm oothly on F o r if c(y+ ,y+ ) is n o t continuous. These
294
6 Linear Regression
Acv = R ~{
X ! c{yJ
jesv
r= 1
^v)}-
(6-46^
In principle there are (") possible splits, possibly an extrem ely large num ber,
b u t it should be adequate to take R in the range 100 to 1000. It would be in
the spirit o f resam pling to m ake the splits at random . However, consideration
should be given to balancing the splits in some way for example, it would
seem desirable th a t each case should occur w ith equal frequency over the R
assessm ent sets; see Section 9.2. D epending on the value o f nt = n m and the
num ber p o f explanatory variables, one m ight also need some form o f balance
to ensure th a t the m odel can always be fitted.
There is an efficient version o f group cross-validation th at does involve ju st
one prediction o f each response. We begin by splitting the d a ta into K disjoint
sets o f nearly equal size, w ith the corresponding sets o f case subscripts denoted
by C i , . . . , C k , say. These K sets define R = K different splits into training and
assessm ent sets, w ith S^k = Q the kt h assessm ent set and the rem ainder o f the
d a ta Stf =
|J,y* Ci the /cth training set.
F or each such
split weapply (6.43), and
then average these estim ates. The result is the K-fold cross-validation estimate
o f prediction error
n
(6.47)
where F-k{j) represents the d a ta from which the group containing the j i h
case has been deleted. N ote th a t ACvjc is equal to the leave-one-out estim ate
(6.44) when K = n. C alculation o f (6.47) requires ju st K m odel fits. Practical
experience suggests th a t a good strategy is to take K = m in{n1!1, 10}, on the
grounds th a t taking K > 10 m ay be too com putationally intensive when the
prediction rule is com plicated, while taking groups o f size at least n1/2 should
p ertu rb the d a ta sufficiently to give small variance o f the estimate.
The use o f groups will have the desired effect o f reducing variance, b u t at
the cost o f increasing bias. F or exam ple, it can be seen from the expansion
used earlier for A th a t the bias o f A Cvjc is a\{n(K l )}-1 + , which could be
substantial if K is small, unless n is very large. F ortunately the bias o f A qv ,k
can be reduced by a simple adjustm ent. In a harm less abuse o f notation, let
if n / K
=m
is an
295
(6.48)
k= 1
T his has sm aller bias th a n Acvjc and is alm ost as simple to calculate, because
it requires n o additional fits o f the model. F or a com parison betw een ACvjc
an d A acvjc in a simple situation, see Problem 6.12.
T he following algorithm sum m arizes the calculation o f AAcvji w hen the
split into groups is m ade a t random .
Algorithm 6.5 (K -fold adjusted cross-validation)
1 Fit the regression m odel to all cases, calculate predictions
m odel, an d average the values o f c(yj,yj) to get D.
2 C hoose group sizes m i,. . . ,
such th a t mi H----- + m* = n.
3 For k = 1
from th at
4 A verage the n values o f c(yj,yj) using yj from step 3(c) to give Ac vj i5 C alculate Aacvji as in (6.48) with pk = mk/n.
Bootstrap estimates
A direct ap plication o f the b o o tstrap principle to A(F) gives the estim ate
A = A(F) = E*{D(F,F*)},
w here F* denotes a sim ulated sam ple ( x j,y j) ,. . . , (x*, >) taken from the d a ta by
case resam pling. U sually sim ulation is required to approxim ate this estim ate,
as follows. F or r = 1
we random ly resam ple cases from the d ata to
obtain the sam ple (x*j,y*j) , . . . , (x*n,y'), which we represent by F*, and to this
sam ple we fit the prediction rule and calculate its predictions n ( x j , F ' ) o f the
d a ta responses yj for j = 1
The aggregate prediction erro r estim ate is
then calculated as
R
R - 1
n
Y 2 c { y j,f i{ x j,F ') } .
r= l
j=l
(6.49)
6 Linear Regression
296
eB = R
r= 1
(6.50)
j=i
T h at is, for the rth b o o tstra p sam ple we construct the prediction rule n(x, F'),
then calculate the average difference betw een the prediction errors when this
rule is applied first to the original d a ta an d secondly to the b o o tstrap sam ple
itself, an d finally average across b o o tstra p samples. We refer to the resulting
estim ate o f aggregate prediction error, Ab = $b + A app, as the bootstrap estimate
o f prediction error, given by
n
n~l E
7=1
E
r= 1
(6.51)
r= l
N ote th a t the first term o f (6.51), which is also the simple b o o tstra p estim ate
(6.49), is expressed as the average o f the contributions jR-1 ^ f = i c{yy-,
F )}
th at each original observation m akes to the estim ate o f aggregate prediction
error. These contributions are o f interest in their own right, m ost im portantly
in assessing how the perform ance o f the prediction rule changes with values
o f the explanatory variables. This is illustrated in Exam ple 6.10 below.
Hybrid bootstrap estimates
It is useful to observe th a t the naive estim ate (6.49), which is also the first term
o f (6.51), can be broken into two qualitatively different parts,
297
and
w here R - j is the n u m b er o f the R b o o tstrap sam ples F ' in which (xj ,yj ) does
n o t appear. In (6.52) yj is always predicted using d ata from which (X j , y j) is
excluded, which is analogous to cross-validation, w hereas (6.53) is sim ilar to
an a p p aren t erro r calculation because yj is always predicted using d a ta th at
contain (xj,yj).
N ow R - j / R is approxim ately equal to the constant e~l = 0.368, so (6.52) is
approxim ately p ro p o rtio n al to
A scr = n - 1E
j=1
(6'54)
J r:j out
som etim es called the leave-one-out bootstrap estimate o f prediction error. The
n o ta tio n refers to the fact th a t Abcv can be viewed as a b o o tstrap sm oothing
o f the cross-validation estim ate Acv- To see this, consider replacing the term
c {y j , n ( x j , F - j )} in (6.44) by the expectation E l j[c{yj,n(Xj,F*)}], where E lrefers to the expectation over b o o tstrap sam ples F * o f size n draw n from F-j.
T he estim ate (6.54) is a sim ulation approxim ation o f this expectation, because
o f the result n o ted in Section 3.10.1 th a t the R - j b o o tstrap sam ples in which
case j does n o t ap p ear are equivalent to random sam ples draw n from F-j.
T he sm oothing in (6.54) m ay effect a considerable reduction in variance,
com pared to Ac v , especially if c(y+, y +) is n o t continuous. B ut there will also
be a tendency tow ard positive bias. This is because the typical b o o tstrap sample
from which predictions are m ade in (6.54) includes only ab o u t (1 e~l )n =
0.632n distinct d a ta values, an d the bias o f cross-validation estim ates increases
as the size o f the train in g set decreases.
W hat we have so far is th a t the b o o tstrap estim ate o f aggregate prediction
erro r essentially involves a w eighted com bination o f Abcv and an apparent
erro r estim ate. Such a com bin atio n should have good variance properties, b u t
m ay suffer from bias. However, if we change the weights in the com bination it
m ay be possible to reduce or rem ove this bias. This suggests th at we consider
the hybrid estim ate
A w = w A b cv + (1 - w)Aapp,
(6.55)
298
6 Linear Regression
A p p a re n t
e rro r
B o o tstrap
0.632
32
16
10
2.0
3.2
3.5
3.6
3.7 (3.7)
3.8 (3.7)
4.4 (4.2)
n~l Y 2 E -j(y j -
x ] P - j ) 2>
j =i
A
w ith p _ j the least squares estim ate o f /? from a b o o tstra p sam ple w ith the j t h
case excluded. A ra th e r lengthy calculation (Problem 6.13) shows th at
E(A jjck) = c 2( l + 2 qn~l ) + 0 ( n ~ 2),
from which it follows th a t
E{wABCk + (1 - w)A app} = er2( l + 3w qn~l ) + 0 ( n ~ 2),
which agrees w ith A to term s o f o rd er n~l if w = 2/3.
It seems im possible to find an optim al choice o f w for general m easures
o f prediction erro r an d general prediction rules, b u t detailed calculations do
suggest th a t w = 1 e-1 = 0.632 is a good choice. H euristically this value
for w is equivalent to an ad justm ent for the below -average distance betw een
cases an d b o o tstra p sam ples w ithout them , com pared to w hat we expect in the
real prediction problem . T h a t the value 0.632 is close to the value 2 /3 derived
above is reassuring. T he hybrid estim ate (6.55) w ith w = 0.632 is know n as
the 0.632 estimator o f prediction error an d is denoted here by A0.632- T here is
substantial em pirical evidence favouring this estim ate, so long as the num ber
o f covariates p is n o t close to n.
Example 6.10 (Nuclear power stations) C onsider predicting the cost o f a new
pow er station based on the d a ta o f Exam ple 6.8. We base o u r prediction on
the linear regression m odel described there, so we have n(x j , F ) = x j f i , where
A
'
18 is the least squares estim ate for a m odel w ith six covariates. The estim ated
299
Figure 6.11
Components of
prediction error for
nuclear power data
based on 200 bootstrap
simulations. The top
panel shows the values
of yj n{xj,F*). The
lower left panel shows
the average error for
each case, plotted
against the residuals.
The lower right panel
shows the ratio of the
model-based to the
bootstrap prediction
standard errors.
Case
Raw residual
Case
300
6 Linear Regression
o f yj n(xj,F*) for r = 1 ,...,J ? , p lo tted against case num ber j. The variability
o f the average error corresponds to the variation o f individual observations
a b o u t their predicted values, while the variance w ithin each group reflects
param eter estim ation uncertainty. A striking feature is the small prediction
erro r for the last six pow er plants, whose variances and m eans are both small.
The lower left panel shows the average values o f y j fi(xj,F*) over the 200
sim ulations, plotted against the raw residuals. They agree closely, as we should
expect w ith a well-fitting m odel. T he lower right panel shows the ratio o f the
m odel-based prediction stan d ard erro r to the b o o tstrap prediction standard
error. It confirm s th a t the m odel-based calculation described in Exam ple 6.8
overestim ates the predictive stan d ard erro r for the last six plants, which have
the partial turnkey guarantee. T he estim ated b o o tstra p prediction erro r for
these plan ts is 0.003, while it is 0.032 for the rest. T he last six cases fall into
three groups determ ined by the values o f the explanatory variables: in effect
they are replicated.
It m ight be preferable to p lo t y j fi(xj, F ' ) only for those b o o tstrap samples
which exclude the j t h case, and then m ean prediction error would b etter be
com pared to jackknifed residuals yj x j /L ; . F or these d a ta the plots are very
sim ilar to those we have shown.
Example 6.11 (Times on delivery suite) F or a m ore system atic com parison
o f prediction error estim ates in linear regression, we use d ata provided by
E. Burns on the times tak en by 1187 w om en to give b irth a t the Jo h n Radcliffe
H ospital in O xford. A n ap p ro p riate linear m odel has response the log time
spent on delivery suite an d dum m y explanatory variables indicating the type
o f labour, the use o f electronic fetal m onitoring, the use o f an intravenous
drip, the reported length o f la b o u r before arriving a t the hospital and w hether
or n o t the lab o u r is the w om ans first; seven p aram eters are estim ated in all.
We took 200 sam ples o f size n = 50 at ran d o m from the full data. F or each
o f these sam ples we fitted the m odel described above, and then calculated
cross-validation estim ates o f prediction error Acv#. w ith K = 50, 10, 5 and
2 groups, the corresponding adjusted cross-validation estim ates A a c v j c , the
b o o tstrap estim ate AB, and the hybrid estim ate Ao.632- We took R = 200 for
the b o o tstrap calculations.
The results o f this experim ent are sum m arized in term s o f estim ates o f the
expected excess erro r in Table 6.10. T he average a p p aren t error and excess
erro r were 15.7 x 10-2 and 5.2 x 10-2 , the latter taken to be e(F) as defined
in (6.42). T he table shows averages and stan d ard deviations o f the differences
betw een estim ates A an d A app. T he cross-validation estim ate w ith K = 50,
the boo tstrap an d the 0.632 estim ate have sim ilar properties, while other
choices o f K give estim ates th a t are m ore variable; the half-sam ple estim ate
A C v ,2 is worst. R esults for cross-validation w ith 10 and 5 groups are alm ost
301
M ean
SD
M SE
B o o tstrap
0.632
50
10
4.6
1.3
0.23
5.3
1.6
0.24
5.3
1.6
0.24
6.0 (5.7)
2.3 (2.2)
0.28 (0.26)
6.2 (5.5)
2.6 (2.3)
0.30 (0.27)
9.2 (5.7)
5.4 (3.3)
0.71 (0.33)
the same. A djustm ent significantly im proves cross-validation when group size
is n o t small. T he b o o tstrap estim ate is least variable, b u t is dow nw ardly
biased.
The final row o f the table gives the conditional m ean squared error, defined
as (200)-1
{Aj Dj ( F, F) }2 for each erro r estim ate A. This m easures the
success o f A in estim ating the true aggregate prediction error D(F, F) for each
o f the 200 samples. A gain the ordinary cross-validation, bootstrap, and 0.632
estim ates perform best.
In this exam ple there is little to choose betw een K -fold cross-validation with
10 an d 5 groups, which b o th perform worse th an the ordinary cross-validation,
bootstrap , an d 0.632 estim ators o f prediction error. K -fold cross-validation
should be used w ith adjustm ent if ordinary cross-validation or the sim ulationbased estim ates are not feasible.
302
6 Linear Regression
erro r is average squared error. It w ould be a sim ple m atter to use other
prediction rules an d o th er m easures o f prediction accuracy.
First we define som e n otation. We denote an arb itrary candidate m odel by
M , which is one o f the 2P possible linear models. W henever M is used as a
subscript, it refers to elem ents o f th a t model. T hus the n x pm design m atrix
X M contains those pM colum ns o f the full design m atrix X th a t correspond
to covariates included in M ; the y'th row o f X m is x h p the least squares
estim ates for regression coefficients in M are P m , and H M is the h at m atrix
X m ( X I i X m )~1X11 th a t defines fitted values
= H My under m odel M . The
total num b er o f regression coefficients in M is qM = pM + 1, assum ing th a t an
intercept term is always included.
Now consider prediction o f single responses y+ a t each o f the original design
points x i,...,x . The average squared prediction erro r using m odel M is
n
n ~l J 2 ( y +j ~ x T m
M >
7=1
and its expectation u n d er m odel (6.22), conditional on the data, is the aggregate
prediction error
n
D ( M ) = a 2 + n~x ^ ( ^ - - x ^ j Pm )2,
i= i
where p.T = (AMj
is the vector o f m ean responses for the true m ultiple
regression m odel. T aking expectation over the d a ta distribution we obtain
A (M ) = E{D(M)} = (1 + n~lqM) a 2 + fxT(I H M)n,
(6.56)
where /ir (/ H M)p is zero only if m odel M is correct. The quantities D (M)
and A(M) generalize D and A defined in (6.36) an d (6.37).
In principle the best m odel w ould be the one th a t m inimizes D{M), but
since the m odel p aram eters are unknow n we m ust settle for m inim izing a
good estim ate o f D ( M) o r A(M). Several resam pling m ethods for estim ating
A were discussed in the previous subsection, so the n atu ral approach would
be to choose a good m ethod an d apply it to all possible models. However,
accurate estim ation o f A(M ) is n o t itself im p o rtan t: w hat is im p o rtan t is to
accurately estim ate the signs o f differences am ong the A(M), so th a t we can
identify which o f the A(M )s is smallest.
O f the m ethods considered earlier, the a p p aren t e rro r estim ate A app( M) =
h^ R S S m was poor. Its use here is im m ediately ruled out w hen we observe th a t
it always decreases w hen covariates are added to a m odel, so m inim ization
always leads to the full model.
303
Cross-validation
O ne good estim ate, when used w ith squared error, is the leave-one-out crossvalidation estim ate. In the present no tatio n this is
(6.57)
w here y ^ j is the fitted value for m odel M based on all the d a ta and h ^ j is the
leverage for case j in m odel M . The bias o f Ac v ( M ) is small, b u t th at
is not
enough to m ake it a good basis for selecting M . To see why, note first th a t an
expansion gives
mAc k (M ) =
et
(I
- H M)e +
(6.58)
ACv(M) = jR_1 Y l m~ l X
r= 1
jesv
~ yMj(St,r)}2,
where p M j ( S t,r) = x h ^ M ^ t , ) an d ^ M(^t,r) are the least squares estim ates for
coefficients in M fitted to the rth training set whose subscripts are in Sv . N ote
th a t the sam e R splits into training and assessm ent sets are used for all models.
It can be show n that, provided m is chosen so th a t n m > o o and m /n >1 as
n - o o , m inim ization o f Ac v ( M ) will give consistent selection o f the true m odel
as n
o o an d R >o o .
304
6 Linear Regression
Bootstrap methods
C orresponding results can be obtained for b o o tstrap resam pling m ethods. The
b o o tstrap estim ate o f aggregate prediction erro r (6.51) becomes
Ab ( M ) = n~l R S S m + R ~ l
n~l
- RSS'M,
(6.59)
where the second term on the right-hand side is an estim ate o f the expected
excess erro r defined in (6.42). The resam pling scheme can be either case
resam pling o r error resam pling, w ith x m
Mj r = x Mj for the latter.
It turns o u t th a t m inim ization o f A B( M) behaves m uch like m inim ization o f
the leave-one-out cross-validation estim ate, an d does n o t lead to a consistent
choice o f true m odel as n*o o . However, there is a m odification o f A B(M),
analogous to th a t m ade for the cross-validation procedure, which does produce
a consistent m odel selection procedure. T he m odification is to m ake sim ulated
datasets be o f size n m rath er th an n, such th a t m / n >l and n m> o o as
n>co. Also, we replace the estim ate (6.59) by the sim pler b o o tstrap estim ate
R
Ab (M ) = R - 1
r= l
n- 1 Y ^ ( y j ~ x l j K r ) 2>
j= 1
(6.60)
305
10
Number of covariates
Figure 6.12 plots the leave-one-out cross-validation estim ates and the b o o t
strap estim ates (6.60) w ith R = 100 o f aggregate prediction error for the
m odels w ith 0 , 1 ,..., 10 covariates. The two estim ates are very close, and b o th
are m inim ized w hen six covariates are included (the six used in Exam ples 6.8
an d 6.10). Selection o f five or six covariates, ra th er th a n fewer, is quite clearcut. These results b ear o u t the rough rule-of-thum b th a t variables are selected
by cross-validation if they are significant at roughly the 0.1 level.
As the previous discussion would suggest, use o f corresponding crossvalidation and b o o tstra p estim ates from training sets o f size 20 or less is
precluded because for training sets o f such sizes the m odels with m ore th an
five covariates are frequently unidentifiable. T h at is, the unbalanced nature o f
the covariates, coupled w ith the binary nature o f some o f them , frequently
leads to singular resam ple designs. Figure 6.12 includes b o o tstrap estim ates
for m odels w ith u p to five covariates and training set o f size 16: these results
were obtained by om itting m any singular resamples. These ra th er fragm entary
results confirm th a t the m odel should include at least five covariates.
A useful lesson from this is th a t there is a practical obstacle to w hat in
theory is a preferred variable selection procedure. O ne w ay to try to overcome
306
6 Linear Regression
cv, resample 10
cv, resample 20
cv, resample 30
leave-one-out cv
boot, resample 10
boot, resample 20
boot, resample 30
boot, resample 50
Figure 6.13
Cross-validation and
bootstrap estimates of
aggregate prediction
error for sequence of six
models fitted to ten
datasets of size n = 50
with p = 5 covariates.
The true model includes
only two covariates.
307
include x 2 in the selected m odel occurred quite frequently even w hen using
training sets o f size 20. This deg radation o f variable selection procedures when
coefficients are sm aller th a n tw o stan d ard errors is reputed to be typical.
The theory used to justify the consistent cross-validation and boo tstrap
procedures m ay depend heavily on the assum ptions th at the dim ension o f
the true m odel is small com pared to the num ber o f cases, and th a t the
non-zero regression coefficients are all large relative to their stan d ard errors.
It is possible th a t leave-one-out cross-validation m ay w ork well in certain
situations where m odel dim ension is com parable to num ber o f cases. This
w ould be im p o rtan t, in light o f the very clear difficulties o f using small training
sets w ith typical applications, such as Exam ple 6.12. Evidently fu rther work,
b o th theoretical an d em pirical, is necessary to find broadly applicable variable
selection m ethods.
308
6 Linear Regression
D ose (rads)
117.5
235.0
470.0
705.0
940.0
1410
S urvival %
44.000
55.000
16.000
13.000
4.000
1.960
6.120
0.500
0.320
0.110
0.015
0.019
0.700
0.006
the jackk n ife-after-b o o tstrap plots o f Section 3.10.1 o r sim ilarly inform ative
diagnostic plots, b u t such plots can fail to show the occurrence o f m ultiple
outliers.
For datasets w ith possibly m ultiple outliers, diagnosis is aided by initial
use o f a fitted m ethod th a t is highly resistant to the effects o f outliers. One
preferred resistant m ethod is least trim m ed squares, which minimizes
m
5 > 0 )(/*)j=i
(6.61)
y = log(survival percentage).
T he right panel o f Figure 6.14 plots these variables. There is a clear outlier,
case 13, at x = 1410. T he least squares estim ate o f slope is 59 x 10-4 using
all the data, changing to 78 x 10-4 w ith stan d ard erro r 5.4 x 10-4 when case
13 is om itted. T he least trim m ed squares estim ate o f slope is 69 x 10-4 .
F rom the scatter p lo t it app ears th a t heteroscedasticity m ay be present, so we
resam ple cases. The effect o f the outlier on the resam ple least squares estim ates
is illustrated in Figure 6.15, which plots R = 200 b o o tstrap least squares slopes
PI against the corresponding values o f ]T (x x*)2, differentiated by the
frequency w ith which case 13 appears in the resam ple. There are three distinct
groups o f b o o tstrap p ed slopes, w ith the lowest corresponding to resam ples in
which case 13 does n o t occur and the highest to sam ples where it occurs twice or
more. A jack k n ife-after-b o o tstrap plot w ould clearly reveal the effect o f case 13.
T he resam pling stan d ard erro r o f p \ is 15.3 x 10-4 , b u t only 7.6 x 10-4 for
309
0s
15
> o
D
(0
O ) CM
O '
i co
D o
CO
C\J
CM
200
600
1000
1400
200
600
1000
1400
Dose
Dose
)2 ( x \ 0 5 ),
differentiated by
frequency of case 13
(appears zero, one or
more times), for case
resampling with
R = 200 from survival
data.
Sum of squares
sam ples w ithout case 13. T he corresponding resam pling standard errors o f the
least trim m ed squares slope are 20.5 x 10-4 and 18.0 x 10~4, showing b o th the
resistance an d inefficiency o f the least trim m ed squares m ethod.
310
6 Linear Regression
Salinity
sal
L agged salinity
la g
T ren d in d icato r
tre n d
R iver discharge
d is
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
7.6
7.7
4.3
5.9
5.0
6.5
8.3
8.2
13.2
12.6
10.4
10.8
13.1
12.3
10.4
10.5
7.7
9.5
12.0
12.6
13.6
14.1
13.5
11.5
8.2
7.6
4.6
4.3
5.9
5.0
6.5
8.3
10.1
13.2
12.6
10.4
10.8
13.1
13.3
10.4
10.5
7.7
23.01
22.87
26.42
24.87
29.90
24.20
23.22
22.86
22.27
23.83
25.14
22.43
21.79
22.38
23.93
33.44
24.86
22.69
21.79
22.04
21.03
21.01
25.87
26.29
25
26
27
28
12.0
13.0
14.1
15.1
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
0
1
4
5
0
1
2
3
4
5
10.0
12.0
12.1
13.6
15.0
13.5
11.5
12.0
13.0
14.1
22.93
21.31
20.77
21.39
Application of standard
algorithms for least
trimmed squares with
default settings can give
very different, incorrect
solutions.
311
co
3
D
'(/)
T
3
N
CO
x>
co
55
Robust methods
We suppose now th a t outliers have been isolated by diagnostic plots and set
aside from fu rth er analysis. The problem now is w hether o r n o t th a t analysis
should use least squares estim ation: if there is evidence o f a long-tailed error
distribution, then we should dow nw eight large deviations yj x j fi by using a
robust m ethod. Two m ain options for this are now described.
O ne ap p ro ach is to m inim ize n o t sums o f squared deviations b u t sums o f
absolute values o f deviations, Y , Iy j ~ x J Jl> so liv in g less weight to those cases
w ith the largest errors. This is the L i m ethod, which generalizes and has
efficiency com parable to the sam ple m edian estim ate o f a population mean.
T here is n o simple expression for approxim ate variance o f L\ estim ators.
M ore efficient is M -estim ation, which is analogous to m axim um likelihood
estim ation. H ere the coefficient estim ates /? for a m ultiple linear regression
solve the estim ating equation
0,
(6.62)
where tp(z) is a b o unded replacem ent for z, and s is either the solution to a
sim ultaneous estim ating equation, o r is fixed in advance. We choose the latter,
tak in g s to be the m edian absolute deviation (divided by 0.6745) from the least
trim m ed squares regression fit. T he solution to (6.62) is obtained by iterative
weighted least squares, for which least trim m ed squares estim ates are good
startin g values.
6 Linear Regression
312
W ith a careful choice o f ip(-), M -estim ates should have sm aller standard
errors th a n least squares estim ates for long-tailed d istributions o f random
errors e, yet have com parable stan d ard errors should those errors be hom o
scedastic norm al. O ne stan d ard choice is tp(z) = z m in (l,c /|z |), H u b ers winsorizing function, for which the coefficient estim ates have approxim ate effi
ciency 95% relative to least squares estim ates for hom oscedastic norm al errors
when c = 1.345.
F or large sam ple sizes M -estim ates ft are approxim ately norm al in distribu
tion, with approxim ate variance
v ar() = o'2 * {'p2{e/<T)\ -2 ( X TX ) - \
[E { v (e /a )}]2
(6.63)
under hom oscedasticity. A m ore robust, em pirical variance estim ate is provided
by the nonp aram etric delta m ethod. First, the em pirical influence values are,
analogous to (6.25),
lj = k n ( X T X ) ~ 1Xj\p
^,
h lJ = k 2( X TX ) - lX TD X ( X TX ) - \
(6.64)
7=1
( l - d h j ) ' / 2
j _2 J2(e)f sMej A)
Y W j/s)
E W2(ej/s)
(E v ff j/s )} 2'
Sim ulated errors are random ly sam pled from the uncentred ru . . . , r n. M ean
313
T he scale estim ate s' is obtained by the same m ethod as s, b u t from the
resam ple data.
S tudentization o f j?* ft is possible, using the resam ple analogue o f the delta
m ethod variance (6.64) o r m ore simply ju st using s'.
Exam ple 6.16 (Salinity d ata) In our previous look a t the salinity d a ta in
E xam ple 6.15, we identified case 16 as a clear outlier. We now set th a t
case aside an d re-analyse the linear regression w ith all three covariates. O ne
objective is to determ ine w hether o r n o t the trend variable should be included
in the m odel: the initial, incorrect least squares analysis suggested not.
A norm al Q -Q plot o f the m odified residuals from the new least squares fit
suggests som ew hat long tails for the erro r disribution, so th a t robust m ethods
m ay be w orthw hile. We fit the m odel by four m e th o d s: least squares, H u b er Mestim ate (w ith c = 1.345), L i and least trim m ed squares. Coefficient estim ates
are fairly sim ilar und er all m ethods, except for t r e n d whose coefficients are
-0 .1 7 , -0 .2 2 , - 0 .1 8 an d -0 .0 8 .
F o r fu rth er analysis we apply case resam pling w ith R = 99. Figure 6.17
illustrates the results for estim ates o f the coefficient o f tr e n d . The d o tted lines
on the top two panels correspond to the theoretical norm al approxim ations:
evidently the stan d ard variance approxim ation based on (6.63) for the
H u b er estim ate is too low. N ote also the relatively large resam pling variance for
the least trim m ed squares estim ate, p a rt o f which m ay be due to unconverged
estim ates: tw o resam pling outliers have been trim m ed from this plot.
To assess the significance o f t r e n d we apply the studentized pivot m ethod
o f Section 6.3.2 w ith b o th least squares and M -estim ates, studentizing by the
theoretical stan d ard erro r in each case. The corresponding values o f z are
1.25 and 1.80, w ith respectively 23 and 12 sm aller values o f z* o u t o f 99.
So there appears to be little evidence o f the need to include tr e n d .
I f we checked diagnostic plots for any o f the four regression fits, a question
m ight be raised ab o u t w hether or n o t case 5 should be included in the
analysis. A n alternative view o f this is provided by jackknife-after-bootstrap
plots (Section 3.10.1) o f the four fits: such plots correspond to case-deletion
resam pling. A s an illustration, Figure 6.18 shows the jackknife-after-bootstrap
plo t for the coefficient o f t r e n d in the M -estim ation fit. This shows clearly th a t
case 5 has an appreciable effect on the resam pling distribution, and th at its
om ission w ould give tighter confidence limits on the coefficient. It also raises
6 Linear Regression
314
315
*
o
O
CO
o
xP
O '"
to
05
CNJ
o
o
8
3
CM
o
LO
CO
<fr
o
9
8
2
1
14
O*
22
11
17
i
2412 21
1
13
ISO
VS
16 2 * 5
15
S
19
3
27
LO
6 Linear Regression
316
6.7 Problems
1
Show that for a multivariate distribution with mean vector pi and variance matrix
Q, the influence functions for the sample mean and variance are respectively
L(z) = z - f i ,
6.7 Problems
317
Hence show that for the linear regression model derived as the conditional expec
tation E (y | X = x) o f a multivariate C D F F, the empirical influence function
values for linear regression parameters are
h (xj , yj ) = n ( X TX ) ~ i x j eJ,
where X is the matrix o f explanatory variables.
(Sections 2.7.2, 6.2.2)
For hom ogeneous data as in Chapter 2, the empirical influence values for an
estimator can be approximated using case-deletion values. Use the matrix identity
t
(* * -
(X TX ) - l x iXJ ( X TX )->
l - xJlXTXT'x,
to show that in the linear regression model with least squares fitting,
P - P - J
= (X X)-
Compare this to the corresponding empirical influence value in Problem 6.1, and
obtain the jackknife estimates o f the bias and variance o f fa
(Sections 2.7.3, 6.2.2, 6.4)
3
For the linear regression m odel y, = xjji + ej, with no intercept, show that the
least squares estimate o f /? is ft = Y x jy j/ Y x j. Define residuals by ej
y j xjfa
If the resampling model is y j = Xjfi + e", with e randomly sampled from the e;s,
show that the resample estimate /T has mean and variance respectively
e and x are the averages
of the ej and xj.
TSei ~
+
Z * j
nExj
_ n y j - y ) 2- M U x j - x ) 2
(n ~
2) ( * ; -
x )2
U y j - y ) 2 - P 2n x j - x ) 2
(n - 2) J2(xj ~ x)2
Hence show that in the permutation test for zero slope, the R values o f f}[ are in the
same order as those o f f a / v ' 1/2, and that f a > fa is equivalent to f a /u*1/2 > f a / v lf2.
This confirms that the P-value o f the permutation test is unaffected by studentizing.
(Section 6.2.5)
6 Linear Regression
318
j = 1......... n.
Show that under this proposal fi" has mean fi and variance equal to therobust
variance estimate (6.26). Examine, theoretically or through numerical examples, to
what extent the skewness of fi matches the skewness of fi.
(Section 6.3.1; Hu and Zidek, 1995)
For the linear regression model y = X p + e, the improved version of the robust
estimate of variance for the least squares estimates fi is
Vrob = (X TX ) - lX Tdizg(r2i, . . . , r 2n) X ( XTX ) - \
where rj is the j th modified residual. If the errors have equal variances, then the
usual variance estimate
v = s2^ 7* ) - 1
would be appropriate and vroi, could be quite inefficient. To quantify this, examine
the case where the random errors e; are independent N(0, a2). Show first that
E(rj) = =,
Hence show that the efficiency of the ith diagonal element of vrob relative to the
ith diagonal element of v, as measured by the ratio of their variances, is
bl
(n-p)g{Qgt
where bu is the ith diagonal element of (Z TX )_1, gJ = (d^...... dfn) with D =
TX)~lX T, and Q has elements (1 h j k ) 2/ { ( 1 /i; )(l hk ) } .
Calculate this relative efficiency for a numerical example.
(Sections 6.2.4, 6.2.6, 6.3.1; Hinkley and Wang, 1991)
(X
The statistical function /?(F) for M-estimation is defined by the estimating equation
J xv{
y - x Tm '
a(F)
dF(x,y) = 0,
where a(F) is typically a robust scale parameter. Assume that the model contains
an intercept, so that the covariate vector x includes the dummy variable 1. Use the
6.1 Problems
319
technique o f Problem 2.12 to show that the influence function for fl(F) is
V?(u) is d ip(u)/du.
^ ) = { / x x Tyj(e)dF(x, y) |
oxy>(e),
Lp(x,y) = m k ~ 1( X TX)~1x\p(e),
where X is the usual covariate matrix and k = E{ip(e)}. U se the empirical version
o f this to verify the variance approximation
y-rX ) i T , V 2(ej/s)
Vl = ns.2 /
(X
{ v(ej/s)}2
where e; = yj x j f t and s is the estimated scale parameter.
(Section 6.5)
Given raw residuals e i , . . . , e n, define independent random variables ej by (6.21).
Show that the first three mom ents o f ej are 0, ej, and ej.
(a) Let
be raw residuals from the fit o f a linear m odel y = X f t + e , and
define bootstrap data by y ' = x f t + e , where the elements o f s are generated
according to the wild bootstrap. Show that the bootstrap least squares estimates
ft" take at m ost 2" values, and that
E(ft') = ft,
var'($*) = vwild = (X TX r lX TW X ( X TX ) ~ \
/r2
m2
n 2(n 1 m^/m\),
where mr = n~l J2(x j x ) r. Hence show that if the x j are uniformly spaced and
the errors have equal variances, the wild bootstrap variance estimate is too small
by a factor o f about 1 14/(5n).
(d) Show that if the e,- are replaced by r;, the difficulties in (b) and (c) do not arise.
(Sections 6.2.4, 6.2.6, 6.3.2)
Suppose that responses y i , . . . , y with n = 2m correspond to m independent
samples o f size two, where the ith sample comes from a population with mean n t
and these means are o f primary interest; the m population variances may differ.
Use appropriate dummy variables x t to express the responses in the linear m odel
y = X f t + e, where /?, = n t. With parameters estimated by least squares, consider
estimating the standard error o f ft, by case resampling.
(a) Show that the probability o f getting a simulated sample in which all the
parameters are estimable is
6 Linear Regression
320
(b) Consider constrained case resampling in which each o f the m samples must be
represented at least once. Show that the probability that there are r resample cases
from the ith sample is
i
^ \
// 2m
\ (/ 11 \\
r /
(4
For the one-way m odel o f Problem 6.9 with two observations per group, suppose
that 9 = fa ~ Pi- N ote that the least squares estimator o f 9 satisfies
+ j (fi 3 + 4 Si 62).
Suppose that we use model-based resampling with the assumption o f error hom oscedasticity. Show that the resample estimate can be expressed as
1=1
where the e ' are randomly sampled from the 2m modified residuals ^ ( 2 i S 2 1-1),
i = 1, . .. , m. U se this representation to calculate the first four resampling moments
o f 8 9. Compare the results with the first four mom ents o f 9 6, and comment.
(Section 6.3)
11
Suppose that a 2~r fraction o f a 28 factorial experiment is run, where 1 < r < 4.
Under what circumstances would a bootstrap analysis based on case resampling
be reliable?
(Section 6.3)
12
yk 1 (=1
k=1
321
6.8 Practicals
(c) Extend the calculations in (b) to show that the adjusted estimate can be written
A acvjc = & c v x
K
K ~ l ( K I)-2 ^ ( p * y ) 2,
k=1
E '_j(yj - x f f t l j ) 2,
j=i
where /T j is the least squares estimate o f ji from a bootstrap sample with the )th
case excluded and EV denotes expectation over such samples. To calculate the
mean o f ABcv, use the substitution
^ { l + q l n - l ) - 1},
E [E'_j { X J ( P l j ~ P - j ) ( t j ~ P - j ) TX j } \
2q(n ~ 1) + 0 ( n ~ 2),
0 ( n ~ 2).
These results combine to show that E(ABCf ) = ff2( 1 + 2qn~]) 0 ( n 2), which leads
to the choice w = | for the estimate Aw = w A BCv + (1 w)Aapp.
(Section 6.4; Hall, 1995)
6.8 Practicals
1
D ataset catsM contains a set o f data on the heart weights and body weights o f 97
male cats. We investigate the dependence o f heart weight (g) on body weight (kg).
To see the data, fit a straight-line regression and do diagnostic plots:
catsM
p lo t(c a tsM $ B w t, catsM$Hwt, x lim = c (0,4),y lim = c (0 , 24))
c a t s . l m < - glm (H w t~Bw t,data=catsM )
su m m ary(cats. lm)
322
6 Linear Regression
plot(cats.boot1,j ack=T)
plot(cats.boot1,index=2,j ack=T)
to see a summary and plots for the bootstrapped intercepts and slopes,. How
normal do they seem? Is the model-based standard error from the original fit
accurate? To what extent do the results depend on any single observation? We can
calculate the estimated standard error by the nonparametric delta m ethod by
The data o f Example 6.14 are in dataframe s u r v iv a l. For a jackknife-afterbootstrap plot for the regression slope f a :
6.8 Practicals
323
df <- as.numeric(unlist(data.anova[1]))
res.dev <- as.numeric(unlist(data.anova[4]))
res.df <- as.numeric(unlist(data.anova[3]))
(dev [4] /df [4] ) / (res.dev [4] /r e s .df [4] ) >
poison.fun(poisons)
anova(glm(time~poison*treat,data=poisons),test="F")
To apply resampling analysis, using as the null m odel that with main effects:
324
6 Linear Regression
+ct+bw+log(cum.n)+pt,data=d)
predict(d.glm,d.p)-(d.p$fit+d$res[i.p]) }
nuclear.boot.pred <- boot (nuke, nuke.pred,R=199,m=l,d.p=nuke.p)
Finally the 95% prediction intervals are obtained by
Consider predicting the log brain weight o f a mammal from its log body weight,
using squared error cost. The data are in dataframe mammals. For an initial model,
apparent error and ordinary cross-validation estimates o f aggregate prediction
error:
325
6.8 Practicals
6
7
Further Topics in Regression
7.1 Introduction
In C h ap ter 6 we showed how the basic b o o tstra p m ethods o f earlier chapters
extend to linear regression. The b ro ad aim o f this ch ap ter is to extend the
discussion further, to various form s o f nonlinear regression m odels espe
cially generalized linear m odels an d survival m odels and to nonparam etric
regression, where the form o f the m ean response is n o t fully specified.
A particu lar feature o f linear regression is the possibility o f error-based
resam pling, w hen responses are expressible as m eans plus hom oscedastic errors.
T his is p articularly useful w hen o u r objective is prediction. F or generalized
linear m odels, especially for discrete data, responses can n o t be described in
term s o f additive errors. Section 7.2 describes ways o f generalizing error-based
resam pling for such m odels. The corresponding developm ent for survival d a ta
is given in Section 7.3. Section 7.4 looks briefly at nonlinear regression with
additive error, m ainly to illustrate the useful co n trib u tio n th a t resam pling
m ethods can m ake to analysis o f such models. T here is often a need to
estim ate the poten tial accuracy o f predictions based on regression models,
and Section 6.4 contained a general discussion o f resam pling m ethods for
this. In Section 7.5 we focus on one type o f application, the estim ation o f
misclassification rates w hen a binary response y corresponds to a classification.
N o t all relationships betw een a response y an d covariates x can be readily
m odelled in term s o f a p aram etric m ean function o f know n form. A t least
for exploratory purposes it is useful to have flexible nonparam etric curvefitting m ethods, an d there is now a wide variety o f these. In Section 7.6 we
exam ine briefly how resam pling can be used in conjunction w ith som e o f these
n onparam etric regression m ethods.
326
327
t],
n = x Tp,
w here g(-) is a specified m o notone link function which links the m ean to the
linear predictor rj. As before, x is a {p + 1) x 1 vector o f explanatory variables
associated w ith Y. The possible com binations o f different variance functions
an d link functions include such things as logistic and probit regression, an d loglinear m odels for contingency tables, w ithout m aking ad-hoc transform ations
o f responses.
T he first extension was touched on briefly in Section 6.2.6 in connection
w ith w eighted least squares, which plays a key role in fitting generalized
linear m odels. T he second extension, to linear m odels for transform ed m eans,
represents a very special type o f nonlinear model.
W hen independent responses y } are obtained with explanatory variables Xj,
the full m odel is usually taken to be
E (Yj) = Hj,
g(nj) = x j p ,
\ a i ( Y j ) = KCjV(fij),
(7.1)
328
G ro u p 1
C ase
Case
1
2
3
4
5
3.36
2.88
3.63
3.41
3.78
65
156
100
134
16
18
19
20
21
22
3.64
3.48
3.60
3.18
3.95
56
65
17
7
16
6
7
8
9
10
11
12
4.02
4.00
4.23
3.73
3.85
3.97
108
121
4
39
143
56
26
22
1
1
5
65
23
24
25
26
27
28
29
30
31
32
33
3.72
4.00
4.28
4.43
4.45
4.49
4.41
22
3
4
2
3
8
4
4.32
4.90
5.00
5.00
3
30
4
43
13
14
15
16
17
4.51
4.54
5.00
5.00
4.72
5.00
G ro u p 2
Exam ple 7.1 (Leukaem ia d a ta ) Table 7.1 contains d a ta on the survival times
in weeks o f tw o groups o f acute leukaem ia victims, as a function o f their w hite
blood cell counts.
A simple m odel is th a t w ithin each group survival tim e Y is exponential
w ith m ean /i = exp(/?o + Pix), where x = log10(white blood cell count). T hus
the link function is logarithm ic. T he intercept is different for each group, b u t
the slope is assum ed com m on, so the full m odel for the- yth response in group
i is
E (Y y) = Hij,
lo g (^ y ) = p Qi + pi Xj j ,
T he fitted m eans p. an d the d a ta are show n in the left panel o f Figure 7.1. The
m ean survival tim es for group 2 are shorter th a n those for group 1 at the same
white blood cell count.
U nder this m odel the ratios Y / n are exponentially distributed with unit
m ean, an d hence the Q -Q p lo t o f y y //iy against exponential quantiles in the
right panel o f Figure 7.1 w ould ideally be a straight line. System atic curvature
m ight indicate th a t we should use a gam m a density w ith index v,
y v_1vv
/ vv\
f i y l ^ v) = J ? T w e x p \ j )
y>0,
^ V>a
= 1/v
329
P ' C jV (fij)
tg iSn j ) = 0
w here g(n) = dr\/dp. is the derivative o f the link function. Because the dis
persion p aram eters are tak en to have the form k c j , the estim ate fi does n o t
depend on k . N ote th a t although the estim ates are derived as m axim um like
lihood estim ates, their values depend only upon the regression relationship as
expressed by the assum ed variance function and the link function and choice
o f covariates.
T he usual m ethod for solving (7.2) is iterative weighted least
squares, in
which a t each iteration the adjusted responses zj
= t]j+ (yj /ij)g(nj) are
regressed on the x; w ith weights wj given by
w j l = c j V(fij)g2(fiJ)-
(7.3)
all these quantities are evaluated at the cu rren t values o f the estim ates. The
weighted least squares equation (6.27) applies at each iteration, w ith y replaced
by the adjusted dependent variable z. The approxim ate variance m atrix for p
330
is given by the analogue o f (6.24), nam ely
var(j?) = k ( X t W X ) ~ 1,
(7.4)
X ( X T W X ) ~ lX T W ,
(7.5)
W.
W hen the dispersion p aram eter
o f residual m ean square,
ft-
y ' to - * #
n - p - l j j
CjVfrj)
(7 6 )
= 2k
{tj (yj ) - 1 0 }) ) ,
(7.7)
which is the scaled difference betw een the m axim ized log likelihoods for the
saturated m odel which has a p aram eter for each observation and the
fitted model. T he deviance corresponds to the residual sum o f squares in the
analysis o f a linear regression model. F or exam ple, there are large reductions
in the deviance w hen im p o rtan t explanatory variables are added to a m odel,
and com peting m odels m ay be com pared via their deviances. W hen the fitted
m odel is correct, the scaled deviance k ~ 1D will som etim es have an approxim ate
chi-squared distrib u tio n on n p 1 degrees o f freedom , analogous to the
rescaled residual sum o f squares in a norm al linear model.
Significance tests
Individual coefficients /?; can be tested using studentized estim ates, with stan
dard errors estim ated using (7.4), w ith k replaced by the estim ate k if necessary.
The null distrib u tio n s o f these studentized estim ates will be approxim ately stan
d ard norm al, b u t the accuracy o f this ap proxim ation can be open to question.
Allowance for estim ation o f k can be m ade by using the t distribution with
331
(7.8)
7 = !.,
(7.9)
(7 10)
332
where dj = d(y; , fij) is the signed square root o f the scaled deviance contribution
due to the yth case, the sign being th a t o f y,- frj. T he deviance residual is dj.
D efinition (7.7) implies th a t
dj = sign(y, - ; )[2{ /,(y y) - <0 (j)}]1/2W hen / is the norm al log likelihood an d k = o 2 is unknow n, D is scaled by
k = s2 rath er th a n k before defining dj. Sim ilarly for the gam m a log likelihood;
see Exam ple 7.2. In practice standardized deviance residuals
TDi
( l - h j ) V 2
(7.11)
333
j = l,...,n,
(7.12)
j = l,...,n ,
(7.13)
where g _1(') is the inverse link function and j,...,e * is a b o o tstrap sample
from the residuals r L U . . . , r Ln defined at (7.10). H ere the residuals should n o t
be m ean-adjusted unless g( ) is the identity link, in which case r Lj = r Pj and
the two schemes (7.12) an d (7.13) are the same.
A th ird ap p ro ach is to use the deviance residuals as surrogate errors. If the
deviance residual dj is w ritten as d{yj,p.j), then im agine th a t corresponding
ran d o m errors ej are defined by ej = d(yj,fij). The distribution o f these _,
334
j = 1 ,..., n.
(7.14)
This also gives the m ethod o f Section 6.2.3 for linear models, except for the
m ean adjustm ent o f residuals.
N one o f these three m ethods is perfect. O ne obvious draw back is th a t they
can all give negative or non-integer values o f y ' when the original d ata are
non-negative integer counts. A simple fix for discrete responses is to round the
value o f y j from (7.12), (7.13), or (7.14) to the nearest appropriate value. For
count d a ta this is a non-negative integer, and if the response is a proportion
w ith d en o m in ato r m, it is a nu m b er in the set 0 , 1 /m ,2 /m ,. . . , 1. However,
rounding can appreciably increase the p ro p o rtio n o f extrem e values o f y ' for
a case w hose fitted value is n ear the end o f its range.
A sim ilar difficulty can occur w hen responses are positive w ith V(fi) = Kfi2,
as in Exam ple 7.1. T he Pearson residuals are K~l/2(yj fij)/p.j, all necessarily
greater th a n k ~ 1^2. But the standardized versions rpj are n o t so constrained,
so th a t the result yj = fij( 1 + /c1/2e*) from applying (7.12) can be negative. The
obvious fix is to tru n cate y j at zero, b u t this m ay distort the distribution o f
y ', and so is n o t generally recom m ended.
Example 7.2 (Leukaemia data) F or the d a ta introduced in Exam ple 7.1 the
p aram etric m odel is gam m a w ith log likelihood contributions
tij(Hij) - K^'OogOxy) + yij/Hij),
and the regression is additive on the logarithm ic scale, log(/zi;) = /?0i + /?ixy.
The deviance for the fitted m odel is D = 40.32 w ith 30 degrees o f freedom ,
and equation (7.6) gives k = 1.09. The deviance residuals are calculated w ith
k set equal to k ,
dtj = sign(ziy -
l ) { 2 k ~ l (zij
- 1 - logz,7)}1/2,
1 /2 ( z ,7 -
1 ).
T he Zjj w ould be approxim ately a sam ple from the stan d ard exponential
distribution if in fact k = 1, and the right-hand panel o f Figure 7.1 suggests
th a t this is a reasonable assum ption.
O ur basic p aram etric m odel for these d a ta sets k = 1 and puts Y = fie,
where has an exponential distrib u tio n w ith unit m ean. Hence the param etric
b o o tstrap involves sim ulating exponential d a ta from the fitted m odel, th a t is
setting y * = fie', where em is stan d ard exponential. A slightly m ore cautious
335
Poi
E xponential
L inear p redictor, r i
D eviance, rp
Cases
Pi
Lower
Upper
Lower
Upper
5.16
3.61
5.00
0.31
11.12
10.58
11.10
8.78
-1.42
-1.53
-1.46
-1.37
-0.04
0.17
0.02
0.81
336
337
Cases
rL o r rp
ro
Po
Pi
V>1
xp2
S tan d ard
N o rm al
Percentile
BCa
Basic
S tu d en t
85
88
85
84
86
89
86
89
87
86
88
89
89
92
83
82
87
86
85
90
89
86
84
81
85
88
86
86
86
92
86
89
89
88
89
92
89
90
86
83
86
89
86
89
89
88
83
84
85
87
86
86
85
92
86
89
88
88
89
92
90
90
86
83
87
89
86
89
89
88
83
84
S tan d ard
N o rm al
Percentile
BCa
Basic
S tudent
79
81
80
78
78
84
79
81
84
83
78
85
82
84
73
72
82
82
81
85
85
81
78
74
79
81
80
80
81
90
78
80
82
80
80
88
82
84
77
74
83
84
82
84
83
79
80
79
79
82
80
79
80
90
78
80
81
81
81
88
82
84
76
74
84
84
82
84
82
80
80
79
S ta n d a rd
N o rm al
Percentile
BCa
Basic
S tudent
90
88
87
86
87
95
90
88
87
86
87
90
91
88
85
82
85
80
90
88
86
86
87
92
89
87
89
88
87
90
90
86
88
87
87
89
92
88
88
85
88
89
90
88
88
87
88
89
89
87
90
88
86
90
91
93
94
94
92
93
92
97
97
96
97
92
91
93
91
91
92
91
S tan d ard
N o rm al
Percentile
BCa
Basic
S tu d en t
69
87
85
85
86
93
64
84
86
85
84
87
59
86
84
80
83
82
70
90
86
85
85
87
69
88
90
88
88
89
63
84
86
83
84
89
59
84
82
77
83
85
69
89
88
86
87
85
67
87
90
87
87
89
64
89
91
89
89
93
60
92
93
88
91
90
71
94
91
89
91
85
Po
Pi
Vl
tp2
Po
Pi
Vl
V>2
The th ird experim ent used the same design m atrix as the first two, b u t linear
predictor rj = Pq + P\x, w ith Po Pi = 2 and Poisson responses w ith m ean
H = exp (rj). T he fourth experim ent used the same m eans as the third, b u t had
negative binom ial responses w ith variance function \x + /i2/1 0 . The b o o tstrap
schemes for these two experim ents were case resam pling and m odel-based
resam pling using (7.12) an d (7.14).
Table 7.3 shows th at while all the m ethods tend to undercover, the standard
m ethod can be disastrously b ad w hen the random p a rt o f the fitted m odel is
incorrect, as in the second an d fourth experim ents. The studentized m ethod
generally does b etter th a n the basic m ethod, b u t the B C a m ethod does not
im prove on the percentile intervals. T hus here a m ore sophisticated m ethod
does n o t necessarily lead to b etter coverage, unlike in Section 5.7, and in
p articu lar there seems to be no reason to use the B C a m ethod. Use o f the
studentized interval on an o th er scale m ight im prove its perform ance for the
ratio \p2 , for which the sim pler m ethods seem best. As far as the resam pling
schemes are concerned, there seems to be little to choose betw een the m odel-
338
based schemes, which im prove slightly on b o o tstrap p in g cases, even when the
fitted variance function is incorrect.
We now consider an im p o rtan t caveat to these general com m ents.
Inhomogeneous residuals
F or some types o f d a ta the standardized Pearson residuals m ay be very
inhom ogeneous. If y is Poisson w ith m ean fi, for example, the distribution
o f (y f i ) / n l/1 is strongly positively skewed w hen n <
increasingly sym m etric as fi increases. T hus w hen a set o f
large and sm all counts, it is unwise to treat the rP as
possibility for such d a ta is to apply (7.12) b u t w ith fitted
the estim ated skewness o f their residuals.
I, b u t it becom es
d a ta contains both
exchangeable. One
values stratified by
- fiij,
= m-jl V(fiij),
+ P j) } ,
Interest focuses on the varieties w ith sm all values o f Pj, which are likely to be
the m ost resistant to the disease.
F or an adequate fit, the deviance would roughly be distributed according
to a X m d istrib u tio n ; in fact it is 1142.8. This indicates severe overdispersion
relative to the model.
339
o
o
<o
Q.
CO
Q.
&1
c
<0
>
10
20
30
40
Variety
eta
T he left panel o f Figure 7.3 shows estim ated variety effects for block 1.
Varieties 1 an d 3 are least resistant to the disease, while variety 31 is m ost
resistant. T he right panel shows the residuals plotted against linear predictors.
T he skewness o f the rP drops as rj increases.
Param etric sim ulation involves generating binom ial observations from the
fitted m odel. This greatly overstates the precision o f conclusions, because this
m odel clearly does n o t reflect the variability o f the data. We could instead use
the beta-binom ial distribution. Suppose that, conditional on n, a response is
binom ial w ith den o m in ato r m an d probability n, b u t instead o f being fixed, n
is taken to have a b eta distribution. T he resulting response has unconditional
m ean and variance
m il,
m ll(l - n ) { l + ( m - 1)0},
(7.15)
where n = E(7t) and <j) > 0 controls the degree o f overdispersion. Param etric
sim ulation from this m odel is discussed in Problem 7.5.
Two variance functions for overdispersed binom ial d a ta are V\{n) =
<f>n(l n), w ith <j> > 1, and Viin) = 7i(l 7t){l + (m
with (/> > 0.
T he first o f these gives com m on overdispersion for all the observations, while
the second allows p roportionately greater spread when m is larger. We use the
first, for which 4> = 8.3, an d perform nonparam etric sim ulation using (7.12).
T he sim ulated responses are rounded to the nearest integer in 0 ,1 ,..., m.
The left panel o f Figure 7.4 shows box plots o f the ratio o f deviance to
degrees o f freedom for 200 sim ulations from the binom ial model, the betabinom ial m odel, for nonparam etric sim ulation by (7.12), and for (7.12) b u t
w ith residuals stratified into groups for the fifteen varieties w ith the smallest
values o f fij, the m iddle fifteen values o f fij, and the fifteen largest values o f
340
Variety
fij. The d o tted line shows the observed ratio. T he binom ial results are clearly
quite inappropriate, those for the beta-binom ial an d unstratified sim ulation
are better, an d those for the stratified sim ulation are best.
To explain this, we retu rn to the right panel o f Figure 7.3. This shows th a t the
residuals are n o t hom ogeneous: residuals for observations with sm all values
o f rj are m ore positively skewed th a n those for larger values. This reflects the
varying skewness o f binom ial data, which m ust be taken into account in the
resam pling scheme.
The right panel o f Figure 7.4 shows the estim ated variety effects for the
200 sim ulations from the stratified sim ulation. Varieties 1 and 3 are m uch less
resistant th a n the others, b u t variety 31 is not m uch m ore resistant th an 11,
18, and 23; o th er varieties are close behind. As m ight be expected, results for
the binom ial sim ulation are m uch less variable. T he unstratified resam pling
scheme gives large negative estim ated variety effects, due to inappropriately
large negative residuals being used. To explain this, consider the right panel o f
Figure 7.3. In effect the unstratified scheme allows residuals from the right h alf
o f the panel to be sam pled an d placed at its left-hand end, leading to negative
sim ulated responses th a t are rounded u p to zero: the varieties for which this
happens seem spuriously resistant.
Finer stratification o f the residuals seems unnecessary for this application.
7.2.4 Prediction
In Section 6.3.3 we showed how to use resam pling m ethods to obtain prediction
intervals based on a linear regression fit. T he sam e idea can be applied here.
341
j= l,...,n ,
3(y+,fl+)
^+,((RM+!))>
$(y+>fi+)
^+,((HM+l)(l-a))-
342
In principle any o f the resam pling m ethods in Section 7.2.3 could be used.
In practice the hom oscedasticity is im portant, and should be checked.
Exam ple 7.4 (A ID S diagnoses)
YI
k
yjk
A+J = ex p (a ')
exP(PD>
k
unobs
7 = 1 ,...,3 8 ,
unobs
where the sum m ation is over the cells o f row j for which yjk was unobserved;
this is step 2. N ote th a t y*+j is equivalent to the results o f steps 3(a) and 3(b)
with M = 1.
We take 8(y,n) = (y
corresponding to Pearson residuals for the
Poisson distribution. This m eans th a t step 3(c) involves setting
_ y-+ J - K j
+J
a *1/2
V+J
We repeat this R times, to obtain values d+}(l) < < d \ j(R) for each j.
The final step is to o btain the b o o tstrap u p p er an d lower limits
for y +j , by solving the equations
y+j
a*+j _ j*
.1 /2
*+J
y +j
~
p +j
_ j*
a + , M ( R + 1))>TT /2
< )
a + J ,( ( R + l) ( la))-
y*+j i_a
343
Diagnosis
period
Y ear
Q uarter
ot
10
11
12
13
214
1983
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
2
2
4
0
6
5
4
11
9
2
5
7
13
12
21
17
36
28
31
26
31
36
32
15
34
38
31
32
49
44
41
56
53
63
71
95
76
67
6
7
4
10
17
22
23
11
22
28
26
49
37
53
44
74
58
74
80
99
95
77
92
92
104
101
124
132
107
153
137
124
175
135
161
178
181
2:66
0
1
0
0
3
1
4
6
6
8
14
17
21
16
29
13
23
23
16
27
35
20
32
14
29
34
47
36
51
41
29
39
35
24
48
39
2:16
1
1
1
1
1
5
5
1
2
8
6
11
9
21
11
13
14
11
9
9
13
26
10
27
31
18
24
10
17
16
33
14
17
23
25
6
1
1
0
1
1
2
2
1
4
5
9
4
3
2
6
3
7
8
3
8
18
11
12
22
18
9
11
9
15
11
7
12
13
12
2:5
0
0
2
0
0
1
1
5
3
2
2
7
5
7
4
5
4
3
2
11
4
3
19
21
8
15
15
7
8
6
11
7
11
Si
0
0
0
0
0
0
3
0
3
2
5
5
7
0
2
3
1
3
8
3
6
8
12
12
6
6
8
6
9
5
6
10
>2
1
0
0
0
0
2
0
1
4
4
5
7
3
7
2
1
2
6
3
4
4
4
4
5
7
1
6
4
2
7
4
Si
0
0
0
1
0
1
1
1
7
3
5
3
1
0
1
2
1
2
1
6
4
8
3
3
3
2
5
4
1
2
23
0
0
0
1
0
0
2
1
1
0
1
1
3
0
0
2
3
5
4
3
3
7
2
0
8
2
3
5
1
0
0
2
1
0
0
0
1
2
1
2
2
1
0
2
0
0
4
6
5
3
1
0
3
0
2
3
0
0
0
1
0
1
0
0
0
0
1
0
2
0
0
0
0
0
1
2
5
2
0
2
3
2
3
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
1
1
1
0
0
2
0
1
2
0
0
0
0
0
0
0
0
0
0
0
1
0
1
2
3
3
1
2
1
3
2
0
1
2
1
0
0
0
1
0
2
1
0
1
2
4
6
1
8
5
1
3
6
3
3
2
2
1
1984
1985
1986
1987
1988
1989
1990
1991
1992
Total
reports
to end
1992
12
12
14
15
30
39
47
40
63
65
82
120
109
120
134
141
153
173
174
211
224
205
224
219
253
2233
2:281
2:245
2260
2285
2271
2263
2306
2258
2310
2318
2273
2133
344
o
oin
Figure I S Results
from the fit of a Poisson
two-way layout to the
AIDS data. The left
panel shows predicted
diagnoses (solid),
together with the actual
totals to the end of 1992
(+). The right panel
shows standardized
Pearson residuals
plotted against
estimated skewness,
p~l/2; the vertical lines
are at skewness 0.6 and
oo
rT
<
D O
O
CO
o
CO
O) O
<C O
CVi
oo
1984 1986
1988
1990
1.
1992
Skewness
345
Negative binomial
0.0
0.5
1.0
1.5
2.0
2.5
Davtanca/df
Nonparametric
mi
0.0
0.5
1.0
1.5
2.0
2.5
Devianca'df
1990
Poisson
N egative binom ial
N o n p aram etric
S tratified n o n p aram etric
296
294
294
292
315
318
318
319
1991
294
289
289
288
327
333
333
335
1992
356
317
314
310
537
560
547
571
are m uch less dispersed th an the original data, for which the ratio is 716.5/413.
T he negative binom ial sim ulation gives m ore ap p ro p riate results, which seem
rath er sim ilar to those for n o nparam etric sim ulation w ithout stratification.
W hen stratification is used, the results mimic the overdispersion m uch better.
The pointw ise 95% prediction intervals for the num bers o f A ID S diagnoses
are shown in the right panel o f Figure 7.6. The intervals for sim ulation from
the fitted Poisson m odel are considerably narrow er th an the intervals from
resam pling residuals, b o th o f which are similar. The intervals for the last
quarters o f 1990, 1991, an d 1992 are given in Table 7.5.
T here is little change if intervals are based on the deviance residual form ula
for the Poisson distribution, S(y,fi) = [ 2 {y log ( y / n) + n ~ y } \ x/1A serious draw back w ith this analysis is th a t predictions from the two-way
layout m odel are very sensitive to the last few rows o f the table, to the extent
th a t the estim ate
for the last row is determ ined entirely by the b ottom left
346
cell. Some sort o f tem poral sm oothing is preferable, and we reconsider these
d a ta in Exam ple 7.12
1 - e x p { - ( y / ; . ) K},
y > 0,
(7.16)
347
V oltage (kV)
5
7
10
15
7131
>9104.25
50.25
108.30
135.60
15.17
23.90
2.40
6.68
8482
>9104.25
87.75
108.30
8559
>9104.25
87.76
117.90
19.87
28.17
2.42
7.30
20.18
29.70
3.17
8762
9026
9034
9104
87.77
123.90
92.90
124.30
92.91
129.70
95.96
135.60
21.50
21.88
22.23
23.02
3.75
4.65
4.95
6.23
in the figure we renorm alize the log likelihood to have m axim um zero. U nder
the stan d ard large-sam ple likelihood asym ptotics outlined in Section 5.2.1, the
approxim ate distrib u tio n o f the likelihood ratio statistic
W( 9) = 2 {< V of(0)
is xj, so a 1 a confidence set for the true 9 is the set such th at
cVtP is the p quantile of
the Xv distribution.
^ p ro f(0 ) ^ < V o f ( ^ ) 5 C U _ a .
348
CD
_D
o>
(5
(0
0
Voltage
theta
Chi-squared quantile
oo
o
o>
o
S
Q_
349
P aram eter
Po
Pi
Pi
ft
*0
M LE
6.346
1.958
4.383
1.235
4.758
L ikelihood
P aram etric
N o n p a ra m e tric
Bias
SE
Bias
SE
Bias
SE
0
0
0
0
0
0.117
0.082
0.850
0.388
0.029
0.007
0.007
0.127
0.022
-0.004
0.117
0.082
0.874
0.393
0.030
0.001
0.006
0.109
0.022
-0.002
0.112
0.080
0.871
0.393
0.028
m ean 1.12, an d their 0.95 quantile is w(*950) = 4.09, to be com pared with
ci,o.95 = 3.84. This gives b o o tstrap calibrated 95% confidence interval the set
o f 9 such th a t / prof(0) > / prof(9) 5 x 4.09, th a t is [19.62,36.12], which is
slightly w ider th a n the stan d ard interval.
? is the m atrix o f
second derivatives o f
with respect to 0 and /?.
Table 7.7 com pares the bias estim ates and stan d ard errors for the m odel
param eters using the param etric b o o tstra p described above and standard firsto rd er likelihood theory, und er which the estim ated biases are zero, and the
variance estim ates are obtained as the diagonal elem ents o f the inverse observed
inform ation m atrix (?)_1 evaluated at the M LEs. The estim ated biases are
sm all b u t significantly different from zero. The largest differences betw een the
stan d ard theory and the b o o tstrap results are for f o and fo, for which the
biases are o f order 2 -3 % . T he threshold param eter xo is well determ ined; the
sta n d a rd 95% confidence interval based on its asym ptotic norm al distribution
is [4.701,4.815], w hereas the norm al interval with estim ated bias and variance
is [4.703,4.820],
A m odel-based nonparam etric b o o tstrap m ay be perform ed by using resid
uals e = ( y / ) . f , three o f which are censored, then resam pling errors * from
their product-lim it estim ate, an d then m aking uncensored b o o tstrap observa
tions le*1/*. T he observations with x = 5 are then m odified as outlined above,
an d the m odel refitted to the resulting data. The product-lim it estim ate for the
residuals is very close to the survivor function o f the stan d ard exponential dis
tribution, so we expect this to give results sim ilar to the param etric sim ulation,
and this is w hat we see in Table 7.7.
F or censoring at a pre-determ ined tim e c, the sim ulation algorithm s would
w ork as described above, except th a t values o f y * greater th a n c would be
replaced by c an d the corresponding censoring indicators d* set equal to zero.
T he nu m b er o f censored observations in each sim ulated dataset would then be
ran d o m ; see Practical 7.3.
Plots show th a t the sim ulated M L E s are close to norm ally distributed: in
this case sta n d a rd likelihood theory w orks well enough to give good confi
dence intervals for the param eters. The benefit o f param etric sim ulation is th at
the b o o tstra p estim ates give em pirical evidence th a t the stan d ard theory can
350
1 - F ( y ; p, x) = {1 - F (y )}exp(x7 P)
where 1 F(y) is the baseline survivor function for the hazard dA(y).
The regression p aram eters P are usually estim ated by m axim izing the partial
likelihood, which is the p ro d u ct over cases w ith dj = 1 o f term s
________g P f r r ft>________
E L i H (yj - y k ) e xp (x Tpky
(717)
where H(u) equals zero if u < 0 an d equals one otherwise. Since (7.17) is
unaltered by recentring the xj, we shall assum e below th at E x j = 0 ; the
baseline h azard then corresponds to the average covariate value x = 0.
In term s o f the estim ated regression param eters the baseline cum ulative
hazard function is estim ated by the Breslow estimator
A (y )= J 2 ^n
m
d\
(T tiV
j:yj<y * = i H (yj - yk)ex p ( x Tfa)
(7-18)
a non-decreasing function th a t ju m p s a t yj by
dA(yj) = ------------------^ -------------- .
E L i H (yj - yk) exp ( x Tpk)
O ne stan d ard estim ator o f the baseline survivor function is
1- ^ 0 0 =
n
{ i- ^ v ) } .
i-y&y
(7.i9)
which generalizes the product-lim it estim ate (3.9), although o th e r estim ators
also exist. W hichever o f them is used, the p ro p o rtio nal hazards assum ption
implies th a t
{1 _ F(y)}exp<-xJfo
351
will be the estim ated survivor function for an individual with covariate values
Xj.
U nder the ran d o m censorship model, the survivor function o f the censoring
d istribution G is given by (3.11).
T he b o o tstra p m ethods for censored d a ta outlined in Section 3.5 extend
straightforw ardly to this setting. F or example, if the censoring distribution is
independent o f the covariates, we generate a single sam ple under the condi
tional sam pling plan according to the following algorithm .
Algorithm 7.2 (Conditional resampling for censored survival data)
For j = 1
1 generate
7?*
from
{1 F(y)}exp(xJW;
2 if dj = 0, set Cj = yj, and if dj = 1, generate Cj from the
conditional censoring distribution given th a t Cj > yj, namely
{G(y) - G{yj)}/{ 1 - G(y,)}; then
3 set Yj = m in(7P*, Cj), w ith Dj = 1 if YJ = Yf* and zero otherwise.
U nder the m ore general m odel where the distribution G o f C also depends
up o n the covariates an d a p ro p o rtional hazards assum ption is ap p ro p riate for
G, the estim ated censoring survivor function when the covariate is x is
f
-1 exp(xr y)
1 - G ( y ;y ,x ) = { l- G ( y ) j
where G0(y) is the estim ated baseline censoring distribution given by the
analogues o f (7.18) and (7.19), in which 1 dj and y replace dj and fi. U nder
m odel-based resam pling, a b o o tstrap dataset is then obtained by
Algorithm 7.3 (Resampling for censored survival data)
F or j = 1 ,..., n,
1 generate
7?*
from
352
l - F 2( y ; p , x ) = {1 - F 2(y)}exp(xT/f),
353
Time (days)
T he sharp increase in risk for small thicknesses is clearly a genuine effect, while
beyond 3mm the confidence interval for the linear predictor is roughly [0,1],
w ith thickness having little o r no effect.
R esults from m odel-based resam pling using the fitted m odel and applying
A lgorithm 7.3, an d from conditional resam pling using A lgorithm 7.2 are also
show n; they are very sim ilar to the results from resam pling cases. In view o f
the discussion in Section 3.5, we did n o t apply the weird bootstrap.
The right panels o f Figure 7.9 show how the estim ated 0.2 quantile o f the
survival distribution, yo.2 = min{y : F i ( y ; P , x ) > 0.2} depends on tum our
thickness. T here is an initial sharp decrease from 3000 days to ab o u t 750
days as tu m o u r thickness increases from 0 -3 mm, but the estim ate is roughly
co n stan t from then on. T he individual estim ates are highly variable, b u t the
degree o f uncertainty m irrors roughly th a t in the left panels. Once again results
for the three resam pling schemes are very similar.
U nlike the previous exam ple, where resam pling and stan d ard likelihood
m ethods led to sim ilar conclusions, this exam ple shows the usefulness o f
resam pling w hen stan d ard approaches would be difficult o r im possible to
apply.
yj
Kxj,P) + j,
; =
(7.20)
354
o
g
TD
ok>
Q.
(0
<D
C
10
0
Tumour thickness (mm)
10
with ji{x, /?) nonlinear in the p aram eter /?, which m ay be vector o r scalar.
The linear algebra associated w ith least squares estim ates for linear regression
no longer applies exactly. However, least squares theory can be developed by
linear approxim ation, an d the least squares estim ate ft can often be com puted
accurately by iterative linear fitting.
The linear approxim ation to (7.20), obtained by Taylor series expansion,
gives
yj ~ t i x j , P') = u j (0 - P') + ej,
j = 1, . . . , n,
(7.21)
355
where
= 8y{xj,P)
W
i>-p
T his defines an iteration th a t starts at P' using a linear regression least squares
fit, an d a t the final iteratio n /?' = /?. A t th a t stage the left-hand side o f (7.21)
is simply the residual ej = yj fi(xj,P). A pproxim ate leverage values and
o th er diagnostics are obtained from the linear approxim ation, th a t is using the
definitions in previous sections b u t w ith the UjS evaluated a t p' = p as the values
o f explanatory variable vectors. This use o f the linear approxim ation can give
m isleading results, depending upon the intrinsic curvature o f the regression
surface. In particu lar, the residuals will no longer have zero expectation in
general, an d standardized residuals r; will no longer have co n stan t variance
u n d er hom oscedasticity o f true errors.
T he usual norm al approxim ation for the distribution o f P is also based on
the linear approxim ation. F or the approxim ate variance, (6.24) applies w ith X
replaced by U = ( u i , . . . , u n)T evaluated at p. So w ith s2 equal to the residual
m ean square, we have
P -P
N ( 0 , s 2( U T U r l ) .
(7.22)
356
to
o
(0
ZJ
"O
*35 o
o
5(0
cr m
o
Time (minutes)
Po
h
10
12
14
Time (minutes)
E stim ate
B o o tstrap bias
T heoretical SE
B o o tstrap SE
4.31
0.028
0.30
0.38
0.209
0.004
0.039
0.040
The right panel o f Figure 7.10 shows th a t hom ogeneity o f variance is slightly
questionable here, so we resam ple cases by stratified sam pling. Estim ated biases
and stan d a rd errors for f o an d fo based on 999 b o o tstrap replicates are given
in Table 7.8. T he m ain p o in t to notice is the appreciable difference betw een
A
theoretical an d b o o tstra p stan d ard errors for Po.
Figure 7.11 illustrates the results. N ote the non-elliptical p a ttern o f variation
and the n on-norm ality: the z-statistics are also quite non-norm al. In this case
the b o o tstrap should give b etter results for confidence intervals th an norm al
approxim ations, especially for Po- T he b o tto m right panel suggests th a t the
param eter estim ates are closer to norm al on logarithm ic scales.
Results for m odel-based resam pling assum ing hom oscedastic errors are fairly
similar, alth o u g h the sta n d a rd error for f o is then 0.32. The effects o f nonlin
earity are negligible in this case: for exam ple, the m axim um absolute bias o f
residuals is a b o u t 0.012<r.
Suppose th a t we w ant confidence lim its on som e aspect o f the curve, such
as the p ro p o rtio n o f m axim um n = 1 exp(P\x). O rdinarily one m ight
357
't
O
o
co
CO
. A.
"".4/ *
r.
iS w
<D
JO
V v W c g 1;
CO
I
*
^
b
3.5
4.0
4.5
5.0
betaO
5.5
6.0
5
betaO
approach this by applying the delta m ethod together with the bivariate norm al
approxim ation for least squares estim ates, b u t the b o o tstrap can deal w ith this
using only the sim ulated p aram eter estim ates. So consider the times x = 1,
5, 15, at which the estim ates n = 1 exp(fiix) are 0.188, 0.647 and 0.956
respectively. T he top panel o f Figure 7.12 shows b o o tstrap distributions o f
7T* = 1 exp(P[x): n ote the strong non-norm ality at x = 15.
T he co n strain t th a t n m ust lie in the interval (0,1) m eans th a t it is unwise
to construct basic or studentized confidence intervals for n itself. F o r example,
the basic b o o tstrap 95% interval for n at x = 15 is [0.922,1.025], The solution
is to do all the calculations on the logit scale, as show n in the lower panel o f
Figure 7.12, an d untransform the lim its obtained a t the end. T h a t is, we obtain
358
x=15
x=1
x=5
1
0.2
0.0
-2
ItfTh-i-rL-
0.4
0.6
Proportion 1 - exp(-beta1*x)
2
Logit of proportion
0.8
1.0
exp (rj2)
1 +exp(j7i) 1 +exp(f/2).
as the corresponding intervals for n. T he resulting 95% intervals are [0.13,0.26]
at x = 1, [0.48,0.76] a t x = 5, and [0.83,0.98] a t x = 15. T he stan d ard linear
theory gives slightly different values, e.g. [0.10,0.27] at x = 1 and [0.83,1.03]
at x = 15.
359
otherwise.
(7.23)
otherwise.
360
0.632
77
38
10
24.7
22.1
23.4
23.4 (23.7)
20.8 (21.0)
26.0 (25.4)
20.8 (20.8)
Figure 7.13
Components of 0.632
estimate of prediction
error, yj fi(xj; F*), for
urine data based on 200
bootstrap simulations.
Values within the dotted
lines make no
contribution to
prediction error. The
components from cases
54 and 66 are the
rightmost and the
fourth from rightmost
sets of errors shown; the
components from case
27 are leftmost.
In this case A app = 20.8 x 10- 2 . O th er estim ates o f aggregate prediction error
are given in Table 7.9. F o r the b o o tstrap an d 0.632 estim ates, we used R = 200
boo tstrap resamples.
The discontinuous n ature o f prediction error gives m ore variable results th an
for the exam ples with squared erro r in Section 6.4.1. In p articular the results
for K -fold cross-validation now depend m ore critically on which observations
fall into the groups. F or example, the average an d standard deviation o f A q v j
for 40 repeats were 23.0 x 10-2 an d 2.0 x 10~2. However, the broad pattern is
sim ilar to th a t in Table 6.9.
Figure 7.13 shows box plots o f the quantities yj n(xj ;F*) th a t contribute
to the 0.632 estim ate o f prediction error, plotted against case j ordered by the
residual; only three values o f j are labelled. There are ab o u t 74 contributions
at each value o f j. O nly values outw ith the horizontal d o tted lines contribute
to prediction error. The p attern is broadly w hat we would ex p ect: observations
with residuals close to zero are generally well predicted, and m ake little
contribu tio n to prediction error. M ore extrem e residuals contribute m ost to
prediction error. N ote cases 66 an d 54, which are always misclassified; their
standardized Pearson residuals are 2.13 an d 2.54. T he figure suggests th a t case
361
K -fold (adjusted) cross-validation
M ean
SD
M SE
B o o tstrap
0.632
50
25
10
9.1
1.2
0.38
8.8
1.9
0.29
11.5
4.4
0.62
11.7 (11.5)
4.5 (4.2)
0.64 (0.63)
12.2 (11.7)
5.0 (4.6)
0.76 (0.73)
12.4 (11.3)
4.8 (3.9)
0.64 (0.54)
15.3 (11.1)
7.1 (4.6)
1.14 (0.59)
54 is outlying. A t the o th er end is case 27, w hose residual is -1.84; this was
misclassified 42 tim es out o f 65 in our sim ulation.
362
7 - n ( x ) + e,
where fi(x) has com pletely unknow n form but w ould be assum ed continuous
in m any applications, an d e is a ran d o m erro r w ith zero m ean. A typical
application is illustrated by the scatter p lo t in Figure 7.14. H ere no simple
param etric regression curve seems appropriate, so it m akes sense to fit a
sm ooth curve (which we do later in Exam ple 7.10) w ith as few restrictions as
possible.
O ften n o n p aram etric regression is used as an exploratory tool, either directly
by producing a curve estim ate for visual interpretation, or indirectly by provid
ing a com parison w ith som e tentative p aram etric m odel fit via a significance
test. In som e applications the ra th e r different objective o f prediction will be o f
interest. W hatever the application, the com plicated n ature o f nonparam etric
regression m ethods m akes it unlikely th a t probability distributions for statistics
o f interest can be evaluated theoretically, an d so resam pling m ethods will play
a prom inent role.
It is n o t possible here to describe all o f the nonparam etric regression
m ethods th a t are now available, an d in any event m any o f them do not yet
have fully developed com panion resam pling m ethods. We shall limit ourselves
to a brief discussion o f som e o f the m ain m ethods, and to applications in
generalized additive m odels, where nonparam etric regression is used to extend
the generalized linear m odels o f Section 7.2.
363
o
2q5
<
oD
o
<
Time (ms)
X > ;w { (x ~ */)/*>}
E {(x -x # )}
(7.24)
w ith w(-) a sym m etric density function and b an adjustable ban d w id th con
stan t th a t determ ines how widely the averaging is done. This estim ate is similar
in m any ways to the kernel density estim ate discussed in Exam ple 5.13, and as
there the choice o f b depends upon a trade-off betw een bias and variability o f
the e stim a te : sm all b gives sm all bias and large variance, whereas large b has
the opposite effects. Ideally b would vary w ith x, to reflect large changes in the
derivative o f /i(x) and heteroscedasticity, b o th evident in Figure 7.14.
M odifications to the estim ate (7.24) are needed at the ends o f the x range,
to avoid the inherent bias when there is little or no d ata on one side o f x.
In m any ways m ore satisfactory are the local regression m ethods, where a
local linear or quad ratic curve is fitted using weights w{(x xj ) / b} as above,
and then p.(x) is taken to be the fitted value at x. Im plem entations o f this
idea include the lowess m ethod, which also incorporates trim m ing o f outliers.
A gain the choice o f b is critical.
A different approach is to define a curve in term s o f basis functions, such
as pow ers o f x which define polynom ials. The fitted m odel is then a linear
co m bination o f basis functions, with coefficients determ ined by least squares
regression. W hich basis to use depends on the application, b u t polynom ials are
364
365
to determ ine the best value o f b. T hen for the sim ulation m odel we use the
corresponding curve with, say, 2b as the tuning constant. To try to elim inate
bias from the sim ulation errors ej, we use residuals from an undersm oothed
curve, say w ith tuning co n stan t b / 2. As with linear regression, it is appropriate
to use m odified residuals, where leverage is taken into account as in (6.9). This
is possible for m ost nonparam etric regression m ethods, since they are linear.
D etailed asym ptotic theory shows th at som ething along these lines is necessary
to m ake resam pling work, b u t there is no clear guidance as to precise relative
values for the tuning constants.
E xam ple 7.10 (M otorcycle im pact d a ta ) The response y here is acceleration
m easured x m illiseconds after im pact in an accident sim ulation experim ent.
T he full d a ta were shown in Figure 7.14, b u t for com putational reasons we
elim inate replicates for the present analysis, which leaves n = 94 cases with
distinct x values. The solid line in the top left panel o f Figure 7.15 shows a cubic
spline fit for the d a ta o f Figure 7.14, chosen by cross-validation and having
approxim ately 12 degrees o f freedom. The top right panel o f the figure gives
the plot o f m odified residuals against x for this fit. N ote the heteroscedasticity,
w hich broadly corresponds to the three stra ta separated by the vertical dotted
lines. The estim ated variances for these stra ta are approxim ately 4, 600 and
140. Reciprocals o f these were used as weights for the spline fit in the left
panel. Bias in these residuals is evident at times 10-15 ms, where the residuals
are first m ostly negative and then positive because the curve does not follow
the d a ta closely enough.
There is a rough correspondence betw een kernel sm oothing and spline
sm oothing an d this, together w ith the previous discussion, suggests th a t for
m odel-based resam pling we use yj = p(xj) + ej, where fi is the spline fit
obtained by doubling the cross-validation choice o f L This fit is the dotted
line in the top left panel o f Figure 7.15. The random errors ej are sam pled
from the m odified residuals for an o th er spline fit in which X is h a lf the crossv alidation value. The lower right panel o f the figure displays these residuals,
which show less bias th a n those for the original fit, though perhaps a smaller
b andw idth would be b etter still. The sam pling is stratified, to reflect the very
strong heteroscedasticity.
We sim ulated R = 999 d atasets in this way, and to each fitted the spline
curve fi (x), w ith the b an d w id th chosen by cross-validation each time. We then
calculated 90% confidence intervals at six values o f x, using the basic b o otstrap
m ethod m odified to equate the distributions o f /i*(x) p(x) and
F or
example, at x = 20 the estim ates ft and p. are respectively 110.8 and 106.2,
and the 950th ordered value o f p" is 87.2, so th a t the upper confidence limit
is 110.8 {87.2 (106.2)} = 129.8. The resulting confidence intervals
are shown in the b o tto m left panel o f Figure 7.15, together w ith the original
366
c3o
'</)
go
-g
Time (ms)
Time (ms)
Time (ms)
Time (ms)
fit. N ote how the confidence limits are centred on the convex side o f the fitted
curve in o rd er to account for its bias; this is m ost evident at x = 20.
367
(7.25)
T he following exam ple illustrates the use o f nonparam etric curve fits in m odelchecking.
Example 7.11 (Leukaemia data) For the d a ta in Exam ple 7.1, we originally
fitted a generalized linear m odel w ith gam m a variance function and linear
p redictor g ro u p + x w ith logarithm ic link, where g ro u p is a factor w ith two
levels. T he fitted m ean function for th a t m odel is show n as two solid curves
in Figure 7.16, the u p p er curve corresponding to G ro u p 1. H ere we consider
368
Figure 7.16
Generalized linear
model fits (solid) and
generalized additive
model fits (dashed) for
leukaemia data of
Example 7.1.
w hether or n o t the effect o f x is linear. To do this, we com pare the original fit
to th at o f the generalized additive m odel in which x is replaced by s(x), which
is a sm oothing spline w ith three degrees o f freedom . The link and variance
functions are unchanged. T he fitted m ean function for this m odel is shown as
dashed curves in the figure.
Is the sm ooth curve a significantly b etter fit? To answ er this we use the test
statistic Q defined in (7.8), where here D corresponds to the residual deviance
for the generalized additive m odel, k is the dispersion for th a t m odel, and
Do is the residual deviance for the sm aller generalized linear model. F or these
d a ta D = 40.32 w ith 30 degrees o f freedom , k = 0.725, and Do = 30.75 w ith
27 degrees o f freedom , so th a t q = (40.32 30.75)/0.725 = 13.2. The standard
approxim ation for the null distrib u tio n o f Q is chi-squared w ith degrees o f
freedom equal to the difference in m odel dim ensions, here p po = 3, so
the approxim ate P-value is 0.004. A lternatively, to allow for estim ation o f the
dispersion, (p po)_12 is com pared to the F distribution w ith denom inator
degrees o f freedom n p 1, here 27, an d this gives approxim ate P-value
0.012. It looks as though there is strong evidence against the simpler, loglinear
model. However, the accuracies o f the approxim ations used here are som ew hat
questionable, so it m akes sense to apply the resam pling analysis.
To calculate a b o o tstrap P-value corresponding to q = 13.2, we sim ulate the
distribution o f Q u nder the fitted null m odel, th a t is the original generalized
linear m odel fit, b u t w ith n o n p aram etric resam pling. T he p articular resam pling
scheme we choose here uses the linear predictor residuals rLj defined in (7.10),
one advantage o f which is th a t positive sim ulated responses are guaranteed.
The residuals in this case are
= logCVj) ~ log(Aoj)
Ll
4 /2( l - S ) i / 2
369
Figure 7.17
Chi-squared Q-Q plot of
standardized deviance
differences q* for
comparing generalized
linear and generalized
additive model fits to
the leukaemia data. The
lines show the
theoretical x\
approximation (dashes)
and the F
approximation (dots).
Resampling uses
Pearson residuals on
linear predictor scale,
with R = 999.
Chi-squared quantiles
w here hoj, jhj an d kq are the leverage, fitted value and dispersion estim ate
for the null (generalized linear) m odel. These residuals ap p ear quite hom oge
neous, so no stratification is used. T hus step 2 o f A lgorithm 7.4 consists o f
sam pling e j,...,e * random ly with replacem ent from rL{, . . . , r Ln (w ithout m ean
correction), an d then generating responses y * = /io; exp(KQ/2e*) for j = l , . . . , n .
A pplying this algorithm w ith R = 999 for o u r d a ta gives the P-value 0.035,
larger th a n the theoretical approxim ations, b u t still suggesting th a t the linear
term in x is n o t sufficient. The b o o tstrap null distribution o f q * deviates
m arkedly from the stan d ard %\ approxim ation, as the Q-Q plot in Figure 7.17
shows. The F approxim ation is also inaccurate.
A jack k n ife-after-b o o tstrap plot reveals th at quantiles o f q* are m oderately
sensitive to case 2, b u t w ithout this case the P-value is virtually unchanged.
Very sim ilar results are obtained under param etric resam pling with the
exponential m odel, as m ight be expected from the original d a ta analysis.
O u r next exam ple illustrates the use o f sem iparam etric regression in predic
tion.
E xam ple 7.12 (A ID S diagnoses) In Exam ple 7.4 we discussed prediction
o f A ID S diagnoses based on the d a ta in Table 7.4. A sm ooth tim e trend
seems preferable to fitting a separate param eter for each diagnosis period,
an d accordingly we consider a m odel where the m ean num ber o f diagnoses in
p eriod j reported w ith delay k, the m ean for the ( j, k) cell o f the table, equals
Hjk = exp(aO') + 0k}.
We take a (j) to be a locally quadratic lowess sm ooth w ith bandw idth 0.5.
370
Figure 7.18
Generalized additive
model prediction of UK
AIDS diagnoses. The
left panel shows the
fitted curve with
bandwidth 0.5 (smooth
solid line), the predicted
diagnoses from this fit
(jagged dashed line),
and the fitted curves
with bandwidths 0.7
(dots) and 0.3 (dashes),
together with the
observed totals (+). The
right panel shows the
predicted quarterly
diagnoses for 1989-92
(central solid line), and
pointwise 95%
prediction limits from
the Poisson bootstrap
(solid), negative
binomial bootstrap
(dashes), and
nonparametric
bootstrap without (dots)
and with (dot-dash)
stratification.
371
1990
Poisson
N egative binom ial
N o n p aram etric
S tratified n o n p aram etric
1991
1992
295
314
302
336
415
532
293
294
293
317
316
315
298
296
295
339
337
338
407
397
394
547
545
542
in those rows. This contrasts w ith the Poisson tw o-w ay layout m odel, for
which the predictions depend com pletely on single rows o f the table and are
m uch m ore variable. C om pare the slight forecast drop in Figure 7.6 with the
predicted increase in Figure 7.18.
The d otted lines in Figure 7.18 show pointw ise 95% prediction bands for the
A ID S diagnoses. The prediction intervals for the negative binom ial and n o n
p aram etric schemes are similar, although the effect o f stratification is smaller.
S tratification has no effect on the deviances. The negative binom ial deviances
are typically a b o u t 90 larger th a n those generated under the nonparam etric
scheme.
The plausibility o f the sm ooth underlying curve and its usefulness for p re
diction is o f course central to the approach outlined here.
372
X
m
y
X
m
y
X
m
17.0
13555
16
18.5
13675
15
19.5
18752
16
20.5
22005
22
21.5
23896
16
22.5
24667
12
23.5
24807
17
24.5
23986
22
25.5
22860
15
26.5
21450
14
27.5
19202
27
28.5
17450
14
29.5
15685
9
30.5
13954
12
31.5
11987
12
32.5
10983
18
33.5
9825
13
34.5
8483
11
35.5
7448
23
36.5
6628
13
37.5
5780
17
38.5
4834
15
39.5
3961
30
40.5
2952
31
41.5
2276
33
42.4
1589
20
43.5
1018
16
44.5
596
22
45.5
327
11
47.0
249
7
a>
o
<d
o
c
o
O)
o
Mean age x
case o f the general model, the la tte r is taken to be an arb itrary convex curve
for the logit o f incidence rate.
If the incidence rate at age x, is n(xi) w ith logit{7r(x/)} = rj(xi) = */*, say, for
i=
then the binom ial log likelihood is
1=1
A convex m odel is one in which
Xi+1 -
Xi
Xi -
%i1
Xj+1 -
X i-1
Xi- 1
I = 2 , . .. ,k - 1 .
The general m odel fit will m axim ize the binom ial log likelihood subject to these
constraints, giving estim ates fji,...,rjk- T he null m odel satisfies the constraints
rji < rji+i for i = l , . . . , k 1, which are equivalent to the previous convexity
373
Mean age x
constraints plus the single co n straint r\\ < r\2 - The null fit essentially pools
adjacent age groups for which the general estim ates fji violate the m onotonicity
o f the null m odel. If the null estim ates are denoted by
then we take as our
test statistic the deviance difference
T = 2{(f()i,. ..,r\k) ~
- flojc)}-
374
10
t*
375
376
A sum m ary o f m uch o f the theory for resam pling in nonlinear and no n
param etric regression is given in C h ap ter 8 o f Shao and Tu (1995).
7.8 Problems
1
c V ( t ) d f / e/
F{x' y} = 0'
J ? " * 1'
.
cjV(Hj)3t]j/8fij
evaluated at the fitted model, where W is the diagonal matrix with elements given
by (7 . 3 ) .
Hence show that the approximate variance matrix for ft' for case resampling in a
generalized linear model is
k ( X T W X ) - 1X T W S X { X T W X ) ~ \
where $ = diag(rp,,..., rj,n) with the rpj standardized Pearson residuals (7.9).
Show that for the linear model this yields the modified version of the robust
variance matrix ( 6 . 2 6 ) .
(Section 7 . 2 . 2 ; Moulton and Zeger, 1 9 9 1 )
2
For the gamma model of Examples 7 . 1 and 7 . 2 , verify that v a r(7 ) = k/i2 and that
the log likelihood contribution from a single observation is
= - ^ { l o g i ^ + y/fi}.
Show that the unstandardized Pearson and deviance residuals are respectively
_ / 2 ( 1) and sign(z1 ) [ 2 k _ 1 / 2 { z 1 log(z)}]1/2, where z = y/p.. If the regression
is loglinear, meaning that the log link is used, verify that the unstandardized linear
predictor residuals are simply k~i/2 log(z).
What are the possible ranges of the standardized residuals rP, rL and rDl Calculate
these for the model fitted in Example 7 .2.
If the deviance residual is expressed as d(y,p), check that d(y,p) = d(z, 1). Hence
show that the resampling scheme based on standardized deviance residuals can
be expressed as y = faz, where zj is defined by d(zj, 1) = s' with ' randomly
sampled from rDi, . . . , r Dn. What further simplification can be made?
(Sections 7 . 2 . 2 , 7 . 2 . 3 )
k
The figure below shows the fit to data pairs ( x u y \ ),,(x,y) of a binary logistic
model
Pr(7 = 1) = 1 - Pr(Y = 0) =
eXp(/?0 + /?lX)
1 + exp(/?0 + /fix)
7.8 Problems
377
(a) Under case resampling, show that the maximum likelihood estimate
for a
bootstrap sample is infinite with probability close to e~2. W hat effect has this on
the different types o f bootstrap confidence intervals for fa ?
(b) Bias-corrected maximum likelihood estimates are obtained by modifying re
sponse values (0,1) to (/iy/2, l+hj), where hj is the jth leverage for the model fit to
the original data. D o infinite parameter estimates arise when bootstrapping cases
from the modified data?
(Section 7.2.3; Firth, 1993; M oulton and Zeger, 1991)
4
Investigate whether resampling schemes given by (7.12), (7.13), and (7.14) yield
Algorithm 6.1 for bootstrapping the linear model.
/( n | ,ff) =
r Wi(P)
- nf-',
0 < n < 1,
tx,P>0.
Show that Y has unconditional mean and variance (7.15) and express n and <f>in
terms o f a and fa
Express a and /? in terms o f n and <j>, and hence explain how to generate data
with mean and variance (7.15) by generating n from a beta distribution, and then,
conditional on the probabilities, generating binom ial variables with probabilities n
and denominators m.
How should your algorithm be amended to generate beta-binomial data with
variance function </>II(l II)?
(Example 7.3)
6
For generalized linear models the analogue o f the case-deletion result in Problem 6.2
is
Kj = P-(xTwxy'wjk-^xj^^i.
(a) Use this to show that when the y'th case is deleted the predicted value for y, is
378
(b)
Use (a) to give an approximation for the leave-one-out cross-validation estimate
o f prediction error for a binary logistic regression with cost (7.23).
(Sections 6.4.1,7.2.2)
7.9 Practicals
1
attach(remission)
plot(LI+O.03*rnorm(27),r,pch=l,xlab="LI, jittered",xlim=c(0,2.5))
rem.glm <- glm(r"LI.binomial,data=remission)
summary(rem.glm)
x <- seqC0.4,2.0,0.02)
eta <- cbind(rep(l,81) ,x)/C*'/.coeff icients(rem.glm)
lines(x,inv.logit(eta),lty=2)
rem.perm <- function(data, i)
{ d <-data
d$LI<- d$LI[i]
d.glm <- glm(r~LI,binomial,data=d)
coefficients(d.glm) >
rem.boot <- boot(remission, rem.perm, R=199, sim="permutation")
qqnorm(rem.boot$t[,2],ylab="Coefficient of LI",ylim=c(-3,3))
abline(h=rem.boot$tO[2],lty=2)
Compare this significance level with that from using a normal approximation for
the coefficient o f LI in the fitted model.
Construct bootstrap tests o f the hypothesis by extending the methods outlined in
Section 6.2.5.
(Freeman, 1987; Hall and Wilson, 1991)
2
Dataframe b reslo w contains data from Breslow (1985) on death rates from heart
disease among British male doctors. A standard m odel is that the numbers o f
deaths y have a Poisson distribution with mean nX, where n is the number o f
person-years and X is the death rate. The focus o f interest is how death rate
depends on two explanatory variables, a factor representing the age group and an
indicator o f sm oking status, x. Two com peting models are
X = exp(aage + fix),
X = aage + fix;
these are respectively multiplicative and additive. To fit these models we proceed
as follows:
7.9 Practicals
379
380
Try this with a larger value of R but dont hold your breath.
For a full likelihood analysis for the parameter 9, the log likelihood must be
maximized over /?i,...,/?4 for a given value of 9. A little thought shows that the
necessary code is
betaO <- function(theta, mle)
{ x49 <- -log(4.9-(5-exp(mle[4])))
x <- -log(4.9)
log(theta*10"3) - m l e [1]*x49-lgamma(l + exp (-mle [2]-mle [3] *x)) }
hirose.Iik2 <- function(mle, data, theta)
{ xO <- 5-exp(mle[4])
lambda <- exp(betaO(theta,mle)+mle[1]*(-log(data$volt-xO)))
beta <- exp(mle[2]+mle[3]*(-log(data$volt)))
z <- (data$time/lambda)
beta
sum(z - data$cens*log(beta*z/data$time)) }
hirose.fun2 <- function(data, start, theta)
{ d <- nlminb(start, hirose.Iik2, data=data, theta=theta)
conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE")
c(conv, d$objective, d$parameters) }
hirose.f <- function(data, start, theta)
c( hirose.fun(data,i.start),
hirose.fun2(data,i ,start[-1],theta))
7.9 Practicals
381
R <- hirose.bootSR
i <- c(l:R) [(hirose.boot$t[,l]==l)&(hirose.boot$t[,8]==l)]
w <- 2*(hirose.boot$t[i,9]-hirose.boot$t[,2])
qqplot(qchisq(c(l:length(w))/(l+length(w)),1),w)
abline(0,1,lty=2)
Again, try this with a larger R.
Can you see how the code would be modified for nonparametric simulation?
(Section 7.3; Hirose, 1993)
Dataframe n o d a l contains data on 53 patients with prostate cancer. For each
patient there are five explanatory variables, each with two levels. These are aged
(< 60, > 6 0 ); s t a g e , a measure o f the seriousness o f the tumour; grade, a measure
o f the pathology o f the tumour; a measure o f the seriousness o f an xray; and
a c id , the level o f serum acid phosphatase. The higher level o f each o f the last four
variables indicates a more severe condition. The response r indicates whether the
cancer has spread to the neighbouring lymph nodes. The data were collected to
see whether nodal involvement can be predicted from the explanatory variables.
Analysis o f deviance for a binary logistic regression model suggests that the
response depends only on s ta g e , xray and a c id , and we base our predictions on
the m odel with these variables. Our measure o f error is the average number o f
misclassifications n 1 c(yj,/ij), where c(y, ft) is given by (7.23).
For an initial model, apparent error, and ordinary and X -fold cross-validation
estimates o f prediction error:
attach(nodal)
cost <- function(r, pi=0) mean(abs(r-pi)>0.5)
nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal)
nodal.diag <- glm.diag(nodal.glm)
app.err <- cost(r, fitted(nodal.glm))
cv.err <- cv.glm(nodal, nodal.glm, cost, K=53)$delta
cv.ll.err <- c v .glm(nodal, nodal.glm, cost, K=ll)$delta
For resampling-based estimates and plot for 0.632 errors:
382
plot(cloth$x,cloth$y)
cloth.glm <- glm(y~offset(log(x)).poisson,data=cloth)
lines(cloth$x,f itted(cloth.glm))
summary(cloth.glm)
cloth.diag <- glm.diag(cloth.glm)
cloth.gam <- gam(y~s(log(x)) .poisson,data=cloth)
lines(cloth$x,fitted(cloth.gam),lty=2)
summary(cloth.gam)
There is som e overdispersion relative to the Poisson m odel with identity link, and
strong evidence that the generalized additive model fit c lo th .g a m improves on
the straight-line m odel in which y is Poisson with mean /30 + fa x . We can try
parametric simulation from the m odel with the linear fit (the null model) to assess
the significance o f the decrease; cf. Algorithm 7.4:
7.9 Practicals
383
Here we have used resampled standardized Pearson residuals for the null model,
obtained by c lo t h .d ia g $ r p .
How significant is the observed drop in deviance under this resampling scheme?
(Section 7.6.2; Bissell, 1972; Firth, G losup and Hinkley, 1991)
6
The data n i t r o f e n are taken from a test o f the toxicity o f the herbicide nitrofen
on the zooplankton Ceriodaphnia dubia, an important species that forms the basis
o f freshwater food chains for the higher invertebrates and for fish and birds. The
standard test measures the survival and reproductive output o f 10 juvenile C. dubia
in each o f four concentrations o f the herbicide, together with a control in which
the herbicide is not present. During the 7-day period o f the test each o f the original
individuals produces three broods o f offspring, but for illustration we analyse the
total offspring.
A previous m odel for the data is that at concentration x the total offspring y for
each individual is Poisson distributed with mean exp(/?, + [3[X + (h * 1)- The fit o f
this m odel to the data suggests that low doses o f nitrofen augment reproduction,
but that higher doses inhibit it.
One thing required from analysis is an estimate o f the concentration x 5o o f nitrofen
at which the mean brood size is halved, together with a 95% confidence interval
for x 50. A second issue is posed by the surprising finding from a previous analysis
that brood sizes are slightly larger at low doses o f herbicide than at high or zero
doses: is this true?
A wide variety o f nonparametric curves could be fitted to the data, though care
is needed because there are only five distinct values o f x. The data do not look
Poisson, but we use models with Poisson errors and the log link function to
ensure that fitted values and predictions are positive. To compare the fits o f the
generalized linear m odel described above and a robustified generalized additive
model with Poisson errors:
384
8
Complex Dependence
8.1 Introduction
In previous chapters o u r m odels have involved variables independent at some
level, an d we have been able to identify independent com ponents th at can be
sim ulated. W here a m odel can be fitted and residuals o f some sort identified,
the sam e ideas can be applied in the m ore com plex problem s discussed in
this chapter. W here th a t m odel is param etric, param etric sim ulation can in
principle be used to obtain resam ples, though M arkov chain M onte C arlo
techniques m ay be needed in practice. But in nonparam etric situations the
dependence m ay be so com plex, or our knowledge o f it so limited, th a t neither
o f these approaches is feasible. O f course some assum ption o f repeatedness
w ithin the d a ta is essential, o r it is im possible to proceed. But the repeatability
m ay not be at the level o f individual observations, b u t o f groups o f them , and
there is typically dependence betw een as well as w ithin groups. This leads to
the idea o f constructing b o o tstrap d a ta by taking blocks o f some sort from the
original observations. T he area is in rapid developm ent, so we avoid a detailed
m athem atical exposition, an d merely sketch key aspects o f the m ain ideas. In
Section 8.2 we describe som e o f the resam pling schemes proposed for time
series. Section 8.3 outlines some ideas useful in resam pling point processes.
385
8 Complex Dependence
386
and n o t on their absolute position in the series. A w eaker assum ption used in
d a ta analysis is th a t the jo in t second m om ents o f observations depend only
on their relative positions; such a series is said to be second-order o r weakly
stationary.
Time domain
T here are two basic types o f sum m ary quantities for stationary tim e series. The
first, in the tim e dom ain, rests on the jo in t m om ents o f the observations. Let
{7,} be a second-order stationary tim e series, w ith zero m ean and autocovari
ance function yj. T h at is, E (Yj) = 0 an d co\(Yk, Yk+j) = yj for all k and j ; the
variance o f Yj is yo- T hen the autocorrelation function o f the series is pj = y j / y o,
for j = 0, + 1, . . which m easures the co rrelation betw een observations at lag
j a p a rt; o f course 1 < pj < 1, po = 1, an d ps = p _; . A n uncorrelated series
would have pj = 0, and if the d a ta were norm ally d istributed this would imply
th a t the observations were independent.
For exam ple, the statio n ary m oving average process o f order one, or M A(1)
model, has
Yj = ej + Pej-i,
; =
1 ,0 ,1 ,...,
(8.1)
j = . . . , - 1, 0, 1, . . . ,
| | < 1.
(8.2)
The autoco rrelatio n function for this process is pj = a 1-'1 for j = + 1 , 2 and so
forth, so large a gives high correlation betw een successive observations. The
autocorrelatio n function decreases rapidly for b o th m odels (8.1) and (8.2).
A close relative o f the au to co rrelatio n function is the partial autocorrelation
function, defined as pj = yj/yo, where yj is the covariance betw een Y& and Yk+j
after adjusting for the intervening observations. T he partial autocorrelations
for the M A (1) m odel are
p j
= - ( - m i - /?2){i - ^2(;+1)} - 1,
= i , +2,
The A R(1) m odel has p\ = a, and pj = 0 for \j\ > 1; a sh arp cut-off in the
partial autocorrelations is characteristic o f autoregressive processes.
The sam ple estim ates o f pj and pj are basic sum m aries o f the structure o f
a time series. Plots o f them against j are called the correlogram and partial
correlogram o f the series.
One widely used class o f linear time series m odels is the autoregressivem oving average or A R M A process. T he general ARM A(p,<j) m odel is defined
387
by
9
Yj = '^2<*kYj-k + Ej + '^2PkEj-k,
k=l
k=1
(8.3)
where {,} is a w hite noise process. I f all the a& equal zero, { Yj} is the
m oving average process M A (q), w hereas if all the f t equal zero, it is AR(p).
In order for (8.3) to represent a stationary series, conditions m ust be placed
on the coefficients. Packaged routines enable m odels (8.3) to be fitted readily,
while series from them are easily sim ulated using a given innovation series
...,
1, o , j , . . . .
Frequency domain
The second ap p ro ach to tim e series is based on the frequency dom ain. The
spectrum o f a statio n ary series w ith autocovariances yj is
00
This sum m arizes the values o f all the autocorrelations o f {Yj}. A w hite noise
process has the flat spectrum g(co) = yo, while a sh arp peak in g(to) corresponds
to a strong periodic com ponent in the series. F or example, the spectrum for a
stationary A R (1) m odel is g(co) = cr2{ 1 2acos(co) + a2}-1 .
The em pirical F ourier transform plays a key role in d a ta analysis in the
frequency dom ain. T he treatm en t is simplified if we relabel the series as
yo, a n d
suppose th a t n = 2np + 1 is odd. Let f = e2n'^n be the nth
com plex ro o t o f unity, so (" = 1. T hen the empirical Fourier transform o f the
d a ta is the set o f n com plex-valued quantities
n1
y k = Y 2 }ky j
fc = o ,. . . , n - 1;
7=0
~^2C~}kyk = yj,
= 0 , . . l,
k=0
so this inverse Fourier transform retrieves the data. N ow define the Fourier
frequencies cok 2nk /n, for k = 1, . . . , n p . T he sam ple analogue o f the spectrum
at a>k is the periodogram,
Y2yj
n- 1
j =0
cos(cokj)
\ +1YI yjsin(mkj)
(n-l
I j =0
' 2
8 Complex Dependence
388
(8.5)
k =
1.
Z jU H ujY
W hen the d a ta are w hite noise these ordinates have roughly the same jo in t
distributio n as the o rd er statistics o f np 1 uniform ran d o m variables.
Exam ple 8.1 (Rio N egro d a ta ) The d a ta for o u r first time series exam ple are
m onthly averages o f the daily stages heights o f the R io N egro, 18 km
upstream a t M anaus, from 1903 to 1992, m ade available to us by Professors
H. O Reilly S ternberg an d D. R. B rillinger o f the U niversity o f C alifornia
at Berkeley. Because o f the tiny slope o f the w ater surface and the lower
courses o f its flatland affluents, these d a ta m ay be regarded as a reasonable
approxim ation o f the w ater level in the A m azon R iver at the confluence o f the
389
Figure 8.1
Deseasonalized monthly
average stage (metres)
of the R io N egro at
M anaus, 1903-1992
(Sternberg, 1995).
1900
1920
1940
1960
1980
2000
Time (years)
two rivers. To remove the strong seasonal com ponent, we subtract the average
value for each m onth, giving the series o f length n = 1080 shown in Figure 8.1.
F or an initial exam ple, we take the first ten years o f observations. The top
panels o f Figure 8.2 show the correlogram and partial correlogram for this
sh o rter series, w ith horizontal lines showing approxim ate 95% confidence limits
for correlations from a w hite noise series. The shape o f the correlogram and
the cut-off in the p artial correlogram suggest th a t a low -order autoregressive
m odel will fit the data, which are quite highly correlated. T he lower left panel
o f the figure shows the periodogram o f the series, which displays the usual
high variability associated w ith single periodogram ordinates. The lower right
panel shows the cum ulative periodogram , which lies well outside its overall
95% confidence b and an d clearly does n o t correspond to a white noise series.
A n A R (2) m odel fitted to the shorter series gives oil = 1.14 and a.2 = 0.31,
b o th w ith stan d ard erro r 0.062, and estim ated innovation variance 0.598. The
left panel o f Figure 8.3 shows a norm al probability plot o f the standardized
residuals from this m odel, an d the right panel shows the cum ulative peri
odogram o f the residual series. The residuals seem close to G aussian white
noise.
390
to
o>
o
o
O
Lag
Lag
omega
omega/pi
residuals into the fitted m odel. T he residuals are typically recentred to have
the same m ean as the innovations o f the m odel. A b o u t the sim plest situation
is w hen the A R (1) m odel (8.2) is fitted to an observed series y i , . . . , y , giving
estim ated autoregressive coefficient a an d estim ated innovations
ej
= yj
- &y j - u
j = 2,...,n;
391
E
2?
o>
o
-o
o
V)
co
D
O
cn
<D
cc
0Q.
3
e3
o
omega/pi
yo = ej and
y j = a yj_! + e j ,
j = l,...,n ;
(8.6)
o f course we m ust have |a| < 1. In fact the series so generated is n o t stationary,
an d it is b etter to start the series in equilibrium , o r to generate a longer series
o f innovations an d sta rt (8.6) at j = k, where the b u rn-in period k , . . . , 0
is chosen large enough to ensure th at the observations y [ , . . . , y * are essentially
statio n ary ; the values y'_k, . . . , y ' ) are discarded.
T hus m odel-based resam pling for tim e series is based on applying the
defining equation(s) o f the series to innovations resam pled from residuals.
This procedure is simple to apply, and leads to good theoretical behaviour
for estim ates based on such d a ta w hen the m odel is correct. F or example,
studentized b o o tstrap confidence intervals for the autoregressive coefficients
ak in an A R (p) process enjoy the good asym ptotic properties discussed in
Section 5.4.1, provided th a t the m odel fitted is chosen correctly. Just as there,
confidence intervals based on transform ed statistics m ay be b etter in practice.
Exam ple 8.2 (Wool prices) T he A ustralian W ool C o rp o ratio n m onitors prices
weekly w hen wool m arkets are held, and sets a m inim um price ju st before each
weeks m arkets open. This reflects the overall price o f wool for th a t week, b u t
the prices actually paid can vary considerably relative to the m inim um . The
left panel o f Figure 8.4 shows a plot o f log(price p aid /m in im u m price) for
those weeks w hen m arkets were held from July 1976 to June 1984. The series
does n o t seem stationary, having som e o f the characteristics o f a ran d o m walk,
as well as a possible overall trend.
I f the log ratio in week j follows a random walk, we have Yj = Yj -\ + Sj,
392
8 Complex Dependence
Time in weeks
Time in weeks
where the ej are w hite noise; a non-zero m ean for the innovations Ej will
lead to drift in yj. The right panel o f Figure 8.4 shows the differenced series,
ej = y j y j - i , which appears stationary a p a rt from a change in the innovation
variance at a b o u t the 100th week. In o u r analysis we drop the first 100
observations, leaving a differenced series o f length 208.
A n alternative to the ran d o m w alk m odel is the A R(1) m odel
( Y j - n ) = <x(Y}- 1 -iJ.) + ej ;
(8.7)
this gives the ran d o m w alk when a = 1. If the innovations have m ean zero
and a is close to b u t less th a n one, (8.7) gives stationary data, though subject
to the clim bs and falls seen in the left panel o f Figure 8.4. The im plications for
forecasting depend on the value o f a, since the variance o f a forecast is only
asym ptotically bounded w hen |a| < 1. We test the unit root hypothesis th a t
the d ata are a ran d o m walk, or equivalently th a t a = 1, as follows.
O ur test is based on the o rdinary least squares estim ate o f a in the regression
Yj = } + a Yj-1 +Sj for j = 2 , . . . , n using test statistic T = (1 a) /S, where S is
the stan d ard erro r for a calculated using the usual form ula for a straight-line
regression m odel. L arge values o f T are evidence against the random walk
hypothesis, w ith or w ithout drift. T he observed value o f T is t = 1.19. The
distribution o f T is far from the usual stan d ard norm al, however, because o f
the regression o f each observation on its predecessor.
U nder the hypothesis th a t a = 1 we sim ulate new time series Y J , . . . , Y * by
generating a b o o tstrap sam ple e \ , . . . , e* from the differences e i , . . . , e n and then
setting YJ = Y\, Y j = YJ + e 2" , an d YJ = Y]'_l + * for subsequent j. This is
(8.6) applied w ith the null hypothesis value a = 1. T he value o f T ' is then
obtained from the regression o f YJ on YJ_X for j = 2
The left panel
393
1/SSy*
o f Figure 8.5 shows the em pirical distribution o f T * in 199 sim ulations. The
distribution is close to norm al w ith m ean 1.17 and variance 0.88. T he observed
significance level for t is (97 + l ) / ( 199 + 1) = 0.49: there is no evidence against
the ran d o m w alk hypothesis.
The right panel o f Figure 8.5 shows the values o f f* plotted against the
inverse sum o f squares for the regressor y j _ v In a conventional regression,
inference is usually conditional on this sum o f squares, which determ ines the
precision o f the estim ate. The dotted line shows the observed sum o f squares.
If the conditional distribution o f tm is th ought to be appropriate here, the
distribution o f values o f t* close to the do tted line shows th a t the conditional
significance level is even higher; there is no evidence against the random walk
conditionally or unconditionally.
M odels are com m only fitted in o rder to predict future values o f a tim e series,
b u t as in o th er settings, it can be difficult to allow for the various sources o f
u ncertainty th a t affect the predictions. The next exam ple shows how boo tstrap
m ethods can give some idea o f the relative contributions from innovations,
estim ation error, and m odel error.
Exam ple 8.3 (Sunspot num bers) Figure 8.6 shows the m uch-analysed annual
sunspot num bers y [ , - - - , y 2%g from 1700-1988. T he d a ta show a strong cycle
w ith a period o f ab o u t 11 years, and som e hint o f non-reversibility, which
shows up as a lack o f sym m etry in the peaks. We use values from 1930-1979
to predict the num bers o f sunspots over the next few years, based on fitting
394
8 Complex Dependence
Time in years
A ctu al
Predicted
1980
81
82
83
84
85
86
87
1988
23.0
21.6
21.8
18.9
19.6
14.9
14.4
12.2
11.7
9.1
6.7
7.5
5.6
6.8
9.0
8.8
18.1
13.6
3.2
3.8
3.6
3.8
6.6
3.6
3.3
4.1
3.8
3.9
6.7
3.9
3.4
4.0
3.9
4.0
6.8
4.3
3.4
3.6
3.8
4.1
6.5
4.3
S ta n d a rd e rro r
N o m in al
M odel, A R (9)
M odel
M odel, c o n d itl
Block, I = 10
P o st-b lack d, I = 10
2.0
2.2
2.3
2.5
7.8
2.1
2.9
2.9
3.3
3.6
7.0
3.3
3.2
3.0
3.6
4.1
6.9
3.9
3.2
3.2
3.5
3.9
6.9
4.0
3.2
3.3
3.5
3.8
6.7
3.6
A R (p) m odels
p
Y j- n
= ^2<3ik ( Yj - k -
n)
+ j ,
( 8 .8 )
fc=i
to the transform ed observations yj = 2 {(yj + l )1/2 1}; this transform ation is
chosen to stabilize the variance. T he corresponding m axim ized log likelihoods
are denoted ?p. A stan d ard ap p ro ach to m odel selection is to select the m odel
th a t m inimizes A IC = 2 i p + 2p, which trades off goodness o f fit (m easured
by the m axim ized log likelihood) against m odel com plexity (m easured by p).
H ere the resulting b est m odel is A R(9), w hose predictions yj for 1980-88
and their n om inal sta n d a rd errors are given a t the top o f Table 8.1. These
stan d a rd errors allow for prediction e rro r due to the new innovations, b u t not
for param eter estim ation or m odel selection, so how useful are they?
To assess this we consider m odel-based sim ulation from (8.8) using centred
residuals an d the estim ated coefficients o f the fitted A R (9) m odel to generate
series y*1,...,y * 59, corresponding to the p eriod 1930-1988, for r = l , . . . , R . We
then fit autoregressive m odels up to o rd er p = 25 to y {, . . . , y ' 50, select the
m odel giving the sm allest A IC , and use this m odel to produce predictions y'rj
for j = 5 1, . .. , 59. The prediction erro r is y j y'r], and the estim ated standard
395
errors o f this are given in Table 8.1, based o n J ? = 999 b o o tstrap series. The
orders o f the fitted m odels were
O rd er
#
1 234
5
3 257 126100
67
89
273 8522 18
10
83
11
12
23
72
so the A R (9) m odel is chosen in only 8% o f cases, and m ost o f the m odels
selected are less com plicated. The fifth and sixth rows o f Table 8.1 give the
estim ated sta n d a rd errors o f the y y* using the 83 sim ulated series for
which the selected m odel was A R(9) and using all the series, based on the
999 replications. T here is ab o u t a 10-15% increase in stan d ard erro r due to
p aram eter estim ation, an d the stan dard errors for the A R (9) m odels are m ostly
smaller.
Prediction errors should take account o f the values o f yj im m ediately prior
to the forecast period, since presum ably these are relevant to the predictions
actually m ade. Predictions th a t follow on from the observed d a ta can be
obtained by using innovations sam pled a t random except for the period j =
n k + 1 ,... ,n, where we use the residuals actually observed. T aking k = n
yields the original series, in which case the only variability in the y'rj is due to
the innovations in the forecast period; the stan d ard errors o f the predictions
will then be close to the nom inal stan d ard error. However, if k is sm all relative
to n, the differences y*j y'j will largely reflect the variability due to the use o f
estim ated param eters, although the y*rj will follow on from y n. The conditional
stan d ard errors in Table 8.1, based on k = 9, are a b o u t 10% larger th an the
unconditional ones, and substantially larger th an the nom inal stan d ard errors.
The distrib u tio n s o f the y'j y'j app ear close to norm al with zero means,
and a sum m ary o f variation in term s o f standard errors seems appropriate.
T here will clearly be difficulties w ith norm al-based prediction intervals in 1985
and 1986, w hen the lower lim its o f 95% intervals for y are negative, and it
m ight be b etter to give one-sided intervals for these years. It would be better
to use a studentized version o f y'j y'j if an ap p ro p riate stan d ard error were
readily available.
W hen b o o tstra p series are generated from the A R (9) m odel fitted to the
d a ta from 1700-1979, the orders o f the fitted m odels are
O rd er
#
5
1
9
765
10
88
11
57
12131415 161718
28211111 51 4
19
25
396
8 Complex Dependence
The m ajo r draw back w ith m odel-based resam pling is th a t in practice not
only the p aram eters o f a m odel, b u t also its structure, m ust be identified
from the data. I f the chosen structure is incorrect, the resam pled series will
be generated from a w rong m odel, an d hence they will n o t have the same
statistical properties as the original data. This suggests th a t som e allowance
be m ade for m odel selection, as in Section 3.11, b u t it is unclear how to do
this w ithout som e assum ptions ab o u t the dependence structure o f the process,
as in the previous example. O f course this difficulty is less critical when the
m odel selected is strongly indicated by subject-m atter considerations o r is
w ell-supported by extensive data.
[y]}
z i>z2>z3
y5,ye,yi,y%,
yuyi,yi,y*,
In general, the resam pled series are m ore like w hite noise th a n the original
series, because o f the joins betw een blocks w here successive independently
chosen z* meet.
The idea th a t underlies this block resampling scheme is th a t if the blocks
397
are long enough, enough o f the original dependence will be preserved in the
resam pled series th a t statistics f* calculated from {yj} will have approxim ately
the sam e distribution as values t calculated from replicates o f the original
series. C learly this approxim ation will be best if the dependence is weak and
the blocks are as long as possible, thereby preserving the dependence m ore
faithfully. O n the o th er hand, the distinct values o f t* m ust be as num erous
as possible to provide a good estim ate o f the distribution o f T, and this
points tow ards short blocks. T heoretical work outlined below suggests th a t a
com prom ise in which the block length I is o f order ny for some y in the interval
(0,1) balances these tw o conflicting needs. In this case b o th the block length /
an d the n u m b er o f blocks b = n/ l tend to infinity as n * oo, though different
values o f y are ap p ro p riate for different types o f statistic t.
There are several v ariants on this resam pling plan. One is to let the original
blocks overlap, in o u r exam ple giving the n I + 1 = 9 blocks z\ = (>i , ...,> '4),
22 =
Z3 = t o , . . . , ye), and so forth up to z9 = (y9, . . . , y n) . This
incurs end effects, as the first and last / 1 o f the original observations ap p ear
in fewer blocks th an the rest. Such effects can be rem oved by w rapping the
d a ta around a circle, in o u r exam ple adding the blocks z\o = (yio,y n , y n , y \ ) ,
Z n . = ( y u , y n , y i , y 2 ), and Z 12 = 0 ' 12,J '1J'2,J'3)- This ensures th a t each o f the
original observations has an equal chance o f appearing in a sim ulated series.
E nd correction by w rapping also removes the m inor problem with the no n
overlapping scheme th a t the last block is shorter th an the rest if n / l is not an
integer.
Post-blackening
The m ost im p o rtan t difficulty w ith resam pling schemes based on blocks is th at
they generate series th a t are less dependent th an the original data. In some
circum stances this can lead to catastrophically bad resam pling approxim ations,
as we shall see in Exam ple 8.4. It is clearly inappropriate to take blocks o f
length / = 1 w hen resam pling dependent data, for the resam pled series is
then w hite noise, b u t the w hitening can rem ain substantial for small and
m oderate values o f I. This suggests a strategy interm ediate betw een m odelbased and block resam pling. The idea is to pre-w hiten the series by fitting
a m odel th a t is intended to remove m uch o f the dependence betw een the
original observations. A series o f innovations is then generated by block
resam pling o f residuals from the fitted m odel, and the innovation series is
then post-blackened by applying the estim ated m odel to the resam pled
innovations. T hus if an A R (1) m odel is used to pre-w hiten the original data,
new series are generated by applying (8.6) b u t w ith the innovation series {ej}
sam pled n o t independently b u t in blocks taken from the centred residual series
e2 - e , . . . , e - e.
8 Complex Dependence
398
B lo ck s o f blocks
A different ap p ro ach to rem oving the w hitening effect o f block resam pling is
to resam ple blocks o f blocks. Suppose th a t the focus o f interest is a statistic
T which estim ates 6 an d depends only on blocks o f m successive observations.
A n exam ple is the lag k autocovariance (n k) 1 Y ^ J l i y j ~ y)(yj+k ~ y), for
which m = k + 1. T hen unless / m the distribution o f T* M s typically a
po o r approxim ation to th a t o f T 6, because a substantial p ro p ortion o f the
pairs (YJ, Yj+k) in a resam pled series will lie across a jo in betw een blocks, and
will therefore be independent. To im plem ent resam pling blocks o f blocks we
define a new m -variate process { Yj } for which Y j = ( Y j , Y j +m- 1), rew rite T
so th a t it involves averages o f the Yj, an d resam ple blocks o f the new d a ta
y \ , .. .,y'_m+1, each o f the observations o f which is a block o f the original data.
F or the lag 1 autocovariance, for exam ple, we set
and w rite t = (n I )-1 YXVij ~ y'lMy'ij ~ ? 2-)- The key point is th a t t should
n o t com pare observations adjacent in each row. W ith n = 12 and / = 4 a
b o o tstrap replicate m ight be
ys
y6
yi
ys
yi
yi
ys
y4
yi
^6
yi
>'8
y9
yi
yi
ys
ys
y?
y9
y io
yio
yn
Since a b o o tstra p version o f t based on this series will only contain products
o f (centred) adjacent observations o f the original data, the w hitening due to
resam pling blocks will be reduced, though n o t entirely removed.
This ap p ro ach leads to a sh o rter series being resam pled, b u t this is unim
p o rta n t relative to the gain from avoiding whitening.
Stationary bootstrap
A further b u t less im p o rtan t difficulty w ith these block schemes is th at the
artificial series generated by them are n o t stationary, because the jo in t distri
bution o f resam pled observations close to a jo in betw een blocks differs from
th a t in the centre o f a block. This can be overcom e by taking blocks o f random
length. The stationary bootstrap takes blocks whose lengths L are geom etrically
distributed, w ith density
Pr(L = j ) = ( l - p y - ' p ,
j = 1 ,2 ,
This yields resam pled series th a t are statio n ary w ith m ean block length Z = p *.
Properties o f this scheme are explored in Problem s 8.1 and 8.2.
Exam ple 8.4 (Rio N egro d a ta ) To illustrate these resam pling schemes we
consider the shorter series o f river stages, o f length 120, w ith its average
subtracted. Figure 8.7 shows the original series, followed by three b o o tstrap
399
20
40
60
80
100
120
20
40
60
80
100
120
20
40
60
80
100
120
20
40
60
80
100
120
series generated by m odel-based sam pling from the A R (2) model. The next
three panels show series generated using the block b o o tstrap with length I = 24
and no w rapping. There are some sharp jum ps a t the ends o f contiguous blocks
in the resam pled series. T he b o tto m panels show series generated using the
sam e blocks applied to the residuals, and then post-blackened using the A R(2)
m odel. The ju m p s from using the block b o o tstrap are largely rem oved by
post-blackening.
F o r a m ore system atic com parison o f the m ethods, we generated 200 b o o t
strap replicates under different resam pling plans. F or each plan we calculated
the sta n d a rd erro r SE o f the average y * o f the resam pled series, and the
average o f the first three au to correlation coefficients. The m ore dependent
400
8 Complex Dependence
O riginal values
S am pling SE
R esam p lin g p lan
M odel-based
Blockwise
P o st-blackened
0.85
0.002
0.62
0.007
0.010
D etails
SE
Pi
Pi
P\
A R (2)
AR(1)
0.34
0.49
0.83
0.82
0.60
0.67
0.38
0.54
A R (3)
0.44
0.83
0.58
0.39
0.20
0.41
0.67
0.75
0.79
-0.02
0.35
0.47
0.54
-0.01
0.14
0.27
0.85
0.85
0.85
0.85
0.63
0.63
0.64
0.64
0.45
0.45
0.47
0.48
0.03
0.38
0.56
0.40
I= 2
1= 5
/ = 10
0.26
0.33
0.33
1= 2
0.20
I= 5
I = 10
0.26
0.33
0.33
1 = 20
S tatio n ary
P2
0.017
1 = 20
B locks o f blocks
Pi
1= 2
0.25
0.28
0.31
0.28
0.40
I= 5
/ = 10
/ = 20
0.74
0.79
0.13
0.37
0.47
0.54
A R (2), I = 2
A R (1), I = 2
A R (3), I = 2
0.39
0.58
0.43
0.83
0.85
0.83
0.59
0.69
0.58
0.66
0.45
0.35
0.20
0.28
0.36
401
The statio n ary b o o tstrap is used with end correction. The results are similar
to those for the block b o o tstrap , except th a t the varying block length preserves
slightly m ore o f the original correlation structure; this is noticeable at I = 2.
R esults for the post-blackened m ethod with A R (2) and A R (3) m odels are
sim ilar to those for the corresponding m odel-based schemes. The results for
the post-blackened A R (1) scheme are interm ediate betw een A R (1) and A R(2)
m odel-based resam pling, reflecting the fact th a t the A R (1) m odel underfits the
data, and hence structure rem ains in the residuals. L onger blocks have little
effect for the A R (2) an d A R (3) models, b u t they bring results for the A R(1)
m odel m ore into line w ith those for the others.
402
8 Complex Dependence
Figure 8.8
Distributions of
nonlinearity statistic for
block resampling
schemes applied to
sunspot data. The left
panel shows R = 999
replicates of a test
statistic for nonlinearity,
based on detecting
nonlinearity at up to 20
lags for the block
bootstrap with / = 10.
The right panel shows
the corresponding plot
for the post-blackened
bootstrap using the
AR(9) model.
Quantile of F distribution
Quantile of F distribution
(8.10)
as the optim um block length for a series o f length n, and calculate k(n,l).
This procedure elim inates the co n stan t o f proportionality. We can check on
the adequacy o f I by repeating the procedure w ith initial value I = I, iterating
if necessary.
403
c
o
'o
it=
<b
o
o
>
D
C
O
<n
n<D
<
1900
1940
1980
Time (years)
404
8 Complex Dependence
o
CO
ir>
co
o
CO
CM
,y\
\ s/
Nv
j\/\
/\jy/ v '0-/v/
/ V'
ca
40
/\" _
0
o
c
Vari
o
<
D co
O
c 10
<d CM
<6
> o
J--' \ /
\\]I
V x/
'
/
/
/
o
CM
10
Block length
15
20
10
20
30
40
50
Block length
405
k(m, /)
S tationary, m
20
15
17.5
20
22.5
25
27.5
30
10
11
11
11
11
11
11
60
70
20
3
3
2
2
4
4
4
4
4
4
4
30
40
50
3
3
18
18
18
11
11
11
Block, m
6
6
12
6
6
3
3
5
7
14
14
9
9
10
10
3
4
8
8
11
30
40
50
60
70
18
16
4
5
5
5
5
6
6
6
3
3
4
5
3
4
4
5
9
9
6
6
8
8
10
5
5
5
5
5
406
8 Complex Dependence
var { h( Y )} = h'(n)2var( Y ),
_14 B) =
yo + 2
co
i / n^ i
7=1
^ 2 yj =
j= co
(8.12)
v
E (Y*) = S,
var*(Y*) = r 2 ^ ( 5 y - S ) 2.
j=i
~ I1)2 ~
Y is the average of
Yu . . . , Y n.
407
~ n ) 2} + 2
~ S )2 is
F or norm al data,
var {(Si n)2}
2{var(Si - n)}2 ,
cov{(S i - j u ) 2,(S 2 - / i ) 2}
2 {cov(Si - n, S2 - n )}2 ,
SO
c f = y i + 2 y 2 -\------ 1- l y i - >
i= i
jyj
~ ji,
x f 'r 't ,
var(/?) ~
var(v) ~
hf(jif x 2/n 3f 2,
(8.14)
an(^
+ y";+l)} '
(8'15)
In this case the leading term o f the expansion for fi is the product o f h'( Y)
and the rig h t-h an d side o f (8.15), so the b o o tstrap bias estim ate for Y as an
estim ator o f 9 = n is non-zero, which is clearly m isleading since E (T ) = fi.
W ith overlapping blocks, the properties o f the b o o tstra p bias estim ator depend
on E*(Y *)Y , and it tu rn s o u t th a t its variance is an order o f m agnitude larger
th an for non-overlapping blocks. This difficulty can be rem oved by w rapping
Yi....... Y aro u n d a circle an d using n blocks, in which case E*(Y*) = Y, or
by re-centring the b o o tstrap bias estim ate to ^ = E {/i(Y*)} ft { E (Y ')} . In
either case (8.13) and (8.14) apply. One asym ptotic benefit o f using overlapping
8 Complex Dependence
408
blocks when the re-centred estim ator is used is th at var(/?) and var(v) are
reduced by a factor | , though in practice the reduction m ay not be visible for
small n.
The corresponding argum ent for tail probabilities involves E dgew orth ex
pansions and is considerably m ore intricate th an th a t sketched above.
A part from sm oothness conditions on h(-), the key requirem ent for the above
argum ent to w ork is th a t x an d ( be finite, and th a t the autocovariances
decrease sharply enough for the various term s neglected to be negligible. This
is the case if
~ a; for sufficiently large j and some a with |a| < 1, as is
the case for stationary finite A R M A processes. However, if for large j we find
th at yj ~ j ~ s, where 5 < S < 1, an d x are n o t finite and the argum ent will
fail. In this case g(<u) ~ oj~ s for sm all co, so long-range dependence o f this
sort is characterized by a pole in the spectrum a t the origin, where = g(0)
is the value o f the spectrum . The d a ta co u n terp art o f this is a sharp increase
in periodogram ordinates a t small values o f co. T hus a careful exam ination o f
the periodogram n ear the origin and o f the long-range correlation structure is
essential before applying the block b o o tstrap to data.
n1
h =
_ y)
j=
where ( = exp(2
ni/n).
k = 0 ,...,n -l,
409
ek = 2 ~ ^ 2 { Xk + X cn_k) ,
fc = 0, . . . , n 1,
where superscript c denotes com plex conjugate and we take X = Xo4 A pply the inverse Fourier transform to e*0, . . . , e'n_ ] to obtain
n1
Y j ^ y + n - 1 Y , Z ~ ik~
e 'k
j = 0,...,n-l.
fc=0
Step 3 guarantees th a t Yk has com plex conjugate Y*_k, and therefore th a t the
bo o tstrap series Y0*, . . . , Yn'_{ is real. A n alternative to step 2 is to resam ple the
Uk from the observed phases
The b o o tstrap series always has average y , which implies th at phase scram
bling should be applied only to statistics th a t are invariant to location changes
o f the original series; in fact it is useful only for linear contrasts o f the y j , as
we shall see below. It is straightforw ard to see th at
-1/2 n-1
n-1
Y j = y + -----Y P l ~
n 1=0
^ Y l cos {2 n k (l ~
k=0
+ U k}
j = 0 , . . . , n - 1,
(8.16)
from which it follows th a t the b o o tstrap d a ta are stationary, w ith covariances
equal to the circular covariances o f the original series, and th a t all their odd
jo in t cum ulants equal zero (Problem 8.4). This representation also m akes it
clear th a t the resam pled series will be essentially linear with norm al margins.
The difference betw een phase scram bling and m odel-based resam pling can
be deduced from A lgorithm 8.1. U nder phase scram bling,
\Yk' \ 2 = \ h \2 {1 + cos (u; + u ;_ k)} ,
(8.17)
which gives
e *(| y; i2)
= Iw I2,
v a r* (|y ;i2) = g W
C learly these resam pling schemes will give different results unless the quantities
o f interest depend only on the m eans o f the |y fe' | 2, i.e. are essentially quadratic
410
8 Complex Dependence
in the data. Since the quan tity o f interest m ust also be location-invariant,
this restricts the dom ain o f phase scram bling to such tasks as estim ating the
variances o f linear contrasts in the data.
Example 8.7 (Rio Negro data) We assess em pirical properties o f phase scram
bling using the first 120 m o n th s o f the R io N egro d ata, which we saw previously
were well-fitted by an A R (2) m odel w ith norm al errors. N ote th a t our statistic
o f interest, T = Y l ajYj> has the necessary structure for phase scram bling n o t
autom atically to fail.
Figure 8.11 shows three phase scram bled datasets, which look sim ilar to the
A R(2) series in the second row o f Figure 8.7.
T he top panels o f Figure 8.12 show the em pirical Fourier transform for the
original d a ta an d for one resam ple. Phase scram bling seems to have shrunk
the m oduli o f the series tow ards zero, giving a resam pled series w ith lower
overall variability. The low er left panel shows sm oothed periodogram s for the
original d a ta and for 9 phase scram bled resam ples, while the right panel shows
corresponding results for sim ulation from the fitted A R (2) model. The results
are quite different, an d show th a t d a ta generated by phase scram bling are less
variable th an those generated from the fitted model.
R esam pling w ith 999 series generated from the fitted A R(2) m odel and by
phase scram bling, the distribution o f 7 is close to no rm al under b o th schemes
b u t it is less variable u nder phase scram bling; the estim ated variances are 27.4
and 20.2. These are sim ilar to the estim ates o f a b o u t 27.5 and 22.5 obtained
using the block and statio n ary bootstraps.
Before applying phase scram bling to the full series, we m ust check th a t
it shows no sign o f nonlinearity or o f long-range dependence, and th at it
is plausibly close to a linear series w ith norm al errors. W ith m = 20 the
nonlinearity statistic described in Exam ple 8.3 takes value 0.015, and no value
for m < 30 is greater th a n 0.84: this gives no evidence th a t the series is
nonlinear. M oreover the p eriodogram shows no signs o f a pole as to>0+, so
long-range dependence seems to be absent. A n A R (8) m odel fits the series
well, b u t the residuals have heavier tails th an the norm al distribution, w ith
kurtosis 1.2. T he variance o f T * u nder phase scram bling is ab o u t 51, which
411
CD
O
Tj-
O
C\J
o
o
C>
- 60
-40
- 20
20
40
60
-60
- 40
- 20
20
40
60
CG
o>
0)
e
o
o>
o
omega
omega
again is sim ilar to the estim ates from the block resam pling schemes. A lthough
this estim ate m ay be untrustw orthy, on the face o f things it casts no d o ubt on
the earlier conclusion th a t the evidence for trend is weak.
8 Complex Dependence
412
e is the average of
r = -? - > /* ,
tr
where I k = Ho}k), ak = a(cok), an d (ok is the /cth F ourier frequency. F o r a linear
process
00
Y j = T , b* H>
i=oo
where {,} is a stream o f independent and identically distributed random
variables w ith standardized fourth cum u lan t K4, the m eans and covariances o f
the Ik are approxim ately
E (Ik) = g(a>k),
cov(Ik,Ii) =
g ( a > k ) g ( c o ,) ( S k,
+ n~ 1 K4).
J a(co)g(a>)d(o,
v ar(T )
ri- 1
(8.18)
413
k=0
where X ( ) is a sym m etric P D F with m ean zero and unit variance and h is a
positive sm oothing param eter. Then
E(T)
'1' 1/ ^
v a r(T )
^ { g ( r i ) } 2 J K 2 (u) du +
J K
g(co)d(o^ .
Since we m ust have h>0 as n *00 in order to remove the bias o f T , the second
term in the variance is asym ptotically negligible relative to the first term , as is
necessary for the resam pling scheme outlined above to work w ith a tim e series
for which /c4 0. C om parison o f the variance and bias term s implies th at the
asym ptotic form o f the relative m ean squared erro r for estim ation o f g(//) is
m inim ized by tak in g h oc n~[^5. However, there are two difficulties in using
resam pling to m ake inference ab o ut g(^) from T.
T he first difficulty is analogous to th at seen in Exam ple 5.13, and appears
on com paring T and its b o o tstrap analogue
k=1
We suppose th a t I k is generated using a kernel estim ate g(a>k) with sm oothing
param eter h. T he standardized versions o f T and T * are
Z = (n h c)1/2 T
g^ \
Z* = (n h c)1 / l T
8 Complex Dependence
414
where c = {2n f K 2 (u)du}
E (Z ) = (nhc ) l / 1
E * (Z ') = (n/ic)1/2E
gO/)
gU/)
415
In n o v atio n s
N o rm al
M ean
V ariance
C hi-squared
M ean
V ariance
N o rm al
M ean
V ariance
C hi-squared
M ean
V ariance
65
129
257
513
1025
00
1.4
2.0
2.5
2.7
1.2
2.1
6.9
2.8
0.9
1.7
1.5
2.0
1.0
1.7
4.9
2.0
0.8
1.3
1.3
1.7
0.8
1.3
3.8
1.6
0.7
1.0
1.1
1.5
0.7
1.0
0.6
0.8
1.1
1.3
0.7
0.8
2.7
1.3
0.5
0.9
0.6
2.3
0.5
0.4
1.3
1.4
0.6
0.4
3.7
1.4
0.5
0.3
1.1
1.4
0.5
0.3
3.1
1.4
0.3
0.3
1.1
1.3
0.4
0.2
0.2
1.0
1.3
0.3
0.2
2.2
1.2
0.0
1.5
1.0
0.7
5.6
1.4
3.1
1.4
0.3
2.5
1.3
1.0
0.5
1.0
1.0
0.0
1.0
Figure 8.13
C om parison o f
distributions o f Z and
Z* for time series o f
length 257. The left
panel shows a norm al
plot o f 1000 values o f
Z . The right panel
com pares the
distributions o f
Z and Z*.
Z*
416
8 Complex Dependence
t > 0.
417
o
o
LO
%.*
o
o
o
o
CO
o
o
C\J
o
o
100
200
300
400
500
m
o
lO
in
---------
V v~
r-
in
T~
40
__r
_ -- --------
K5
O
V
20
o
T
7
I 'v\ / "
60
Distance
80
100
<
20
40
60
80
100
Distance
Z (t) o f Z(t). The dashed lines are pointw ise 95% confidence bands from
R = 999 realizations o f the binom ial process, and the dotted lines are overall
b ands w ith level ab o u t 92% , obtained by using the m ethod outlined after
(4.17) w ith k = 2. Relative to a Poisson process there is a significant deficiency
o f pairs o f points lying close together, which confirm s our previous impression.
The lower right panel o f the figure shows the corresponding results for
sim ulations from the Strauss process, a param etric m odel o f interaction th at
can inhibit p attern s in which pairs lie close together. This m odels the local
behaviour o f the d a ta b etter th an the stationary Poisson process.
8 Complex Dependence
418
c
o
<N
-200 -100
100
200
Time (ms)
W
c
0)
o
-200
100
200
Time (ms)
-200 -100
100
200
Time (ms)
Figure 8.15
Neurophysiological
point process. The rows
of the left panel show
100 replicates of the
interval surrounding the
times at which a human
subject was given a
stimulus; each point
represents the time at
which the firing of a
neuron was observed.
The right panels shows
a histogram and kernel
intensity estimate
(xlO -2 ms-1) from
superposing the events
on the left, which are
shown by the rug in the
lower right panel.
419
The right panels o f Figure 8.15 show a histogram o f the superposed d ata
and a rescaled kernel estim ate o f the intensity X(y) in units o f 10-2 m s-1 ,
k y , h ) = 100 x (N h )~1 w ( ^ y 1 ) ,
7=1
where w(-) is a sym m etric density with m ean zero and unit variance; we use
the stan d ard norm al density w ith bandw idth h = 7.5 ms. O ver the observation
period this estim ate integrates to 100n / N . The estim ated intensity is highly
variable an d it is unclear which o f its features are spurious. We can try to
construct a confidence region for A(y) at a set
o f y values o f interest, but
the sam e problem s arise as in Exam ples 5.13 and 8.8.
O nce again the key difficulty is bias: l ( y ; h ) estim ates n o t k(y) b u t
/ w(u)A(y hu) du. F or large n and small h this m eans th at
E {l(y ;/j)} = 2.(y) + \ h 2 X'(y),
var{2(y;/i)} = c(iVft)_1A(>>),
where c = f w 2 (u)du. As in Exam ple 5.13, the delta m ethod (Section 2.7.1) im
plies th a t l ( y ; h )l/2 has approxim ately constant variance \ c ( N h ) ~ l . We choose
to w ork w ith the standardized quantities
2 (y,h)=
l l' 2 ( y ; h ) - k l/ 2 ( y )
K M )-V 2 c 1/2
y ef.
(8.19)
J2
(8.20)
{ l i/2( y , h ) - \ ( N h ) - 1/2cl/2z U h ) }
In practice we m ust use resam pling analogues Z * ( y \ h ) o f Z ( y ; h ) to estim ate
ZL,a(h) and zu,x(h), and for this to be successful we m ust choose h and the
resam pling scheme to ensure th a t Z* and Z have approxim ately the same
distributions.
In this context there are a nu m b er o f possible resam pling schemes. The sim
plest is to take n events a t ran d o m from the observed events. This relies on the
independence assum ptions for Poisson processes. A second scheme generates
n events from the observed events, where n* has a Poisson distribution with
m ean n. A m ore robust scheme is to superpose 100 resam pled intervals, though
this does n o t hold fixed the to tal n um ber o f events. These schemes would be
8 Complex Dependence
420
inappro p riate if the estim ator o f interest presupposed th at events could not
coincide, as did the K -function o f Exam ple 8.9.
For all o f these resam pling schemes the b o o tstrap estim ators r ( y ; h ) are
unbiased for l(y',h). T he n atu ral resam pling analogue o f Z is
{ r ( r ; f c ) } '/ 2 - { r ( r ) ) l/2
z(y M
------------------------------------
m a xz*(y;h).
The u p p er panel o f Figure 8.16 shows overall 95% confidence bands for
A(y;5), using three o f the sam pling schemes described above. In each case
R = 999, an d zl,0.025(5) an d zl',0.025(5) are estim ated by the em pirical 0.025 and
0.975 quantiles o f the R replicates o f m in{z(j;;5),>' = 250, 2 4 8 ,...,2 5 0 }
and m a x { z '(y ;5),y = 2 5 0 ,2 4 8 ,...,2 5 0 } . R esults from resam pling intervals
and events are alm ost indistinguishable, while generating d a ta from a fitted
intensity gives slightly sm oother results. In o rd er to avoid problem s at the
boundaries, the set
is taken to be (230,230). The experim ental setup
implies th a t the intensity should be ab o u t 1 x 10-2 firings per second, the
only significant d ep artu re from which is in the range 0-130 ms, where there is
strong evidence th a t the stim ulus affects the firing rate.
421
-200
-100
100
200
Time (ms)
Time (ms)
The lower panel o f the figure shows z0.025(5) z0.975(5), and the boo tstrap
bias estim ate for /*(>>) for resam pling intervals and for generating d a ta from
a fitted intensity function, with h = 7.5 ms. The quantile processes suggest
th a t the variance-stabilizing transform ation has w orked well, b u t the double
sm oothing effect o f the latter scheme shows in the bias. The behaviour o f the
quantile process when y = 50 ms where there are no firings suggests th at
a variable b andw idth sm oother m ight be better.
Essentially the same ideas can be applied when the d ata are a single real
ization o f an inhom ogeneous Poisson process (Problem 8.8).
422
8 Complex Dependence
where the suspicion is th a t X(y) decreases w ith distance from the origin. Since
the disease is rare, the n u m b er o f cases a t y will be well approxim ated by a
Poisson variable w ith m ean X{y)n(y), where fi(y) is the population density o f
susceptible persons a t y. T he null hypothesis is th a t My) = Xo, i.e. th a t y has
no effect on the intensity o f cases, o th er th an through /i(y). A crucial difficulty
is th at n{y) is unknow n an d will be h ard to estim ate from the d a ta available.
One ap p ro ach to testing for constancy o f X ( y ) is to com pare the p o int pattern
for 2> to th a t o f an o th er disease 2)'. This disease is chosen to have the same
populatio n o f susceptible individuals as 3), b u t its incidence is assum ed to be
unrelated to em issions from the site an d to incidence o f S>, and so it arises
with co n stan t b u t unknow n rate X p er person-year. If Sfi' is also rare, it will
be reasonable to suppose th a t the num b er o f cases o f
at y has a Poisson
distributio n w ith m ean X 'f i ( y ) . H ence the conditional probability o f a case o f
at y given th a t there is a case o f
o r 3 ' a t y is n { y ) = X { y ) / { X ' + A(y)}.
If the disease locations are indicated by yj, an d dj is zero o r one according as
the case a t yj has 3)' or Q>, the likelihood is
n ^ { i - ( y ^ .
j
If a suitable form for X(y) is assum ed we can o btain the likelihood ratio or
perhaps an o th er statistic T to test the hypothesis th at 7i(y) is constant. This
is a test o f pro p o rtio n al hazards for Q) and & , b u t unlike in Exam ple 4.4 the
alternative is specified, at least weakly.
W hen A(y) = Xo an ap proxim ation to the null distribution o f T can be
obtained by perm uting the labels on cases at different locations. T h at is, we
and 3l' to the yj, recom pute T
perform R ran d o m reallocations o f the labels
for each such reallocation, an d see w hether the observed value o f t is extrem e
relative to the sim ulated values t \ , . . . , t R.
m
Exam ple 8.12 (Bram bles) The upp er left panel o f Figure 8.17 shows the
locations o f 103 newly em ergent an d 97 one-year-old bram ble canes in a 4.5 m
square plot. It seems plausible th a t these two types o f event are related, but
how should this be tested? Events o f b o th types are clustered, so a Poisson
null hypothesis is not appropriate, n o r is it reasonable to perm ute the labels
attached to events, as in the previous example.
Let us denote the locations o f the two types o f event by y i , . . . , y and
y [, . . ., y 'n-. Suppose th a t a statistic T = t ( y i , . . . , y , y [ , . . . , y ' n,) is available th at
tests for association betw een the event types. If the extent o f the observation
region were infinite, we m ight construct a null distribution for T by applying
random translations to events o f one type. T hus we would generate values
T = t(yi + U*, . . ., y + U*,y[,...,y'rf), where I/* is a random ly chosen location
in the plane. This sam pling scheme has the desirable property o f fixing the
423
++
\ \
+:*4 + ++ *-
v*
+
;
++ **
4-
++
V.
424
8 Complex Dependence
8.3.4 Tiles
Little is know n ab o u t resam pling spatial processes when there is no param etric
model. One n onparam etric ap proach th a t has been investigated starts from a
p artition o f the observation region St into disjoint tiles
o f equal
size and shape. I f we abuse n o tatio n by identifying each tile with the pattern it
contains, we can w rite the original value o f the statistic as T = t(.stf
The idea is to create a resam pled p attern by tak ing a random sam ple of
tiles s 4 \ , . . . , s 4 ' n from
with corresponding boo tstrap statistic T* =
t( j/J ,...,,s /* ) . The hope is th a t if dependence is relatively short-range, taking
large tiles will preserve enough dependence to m ake the properties o f T* close
to those o f T. If this is to w ork, the size o f the tile m ust be chosen to trade
off preserving dependence, which requires a few large tiles, and getting a good
estim ate o f the distribution o f T , which requires m any tiles.
This idea is analogous to block resam pling in tim e series, and is capable o f
sim ilar variations. F o r exam ple, ra th e r th an choosing the stf* independently
from the fixed tiles s
i
we m ay resam ple m oving tiles by setting
o
in
.....
:
:
:
*
. .
..
* *
-I
.*
j * ......... -
. r
V #
* -V
.........
..... . *
!
!
.
: : :
0
100
;
200
*
300
.
400
o
o
300
.
:
.v
..
-........- ......................................
-* .
............
!!
* **
:*
............. . *
:
/. *
* . * *
*
* '
*/ .
. ** :
*
o
500
200
..
100
425
100
200
300
400
8 Complex Dependence
426
16
36
64
100
144
196
256
theory
fixed
m oving
224.2
255.2
92.2
77.9
66.1
39.7
47.3
40.2
35.8
36.3
31.7
31.6
31.2
27.6
33.0
28.4
27.6
30.8
26.7
25.5
27.4
25.6
27.8
27.0
S trau ss
fixed
m oving
129.1
53.2
49.1
26.4
27.9
19.0
19.2
17.4
16.4
15.9
19.3
18.9
20.8
18.7
21.9
17.9
SSI
fixed
m oving
123.8
36.5
37.7
12.9
14.8
11.2
13.5
15.6
17.9
18.3
25.1
21.2
34.6
28.6
42.4
35.4
Poisson
Table 8.5 shows the results. F o r the Poisson process the fixed tile results
broadly agree w ith theoretical calculations (Problem 8.9), and the m oving tile
results accord w ith general theory, which predicts th a t m ean squared errors
for m oving tiles should be lower th a n for fixed tiles. H ere the m ean squared
erro r decreases to 22 as no o .
T he fitted Strauss process inhibits pairs o f points closer together th an 12
units. The m ean squared erro r is m inim ized w hen n = 100, corresponding to
tiles o f side 20; the average estim ated variances from the 100 replicates are
then 19.0 an d 18.2. T he m ean squared errors for m oving tiles are rath er lower,
b u t their p a tte rn is similar.
The sequential spatial inhibition results are sim ilar to those for the Strauss
process, b u t w ith a sh arp er rise in m ean squared error for larger n.
In this setting theory predicts th a t for a process with sufficiently shortrange dependence, the optim al n o c \
I f the caveolae d a ta were generated
by a Strauss process, results from Table 8.5 would suggest th a t we take
n = 100 x 500/200 = 162, so there w ould be 16 tiles along each side o f 3k.
W ith R = 200 an d fixed and m oving tiles this gives variance estim ates o f 101.6
and 100.4, b o th considerably sm aller th a n the variance for Poisson data, which
would be 138.
427
M odel-based resam pling for tim e series was discussed by F reedm an (1984),
Freedm an an d Peters (1984a,b), Sw anepoel and van W yk (1986) and Efron and
T ibshirani (1986), am ong others. Li and M ad d ala (1996) survey m uch o f the
related tim e dom ain literature, which has a som ew hat theoretical em phasis;
their account stresses econom etric applications. F or a m ore applied account o f
param etric b o o tstrap p in g in tim e series, see Tsay (1992). B ootstrap prediction
in tim e series is discussed by K ab aila (1993b), while the b o otstrapping o f statespace m odels is described by Stoffer and W all (1991). The use o f m odel-based
resam pling for o rd er selection in autoregressive processes is discussed by Chen
et al. (1993).
Block resam pling for tim e series was introduced by C arlstein (1986). In an
im p o rta n t paper, K iinsch (1989) discussed overlapping blocks in tim e series,
although in spatial d a ta the proposal o f block resam pling in H all (1985)
predates both. Liu an d Singh (1992a) also discuss the properties o f block
resam pling schemes. Politis an d R om ano (1994a) introduced the stationary
b o o tstrap , an d in a series o f papers (Politis and R om ano, 1993, 1994b) have
discussed theoretical aspects o f m ore general block resam pling schemes. See
also B uhlm ann an d K iinsch (1995) and L ahiri (1995). The m ethod for block
length choice outlined in Section 8.2.3 is due to H all, H orow itz and Jing
(1995); see also H all an d H orow itz (1993). B ootstrap tests for unit roots in
autoregressive m odels are discussed by F erretti and R om o (1996). H all and
Jing (1996) describe a block resam pling approach in which the construction o f
new series is replaced by R ichardson extrapolation.
Bose (1988) showed th a t m odel-based resam pling for autoregressive p ro
cesses has good asym ptotic higher-order properties for a wide class o f statistics.
L ahiri (1991) an d G otze and K iinsch (1996) show th a t the same is true for
block resam pling, b u t D avison and H all (1993) p o int o u t th a t unfortunately
and unlike w hen the d a ta are independent this depends crucially on the
variance estim ate used.
Form s o f phase scram bling have been suggested independently by several
au th o rs (N ordgaard, 1990; Theiler et al., 1992), and B raun and K ulperger
(1995, 1997) have studied its properties. H artig an (1990) describes a m ethod
for variance estim ation in G aussian series th a t involves sim ilar ideas b u t needs
no rand o m izatio n ; see Problem 8.5.
Frequency dom ain resam pling has been discussed by F ranke and H ardle
(1992), w ho m ake a strong analogy w ith b o o tstrap m ethods for nonparam etric
regression. It has been fu rth er studied by Janas (1993) and D ahlhaus and
Janas (1996), on which o u r account is based.
O u r discussion o f the R io N egro d a ta is based on Brillinger (1988, 1989),
which should be consulted for statistical details, while Sternberg (1987, 1995)
gives accounts o f the d a ta and background to the problem.
M odels based on p o in t processes have a long history and varied provenance.
428
8 Complex Dependence
8.5 Problems
1
Suppose that y i,...,y is an observed time series, and let zy denote the block
of length / starting at yu where we set y, = yi+(i_i mod ) and y0 = ynAlso let h , . . . be a stream of random numbers uniform on the integers 1,...,n
and let
be a stream of random numbers having the geometric distribution
Pr(L = I) = p(l p)~ \ I = 1, The algorithm to generate a single stationary
bootstrap replicate is
Algorithm 8.2 (Stationary bootstrap)
429
8.5 Problems
Set 7* =
n
Ck =
O '; -
y ) ( y i + u + k - t mod n) -
y ),
k =Q ,...,n.
;=1
Show that conditional on y i , . . . , y ,
E(y /) = y,
cov*(y,-,y;+1) = ( i - Py Cj
fl -
yh
;=l '
and that this approaches C = Vo + 2 5 yj if
! j\yj\ is finite.
Show that under the stationary bootstrap, conditional on the data,
n1 /
v a r '( y ) = c0 + 2 ^ 3 ( 1 - " ) (! ~ P ) JCj,
;=l '
nJ
where Co,c1;. .. are the empirical circular autocovariances defined in Problem 8.1.
(Section 8.2.3; Politis and Romano, 1994a)
3
(a) Using the setup described on pages 405-408, show that J2($j ~ S )2 has mean
vy b~l v,j and variance
V ijjj +
where vy = cov(S,,S,),
= cum(S,, Sj, St) and so forth are the joint cumulants
o f the Sj, and summation is understood over each index.
(b) For an m-dependent normal process, show that provided / > m,
( l~' 4 }, i = j,
v.i = \ l - 2c(l>, \ i - j \ = l,
( 0,
otherwise,
8 Complex Dependence
430
4
n_1 H YJ =
where j + m is interpreted m od n, and that all odd joint mom ents o f the Y j are
zero.
This last result implies that the resampled series have a highly symmetric joint
distribution. W hen the original data have an asymmetric marginal distribution, the
following procedure has been proposed:
ya, . . . , y n- 1 ;
b,
cjci= 0 ,
ii= j,
c j c j = b,
j= l,...,m .
ffl+1 ttl+1
2 ^ r n ) g B r ' - TO'
where for i = 1 , . . . , m + 1 , Tf =
+ ca)h(b) N ow suppose that Yo,. . . , Y_i is a time series o f length n = 2m + lz with
empirical Fourier transform fb .---.iB _ i and periodogram ordinates h = \Yk\2/n,
for k = 0 , . . . , m. For each i = 1 ,..., m + 1, let the perturbed periodogram ordinates
be
YJ = ?o,
Y> = ( l + c ^ 2 Yk,
= ( l + c * ) 1/2Y_*,
k = l,...,m ,
from which the ith replacement time series is obtained by the inverse Fourier
transform.
Let T be the value o f a statistic calculated from the original series. Explain
how the corresponding resample values, T 1' , . . . , T ^ +1, may be used to obtain an
approximately unbiased estimate o f the variance o f T , and say for what types o f
statistics you think this is likely to work.
(Section 8.2.4; Hartigan, 1990)
6
1/2Z i)
say. U se (8.18) to show that X a and X i have means zero and that
var(-Xa)
COV(XUX a)
n l aaggl ^ 2 + i(c4,
1llagglag Ig
t-
^4 .
\ a r ( X i ) = n l gel ~ 2 + ^ k4,
431
8.5 Problems
where I aagg = / a2(co)g2(co) dco, and so forth. Hence show that to first order
the mean and variance o f T do not involve k4, and deduce that periodogram
resampling may be applied to ratio statistics.
Use simulation to see how well periodogram resampling performs in estimating the
distribution o f a suitable version o f the sample estimate o f the lag j autocorrelation,
=
Pl
e~toJg M dco
f l n g ( ) dco
J= 1
denote a kernel estimate o f My), based on a kernel w( ) that is a PDF. Explain why
the following two algorithms for generating bootstrap data from the estimated
intensity are (almost) equivalent.
p1
(Section 8.3.2)
8
Consider an inhom ogeneous Poisson process o f intensity /.(y) = N n(y), where fi(y)
is fixed and sm ooth, observed for 0 < y < 1.
A kernel intensity estimate based on events at y i , . . . , y n is
i =i
you may need the facts that the number o f events n has a Poisson distribution with
mean A = /J Mu) du, and that conditional on there being n observed events, their
432
8 Complex Dependence
\ Kh ~l.
(b)
Now suppose that resamples are formed by taking n observations at random
from yi,...,y. Show that the bootstrapped intensity estimate
w ', y - y j
h J=l
has mean E{ l (y, h)} = l(y;h), and that the same is true when there are n'
resampled events, provided that E '(n') = n.
For a third resampling scheme, let n have a Poisson distribution with mean n,
and generate n events independently from density ).(y;h)/ f Ql l(u;h)du. Show that
under this scheme
E*{3.*{_y; Ai)} =
J w(u)2(y hu;h)du.
{kU -w
Z ( r h) =
{ r ( y ; h ) \ ' - l 1/ 2 (y;h)
------- W m F u i ---------*
find conditions under which the quantiles of Z ' can estimate those of Z.
(Section 8.3.2; Example 5.13; Cowling, Hall and Phillips, 1996)
Consider resampling tiles when the observation region ^ is a square, the data are
generated by a stationary planar Poisson process of intensity X, and the quantity
of interest is d = var(Y), where Y is the number of events in 3t.
Suppose that 0t is split into n fixed tiles of equal size and shape, which are then
resampled according to the usual bootstrap. Show that the bootstrap estimate of
6 is t = ^2(yj y)2, where yj is the number of events in the jth tile. Use the fact
that var(T) = (n 1)2{k4/h + 2 k \ /( n 1)}, where Kr is the rth cumulant of Yj, to
show that the mean squared error of T is
^ { n + ( n - l ) ( 2n + n - l ) } ,
where n = l\9l\. Sketch this when p. > 1, fi = 1, and /i < 1, and explain in
qualitative terms its behaviour when fi > 1.
Extend the discussion to moving tiles.
(Section 8.3)
8.6 Practicals
1
Dataframe lynx contains the Canadian lynx data, to the logarithm of which we
fit the autoregressive model that minimizes A IC :
t s .plot(log(lynx))
lynx.ar <- arClogClynx))
lynx.ar$order
Practicals
433
The best model is A R (ll). How well determined is this, and what is the variance
of the series average? We bootstrap to see, using ly n x .fu n (given below), which
calculates the order of the fitted autoregressive model, the series average, and saves
the series itself.
Here are results for fixed-block bootstraps with block length I = 20:
lynx.fun <- function(tsb)
{ ar.fit <- ar(tsb, order,max=25)
c(ar.fit$order, mean(tsb), tsb) >
lynx.l <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="fixed")
tsplot(ts(lynx.l$t[l,3:116],start=c(1821,1)),
main="Block simulation, 1=20")
boot.array(lynx.1) [1,]
table(lynx.l$t[,1])
var(lynx.l$t[,2])
qqnormdynx. l$t [,2] )
abline(mean(lynx.l$t[,2]),sqrt(var(lynx.l$t[,2])),lty=2)
To obtain similar results for the stationary bootstrap with mean block length
1 = 20:
See if the results look different from those above. Do the simulated series using
blocks look like the original? Compare the estimated variances under the two
resampling schemes. Try different block lengths, and see how the variances of the
series average change.
For model-based resampling we need to store results from the original model:
lynx.model <- list(order=c(lynx.ar$order,0,0),ar=lynx.ar$ar)
lynx.res <- lynx.ar$resid[!is.na(lynx.ar$resid)]
lynx.res <- lynx.res - mean(lynx.res)
lynx.sim <- function(res,n.sim, ran.args)
{ rgl <- function(n, res) sample(res, n, replace=T)
ts.orig <- ran.args$ts
ts.mod <- r a n .args$model
mean(ts.orig)+ts(arima.sim(model=ts.mod, n=n.sim,
rand.gen=rgl, res=as.vector(res))) }
.Random.seed <- lynx.l$seed
lynx.3 <- tsboot(lynx.res, lynx.fun, R=99, sim="model",
n.sim=114,ran.gen=lynx.sim,
ran.args=list(ts=log(lynx), model=lynx.model))
8 Complex Dependence
Compare these results with those above, and try the post-blackened bootstrap with
sim=" geom".
(Sections 8.2.2, 8.2.3)
The data in b ea v er consist o f a time series o f n = 100 observations on the body
temperature y i , . . . , y and an indicator x i , . . . , x n o f activity o f a female beaver,
Castor canadensis. We want to estimate and give an uncertainty measure for the
body temperature o f the beaver. The simplest m odel that allows for the clear
autocorrelation o f the series is
yj = P o + PiXj + rij,
j = l,...,n ,
(8.21)
a linear regression m odel in which the errors r\j form an A R (1) process, and the
are independent identically distributed errors with mean zero and variance a 2.
Having fitted this model, estimated the parameters a,/?o, j8i,<t2 and calculated the
residuals e i , . . . , e n (e\ cannot be calculated), we generate bootstrap series by the
following recipe:
y'j = Po + PiXj + *}],
n] = j = i , . . . , n ,
(8.22)
where the error series {>/'} is formed by taking a white noise series {e } at random
from theset {a(e2 e) , . . . , o(e e)} and then applying the second parto f (8.22).
To fit the original m odel and to generate a new series:
f i t < - f u n c t io n ( d a ta )
{ X < - c b i n d ( r e p ( l , 1 0 0 ) ,d a t a $ a c t iv )
para < - l i s t ( X =X ,data=data)
a ss ig n (" p a r a " ,p a r a ,fr a m e = l)
d < - a r im a .m le (x = p a r a $ d a ta $ t e m p ,m o d e l= lis t(a r = c (0 .8 )),
xreg=para$X )
r e s < - a r i m a .d ia g ( d ,p l o t = F ,s t d .r e s id = T ) $ s t d .r e s i d
r e s <- r e s [ ! is .n a ( r e s ) ]
li s t ( p a r a s = c ( d $ m o d e l$ a r ,d $ r e g .c o e f ,s q r t ( d $ s ig m a 2 ) ) ,
r e s = r e s -m e a n (r e s ) ,f it = X 7,*7, d $ r e g .c o e f ) >
b e a v e r .a r g s < - f i t ( b e a v e r )
w h it e .n o i s e < - f u n c t io n ( n .s im , t s ) s a m p le ( t s ,s iz e = n .s im ,r e p la c e = T )
b e a v e r .g e n < - f u n c t i o n ( t s , n .s im , r a n .a r g s )
{ t s b < - r a n .a r g s $ r e s
f i t < - r a n .a r g s $ f i t
c o e f f < - r a n .a r g s$ p a r a s
ts$ tem p < - f i t + c o e f f [ 4 ] * a r im a .s im ( m o d e l= lis t ( a r = c o e f f [ 1 ] ) ,
n = n .s im ,r a n d .g e n = w h it e .n o is e ,t s = t s b )
ts }
n ew .b ea v er < - b e a v e r .g e n (b e a v e r , 1 0 0 , b e a v e r .a r g s )
N ow we are able to generate data, we can bootstrap and see the results o f
b e a v e r .b o o t as follows:
b e a v e r .fu n < - f u n c t i o n ( t s ) f i t ( t s ) $ p a r a s
b e a v e r .b o o t < - t s b o o t ( b e a v e r , b e a v e r .fu n , R =99,sim ="m odel",
n . s im=1 0 0 ,r a n . g e n = b e a v e r. g e n , r a n . a r g s= b e a v e r . a r g s )
n a m es(b ea v er. b o o t)
b e a v e r . b o o t$ t0
b e a v e r .b o o t $ t [ 1 : 1 0 ,]
showing the original value o f b e a v e r . fu n and its value for the first 10 replicate
8.6 Practicals
435
series. Are the estimated mean temperatures for the R = 99 simulations normal?
Use b o o t . c i to obtain normal and basic bootstrap confidence intervals for the
resting and active temperatures.
In this analysis we have assumed that the linear m odel with A R(1) errors is
appropriate. How would you proceed if it were not?
(Section 8.2; Reynolds, 1994)
3
Consider scrambling the phases o f the su n sp o t data. To see the original data,
two replicates generated using ordinary phase scrambling, and two phase scram
bled series whose marginal distribution is the same as that o f the original
data:
su n s p o t .fu n < - f u n c t i o n ( t s ) t s
s u n s p o t .1 < - ts b o o t(s u n s p o t,s u n s p o t.fu n ,R = 2 ,s im = " s c r a m b le " )
.R andom .seed < - s u n s p o t .l$ s e e d
s u n s p o t .2 < - tsb o o t(su n sp o t,su n sp o t.fu n ,R = 2 ,sim = " sc r a m b le " ,n o r m = F )
s p l i t . s c r e e n ( c (3 ,2 ) )
y l < - c (- 5 0 ,2 0 0 )
s c r e e n ( l ) ; t s . p l o t ( s u n s p o t , y l i m = y l ) ; a b lin e ( h = 0 ,lt y = 2 )
s c r e e n ( 3 ) ; t s p l o t ( s u n s p o t . l $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 4 ) ; t s p l o t ( s u n s p o t . l $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 5 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 6 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
W hat features o f the original data are preserved by the two algorithms? (You may
find it helpful to experiment with different shapes for the figures.)
(Section 8.2.4; Problem 8.4; Theiler et a l, 1992)
436
8 Complex Dependence
9
Improved Calculation
9.1 Introduction
A few o f the statistical questions in earlier chapters have been am enable to
analytical calculation. However, m ost o f o u r problem s have been too com
plicated for exact solutions, an d sam ples have been too small for theoretical
large-sam ple approxim ations to be trustw orthy. In such cases sim ulation has
provided approxim ate answ ers through M onte C arlo estim ates o f bias, vari
ance, quantiles, probabilities, an d so forth. T h roughout we have supposed th at
the sim ulation size is lim ited only by our im patience for reliable results.
S im ulation o f independent b o o tstrap sam ples and their use as described in
previous chapters is usually easily program m ed and im plem ented. I f it takes
up to a few hours to calculate enough values o f the statistic o f interest, T,
ordinary simulation o f this sort will be an efficient use o f a researchers time. But
som etim es T is very costly to com pute, or sam pling is only a single com ponent
in a larger procedure as in a double b o o tstrap o r the procedure will be
repeated m any times w ith different sets o f data. T hen it m ay pay to invest
in m ethods o f calculation th a t reduce the num ber o f sim ulations needed to
obtain a given precision, o r equivalently increase the accuracy o f an estim ate
based on a given sim ulation size. This chapter is devoted to such m ethods.
N o lunch is free. The techniques th a t give the biggest potential variance
reductions are usually the h ardest to im plem ent. O thers yield less spectacular
gains, b u t are m ore easily im plem ented. T houghtless use o f any o f them may
m ake m atters worse, so it is essential to ensure th a t use o f a variance reduction
technique will save the investigators time, which is m uch m ore valuable than
com puter time.
M ost o f o u r b o o tstrap estim ates depend on averages. For exam ple, in testing
a null hypothesis (C h ap ter 4) we w ant to calculate the significance probability
p = P r(7 ^ t | Fo), where t is the observed value o f test statistic T and
437
9 Improved Calculation
438
the fitted m odel Fo is an estim ate o f F und er the null hypothesis. The simple
M onte C arlo estim ate o f p is R ^ 1
/ {T ' > (}, where I is the indicator
function an d the T are based on R independent sam ples generated from FoT he variance o f this estim ate is c R ~{, w here c = p fl p). N othing can generally
be done ab o u t the factor R ~ l , b u t the co n stan t c can be reduced if we use
a m ore sophisticated M onte C arlo technique. M ost o f this chapter concerns
such techniques. Section 9.2 describes m ethods for balancing the sim ulation
in order to m ake it m ore like a full enum eration o f all possible samples,
and in Section 9.3 we describe m ethods based on the use o f control variates.
Section 9.4 describes m ethods based on im portance sampling. In Section 9.5 we
discuss one im p o rta n t m ethod o f theoretical approxim ation, the saddlepoint
m ethod, which elim inates the need for sim ulation.
=J
t { y \, . . . , y'n)g(y[,. . . , y')dy{
(9.1)
This sum over all possible sam ples need involve only (2n_1) calculations o f (*,
since the sym m etry o f t( ) w ith respect to the sam ple can be used, b u t even
so the complete enumeration o f values t* th a t (9.1) requires will usually be
im practicable unless n is very small. So it is that, especially in nonparam etric
problem s, we usually approxim ate the average in (9.1) by the average o f R
random ly chosen elem ents o f Zf an d so approxim ate B by B r = R _i Y , T* t.
This calculation w ith a ran d o m subset o f
has a m ajor defect: the values
y i , . . . , y n typically d o n o t occur w ith equal frequency in th a t subset. This is
illustrated in Table 9.1, which reproduces Table 2.2 b u t adds (penultim ate
row) the aggregate frequencies for the d a ta values; the final row is explained
later. In the even sim pler case o f the sam ple average t = y we can see clearly
j
u
X
Data
Sample
1
2
3
4
5
6
7
8
9
439
1
138
143
2
93
104
3
61
69
2
1
1
1
3
1
1
2
1
1
4
179
260
Aggregate
1
8
rF*
9
50
55
1
2
1
2
2
2
1
1
11
11
50
7
29
50
8
23
48
1
1
1
Data
5
6
48
37
75
63
1
3
1
1
2
1
2
1
1
2
1
5
2
3
2
13
1
1
8
5
50
13
50
50
1
1
8
8
50
9
30
111
10
2
50
1
2
4
1
1
2
1
2
1
1
1
1
1
2
7
1
1
11
7
50
50
3
1
1
10
Statistic
t = 1.520
t\ =
t 2 =
=
tA =
t; =
t'6 =
t; =
( =
t; =
1.466
1.761
1.951
1.542
1.371
1.686
1.378
1.420
1.660
10
50
th a t the unequal frequencies com pletely account for the fact th a t B r differs
from the correct value B = 0. The corresponding phenom enon for param etric
b o o tstrap p in g is th a t the aggregated E D F o f the R sam ples is n o t as close
to the C D F o f the fitted param etric m odel as it is to the sam e m odel with
different p aram eter values.
T here are tw o ways to deal w ith this difficulty. First, we can try to change
the sim ulation to remove the defect; and secondly we can try to adjust the
results o f the existing sim ulation.
o f size Rn.
to be the balanced
440
9 Improved Calculation
Data
Sample
Aggregate
1
2
3
4
5
6
7
8
9
10
1
2
2
2
1
2
2
3
1
1
2
2
1
2
2
2
3
1
2
1
9
i
1
1
1
2
9
1
1
1
1
2
1
1
1
2
9
2
2
1
1
1
1
1
9
1
1
1
1
1
1
1
1
1
9
1
1
1
1
2
1
1
2
2
1
1
9
1
1
1
1
Statistic
t = 1.520
t\ =
ti =
t"3 =
t'4 =
t5 =
t6 =
ty =
t\ =
t; =
1.632
1.823
1.334
1.317
1.531
1.344
1.730
1.424
1.678
K J b r)
v ar I J B r Y
where for this com parison the subscripts denote the sam pling scheme under
which B r was calculated.
441
Cases
Po
Pi
a
Stratified
R esiduals
B alanced
A djusted
B alanced
A djusted
B alanced
A djusted
8.9
13.1
11.1
6.9
141
108
1.2
0.6
8.9
9.1
63
18.7
49
18.0
1.4
15.3
0.6
13.5
So far we have focused on the application to bias estim ation, for which the
balance typically gives a big im provem ent. The same is not generally true for
estim ating higher m om ents or quantiles. For instance, in the previous exam ple
the balanced b o o tstrap has efficiency less th an one for calculation o f the
variance estim ate VR.
The balanced b o o tstra p extends quite easily to m ore com plicated sam pling
situations. I f the d a ta consist o f several independent samples, as in Section 3.2,
balanced sim ulation can be applied separately to each. Some o ther extensions
are straightforw ard.
Exam ple 9.2 (Calcium uptake d ata)
To investigate the im provem ent in bias
estim ation for the p aram eters o f the nonlinear regression m odel fitted to the
d a ta o f Exam ple 7.7, we calculated 100 replicates o f the estim ated biases based
on 49 b o o tstra p samples. The resulting efficiencies are given in Table 9.3 for
different resam pling schem es; the results labelled A djusted are discussed in
Exam ple 9.3. F or stratified resam pling the d a ta are stratified by the covariate
value, so there are nine stra ta each w ith three observations. T he efficiency gains
u nder stratified resam pling are very large, and those under case resam pling are
worthwhile. T he gains w hen resam pling residuals are n o t w orthw hile, except
for a 2.
9 Improved Calculation
442
can be w ritten in expanded n o tatio n as
R
(9.2)
r= l
where as usual F* denotes the E D F corresponding to the rth row o f the array.
Let F* denote the average o f these E D F s, th a t is
f * = r - ^ f ; + --- + F*r ).
F or a frequency table such as Table 9.1, F* is the C D F o f the distribution
corresponding to the aggregate frequencies o f d a ta values, as show n in the
final row. T he resulting adjusted bias estimate is
R
Brmj = R - 1
*(*) -
(9-3)
r= 1
This is som etim es called the re-centred bias estim ate. In addition to the usual
A
b o o tstrap values t(Fr ), its calculation requires only F* and f(F*). N ote th at
for adjustm ent to work, t( ) m ust be in a functional form, i.e. be defined
independently o f sam ple size n. F or example, a variance m ust be calculated
with divisor n ra th e r th a n n 1.
The corresponding calculation for a p aram etric b o o tstra p is similar. In effect
the adjustm ent com pares the sim ulated estim ates T ' to the p aram eter value
Or = t(F*) obtained by fitting the m odel to d a ta w ith E D F F* rath er th an F.
Exam ple 9.3 (Calcium uptake d a ta )
Table 9.3 shows the efficiency gains
from using B r ^
in the nonparam etric resam pling experim ent described in
Exam ple 9.2. T he gains are broadly sim ilar to those for balanced resam pling,
b u t smaller.
F o r param etric sam pling the quantities F in (9.3) represent sets o f d a ta
generated by p aram etric sim ulation from the fitted m odel, and the average
F* is the d ataset o f size R n obtained by concatenating the sim ulated samples.
H ere the sim plest p aram etric sim ulation is to generate d a ta y j = p-j + ej, where
the fa are the fitted values from Exam ple 7.7 an d the e* are independent
iV(0,0.552) variables. In 100 replicates o f this b o o tstrap with R = 49, the
efficiency gains for estim ating the biases o f Po, P\, an d a were 24.7, 42.5, and
20.7; the effect o f the adjustm ent is m uch m ore m arked for the param etric
th a n for the n o n p aram etric b ootstraps.
The sam e adjustm ent does n o t apply to the variance approxim ation V r ,
higher m om ents o r quantiles. R a th e r the linear approxim ation is used as a
conventional control variate, as described in Section 9.3.
443
<9'4)
where lj = H Y J ; F) an d qjk = q(YJ, Yk ; F) are values o f the em pirical firstan d second-order derivatives o f t at F; equation (9.4) is the same as (2.41),
b u t w ith F an d F replaced by F ' and F. We call the right-hand side o f (9.4)
the quadratic approximation to T". O m ission o f the final term leaves the linear
approxim ation
n
(9-5)
i= i
which is the basis o f the variance approxim ation vL ; equation (9.5) is simply a
recasting o f (2.44).
In term s o f the frequencies f j with which the yj ap p e ar in the boo tstrap
sam ple a n d the em pirical influence values lj = l(yj;F) and qjk = q(yj,yk;F),
the q u ad ratic ap proxim ation (9.4) is
n =t
7=1 k= 1
7=1
7=1 k = l
{9J)
In the ord in ary resam pling scheme, the rows o f frequencies (/* 1 , . . . , f ' n) are
in dependent sam ples from the m ultinom ial distribution with denom inator n
an d probability vector (n-1 , . . . , n _1). This is the case in Table 9.1. In this
situation the first an d second jo in t m om ents o f the frequencies are
E*(/V) = 1,
444
9 Improved Calculation
where <5;* = 1 if j = k an d zero otherwise, an d so forth; the higher cum ulants are given in Problem 2.19. Straightforw ard calculations show th a t
approxim ation (9.7) has m ean ^n~ 2 ^ 2 j q j j and variance
1
Rn1
i= 1
j= 1
An1
j= i
\j= i
2 ^
j= i
Qjk
(9.8)
*=i
/= 1 k= l
rl
*(/*;) = i,
(nSJk - 1)(JW - 1)
ni? - 1
cov*(/;;, / ; , ) =
(9.9)
-2I T 1
qjj + 2nT 2 R - 2 (
j=1
^
/
\;= 1
(9.10)
The m ean is alm ost the sam e u nder b o th schemes, b u t the leading term o f the
variance in (9.10) is sm aller th an in (9.8) because the term in (9.7) involving
the lj is held equal to zero by the balance constraints Y l r f*j =
First-order
balance ensures th a t the linear term in the expansion for B r is held equal to
its value o f zero for the com plete enum eration.
Post-sim ulation balance is closely related to the balanced bootstrap. It is
straightforw ard to see th a t the quad ratic nonparam etric delta m ethod approx
im ation o f Bg^adj in (9.3) equals
(9.11)
y = l k= 1 I
r= l
r= l
r= l
". / W
JS
c
0)
o
ifc
LU
v '
CO
in
j
V
.- j :
5.0
in
O
0)
oc
m
o
o
T
icy
445
"V*
o
"
_____ -''r'TV'V.T/*
in
d
0.1
0.5
5.0
Adjusted
0.0
0.2
0.4
0.6
0.8
1.0
Correlation
Like the balanced b o o tstrap estim ate o f bias, there are no linear term s in this
expression. R e-centring has forced those term s to equal their p o p ulation values
o f zero.
W hen the statistic T does n o t possess an expansion like (9.4), balancing
m ay n o t help. In any case the correlation betw een the statistic and its linear
approxim ation is im p o rtan t: if the correlation is low because the quadratic
com ponent o f (9.4) is appreciable, then it m ay n o t be useful to reduce variation
in the linear com ponent. A rough approxim ation is th a t var*(B) is reduced
by a factor equal to 1 m inus the square o f the correlation betw een T" and T'L
(Problem 9.5).
Exam ple 9.4 (N orm al eigenvalues)
F or a num erical com parison o f the effi
ciency gains in bias estim ation from balanced resam pling and post-sim ulation
adjustm ent, we perform ed M onte C arlo experim ents as follows. We generated
n variates from the m ultivariate norm al density w ith dim ension 5 and identity
covariance m atrix, and to o k t to be the five eigenvalues o f the sam ple covari
ance m atrix. F or each sam ple we used a large b o o tstrap to estim ate the linear
approxim ation t"L for each o f the eigenvalues and then calculated the correla
tion c betw een t* and t"L. We then estim ated the gains in efficiency for balanced
an d adjusted estim ates o f bias calculated using the b o o tstrap w ith R = 39,
using variances estim ated from 100 independent bo o tstrap sim ulations.
Figure 9.1 shows the gains in efficiency for each o f the 5 eigenvalues, for
50 sets o f d a ta w ith n = 15 an d 50 sets w ith n = 25; there are 500 points
in each panel. T he left panel com pares the efficiency gains for the balanced
an d adjusted schemes. Balanced sam pling gives b etter gains th an post-sam ple
adjustm ent, b u t the difference is sm aller at larger gains. The right panel shows
446
9 Improved Calculation
the efficiency gains for the balanced scheme plotted against the correlation
c. The solid line is the theoretical curve (1 c2)-1 . Know ledge o f c would
enable the efficiency gain to be predicted quite accurately, at least for c > 0.8.
T he potential im provem ent from balancing is n o t g u aranteed to be w orthwhile
w hen c < 0.7. The corresponding plot for the adjusted estim ates suggests th a t
c m ust be at least 0.85 for a useful efficiency gain.
This exam ple suggests the following strategy when a good estim ate o f bias
is required: perform a sm all stan d ard unbalanced b ootstrap, and use it to
estim ate the correlation betw een the statistic an d its linear approxim ation.
If th a t correlation exceeds ab o u t 0.7, it m ay be w orthw hile to perform a
balanced sim ulation, b u t otherw ise it will not. I f the correlation exceeds 0.85,
post-sim ulation adjustm ent will usually be w orthw hile, b u t otherw ise it will
not.
447
be w ritten
E 'e r * ) = E m( T l ) + E*(D ),
the leading term s o f which are known. O nly term s involving D * need to be
approxim ated by sim ulation. G iven sim ulations T
w ith corresponding
linear approxim ations
and differences D* = T* T r, the m ean
and variance o f T* are estim ated by
t+ D\
VKcon = v L + ^
i?
^ ( T r - f i ) ( D r* - D' ) + ^
i?
J 2 ( D ; ~ D ' ) 2,
r= l
r= l
(9.12)
where T[ =
Ylr ^L,r an d D" =
Use o f these and related
approxim ations requires the calculation o f the T[ r as well as the T*.
The estim ated bias o f T* based on (9.12) is B r co = D ' . This is closely
related to the estim ate obtained un d er balanced sim ulation and to the re
centred bias estim ate B r ^ . Like them , it ensures th at the linear com ponent
o f the bias estim ate equals its population value, zero. D etailed calculation
shows th a t all three approaches achieve the same variance reduction for the
bias estim ate in large samples. However, the variance estim ate in (9.12) based
on linear approxim ation is less variable th an the estim ated variances obtained
u n d er the o th er approaches, because its leading term is n o t random .
Example 9.5 (City population data)
To see how effective control m ethods
are in reducing the variability o f a variance estim ate, we consider the ratio
statistic for the city pop u latio n d a ta in Table 2.1, w ith n = 10. F or 100
b o o tstrap sim ulations w ith R = 50, we calculated the usual variance estim ate
vr = ( R I)-1
t*)2 and the estim ate VR>con from (9.12). The estim ated
gain in efficiency calculated from the 100 sim ulations is 1.92, which though
w orthw hile is n o t large. T he correlation betw een t* and tL is 0.94.
F or the larger set o f d a ta in Table 1.3, with n = 49, we repeated the
experim ent w ith R = 100. H ere the gain in efficiency is 7.5, and the correlation
is 0.99.
Figure 9.2 shows scatter plots o f the estim ated variances in these experim ents.
F or b o th sam ple sizes the values o f v r <co are m ore concentrated th an the values
o f vR, though the m ain effect o f control is to increase underestim ates o f the
true variances.
9 Improved Calculation
448
.....................................
0
Usual
Usual
1.3. The four left panels o f Figure 9.3 show plots o f the values o f v r >co against
the values o f v r . N o strong p attern is discernible.
To get a m ore system atic idea o f the effectiveness o f control m ethods in this
setting, we repeated the experim ent outlined in Exam ple 9.4 and com pared the
usual and control estim ates o f the variances o f the five eigenvalues. The results
for the five eigenvalues an d n = 15 and 25 are show n in Figure 9.3. G ains in
efficiency are n o t g u aranteed unless the correlation betw een the statistic and
its linear ap proxim ation is 0.80 o r m ore, and they are n o t large unless the
correlation is close to one. T he line y = (1 x4)-1 sum m arizes the efficiency
gain well, th o u g h we have n o t attem p ted to justify this.
Quantiles
C ontrol m ethods m ay also be applied to quantiles. Suppose th a t we have
the sim ulated values t\, ..., tR o f a statistic, and th a t the corresponding
control variates and differences are available. We now sort the differences by
the values o f the control variates. F o r exam ple, if o u r control variate is a
linear approxim ation, w ith R = 4 an d t 'L 2 < t"L , < t *L 4 < t] 3, we p u t the
differences in order d"2, d\, d"4, d\. The procedure now is to replace the p
quantile o f the linear approxim ation by a theoretical approxim ation, tp, for
p = 1/(jR + 1 ) ,..., R / ( R + 1), thereby replacing t'r) w ith t C r = tp + d '(r), where
7t(r) is the ran k o f t'L r. In o u r exam ple we would obtain t c j = t0.2 + d'2,
t'c 2 = 0 . 4 + d.\, t'c 3 = to. 6 + d\, an d t CA = fo.g + d\. We now estim ate the pth
quantile o f the distribution o f T by t'c ^ , i.e. the rth quantile o f t
c v ... ,t*CR.
If the control variate is highly correlated w ith T m, the bulk o f the variability
in the estim ated quantiles will have been rem oved by using the theoretical
approxim ation.
449
Third
Fourth
0.0
S
10
15
20
25
0.2
0.4
0.6
0.8
1.0
Correlation
O ne desirable property o f the control quantile estim ates is that, unlike m ost
o th er variance reduction m ethods, their accuracy improves with increasing n
as well as R.
T here are various ways to calculate the quantiles o f the control variate. The
preferred ap proach is to calculate the entire distribution o f the control variate
by saddlepoint approxim ation (Section 9.5), and to read off the required qu an
tiles tp. This is better th a n oth er m ethods, such as C o rn ish 'F ish e r expansion,
because it guarantees th a t the quantiles o f the control variate will increase
w ith p.
Example 9.7 (Returns data)
To assess the usefulness o f the control m ethod
ju s t described, we consider setting studentized b o o tstrap confidence intervals
for the rate o f retu rn in Exam ple 6.3. We use case resam pling to estim ate
quantiles o f T* = (/?J /?i ) / S \ where fli is the estim ate o f the regression slope,
an d S 2 is the robust estim ated variance o f fii based on the linear approxim ation
to Pi.
F or a single b o o tstra p sim ulation we calculated three estim ates o f the qu an
tiles o f T * : the usual estim ates, the order statistics
< < t'R); the control
estim ates
taking the control variate to be the linear approxim ation to T*
based on exact em pirical influence values; and the control estim ates obtained
using the linear approxim ation w ith em pirical influence values estim ated by
regression on the frequency array for the same bootstrap. In each case the
quantiles o f the control variate were obtained by saddlepoint approxim ation,
as outlined in Exam ple 9.13 below. We used R = 999 and repeated the experi
m ent 50 tim es in o rder to estim ate the variance o f the quantile estim ates. We
9 *Improved Calculation
450
CM
CM
o
-3
-2
-1
Normal quantile
-3
-2
-1
Normal quantile
estim ated their bias by com paring them w ith quantiles o f T * obtained from
100000 b o o tstrap resamples.
Figure 9.4 shows the efficiency gains o f the exact control estim ates relative
to the usual estim ates. T he efficiency gain based on the linear approxim ation
is n o t shown, b u t it is very similar. T he right panel shows the biases o f the
two control estim ates. The efficiency gains are largest for central quantiles,
and are o f o rd er 1.5-3 for the quantiles o f m ost interest, at ab o u t 0.025-0.05
an d 0.95-0.975. T here is som e suggestion th a t the control estim ates based on
the linear ap proxim ation have the sm aller bias, b u t b o th sets o f biases are
negligible a t all b u t the m ost extrem e quantiles.
The efficiency gains in this exam ple are broadly in line w ith sim ulations
reported in the literatu re; see also Exam ple 9.10 below.
for som e function m( ), where y ' is abbreviated n o ta tio n for a sim ulated d a ta
set. In expression (9.1), for exam ple, m( y' ) = t(y*), and the distribution G for
y* = (y^,..., y*) puts m ass n~n on each elem ent o f the set f f = { y i,...,y } ".
details
451
pG = R 5 3 " H O
r=1
This estim ator has m ean an d variance
J m( y )dG(y*) = J
d H ( y ),
(9.14)
(9.15)
r= l
where w(y) = dG(y ) / d H ( y ' ) is know n as the importance sampling weight. The
estim ate fin,raw has m ean fi by virtue o f (9.14), so is unbiased, and has variance
9 Improved Calculation
452
O u r aim is now to choose H so th at
j= l,...,n ,
(9.18)
where the lj are the usual em pirical influence values for t. The result o f Prob
lem 9.10 shows th a t u nder this distribution T * is approxim ately
N ( t + XnvL, vi ), so the ap p ro p riate choice for X in (9.18) is approxim ately
X = (to t)/{nvL), again provided to < t\ in some cases it is possible to choose
X to m ake T* have m ean exactly to- T he choice o f probabilities given by (9.18)
is called an exponential tilting o f the original values n ~l . This idea is also used
in Sections 4.4, 5.3, an d 10.2.2.
Table 9.4 shows approxim ate values o f the efficiency R ~ 1 n ( l n ) / \ a T , (p.H,raw)
o f near-optim al im portance resam pling for various values o f the tail probability
7i. The values were calculated using no rm al approxim ations for the distributions
453
n
Efficiency
0.01
37
0.025
17
0.05
9.5
0.2
3.0
0.5
1.0
0.8
0.12
0.95
0.003
0.975
0.0005
0.99
0.00004
o f T* und er G and H ; see Problem 9.8. The entries in the table suggest th at
for n < 0.05 we could a tta in the same accuracy as w ith ordinary resam pling
w ith R reduced by a factor larger th an ab o u t 10. A lso shown in the table is the
result o f applying the exponential tilted im portance resam pling distribution
w hen t > to, or n > 0.5: then im portance resam pling will be worse possibly
much worse th an o rdinary resampling.
This last observation is a w arning: straightforw ard im portance sam pling can
be bad if m isapplied. We can see how from (9.17). If d H ( y ' ) becom es very
small where m( y ) an d dG(y') are n o t small, then w{y') = d G(y ) / d H ( y ' ) will
becom e very large and inflate the variance. For the tail probability calculation,
if to > t then all sam ples y ' w ith t(y*) < to contribute R ~ lw(y'r ) to pH,raw, and
som e o f these contributions are enorm ous: although rare, they w reak havoc
On flH,rawA little th o u g h t shows th a t for to > t one should apply im portance sam pling
to estim ate 1 n = Pr*(T* > to) and subtract the result from 1, ra th er th an
estim ate n directly.
Quantiles
To see how quantiles are estim ated, suppose th a t we w ant to estim ate the
a quantile o f the distribution o f 7 , and T* is approxim ately N(t, vL) under
G = F. T hen we take a tilted distribution for H such th a t T* is approxim ately
N ( t + zxV l 2 ,vl). For the situation we have been discussing, the exponential
tilted distribution (9.18) will be near-optim al with k = zi / ( n v i/ 2), and in large
sam ples this will be superior to G = F for any ct =/= i. So suppose th a t we
have used im portance resam pling from this tilted distribution to obtain values
fj < < tf; w ith corresponding weights vvj,. . . , w R. T hen for a < | the raw
quantile estim ate is t"M, where
- m
V
R + 1^
r= l
.
M+l
wr* < a < - - V wr\
r
R+l ^
(9.19)
r
r= 1
R
w*;
r= M + 1
see Problem 9.9. W hen there is no im portance sam pling we have w* = 1, and
the estim ate equals the usual ((R+1)a).
T he variation in w (y') and its im plications are illustrated in the following
454
9 Improved Calculation
(9'2)
z =
?2-?l-(/^2-W ).
1/2
(S f /n 2 + S f / n i )
W = ^ )
r
dHW Y
The choice o f H is m ade by analogy w ith the single-sam ple case discussed e ar
lier. The tw o E D F s are tilted so as to m ake Z* approxim ately N ( zq, v l ), which
should be near-optim al. This is done by w orking w ith the linear approxim ation
nl
Z'L =
f j l 'j +
Z + Mi 1
n 2 1 Y l f 2 J lV>
7=1
;=1
where / a nd f'2j are the b o o tstrap sam ple frequencies o f y \j and y 2j, and the
em pirical influence values are
l
yij - h
{ s \ / n 2 + s f / n i ) 1/2
_
1
yij - yi
( s l / n 2 + s 2l / n i ) U2
455
O
O
o
o
X
8
1
o
O
LL
Q
O
2
o
r i
i \
I:
1
i;
o
-2
-2
V L
y
0
z*
z*
E l i h j exp(Xl2}/ n 2) _
+
Wr =
p lo tted against the b o o tstra p values z* for the im portance resamples. These
values o f z* are shifted to the right relative to the hollow points, which show
the values o f z an d w* (all equal to 1) for 99 ordinary resamples. The values
o f w* for the im portance re-w eighting vary over several orders o f m agnitude,
w ith the largest values w hen z* <C z q . But only those for z* > z0 contribute to
f^H,raw
H ow well does this single im portance resam pling distribution w ork for
estim ating all values o f the survivor function Pr*(Z * > z)? T he heavy solid
line in the right panel shows the tru e survivor function o f Z* estim ated from
50 000 o rdinary b o o tstra p sim ulations. T he lighter solid line is the im portance
456
9 Improved Calculation
wrf{*r* ^ Z)
r= 1
with R = 99, an d the d o tted line is the estim ate based on 99 ordinary boo tstrap
sam ples from the null distribution. T he im portance resam pling estim ate follows
the tru e survivor function accurately close to zq b u t does poorly for negative
z*. The usual estim ate does best n ear z* = 0 b u t poorly in the tail region
o f interest; the estim ated significance probability is f a = 0. W hile the usual
estim ate decreases by R ~ { at each z*, the weighted estim ate decreases by
m uch sm aller ju m p s close to z<>; the raw im portance sam pling tail probability
estim ate is p.H,raw = 0.015, which is very close to the true value. T he weighted
survivor function estim ate has large ju m p s in its left tail, where the estim ate is
unreliable.
In 50 repetitions o f this experim ent the o rdinary and raw im portance re
sam pling tail probability estim ates h ad variances 2.09 x 10-4 and 2.63 x 10-5 .
F or a tail probability o f 0.015 this efficiency gain o f ab o u t 8 is sm aller th an
would be predicted from Table 9.4, the reason being th a t the distribution o f
z* is rath er skewed an d the norm al approxim ation to it is poor.
In general there are several ways to obtain tilted distributions. We can use
exponential tilting w ith exact em pirical influence values, if these are readily
available. O r we can estim ate the influence values by regression using jRo
initial ordinary b o o tstra p resam ples, as decribed in Section 2.7.4. A n other
way o f using an initial set o f b o o tstrap sam ples is to derive weighted sm ooth
distributions as in (3.39): illustrations o f this are given later in Exam ples 9.9
and 9.11.
_ E f-i h y ; m y ;)
Z L
m y
;)
To some extent this controls the effect o f very large fluctuations in the weights.
In practice it is b etter to treat the weight as a control variate o r covariate.
Since ou r aim in choosing H is to concentrate sam pling where m( ) is largest,
the values o f m(Yr )w(Yr*) and w(Yr*) should be correlated. If so, and if
457
the average weight differs from its expected value o f one un d er sim ulation
from H, then the estim ate pH,raw probably differs from its expected value fi.
This m otivates the covariance adjustm ent m ade in the importance resampling
regression estimate
Ph ,reg = Ah,raw ~ b(w - 1),
(9.23)
w here vv* = R ~ x
w(Yr*), an d b is the slope o f the linear regression o f the
m ( Y ' ) w ( Y * ) on the w (Y r*). The estim ator pH,reg is the predicted value for
m { Y ' ) w { Y ) at the poin t w(Y*) = 1.
T he adjustm ents m ade to pH,raw in b o th pH,rat and pH,reg m ay induce bias,
b u t such biases will be o f o rd er R ~ l and will usually be negligible relative
to sim ulation stan d ard errors. C alculations outlined in Problem 9.12 indicate
th a t for large R the regression estim ator should outperform the raw and ratio
estim ators, b u t the im provem ent depends on the problem , and in practice the
raw estim ator o f a tail probability o r quantile is usually the best.
Defensive mixtures
A second im provem ent aim s to prevent the weight w( y' ) from varying wildly.
Suppose th a t H is a m ixture o f distributions, n H\ + (1 n ) H 2 , where 0 < n < 1.
T he distributions Hi and H 2 are chosen so th at the corresponding probabilities
are n o t b o th sm all sim ultaneously. T hen the weights
d G ( / ) / { j i d H , ( / ) + (1 - 7z)dH 2 (y')}
will vary less, because even if d H i ( y m) is very small, d H 2 (y*) will keep the
den o m in ato r aw ay from zero and vice versa. This choice o f H is know n as
a defensive mixture distribution, and it should do particularly well if m any
estim ates, w ith different m( y ), are to be calculated. T he m ixture is applied by
stratified sam pling, th a t is by generating exactly n R observations from Hi and
the rest from H 2, and using pH,reg as usual.
T he com ponents o f the m ixture H should be chosen to ensure th a t the
relevant range o f values o f t* is well covered, b u t beyond this the detailed
choice is n o t critical. F o r exam ple, if we are interested in quantiles o f T* for
probabilities betw een a an d 1 a, then it would be sensible to target Hi at
the a quantile and H 2 a t the 1 a quantile, m ost simply by the exponential
tilting m ethod described earlier. As a further precaution we m ight add a
th ird com ponent to the m ixture, such as G, to ensure stable perform ance
in the m iddle o f the distribution. In general the m ixture could have m any
com ponents, b u t careful choice o f two or three will usually be adequate.
A lways the application o f the m ixture should be by stratified sam pling, to
reduce variation.
Exam ple 9.9 (G ravity d a ta )
To illustrate the above ideas, we again consider
the hypothesis testing problem o f Exam ple 9.8. T he left panel o f Figure 9.6
458
9 Improved Calculation
shows 20 replicate estim ates o f the null survivor function o f z*, using ordinary
b o o tstrap resam pling w ith R = 299. The right panel shows 20 estim ates o f the
survivor function using the regression estim ate fiH,reg after sim ulations w ith a
defensive m ixture distribution. This m ixture has three com ponents which are
G (the tw o E D F s), an d tw o pairs o f exponential tilted distributions targeted
at the 0.025 an d 0.975 quantiles o f Z*. From o u r earlier discussion these
distributions are given by (9.21) w ith X = 2 / v L \ we shall denote the first pair
o f distributions by probabilities p i j an d p 2j , and the second by probabilities
q i j and q 2j . The first com ponent G was used for R i = 99 samples, the second
com ponent (the ps) for R 2 = 100 an d the th ird com ponent (the qs) for
R j = 100: the m ixture prop o rtio n s were therefore nj = R j / ( R \ + R 2 + R 3 ) for
j = 1,2,3. T he im portance resam pling weights were
459
M ixture
Ri
r2
E stim ate
Ri
299
99
100
100
19
140
140
R aw
R a tio
R egression
R aw
R a tio
R egression
R aw
R a tio
R egression
E stim an d
Pr* (Z* > z0)
E ( Z ')
var*(Z *)
Z0.05
z0.025
11.2
3.5
12.4
3.8
3.4
4.0
3.9
2.3
4.3
0.04
0.06
0.18
0.73
0.79
0.93
0.34
0.43
0.69
0.03
0.05
0.07
1.5
1.5
1.6
1.2
0.82
1.3
0.07
0.06
0.06
1.3
0.93
0.87
0.96
0.48
0.44
0.05
0.04
2.5
1.3
1.2
2.6
1.1
1.3
squared erro r from ordinary resam pling to th at w hen using defensive m ixture
d istributions to estim ate the tail probability Pr*(Z* > z q ) with zo = 1.77, two
quantiles, an d the bias E *(Z ) and the variance var (Z*) for sam pling from
the two series. T he m ixture distributions have the same three com ponents
as before, b u t w ith different values for the num bers o f sam ples R \ , R 2 and
Rt, from each. Table 9.5 gives the results for three resam pling m ixtures with
a to tal o f R = 299 resam ples in each case. The m ean squared errors were
estim ated from 100 replicate b ootstraps, w ith tru e values obtained from a
single b o o tstra p o f size 50000. The m ain contribution to the m ean squared
erro r is from variance ra th e r th an bias.
The first resam pling distribution is n o t a m ixture, b u t simply the exponential
tilt to the 0.975 quantile. This gives the best estim ates o f the tail probability,
w ith efficiencies for raw an d regression estim ates in line with Exam ple 9.8, b u t
it gives very p o o r estim ates o f the other quantities. F or the other two m ixtures
the regression estim ates are best for estim ating the m ean and variance, while
the raw estim ates are best for the quantiles and n o t really worse for the tail
probability. B oth m ixtures are ab o u t the same for tail quantiles, while the first
m ixture is b etter for the m om ents.
In this case the efficiency gains for tail probabilities and quantiles predicted
by Table 9.4 are unrealistic, for two reasons. First, the table com pares 299
o rdinary sim ulations w ith ju st 100 tilted to each tail o f the first m ixture
distribution, so we w ould expect the variance for a tail quantity based on the
m ixture to be larger by a factor o f ab o u t three; this is ju st w hat we see when
the first distrib u tio n is com pared to the second. Secondly, the distribution o f
Z* is quite skewed, which considerably reduces the efficiency out as fa r as the
0.95 quantile.
We conclude th a t the regression estim ate is best for estim ating central
9 Improved Calculation
460
quantities, th a t the raw estim ate is best for quantiles, th a t results for estim ating
quantiles are insensitive to the precise m ixture used, and th a t theoretical gains
m ay not be realized in practice unless a single tail quantity is to be estim ated.
This is in line w ith o th er studies.
R\
copies o f y\ w ith
R 2
R n
copies o f y n,
this does n o t take account o f the fact th a t sam pling is w ithout replacem ent.
Figure 9.7 shows the theoretical large-sam ple efficiencies o f balanced re
sampling, im portance resam pling, an d balanced im portance resam pling for
estim ating the quantiles o f a norm al statistic. O rdinary balance gives m ax
im um efficiency o f 2.76 a t the centre o f the distribution, while im portance
461
Normal quantile
resam pling w orks well in the lower tail b u t badly in the centre and u p per tail
o f the distribution. Balanced im portance resam pling dom inates both.
Exam ple 9.10 (Returns d a ta )
In order to assess how well these ideas m ight
w ork in practice, we again consider setting studentized b o o tstrap confidence
intervals for the slope in the returns example. We perform ed an experim ent
like th a t o f Exam ple 9.7, b u t w ith the R = 999 b o o tstrap sam ples generated
by balanced resam pling, im portance resam pling, and balanced im portance
resampling.
Table 9.6 shows the m ean squared error for the ordinary b o o tstrap divided
by the m ean squared errors o f the quantile estim ates for these m ethods, using
50 replicate sim ulations from each scheme. This slightly different efficiency
takes into account any bias from using the im proved m ethods o f sim ulation,
though in fact the co n trib u tio n to m ean squared error from bias is small. The
tru e quantiles are estim ated from an ordinary b o o tstrap o f size 100000.
The first tw o lines o f the table show the efficiency gains due to using the
control m ethod w hen the linear approxim ation is used as a control variate,
w ith em pirical influence values calculated exactly and estim ated by regression
from the sam e b o o tstrap sim ulation. The results differ little. T he next two rows
show the gains due to balanced sampling, both w ithout and w ith the control
462
M eth o d
9 Improved Calculation
D istrib u tio n
Q u an tile (% )
1
2.5
10
50
90
95
97.5
99
C o n tro l (exact)
C o n tro l (approx)
1.7
1.4
2.7
2.8
2.8
3.2
4.0
4.1
11.2
11.8
5.5
5.1
2.4
2.2
2.6
2.6
1.4
1.3
B alance
w ith co n tro l
1.0
1.4
1.2
1.8
1.5
3.0
1.4
2.8
3.1
4.4
2.9
4.7
1.7
2.5
1.4
2.2
0.6
1.5
7.8
4.6
3.6
4.3
2.6
3.7
2.9
3.7
2.6
2.1
3.6
3.5
2.0
2.5
0.7
1.8
1.1
1.7
1.8
0.3
0.4
0.1
0.5
0.9
0.4
3.5
2.6
2.4
1.6
0.5
2.3
3.1
2.2
1.6
0.6
3.1
4.3
2.6
2.2
1.6
5.5
5.2
3.6
2.3
2.1
5.0
4.2
5.2
4.3
3.2
5.7
3.4
4.2
3.3
2.8
4.1
2.4
3.8
3.4
1.0
1.9
1.8
1.8
2.2
0.4
0.5
0.2
0.9
2.1
0.9
2.6
2.0
3.0
2.7
0.9
2.2
3.6
2.4
3.7
1.4
6.3
4.2
4.0
3.3
2.1
4.5
3.9
4.0
4.3
2.1
Im p o rtan ce
Hi
Hi
Hi
H*
Hs
B alanced
im p o rtan ce
Hi
Hi
h3
h4
h 5
m ethod, which gives a w orthw hile im provem ent in perform ance, except in the
tail.
The next five lines show the gains due to different versions o f im portance
resam pling, in each case using a defensive m ixture distribution and the raw
quantile estim ate. In practice it is unusual to perform a b o o tstrap sim ulation
w ith the aim o f setting a single confidence interval, and the choice o f im
portance sam pling distrib u tio n H m ust balance various potentially conflicting
requirem ents. O u r choices were designed to reflect this. We first suppose th at
the em pirical influence values lj for t are know n an d can be used for exponen
tial tilting o f the linear approxim ation t'L to t . T he first defensive m ixture, H\,
uses 499 sim ulations from a distribution tilted to the a quantile o f t*L and 500
sim ulations from a distribution tilted to the 1 a quantile o f fL, for a = 0.05.
The second m ixture is like this b u t w ith a = 0.025.
The third, fo u rth an d fifth distributions are the sort th a t m ight be used in
practice w ith a com plicated statistic. We first perform ed an ordinary b o otstrap
o f size Ro, which we used to estim ate first the em pirical influence values lj
by regression an d then the tilt values rj for the 0.05 and 0.95 quantiles. We
then perform ed a fu rth er b o o tstrap o f size (R Ro)/2 using each set o f tilted
probabilities, giving a to tal o f R sim ulations from three different distributions,
one centred an d tw o tilted in opposite directions. We took Ro = 199 and
Ro = 499, giving Hj an d i / 4. F or H$ we took Ro = 499, b u t estim ated the
tilted distributions by frequency sm oothing (Section 3.9.2) w ith bandw idth
463
e = 0.5t>1/2 at the 0.05 an d 0.95 quantiles o f t*, where v x/1 is the standard error
o f t estim ated from the ordinary bootstrap.
Balance generally im proves im portance resam pling, which is n o t sensitive to
the m ixture distrib u tio n used. The effect o f estim ating the em pirical influence
values is n o t m arked, while frequency sm oothing does n o t perform so well as
exponential tilting. Im portance resam pling estim ates o f the central quantiles
are poor, even w hen the sim ulation is balanced. Overall, any o f schemes H \H 4 leads to appreciably m ore accurate estim ates o f the quantiles usually o f
interest.
E{m(Y) | Gk} =
J m(y)dGk{y) = J
E jm(Y)
dGk( Y )
dH(Y)
We can therefore estim ate all K values using one set o f sam ples y \ , . . . , y N
sim ulated from H, w ith estim ates
N
P k = N 1^ m ( y , )
(9.24)
464
9 Improved Calculation
Both N and the choice o f H depend u p o n the use being m ade o f the estim ates
and the form o f m(-).
Exam ple 9.11 (City population d a ta ) C onsider again estim ating the bias
and variance functions for ratio 8 = t(F ) o f the city population d a ta with
n = 10. In Exam ple 3.22 we estim ated b(F) = E (T | F) t(F) and v(F) =
v ar( T | F) for a range o f values o f 0 = t{F) using a first-level b o o tstrap to
calculate values o f t* for 999 b o o tstrap sam ples F*, and then doing a secondA
A
level b o o tstrap to estim ate b(F') an d v( F) for each o f those samples. H ere
the second level o f resam pling is avoided by using im portance re-weighting.
A t the sam e time, we retain the sm oothing introduced in Exam ple 3.22.
R a th er th a n take each Gk to be one o f the b o o tstrap E D F s F*, we obtain
a sm ooth curve by using sm ooth distributions F'f) w ith probabilities pj( 6 ) as
defined by (3.39). Recall th a t the p aram eter value o f F e is t(F'g) = 0*, say,
which will differ slightly from 6 . F o r H we take F , the E D F o f the original
data, on the grounds th a t it has the correct su p p o rt and covers the range o f
values for y w ell: it is n o t necessarily a good choice. T hen we have weights
dGk( f r ) = dFg(y') = A ( PjW V " =
dH(y'r )
dH(y'r )
y i n - 1/
say, where as usual /* is the frequency with which y} occurs in the rth
bo o tstrap sample. We should em phasize th a t the sam ples y * draw n from H
here replace second-level b o o tstrap samples.
C onsider the bias estim ate. T he weighted sum R~' ^ ( f 6")w'(0} is an
unbiased estim ate o f the bias E (T * | F'e ) 6 *, an d we can plot this estim ate
to see how the bias varies as a function o f O' or 6 . However, the weighted sum
can behave badly if a few o f the w ' ( 0 ) are very large, and it is b etter to use
the ratio an d regression estim ates (9.22) and (9.23).
The top left panel o f Figure 9.8 shows raw, ratio, an d regression estim ates o f
the bias, based on a single set o f R = 999 sim ulations, w ith the curve obtained
from the double b o o tstrap calculation used in Figure 3.7. F o r example, the
ratio estim ate o f bias for a p articu lar value o f d is ]T r(r' 0 )w(0 ) / 2 2 r w '(0),
and this is plotted as a function o f 0*. T he raw an d ratio estim ates are rath er
poor, but the regression estim ate agrees fairly well w ith the double boo tstrap
curve. The panel also shows the estim ated bias from a defensive m ixture w ith
499 ordinary sam ples m ixed w ith 250 sam ples tilted to each o f the 0.025 and
0.975 quantiles; this is the best estim ate o f those we consider. The panels below
show 20 replicates o f these estim ated biases. These confirm the im pression from
the panel a b o v e: w ith o rdinary resam pling the regression estim ator is best, but
it is b etter to use the m ixture distribution.
The to p right panel shows the corresponding estim ates for the standard
465
ID
o
iS
c
1</) 1
*
CO
O
co
d
&
"
o
0
go
_1
CD
aj
CM
O
o
oo
LU
/
/
0.5
0.4
/.
/A
0.2
J' B &
02
l;k
n'if vs
Mrr--
0.3
0.10
0.08
0.04
0.06
;'Y / *
) / / M
0.5
0.4
____
0 0
>
0.3
0)
00
o
o
200
= e**(t** - 0* i f ;),
e**{(r* - 0*)2 1f ; }
by each o f the raw, ratio, and regression m ethods, and plotting the resulting
estim ate
466
9 Improved Calculation
The results for the raw estim ate suggest th a t recycling can give very variable
results, an d it m ust be used w ith care, as the next exam ple vividly illustrates.
Exam ple 9.12 (Bias adjustm ent)
C onsider the problem o f adjusting the
b o o tstrap estim ate o f bias o f T , discussed in Section 3.9. The adjustm ent C
in equation (3.30) is (R M )_1 5Zf=1 ]Cm=i(f*m ~ K) ~
which uses M sam ples
from each o f the R m odels F* fitted to sam ples from F. The recycling m ethod
replaces each average M -1 2 2 m=i([rm t ) by a w eighted average o f the form
(9.24), so th a t C is estim ated by
(9.25)
where t is the value o f T for the /th sam ple y { , y draw n from the
distribution H. If we applied recycling only to the first term o f C, which
estim ates E (T**), then a different an d as it tu rn s out inferior estim ate
would be obtained for C.
The sup p o rt o f H m ust include all R first-level b o o tstrap samples, so as in
the previous exam ple a n atu ral choice is H = F , the m odel fitted to (or the
E D F of) the original sample. However, this can give highly unstable results, as
one m ight predict from the leftm ost panel in the second row o f Figure 9.8. This
can be illustrated by considering the case o f the p aram etric m odel Y ~ N(0, 1),
with estim ate T = Y . H ere the term s being sum m ed in (9.25) have infinite
variance; see Problem 9.15. T he difficulty arises from the choice H = F , and
can be avoided by taking H to be a m ixture as described in Section 9.4.2, with
at least three com ponents.
Instability due to the choice H = F does not occur with all applications o f
recycling. Indeed applications to b o o tstra p likelihood (C hapter 10) w ork well
with this choice.
467
(9.26)
(9.27)
an d is therefore a function o f u. H ere K ' and K " are respectively the first and
second derivatives o f K with respect to . A simple approxim ation to the C D F
o f U, P r(l/ < u), is
Gf(u) = <I>(w + - l o g ( - ) l ,
[
w
\ w/ J
(9.28)
v = l { K " ( l ) } U\
(9.29)
for values o f u such th a t |w| < c for some positive c; the erro r in the
C D F ap proxim ation rises to 0 ( n ~ l ) w hen u is such th a t |w| < cn1^2. A key
feature is th a t the error is relative, so th a t the ratio o f the true density o f
U to its saddlepoint approxim ation is bounded over the likely range o f u. A
consequence is th a t unlike other analytic approxim ations to densities and tail
probabilities, (9.26), (9.28) an d (9.29) are very accurate far into the tails o f the
density o f U. If there is d o u b t ab o u t the accuracy o f (9.28) and (9.29), Gs may
be calculated by num erical in tegration o f gs.
The m ore com plex form ulae th a t are used for conditional and m arginal
density an d distrib u tio n functions are given in Sections 9.5.2 and 9.5.3.
9 Improved Calculation
468
10
15
a (% )
2.5
95
97.5
99
99.5
99.9
10.9
12.8
12.5
15.4
15.2
18.1
17.8
78.5
78.1
85.1
85.9
96.0
95.3
102.1
10.8
101.9
116.4
115.8
13.6
13.5
14.5
14.4
16.0
15.9
17.4
17.4
37.4
37.4
39.7
39.7
42.3
42.4
44.4
44.3
48.2
48.2
0.1
0.5
Sim n
Sp o in t
7.8
7.6
Sim n
Sp o in t
11.8
11.7
Application to resampling
In the context o f resam pling, suppose th a t we are interested in the distri
b ution o f the average o f a sam ple from y \ , . . . , y n, where
is sam pled with
probability pj, j = 1, . . . , n. O ften, b u t n o t always, Pj = n-1 . We can w rite the
average as U' = n~l J2 f)yj> where as usual ( / j , . . . , ) has a jo in t m ultinom ial
distribution w ith den o m in ato r n. T hen U ' has cum ulant-generating function
K( ) = nl og j ] T f ? ; exp(o,)
(9.30)
where a; = y j / n . The function (9.30) can be used in (9.26) and (9.28) to give
n on-random approxim ations to the P D F and C D F o f U . Unlike m ost o f
the m ethods described in this book, the erro r in saddlepoint approxim ations
arises not from sim ulation variability, b u t from determ inistic num erical error
in using gs and Gs rath er th an the exact density and distribution function.
In principle, o f course, a n o n p aram etric b o o tstrap statistic is discrete and
so the density does not exist, b u t as we saw in Section 2.3.2, U * typically has
so m any possible values th a t we can thin k o f it as continuous aw ay from
the extrem e tails o f its distribution. C ontinuity corrections can som etim es be
applied, b u t they m ake little difference in b o o tstrap applications.
W hen it is necessary to approxim ate the entire distribution o f U', we
calculate the values o f Gs(u) for m values o f u equally spaced betw een min aj
and m ax aj an d use a spline sm oother to in terpolate betw een the corresponding
values o f C>_ 1{Gs(m)}. Q uantiles an d cum ulative probabilities for U ' can be
read off the fitted curve. Experience suggests th a t m = 50 is usually ample.
Exam ple 9.13 (L inear approxim ation)
A simple application o f these ideas
is to the linear approxim ation t'L for a b o o tstrap statistic t \ as was used in
Exam ple 9.7. We write T [ = t + n~l
where as usual /* is the frequency
o f the yth case in the b o o tstrap sam ple an d lj is the 7 th em pirical influence
value. T he cum ulant-generating function o f T[ t is (9.30) with aj = l j / n and
o f saddlepoint
approxim ation to
bootstrap a quantiles
(x lO -2 ) o f a linear
statistic for samples of
sizes 10 and 15, with
results from R = 49 999
simulations.
469
A
O
cvi
o
Q
Q
Q_
co
CL
CM
in
Jh
0.0
0.5
1.0
1.5
lltlii**,----------
0.1
0.2
0.3
0.4
0.5
0.6
Pj = n
o f T[ a t t'L is
E " = i O exP ( ^ ;/ ) _ .
E J - . e x p ({ /;/ )
L
whose solution is \.
For a num erical exam ple, we take the variance t = n~l J2(yj ~ y )2 f r
exponential sam ples o f sizes 10 and 15; the em pirical influence values are
lj = (yj y )2 t. Figure 9.9 com pares the saddlepoint approxim ations to the
P D F s o f t'L w ith the histogram from b o o tstrap calculations with R = 49 999.
T he saddlepoint approxim ation accurately reflects the skewed lower tail o f
the b o o tstrap distribution, w hereas a norm al approxim ation would not do so.
However, the saddlepoint approxim ation does n o t pick up the m ultim odality
o f the density for n = 10, which arises for the sam e reason as in the right panels
o f Figure 2.9: the bulk o f the variability o f T[ is due to a few observations
w ith large values o f |/; |, while those for which \lj\ is small merely add noise.
The figure suggests th a t w ith so small a sam ple the C D F approxim ation will
be m ore useful. This is borne out by Table 9.7, which com pares the sim ulation
quantiles an d quantiles obtained by fitting a spline to 50 saddlepoint C D F
values.
In m ore com plex applications the em pirical influence values lj would usually
be estim ated by num erical differentiation or by regression, as outlined in
Sections 2.7.2, 2.7.4 and 3.2.1.
470
9 Improved Calculation
used in Exam ple 5.13 to calibrate confidence intervals based on a kernel density
estim ate. This involved estim ating the probabilities
P r**(T " < 2t - t | F ),
(9.31)
where
is the variance-stabilized estim ate o f the quan tity o f interest. The double
bo o tstrap version o f t can be w ritten as t " = ( 2 2 f j ' aj ) ,/2> where aj =
(nh)~l { 4>{y j / h ) + <t>(yj/h)} an d /** is the frequency w ith which yj appears in
a second-level b o o tstrap sample. C onditional on a first-level b o o tstra p sample
F with frequencies
/*, the / * are independent m ultinom ial variables
with m ean vector (/* !,...,/* ) and d en o m in ato r n.
Now if 2 tr t < 0, the probability (9.31) equals zero, because T is positive. If
2t*t > 0, the event T** < 2 t*t is equivalent to n~l
^ (2 ^ 0 2- Thus
Estimating functions
O ne simple extension o f the basic approxim ations is to statistics determ ined by
m onotonic estim ating functions. Suppose th a t the value o f a scalar bo o tstrap
statistic T* based on sam pling from y i , . . . , y is the solution to the estim ating
equation
n
(9.32)
U*(t) = ^ 2, a{ f ,y j )f 'j = 0,
where for each y the function a( 6 ;y) is decreasing in d. T hen T* < t if and
only if U (t) < 0. H ence Pr*(T* < t) m ay be estim ated by Gs(0) applied
w ith cum ulant-generating function (9.30) in which aj = a{t;yj). A saddlepoint
approxim ation to the density o f T is
(9.33)
.
471
(9.34)
j=i
where tp(e) is designed to dow nw eight large values o f s. A com m on choice is
the H u b er estim ate determ ined by
y>(e) =
c,
(9.35)
W ith c = oo this gives 1p(s) = s and leads to the norm al-theory estim ate 9 = y,
b u t a sm aller choice o f c will give b etter behaviour w hen there are outliers.
W ith c = 1.345 and a fixed a t the m edian absolute deviation s o f the data,
we obtain 8 = 26.45. H ow variable is this? We can get some idea by looking
at replicates o f 9 based on b o o tstrap sam ples y j,...,y * . A b o o tstrap value 9*
solves
9 Improved Calculation
472
CD
o
o
Q_
C\J
o
d
-20
theta
20
40
60
theta
(2n) 1 ^
j=i
T he right panel o f Figure 9.10 com pares saddlepoint and M onte C arlo a p
proxim ations to the P D F o f O' und er this sym m etrized resam pling scheme;
the P D F o f the average is shown also. All are sym m etric ab o u t 6 .
One difficulty here is th a t we m ight prefer to approxim ate the P D F o f O'
w hen s is replaced by its b o o tstrap version s', an d this cannot be done in the
current fram ew ork. M ore fundam entally, the distrib u tion o f interest will often
be for a q uantity such as a studentized form o f O' derived from 6 ", s', and
perhaps other statistics, necessitating the m ore sophisticated approxim ations
outlined in Section 9.5.3.
K(0
= log E exp
(ZTA T W) =
Figure 9.10
Comparison of the
saddlepoint
approximation to the
PDF of a robust
M-estimate applied to
the maize data (solid),
with results from a
bootstrap simulation
with R = 50000. The
heavy curve is the
saddlepoint
approximation to the
PDF of the average.
The left panel shows
results from resampling
the data, and the right
shows results from a
symmetrized bootstrap.
473
(9.36)
(9.37)
' .
\ |K"2( 0 , 6 o)I J
- ( ) " - ( V '
N ow T ' < t if and only if J2 j(xj ~ tzj ) W j < 0, where Wj is the num ber o f
times ( z j , X j ) is included in the sample. But the relation betw een the Poisson
an d m ultinom ial distributions (Problem 9.19) implies th a t the jo in t conditional
distrib u tio n o f ( W \ , . . . , W ) given th a t J 2 ^ j = n is the same as th a t o f the
m ultinom ial frequency vector (/*, . . . , / * ) in ordinary b o o tstra p sam pling from
a sam ple o f size n. T hus the probability th a t J2 j(xj tzj)W j < 0 given th at
J2 W j = n is ju st the probability th a t T ' < t in ordinary b o o tstrap sam pling
from the d a ta pairs.
474
9 Improved Calculation
a
0.001
0.005
0.01
0.025
0.05
0.1
0.9
0.95
0.975
0.99
0.995
0.999
Unconditional
Conditional
W ithout replacement
Spoint
Simn
Spoint
Simn
Spoint
Simn
1.150
1.191
1.214
1.251
1.286
1.329
1.834
1.967
2.107
2.303
2.461
2.857
1.149
1.192
1.215
1.252
1.286
1.329
1.833
1.967
2.104
2.296
2.445
2.802
1.216
1.236
1.248
1.273
1.301
1.340
1.679
1.732
1.777
1.829
1.865
1.938
1.215
1.237
1.247
1.269
1.291
1.337
1.679
1.736
1.777
1.833
1.863
1.936
1.070
1.092
1.104
1.122
1.139
1.158
1.348
1.392
1.436
1.493
1.537
1.636
1.070
1.092
1.103
1.122
1.138
1.158
1.348
1.392
1.435
1.495
1.540
1.635
In this situation it is o f course m ore direct to use the estim ating function
m ethod w ith a(t;yj) = XjtZj and the sim pler approxim ations (9.28) and (9.33).
T hen the Jaco b ian term in (9.33) is | 22 z; e x p { |(x , t zj ) } / 22 exp{|(x,- tZj)}\.
A n o th er application is to conditional distributions for T*. Suppose th at the
populatio n pairs are related by x; = Zj 6 + z l/ 2 j, where the e; are a random
sam ple from a distrib u tio n w ith m ean zero. T hen conditional on the Zj, the
ratio 2 2 xj / 2 2 zj has variance p ro p o rtio n al to (]P Z j)~' In some circum stances
we m ight w ant to obtain an ap proxim ation to the conditional distribution o f
T * given th a t 2 2 Z j = 2 2 zjthis case we can use the approach outlined in
the previous p aragraph, b u t w ith tw o conditioning variables: we take the Wj
to be independent Poisson variables w ith equal m eans, and set
"E (xj-tzj)W j\
[/*=
2 2 zjW j
22 w j
(
h
u = \ 2 2 zj
n )
a, =
zJ
V 1
- U ) -
" - ( V '
Table 9.8 com pares the quantiles o f these saddlepoint distributions with
475
M onte C arlo approxim ations based on 100000 samples. The general agreem ent
is excellent in each case.
(9.38)
w here K; is the ith cum u lan t o f T*. T he exact cum ulants are usually unavailable,
so we replace them w ith the cum ulants o f the cubic approxim ation to T* given
by
n
t * = t + n - 1 / ; / , + K 2 E / * / ; * ; + i n~3
fifjfkdjk,
i=l
i,j=1
i,jjt=1
9 Improved Calculation
476
where t is the original value o f the statistic, an d lj, qjj and Cy* are the em pirical
linear, quad ratic and cubic influence values; see also (9.6). To the order required
the approxim ate cum ulants are
+!2 (n 3Y .ijk llm q jk) +4 (n 3Y ,ijk W ^ w ) } where the quantities in parentheses are o f o rder one.
We get an approxim ate cum ulant-generating function K c ( 0 by substituting
the Kc,i into (9.38), an d then use the stan d ard approxim ations (9.26) and (9.28)
w ith K ( ) replaced by Kc(). D etailed consideration establishes th a t this
preserves the usual asym ptotic accuracy o f the saddlepoint approxim ations.
From a practical point o f view it m ay be b etter to sacrifice some theoretical
accuracy b u t reduce the co m p u tatio n al b urden by dropping from k c ,2 and
Kc,4 the term s involving the cy*; w ith this m odification b o th P D F and C D F
approxim ations have erro r o f order n~l .
In principle this ap proach is fairly simple, b u t in applications there is no
guarantee th a t K c ( ) is close to the true cum ulant-generating function o f T '
except for small
It m ay be necessary to m odify K c () to avoid m ultiple
roots to the saddlepoint equation or if Kc ,4 < 0, for then K c( ) cannot be
convex. In these circum stances we can m odify K c () to ensure th a t the cubic
and quartic term s do n o t cause trouble, for exam ple replacing it by
K c M ) = f r c . i + { 2 * c ,2 + (|<^3Kc,3 + J4 %4 kc,4) exp ( - \ n b 2 2 KC,2 ) ,
where b is chosen to ensure th a t the second derivative o f Kc,b() with respect
to is positive; Kc,b() is then convex. A suitable value is
b = m ax [5, in f {a : K ' c a() > 0, oo <
< oo}],
477
Z q = (T* ~ kc ,i ) / k
is
(9.39)
where
Pi(z )
p 2 {z)
- z { ^ K C,4 K c l ( z 2 - 3) + j 2 KC,3 Kc U z4 ~ ^
+ 15)}
W = w (F,F) =
1/ 2
{ / L t ( y , F ) 2 d F( y )}
w ith F the E D F o f the data. The corresponding b o o tstrap statistic is w ( F \ F),
where F* is the E D F corresponding to a boo tstrap sample. F or econom y o f
n o tatio n below we write
v = v(F) = J L t( y; F) dF(y), L w(yi) = M j ^ F ) , Q A y u y i ) = Q A y u y n F ) ,
and so forth.
To obtain the linear, quad ratic and cubic influence values for w(G, F ) at
G = F, we replace G(y) w ith
Here H(x) is the
Heaviside function,
jumping from 0 to 1 at
x = 0.
= v~ 1 / 2 L t(yi),
= v~ll 2 Qt{yx, y 2) - ^v~ 3/ 2 L t{yi)Lv{y2 )[2],
= v ^ l/ 2 Ct(yu y 2 , y 3)
- \ v ~ V 2 { 6 f0 'i,j'2 )l.,0'3) + Qv (y uy 2 ) Lt(yi)} [3]
+ 1V~5/ 2 L[ (y 1)LV(y 2 )LV(y3) [3],
9 Improved Calculation
478
where [fc] after a term indicates th a t it should be sum m ed over the perm utations
o f its y^s th a t give the k distinct quantities in the sum. Thus for exam ple
L t( yi ) Lv(y 2 )Lv(y 3 )[3 ]
The influence values for z involve linear, quadratic, and cubic influence values
for t, and linear an d quad ratic influence values for v, the latter given by
L t( x )2 dF(x)
+ 2J L t(x)Qt( x , y l )dF(x),
L v(yi)
L t{yi )2
lQv(yi,y2)
L t( y i )Q t ( y uy 2 )l 2 ] - L t{yi)Lt(y2)
~ J { Q t ( x , y i ) + Qt( x , y 2 ) }Lt( x) dF( x)
+J
The sim plest exam ple is the average t(F) = f x d F ( x ) = y o f a sam ple o f
values y u - - - , y from F. T hen L t(j/,-) = y t - y , Qtiyuyj) = Ct(yi,yj,yk) = 0, the
expressions above simplify greatly, an d the required influence quantities are
li
9ij
Cijk
{(yk - y)2 -
[3],
where v = n-1 J2(yi ~ y)2- The influence quantities for Z are obtained from
those for W by m ultiplication by n 1/2. A num erical illustration o f the use o f
the corresponding approxim ate cum ulant-generating function K c ( ) is given
in Exam ple 9.19.
Integration approach
A n other ap proach involves extending the estim ating function approxim ation
to the m ultivariate case, an d then approxim ating the m arginal distribution o f
the statistic o f interest. To see how, suppose th a t the quantity T o f interest is a
scalar, an d th a t T and S = ( S i , . . . , S q- i) r are determ ined by a q x 1 estim ating
function
n
U(t,s) = ^ a ( t , s i , . . . , s 9- l ;Yj ).
J=i
T hen the b o o tstra p quantities T* an d S are the solutions o f the equations
n
U'(t, s) = J 2 a j ( t , s ) f j = 0 ,
j=i
(9.40)
479
K ( ; t , s ) = n log
(9.41)
Y 2 PJex P l ^ a / M ) }
;'=i
j=i
dsT
where
,,
1
p j e x p { Z Taj(t,s)}
Y l = i P k e x p { , Tak{ t , s) }
(9.42)
d 2 A(t, s)
dsdsT
1 /2
exp
s),
(9.43)
480
9 Improved Calculation
; t, s)
al
nYj l
8 K (<^; t, s)
pA
s ) = >
, 8 aj
nYlpft
=1
s)
>s )
d-s- t =
;= 1
(9.44)
These can be solved using packaged routines, w ith starting values given by
noting th a t w hen t equals its sam ple value to, say, s equals its sam ple value
and = 0.
The second derivatives o f A needed to calculate (9.43) m ay be expressed as
8 2 A(t,s) _ d 2 K ( ; t , s ) f d 2 K ( i ; t , s ) Y i d 2K(-,t,s)
8 s8 s T
8 s8 T
8^8^ T
8 ,dsT
82 K(t;-,t,s)
8 sdsT
n ^2p ' j( t ,s ) aj ( t , s ) a j ( t ,s ) T,
(9.46)
(9.47,
(9.49)
d2A (t, s)
8 s8 s T
1/2
481
daj(z,v)
dv
d 2 cij(z,v)
8z 2
8 2 aj(z,v)
8 v1
9 *Improved Calculation
482
o
W
a>
c
co
3
-4
*2
2000
3000
2
z
1000
-2
Figure 9.11
Saddlepoint
approximations for the
bootstrap variance V *
and studentized average
Z* for the maize data.
Top left:
approximations to
quantiles of Z* by
integration saddlepoint
(solid) and simulation
using 50000 bootstrap
samples (every 20th
order statistic is shown).
Top right: density
approximations for Z*
by integration
saddlepoint (heavy
solid), approximate
cumulant-generating
function (solid), and
simulation using 50 000
bootstrap samples.
Bottom left:
corresponding
approximations for V*.
Bottom right: contours
of A(z,t>), with local
maxima along the
dashed line z = 3.5 at
A and at B.
483
ciJk = 0,
7=1
;=1
( y/n)x/ 2
which is pro p o rtio n al to the usual Student-t statistic when ip(s) = e. In order
to set studentized b o o tstrap confidence limits for 8 , we need approxim ations to
the b o o tstrap quantiles o f Z . These m ay be obtained by applying the m arginal
saddlepoint approxim ation outlined above with T = Z , S = (Si, S i ) 7 , Pj = n-1 ,
9 Improved Calculation
484
(% )
Sim n
Sp o in t
0.1
2.5
10
90
95
97.5
99
99.9
-3.81
-3 .6 8
-2 .6 8
-2 .6 0
-2.21
-2.11
-1 .8 6
-1 .7 2
-1.49
-1.31
1.25
1.24
1.62
1.62
1.94
1.97
2.35
2.42
3.49
3.57
and
xp ( ae j/s i - z d / s 2)
\
tp 2 ( ae j/ si - z d / s 2) - y j ,
(9.52)
tp' ( pe j/ s \ z d / s 2) - s 2 )
dt
ddj(t,s)
\ 2 xpxp'd/s2 ,
ds
d 2 cij(t,s)
8 s 8 sT
xp'd/s 2
aejtp'/s 2
\ 2 oej\py>'/si
_
(
2 aej\p'/sl
\4aejxpxp,/ s l + 4 a 2 e 2 \p,2 /s'\
485
N o rm al
n = 20
n = 40
Slash
C hi-squared
90
95
90
95
90
95
95
90
2.5
97.5
95
91
90
95
94
91
89
96
95
91
89
95
95
14
9
97
94
83
85
9
6
97
95
88
89
sim ulation w ould be needed to obtain confidence intervals; we simply left these
sam ples out. C uriously there were no convergence problem s for the xl samples.
O ne com plication arises from assum ing th at the error distribution is sym
metric, in which case the discussion in Section 3.3 implies th a t our resam pling
scheme should be m odified accordingly. We can do so by replacing (9.41) with
X (^ ;z ,S !) = n lo g
j~i
486
9 Improved Calculation
such as H am m ersley and H andscom b (1964), Bratley, Fox and Schrage (1987),
Ripley (1987), an d N iederreiter (1992).
B alanced b o o tstra p sim ulation was introduced by D avison, H inkley and
Schechtm an (1986). O gbonm w an (1985) describes a slightly different m ethod
for achieving first-order balance. G rah am et al. (1990) discuss second-order
balance an d the connections to classical experim ental design. A lgorithm s for
balanced sim ulation are described by G leason (1988). T heoretical aspects o f
balanced resam pling have been investigated by D o and H all (1992b). Balanced
sam pling m ethods are related to num ber-theoretical m ethods for integration
(Fang an d W ang, 1994), and to L atin hypercube sam pling (M cK ay, Conover
and Beckm an, 1979; Stein, 1987; Owen, 1992b). D iaconis and H olm es (1994)
discuss the com plete en um eration o f b o o tstrap sam ples by m ethods based on
G ray codes.
L inear approxim ations were used as control variates in b o o tstra p sam pling
by D avison, H inkley an d Schechtm an (1986). A different approach was taken
by E fron (1990), w ho suggested the re-centred bias estim ate and the use
o f control variates in quantile estim ation. D o and H all (1992a) discuss the
properties o f this m ethod, an d provide com parisons with other approaches.
F u rth er discussion o f control m ethods is contained in theses by T herneau
(1983) and H esterberg (1988).
Im portance resam pling was suggested by Jo hns (1988) and D avison (1988),
and was exploited by H inkley an d Shi (1989) in the context o f iterated
boo tstrap confidence intervals. G igli (1994) outlines its use in param etric
sim ulation for regression an d certain tim e series problem s. H esterberg (1995b)
suggests the application o f ratio and regression estim ators and o f defensive
m ixture distributions in im portance sam pling, an d describes their properties.
T he large-sam ple perform ance o f im portance resam pling has been investigated
by D o an d H all (1991). B ooth, H all and W ood (1993) describe algorithm s for
balanced im portance resampling.
B ootstrap recycling was suggested by D avison, H inkley and W orton (1992)
and independently by N ew ton and G eyer (1994), following earlier ideas by
J. W. Tukey; see M orgenthaler an d Tukey (1991) for application o f sim ilar
ideas to robust statistics. Properties o f recycling in various applications are
discussed by V entura (1997).
S addlepoint m ethods have a history in statistics stretching back to D aniels
(1954), an d they have been studied intensively in recent years. R eid (1988)
reviews their use in statistical inference, while Jensen (1995) and Field and
R onchetti (1990) give longer accounts; see also B arndorff-N ielsen and Cox
(1989). Jensen (1992) gives a direct account o f the distribution function ap
proxim ation we use. Saddlepoint approxim ation for p erm u tatio n tests was
proposed by D aniels (1955) and further discussed by R obinson (1982). D avi
son and H inkley (1988), D aniels and Y oung (1991), and W ang (1993b) in
9.7 - Problems
487
9.7 Problems
1
Under the balanced bootstrap the descending product factorial m om ents o f the /*
are
= Y [ m(p>
U
R ^ / i n R ) ^ s">,
(9.53)
vv:rw = u
Qu
^ ^ Sw>
w :jw=v
with u and v ranging over the distinct values o f row and colum n subscripts on the
left-hand side o f (9.53).
(a) Check the first- and second-order mom ents for the f
j at (9.9), and verify that
the values in Problem 2.19 are recovered as R * c o .
(b) Use the results from (a) to obtain the mean o f the bias estimate under balanced
resampling.
(c) N ow suppose that 7 is a linear statistic, and let V ( R I)-1 ^ r(Tr' T ' ) 2
be the estimated variance o f T based on the bootstrap samples. Show that the
mean o f V under m ultinom ial sampling is asymptotically equivalent to the mean
under hypergeometric sampling, as R increases.
(Section 9.2.1; Appendix A ; Haldane, 1940; D avison, Hinkley and Schechtman,
1986)
9 Improved Calculation
488
2
o f length nR.
For I = n R , . . . , 2 :
(a) Generate a random integer U in the range 1 ,..., I.
(b) Swap a.Vi and <
& iv
[ 21/2r ( f )
l " I/2r ( )
But suppose that we estimate the bias by parametric resampling; that is, we
generate samples y[,...,y from the N(y,t2) distribution. Show that the raw and
adjusted bootstrap estimates o f B can be expressed as
Br =
Y xr/2 ~ 1
and
1 /2
B RM j = n - ' /2 R - ' |
X r /2 -
R i/2
>
( Y X' + X R
+l
9.7 Problems
6
489
n
f'rj = ' H j - Z ( r , i ) } ,
i= 1
where S(u) = 1 if u = 0 and equals zero otherwise, to show that the ^-balanced
design is balanced in terms o f f r j. Is the converse true?
(c) Suppose that we have a regression m odel Yj = fixj + ej, where the independent
errors e; have mean zero and variance a 2. We estimate fi by T =
Y j X j / J 2 x jLet T ' = 52(t Xj +e'j )xj/ Y l x ) denote a resampled version o f T , where e is selected
randomly from centred residuals e; e, with e} = y, t x j and e = n ~ l ^ e;. Show
that the average value o f T* equals t if R values o f T ' are generated from a
<!;-balanced design, but not necessarily if the design is balanced in terms o f / * .
(Section 9.2.1; Graham e t al., 1990)
Y l f ' J = k (2 n ~ ! ).
r=l
J2frjfrk= k(n-l),
j,k = l,...,n,
j + k.
r=l
Suppose that you wish to estimate the normal tail probability f I { z < a}<p(z)dz,
where <p(.) is the standard normal density function and I [A] is the indicator o f the
event A, by importance sampling from a distribution H( ).
Let H be the normal distribution with mean fi and unit variance. Show that the
maximum efficiency is
<P(fl){l-< D (q )}
exp(/i2)<I>(a + fi) 9 ( a ) 2
where (i is chosen to minimize exp(/r )<t>(a+^). Use the fact that <t>(z) = <f>(z)/z for
z < 0 to give an approximate value for fj., and plot the corresponding approximate
efficiency for 3 < a < 0. W hat happens when a > 0 1
(Section 9.4.1)
9 Improved Calculation
490
9
'- L
rip
- / . S 5 " w
Let T(! ) < < T {R) denote the order statistics o f the Tr, set
1 \ " g(T(r))
R + i ^ H T {r)y
m\
and let M be the random index determined by SM < p < SM+1 - Show that
as R *-oo, and hence justify the estimate t"M given at (9.19).
(Section 9.4.1; Johns, 1988)
10
Suppose that T has a linear approximation T[, and let p be the distribution on
y y n with probabilities p; oc exp { l l j / ( n v Ll / 2 ) } , where v L = n ~ 2 Y I l j - Find the
mom ent-generating function o f T[ under sampling from p, and hence show that
in this case T* is approximately N ( t -I- A v j / 2 , v L ). You may assume that T[ is
approximately N ( 0, v L ) when A = 0.
(Section 9.4.1; Johns, 1988; Hinkley and Shi, 1989)
11
lj* = ^ t { ( l - f i) p a + e lj]
Show that
t = ?Ljz = t(pa) + n
f]h aj=i
(c) For the ratio estimates in Example 2.22, compare numerically tL, t'Lfi9 and the
quadratic approximation
tQ = t + n^ J l f ? J +
j=l
2l n ~ 2 H
j = 1 k= 1
fjfk t*
with t .
(Sections 2.7.4, 3.10.2, 9.4.1; Hesterberg, 1995a)
12
E ^ r)w W
J 2 W(yr)
where si = R 1/2 $3{m(yr)w(jv)
implies that
n + R-Vh,
1 + R~,/2Eo
9.7 Problems
491
(9.54)
m(y ) g( y ) dy =
/ vke~y
- j j - x e e ~ eydy
by simulating from density h(y) = fie~^y, y > 0, fi > 0. Give w(y) and show
that E{m (Y )w (Y )} = n for any fi and 6, but that var(//rat) is only finite when
0 < fi < 2 6 . Calculate var{m (Y)w (Y )}, cov{m (Y )w (Y ), w (Y)}, and var{w(Y)}.
Plot the asymptotic efficiencies var(/i;; raw) / var(// ra, ) and var(/*//ratv) / var(^Wrfg) as
functions o f fi for 0 = 2 and fc = 0 ,1 ,2 ,3 . Discuss your findings.
(Section 9.4.2; Hesterberg, 1995b)
13
14
i i , . . . ,/ni?
InR I SnR
q)
equals
nfi-l
g(inR I Sr = q)
g (inR-j |
= q iR-J+] iRj ,
i =i
= n ^-
9 Improved Calculation
492
15
For the bootstrap recycling estimate o f bias described in Example 9.12, consider
the case T = Y with the parametric m odel Y ~ N ( 0 , 1). Show that if H is taken to
be the N ( y , a ) distribution, then the simulation variance o f the recycling estimate
o f C is approximately
/ a2 y
+ \2 -l/
~ 1)/2 r ( - ! )
1
I (2a - 3)3/2 R N
2
11
8 (a - \ f ' 2 N J J
provided a >
Compare this to the simulation variance when ordinary double
bootstrap methods are used.
What are the im plications for nonparametric double bootstrap calculations? In
vestigate the use o f defensive mixtures for H in this problem.
(Section 9.4.4; Ventura, 1997)
16
where the ( / ' , , . . . , f s), s = 1 ,..., S, are independent sets o f m ultinom ial frequen
cies.
(a) Show that the cumulant-generating function o f T I is
s
K { 0 = ft + Y
f 1 "s
n* lo6 \ ~
s=l
exP ( ^ y M )
t= 1
K() = Y
io g c o sh (Zdj/n),
i=i
and find the saddlepoint equation and the quantities needed for saddlepoint
approximation to the observed significance level. Explain how this may be fitted
into the framework o f a conditional saddlepoint approximation.
(b) See Practical 9.5.
(Section 9.5.1; Daniels, 1958; D avison and Hinkley, 1988)
18
For the testing problem o f Problem 4.9, use saddlepoint methods to develop an
approximation to the exact bootstrap P-value based on the exponential tilted EDF.
Apply this to the city population data with n = 10.
(Section 9.5.1)
9.7 Problems
493
19
20
(a) Show that the bootstrap correlation coefficient t based on data pairs ( x j , Zj),
j = 1, . . . , n , may be expressed as the solution to the estimating equation (9.40)
with
Xj-Si
Zj
Oj ( t , s ) =
( Xj
- s2
Si)2
53
- s2j2 - s4
( Xj - Si ) ( Zj - S2) - t{s3s4)1/2 J
(Zj
where s T = (s1,s 2 ,s 3,s 4), and show that the Jacobian J ( t , s ; ) = n5(s 3s4)1/2. Obtain
the quantities needed for the marginal saddlepoint approximation (9.43) to the
density o f T*.
(b) W hat further quantities would be needed for saddlepoint approximation to the
marginal density o f the studentized form o f T ?
(Section 9.5.3; D avison, Hinkley and Worton, 1995; DiCiccio, Martin and Young,
1994)
21
7=1
7=1
and deduce that when T is the sample average and F is the exponential distribution
the large-sample performance gain o f antithetic resampling is 6 /(1 2 n 2) = 2.8.
(c) W hat happens if F is symmetric? Explain qualitatively why.
(Hall, 1989a)
9 - Improved Calculation
494
22
(9-55)
where zo, zi, z2 are unknown but a is known; often a = j . Suppose that we
resample from the E D F F, but with sample sizes nQ, m , where 1 < no < n t < n,
instead o f the usual n, giving simulation estimates z ' ( n0), z ' ( n t ) o f z(n0), z( n x).
(a) Show that z*(n) can be estimated by
z(n) =
z (no) +
no
n,
(z(n0) - z > i ) }
9.8
1
Practicals
For ordinary bootstrap sampling, balanced resampling, and balanced resampling
within strata:
y <- rnorm(lO)
junk.fun <- function(y, i) var(y[i])
junk <- boot(y, junk.fun, R=9)
b o o t .array(junk)
apply(junk$t,2,sum)
junk <- boot(y, junk.fun, R=9, sim="balanced")
b o o t .array(junk)
apply(j unk$t,2,sum)
junk <- boot(y, junk.fun, R=9, sim="balanced",
strata=rep(l:2,c(5,5)))
boot.array(j unk)
apply(junk$t,2,sum)
N ow use balanced resampling in earnest to estimate the bias for the gravity data
weighted average:
Practicals
495
496
9 Improved Calculation
e x p . t i l t ( t a u . L , t h e t a = c ( 1 4 , 1 8 ) ,t 0 = 1 6 .1 6 )
Function t i l t . b o o t does this automatically. Here we do 199 bootstraps without
tilting, then 100 each tilted to the 0.05 and 0.95 quantiles o f these 199 values o f t".
We then display the weights, without and with defensive mixture distributions:
t a u . t i l t < - t i l t . b o o t ( t a u , t a u . w, R=c( 1 9 9 ,1 0 0 ,1 0 0 ) ,s t r a ta = ta u $ d e c a y ,
s t y p e = "w", L = ta u . L, a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
s p l i t . s c r e e n (c (1 ,2 ) )
s c r e e n ( l) ; p lo t ( t a u .t ilt $ t ,im p .w e ig h t s ( t a u .t ilt ) ,lo g = " y " )
s c r e e n ( 2 ) ; p l o t ( t a u . t i l t $ t , im p. w e ig h t s ( t a u . t i l t , d e f= F ), lo g = " y ")
The corresponding estimated quantiles are
i m p .q u a n t i l e ( t a u . t i l t , a l p h a = c ( 0 . 0 5 , 0 . 9 5 ) )
im p. q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) , def=F )
The same can be done with frequency sm oothing, but then the initial value o f R
must be larger:
t a u .f r e q < - t i l t . b o o t ( t a u , ta u .w , R=c( 4 9 9 ,2 5 0 ,2 5 0 ) ,
s t r a ta = ta u $ d e c a y , stype="w ", t i l t = F , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
im p .q u a n t il e ( t a u .f r e q ,a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
For balanced importance resampling we simply add sim ="balanced" to the argu
ments o f t i l t . b o o t . For a small simulation study to see the potential efficiency
gains over ordinary sampling, we compare the performance o f ordinary sampling
and importance resampling with and without balance, in estimating the 0.1 and
0.9 quantiles o f the distribution o f t".
t a u . t e s t < - NULL
f o r ( i r e p in 1 :1 0 )
{ ta u .b o o t < - b o o t ( t a u , ta u .w , R=199, stype="w ",
str a ta = ta u $ d e c a y )
q .o r d < - s o r t ( t a u . b o o t $ t ) [ c ( 2 0 , 1 8 0 )]
t a u . t i l t < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) ,
s t r a ta = ta u $ d e c a y , stype="w ", L = tau .L ,
a lp h a = c ( 0 . 1 , 0 . 9 ) )
q . t i l t < - i m p . q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 1 , 0 . 9 ) ) $raw
t a u .b a l < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) ,
s tr a ta = ta u $ d e c a y , stype="w ", L = tau .L ,
a lp h a = c ( 0 .1 , 0 . 9 ) , sim ="balanced")
q .b a l < - i m p .q u a n t il e ( t a u .b a l , a lp h a = c ( 0 .1 , 0 .9 ))$ r a w
t a u . t e s t < - r b i n d ( t a u . t e s t , c ( q .o r d , q . t i l t , q . b a l ) ) >
s q r t ( a p p l y ( t a u . t e s t , 2 , v a r ))
W hat are the efficiency gains o f the two importance resampling methods?
Consider the bias and standard deviation functions for the correlation o f the
c la r id g e data (Example 4.9). To estimate them, we perform a double bootstrap
and plot the results, as follows.
c l a r .f u n < - f u n c t io n ( d a t a , f )
{ r <- c o r r (d a ta , f/s u m (f))
n < - n row (d ata)
d <- d a ta [r e p ( 1 : n ,f ) ,]
us <- ( d [ ,l] - m e a n ( d [ ,l ] ) ) /s q r t ( v a r ( d [ ,l] ) )
x s < - ( d [ ,2 ] - m e a n ( d [ ,2 ] ) ) / s q r t ( v a r ( d [ ,2 ] ) )
9.8 Practicals
497
498
9 Improved Calculation
r5 <- apply(matrix(capability$y,15,5,byrow=T), 1,
function(x) diff(range(x)))
m <- 300; top <- 10; bot <- 4
sad <- matrix(, m, 3)
th <- seq(bot,top,length=m)
for (i in l:m)
{ sp <- saddle(A=psi(th[i], r 5 ) , u=0)
sad[i,] <- c(th[i] , sp$spa[l] *det .psi(th[i] , r5, xi=sp$zeta.hat) ,
sp$spa[2]) }
sad <- sad[! is.na(sad[,2] )&!is.na(sad[,3] ) ,]
plot(sad[,l],sad[,2],type="l",xlab="theta hat",ylab="PDF")
To obtain the quantiles o f the distribution o f 6', we use the following code; here
ca p a b .tO contains 9 and its standard error.
r5 <- NULL
for (j in 1:71) r5 <- c(r5, diff(range(capability$y[j:(j+4)])))
Afn <- function(t, data, k=2.236*(5.79-5.49)) cbind(k-t*data,1)
ufn <- function(t, data, k=2.236*(5.79-5.49)) c(0,15)
capab.spl <- saddle.distn(A=Afn,u=ufn,wdist="p",
type="cond",t0=capab.t O ,data=r5)
capab.spl$quantiles
Compare them with the quantiles above. How do they differ? W hy?
5
To apply the saddlepoint approximation given in Problem 9.17 to the paired com
parison data o f Problem 4.7, and obtain a one-sided significance level P r'(D > d):
10
Semiparametric Likelihood Inference
10.1 Likelihood
The likelihood function is central to inference in param etric statistical models.
Suppose th a t d a ta y are believed to have com e from a distribution F w, where
xp is an unknow n p x 1 vector param eter. T hen the likelihood for rp is the
corresponding density evaluated a t y , nam ely
L (w )
/ v (y ).
499
500
in param etric models. O ne special feature is th a t the likelihood determ ines the
shape o f confidence regions when xp is a vector.
Unlike m any o f the confidence interval m ethods described in C h apter 5,
likelihood provides a n a tu ra l basis for the com bination o f inform ation from
different experim ents. If we have tw o independent sets o f data, y and z,
th a t bear on the sam e p aram eter, the overall likelihood is simply L(xp) =
f ( y I W)f(z I
and tests an d confidence intervals concerning 1p m ay be
based on this. This type o f com bination is particularly useful in applications
where several in dependent experim ents are linked by com m on param eters; see
Practical 10.1.
In applications we can often w rite xp = ( 6 ,X), where the com ponents o f 8 are
o f prim ary interest, while the so-called nuisance param eters X are o f secondary
concern. In such situations inference for 8 is based on the profile likelihood,
L p( 6 ) = m ax L ( 8 , X),
(10.1)
L p ( o m e ,
i<or1/2,
(io.2)
where % is the M L E o f X for fixed 6 and jx(xp) is the observed inform ation
m atrix for X, i.e. jx(xp) = d 2 f(xp)/dXdXT .
W ithout a p aram etric m odel the definition o f a param eter is m ore vexed.
As in C h ap ter 2, we suppose th a t a p aram eter d is determ ined by a statistical
function t(-), so th a t 8 = t(F) is a m ean, m edian, o r o ther quantity determ ined
by, b u t n o t by itself determ ining, the unknow n distribution F. N ow the nuisance
param eter Xis all aspects o f F o th er th a n t(F), so th a t in general X is
infinite dim ensional. N o t surprisingly, there is no unique way to construct a
likelihood in this situation, and in this ch ap ter we describe some o f the different
possibilities.
501
n o n p aram etric m axim um likelihood estim ate t = t(F) for 9 (Problem 10.1).
T he E D F is a m ultinom ial distribution w ith d enom inator one and probability
vector (_1, . . . , n _1) attached to the yj. We can think o f this distribution as
em bedded in a m ore general m ultinom ial distribution with a rb itrary probability
vector p = (pi ,. . . ,p) attached to the d a ta values. If F is restricted to be
such a m ultinom ial distribution, then we can w rite t(p) rath er than t(F)
for the function which defines 8 . The special m ultinom ial probability vector
(n_1, . . . , n _1) corresponding to the E D F is p, and t = t(p) is the nonparam etric
m axim um likelihood estim ate o f 6 . This m ultinom ial representation was used
earlier in Sections 4.4 an d 5.4.2.
R estricting the m odel to be m ultinom ial on the d a ta values with probability
vector p, the p aram eter value is 9 = t{p) and the likelihood for p is L(p) =
n " = i P^j >with / j equal to the frequency o f value yj in the sample. But, assum ing
there are no tied observations, all f j are equal to 1, so th a t U p ) = p x x x pn:
this is the analogue o f L(i/;) in the param etric case. We are interested only in
9 = t(p), for which we can use the profile likelihood
L e l {Q)=
n
sup TT Pj,
p-Ap)=ejJi
(10.3)
which is called the empirical likelihood for 9. N otice th a t the value o f 9 which
m axim izes L El { 8 ) corresponds to the value o f p m axim izing L(p) with only
the constrain t Y l Pj = 1> th a t is p. In other words, the em pirical likelihood is
m axim ized by the nonparam etric m axim um likelihood estim ate t.
In (10.3) we m axim ize over the p; subject to the constraints im posed by
fixing t(p) = 9 an d Y l Pj = 1> which is effectively a m axim ization over n 2
quantities w hen 9 is scalar. R em arkably, although the num ber o f param eters
over which we m axim ize is com parable with the sam ple size, the approxim ate
d istributional results from the p aram etric situation carry over. Let do be the
true value o f 8 , w ith T the m axim um em pirical likelihood estim ator. T hen
und er mild conditions on F and in large samples, the em pirical likelihood
ratio statistic
W e l (90) = 2 {log L e l ( T ) - log L e l ( 6 o)}
has an approxim ate chi-squared distribution w ith d degrees o f freedom. A l
though the lim iting distribution o f W e l (8 q) is the same as th at o f W p(8 o)
und er a correct p aram etric m odel, such asym ptotic results are typically less
useful in the nonparam etric setting. This suggests th at the b o o tstrap be used
to calibrate em pirical likelihood, by using quantiles o f b o o tstrap replicates o f
A
W e l (9q), i.e. quantiles o f W ^ L( 8 ). This idea is outlined below.
Exam ple 10.1 (Air-conditioning d ata) We consider the em pirical likelihood
for the m ean o f the larger set o f air-conditioning d a ta in Table 5.6; n = 24
502
00
d
JO
CvJ
o
o
o
40
60
80
100
120
40
60
80
100
120
theta
theta
T hus the log em pirical likelihood, norm alized to have m axim um zero, is
n
(10.5)
This is m axim ized at the sam ple average 8 = y, where ye = 0 and Pj = n_1. It
is undefined outside (m iny^, m ax y 7-), because no m ultinom ial distribution on
the yj can have m ean outside this interval.
Figure 10.1 shows L e l (&), which is calculated by successive solution o f (10.4)
to yield tjg at values o f 8 small steps apart. T he exponential likelihood and
gam m a profile likelihood for the m ean are also shown. As we should expect,
the gam m a profile likelihood is always higher th a n the exponential likelihood,
which corresponds to the gam m a likelihood b u t w ith shape param eter k = 1.
Both param etric likelihoods are w ider th a n the em pirical likelihood. D irect
com parison betw een p aram etric an d em pirical likelihoods is misleading, how
ever, since they are based on different m odels, an d here and in later figures
y
To
sw
o
2
"D
503
O
W
sw
o
2
D
o
o
o
o
JO
<D
o
OJ
E
E
Q .
aj
0
LU
Chi-squared quantiles
Chi-squared quantiles
we give the gam m a likelihood purely as a visual reference. The circum stances
in which em pirical an d p aram etric likelihoods are close are discussed in P rob
lem 10.3.
The endpoints o f an approxim ate 95% confidence interval for 8 are obtained
by reading off where e l ( 8 ) = 501,0.95, where c^a is the a quantile o f the chisquared distribution w ith d degrees o f freedom. The interval is (43.3,92.3),
which com pares well w ith the n onparam etric B C a interval o f (42.4,93.2). The
likelihood ratio intervals for the exponential and gam m a m odels are (44.1,98.4)
and (44.0,98.6).
Figure 10.2 shows the em pirical likelihood and gam m a profile likelihood
ratio statistics for 500 exponential sam ples o f size 24. T hough good for the
param etric statistic, the chi-squared approxim ation is poor for W EL, whose
estim ated 95% quantile is 5.92 com pared to the xj quantile o f 3.84. This
suggests strongly th a t the em pirical likelihood-based confidence interval given
above is too narrow . However, the sim ulations are only relevant w hen the
d a ta are exponential, in which case we would n o t be concerned w ith em pirical
likelihood.
We can use the b o o tstrap to estim ate quantiles for W e l ( 8 o ), by setting 6 q = y
and then calculating W ( 6 q ) for b o o tstrap sam ples from the original data. The
resulting Q -Q p lo t is less extrem e th a n the left panel o f Figure 10.2, w ith a 95%
quantile estim ate o f 4.08 based on 999 b o o tstrap sam ples; the corresponding
em pirical likelihood ratio interval is (42.8,93.3). W ith a sam ple o f size 12, 41
o f the 999 sim ulations gave infinite values o f W e l ( 6 q) because y did not lie
w ithin the lim its (m in y ',m a x y * ) o f the b o o tstrap sample. W ith a sam ple o f
size 24, this problem did n o t arise.
504
V ector p a ra m eter
u( 9; y) dF ( y) = 0,
i= l,...,d,
where u(9;y) is a d x 1 vector w hose ith elem ent is Ui(9;y). T hen the estim ate
9 is the solution to the d estim ating equations
n
( 10.6 )
;'=i
A n extension o f the argum ent in Exam ple 10.1, involving the vector o f L a
grange m ultipliers rjg = (t]ou- , f]od)T, shows th a t the log em pirical likelihood
is
n
S e l ( 0) = ~
log {1 + n l U j ( 9 ) } ,
(10.7)
i =i
where uj(9) = u(9;yj). T he value o f rjg is determ ined by 9 through the.
sim ultaneous equations
V
- UjiTd )- - = 0 .
1 + tiju j(8)
(10.8)
The sim plest approxim ate confidence region for the true 9 is the set o f values
such th at W EL( 0 ) < q j _ b u t in sm all sam ples it will again be preferable to
replace the Xd quantile by its b o o tstrap estim ate.
E M;( 0 ) e x p { ^ u ; ( 0 ) } = O .
j=i
(10.9)
505
u ,( 0 ) j
(10.10)
denote the unit vectors o rthogonal to a(0,4>). T hen since the eigenvectors o f
E (y Y T) m ay be taken to be orthogonal, the p o pulation values o f (0, <f>) satisfy
sim ultaneously the equations
b(9, 4>)t E{ Y Y T )a(0, <f>) = 0,
w ith sam ple equivalents
506
<e,4>,y j ) -
The left panel o f Figure 10.3 shows the em pirical likelihood contours based
on (10.7) and (10.8), in the square region show n in Figure 5.10. The correspond
ing contours for Q e e f ( S ) are show n on the right. T he dashed lines show the
boundaries o f the 95% confidence regions for ( 6 , (f>) using b o o tstrap calibra
tio n ; these differ little from those based on the asym ptotic y\ distribution. In
each panel the d o tted ellipse is a 95% confidence region based on a studentized
form o f the sam ple m ean p o la r axis, for which the contours are ellipses. The
elliptical contours are appreciably tighter th a n those for the likelihood-based
statistics.
Table 10.1 com pares theoretical and b o o tstrap quantiles for several likelihoodbased statistics an d the studentized b o o tstrap statistic, Q , for the full d a ta and
for a rando m subset o f size 20. F or the full data, the quantiles for Q e e f and
W e l are close to those for the large-sam ple
distribution. F o r the subset,
Q e e f is close to its nom inal distribution, b u t the o th er statistics seem consid
erably m ore variable. Except for Q e e f , it would be m isleading to rely on the
asym ptotic results for the subsam ple.
507
Full d a ta , n = 50
Subset, n = 20
xi
W el
W ee f
Qe e f
W EL
W eef
Qeef
0.80
0.90
0.95
3.22
4.61
5.99
3.23
4.77
6.08
3.40
4.81
6.18
3.37
5.05
6.94
3.15
4.69
6.43
3.67
5.39
7.17
3.70
5.66
7.99
3.61
5.36
10.82
3.15
4.45
7.03
<10-1
508
e - 1-') p'rj,
r= 1
'
j= l,...,n ,
(10.12)
'
1II
where typically w(-) is the stan d ard norm al density an d e = t>L ; as usual vl is
the nonparam etric delta m ethod variance estim ate for t. The distribution p*(0)
will have p aram eter value not 0 b u t 9 = t (p '(0 0)). W ith the understanding
th a t 9 is defined in this way, we shall for sim plicity w rite p'(9) ra th er th an
p*(0). F or a fixed collection o f R first-level sam ples and bandw idth e > 0, the
probability vectors p"(9) change gradually as 9 varies over its range o f interest.
Second-level b o o tstra p sam pling now uses vectors p'(0) as sam pling distri
butions on the d a ta values, in place o f the p* s. The second-level sam ple values
f** are then used in (10.11) to o btain Lg(0). R epeating this calculation for,
say, 100 values o f 6 in the range t + 4 v /1 2, followed by sm ooth interpolation,
should give a good result.
Experience suggests th a t the value e = v 1/ 2 is safe to use in (10.12) if the
t* are roughly equally spaced, which can be arran g ed by weighted first-level
sam pling, as outlined in Problem 10.6.
A way to reduce furth er the am o u n t o f calculation is to use recycling, as
described in Section 9.4.4. R a th e r th an generate second-level sam ples from each
p"(9) o f interest, one set o f M sam ples can be generated using distribution p on
the d ata values, an d the associated values f , . . . ,
calculated. Then, following
the general re-w eighting m ethod (9.24), the likelihood values are calculated as
.> 0 ,3 ,
m=\
/j = 1 v
'
where
is the frequency o f the j t h case in the with second-level boo tstrap
sample. O ne simple choice for p is the E D F p. In special cases it will be possible
to replace the second level o f sam pling by use o f the saddlepoint approxim ation
m ethod o f Section 9.5. This w ould give an accurate an d sm ooth approxim ation
to the density o f T for sam pling from each p ' ( 8 ).
Exam ple 10.3 (Air-conditioning d a ta )
509
~D
O
O
JZ
o>
theta
theta
the d a ta from Exam ple 10.1. The solid points in the left panel o f Figure 10.4
are b o o tstrap likelihood values for the m ean 9 for 200 resamples, obtained by
saddlepoint approxim ation. This replaces the kernel density estim ate (10.11)
an d avoids the second level o f resam pling, b u t does n o t remove the variation in
estim ated likelihood values for different b o o tstrap sam ples with sim ilar values
o f t r*. A locally q u ad ratic nonparam etric sm oother (on the log likelihood scale)
could be used to produce a sm ooth likelihood curve from the values o f L(t"),
b u t an o th er approach is better, as we now describe.
The solid line in the left panel o f Figure 10.4 interpolates values obtained
by applying the saddlepoint approxim ation using probabilities (10.12) at a few
values o f 9. H ere the values o f t! are generated at random , and we have taken
112
e = 0.5vl ; the results depend little on the value o f e.
T he log b o o tstrap likelihood is very close to log em pirical likelihood, with
95% confidence interval (43.8,92.1).
510
(10.14)
(10 15)
(10.15)
where the subscripts indicate sam ple size. N ote th a t F and t are the same for
both sam ple sizes, b u t quantities such as variance estim ates will depend upon
sam ple size. N ote also th a t the im plied p rio r is estim ated by L l 2( 6 ) / L f2n(6).
Exam ple 10.4 (Exponential m ean)
If d a ta y i , . . . , y n are sam pled from an
exponential distrib u tio n w ith m ean 6 , then a suitable choice for z ( 6 , F ) is
y / 6 . The gam m a distrib u tio n for Y can be used to check th at the original
definition (10.14) gives L i (6) = 9 ~ n~ l exp(n y / 6 ) , w hereas the true likelihood
is 9~n exp (n y / 6 ) . The true result is obtained exactly using (10.15). The im plied
prior is n( 6 ) oc 0-1 , for 6 > 0.
In practice the distrib u tio n o f Z m ust be estim ated, in general by boo tstrap
511
L f (0 ) = k 2n
j k n
,(10 .1 6 )
In practice these values can be com puted via spline sm oothing from a dense
set o f values o f the kernel density estim ates k{z).
There are difficulties w ith this m ethod. First, ju st as with b o o tstrap likeli
hood, it is necessary to use a large num ber o f sim ulations R. A second difficulty
is th a t o f ascertaining w hether or n o t the chosen Z is a pivot, o r else w hat
p rio r tran sfo rm atio n o f T could be used to m ake Z pivotal; see Section 5.2.2.
This is especially true if we extend (10.16) to vector 9, which is theoretically
possible. N ote th a t if the studentized b o o tstrap is applied to a transform ation
o f t rath er th a n t itself, then the factor \z(9)\ in (10.14) can be ignored when
applying (10.16).
(10.18)
512
50
100
150
200
250
300
theta
40
60
80
100
120
theta
where
um
W
2r(fl)
l + 2 ar(d) + { l + 4 a r ( d ) } 1/ 2
22(0)
1 + {1 - 4cz(0)}V2
1/I
with z(9) = (t d)/vj[ as before. This is called the implied likelihood. Based
on the discussion in Section 5.4, one w ould expect results sim ilar to those from
applying (10.16).
A furth er m odification is to m ultiply La b c( 8 ) by exp{(cv1/ 2 b) 6 /vi.}, with b
the bias estim ate defined in (5.49). T he effect o f this m odification is to m ake the
likelihood even m ore com patible w ith the Bayesian interpretation, som ew hat
akin to the adjusted profile likelihood (10.2).
Exam ple 10.5 (Air-conditioning d ata) Figure 10.5 shows confidence likeli
hoods for the two sets o f air-conditioning d a ta in Table 5.6, sam ples o f size
12 and 24 respectively. The im plied likelihoods L ABc ( 9 ) are sim ilar to the
em pirical likelihoods for these data. The pivotal likelihood L z ( 6 ), calculated
from R = 9999 sam ples w ith bandw idths equal to 1.0 in (10.17), is clearly quite
unstable for the sm aller sam ple size. This also occurred with b o o tstrap likeli
hood for these d a ta an d seems to be due to the discreteness o f the sim ulations
with so sm all a sample.
513
Pr(y = Uj | p i , . . . , p N ) = pj,
= I-
n ip u -.^p ^n //,
7= 1
and this induces a posterior density for 8 . Its calculation is particularly straight
forw ard w hen 7i is the D irichlet density, in which case the p rio r and posterior
densities are respectively prop o rtional to
ft#
7= 1
7= 1
the posterior density is D irichlet also. Bayesian bootstrap sam ples and the
corresponding values o f 8 are generated from the jo in t posterior density
for
the pj, as follows.
Algorithm 10.1 (Bayesian bootstrap)
F or r = 1 ,...,/? ,
1 L et G \ , . . . , G n be independent gam m a variables with shape param eters
aj + f j + 1, and unit scale param eters, and for j = l , . . . , N set P j =
Gj/{G\ H------- 1- G^).
2 L et 8 } = t(Fj), where F j = ( P / , . . . , / ^ ) .
E stim ate the posterior density for 8 by kernel sm oothing o f d \ , . . . , dfR.
514
o
C
Q_
theta
theta
The left panel o f Figure 10.6 shows kernel density estim ates o f the posterior
density o f 9 based on R = 999 sim ulations w ith all the aj equal to a = 1, 2, 5,
and 10. The increasingly strong p rio r inform ation results in posterior densities
th at are m ore an d m ore sharply peaked.
The right panel shows the im plied priors on 6 , obtained using the d a ta
doubling device from Section 10.4. The priors seem highly inform ative, even
when a = 1.
The prim ary use o f the Bayesian b o o tstrap is likely to be for im putation when
d a ta are missing, ra th e r th a n in inference for 9 per se. There are theoretical
advantages to such weighted bootstraps, in which the probabilities P* vary
sm oothly, b u t as yet they have been little used in applications.
515
516
10.7 Problems
1
gisuiOiyj)
nj(6) = Pr(Y = y j ) =
where
j= l,...,n,
is determined by
n
5 > ; (0)u (0;y,) = 0.
j =i
(10.19)
(a) Let Z i,...,Z be independent Poisson variables with means exp(uj), where
Uj = u(0;yj); we treat 6 as fixed. Write down the likelihood equation for
and
show that when the observed values o f the Z j all equal zero, it is equivalent to
(10.19). Hence outline how software that fits generalized linear models may be
used to find
(b) Show that the formulation in terms o f Poisson variables suggests that the empir
ical exponential family likelihood ratio statistic is the Poisson deviance W tEF(0Q),
517
10.7 Problems
while the multinomial form gives W EEF(0O), where
W W (flo)
2 ^ { l - e x p ( ^ ; )},
W EEF(e0)
(c) Plot the log likelihood functions corresponding to W E e f and W EEF for the data
in Example 10.1; take Uj = y, 6. Perform a small simulation study to compare
the behaviour o f W EEF and W'EEF when the underlying data are samples o f size
24 from the exponential distribution.
(Section 10.2.2)
f 1
(10.20)
E ,= i ^
where f = v l/2(8 t ) and v = n~2
lj.
(a) Show that t(Ft ) = 90, where Fj denotes the C D F corresponding to (10.20).
Hence describe how to space out the values t" in the first-level resampling for a
bootstrap likelihood.
(b) Rather than use the tilted probabilities (10.12) to construct a bootstrap like
lihood by simulation, suppose that we use those in (10.20). For a linear statistic,
show that the cumulant-generating function o f T in sampling from (10.20) is
At + n { K ( ^ + n~iA) K( ^ ) } , where K ( ^ ) = lo g ( ] e(lJ). Deduce that the saddlepoint
approximation to f r - \ T - ( t I 0) is proportional to exp {n X (f)}, where 6 = K '(0 Hence show that for the sample average, the log likelihood at 9 = Y I y j e tyi / 53 eiyj
is n { i t - lo g ( 5 3 e ^ )} .
(c) Extend (b) to the situation where t is defined as the solution to a m onotonic
estimating equation.
(Section 10.3; D avison, Hinkley and Worton, 1992)
7
Consider the choice o f h for the raw bootstrap likelihood values (10.11), when w ( )
is the standard normal density. A s is often roughly true, suppose that T* ~ N(t, v),
and that conditional on T t , T ~ N(t ' , v).
(a) Show that the mean and variance o f the product o f vl/1 with (10.11) are /j and
M ~ l {12 It ), where
h = (
}.
where y = hv~l/2 and 3 = v~l/2(t' t). Hence verify some o f the values in the
following table:
D ensity x lO -2
Bias x lO -2
M x variance xlO -2
39.9
-0.8
40.4
<N
II
Oo
II
O
518
8 = 1
0
24.2
0
28.3
24.2
-0.1
11.2
24.2
-0 .5
5.7
5.4
0.3
7.5
5.4
1.2
3.8
5.4
2.5
2.6
39.9 39.9
-2 .9 -5 .7
13.4 5.6
(b) If y is small, show that the variance o f (10.11) is roughly proportional to the
square o f its mean, and deduce that the variance is approximately constant on the
log scale.
(c) Extend the calculations in (a) to (10.13).
(Section 10.3; D avison, Hinkley and Worton, 1992)
8
( 1 - 0 ) - 2,
|0| < 1.
*>
5 > -> .
< -> *
\ __
_____l n
n \ _
^j(^jks ~ Zk)
S2(t + l)
,-
do- 21)
= YI yjPj are y an(l
(2n + a n + 1 )- I m2, where m 2 = n_1 J2(yj ~ y ) 2(b) N ow consider the average F t o f bootstrap samples generated as follows. We
generate a distribution F 1 = ( P / , . . . , P j ) o n y t, . . . , y under the Bayesian bootstrap,
10.8 Practicals
519
E>(
,a,(f
(2n + an + 1) n
Are the properties o f this as noo and a oo what you would expect? How does
this compare with samples generated by the usual nonparametric bootstrap?
(Section 10.5)
10.8 Practicals
1
We compare the empirical likelihoods and 95% confidence intervals for the mean
o f the data in Table 3.1, (a) pooling the eight series:
attach(gravity)
grav.EL <- EL.profile(g,tmin=70,tmax=85,n.t=51)
plot(grav.EL[,1],exp(grav.EL[,2]) ,type="l",xlab="mu",
ylab="empirical likelihood")
l i k .CI(grav.E L ,lim=-0.5*qchisq(0.95,1))
and (b) treating the series as arising from separate distributions with the same
mean and plotting eight individual likelihoods:
520
2
th <- ifelse(theta>180,theta-360,theta)
a.t <- function(th) c(sin(th*pi/180), cos(th*pi/180))
b.t <- fimction(th) c(cos(th*pi/180), -sin(th*pi/180))
y <- t(apply(matrix(theta, 18,1), 1, a.t))
thetahat <- function(y)
{ m <- apply(y,2,sum)
m <- m/sqrt(m[l] ~2+m[2]
2)
180*atan(m[l]/m[2] )/pi }
thetahat(y)
u.t <- function(y, th) crossprod(b.t(th), t(y))
islay.EL <- EL.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t)
plot(islay.EL[,1],islay.EL[,2],type="l",xlab="theta",
ylab="log empirical likelihood",ylim=c(-25,0))
points(th,rep(-25,18)); abline(h=-3.84/2,lty=2)
lik.CI(islay.EL,lim=-0.5*qchisq(0.95,1))
islay.EEF <- EEF.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t)
lines(islay.EEF[,1],islay.EEF[,2],lty=3)
l i k .CI(islay.E E F ,lim=-0.5*qchisq(0.95,1))
Discuss the shapes o f the log likelihoods.
To obtain 0.95 quantiles o f the bootstrap distributions o f W EL and
W Eef
'
We compare posterior densities for the mean o f the air-conditioning data using (a)
the Bayesian bootstrap with aj = 1:
521
8 Practicals
k,
X in the
11
Computer Implementation
11.1 Introduction
The key requirem ents for com puter im plem entation o f resam pling m ethods
are a flexible program m ing langauge w ith a suite o f reliable quasi-random
num ber generators, a wide range o f built-in statistical procedures to bootstrap,
and a reasonably fast processor. In this chapter we outline how to use one
im plem entation, using the curren t (M ay 1997) com m ercial version S p lu s 3.3
o f the statistical language S, although the m ethods could be realized in a
n um ber o f o th e r statistical com puting environm ents.
The rem ainder o f this section outlines the in stallation o f the library, and
gives a quick sum m ary o f features o f S p lu s essential to our purpose. Each
subsequent section describes aspects o f the library needed for the m aterial in
the corresponding ch ap ter: Section 11.2 corresponds to C h apter 2, Section 11.3
to C h apter 3, an d so forth. These sections take the form o f a tutorial on the
use o f the library functions. T he outline given here is n o t intended to replace
the help files distributed w ith the library, w hich can be viewed by typing
h e l p ( b o o t , l i b r a r y = " b o o t " ) w ithin S p lu s. A t various points below, you
will need to consult these files for m ore details on functions.
The m ain functions in the library are sum m arized in Table 11.1.
The best way to learn to use softw are is to use it, and from Section 11.1.2
onw ards, we assum e th a t you, d ear reader, know the basics o f S, including how
to w rite sim ple functions, th a t you are seated com fortably at your favourite
com puter w ith S p lu s launched and a graphics w indow open, and th a t you are
working through this chapter. We d o n o t show the S p lu s p ro m p t >, n o r the
continuatio n p ro m p t +.
522
523
11.1 Introduction
Table 11.1 Functions in
the S plus bootstrap
library.
F u nction
Purpose
abc.ci
boot
b o o t. a rra y
b o o t. c i
censboot
c o n tro l
c v . glm
em p in f
e n v e lo p e
e x p .tilt
g l m .d ia g
g lm .d ia g .p lo t
im p.m om ents
im p . p r o b
im p .q u a n tile
im p .w e ig h ts
j a c k , c i f t e r .b o o t
lin e a r .a p p r o x
s a d d le
s a d d le .d is tn
s im p le x
s m o o th .f
tilt.b o o t
ts b o o t
11.1.1 Installation
UNIX
T he b o o tstra p library can be obtained from the hom e page for this book,
h t t p : //dmawww. e p f 1 . c h / d a v i s o n . mosaic/BM A/
in the form o f a com pressed s h a r file b o o t l i b . s h . Z. This file should be
uncom pressed an d m oved to an ap p ro p riate directory. The file can then be
unpacked by
sh bootlib.sh
rm bootlib.sh
Y ou should then
installation o f the
It is best to set
m ay need to ask
once inside S p lu s
accessed by typing
l i b r a r y ( b o o t ,f irst=T)
524
11 Computer Implementation
T his will avoid cluttering your w orking directory w ith library files, and reduce
the chance th a t you accidentally overw rite them.
W i n d ow s
T he disk at the back o f this book contains the library functions and docum en
tatio n for use w ith S p lu s f o r Windows. F or instructions on the installation,
see the file README. TXT on the disk. T he contents o f the disk can also be
retrieved in the form o f a z i p file from the hom e page for the book given
above.
y
H ere < - is the assignm ent sym bol. To see the contents o f any S object, simply
type its nam e, as above. This is often done below, and we do n o t show the
output.
In general q uasi-random num bers from a distrib u tion are generated by the
functions re x p , rgamma, r c h i s q , r t , . . . , w ith argum ents to give param eters
where needed. F or example,
y <- rgamma(n=10,shape=2)
generates 10 gam m a observations w ith shape p aram eter 2, and
y <- rgamma(n=10,shape=c(l:10))
generates a vector o f ten gam m a variables w ith shape param eters 1 ,2 ,..., 10.
T he function sam ple is used to sam ple from a set w ith o r w ithout replace
m ent. F o r exam ple, to get a ran d o m p erm u tatio n o f the num bers 1 ,...,1 0 , a
random sam ple w ith replacem ent from them , a ran d o m p erm u tatio n o f 11, 22,
33, 44, 55, a sam ple o f size 10 from them , and a sam ple o f size 10 taken with
unequal probabilities:
sample(10)
sample(10,replace=T)
set <- c(ll,22,33,44,55)
sample(set)
sample(set,size=10,replace=T)
sample(set,size=10,replace=T,prob=c(0.1,0.1,0.1,0.1,0.6))
525
Subscripts
T he city p o p u latio n d a ta w ith n = 10 are
city
citySu
city$x
where the second two com m ands show the individual variables o f c i t y . T his
S p lu s object is a datafram e an array o f d a ta in which rows correspond to
cases, and the nam ed colum ns to variables. Elem ents o f an object are accessed
by subscripts, so
city$x[l]
city$x[c(l:4)]
city$x[c(l,5,10)]
city[c(l,5,10),2]
city$x[-l]
city[c(l:3),]
give various subsets o f the elem ents o f c i t y . To m ake a nonparam etric
b o o tstrap sam ple o f the rows o f c i t y , you could type:
i <- sample(10,replace=T)
city[i,I
The row labels result from the algorithm used to give unique labels to rows,
an d can be ignored for o u r purposes.
526
11 Computer Implementation
B o o t s t r a p obj ect s
city.boot
prints the original statistic, its estim ated bias an d its stan d ard error, while
plot(city.boot)
gives suitable sum m ary plots.
To see the nam es o f the elem ents o f the b o o tstrap object c i t y . b o o t , type
names(city.boot)
Y ou see various nam es, o f which c i t y .b o o t$ t O , c i t y . b o o t $ t , c ity .b o o t$ R ,
c i t y ,b o o t$ s e e d con tain the original value o f the statistic, the boo tstrap
values, the value o f R, an d the value o f the S p lu s random num ber generation
seed w hen b o o t was invoked. To see their contents, type their names.
Timing
To repeat the sim ulation, checking how long it takes, type
u n i x . t i m e ( c i t y . b o o t < - b o o t ( c i t y , c i t y .f u n ,R = 5 0 ) )
on a U N IX system or
f <- boot.array(city.boot)
f [ 1 :2 0 ,]
527
= E w'jxj / E wj
u
Y , w) uj / Y , wY
where w* is the weight p u t on the j t h case o f the datafram e in the boo tstrap
sam ple; the first line o f c i t y .w ensures th at
vv* = 1. Setting w in the
initial line o f the function gives the default value for w, which is a vector of
n-1 s; this enables the original value o f t to be obtained by c i t y . w ( c i t y ) . A
m ore com plicated exam ple is given by the library correlation function c o rr .
N o t all statistics can be w ritten in this form, b u t w hen they can, num erical
differentiation can be used to obtain em pirical influence values and A B C
confidence intervals.
F or the third, frequency type, the argum ents are d a t a and a vector o f
frequencies f. F o r example,
n_ 1 E / y * /
w here /* is the frequency w ith which the ;'th row o f the datafram e occurs in the
b o o tstra p sample. N o t all statistics can be w ritten in this form. It differs from
the preceding type in th a t w hereas weights can in principle take any positive
528
11 Computer Implementation
529
new argum ent to b o o t, s im = " p a ra m e tric " , tells b o o t to perform a param etric
sim ulation: by default the sim ulation is nonparam etric and s im = " o rd in a ry " .
O ther possible values for sim are described below.
F or exam ple, for p aram etric sim ulation from the exponential m odel fitted
to the air-conditioning d a ta in Table 1.2, we set
a i r c o n d i t . f u n < - f u n c t i o n ( d a t a ) m e a n (d a ta $ h o u rs )
a i r c o n d i t . sim < - f u n c t i o n ( d a t a , m le)
{ d <- d a ta
d $ h o u rs < - r e x p ( n = n r o w ( d a ta ) , ra te = m le )
d >
a i r c o n d i t . m l e < - l/ m e a n ( a ir c o n d i t$ h o u r s )
a ir c o n d it.p a r a <- b o o t( d a ta = a ir c o n d it, s t a t i s t i c = a i r c o n d i t .f u n ,
R=20, s im = " p a r a m e tr ic " , r a n . g e n = a i r c o n d i t . sim ,
m le = a ir c o n d it.m le )
A ir-conditioning d a ta for a different aircraft are given in a i r c o n d i t 7 . O btain
their sam ple average, and perform a param etric b o o tstrap o f the average using
the fitted exponential model. Give the bias and variance estim ates for the
average. D o the b o o tstrap p ed averages look norm al for this sam ple size?
A m ore com plicated exam ple is param etric sim ulation based on a log
bivariate no rm al distribution fitted to the city p o pulation d a ta:
l . c i t y <- lo g (c ity )
c ity .m le <- c ( a p p ly ( 1 .c i t y , 2 ,m e a n ) ,s q r t ( a p p l y ( l .c i t y ,2 ,v a r ) ) ,
c o r r ( 1 .c i t y ) )
c i t y . s i m < - f u n c t i o n ( d a t a , m le)
{ n < - n ro w (d a ta )
d < - m a t r i x ( r n o r m ( 2 * n ) ,n ,2)
d [ , 2 ] < - m le [2 ] + m le [ 4 ] * (m le [ 5 ] * d [ ,2 ] + s q r t ( 1 - m l e [ 5 ] ~ 2 )* d [ , 1 ] )
d [ , 1] < - m l e [ l ] + m l e [ 3 ] * d [ ,l ]
d a ta $ x < - e x p ( d [ ,2 ] )
d a ta $ u < - e x p ( d [ , l ] )
d a ta }
c i t y . f < - f u n c t i o n ( d a t a ) m e a n ( d a t a [ ,2 ] ) / m e a n ( d a t a [ ,1 ])
c i t y . p a r a < - b o o t ( c i t y , c i t y . f , R=200, s im = " p a r a m e tr ic " ,
r a n . g e n = c i t y . sim , m le = c ity .m le )
W ith this definition o f c i t y . f , a nonparam etric b o o tstrap can be perform ed
by
c ity .b o o t <- b o o t(d a ta = c ity ,
s t a t i s t i c = f u n c t i o n ( d a t a , i ) c i t y . f ( d a t a [ i , ] ) , R=200)
530
11 Computer Implementation
This is useful w hen com paring p aram etric and n o n p aram etric b o o tstraps for
the same problem . C om pare them for the c i t y data.
J <- empinf(data=city,statistic=city.fun,stype="i",type="jack")
The argument type controls h o w the influence values are to be calculated, but
this also depends on the quantities input to empinf: for details see the help
file.
Variance approximations
v a r . l i n e a r uses em pirical influence values to calculate the nonparam etric
delta m ethod variance ap proxim ation for a statistic:
v a r .linear(L.diff)
v a r .linear(L.reg)
Linear approximation
l i n e a r .a p p ro x uses o u tp u t from a nonparam etric b o o tstrap sim ulation to
calculate the linear approxim ations to the b o o tstrap p ed quantities. The em pir
ical influence values can be supplied, b u t if not, they are estim ated by a call to
em pinf. F o r the city p o p u latio n ratio,
531
calculates the linear approxim ation for the two sets o f em pirical influence
values an d plots the actual t' against them.
gravity
grav <- gravity[as.numeric(gravity$series)>=7,]
grav
grav.fun <- function(data, i, trim=0.125)
{ d <- data[i,]
m <- tapply(d$g, d$series, mean, trim=trim)
m[7] -m [8] >
grav.boot <- boot(grav, grav.fun, R=200, strata=grav$series)
Check th a t the expected properties o f b o o t . a r r a y ( g r a v . b o o t) hold.
E m pirical influence values, linear approxim ations, and nonparam etric delta
m ethod variance approxim ations are calculated by
11.3.2 Sm oothing
T he neatest w ay to perform sm ooth bo o tstrap p in g is to use s im = " p a ra m e tric " .
F o r exam ple, to estim ate the variance o f the m edian o f the d a ta in y, using
sm oothing p aram eter h = 0.5:
y <- rnorm(99)
h <- 0.5
y.gen <- function(data, mle)
{ n <- length.(data)
i <- sample(n, n, replace=T)
data[i] + mle*rnorm(n) }
532
11 Computer Implementation
533
c i t y . b o o t < - b o o t ( c i t y , c i t y . f u n , R=999)
c i t y . L < - c i t y . b o o t $ t 0 [ 3 : 12]
s p l i t . s c r e e n ( c ( l , 2 ) ) ; s c r e e n ( l ) ; s p l i t .s c r e e n ( c ( 2 ,1 )); sc re e n (4 )
a tta c h (c ity )
p l o t ( u , x , ty p e = " n " , x lim = c ( 0 , 3 0 0 ) , y lim = c ( 0 ,3 0 0 ) )
te x t( u ,x ,r o u n d ( c ity .L ,2 ) )
sc re e n (3 )
p l o t ( u , x , t y p e = " n " , x l i m = c ( 0 ,3 0 0 ) ,y lim = c ( 0 ,3 0 0 ) )
t e x t ( u , x , c ( l : 1 0 ) ) ; a b l i n e ( 0 , c i t y . b o o t $ t 0 [ 1 ] ,lty = 2 )
sc re e n (2 )
j a c k . a f t e r . b o o t ( b o o t . o u t = c i t y . b o o t , u se J= F , s t i n f = F , L = c ity .L )
c l o s e . s c r e e n ( a ll = T )
The two left panels show the d a ta with case num bers and em pirical influence
values as p lo ttin g symbols. T he jackknife-after-bootstrap plot on the right
shows the effect o f deleting cases in tu rn : values o f t* are m ore variable when
case 4 is deleted and less variable w hen cases 9 and 10 are deleted. We see
from the em pirical influence values th at the distribution o f t' shifts dow nw ards
when cases w ith positive em pirical influence values are deleted, and conversely.
This plot is also produced by setting true the ja c k argum ent to p l o t when
applied to a b o o tstrap object, as in p l o t ( c i t y . b o o t , j a c k = T ) .
O ther argum ents for j a c k , a f t e r .b o o t control w hether the influence values
are standardized (by default they are, s tin f = T ) , w hether the em pirical influence
values are used (by default jackknife values are used, based on the sim ulation,
so the default values are u seJ= T and L=NULL).
M ost post-processing functions allow the user to specify either an index for
the com ponent o f interest, o r a vector o f length b o o t.o u t$ R to be treated
as the m ain statistic. T hus a jackknife-after-bootstrap plot using the second
com ponent o f c i t y . b o o t $ t the estim ated variances for t* would be
obtained by either o f
j a c k . a f t e r . b o o t ( c i t y . b o o t , u s e J = F , s t i n f = F , in d ex = 2 )
ja c k .a f te r .b o o t( c ity .b o o t,u s e J = F ,s tin f = F ,t= c ity .b o o t$ t[ ,2 ] )
Frequency smoothing
sm o o th . f sm ooths the frequencies o f a nonparam etric b o o tstrap object to give
a typical distrib u tio n w ith expected value roughly at 9. In order to find
the sm oothed frequencies for 9 = 1.4 for the city ratio, and to obtain the
corresponding value o f t, we set
c i t y . f r e q < - s m o o t h . f ( t h e t a = l . 4 , b o o t . o u t = c i t y .b o o t )
c ity .w ( c ity , c ity .f r e q )
534
11 Computer Implementation
11.4 Tests
11.4.1 Parametric tests
Simple param etric tests can be conducted using p aram etric sim ulation. For
example, to perform the conditional sim ulation for the d a ta in f i r (E xam
ple 4.2):
11.4 Tests
535
z <- grav$g
z[grav$series==8] <- -z[grav$series==8]
z.tilt <- exp.tilt(L=z, theta=0, strata=grav$series)
z.tilt
where z . t i l t contains the fitted probabilities (which sum to one for each
stratum ) and the values o f k an d 6. O ther argum ents can be in put to e x p . t i l t :
see its help file.
The significance probability is then obtained by using the w e ig h ts argum ent
to b o o t. This argum ent is a vector containing the probabilities w ith which to
select the rows o f d a ta , when b o o tstrap sam pling is to be perform ed with
unequal probabilities. In this case the unequal probabilities are given by the
tilted distribution, und er which the expected value o f the test statistic is zero.
The code needed to perform the sim ulation and get the estim ated significance
level is:
536
11 Computer Implementation
537
538
11 Computer Implementation
F or m ore com plicated regressions, for exam ple w ith unequal response vari
ances, m ore inform ation m ust be added to the new datafram e.
Wild bootstrap
The wild b o o tstra p can be im plem ented using s im = " p a ra m e tric " , as follows:
11.6.2 Prediction
Now consider prediction o f the log b rain weight o f new m am m als w ith body
weights equal to those for the chim panzee and baboon. F or this we introduce
yet an o th er argum ent to b o o t m, which gives the num ber o f e*m to be
sim ulated w ith each b o o tstra p sam ple (see A lgorithm 6.4). In this case we
w ant to predict a t m = 2 new m am m als, w ith covariates contained in
d .p r e d . The s t a t i s t i c function supplied to b o o t m ust now take at least one
m ore argum ent, nam ely the additional indices for constructing the boo tstrap
versions o f the two new m am m als. We im plem ent this as follows:
539
540
11 Computer Implementation
541
attach(melanoma)
mel.boot <- censboot(melanoma, mel.fun, R=99, strata=ulcer)
mel.boot.mod <- censboot(melanoma, mel.fun, R=99,
F.surv=mel.surv, G.surv=mel.cens, strata=ulcer,
cox=mel.cox, sim="model")
mel.boot.con <- censboot(melanoma, mel.fun, R=99,
F .surv=mel.surv, G.surv=mel.cens, strata=ulcer,
cox=mel.cox, sim="cond")
542
11 Computer Implementation
The b o o tstrap results are best displayed graphically. H ere is the code for the
analogue o f the left panels o f Figure 7.9:
t h < - s e q ( f r o m = 0 .2 5 ,to = 1 0 ,b y = 0 .2 5 )
s p l i t . s c r e e n ( c ( 2 , 1 ))
s c re e n (l)
p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)",
x l i m = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L in e a r p r e d i c t o r " )
l i n e s ( t h , m e l . b o o t$ tO , lwd=3)
r u g (jitte r(th ic k n e s s ))
f o r ( i i n 1 :1 9 ) l i n e s ( t h , m e l . b o o t $ t [ i , ] , l w d = 0 . 5 )
sc re e n (2 )
p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)",
x lim = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L i n e a r p r e d i c t o r " )
l i n e s ( t h , m e l . b o o t $ t 0 , lwd=3)
m e l.e n v < - e n v e l o p e ( m e l .b o o t $ t ,l e v e l = 0 .95)
l i n e s ( t h . m e l . e n v $ p o in t [ 1 , ] , l t y = l )
l i n e s ( t h , m e l. e n v $ p o in t [ 2 , ] , l t y = l )
m e l.e n v < - e n v e lo p e ( m e l .b o o t .m o d $ t ,l e v e l = 0 .95)
l i n e s ( t h , m e l . e n v $ p o i n t [ 1 , ] , lty = 2 )
lin e s ( th ,m e l.e n v $ p o in t[ 2 , ] ,lty = 2 )
m e l.e n v < - e n v e l o p e ( m e l .b o o t .c o n $ t ,l e v e l = 0 .95)
l i n e s ( t h . m e l . e n v $ p o i n t [ 1 , ] ,lt y = 3 )
l i n e s ( t h , m e l . e n v $ p o i n t [ 2 , ] , lty = 3 )
d e t a c h ( "m elanom a")
N ote how tight the confidence envelope is relative to th a t for the m ore highly
param etrized m odel used in the example. Try again w ith larger values o f R, if
you have the patience.
543
544
11 Computer Implementation
545
ts .m o d < - ra n .a r g s $ m o d e l
m e a n ( t s . o r i g ) + r t s ( a r im a .s im ( m o d e l= ts .m o d ,n = n .s im ,in n o v = r e s ) ) }
s u n . l b < - t s b o o t ( s u n . r e s , s u n .f u n , R=99, 1=20, s im = " f ix e d " ,
r a n .g e n = s u n .b l a c k , r a n . a x g s = l i s t ( t s = s u n , m o d e l= su n .m o d e l),
n .s im = le n g t h ( s u n ) )
C om pare these results with those above, and try it with sim="geom".
. con$L
. c o n $ b ia s
. con$var
. c o n $ q u a n tile s
11 Computer Implementation
546
to I
, 0 e x p (^ )
V\
/17 \
J2 j exp(A/j)
547
Function tilt.boot
The description above relies on exponential tilting to obtain the resam
pling probabilities, an d requires know ing where to tilt to. If this is difficult,
t i l t . b o o t can be used to avoid this, by perform ing an initial b o o tstrap with
equal resam pling probabilities, then using frequency sm oothing to estim ate
ap p ro p riate tilted probabilities. F o r example,
c i t y . t i l t < - t i l t . b o o t ( c i t y , c i t y . f u n , R=c(5 0 0 ,2 5 0 ,2 4 9 ))
perform s 500 ordinary bootstraps, uses the results to estim ate probability
d istributions tilted to the 0.025 and 0.975 points o f the sim ulations, and then
perform s 250 b o o tstrap s tilted to the 0.025 quantile, and 249 tilted to the 0.975
quantile, before assigning the result to a b o o tstrap object. M ore com plex uses
o f t i l t , b o o t are possible; see its help file.
Importance re-weighting
These functions allow for im portance re-w eighting as well as im portance
sam pling. F or exam ple, suppose th a t we require to re-weight the sim ulated
values so th a t they ap p e a r to have been sim ulated from a distribution with
expected ratio close to 1.4. We then use the q= option to the im portance
sam pling functions as follows:
q <- s m o o th .f ( th e ta = l.4 , b o o t .o u t = c i t y .t i l t )
c i t y . w ( c i t y , q)
i m p . m o m e n t s ( c i t y . t i l t , q=q)
i m p . q u a n t i l e ( c i t y . t i l t , q=q)
where the first line calculates the sm oothed distribution, the second obtains the
corresponding ratio, an d the third and fourth obtain the m om ent and quantile
estim ates corresponding to sim ulation from the distribution q.
11 Computer Implementation
548
/ ; ( * ; - tu ,) = ,
(11.1)
where the /* have a jo in t m ultinom ial distrib u tio n w ith equal probabilities and
denom in ato r n = 10, the n u m b er o f rows o f c i t y , as outlined in Exam ple 9.16.
A ccordingly we set
549
This is m ost useful w hen K (-) is not o f the standard form th a t follows from a
m ultinom ial distribution.
Conditional approximations
C onditional saddlepoint approxim ation is applied by giving Af n and u f n m ore
colum ns, an d setting the w d is t and ty p e argum ents to s a d d le appropriately.
F or example, suppose th a t we w ant to find the distribution o f T " , defined as
the root o f (11.1), b u t resam pling 25 rath e r th an 49 cases o f b i g c i t y . T hen
we set
550
11 Computer Implementation
APPENDIX A
Cumulant Calculations
ks
M(t) = E ( e' r ) = - f y s,
s=0 5
where fi's = E (Y S) is the sth m om ent. A simple exam ple is a N(/i, a 2) random
variable, for which K( t ) = t/j.+ ^ t 2 a 2; note the appealing fact th a t its cum ulants
o f order higher th an tw o are zero. By equating powers o f t in the expansions
o f K ( t ) an d log M( t) we find th a t k\ =
and th at
K2
*3
n'3 -3(i2'i+2(n'l)3,
K4
= K2 + (K i )2,
/i'3
= K3 + 3K2K\ + (Ki)3,
/*4
4+
4 ( c 3 )c i +
3 (k 2)2
(A .l)
+
2(k
i)2
(k i)4.
T he cum ulants k j, k2, kt, and K4 are the m ean, variance, skewness and kurtosis
o f Y.
F o r vector Y it is b etter to drop the pow er n o ta tio n used above and to
551
552
A Cumulant Calculations
where sum m ation is im plied over repeated indices, so that, for example,
t,/c = t\Kl + ----- h tKn,
T hus the n-dim ensional norm al distrib u tio n w ith m eans k and covariance
m atrix fcJ has cum ulant-generating function Uk1+ jtjtjKli . We som etim es write
K>J = cum (Y , Y j ), K'j* = cum (Y ', Y>, Y k) and so forth for the coefficients o f
titj, titjtk in K(t). The cum ulant arrays k j ,
etc. are invariant to index
perm utation, so for exam ple (c1,2,3 = k2,3,1.
A key feature th a t simplifies calculations w ith cum ulants as opposed to m o
m ents is th a t cum ulants involving two or m ore independent random variables
are z e ro : for independent variables, k,j = k ' ^ = = 0 unless all the indices
are equal.
The above n o tatio n extends to generalized cum ulants such as
cu m (Y Y ; Y fc)
E ( Y iY i Y k) = Kijk,
cu m (Y , Y * Y k)
KlJk,
c u m ( Y iY J, Y k, Y l) = KijJ('1,
A Cumulant Calculations
553
i = j
otherwise,
554
A Cumulant Calculations
1
1 [1]
12
12 [1]
1|2 [1]
123
123 [1]
12|3 [3]
1|2|3 [1]
1234
1234 [1]
123|4 [4]
12|34 [3]
12|3|4 [6]
1|2|3|4 [1]
1|2
12 [1]
12|3
123 [1]
13|2 [2]
1|2|3|
123 [1]
123|4
1234 [1]
124|3 [3]
12|34 [3]
14|2|3 [3]
12|34
1234 [1]
123|4 [2] [2]
134|2
13|24 [2]
13|2|4 [4]
12|3|4
1234 [1]
134|2 [2]
13|24 [2]
1|2|3|4
1234 [1]
Table A.1
Complementary set
partitions
Bibliography
555
556
W. Sendler, volum e 376 o f L ecture N otes in Economics
and M athem atical Systems, pp. 23-30. N ew Y ork:
Springer.
B eran, R. J. (1997) D iag n o sin g b o o tstra p success. Annals
o f the Institute o f Statistical M athem atics 49, to appear.
Berger, J. O. an d B ern ard o , J. M . (1992) O n the
dev elo p m en t o f reference p rio rs (w ith D iscussion). In
Bayesian S tatistics 4, eds J. M . B ernardo, J. O. Berger,
A. P. D aw id an d A. F. M . Sm ith, pp. 35-60. O xford:
C laren d o n Press.
Bibliography
B ooth, J. G . (1996) B o o tstrap m eth o d s for generalized
linear m ixed m odels w ith ap p licatio n s to sm all area
estim ation. In Statistical Modelling, eds G. U . H. Seeber,
B. J. F rancis, R . H atzin g er an d G . Steckel-Berger,
volum e 104 o f Lecture N otes in Statistics, pp. 43-51.
N ew Y ork: Springer.
B ooth, J. G . an d B utler, R. W. (1990) R a n d o m iza tio n
d istrib u tio n s an d sa d d lep o in t a p p ro x im atio n s in
g eneralized linear m odels. Biometrika 77, 787-796.
81, 331-340.
B ooth, J. G ., H all, P. an d W ood, A. T. A. (1992) B o o tstrap
estim atio n o f co n d itio n al distrib u tio n s. Annals o f
Statistics 20, 1594-1610.
B ooth, J. G ., H all, P. an d W ood, A. T. A. (1993) B alanced
im p o rtan c e resam pling for the b o o tstrap . Annals o f
S tatistics 21, 286-298.
Bose, A. (1988) E dgew orth co rrectio n by b o o tstra p in
autoregressions. Annals o f Statistics 16, 1709-1722.
Box, G . E. P. a n d C ox, D . R. (1964) A n analysis o f
tran sfo rm atio n s (w ith D iscussion). Journal o f the Royal
Statistical Society series B 26, 211-246.
B ratley, P., Fox, B. L. an d Schrage, L. E. (1987) A Guide to
Simulation. S econd edition. N ew Y o rk : Springer.
B raun, W. J. an d K ulperger, R. J. (1995) A F o u rie r m eth o d
fo r b o o tstra p p in g tim e series. P reprint, D e p a rtm e n t o f
M a th em atic s an d Statistics, U niversity o f W innipeg.
B raun, W. J. an d K ulperger, R . J. (1997) P roperties o f a
F o u rie r b o o tstra p m eth o d fo r tim e series.
Communications in Statistics Theory and M ethods 26,
to ap pear.
B reim an, L., F ried m an , J. H ., O lshen, R . A. an d Stone,
C. J. (1984) Classification and Regression Trees. Pacific
G rove, C alifo rn ia: W adsw orth & B ro o k s/C o le.
Breslow, N . (1985) C o h o rt analysis in epidem iology. In A
Celebration o f Statistics, eds A. C. A tk in so n an d S. E.
F ienberg, pp. 109-143. N ew Y o rk : S pringer.
B retagnolle, J. (1983) L ois lim ites d u b o o tstra p de
certaines fonctionelles. Annales de I'Institut Henri
Poincare, Section B 19, 281-296.
Brillinger, D. R . (1981) Time Series: Data Analysis and
Theory. E xpan d ed edition. San F ran cisco : H olden-D ay.
Brillinger, D. R . (1988) A n elem entary tren d analysis o f
R io N eg ro levels a t M a n au s, 1903-1985. Brazilian
Journal o f Probability and Statistics 2, 63-79.
557
Bibliography
B rillinger, D. R. (1989) C o n sistent d etection o f a
m o n o to n ic tren d su p erp o sed on a sta tio n ary tim e series.
Biometrika 76, 23-30.
558
Cox, D. R. a n d Lewis, P. A. W. (1966) The Statistical
Analysis of Series of Events. L o n d o n : C h a p m a n & H all.
Cox, D. R. a n d O akes, D . (1984) Analysis of Survival Data.
L o n d o n : C h a p m a n & H all.
Cox, D. R. a n d Snell, E. J. (1981) Applied Statistics:
Principles and Examples. L o n d o n : C h a p m a n & H all.
Cressie, N. A. C. (1982) P laying safe w ith m isw eighted
m eans. Journal of the American Statistical Association
77, 754-759.
Cressie, N . A. C. (1991) Statistics for Spatial Data. N ew
Y o rk : Wiley.
D ah lh au s, R. a n d Ja n as, D. (1996) A frequency d o m ain
b o o tstra p fo r ra tio statistics in tim e series analysis.
Annals of Statistics 24, to ap p ear.
D aley, D. J. an d V ere-Jones, D. (1988) A n Introduction to
the Theory of Point Processes. N ew Y ork: S pringer.
D aniels, H . E. (1954) S ad d lep o in t a p p ro x im atio n s in
statistics. Annals of Mathematical Statistics 25, 631-650.
D aniels, H. E. (1955) D iscussion o f P e rm u ta tio n theory in
the d eriv atio n o f ro b u st criteria an d th e study o f
d ep artu re s fro m assu m p tio n , by G . E. P. Box a n d S. L.
A ndersen. Journal of the Royal Statistical Society series
B 17, 27-28.
D aniels, H . E. (1958) D iscussion o f T h e regression
analysis o f b in ary sequences, by D . R. Cox. Journal of
the Royal Statistical Society series B 20, 236-238.
D aniels, H. E. an d Y oung, G . A. (1991) S ad d lep o in t
ap p ro x im atio n fo r th e stu d entized m ean, w ith an
a p p licatio n to th e b o o tstra p . Biometrika 78, 169-179.
D av iso n , A. C. (1988) D iscussion o f th e R oyal S tatistical
Society m eeting o n th e b o o tstrap . Journal of the Royal
Statistical Society series B 50, 356-357.
D av iso n , A. C. an d H all, P. (1992) O n the bias a n d
variab ility o f b o o tstra p an d cross-validation estim ates o f
e rro r rate in d iscrim in atio n problem s. Biometrika 79,
279-284.
D av iso n , A. C. a n d H all, P. (1993) O n S tudentizing an d
blo ck in g m eth o d s fo r im plem enting the b o o tstra p w ith
d ep en d en t d a ta . Australian Journal of Statistics 35,
215-224.
D av iso n , A. C. a n d H inkley, D. V. (1988) S ad d lep o in t
ap p ro x im atio n s in resam pling m ethods. Biometrika 75,
417-431.
D av iso n , A. C., H inkley, D. V. a n d S chechtm an, E. (1986)
Efficient b o o tstra p sim ulation. Biometrika 73, 555-566.
D av iso n , A. C., H inkley, D. V. a n d W orton, B. J. (1992)
B o o tstrap likelihoods. Biometrika 79, 113-130.
D av iso n , A. C., H inkley, D . V. a n d W orton, B. J. (1995)
A ccu rate an d efficient co n stru ctio n o f b o o tstra p
Bibliography
559
Bibliography
D iC iccio, T. J. an d R o m an o , J. P. (1989) O n adjustm ents
based o n th e signed ro o t o f the em pirical likelihood
ra tio statistic. Biometrika 76, 447-456.
D iC iccio, T. J. an d R o m an o , J. P. (1990) N o n p aram etric
confidence lim its by resam pling m eth o d s an d least
fav o rab le fam ilies. International Statistical Review 58,
59-76.
Diggle, P. J. (1983) Statistical Analysis of Spatial Point
560
Faraway, J. J. (1992) O n the cost of data analysis. Journal
of Computational and Graphical Statistics 1, 213-229.
Bibliography
Bibliography
561
Hall, P. and Horowitz, J. L. (1993) Corrections and
blocking rules for the block bootstrap with dependent
data. Technical Report SRI 1-93, Centre for
Mathematics and its Applications, Australian National
University.
Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995) O n blocking
rules for the bootstrap with dependent data. Biometrika
82, 561-574.
Hall, P. and Jing, B.-Y. (1996) On sample reuse methods
for dependent data. Journal of the Royal Statistical
Society series B 58, 727-737.
Hall, P. and Keenan, D. M. (1989) Bootstrap methods for
constructing confidence regions for hands.
Communications in Statistics Stochastic Models 5,
555-562.
Hall, P. and La Scala, B. (1990) Methodology and
algorithms of empirical likelihood. International
Statistical Review 58, 109-28.
Hall, P. and Martin, M. A. (1988) O n bootstrap resampling
and iteration. Biometrika 75, 661-671.
Hall, P. and Owen, A. B. (1993) Empirical likelihood
confidence bands in density estimation. Journal of
Computational and Graphical Statistics 2, 273-289.
Hall, P. and Titterington, D. M. (1989) The effect of
simulation order on level accuracy and power of Monte
Carlo tests. Journal of the Royal Statistical Society series
B 51, 459-467.
Hall, P. and Wilson, S. R. (1991) Two guidelines for
bootstrap hypothesis testing. Biometrics 47, 757-762.
Hamilton, M. A. and Collings, B. J. (1991) Determining
the appropriate sample size for nonparametric tests for
location shift. Technometrics 33, 327-337.
Hammersley, J. M. and Handscomb, D. C. (1964) Monte
Carlo Methods. London: Methuen.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and
Stahel, W. A. (1986) Robust Statistics: The Approach
Based on Influence Functions. N e w York: Wiley.
Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and
Ostrowski, E. (eds) (1994) A Handbook of Small Data
Sets. London: Chapman & Hall.
562
Hardle, W. and Marron, J. S. (1991) Bootstrap
simultaneous error bars for nonparametric regression.
Annals of Statistics 19, 778-796.
Hartigan, J. A. (1969) Using subsample values as typical
values. Journal of the American Statistical Association
64, 1303-1317.
Hartigan, J. A. (1971) Error analysis by replaced samples.
Journal of the Royal Statistical Society series B 33,
98-110.
Hartigan, J. A. (1975) Necessary and sufficient conditions
for asymptotic joint normality of a statistic and its
subsample values. Annals of Statistics 3, 573-580.
Bibliography
jackknife confidence limit methods. Biometrika 71,
331-339.
Bibliography
563
Bibliography
564
305-384.
Bibliography
Society series B 54, 541-551.
565
University Press.
566
Rubin, D. B. and Schenker, N. (1986) Multiple imputation
for interval estimation from simple random samples
with ignorable nonresponse. Journal of the American
Statistical Association 81, 366-374.
Ruppert, D. and Carroll, R. J. (1980) Trimmed least
squares estimation in the linear model. Journal of the
American Statistical Association 75, 828-838.
Samawi, H. M. (1994) Power estimation for two-sample
tests using importance and antithetic resampling. Ph.D.
thesis, Department of Statistics and Actuarial Science,
University of Iowa, Ames.
Sauerbrei, W. and Schumacher, M. (1992) A bootstrap
resampling procedure for model building: application to
the Cox regression model. Statistics in Medicine 11,
2093-2109.
Schenker, N. (1985) Qualms about bootstrap confidence
intervals. Journal of the American Statistical Association
80, 360-361.
Seber, G. A. F. (1977) Linear Regression Analysis. N ew
York: Wiley.
Shao, J. (1988) O n resampling methods for variance and
bias estimation in linear models. Annals of Statistics 16,
986-1008.
Shao, J. (1993) Linear model selection by cross-validation.
Journal of the American Statistical Association 88,
486-494.
Shao, J. (1996) Bootstrap model selection. Journal of the
American Statistical Association 91, 655-665.
Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap.
N e w York: Springer.
Shao, J. and Wu, C. F. J. (1989) A general theory for
jackknife variance estimation. Annals of Statistics 17,
1176-1197.
Shorack, G. (1982) Bootstrapping robust regression.
Communications in Statistics Theory and Methods 11,
961-972.
Bibliography
bootstrap. Annals o f Statistics 9, 1187-1195.
Bibliography
Springer.
567
Wang, S. (1992) General saddlepoint approximations in the
bootstrap. Statistics and Probability Letters 13, 61-66.
Wang, S. (1993a) Saddlepoint expansions in finite
population problems. Biometrika 80, 583-590.
Name Index
Abelson, R. R 403
Akaike, H. 316
Akritas, M. G. 124
Altman, D. G. 375
Amis, G. 253
Andersen, P. K. 124, 128, 353, 375
Andrews, D. F. 360
Appleyard, S. T. 417
Athreya, K. B. 60
Atkinson, A. C. 183, 315, 325
Bai, C. 315
Bai, Z. D. 427
Bailer, A. J. 384
Banks, D. L. 515
Barbe, P. 60, 516
Barnard, G. A. 183
Bamdorff-Nielsen, O. E. 183, 246,
486, 514
Becher, H. 379
Beckman, R. J. 486
Benes, F. M. 428
Beran, J. 426
Beran, R. J. 125, 183, 184, 187, 246,
250, 315
Berger, J. O. 515
Bernardo, J. M. 515
Bertail, P. 60, 516
Besag, J. E. 183, 184, 185
Bickel, P. J. 60, 123, 125, 129, 315,
487, 494
Biden, E. N. 316
Bissell, A. F. 253, 383, 497
Bithell, J. F. 428
Bloomfield, P. 426
Boniface, S. J. 418, 428
Boos, D. D. 515
Booth, J. G. 125, 129, 246, 247, 251,
374, 486, 487, 488, 491, 493
568
Conover, W. J. 486
Cook, R. D. 125, 315, 316, 375
Corcoran, S. A. 515
Cowling, A. 428, 432, 436
Cox, D. R. 124, 128, 183, 246, 287,
323,
324, 428, 486, 514
Cressie, N. A. C. 72, 428
Dahlhaus, R. 427, 431
Daley, D. J. 428
Daly, F. 68, 182, 436, 520
Daniels, H. E. 59, 486, 492
Darby, S. C. 428
Davis, R. A. 426, 427
Davison, A. C. 66, 135, 246, 316, 374,
427, 428, 486, 487, 492, 493, 515,
517, 518
De Angelis, D. 2, 60, 124, 316, 343
Demetrio, C. G. B. 338
Dempster, A. P. 124
Diaconis, P. 60, 486
DiCiccio, T. J. 68, 124, 246, 252, 253,
487, 493, 515, 516
Diggle, P. J. 183, 392, 423, 426, 428
Do, K.-A. 486, 487
Dobson, A. J. 374
Donegani, M. 184, 187
Doss, H. 124, 374
Draper, N. R. 315
Droge, B. 316
Dubowicz, V. 417
Ducharme, G. R. 126
Easton, G. S. 487
Efron, B. ix, 59, 60, 61, 66, 68, 123,
124, 125, 128, 130, 132, 133, 134,
183, 186, 246, 249, 252, 253, 308,
315,
316, 375, 427, 486, 488, 515
Embleton, B. J. J. 236, 506
Eubank, S. 427, 430, 435
Fahey, T. J. 185
Name Index
569
Hammersley, J. M. 486
Hampel, F. R. 60
Hand, D. J. 68, 182, 436, 520
Handscomb, D. C. 486
Hardle, W. 316, 375, 427
Harrington, D. P. 124
Hartigan, J. A. 59, 60, 427, 430
Hastie, T. J. 374, 375
Hawkins, D. M. 316
Hayes, K. G. 123
Heggelund, P. 189
Heller, G. 374
Herzberg, A. M. 360
Hesterberg, T. C. 60, 66, 486, 490, 491
Hettsmansperger, T. P. 316
Hinkley, D. V. 60, 63, 66, 125, 135,
183, 246, 247, 250, 318, 383, 486,
487, 489, 490, 492, 493, 515, 517,
518
Hirose, H. 347, 381
Hjort, N. L. 124, 374
Holmes, S. 60, 246, 486
Horowitz, J. L. 427, 429
Horvath, L. 374
Hosmer, D. W. 361
Hu, F. 318
Huet, S. 375
Hyde, J. 131
Isham, V. 428
Janas, D. 427, 431
Jennison, C. 183, 184, 246
Jensen, J. L. 486
Jeong, J. 315
Jhun, M. 126
Jing, B.-Y. 427, 429, 487, 515, 517
Jockel, K.-H. 183
John, P. W. M. 486, 489
Johns, M. V. 486, 490
Jolivet, E. 375
Jones, M. C. x, 128
Joumel, A. G. 428
Kabaila, P. 246, 250, 427
Kalbfleisch, J. D. 124
Kaplan, E. L. 124
Karr, A. F. 428
Katz, R. 282
Keenan, D. M. 428
Keiding, N. 124, 128, 353
Kendall, D. G. 124
Kendall, W. S. 124
Kim, J.-H. 124
Klaassen, C. A. J. 123
Kulperger, R. J. 427, 430
KUnsch, H. R. 427
Lahiri, S. N. 427
Laird, N. M. 124, 125
Lange, N. 428
La Scala, B. 514
Lawless, J. 514
Lawson, A. B. 428
Lee, S. M. S. 246
Leger, C. 125
Lehmann, E. L. 183
Lemeshow, S. 361
Leroy, A. M. 315
Lewis, P. A. W. 428
Lewis, T. 236, 506
Li, G. 514
Li, H. 427
Li, K.-C. 316
Liu, R. Y. 315, 427
Lloyd, C. J. 515
Lo, S.-H. 125, 374
Loader, C. 375
Loh, W.-Y. 183, 246
Longtin, A. 427, 430, 435
Louis, T. A. 125
Lunn, A. D. 68, 182, 436, 520
Maddala, G. S. 315, 427
Mallows, C. L. 316
Mammen, E. 60, 315, 316
Manly, B. F. J. 183
Marriott, F. H. C. 183
Marron, J. S. 375
Martin, M. A. 125, 183, 246, 251, 487,
493
McCarthy, P. J. 59, 60, 125
McConway, K. J. 68, 182, 436, 520
McCullagh, P. 66, 374, 553
McDonald, J. W. 183, 184
570
McKay, M. D. 486
McKean, J. W. 316
McLachlan, G. J. 375
Meier, P. 124
Messean, A. 375
Milan, L. 125
Miller, R. G. 59, 84
Monahan, J. F. 515
Monti, A. C. 514
Morgenthaler, S. 486
Moulton, L. H. 374, 376, 377
Muirhead, C. R. 428
Murphy, S. A. 515
Mykland, P. A. 515
Nelder, J. A. 374
Newton, M. A. 178, 183, 486, 515
Niederreiter, H. 486
Nordgaard, A. 427
Noreen, E. W. 184
Oakes, D. 124, 128
Ogbonmwan, S.-M. 486, 515
Olshen, R. A. 315, 316
Oris, J. T. 384
Ostrowski, E. 68, 182, 436, 520
Owen, A. B. 486, 514, 515, 550
Parzen, M. I. 250
Paulsen, O. 189
Peers, H. W. 515
Percival, D. B. 426
Perl, M. L. 123
Peters, S. C. 315, 427
Phillips, M. J. 428, 432, 436
Pitman, E. J. G. 183
Plackett, R. L. 60
Politis, D. N. 60, 125, 427, 429
Possolo, A. 428
Pregibon, D. 374
Prentice, R. L. 124
Presnell, B. 125, 129
Priestley, M. B. 426
Proschan, F. 4, 218
Qin, J. 514
Quenouille, M. H. 59
Raftery, A. E. 515
Rao, J. N. K 125, 130
Name Index
Rawlings, J. O. 356
Reid, N. 124, 486
Reynolds, P. S. 435
Richardson, S. 183
Ripley, B. D. x, 183, 282, 315, 316,
361, 374, 375, 417, 428, 486
Ritov, Y. 123
Robinson, J. 486, 487
Stone, R. A. 428
Sutherland, D. H. 316
Swanepoel, J. W. H. 427
Tanner, M. A. 124, 125
Theiler, J. 427, 430, 435
Themeau, T. 486
Tibshirani, R. J. ix, 60, 125, 246, 316,
375,
427, 515
Titterington, D. M. 183
Tong, H. 394, 426
Truong, K. N. 126
Tsai, C.-L. 269, 375
Tsay, R. S. 427
Tu, D. 60, 125, 246, 376
Tukey, J. W. 59, 403, 486
van Wyk, J. W. J. 427
van Zwet, W. R. 60
Venables, W. N. 282, 315, 361, 374,
375
Venkatraman, E. S. 374
Ventura, V. x, 428, 486, 492
Vere-Jones, D. 428
Wahrendorf, J. 379
Walden, A. T. 426
Wall, K. D. 427
Wand, M. P. 128
Wang, S. 124, 318, 486, 487
Wang, Y. 486
Wei, B. C. 63, 375, 487
Wei, L. J. 250
Weisberg, S. 125, 257, 315, 316
Welch, B. L. 515
Welch, W. J. 183, 185
Wellner, J. A. 123
Westfall, P. H. 184
Whittaker, J. 125
Wilson, S. R. 378, 379
Name Index
Witkowski, J. A. 417
Wong, W. H. 124
Wood, A. T. A. 247, 251, 486, 488,
491,
515, 517
Woods, H. 277
Worton, B. J. 486, 487, 493, 515, 517,
518
571
Wynn, H. P. 515
Young, S. S. 184
Yandell, B. S. 374
Ying, Z. 250
Zelen, M. 328
Zidek, J. V. 318
Example index
572
Example index
573
phase scrambling, 410, 430, 435
point process data, 416, 418, 421
poisons data, 322
Poisson process, 416, 418, 422, 425,
431, 435
Poisson regression, 342, 369, 378, 382,
383
prediction, 244, 286, 287, 323, 324, 342
prediction error, 298, 300, 320, 321,
359, 361, 369, 381, 393, 401
product-limit estimator, 86, 128
proportional hazards, 146, 160, 221,
352
quantile, 48, 253, 352
quartzite data, 520
Example index
574
unimodality, 168, 169, 189
unit root test, 391
Subject index
abc.ci,536
A B C method, see confidence interval
balanced
, 545
apparent error, 292
m, 538
assessment set, 292
mle, 528, 538, 540, 543
autocorrelation, 386, 431
parametric
, 534
autoregressive process, 386, 388, 389,
ran.gen,
528,
538,
540, 543
392,
395, 398, 399, 400, 401, 410,
414, 432, 433, 434
sim, 529, 534
simulation, 390-391
statistic, 527, 528
autoregressive-moving average
strata, 531
process, 386, 408
stype, 527
average, 4, 8, 13, 15, 17, 19, 22, 25, 27,
weights, 527, 536, 546
30, 33, 36, 47, 51, 88, 90, 92, 94,
boot.array, 526
98, 129, 130, 197, 199, 203, 205,
boot.ci, 536
207, 209, 216, 251, 501, 508, 512,
bootstrap
513, 516, 518, 520
adjustment, 103-107, 125, 130,
comparison of several, 163
175-180, 223-230
comparison of two, 159, 162, 166,
antithetic, 487, 493
171,
172, 186, 454, 457, 519
finite population, 94, 98, 129
asymptotic accuracy, 39-41,
211-214
balanced, 438-446, 486, 494-499
Bayesian bootstrap, 512-514, 515,
518, 520
algorithm, 439, 488
B C a method, see confidence interval
bias estimate, 438-440, 488
beaver data example, 434
conditions for success, 445
bias correction, 103-107
efficiency, 445, 460, 461, 495
bias estimator, 16-18
experimental design, 441, 486,
489
adjusted, 104, 106-107, 130, 442,
464, 466, 492
first-order, 439, 486, 487-488
575
576
Subject index
Subject index
577
578
equal marginal distributions example,
78
error rate, 137, 153, 174, 175
estimating function, 50, 63, 105, 250,
318, 329, 470-471, 478, 483, 504,
505,
514, 516
excess error, 292, 296
exchangeability, 143, 145
expansion
Cornish-Fisher, 40, 211, 449
cubic, 475-478
Edgeworth, 39-41, 60, 411,
476-478, 487
linear, 47, 51, 69, 75, 76, 118, 443,
446,
468
notation, 39
quadratic, 50, 66, 76, 443
Taylor series, 45, 46
experimental design
relation to resampling, 58, 439, 486
exponential mean example, 15, 17, 19,
30, 61, 176, 250, 510
exponential quantile plot, 5, 188
exponential tilting, 166-167, 183,
209-210, 452-454, 456-458,
461-463, 492, 495, 504, 517, 535,
546, 547
exp.tilt, 535
factorial experiment, 320, 322
finite population sampling, 92-100,
125, 128, 129, 130, 474
fir seedlings data, 142
Fisher information, 193, 206, 349, 516
Fourier frequencies, 387
Fourier transform, 387
empirical, 388, 408, 430
fast, 388
inverse, 387
frequency array, 23, 52, 443
frequency smoothing, 110, 456, 462,
463, 464-465, 496, 508
Fretsheads data example, 115, 447
gamma model, 5, 25, 62, 131, 149,
207, 216, 233, 247, 376
generalized additive model, 366-371,
375, 382, 383
Subject index
Subject index
integration
number-theoretic methods, 486
interaction example, 322
interpolation of quantiles, 195
Islay data example, 520
isotonic regression example, 371
iterative weighted least squares, 329
579
likelihood, 137
adjusted, 500, 512, 515
580
neurophysiological data example, 418,
428
Nile data example, 241
nitrofen data example, 383
nodal involvement data example, 381
nonlinear regression, 353-358
nonlinear time series, 393-396, 401,
410, 426
nonparametric delta method, 46-50,
75
balanced bootstrap, 443-444
cubic approximation, 475-478
linear approximation, 47, 51, 52, 60,
69, 76, 118, 126, 127, 205, 261,
443, 454, 468, 487, 488, 490,
492
control variate, 446
importance resampling, 452
tilted, 490
quadratic approximation, 50, 79,
212, 215, 443, 487, 490
variance approximation, 47, 50, 63,
64, 75, 76, 108, 120, 199, 260,
261, 265, 275, 312, 318, 319,
376,
477, 478, 483
nonparametric maximum likelihood,
165-166, 186, 209, 501
nonparametric regression, 362-373,
375,
382, 383
normal prediction limit, 244
normal quantile plot test, 150
notation, 9-10
nuclear power data example, 286, 298,
304, 323
null distribution, 137
null hypothesis, 136
one-way model example, 208, 276,
319,
320
outliers, 27, 307-308, 363
overdispersion, 327, 332, 338-339,
343-344, 370, 382
test for, 142
paired comparison, see matched-pair
data
parameter transformation, see
transformation of statistic
partial autocorrelation, 386
Subject index
Subject index
581
582
pivot, 138-139, 268-269, 280, 284,
392,
454, 486
power, 155-156, 180-184
P-value, 137, 138, 141, 148, 158,
161,
175-176
randomization, 183, 185, 186, 492,
498
separate families, 148, 378
sequential, 182
spatial data, 416, 421, 422, 428
studentized, see pivot
time series, 392, 396, 403, 410
simulated data example, 306
simulation error, 34-37, 62
simulation outlier, 73
simulation size, 17-21, 34-37, 69,
155-156, 178-180, 183, 185, 202,
226,
246, 248
size of test, 137
smooth.f, 533
smooth estimates of F, 79-81
spatial association example, 421, 428
spatial clustering example, 416
spatial data, 124, 416, 421^126, 428
spatial epidemiology, 421, 428
species abundance example, 169, 228
spectral density estimation example,
413
spectral resampling, see periodogram
resampling
spectrum, 387, 408
spherical data example, 126, 234, 505
spline smoother, 352, 364, 365, 367,
368,371,468,
standardized residuals, see residuals,
standardized
stationarity, 385-387, 391, 398, 416
statistical error, 31-34
statistical function, 12-14, 46, 60, 75
Stirling
s approximation, 61, 155
Subject index