0% found this document useful (0 votes)
84 views14 pages

1983 Efron Gong A Leisurely Look at The Bootstrap Jackknife CV CV

This document provides an overview of three nonparametric methods for estimating statistical error: the bootstrap, the jackknife, and cross-validation. The bootstrap and jackknife methods can be used to estimate the standard error of an estimator by re-sampling or leaving out points from the original dataset. Cross-validation relates to estimating the error rate of a predictive model by testing it on held-out data. The document uses simple examples to illustrate how the methods work and their relationships without technical details.

Uploaded by

tema.kouznetsov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views14 pages

1983 Efron Gong A Leisurely Look at The Bootstrap Jackknife CV CV

This document provides an overview of three nonparametric methods for estimating statistical error: the bootstrap, the jackknife, and cross-validation. The bootstrap and jackknife methods can be used to estimate the standard error of an estimator by re-sampling or leaving out points from the original dataset. Cross-validation relates to estimating the error rate of a predictive model by testing it on held-out data. The document uses simple examples to illustrate how the methods work and their relationships without technical details.

Uploaded by

tema.kouznetsov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation

Author(s): Bradley Efron and Gail Gong


Source: The American Statistician, Vol. 37, No. 1 (Feb., 1983), pp. 36-48
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2685844 .
Accessed: 04/08/2011 19:22

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to The
American Statistician.

http://www.jstor.org
A LeisurelyLook at the Bootstrap,the Jackknife,and
Cross-Validation
BRADLEY EFRON and GAIL GONG*

validation? For a quick answer, before we begin the


This is an invitedexpositoryarticlefor The American mainexposition,we considera problemwherenone of
Statistician.It reviewsthe nonparametricestimationof the three methodsare necessary,estimatingthe stan-
statisticalerror,mainlythe bias and standarderrorof dard errorof a sample average.
an estimator,or the errorrateof a predictionrule. The The data set consistsof a random sample of size n
presentationis writtenat a relaxedmathematicallevel, froman unknownprobabilitydistribution F on the real
omittingmost proofs,regularityconditions,and tech- line,
nical details. - F.
XI, X,, X., (1)

KEY WORDS: Bias estimation;Variance estimation; HavingobservedXI = xi, X, =x, . I,= xn, we com-
Nonparametricstandard errors; Nonparametriccon- pute the sample average x-= n x,ln for use as an
fidenceintervals;Errorrate prediction. estimateof the expectationof F.
An interestingfact,and a crucial one forstatistical
applications,is thatthe data set providesmorethanthe
estimatex It also givesan estimateforthe accuracyof
1. INTRODUCTION
x-,namely
This articleis intendedto coverlotsof ground,but at
a relaxed mathematicallevel that omits most proofs, - ) X (2)
&=:n.(n
regularityconditions,and technicaldetails.The ground
in questionis thenonparametric estimationofstatistical 6 is the estimatedstandarderrorof X= x, the root
error."Error" here refersmainlyto the bias and stan- mean squared errorof estimation.
dard errorof an estimator,or to the errorrate of a
The troublewithformula(2) is thatitdoes not,in any
data-based predictionrule.
obvious way, extend to estimatorsother than X, for
All of the methodswe discussshare some attractive
example the sample median. The jackknifeand the
propertiesforthe statisticalpractitioner:theyrequire
bootstrapare two ways of makingthisextension.Let
verylittlein thewayof modeling,assumptions,or anal-
ysis, and can be applied in an automaticway to any n __ -X

situation,no matterhow complicated.(We willgive an = n-


x-(,) X1 (3)
n
example of a verycomplicatedpredictionrule indeed).
An importantthemeof whatfollowsis the substitution the sample average of the data set deletingthe nth
of raw computingpower fortheoreticalanalysis. point. Also, let x(.)= = x(,)/n,the averageof the de-
The referencesupon whichthisarticleis based (Efron leted averages. (Actuallyx-(.)= x-,but we need the dot
1979a,b, 1981a,b,c, 1982; Efron and Gong 1982) ex- notationbelow.) The jackknifeestimateof standard
plore the connections between the various non- erroris
parametricmethods,and also the relationshipto famil-
iar parametrictechniques.Needless to say, thereis no
dangerof parametricstatisticsgoingout of business.A &J=[n 1 (x(.))2] (4)
good parametricanalysis,whenappropriate,can be far
more efficient thanits nonparametriccounterpart.Of-
ten,though,parametricassumptionsare difficult to jus- The reader can verifythatthisis the same as (2). The
tify,in whichcase it is reassuringto have available the advantageof (4) is an easy generalizabilityto any esti-
comparativelycrude but trustworthy nonparametric mator 0 = 6(XI, X2, . . , Xn). The only change is to
answers. substitute6(,)= (XI, X, X+, . . ., Xn) forx-(,)and
What are the bootstrap,the jackknife,and cross- 0( ) = 0(,)n forX(.).
The bootstrapgeneralizes(2) in an apparentlydiffer-
*BradleyEfronis Professorof Statisticsand Biostatisticsat Stan-
entway. Let F be the7empirical probabilitydistribution
ford University.Gail Gong is AssistantProfessorof Statisticsat of thedata, puttingprobabilitymass l/n on each x,, and
Carnegie-MellonUniversity.The authorsare gratefulto Rob Tibshir- let X*, X , X* be a randomsample fromF,
ani who suggestedthe finalexample in Section 7; to SampritChat-
terjee and WernerStuetzlewho suggestedlookingat estimatorslike X*,. X*2, .--, Xn_F (5)
"BootAve" in Section 9; and to Dr. Peter Gregoryof the Stanford
Medical School who providedthe originalanalysisas well as the data
In other words each X* is drawn independentlywith
in Section 10. This work was partiallysupportedby the National replacementand withequal probabilityfromtheset {xl,
Science Foundationand the National Institutesof Health. ... , x4. Then X = n X*,/n has variance

36 February1983, Vol. 37, No. I


(D The AmericanStatistician,
The observed Pearson correlation coefficientfor
var*X* =1 E (xi -X)2' (6)
i=1
these n = 15 pairs is p(x1, x2, . . ., x15) = .776. We want
to attacha nonparametric estimateof standarderrorto
var*indicatingvarianceundersamplingscheme(5). The p. The bootstrapidea is the following:
bootstrapestimateof standarderrorfor an estimator
iS
1. Suppose that the data pointsxi, x2, ..., x15 are
6(Xl, X2, * * , Xn)
independentobservationsfromsome bivariatedistribu-
as= [var. ( X
X2, ,
X,)] (7) tion F on the plane. Then the truestandarderrorof p
Comparing (7) with (2) we see that [nl(n - 1)]1/2 is a functionof F, indicateda(F),
&B= a for0 = X. We could make &B exactlyequal &, u(F) = [varF M(X, X2, . .. , xn)]112.
for 0 X, by adjustingdefinition(7) with the factor
[nl(n - 1)]11/2, butthereis no generaladvantagein doing (It is also a functionof samplesize n, and thefunctional
so. A simplealgorithmdescribedin Section2 allowsthe formof the statisticp, but both of these are knownto
statisticianto computeUB no matterhow complicated0 the statistician.)
maybe. Section 3 showsthe close connectionbetween 2. We don't know F, but we can estimateit by the
UB and &J. empiricalprobabilitydistributionf,
Cross-validationrelates to another,more difficult,
problem in estimatingstatisticalerror.Going back to I
f: mass on each observeddata pointxi,
(1), suppose we tryto predicta new observationfrom n
F, call it X0, usingthe estimatorX as a predictor.The i=1, 2, n.
expectedsquarederrorofpredictionE [X0- X]2 equals
((n + 1)/n),u2where L2 iS the varianceof the distribu- 3. The bootstrapestimateof a(F) is
tion F. An unbiased estimateof ((n + 1)/n)112 is aB = a(F). (10)
(n + 1)&2. (8) For the correlationcoefficientand for most statistics,
Cross-validationis a wayofobtainingnearlyunbiased even verysimpleones, the functiona(F) is impossible
estimatorsof predictionerrorin much more compli- to expressin closed form.That is whythe bootstrapis
cated situations.The methodconsistsof (a) deletingthe not in commonuse. However in these days of fastand
pointsxi fromthe data set one at a time; (b) recalcu- cheap computationUB can easily be approximatedby
latingthe predictionrule on the basis of the remaining Monte Carlo methods:
n -1 points; (c) seeing how well the recalculatedrule (i) ConstructF, the empiricaldistribution
function,
predictsthedeletedpoint;and (d) averagingthesepre- as just described.
dictionsover all n deletionsof an xi. In the simplecase (ii) Draw a bootstrap
sampleX1, X2, ..., X* by
above, the cross-validatedestimateof predictionerror independentrandomsamplingfromF. In otherwords,
is n make n random draws withreplacementfrom{xl, x2,
(9) . . .x, Xn. In the law school example a typicalbootstrap
[Xi - X(].
sample mightconsistof 2 copies of point 1, 0 copies of
A little algebra shows that (9) equals (8) times point2, 1 copy of point3, and so on, the total number
of copies adding up to n = 15. Compute the bootstrap
n21(n2- 1), thislast factorbeing nearlyequal to one.
replication,p* = p(X*, X .,X*), thatis, the value
The advantage of the cross-validationalgorithmis
of the statistic,in thiscase the correlationcoefficient,
thatit can be applied to arbitrarily
complicatedpredic-
evaluated forthe bootstrapsample.
tionrules.The connectionwiththebootstrapand jack-
(iii) Do step (ii) some large number"B" of times,
knifeis shownin Section 9.

3.50 -
2. THE BOOTSTRAP *8
*5

This section describesthe simple idea of the boot- 3.30 .2

strap(Efron1979a). We beginwithan example. The 15


GPA
pointsin Figure 1 representvariousenteringclasses at 3.10 _ *10
*6
Americanlaw schoolsin 1973. The twocoordinatesfor - *7
law school i are xi = (Yi, z.), @15
2.90 - .14

yi = average LSAT score of enteringstudents 2 _70 *.3


@13 @121
at school i, 2.70 a1 l I
540 560 580 600 620 640 660 680
zi= average undergraduateGPA score of enteringstu- LSAT
dents at school i. Figure1. Thelaw school data (Efron1979B).The data points,
beginning withSchool#1,are (576,3.39),(635,3.30),(558,2.81),
(The LSAT is a national test similarto the Graduate (578,3.03), (666,3.44), (580,3.07), (555,3.00), (661,3.43), (651,
Record Exam, while GPA refersto undergraduate 3.36), (605,3.13), (653,3.12), (575,2.74), (545,2.76), (572,2.88),
grade point average.) (594,2.96).

? The AmericanStatistician,
February1983, Vol. 37, No. 1 37
cases, buthas highervariability thanaB, as shownbyits
highercoefficientof variation.The minimumpossible
Normaltheorydensity H>
Histogram coefficient ofvariation(C.V.), fora scale-invariant esti-
mate of a(F), assumingfull knowledgeof the para-
metricmodel, is shownin brackets.In the normalcase,
Histogram for example, .19 is the C.V. of [ (x,--)
I /14]12. The
percentiles
16% 50% 84%
bootstrapestimateperformswell by thisstandardcon-
sidering its totally nonparametriccharacter and the
-.4 -.3 -.2 -.1 0 2 small sample size.
Figure 2. Histogram of B = 1000 bootstrap replications p* for the Table 2 returnsto the case of p, the correlationcoef-
law school data. The normal theory density curve has a similar ficient.Instead of real data we have a samplingexperi-
shape, but falls off more quickly at the upper tail. ment in which F is bivariatenormal,true correlation
p = .5, and the sample size is n = 14. The leftside of
obtaining independent bootstrap replications p', Table 2 refersto p, while the rightside refersto the
p*2
I p*B , and approximate&Bby statistic$ = tanhW -I = .5 log(1 + p)/(1- p). For each
estimator6, the rootmean squared errorof estimation
(B - 1) [E(a - U)2]1/2 is given in the column headed MSE.
aB= [( E ) P_- (11)
.-)p
The bootstrapwas run with B = 128 and B = 512,
the lattervalue yieldingonly slightlybetterestimates
As B xc, (11) approachesthe originaldefinition(10).
&B. FurtherincreasingB would be pointless. It can
The choice of B is furtherdiscussedbelow, but mean- be shownthatB = x would give MSE = .063 in the p
whilewe won't distinguish between(10) and (11), call- case, only .001 less than using B = 512. As a point
ing both estimates1B- of comparison, the normal theory estimate for the
Figure 2 shows B = 1000 bootstrapreplicationsp', standard error of p, &NORM = (1 - )1(n - 3)12, has
..., p*'? for the law school data. The abscissa is plot- /MSE = .056.
ted in terms of p* - p= p* - .776. Formula (11) gives Why not generate the bootstrapobservationsfrom
aB = .127. This can be comparedwiththe normalthe- an estimate of F which is smootherthan F? This is
oryestimateof standarderrorforp, (Johnsonand Kotz done in lines 3, 4, and 5 of Table 2. Let X =,
1970, p. 229), (x, - x-)(x, - x-)'In be the sample covariance matrix
of the observed data. The normal smoothed boot-
&NORM X z *11 5 - strap draws the bootstrap sample Xl, X*, ..., X
fromF eDX2(O, .25Z), e indicatingconvolution.This
One thingis obvious about the bootstrapprocedure: amountsto estimatingF by an equal mixtureof the n
it can be applied just as well to any statistic,simpleor distributions XNS(x,,.25k), thatis by a normalwindow
complicated, as to the correlation coefficient.In estimate.Smoothingmakes littledifferenceon the left
Table 1 the statisticis the25 percenttrimmedmean for side of the table, but is spectacularlyeffectivein the 4
a sample of size n = 15. The true distributionF (now case. The latterresultis suspectsincethe truesampling
defined on the line ratherthan on the plane) is the distributionis bivariate normal, and the function
standardnormalX(0, 1) fortheleftside of thetable, or 4 = tanh-' p is specificallychosen to have nearlycon-
one-sided negativeexponentialforthe rightside. The stantstandarderrorin thebivariate-normal family.The
true standarderrorsa(F) are .286 and .232, respec- uniformsmoothedbootstrapsamplesX .,X* from
tively.In bothcases, &B, calculatedwithB = 200 boot- Fe6DN0, .25,Z), where t(0, .25X) is the uniform
strapreplications,is nearlyunbiased forca(F). distribution on a rhombusselectedso t has mean vec-
The jackknife estimate of standard error &j, de- tor 0 and covariancematrix.25,. It yieldsmoderate
scribed in Section 3, is also nearlyunbiased in both reductionsin MSE forboth sides of the table.
The standardnormal-theory estimatesofline8, Table
Table 1. A Sampling ExperimentComparingthe 2, are themselvesbootstrapestimates,carriedout in a
Bootstrapand JackknifeEstimatesof Standard parametricframework.The bootstrapsample Xl, ... .
Errorforthe 25% TrimmedMean, X* is drawnfromthe parametricmaximumlikelihood
Sample Size n = 15 distribution
F'NORM~
J<(,,)
F Standard Normal F Negative Exponential
Coeff Coeff ratherthanthenonparametric maximumlikelihooddis-
Ave Sd Var Ave Sd Var tributionF, and with only this change the bootstrap
Bootstrap&B .287 .071 .25 .242 .078 .32 algorithmproceeds as previouslydescribed.In practice
(B = 200) the bootstrapprocess is not actuallycarriedout. If it
Jackknife
&JJ: .280 .084 .30 .224 .085 .38 were, and if B-*x, then a high-orderTaylor series
analysis shows that UB would equal approximately
True: .286 [.191 .232 [.27]
[Minimum
C.V.] (1 - ip2)/(n- 3)' 2 theformulaactuallyused to compute
line 8 forthe p side of Table 2. Notice thatthe normal

38 ? The AmericanStatistician,February1983, Vol. 37, No. I


Table2. Estimatesof StandardErrorfortheCorrelation
Coefficientp and for+ = tanh1 p; Sample Size n = 14,
F BivariateNormalWithTrueCorrelation
Distribution p = .5. Froma LargerTablein Efron(1981b)

SummaryStatisticsfor200 Trials
Standard Error Standard Error
Estimates forp Estimates for )
Ave Std Dev CV VMSE Ave Std Dev CV VMSE

1. BootstrapB= 128 .206 .066 .32 .067 .301 .065 .22 .065
2. BootstrapB = 512 .206 .063 .31 .064 .301 .062 .21 .062
3. NormalSmoothed BootstrapB 128 .200 .060 .30 .063 .296 .041 .14 .041
4. UniformSmoothed BootstrapB 128 .205 .061 .30 .062 .298 .058 .19 .058
5. UniformSmoothed BootstrapB 512 .205 .059 .29 .060 .296 .052 .18 .052
6. Jackknife .223 .085 .38 .085 .314 .090 .29 .091
7. Delta Method .175 .058 .33 .072 .244 .052 .21 .076
(Infinitesimal
Jackknife)
8. NormalTheory .217 .056 .26 .056 .302 0 0 .003
True Standard Error .218 .299

smoothedbootstrapcan be thoughtof as a compromise


between using F and FNORM to begin the bootstrap 0B [var*, (p*)]/2, (13)
process. where var, indicates variance under distribution(12).
(This is truebecause we can take P* = #fX*= xi}ln in
3. THE JACKKNIFE step 2 of the bootstrapalgorithm.)
Figure 3 illustratesthe situationfor the case n = 3.
The jackknife estimate of standard error was in- There are 10 possible bootstrappoints. For example,
troduced by Tukey in 1958 (see Miller 1974). Let thepointP* = (2, 3, 0)' is thesecond dot fromthelefton
P(i)= P(xI, x2,. . ., xi-1,xi+1,. . ., x,) be the value of the the lowerside of the triangle,and occurswithbootstrap
statisticwhen xi is deleted fromthe data set, and let probability9, under(12). It indicatesa bootstrapsample
P(.)= (1/n) In P(i).The jackknifeformulais Xl, X", X3 consistingof twox,'s and one x2.The center
pointPo (3, 3, 3)' has bootstrapprobability 9.
) n) The jackkniferesamplesthe statisticat the n points
- 1)/n) (^ p(.)_t))2]'1/2
-

UJ=[((n
P(i) (1/(n - 1)) (1, 1, . . . 1,,0I 1, .,. 1)'
Like the bootstrap,the jackknifecan be applied to any (O in ith place),
statisticthatis a functionof n independentand identi- i = 1, 2, ..., n. These are indicatedby the open circles
callydistributedvariables.It performsless well thanthe in Figure 3. In general there are n jackknifepoints,
bootstrapin Tables 1 and 2, and in most cases investi- compared with(2n7 1) bootstrappoints.
gated by the author(see Efron 1982), but requiresless The troublewithbootstrapformula(13) is that 0(P)
computation.In fact the two methods are closely re- is usually a complicated functionof P (think of the
lated, whichwe shall now show. examples in Sec. 2), and so var, 0(P*) cannotbe evalu-
Suppose the statisticof interest,which we will now
call O(xl,X2, . . . Xv) is offunctional
, form:0 F0(), x3
where 0(F) is a functionalassigninga real numberto
any distribution F on the sample space. Both examples 1/27
in Section 2 are of thisform.Let P= (P1, P2,. PO)
be a probabilityvectorhavingnonnegativeweightssum-
mingto one, and definethe reweightedempiricaldistri-
1/9 1/9
butionF(P): mass Pi on xi, i = 1, 2, . . ., n. Correspond-
ing to P is a resampledvalue of the statisticof interest,
say 0(P) = 0(F(P)). The shorthandnotation 0(P) as- p (2) P0 (1)
sumes that the data points x1, x2, ..., xn are fixed at
theirobserved values. 1/9 2/9 1/9
Anotherway to describethe bootstrapestimate6B iS
as follows. Let P* indicate a vector drawn fromthe
rescaled multinomialdistribution
P* 4ultn(n,P?)ln, (Po (1/n) (1, 1,. 1)'), (12)
x1 1/27 1/9 P (3) 1/9 1/27 x2
meaning the observed proportions from n random
draws on n categories, with equal probability1/nfor Figure 3. The bootstrapand jackknifesamplingpointsin the case
each category.Then n = 3. The bootstrap points (-) are shown withtheirprobabilities.

(? The AmericanStatistician,February1983, Vol. 37, No. 1 39


ated except by Monte Carlo methods.The jackknife finitesimaljackknife lets E-+ 0, therebyearning the
trickapproximates0(P) by a linear functionof P, say name.
0L (P). and thenuses the knowncovariancestructure of The U,7are values of what Mallows (1974) calls the
(12) to evaluatevar,OL(P*). The approximatorOL(P) is empiricalinfluencefunction.Their definitionis a non-
chosen to match0(P) at the n pointsP= P(,). It is not parametricestimateof the trueinfluencefunction
hard to see that
- F)F + Eb,) - 0(F)
IF(x)
lF( ) lim0((1
= ll
OL(P) = 0(.) + (P P0)YU (14) e )(} ~~~E
where 0(.)= (1/n) L; 0(,1=/(in) E) 0(P(,)), and U is a 6. being the degeneratedistribution puttingmass 1 on
columnvectorwithcoordinatesU, = (n - 1) (O(.)- 6(,). x. The rightside of (15) is thenthe obvious estimateof
Theorem.The jackknifeestimateof standarderror the influencefunctionapproximationto the standard
equals errorof 0, (Hampel 1974), u(F) [fIF(x)dF(x)In]'.
The empiricalinfluencefunctionmethod and the in-
_i var.OL(P*)
crJ=Kn finitesimaljackknifegive identicalestimatesof stan-
dard error.
whichis [nl(n -1)]1 2times the bootstrapestimateof How have statisticiansgottenalongforso manyyears
standarderrorfor0L (Efron 1982). withoutmethods like the jackknifeor the bootstrap?
The answeris the delta method,whichis stillthe most
In otherwordsthe jackknifeis, almost,' a bootstrap commonlyused device forapproximatingstandarder-
itself.The advantageof workingwith0L ratherthan 0 rors.The methodapplies to statisticsof the formt(Q1,
is that there is no need for Monte Carlo: var* Q2. QA) wheret(,.. ) is a knownfunction
OL(P*) =var, (P* - P0)'U =- IU>ln, using the covar- and each QO is an observedaverage,QOa= >% I Qa (Xi)/n.
iance matrixfor (12) and the fact that )2U, = 0. The For example, the correlationp is a functionof A = 5
disadvantageis (usually) increasederrorof estimation, such averages: the average of the firstcoordinateval-
as seen in Tables 1 and 2. ues, the second coordinates, the first coordinates
The fact that &j is almost 0&B for a linear approxi- squared,thesecond coordinatessquared,and thecross-
mationof 0 does not mean that&j is a reasonable ap- products.
proximationfor the actual 6B. That depends on how In its nonparametricformulation,the delta method
well 0L approximates0. In the case where0 is the sam- worksby (a) expandingt in a linearTaylorseriesabout
ple median, for instance, the approximationis very the expectationsof the Qa; (b) evaluatingthe standard
poor. errorof theTaylorseriesusingtheusual expressionsfor
variancesand covariancesof averages; and (c) substi-
tutingy(F) foranyunknownquantity-y(F)occurringin
4. THE DELTA METHOD, INFLUENCE (b). For example,the nonparametric delta methodesti-
FUNCTIONS, AND THE mates the standarderrorof p by
INFINITESIMAL JACKKNIFE
J 0+ 4
11iO + 422 4i31 4 13112
There is a moreobvious linearapproximationto 0(P)
l4nLp.20 P,02 P2(Jf)2 PI PI!Pu(-2 IP1P(J2 JJ
than OL(P), (14). Why not use the first-order Taylor
seriesexpansionfor0(P) about thepointP - P0? This is where, in terms of x, (y,, z,), 14J2- Y
-
the idea of Jaeckel'sinfinitesimal
)

jackknife(1972). The (z, - z)"ln (Cramer 1946, p. 359).


Taylorseries approximationturnsout to be
Theorem. For statistics of the form0 t(Q..
OT(P) = 0(PO) + (P -P'U', QA) the nonparametricdelta method and the infini-
where tesimal jackknifegive the same estimateof standard
error(Efron 1981b).
,lim 6 ((1 - E)P0 + Es,) - 0(P0)
F - (0 F
The infinitesimal jackknife,the delta method, and
the empirical influencefunctionapproach are three
5, being the ith coordinate vector. This suggeststhe names forthe same method.Noticethattheresultsre-
infinitesimal
jackknifeestimateof standarderror portedin line 7 of Table2 showa severedownwardbias.
2In ] 112 ( 15) Efronand Stein(1981) showthattheordinaryjackknife
&SsJ= [var.6T(P*)]'/ = [iuo
is alwaysbiased upwards,in a sense made precisein that
withvar. stillindicatingvarianceunder(12). The ordi- paper. In the authors opinionthe ordinaryjackknifeis
nary jackknife can be thought of as taking the methodof choice if one does not want to do the
E = -1/(n - 1) in the definitionof U.'. while the in- bootstrapcomputations.

'The factor[nl(n - 1)11/2makes 6&'unbiased fora& if 0 is a linear 5. NONPARAMETRIC CONFIDENCE INTERVALS


statistic,e.g.. 0 = X. We could multiplva,, by thissame factor,and
achieve the same unbiasedness. but there doesn't seem to be any In applied work, the usual purpose of estimatinga
generaladvantageto doing so. standarderroris to set confidenceintervalsforthe un-

40 (? The AmericanStatistician.February1983, Vol. 37, No. I


known
paramater. ofthecrudeform
Thesearetypically The bias-correctedputative1 - 2(xcentralconfidence
o + z,,,cr, pointof
withzatbeingthe100(1- a) percentile interval
is definedto be
a standard Wecan,anddo,usethe
normaldistribution.
bootstrapand jackknifeestimates'B, ( in thisway. 0 E[ -{)(2zo - zX)},C 1{1(2zo+ z,)}]. (17)
Howeverin small-sample parametric where
situations,
confidence
we can do exactcalculations, intervalsare If C(O) = .50, the medianunbiasedcase, thenz0 = 0
oftenhighlyasymmetricaboutthebestpointestimate 0. and (8) reduceto the uncorrected percentileinterval
Thisasymmetry, whichis0(1//'>)inmagnitude,is sub- (16). Otherwise theresultscanbe quitedifferent.
In the
stantiallymore important than the Student'st cor- law school example z0 = .F(.433) = - .17, and for a =
rection (replacing 0 ? Z,, by 0 + t<,, with t, the .16, (8) gives pE[C'1{f(-1.34)}, C-&{sF(.66)}]=
100(1- a) percentile pointoftheappropriate tdistribu- -.17, p + .10]. This agrees nicelywiththe normal-
tion),whichis onlyO(1/n).Thissectiondiscussessome theoryinterval [A - .16, p + .09].

nonparametric methodsof assigning confidenceinter- Table3 showstheresultsofa smallsampling experi-


vals,whichattempt tocapturethecorrect asymmetry.it ment,only10trials,inwhichthetruedistribution Fwas
is abbreviatedfroma longerdiscussionin Efron bivariatenormal,p = .5. The bias-corrected percentile
(1981c),andalsoChapter10ofEfron(1982).Allofthis methodshowsimpressive agreement withthenormal-
workis highly speculative,thoughencouraging. theoryintervals. Even betterare thesmoothedinter-
We returnto thelaw schoolexampleof Section2. vals,lastcolumn.Herethebootstrap replications were
Supposeforthemoment thatwe believethedatacome obtainedbysampling fromtfflDX(O, .25X), as in line
froma bivariate normaldistribution. The standard68 3 of Table 2, and thenapplying(17) to the resulting
percentcentralconfidence interval(i.e., x= .16, 1 - histogram.
2a = .68) forp in thiscase is [.62, .87] = - .16, p + There are some theoreticalarguments supporting
.09], obtainedby inverting the approximation 4) (16) and (17). If thereexistsa normalizing transfor-
- -
N(4)+ pI(2(n 1)), 1/(n 3)). Compared to the crude mation, in the same sense as = tanh- 1pis normalizing
'
interval ? ZZ.16cTNORM = = [- .12, p + .12], forthe correlation coefficient underbivariate-normal
thisdemonstrates themagnitude oftheasymmetry ef- sampling, then the bias-corrected percentile methodau-
fectdescribedpreviously. tomatically producestheappropriate confidence inter-
The asymmetry of the confidence interval [p-.16, vals.This is interestingsince we do not have to know the
p + .09] relatesto theasymmetry of the normal-theory form of the normalizing transformation to apply (17).
densitycurveforp5, as showninFigure2. Thebootstrap Bayesianandfrequentist justificationsaregivenalso in
histogram shows this same asymmetry. The striking Efron (1981c). None ofthese arguments is overwhelm-
similarity betweenthehistogram and thedensity curve ing,and in fact (17) and (16) sometimes perform poor-
suggeststhatwe can use the bootstrapresultsmore ly.Someothermethods aresuggested inEfron(1981c),
ambitiously thansimplyto compute&B. buttheappropriate theoryis stillfarfromclear.
Twowaysofforming nonparametric confidence inter-
valsfromthebootstrap histogramare discussedin Ef- 6. BIAS ESTIMATION
ron(1981c).Thefirst, calledthepercentilemethod, uses
the lOOa and 100(1- a) percentiles of the bootstrap Quenouille(1949)originally introduced thejackknife
histogram, say as a nonparametric deviceforestimating bias. Let us
denotethe bias of a functional statistic0 = 0(s) by
OE[O(t), O(1- t)], (16)
as a putative1 - 2a centralconfidence forthe
interval
unknown 0. Letting Table3. Central68% ConfidenceIntervals forp, 10
parameter TrialsofXi, X2,..., X1,5BivariateNormalWithTrue
b
< t}
p=.5. Each Interval Has p SubtractedFrom
_
#I-)
BothEndpoints

then0(a) = (a), 0(1-()='(1 - a). In thelaw


Smoothedand
school example, withB = 1000 and t- .16, the68 per- Bias-Corrected
Bias-Corrected
centintervalis p E [.65, .91]= [p- .12,p + .13],almost Normal Percentile Percentile Percentile
exactlythe same as the crudenormal-theory interval Trial p Theory Method Method Method
P C+ NORM
1 .16 (-.29, .26) (-.29, .24) (-.28, .25) (-.28, .24)
Noticethatthemedianofthebootstrap histogram is 2 .75 (-.17, .09) (-.05, .08) (-.13, .04) (-.12, .08)
3 .55 (-.25, .16) (-.24, .16) (-.34, .12) (-.27, .15)
substantially higherthan p in Figure 2. In fact, 4 .53 (-.26, .17) (-.16, .16) (-.19, .13) (-.21, .16)
CO5)= .433,only433outof1000bootstrap replications 5 .73 (-.18, .10) (-.12, .14) (-.16, .10) (-.20, .10)
havingp*<f. The bias-corrected percentilemethod 6 .50 (-.26, .18) (-.18, .18) (-.22, .15) (-.26, .14)
7 .70 (-.20, .11) (-.17, .12) (-.21, .10) (-.18, .11)
makesan adjustment forthistypeof bias. Let ?(z) 8 .30 (-.29, .23) (-.29, .25) (-.33, .24) (- .29, .25)
indicatetheCDF of thestandardnormaldistribution, 9 .33 (-.29, .22) (-.36, .24) (-.30, .27) (-.30, .26)
so 1>(za)= 1 - ax,and define 10 .22 (- .29, .24) (- .50, .34) (- .48, .36) (-.38, .34)
AVE .48 (-.25, .18) (-.21, .19) (-.26, .18) (-.25, .18)

? The AmericanStatistician,
February1983, Vol. 37, No. 1 41
1, a = E{0(F) - O(F)}. In the notationof Section3, and havebeeninterested in theexpectation, and the
Quenouille'sestimateis standarddeviationcrof R.) The bootstrapalgorithm
proceedsas describedin Section2, withthese two
pi= (n - 1)(0() -) (18) changes:at step(ii), we calculatethebootstraprepli-
1j from0, to correctthebias leads to the
Subtracting cationR * = R (X*, X*, ..., X*; F), and at step (iii) we
jackknifeestimateof 0, 0;= nO - (n - 1)0(.), see Miller calculatethedistributional ofinterest
property fromthe
(1974), and also Schucany,Gray,and Owen (1971). empiricaldistribution
ofthebootstrap R*1
replications
Thereare manywaysto justify (18). Herewe follow D* 2 D* B
R , . . ., R
thesamelineofargument as in thejustification of 6f]. For example,we mightbe interested in theproba-
The bootstrap estimateof 1, which has an obvious mo- VNi(X - ,u)/Sexceeds2,
bilitythattheusualt statistic
tivation,is introduced,and then(18) is relatedto the where ,u= E{X} and S2 = I(Xi - X)21(n - 1). Then
bootstrap estimatebya Taylorseries argument. R =NV"(X* --)/S*, and the bootstrapestimate is
The bias can be thought of as a function of theun- #fR*b>2}1B. Thiscalculation is used in Section9 of
knownprobability distributionF, 1 = 1(F). The boot- Efron(1981c)to getconfidence intervalsforthemean
strapestimateofbias is simply ,uin a situation
whereiiormalityis suspect.
problemofSections8 and 9 in-
The cross-validation
B = 1(F) = E4f0(F*) - 0(F)}. (19)
volvesa differenttypeof errorrandomvariableR. It
HereE*indicates expectation withrespecttobootstrap willbe usefulthereto use a jackknife-type approxi-
sampling,and Ft*is the empiricaldistribution of the mationto thebootstrap expectation ofR,
bootstrap sample.
E*{R* } = R? + (n - 1)(R(.) - R?). (20)
In practice1B mustbe approximated byMonteCarlo
methods. Theonlychange inthealgorithmdescribed in Here R 0 = R (xi, x2, . . ., Xn; F) and R(.) = (ll/n)R(i),
Section2 is at step(iii), when of
instead (or in addition R(i)= R (xi, x2, . . ., xi 1, xi+1, . . ., Xn; F). The justifica-
to) CB we calculate tion of (20) is the same as forthe theoremof this
section,being based on a quadraticapproximation
formula.
B

In thesamplingexperiment ofTable2 thetruebias,of 7. MORE COMPLICATED DATA SETS


p,is 13
pjforestimating = -.014. estimate
Thebootstrap
So farwe haveconsidered thesimplest kindofdata
1B, takingB = 128, has expectation-.014 and stan- sets,whereall the observationscome fromthe same
darddeviation .031inthiscase,while1J hasexpectation
F. The bootstrap
distribution idea, and jackknife-type
-.017, standarddeviation.040. Bias is a negligible
approximations (whichare notdiscussedhere),can be
errorinthissituation
sourceofstatistical comparedwith
appliedtomuchmorecomplicated situations.
Webegin
In
variability. applicationsthisis usuallymadeclearby
witha two-sample problem.
comparisonof 1B withCB.
The datainourfirstexampleconsistoftwoindepen-
The estimates(18) and (19) are closelyrelatedto
dentrandomsamples,
each other.The argument is thesameas in Section3,
except that we approximate0(P) witha quadratic Xli X2, . . . X X. - F and YI, Y2i ... ., Yn~-C,
ratherthan linear functionof P, say OQ(P) =
a
F and G beingtwopossiblydifferent distributions
on
+(P- PO)'b + 1(P - PO)'c(P- PO). Let OQ(P) be any of interest
the real line. The statistic is the Hodges-
suchquadraticsatisfying
Lehmannshiftestimate
OQ(PO)= 0(P(o)),i = 1,2, .. ., n.
o(Po) = 0 andOQ(P(i))
=
0=median{yj-x;i=1, ..., m,j=1, ..., n}.
Theorem. estimateofbias equals
The jackknife We desirean estimateof thestandarderrorc(F, G).
The bootstrap estimateis simply
n-in
- 1 [E{Q (P*) - '}],
J=n
UBO(F, 6),
whichis nl(n - 1) timesthebootstrap
estimateofbias 6 beingthe empiricaldistribution
of the yi. This is
for OQ (Efron 1982).
evaluatedbyMonteCarlo,as inSection3, withobvious
Onceagain,thejackknife is,almost,a bootstrap
esti- a bootstrap
modifications: samplenowconsists ofa ran-
mate itself,exceptappliedto a convenient approxi- dom sample X*, X*, ..., X, drawn fromF and an
mationof 0(P). independentrandom sample Y*, ..., Y* drawnfrom
More general problems. There is nothingspecial . (In otherwords,mdrawswithreplacement
from{x1,
aboutbiasandstandard erroras faras thebootstrap is X2. .,m}, andn drawswithreplacement from{Yf,Y2,
concerned. The bootstrap procedure can be appliedto ***mYn}-) The bootstrap 0* is themedianof
replication
almostanyestimation problem. themndifferences yj*- Xi*.ThenUB iS approximated
SupposethatR (X,, X2, . .., Xn; F) is a randomvari- fromB independent suchreplications as on the right
able,andwe areinterested inestimating someaspectof sideof (11).
(So farwehavetakenR = 0(F) - 0(F)
R 's distribution. Table4 showstheresults ofa sampling experimentin

42 ?) The AmericanStatistician,
February1983, Vol. 37, No. I
Table 4. BootstrapEstimatesof Standard Errorforthe between y and the vector of predictedvalues q(3)=
Hodges-Lehmann Two-Sample ShiftEstimate; (gl (13),. gn (13)),
m = 6, n = 9; TrueDistributionsBoth F and G
:min D(y, iq(1)).
Uniform[0, 1]
The most common choice of D is D (y, q) =
Expectation St. Dev. C. V. MSE -
(yi i)
B = 100 .165 .030 .18 .030 Having calculated 1, we can modifythe one-sample
Separate bootstrapalgorithmof Section 2, and obtain an esti-
B = 200 .166 .031 .19 .031
mate of 13'svariability:
B = 100 .145 .028 .19 .036
Combined (i) ConstructF puttingmass l/n at each observed
B = 200 .149 .025 .17 .031 residual,
True Standard Error .167 F: mass l/n on ii = y, - g,(13)
(ii) Constructa bootstrapdata set
which m = 6, n = 9, and both F and G were uniform
Y7 = g() + E*, i = 1, 2, n,
distributionson theinterval[0, 1]. The table is based on
100 trialsof the situation.The true standarderroris where the E* are drawn independentlyfromF, and
c(F, G) = .167. "Separate" refersto UB calculatedex- calculate
actlyas describedin the previousparagraph.The im- *:min D (Y*, M(n))
provementin going fromB = 100 to B = 200 is too
small to show up in the table.
(iii) Do step (ii) some large numberB of times,ob-
"Combined" refersto thefollowingidea: suppose we
taining independent bootstrap replications *1, *2
believe thatG is reallya translateof F. Then it wastes
information to estimateF and G separately.Insteadwe
.. *B, and estimatethe covariancematrixof (3 by
can formthe combinedempiricaldistribution
I 1 [E (0*b - *) (0*- 3*.)) (B - 1)1,
H: mass m+ on
m+n
X1, X2,. . ., Xm,Yi - 6, Y2 - 0. yYn-6.
(13*= 143*b)

All m + n bootstrapvariatesXl, ..., XA, Y7, ,.Yn In ordinarylinearregressionwe have gi(1) = t' 1 and
are thensampledindependently fromH. (We could add D(y, lj) = I(yi - qi)2. Section7 of Efron(1979a) shows
0 back to the Y, values, but thishas no effecton the thatin thiscase the algorithmabove can be carriedout
bootstrapstandarderrorestimate,since itjust adds the theoretically,B =, and yields
constant0 to each bootstrapreplicationV*.)
n it
The combinedmethodgives no improvementhere,
but it mightbe valuable in a many-sampleproblem (B= (> tit;) 6.2 = E I/n. (22)
wherethereare small numbersof observationsin each
sample, a situationthat arises in stratifiedsampling. This is theusual answer,exceptfordividingbyn instead
(See Efron 1982, Ch. 8.) The main point here is that of n - p in &2. Of coursethe advantageofthebootstrap
"bootstrap"is not a well-definedverb, and thatthere approach is that XB can just as well be calculated if,
maybe more than one way to proceed in complicated say, gi(1) = exp (ti1) and D(y, 1q) = [|, - 'rij
situations. Next we consider regression problems, There is another simplerway to bootstrapthe re-
whereagainthereis a choice of bootstrapping methods. gression problem. We can consider each covariate-
In a typicalregressionproblemwe observe n inde- responsepair x, = (t,,yi) to be a singledata point ob-
pendentreal-valuedquantitivesYi =yi, tained by random samplingfroma distributionF on
p + 1 dimensionspace. Then we applythe one-sample
Y, =gi(I) + Ei,i = 1, 2, . . , n. (21) bootstrapof Section 2 to the data set xi, x,, ..., xn.
The functions g,( ) are of knownform,usuallygi() = The two bootstrapmethodsforthe regressionprob-
g(3; ti), wheretiis an observedp-dimensionalvectorof lem are asymptoticallyequivalent, but can perform
covariates;i is a vectorofunknownparameterswe wish quite differently in small-samplesituations.The simple
to estimate.The gi are an independentand identically method,describedlast,takes less advantageof thespe-
distributed randomsamplefromsome distribution F on cial structureoftheregressionproblem.It does notgive
the real line, answer(22) in thecase ofordinaryleastsquares. On the
otherhand thlesimplemethodgivesa trustworthy esti-
81 E 2 . . , En F
mate of 1's variabilityeveniftheregressionmodel (21)
where F is assumed to be centered at zero in some is notcorrect.For thisreason we use thesimplemethod
sense, perhapsE{E} =0 or Prob{e < 0} =0.5. ofbootstrapping on theerrorratepredictionproblemof
Havingobservedthedata vectorV = y =(y1., n Sections9 and 10.
some measureof distance
we estimate<3 by minimizing As a finalexampleof bootstrapping complicateddata

?) The AmericanStatistician,February1983, Vol. 37, No. 1 43


we considera two-sampleproblemwithcensoreddata. There are other reasonable ways to bootstrapcen-
The data are the leukemia remissiontimes listed in sored data. One of these is describedin Efron(1981a),
Table 1 ofCox (1972). The samplesizes are m = n = 21. which also contains a theoreticaljustificationfor the
Treatment-group remissiontimes(weeks) are 6+, 6, 6, methodused to constructFigure4.
6, 7 9+ 10 +, 10, 11+, 13, 16, 17+, 19+, 20+, 22, 23,
25+, 32+, 32+, 34+, 35+; control-groupremission 8. CROSS-VALIDATION
times(weeks) are 1. 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11,
11, 12. 12, 15, 17, 22, 23. Here 6+ indicatesa censored Cross-validationis an old butusefulidea, whosetime
remissiontime,knownonlyto exceed 6 weeks, while6 seems to have come again withthe adventof modern
is an uncensoredremissiontime of exactly6 weeks. computers.We discussitinthecontextofestimating the
None of the control-group timeswere censored. errorrate of a predictionrule. (There are other im-
We assume Cox's proportionalhazards model, the portantuses; see Stone 1974; Geisser 1975.)
hazard rate in the controlgroupequaling eO timesthat The predictionproblemis as follows:each data point
in the Treatmentgroup.The partiallikelihoodestimate xI = (t,, y,) consists of a p-dimensional vector of
of 3 is 3 = 1.51, and we wantto estimatethe standard explanatoryvariables t,, and a response variable y,.
errorof 3. (Cox gets 1.65, not 1.51. Here we are using Here we assumey,can take on onlytwopossiblevalues,
Breslow'sconventionforties(1972), whichaccountsfor say 0 or 1, indicatingtwo possible responses, live or
the discrepancy.) dead, male or female,successor failure,and so on. We
Figure4 showsthe histogramfor1000bootstraprep- observexi, xi, . . ., xn,called collectivelythetraining
set,
licationsof *. Each replicationwas obtained by the and indicatedx = (xi, x2, . . ., xn). We have in minda
two-samplemethoddescribedfortheHodges-Lehmann formula-r(t;x) forconstructing a predictionrule from
estimate: the trainingset, also takingon values either0 or 1.
(i) ConstructFPputting mass at each point6+, 6, 6, Given a new explanatoryvectort0,,the value -r(to; x) is
35+, and 6 puttingmass Iyat each point 1, 1, . . . supposed to predictthe correspondingresponsey(.
23. (Notice thatthe "points' in F includethe censoring We assume thateach x, is an independentrealization
information.) of X = (T, Y), a randomvectorhavingsome distribu-
(ii) Draw Xl, X*, . . .<, X* byrandomsamplingfrom tion F on p + 1-dimensionalspace, and likewiseforthe
F, and likewiseYL Y,, ..., Y*, by randomsampling "new case" X(,= (T(, Y(). The trueerrorrateerrof the
fromG. Calculate * by applyingthepartial-likelihood predictionrule -q(; x) is the expected probabilityof
methodto the bootstrapdata. errorover X( - F withx fixed,

The bootstrapestimateof standarderrorfor 3, as err=E{Q[Yn, -r(T(, x)]}.


givenby (11), is &B = .42. This agrees nicelywithCox's where Q [y, -r]is the errorindicator
asymptoticestimate6^= .41. However, the percentile
methodgives quite different confidenceintervalsfrom Q[Y' N]= 1 if y , -q.
those obtained by the usual method. For x= .05,
1 - 2(x= .90, the latter intervalis 1.51 + 1.65 - .41 = An obvious estimateof erris the apparenterrorrate
[.83. 2.19]. The percentilemethodgivesthe 90 percent
centralinterval[.98, 2.35]. Notice that (2.35 - 1.51)! err=-E{Q [Y(,, -(T(X;x)]} ~~~~~~~~~~~~1
n 1 Q [y,,-(t,; x)]
(1.51 - .98) = 1.58, so that the percentileintervalis
considerablylargerto the rightof 3 than to the left. The symbolE indicatesexpectationwithrespectto the
(The bias-correctedpercentilemethodgivesalmostthe empiricaldistributionF, puttingmass 1/non each x,.
same answersas the uncorrectedmethod in this case The apparenterrorrate is likelyto underestimatethe
since C((r) = .49.) true errorrate, since we are evaluating-r(, x)'s per-
formanceon the same set of data used in its construc-
tion.A randomvariableof interestis theoveroptimism,
trueminusapparenterrorrate,
R (x, F) = err- err
= E{Q [ Y(,,q(T1; x)]} - E{Q [ Y(, -(TO; x)]}. (23)
The expectationof R (X, F) over the randomchoice of
XI, X9,, Xn fromF,
w(F) ER (X, F) (24)
is the expectedoveroptimism.
The cross-validatedestimateof erris
0 5 1.0 1.5 2.0 2.5 3.0

of* forthe
of 1000 bootstrapreplications
Figure4. Histogram
err' Z Q [y, -j(t,;X())] .
hazards model. Courtesyof Rob
leukemiadata, proportional
Stanford.
Tibshirani, -r(t,; x,) being the prediction rule based on x(,) =

44 February1983, Vol. 37, No. I


? The AmericanStatistician,
(xl, x,, ... . x,i-,x?+l, ... I, x). In otherwordserr.is the trialto trial,oftenbeingnegative.The cross-validation
error rate over the observed data set, not allowing estimate w is positive in all 10 cases, and does not
xI = (t,,y,) to enterintotheconstruction oftheruleforits correlatewithR. This relatesto the commentthatwxis
ownprediction.It is intuitively obviousthaterr'is a less tryingto estimatew ratherthanR. We willsee laterthat
biased estimatorof err than is err. In what followswe w. has expectation.091, and so is nearlyunbiasedforw.
consider how well err' estimateserr, or equivalently However, w is too variable itselfto be veryusefulfor
how well estimatingR, whichis to say thaterr' is not a particu-
w -err' - err larlygood estimateof err. These pointsare discussed
further in Section9, wherethetwootherestimatesof w
estimates R (x, F) = err- err. (These are equivalent appearingin Table 5, C(j and CO, are introduced.
problems since err' - err= w - R(x, F).) We have
used the notation , ratherthan R., because it turns
9. BOOTSTRAP AND JACKKNIFE ESTIMATES
out later thatit is actuallyw being estimated.
FOR THE PREDICTION PROBLEMS
We consider a samplingexperimentinvolvingFish-
er's linear discriminantfunction.The dimension is At the end of Section 6 we describeda methodfor
p = 2 and the sample size of the trainingset is n = 14. applyingthe boostrapto anyrandomvariableR (X, F).
The distribution F is as follows:Y = 0 or 1 withproba- Now we use thatmethodon the overoptimism random
bility-, and given Y = y the predictorvector T is bi- variable (23), and obtain a bootstrapestimateof the
variate normal with identitycovariance matrix and expected overoptimismw(F).
mean vector(y - I, 0). If F were knownto the statisti- The bootstrapestimateof w w(F), (24), is simply
cian, the ideal predictionrule would be to guessyo= 0
ifthe firstcomponentof t(,was '0, and to guess y( = 1 COB= W(I(F).
otherwise.Since F is assumed unknown,we mustesti- As usual (aB mustbe approximatedbyMonte Carlo. We
mate a predictionrule fromthe trainingset. generateindependentbootstrapreplicationsR* 1, R,
We use the predictionrule based on Fisher's esti- R* B, and take
mated lineardiscriminant function(Efron 1975),
l
1B
WB ==- R b

-rj(t;x) 1if x + t is> B h=1

As B goes to infinitythis last expressionapproaches


The quantities& and a are definedin termsof n, and
E*{R*}, the expectation of R* under bootstrap re-
nl, thenumberofy,equal to zero and one, respectively; sampling,whichis by definitionthe same quantityas
toand tl, the averages of the t,correspondingto those = CB. The bootstrap CB seenin thelast
estimates
W(()
yi equaling zero and one, respectively; and S=
column of Table 5 are considerablyless variable than
tit' - n(tnt()- nittljl/n:
the cross-validationestimatesX .
(X [t1S -1t-tl tS -t,]l2, Whatdoes a typicalbootstrapreplicationconsistofin
thissituation?As in Section3 letP* = (P*, P*, . . ., P* )
13= [tS t- l 5 indicate the bootstrap resampling proportions
Table 5 showsthe resultsof 10 simulations("trials") P* = #{X, = x,}ln. (Notice that we are considering
of thissituation.The expectedoveroptimism, obtained each vectorx, = (t,,y,) as a singlesample pointforthe
from100 trials,is w= .098, so that R = err - erris typ- purpose of carryingout the bootstrapalgorithm.)Fol-
icallyquitelarge.However, R is also quite variablefrom lowingthroughdefinition(13), it is not hardto see that
n

Table 5. The First10 Trialsof a Sampling Experiment R* =R(X*, F)= (P -P*) Q[yI, -r(t,;X*)], (25)
InvolvingFisher's Linear DiscriminantFunction.The
TrainingSet Has Size n = 14. The Expected whereP?= (1, 1,. 1)'/nas before,and -(, X*) is
Overoptimism is w =.096, see Table 6 the predictionrule based on the bootstrapsample.
Table 6 shows the resultsof two simulationexperi-
Rates
Error Estimates ofOveroptimism ments(100 trialseach) involvingFisher'slineardiscrim-
Appar- Over- Cross- Jack- Bootstrap inantfraction.The leftside relatesto the bivariatenor-
True ent optimismvalidationknife (B = 200)
Trialn., n, err err R wt B
mal situationdescribedin Section8: samplesize n = 14.
dimensiond = 2, mean vectorsfor the two randomly
A

1 9,5 .45a .286 .172 .214 .214 .083 = (+,, 0). The rightside
2 6,8 .312 .357 -.045 .000 .066 .098 selected normaldistributions
3 7,7 .313 .357 -.044 .071 .066 .110 stillhas n = 14, butthe dimensionhas been raisedto 5,
4 8,6 .351 .429 -.078 .071 .066 .107 withmean vectors(+1, 0, 0, 0, 0). Fullerdescriptions
5 8,6 .330 .357 - .027, .143 .148 .102
6 8,6 .318 .143 .175 .214 .194 .073 appear in Chapter7 of Efron(1982).
7 8,6 .310 .071 .239 .071 .066 .087 Seven estimatesof overoptimism wereconsidered.In
8 6,8 .382 .286 .094 .071 .056 .097 thed = 2 situation,the cross-validationestimate~, for
9 7,7 .360 .429 -.069 .071 .087 .127
10 8,6 .335 .143 - .192 .000 .010 .048 example, had expectation .091, standard deviation
.073, and correlation- .07 withR. This gave rootmean

? The AmericanStatistician,February1983, Vol. 37, No. 1 45


Table 6. TwoSampling ExperimentsInvolvingFisher's Even thoughCOBand wXare closelyrelatedin theory
Linear DiscriminantFunction.The LeftSide of and are asymptotically equivalent,theybehaveverydif-
the Table Relates to the Situationof Table 5: ferentlyin Table 6: wX is nearly unbiased and un-
n = 14, d = 2, TrueMean Vectors= (-+-2, 0). correlatedwithR, but has enormousvariability;COB has
The RightSide Relates to n = 14, d = 5, small variability,but is biased downwards,particularly
TrueMean Vectors= (1 0, 0, 0, 0) in the right-handcase, and highlynegativelycorrelated
withR. The poor performancesof the two estimators
Dimension 2 Dimension5
are due to differentcauses, and thereare some grounds
Overoptimism Exp. Sd. Exp. Sd.
R(X, F) w = .096 .113 Corr MSE w =. 184 .099 Corr. VM/ISE of hope fora favorablehybrid.
1. Ideal Constant .096 0 0 .113 .184 0 0 .099 "BootRand," line 5, modifiedthe bootstrapestimate
2. Cross- in justone way:insteadofdrawingthebootstrapsample
Validation .091 .073 -.07 .139 .170 .094 -.15 .147 Xl*,X*. X* fromF, it was drawnfrom
3. Jackknife .093 .068 -.23 .145 .167 .089 -.26 .150
4. Bootstrap
(B = 200) .080 .028 -.64 .135 .103 .031 -.58 .145
5. BootRand FRAND: mass ((1 - )In on (t,, 1)
(B = 200) .087 .026 -.55 .130 .147 .020 -.31 .114
6. BootAve i =l1, 2, .. ., n.
(B = 200) .100 .036 -.18 .125 .172 .041 -.25 .118
7. Zero 0 0 0 .149 0 0 0 .209 This is a distributionsupportedon 2n points,the ob-
served pointsx, = (t,,y,) and also the complementary
squared error,of w forestimatingR or equivalentlyof points(t,, 1 - y,). The probabilitiesrr,
were those natu-
err,forestimatingerr, rallyassociated withthe linear discriminantfunction,

[E[(w - RI 2 [E(err- - err)-] .139. rr,1/1 +exp - ((x + t' )]


The bootstrap, line 4, did only slightly better, (see Efron1975), exceptthat rr, was alwaysforcedto lie
'MS-E= .135. in the interval[.1, .91.
The zero estimateC -= 0, line 7, had MSE= .149, Drawing the bootstrap sample X*. X* from
whichis also [E(err - err)2], the MSE of estimating FRAND insteadof Fis a formofsmoothing,notunlikethe
err by the apparenterrorerr,withzero correctionfor smoothed bootstrapsof Section 2. In both cases we
overoptimism.The "ideal constant" is w itself.If we supportthe estimateof F on pointsbeyondthose actu-
knew w, which we don't in genuine applications,we ally observedin the sample. Here the smoothingis en-
would use the bias-correctedestimateerr+ w. Line 1, tirelyin the responsevariabley. In complicatedprob-
left side, says that this ideal correction gives lems,suchas theone describedin Section10,t,can have
'MS-E= .113. complex structure(censoring,missingvalues, cardinal
We see thatneithercross-validation northebootstrap and ordinal scales, discrete and continuousvariates,
are muchof an improvement overmakingno correction etc.) makingit difficult to smoothin the t space. Notice
at all, thoughthe situationis more favorableon the thatin Table 6 BootRand is an improvement over the
rightside of Table 6. Estimators5 and 6, whichwill be ordinarybootstrapin everyway: it has smaller bias,
describedlater,performnoticeablybetter. smallerstandarddeviation,and smaller negativecor-
The "jackknife,"line 3, refersto the followingidea: relationwithR. The decrease in MSE is especially
since CB = E*{R*} is a bootstrapexpectation,we can impressiveon the rightside of the table.
approximatethatexpectationby (19). In thiscase (25) "BootAve,7 line 6, involvesa quantitywe shall call
gives R" = 0, so the jackknifeapproximationis simply ,. GeneratingB bootstrapreplicationsinvolvesmak-
cji= (n - 1) R(.). Evaluatingthislast expression,as in ingnB predictions-q(t,,X*h), i = 1, 2,. n, b = 1, 2,
Chapter7 of Efron(1982), gives ..., B.Let
_ 1 if P*bO=
w n { Q[y,, q(t,, X(,)) -
Q[y,q(t,, X(i))]) n}
, if pth>0
Then
This looks verymuchlike thecross-validation
estimate, Wo(),.h I
fIb Q [y',, (ti, X* h)/1,b I,b-err.
whichcan be written In otherwords, I, + erris theobservedbootstraperror
n rateforpredictionof thoseyiwherex, is notinvolvedin
w nE{[Y"IQ 1l(t,,x(,)] Q[y,, 9(t,, x)]}. theconstruction of-( , X*). Theoreticalargumentscan
be musteredto show that i- will usuallyhave expec-
As a matterof fact,C(i and w have asymptoticcor- tationgreaterthanw, while usuallyhas expectation
relation one (Gong 1982). Their nearly perfectcor- less than w. "BootAve" is the compromiseestimator
relationcan be seen in Table 5. In thesamplingexperi- WtAVE =(ci)B + (ci())/2.It also performswell in Table 6,
ments
ofTable6,corr(Uj, UT) =.93 ontheleftside,and thoulghthereis notyetenoughtheoreticalor numerical
.98 on the rightside. The point here is thatthecross- evidence to warrantunqualifiedenthusiasm.
validation
estimate
&1) is, essentially,
a Taylorseriesap- The bootstrapis a generalall-purposedevicethatcan
proximationto thebootstrapestimatec%. be applied to almostany problem.This is veryhandy,

46 ?) The AmericanStatistician,
February1983, Vol. 37, No. I
Table 7. The Last 11 LiverPatients.Negative Numbers Indicate Missing Values

Cons- Ster- Anti- Mal- Anor- Liver Liver Spleen As- Bili- Alk Albu- Pro- Histo-
tant Age Sex oid viral Fatigue aise exia Big Firm Palp Spiders cites Varices rubin Phos SGOT min tein logy
y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 #
1 1 45 1 2 2 1 1 1 2 2 2 1 1 2 1.90 -1 114 2.4 -1 -3 145
0 1 31 1 1 2 1 2 2 2 2 2 2 2 2 1.20 75 193 4.2 54 2 146
1 1 41 1 2 2 1 2 2 2 1 1 1 2 1 4.20 65 120 3.4 -1 -3 147
1 1 70 1 1 2 1 1 1 -3 -3 -3 -3 -3 -3 1.70 109 528 2.8 35 2 148
0 1 20 1 1 2 2 2 2 2 -3 2 2 2 2 .90 89 152 4.0 -1 2 149
0 1 36 1 2 2 2 2 2 2 2 2 2 2 2 .60 120 30 4.0 -1 2 150
1 1 46 1 2 2 1 1 1 2 2 2 1 1 1 7.60 -1 242 3.3 50 -3 151
0 1 44 1 2 2 1 2 2 2 1 2 2 2 2 .90 126 142 4.3 -1 2 152
0 1 61 1 1 2 1 1 2 1 1 2 1 2 2 .80 95 20 4.1 -1 2 153
0 1 53 2 1 2 1 2 2 2 2 1 1 2 1 1.50 84 19 4.1 48 -3 154
1 1 43 1 2 2 1 2 2 2 2 1 1 1 2 1.20 100 19 3.1 42 2 155

butitimpliesthatin situationswithspecialstructure
the Amongthese 19 tests,13 predictorsindicatedpredic-
bootstrapmay be outperformedby more specialized tive power by rejectingHo:j = 18, 13, 15, 12, 14, 7, 6,
methods.Here we have done so in two different ways. 19, 20, 11, 2, 5, 3. These are listedin orderof achieved
BootRand uses an estimateof F thatis betterthanthe significancelevel, j = 18 attainingthe smallestalpha.
totallynonparametric estimateF. BootAve makes use 2. These 13 predictorswere tested in a forward
of the particularform of R for the overoptimism multiple-logistic-regression program,whichadded pre-
problem. dictorsone at a time(beginningwiththeconstant)until
no furthersingle addition achieved significancelevel
aL= .10. Five predictorsbesides the constantsurvived
10. A COMPLICATED PREDICTION PROBLEM thisstep,j = 13, 20, 15, 7, 2.
3. A finalforward,stepwisemultiple-logistic-regres-
We end this articlewiththe bootstrapanalysisof a
sion programon these five predictors,stoppingthis
genuine predictionproblem, involvingmany of the
time at level a= .05, retainedfourpredictorsbesides
complexitiesand difficulties typicalof genuine prob-
the constant,j = 13, 15, 7, 20.
lems. The bootstrapis not necessarilythe best method
here,as discussedin Section9, butitis impressiveto see At each of the threesteps,onlythosepatientshaving
how muchinformation thissimpleidea, combinedwith no relevantdata missingwereincludedin thehypothesis
massivecomputation,can extractfroma situationthat tests.At step2 forexample,a patientwas includedonly
is hopelesslybeyondtraditionaltheoreticalsolutions.A if all 13 variableswere available.
fullerdiscussionappears in Efronand Gong (1981). The finalpredictionrule was based on the estimated
Among n = 155 acute chronichepatitispatients,33 logisticregression
were observedto die fromthe disease, while 122 sur-
vived.Each patienthad associateda vectorof 20 covar- log r(ti)- ~ ,~1
iates. On the basis of thistrainingset it was desiredto 1 -
rr(tJ) J=1, 13, 15, 7, 20
produce a rule for predicting,from the covariates,
whethera givenpatientwould liveor die. If an effective where j was the maximumlikelihoodestimatein this
predictionrule were available, it would be useful in model. The predictionrule was
choosingamong alternativetreatments.For example, '
patientswitha verylow predictedprobabilityof death -q(t; x) = ifY, ti ' (26)
could be givenless rigoroustreatment.
Let xi = (ti,yi) representthe data forpatienti, i = 1, c = log 33/122.
2, .. ., 155. Here tiis the 20-dimensionalvectorof co- Among the 155 patients,133 had none of thepredic-
variates,and yiequals 1 or 0 as thepatientdied or lived. tors 13, 15, 7, 20 missing.When the rule -q(t; x) was
Table 7 showsthedata forthelast 11 patients.Negative appliedto these133patients,itmisclassified 21 ofthem,
numbersrepresentmissingvalues. Variable 1 is thecon- for an apparent error rate err= 21/133= .158. We
stant1, includedforconvenience.The meaningof the would like to estimatehow overoptimistic e-rris.
19 otherpredictors,and theircodingin Table 7, willnot To answer this question, the simple bootstrapwas
be explainedhere. applied as describedin Section 9. A typicalbootstrap
A predictionrule was constructedin 3 steps: sample consistedof X*, X*, ..., X*5, randomlydrawn
withreplacementfromthe trainingset x1,x2, ,x155
1. An a = .05 test of the importanceof predictorj, The bootstrapsample was used to constructthe boot-
Ho: j = 0 versus HI: j * 0, was run separately for strappredictionrule-9( , X*), followingthesame three
j = 2, 3, ..., 20, based on the logisticmodel stepsused in theconstruction of-q(*, x), (26). This gives
)
a bootstrapreplicationR* fortheoveroptimism random
Tr(t,
log 1 - (t) + t, variableR = err- err,essentiallyas in (25), but witha
modificationto allow fordifficulties caused by missing
Tr(ti) Prob{patienti dies}. predictorvalues.

? The AmericanStatistician,
February1983, Vol. 37, No. 1 47
(see Efron 1982, Ch. VII), whichby definitionequals
[E(err - err-_ )2]112, the \/M@Eof e5irr + w as an esti-
mate of err.Comparingline 1 withline 4 in Table 6, we
expect err+ 0B = .203 to have VK4WE at least thisbig
forestimatingerr.
Figure6 illustratesanotheruse ofthebootstraprepli-
cations.The predictionschosenbythe three-stepselec-
tionprocedure,applied to thebootstraptrainingsetX*,
WB are shownforthelast25 ofthe500 replications.Among
all 500 replications,predictor13 was selected37 percent
-.10 -.05 0 .05 .10 .15 of the time,predictor15 selected48 percent,predictor
Figure 5. Histogram of 500 bootstrap replications of over- 7 selected35 percent,and predictor20 selected59 per-
optimismforthe hepatitisproblem. cent. No other predictorwas selected more than 50
percentof the time. No theoryexistsfor interpreting
Figure5 shows the histogramof B = 500 such repli- Figure6, buttheresultscertainlydiscourageconfidence
cations. 95 percentof these fall in the range0 - R* in the casual nature of the predictors13, 15, 7, 20.
.12. This indicates that the unobservabletrue over- [ReceivedJanuary1982. RevisedMay 1982.]
optimismerr- erris likelyto be positive.The average
value is REFERENCES
B

(B = B E R *b= .045, BRESLOW, N. (1972). Discussion of Cox (1974), Journalof the


b=1 Royal StatisticalSociety,Ser. B, 34, 216-217.
COX, D.R. (1972), "RegressionModels WithLife Tables," Journal
suggestingthatthe expectedoveroptimism is about 3 as of theRoyal StatisticalSociety,Ser. B, 34, 187-000.
large as the apparenterrorrate .158. Taken literally, CRAMER, H. (1946), MathematicalMethodsof Statistics, Princeton:
this gives the bias-correctedestimated error rate PrincetonUniversityPress.
.158 + .045 = .203. There is obviouslyplentyof room EFRON, B. (1975), "The Efficiencyof Logistic RegressionCom-
forerrorin thislastestimate,giventhespreadofvalues pared to NormalDiscriminantAnalysis,"Journalof theAmerican
StatisticalAssociation,70, 897-898.
in Figure5, but at least we now have some idea of the
(1979a), "Bootstrap Methods: Another Look at the Jack-
possible bias in err. knife,"Annals of Statistics,7, 1-26.
The bootstrapanalysisprovided more than just an (1979b), "Computersand the Theoryof Statistics:Thinking
estimateof w(F). For example, the standarddeviation the Unthinkable,"SIAM Review,21, 460-480.
of the histogramin Figure5 is .036. This is a depend- (1981a), "Censored Data and the Bootstrap,"Journalof the
AmericanStatisticalAssociation,76, 312-319.
able estimate of the true standard deviation of R (1981b), "NonparametricEstimatesof StandardError: The
Jackknife,the Bootstrap,and Other ResamplingMethods," Bio-
13 7 20 15 metrika,00, 00-0.
13 19 6 (1981c), "NonparametricStandard Errors and Confidence
20 Intervals,"Canadian Journalof Statistics,9, 139-172.
16 19
(1982), "The Jackknife,the Bootstrap, and Other Re-
20 19
samplingPlans," SIAM, monograph#38, CBMS-NSF.
14 18 7 16 2 EFRON, B., and GONG, G. (1981), "StatisticalTheory and the
18 20 7 11 Computer," unpublishedmanuscript.
20 19 15 GEISSER, S. (1975), "The PredictiveSample Reuse Method With
20 Applications,"Journalof theAmericanStatistical Association,70,
13 12 15 8 18 7 19 320-328.
15 13 19 GONG, G. (1982), "Cross-validation,the Jackknife,and the Boot-
13 4 strap: Excess Error Estimationin ForwardLogisticRegression",
12 15 3 Ph.D. dissertation,Dept. of Statistics,StanfordUniversity.
15 16 3 HAMPEL, F. (1974), "The InfluenceCurve and its Role in Robust
15 20 4 Estimation,"Journalof the AmericanStatisticalAssociation,69,
16 13 2 19 383-393.
JAECKEL, L. (1972), "The Infinitesimal Jackknife,"Bell Laborato-
18 20 3
ries Memorandum#MM 72-1215-11.
13 15 20 JOHNSON, N., and KOTZ, S. (1970), ContinuousUnivariateDistri-
15 13 butions(vol. 2), Boston: HoughtonMifflin.
15 20 7 MALLOWS, C.L. (1974), "On Some Topics in Robustness",Memo-
13 randum,Bell Laboratories,MurrayHill, New Jersey.
15 QUENOUILLE, M. (1949), "ApproximateTests of Correlationin
13 14 Time Series," Journalof The Royal StatisticalSociety,Ser. B, 11,
12 20 18 18-84.
2 20 15 7 19 12 SHUCANY, W.; BRAY, H.; and OWEN, 0. (1971), "On Bias
13 20 15 19 Reductionin Estimation,"Journalof theAmericanStatisticalAs-
sociation,66, 524-533.
Figure6. Predictors
selectedinthelast25 bootstrap
replications STONE, M. (1974), "Cross-ValidatoryChoice and Assessmentof
forthehepatitis
program.Thepredictors
selectedbytheactualdata StatisticalPredictions,"Journalof the Royal StatisticalSociety,
were13, 15, 7, 20. Ser. B, 36, 111-147.

48 ? The AmericanStatistician,
February1983, Vol. 37, No. 1

You might also like