Boot
Boot
The Bootstrap
11.1 Introduction
Most of this volume is devoted to parametric inference. In this chapter we depart from
the parametric framework and discuss a nonparametric technique called the bootstrap.
The bootstrap is a method for estimating the variance of an estimator and for finding
approximate confidence intervals for parameters. Although the method is nonparametric,
it can be used for inference about parameters in parametric and nonparametric models
which is why we include it in this volume.
We begin by broadening what we mean by a parameter. Let us begin with a few examples.
209
210 CHAPTER 11. THE BOOTSTRAP
In the first example, ✓ denotes the parameter of a parametic model. In the second and third
example, we are in a nonparametric situation; in these cases we think of a “parameter” as
a function of the distribution P and we write ✓ = T (P ). The bootstrap can be used in both
the parametric and nonparametric settings.
Let Pn be the empirical distribution. This is the discrete distribution that puts mass 1/n at
each datapoint Xi . Hence,
n
1X
Pn (A) = I(Xi 2 A). (11.1)
n i=1
In the nonparametric case, we will estimate the parameter ✓ = TR (P ) by ✓bn = T (Pn ) which
is called the plug-in estimator. For example, when ✓ = T (P ) = xdP (x) is the mean, the
plug-in estmator is Z
1X
✓bn = T (Pn ) = xdPn (x) = Xi (11.2)
n i=1
which is the sample mean.
X1⇤ , . . . , Xn⇤ ⇠ Pn .
Bootstrap samples play an important role in what follows. Note that drawing an iid sample
X1⇤ , . . . , Xn⇤ from Pn is equivalent to drawing n observations, with replacement, from the
original data {X1 , . . . , Xn }. Thus, bootstrap sampling is often described as “resampling the
data.” This can be a bit confusing and we think it is much clearer to think of a bootstrap
sample X1⇤ , . . . , Xn⇤ as n draws from the empirical distribution Pn .
Now we give the bootstrap algorithms for estimating the variance of ✓bn and for construct-
ing confidence intervals. The explanation of why (and when) the bootstrap gives valid
estimates, is deferred until Section 11.5. Let ✓bn = g(X1 , . . . , Xn ) denotes some estimator.
11.3. THE BOOTSTRAP 211
3. Compute: v
u B
u1 X
sb = t (✓b⇤ ✓)2
B j=1 n,j
PB b⇤
where ✓ = 1
B j=1 ✓n,j .
4. Output sb.
The next theorem states that sb2 approximates Var(✓bn ). The are two sources of error in
this apprixmation. The first is due to the fact that n is finite and the second is due to the
fact that B is finite. However, we can make B as large as we like. (In practice, it usually
suffices to take B = 10, 000.) So we ignore the error due to finite B.
s2 P
Theorem 138. Under appropriate regularity conditions, Var(✓bn )
! 1 as n ! 1.
3. Let
1 X ⇣p b⇤ ⌘
B
Fb(t) = I n(✓n,j ✓bn t).
B j=1
4. Let
t1 ↵/2 b t↵/2
Cn = ✓bn p , ✓n p
n n
where t↵/2 = Fb 1 (↵/2) and t1 ↵/2 = Fb 1 (1 ↵/2).
5. Output Cn .
212 CHAPTER 11. THE BOOTSTRAP
0.5
●
●
●
● ●
●
0.0
● ● ● ●
● ●
● ● ● ●
●
● ●
● ● ●
● ●
●
●
● ● ●
●
●
−0.5
●
●
●
●
●
● ●
● ●
●
●
−1.0
●
●
●●
Figure 11.1: 50 points drawn from the model Yi = 1 + 2Xi Xi2 + ✏i where Xi ⇠
Uniform(0, 2) and ✏i ⇠ N (0, .22 ). In this case, the maximum of the polynomail occurs at
✓ = 1. The true and estimated curves are shown in the figure. At the bottom of the plot we
show the 95 percent boostrap confidence interval based on B = 1, 000.
as n ! 1.
11.4 Examples
⌦12
✓= p
⌦11 ⌦22
where ⌦ = ⌃ 1 and ⌃ is the covariance matrix of W = (X, Y, Z)T . The partial correlation
measures the linear dependence between X and Y after removing the effect of Z. For
illustration, suppose we generate the data as follows: we take Z ⇠ N (0, 1), X = 10Z + ✏
and Y = 10Z + where ✏, ⇠ N (0, 1). The correlation between X and Y is very large. But
the partial correlation is 0. We generated n = 100 data points from this model. The sample
correlation was 0.99. However, the estimate partial correaltion was -0.16 which is much
closer to 0. The 95 percent bootstrap confidence interval is [-.33,.02] which includes the
true value, namely, 0.
To explain why the bootstrap works, let us begin with a heuristic. Let
p
Fn (t) = P( n(✓b⇤ ✓bn ) t)
and let
p
Fbn (t) = P( n(✓b⇤ ✓bn ) t|X1 , . . . , Xn ).
Now we will give more detail in a simple, special case. Suppose that X1 , . . . , Xn ⇠ P where
Xi has mean µ and variance 2 . Suppose we want to construct a confidence interval for µ.
P
bn = n1 ni=1 Xi and define
Let µ
p
Fn (t) = P( n(b
µn µ) t). (11.3)
214 CHAPTER 11. THE BOOTSTRAP
We do not know the cdf F . But, for the moment, that an oracle gave us F . For any
0 < < 1, define z = F 1 ( ). Define the oracle confidence interval
z1 ↵/2 z↵/2
An = µ bn p , µ bn p . (11.4)
n n
We claim that Bn is a 1 ↵ confidence interval. To see this, note that the probability that
An traps µ is
✓ ◆
z1 ↵/2 z↵/2
P(µ 2 An ) = P µ bn p µµ bn p
n n
p
= P z↵/2 n(b µn µ) z1 ↵/2
⇣ ↵⌘ ↵
= Fn (z1 ↵/2 ) Fn (z↵/2 ) = 1 =1 ↵.
2 2
Unfortunately, we do not know F but we can estimate it. The bootstrap estimate if F is
⇣p ⌘
b
Fn (t) = P n(b ⇤
µn µ bn ) t X1 , . . . , Xn
P
where µb⇤n = n1 ni=1 Xi⇤ and X1⇤ , . . . , Xn⇤ ⇠ Pn . The data X1 , . . . , Xn are treated as fixed
during the bootstrap which is why we write Fbn as a conditional distribution.
Note that when we do the bootstrap algorithm, we are just approximating Fbn (t) by
B
1 X p
F (t) = µ⇤n,j
I( n(b bn ) t).
µ
B j=1
But
sup |F (t) Fbn (t)| ! 0
t
t1 ↵/2 t↵/2
Cn = µ bn p , µ bn p . (11.5)
n n
This is the same as the oracle confidence interval except that we have used t↵/2 and t1 ↵/2
in place of z↵/2 and z1 ↵/2 . To show that t↵/2 ⇡ z↵/2 and t1 ↵/2 ⇡ z1 ↵/2 , we need to show
that Fbn (t) approximates Fn (t).
Theorem 142 (Bootstrap Theorem). Suppose that µ3 = E|Xi |3 < 1. Then,
✓ ◆
b 1
sup |Fn (t) Fn (t)| = OP p .
t n
11.5. WHY DOES THE BOOTSTRAP WORK? 215
p
O(1/ n)
Fn L
p
OP (1/ n)
Fbn p b
L
OP (1/ n)
p
O(1/ B)
p
Figure 11.2: The distribution Fn (t) = P( n(✓bn ✓) t) is close to some limit distribution
p
L. Similarly, the bootstrap distribution Fbn (t) = P( n(✓bn⇤ ✓bn ) t|X1 , . . . , Xn ) is close to
some limit distribution L.b Since L
b and L are close, it follows that Fn and Fbn are close. In
practice, we approximate Fbn with its Monte Carlo version F which we can make as close
to Fbn as we like by taking B large.
216 CHAPTER 11. THE BOOTSTRAP
To prove this result, let us recall that Berry-Esseen Theorem from Chapter 2. For conve-
nience, we repeat the theorem here.
Theorem 143 (Berry-Esseen Theorem). Let X P1 n, . . . , Xn be i.i.d. with mean µ and variance
2
. Let µ3 = E[|Xi µ| ] < 1. Let X n = n p i=1 Xi be the sample mean and let be the
3 1
33 µ3
sup P(Zn z) (z) p . (11.6)
z 4 n
sup |Fbn (t) Fn (t)| sup |Fn (t) (t)| + sup | (t) b (t)| + sup |Fbn (t) b (t)|
t t t t
= I + II + III.
b
33 µ
III = sup |Fbn (t) b (t)| p3
t 4 n
P
where µb3 = n1 i=1 |Xi µ bn |3 is the empirical third moment. By the strong law of large
numbers, µ b3 converges almost surely to µ3 . So, almost p surely, for all large n, µ
b3 2µ3
and so III 4 n . From the fact that b
33 2µ
p 3
= OP ( 1/n) it may be shown that II =
p
supt | (t) b (t)| = OP ( 1/n). (This may be seen by Taylor expanding b (t) around .)
This completes the proof. ⇤
⇣ ⌘
b
We have shown that supt |Fn (t) Fn (t)| = OP p1n . From this, it may be shown that, for
⇣ ⌘
each 0 < < 1, t z = OP p1n . From this, one can prove Theorem 139.
So far we have focused on the mean. Similar theorems may be proved for more general
parameters. The details are complex so we will not discuss them here. We give a little more
information in the appendix. For a thorough treatment, we refer the reader to Chapter 23
of van der Vaart (1998).
11.6. A FEW REMARKS ABOUT THE BOOTSTRAP 217
1. The bootstrap is nonparametric but it does require some assumptions. You can’t
assume it is always valid. (See the appendix.)
2. The bootstrap is an asymptotic method.
p Thus the coverage of the confidence interval
is 1 ↵ + rn where, typically, rn = C/ n.
3. There is a related method called the jackknife where the standard error is estimated
by leaving out one observation at a time. However, the bootstrap is valid under
weaker conditions than the jackknife. See Shao and Tu (1995).
4. Another way to construct a bootstrap confidence interval is to set C = [a, b] where a is
the ↵/2 quantile of ✓b1⇤ , . . . , ✓bB
⇤
and b is the 1 ↵/2 quantile. This is called the percentile
interval. This interval seems very intuitive but does not have the theoretical support
for the interval Bn . However, in practice, the percentile interval and Bn are often
quite similar.
5. There are many cases where the bootstrap is not formally justified. This is especially
true with discrete structures like trees and graphs. Nonethless, the bootstrap can be
used in an informal way to get some intuition of the variability of the procedure. But
keep in mind that the formal guarantees may not apply in these cases. For example,
see Holmes (2003) for a discussion of the bootstrap applied to phylogenetic tres.
6. There is a method related to the bootstrap called subsampling. In this case, we draw
samples of size m < n without replacement. Subsampling produces valid confidence
intervals under weaker conditions than the bootstrap. See Politis, Romano and Wolf
(1999).
7. There are many modifications of the bootstrap that lead to more accurate confidence
intervals; see Efron (1996).
8. There is also a parametric bootstrap. If {p(x : ✓) : ✓ 2 ⇥} is a parametric model and
✓b is an estimator, such as the maximum likelihood estimator, we sample X1⇤ , . . . , Xn⇤
from p(x; ✓)b instead of sampling from Pn .
that the distribution of Xi is sub-Gaussian, although this is stronger than needed. This
T 2
means that E(et X ) ec||t|| for some c > 0.
Let µ = E[Xi ] 2 Rd . Here is a bootstrap algorithm for constructing a confidence set for µ.
3. Let
B
b 1 X p
Fn (t) = µ⇤n,j
I( n||b bn ||1 t).
µ
B j=1
4. Let ( )
t↵
Cn = a 2 R : ||a
d
bn ||1
µ p
n
where t↵ = Fb 1 (1 ↵).
5. Output Cn .
1/8
Theorem 144 (Chernozhukov, Chetverikov and Kato, 2014). Suppose that d = o(en ).
Then
c log d
P(µ 2 Cn ) 1 ↵
n1/8
for some c > 0.
Under the stated conditions, the same result applies to higher-order moments. If ✓ = g(µ)
for some function g then we can get a confidence set for ✓ by applying g to Cn . We call this
the projected confidence set. That is, if we define An = {g(µ) : µ 2 Cn } then it follows that
c log d
P(✓ 2 An ) 1 ↵ .
n1/8
p
Alternatively, we can apply the bootstrap to n(g(bµ) g(µ)). However, we do not auto-
matically get the same coverage guarantee that the projected set has.
Example 145. Let us consider constructing a confidence set for a high-dimensional covari-
ance matrix. Let X1 , . . . , Xn 2 Rk be a random sample and let ⌃ = Var(X) which is a k ⇥ k
matrix. There are d = O(k 2 ) parameters here. Let ⌃ b = (1/n) Pn (Xi X n )(Xi X n )T .
i=1
Also, let = vec(⌃) and b = vec(⌃), b where vec takes a matrix and converts it into a vector
p
by stacking the columns. We can then apply the bootstrap algorithm above to n(b )
11.8. SUBSAMPLING 219
p p
to get the bootstrap quantile t↵ . Let `n = b t↵ / n and un = b + t↵ / n. We can then
unstack `n and un into k ⇥ k matrices Ln and Un . It then follows that
c log d
P(Ln ⌃ Un ) 1 ↵
n1/8
where A B means that Ajk Bjk for all (j, k).
11.8 Subsampling
In this section we discuss a nonparametric hypothesis testing method. The test is not based
on the bootstrap but we include it here because it is similar in spirit to the bootstrap. Let
X1 , . . . , Xn ⇠ F, Y1 , . . . , Y m ⇠ G
be two independent samples and suppose we want to test the hypothesis
H0 : F = G versus H1 : F 6= G. (11.7)
The permutation test gives an exact (nonasymptotic), nonparametric method for testing
this hypothesis. Let Z = (X, Y ) where X = (X1 , . . . , Xn )T and Y = (Y1 , . . . , Ym )T . Define a
vector W of length N = n + m that indicates which group Zi is from. Thus, Wi = 1 if i n
and Wi = 2 if i > n. The data look like this:
Permutation Test
1. Compute t = T (Z, W ).
Proof. xxxx
The test is called exact since the probability of falsely rejecting the null hypothesis is less
than or equal to ↵. There is no large sample approximation here.
Remark: There is a bootstrap hypothesis test that is similar to the permutation test. The
advantage of the bootstrap test is that it is more general than the permutation test. The
disadvantage is that it is an approximate test, not an exact test. The bootstrap p-value based
on a statistic T = T (X) is
p = PF0 (T ⇤ > t) (11.10)
where t = T (X), T ⇤ = T (X ⇤ ) and X ⇤ is drawn from the null distribution F0 . If the null
hypothesis does not completely specify a distribution F0 then we compute p = PFb0 (T ⇤ > t)
where Fb0 is an estimate F under the restriction that F 2 F0 where F0 is the set of distributions
consistent with the null hypothesis. However, this is an approximate test while the permutation
test is exact.
Example 147. Gretton et al (2008) developed a two sample test based on reproducing
kernel Hilbert spaces. The test statistic is
n n m
1 X 2 X 1 X
T = 2 K(Xi , Xj ) K(Xi , Yj ) + 2 K(Yi , Yj )
n i,j=1 nm i,j=1 m i,j=1
2 2
where K is a symmetric kernel. Suppose we take K = Kh (x, y) = e ||x y|| /(2h ) to be
the Gaussian kernel. Rather than choosing a bandwidth h we can simply define the test
statistic to be the maximum over all bandwidths:
0 1
X n X n Xm
1 2 1
T = sup @ 2 Kh (Xi , Xj ) Kh (Xi , Yj ) + 2 Kh (Yi , Yj )A .
h>0 n i,j=1 nm i,j=1 m i,j=1
11.10. SUMMARY 221
4
2
2
0
0
−2
−2
−4
−4
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
400
300
200
100
0
Test Statistics
Figure 11.3: Top left: X1 , . . . , Xn . Top right: Y1 , . . . , Ym . Bottom left: values of the test
statistic from 1,000 permutations.
It would be difficult to find a useful expression for the distribution of the test statistic T
under the null hypothesis H0 : F = G. However, we can compute the p-value easily using
the permutation test. Figure 11.3 shows an example. The top left plot shows n = 10
observations from F and the top right plot shows n = 10 observations from G. (We took F
to be bivariate normal and G to be a mixture of two normals.) The test statistic is 0.45 and
the p-value, based on B = 1, 000 is 0.006 suggesting that we should reject H0 . The bottom
left shows a histogram of the values of T from the 1,000 permutations. The vertical line is
the observed value of T . The p-value is the fraction of statistics greater than T .
11.10 Summary
The bootstrap provides nonparametric standard errors and confidence intervals. To draw
a bootstrap sample we draw n observations X1⇤ , . . . , Xn⇤ from the empirical distribution
Pn . This is equivalent to drawing n observations with replacement from the original daa
X1 , . . . , Xn . We then compute the estimator ✓b⇤ = g(X1⇤ , . . . , Xn⇤ ). If we repeat this whole
222 CHAPTER 11. THE BOOTSTRAP
Further details on statistical functionals can be found in [51], [13], [52], [23] and [59].
The jackknife was invented by [47] and [58]. The bootstrap was invented by [20]. There
are several books on these topics including [22], [13], [29] and [52]. Also, see Section
3.6 of [60].
Appendix
As another example, suppose that ✓ = T (P ) is the variance of X. Let µ denote the mean.
Then Z Z Z 2
✓ = E(X µ) = (x µ) dP (x) = x dP (x)
2 2 2
xdP (x) .
For one more example, let ✓ be the ↵ quantile of X. Here it is convenient to work with the
cdf Fn (x) = P (X x). Thus ✓ = T (P ) = T (F ) = F 1 (↵) where F 1 (y) = inf x {Fn (x)
P
y}. The empirical cdf is Fn (x) = n 1 ni=1 I(Xi x) and ✓bn = T (Fn ) = inf x {Fn (x) ↵}.
In other words, ✓bn is just the corresponding sample quantile.
Hadamard Differentiability. The key condition needed for the bootstrap is Hadamard
differentiability. Let P denote all distributions on the real line and let D denote the linear
space generated by P. Write T ((1 ✏)P + ✏Q) = T (P + ✏D) where D = Q P 2 D. The
11.11. BIBLIOGRAPHIC REMARKS 223
T (P + ✏D) T (P )
LP (D) = lim LF (D) ! 0. (11.11)
✏#0 ✏
Thus T (P + ✏D) ⇡ ✏LP (D) + o(✏) and the error term o(✏) goes to 0 as ✏ ! 0. Hadamard
differentiability requires that this error term be small uniformly over compact sets. Equip
D with a metric d. T is Hadamard differentiable at P if there exists a linear functional
LP on D such that for any ✏n ! 0 and {D, D1 , D2 , . . .} ⇢ D such that d(Dn , D) ! 0 and
P + ✏n Dn 2 P, ✓ ◆
T (P + ✏n Dn ) T (P )
lim LP (Dn ) = 0. (11.12)
n!1 ✏n