Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart
Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart
the present paper. In particular, we are indebted to the part of Le Cam’s work
as it was extended by Birgé (1983) to general metric spaces.
Shortly after completing this paper, we learned of independent work by
Shen and Wasserman (1998), who also address rates of convergence.
The construction of prior measures on infinite-dimensional models is not
a trivial matter and has also received recent attention. This development
started with the introduction of Dirichlet processes by Ferguson (1973, 1974).
Given computing algorithms such as Markov chain Monte Carlo methods and
powerful computing machines, implementation of Bayesian methods has now
become feasible even for many complicated priors and infinite dimensional
models.
In Section 2 we present a main result and several variations concerning the
rate of convergence of the posterior relative to the total variation, Hellinger
and L2 -metrics. Every time the two main elements characterizing the rate of
convergence are the size of the model (measured by covering numbers or exis-
tence of certain tests) and the amount of prior mass given to a shrinking ball
around the true measure. Actually, the size of the model comes in only to guar-
antee the existence of certain tests of the true measure versus the complement
of a shrinking ball around it, and conditions can be put in terms of such tests
instead. Conditions of this form go back to Schwartz (1965) and Le Cam (1973).
We discuss testing in Section 7, and reformulate our main result in terms of
tests in this section. The proofs of the main results are contained in Section 8
following the discussion of the existence of tests. In Section 2 we also note that
a rate of convergence for the posterior automatically entails the existence of
point estimators with the same rate.
We apply the general result to several examples. In Section 3 we consider
discrete priors constructed on ε-nets over the model. In Section 4 we discuss
Bayes estimators based on the log spline models for density estimation dis-
cussed by Stone (1986). In Section 6 we consider finite-dimensional models.
In Section 6 we discuss applications to Dirichlet priors.
The notation is used to denote inequality up to a universal multiplicative
constant, or up to a constant that is fixed throughout. We define the Hellinger
distance hp q or hP Q between two probability densities or measures by
√ √
the L2 -distance between the root densities p and q. The total variation
distance is the L1 -distance. (Some authors define these distances with an
additional factor 1/2.)
(2.3) n \ n ≤ exp −nεn2 C + 4
p p 2
(2.4) n P − P0 log ≤ εn2 P0 log ≤ εn2 ≥ exp −nεn2 C
p0 p0
The first and third conditions of the theorem are the essential ones. Condi-
tion (2.3) allows some additional flexibility, but should first be understood as
expressing that n is almost the support of the prior (in which case its left
side is zero and the condition is trivially satisfied).
Condition (2.2) requires that the “model” n be not too big. It is true for
every εn ≥ εn as soon as it is true for εn and can thus be seen as defining a
minimal possible value of εn . Condition (2.2) ensures the existence of certain
tests, as discussed in Section 7, and could be replaced by a testing condition.
Note that the metric d used here reappears in the assertion of the theorem.
Since the total variation metric is bounded above by twice the Hellinger metric,
the assertion of the theorem using the Hellinger metric is stronger, but also
condition (2.2) will be more restrictive, so that we really have two theorems.
In the case that the densities are uniformly bounded, we even have a third
theorem, when using the L2 -distance, which in that case will be bounded above
by a multiple of the Hellinger distance. If the densities are also uniformly
bounded and uniformly bounded away from zero, then these three distances
504 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART
are equivalent and are also equivalent to the Kullback–Leibler number and
L2 -norm appearing in condition (2.4). See, for example, Lemmas 8.2 and 8.3
and (8.6).
A rate εn satisfying (2.2) for = n and d the Hellinger metric is often
viewed as giving the “optimal” rate of convergence for estimators of P relative
to the Hellinger metric, given the model . Under certain conditions, such
as likelihood ratios bounded away from zero and infinity, this is proved as a
theorem by Birgé (1983) and Le Cam (1973, 1986). From Birgé’s work it is clear
that condition (2.2) is the correct expression of the complexity of the model,
as relating to estimating the true density relative to the Hellinger distance,
if this is to be given in terms of metric entropy. A weaker, but more involved,
condition is in terms of the existence of certain tests. We give a generalization
of the theorem using tests in Section 7.
Condition (2.4) is the other main determinant of the posterior rate given
by the theorem. It requires that the prior measures put a sufficient amount
of mass near the true measure P0 . Here “near” is measured through a combi-
nation of the Kullback–Leibler divergence of p and p0 and the L2 P0 -norm
of logp/p0 . Again this condition is satisfied for εn ≥ εn if it is satisfied for
εn and thus is another restriction on a minimal value of εn . The form of this
condition can be motivated from entropy considerations. Suppose that we wish
to satisfy (2.4) for the minimal εn satisfying (2.2) with n = , that is, for
the optimal rate of convergence for the model. Furthermore, for the sake of
the argument assume that all distances used are equivalent. Then a mini-
mal εn -cover of consists of expnεn2 balls. If the prior n would spread
its mass uniformly over , then every ball would obtain mass approximately
exp−Cnεn2 . (The constant C expresses the constants in comparing the dis-
tances and the fact that the balls of radius εn may overlap.) On the other hand,
if n is not “uniform,” then we should expect (2.4) to fail for some P0 ∈ . Here
we must admit that “uniform” priors do not exist in infinite-dimensional mod-
els and actually condition (2.4) is stronger than needed and will be improved
ahead in Theorem 2.4. However, a rough implication of the condition is that n
should be “uniformly spread” in order for the posterior distribution to attain
the optimal rate of convergence.
Condition (2.3), combined with (2.2), can be interpreted as saying that a
part of that barely receives prior mass need not be small. The sets n
may be thought of as “sieves” approximating the parameter space, which cap-
ture most of the prior probability. This type of condition has received much
attention in the discussion of consistency issues [see Barron, Schervish and
Wasserman (1998)], but plays a smaller role in the present paper. Of course,
condition (2.3) is trivially satisfied for n = ; we can make this choice if
condition (2.2) holds with n = itself.
The assertion of the theorem is an in-probability statement that the pos-
terior mass outside a large ball of radius proportional to εn is approximately
zero. The in-probability statement can be improved to an almost sure asser-
tion, but under stronger conditions. We present two results.
Let h be the Hellinger distance and write log + x for log x ∨ 0.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 505
Theorem 2.2. Suppose that conditions (2.2) and (2.3) hold as in the pre-
ceding theorem and in addition n exp−Bnεn2 < ∞ for every B > 0 and
p
(2.5) n P h2 P P0 0 ≤ εn2 ≥ exp −nεn2 C
p ∞
Then for sufficiently large M, we have that n P dP P0 ≥ Mεn X1 Xn
→ 0 in P0n -almost surely.
improve this situation we must refine both the entropy condition (2.2) and
the prior mass condition (2.4). The following generalization of Theorem 2.1 is
more complicated but does yield the right result in the finite-dimensional situ-
ation. It is essential for our examples using spline approximations in Section 4.
Theorems 2.2 and 2.3 can be generalized similarly. Let
p p 2
Bn ε = P − P0 log ≤ ε2 P0 log ≤ ε2
p0 p0
Theorem 2.4. Suppose that for a sequence εn with εn → 0 and such that
nεn2 is bounded away from zero, every sufficiently large j and sets n ⊂ , we
have
ε
(2.7) log D P ∈ n ε ≤ dP P0 ≤ 2ε d ≤ nεn2 for every ε ≥ εn
2
n − n
(2.8) = o exp −2nεn2
n Bn εn
n P jεn < dP P0 ≤ 2jεn
(2.9) ≤ exp Knεn2 j2 /2
n Bn εn
Here K is the universal testing constant appearing in (7.1) and (7.2). Then for
every Mn → ∞, we have that n P dP P0 ≥ Mn εn X1 Xn → 0 in
P0n -probability.
for appropriate loss functions "n . Such estimators are called formal Bayes
estimators in Le Cam (1986).
On the one hand, Theorem 2.5 shows that we can construct good estimators
from the posterior if the posterior converges at a good rate. On the other hand,
it shows that the posterior cannot converge at a rate faster than the optimal
rate of convergence for point estimators. We use this argument in a number
of examples to show that the posterior converges at the best possible rate. Of
course, our arguments have nothing to say about the best possible constants.
Furthermore, for many priors the rate may be suboptimal.
and hence the present bracketing numbers are bigger than the packing num-
bers Dε h defined previously [see (2.1)]. However, in many examples
there is also an equality in the other direction, up to a constant, and bracketing
and packing numbers give equivalent results. The corresponding bracketing
entropy is defined as the logarithm of the bracketing number N ε h.
We shall construct a discrete prior supported on densities constructed from
minimal sets of brackets for the Hellinger distance. For a given number εn > 0,
let n be the uniform discrete measure on the N εn h densities obtained
by covering with a minimal set of εn -brackets and next renormalizing the
upper bounds of the brackets to integrate to 1. Thus if l1 u1 lN uN
are the N = N εn h brackets, then n is the uniform measure on the
N functions uj / uj dµ. Next set
= λ n n
n∈
for a given sequence λn with λn ≥ 0 and n λn = 1.
Proof. The prior gives probability 1 to the set = ∞ j=1 j for j the
N εj h functions in the support of j . We claim that
(3.1) D 8εn h ≤ exp 2nεn2
To see this, we first note that, given an ε-bracket l u that contains a prob-
ability density p, with · 2 the norm in L2 µ,
1/2 √ √ √ √
1≤ u dµ = u2 ≤ u − p2 + p2 = hu p + 1 ≤ 1 + ε
u u
h p ≤ hp u + h u ≤ 2ε
u dµ u dµ
virtue of the relationship (2.1) between covering numbers and packing num-
bers, we obtain (3.1). This verifies condition (2.2) with n taken equal to
and εn taken equal to eight times the present εn .
If u is the upper limit of the εn -bracket containing p0 , then
p
0 ≤ u dµ ≤ 1 + εn 2
u/ u dµ
It follows that for large n the set of points p such that h2 p p0 p0 /p∞ ≤ 8εn2
contains at least the function u/ u dµ and hence has prior mass at least
1
λn ≥ exp−nεn2 − Olog n ≥ exp−2nεn2
N εn h
for large n. This verifies condition (2.5) for εn a multiple of the present εn .
Since condition (2.3) is trivially satisfied for n = , the proof is com-
plete. ✷
There are many specific examples in which the preceding theorem applies.
The situation here is similar to that in recent papers on rates of convergence
of (sieved) maximum likelihood estimators, as in Birgé and Massart (1993,
1997, 1998), Wong and Shen (1995) or Chapter 3.4 of van der Vaart and
Wellner (1996). It is interesting to note that these authors also use brack-
ets, whereas Birgé (1983), in his study of the metric entropy of statistical
models, uses ε-nets. This is because the cited papers are concerned with a
particular type of estimator (namely, minimum contrast estimators), whereas
Birgé (1983) uses special constructs, called d-estimators. It appears that for
good behavior of Bayes estimators on nets we also need some special property
of the nets, such as available from nets obtained from brackets.
We include two concrete examples.
Inspection of the proof of the theorem shows that the lower bounds of the
brackets are not really needed. The theorem can be generalized by defining
upper bracketing numbers N ε h as the minimal number of functions
u1 um such that for every p ∈ there exist a function ui such that
both p ≤ ui and hui p < ε. Next we construct a prior as before. These
upper bracketing numbers are clearly smaller than the bracketing numbers
N ε h. We have formulated the theorem using the better known brack-
eting numbers, because we do not know any example where this generalization
could be useful.
The preceding theorem implicitly requires that the model be totally
bounded for the Hellinger metric. A simple modification works for countable
unions of totally bounded models, provided that we use a sequence of priors.
Suppose that the bracketing numbers of are infinite, but there exist sub-
sets n ↑ with finite bracketing numbers. Let εn be numbers such that
log N εn n h ≤ nεn2 . Then we construct n as the discrete uniform distri-
bution on renormalized upper brackets of a minimal set of εn -brackets over
n , as before. Then the posterior relative to prior n achieves the convergence
rate εn . (Note that this time we do not construct a fixed prior = n λn n ,
but use the prior n when n observations are available.)
In the preceding we start with a condition on the entropies with bracket-
ing even though we apply Theorem 2.2, which demands control over metric
entropies only. This is because Theorem 2.2 also requires control over the like-
lihood ratios. If, for instance, the densities would be uniformly bounded away
from zero and infinity, so that the quotients p0 /p are uniformly bounded, then
we can replace the bracketing entropy in Theorem 3.1 by ordinary entropy.
Alternatively, if the set of densities possesses an integrable envelope func-
tion, then we can construct priors achieving the rate εn determined by the
covering numbers up to logarithmic factors. Here we define εn as the mini-
mal solution of the equation log Nε h ≤ nε2 and Nε h denotes the
Hellinger covering number (without bracketing). The construction, described
briefly below, parallels Theorem 6 of Wong and Shen (1995) for sieved maxi-
mum likelihood estimators.
We assume that the set of densities
has a µ-integrable envelope function:
a measurable function m with m dµ < ∞ such that p ≤ m for every p ∈ .
Given εn > 0 let s1n sNn n be a minimal εn -net over [hence Nn =
Nεn h] and put
1/2
gj n = sj n + εn m1/2 2 /cj n
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 511
where cj n is a constant ensuring that gj n is a probability density. Finally, let
n be the uniform discrete measure on g1 n gNn n and let = ∞n=1 λn n
be a convex combination of the n as before.
Proof. The proof follows as before, but this time we apply Theorem 2.1,
using the observation of Wong and Shen (1995) that for any p ∈ such that
hp sj n ≤ εn we have that hp gj n = Oεn and that p/gj n is bounded
above by a multiple of εn−2 . This verifies (2.4) with εn replaced by a multiple of
εn log1/εn through a use of Theorem 5 of Wong and Shen (1995), the relevant
part of which is reproduced below as Lemma 8.6. ✷
as defined on page 108 of de Boor (1978). The exact nature of these functions
does not matter to us here, except for the following properties [cf. de Boor
(1978), pages 109 and 110]:
1 Bj ≥ 0 j = 1 J
J
2 Bj ≡ 1
j=1
Because we assume that p0 is bounded away from zero (and infinity) the
function p0 is in Cα 0 1 if and only if log p0 ∈ Cα 0 1.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 513
It is easy to see from this, as we show in part below, that the root of the
Kullback–Leibler divergence and the Hellinger distance between p0 and the
closest pθ are of the order J−α as well. Since a ball of radius εn around p0
must contain prior mass in order to satisfy (2.9), the rate of convergence εn
of the posterior can certainly not be faster than J−α . The minimum distance
of alternatives to allow appropriate tests, determined by (2.7), will be shown
to satisfy nεn2 ≥ Jn . Together with the previous restriction on εn this will
yield a rate of convergence of n−α/2α+1 , for Jn ∼ n1/2α+1 . This is also the
rate of convergence of the sieved maximum likelihood estimator, found by
Stone (1990). It is well known that this rate is optimal for α-smooth densities.
To make this precise we start with stating some lemmas that connect dis-
tances and norms on the densities pθ with the J-dimensional Euclidean norm
θ and infinity norm θ∞ = maxj θj . Let f be the L2 0 1 norm of f and
write a b if a ≤ Cb for a constant C that is universal or depends only on q
(which is fixed throughout) and not on K. Most of these are known from or
implicit in Stone (1986, 1990) or the literature on approximation theory.
Proof. The first inequality is proved by de Boor [(1978), page 156, Corol-
lary 3]. The second is immediate from the fact that the B-spline basis forms a
partition of unity. The third and fourth inequalities are stated in Stone [(1986),
equation (12)]. As their full proofs are not in one place, we sketch the argument
for completeness.
Let Ii be the interval i − q/K ∨ 0 i/K ∧ 1 . By (2) on page 155 of
de Boor (1978), we have
θi2 θT BIi 2∞ KθT BIi 2
i i i
T
The last inequality follows, because θ BIi consists of at most q polynomial
pieces, each on an interval of length 1/K, and the supremum norm √ of a
polynomial of order q on an interval of length L is bounded by 1/ L times
the L2 -norm, up to a constant depending on q. [To see the last: the squared
q−1
L2 0 1-norm of the polynomial x → j=0 αj xj on 0 1 is the quadratic form
αT EUq UT q α for Uq = 1 U U
q−1
and U a uniform 0 1 variable. The
T
second moment matrix EUq Uq is nonsingular and hence the quadratic form
is bounded below by a constant times α2 ≥ α2∞ .) This yields the third
inequality.
514 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART
By property (3) of the B-spline basis at most q elements Bj x are nonzero
for every given x, say for j ∈ Jx. Therefore,
T 2 2
θ Bx = θj Bj x ≤ θj2 B2j x q
j∈Jx j∈Jx
Proof. By the second inequality in Lemma 4.2 we have that θT B∞ ≤
θ∞ , whence ecθ is contained
in the interval e−M eM for M = θ∞ , by its
definition, so that cθ ≤ θ∞ . Consequently, by the triangle inequality
log pθ ∞ = θT B − cθ∞ ≤ 2θ∞
This yields the inequality on the right.
For the inequality on the left, we note that, since θT 1 = 0,
cθ = θ − cθ1T 1 1 ≤ θ − cθ1 11 1
J ∞ J
θ − cθ1 B∞ = log pθ ∞
T
where the infimum and supremum are taken over all θ on the line segment
between θ1 and θ2 and all x ∈ 0 1.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 515
for θ̃ and θ̃˜ vectors on the line segment between θ1 and θ2 and c̈θ the Hessian
of c. By the well-known properties of exponential families, we have
1 2
τT c̈θτ = varθ τT B = inf τx − µ1T Bx pθ x dx
µ∈ 0
By combining these lemmas we see that the Hellinger distance hPθ1 Pθ2
√
and 1/ J times the J-dimensional Euclidean distance θ1 − θ2 are propor-
tional, uniformly in J and in θ1 θ2 having uniformly bounded coordinates.
This combined with the estimate on the distance of p0 to the set of pθ given
by Lemma 4.1 reduces the verification of (2.7) and (2.9) to calculations in the
Euclidean setting.
We are now ready to prove the following theorem. By Lemma 4.3 there
exists a constant d such that dθ∞ ≤ log pθ ∞ for every θ ∈ J with θT 1 = 0
and every J ∈ . We shall assume that the prior is chosen as roughly uniform
on a large box −M MJ . This corresponds to densities pθ that are bounded
and bounded away from zero by at least a small constant.
Theorem 4.5. Suppose that n has a density with respect to Lebesgue mea-
sure on θ ∈ Jn θT 1 = 0 for Jn ∼ n1/2α+1 whose minimum and maximal
values on −M MJn are bounded below and above by terms of the orders cJn
and CJn , respectively, for positive constants c C, and which vanishes outside
−M MJn . Let M ≥ 1. Then for every p0 ∈ Cα 0 1 for q ≥ α ≥ 1/2 such that
log p0 ∞ ≤ 12 dM the conditions of Theorem 2.4 are satisfied for εn a large
multiple of n−α/2α+1 and n the support of n , and hence the posterior rate of
convergence is n−α/2α+1 .
By Lemma 4.1 there exists θ∗ such that θ∗ T B − log p0 ∞ J−α . Taking
∗ T −α
exponentials we see that this implies that ∗expθ B−p 0 ∞ J , and
next,
by integrating this inequality, that exp cθ − 1 J , whence cθ∗ J−α .
−α
Next, we have, with volJ the volume of the J − 1-dimensional unit ball,
n Pθ hPθ P0 ≤ 2jε θ∞ ≤ M
√ J
≤ sup πn θ 2C Jjε + J−α M volJ
θ∞ ≤M
By Lemma 4.3 and the assumption that p0 is bounded, the norms p0 /pθ ∞
are uniformly bounded over θ ranging over a set of bounded θ∞ . Therefore,
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 517
≥ N xi dxi
i=1 :αi i=1 maxxi0 −ε 0
2
We use here that 1 − N−1 i=1 xi
αN −1
≥ 1, since αN ≤ 1. Similarly, since αi ≤ 1
for every i, we can lower bound the integrand by 1 and note that the interval of
integration contains at least an interval of length ε2 . Since α:α = :α + 1 ≤
1 for 0 < α ≤ 1 we can bound the last display from below by
N
1
:mε2N−1 αi ≥ :Aε2N−1 AεN ≥ C exp −cN log
i=1
ε
This concludes the proof. ✷
This is true both for d equal to the total variation distance and for d equal to
the Hellinger distance. (The constant 2 has no particular interest and is not
optimal; any constant bigger than 1 is possible and would do for our purposes.)
More generally, it is known from Birgé (1984) and Le Cam (1986) (see
Lemma 4 on page 478) that given any two convex sets 0 and 1 of prob-
ability measures, there exist tests φn such that
(7.3) sup Pn φn ≤ exp n log ρ0 1
P∈0
(7.4) sup Pn 1 − φn ≤ exp n log ρ0 1
P∈1
Theorem 7.1. Suppose that for some nonincreasing function Dε, some
εn ≥ 0 and every ε > εn ,
ε
D P ε ≤ dP P0 ≤ 2ε d ≤ Dε
2
Then for every ε > εn there exist tests φn (depending on ε > 0) such that, for a
universal constant K and every j ∈ ,
1
(7.5) P0n φn ≤ Dε exp −Knε2
1 − exp−Knε2
(7.6) sup Pn 1 − φn ≤ exp −Knε2 j2
dP P0 >jε
522 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART
Proof.
For a given j ∈ choose a maximal jε/2 separated set of points
in Sj = P jε < dP P0 ≤ j + 1ε . This yields a set Sj of at most Djε
points and every P ∈ Sj is within distance jε/2 of at least one of these points.
(Take Sj empty and adapt the following in the obvious way if Sj is empty.)
For every such point P1 ∈ Sj there exists a test ωn with the properties as in
(7.1) and (7.2). Let φn be the maximum of all tests attached in this way to
some point P1 ∈ Sj for some j ∈ . Then
P0n φn ≤ exp −Knj2 ε2 ≤ Djε exp −Knj2 ε2
j P1 ∈Sj j∈
sup
Pn 1 − φn ≤ sup exp −Kni2 ε2
P∈ i≥j Si i≥j
The right sides can be further bounded as desired. [Note that Djε ≤ Dε
for every j ∈ , by assumption.] ✷
One possible choice for Dε is the ε-packing number Dε/2 d. This
is a bigger number, but in many infinite-dimensional situations this does not
appear to yield a real loss. On the other hand, the theorem is needed as stated
if is finite-dimensional.
It is known from Le Cam (1973) that even for a fixed ε there need not
exist a consistent sequence of tests of P0 versus P ∈ dP P0 > ε . The
preceding theorem shows that total boundedness of [which is equivalent to
Dε d being finite for every ε > 0] is sufficient for the existence of such
a test. However, this is not necessary. One example showing this is given by
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 523
(7.1) and (7.2) applied with = P0 P dP P1 < dP0 P1 /2 , because
a total variation or Hellinger ball is usually not totally bounded. A classical
example is as follows.
Lemma 7.2. Suppose that there exist tests ωn such that for fixed sets 0
and 1 of probability measures
sup P0n ωn → 0 sup Pn 1 − ωn → 0
P0 ∈0 P∈1
In view of the fact that, apparently, entropy conditions are not always appro-
priate to ensure the existence of tests, it is fruitful to formulate a theorem on
rates of convergence directly in terms of existence of tests. The following is a
result of this type.
Theorem 7.3. Suppose that (2.8) and (2.9) hold for a sequence εn with
εn → 0 and nεn2 bounded away from zero and sets n ⊂ , and in addition
suppose that their exists a sequence of tests φn such that for some constant
K > 0 and for every sufficiently large j,
(7.7) P0n φn → 0
(7.8) sup Pn 1 − φn ≤ exp −Knεn2 j2
P∈n εn j<dP P0 ≤2jεn
524 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART
Lemma 8.1. For every ε > 0 and probability measure on the set
2
P − P0 logp/p0 ≤ ε2 P0 logp/p0 ≤ ε2
we have, for every C > 0,
n
n
p 2
1
P0 Xi dP ≤ exp −1 + Cnε ≤ 2 2
p
i=1 0
C nε
≤ n n + Pn 1 − φn dn P
P∈n dP P0 >Mεn
for M ≥ C
+ 4/K. By Lemma 8.1, we have with probability tending to 1,
with Bn = P − P0 logp/p0 ≤ εn2 P0 logp/p0 2 ≤ εn2 ,
n
p
Xi dn P ≥ exp −2nεn2 n Bn ≥ exp −nεn2 2 + C
p
i=1 0
For the proof of Theorem 2.2 we need a replacement of Lemma 8.1 that gives
a faster rate of convergence in its statement. We can achieve this by controlling
the quotients p/p0 . First, if one has uniform control from below, then the
Hellinger distance and the Kullback–Leibler information are comparable. The
following lemma can be found in Birgé and Massart (1998) [see their (7.6)].
If e−c = p0 /p∞ , then logp/p0 ≥ c and hence the integrand on the left side
of (8.4) is bounded above by
2 ! 2
1 p
2e−c exp logp/p0 − 1 = 2e−c −1
2 p0
The integral of the right side with respect to P0 is equal to 2e−c times the
squared Hellinger distance. ✷
(8.6) ∞ 2
p0 2 p
⊂ P P0 log p ≤ 2ε P0 log p ≤ 4ε2
0
This shows that condition (2.4) is weaker than condition (2.5), up to con-
stants. Actually controlling all moments is more than what is needed. Another
possible extension of Lemma 8.1 would be to replace the second moment of
logp/p0 by a higher moment (and use Markov’s inequality at the end of
the proof). This would give a result good enough for the proof of Theorem 2.2
provided the higher moment is chosen “high enough” (dependent on the order
εn , faster convergence to zero needing a higher moment). We have chosen
here to forego such refinements and obtain an exponential inequality under a
somewhat stronger assumption.
We are ready for an adaptation of Lemma 8.1
Lemma 8.4. For every ε > 0 and probability measure on the set
(8.7) P h2 P P0 p0 /p∞ ≤ ε2
Proof. Lemma 8.2 gives that −P0 log p/p0 ≤ 2ε2 for every P in the
set (8.7), which has -probability 1. Furthermore, by Lemma 8.3,
by Fubini’s theorem.
By the lemma below the same bound is true for 12 times
the variable logp/p0 dP centered at its expectation. Therefore, rewrit-
ing the probability on the left side of (8.8) as in the proof of Lemma 8.1, we
see that it is bounded above by
n p √ 2 √ 2 nε4
P0 n log dP ≤ −3 nε + n2ε ≤ exp −D 2 √ 2 √
p0 ε + nε / n
by (the refined version of) Bernstein’s inequality. [see, e.g., Lemma 2.2.11 of
van der Vaart and Wellner (1996).] ✷
Proof of Theorem 2.2. The proof of Theorem 2.2 follows the same lines
as the proof of Theorem 2.1. The difference is that we use Lemma 8.4 instead
of Lemma 8.1 to ensure that the probability of the events An converges to 1
at an exponential rate. By inspecting the proof, we conclude that for some
B1 B2 > 0 and M chosen as before,
P0 n P dP P0 > Mεn X1 Xn ≥ exp−B1 nεn2
converges to zero at the rate exp−B2 nεn2 . Since n exp−B2 nεn2 < ∞,
almost sure convergence follows by the Borel–Cantelli lemma. ✷
For the proof of Theorem 2.3 we need other variations on the preceding lem-
mas. The following lemma follows from Theorem 5 of Wong and Shen (1995).
Let log + x = log x ∨ 0.
528 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART
Lemma 8.6. For any pair of probability measures P and P0 such that
hP P0 ≤ 044 and P0 p0 /p < ∞,
p P0 p0 /p
−P0 log ≤ 18h2 P P0 1 + log +
p0 hP P0
p 2 P0 p0 /p 2
P0 log ≤ 5h2 P P0 1 + log +
p0 hP P0
Lemma 8.7. For any pair of probability measures P and P0 such that
P0 p0 /p < ∞,
P0 exp logp/p0 − 1 − logp/p0 ≤ 4h2 P P0 1 + −1 h2 P P0
for −1 ε = sup M M ≥ ε the inverse of the function M = P0 p0 /p1
p0 /p ≥ M /M.
Proof. Set m = p0 /p. By inequality (8.5) in the proof of Lemma 8.3, the
left side is bounded above by
! 2 ! 2
p p p
2P0 −1 1 p ≥ p0 + 2P0 0 −1 1 p < p0
p0 p p0
for every M > 0. The function is left continuous and strictly decreasing
from infinity at 0 to 0 at a point τ ≤ ∞. If we choose M = −1 h2 P P0 ,
then MM ≥ Mh2 P P0 ≥ MM+ = P0 m1 m > M . The right side of
the last display can now be bounded by an expression as in the lemma. ✷
Lemma 8.8. For a given function m let −1 ε = sup M M ≥ ε be the
inverse function of M = P0 m1 m ≥ M /M. For every ε ∈ 0 044 and
probability measure on the set
P0 m
P p0 /p ≤ m 18h2 P P0 1 + log + + −1 h2 P P0 ≤ ε2
hP P0
n
p
(8.9) P0n Xi dP ≤ exp−2nε ≤ exp−Bnε2
2
p
i=1 0
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 529
Proof. This follows the same lines as the proof of Lemma 8.4, now sub-
stituting Lemmas 8.6 and 8.7 for Lemmas 8.2–8.3.
Proof of Theorem 2.4. The first part of the proof is identical to the first
part of the proof of Theorem 2.1, except that we choose the tests φn to satisfy
(8.1) and [instead of 8.2] for every j ∈ ,
(8.10) sup Pn 1 − φn ≤ exp−KnM2 εn2 j2
P∈n dP P0 >Mεn j
REFERENCES
Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distribu-
tions in nonparametric problems. Ann. Statist. 27 536–561.
Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch.
Verw. Gebiete 65 181–238.
Birgé, L. (1984). Sur un théorème de minimax et son application aux tests. Probab. Math. Statist.
3 259–282.
Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab.
Theory Related Fields 97 113–150.
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for
Lucien Le Cam (G. Yang and D. Pollard, eds.) 55–87. Springer, New York.
Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: exponential bounds
and rates of convergence. Bernoulli 4 329–375.
de Boor, C. (1978). A Practical Guide to Splines. Springer, New York.
Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates (with discussion).
Ann. Statist. 14 1–67.
Doob, J. L. (1949). Le Calcul des Probabilités et ses Applications. Coll. Int. du CNRS 13 23–27.
Dudley, R. M. (1984). A course on empirical processes. Lectures Notes in Math. 1097 2–141.
Springer, Berlin.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1
209–230.
Ferguson, T. S. (1974). Prior distribution on the spaces of probability measures. Ann. Statist. 2
615–629.
Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ estimates in the discrete case.
Ann. Math. Statist. 34 1194–1216.
Freedman, D. A. (1965). On the asymptotic behavior of Bayes’ estimates in the discrete case II.
Ann. Math. Statist. 36 454–456.
Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1997). Non-informative priors via sieves
and packing numbers. In Advances in Statistical Decision Theory and Applications (S.
Panchapakeshan and N. Balakrishnan eds.) 129–140. Birkhäuser, Boston.
Ghosal, S., Ghosh, J. K. and Ramamoorthi R. V. (1999a). Posterior consistency of Dirichlet
mixtures in density estimation. Ann. Statist. 27 143–158.
Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999b). Consistency issues in Bayesian non-
parametrics. In Asymptotics, Nonparametrics and Time Series: A Tribute to Madan Lal
Puri (Subir Ghosh, ed.) 639–667. Dekker, New York.
Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical Estimation: Asymptotic Theory.
Springer, New York.
Kolmogorov, A. N. and Tikhomirov, V. M. (1961). Epsilon-entropy and epsilon-capacity of sets
in function spaces. Amer. Math. Soc. Trans. Ser. 2 17 277–364.
Le Cam, L. M. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist.
1 38–53.
Le Cam, L. M. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York.
Le Cam, L. M. and Yang, G. (1990). Asymptotics in Statistics: Some Basic Concepts. Springer,
New York.
Pollard, D. (1990). Empirical Processes: Theory and Applications. IMS, Hayward, CA and Amer.
Statist. Assoc., Alexandria, VA.
Schwartz, L. (1965). On Bayes procedures. Z. Wahrsch. Verw. Gebiete 4 10–26.
Shen, X. and Wasserman, L. (1999). Rates of convergence of posterior distributions. Preprint.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 531
Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. Ann.
Statist. 14 590–606.
Stone, C. J. (1990). Large-sample inference for log-spline models. Ann. Statist. 18 717–741.
Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate
function estimation (with discussion). Ann. Statist. 22 118–184.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes.
Springer, New York.
Wasserman, L. (1998). Asymptotic properties of nonparametric Bayesian procedures. Practical
Nonparametric and Semiparametric Bayesian Statistics. Lecture Notes in Statist. 133
293–304. Springer, New York.
Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence
rates of sieve MLEs. Ann. Statist. 23 339–362.
S. Ghosal J. K. Ghosh
A. W. van der Vaart Statistics and Mathematics Unit
Department of Mathematics Indian Statistical Institute
Free University 203 B.T. Road
De Boelelaan 1081a Calcutta 700 035
1081 HV Amsterdam India
Netherlands
E-mail: [email protected]