0% found this document useful (0 votes)
27 views32 pages

Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views32 pages

Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

The Annals of Statistics

2000, Vol. 28, No. 2, 500–531

CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS

By Subhashis Ghosal, Jayanta K. Ghosh


and Aad W. van der Vaart
Free University Amsterdam, Indian Statistical Institute and
Free University Amsterdam
We consider the asymptotic behavior of posterior distributions and
Bayes estimators for infinite-dimensional statistical models. We give gen-
eral results on the rate of convergence of the posterior measure. These are
applied to several examples, including priors on finite sieves, log-spline
models, Dirichlet processes and interval censoring.

1. Introduction. Suppose that we observe a random sample X1      Xn


from a distribution P with density p relative to some reference measure on
the sample space    . The unknown distribution is known to belong to
some model  , a set of probability measures on the sample space. Given some
prior distribution n on the set  , the posterior distribution is the random
measure given by
 n
pXi  dn P
n BX1      Xn  = Bni=1 
i=1 pXi  dn P

If the distribution P is considered random and distributed according to , as


it is in Bayesian inference, then the posterior distribution is the conditional
distribution of P given the observations. The prior is, of course, a measure on
some σ-field on  and we must assume that the expressions in the display
are well defined. In particular, we assume that the map x p → px is
measurable for the product σ-field on  ×  . It will be silently understood
that the sets of which we compute prior or posterior measures are measurable.
In this paper we study the frequentist properties of the posterior distribu-
tion as n → ∞, assuming that the observations are a random sample from
some fixed measure P0 . In particular, we study the rate at which this ran-
dom distribution converges to P0 . The posterior is said to be consistent if, as
a random measure, it concentrates on arbitrarily small neighborhoods of P0 ,
with probability tending to 1 or almost surely, as n → ∞. We study the rate
at which such neighborhoods may decrease to zero meanwhile still capturing
most of the posterior mass.
If  = Pθ θ ∈  is parametrized by a parameter θ, then usually the
prior is constructed by putting a measure on the parameter set . If  is a

Received July 1998; revised December 1999.


AMS 1991 subject classifications. 62G15, 62G20, 62F25
Key words and phrases. Infinite dimensional model, posterior distribution, rate of convergence,
sieves, splines
500
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 501

subset of a finite-dimensional Euclidean space and the dependence of θ → Pθ


is sufficiently regular, then it is well known that the posterior distribution
of θ achieves the optimal rate of convergence, as n → ∞ [see, for example,
Le Cam (1973) and Ibragimov and Has’minskii (1981)]. In particular, if the
model θ √→ Pθ is suitably differentiable, then the rate of the posterior mean
for θ is n and the posterior distribution, when rescaled, tends to a normal
distribution with covariance the inverse Fisher information, according to the
Bernstein–von Mises theorem. In that case the posterior expectation is an
asymptotically efficient estimator for the parameter under some integrability
conditions.
Much less is known on the behavior of posterior distributions for infinite-
dimensional models. Most of the known results in this area address consis-
tency issues. A famous theorem by Doob (1949) shows that consistency obtains
on a set of prior measure 1, but his result concludes nothing on consistency at
a particular true distribution of interest. Schwartz (1965) gives results that
do apply to a particular true distribution. She shows that the posterior distri-
bution is consistent if the true distribution P0 can be suitably tested versus
the complements of neighborhoods of P0 and Kullback–Leibler neighborhoods
of P0 receive positive probabilities under the prior. Examples by, among oth-
ers, Freedman (1963, 1965) and Diaconis and Freedman (1986) show that the
situation is more complicated, even though, perhaps, these examples put too
much emphasis on the situations where Bayes estimation does not work. A
number of recent papers consider the consistency with a particular interest
in the infinite-dimensional situation. Barron, in an unpublished paper, refines
Schwartz’s theorem [see Proposition 2 of Barron, Schervish and Wasserman
(1999)] in a way that is particularly suitable for prior measures on infinite-
dimensional spaces of densities. Ghosal, Ghosh and Ramamoorthi (1999a) use
this extension to study consistency in the variation distance for Dirichlet mix-
ture priors. For reviews on posterior consistency in infinite dimensions, see
Ghosal, Ghosh and Ramamoorthi (1999b) and Wasserman (1998).
Le Cam [(1986), pages 509–529], addresses rates of convergence of Bayes
estimators in an abstract setting. Our methods are clearly related to the meth-
ods used by Le Cam. A crucial distinction appears to be that Le Cam appears
to base his argument on the prior mass present in fairly small balls (the sets
V in his Lemma 1 on page 510, later chosen such that PV is close to P0n ),
whereas our result is based on having sufficient prior mass in balls of radius
equal to the rate of convergence that we wish to obtain. The behavior of prod-
uct densities in these bigger balls appears not to be determined by the metric
distance of the marginal components alone. Instead we use a combination of
Kullback–Leibler numbers and distances on the log likelihood ratio ratios.
Another distinction is that we consider rates of posterior measures, whereas
Le Cam considers the rate of “formal Bayes estimators.” For these reasons our
results appear not to be covered by Le Cam’s Theorem 1 on page 513 (in which
distances and other quantities are in terms of the product measures Pn ). How-
ever, we would like to acknowledge the great importance of Le Cam’s work to
502 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

the present paper. In particular, we are indebted to the part of Le Cam’s work
as it was extended by Birgé (1983) to general metric spaces.
Shortly after completing this paper, we learned of independent work by
Shen and Wasserman (1998), who also address rates of convergence.
The construction of prior measures on infinite-dimensional models is not
a trivial matter and has also received recent attention. This development
started with the introduction of Dirichlet processes by Ferguson (1973, 1974).
Given computing algorithms such as Markov chain Monte Carlo methods and
powerful computing machines, implementation of Bayesian methods has now
become feasible even for many complicated priors and infinite dimensional
models.
In Section 2 we present a main result and several variations concerning the
rate of convergence of the posterior relative to the total variation, Hellinger
and L2 -metrics. Every time the two main elements characterizing the rate of
convergence are the size of the model (measured by covering numbers or exis-
tence of certain tests) and the amount of prior mass given to a shrinking ball
around the true measure. Actually, the size of the model comes in only to guar-
antee the existence of certain tests of the true measure versus the complement
of a shrinking ball around it, and conditions can be put in terms of such tests
instead. Conditions of this form go back to Schwartz (1965) and Le Cam (1973).
We discuss testing in Section 7, and reformulate our main result in terms of
tests in this section. The proofs of the main results are contained in Section 8
following the discussion of the existence of tests. In Section 2 we also note that
a rate of convergence for the posterior automatically entails the existence of
point estimators with the same rate.
We apply the general result to several examples. In Section 3 we consider
discrete priors constructed on ε-nets over the model. In Section 4 we discuss
Bayes estimators based on the log spline models for density estimation dis-
cussed by Stone (1986). In Section 6 we consider finite-dimensional models.
In Section 6 we discuss applications to Dirichlet priors.
The notation  is used to denote inequality up to a universal multiplicative
constant, or up to a constant that is fixed throughout. We define the Hellinger
distance hp q or hP Q between two probability densities or measures by
√ √
the L2 -distance between the root densities p and q. The total variation
distance is the L1 -distance. (Some authors define these distances with an
additional factor 1/2.)

2. Main results. Let X1      Xn be distributed according to some distri-


bution P0 and let n be a sequence of prior probability measures supported
on some set of probability measures  . Let d be either the variation or the
Hellinger metric on  . If the set of densities is uniformly bounded, then we
may also choose d equal to the L2 -distance. This metric is used in condition
(2.2) of the following theorem and also in its assertion.
Let Dε   d denote the ε-packing number of  . This is the maximal
number of points in  such that the distance between every pair is at least
ε. It is easy to see that this is related to the ε-covering number Nε   d,
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 503

which is the minimal number of balls of radius ε needed to cover  , by the


inequalities

(2.1) Nε   d ≤ Dε   d ≤ Nε/2   d

Because we are only interested in rates of convergence, the additional constant


2 is of no real importance in the following, and covering numbers may replace
packing numbers throughout. The set of centers of a minimal set of balls of
radius ε covering  is called an ε-net.
A good early reference on entropy numbers is the paper Kolmogorov and
Tikhomirov (1961). Alternative references are Dudley (1984) and van der
Vaart and Wellner (1996). 
We use the notation Pf to abbreviate f dP, and, later on, n f for
n−1 ni=1 fXi .

Theorem 2.1. Suppose that for a sequence εn with εn → 0 and nεn2 → ∞,


a constant C > 0 and sets n ⊂  , we have

(2.2) log Dεn  n  d ≤ nεn2 

 
(2.3) n  \ n  ≤ exp −nεn2 C + 4 

  p  p 2   
(2.4) n P − P0 log ≤ εn2  P0 log ≤ εn2 ≥ exp −nεn2 C 
p0 p0

Then for sufficiently large M, we have that n P dP P0  ≥ Mεn X1      Xn 


→ 0 in P0n -probability.

The first and third conditions of the theorem are the essential ones. Condi-
tion (2.3) allows some additional flexibility, but should first be understood as
expressing that n is almost the support of the prior (in which case its left
side is zero and the condition is trivially satisfied).
Condition (2.2) requires that the “model” n be not too big. It is true for
every εn ≥ εn as soon as it is true for εn and can thus be seen as defining a
minimal possible value of εn . Condition (2.2) ensures the existence of certain
tests, as discussed in Section 7, and could be replaced by a testing condition.
Note that the metric d used here reappears in the assertion of the theorem.
Since the total variation metric is bounded above by twice the Hellinger metric,
the assertion of the theorem using the Hellinger metric is stronger, but also
condition (2.2) will be more restrictive, so that we really have two theorems.
In the case that the densities are uniformly bounded, we even have a third
theorem, when using the L2 -distance, which in that case will be bounded above
by a multiple of the Hellinger distance. If the densities are also uniformly
bounded and uniformly bounded away from zero, then these three distances
504 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

are equivalent and are also equivalent to the Kullback–Leibler number and
L2 -norm appearing in condition (2.4). See, for example, Lemmas 8.2 and 8.3
and (8.6).
A rate εn satisfying (2.2) for  = n and d the Hellinger metric is often
viewed as giving the “optimal” rate of convergence for estimators of P relative
to the Hellinger metric, given the model  . Under certain conditions, such
as likelihood ratios bounded away from zero and infinity, this is proved as a
theorem by Birgé (1983) and Le Cam (1973, 1986). From Birgé’s work it is clear
that condition (2.2) is the correct expression of the complexity of the model,
as relating to estimating the true density relative to the Hellinger distance,
if this is to be given in terms of metric entropy. A weaker, but more involved,
condition is in terms of the existence of certain tests. We give a generalization
of the theorem using tests in Section 7.
Condition (2.4) is the other main determinant of the posterior rate given
by the theorem. It requires that the prior measures put a sufficient amount
of mass near the true measure P0 . Here “near” is measured through a combi-
nation of the Kullback–Leibler divergence of p and p0 and the L2 P0 -norm
of logp/p0 . Again this condition is satisfied for εn ≥ εn if it is satisfied for
εn and thus is another restriction on a minimal value of εn . The form of this
condition can be motivated from entropy considerations. Suppose that we wish
to satisfy (2.4) for the minimal εn satisfying (2.2) with n =  , that is, for
the optimal rate of convergence for the model. Furthermore, for the sake of
the argument assume that all distances used are equivalent. Then a mini-
mal εn -cover of  consists of expnεn2  balls. If the prior n would spread
its mass uniformly over  , then every ball would obtain mass approximately
exp−Cnεn2 . (The constant C expresses the constants in comparing the dis-
tances and the fact that the balls of radius εn may overlap.) On the other hand,
if n is not “uniform,” then we should expect (2.4) to fail for some P0 ∈  . Here
we must admit that “uniform” priors do not exist in infinite-dimensional mod-
els and actually condition (2.4) is stronger than needed and will be improved
ahead in Theorem 2.4. However, a rough implication of the condition is that n
should be “uniformly spread” in order for the posterior distribution to attain
the optimal rate of convergence.
Condition (2.3), combined with (2.2), can be interpreted as saying that a
part of  that barely receives prior mass need not be small. The sets n
may be thought of as “sieves” approximating the parameter space, which cap-
ture most of the prior probability. This type of condition has received much
attention in the discussion of consistency issues [see Barron, Schervish and
Wasserman (1998)], but plays a smaller role in the present paper. Of course,
condition (2.3) is trivially satisfied for n =  ; we can make this choice if
condition (2.2) holds with n =  itself.
The assertion of the theorem is an in-probability statement that the pos-
terior mass outside a large ball of radius proportional to εn is approximately
zero. The in-probability statement can be improved to an almost sure asser-
tion, but under stronger conditions. We present two results.
Let h be the Hellinger distance and write log + x for log x ∨ 0.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 505

Theorem 2.2. Suppose that  conditions (2.2) and (2.3) hold as in the pre-
ceding theorem and in addition n exp−Bnεn2  < ∞ for every B > 0 and
 p    
 
(2.5) n P h2 P P0  0  ≤ εn2 ≥ exp −nεn2 C 
p ∞
Then for sufficiently large M, we have that n P dP P0  ≥ Mεn X1      Xn 
→ 0 in P0n -almost surely.

Theorem 2.3. Suppose that  conditions


 (2.2) and (2.3) hold as in the pre-
ceding theorem and in addition n exp − Bnεn2 < ∞ for every B > 0 and for
a given function m with P0 m < ∞,
   
n P 18h2 PP0  log + P0 m/hP P0 
(2.6)  p   
+ −1 h2 PP0  ≤ εn2  0 ≤ m ≥ exp −nεn2 C 
p
where −1 ε = sup M M ≥ ε is the inverse of the function M =
P0 m1 m ≥ M /M. Then for sufficiently large M, we have that n P dP P0 
≥ Mεn X1      Xn  → 0 in P0n -almost surely.

If the quotients p0 /p are uniformly bounded, then condition (2.5) simply


requires that shrinking Hellinger balls possess a sufficient amount of prior
mass. Then a fairly symmetric statement is obtained when combined with
condition (2.2) for the Hellinger metric d: if we can cover the model with not
too many Hellinger balls and the Hellinger ball around P0 contains a sufficient
amount of mass, then the rate of convergence relative to the Hellinger distance
is εn .
Lemmas 8.2 and 8.3 in Section 8 relate  the
 Kullback–Leibler divergence
and L2 -norm of logp/p0  to h2 P P0 p0 /p∞ and imply that the conditions
of Theorem 2.2 are essentially stronger than those of Theorem 2.1.
Condition (2.6) is milder in its control of p/p0 than (2.5) by allowing a
general bound m that need only satisfy a moment condition. However, in com-
parison with (2.5) it will be satisfied only for somewhat bigger εn , due to the
presence of the term involving log and −1 .
In general, good control on the quotients p/p0 is needed next to the close-
ness of p to p0 relative to, for example, the Hellinger metric, because the
product measures Pn and P0n can be arbitrarily far apart as n → ∞ within
balls of radii εn , for the values of εn bigger than n−1/2 that we are considering
here. The bound on p/p0 together with the distance ensures that Pn and P0n
are still “close” enough on an exponential scale. Only prior mass on such close
alternatives helps to increase the rate of convergence of the posterior.
One deficit of the theorems as presented so far is that they do not satis-
factorily cover finite-dimensional
√ models. When applied to such models,
√ they
would yield the rate 1/ n times a logarithmic factor rather than 1/ n itself.
Similarly, the theorems may also yield unnecessary logarithmic factors when
applied to priors constructed on a sequence of finite-dimensional sieves. To
506 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

improve this situation we must refine both the entropy condition (2.2) and
the prior mass condition (2.4). The following generalization of Theorem 2.1 is
more complicated but does yield the right result in the finite-dimensional situ-
ation. It is essential for our examples using spline approximations in Section 4.
Theorems 2.2 and 2.3 can be generalized similarly. Let
 p  p 2
Bn ε = P − P0 log ≤ ε2  P0 log ≤ ε2 
p0 p0

Theorem 2.4. Suppose that for a sequence εn with εn → 0 and such that
nεn2 is bounded away from zero, every sufficiently large j and sets n ⊂  , we
have
ε 
(2.7) log D  P ∈ n ε ≤ dP P0  ≤ 2ε  d ≤ nεn2 for every ε ≥ εn 
2

n  − n    
(2.8)   = o exp −2nεn2 
n Bn εn 
 
n P jεn < dP P0  ≤ 2jεn  
(2.9)   ≤ exp Knεn2 j2 /2 
n Bn εn 
Here K is the universal testing constant appearing in (7.1) and (7.2). Then for
every Mn → ∞, we have that n P dP P0  ≥ Mn εn X1      Xn  → 0 in
P0n -probability.

Convergence of the posterior distribution at the rate εn implies the existence


of point estimators, which are Bayes in that they are based on the posterior
distribution, that converge at least as fast as εn in the frequentist sense. One
possible construction is to define P̂n as the (near) maximizer of
 
Q → n P dP Q < εn X1      Xn 

Theorem 2.5. Suppose that n P dP P0  ≥ εn X1      Xn  converges to


0, almost surely (respectively, in probability) under P0n and let P̂n maximize, up
 
to o1, the function Q → n P dP Q < εn X1      Xn . Then dP̂n  P0  ≤
2εn eventually almost surely (respectively, in probability) under P0n .

Proof. By definition, the εn -ball around P̂n contains at least as much


posterior probability as the εn -ball around P0 . The latter, by posterior conver-
gence at rate εn , has posterior probability close to unity. Therefore, these two
balls cannot be disjoint, for otherwise, the total posterior mass would exceed
unity. Now apply the triangle inequality. ✷

The preceding construction actually applies to general statistical models


and posterior distributions, and the theorem is well-known. [See, e.g., Le
Cam (1986) or Le Cam and Yang (1990).] If we use the Hellinger or total
variation metric (or some other bounded metric whose square is convex), then
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 507

an alternative is to use the posterior expectation, which typically has a similar


property. By Jensen’s inequality and convexity of P → d2 P P0 ,
 
d2 P dn PX1      Xn  P0 ≤ d2 P P0  dn PX1      Xn 
 
≤ εn2 + d2∞ n P dP P0  > εn X1      Xn 

where d∞ is a bound on the maximal distance ( 2 and 2, respectively, for
Hellinger and variation distance). To obtain the desired result, we now need
that the posterior probability of the complement of the εn -ball around p0 con-
verges to zero at least at the order εn2 . This is usually the case, in particular
under the conditions of Theorems 2.2 and 2.3, whose proofs yield the expo-
nential order exp−Bnεn2 . (We use the square of the distance, because the
Hellinger distance is not convex. With the total variation distance the argu-
ment would work also with the distance itself.)
More generally, we could use the minimizer P̂n of
 
Q → "n dQ P dn PX1      Xn 

for appropriate loss functions "n . Such estimators are called formal Bayes
estimators in Le Cam (1986).
On the one hand, Theorem 2.5 shows that we can construct good estimators
from the posterior if the posterior converges at a good rate. On the other hand,
it shows that the posterior cannot converge at a rate faster than the optimal
rate of convergence for point estimators. We use this argument in a number
of examples to show that the posterior converges at the best possible rate. Of
course, our arguments have nothing to say about the best possible constants.
Furthermore, for many priors the rate may be suboptimal.

3. Priors based on finite approximating sets. In this section, we con-


struct, under bracketing entropy conditions, priors based on uniform distri-
butions on carefully chosen finite sets for which the posterior converges at
the best possible rate. Priors based on uniform distributions on finite subsets
are introduced by Ghosal, Ghosh and Ramamoorthi (1997) as the Bayesian
default priors for nonparametric problems. They establish posterior consis-
tency for such priors under mild entropy conditions. In the present case, the
prior is constructed more carefully to achieve the optimal rate of convergence
as well.
Given two functions l u  →  the bracket l u is defined as the set of
all functions f  →  such that l ≤ f ≤ u everywhere. The bracket is said to
be of size ε relative to the distance d if dl u < ε. In this section we use the
Hellinger distance h as the distance d and restrict the brackets to consisting
of nonnegative functions, which are assumed to be integrable relative to a
reference measure µ. Let N  ε   h be the minimal number of brackets of
size ε needed to cover  . Because a bracket of size ε is contained in the ball
of radius ε/2 around its midpoint, it follows that Nε/2   h ≤ N  ε   h
508 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

and hence the present bracketing numbers are bigger than the packing num-
bers Dε   h defined previously [see (2.1)]. However, in many examples
there is also an equality in the other direction, up to a constant, and bracketing
and packing numbers give equivalent results. The corresponding bracketing
entropy is defined as the logarithm of the bracketing number N  ε   h.
We shall construct a discrete prior supported on densities constructed from
minimal sets of brackets for the Hellinger distance. For a given number εn > 0,
let n be the uniform discrete measure on the N  εn    h densities obtained
by covering  with a minimal set of εn -brackets and next renormalizing the
upper bounds of the brackets to integrate to 1. Thus if l1  u1      lN  uN 
are the N = N εn    h brackets, then n is the uniform measure on the
N functions uj / uj dµ. Next set

= λ n n
n∈


for a given sequence λn with λn ≥ 0 and n λn = 1.

Theorem 3.1. Suppose that εn are numbers decreasing in n such that


log N  εn    h ≤ nεn2 for every n and nεn2 / log n → ∞. Construct the prior
 as given previously for a sequence λn such that λn > 0 for all n and
log λ−1
n = Olog n. Then the conditions of Theorem 2.2 are satisfied for εn
a sufficiently large multiple of the present εn and hence the corresponding pos-
terior converges at the rate εn almost surely, for every P0 ∈  , relative to the
Hellinger distance.


Proof. The prior  gives probability 1 to the set  = ∞ j=1 j for j the
N  εj    h functions in the support of j . We claim that
  
(3.1) D 8εn   h ≤ exp 2nεn2 

To see this, we first note that, given an ε-bracket l u that contains a prob-
ability density p, with  · 2 the norm in L2 µ,
 1/2 √ √ √ √
1≤ u dµ =  u2 ≤  u − p2 +  p2 = hu p + 1 ≤ 1 + ε
 u   u 
h p  ≤ hp u + h u  ≤ 2ε
u dµ u dµ

Therefore, n is a 2εn -net over  : every point of  is within distance 2εn


of some point in n . Since for j > n every point of j is within distance
2εj ≤ 2εn of  , it follows that n is also a 4εn -net  over j . This being true
for every
 j > n it follows that n is a 4εn -net over j≥n j and hence, triv-
ially, j≤n j is a 4εn -net over . The cardinality of the latter net is at most
nN  εn    h ≤ expnεn2 + log n ≤ exp2nεn2  for sufficiently large n. By
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 509

virtue of the relationship (2.1) between covering numbers and packing num-
bers, we obtain (3.1). This verifies condition (2.2) with n taken equal to 
and εn taken equal to eight times the present εn .
If u is the upper limit of the εn -bracket containing p0 , then
p
 0 ≤ u dµ ≤ 1 + εn 2 
u/ u dµ
 
It follows that for large n the set of points p such that h2 p p0 p0 /p∞ ≤ 8εn2

contains at least the function u/ u dµ and hence has prior mass at least
1
λn ≥ exp−nεn2 − Olog n ≥ exp−2nεn2 
N  εn    h
for large n. This verifies condition (2.5) for εn a multiple of the present εn .
Since condition (2.3) is trivially satisfied for n = , the proof is com-
plete. ✷

There are many specific examples in which the preceding theorem applies.
The situation here is similar to that in recent papers on rates of convergence
of (sieved) maximum likelihood estimators, as in Birgé and Massart (1993,
1997, 1998), Wong and Shen (1995) or Chapter 3.4 of van der Vaart and
Wellner (1996). It is interesting to note that these authors also use brack-
ets, whereas Birgé (1983), in his study of the metric entropy of statistical
models, uses ε-nets. This is because the cited papers are concerned with a
particular type of estimator (namely, minimum contrast estimators), whereas
Birgé (1983) uses special constructs, called d-estimators. It appears that for
good behavior of Bayes estimators on nets we also need some special property
of the nets, such as available from nets obtained from brackets.
We include two concrete examples.

Example 3.1 (Smooth densities). Suppose that  consists of all measures



with densities whose roots p belong to a fixed multiple of the unit ball of
α
the Hölder class C 0 1, for some fixed α > 0. [See, e.g., van der Vaart and
Wellner (1996) for a precise definition of this space of functions.] By results of
Kolmogorov and Tihomirov (1961), the ε-entropy numbers of this unit ball rel-
ative to the uniform norm are bounded by a multiple of 1/ε1/α . [Their result
is reproduced in Theorem 2.7.1 of van der Vaart and Wellner (1996).] Because
we can construct upper and lower brackets from uniform approximations, this
shows that the bracketing Hellinger entropies grow like ε−1/α , so that we can
take εn of the order n−α/2α+1 to satisfy the relation log N  εn    h ≤ nεn2 .
This rate is known to be the frequentist optimal rate for estimators. From
Theorem 2.5, we therefore conclude that the prior constructed above achieves
the optimal rate of convergence for the posterior.
Upper brackets are, in principle, available from the classical proof of Kol-
mogorov and Tihomirov (1961). Alternatively, we may use more modern classes
of approximating functions, such as wavelets or splines.
510 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Example 3.2 (Monotone densities). Suppose that  consists of all mono-


tone decreasing densities on a compact interval in , bounded above by a fixed
constant. The root of a monotone density is monotone and hence the bracket-
ing entropy of  for the Hellinger distance is bounded by the L2 -entropy for
the set of monotone functions. This is of the order 1/ε [e.g., van der Vaart and
Wellner (1996), Theorem 2.7.5], whence we obtain a n−1/3 -rate of convergence
of the posterior. Again this rate cannot be improved.

Inspection of the proof of the theorem shows that the lower bounds of the
brackets are not really needed. The theorem can be generalized by defining
upper bracketing numbers N ε   h as the minimal number of functions
u1      um such that for every p ∈  there exist a function ui such that
both p ≤ ui and hui  p < ε. Next we construct a prior  as before. These
upper bracketing numbers are clearly smaller than the bracketing numbers
N  ε   h. We have formulated the theorem using the better known brack-
eting numbers, because we do not know any example where this generalization
could be useful.
The preceding theorem implicitly requires that the model  be totally
bounded for the Hellinger metric. A simple modification works for countable
unions of totally bounded models, provided that we use a sequence of priors.
Suppose that the bracketing numbers of  are infinite, but there exist sub-
sets n ↑  with finite bracketing numbers. Let εn be numbers such that
log N  εn  n  h ≤ nεn2 . Then we construct n as the discrete uniform distri-
bution on renormalized upper brackets of a minimal set of εn -brackets over
n , as before. Then the posterior relative to prior n achieves the convergence

rate εn . (Note that this time we do not construct a fixed prior  = n λn n ,
but use the prior n when n observations are available.)
In the preceding we start with a condition on the entropies with bracket-
ing even though we apply Theorem 2.2, which demands control over metric
entropies only. This is because Theorem 2.2 also requires control over the like-
lihood ratios. If, for instance, the densities would be uniformly bounded away
from zero and infinity, so that the quotients p0 /p are uniformly bounded, then
we can replace the bracketing entropy in Theorem 3.1 by ordinary entropy.
Alternatively, if the set of densities  possesses an integrable envelope func-
tion, then we can construct priors achieving the rate εn determined by the
covering numbers up to logarithmic factors. Here we define εn as the mini-
mal solution of the equation log Nε   h ≤ nε2 and Nε   h denotes the
Hellinger covering number (without bracketing). The construction, described
briefly below, parallels Theorem 6 of Wong and Shen (1995) for sieved maxi-
mum likelihood estimators.
We assume that the set of densities
  has a µ-integrable envelope function:
a measurable function m with m dµ < ∞ such that p ≤ m for every p ∈  .
Given εn > 0 let s1n      sNn  n be a minimal εn -net over  [hence Nn =
Nεn    h] and put
1/2
gj n = sj n + εn m1/2 2 /cj n 
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 511

where cj n is a constant ensuring that gj n is a probability density. Finally, let
n be the uniform discrete measure on g1 n      gNn  n and let  = ∞n=1 λn n
be a convex combination of the n as before.

Theorem 3.2. Suppose that εn are numbers decreasing in n such that


log Nεn    h ≤ nεn2 for every n and nεn2 / log n → ∞. Construct the prior
 as given previously for a sequence λn such that λn > 0 for all n and
log λ−1
n = Olog n. Assume m is a µ-integrable envelope. Then the correspond-
ing posterior converges at the rate εn log1/εn  in probability, relative to the
Hellinger distance.

Proof. The proof follows as before, but this time we apply Theorem 2.1,
using the observation of Wong and Shen (1995) that for any p ∈  such that
hp sj n  ≤ εn we have that hp gj n  = Oεn  and that p/gj n is bounded
above by a multiple of εn−2 . This verifies (2.4) with εn replaced by a multiple of
εn log1/εn  through a use of Theorem 5 of Wong and Shen (1995), the relevant
part of which is reproduced below as Lemma 8.6. ✷

4. Log spline models. In this section we apply the general results to


prior distributions on log spline models for densities. Log spline models for
density estimation have been used, among others, by Stone (1990), who shows
that the sieved maximum likelihood estimator attains the optimal rate of
convergence for estimating a smooth density. As shown by Stone (1994) they
can be extended to higher dimensions by using tensor splines, but following
Stone (1990), we restrict ourselves to the one-dimensional case.
We assume that the observations are sampled from a density p0 on the unit
interval 0 1 in the real line that is bounded away from zero and infinity. Our
choice of priors will yield the optimal rate of convergence of the posterior if
the density p0 belongs to the Hölder space Cα 0 1. (This is the set of all
functions that have α0 derivatives, for α0 the largest integer strictly smaller
than α, with the α0 th derivative being Lipschitz of order α − α0 .)
Our prior measures will not be supported on the set of smooth functions, but
on exponential families constructed from a spline basis. Fix some “order” q, a
natural number, throughout this section. Let K be another natural number,
which will increase  with n, and partition the half-open unit interval 0 1 into
K subintervals k−1/K k/K for k = 1     K. Consider the linear space of
splines of order q relative to this partition, that is, all functions
 f 0 1 → 
whose restriction to every of the partioning intervals k − 1/K k/K is a
polynomial of degree strictly less than q and, in the case that q ≥ 2, that are
q − 2 times continuously differentiable on 0 1. It can be shown that this is
a J = q + K − 1-dimensional vector space. A convenient basis is the set of
B-splines B1      BJ , defined, for example, in de Boor (1978). More precisely,
let B1      BJ be the B-splines of order q for the known sequence
q times q times
   1 2 K − 1   
0 0     0        1 1     1
K K K
512 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

as defined on page 108 of de Boor (1978). The exact nature of these functions
does not matter to us here, except for the following properties [cf. de Boor
(1978), pages 109 and 110]:

1 Bj ≥ 0 j = 1     J

J
2 Bj ≡ 1
j=1

3. Bj is supported inside an interval of length q/K


4. at most q functions Bj are nonzero at every given x.
The first two properties express that the basis elements form a partition of
unity, and the third and fourth properties mean that their supports are close
to being disjoint if K is very
 large relative to q.
For θ ∈ J let θT B = j θj Bj and define
  1  
pθ x = exp θT Bx − cθ  ecθ = exp θT Bx dx
0

Thus pθ belongs to a J-dimensional exponential family, with the B-spline


functions as sufficient statistics. Since the B-splines add up to unity, the family
is actually of dimension J − 1 and we could restrict θ to the subset 0 = θ ∈
J θT 1 = 0 . The true density p0 of the observations need not be of the form
pθ for some θ. (Hence we make a difference between p0 and p0 for 0 ∈ J ;
this should not lead to confusion as p0 does not play a role.) In the following
we construct a prior measure n on the set of probability densities on 0 1 by
choosing a prior on 0 , which next induces a prior on the probability densities
pθ through the map θ → pθ .
For q = 1 the linear space of splines consists of histograms with cell
boundaries k/K for k = 0 1     K. Since exponentials of histograms are his-
tograms, our construction therefore contains priors constructed on histograms
as a special case.
Since the true density p0 need not belong to this “log spline model,” we must
ensure that it is approximated sufficiently closely by some pθ . To approximate
sufficiently many p0 it is necessary to let the dimension J − 1 of the log spline
models tend to infinity with n. Here we fix the order q and let the number K
of partioning sets tend to infinity. If we focus on α-smooth densities p0 , then
the minimal rate at which J = Jn must grow is determined by the following 
lemma, taken from de Boor [(1978), page 170]. Let f∞ = sup0≤x≤1 fx be
the supremum norm, and let fα be the seminorm
 α  
f 0 x − fα0  y
fα = sup 
x=y x − yα−α0

Because we assume that p0 is bounded away from zero (and infinity) the
function p0 is in Cα 0 1 if and only if log p0 ∈ Cα 0 1.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 513

Lemma 4.1. Let q ≥ α > 0. There exists a constant C depending only on q


and α such that, for every p0 ∈ Cα 0 1 that is bounded away from zero,
 
inf θT B − log p0 ∞ ≤ CJ−α  log p0 α 
θ∈J

It is easy to see from this, as we show in part below, that the root of the
Kullback–Leibler divergence and the Hellinger distance between p0 and the
closest pθ are of the order J−α as well. Since a ball of radius εn around p0
must contain prior mass in order to satisfy (2.9), the rate of convergence εn
of the posterior can certainly not be faster than J−α . The minimum distance
of alternatives to allow appropriate tests, determined by (2.7), will be shown
to satisfy nεn2 ≥ Jn . Together with the previous restriction on εn this will
yield a rate of convergence of n−α/2α+1 , for Jn ∼ n1/2α+1 . This is also the
rate of convergence of the sieved maximum likelihood estimator, found by
Stone (1990). It is well known that this rate is optimal for α-smooth densities.
To make this precise we start with stating some lemmas that connect dis-
tances and norms on the densities pθ with the J-dimensional Euclidean norm
θ and infinity norm θ∞ = maxj θj . Let f be the L2 0 1 norm of f and
write a  b if a ≤ Cb for a constant C that is universal or depends only on q
(which is fixed throughout) and not on K. Most of these are known from or
implicit in Stone (1986, 1990) or the literature on approximation theory.

Lemma 4.2. For any θ ∈ J ,


θ∞  θT B∞ ≤ θ∞ 

θ  J θT B  θ

Proof. The first inequality is proved by de Boor [(1978), page 156, Corol-
lary 3]. The second is immediate from the fact that the B-spline basis forms a
partition of unity. The third and fourth inequalities are stated in Stone [(1986),
equation (12)]. As their full proofs are not in one place, we sketch the argument
for completeness.  
Let Ii be the interval i − q/K ∨ 0 i/K ∧ 1 . By (2) on page 155 of
de Boor (1978), we have
θi2  θT BIi 2∞  KθT BIi 2 
i i i
T
The last inequality follows, because θ BIi consists of at most q polynomial
pieces, each on an interval of length 1/K, and the supremum norm √ of a
polynomial of order q on an interval of length L is bounded by 1/ L times
the L2 -norm, up to a constant depending on q. [To see the last: the squared
q−1
L2 0 1-norm of the polynomial x → j=0 αj xj on 0 1 is the quadratic form
αT EUq UT q α for Uq = 1 U     U
q−1
 and U a uniform 0 1 variable. The
T
second moment matrix EUq Uq is nonsingular and hence the quadratic form
is bounded below by a constant times α2 ≥ α2∞ .) This yields the third
inequality.
514 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

By property (3) of the B-spline basis at most q elements Bj x are nonzero
for every given x, say for j ∈ Jx. Therefore,
 T 2  2
θ Bx = θj Bj x ≤ θj2 B2j x q
j∈Jx j∈Jx

by the Cauchy–Schwarz inequality. Since each Bj is supported on an interval


of length proportional
√ to 1/J and takes its values in 0 1, its L2 0 1-norm
is of the order 1/ J. Combined with the preceding display this yields
1 2 q
θT Bx dx  θ2 
0 J
This yields the fourth inequality. ✷

Lemma 4.3. For any θ ∈ J such that θT 1 = 0,


θ∞   log pθ ∞ ≤ 2θ∞ 

Proof. By the second inequality in Lemma 4.2 we have that θT B∞ ≤
θ∞ , whence ecθ is contained
 in the interval e−M  eM  for M = θ∞ , by its
 
definition, so that cθ ≤ θ∞ . Consequently, by the triangle inequality

 log pθ ∞ = θT B − cθ∞ ≤ 2θ∞ 
This yields the inequality on the right.
For the inequality on the left, we note that, since θT 1 = 0,
     
cθ = θ − cθ1T 1 1 ≤ θ − cθ1 11 1
J ∞ J
 
 θ − cθ1 B∞ =  log pθ ∞ 
T

by Lemma 4.2. Consequently, by Lemma 4.2 and the triangle inequality,


   
θ∞  θT B∞ ≤ θT B − cθ∞ + cθ  2 log pθ ∞ 
This concludes the proof. ✷

As a consequence of the preceding lemma, a set of densities pθ is uniformly


bounded away from 0 and ∞ if and only if the norms θ∞ of the corresponding
set of θ are bounded. This is true uniformly in J ∈ .

Lemma 4.4. For every θ1  θ2 such that 1T θ1 − θ2  = 0,


 θ − θ 2   θ − θ 2 
1 2 1 2
inf pθ x ∧ 1  h2 Pθ1  Pθ2   sup pθ x 
x θ J x θ J

where the infimum and supremum are taken over all θ on the line segment
between θ1 and θ2 and all x ∈ 0 1.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 515

Proof. By direct calculation and Taylor’s theorem, we have


    
h2 Pθ1  Pθ2  = 2 1 − exp c 12 θ1 + 12 θ2 − 12 cθ1  − 12 cθ2 
   
1
= 2 1 − exp − 16 ˜ θ − θ  
θ1 − θ2 T c̈θ̃ + c̈θ̃ 1 2

for θ̃ and θ̃˜ vectors on the line segment between θ1 and θ2 and c̈θ the Hessian
of c. By the well-known properties of exponential families, we have
1 2
τT c̈θτ = varθ τT B = inf τx − µ1T Bx pθ x dx
µ∈ 0

since 1T B ≡ 1. Up to bounds below and above on pθ the right side is equiv-


alent to the infimum over µ of the squared L2 0 1-norm of τ − µ1T B. By
Lemma 4.2 the latter is comparable to the infimum over µ of τ − µ12 /J,
which is equal to τ2 /J if 1T τ = 0.
We can finish the proof by applying this in the first display, together with
the inequalities 1 − e−x ≤ x for x ≥ 0 and 1 − e−x ≥ 12 x ∧ 1 for x ≥ 0 and
cx ∧ 1 ≥ cx ∧ 1 for x ≥ 0 and c ≤ 1. ✷

By combining these lemmas we see that the Hellinger distance hPθ1  Pθ2 

and 1/ J times the J-dimensional Euclidean distance θ1 − θ2  are propor-
tional, uniformly in J and in θ1  θ2 having uniformly bounded coordinates.
This combined with the estimate on the distance of p0 to the set of pθ given
by Lemma 4.1 reduces the verification of (2.7) and (2.9) to calculations in the
Euclidean setting.
We are now ready to prove the following theorem. By Lemma 4.3 there
exists a constant d such that dθ∞ ≤  log pθ ∞ for every θ ∈ J with θT 1 = 0
and every J ∈ . We shall assume that the prior is chosen as roughly uniform
on a large box −M MJ . This corresponds to densities pθ that are bounded
and bounded away from zero by at least a small constant.

Theorem 4.5. Suppose that n has a density with respect to Lebesgue mea-
sure on θ ∈ Jn θT 1 = 0 for Jn ∼ n1/2α+1 whose minimum and maximal
values on −M MJn are bounded below and above by terms of the orders cJn
and CJn , respectively, for positive constants c C, and which vanishes outside
−M MJn . Let M ≥ 1. Then for every p0 ∈ Cα 0 1 for q ≥ α ≥ 1/2 such that
 log p0 ∞ ≤ 12 dM the conditions of Theorem 2.4 are satisfied for εn a large
multiple of n−α/2α+1 and n the support of n , and hence the posterior rate of
convergence is n−α/2α+1 .

Proof. Let θ0 minimize θ →  log pθ − log p0 ∞ over θ ∈ J such that


T
θ 1 = 0. We first show that, for constants C1  C depending on p0 , α and q
only,
(4.1) hpθ0  p0  ≤ C1  log pθ0 − log p0 ∞ ≤ CJ−α 
516 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

By Lemma 4.1 there exists θ∗ such that θ∗ T B − log p0 ∞  J−α . Taking
∗ T −α
exponentials we see that this implies that ∗expθ  B−p 0 ∞  J , and
 next,
by integrating this inequality, that exp cθ  − 1  J , whence cθ∗   J−α .
−α

Consequently,  log pθ∗ − log p0 ∞  J−α , whence θ∗ minimizes θ →  log pθ −


log p0 ∞ up to a multiple of J−α . Since the set of pθ is the same whether θ
is restricted to satisfy θT 1 = 0 or not, the second inequality in the display
follows by the definition of θ0 . The first now follows easily, since p0 and hence
pθ0 is bounded away from zero and infinity.
Thus the Hellinger ball of radius ε around P0 is contained in a multiple of
the Hellinger ball of radius ε + J−α around Pθ0 , whence by Lemma 4.4, for
any ε > 0 and suitable constants A B C, since θ0 ∞ ≤ 12 M + o1 by the
assumption that  log p0 ∞ ≤ 12 dM,
 
Pθ hPθ  P0  ≤ ε θ∞ ≤ M
 
⊂ Pθ AhPθ  Pθ0  ≤ ε + J−α  θ∞ ≤ M
θ − θ0  √
⊂ Pθ B √ ∧ 1 ≤ ε + J−α  θ − θ0  ≤ 2 JM θ∞ ≤ M
J

⊂ Pθ θ − θ0  ≤ C Jε + J−α M 

since x x ∧ 1 ≤ ε x ≤ M ⊂ x x ≤ εM for M ≥ 1. Hence, in view of


Example 7.1 [or Pollard (1990), Lemma 4.1] and Lemma 4.4, for constants
E F,
ε   
D  Pθ hPθ  P0  ≤ 2ε θ∞ ≤ M  h
2
 √  √  
≤ D Eε J θ θ − θ0  ≤ 2C Jε + J−α M   · 
 F√Jε + J−α M J
≤ √ 
ε J
 
Therefore, we can verify (2.7) for n = Pθ θ∞ ≤ M and every εn such
that
 J−α 
Jn log 1 + n  nεn2 
εn

Next, we have, with volJ the volume of the J − 1-dimensional unit ball,
 
n Pθ hPθ  P0  ≤ 2jε θ∞ ≤ M
 √ J
≤ sup πn θ 2C Jjε + J−α M volJ 
θ∞ ≤M

By Lemma 4.3 and the assumption that p0 is bounded, the norms p0 /pθ ∞
are uniformly bounded over θ ranging over a set of bounded θ∞ . Therefore,
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 517

in view of Lemma 4.4 and (4.1), uniformly in θ∞ ≤ M,


p  θ − θ0 2
 
h2 pθ  p0  0   h2 pθ  pθ0  + h2 pθ0  p0   + J−2α 
pθ ∞ J
We conclude that for εn bigger than a sufficiently large multiple of J−α ,
 p  
 
n Pθ h2 pθ  p0  0   εn2
 pθ ∞ 

≥ n θ θ∞ ≤ M θ − θ0 / J ≤ εn − J−α
 1 √ 
≥ inf πn θvol θ θ∞ ≤ M θ − θ0  ≤ εn J
θ∞ ≤M 2
 J
1 √
= inf πn θ εn J volJ 
θ∞ ≤M 2
√ √
since θ∞ ≤ θ0 ∞ +θ−θ0  J ≤ θ0 ∞ + 12 εn J ≤ M eventually, if θ−θ0  ≤
1

ε J. By assumption, the first term is of the order cJ . Thus condition (2.9)
2 n
is satisfied if, for all sufficiently large j,
J log j  nεn2 j2 and εn  J−α 
−1/α
This gives εn of the order 1/nα/2α+1 for Jn of the order εn . ✷

5. Finite-dimensional models. Although in this paper we are primar-


ily interested in infinite-dimensional models, it is desirable to have a unified
theory applicable to both finite- and infinite-dimensional models. In this sec-
tion we show that Theorem 2.4 yields the right rate of convergence for finite-
dimensional models.
Let pθ θ ∈  be a family of densities parametrized by a Euclidean param-
eter θ running through a set  ⊂ d . Assume that for every θ θ1  θ2 ∈  and
some α > 0,
p
−Pθ0 log θ  θ − θ0 2α 
pθ 0
 p 2
Pθ0 log θ  θ − θ0 2α 
pθ 0
θ1 − θ2 α  hPθ1  Pθ2   θ1 − θ2 α 
Assume that the prior measure  possesses a density that is uniformly boun-
ded away from zero √ and infinity on . In this situation the posterior rate of
convergence is 1/ n relative to the Hellinger distance h. Under the assump-
tions, this translates into a n1/2α -rate of convergence of the posterior for θ in
the Euclidean distance.

Theorem 5.1. Under the conditions listed previously and θ0 interior to ,


the conditions of Theorem 2.4 are satisfied for  = n = P √θ θ ∈  , the
Hellinger distance d and εn a sufficiently large multiple of 1/ n.
518 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Proof. The left side of condition (2.7) is seen to be bounded by a constant


in Example 7.1 in the case that α = 1. The √ case of general α is not different.
It follows that (2.7) is satisfied for εn = M/ n and sufficiently large M.
In order to verify (2.9) we calculate
 
n Pθ hPθ  Pθ0  ≤ jεn
  2 
n Pθ − Pθ0 logpθ /pθ0  ≤ εn2  Pθ0 logpθ /pθ0  ≤ εn2
   A d
n θ θ − θ0  ≤ Ajεn 1/α
≤  1/α  ≤ C jd/α 
n θ θ − θ0  ≤ Bεn B

for constants A B defined by the conditions preceding the theorem, and a


constant C depending
√ on the prior density only. It follows that (2.9) is satisfied
easily for εn = M/ n and sufficiently large M. ✷

It may be noted that our conditions preclude unbounded parameter spaces


: we cannot have that the Hellinger distance is bounded below by a multi-
ple of the Euclidean distance unless the latter is bounded, since the Hellinger
distance is uniformly bounded above. This could be improved by replacing con-
dition (2.7) by a testing condition. The lower bound on the Hellinger distance
is used only to verify (2.7), which in turn is used only to ensure the existence
of tests of θ0 versus the complements of balls of ε around θ0 . For most classical
parametric models such tests exist. In fact, existence of uniformly consistent
tests of the outside of a compact neigborhood of θ0 already implies existence
of tests with exponential error probabilities (see Lemma 7.2), and this would
be sufficient to reduce the problem to a bounded parameter set, to which the
preceding theorem applies. Note that the conditions are very reasonable for 
equal to a small neighborhood of θ0 . See also Le Cam (1973) and Le Cam and
Yang (1990).

6. Priors based on Dirichlet processes. In this section we apply the


general theorems to priors based on Dirichlet processes. A major difficulty is
the computation of the prior mass, as in conditions (2.4) or (2.5). We present
one such computation and expect that future papers will address more prob-
lems of this sort. We shall need an estimate of the probability of an L1 -ball
under a Dirichlet distribution given by the following lemma.

Lemma 6.1. Let X1      XN  be distributed according to the Dirichlet dis-


tribution
 on the N-simplex with parameters m# α1      αN , where Aε ≤ αi ≤
1 and N i=1 αi = m for some constant A. Let x10      xN0  be any point on the
N-simplex. There exist positive constants c and C depending only on A such
that, for ε ≤ 1/N,
N   
1
(6.1) Pr Xi − xi0  ≤ 2ε ≥ C exp −cN log 
i=1
ε
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 519

Proof. Find an index i such that xi0 ≥ 1/N. By relabelling, we can


assume that i = N. If xi − xi0  ≤ ε2 for i = 1     N − 1, then
N−1
xi ≤ 1 − xN0 + N − 1ε2 ≤ N − 1ε2 + 1/N ≤ 1 − ε2 < 1
i=1

Hence there exists x = x 1      xN  in thesimplex with these first N − 1


coordinates. Furthermore, N i=1 xi −xi0  ≤ 2
N−1 2
i=1 xi −xi0  ≤ 2ε N−1 ≤ 2ε.
Therefore the probability on the left-hand side of (6.1) is bounded below by
 
P Xi − xi0  ≤ ε2  i = 1     N − 1

:m N−1  minxi0 +ε  1 αi −1


2

≥ N xi dxi 
i=1 :αi  i=1 maxxi0 −ε  0
2


We use here that 1 − N−1 i=1 xi 
αN −1
≥ 1, since αN ≤ 1. Similarly, since αi ≤ 1
for every i, we can lower bound the integrand by 1 and note that the interval of
integration contains at least an interval of length ε2 . Since α:α = :α + 1 ≤
1 for 0 < α ≤ 1 we can bound the last display from below by
N  
 1
:mε2N−1 αi ≥ :Aε2N−1 AεN ≥ C exp −cN log 
i=1
ε
This concludes the proof. ✷

Example 6.1 (Current status censoring). Let Y1      Yn be an i.i.d. sam-


ple from a distribution F and C1      Cn be an independent i.i.d. sample from
a distribution G, both on 0 ∞. Suppose that we observe Xi = =i  Ci  for
i = 1     n, where =i = 1 Yi ≤ Ci and would like to estimate F. The density
function pF of X with respect to the product of counting measure on 0 1
and a dominating measure for G at δ c is given by
 1−δ
pF δ c = Fcδ 1 − Fc gc
Since this factorizes in parts depending on F and G only, if we put a product
prior on the pair F G and next compute the posterior for F only, then the
part involving G will cancel out. Therefore, it is equivalent to treat g as a
known density and put no prior on g.
We assume that G is supported on some compact interval a b and that
the true distribution F0 is continuous and has support which extends to the
left and the right of a b. [Hence F0 a− > 0 and F0 b < 1.] As a prior
measure on F we consider a Dirichlet prior with base measure α that has a
positive, continuous density on a compact interval containing a b. We shall
show that the conditions of Theorem 2.2 are satisfied for εn a large multiple
of n−1/3 log n1/3 . This is very close to the optimal rate of convergence in this
model, which is n−1/3 . We do not exclude the possibility that this small dis-
crepancy is due to suboptimal estimates of the prior mass in the following,
and not a deficit of Dirichlet priors. We note that the priors based on ε-nets
520 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

given in Section 3 do lead to a posterior rate of convergence of n−1/3 , as the


bracketing entropy for this model is of the order 1/ε.
Since the roots of the densities pF are essentially pairs of two bounded
monotone functions and the Hellinger distance is the L2 -distance between the
root densities, the Hellinger entropy of the model PF F ∈  , where  is
the set of all distribution functions on 0 ∞, can be estimated by the estimate
of the entropy of the space of uniformly bounded monotone functions. Thus it
is of the order 1/ε [see Theorem 2.7.5 of van der Vaart and Wellner (1996)].
Therefore condition (2.2) is verified for εn equal or bigger than n−1/3 .
Under our conditions, F0 is bounded away from 0 and 1 on the interval a b
that contains all observation times Ci . Consequently, the quotients pF0 /pF x
are uniformly bounded away from zero and infinity, uniformly in F that are
uniformly close to F0 on the interval a b. The squared Hellinger distance is
equal to
 1/2 2
h2 PF  PF0  = F1/2 c − F0 c dGc
 2
+ 1 − Fc1/2 − 1 − F0 c1/2  dGc
 2
≤ C sup Fc − F0 c 
c∈a b

for a constant depending on F0 . Thus, to verify (2.5) it suffices to estimate the


prior mass of a Kolmogorov–Smirnov ball of radius εn around F0 . Given ε > 0,
partition the positive half line in intervals E1      EN such that F0 Ei  ≤ ε
and Aε ≤ αEi  ≤ 1 for every i and some fixed constant  A. We can achieve this
N
with
 N = O1/ε intervals. By Lemma 6.1, the set F i=1 FEi −F0 Ei  ≤
ε has probability of the order exp−c1/ε log1/ε. For every F in this set,
the Kolmogorov–Smirnov distance to F0 is of the order ε. We conclude that
the prior mass in a Hellinger ball of radius a large multiple of ε is of the order
exp−c1/ε log1/ε. Thus condition (2.5) is verified for εn a large multiple
of n−1/3 log n1/3 .

7. Existence of tests. In this section we consider


 some results
 on the
existence of tests of P0 versus the complement P dP P0  > ε of the ball
of radius ε around P0 . The existence of certain tests is a main element in
the proofs of Theorems 2.1–2.3 and is guaranteed by entropy bounds. At the
end of this section we state a theorem on the rate of convergence of posteriors
directly in terms of tests. 
Appropriate
 tests can be built up from tests of P0 versus balls P dP P1 
≤ η for given P1 . Throughout this section we use a distance d such that for
every pair P0 and P1 in the model  there exist tests φn such that, for some
universal constant K,
 
(7.1) P0n φn ≤ exp −Knd2 P0  P1  
 
(7.2) sup Pn 1 − φn  ≤ exp −Knd2 P0  P1  
dP P1 <dP0  P1 /2
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 521

This is true both for d equal to the total variation distance and for d equal to
the Hellinger distance. (The constant 2 has no particular interest and is not
optimal; any constant bigger than 1 is possible and would do for our purposes.)
More generally, it is known from Birgé (1984) and Le Cam (1986) (see
Lemma 4 on page 478) that given any two convex sets 0 and 1 of prob-
ability measures, there exist tests φn such that
 
(7.3) sup Pn φn ≤ exp n log ρ0  1  
P∈0
 
(7.4) sup Pn 1 − φn  ≤ exp n log ρ0  1  
P∈1

where ρ0  1  = 1 − 12 h2 0  1  is the Hellinger affinity, and h0  1  is


the minimum of hP0  P1  over P0 ∈ 0 and P1 ∈ 1 . Because log ρ ≤ − 12 h2
this gives exponential decrease of the error probabilities, with the exponent
proportional to −nh2 0  1 . This general result brings out the special role
of the Hellinger distance (even though in some situations it may be preferable
to work with the log Hellinger affinity directly).
If a distance d is bounded above by the Hellinger distance, then the ball
P dP P1  < dP0  P1 /2 is at Hellinger distance at least dP0  P1 /2 from
P0 . Thus if this ball is a convex set of probability measures, then (7.1) and (7.2)
is satisfied for d (with K = 12 ), by the general results of Birgé and Le Cam.
This argument immediately gives (7.1) and (7.2) for the Hellinger distance
itself and the total variation distance, which satisfies dP − dQ ≤ 2hP Q,
by the Cauchy–Schwarz inequality. If the set of probability densities under
consideration is uniformly bounded, then it also gives (7.1) and (7.2) for the
L2 -distance, because this is then also bounded by a multiple of the Hellinger
distance.
The next step is to combine the tests for balls (which are convex) into a test
for the complements of balls, which are nonconvex. The following result is
related to Lemma 2.1 in Birgé (1983). The number Dε in its first condition
is related to the measure of metric dimension used by Birgé and Le Cam.
The number supε≥εn Dε is almost identical to what Le Cam (1986) calls the
dimension of  for the pair d εn .

Theorem 7.1. Suppose that for some nonincreasing function Dε, some
εn ≥ 0 and every ε > εn ,
ε   
D  P ε ≤ dP P0  ≤ 2ε  d ≤ Dε
2
Then for every ε > εn there exist tests φn (depending on ε > 0) such that, for a
universal constant K and every j ∈ ,
  1
(7.5) P0n φn ≤ Dε exp −Knε2 
1 − exp−Knε2 
 
(7.6) sup Pn 1 − φn  ≤ exp −Knε2 j2 
dP P0 >jε
522 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Proof.
 For a given j ∈  choose a maximal jε/2 separated set of points
in Sj = P jε < dP P0  ≤ j + 1ε . This yields a set Sj of at most Djε
points and every P ∈ Sj is within distance jε/2 of at least one of these points.
(Take Sj empty and adapt the following in the obvious way if Sj is empty.)
For every such point P1 ∈ Sj there exists a test ωn with the properties as in
(7.1) and (7.2). Let φn be the maximum of all tests attached in this way to
some point P1 ∈ Sj for some j ∈ . Then
   
P0n φn ≤ exp −Knj2 ε2 ≤ Djε exp −Knj2 ε2 
j P1 ∈Sj j∈
 
sup

Pn 1 − φn  ≤ sup exp −Kni2 ε2 
P∈ i≥j Si i≥j

The right sides can be further bounded as desired. [Note that Djε ≤ Dε
for every j ∈ , by assumption.] ✷

One possible choice for Dε is the ε-packing number Dε/2   d. This
is a bigger number, but in many infinite-dimensional situations this does not
appear to yield a real loss. On the other hand, the theorem is needed as stated
if  is finite-dimensional.

Example 7.1. Suppose that  = Pθ θ ∈  ⊂ m and, for given con-


stants A and B and  ·  the m-dimensional Euclidean norm,
dPθ  Pθ0  ≥ Aθ − θ0 
dPθ1  Pθ2  ≤ Bθ1 − θ2 
(Since both the Hellinger and total variation metric are bounded, the first can
be true with d one of these distances only if  is bounded.) The ε-packing
number of the m-dimensional unit ball is bounded above by 6/εm [e.g.,
Pollard (1990), Lemma 4.1]. Thus
     6l m
D kε θ ∈ m θ − θ0  ≤ lε   ·  ≤ 
k
It follows that
    
    ε 2ε 12B m
D ε Pθ dPθ  Pθ0  ≤ 2ε  d ≤ D  θ θ − θ0  ≤ · ≤ 
B A A
Thus we can take Dε in Theorem 7.1 independent of ε, but increasing expo-
nentially with the dimension (if A/B is fixed). In comparison, the numbers
Dε   h are of the order ε−m .

It is known from Le Cam (1973) that even for  a fixed ε there need  not
exist a consistent sequence of tests of P0 versus P ∈  dP P0  > ε . The
preceding theorem shows that total boundedness of  [which is equivalent to
Dε   d being finite for every ε > 0] is sufficient for the existence of such
a test. However, this is not necessary. One example showing this is given by
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 523

(7.1) and (7.2) applied with  = P0 P dP P1  < dP0  P1 /2 , because
a total variation or Hellinger ball is usually not totally bounded. A classical
example is as follows.

Example 7.2. The collection of all normal distributions Nθ 1 on  is not


totally bounded for the Hellinger or total variation distances, but certainly
there are good tests of H  0 θ = 0 versus H1 θ > ε. [Actually, in this case
the affinity satisfies log ρ Nθ 1 Nθ  1 = −θ − θ 2 /8 and hence we could
apply the general form of (7.3) and (7.4) in combination with the Euclidean
distance to obtain good tests through the approach of Theorem 7.1, even for
unbounded alternatives. For other parametric models the log affinity is typ-
ically not nicely related to the Euclidean distance and this approach would
fail.]

On the other hand, if a fixed part of  can be uniformly consistently tested


versus P0 , then it can also be tested with exponentially small error proba-
bilities. This implies that such a fixed part can be ignored for our purposes,
in that it is not a loss of generality in the main result to assume that the
prior only charges the remaining part of  (and P0 ). The error probabilities
of the tests φn given in the following lemma are of smaller order than the
error probabilities of the tests in Theorem 7.1 if ε = εn → 0 in the latter theo-
rem. This can be useful to reduce the model  to a totally bounded submodel,
by trimming away parts that can easily be tested by ad hoc arguments. The
following lemma is a consequence of results of Le Cam (1973).

Lemma 7.2. Suppose that there exist tests ωn such that for fixed sets 0
and 1 of probability measures
sup P0n ωn → 0 sup Pn 1 − ωn  → 0
P0 ∈0 P∈1

Then there exist tests φn and constants K > 0 such that


sup P0n φn ≤ e−Kn  sup Pn 1 − ωn  ≤ e−Kn 
P0 ∈0 P∈1

In view of the fact that, apparently, entropy conditions are not always appro-
priate to ensure the existence of tests, it is fruitful to formulate a theorem on
rates of convergence directly in terms of existence of tests. The following is a
result of this type.

Theorem 7.3. Suppose that (2.8) and (2.9) hold for a sequence εn with
εn → 0 and nεn2 bounded away from zero and sets n ⊂  , and in addition
suppose that their exists a sequence of tests φn such that for some constant
K > 0 and for every sufficiently large j,
(7.7) P0n φn → 0
 
(7.8) sup Pn 1 − φn  ≤ exp −Knεn2 j2 
P∈n εn j<dP P0 ≤2jεn
524 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Then for any Mn → ∞, we have that n P dP P0  ≥ Mn εn X1      Xn  → 0


in P0n -probability.

8. Proof of Theorems 2.1–2.3. In the proof of Theorem 2.1 we use the


following simple lemma, which will need to be replaced by more complicated
results for the proofs of Theorems 2.2 and 2.3.

Lemma 8.1. For every ε > 0 and probability measure  on the set
  2 
P − P0 logp/p0  ≤ ε2  P0 logp/p0  ≤ ε2
we have, for every C > 0,
 n 
n
 p  2
 1
P0 Xi  dP ≤ exp −1 + Cnε ≤ 2 2
p
i=1 0
C nε

Proof. By Jensen’s inequality applied to the logarithm,


n n
p p
log Xi  dP ≥ log Xi  dP
p
i=1 0 i=1
p0

Thus the probability is bounded by, with n = nn − P0  the empirical
process,
 
p √ √ p
P0n n log dP ≤ − n1 + Cε2 − nP0 log dP 
p0 p0
By Fubini’s theorem and the assumption √ on  the expression on the right
of the inequality sign is bounded by − nε2 C. An application of Chebyshev’s
inequality yields the upper bound
  2
var logp/p0 X1  dP P0 logp/p0  dP
≤ 
C2 nε4 C2 nε4
by another application of Jensen’s inequality. The right side is bounded by
C2 nε2 −1 by the assumption on . This concludes the proof. ✷

Proof of Theorem 2.1. For every ε > 2εn we have by (2.2),


 
ε
log D  n  d ≤ log Dεn  n  d ≤ nεn2 
2
Therefore, by Theorem 7.1, applied with Dε = expnεn2  (constant in ε) and
ε = Mεn and j = 1 in its assertion, where M ≥ 2 is a large constant to be
chosen later, there exist tests φn that satisfy
 
(8.1) P0n φn ≤ exp nεn2
  1
× exp −KnM2 εn2  
1 − exp −KnM2 εn2
 
(8.2) sup Pn 1 − φn  ≤ exp −KnM2 εn2 
P∈n dP P0 >Mεn
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 525

By the first condition (8.1) it follows that, if KM2 − 1 > K, as n → ∞,


  2
(8.3) EP0 n P dP P0  ≥ εn X1      Xn φn ≤ P0n φn ≤ 2e−Knεn 
By Fubini’s theorem and the fact that P0 p/p0  ≤ 1,
n
p
EP0 Xi  dn P ≤ n  − n 
 n i=1 p0

Combining the above assertion with (8.2) we see that


n
p
EP0 Xi  dn P1 − φn 
P dP P0 >Mεn i=1 p0

≤ n  n  + Pn 1 − φn  dn P
P∈n dP P0 >Mεn

≤ n  n  + exp−KnM2 εn2  ≤ 2 exp−nεn2 C + 4

for M ≥ C
 + 4/K. By Lemma 8.1, we have with probability  tending to 1,
with Bn = P − P0 logp/p0  ≤ εn2  P0 logp/p0 2 ≤ εn2 ,
n    
p
Xi  dn P ≥ exp −2nεn2 n Bn  ≥ exp −nεn2 2 + C 
p
i=1 0

by assumption (2.4). If An is the event that this inequality is true, so that


P0n An  → 1, then it follows that
 
EP0 n P dP P0  > Mεn X1      Xn 1 − φn 1An
   
≤ exp nεn2 2 + C 2 exp −nεn2 C + 4 → 0
This concludes the proof. ✷

For the proof of Theorem 2.2 we need a replacement of Lemma 8.1 that gives
a faster rate of convergence in its statement. We can achieve this by controlling
the quotients p/p0 . First, if one has uniform control from below, then the
Hellinger distance and the Kullback–Leibler information are comparable. The
following lemma can be found in Birgé and Massart (1998) [see their (7.6)].

Lemma 8.2. For any pair of probability measures P and P0 ,


     
p  p0   p0 
2
h P P0  ≤ −P0 log 2 
≤ 2h P P0  1 + log    ≤ 2h P P0 
2 
p0 p ∞ p 

A second lemma is a comparison of a certain exponential moment and the


Hellinger distance. This exponential moment [called the “Bernstein norm” in
van der Vaart and Wellner (1996) even though it is not a norm] is essential
in Bernstein’s inequality. Birgé and Massart (1993) used this “norm” to derive
results on rates of convergence of minimum contrast estimators.
526 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Lemma 8.3. For any pair of probability measures P and P0 ,


 
   p0 
(8.4) P0 exp logp/p0  − 1 −  logp/p0  ≤ 2h P P0 
2
p 


Proof. For every c ≤ 0 and x ≥ c we have the inequality


   2
(8.5) ex − 1 − x ≤ 2ec ex/2 − 1 

If e−c = p0 /p∞ , then logp/p0  ≥ c and hence the integrand on the left side
of (8.4) is bounded above by
   2 ! 2
1 p
2e−c exp logp/p0  − 1 = 2e−c −1 
2 p0
The integral of the right side with respect to P0 is equal to 2e−c times the
squared Hellinger distance. ✷

The “Bernstein norm” of logp/p0  dominates all moments of order greater


than or equal to 2 of logp/p0  up to constants, including the second moment
up to a factor 2. Therefore, when combined the preceding two lemmas show
that
  
p 
P h P P0 
2 0
p ≤ε
2

(8.6)  ∞  2
p0 2 p
⊂ P P0 log p ≤ 2ε  P0 log p ≤ 4ε2 
0

This shows that condition (2.4) is weaker than condition (2.5), up to con-
stants. Actually controlling all moments is more than what is needed. Another
possible extension of Lemma 8.1 would be to replace the second moment of
logp/p0  by a higher moment (and use Markov’s inequality at the end of
the proof). This would give a result good enough for the proof of Theorem 2.2
provided the higher moment is chosen “high enough” (dependent on the order
εn , faster convergence to zero needing a higher moment). We have chosen
here to forego such refinements and obtain an exponential inequality under a
somewhat stronger assumption.
We are ready for an adaptation of Lemma 8.1

Lemma 8.4. For every ε > 0 and probability measure  on the set
   
(8.7) P h2 P P0 p0 /p∞ ≤ ε2 

we have, for a universal constant B > 0,


 n 
n
 p
(8.8) P0 Xi  dP ≤ exp−3nε  ≤ exp−Bnε2 
2
p
i=1 0
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 527

Proof. Lemma 8.2 gives that −P0 log p/p0 ≤ 2ε2 for every P in the
set (8.7), which has -probability 1. Furthermore, by Lemma 8.3,

P0 exp  logp/p0  − 1 −  logp/p0  ≤ 2ε2 

By monotonicity and convexity of the function y → ey − 1 − y on 0 ∞ and


Jensen’s inequality,
    
   
P0 exp  logp/p0  dP − 1 −  logp/p0  dP

≤ P0 exp logp/p0  − 1 −  logp/p0  dP ≤ 2ε2 

by Fubini’s theorem.
 By the lemma below the same bound is true for 12 times
the variable logp/p0  dP centered at its expectation. Therefore, rewrit-
ing the probability on the left side of (8.8) as in the proof of Lemma 8.1, we
see that it is bounded above by
     
n p √ 2 √ 2 nε4
P0 n log dP ≤ −3 nε + n2ε ≤ exp −D 2 √ 2 √ 
p0 ε + nε / n
by (the refined version of) Bernstein’s inequality. [see, e.g., Lemma 2.2.11 of
van der Vaart and Wellner (1996).] ✷

Lemma 8.5. If ψ 0 ∞ →  is convex and nondecreasing, then EψX −


EX ≤ Eψ2X for every random variable X.

Proof. The map y → ψy is convex on . If X is an independent copy


of X, then the left side is equal to EψX − EX  ≤ EψX − X , by Jensen’s
inequality. Next bound X − X  ≤ X + X  and use the monotonicity and
convexity of ψ again to bound the expectation by E 12 ψ2X+ψ2X , which
is the right side. ✷

Proof of Theorem 2.2. The proof of Theorem 2.2 follows the same lines
as the proof of Theorem 2.1. The difference is that we use Lemma 8.4 instead
of Lemma 8.1 to ensure that the probability of the events An converges to 1
at an exponential rate. By inspecting the proof, we conclude that for some
B1  B2 > 0 and M chosen as before,
 
P0 n P dP P0  > Mεn X1      Xn  ≥ exp−B1 nεn2 

converges to zero at the rate exp−B2 nεn2 . Since n exp−B2 nεn2  < ∞,
almost sure convergence follows by the Borel–Cantelli lemma. ✷

For the proof of Theorem 2.3 we need other variations on the preceding lem-
mas. The following lemma follows from Theorem 5 of Wong and Shen (1995).
Let log + x = log x ∨ 0.
528 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Lemma 8.6. For any pair of probability measures P and P0 such that
hP P0  ≤ 044 and P0 p0 /p < ∞,

 
p P0 p0 /p
−P0 log ≤ 18h2 P P0  1 + log + 
p0 hP P0 
   
p 2 P0 p0 /p 2
P0 log ≤ 5h2 P P0  1 + log + 
p0 hP P0 

Lemma 8.7. For any pair of probability measures P and P0 such that
P0 p0 /p < ∞,

P0 exp logp/p0  − 1 −  logp/p0  ≤ 4h2 P P0 1 + −1 h2 P P0 

for −1 ε = sup M M ≥ ε the inverse of the function M = P0 p0 /p1
p0 /p ≥ M /M.

Proof. Set m = p0 /p. By inequality (8.5) in the proof of Lemma 8.3, the
left side is bounded above by

! 2 ! 2
p p p
2P0 −1 1 p ≥ p0 + 2P0 0 −1 1 p < p0
p0 p p0

≤ 2h2 P P0 1 + M + 2P0 m1 m > M 

for every M > 0. The function  is left continuous and strictly decreasing
from infinity at 0 to 0 at a point τ ≤ ∞. If we choose M = −1 h2 P P0 ,
then MM ≥ Mh2 P P0  ≥ MM+ = P0 m1 m > M . The right side of
the last display can now be bounded by an expression as in the lemma. ✷

Lemma 8.8. For a given function m let −1 ε = sup M M ≥ ε be the
inverse function of M = P0 m1 m ≥ M /M. For every ε ∈ 0 044 and
probability measure  on the set

  
P0 m  
P p0 /p ≤ m 18h2 P P0  1 + log + + −1 h2 P P0  ≤ ε2
hP P0 

we have, for a universal constant B > 0,

 n 
 p
(8.9) P0n Xi  dP ≤ exp−2nε  ≤ exp−Bnε2 
2
p
i=1 0
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 529

Proof. This follows the same lines as the proof of Lemma 8.4, now sub-
stituting Lemmas 8.6 and 8.7 for Lemmas 8.2–8.3.

Proof of Theorem 2.3. This is identical to the proof of Theorem 2.2,


except that we use Lemma 8.8 instead of Lemma 8.4.

Proof of Theorem 2.4. The first part of the proof is identical to the first
part of the proof of Theorem 2.1, except that we choose the tests φn to satisfy
(8.1) and [instead of 8.2] for every j ∈ ,
(8.10) sup Pn 1 − φn  ≤ exp−KnM2 εn2 j2 
P∈n dP P0 >Mεn j

We also choose M large enough to ensure that the right side


 of (8.1) and hence
the left side of (8.3) converges to zero. Defining Sn j = P ∈ n Mεn j <

dP P0  ≤ Mεn j + 1 and using (8.10), we obtain
n
p
EP 0 Xi  dn P1 − φn  ≤ exp−KnM2 εn2 j2 n Sn j 
Sn j i=1 p0

Fix some C0 ≥ 1. By Lemma 8.1, we have on an event An with probability at


least 1 − nεn2 C20 −1 ,
n    
p
Xi  dn P ≥ exp −2C0 nεn2 n Bn εn  
p
i=1 0

Hence, by assumption (2.9), for every sufficiently large J,


 
EP0 n P ∈ n dP P0  > Jεn MX1      Xn 1 − φn 1An
 
exp −KnM2 εn2 j2 n Snj 
≤    
2
j≥J exp −2C0 nεn n Bn εn 
 
≤ exp −nεn2 KM2 j2 − 2C0 − 12 KM2 j2  
j≥J

This converges to zero as J → ∞ if nεn2 is bounded away from zero. Next


n  n 
EP0 n P ∈
/ n X1      Xn 1 − φn 1An ≤  
exp−2C0 nεn2 n  Bn εn 
We may assume that either nεn2 is bounded or nεn2 → ∞; otherwise we argue
along subsequences. If nεn2 is bounded, then we first choose C0 large but fixed
so as to make P0n An  as large as desired. Then the right side of the preceding
display converges to zero by assumption (2.8). If nεn2 → ∞, then we choose
C0 = 1, in which case P0n An  → 1 and again the right side of the preceding
display converges to zero. ✷

Proof of Theorem 7.3. This is essentially contained in the proof of The-


orem 2.4 (take M = 1).
530 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Acknowledgment. We thank Lucien Birgé for insightful discussions that


have led to an improved presentation (and some corrections), in particular
relating to Section 7.

REFERENCES
Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distribu-
tions in nonparametric problems. Ann. Statist. 27 536–561.
Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch.
Verw. Gebiete 65 181–238.
Birgé, L. (1984). Sur un théorème de minimax et son application aux tests. Probab. Math. Statist.
3 259–282.
Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab.
Theory Related Fields 97 113–150.
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for
Lucien Le Cam (G. Yang and D. Pollard, eds.) 55–87. Springer, New York.
Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: exponential bounds
and rates of convergence. Bernoulli 4 329–375.
de Boor, C. (1978). A Practical Guide to Splines. Springer, New York.
Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates (with discussion).
Ann. Statist. 14 1–67.
Doob, J. L. (1949). Le Calcul des Probabilités et ses Applications. Coll. Int. du CNRS 13 23–27.
Dudley, R. M. (1984). A course on empirical processes. Lectures Notes in Math. 1097 2–141.
Springer, Berlin.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1
209–230.
Ferguson, T. S. (1974). Prior distribution on the spaces of probability measures. Ann. Statist. 2
615–629.
Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ estimates in the discrete case.
Ann. Math. Statist. 34 1194–1216.
Freedman, D. A. (1965). On the asymptotic behavior of Bayes’ estimates in the discrete case II.
Ann. Math. Statist. 36 454–456.
Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1997). Non-informative priors via sieves
and packing numbers. In Advances in Statistical Decision Theory and Applications (S.
Panchapakeshan and N. Balakrishnan eds.) 129–140. Birkhäuser, Boston.
Ghosal, S., Ghosh, J. K. and Ramamoorthi R. V. (1999a). Posterior consistency of Dirichlet
mixtures in density estimation. Ann. Statist. 27 143–158.
Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999b). Consistency issues in Bayesian non-
parametrics. In Asymptotics, Nonparametrics and Time Series: A Tribute to Madan Lal
Puri (Subir Ghosh, ed.) 639–667. Dekker, New York.
Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical Estimation: Asymptotic Theory.
Springer, New York.
Kolmogorov, A. N. and Tikhomirov, V. M. (1961). Epsilon-entropy and epsilon-capacity of sets
in function spaces. Amer. Math. Soc. Trans. Ser. 2 17 277–364.
Le Cam, L. M. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist.
1 38–53.
Le Cam, L. M. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York.
Le Cam, L. M. and Yang, G. (1990). Asymptotics in Statistics: Some Basic Concepts. Springer,
New York.
Pollard, D. (1990). Empirical Processes: Theory and Applications. IMS, Hayward, CA and Amer.
Statist. Assoc., Alexandria, VA.
Schwartz, L. (1965). On Bayes procedures. Z. Wahrsch. Verw. Gebiete 4 10–26.
Shen, X. and Wasserman, L. (1999). Rates of convergence of posterior distributions. Preprint.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 531

Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. Ann.
Statist. 14 590–606.
Stone, C. J. (1990). Large-sample inference for log-spline models. Ann. Statist. 18 717–741.
Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate
function estimation (with discussion). Ann. Statist. 22 118–184.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes.
Springer, New York.
Wasserman, L. (1998). Asymptotic properties of nonparametric Bayesian procedures. Practical
Nonparametric and Semiparametric Bayesian Statistics. Lecture Notes in Statist. 133
293–304. Springer, New York.
Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence
rates of sieve MLEs. Ann. Statist. 23 339–362.

S. Ghosal J. K. Ghosh
A. W. van der Vaart Statistics and Mathematics Unit
Department of Mathematics Indian Statistical Institute
Free University 203 B.T. Road
De Boelelaan 1081a Calcutta 700 035
1081 HV Amsterdam India
Netherlands
E-mail: [email protected]

You might also like