0% found this document useful (0 votes)

27 views32 pages

Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart

Uploaded by

hadjiamine93

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views32 pages

Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart

Uploaded by

hadjiamine93

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

The Annals of Statistics

2000, Vol. 28, No. 2, 500–531

CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS

By Subhashis Ghosal, Jayanta K. Ghosh

and Aad W. van der Vaart
Free University Amsterdam, Indian Statistical Institute and
Free University Amsterdam
We consider the asymptotic behavior of posterior distributions and
Bayes estimators for inﬁnite-dimensional statistical models. We give gen-
eral results on the rate of convergence of the posterior measure. These are
applied to several examples, including priors on ﬁnite sieves, log-spline
models, Dirichlet processes and interval censoring.

1. Introduction. Suppose that we observe a random sample X1 Xn

from a distribution P with density p relative to some reference measure on
the sample space . The unknown distribution is known to belong to
some model , a set of probability measures on the sample space. Given some
prior distribution n on the set , the posterior distribution is the random
measure given by
n
pXi dn P
n BX1 Xn = Bni=1
i=1 pXi dn P

If the distribution P is considered random and distributed according to , as

it is in Bayesian inference, then the posterior distribution is the conditional
distribution of P given the observations. The prior is, of course, a measure on
some σ-field on and we must assume that the expressions in the display
are well defined. In particular, we assume that the map x p → px is
measurable for the product σ-field on × . It will be silently understood
that the sets of which we compute prior or posterior measures are measurable.
In this paper we study the frequentist properties of the posterior distribu-
tion as n → ∞, assuming that the observations are a random sample from
some fixed measure P0 . In particular, we study the rate at which this ran-
dom distribution converges to P0 . The posterior is said to be consistent if, as
a random measure, it concentrates on arbitrarily small neighborhoods of P0 ,
with probability tending to 1 or almost surely, as n → ∞. We study the rate
at which such neighborhoods may decrease to zero meanwhile still capturing
most of the posterior mass.
If = Pθ θ ∈ is parametrized by a parameter θ, then usually the
prior is constructed by putting a measure on the parameter set . If is a

Received July 1998; revised December 1999.

AMS 1991 subject classiﬁcations. 62G15, 62G20, 62F25
Key words and phrases. Inﬁnite dimensional model, posterior distribution, rate of convergence,
sieves, splines
500
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 501

subset of a ﬁnite-dimensional Euclidean space and the dependence of θ → Pθ

is sufficiently regular, then it is well known that the posterior distribution
of θ achieves the optimal rate of convergence, as n → ∞ [see, for example,
Le Cam (1973) and Ibragimov and Has’minskii (1981)]. In particular, if the
model θ √→ Pθ is suitably differentiable, then the rate of the posterior mean
for θ is n and the posterior distribution, when rescaled, tends to a normal
distribution with covariance the inverse Fisher information, according to the
Bernstein–von Mises theorem. In that case the posterior expectation is an
asymptotically efficient estimator for the parameter under some integrability
conditions.
Much less is known on the behavior of posterior distributions for infinite-
dimensional models. Most of the known results in this area address consis-
tency issues. A famous theorem by Doob (1949) shows that consistency obtains
on a set of prior measure 1, but his result concludes nothing on consistency at
a particular true distribution of interest. Schwartz (1965) gives results that
do apply to a particular true distribution. She shows that the posterior distri-
bution is consistent if the true distribution P0 can be suitably tested versus
the complements of neighborhoods of P0 and Kullback–Leibler neighborhoods
of P0 receive positive probabilities under the prior. Examples by, among oth-
ers, Freedman (1963, 1965) and Diaconis and Freedman (1986) show that the
situation is more complicated, even though, perhaps, these examples put too
much emphasis on the situations where Bayes estimation does not work. A
number of recent papers consider the consistency with a particular interest
in the infinite-dimensional situation. Barron, in an unpublished paper, refines
Schwartz’s theorem [see Proposition 2 of Barron, Schervish and Wasserman
(1999)] in a way that is particularly suitable for prior measures on infinite-
dimensional spaces of densities. Ghosal, Ghosh and Ramamoorthi (1999a) use
this extension to study consistency in the variation distance for Dirichlet mix-
ture priors. For reviews on posterior consistency in infinite dimensions, see
Ghosal, Ghosh and Ramamoorthi (1999b) and Wasserman (1998).
Le Cam [(1986), pages 509–529], addresses rates of convergence of Bayes
estimators in an abstract setting. Our methods are clearly related to the meth-
ods used by Le Cam. A crucial distinction appears to be that Le Cam appears
to base his argument on the prior mass present in fairly small balls (the sets
V in his Lemma 1 on page 510, later chosen such that PV is close to P0n ),
whereas our result is based on having sufficient prior mass in balls of radius
equal to the rate of convergence that we wish to obtain. The behavior of prod-
uct densities in these bigger balls appears not to be determined by the metric
distance of the marginal components alone. Instead we use a combination of
Kullback–Leibler numbers and distances on the log likelihood ratio ratios.
Another distinction is that we consider rates of posterior measures, whereas
Le Cam considers the rate of “formal Bayes estimators.” For these reasons our
results appear not to be covered by Le Cam’s Theorem 1 on page 513 (in which
distances and other quantities are in terms of the product measures Pn ). How-
ever, we would like to acknowledge the great importance of Le Cam’s work to
502 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

the present paper. In particular, we are indebted to the part of Le Cam’s work
as it was extended by Birgé (1983) to general metric spaces.
Shortly after completing this paper, we learned of independent work by
Shen and Wasserman (1998), who also address rates of convergence.
The construction of prior measures on infinite-dimensional models is not
a trivial matter and has also received recent attention. This development
started with the introduction of Dirichlet processes by Ferguson (1973, 1974).
Given computing algorithms such as Markov chain Monte Carlo methods and
powerful computing machines, implementation of Bayesian methods has now
become feasible even for many complicated priors and infinite dimensional
models.
In Section 2 we present a main result and several variations concerning the
rate of convergence of the posterior relative to the total variation, Hellinger
and L2 -metrics. Every time the two main elements characterizing the rate of
convergence are the size of the model (measured by covering numbers or exis-
tence of certain tests) and the amount of prior mass given to a shrinking ball
around the true measure. Actually, the size of the model comes in only to guar-
antee the existence of certain tests of the true measure versus the complement
of a shrinking ball around it, and conditions can be put in terms of such tests
instead. Conditions of this form go back to Schwartz (1965) and Le Cam (1973).
We discuss testing in Section 7, and reformulate our main result in terms of
tests in this section. The proofs of the main results are contained in Section 8
following the discussion of the existence of tests. In Section 2 we also note that
a rate of convergence for the posterior automatically entails the existence of
point estimators with the same rate.
We apply the general result to several examples. In Section 3 we consider
discrete priors constructed on ε-nets over the model. In Section 4 we discuss
Bayes estimators based on the log spline models for density estimation dis-
cussed by Stone (1986). In Section 6 we consider finite-dimensional models.
In Section 6 we discuss applications to Dirichlet priors.
The notation is used to denote inequality up to a universal multiplicative
constant, or up to a constant that is fixed throughout. We define the Hellinger
distance hp q or hP Q between two probability densities or measures by
√ √
the L2 -distance between the root densities p and q. The total variation
distance is the L1 -distance. (Some authors define these distances with an
additional factor 1/2.)

2. Main results. Let X1 Xn be distributed according to some distri-

bution P0 and let n be a sequence of prior probability measures supported
on some set of probability measures . Let d be either the variation or the
Hellinger metric on . If the set of densities is uniformly bounded, then we
may also choose d equal to the L2 -distance. This metric is used in condition
(2.2) of the following theorem and also in its assertion.
Let Dε d denote the ε-packing number of . This is the maximal
number of points in such that the distance between every pair is at least
ε. It is easy to see that this is related to the ε-covering number Nε d,
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 503

which is the minimal number of balls of radius ε needed to cover , by the

inequalities

(2.1) Nε d ≤ Dε d ≤ Nε/2 d

Because we are only interested in rates of convergence, the additional constant

2 is of no real importance in the following, and covering numbers may replace
packing numbers throughout. The set of centers of a minimal set of balls of
radius ε covering is called an ε-net.
A good early reference on entropy numbers is the paper Kolmogorov and
Tikhomirov (1961). Alternative references are Dudley (1984) and van der
Vaart and Wellner (1996).
We use the notation Pf to abbreviate f dP, and, later on, n f for
n−1 ni=1 fXi .

Theorem 2.1. Suppose that for a sequence εn with εn → 0 and nεn2 → ∞,

a constant C > 0 and sets n ⊂ , we have

(2.2) log Dεn n d ≤ nεn2

(2.3) n \ n ≤ exp −nεn2 C + 4

p p 2
(2.4) n P − P0 log ≤ εn2 P0 log ≤ εn2 ≥ exp −nεn2 C
p0 p0

Then for sufﬁciently large M, we have that n P dP P0 ≥ Mεn X1 Xn

→ 0 in P0n -probability.

The first and third conditions of the theorem are the essential ones. Condi-
tion (2.3) allows some additional flexibility, but should first be understood as
expressing that n is almost the support of the prior (in which case its left
side is zero and the condition is trivially satisfied).
Condition (2.2) requires that the “model” n be not too big. It is true for
every εn ≥ εn as soon as it is true for εn and can thus be seen as defining a
minimal possible value of εn . Condition (2.2) ensures the existence of certain
tests, as discussed in Section 7, and could be replaced by a testing condition.
Note that the metric d used here reappears in the assertion of the theorem.
Since the total variation metric is bounded above by twice the Hellinger metric,
the assertion of the theorem using the Hellinger metric is stronger, but also
condition (2.2) will be more restrictive, so that we really have two theorems.
In the case that the densities are uniformly bounded, we even have a third
theorem, when using the L2 -distance, which in that case will be bounded above
by a multiple of the Hellinger distance. If the densities are also uniformly
bounded and uniformly bounded away from zero, then these three distances
504 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

are equivalent and are also equivalent to the Kullback–Leibler number and
L2 -norm appearing in condition (2.4). See, for example, Lemmas 8.2 and 8.3
and (8.6).
A rate εn satisfying (2.2) for = n and d the Hellinger metric is often
viewed as giving the “optimal” rate of convergence for estimators of P relative
to the Hellinger metric, given the model . Under certain conditions, such
as likelihood ratios bounded away from zero and infinity, this is proved as a
theorem by Birgé (1983) and Le Cam (1973, 1986). From Birgé’s work it is clear
that condition (2.2) is the correct expression of the complexity of the model,
as relating to estimating the true density relative to the Hellinger distance,
if this is to be given in terms of metric entropy. A weaker, but more involved,
condition is in terms of the existence of certain tests. We give a generalization
of the theorem using tests in Section 7.
Condition (2.4) is the other main determinant of the posterior rate given
by the theorem. It requires that the prior measures put a sufficient amount
of mass near the true measure P0 . Here “near” is measured through a combi-
nation of the Kullback–Leibler divergence of p and p0 and the L2 P0 -norm
of logp/p0 . Again this condition is satisfied for εn ≥ εn if it is satisfied for
εn and thus is another restriction on a minimal value of εn . The form of this
condition can be motivated from entropy considerations. Suppose that we wish
to satisfy (2.4) for the minimal εn satisfying (2.2) with n = , that is, for
the optimal rate of convergence for the model. Furthermore, for the sake of
the argument assume that all distances used are equivalent. Then a mini-
mal εn -cover of consists of expnεn2 balls. If the prior n would spread
its mass uniformly over , then every ball would obtain mass approximately
exp−Cnεn2 . (The constant C expresses the constants in comparing the dis-
tances and the fact that the balls of radius εn may overlap.) On the other hand,
if n is not “uniform,” then we should expect (2.4) to fail for some P0 ∈ . Here
we must admit that “uniform” priors do not exist in infinite-dimensional mod-
els and actually condition (2.4) is stronger than needed and will be improved
ahead in Theorem 2.4. However, a rough implication of the condition is that n
should be “uniformly spread” in order for the posterior distribution to attain
the optimal rate of convergence.
Condition (2.3), combined with (2.2), can be interpreted as saying that a
part of that barely receives prior mass need not be small. The sets n
may be thought of as “sieves” approximating the parameter space, which cap-
ture most of the prior probability. This type of condition has received much
attention in the discussion of consistency issues [see Barron, Schervish and
Wasserman (1998)], but plays a smaller role in the present paper. Of course,
condition (2.3) is trivially satisfied for n = ; we can make this choice if
condition (2.2) holds with n = itself.
The assertion of the theorem is an in-probability statement that the pos-
terior mass outside a large ball of radius proportional to εn is approximately
zero. The in-probability statement can be improved to an almost sure asser-
tion, but under stronger conditions. We present two results.
Let h be the Hellinger distance and write log + x for log x ∨ 0.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 505

Theorem 2.2. Suppose that conditions (2.2) and (2.3) hold as in the pre-
ceding theorem and in addition n exp−Bnεn2 < ∞ for every B > 0 and
p

(2.5) n P h2 P P0 0 ≤ εn2 ≥ exp −nεn2 C
p ∞
Then for sufﬁciently large M, we have that n P dP P0 ≥ Mεn X1 Xn
→ 0 in P0n -almost surely.

Theorem 2.3. Suppose that conditions

(2.2) and (2.3) hold as in the pre-
ceding theorem and in addition n exp − Bnεn2 < ∞ for every B > 0 and for
a given function m with P0 m < ∞,

n P 18h2 PP0 log + P0 m/hP P0
(2.6) p
+ −1 h2 PP0 ≤ εn2 0 ≤ m ≥ exp −nεn2 C
p
where −1 ε = sup M M ≥ ε is the inverse of the function M =
P0 m1 m ≥ M /M. Then for sufﬁciently large M, we have that n P dP P0
≥ Mεn X1 Xn → 0 in P0n -almost surely.

If the quotients p0 /p are uniformly bounded, then condition (2.5) simply

requires that shrinking Hellinger balls possess a sufficient amount of prior
mass. Then a fairly symmetric statement is obtained when combined with
condition (2.2) for the Hellinger metric d: if we can cover the model with not
too many Hellinger balls and the Hellinger ball around P0 contains a sufficient
amount of mass, then the rate of convergence relative to the Hellinger distance
is εn .
Lemmas 8.2 and 8.3 in Section 8 relate the
Kullback–Leibler divergence
and L2 -norm of logp/p0 to h2 P P0 p0 /p∞ and imply that the conditions
of Theorem 2.2 are essentially stronger than those of Theorem 2.1.
Condition (2.6) is milder in its control of p/p0 than (2.5) by allowing a
general bound m that need only satisfy a moment condition. However, in com-
parison with (2.5) it will be satisfied only for somewhat bigger εn , due to the
presence of the term involving log and −1 .
In general, good control on the quotients p/p0 is needed next to the close-
ness of p to p0 relative to, for example, the Hellinger metric, because the
product measures Pn and P0n can be arbitrarily far apart as n → ∞ within
balls of radii εn , for the values of εn bigger than n−1/2 that we are considering
here. The bound on p/p0 together with the distance ensures that Pn and P0n
are still “close” enough on an exponential scale. Only prior mass on such close
alternatives helps to increase the rate of convergence of the posterior.
One deficit of the theorems as presented so far is that they do not satis-
factorily cover finite-dimensional
√ models. When applied to such models,
√ they
would yield the rate 1/ n times a logarithmic factor rather than 1/ n itself.
Similarly, the theorems may also yield unnecessary logarithmic factors when
applied to priors constructed on a sequence of finite-dimensional sieves. To
506 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

improve this situation we must reﬁne both the entropy condition (2.2) and
the prior mass condition (2.4). The following generalization of Theorem 2.1 is
more complicated but does yield the right result in the ﬁnite-dimensional situ-
ation. It is essential for our examples using spline approximations in Section 4.
Theorems 2.2 and 2.3 can be generalized similarly. Let
p p 2
Bn ε = P − P0 log ≤ ε2 P0 log ≤ ε2
p0 p0

Theorem 2.4. Suppose that for a sequence εn with εn → 0 and such that
nεn2 is bounded away from zero, every sufﬁciently large j and sets n ⊂ , we
have
ε
(2.7) log D P ∈ n ε ≤ dP P0 ≤ 2ε d ≤ nεn2 for every ε ≥ εn
2

n − n
(2.8) = o exp −2nεn2
n Bn εn

n P jεn < dP P0 ≤ 2jεn
(2.9) ≤ exp Knεn2 j2 /2
n Bn εn
Here K is the universal testing constant appearing in (7.1) and (7.2). Then for
every Mn → ∞, we have that n P dP P0 ≥ Mn εn X1 Xn → 0 in
P0n -probability.

Convergence of the posterior distribution at the rate εn implies the existence

of point estimators, which are Bayes in that they are based on the posterior
distribution, that converge at least as fast as εn in the frequentist sense. One
possible construction is to deﬁne P̂n as the (near) maximizer of

Q → n P dP Q < εn X1 Xn

Theorem 2.5. Suppose that n P dP P0 ≥ εn X1 Xn converges to

0, almost surely (respectively, in probability) under P0n and let P̂n maximize, up

to o1, the function Q → n P dP Q < εn X1 Xn . Then dP̂n P0 ≤
2εn eventually almost surely (respectively, in probability) under P0n .

Proof. By deﬁnition, the εn -ball around P̂n contains at least as much

posterior probability as the εn -ball around P0 . The latter, by posterior conver-
gence at rate εn , has posterior probability close to unity. Therefore, these two
balls cannot be disjoint, for otherwise, the total posterior mass would exceed
unity. Now apply the triangle inequality. ✷

The preceding construction actually applies to general statistical models

and posterior distributions, and the theorem is well-known. [See, e.g., Le
Cam (1986) or Le Cam and Yang (1990).] If we use the Hellinger or total
variation metric (or some other bounded metric whose square is convex), then
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 507

an alternative is to use the posterior expectation, which typically has a similar

property. By Jensen’s inequality and convexity of P → d2 P P0 ,

d2 P dn PX1 Xn P0 ≤ d2 P P0 dn PX1 Xn

≤ εn2 + d2∞ n P dP P0 > εn X1 Xn
√
where d∞ is a bound on the maximal distance ( 2 and 2, respectively, for
Hellinger and variation distance). To obtain the desired result, we now need
that the posterior probability of the complement of the εn -ball around p0 con-
verges to zero at least at the order εn2 . This is usually the case, in particular
under the conditions of Theorems 2.2 and 2.3, whose proofs yield the expo-
nential order exp−Bnεn2 . (We use the square of the distance, because the
Hellinger distance is not convex. With the total variation distance the argu-
ment would work also with the distance itself.)
More generally, we could use the minimizer P̂n of

Q → "n dQ P dn PX1 Xn

for appropriate loss functions "n . Such estimators are called formal Bayes
estimators in Le Cam (1986).
On the one hand, Theorem 2.5 shows that we can construct good estimators
from the posterior if the posterior converges at a good rate. On the other hand,
it shows that the posterior cannot converge at a rate faster than the optimal
rate of convergence for point estimators. We use this argument in a number
of examples to show that the posterior converges at the best possible rate. Of
course, our arguments have nothing to say about the best possible constants.
Furthermore, for many priors the rate may be suboptimal.

3. Priors based on ﬁnite approximating sets. In this section, we con-

struct, under bracketing entropy conditions, priors based on uniform distri-
butions on carefully chosen finite sets for which the posterior converges at
the best possible rate. Priors based on uniform distributions on finite subsets
are introduced by Ghosal, Ghosh and Ramamoorthi (1997) as the Bayesian
default priors for nonparametric problems. They establish posterior consis-
tency for such priors under mild entropy conditions. In the present case, the
prior is constructed more carefully to achieve the optimal rate of convergence
as well.
Given two functions l u → the bracket l u is defined as the set of
all functions f → such that l ≤ f ≤ u everywhere. The bracket is said to
be of size ε relative to the distance d if dl u < ε. In this section we use the
Hellinger distance h as the distance d and restrict the brackets to consisting
of nonnegative functions, which are assumed to be integrable relative to a
reference measure µ. Let N ε h be the minimal number of brackets of
size ε needed to cover . Because a bracket of size ε is contained in the ball
of radius ε/2 around its midpoint, it follows that Nε/2 h ≤ N ε h
508 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

and hence the present bracketing numbers are bigger than the packing num-
bers Dε h deﬁned previously [see (2.1)]. However, in many examples
there is also an equality in the other direction, up to a constant, and bracketing
and packing numbers give equivalent results. The corresponding bracketing
entropy is deﬁned as the logarithm of the bracketing number N ε h.
We shall construct a discrete prior supported on densities constructed from
minimal sets of brackets for the Hellinger distance. For a given number εn > 0,
let n be the uniform discrete measure on the N εn h densities obtained
by covering with a minimal set of εn -brackets and next renormalizing the
upper bounds of the brackets to integrate to 1. Thus if l1 u1 lN uN
are the N = N εn h brackets, then n is the uniform measure on the
N functions uj / uj dµ. Next set

= λ n n
n∈

for a given sequence λn with λn ≥ 0 and n λn = 1.

Theorem 3.1. Suppose that εn are numbers decreasing in n such that

log N εn h ≤ nεn2 for every n and nεn2 / log n → ∞. Construct the prior
as given previously for a sequence λn such that λn > 0 for all n and
log λ−1
n = Olog n. Then the conditions of Theorem 2.2 are satisﬁed for εn
a sufﬁciently large multiple of the present εn and hence the corresponding pos-
terior converges at the rate εn almost surely, for every P0 ∈ , relative to the
Hellinger distance.

Proof. The prior gives probability 1 to the set = ∞ j=1 j for j the
N εj h functions in the support of j . We claim that

(3.1) D 8εn h ≤ exp 2nεn2

To see this, we ﬁrst note that, given an ε-bracket l u that contains a prob-
ability density p, with · 2 the norm in L2 µ,
1/2 √ √ √ √
1≤ u dµ = u2 ≤ u − p2 + p2 = hu p + 1 ≤ 1 + ε
u u
h p ≤ hp u + h u ≤ 2ε
u dµ u dµ

Therefore, n is a 2εn -net over : every point of is within distance 2εn

of some point in n . Since for j > n every point of j is within distance
2εj ≤ 2εn of , it follows that n is also a 4εn -net over j . This being true
for every
j > n it follows that n is a 4εn -net over j≥n j and hence, triv-
ially, j≤n j is a 4εn -net over . The cardinality of the latter net is at most
nN εn h ≤ expnεn2 + log n ≤ exp2nεn2 for sufﬁciently large n. By
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 509

virtue of the relationship (2.1) between covering numbers and packing num-
bers, we obtain (3.1). This verifies condition (2.2) with n taken equal to
and εn taken equal to eight times the present εn .
If u is the upper limit of the εn -bracket containing p0 , then
p
0 ≤ u dµ ≤ 1 + εn 2
u/ u dµ

It follows that for large n the set of points p such that h2 p p0 p0 /p∞ ≤ 8εn2

contains at least the function u/ u dµ and hence has prior mass at least
1
λn ≥ exp−nεn2 − Olog n ≥ exp−2nεn2
N εn h
for large n. This verifies condition (2.5) for εn a multiple of the present εn .
Since condition (2.3) is trivially satisfied for n = , the proof is com-
plete. ✷

There are many speciﬁc examples in which the preceding theorem applies.
The situation here is similar to that in recent papers on rates of convergence
of (sieved) maximum likelihood estimators, as in Birgé and Massart (1993,
1997, 1998), Wong and Shen (1995) or Chapter 3.4 of van der Vaart and
Wellner (1996). It is interesting to note that these authors also use brack-
ets, whereas Birgé (1983), in his study of the metric entropy of statistical
models, uses ε-nets. This is because the cited papers are concerned with a
particular type of estimator (namely, minimum contrast estimators), whereas
Birgé (1983) uses special constructs, called d-estimators. It appears that for
good behavior of Bayes estimators on nets we also need some special property
of the nets, such as available from nets obtained from brackets.
We include two concrete examples.

Example 3.1 (Smooth densities). Suppose that consists of all measures

√
with densities whose roots p belong to a fixed multiple of the unit ball of
α
the Hölder class C 0 1, for some fixed α > 0. [See, e.g., van der Vaart and
Wellner (1996) for a precise definition of this space of functions.] By results of
Kolmogorov and Tihomirov (1961), the ε-entropy numbers of this unit ball rel-
ative to the uniform norm are bounded by a multiple of 1/ε1/α . [Their result
is reproduced in Theorem 2.7.1 of van der Vaart and Wellner (1996).] Because
we can construct upper and lower brackets from uniform approximations, this
shows that the bracketing Hellinger entropies grow like ε−1/α , so that we can
take εn of the order n−α/2α+1 to satisfy the relation log N εn h ≤ nεn2 .
This rate is known to be the frequentist optimal rate for estimators. From
Theorem 2.5, we therefore conclude that the prior constructed above achieves
the optimal rate of convergence for the posterior.
Upper brackets are, in principle, available from the classical proof of Kol-
mogorov and Tihomirov (1961). Alternatively, we may use more modern classes
of approximating functions, such as wavelets or splines.
510 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Example 3.2 (Monotone densities). Suppose that consists of all mono-

tone decreasing densities on a compact interval in , bounded above by a ﬁxed
constant. The root of a monotone density is monotone and hence the bracket-
ing entropy of for the Hellinger distance is bounded by the L2 -entropy for
the set of monotone functions. This is of the order 1/ε [e.g., van der Vaart and
Wellner (1996), Theorem 2.7.5], whence we obtain a n−1/3 -rate of convergence
of the posterior. Again this rate cannot be improved.

Inspection of the proof of the theorem shows that the lower bounds of the
brackets are not really needed. The theorem can be generalized by defining
upper bracketing numbers N ε h as the minimal number of functions
u1 um such that for every p ∈ there exist a function ui such that
both p ≤ ui and hui p < ε. Next we construct a prior as before. These
upper bracketing numbers are clearly smaller than the bracketing numbers
N ε h. We have formulated the theorem using the better known brack-
eting numbers, because we do not know any example where this generalization
could be useful.
The preceding theorem implicitly requires that the model be totally
bounded for the Hellinger metric. A simple modification works for countable
unions of totally bounded models, provided that we use a sequence of priors.
Suppose that the bracketing numbers of are infinite, but there exist sub-
sets n ↑ with finite bracketing numbers. Let εn be numbers such that
log N εn n h ≤ nεn2 . Then we construct n as the discrete uniform distri-
bution on renormalized upper brackets of a minimal set of εn -brackets over
n , as before. Then the posterior relative to prior n achieves the convergence

rate εn . (Note that this time we do not construct a fixed prior = n λn n ,
but use the prior n when n observations are available.)
In the preceding we start with a condition on the entropies with bracket-
ing even though we apply Theorem 2.2, which demands control over metric
entropies only. This is because Theorem 2.2 also requires control over the like-
lihood ratios. If, for instance, the densities would be uniformly bounded away
from zero and infinity, so that the quotients p0 /p are uniformly bounded, then
we can replace the bracketing entropy in Theorem 3.1 by ordinary entropy.
Alternatively, if the set of densities possesses an integrable envelope func-
tion, then we can construct priors achieving the rate εn determined by the
covering numbers up to logarithmic factors. Here we define εn as the mini-
mal solution of the equation log Nε h ≤ nε2 and Nε h denotes the
Hellinger covering number (without bracketing). The construction, described
briefly below, parallels Theorem 6 of Wong and Shen (1995) for sieved maxi-
mum likelihood estimators.
We assume that the set of densities
has a µ-integrable envelope function:
a measurable function m with m dµ < ∞ such that p ≤ m for every p ∈ .
Given εn > 0 let s1n sNn n be a minimal εn -net over [hence Nn =
Nεn h] and put
1/2
gj n = sj n + εn m1/2 2 /cj n
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 511

where cj n is a constant ensuring that gj n is a probability density. Finally, let
n be the uniform discrete measure on g1 n gNn n and let = ∞n=1 λn n
be a convex combination of the n as before.

Theorem 3.2. Suppose that εn are numbers decreasing in n such that

log Nεn h ≤ nεn2 for every n and nεn2 / log n → ∞. Construct the prior
as given previously for a sequence λn such that λn > 0 for all n and
log λ−1
n = Olog n. Assume m is a µ-integrable envelope. Then the correspond-
ing posterior converges at the rate εn log1/εn in probability, relative to the
Hellinger distance.

Proof. The proof follows as before, but this time we apply Theorem 2.1,
using the observation of Wong and Shen (1995) that for any p ∈ such that
hp sj n ≤ εn we have that hp gj n = Oεn and that p/gj n is bounded
above by a multiple of εn−2 . This veriﬁes (2.4) with εn replaced by a multiple of
εn log1/εn through a use of Theorem 5 of Wong and Shen (1995), the relevant
part of which is reproduced below as Lemma 8.6. ✷

4. Log spline models. In this section we apply the general results to

prior distributions on log spline models for densities. Log spline models for
density estimation have been used, among others, by Stone (1990), who shows
that the sieved maximum likelihood estimator attains the optimal rate of
convergence for estimating a smooth density. As shown by Stone (1994) they
can be extended to higher dimensions by using tensor splines, but following
Stone (1990), we restrict ourselves to the one-dimensional case.
We assume that the observations are sampled from a density p0 on the unit
interval 0 1 in the real line that is bounded away from zero and inﬁnity. Our
choice of priors will yield the optimal rate of convergence of the posterior if
the density p0 belongs to the Hölder space Cα 0 1. (This is the set of all
functions that have α0 derivatives, for α0 the largest integer strictly smaller
than α, with the α0 th derivative being Lipschitz of order α − α0 .)
Our prior measures will not be supported on the set of smooth functions, but
on exponential families constructed from a spline basis. Fix some “order” q, a
natural number, throughout this section. Let K be another natural number,
which will increase with n, and partition the half-open unit interval 0 1 into
K subintervals k−1/K k/K for k = 1 K. Consider the linear space of
splines of order q relative to this partition, that is, all functions
f 0 1 →
whose restriction to every of the partioning intervals k − 1/K k/K is a
polynomial of degree strictly less than q and, in the case that q ≥ 2, that are
q − 2 times continuously differentiable on 0 1. It can be shown that this is
a J = q + K − 1-dimensional vector space. A convenient basis is the set of
B-splines B1 BJ , deﬁned, for example, in de Boor (1978). More precisely,
let B1 BJ be the B-splines of order q for the known sequence
q times q times
1 2 K − 1
0 0 0 1 1 1
K K K
512 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

as deﬁned on page 108 of de Boor (1978). The exact nature of these functions
does not matter to us here, except for the following properties [cf. de Boor
(1978), pages 109 and 110]:

1 Bj ≥ 0 j = 1 J

J
2 Bj ≡ 1
j=1

3. Bj is supported inside an interval of length q/K

4. at most q functions Bj are nonzero at every given x.
The ﬁrst two properties express that the basis elements form a partition of
unity, and the third and fourth properties mean that their supports are close
to being disjoint if K is very
large relative to q.
For θ ∈ J let θT B = j θj Bj and deﬁne
1
pθ x = exp θT Bx − cθ ecθ = exp θT Bx dx
0

Thus pθ belongs to a J-dimensional exponential family, with the B-spline

functions as sufficient statistics. Since the B-splines add up to unity, the family
is actually of dimension J − 1 and we could restrict θ to the subset 0 = θ ∈
J θT 1 = 0 . The true density p0 of the observations need not be of the form
pθ for some θ. (Hence we make a difference between p0 and p0 for 0 ∈ J ;
this should not lead to confusion as p0 does not play a role.) In the following
we construct a prior measure n on the set of probability densities on 0 1 by
choosing a prior on 0 , which next induces a prior on the probability densities
pθ through the map θ → pθ .
For q = 1 the linear space of splines consists of histograms with cell
boundaries k/K for k = 0 1 K. Since exponentials of histograms are his-
tograms, our construction therefore contains priors constructed on histograms
as a special case.
Since the true density p0 need not belong to this “log spline model,” we must
ensure that it is approximated sufficiently closely by some pθ . To approximate
sufficiently many p0 it is necessary to let the dimension J − 1 of the log spline
models tend to infinity with n. Here we fix the order q and let the number K
of partioning sets tend to infinity. If we focus on α-smooth densities p0 , then
the minimal rate at which J = Jn must grow is determined by the following
lemma, taken from de Boor [(1978), page 170]. Let f∞ = sup0≤x≤1 fx be
the supremum norm, and let fα be the seminorm
α
f 0 x − fα0 y
fα = sup
x=y x − yα−α0

Because we assume that p0 is bounded away from zero (and inﬁnity) the
function p0 is in Cα 0 1 if and only if log p0 ∈ Cα 0 1.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 513

Lemma 4.1. Let q ≥ α > 0. There exists a constant C depending only on q

and α such that, for every p0 ∈ Cα 0 1 that is bounded away from zero,

inf θT B − log p0 ∞ ≤ CJ−α log p0 α
θ∈J

It is easy to see from this, as we show in part below, that the root of the
Kullback–Leibler divergence and the Hellinger distance between p0 and the
closest pθ are of the order J−α as well. Since a ball of radius εn around p0
must contain prior mass in order to satisfy (2.9), the rate of convergence εn
of the posterior can certainly not be faster than J−α . The minimum distance
of alternatives to allow appropriate tests, determined by (2.7), will be shown
to satisfy nεn2 ≥ Jn . Together with the previous restriction on εn this will
yield a rate of convergence of n−α/2α+1 , for Jn ∼ n1/2α+1 . This is also the
rate of convergence of the sieved maximum likelihood estimator, found by
Stone (1990). It is well known that this rate is optimal for α-smooth densities.
To make this precise we start with stating some lemmas that connect dis-
tances and norms on the densities pθ with the J-dimensional Euclidean norm
θ and inﬁnity norm θ∞ = maxj θj . Let f be the L2 0 1 norm of f and
write a b if a ≤ Cb for a constant C that is universal or depends only on q
(which is ﬁxed throughout) and not on K. Most of these are known from or
implicit in Stone (1986, 1990) or the literature on approximation theory.

Lemma 4.2. For any θ ∈ J ,

θ∞ θT B∞ ≤ θ∞
√
θ J θT B θ

Proof. The ﬁrst inequality is proved by de Boor [(1978), page 156, Corol-
lary 3]. The second is immediate from the fact that the B-spline basis forms a
partition of unity. The third and fourth inequalities are stated in Stone [(1986),
equation (12)]. As their full proofs are not in one place, we sketch the argument
for completeness.
Let Ii be the interval i − q/K ∨ 0 i/K ∧ 1 . By (2) on page 155 of
de Boor (1978), we have
θi2 θT BIi 2∞ KθT BIi 2
i i i
T
The last inequality follows, because θ BIi consists of at most q polynomial
pieces, each on an interval of length 1/K, and the supremum norm √ of a
polynomial of order q on an interval of length L is bounded by 1/ L times
the L2 -norm, up to a constant depending on q. [To see the last: the squared
q−1
L2 0 1-norm of the polynomial x → j=0 αj xj on 0 1 is the quadratic form
αT EUq UT q α for Uq = 1 U U
q−1
and U a uniform 0 1 variable. The
T
second moment matrix EUq Uq is nonsingular and hence the quadratic form
is bounded below by a constant times α2 ≥ α2∞ .) This yields the third
inequality.
514 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

By property (3) of the B-spline basis at most q elements Bj x are nonzero
for every given x, say for j ∈ Jx. Therefore,
T 2 2
θ Bx = θj Bj x ≤ θj2 B2j x q
j∈Jx j∈Jx

by the Cauchy–Schwarz inequality. Since each Bj is supported on an interval

of length proportional
√ to 1/J and takes its values in 0 1, its L2 0 1-norm
is of the order 1/ J. Combined with the preceding display this yields
1 2 q
θT Bx dx θ2
0 J
This yields the fourth inequality. ✷

Lemma 4.3. For any θ ∈ J such that θT 1 = 0,

θ∞ log pθ ∞ ≤ 2θ∞

Proof. By the second inequality in Lemma 4.2 we have that θT B∞ ≤
θ∞ , whence ecθ is contained
in the interval e−M eM for M = θ∞ , by its

deﬁnition, so that cθ ≤ θ∞ . Consequently, by the triangle inequality

log pθ ∞ = θT B − cθ∞ ≤ 2θ∞
This yields the inequality on the right.
For the inequality on the left, we note that, since θT 1 = 0,

cθ = θ − cθ1T 1 1 ≤ θ − cθ1 11 1
J ∞ J

θ − cθ1 B∞ = log pθ ∞
T

by Lemma 4.2. Consequently, by Lemma 4.2 and the triangle inequality,

θ∞ θT B∞ ≤ θT B − cθ∞ + cθ 2 log pθ ∞
This concludes the proof. ✷

As a consequence of the preceding lemma, a set of densities pθ is uniformly

bounded away from 0 and ∞ if and only if the norms θ∞ of the corresponding
set of θ are bounded. This is true uniformly in J ∈ .

Lemma 4.4. For every θ1 θ2 such that 1T θ1 − θ2 = 0,

θ − θ 2 θ − θ 2
1 2 1 2
inf pθ x ∧ 1 h2 Pθ1 Pθ2 sup pθ x
x θ J x θ J

where the inﬁmum and supremum are taken over all θ on the line segment
between θ1 and θ2 and all x ∈ 0 1.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 515

Proof. By direct calculation and Taylor’s theorem, we have

h2 Pθ1 Pθ2 = 2 1 − exp c 12 θ1 + 12 θ2 − 12 cθ1 − 12 cθ2

1
= 2 1 − exp − 16 ˜ θ − θ
θ1 − θ2 T c̈θ̃ + c̈θ̃ 1 2

for θ̃ and θ̃˜ vectors on the line segment between θ1 and θ2 and c̈θ the Hessian
of c. By the well-known properties of exponential families, we have
1 2
τT c̈θτ = varθ τT B = inf τx − µ1T Bx pθ x dx
µ∈ 0

since 1T B ≡ 1. Up to bounds below and above on pθ the right side is equiv-

alent to the infimum over µ of the squared L2 0 1-norm of τ − µ1T B. By
Lemma 4.2 the latter is comparable to the infimum over µ of τ − µ12 /J,
which is equal to τ2 /J if 1T τ = 0.
We can finish the proof by applying this in the first display, together with
the inequalities 1 − e−x ≤ x for x ≥ 0 and 1 − e−x ≥ 12 x ∧ 1 for x ≥ 0 and
cx ∧ 1 ≥ cx ∧ 1 for x ≥ 0 and c ≤ 1. ✷

By combining these lemmas we see that the Hellinger distance hPθ1 Pθ2
√
and 1/ J times the J-dimensional Euclidean distance θ1 − θ2 are propor-
tional, uniformly in J and in θ1 θ2 having uniformly bounded coordinates.
This combined with the estimate on the distance of p0 to the set of pθ given
by Lemma 4.1 reduces the veriﬁcation of (2.7) and (2.9) to calculations in the
Euclidean setting.
We are now ready to prove the following theorem. By Lemma 4.3 there
exists a constant d such that dθ∞ ≤ log pθ ∞ for every θ ∈ J with θT 1 = 0
and every J ∈ . We shall assume that the prior is chosen as roughly uniform
on a large box −M MJ . This corresponds to densities pθ that are bounded
and bounded away from zero by at least a small constant.

Theorem 4.5. Suppose that n has a density with respect to Lebesgue mea-
sure on θ ∈ Jn θT 1 = 0 for Jn ∼ n1/2α+1 whose minimum and maximal
values on −M MJn are bounded below and above by terms of the orders cJn
and CJn , respectively, for positive constants c C, and which vanishes outside
−M MJn . Let M ≥ 1. Then for every p0 ∈ Cα 0 1 for q ≥ α ≥ 1/2 such that
log p0 ∞ ≤ 12 dM the conditions of Theorem 2.4 are satisﬁed for εn a large
multiple of n−α/2α+1 and n the support of n , and hence the posterior rate of
convergence is n−α/2α+1 .

Proof. Let θ0 minimize θ → log pθ − log p0 ∞ over θ ∈ J such that

T
θ 1 = 0. We ﬁrst show that, for constants C1 C depending on p0 , α and q
only,
(4.1) hpθ0 p0 ≤ C1 log pθ0 − log p0 ∞ ≤ CJ−α
516 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

By Lemma 4.1 there exists θ∗ such that θ∗ T B − log p0 ∞ J−α . Taking
∗ T −α
exponentials we see that this implies that ∗expθ B−p 0 ∞ J , and
next,
by integrating this inequality, that exp cθ − 1 J , whence cθ∗ J−α .
−α

Consequently, log pθ∗ − log p0 ∞ J−α , whence θ∗ minimizes θ → log pθ −

log p0 ∞ up to a multiple of J−α . Since the set of pθ is the same whether θ
is restricted to satisfy θT 1 = 0 or not, the second inequality in the display
follows by the definition of θ0 . The first now follows easily, since p0 and hence
pθ0 is bounded away from zero and infinity.
Thus the Hellinger ball of radius ε around P0 is contained in a multiple of
the Hellinger ball of radius ε + J−α around Pθ0 , whence by Lemma 4.4, for
any ε > 0 and suitable constants A B C, since θ0 ∞ ≤ 12 M + o1 by the
assumption that log p0 ∞ ≤ 12 dM,

Pθ hPθ P0 ≤ ε θ∞ ≤ M

⊂ Pθ AhPθ Pθ0 ≤ ε + J−α θ∞ ≤ M
θ − θ0 √
⊂ Pθ B √ ∧ 1 ≤ ε + J−α θ − θ0 ≤ 2 JM θ∞ ≤ M
J
√
⊂ Pθ θ − θ0 ≤ C Jε + J−α M

since x x ∧ 1 ≤ ε x ≤ M ⊂ x x ≤ εM for M ≥ 1. Hence, in view of

Example 7.1 [or Pollard (1990), Lemma 4.1] and Lemma 4.4, for constants
E F,
ε
D Pθ hPθ P0 ≤ 2ε θ∞ ≤ M h
2
√ √
≤ D Eε J θ θ − θ0 ≤ 2C Jε + J−α M ·
F√Jε + J−α M J
≤ √
ε J

Therefore, we can verify (2.7) for n = Pθ θ∞ ≤ M and every εn such
that
J−α
Jn log 1 + n nεn2
εn

Next, we have, with volJ the volume of the J − 1-dimensional unit ball,

n Pθ hPθ P0 ≤ 2jε θ∞ ≤ M
√ J
≤ sup πn θ 2C Jjε + J−α M volJ
θ∞ ≤M

By Lemma 4.3 and the assumption that p0 is bounded, the norms p0 /pθ ∞
are uniformly bounded over θ ranging over a set of bounded θ∞ . Therefore,
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 517

in view of Lemma 4.4 and (4.1), uniformly in θ∞ ≤ M,

p θ − θ0 2

h2 pθ p0 0 h2 pθ pθ0 + h2 pθ0 p0 + J−2α
pθ ∞ J
We conclude that for εn bigger than a sufficiently large multiple of J−α ,
p

n Pθ h2 pθ p0 0 εn2
pθ ∞
√
≥ n θ θ∞ ≤ M θ − θ0 / J ≤ εn − J−α
1 √
≥ inf πn θvol θ θ∞ ≤ M θ − θ0 ≤ εn J
θ∞ ≤M 2
J
1 √
= inf πn θ εn J volJ
θ∞ ≤M 2
√ √
since θ∞ ≤ θ0 ∞ +θ−θ0 J ≤ θ0 ∞ + 12 εn J ≤ M eventually, if θ−θ0 ≤
1
√
ε J. By assumption, the first term is of the order cJ . Thus condition (2.9)
2 n
is satisfied if, for all sufficiently large j,
J log j nεn2 j2 and εn J−α
−1/α
This gives εn of the order 1/nα/2α+1 for Jn of the order εn . ✷

5. Finite-dimensional models. Although in this paper we are primar-

ily interested in infinite-dimensional models, it is desirable to have a unified
theory applicable to both finite- and infinite-dimensional models. In this sec-
tion we show that Theorem 2.4 yields the right rate of convergence for finite-
dimensional models.
Let pθ θ ∈ be a family of densities parametrized by a Euclidean param-
eter θ running through a set ⊂ d . Assume that for every θ θ1 θ2 ∈ and
some α > 0,
p
−Pθ0 log θ θ − θ0 2α
pθ 0
p 2
Pθ0 log θ θ − θ0 2α
pθ 0
θ1 − θ2 α hPθ1 Pθ2 θ1 − θ2 α
Assume that the prior measure possesses a density that is uniformly boun-
ded away from zero √ and infinity on . In this situation the posterior rate of
convergence is 1/ n relative to the Hellinger distance h. Under the assump-
tions, this translates into a n1/2α -rate of convergence of the posterior for θ in
the Euclidean distance.

Theorem 5.1. Under the conditions listed previously and θ0 interior to ,

the conditions of Theorem 2.4 are satisﬁed for = n = P √θ θ ∈ , the
Hellinger distance d and εn a sufﬁciently large multiple of 1/ n.
518 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Proof. The left side of condition (2.7) is seen to be bounded by a constant

in Example 7.1 in the case that α = 1. The √ case of general α is not different.
It follows that (2.7) is satisﬁed for εn = M/ n and sufﬁciently large M.
In order to verify (2.9) we calculate

n Pθ hPθ Pθ0 ≤ jεn
2
n Pθ − Pθ0 logpθ /pθ0 ≤ εn2 Pθ0 logpθ /pθ0 ≤ εn2
A d
n θ θ − θ0 ≤ Ajεn 1/α
≤ 1/α ≤ C jd/α
n θ θ − θ0 ≤ Bεn B

for constants A B deﬁned by the conditions preceding the theorem, and a

constant C depending
√ on the prior density only. It follows that (2.9) is satisﬁed
easily for εn = M/ n and sufﬁciently large M. ✷

It may be noted that our conditions preclude unbounded parameter spaces

: we cannot have that the Hellinger distance is bounded below by a multi-
ple of the Euclidean distance unless the latter is bounded, since the Hellinger
distance is uniformly bounded above. This could be improved by replacing con-
dition (2.7) by a testing condition. The lower bound on the Hellinger distance
is used only to verify (2.7), which in turn is used only to ensure the existence
of tests of θ0 versus the complements of balls of ε around θ0 . For most classical
parametric models such tests exist. In fact, existence of uniformly consistent
tests of the outside of a compact neigborhood of θ0 already implies existence
of tests with exponential error probabilities (see Lemma 7.2), and this would
be sufﬁcient to reduce the problem to a bounded parameter set, to which the
preceding theorem applies. Note that the conditions are very reasonable for
equal to a small neighborhood of θ0 . See also Le Cam (1973) and Le Cam and
Yang (1990).

6. Priors based on Dirichlet processes. In this section we apply the

general theorems to priors based on Dirichlet processes. A major difﬁculty is
the computation of the prior mass, as in conditions (2.4) or (2.5). We present
one such computation and expect that future papers will address more prob-
lems of this sort. We shall need an estimate of the probability of an L1 -ball
under a Dirichlet distribution given by the following lemma.

Lemma 6.1. Let X1 XN be distributed according to the Dirichlet dis-

tribution
on the N-simplex with parameters m# α1 αN , where Aε ≤ αi ≤
1 and N i=1 αi = m for some constant A. Let x10 xN0 be any point on the
N-simplex. There exist positive constants c and C depending only on A such
that, for ε ≤ 1/N,
N
1
(6.1) Pr Xi − xi0 ≤ 2ε ≥ C exp −cN log
i=1
ε
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 519

Proof. Find an index i such that xi0 ≥ 1/N. By relabelling, we can

assume that i = N. If xi − xi0 ≤ ε2 for i = 1 N − 1, then
N−1
xi ≤ 1 − xN0 + N − 1ε2 ≤ N − 1ε2 + 1/N ≤ 1 − ε2 < 1
i=1

Hence there exists x = x 1 xN in thesimplex with these ﬁrst N − 1

coordinates. Furthermore, N i=1 xi −xi0 ≤ 2
N−1 2
i=1 xi −xi0 ≤ 2ε N−1 ≤ 2ε.
Therefore the probability on the left-hand side of (6.1) is bounded below by

P Xi − xi0 ≤ ε2 i = 1 N − 1

:m N−1 minxi0 +ε 1 αi −1

≥ N xi dxi
i=1 :αi i=1 maxxi0 −ε 0
2

We use here that 1 − N−1 i=1 xi
αN −1
≥ 1, since αN ≤ 1. Similarly, since αi ≤ 1
for every i, we can lower bound the integrand by 1 and note that the interval of
integration contains at least an interval of length ε2 . Since α:α = :α + 1 ≤
1 for 0 < α ≤ 1 we can bound the last display from below by
N
1
:mε2N−1 αi ≥ :Aε2N−1 AεN ≥ C exp −cN log
i=1
ε
This concludes the proof. ✷

Example 6.1 (Current status censoring). Let Y1 Yn be an i.i.d. sam-

ple from a distribution F and C1 Cn be an independent i.i.d. sample from
a distribution G, both on 0 ∞. Suppose that we observe Xi = =i Ci for
i = 1 n, where =i = 1 Yi ≤ Ci and would like to estimate F. The density
function pF of X with respect to the product of counting measure on 0 1
and a dominating measure for G at δ c is given by
1−δ
pF δ c = Fcδ 1 − Fc gc
Since this factorizes in parts depending on F and G only, if we put a product
prior on the pair F G and next compute the posterior for F only, then the
part involving G will cancel out. Therefore, it is equivalent to treat g as a
known density and put no prior on g.
We assume that G is supported on some compact interval a b and that
the true distribution F0 is continuous and has support which extends to the
left and the right of a b. [Hence F0 a− > 0 and F0 b < 1.] As a prior
measure on F we consider a Dirichlet prior with base measure α that has a
positive, continuous density on a compact interval containing a b. We shall
show that the conditions of Theorem 2.2 are satisﬁed for εn a large multiple
of n−1/3 log n1/3 . This is very close to the optimal rate of convergence in this
model, which is n−1/3 . We do not exclude the possibility that this small dis-
crepancy is due to suboptimal estimates of the prior mass in the following,
and not a deﬁcit of Dirichlet priors. We note that the priors based on ε-nets
520 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

given in Section 3 do lead to a posterior rate of convergence of n−1/3 , as the

bracketing entropy for this model is of the order 1/ε.
Since the roots of the densities pF are essentially pairs of two bounded
monotone functions and the Hellinger distance is the L2 -distance between the
root densities, the Hellinger entropy of the model PF F ∈ , where is
the set of all distribution functions on 0 ∞, can be estimated by the estimate
of the entropy of the space of uniformly bounded monotone functions. Thus it
is of the order 1/ε [see Theorem 2.7.5 of van der Vaart and Wellner (1996)].
Therefore condition (2.2) is veriﬁed for εn equal or bigger than n−1/3 .
Under our conditions, F0 is bounded away from 0 and 1 on the interval a b
that contains all observation times Ci . Consequently, the quotients pF0 /pF x
are uniformly bounded away from zero and inﬁnity, uniformly in F that are
uniformly close to F0 on the interval a b. The squared Hellinger distance is
equal to
1/2 2
h2 PF PF0 = F1/2 c − F0 c dGc
2
+ 1 − Fc1/2 − 1 − F0 c1/2 dGc
2
≤ C sup Fc − F0 c
c∈a b

for a constant depending on F0 . Thus, to verify (2.5) it sufﬁces to estimate the

prior mass of a Kolmogorov–Smirnov ball of radius εn around F0 . Given ε > 0,
partition the positive half line in intervals E1 EN such that F0 Ei ≤ ε
and Aε ≤ αEi ≤ 1 for every i and some ﬁxed constant A. We can achieve this
N
with
N = O1/ε intervals. By Lemma 6.1, the set F i=1 FEi −F0 Ei ≤
ε has probability of the order exp−c1/ε log1/ε. For every F in this set,
the Kolmogorov–Smirnov distance to F0 is of the order ε. We conclude that
the prior mass in a Hellinger ball of radius a large multiple of ε is of the order
exp−c1/ε log1/ε. Thus condition (2.5) is veriﬁed for εn a large multiple
of n−1/3 log n1/3 .

7. Existence of tests. In this section we consider

some results
on the
existence of tests of P0 versus the complement P dP P0 > ε of the ball
of radius ε around P0 . The existence of certain tests is a main element in
the proofs of Theorems 2.1–2.3 and is guaranteed by entropy bounds. At the
end of this section we state a theorem on the rate of convergence of posteriors
directly in terms of tests.
Appropriate
tests can be built up from tests of P0 versus balls P dP P1
≤ η for given P1 . Throughout this section we use a distance d such that for
every pair P0 and P1 in the model there exist tests φn such that, for some
universal constant K,

(7.1) P0n φn ≤ exp −Knd2 P0 P1

(7.2) sup Pn 1 − φn ≤ exp −Knd2 P0 P1
dP P1 <dP0 P1 /2
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 521

This is true both for d equal to the total variation distance and for d equal to
the Hellinger distance. (The constant 2 has no particular interest and is not
optimal; any constant bigger than 1 is possible and would do for our purposes.)
More generally, it is known from Birgé (1984) and Le Cam (1986) (see
Lemma 4 on page 478) that given any two convex sets 0 and 1 of prob-
ability measures, there exist tests φn such that

(7.3) sup Pn φn ≤ exp n log ρ0 1
P∈0

(7.4) sup Pn 1 − φn ≤ exp n log ρ0 1
P∈1

where ρ0 1 = 1 − 12 h2 0 1 is the Hellinger afﬁnity, and h0 1 is

the minimum of hP0 P1 over P0 ∈ 0 and P1 ∈ 1 . Because log ρ ≤ − 12 h2
this gives exponential decrease of the error probabilities, with the exponent
proportional to −nh2 0 1 . This general result brings out the special role
of the Hellinger distance (even though in some situations it may be preferable
to work with the log Hellinger affinity directly).
If a distance d is bounded above by the Hellinger distance, then the ball
P dP P1 < dP0 P1 /2 is at Hellinger distance at least dP0 P1 /2 from
P0 . Thus if this ball is a convex set of probability measures, then (7.1) and (7.2)
is satisfied for d (with K = 12 ), by the general results of Birgé and Le Cam.
This argument immediately gives (7.1) and (7.2) for the Hellinger distance
itself and the total variation distance, which satisfies dP − dQ ≤ 2hP Q,
by the Cauchy–Schwarz inequality. If the set of probability densities under
consideration is uniformly bounded, then it also gives (7.1) and (7.2) for the
L2 -distance, because this is then also bounded by a multiple of the Hellinger
distance.
The next step is to combine the tests for balls (which are convex) into a test
for the complements of balls, which are nonconvex. The following result is
related to Lemma 2.1 in Birgé (1983). The number Dε in its first condition
is related to the measure of metric dimension used by Birgé and Le Cam.
The number supε≥εn Dε is almost identical to what Le Cam (1986) calls the
dimension of for the pair d εn .

Theorem 7.1. Suppose that for some nonincreasing function Dε, some
εn ≥ 0 and every ε > εn ,
ε
D P ε ≤ dP P0 ≤ 2ε d ≤ Dε
2
Then for every ε > εn there exist tests φn (depending on ε > 0) such that, for a
universal constant K and every j ∈ ,
1
(7.5) P0n φn ≤ Dε exp −Knε2
1 − exp−Knε2

(7.6) sup Pn 1 − φn ≤ exp −Knε2 j2
dP P0 >jε
522 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Proof.
For a given j ∈ choose a maximal jε/2 separated set of points
in Sj = P jε < dP P0 ≤ j + 1ε . This yields a set Sj of at most Djε
points and every P ∈ Sj is within distance jε/2 of at least one of these points.
(Take Sj empty and adapt the following in the obvious way if Sj is empty.)
For every such point P1 ∈ Sj there exists a test ωn with the properties as in
(7.1) and (7.2). Let φn be the maximum of all tests attached in this way to
some point P1 ∈ Sj for some j ∈ . Then

P0n φn ≤ exp −Knj2 ε2 ≤ Djε exp −Knj2 ε2
j P1 ∈Sj j∈

sup

Pn 1 − φn ≤ sup exp −Kni2 ε2
P∈ i≥j Si i≥j

The right sides can be further bounded as desired. [Note that Djε ≤ Dε
for every j ∈ , by assumption.] ✷

One possible choice for Dε is the ε-packing number Dε/2 d. This
is a bigger number, but in many inﬁnite-dimensional situations this does not
appear to yield a real loss. On the other hand, the theorem is needed as stated
if is ﬁnite-dimensional.

Example 7.1. Suppose that = Pθ θ ∈ ⊂ m and, for given con-

stants A and B and · the m-dimensional Euclidean norm,
dPθ Pθ0 ≥ Aθ − θ0
dPθ1 Pθ2 ≤ Bθ1 − θ2
(Since both the Hellinger and total variation metric are bounded, the ﬁrst can
be true with d one of these distances only if is bounded.) The ε-packing
number of the m-dimensional unit ball is bounded above by 6/εm [e.g.,
Pollard (1990), Lemma 4.1]. Thus
6l m
D kε θ ∈ m θ − θ0 ≤ lε · ≤
k
It follows that

ε 2ε 12B m
D ε Pθ dPθ Pθ0 ≤ 2ε d ≤ D θ θ − θ0 ≤ · ≤
B A A
Thus we can take Dε in Theorem 7.1 independent of ε, but increasing expo-
nentially with the dimension (if A/B is ﬁxed). In comparison, the numbers
Dε h are of the order ε−m .

It is known from Le Cam (1973) that even for a fixed ε there need not
exist a consistent sequence of tests of P0 versus P ∈ dP P0 > ε . The
preceding theorem shows that total boundedness of [which is equivalent to
Dε d being finite for every ε > 0] is sufficient for the existence of such
a test. However, this is not necessary. One example showing this is given by
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 523

(7.1) and (7.2) applied with = P0 P dP P1 < dP0 P1 /2 , because
a total variation or Hellinger ball is usually not totally bounded. A classical
example is as follows.

Example 7.2. The collection of all normal distributions Nθ 1 on is not

totally bounded for the Hellinger or total variation distances, but certainly
there are good tests of H 0 θ = 0 versus H1 θ > ε. [Actually, in this case
the affinity satisfies log ρ Nθ 1 Nθ 1 = −θ − θ 2 /8 and hence we could
apply the general form of (7.3) and (7.4) in combination with the Euclidean
distance to obtain good tests through the approach of Theorem 7.1, even for
unbounded alternatives. For other parametric models the log affinity is typ-
ically not nicely related to the Euclidean distance and this approach would
fail.]

On the other hand, if a ﬁxed part of can be uniformly consistently tested

versus P0 , then it can also be tested with exponentially small error proba-
bilities. This implies that such a ﬁxed part can be ignored for our purposes,
in that it is not a loss of generality in the main result to assume that the
prior only charges the remaining part of (and P0 ). The error probabilities
of the tests φn given in the following lemma are of smaller order than the
error probabilities of the tests in Theorem 7.1 if ε = εn → 0 in the latter theo-
rem. This can be useful to reduce the model to a totally bounded submodel,
by trimming away parts that can easily be tested by ad hoc arguments. The
following lemma is a consequence of results of Le Cam (1973).

Lemma 7.2. Suppose that there exist tests ωn such that for ﬁxed sets 0
and 1 of probability measures
sup P0n ωn → 0 sup Pn 1 − ωn → 0
P0 ∈0 P∈1

Then there exist tests φn and constants K > 0 such that

sup P0n φn ≤ e−Kn sup Pn 1 − ωn ≤ e−Kn
P0 ∈0 P∈1

In view of the fact that, apparently, entropy conditions are not always appro-
priate to ensure the existence of tests, it is fruitful to formulate a theorem on
rates of convergence directly in terms of existence of tests. The following is a
result of this type.

Theorem 7.3. Suppose that (2.8) and (2.9) hold for a sequence εn with
εn → 0 and nεn2 bounded away from zero and sets n ⊂ , and in addition
suppose that their exists a sequence of tests φn such that for some constant
K > 0 and for every sufﬁciently large j,
(7.7) P0n φn → 0

(7.8) sup Pn 1 − φn ≤ exp −Knεn2 j2
P∈n εn j<dP P0 ≤2jεn
524 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Then for any Mn → ∞, we have that n P dP P0 ≥ Mn εn X1 Xn → 0

in P0n -probability.

8. Proof of Theorems 2.1–2.3. In the proof of Theorem 2.1 we use the

following simple lemma, which will need to be replaced by more complicated
results for the proofs of Theorems 2.2 and 2.3.

Lemma 8.1. For every ε > 0 and probability measure on the set
2
P − P0 logp/p0 ≤ ε2 P0 logp/p0 ≤ ε2
we have, for every C > 0,
n
n
p 2
1
P0 Xi dP ≤ exp −1 + Cnε ≤ 2 2
p
i=1 0
C nε

Proof. By Jensen’s inequality applied to the logarithm,

n n
p p
log Xi dP ≥ log Xi dP
p
i=1 0 i=1
p0
√
Thus the probability is bounded by, with n = nn − P0 the empirical
process,

p √ √ p
P0n n log dP ≤ − n1 + Cε2 − nP0 log dP
p0 p0
By Fubini’s theorem and the assumption √ on the expression on the right
of the inequality sign is bounded by − nε2 C. An application of Chebyshev’s
inequality yields the upper bound
2
var logp/p0 X1 dP P0 logp/p0 dP
≤
C2 nε4 C2 nε4
by another application of Jensen’s inequality. The right side is bounded by
C2 nε2 −1 by the assumption on . This concludes the proof. ✷

Proof of Theorem 2.1. For every ε > 2εn we have by (2.2),

ε
log D n d ≤ log Dεn n d ≤ nεn2
2
Therefore, by Theorem 7.1, applied with Dε = expnεn2 (constant in ε) and
ε = Mεn and j = 1 in its assertion, where M ≥ 2 is a large constant to be
chosen later, there exist tests φn that satisfy

(8.1) P0n φn ≤ exp nεn2
1
× exp −KnM2 εn2
1 − exp −KnM2 εn2

(8.2) sup Pn 1 − φn ≤ exp −KnM2 εn2
P∈n dP P0 >Mεn
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 525

By the ﬁrst condition (8.1) it follows that, if KM2 − 1 > K, as n → ∞,

2
(8.3) EP0 n P dP P0 ≥ εn X1 Xn φn ≤ P0n φn ≤ 2e−Knεn
By Fubini’s theorem and the fact that P0 p/p0 ≤ 1,
n
p
EP0 Xi dn P ≤ n − n
n i=1 p0

Combining the above assertion with (8.2) we see that

n
p
EP0 Xi dn P1 − φn
P dP P0 >Mεn i=1 p0

≤ n n + Pn 1 − φn dn P
P∈n dP P0 >Mεn

≤ n n + exp−KnM2 εn2 ≤ 2 exp−nεn2 C + 4

for M ≥ C
+ 4/K. By Lemma 8.1, we have with probability tending to 1,
with Bn = P − P0 logp/p0 ≤ εn2 P0 logp/p0 2 ≤ εn2 ,
n
p
Xi dn P ≥ exp −2nεn2 n Bn ≥ exp −nεn2 2 + C
p
i=1 0

by assumption (2.4). If An is the event that this inequality is true, so that

P0n An → 1, then it follows that

EP0 n P dP P0 > Mεn X1 Xn 1 − φn 1An

≤ exp nεn2 2 + C 2 exp −nεn2 C + 4 → 0
This concludes the proof. ✷

For the proof of Theorem 2.2 we need a replacement of Lemma 8.1 that gives
a faster rate of convergence in its statement. We can achieve this by controlling
the quotients p/p0 . First, if one has uniform control from below, then the
Hellinger distance and the Kullback–Leibler information are comparable. The
following lemma can be found in Birgé and Massart (1998) [see their (7.6)].

Lemma 8.2. For any pair of probability measures P and P0 ,

p p0 p0
2
h P P0 ≤ −P0 log 2
≤ 2h P P0 1 + log ≤ 2h P P0
2
p0 p ∞ p
∞

A second lemma is a comparison of a certain exponential moment and the

Hellinger distance. This exponential moment [called the “Bernstein norm” in
van der Vaart and Wellner (1996) even though it is not a norm] is essential
in Bernstein’s inequality. Birgé and Massart (1993) used this “norm” to derive
results on rates of convergence of minimum contrast estimators.
526 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Lemma 8.3. For any pair of probability measures P and P0 ,

p0
(8.4) P0 exp logp/p0 − 1 − logp/p0 ≤ 2h P P0
2
p

∞

Proof. For every c ≤ 0 and x ≥ c we have the inequality

2
(8.5) ex − 1 − x ≤ 2ec ex/2 − 1

If e−c = p0 /p∞ , then logp/p0 ≥ c and hence the integrand on the left side
of (8.4) is bounded above by
2 ! 2
1 p
2e−c exp logp/p0 − 1 = 2e−c −1
2 p0
The integral of the right side with respect to P0 is equal to 2e−c times the
squared Hellinger distance. ✷

The “Bernstein norm” of logp/p0 dominates all moments of order greater

than or equal to 2 of logp/p0 up to constants, including the second moment
up to a factor 2. Therefore, when combined the preceding two lemmas show
that

p
P h P P0
2 0
p ≤ε
2

(8.6) ∞ 2
p0 2 p
⊂ P P0 log p ≤ 2ε P0 log p ≤ 4ε2
0

This shows that condition (2.4) is weaker than condition (2.5), up to con-
stants. Actually controlling all moments is more than what is needed. Another
possible extension of Lemma 8.1 would be to replace the second moment of
logp/p0 by a higher moment (and use Markov’s inequality at the end of
the proof). This would give a result good enough for the proof of Theorem 2.2
provided the higher moment is chosen “high enough” (dependent on the order
εn , faster convergence to zero needing a higher moment). We have chosen
here to forego such reﬁnements and obtain an exponential inequality under a
somewhat stronger assumption.
We are ready for an adaptation of Lemma 8.1

Lemma 8.4. For every ε > 0 and probability measure on the set

(8.7) P h2 P P0 p0 /p∞ ≤ ε2

we have, for a universal constant B > 0,

n
n
p
(8.8) P0 Xi dP ≤ exp−3nε ≤ exp−Bnε2
2
p
i=1 0
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 527

Proof. Lemma 8.2 gives that −P0 log p/p0 ≤ 2ε2 for every P in the
set (8.7), which has -probability 1. Furthermore, by Lemma 8.3,

P0 exp logp/p0 − 1 − logp/p0 ≤ 2ε2

By monotonicity and convexity of the function y → ey − 1 − y on 0 ∞ and

Jensen’s inequality,

P0 exp logp/p0 dP − 1 − logp/p0 dP

≤ P0 exp logp/p0 − 1 − logp/p0 dP ≤ 2ε2

by Fubini’s theorem.
By the lemma below the same bound is true for 12 times
the variable logp/p0 dP centered at its expectation. Therefore, rewrit-
ing the probability on the left side of (8.8) as in the proof of Lemma 8.1, we
see that it is bounded above by

n p √ 2 √ 2 nε4
P0 n log dP ≤ −3 nε + n2ε ≤ exp −D 2 √ 2 √
p0 ε + nε / n
by (the reﬁned version of) Bernstein’s inequality. [see, e.g., Lemma 2.2.11 of
van der Vaart and Wellner (1996).] ✷

Lemma 8.5. If ψ 0 ∞ → is convex and nondecreasing, then EψX −

EX ≤ Eψ2X for every random variable X.

Proof. The map y → ψy is convex on . If X is an independent copy

of X, then the left side is equal to EψX − EX ≤ EψX − X , by Jensen’s
inequality. Next bound X − X ≤ X + X and use the monotonicity and
convexity of ψ again to bound the expectation by E 12 ψ2X+ψ2X , which
is the right side. ✷

Proof of Theorem 2.2. The proof of Theorem 2.2 follows the same lines
as the proof of Theorem 2.1. The difference is that we use Lemma 8.4 instead
of Lemma 8.1 to ensure that the probability of the events An converges to 1
at an exponential rate. By inspecting the proof, we conclude that for some
B1 B2 > 0 and M chosen as before,

P0 n P dP P0 > Mεn X1 Xn ≥ exp−B1 nεn2

converges to zero at the rate exp−B2 nεn2 . Since n exp−B2 nεn2 < ∞,
almost sure convergence follows by the Borel–Cantelli lemma. ✷

For the proof of Theorem 2.3 we need other variations on the preceding lem-
mas. The following lemma follows from Theorem 5 of Wong and Shen (1995).
Let log + x = log x ∨ 0.
528 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Lemma 8.6. For any pair of probability measures P and P0 such that
hP P0 ≤ 044 and P0 p0 /p < ∞,

p P0 p0 /p
−P0 log ≤ 18h2 P P0 1 + log +
p0 hP P0

p 2 P0 p0 /p 2
P0 log ≤ 5h2 P P0 1 + log +
p0 hP P0

Lemma 8.7. For any pair of probability measures P and P0 such that
P0 p0 /p < ∞,

P0 exp logp/p0 − 1 − logp/p0 ≤ 4h2 P P0 1 + −1 h2 P P0

for −1 ε = sup M M ≥ ε the inverse of the function M = P0 p0 /p1
p0 /p ≥ M /M.

Proof. Set m = p0 /p. By inequality (8.5) in the proof of Lemma 8.3, the
left side is bounded above by

! 2 ! 2
p p p
2P0 −1 1 p ≥ p0 + 2P0 0 −1 1 p < p0
p0 p p0

≤ 2h2 P P0 1 + M + 2P0 m1 m > M

for every M > 0. The function is left continuous and strictly decreasing
from inﬁnity at 0 to 0 at a point τ ≤ ∞. If we choose M = −1 h2 P P0 ,
then MM ≥ Mh2 P P0 ≥ MM+ = P0 m1 m > M . The right side of
the last display can now be bounded by an expression as in the lemma. ✷

Lemma 8.8. For a given function m let −1 ε = sup M M ≥ ε be the
inverse function of M = P0 m1 m ≥ M /M. For every ε ∈ 0 044 and
probability measure on the set

P0 m
P p0 /p ≤ m 18h2 P P0 1 + log + + −1 h2 P P0 ≤ ε2
hP P0

we have, for a universal constant B > 0,

n
p
(8.9) P0n Xi dP ≤ exp−2nε ≤ exp−Bnε2
2
p
i=1 0
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 529

Proof. This follows the same lines as the proof of Lemma 8.4, now sub-
stituting Lemmas 8.6 and 8.7 for Lemmas 8.2–8.3.

Proof of Theorem 2.3. This is identical to the proof of Theorem 2.2,

except that we use Lemma 8.8 instead of Lemma 8.4.

Proof of Theorem 2.4. The ﬁrst part of the proof is identical to the ﬁrst
part of the proof of Theorem 2.1, except that we choose the tests φn to satisfy
(8.1) and [instead of 8.2] for every j ∈ ,
(8.10) sup Pn 1 − φn ≤ exp−KnM2 εn2 j2
P∈n dP P0 >Mεn j

We also choose M large enough to ensure that the right side

of (8.1) and hence
the left side of (8.3) converges to zero. Deﬁning Sn j = P ∈ n Mεn j <

dP P0 ≤ Mεn j + 1 and using (8.10), we obtain
n
p
EP 0 Xi dn P1 − φn ≤ exp−KnM2 εn2 j2 n Sn j
Sn j i=1 p0

Fix some C0 ≥ 1. By Lemma 8.1, we have on an event An with probability at

least 1 − nεn2 C20 −1 ,
n
p
Xi dn P ≥ exp −2C0 nεn2 n Bn εn
p
i=1 0

Hence, by assumption (2.9), for every sufﬁciently large J,

EP0 n P ∈ n dP P0 > Jεn MX1 Xn 1 − φn 1An

exp −KnM2 εn2 j2 n Snj
≤
2
j≥J exp −2C0 nεn n Bn εn

≤ exp −nεn2 KM2 j2 − 2C0 − 12 KM2 j2
j≥J

This converges to zero as J → ∞ if nεn2 is bounded away from zero. Next

n n
EP0 n P ∈
/ n X1 Xn 1 − φn 1An ≤
exp−2C0 nεn2 n Bn εn
We may assume that either nεn2 is bounded or nεn2 → ∞; otherwise we argue
along subsequences. If nεn2 is bounded, then we ﬁrst choose C0 large but ﬁxed
so as to make P0n An as large as desired. Then the right side of the preceding
display converges to zero by assumption (2.8). If nεn2 → ∞, then we choose
C0 = 1, in which case P0n An → 1 and again the right side of the preceding
display converges to zero. ✷

Proof of Theorem 7.3. This is essentially contained in the proof of The-

orem 2.4 (take M = 1).
530 S. GHOSAL, J. K. GHOSH AND A. W. VAN DER VAART

Acknowledgment. We thank Lucien Birgé for insightful discussions that

have led to an improved presentation (and some corrections), in particular
relating to Section 7.

REFERENCES
Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distribu-
tions in nonparametric problems. Ann. Statist. 27 536–561.
Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch.
Verw. Gebiete 65 181–238.
Birgé, L. (1984). Sur un théorème de minimax et son application aux tests. Probab. Math. Statist.
3 259–282.
Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab.
Theory Related Fields 97 113–150.
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for
Lucien Le Cam (G. Yang and D. Pollard, eds.) 55–87. Springer, New York.
Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: exponential bounds
and rates of convergence. Bernoulli 4 329–375.
de Boor, C. (1978). A Practical Guide to Splines. Springer, New York.
Diaconis, P. and Freedman, D. (1986). On the consistency of Bayes estimates (with discussion).
Ann. Statist. 14 1–67.
Doob, J. L. (1949). Le Calcul des Probabilités et ses Applications. Coll. Int. du CNRS 13 23–27.
Dudley, R. M. (1984). A course on empirical processes. Lectures Notes in Math. 1097 2–141.
Springer, Berlin.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1
209–230.
Ferguson, T. S. (1974). Prior distribution on the spaces of probability measures. Ann. Statist. 2
615–629.
Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ estimates in the discrete case.
Ann. Math. Statist. 34 1194–1216.
Freedman, D. A. (1965). On the asymptotic behavior of Bayes’ estimates in the discrete case II.
Ann. Math. Statist. 36 454–456.
Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1997). Non-informative priors via sieves
and packing numbers. In Advances in Statistical Decision Theory and Applications (S.
Panchapakeshan and N. Balakrishnan eds.) 129–140. Birkhäuser, Boston.
Ghosal, S., Ghosh, J. K. and Ramamoorthi R. V. (1999a). Posterior consistency of Dirichlet
mixtures in density estimation. Ann. Statist. 27 143–158.
Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999b). Consistency issues in Bayesian non-
parametrics. In Asymptotics, Nonparametrics and Time Series: A Tribute to Madan Lal
Puri (Subir Ghosh, ed.) 639–667. Dekker, New York.
Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical Estimation: Asymptotic Theory.
Springer, New York.
Kolmogorov, A. N. and Tikhomirov, V. M. (1961). Epsilon-entropy and epsilon-capacity of sets
in function spaces. Amer. Math. Soc. Trans. Ser. 2 17 277–364.
Le Cam, L. M. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist.
1 38–53.
Le Cam, L. M. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York.
Le Cam, L. M. and Yang, G. (1990). Asymptotics in Statistics: Some Basic Concepts. Springer,
New York.
Pollard, D. (1990). Empirical Processes: Theory and Applications. IMS, Hayward, CA and Amer.
Statist. Assoc., Alexandria, VA.
Schwartz, L. (1965). On Bayes procedures. Z. Wahrsch. Verw. Gebiete 4 10–26.
Shen, X. and Wasserman, L. (1999). Rates of convergence of posterior distributions. Preprint.
CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS 531

Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. Ann.
Statist. 14 590–606.
Stone, C. J. (1990). Large-sample inference for log-spline models. Ann. Statist. 18 717–741.
Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate
function estimation (with discussion). Ann. Statist. 22 118–184.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes.
Springer, New York.
Wasserman, L. (1998). Asymptotic properties of nonparametric Bayesian procedures. Practical
Nonparametric and Semiparametric Bayesian Statistics. Lecture Notes in Statist. 133
293–304. Springer, New York.
Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence
rates of sieve MLEs. Ann. Statist. 23 339–362.

S. Ghosal J. K. Ghosh
A. W. van der Vaart Statistics and Mathematics Unit
Department of Mathematics Indian Statistical Institute
Free University 203 B.T. Road
De Boelelaan 1081a Calcutta 700 035
1081 HV Amsterdam India
Netherlands
E-mail: [email protected]

Diffusion Process in Bayesian Posterior Contraction
No ratings yet
Diffusion Process in Bayesian Posterior Contraction
36 pages
A Very Gentle Note On The Construction of DP Zhang
No ratings yet
A Very Gentle Note On The Construction of DP Zhang
15 pages
Binder 1982
No ratings yet
Binder 1982
6 pages
2011 Nemenman - Coincidences and Estimation of Entropies of Random Variables
No ratings yet
2011 Nemenman - Coincidences and Estimation of Entropies of Random Variables
11 pages
Schwarz EstimatingDimensionModel 1978
No ratings yet
Schwarz EstimatingDimensionModel 1978
5 pages
Wasserstein Complexity Penalization Priors: A New Class of Penalizing Complexity Priors
No ratings yet
Wasserstein Complexity Penalization Priors: A New Class of Penalizing Complexity Priors
35 pages
2 DP Handout
No ratings yet
2 DP Handout
41 pages
Models Beyond The DP
No ratings yet
Models Beyond The DP
47 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
11 Aos920
No ratings yet
11 Aos920
32 pages
Lambda Statistical Convergence in 2-Normed Spaces
No ratings yet
Lambda Statistical Convergence in 2-Normed Spaces
14 pages
20 Bayesian2
No ratings yet
20 Bayesian2
50 pages
A Review Constructing Priors That Penalizes The Complexity of Gaussian Random Fields - Fuglstad Et Al
No ratings yet
A Review Constructing Priors That Penalizes The Complexity of Gaussian Random Fields - Fuglstad Et Al
8 pages
1950 - Paper - Information Retrieval Viewed As Temporal Signalling
No ratings yet
1950 - Paper - Information Retrieval Viewed As Temporal Signalling
12 pages
Lecture 4
No ratings yet
Lecture 4
7 pages
Bayesian Analysis of Binary Sequences
No ratings yet
Bayesian Analysis of Binary Sequences
13 pages
Statistical Undecidability - SSRN-Id1691165
No ratings yet
Statistical Undecidability - SSRN-Id1691165
4 pages
1965 - Schwartz - On Bayes Procedures
No ratings yet
1965 - Schwartz - On Bayes Procedures
17 pages
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
No ratings yet
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
282 pages
Distribution of Parameters
No ratings yet
Distribution of Parameters
23 pages
Information Geometry in Statistical Systems
No ratings yet
Information Geometry in Statistical Systems
25 pages
From The Moments of The Standard Normal Distribution
No ratings yet
From The Moments of The Standard Normal Distribution
18 pages
Multidimensional Change-Point Problems: Universite Catholique de Louυain and Russian Academy of Sciences
No ratings yet
Multidimensional Change-Point Problems: Universite Catholique de Louυain and Russian Academy of Sciences
13 pages
Asymptotic Issues For Some Partial Differential Equations, 2nd Edition, B0D43L8XL4, 9811290431, 2024, by Michel Marie Chipot
No ratings yet
Asymptotic Issues For Some Partial Differential Equations, 2nd Edition, B0D43L8XL4, 9811290431, 2024, by Michel Marie Chipot
283 pages
(Cambridge Tracts in Mathematics) H. Cramer - Random Variables and Probability Distributions (2004, Cambridge University Press) - Libgen - Li
No ratings yet
(Cambridge Tracts in Mathematics) H. Cramer - Random Variables and Probability Distributions (2004, Cambridge University Press) - Libgen - Li
133 pages
Castillo Nonparametricbernsteinvonmises 2013 1
No ratings yet
Castillo Nonparametricbernsteinvonmises 2013 1
31 pages
Ergodic Properties Explained
No ratings yet
Ergodic Properties Explained
5 pages
Model-Free Objetive Bayesian Prediction
No ratings yet
Model-Free Objetive Bayesian Prediction
8 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
Guttman I. A Bayesian Analogue of Paulsons Lemma and Its Use in Tolerance Region Construction When Sampling From The Multi-Variate Normal 1971
No ratings yet
Guttman I. A Bayesian Analogue of Paulsons Lemma and Its Use in Tolerance Region Construction When Sampling From The Multi-Variate Normal 1971
10 pages
Chương 6 QTNN
No ratings yet
Chương 6 QTNN
75 pages
7.4 - Bayesian Estimation - 2
No ratings yet
7.4 - Bayesian Estimation - 2
8 pages
Bayesian Interlude - Marginalization
No ratings yet
Bayesian Interlude - Marginalization
7 pages
(CBMS-NSF Regional Conference Series in Applied Mathematics) R. Bahadur - Some Limit Theorems in Statistics-Society For Industrial Mathematics (1987)
No ratings yet
(CBMS-NSF Regional Conference Series in Applied Mathematics) R. Bahadur - Some Limit Theorems in Statistics-Society For Industrial Mathematics (1987)
51 pages
Prob Level Sets
No ratings yet
Prob Level Sets
8 pages
B671-672 Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets
No ratings yet
B671-672 Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets
13 pages
Stochastic Limits & Order Theory
No ratings yet
Stochastic Limits & Order Theory
11 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Bayes Intro PT 2
No ratings yet
Bayes Intro PT 2
13 pages
Vapnik 71
No ratings yet
Vapnik 71
17 pages
Strong Laws For Weighted Sums of ND R.V.
No ratings yet
Strong Laws For Weighted Sums of ND R.V.
5 pages
A Diagram Free Approach To The Stochastic Estimates in Regularity Structures
No ratings yet
A Diagram Free Approach To The Stochastic Estimates in Regularity Structures
97 pages
Robust Shrinkage Prior Estimation
No ratings yet
Robust Shrinkage Prior Estimation
21 pages
Minka - Inferring A Gaussian Distribution
No ratings yet
Minka - Inferring A Gaussian Distribution
15 pages
Sobolev Spaces Elliptic Equations 2010
No ratings yet
Sobolev Spaces Elliptic Equations 2010
88 pages
MIR - Ivchenko G. I., Medvedev Yu. and Chistyakov A. - Problems in Mathematical Statistics - 1991
100% (4)
MIR - Ivchenko G. I., Medvedev Yu. and Chistyakov A. - Problems in Mathematical Statistics - 1991
282 pages
Notes For Miscellaneous Lectures
No ratings yet
Notes For Miscellaneous Lectures
5 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Restricted Parameter Space Estimation Problems
No ratings yet
Restricted Parameter Space Estimation Problems
171 pages
Asymptotic Behavior of Generalized Functions
No ratings yet
Asymptotic Behavior of Generalized Functions
169 pages
Bayesian Approximation Techniques of Topp-Leone Distribution
No ratings yet
Bayesian Approximation Techniques of Topp-Leone Distribution
8 pages
Univariate Density Estimation by Orthogonal Series: Department of Statistics, Oregon State University, Corvallis
No ratings yet
Univariate Density Estimation by Orthogonal Series: Department of Statistics, Oregon State University, Corvallis
8 pages
Schweizer and Sklar
No ratings yet
Schweizer and Sklar
25 pages
Stein's Method for Diffusions
No ratings yet
Stein's Method for Diffusions
26 pages
Lecture Notes On Empirical Process Theory (Kengo Kato April 7, 2019)
No ratings yet
Lecture Notes On Empirical Process Theory (Kengo Kato April 7, 2019)
109 pages
This Content Downloaded From 2.14.59.251 On Sat, 18 Jul 2020 08:34:10 UTC
No ratings yet
This Content Downloaded From 2.14.59.251 On Sat, 18 Jul 2020 08:34:10 UTC
9 pages
Random Signals: 1 Kolmogorov's Axiomatic Definition of Probability
No ratings yet
Random Signals: 1 Kolmogorov's Axiomatic Definition of Probability
14 pages
Python Codes
No ratings yet
Python Codes
99 pages
Fep Gslo 01 Def 2024 06 25c
No ratings yet
Fep Gslo 01 Def 2024 06 25c
30 pages
AI ML Course Content
No ratings yet
AI ML Course Content
3 pages
Home Lesson 15: Logistic, Poisson & Nonlinear Regression
No ratings yet
Home Lesson 15: Logistic, Poisson & Nonlinear Regression
32 pages
Math644 - Chapter 1 - Part2 PDF
No ratings yet
Math644 - Chapter 1 - Part2 PDF
14 pages
Introduction To Robust Estimation and Hypothesis Testing 5th Edition Rand R. Wilcox - Ebook PDFinstant Download
100% (2)
Introduction To Robust Estimation and Hypothesis Testing 5th Edition Rand R. Wilcox - Ebook PDFinstant Download
65 pages
Advanced Probability Exercises
No ratings yet
Advanced Probability Exercises
3 pages
Heath-Jarrow-Morton Framework: Jeff Greco
No ratings yet
Heath-Jarrow-Morton Framework: Jeff Greco
12 pages
pr3 Reviewer With Answers
No ratings yet
pr3 Reviewer With Answers
5 pages
Conditional Probability and Bayes Theorem
100% (1)
Conditional Probability and Bayes Theorem
68 pages
Explore: Notes
No ratings yet
Explore: Notes
37 pages
Student's T Test
100% (1)
Student's T Test
7 pages
More Predictive Analytics. Microsoft Excel (PDFDrive)
No ratings yet
More Predictive Analytics. Microsoft Excel (PDFDrive)
465 pages
Chapter3 - Learning To Use Regression Analysis PDF
No ratings yet
Chapter3 - Learning To Use Regression Analysis PDF
19 pages
Chapter 13, Numbers 13.6, 13.8, 13.9, and 13.10 2. Chapter 14, Numbers 14.11, 14.12, and 14.14 3. Chapter 15, Numbers 15.7, 15.8, 15.10 and 15.14
No ratings yet
Chapter 13, Numbers 13.6, 13.8, 13.9, and 13.10 2. Chapter 14, Numbers 14.11, 14.12, and 14.14 3. Chapter 15, Numbers 15.7, 15.8, 15.10 and 15.14
5 pages
Broota K D - Experimental Design in Behavioural Research - 3e
50% (8)
Broota K D - Experimental Design in Behavioural Research - 3e
26 pages
Discrete Choice Analysis I: Moshe Ben-Akiva
No ratings yet
Discrete Choice Analysis I: Moshe Ben-Akiva
38 pages
OANH SƠ B 2.xlsx - Complete
No ratings yet
OANH SƠ B 2.xlsx - Complete
78 pages
Notes Part 2
No ratings yet
Notes Part 2
101 pages
Traffic Analysis & Arrival Distributions
No ratings yet
Traffic Analysis & Arrival Distributions
20 pages
Mathematical Statistics: Introduction
No ratings yet
Mathematical Statistics: Introduction
5 pages
Lecture 6 Hidden Markov and Maximum Entropy Models
No ratings yet
Lecture 6 Hidden Markov and Maximum Entropy Models
28 pages
AUTOCORRELATION
No ratings yet
AUTOCORRELATION
14 pages
Introstats 2 Final Exam Review With Solutions
No ratings yet
Introstats 2 Final Exam Review With Solutions
22 pages
3017 Tutorial 2 Solutions
100% (1)
3017 Tutorial 2 Solutions
2 pages
(Original PDF) Statistics For Business: Decision Making and Analysis 3rd Edition Instant Download
100% (3)
(Original PDF) Statistics For Business: Decision Making and Analysis 3rd Edition Instant Download
46 pages
Defining Model 1 (Null Model) With PASW Menu Commands: Models: Specified Subjects and Repeated
No ratings yet
Defining Model 1 (Null Model) With PASW Menu Commands: Models: Specified Subjects and Repeated
18 pages
Practical Attachment Report12
No ratings yet
Practical Attachment Report12
22 pages
AP Stats Chapter 11: Significance Tests
No ratings yet
AP Stats Chapter 11: Significance Tests
10 pages
Experiments and Quasi-Experiments: Fourth Edition, Allen Rubin. Earl Babbie
100% (1)
Experiments and Quasi-Experiments: Fourth Edition, Allen Rubin. Earl Babbie
36 pages

Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart

Uploaded by

Convergence Rates of Posterior Distribution - Ghosal, Ghosh, VD Vaart

Uploaded by

The Annals of Statistics

2000, Vol. 28, No. 2, 500–531

CONVERGENCE RATES OF POSTERIOR DISTRIBUTIONS

By Subhashis Ghosal, Jayanta K. Ghosh

1. Introduction. Suppose that we observe a random sample X1 Xn

If the distribution P is considered random and distributed according to , as

Received July 1998; revised December 1999.

subset of a ﬁnite-dimensional Euclidean space and the dependence of θ → Pθ

2. Main results. Let X1 Xn be distributed according to some distri-

which is the minimal number of balls of radius ε needed to cover , by the

(2.1) Nε d ≤ Dε d ≤ Nε/2 d

Because we are only interested in rates of convergence, the additional constant

Theorem 2.1. Suppose that for a sequence εn with εn → 0 and nεn2 → ∞,

(2.2) log Dεn n d ≤ nεn2

Then for sufﬁciently large M, we have that n P dP P0 ≥ Mεn X1 Xn

Theorem 2.3. Suppose that conditions

If the quotients p0 /p are uniformly bounded, then condition (2.5) simply

Convergence of the posterior distribution at the rate εn implies the existence

Theorem 2.5. Suppose that n P dP P0 ≥ εn X1 Xn converges to

Proof. By deﬁnition, the εn -ball around P̂n contains at least as much

The preceding construction actually applies to general statistical models

an alternative is to use the posterior expectation, which typically has a similar

3. Priors based on ﬁnite approximating sets. In this section, we con-

Theorem 3.1. Suppose that εn are numbers decreasing in n such that

Therefore, n is a 2εn -net over : every point of is within distance 2εn

Example 3.1 (Smooth densities). Suppose that consists of all measures

Example 3.2 (Monotone densities). Suppose that consists of all mono-

Theorem 3.2. Suppose that εn are numbers decreasing in n such that

4. Log spline models. In this section we apply the general results to

3. Bj is supported inside an interval of length q/K

Thus pθ belongs to a J-dimensional exponential family, with the B-spline

Lemma 4.1. Let q ≥ α > 0. There exists a constant C depending only on q

Lemma 4.2. For any θ ∈ J ,

by the Cauchy–Schwarz inequality. Since each Bj is supported on an interval

Lemma 4.3. For any θ ∈ J such that θT 1 = 0,

by Lemma 4.2. Consequently, by Lemma 4.2 and the triangle inequality,

As a consequence of the preceding lemma, a set of densities pθ is uniformly

Lemma 4.4. For every θ1 θ2 such that 1T θ1 − θ2 = 0,

Proof. By direct calculation and Taylor’s theorem, we have

since 1T B ≡ 1. Up to bounds below and above on pθ the right side is equiv-

Proof. Let θ0 minimize θ →  log pθ − log p0 ∞ over θ ∈ J such that

Consequently,  log pθ∗ − log p0 ∞ J−α , whence θ∗ minimizes θ →  log pθ −

since x x ∧ 1 ≤ ε x ≤ M ⊂ x x ≤ εM for M ≥ 1. Hence, in view of

in view of Lemma 4.4 and (4.1), uniformly in θ∞ ≤ M,

5. Finite-dimensional models. Although in this paper we are primar-

Theorem 5.1. Under the conditions listed previously and θ0 interior to ,

Proof. The left side of condition (2.7) is seen to be bounded by a constant

for constants A B deﬁned by the conditions preceding the theorem, and a

It may be noted that our conditions preclude unbounded parameter spaces

6. Priors based on Dirichlet processes. In this section we apply the

Lemma 6.1. Let X1 XN be distributed according to the Dirichlet dis-

Proof. Find an index i such that xi0 ≥ 1/N. By relabelling, we can

Hence there exists x = x 1 xN in thesimplex with these ﬁrst N − 1

:m N−1  minxi0 +ε 1 αi −1

Example 6.1 (Current status censoring). Let Y1 Yn be an i.i.d. sam-

given in Section 3 do lead to a posterior rate of convergence of n−1/3 , as the

for a constant depending on F0 . Thus, to verify (2.5) it sufﬁces to estimate the

7. Existence of tests. In this section we consider

where ρ0 1 = 1 − 12 h2 0 1 is the Hellinger afﬁnity, and h0 1 is

Example 7.1. Suppose that = Pθ θ ∈  ⊂ m and, for given con-

Example 7.2. The collection of all normal distributions Nθ 1 on  is not

On the other hand, if a ﬁxed part of can be uniformly consistently tested

Then there exist tests φn and constants K > 0 such that

Then for any Mn → ∞, we have that n P dP P0 ≥ Mn εn X1 Xn → 0

8. Proof of Theorems 2.1–2.3. In the proof of Theorem 2.1 we use the

Proof. By Jensen’s inequality applied to the logarithm,

Proof of Theorem 2.1. For every ε > 2εn we have by (2.2),

By the ﬁrst condition (8.1) it follows that, if KM2 − 1 > K, as n → ∞,

Combining the above assertion with (8.2) we see that

≤ n n + exp−KnM2 εn2 ≤ 2 exp−nεn2 C + 4

by assumption (2.4). If An is the event that this inequality is true, so that

Lemma 8.2. For any pair of probability measures P and P0 ,

A second lemma is a comparison of a certain exponential moment and the

Lemma 8.3. For any pair of probability measures P and P0 ,

Proof. For every c ≤ 0 and x ≥ c we have the inequality

The “Bernstein norm” of logp/p0 dominates all moments of order greater

If the distribution P is considered random and distributed according to , as

subset of a ﬁnite-dimensional Euclidean space and the dependence of θ → Pθ

Then for sufﬁciently large M, we have that n P dP P0 ≥ Mεn X1 Xn

Theorem 2.5. Suppose that n P dP P0 ≥ εn X1 Xn converges to

Lemma 4.2. For any θ ∈ J ,

Lemma 4.3. For any θ ∈ J such that θT 1 = 0,

Proof. Let θ0 minimize θ → log pθ − log p0 ∞ over θ ∈ J such that

Consequently, log pθ∗ − log p0 ∞ J−α , whence θ∗ minimizes θ → log pθ −

in view of Lemma 4.4 and (4.1), uniformly in θ∞ ≤ M,

Theorem 5.1. Under the conditions listed previously and θ0 interior to ,

:m N−1 minxi0 +ε 1 αi −1

Example 7.1. Suppose that = Pθ θ ∈ ⊂ m and, for given con-

Example 7.2. The collection of all normal distributions Nθ 1 on is not

Then for any Mn → ∞, we have that n P dP P0 ≥ Mn εn X1 Xn → 0

≤ n n + exp−KnM2 εn2 ≤ 2 exp−nεn2 C + 4

By monotonicity and convexity of the function y → ey − 1 − y on 0 ∞ and

≤ P0 exp logp/p0 − 1 − logp/p0 dP ≤ 2ε2

Lemma 8.5. If ψ 0 ∞ → is convex and nondecreasing, then EψX −

Proof. The map y → ψy is convex on . If X is an independent copy