Nonparametric Bayesian model selection and averaging

We consider nonparametric Bayesian estimation of a probability density $p$ based on a random sample of size $n$ from this density using a hierarchical prior. The prior consists, for instance, of prior weights on the regularity of the unknown density combined with priors that are appropriate given that the density has this regularity. More generally, the hierarchy consists of prior weights on an abstract model index and a prior on a density model for each model index. We present a general theorem on the rate of contraction of the resulting posterior distribution as $n\to \infty$, which gives conditions under which the rate of contraction is the one attached to the model that best approximates the true density of the observations. This shows that, for instance, the posterior distribution can adapt to the smoothness of the underlying density. We also study the posterior distribution of the model index, and find that under the same conditions the posterior distribution gives negligible weight to models that are bigger than the optimal one, and thus selects the optimal model or smaller models that also approximate the true density well. We apply these result to log spline density models, where we show that the prior weights on the regularity index interact with the priors on the models, making the exact rates depend in a complicated way on the priors, but also that the rate is fairly robust to specification of the prior weights.


Introduction
It is well known that the selection of a suitable "bandwidth" is crucial in nonparametric estimation of densities.Within a Bayesian framework it is natural to put a prior on bandwidth and let the data decide on a correct bandwidth through the corresponding posterior distribution.More generally a Bayesian procedure might consist of the specification of a suitable prior on a statistical model that is "correct" if the true density possesses a certainly regularity level, together with the specification of a prior on the regularity.Such a hierarchical Bayesian procedure fits naturally within the framework of adaptive estimation, which focuses on constructing estimators that automatically choose a best model from a given set of models.Given a collection of models an estimator is said to be rate-adaptive if it attains the rate of convergence that would have been attained had only the best model been used.For instance, the minimax rate of convergence for estimating a density on [0, 1] d that is known to have α derivatives is n −α/(2α+d) .An estimator would be rate-adaptive to the set of models consisting of all smooth densities if it attained the rate n −α/(2α+d) whenever the true density is α-smooth, for any α > 0. (See e.g.Tsybakov [2004].) In this paper we present a general result on adaptation for density estimation within the Bayesian framework.The observations are a random sample X 1 , . . ., X n from a density on a given measurable space.Given a countable collection of density models P n,α , indexed by a parameter α ∈ A n , each provided with a prior distribution Π n,α , and a prior distribution λ n on A n , we consider the posterior distribution relative to the prior that first chooses α according to λ n and next p according to Π n,α for the chosen α.The index α may be a regularity parameter, but in the general result it may be arbitrary.Thus the overall prior is a probability measure on the set of probability densities, given by Π n = α∈An λ n,α Π n,α . (1.1) Given this prior distribution, the corresponding posterior distribution is the random measure .
Of course, we make appropriate (measurability) conditions to ensure that this expression is well defined.We say that the posterior distributions have rate of convergence at least ε n if, for every sufficiently large constant M , as n → ∞, in probability, Π n d(p, p 0 ) > M ε n |X 1 , . . ., X n → 0.
Here the distribution of the random measure (1.2) is evaluated under the assumption that X 1 , . . ., X n are an i.i.d.sample from p 0 , and d is a distance on the set of densities.Throughout the paper this distance is assumed to be bounded above by the Hellinger distance and generate convex balls.(For instance, the Hellinger or L 1 -distance, or the L 2 -distance if the densities are uniformly bounded.)Thus we study the asymptotics of the posterior distribution in the frequentist sense.
The aim is to prove a result of the following type.For a given p 0 there exists a best model P n,βn that gives a posterior rate ε n,βn if it would be combined with the prior Π n,βn .The hierachical Bayesian procedure would adapt to the set of models if the posterior distributions (1.2), which are based on the mixture prior (1.1), have the same rate of convergence for this p 0 , for any p 0 in some model P n,α .Technically, the first main result is Theorem 2.1 in Section 2.
This sense of Bayesian adaptation refers to the "full" posterior, both its centering and its spread.As noted by Belitser and Ghosal [2003] suitably defined centers of these posteriors would yield adaptive point estimators.
The posterior distribution can be viewed as a mixture of the posterior distributions on the various models, with the weights given by the posterior distribution of the model index.Our second main result, Theorem 3.1 concerns the posterior distribution of the model index.It shows that models that are "bigger" than the optimal model asymptotically achieve zero posterior mass.On the other hand, under our conditions the posterior may distribute its mass over a selection of smaller models, provided that these can approximate the true distribution well.
In the situation that there are precisely two models this phenomenon can be conveniently described by the Bayes factor of the two models.We provide simple sufficient conditions for the Bayes factor to select the "true" model with probability tending to one.This consistency property is especially relevant for Bayesian goodness of fit testing against a nonparametric alternative.A computationally advantageous method of such goodness of fit test was developed by Berger and Guglielmi [2001] using a mixture of Polya tree prior on the nonparametric alternative.Asymptotic properties of Bayes factors for nested regular parametric models have been well studied beginning with the pioneering work by Schwarz [1978], who also introduced the Bayesian information criterion.However, large sample properties of Bayes factors when at least one model is infinite dimensional appear to be unknown except in special cases.The paper Dass and Lee [2004] showed consistency of Bayes factors when one of the models is a singleton and the prior for the other model assigns positive probabilities to the Kullback-Leibler neighborhoods of the true density, popularly known as the Kullback-Leibler property.The paper Walker et al. [2004] showed (in particular) that if the prior on one model has the Kullback-Leibler property and the other does not, then the Bayes factor will asymptotically favour the model with the Kullback-Leibler property.Unfortunately, the proof of Dass and Lee [2004] does not generalize to general null models and frequently both priors will have the Kullback-Leibler property, precluding the application of Walker et al. [2004].In Sections 3 and 4 we study these issues in general.
The present paper is an extension of the paper Ghosal et al. [2003], which studies adaptation to finitely many models of splines with a uniform weight on the models.In the present paper we derive a result for general models, possibly infinitely many, and investigate different model weights.Somewhat surprisingly we find that both weights that give more prior mass to small models and weights that downweight small models may lead to adaptation.
Related work on Bayesian adaptation was carried out by Huang [2004], who considers adaptation using scales of finite-dimensional models, and Lember and van der Vaart (2007), who consider special weights that downweight large models.Our methods of proof borrow from Ghosal et al. [2000].
The paper is organized as follows.After stating the main theorems on adaptation and model selection and some corollaries in Sections 2 and 3, we investigate adaptation in detail in the context of log spline models in Section 5, and we consider the Bayes factors for testing a finite-versus an infinite-dimensional model in detail in Section 4. The proof of the main theorems is given in Section 6, and further technical proofs and complements are given in Section 7.

Notation
Throughout the paper the data are a random sample X 1 , . . ., X n from a probability measure P 0 on a measurable space (X , A) with density p 0 relative to a given reference measure µ on (X , A).In general we write p and P for a density and the corresponding probability measure.The Hellinger distance between two densities p and q relative to µ is defined as h(p, q) = √ p − √ q 2 , for • 2 the norm of L 2 (µ).The ε-covering numbers and ε-packing numbers of a metric space (P, d), denoted by N (ε, P, d) and D(ε, P, d), are defined as the minimal numbers of balls of radius ε needed to cover P, and the maximal number of ε-separated points, respectively.
For each n ∈ N the index set is a countable set A n , and for every α ∈ A n the set P n,α is a set of µ-probability densities on (X , A) equipped with a σ-field such that the maps (x, p) → p(x) are measurable.Furthermore, Π n,α denotes a probability measure on P n,α , and Throughout the paper ε n,α are given positive numbers with ε n,α → 0 as n → ∞.
These may be thought of as the rate attached to the model P n,α if this is (approximately) correct.
The notation a b means that a ≤ Cb for a constant C that is universal or fixed in the proof.For sequences a n and b n we write a n ≪ b n if a n /b n → 0 and a n ≫ 0 if a n > 0 for every n and lim inf a n > 0. For a measure P and a measurable function f we write P f for the integral of f relative to P .

Adaptation
For β n a given element of A n , thought to be the index of a best model for a given fixed true density p 0 , we split the index set in the indices that give a faster or slower rate: for a fixed constant H ≥ 1, Even though we do not assume that A n is ordered, we shall write α β n and α < β n if α belongs to the sets A n, βn or A n,<βn , respectively.The set A n, βn contains β n and hence is never empty, but the set A n,<βn can be empty (if β n is the "smallest" possible index).In the latter case conditions involving α < β n are understood to be automatically satisfied.
The assumptions of the following theorem are reminiscent of the assumptions in Ghosal et al. [2000], and entail a bound on the complexity of the models and a condition on the concentration of the priors.The complexity bound is exactly as in Ghosal et al. [2000] and takes the form: for some constants The conditions on the priors involve comparisons of the prior masses of balls of various sizes in various models. (2.4) Let K be the universal testing constant.According to the assertion of Lemma 6.2 it can certainly be taken equal to K = 1/9.
(2.7)Such a condition may be satisfied because the prior probabilities Π n,α C n,α (IBε n,α ) are very small.For instance, a reverse bound of the type (2.5) for α instead of β n would yield this type of bound for fairly general model weights λ n,α , since ε n,α ≥ Hε n,βn for α < β n .Alternatively, the condition could be forced by choice of the model weights λ n,α , for general priors Π n,α .For instance, in Section 5.3 we consider weights of the type .
(2.8)Such weights were also considered in Lember and van der Vaart [2007], who discuss several other concrete examples.For reference we codify the preceding discussion as a lemma.
Theorem 2.1 excludes the case that ε n,βn is equal to the "parametric rate" 1/ √ n.To cover this case the statement of the theorem must be slightly adapted.The proof of the following theorem is given in Section 6.
Furthermore, assume that α∈An √ µ n,α = O(1).If β n ∈ A n for every n and ε n,βn = 1/ √ n, then the posterior distribution (1.2) satisfies, for every I n → ∞, For further understanding it is instructive to apply the theorems to the situation of two models, say P n,1 and P n,2 with rates ε n,1 > ε n,2 .For simplicity we shall also assume (2.5) and use universal constants.
Proof.We apply the preceding theorems with Statement (1) of the corollary gives the slower ε n,1 of the two rates under the assumption that the bigger model satisfies the prior mass condition (2.5) and a condition on the weights λ n,i that ensures that the smaller model is not overly downweighted.The latter condition is very mild, as it allows the weights of the two models too be very different.Apart from this, statement (1) is not surprising, and could also be obtained from nonadaptive results on posterior rates of contraction, as in Ghosal et al. [2000].
Statement (2) gives the faster rate of contraction ε n,2 under the condition that the smaller model satisfies the prior mass condition (2.5), an equally mild condition on the relative weights of the two models, and an additional condition on the prior weight Π n,1 C n,1 (Iε n,1 ) that the bigger model attaches to neighbourhoods of the true distribution.If this would be of the expected order exp(−F nε 2 n,1 ), then the conditions on the weights λ n,1 and λ n,2 in the union of ( 1) and ( 2) can be summarized as This is a remarkably big range of weights.One might conclude that Bayesian methods are very robust to the prior specification of model weights.One might also more cautiously guess that rate-asymptotics do not yield a complete picture of the performance of the various priors (even though rates are considerably more informative than consistency results).
Remark 2.1.The entropy condition (2.1) can be relaxed to the same condition on a submodel P ′ n,α ⊂ P n,α that carries most of the prior mass, in the sense This follows, because in that case the posterior will concentrate on ∪ α P ′ n,α (see Ghosal and van der Vaart [2007a], Lemma 1).This relaxation has been found useful in several papers on nonadaptive rates.In the present context, it seems that the condition would only be natural if it is valid for the index β n that gives the slowest of all rates ε n,α .

Model Selection
The theorems in the preceding section concern the concentration of the posterior distribution on the set of densities relative to the metric d.In this section we consider the posterior distribution of the index parameter α, within the same set-up.The proof of the following theorem is given in Section 6.
Somewhat abusing notation (cf.(1.2)), we write, for any set . Theorem 3.1.Under the conditions of Theorem 2.1, Under the conditions of Theorem 2.2 this is true with IB replaced by I n , for any I n → ∞.
The first assertion of the theorem is pleasing.It can be interpreted in the sense that the models that are bigger than the model P n,βn that contains the true distribution eventually receive negligible posterior weight.The second assertion makes a similar claim about the smaller models, but it is restricted to the smaller models that keep a certain distance to the true distribution.Such a restriction appears not unnatural, as a small model that can represent the true distribution well ought to be favoured by the posterior: the posterior looks at the data through the likelihood and hence will judge a model by its approximation properties rather than its parametrization.That big models with similarly good approximation properties are not favoured is caused by the fact that (under our conditions) the prior mass on the big models is more spread out, yielding relatively little prior mass near good approximants within the big models.
It is again insightful to specialize the theorem to the case of two models, and simplify the prior mass conditions to (2.5).The behaviour of the posterior of the model index can then be described through the Bayes factor n,1 and λ n,2 /λ n,1 ≤ e nε 2 n,1 and d(p 0 , P n,2 ) ≥ I n ε n,1 for every n and some Proof.The Bayes factor tends to 0 or ∞ if the posterior probability of model P n,2 or P n,1 tends to zero, respectively.Therefore, we can apply Theorem 3.1 with the same choices as in the proof of Corollary 2.1.
In particular, if the two models are equally weighted (λ n,1 = λ n,2 ), the models satisfy (2.1) and the priors satisfy (2.5), then the Bayes factors are asymptotically consistent if

Testing a Finite-versus an Infinite-dimensional Model
Suppose that there are two models, with the bigger models P n,1 infinite dimensional, and the alternative model a fixed parametric model P n,2 = P 2 = {p θ : θ ∈ Θ}, for Θ ⊂ R d , equipped with a fixed prior Π n,2 = Π 2 .Assume that λ n,1 = λ n,2 .We shall show that the Bayes factors are typically consistent in this situation: BF n → ∞ if p 0 ∈ P 2 , and BF n → 0 if p 0 / ∈ P 2 .If the prior Π 2 is smooth in the parameter and the parametrization θ → p θ is regular, then, for any θ 0 ∈ Θ and ε → 0, Therefore, if the true density p 0 is contained in and sufficiently large n.(The logarithmic factor enters, because we use the crude prior mass condition (2.5) instead of the comparisons of prior mass in the main theorems, but it does not matter for this example.) For this choice of ε n,2 we have exp[nε 2 n,2 ] = n D .Therefore, it follows from (2) of Corollary 3.1, that if p 0 ∈ P 2 , then the Bayes factor BF n tends to ∞ as n → ∞ as soon as there exists For an infinite-dimensional model P n,1 this is typically true, even if the models are nested, when p 0 is also contained in P n,1 .In fact, we typically have for p 0 ∈ P n,1 that the left side is of the order exp[−F nε 2 n,1 ] for ε n,1 the rate attached to the model P n,1 .As for a true infinite-dimensional model this rate is not faster than n −a for some a < 1/2, this gives an upper bound of the order exp(−F n 1−2a ), which is easily o(n −3D ).For p 0 not contained in the model P n,1 , the prior mass in the preceding display will be even smaller than this.If p 0 is not contained in the parametric model, then typically d(p 0 , P 2 ) > 0 and hence d(p 0 , P 2 ) > I n ε n,1 for any ε n,1 → 0 and sufficiently slowly increasing I n , as required in (1) of Corollary 3.1.To ensure that BF n → 0, it suffices that for some This is the usual prior mass condition (cf.Ghosal et al. [2000]) for obtaining the rate of convergence ε n,1 using the prior Π n,1 on the model P n,1 .We present three concrete examples where the preceding can be made precise.
Example 4.1 (Bernstein-Dirichlet mixtures).Bernstein polynomial densities with a Dirichlet prior on the mixing distribution and a geometric or Poisson prior on the order are described in Petrone [1999], Petrone and Wasserman [2002] and Ghosal [2001].
If we take these as the model P n,1 and prior Π n,1 , then the rate ε n,1 is equal to ε n,1 = n −1/3 (log n) 1/3 and (4.2) is satisfied, as shown in Ghosal [2001].As the prior spreads its mass over an infinite-dimensional set that can approximate any smooth function, condition (4.1) will be satisfied for most true densities p 0 .In particular, if k n is the minimal degree of a polynomial that is within Hellinger distance n −1/3 log n of p 0 , then the left side of (4.1) is bounded by the prior mass of all Bernstein-Dirichlet polynomials of degree at least k n , which is e −ckn for some constant c by construction.Thus (4.1) is certainly satisfied if k n ≫ log n.Consequently, the Bayes factor is consistent for true densities that are not well approximable by polynomials.
Example 4.2 (Log spline densities).Let P n,1 be equal to the set of log spline densities described in Section 5 of dimension J ∼ n 1/(2α+1) , equipped with the prior obtained by putting the uniform distribution on [−M, M ] J on the coefficients.The corresponding rate can then be taken ε n,1 = n −α/(2α+1) √ log n (see Section 5.1).Conditions (4.1) and (4.2) can be verified easily by computations on the uniform prior, after translating the distances on the spline densities into the Euclidean distance on the coefficients (see Lemmas 7.4 and 7.6).
Example 4.3 (Infinite dimensional normal model).Let P n,1 be the set of (Thus a typical observation is an infinite sequence of independent normal variables with means θ i and variances 1.) Equip it with the prior obtained by letting the θ i be independent Gaussian variables with mean 0 and variances i −(2q+1) .Take P n,2 equal to the submodel indexed by all θ with θ i = 0 for all i ≥ 2, equipped with a positive smooth (for instance Gaussian) prior on θ 1 .This model is equivalent to the signal plus Gaussian white noise model, for which Bayesian procedures were studied in Freedman [1999], Zhao [2000], Belitser and Ghosal [2003] and Ghosal and van der Vaart [2007a].
The Kullback-Leibler and squared Hellinger distances on P n,1 are, up to constants, essentially equivalent to the squared ℓ 2 -distance on the parameter θ, when the distance is bounded (see e.g.Lemma 6.1 of Belitser and Ghosal [2003]).This allows to verify (4.1) by calculations on Gaussian variables, after truncating the parameter set.For a sufficiently large constant M , consider sieves P ′ n,1 = {θ: i ≤ M } for some q ′ < q, and P ′ n,2 = {|θ 1 | ≤ M }, respectively.The posterior probabilities of the complements of these sieves are small in probability respectively by Lemma 3.2 of Belitser and Ghosal [2003] and Lemma 7.2 of Ghosal et al. [2000].Hence in view of Remark 2.1 it suffices to perform the calculations on P ′ n,1 and P ′ n,2 .By Lemmas 6.3 and 6.4 of Belitser and Ghosal [2003] it follows that the conditions of (2) of Corollary 3.1 holds for ε n,1 = max(n −q ′ /(2q ′ +1) , n −q/(2q+1) ) = n −q ′ /(2q ′ +1) , provided that (4.1) can be verified.Now, for any For i ≤ n 2q ′ /((2q+1)(2q ′ +1)) the argument in the normal distribution function is bounded above by 1, and then the corresponding factor is bounded by 2Φ(1) − 1 < 1.It follows that the right side of the last display is bounded by a term of the order e −c ′ n 2q ′ /((2q+1)(2q ′ +1)) , for some positive constant c.This easily shows that (4.1) is satisfied.

Log Spline Models
Log spline density models, introduced in Stone [1990], are exponential families constructed as follows.
For a given "resolution" K ∈ N partition the half open unit interval [0, 1) into K subintervals (k − 1)/K, k/K for k = 1, . . ., K. The linear space of splines of "order" q ∈ N relative to this partition is the set of all continuous functions f : [0, 1] → R that are q − 2 times differentiable on [0, 1) and whose restriction to every of the partitioning intervals (k−1)/K, k/K is a polynomial of degree strictly less than q.It can be shown that these splines form a J = q + K − 1-dimensional vector space.A convenient basis is the set of B-splines B J,1 , . . ., B J,J , defined e.g. in de Boor [2001].The exact nature of these functions does not matter to us here, but the following properties are essential (cf.de Boor [2001], pp 109-110): is supported on an interval of length q/K • at most q functions B J,j are nonzero at every given x.
The first two properties express that the basis elements form a partition of unity, and the third and fourth properties mean that their supports are close to being disjoint if K is very large relative to q.This renders the B-spline basis stable for numerical computation, and also explains the simple inequalities between norms of linear combinations of the basis functions and norms of the coefficients given in Lemma 7.2 below.
For θ ∈ R J let θ T B J = j θ j B J,j and define Thus p J,θ is a probability density that belongs to a J-dimensional exponential family with sufficient statistics the B-spline functions.Since the B-splines add up to unity, the family is actually of dimension J − 1 and we can restrict θ to the subset of θ ∈ R J such that θ T 1 = 0. Splines possess excellent approximation properties for smooth functions, where the error is smaller if the function is smoother or the dimension of the spline space is higher.More precisely, a function f ∈ C α [0, 1] can be approximated with an error of order (1/J) α by splines of order q ≥ α and dimension J.Because there are J − 1 free base coefficients, the variance of a best estimate in a J-dimensional spline space can be expected to be of order J/n.Therefore, we may expect to determine an optimal dimension J for a given smoothness level α from the bias-variance trade-off J/n ∼ (1/J) 2α .This leads to the dimension J n,α ∼ n 1/(2α+1) , and the "usual" rate of convergence n −α/(2α+1) .This informal calculation was justified for maximum likelihood and Bayesian estimators in Stone [1990] and Ghosal et al. [2000], respectively.The paper Stone [1990] showed that the maximum likelihood estimator of p in the model {p J,θ : J = J n,α , θ ∈ R Jn,α , θ T 1 = 0} achieves the rate of convergence n −α/(2α+1) if the true density p 0 belongs to C α [0, 1].The paper Ghosal et al. [2000] showed that a Bayes procedure with deterministic dimension J n,α and a smooth prior on the coefficients θ ∈ R Jn,α achieves the same (posterior) rate.(In both papers it is assumed that the true density is also bounded away from zero.) Both the maximum likelihood estimator and the Bayesian estimator described previously depend on α.They can be made rate-adaptive to α by a variety of means.We shall consider several Bayesian schemes, based on different choices of priors Πn,α on the coefficients and λ n on the dimensions J n,α of the spline spaces.Thus Πn,α will be a prior on R Jn,α for J n,α = ⌊n 1/(2α+1) ⌋, (5.1) the prior Π n,α on densities will be the distribution induced under the map θ → p J,θ , where J = J n,α , and λ n is a prior on the regularity parameter α.
We always choose the order of the splines involved in the construction of the αth log spline model at least α.
We shall assume that the true density p 0 is bounded away from zero next to being smooth, so that the Hellinger and L 2 -metrics are equivalent.We shall in fact assume that uniform upper and lower bounds are known, and construct the priors on sets of densities that are bounded away from zero and infinity.It follows from Lemma 7.3 that the latter is equivalent to restricting the coefficient vector θ in p J,θ to a rectangle [−M, M ] J for some M .We shall construct our priors on this rectangle, and assume that the true density p 0 is within the range of the corresponding spline densities, i.e. log p 0 ∞ ≤ C 4 M for the constant C 4 of Lemma 7.3.Extension to unknown M through a second shell of adaptation is possible (see e.g.Lember and van der Vaart [2007]), but will not be pursued here.
In the next three sections we discuss three examples of priors.In the first example we combine smooth priors Πn,α on the coefficients with fixed model weights λ n,α = λ α .These natural priors lead to adaptation up to a logarithmic factor.Even though we only prove an upper bound, we believe that the logarithmic factor is not a defect of our proof, but connected to this prior.In the second example we show how the logarithmic factor can be removed by using special model weights λ n,α that put less mass on small models.In the third example we show that the logarithmic factor can also be removed by using discrete priors Π n,α for the coefficients combined with model weights that put more mass on small models.One might conclude from these examples that the fine details of rates depend on a delicate balance between model priors and model weights.The good news is that all three priors work reasonably well.
Proof.The dimension numbers J n,α defined in (5.1) relate to the present rates ε n,α : = n −α/(2α+1) √ log n as J n,α log n ∼ nε 2 n,α .By Lemma 7.7 condition (2.1) is satisfied for any ε n,α such that nε 2 n,α J n,α , and hence certainly for the present ε n,α .The constants E α do not depend on α, and hence both E and E in Theorem 2.1 can be taken equal to a single constant E.
Because log p 0 ∞ < C 4 M by assumption, the Hellinger distance of p 0 to P J is bounded above by a multiple of J −β by Lemma 7.8.By Lemma 7.6 for ε n,β J −β n,β , some constants A and A, and sufficiently large n, Because θ J ∈ Θ J by its definition, the set {θ ∈ Θ J : θ − θ J 2 ≤ ε} contains at least a fraction 2 −J of the volume of the ball of radius ε around θ J , even though it does not contain the full ball if θ J is near the boundary of Θ J .It follows that, for any α, β ∈ A and ε, and v J the volume of the J-dimensional Euclidean ball, for suitable constants a and a, in view of Lemma 7.9.If α < β, then with ε = ε n,α , inequality (5.2) yields For sufficiently large H the coefficient of log n is negative, uniformly in α ≥ α.Condition (2.2) is easily satisfied for such H, with µ n,α = µ α /µ β and arbitrarily small L > 0. If α < β, then with ε = IBε n,α , inequality (5.2) and similar calculations yield By the same arguments as before for sufficiently large H the exponent is smaller than −J n,α c log n for a positive constant c, uniformly in α ≥ α, eventually.This implies that (2.4) is fulfilled.
If α β the right side is bounded above by for sufficiently large i, where L may be an arbitrarily small constant.Hence condition (2.3) is fulfilled.
The theorem is a consequence of Theorem 2.1, with A n of this theorem equal to the present A.

Flat priors, decreasing weights
The constant weights λ n,α = µ α used in the preceding subsection resulted in an additional logarithmic factor in the rate.The following theorem shows that for A = {α 1 , α 2 , . . ., α N } a finite set, this factor can be removed by choosing the weights λ n,α ∝ γ∈A:γ<α These weights are decreasing in α, unlike the weights in (2.8).Thus the present prior puts less weight on the smaller, more regular models.We use the same priors Π n,α on the spline models as in Section 5.1.
Similarly, again with α r = α < β = α s , This tends to 0 if C > aIB.Hence, for C big enough, condition (2.4) is fulfilled as well.
Finally, choose α r = β < α = α s and note that Here the exponent is of the order J n,αr log(C/a)+o(1) log i+o(1) .We conclude that the condition (2.3) holds.
The theorem is a consequence of Theorem 2.1, with A n of this theorem equal to the present A.
The proof of Theorem 5.2 relies on the fact that J n,αr ≪ J n,αs if s < r, and for that reason it does not extend to the more general case of a dense set A as considered in Theorem 5.1.On the other hand, it would be possible to extend the theorem to a countable totally ordered set α 1 < α 2 < • • • by using the weights (5.3) restricted to sets The existence of rate-adaptive priors that yield the rate without log-factor in the general countable case is subject for further research.There are some reasons to believe that this task is achievable with some more elaborate priors as these in (5.3).For example, one could consider more general priors than (5.3) of the type The truncation-set A n as well as the constants C γ,α and λ γ,α must be carefully chosen.

Discrete priors, increasing weights
In this section we choose the priors Π n,α to be discrete on a suitable subset of the J n,α -dimensional log spline densities, constructed as follows.
According to Kolmogorov and Tihomirov [1961] (cf.Theorem 2.7.1 in van der Vaart and Wellner [1996]) the unit ball C α 1 [0, 1] of the Hölder space • ∞ of the order (1/ε) 1/α , as ε ↓ 0, relative to the uniform norm.Then it follows that there exists a set of N n,α (M/ε n,α ) 1/α functions f 1 , . . ., f Nn,α such that every f with Hölder norm smaller than a given constant M is within uniform distance ε n,α : = n −α/(2α+1) of some f i .These functions can without loss of generality be chosen with Hölder norm bounded by M .By the approximation properties of spline spaces (cf.Lemma 7.1), we can find Define Πn,α to be the uniform probability distribution on the collection θ 1 , . . ., θ Nn,α .
We combine the resulting prior Π n,α on log spline densities with model weights on the index set A = Q + of the form (2.8), where (µ α : α ∈ Q + ) is a strictly positive measure with α∈Q + √ µ α < ∞ and C is an arbitrary positive constant.
Theorem 5.3.If p 0 ∈ C β [0, 1] for some β ∈ A and log p 0 β < M , then there exist a constant B such that Proof.By construction there exists an element θ i in the support of Πn,β such that log It follows that the function e θ T i BJ n,β is sandwiched between the functions p 0 e −dε n,β and p 0 e +dε n,β , for some constant d.Consequently, the norming constant satisfies |e −c(θi) − 1| ε n,β , and hence p 0 − p J n,β ,θi ∞ ε n,β .Because p 0 is bounded away from zero and infinity, this implies that p J n,β ,θi is in the Kullback-Leibler neighbourhood B n,β (Dε n,β ) of p 0 , for some constant D (cf.Lemma 8 in Ghosal and van der Vaart [2007b]).Because Π n,β is the uniform measure on the N n,β log spline densities of this type, it follows that Π for some positive constant F .In view of (2.8) it follows, for any ε, Define the sets of indices α < β and α β as in Theorem 2.1, relative to a given constant H. Thus α < β is equivalent to J n,α > HJ n,β and hence the sum over α < β of the preceding display can be bounded above by The leading term is o(e −2nε 2 n,β ) provided H is big enough, and the sum is bounded by assumption.Thus (2.4) is fulfilled for any constant B. Furthermore, condition (2.2) holds trivially with µ n,α = (µ α /µ β )e −CJn,α/2 .Condition (2.3) clearly holds for sufficiently large i, with the same choice of µ n,α and any L > 0.
The theorem is a consequence of Theorem 2.1, with A n of this theorem equal to the present A.

Proof of the main theorems
We start by extending results from Ghosal and van der Vaart [2007b], Ghosal et al. [2000], LeCam [1973] and Birgé [1983] on the existence of tests of certain tests under local entropy of a statistical model.The results differ from the last three references by inclusion of weights α and β; relative to Ghosal and van der Vaart [2007b] the difference is the use of local rather than global entropy.
Let d be a metric which induces convex balls and is bounded above on P by the Hellinger metric h.Lemma 6.1.For any dominated convex set of probability measures P, and any constants α, β > 0 and any n there exists a test φ with Proof.This follows by minor adaptation of a result of Le Cam [1986].The essence is that, by the minimax theorem, Next we use the convexity of P to see that this is bounded above by (see Le Cam [1986] or Lemma 6.2 in Kleijn and van der Vaart [2006]) Finally we express the affinity √ p 0 √ q in the Hellinger distance as 1− 1 2 h 2 (P 0 , Q) and use the inequality 1 − x ≤ e −x , for x > 0.
Corollary 6.1.For any dominated set of probability measures P with d(P 0 , P) ≥ 3ε, any α, β > 0 and any n there exists a test φ with Proof.Choose a maximal 2ε-separated set P ′ of points in P. Then the balls B Q ′ of radius 2ε centered at the points in P ′ cover P, whence their number is bounded by N (2ε, P, d).Furthermore, these balls are convex by assumption, and are at distance 3ε − ε = 2ε from P 0 .The latter is true both for the distance d and the Hellinger distance, which is larger by assumption.For every ball B Q ′ attached to a point Q ′ ∈ P ′ there exists a test ω Q ′ with the properties as in Lemma 6.1 with P taken equal to B Q ′ .Let φ be the maximum of all tests attached in this way to some point Q ∈ P ′ .Then The right sides can be further bounded as desired.
Lemma 6.2.Suppose that for a dominated set of probability measures P, some nonincreasing function ε → N (ε), some ε 0 ≥ 0, and for every ε > ε 0 , Then for every ε > ε 0 and every α, β > 0 there exists a test φ (depending on ε but not on i) such that for every i ∈ N, Proof.For j ∈ N let P j = {p ∈ P: jε < d(p, p 0 ) ≤ (j + 1)ε}.Because the set P j has distance 3(jε/3) to p 0 , the preceding corollary implies the existence of a test φ j with By Lemma 6.3 there exist events A n with probability By Fubini's theorem and the inequality P 0 (p/p 0 ) ≤ 1, we have for every set C, Combining these inequalities with (6.6), (6.2) and (6.4) we see that, .
The third term on the right tends to zero by assumption (2.4), since We shall show that the first two terms on the right also tend to zero.
Because for α β n and for i ≥ I ≥ 3 we have S n,α,i ⊂ C n,α ( √ 2iBε n,βn ), the assumptions (2.3) shows that the first term is bounded by Because α∈An √ µ n,α ≤ exp J n,βn by assumption, this tends to zero if (K − 2L)I 2 B 2 > 3, which is assumed.
Similarly, for α < β n the second term is bounded by, in view of (2.2), Here J n,α > HJ n,βn for every α < β n , and hence this tends to zero, because again (K − 2L)B 2 I 2 > 3.
Proof of Theorem 2.2.We follow the line of argument of the proof of Theorem 2.1, the main difference being that presently J n,βn = 1 and hence does not tend to infinity.To make sure that P n 0 φ n is small we choose the constant B sufficiently large, and to make P n 0 (A n ) sufficiently large we apply Lemma 6.3 with C a large constant instead of C = 1.This gives a factor e −(1+C)J n,βn instead of e −2J n,βn in the denominators of (6.7), but this is fixed for fixed C. The arguments then show that for an event A n with probability arbitrarily close to 1 the expectation P n 0 Π n d(p, p 0 ) > IBε n,βn |X 1 , . . ., X n 1 An can be made arbitrarily small by choosing sufficiently large I and B. This proves the theorem.
Proof of Theorem 3.1.The second assertion of the theorem is an immediate consequence of Theorems 2.1 and 2.2.These theorems show that the posterior concentrates all its mass on balls of radius BIε n,βn or I n ε n,βn around p 0 , respectively.Hence the posterior cannot charge any model that do not intersect these balls.
The first assertion can be proved using exactly the proof of Theorems 2.1 and 2.2, except that the references to α β n can be omitted.In the notation of the proof of Theorem 2.1 we have that It follows that P 0 Π n A n,<βn |X 1 , . . ., X n (1 − φ n )1 An can be bounded by the sum of the second and third terms on the right side of (6.7), which tend to zero under the conditions of Theorem 2.1, and can be made arbitrarily small under the conditions of Theorem 2.2 by choosing B and/or I sufficently large.
The first lemma in this list is the basic approximation lemma for splines and shows that splines of sufficient dimension are well suited to approximating smooth functions.Its proof can be found in de Boor [2001], p170.Lemmas 7.2-7.4are (partly) implicit in Stone [1986Stone [ , 1990] ] and can be explicitly found in Ghosal et al. [2000].The equivalence of the L 2 -norm or infinity-norm on the linear combinations of splines and the Euclidean or maximum norm on the coefficients (up to constants) given by Lemma 7.2 are consequences of using the B-splines, with their special properties, as a basis.Lemma 7.5 is a consequence of the other lemmas; a proof can be found in Ghosal et al. [2003].
For given M > 0 let Θ J = {θ ∈ [−M, M ] J : θ T 1 = 0, θ ∞ ≤ M }, and write P J for the set of functions p J,θ .By Lemma 7.3 the densities p J,θ with θ ∈ Θ J take their values in the interval [e −C4M , e C 4 M ].In particular, they are uniformly bounded away from zero and infinity.Assume that the true density p 0 is also bounded away from 0 and infinity.
Because the quotients p 0 /p J,θ are uniformly bounded above by exp(C 4 M ), there exists a constant 1 ≤ B, depending on M only, such that B J (ε) ⊂ C J (ε) ⊂ B J (Bε).(7.1)(In fact B a multiple of M does; see e.g.Lemma 8 in Ghosal and van der Vaart [2007b].)In order to verify the conditions of the main theorems involving the sets B n,α (ε) or C n,α (ε), we can therefore restrict ourselves to the Hellinger balls C n,α (ε).These Hellinger balls can themselves be related to Euclidean balls.
Lemma 7.6.If θ J minimizes the map θ → h(p J,θ , p 0 ) over Θ J and ε J = h(p 0 , p J,θJ ), then there exist constants F and F such that C J (ε) ⊂ p J,θ : θ ∈ Θ J , F θ − θ J 2 ≤ √ J2ε , 2ε < F , (7.2) Proof.By Lemma 7.4 there exist constants F ≤ F , only depending on M , such that, for every θ ∈ Θ J , (In fact multiples of F = e −C4M and F = e C4M will do.)The set C J (ε) is empty for ε < ε J .Therefore, if p J,θ ∈ C J (ε), then ε ≥ ε J and by the triangle inequality h(p J,θ , p J,θJ ) ≤ 2ε.If also 2ε < F , then the preceding display shows that F θ − θ J ≤ √ J2ε.This and a similar argument for an inclusion in the other direction, yields the lemma.