Adaptive Bayesian Density Estimation with Location-Scale Mixtures

: We study convergence rates of Bayesian density estimators based on ﬁnite location-scale mixtures of a kernel proportional to exp {−| x | p } . We construct a ﬁnite mixture approximation of densities whose logarithm is locally β -H¨older, with squared integrable H¨older constant. Under additional tail and moment conditions, the approximation is minimax for both the Kullback-Leibler divergence. We use this approximation to establish convergence rates for a Bayesian mixture model with priors on the weights, locations, and the number of components. Regarding these priors, we provide general conditions under which the posterior converges at a near optimal rate, and is rate-adaptive with respect to the smoothness of the logarithm of the true density.


Introduction
When the number of components in a mixture model can increase with the sample size, it can be used for nonparametric density estimation.Such models were called mixture sieves by Grenander [15] and Geman and Hwang [7].Although originally introduced in a maximum likelihood context, there has been a large number of Bayesian papers in recent years; among many others, see [25], [5], and [6].Whereas much progress has been made regarding the computational problems in nonparametric Bayesian inference (see for example the review by Marin et al. [22]), results on convergence rates were found only recently, especially for the case when the underlying distribution is not a mixture itself.Also the approximative properties of mixtures needed in the latter case are not well understood.
In this paper we find conditions under which a probability density of any Hölder-smoothness can be efficiently approximated by a location-scale mixture.Using these results we then considerably generalize existing results on posterior convergence of location-scale mixtures.In particular our results are adaptive to any degree of smoothness, and allow for more general kernels and priors on the mixing distribution.Moreover, the bandwidth prior can be any inverse-gamma distribution, whose support neither has to be bounded away from zero, nor to depend on the sample size.
We consider location-scale mixtures of the type m(x; k, µ, w, σ) where σ > 0, w j ≥ 0, k j=1 w j = 1, µ j ∈ R and, for p ∈ N, Approximation theory (see for example [3]) tells us that for a compactly supported kernel and a compactly supported β-Hölder function, being not necessarily nonnegative, the approximation error will be of order k −β , provided σ ∼ k −1 and the weights are carefully chosen.This remains the case if both the kernel and the function to be approximated have exponential tails, as we consider in this work.If the function is a probability density however, this raises the question whether the approximation error k −β can also be achieved using nonnegative weights only.To our knowledge, this question has been little studied in the approximation theory literature.
Ghosal and Van der Vaart [13] approximate twice continuously differentiable densities with mixtures of Gaussians, but it is unclear if their construction can be extended to other kernels, or densities of different smoothness.In particular, for functions with more than two derivatives, the use of negative weights seems at first sight to be inevitable.A recent result by Rousseau [26] however does allow for nonnegative approximation of smooth but compactly supported densities by beta-mixtures.We will derive a similar result for location-scale mixtures of a kernel ψ as in (2), for any p ∈ N.Although the same differencing technique is used to construct the desired apprimations, there are various differences.First, we are dealing with a noncompact support, which required investigation of the tail conditions under which approximations can be established.Second, we are directly dealing with location-scale mixtures, hence there is no need for a 'location-scale mixture' approximation as in [26].
The posterior (or its mean) can be used as a Bayesian density estimator of f 0 .Provided this estimator is consistent, it is then of interest to see how fast it converges to the Dirac-mass at f 0 .More precisely, let the convergence rate be a sequence ǫ n tending to zero such that nǫ 2  n → ∞ and in F n 0 -probability, for some sufficiently large constant M , d being the Hellingeror L 1 -metric.The problem of finding general conditions for statistical models under which (3) holds has been studied in among others [11], [13], [32], [17], [8] and [29].In all these papers, the complexity of the model needs to be controlled, typically by verifying entropy conditions, and at the same time the prior mass on Kullback-Leibler balls around f 0 needs to be lower bounded.It is for the latter condition that the need for good approximations arises.Our approximation result allows to prove (3) with ǫ n = n − β 2β+1 (log n) t for location-scale mixtures of the kernel ψ, provided p is even and f 0 is locally Hölder and has tails bounded by ψ.The constant t in the rate depends on the choice of the prior.We only consider priors independent of β, hence the posterior adapts to the unknown smoothness of f 0 , which can be any β > 0. The adaptivity relies on the approximation result that allows to approximate f 0 with f 1 * ψ, for a density f 1 that may be different from f 0 .In previous work on density estimation with finite location-scale mixtures (see e.g.[27], [8] and [13]) f 0 is approximated with f 0 * ψ, which only gives minimax-rates for β ≤ 2. For regression-models based on location-scale mixtures, fully adaptive posteriors have recently been obtained by De Jonge and Van Zanten [2]; their work was written at the same time and independently of the present work.For continuous beta-mixtures (near)-optimal1 rates have been derived by Rousseau [26].Another related work is [28], where also kernels of type (2) are studied; however it is assumed that the true density is a mixture itself .In a clustering and variable selection framework using multivariate Gaussian mixtures, Maugis and Michel [23] give non-asymptotic bounds on the risk of a penalized maximum likelihood estimator.Finally, for a general result on consistency of location scale mixtures, see [31].
is defined on (0, C p ].When σ = 1 we also write For any function h, let K σ h denote the convolution h * ψ σ , and let ∆ σ h denote the error (K σ h) − h.The (k − 1)-dimensional unit-simplex and the k-dimensional bounded quadrant are denoted ( Given ǫ > 0 and fixed points x ∈ R k and y ∈ ∆ k , define the l 1 -balls Inequality up to a multiplicative constant is denoted with and (for we also use O).The number of integer points in an interval I ∈ R is denoted N (I).Integrals of the form gdF 0 are also denoted F 0 g.

Main results
We now state our conditions on f 0 and the prior.Note that some of them will not be used in some of our results.For instance in Theorem 1 below, (C3) is not required.Conditions on f 0 .The observations X 1 , . . ., X n are an i.i.d.sample from a density f 0 satisfying the following conditions.(C1) Smoothness.log f 0 is assumed to be locally β-Hölder, with derivatives l j (x) = d j dx j log f (x).We assume the existence of a polynomial L and a constant γ > 0 such that for all x, y with |y − x| ≤ γ. (C2) Tails.There exists ǫ > 0 such that the functions l j and L satisfy and there exist constants α > 2, T > 0 and c > 0 such that when |x| > T , imsart-generic ver.2008/08/29 file: lsMixtures3.texdate: July 1, 2010 (C3) A stronger tail condition: f 0 has smaller tails than the kernel, i.e. there exist constants T and M f0 such that (C4) Monotonicity.f 0 is strictly positive, and there exist x m < x M such that f 0 is nondecreasing on (−∞, x m ) and nonincreasing on (x M , ∞).Without loss of generality we assume that f 0 (x m ) = f 0 (x M ) = c and that f 0 (x) ≥ c for all x m < x < x M .The monotonicity in the tails implies that K σ f 0 f 0 ; see the remark on p. 149-150 in [9].
Assumption (C3) is only needed in the proofs of Lemma 4 and Theorem 2.
We can now state the approximation result which will be the main ingredient in the proof of Theorem 2, but which is also interesting on its own right.
Theorem 1.Let f be a density satisfying conditions (C1),(C2) and (C4), and let K σ denote convolution over the kernel ψ defined in (2), for any p ∈ N. Then there exists a density h k such that for all small enough σ, The construction of the approximation h k is detailed in section 3.As our smoothness condition is only local, the class of densities satisfying (C1),(C2) and (C4) is quite large.In particular, all (log)-spline densities are permitted, provided they are sufficiently differentiable at the knots.Condition (7) rules out superexponential densities like exp{− exp{x 2 }}.In fact the smallest possible L(x) such that (6) holds, does not have to be of polynomial form, but in that case it should be bounded by some polynomial L for which (7) holds.Note that when β = 2, L is an upper bound for 2 , and apart from the additional ǫ in (7), this assumption is equivalent to the assumption in [13] that F 0 (f ′′ 0 /f 0 ) 2 and F 0 (f ′ 0 /f 0 ) 4 be finite.Also the monotonicity condition can be weakened, as in fact it suffices to have an upper and lower bound on f 0 for which (C4) hold.For the clarity of presentation however we assume monotonicity of f 0 itself.
We now describe the family of priors we consider to construct our estimate.
Prior (Π) The prior on σ is the inverse Gamma distribution with scale parameter λ > 0 and shape parameter α > 0, i.e. σ has prior density λ α Γ(α) x −(α+1) e −λ/x and σ −1 has the Gamma-density λ α Γ(α) x α−1 e −λx .The other parameters have a hierarchical prior, where the number of components k is drawn, and given k the locations µ and weights w are independent.The priors on k, µ and w satisfy the conditions ( 11)-( 14) below.
The prior on k is such that for all integers k > 0 for some constants 0 Given k, the locations µ 1 , . . ., µ k are drawn independently from a prior density p µ on R satisfying p µ (x) e −a1|x| a 2 for constants a 1 > 0 and a 2 ≤ p .
Given k, the prior distribution of the weight vector w = (w 1 , . . ., w k ) is independent of µ, and there is a constant d 1 such that for ǫ < 1 k , and for some nonnegative constant b, which affects the logarithmic factor in the convergence rate.
Theorem 2. Let the bandwidth σ be given an inverse-gamma prior, and assume that the prior on the weights and locations satisfies conditions (11)- (14).Given a positive even integer p, let ψ be the kernel defined in (2), and consider the family of location-scale mixtures defined in (1), equipped with the prior described above.If f 0 satisfies conditions (C1)-(C4), then Π(• | X 1 , . . ., X n ) converges to f 0 in F n 0 -probability, with respect to the Hellinger or L 1 -metric, with rate ǫ n = n −β/(1+2β) (log n) t , where r and b are as in (11) and (14), The proof is based on Theorem 5 of Ghosal and van der Vaart [13], which is included here in appendix A.
Condition (11) is usual in finite mixture models, see for instance [10], [20] and [26] for beta-mixtures.It controls both the approximating properties of the support of the prior and its entropy.For a Poisson prior, we have r = 1 and for a geometric prior r = 0.
Conditions (12) and ( 14) translate the general prior mass condition (41) in Theorem 3 to conditions on the priors for µ and w.The prior is to put enough mass near µ 0 and w 0 , which are the locations and weights of a mixture approximating f 0 .Since µ 0 and w 0 are unknown, the conditions in fact require that there is a minimal amount of prior mass around all their possible values.The restriction to kernels with even p in Theorem 2 is assumed to discretize the approximation h k obtained from Theorem 1. Results on minimax-rates for Laplace-mixtures (p = 1) (see [18]) suggest that this assumption is in fact necessary.Note that also [2] and [28] require analytic kernels.

Approximation of smooth densities
In many statistical problems it is of interest to bound the Kullback-Leibler divergence D KL (f 0 , m) = f 0 log f0 m between f 0 and densities contained in the model under consideration, in our case finite location-scale mixtures m.When β ≤ 2, the usual approach to find an m such that D KL (f 0 , m) = O(σ 2β ), is to discretize the continuous mixture K σ f 0 , and show that K σ f 0 − m ∞ and imsart-generic ver.2008/08/29 file: lsMixtures3.texdate: July 1, 2010 ).Under additional assumptions on f 0 , this then gives a KL-divergence of O(σ 2β ).But as f 0 −K σ f 0 ∞ remains of order σ2 when β > 2, this approach appears to be inefficient for smooth f 0 .In this section we propose an alternative mixing distribution f0 such that D KL (f 0 , K σ f0 ) = O(σ 2β ).To do so, we first construct a not necessarily positive function f k such that under a global Hölder condition, ).However, as we only assume the local Hölder condition (C1), the approximation error of O(σ β ) will in fact include the local Hölder constant, which is made explicit in Lemma 1. Modifying f k we obtain a density which still has the desired approximative properties (Lemma 2).Using this result we then prove Theorem 1. Finally we prove that the continuous mixture can be approximated by a discrete mixture (Lemmas 3 and 4).In the remainder of this section, we write f instead of f 0 for notational convenience, unless stated otherwise.
To illustrate the problem that arises when approximating a smooth density f with its convolution K σ f , let us consider a three times continuously differentiable density f such that , where ν 2 is defined as in (4).Although the regularity of f is larger than two, the approximation error remains order σ 2 .The following calculation illustrates how this can be improved if we take Likewise, the error is O(σ β ) when f is of Hölder regularity β ∈ (2,4].When β > 4, this procedure can be repeated, yielding a sequence Once the approximation error O(σ β ) is achieved with a certain f k , the approximation clearly doesn't improve any more for f j with j > k.In the context of a fixed β > 0 and a density f of Hölder regularity β, f k will be understood as the first function in the sequence {f i } i∈N for which an error of order σ β is achieved, i.e. k is such that β ∈ (2k, 2k + 2].The construction of the sequence {f i } i∈N is related to the use of superkernels in kernel density estimation (see e.g.[30] and [4]), or to the twicing kernels used in econometrics (see [24]).However, instead of finding a kernel In Lemma 11 in appendix B we show that for any β > 0, In Theorems 1 and 2 however we have instead the local Hölder condition (C1) on log f , along with the tail and monotonicity conditions (C2) and (C4).With only a local Hölder condition, the approximation error will depend in some way on the local Hölder constant L(x) as well as the derivatives l j (x) of log f .This is made explicit in the following approximation result, whose proof can be found in Appendix C. A similar result for beta-mixtures is contained in Theorem 3.1 in [26].
Lemma 1.Given β > 0, let f be a density satisfying condition (C1), for any possible function L, not necessarily polynomial.Let k be such that β ∈ (2k, 2k + 2], and let f k be defined as in (15).Then for all sufficiently small σ and for all x contained in the set where H > 0 can be chosen arbitrarily large and for nonnegative constants r i .
Compared to the uniform result that can be obtained under a global Hölder condition (Lemma 11 in appendix B) the approximation error The good news however, is that on a set on which the l j 's are sufficiently controlled, it is also relative to f (x), apart from a term σ H where H can be arbitrarily large.Note that no assumptions were made regarding L, but obviously the result is only of interest when L is known to be bounded in some way.In the remainder we require L to be polynomial.Since K σ f j is a density when f j is a density, we have that for any nonnegative integer j (f 0 denoting the density f itself) f j integrates to one.For j > 0 the f j 's are however not necessarily nonnegative.To obtain a probability density, we define The constant 1 2 in ( 19) and ( 20) is arbitrary and could be replaced by any other number between zero and one.In the following lemma, whose proof can be found in Appendix D, we show that the normalizing constant g k is 1 + O(σ β ).For this purpose, we first control integrals over the sets A σ defined in (16) and imsart-generic ver.2008/08/29 file: lsMixtures3.texdate: July 1, 2010 for a sufficiently large constant H 1 .
Lemma 2. Let f be a density satisfying conditions (C1),(C2) and (C4).Then for all small enough σ and all nonnegative integers m and all K > 0, provided that H 1 in ( 22) is sufficiently large.Furthermore, A σ ∩ E σ ⊂ J σ,k for small enough σ.Consequently, Finally, when β > 2, and f k is defined as in Lemma 1 and h k as in (21), for all x ∈ A σ ∩ E σ , i.e. in (17) we can replace f k by h k , provided we assume that x is also contained in E σ .
Remark 1. From (20), ( 21) and (24) it follows that The fact that K σ f is lower bounded by a multiple of f then implies that the same is true for K σ h k .
Remark 2. The integrals over A c σ in (23) can be shown to be O(σ 2β ) only using conditions (C1) and (C2), whereas for the integrals over E c σ also condition (C4) is required.
Using this result we can now prove Theorem 1: for any densities p and q and any set S, we have the bound The first integral on the right can be bounded by application of ( 25) and Remark 1 following Lemma 2. On A σ ∩E σ the integrand is bounded by  H1) , as f (x) ≥ σ H1 on E σ and the Lebesgue measure of this interval is at most σ −H1 .To bound the second integral in (26) we use once more that K σ h k f , and then apply (23) with m = 0.For the last integral we use (23) with m = 0, . . ., k + 1; recall that h k is a linear combination of K m σ f , m = 0, . . ., k.The second integral in (10) is bounded by ) by the same arguments.
The continuous mixture approximation of Theorem 1 is discretized in Lemma 4 below.Apart from the finite mixture derived from h k we also need to construct a set of finite mixtures close to it, such that this entire set is contained in a KL-ball around f .For this purpose the following lemma is useful.A similar result can be found in Lemma 5 of [13].The inequality for the L 1 -norm will be used in the entropy calculation in the proof of Theorem 2.
Proof.Let 1 ≤ i ≤ k and assume that wi ≤ w i .By the triangle inequality, for any norm.We have the following inequalities: To prove (27), let σ = z −1 and σ = z−1 , and for fixed x define the function g x : z → zψ(zx).By assumption, d dz g x (z) = ψ(zx) + zxψ ′ (zx) is bounded, and Applying the mean value theorem to ψ itself, the last inequality is obtained.
The approximation h k defined by ( 21) can be discretized such that the result of Lemma 1 still holds.The discretization relies on Lemma 3.13 in [19], which is included in Appendix F. As in [2] and [28] (XXX), we require the kernel ψ to be analytic.i.e. p needs to be even.Lemma 4. Let the constant H 1 in the definition of E σ be at least 4(β+p).Given β > 0, let f be a density that satisfies conditions (C1)-(C4) and for p = 2, 4, . . .let ψ be as in (2).Then there exists a finite mixture m = m(•; k σ , µ σ , w σ , σ) Furthermore, (28) holds for all mixtures The proof can be found in Appendix E. A discretization assuming only (C1),(C2) and (C4) could be derived similarly, but to have sufficient control of the number of components in Theorem 2, we make the stronger assumption (C3) of exponential tails .Together with the monotonicity condition (C4) this implies the existence of a finite constant c f such that for all sufficiently small ǫ, The constant c f depends on f by the constant M f in (9).This property is used in the proof of Lemma 4.

The proof of Theorem 2
We first state a lemma needed for the entropy calculations.
imsart-generic ver.2008/08/29 file: lsMixtures3.texdate: July 1, 2010 Proof.A proof of (30) can be found in [11]; the other result follows from a volume argument.For 2 for all i.Because for each coordinate we have the bounds Proof of Theorem 2. The proof is an application of Theorem 3 in [13] (stated below in appendix A), with sequences ǫ n = n −β/(1+2β) (log n) t1 and ǭn = n −β/(1+2β) (log n) t2 , where t 1 and t 2 ≥ t 1 are determined below.Let k n be the number of components in Lemma 4 when The number of components is their locations being contained in the set E σn defined in (22).By the same lemma there are l 1 -balls It now suffices to lower bound the prior probability on having k n components and on B n , ∆(n) and which is larger than exp{−n ǫ 2 n } for any choice of t 1 ≥ 0. Condition (11) gives a lower bound of B 0 exp{−b 0 k n log r k n } on Π(k n ), which is larger than exp{−n ǫ 2 n } when (2 + β −1 )t 1 > 1 + p −1 + r.Given that there are k n components, condition (14) gives a lower bound on Π(∆(n)), which is larger than exp{−n ǫ 2 n } when (2 + β −1 )t 1 > 2 + b + p −1 .The required lower-bound for Π(B n ) follows from (9) and the fact that µ kn are independent with prior density p µ satisfying (12).The 'target' mixture given by Lemma 4 has location vector µ (n) , whose elements are contained in E σn .By ( 9), E σn is contained in the interval T c f ,ǫ defined in (29), with ǫ = σ n H1 .Since p µ ψ, p µ is lower bounded by a multiple of σ n c p f H1 at the boundaries of this interval.Consequently, for all i = 1, . . ., k n , imsart-generic ver.2008/08/29 file: lsMixtures3.texdate: July 1, 2010 As the l 1 -ball B kn (µ for some constant d > 0. Combining the above results it follows that Π(KL(f 0 , ǫ n )) ≥ exp{−n ǫ 2 n } when t 1 > (2 + b + p −1 )/(2 + β −1 ).We then have to find sets F n such that (40) and (42) hold.For r n = n 1 1+2β (log n) tr (rounded to the nearest integer) and a polynomially increasing sequence b n such that b a2 n > n 1/(1+2β) , with a 2 as in ( 13), we define The bandwidth σ is contained in S n = (σ n , σn ], where σ n = n −A and σn = exp{n ǫ 2 n (log n) δ }, for arbitrary constants A > 1 and δ > 0. An upper bound on Π(S c n ) can be found by direct calculation, for example Hence Π(S c n ) ≤ e −cn ǫ 2 n for any constant c, for large enough n.The prior mass on mixtures with more than r n support points is bounded by a multiple of exp{−b 1 k n log rn k n }.The prior mass on mixtures with at least one support point outside [−b n , b n ] is controlled as follows.By conditions ( 11) and ( 13), the probability that a certain µ Since the prior on k satisfies (11), k clearly has finite expectation.Consequently, (34) implies that Combining these bounds, we find The right hand side decreases faster than e −n ǫ 2 n if t r + r > 2t To control the sum in (40), we partition F n using An upper bound on the prior probability on the F n,j is again found by direct calculation: As the L 1 -distance is bounded by the Hellinger-distance, condition (40) only needs to be verified for the L 1 -distance.We further decompose the F n,j 's and write , (s n,j−1 , s n,j ], l 1 .
Lemma 5 provides the following bounds: For some constant C, we find that If b n ≥ ǭn s n,j−1 , we have (1 + ǫ n ) −j ≥ ǭnσ n bn(1+ ǫn) , and the last exponent in (36) is bounded by −λb −1 n ǭn /(1 + ǫ n ).A combination of (36), (37) and Stirling's bound on r n ! then imply that Π(F n,j ) N (ǭ n , F n,j , d) is bounded by a multiple of for certain constants C, K 0 and K 1 .If b n < ǭn s n,j−1 we obtain similar bound but with an additional factor ǭ−rn/2 , where the factor (1 + ǫ n ) (j−1)rn/2 cancels out with (1 + ǫ n ) −(j−1)rn/2 on the third line of the above display.There is however a remaining factor (1 Hence the convergence rate is

Examples of priors on the weights
Condition ( 14) on the weights-prior is known to hold for the Dirichlet distribution.We now address the question whether it also holds for other priors.Alternatives to Dirichlet-priors are increasingly popular, see for example [16].In this section two classes of priors on the simplex are considered.In both cases the Dirichlet distribution appears as a special case.The proof of Theorem 2 requires lower bounds for the prior mass on l 1 -balls around some fixed point in the simplex.These bounds are given in Lemmas 6 and 8 below.
Since a normalized vector of independent gamma distributed random variables is Dirichlet distributed, a straightforward generalization is to consider random variables with an alternative distribution on R + .Given independent random variables Y 1 , . . ., Y k with densities f i on [0, ∞), define a vector X with elements The corresponding density is where We obtain a result similar to lemma 8 in [13].Lemma 6.Let X 1 , . . ., X k have a joint distribution with a density of the form (39). Assume there are positive constants c 1 (k), c 2 (k) and c 3 such that for i Then there are constants c and C such that for all y ∈ ∆ k and all ǫ ≤ Proof.As in [13] it is assumed that y k ≥ k −1 .Define δ i = max(0, y i − ǫ 2 ) and δi = min(1, Since all x i in (39) are at most one, Because ), there are constants c and C for which this quantity is lower-bounded by Ce −ck log( 1 ǫ ) .
Alternatively, the Dirichlet distribution can be seen as a Polya tree.Following Lavine [21] we use the notation for some integer m, and the coordinates are indexed with binary vectors ǫ ∈ E m .A vector X has a Polya tree distribution if where U δ , δ ∈ E m−1 * is a family of beta random variables with parameters (α δ1 , α δ2 ), δ ∈ E m−1 * .We only consider symmetric beta densities, for which α δ = α δ1 = α δ2 .Adding pairs of coordinates, lower dimensional vectors X δ can be defined for δ Then for all y ∈ ∆ 2 m and η > 0, Proof.For all i = 1, . . ., m and δ Hence, With δ ∈ E i−1 fixed, we can lower-bound P ) for various values of the α δ .In the remainder we will assume that α δ = α i , for all δ ∈ E i−1 , with i = 1, . . ., m.For increasing α i ≥ 1, U δ has a unimodal beta-density, and without loss of generality we can assume the most unfavorable case, i.e. when y δ0 y δ = 0.If the α i are decreasing, and smaller than one, this is when y δ0 y δ = 1 2 .In both cases Lemma 9 in appendix A is used to lower bound the normalizing constant of the beta-density.
At the ith level there are 2 i−1 independent variables U δ with the Beta(α i , α i ) distribution, and therefore We have the following application of these results.
Lemma 8. Let X m δ be Polya distributed with parameters α i .If for some constants c and C. By a straightforward calculation one can see that this result is also valid for b = 0.In the Dirichlet case in accordance with the result in [11].

Conclusion
We obtained posteriors that adapt to the smoothness of the underlying density, that is assumed to be contained in a nonparametric model.It is of interest to obtain, using the same prior, a parametric rate if the underlying density is a finite mixture itself.This is the case in the location-scale-model studied in [19], and the arguments used therein could be easily applied in the present work.
The result would however have less practical relevance, as the variances σ 2 j of all components are required to be the same.
Furthermore, the prior on the σ j 's used in [19] depends on n, and this seems to be essential if the optimal rates and adaptivity found in the present work are to be maintained.In the lower bound for the prior mass on a KL-ball around f 0 , given by ( 33), we get an extra factor k n in the exponent, and the argument only applies if λ = λ n ≈ σ n .This suggests that the restriction to have the same variance for all components is necessary to have a rate-adaptive posterior based on a fixed prior, but we have not proved this.The determination of lower bounds for convergence rates deserves further investigation; some results can be found in [33].Full adaptivity over the union of all finite mixtures and Hölder densities could perhaps be established by putting a hyperprior on the two models, as considered in [12].

Acknowledgements
We want to thank Catia Scricciolo, Bertrand Michel and Cathy Maugis for carefully reading earlier versions of this work, enabling to significantly improve our paper.

Appendix A
The following theorem is taken from [13] (Theorem 5), and slightly adapted to facilitate the entropy calculations in the proof of Theorem 2. Their condition Π(F n |X 1 , . . ., X n ) → 0 in F n 0 -probability is a consequence of (41) and (42) below.This follows from a simplification of the proof of Theorem 2.1 in [11], p.525, where we replace the complement of a Hellinger-ball around f 0 by F c n .If we then take ǫ = 2ǭ n in Corollary 1 in [13], with ǭn ≥ ǫ n and ǭn → 0, the result of Theorem 5 in this paper still holds.
Theorem 3 (Ghosal and van der Vaart, 2006).Given a statistical model F, let {X i } i≥1 be an i.i.d.sequence with density f 0 ∈ F. Assume that there exists a sequence of submodels F n that can be partitioned as where KL(f 0 , ǫ n ) is the Kullback-Leibler ball The advantage of the above version is that (42) is easier to verify for a faster sequence ǫ n .The use of the same sequence ǫ n in (40) and (42) would otherwise pose restrictions for the choice of F n .
From (43) it follows that for all α > 0 and all integers j ≥ 1, The following lemma will be required for the proof of Lemma 1 in the next section.
Lemma 10.Given a positive integer m and ψ (p) (x) = C p e −|x| p , let ϕ be the mfold convolution ψ * • • • * ψ.Then for any α ≥ 0 there is a number k ′ = k ′ (p, α, m) such that for all sufficiently small σ > 0, imsart-generic ver.2008/08/29 file: lsMixtures3.texdate: July 1, 2010 Proof.For any p > 0 and a random variable Z with density ψ (p) , For m = 1, we have for any α > 0 and y > 0. Now let m > 1, and X = m i=1 Z i for i.i.d.random variables Z i with density ψ (p) .If α ≥ 1 then, by Jensen's inequality applied to the function x → x α , where we used (46) with α = 0 and the independence of the Z i 's to bound the terms with i = j.If α < 1, we bound |Z| α by |Z| and apply the preceding result.

Appendix B: Approximation under a global Hölder condition
For L > 0, β > 0 and r the largest integer smaller than β, let H(β, L) be the space of functions h such that sup x =y |h (r) (x) − h (r) (y)|/|y − x| β−r ≤ L, where h (r) is the rth derivative of h.Let H β be the Hölder-space ∪ L>0 H(β, L), and given some function h Proof.By induction it follows that The proof then depends on the following two observations.First, note that if f ∈ H β then f 1 , f for some i ∈ {1, . . ., r}.From (6) it follows that for all i = 1, . . ., r if δ is sufficiently small.Therefore it has to be a large value of |L(X + σU )| that forces X + σU to be in . We now derive the contradiction from the assumption that L is polynomial.Let q be its degree, and let η = max |z i |, z i being the roots of L. First, suppose that |X| > η + 1.Then again for sufficiently small σ.
If m = 2 in (23), note that the above argument remains valid if X has density K σ f instead of f .The last term in (57) is then bounded by P (X ∈ A c σ,δ ), which is O(σ 2β ) by the result for m = 1.This step can be repeated arbitrarily often, for some decreasing sequence of δ's.
To bound the second integral in (23) for m = 0, we need the tail condition f (x) ≤ c|x| −α in (C2).In combination with the monotonicity of f required in (C4), this implies that which is O(σ 2β ) when H 1 ≥ 4β.For m = 1, we integrate over the sets E c σ ∩ A c σ and E c σ ∩ A σ .The integral over the first set is O(σ 2β ) by the preceding paragraph.To bound the second integral, consider the sets This step can be repeated as long as the terms P X∼f (X ∈ E c σ,δ ) remain O(σ 2β ), which is the case if the initial H 1 is chosen large enough.This finishes the proof of (23).
To prove (25), let β > 2 and k ≥ 1 be such that 2k < β ≤ 2k + 2, l = log f being β-Hölder.It can be seen that Lemma 1 still holds if we treat l as if it was Hölder smooth of degree 2. Instead of (17), we then obtain where L (2) = l 2 and R (2) is a linear combination of l 2 1 and |L (2) |.The key observation is that R (2) = o(1) uniformly on A σ when σ → 0. Combining (60) with the lower bound for f on E σ , can find a constant ρ close to 1 such that for small enough σ.Similarly, when l is treated as being Hölder smooth of degree 4, we find that Continuing in this manner, we find a constant ρ k such that f k (x) > ρ k f (x) for x ∈ A σ ∩ E σ and σ sufficiently small.If initially ρ is chosen close enough to 1, ρ k > 1 2 and hence A σ ∩ E σ ⊂ J σ,k .To see that (23) now implies (24), note that the integrand 1 2 f − f k is a linear combination of K m σ f , m = 0, . . ., k.

Appendix E: Proof of Lemma 4
We bound the second integral in (28); the first integral can be bounded similarly.
For hk the normalized restriction of h k to E σ and m the finite mixture to be constructed, we write The integral of f (log(f /K σ h k )) ) and the required bound for Eσ f (log(K σ h k /K σ hk )) 2 follows.To bound the integral of f (log K σ hk /m) 2 over E σ , let m = m(•; k σ , µ σ , w σ , σ) be the finite mixture obtained from Lemmas 12 and 13, with ǫ = σ δ ′ H1+1 and δ ′ ≥ 1+2β/H 1 .
The requirement that a ψ −1 (ǫ) in Lemmas 12 and 13 is satisfied by the monotonicity and tail conditions on f (see (29)).The number of components k σ in Lemma 13 is O(σ −1 | log σ| 1+p −1 ).We have provided that δ ′ ≥ 1 + β H1 .The cross-products resulting from the square in the integral over E σ can be shown to be O(σ 2β ) using the Cauchy-Schwartz inequality and the preceding bounds.
To bound the integral over E c σ , we add a component with weight σ 2β and mean zero to the finite mixture m.From Lemma 3 it can be seen that this does not affect the preceding results.Since f and h k are uniformly bounded, so is K σ h k .If C is an upper bound for K σ h k , then