UvA-DARE ( Digital Academic Repository ) Semiparametric Bernstein-von Mises for the error standard deviation

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: http://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.


Introduction
In this paper we study the asymptotic behavior of the marginal posterior for the error standard deviation in a nonparametric, fixed design regression model with Gaussian errors. We suppose we have observations Y 1 , . . . , Y n satisfying Y i = f 0 (x i ) + σ 0 Z i , i = 1, . . . , n, * Research supported by the Netherlands Organization for Scientific Research (NWO).
217 where x 1 , . . . , x n are known elements of a general design space X , the variables Z 1 , . . . , Z n are independent, standard normal and both the regression function f 0 : X → R and the error standard deviation σ 0 > 0 are unknown. We can then make Bayesian inference about the parameters f and σ by endowing them with independent priors Π f and Π σ , respectively, and considering the resulting posterior distribution Π(· | Y 1 , . . . , Y n ). We study the asymptotic behavior of the marginal posterior distribution B → Π(σ ∈ B | Y 1 , . . . , Y n ) of the parameter σ for n → ∞.
Although in most cases the main interest is in estimating the regression function f , making accurate inference about the error variance σ can also be important. In regression analysis it is common to report an estimate of σ to quantify the magnitude of the measurement errors in the data or to assess model fit. It is quite natural to attempt to estimate σ in an efficient way. In the frequentist literature this problem has been studied for a long time and in increasing generality. See for instance the recent paper Brown and Levine (2007) for historical comments and rather extensive references. The efficient estimation of the error standard deviation or variance in nonparametric regression has so far received little attention in the Bayesian literature however. The existing theorems focus on contraction rates for the posterior distribution of the regression function f and at best give only crude rates for the posterior distribution of σ. Theorems about the asymptotic shape of the posterior of σ have not been obtained so far.
The general rate of contraction result for fixed design regression obtained in Ghosal and Van der Vaart (2007) gives conditions under which the posterior for the regression function f contracts around the true f 0 at a certain rate ε n as n → ∞, under the assumption that σ 0 is known. As has been observed several times in the literature (see e.g. Van der Vaart and Van Zanten (2008a), Van der Vaart andVan Zanten (2009), De Jonge andVan Zanten (2010)) this result can be extended to the case that σ 0 is unknown, see also the appendix to this paper. In that case one also obtains a rate for the marginal posterior of σ. Specifically, the (extended version of) existing general results give conditions under which, for a given sequence ε n → 0, it holds that as n → ∞, for every sufficiently large M > 0. Here the convergence is in probability under the true model. A result like (1.1) implies in particular that the marginal posterior for σ is asymptotically concentrated on an interval with length of the order ε n around the true value σ 0 . Since ε n is also a bound for the rate of contraction of the marginal posterior for f however, it is a "nonparametric rate" that will be slower than the parametric rate n −1/2 if the space of regression functions that are considered is infinite-dimensional. The rate bound ε n for the one-dimensional parameter σ is therefore typically very crude and it is natural to ask whether in fact the actual rate of contraction for the marginal posterior for σ can be faster than the rate for the regression function f .
In the extreme case that the regression function f is completely known and σ is the only unknown parameter in the problem, the classical Bernstein von-Mises (BvM) theorem asserts that under minimal regularity conditions, the posterior distribution of σ contracts around the true value σ 0 at the rate n −1/2 . Moreover, it says that the posterior law of √ n(σ − σ 0 ) behaves asymptotically like a normal distribution N (∆ n , I −1 σ0 ), with ∆ n a sequence of random variables with an asymptotic N (0, I −1 σ0 )-distribution under P 0 and I σ0 the Fisher information for σ 0 . The precise statement is recalled in the next section. The BvM result implies in particular that the posterior for σ correctly quantifies the uncertainty about the parameter. Specifically, if credible bounds l n < u n are determined such that for a fixed level α ∈ (0, 1) it holds that Π(σ ∈ (l n , u n ) | Y 1 , . . . , Y n ) ≥ 1 − α, then the BvM theorem implies that the credible interval (l n , u n ) also has frequentist coverage probability 1 − α asymptotically, i.e. lim inf n→∞ P 0 (σ 0 ∈ (l n , u n )) ≥ 1 − α. Moreover, the length u n − l n of the credible interval asymptotically coincides with the length of an optimal confidence interval. We refer to the discussion in Section 1.5 of Castillo (2012a) for more details.
In this paper we investigate if and how this changes if the regression function f is unknown. In this case we know that (1.1) holds for instance if f 0 ∈ F and σ 0 ∈ [a, b], say, and we place independent priors Π f and Π σ on F and [a, b], respectively, Π σ having a positive, continuous Lebesgue density and Π f such that for positive numbersε n ,ε n ≤ ε n and constants c 1 , c 2 > 0 it holds that for every c 3 > 1, there exist measurable subsets F n ⊂ F and a constant c 4 > 0 such that (1.4) Here g 2 n = n −1 g 2 (x i ) and for a metric space (A, d) and ε > 0, N (ε, A, d) is the minimal number of balls of d-radius ε needed to cover A. (See Theorem A.1 in the appendix for this result.) We prove below (see Theorem 2.2) that if in addition nε 4 n → 0 and aεn 0 log N (δ, F n , · n ) dδ → 0 for all a > 0, (1.5) then the BvM assertion holds for the marginal posterior distribution of σ. In particular, the marginal posterior distribution of σ then has the same, optimal asymptotic behavior as in the case that f is known.
In the literature various papers can be found that deal with the verification of conditions (1.2)-(1.4) for specific families of priors on f . See for instance Ghosal and Van der Vaart (2007), Van der Vaart and Van Zanten (2008a), Van der Vaart andVan Zanten (2009), De Jonge andVan Zanten (2010), De Jonge and Van Zanten (2012), Tokdar (2011), Bhattacharya, Pati andDunson (2012). These results can however not be applied directly to verify also the additional condition (1.5). The reason is that in the cited papers, the constructed sieves F n that verify (1.3) and (1.4), are typically too large for condition (1.5) to hold. Therefore, verifying the conditions of our general BvM theorem for a specific prior usually involves the careful construction of alternative sieves. The new, smaller sieves should be such that the remaining mass condition (1.3) is still fulfilled and in addition the entropy log N (δ, F n , · n ) can be controlled for arbitrarily small δ, so that (1.5) can be verified.
In this paper we carry out this task for Gaussian process priors and for a spline-based prior on the regression function. In the case of Gaussian process priors, it is known that conditions (1.2)-(1.4) can be replaced by single condition on the so-called concentration function of the prior, cf. Van der Vaart and Van Zanten (2008a). Roughly speaking we prove in Theorem 3.1 below that if in addition to this condition the rate ε n is fast enough and the sample paths of the Gaussian prior have regularity larger than d/2, for d the dimension of the covariate space, then the BvM statement holds. We give details for two specific popular families of Gaussian priors: multiply integrated Brownian motions and the class of Matérn processes. In both cases we find that BvM holds if the prior is rough enough relative to the degree of smoothness of the true regression function f 0 . In some generality it is known that if we want optimal contraction rates for f using a Gaussian prior, then the regularities of the truth and the prior should be equal (see Van der Vaart and Van Zanten (2008a), Castillo (2008)). In the examples we work out we find that for BvM for σ to hold it is not necessary that the smoothnesses are matched exactly however. Some degree of oversmoothing is allowed and an arbitrary degree of undersmoothing, cf. Section 3.2. In particular, the rate of contraction of the marginal posterior for f may be sub-optimal, while still having an optimal asymptotic behavior of the posterior for σ. This is in line with the findings of Castillo (2012a) in the context of the white noise model.
The second type of concrete priors we study are spline-based priors studied before in De Jonge and Van Zanten (2012). More precisely, we consider a hierarchical prior on functions on [0, 1] d , defined structurally as a spline of fixed order, with randomly placed, regularly spaced knots and random B-spline coefficients (details in Section 4). In De Jonge and Van Zanten (2012) it was shown that when properly constructed, such a prior yields adaptive, nearly rate-optimal estimation of a smooth regression function f . We investigate this prior in this paper because we are interested in the question whether or not we can have adaptive estimation of f and BvM for σ at the same time. In Theorem 4.1 we show, by constructing appropriate sieves, that this is indeed possible. For the spline prior we prove that if the true f 0 is a d-variate function with (Hölder-) regularity β, then BvM for σ holds if β > d. So in that case we have a single procedure that yields both efficient estimation of the error standard deviation and adaptive, nearly rate-optimal estimation of f across a range of regularities.
The specific priors that we analyze are Gaussian or conditionally Gaussian. This is technically convenient, since it allows us to use tools from Gaussian process theory. However we stress that our general BvM theorems are valid outside the Gaussian realm as well.
Our general result can be viewed as a semiparametric Bernstein-von Mises theorem. In general, semiparametric BvM theorems deal with the asymptotic behavior of posterior distributions of finite-dimensional parameters in the presence of an infinite-dimensional "nuisance" parameter. Theorems of this type have recently been established by several authors, see for instance Shen (2002), De Blasi and Hjort (2009), Castillo (2012a), Bickel and Kleijn (2012), Rivoirard and Rousseau (2012). Our problem in fact fits into the general framework of Castillo (2012a) (up to minor adaptations) and we will use his results to derive our BvM theorem for the error standard deviation.
The remainder of the paper is organized as follows. After recalling the parametric BvM theorem in Section 2.1 we present our general semiparametric results for the error standard deviation in Section 2.2. In Section 3 we consider the special case that the prior on f is Gaussian. We formulate a general theorem and verify the conditions for the two particular examples mentioned above. Section 4 treats the hierarchical spline-based priors. We prove that they yield simultaneous adaptation for f and BvM for σ. The proof of our general theorem is given in Section 5. In the appendix, which we added for the sake of completeness, we state and prove a theorem giving sufficient conditions for the contraction rate result (1.1). This result is essentially known, but a proof has never been published.

Prelude: Parametric Bernstein-von Mises
The main result of this paper is a semiparametric Bernstein-von Mises (BvM) theorem for the error standard variance in a fixed design regression model. As a prelude we first consider the parametric case in which we observe variables Y 1 , . . . , Y n satisfying for known covariates x i ∈ X and standard normal random variables Z i . We now assume that the regression function f 0 is known, so that the error standard deviation σ > 0 is the only unknown parameter. We denote its true value by σ 0 .
Observe that in this case we simply have a sample of size n from the N (0, σ 2 )distribution, given by The BvM theorem in a smooth, parametric i.i.d. model like this one is classical. As an illustration and to connect to the semiparametric case studied ahead we briefly explain it. Let p σ be the marginal density of

Then a Taylor expansion gives
By the law of large numbers the average −n −1 n i=1l σ0 (X i ) converges almost surely to the Fisher information I σ0 = −E 0lσ0 (X 1 ) = Var 0lσ0 (X 1 ). It follows that for the full log-likelihood we have the LAN approximation σ0 (X i ).
By the central limit theorem, we have the weak convergence ∆ n ⇒ N (0, I −1 σ0 ) as n → ∞.
If we now put a prior on (0, ∞) with a Lebesgue density π which is positive and continuous at σ 0 , then for the corresponding posterior we have, for a Borel subset B ⊂ R, By the LAN approximation, the integrands are approximately equal to a constant times

Making a change of variable
This somewhat loose argumentation can be made precise and it can be shown that in probability, the total variation distance between the posterior distribution of √ n(σ − σ 0 ) and the N (∆ n , I −1 σ0 )-distribution vanishes as n → ∞, cf. e.g. Van der Vaart (1998). It is easily verified that in this case (2.1) In the next section we state the semiparametric version of this result for the case that the regression function f is in fact unknown. It turns out that there is no loss of information (in the semiparametric sense) for the error standard deviation and that under relatively mild conditions on the prior for the nonparametric part f , the asymptotic behavior of the marginal posterior for √ n(σ − σ 0 ) is the same as if f were known.

Semiparametric Bernstein-von Mises
Now suppose that we have observations Y 1 , . . . , Y n from the regression model with fixed and known design points x 1 , . . . , x n in the set X , an unknown regression function f : X → R, an unknown constant σ > 0, and with Z 1 , . . . , Z n independent standard Gaussian random variables. We assume that the true parameter (σ 0 , f 0 ) belongs to the set (0, ∞) × F , for F a measurable space of functions on X . The corresponding true distribution of the data is denoted by P 0 . The log-likelihood is given by We assume that for every n, the map (σ, f, y) → ℓ n (σ, f ; y 1 , . . . , y n ) is a measurable map on (0, ∞) × F × R n . Note that this is the case for instance if X is a topological space and F is a measurable subset of the space of C(X ) of continuous functions on X , endowed with its Borel sigma-field.
To make Bayesian inference about f and σ we endow the pair (σ, f ) with a product prior distribution of the form Π = Π σ ×Π f . Here Π σ is a distribution on (0, ∞) with a positive and continuous Lebesgue density and Π f is a distribution on F . In view of the measurability assumptions the corresponding posterior distribution is well defined and given by Bayes' formula. For A and B measurable subsets of (0, ∞) and F , respectively, the posterior measure of the set The following theorem deals with the marginal posterior distribution of the parameter σ. It gives conditions under which we have, as in the case that f is known, that the posterior distribution of √ n(σ − σ 0 ) asymptotically behaves as an N (∆ n , I −1 σ0 )-distribution, where ∆ n and I σ0 are as in (2.1). Note that we still have the weak convergence ∆ n ⇒ N (0, I −1 σ0 ) under P 0 , by the central limit theorem.
The existing general contraction rate theorems for fixed design regression give conditions under which the posterior contracts around the true parameter (σ 0 , f 0 ). More precisely, for a sequence of positive numbers ε n such that nε 2 n → ∞ they give conditions under which there exist measurable subsets F n ⊂ F such that as n → ∞, where, as before, the norm · n is the L 2 -norm associated with the empirical measure on the design points, i.e. g 2 n = n −1 g 2 (x i ). (Since a full proof of this exact statement appears never to have been given in the literature, we provide it in the appendix of the paper for the sake of completeness. See Theorem A.1.) The case that σ 0 is known is covered by these general results as well. Following Castillo (2012a), we denote the posterior distribution for f in the model that σ 0 is known by Π σ=σ0 (· | Y 1 , . . . , Y n ). In this notation, the general theory gives conditions under which as n → ∞ (see e.g. Ghosal and Van der Vaart (2007)).
The rate ε n should be viewed as the contraction rate that is achieved for the nonparametric part of the statistical problem. The following theorem states that if this rate is fast enough, namely nε 4 n → 0, then under the additional entropy condition (1.5), we have the BvM result for the error standard deviation σ. The proof of the theorem is given in Section 5.
Theorem 2.1. Consider positive numbers ε n such that nε 2 n → ∞ and nε 4 n → 0. If there exist measurable subsets F n ⊂ F such that (2.3), (2.4) and (1.5) hold, then with ∆ n and I σ0 given by (2.1) we have as n → ∞, where the supremum is taken over all measurable subsets B ⊂ R.
Existing general theorems give sufficient conditions on the prior Π f for (2.3) and (2.4) to hold. Full proofs are only given in the literature for the case that σ is known (see Ghosal and Van der Vaart (2007)), which only takes care of (2.4). It has been noted however that these results can be adapted to deal with the case that σ 0 belongs to a known compact interval [a, b] and Π σ is a prior concentrated on [a, b]. For completeness, we give a precise result in Theorem A.1 in the appendix. Admittedly, the assumption that the standard deviation belongs to a compact interval is restrictive. Extending the general rate result given in the appendix to alleviate this restriction is therefore desirable, but is not completely straightforward. We note that our general theorem, Theorem 2.1, does not require σ to be in a compact set. Hence, a generalization of Theorem A.1 will immediately yield a generalization of the following theorem as well.
Theorem 2.2. Suppose that σ ∈ [a, b] and Π σ is concentrated on [a, b]. Consider positive numbersε n ,ε n ≤ ε n such that n(ε n ∧ε n ) 2 log n and nε 4 n → 0. Suppose that for constants c 1 , c 2 > 0 we have that for every c 3 > 1, there exist measurable subsets F n ⊂ F and a constant c 4 > 0 such that conditions (1.2)-(1.5) are fulfilled. Then with ∆ n and I σ0 given by (2.1) we have as n → ∞, where the supremum is taken over all measurable subsets B ⊂ R.
Proof. Combining Theorems A.1 and 2.1 yields the result.
In the next two sections we verify the conditions of Theorem 2.2 for two classes of priors Π f : Gaussian process priors and hierarchical spline-based priors.

General Gaussian priors
We now specialize to the case that X = [0, 1] d for some d ∈ N. As prior Π f on the regression function f we employ the law of a Gaussian random element W in the space C([0, 1] d ) of continuous functions on [0, 1] d . We denote the reproducing kernel Hilbert space (RKHS) of W by H. For f 0 ∈ C([0, 1] d ) the true regression function, the associated concentration function is denoted by ϕ f0 , that is to say (See the papers Van der Vaart and Van Zanten (2008a) and Van der Vaart and Van Zanten (2008b) and the references therein for these fundamental concepts.) As in Theorem 2.2, the error standard deviation is assumed to belong to [a, b] and Π σ is concentrated on that interval. The general theory for Gaussian process priors then says that if ε n → 0 is such that nε 2 n → ∞ and The theorem below essentially states that if in addition to (3.2) we have nε 4 n → 0 and W has degree of regularity α > d/2, then BvM holds true. Specifically, we shall assume that W takes values in the Hölder space C γ [0, 1] d for all γ < α. (Recall that a function belongs to this space if for γ the largest integer strictly smaller than γ, it has continuous partial derivatives up to the order γ and the derivatives of order γ are Hölder continuous of the order γ − γ.) We typically have that if a Gaussian process on [0, 1] d is α-regular in this sense, then its RKHS unit ball H 1 is contained in a Sobolev-type ball of regularity α + d/2 (see for instance the concrete examples in the next subsection). If this is the case, then for every γ ∈ [0, α) the space H 1 typically satisfies an entropy bound of the form (see, e.g., Edmunds and Triebel (1996) for some K γ > 0. Here · C γ denotes the usual Hölder norm on C γ [0, 1] d (see e.g. Van der Vaart and Wellner (1996) for its precise definition).
Theorem 3.1. Suppose that for α > d/2 the process W takes values in C γ ([0, 1] d ) for every γ < α and its RKHS unit ball H 1 satisfies the entropy bound (3.3) for every γ ∈ [0, α). 1 If (3.2) holds for numbers ε n → 0 such that nε 4 n → 0, then with ∆ n and I σ0 given by (2.1) we have and n → ∞, where the supremum is taken over all measurable subsets B ⊂ R.
Proof. We first remark that if (3.2) holds for the sequence ε n then it also holds for larger sequences, in particular for ε ′ n = ε n ∨ n − α d+2α . Since α > d/2, this new sequence satisfies n(ε ′ n ) 4 → 0 as well. Therefore, we can assume without loss of generality that ε n ≥ n − α d+2α in the remainder of the proof. We apply Theorem 2.2. It is well known that (3.2) implies that condition (1.2) is fulfilled withε n = ε n (see Van der Vaart and Van Zanten (2008b), Lemma 5.3). To prove that there exists sieves F n such that (1.3)-(1.5) are satisfied we exploit the fact that by assumption we can view W as a Gaussian random element in the Banach space (C γ [0, 1] d , · C γ ) for γ < α. Since C[0, 1] d is the completion of C γ [0, 1] d with respect to the · ∞ -norm and · ∞ ≤ · C γ , we have that the RKHS of W viewed as a C γ [0, 1] d -valued Gaussian random element coincides with the RKHS H of W viewed as continuous Gaussian process. This follows from Lemma 8.1 in Van der Vaart and Van Zanten (2008b).
Since α > d/2 by assumption, there exists a γ such that α > γ > d/2. Now set δ n = n − α−γ d+2α and F n = M √ nε n H 1 + δ n C γ 1 , where M is a constant to be determined below and C γ 1 is the unit ball in C γ [0, 1] d . We claim that if M is chosen large enough, then conditions (1.3)-(1.5) hold true.
By the relation between the entropy of the RKHS unit ball and small ball probabilities established by Li and Linde (1999), assumption (3.3) implies that Hence, by the Borell-Sudakov inequality (see Van der Vaart and Van Zanten (2008b)) and the fact that for the standard normal distribution function Φ we have Φ −1 (y) ≥ − (5/2) log(1/y) for small y, we have that condition (1.3) is fulfilled withε n = ε n , provided M is chosen large enough.
For the entropy conditions we note that by assumption (3.3) (applied with γ = 0 this time) and known entropy bounds for Hölder balls (see for instance Van der Vaart and Wellner (1996)), we have The right-hand side with ε n substituted for ε is bounded by a constant times n d/(d+2α) + (δ n /ε n ) d/γ . Both terms in this sum are bounded by nε 2 n by the lower bound assumption on ε n and the definition of δ n . Hence, condition (1.4) holds. The inequality in the last display also shows that for a > 0, Since α ≥ d/2 and nε 4 n → 0, the first term on the right converges to 0. Since γ > d/2, the second term vanishes as well. This covers condition (1.5).

Specific Gaussian priors
In this subsection we verify the conditions of Theorem 3.1 for two particular examples of Gaussian process priors on f. In the first example we investigate a Matérn prior on a multivariate regression function. In the second example we consider the case d = 1 and choose a Riemann-Liouville type prior.

Matérn prior
The Matérn process (W t : t ∈ [0, 1] d ) with parameter α > 0 is the zero-mean, stationary Gaussian process with covariance function where the spectral measure µ is given by A special case is the Ornstein-Uhlenbeck process, which is the case d = 1, α = 1/2. The Matérn process is a popular prior in Bayesian nonparametrics, see for instance Rasmussen and Williams (2006) and the references therein.
It is not difficult to see that there exists a version of the Matérn process with parameter α > 0 that takes its values in C γ ([0, 1] d ) for any γ < α, see Van der Vaart and Van Zanten (2011). The RKHS unit ball of the Matérn process is included in a Sobolev ball of regularity α + d/2, cf. Section 4.3 of Van der Vaart and Van Zanten (2011). For γ < α, the metric entropy relative to the C γ -norm of such a Sobolev ball satisfies (3.3) (see Theorem 3.3.2 on p. 105 in Edmunds and Triebel (1996) It is shown in Section IV of Van der Vaart and Van Zanten (2011) that for such f 0 the inequality (3.2) holds for ε n proportional to n −(α∧β)/(d+2α) .
It is easily verified that in this situation the conditions of Theorem 3.1 are satisfied if the regularity α of the prior and the regularity β of the true regression function satisfy the conditions The collection of α's and β's satisfying (3.4) is sketched in Figure 1. The figure makes clear that for the BvM result to hold, it is not necessary to estimate the regression function f 0 at an optimal rate. In particular, it is not necessary that the smoothness α of the prior matches the smoothness β of the unknown regression function exactly. An arbitrary amount of undersmoothing (β > α) is allowed and also some degree of oversmoothing (β < α).
We note that it is not ruled out that the area for which BvM holds is actually larger than what we found. Using our general theorems it does not seem possible however to shed more light on this issue. Possibly more insight can be obtained by a more detailed analysis, tailored to the particular statistical problem and prior, in the spirit of Castillo (2012b).

Riemann-Liouville prior
In this subsection we consider the case d = 1, i.e. the true regression function is an unknown element f 0 ∈ C[0, 1].
For α > 0 and W a standard Brownian motion, the Riemann-Liouville process with parameter α is defined by It can be interpreted as the (α − 1/2)-fold iterated integral of Brownian motion. The use of such priors is well established and goes back at least to Wahba (1978). The process R α and its higher derivatives (if they exist) vanish at zero. In order to enlarge the class of functions that are well approximated by the process we modify it slightly, following Van der Vaart and Van Zanten (2008a). Let α be the biggest integer strictly smaller than α, and let Z 1 , . . . , Z α+1 be independent standard normal random variables, independent of the Riemann-Liouville process R α . Define the Riemann-Liouville-type process X as follows: The process (X t : t ∈ [0, 1]) is zero-mean Gaussian and can be seen as a random element in C[0, 1]. Since Brownian motion has "regularity" 1/2 the Riemann-Liouville process with parameter α is expected to be "regular" of order α in an appropriate sense. Indeed it can be shown that the process R α , and hence also the process X, has a version that take values in C γ [0, 1] for all γ < α, cf. Lifshits and Simon (2005). The RKHS unit ball of X is a Sobolev-type ball of regularity α + 1/2, cf. e.g. Van der Vaart and Van Zanten (2008a), and hence satisfies (3.3) with d = 1. Alternatively, the entropy bound (3.3) follows from the bound on the small ball probability of the Riemann-Liouville process with respect to the C γ -norm given by Lifshits and Simon (2005) in combination with the result of Li and Linde (1999).
Upper bounds for the left hand side of (3.2) in this case are given in Van der Vaart and Van Zanten (2008a) and Castillo (2012a). If f 0 is in C β [0, 1] for some β ≥ α, then the left hand side of (3.2) is bounded from above by a multiple of ε −1/α n . For β < α, the upper bound in Castillo (2012a) is ε −(2α−2β+1)/β n log(1/ε n ). It follows that condition (3.2) is satisfied for ε n a multiple of (log n/n) β/(1+2α) if β < α and for ε n a multiple of n −α/(1+2α) if β ≥ α. These conditions are almost the same as in the Matérn prior case. The log factor does not affect the pairs (α, β) for which the inequalities are true. We thus obtain that for the Riemann-Liouville prior as well, the BvM statement of Theorem 3.1 holds if the regularity β of the truth and the regularity α of the prior as related as in (3.4), for d = 1. Again, Figure 1 visualizes the set of α's and β's.

Hierarchical spline-based priors
We consider again the case X = [0, 1] d in this section and investigate a spline prior on f . Such priors were considered for nonparametric regression for instance by Huang (2004) and De Jonge and Van Zanten (2012), where it was shown that when properly constructed, they can yield adaptive, rate-optimal procedures for estimating the regression function. Here we show that it is possible to simultaneously have BvM for the error standard deviation.
We fix an order q ≥ 2 and for m ∈ N, consider the space S m of polynomial splines of order q with simple knots at the points 1/m, 2/m, . . . , (m − 1)/m. A function s : [0, 1] → R belongs to S m if there exist polynomials p 1 , . . . , p m of degree at most q − 1 such that s(x) = p j (x) for x ∈ [(j − 1)/m, jm) and s is q − 2 times continuously differentiable. The space S m has dimension J m = q + m − 1, cf. Theorem 4.4 of Schumaker (1981). A convenient basis of the space is given by the so-called B-splines. The exact definition of these functions (see Theorem 4.9 of Schumaker (1981)) is not of importance to us here. Important properties of B-splines are that they are nonnegative and supported on relative small parts of the domain and that the sum of all B-splines at any given location equals one. More precisely, they form a partition of unity: if we denote the B-splines by B m 1 , . . . , B m Jm , then Jm j=1 B m j (x) = 1 for all x ∈ [0, 1]. As a consequence, the supremum norm s ∞ of a function s ∈ S m of the form s = c j B m j is bounded by the supremum norm of its B-spline coefficients c ∞ = max |c j |.
Functions of several variables can be dealt with using tensor product splines. For d ≥ 2 we define the tensor product space S m = S m ⊗· · ·⊗S m (d times), with S m the space of univariate splines defined above. The space S m has dimension J d m and a basis is given by the tensor-product B-splines Slightly abusing notation these multivariate B-splines are denoted by It is easy to see that we again have the partition of unity property and hence also for d ≥ 2 it holds that the supremum norm of a function in S m is bounded by the supremum norm of its B-spline coefficients.
We define the prior Π f on f as the law of the random spline process W defined by where ξ 1 , ξ 2 , . . . are independent, standard normal random variables and M d is a geometric variable, independent of the ξ j 's. Theorem 4.2 of De Jonge and Van Zanten (2012) asserts that if f 0 ∈ C β ([0, 1] d ) for some β ≤ q, then corresponding posterior distribution satisfies (1.1) for ε n equal to n −β/(d+2β) , up to a logarithmic factor. In particular, with this prior we achieve nearly rate-optimal, adaptive estimation of the regression function for regularities up to the order of the splines that are used. We can now prove that if the regularity of the regression function is larger than the dimension of the design space, we simultaneously have BvM for σ. d, q]. Then with ∆ n and I σ0 given by (2.1) we have and n → ∞, where the supremum is taken over all measurable subsets B ⊂ R.
Proof. It was proved in De Jonge and Van Zanten (2012) (see Theorem 4.2 in that paper) that if f 0 ∈ C β ([0, 1] d ) for β ≤ q, then for sequencesε n andε n that are both up to a logarithmic factor equal to n −β/(d+2β) , it holds that for every C > 1 there exists a constant D > 0 and sets U n ⊂ C[0, 1] such that So we see that conditions (1.2)-(1.4) of Theorem 2.2 are satisfied. The sets U n are certain unions of enlarged RKHS balls corresponding to the Gaussian process that is obtained by conditioning the process W on the gridsize variable M . Inspection of the proof of Theorem 4.2 of De Jonge and Van Zanten (2012) however shows that condition (1.5) does not hold for the U n . Fix C > 1. To construct new, slightly smaller sieves we take constants K, L > 0, determined further below, and define Then we set F n = U n ∩ V n . We claim that conditions (1.3)-(1.5) are satisfied for these sets.
We have Π(F c n ) ≤ P(W ∈ U n ) + P(W ∈ V n ). The first probability is bounded by exp(−Cnε 2 n ) and by construction we have Hence, since the variable M d is geometric and P(max j≤J d m |Z j | > L √ nε n ) m d exp(−L 2 nε 2 n /2), P(W ∈ V n ) e −cKnε 2 n + (Knε 2 n ) 1+1/d e − 1 2 Lnε 2 n for some c > 0. For K, L large enough this is bounded by exp(−Cnε 2 n ) as well, and it follows that condition (1.3) is fulfilled.
It is clear that the sieves F n satisfy condition (1.4), since the are contained in the U n . Next, observe that for δ > 0, Since the supremum norm of a spline in S m is bounded by the supremum norm of its B-spline coefficients, It follows that for every a, ε > 0, aε 0 log N (δ, V n , · ∞ ) dδ aε log n + nε 2 n aε 0 log 2L √ nε n δ dδ.
It is easily checked that the integral on the right is bounded by a constant times aε log(2L √ nε n /(aε)). All together we find that for ε n =ε n ∨ε n , aεn 0 log N (δ, V n , · ∞ ) dδ ε 3 n n log n.
Since ε n n −β/(d+2β) log p n for some p > 0, the right-hand side converges to 0 if β > d. This covers condition (1.5) and also shows that nε 4 n → 0, as required.
We remark that the condition β > d is used for technical reasons in the proof, to control the last entropy integral appearing in the proof. This does not rule out the possibility that the statement of the theorem is true for a larger range of β's.

Proof of the general theorem
In this section we give the proof of Theorem 2.1.
It is convenient to describe the model by the parameter (θ, f ) with θ = 1/σ 2 . For this parametrization the log-likelihood is given by The first step in the proof is finding an appropriate expansion for the loglikelihood ratio Λ n (θ, f ) = ℓ n (θ, f ) − ℓ n (θ 0 , f 0 ). We define an inner product ·, · L on pairs (θ, f ) of inverse variances and regression functions by The corresponding norm is denoted by · L , so Note that although it is not made explicit in the notation, the inner product and the norm depend on the sample size n (and on the true parameter θ 0 ). Straightforward algebra yields the following lemma.
Lemma 5.1. We have We are now in the situation that we can apply Theorem 1 of Castillo (2012a). Strictly speaking this theorem does not allow the dependence of the inner product ·, · L on n that we have, but inspection of Castillo's proof shows that this causes no problems. Since our LAN-norm has the property that the norm θ → θ, 0 L on R is independent of n, only minor adaptations of that proof are necessary. We note that our change of variables θ = 1/σ 2 helps to establish a direct connection with the setup of Castillo (2012a), since the map W n defined in Lemma 5.1 is linear in θ.
Castillo's theorem asserts that if there exists positive numbers δ n → 0 such that nδ 2 n → ∞ and measurable subsets F n ⊂ F such that (5.4) The next step is to show that conditions (5.1)-(5.3) hold for δ n equal to a constant times ε n under the assumptions of Theorem 2.1.
Next we consider (5.3). Define V n = {(θ, f ) ∈ (0, ∞)×F n : θ−θ 0 , f −f 0 L ≤ δ n }. We consider the three terms in the definition of R n in the statement of Lemma 5.1 separately. For θ 0 ∈ V n it holds that |θ − θ 0 | is bounded by a multiple of δ n . By Taylor's formula, the first term in the definition of R n is nO(|θ − θ 0 | 3 ) for θ close to θ 0 , and hence the first term is bounded by a multiple of (1 + n(θ − θ 0 ) 2 )δ n on V n . For the second term, note that x → x/(1 + nx 2 ) is maximal at x = n −1/2 , and equal to n −1/2 /2 at that point. It follows that

R. De Jonge and H. Van Zanten
Similarly, the supremum over V n of third term divided by 1 + n(θ − θ 0 ) 2 is bounded by where G n is the Gaussian random map defined by The norm · n is precisely the natural semi-norm associated with the Gaussian process G n , in the sense that E 0 (G n f − G n g) 2 = f − g 2 n . Therefore, the wellknown maximal inequality for sub-Gaussian processes, cf. e.g. Van der Vaart and Wellner (1996), Corollary 2.2.8, implies that for some constant K > 0. All together we conclude that the left-hand side of (5.3) is for n → ∞. For δ n a multiple of ε n this is o P0 (1) under the assumptions of the theorem, hence (5.3) holds as well.
We have now established that (5.4) holds under the conditions of Theorem 2.1. Next, observe that 1, 0 2 L = 1/(2θ 2 0 ) and W n (1, 0) 1, 0 2 under P 0 , by the central limit theorem. The statement of the theorem then follows by an application of the lemma below, which gives a total variation version of the delta method, tailored to our situation. We apply the lemma with X n a random variable which has the posterior distribution of θ as law, x 0 = θ 0 , µ n = W n (1, 0)/ 1, 0 2 L , σ 2 = 1/ 1, 0 2 L = 2θ 2 0 and f (x) = 1/ √ x. The lemma deals with the total variation distance between deterministic distributions. We can use it in our stochastic setting since W n (1, 0)/ 1, 0 2 L converges in distribution and hence is uniformly tight.
We denote the total variation distance between two probability measure µ and ν by d T V (µ, ν) and the law, or distribution of a random variable X by L(X).
Proof. We suppose for definiteness that f ′ (x 0 ) > 0. It follows from the assumptions on f that there exist neighborhoods U and V of x 0 and f (x 0 ) such that f is an invertible (in this case increasing) bijection between U and V . The distribution N (x 0 + µ n / √ n, σ 2 /n concentrates around x 0 as n → ∞. Hence, by (5.5), so does L(X n ) and hence the law L(f (X n )) concentrates around f (x 0 ). Therefore, we only need to prove that