Bayesian inference with rescaled Gaussian process priors

We use rescaled Gaussian processes as prior models for functional parameters in nonparametric statistical models. We show how the rate of contraction of the posterior distributions depends on the scaling factor. In particular, we exhibit rescaled Gaussian process priors yielding posteriors that contract around the true parameter at optimal convergence rates. To derive our results we establish bounds on small deviation probabilities for smooth stationary Gaussian processes.


Introduction
Gaussian processes have been adopted as building blocks for constructing prior distributions on infinite-dimensional statistical models in several settings.For instance, a sample path of a Gaussian process can be used directly as a prior model for a regression function (see e.g.[6], [22], [16]); after a monotonic transformation to the unit interval it can be used in the setting of classification (e.g.[16], [1], [5]); and after exponentiation and renormalization it becomes a model for density estimation (e.g.[11], [9], [18]).
Priors on functions of a single variable are commonly constructed using stationary Gaussian processes with smooth sample paths (e.g.[1], [5], [10], [16])).A popular example is the so-called squared-exponential process, i.e. the centered Gaussian process W with covariance function EW s W t = a exp(−b|t − s| 2 ) for some a, b > 0. The existing mathematical literature concerned with priors of this type focusses on computational issues (e.g.[16], [1], [10]) or posterior consistency ( [5]).In the present paper we study posterior convergence rates, i.e. the rate at which the posterior distribution contracts around the true unknown functional parameter of interest.In particular, we are interested in exhibiting priors yielding optimal convergence rates if the unknown function belongs to a smoothness class.
It is well known that in the frequentist setup in which the data are sampled from a fixed true distribution, a prior on an infinite-dimensional model has to be carefully chosen in order to obtain optimal rates of contraction of the posterior (cf.e.g.[3], [17], [19], [4], [25], [23]).Even if posterior consistency is ensured, the rate of contraction of the posterior around the true functional parameter will typically be suboptimal if the regularity of prior process does not equal the regularity of the unknown parameter.Since Gaussian processes like the squared exponential process have infinitely often differentiable sample paths (at least in mean square sense), they will, at least without modification, typically not be appropriate as a prior model for a functional parameter with some finite smoothness level, in the sense that they yield suboptimal contraction rates.
In this paper we show however that this can be remedied by suitably rescaling the smooth process, with rescaling constants depending on the sample size.Given a fixed Gaussian process (W t : t ≥ 0) indexed by the positive time axis and scaling constants c n > 0 we use the rescaled sample path as a prior model for a given function w 0 : [0, 1] → R that indexes the distribution of the observations.The rescaling has the purpose of changing the appearance of the sample paths, so as to make them reflect more closely our prior ideas about the true parameter.Scaling factors c n → ∞ stretch the restrictions of the sample paths t → W t to the time interval [0, 1/c n ] to the interval [0, 1] and hence use only a small part of the randomness in the Gaussian process.Typically, this has the effect of smoothing the sample path.Conversely, scaling factors c n → 0 use the sample path t → W t on a long interval [0, 1/c n ] and shrink this to the interval [0, 1].This typically makes the prior rougher, by incorporating the randomness of a longer time period.Coming back to the example of the squared exponential process, we will show that for any given regularity level α > 0 there exist scaling factors c n → 0 ("roughening of the sample paths") such that after rescaling, we obtain a prior yielding (up to logarithmic factors) optimal contraction rates if the true parameter is α-smooth.To prove this result we use the general theory on posterior convergence rates for Gaussian priors developed in Van der Vaart and Van Zanten [20].The results of the latter paper state that the rate of convergence for a Gaussian process prior W is determined by the behaviour of its concentration function for ε → 0, where H is the reproducing kernel Hilbert space (RKHS) associated to the process W , • H is the RKHS-norm, and • is the norm of the function space where W takes its values.More precisely, the results state that asymptotically, the posterior concentrates its mass on balls around the true parameter with radius of the order ε n , where ε n is found by solving Our results for rescaled smooth, stationary Gaussian process priors are obtained by studying the RKHS and small deviations behaviour of such processes, leading to upper bounds for their concentration functions.For W the centered Gaussian process with covariance function EW s W t = exp(−|t − s| 2 ) and c > 0 we find that for the rescaled process W c = (W t/c : 0 ≤ t ≤ 1), and w 0 a C α function, This implies that if we use rescaling rates 1+2α) .Up to a logarithmic factor, this is the well-known minimax rate for estimating α-regular functions.
In addition to smooth stationary prior processes we also consider self-similar processes.At first thought one might expect that rescaling would have no effect on such processes, but this turns out to be false.Stretching and shrinking causes the smoothing and roughening effects mentioned previously.Convergence rates for posteriors based on certain self-similar Gaussian process priors were obtained in [20].We proved for instance that if W is a k-fold integrated Brownian motion (plus an independent polynomial part), then using W as prior on (k + 1/2)-smooth functions yields an optimal convergence rate for the posterior.In this paper we show that after rescaling, this prior becomes appropriate for a larger range of smoothness levels.For any α ∈ (0, k + 1] there exist scaling factors c n such that the prior based on the rescaled process W t/cn yields optimal contraction rates if the true parameter is α-smooth.In this case we have c n → 0 ("roughening") if α < k + 1/2 and c n → ∞ ("smoothing") if α > k + 1/2.The range of α's for which rescaling the k-fold integrated Brownian motion leads to a rate-optimal posterior is limited by the smoothness level k + 1 of the functions in the RKHS of the process.Technically, the results for self-similar processes are relatively easy consequences of the general results obtained in [20].
The results of this paper can be viewed as mathematical support for the common use of rescaled Gaussian process priors in Bayesian practice (see for instance [1], [10], [24]).We show that, from a frequentist perspective, rescaling greatly enlarges the range of models for which a given Gaussian process prior is appropriate.In a practical setting one often tries to robustify a Bayes procedure, or reduce subjectivity, by employing a random rescaling, i.e. using the prior W t/C with a random scaling factor C independent of W , rather than the prior W t itself.Further analysis is necessary on this issue, but the results in this paper may serve as a starting point for such an investigation.
Another point that deserves elaboration is the extension of our results to multivariate settings, i.e. cases where the unknown function of interest is a function of several variables.This requires however the generalization to higher dimensions of a number of approximation results (cf.Lemmas 2.2 and 2.3), which, in general, can be technically quite involved.
The remainder of this paper is organized as follows.In the next section we introduce and study the Gaussian processes that will serve as prior models.
To prepare for the proofs of our main results on posterior convergence rates we obtain small deviation bounds and results on the approximation properties of the RKHS of rescaled smooth stationary processes and multiply integrated Brownian motions.The results on posterior contraction are precisely stated in the final Section 3.
The notation is used for "smaller than or equal to a universal constant times", and ≍ is "proportional up to constants".We use the notation C[0, 1] for the space of continuous functions f : [0, 1] → R, endowed with the uniform norm, and L r (µ) for the measurable functions f : denote the Hölder space of order β, consisting of the functions f ∈ C[0, 1] that have β continuous derivatives, for β the biggest integer strictly smaller than β, with the βth derivative f (β) being Lipshitz continuous of order β − β.For ε > 0 let N (ε, B, d) be the minimum number of balls of radius ε needed to cover a subset B of a metric space with metric d.

Prior processes
The theorems on posterior contraction rates that we present in the next section concern two classes of priors.The first are constructed by rescaling smooth, stationary Gaussian processes, the second by rescaling multiply integrated Brownian motions.In this section we study these rescaled processes, obtaining results on their small deviations behaviour and the approximation properties of their reproducing kernel Hilbert spaces (RKHSs).Together with the general theory of [20], these results will allow us to obtain rates of convergence for posteriors.

Smooth stationary processes
Consider a centered, mean-square continuous Gaussian process W = (W t : t ≥ 0) with covariance function for a given continuous function ϕ : R → R. For a fixed scaling constant c > 0, we define the rescaled version W c of the process W by setting the process W admits a continuous version.We shall work with this continuous version throughout.
The spectral measure µ c of the rescaled process W c is obtained by rescaling µ: Let L 2 (µ c ) be the set of all functions h : R → C whose modulus |h| is square integrable with respect to µ c .Denote by F c h the transform F c h : R → C of the function h relative to the measure µ c : Note that F c maps L 2 (µ c ) into the space C(R) of continuous functions on the real line.
The following lemma describes the RKHS H c of the process (W c t : t ∈ [0, 1]).Recall that this space is defined as the completion of the linear span of the functions h s defined by h s Proof.Although the RKHS is real by definition, it will be convenient to complete the linear span of the functions h s over the complex numbers.Because the functions h s are real-valued, the RKHS is the set of real parts of functions in this complex RKHS.
If e s : R → C denotes the function e s (λ) = e isλ , then, by the definition of the spectral measure, It follows that the linear extension of the map L : h s → e s is an isometry for the Hilbert space structures from lin(h s : Hence, H c is the inverse image under L of the closure of the set of functions lin(e s : s ∈ [0, 1]) in L 2 (µ c ).Now, again by the definition of the spectral measure, It follows that the inverse L −1 is exactly the transform F c .Finally we prove that lin(e s : s ∈ [0, 1]) is dense in L 2 (µ c ), using the condition (2.2).As s ↓ 0, by dominated convergence, as functions in L 2 (µ c ).Because the functions on the left side are in lin(e s : s ∈ [0, 1]), the function λ → iλ is in the closure of this set.We repeat this argument with the function e s − e 0 )(λ)/s − iλ to see that the function 1 2 (iλ) 2 is contained in the closure of lin(e s : s ∈ [0, 1]), and so on.We conclude that all polynomials λ → λ k are contained in this closure.By extension to the complex case of Proposition 6.4.1 of [15] and (2.2), it then follows that this closure is dense in the full space L 2 (µ c ).
Let ϕ c (x) = ϕ(x/c) and denote by ϕ c * G(t) = ϕ c (t − s) dG(s) the density of the convolution of a signed measure G and the distribution corresponding to the density ϕ c .By Fubini's theorem, such a convolution can be written as for Ĝ the characteristic function of G, defined by 2π Ĝ(λ) = e itλ dG(t).Because 2π| Ĝ| is uniformly bounded by the total variation of G, it is contained in L 2 (µ c ) and hence the function ϕ c * G is contained in the RKHS, with square norm For a measure G on the interval [0, 1] this follows readily from the definition of the RKHS as the linear space spanned by the functions t → EW c s W c t = ϕ c (s − t) = ϕ c * δ s (t).As shown by the preceding lemma, under condition (2.2), the functions ϕ c * G are contained in the RKHS for any signed measure G on the full line R.This will be important for the proof of the following lemma, which quantifies how well C β functions can be approximated by elements of the RKHS of the rescaled process W c .Lemma 2.2.Let µ satisfy (2.2) and possess a Lebesgue density that is bounded away from zero on a neighborhood of 0. Let β > 0 be given.Then for any w ∈ C β [0, 1] there exist constants C w and D w depending only on w such that, as c ↓ 0, Proof.Let β be the biggest integer strictly smaller than β.Let φ be the density of µ, and let ψ(λ) = (2π) −1 e itλ ψ(t) dt be the Fourier transform of a general function ψ : R → C. The Fourier transform of the function ψ c defined by ψ c (x) = ψ(x/c) is given by ψc (λ) = c ψ(cλ).
We can extend w : [0, 1] → R to a function w : R → R with compact support and w β < ∞.
By Taylor's theorem we can write, for s, t ∈ R, where, for some ξ ∈ [0, 1], In view of the assumption that ψ is a higher order kernel, for any t ∈ R, Combining the preceding displays shows that c For ŵ the Fourier transform of w, we can write It follows that the function c −1 ψ c * w is contained in the RKHS, with square norm Here (2π Lemma 2.1 implies that under (2.2) the elements of the RKHS can be continuously extended to functions that are analytic on the strip {t ∈ C : |Im t| < (cδ)/2} in the complex plane.The following entropy estimate is therefore related to classical results on the entropy of spaces of analytic functions, obtained by Kolmorogov and Tihomirov [7].They obtain estimates of the order log(1/ε) 2 for the entropy of the spaces we are interested in.The following lemma makes the dependence on the scaling constant c explicit, which is essential for the proof of our main results.Lemma 2.3.Assume that the spectral measure satisfies (2.2) for some δ > 0. Then the entropy of the unit ball H c 1 of the RKHS of the process Proof.We construct an ε-net of piecewise polynomials over H c 1 .Because all moments of the spectral measure µ c are finite by (2.2), we can for any h ∈ L 2 (µ c ) differentiate the function F c h under the integral sign to find that (F c h) (k) (t) = (−iλ) k e −itλ h(λ) dµ c (λ).Consequently, where α k are the absolute moments of the spectral measure µ.By Taylor's formula it follows that, for every s, t ∈ [0, 1], where γ i,j ranges over the grid {0, ±η j , ±2η j , . ..} intersected with the interval − √ α 2j /c j , √ α 2j /c j , for η j = ε j!/(d j k).For every function F c h ∈ H c 1 and i there exist points γ i,j in the grid such that sup

The function
k−1 j=0 (F c h) (j) (id)(t − id) j /j! is within uniform distance ε of the function F c h on the interval (i − 1)d, id .The preceding being true for every i implies that the set H of piecewise polynomials (2.3) forms a 2ε-net over H c 1 for the uniform norm on (0, 1], and hence the covering number N 2ε, H c 1 , • ∞ is bounded by the number of points in H, which is equal to the number of different matrices (γ i,j ).The logarithm of this number can be bounded as For any x ≥ 0 and j ≥ 0 we have x j ≤ e x Γ(j + 1).Indeed, Therefore, for any λ we have |λ| 2j ≤ δ −2j Γ(2j + 1)e δ|λ| and, consequently, with In view of Stirling's approximation Γ(n + 1) ≍ n n+1/2 e −n , the right-hand side is, up to a constant, for j ≥ 1, equivalent to We choose d < δc/2 and k ∼ log(1/ε) to reduce this expression for j = k to a number smaller than ε.We have that (d/c) j √ α 2j /j! is bounded above uniformly in j = 1, . . ., k − 1, and hence With the indicated choices of k, this yields the bound given in the statement of the lemma.
If the spectral measure satisfies the stronger tail condition exp(δ|λ| r ) µ(dλ) < ∞ for some δ > 0 and r > 1, the elements of the RKHS can be extended to the entire complex plane and satisfy an exponential type restriction.In that case the results of Section 7.4 of [7] apply and can be used to improve the power 2 appearing in the entropy bound given by the lemma.In the statistical results, this improves the power of the logarithmic factors that we have in the rate of contraction results.
The following is a consequence of the preceding lemma and the well-known connection between the entropy of the unit ball of the RKHS and small ball probabilities, cf.Kuelbs and Li [8], Li and Linde [14].
Theorem 2.4.Suppose the spectral measure satisfies (2.2) and c ≤ 1.Then there exists an ε 0 > 0, independent of c, such that the rescaled process W c satisfies The preceding lemma and Theorem 1.2 of [14] imply we have the crude bound for every α ∈ (0, 2).According to the proof of Theorem 1.2 and the related Proposition 3.1 of [14], this bound holds for all ε > 0 satisfying We have c ≤ 1 by assumption and hence Since the right-hand side is independent of c, it follows that (2.4) holds for all ε in an interval independent of c.The preceding lemma and Theorem 2 of [8] imply that for ε small enough Again, inspection of the proof of the cited result of [8] shows that under our assumption c ≤ 1, this holds for all ε > 0 in an interval independent of c. Combination of the preceding display with (2.4) now yields the statement of the theorem.

Multiply integrated Brownian motion
Consider a mean-zero Gaussian process (W t : t ≥ 0) that is self-similar of order α: the processes (c α W t/c : t ≥ 0) and (W t : t ≥ 0) are equal in distribution for every c > 0. The rescaled process (W t/c : 0 ≤ t ≤ 1) given in (1.1) is then equal in distribution to the process (c −α W t : 0 ≤ t ≤ 1), which means that for use as a prior distribution the rescaling of the time-axis is equivalent to a rescaling of the vertical axis.The rescaling has a simple effect on the reproducing kernel Hilbert space and small ball probability, but it has an interesting consequence.We assume that the restriction (W t : 0 ≤ t ≤ 1) of the process to the unit interval and the rescaled process W c = (W t/c : 0 ≤ t ≤ 1) can be viewed as Borel measurable maps in a separable Banach space (B, • ), and that the self-similarity can be understood in the sense that the Borel laws of these two processes are identical.The RKHS and small ball probability of the process W c = (W t/c : 0 ≤ t ≤ 1) are then equal to these objects for the process (c −α W t : 0 ≤ t ≤ 1).Let H W be the RHKS of the process W (restricted to the unit interval) and let ϕ 0 (ε; W ) = − log Pr( W ≤ ε) be the exponent in its centered small ball probability.The following lemma is clear from the preceding.
Lemma 2.5.The RKHS of the process W c is the set of functions H W equipped with the norm h W c = c α h W .The centered small ball exponent of W c satisfies − log Pr W c ≤ ε = ϕ 0 (c α ε; W ).
As an example consider the k-fold integrated Brownian motion.Define I 1 0+ f as the function t → t 0 f (s) ds and set We consider the restriction of this process to [0, 1] as a map in C[0, 1].
The fact that the integrated Brownian motion has k derivatives at 0 equal to zero causes that the functions in its reproducing kernel Hilbert space satisfy similar constraints at 0. A better prior is obtained by adding an independent polynomial to the process.We consider the modified process for scaling factors c, a > 0, B a standard Brownian motion and independent standard normal variables Z 0 , . . ., Z k , independent of B.
The following theorem gives a centered small deviation bound for the process V c,a , and describes the approximation of smooth functions by elements of its RKHS H c,a .Theorem 2.6.Consider the process V c,a given in (2.5) as a map in C[0, 1].This process satisfies, for ε > 0 small enough, Moreover, for w ∈ C β [0, 1] and Proof.The assertion on the small ball probability follows easily from Theorem 2.1 of Li and Linde [13] on the small ball probability of integrated Brownian motion, and the fact that the added polynomial is independent and finitedimensional.By general arguments (e.g.[21], Section 10) we have that the reproducing kernel Hilbert space of the process (2.5) viewed as a map in C[0, 1] is the Sobolev space H k+1 [0, 1] of functions h : [0, 1] → R that are k times continuously differentiable with absolutely continuous kth derivative that is the integral of a function h (k+1) ∈ L 2 [0, 1], equipped with the norm with square For a smooth function ϕ and ϕ σ (x) = ϕ(x/σ)/σ its scaled version, the convolution w * ϕ σ is contained in the RKHS, with square norm If ϕ is chosen such that ϕ(x) dx = 1 and with zero moments of orders 1, . . ., k, then the distance w * ϕ σ − w ∞ in the uniform norm can be seen to be of order σ β .We choose σ = ε 1/β and next evaluate the preceding display to be of the order as given in the theorem.

Posterior contraction rates
In this section we present the main results on posterior convergence rates using rescaled Gaussian process priors.We denote the posterior distribution based on a prior Π n and observations X (n) by B → Π n (B | X (n) ).We consider three different statistical settings: i.i.d.density estimation, classification, and fixed design regression.For any of these, the general theory developed in [20] gives results expressing posterior contraction rates in terms of the small deviations behaviour and the RKHS structure of the Gaussian prior.The results in the present section are obtained by combining these general results with the material of the preceding section.
Below we give complete proofs for the density estimation case.Since the other two cases are completely analogous we only explain the results briefly.

Density estimation
Suppose that we observe a random sample X 1 , . . ., X n from a positive density p0 on [0, 1].A prior distribution on the set of positive densities can be defined structurally as p W , for a Gaussian process W = (W t : t ∈ [0, 1]) and p w the function defined by In the next two theorems we assume that the true density is α-smooth in the sense that log p0 ∈ C α [0, 1].We show that if in this case we take for W a suitably rescaled Gaussian process, we obtain a posterior that (perhaps up to logarithmic factors) contracts around the true density at the optimal minimax rate n −α/(1+2α) .The first result deals with rescaled smooth stationary processes.this gives the rate ε n = (n/log 2 n) − α 1+2α .
Proof.By Theorem 3.1 of [20] we get the conclusion of the theorem as soon as we show that ϕ n (ε n ) nε It is easy to check that these relations indeed hold for c n and ε n as in the statement of the theorem.
The following theorem gives the analogous result for rescaled integrated Brownian motions. .Define the prior Π n structurally as p V n , with p w as in (3.1).Then if log p0 ∈ C α [0, 1] and α ≤ k + 1, we have E 0 Π n (p : h(p, p0 ) > M ε n | X 1 , . . ., X n ) → 0 for all M large enough, where ε n = n − α 1+2α and h is the Hellinger distance on densities.
We construct a prior on the set of regression functions as f W n (or f V n ) for W n (or V n ) the Gaussian process from Theorem 3.1 (or 3.2) and f w the function f w (t) = Ψ(w t ), where Ψ : R → (0, 1) is (for instance) the logistic distribution function.
Theorem 3.2 of [20] and the results of Section 2 imply that if Ψ −1 ( f0 ) ∈ C α [0, 1], the analogues of the statements of Theorems 3.1 and 3.2 hold in this setting.We get the same rates of posterior contraction in this case as well, the statement of the theorems has to be replaced by for any sufficiently large constant M , where f 2 G,2 = f 2 (t)dG(t) and G is the marginal distribution of X.

Theorem 3 . 2 .
For α > 0 and k ∈ N 0 , let V n be the modified k-fold integrated Brownian motion defined in (2.5), with the scaling constant c replaced byc n = n α−(k+1/2) (k+1/2)(1+2α)and a replaced by a sequence a n satisfying a n ≤ n 1+2α−2k 1+2α c .By Bochner's theorem the function ϕ is representable as the characteristic function ϕ(t) = e −itλ dµ(λ) of a symmetric, finite measure µ on R, called the spectral measure of the process W . (The minus sign in the exponent is for consistency in notation, but is superfluous as µ is symmetric.)We shall consider spectral measures satisfying the condition e δ|λ| µ(dλ) < ∞ (2.2) for some δ > 0. This condition on the tails of the spectral measure should be viewed as a smoothness condition on W .It implies for instance that the sample paths of W are infinitely often differentiable, at least in mean-square sense.Examples of processes satisfying (2.2) are the centered Gaussian processes with covariance functions (s, t) → exp(−|t − s| 2 ) or (s, t) → (1 + |t − s| 2 ) −1 , which correspond, respectively, to Gaussian and Laplace spectral measures.Observe that in particular (2.2) implies that µ has finite moments of all orders and hence, since 2n , whereϕ n (ε n ) = inf h∈Hn: h−log p0 ∞ <εn h 2 Hn − log Pr W n < ε n ,with H n the RKHS of the rescaled process W n .Hence, by Lemma 2.2 and Theorem 2.4, it suffices to verify that