Posterior contraction and credible regions for level sets

For a given function f on a multivariate domain, the level sets, given by {x : f(x) = c} for different values of c, provide important geometrical insights about the underlying function of interest. The distance on level sets of two functions may be measured by the Hausdorff metric or a metric based on the Lebesgue measure of a discrepancy, both of which can be linked with the L∞-distance on the underlying functions. In a Bayesian framework, we derive posterior contraction rates and optimal sized credible sets with assured frequentist coverage for level sets in some nonparametric settings by extending some univariate L∞-posterior contraction rates to the corresponding multivariate settings. For the multivariate Gaussian white noise model, adaptive Hausdorff and Lebesgue contraction rates for levels sets of the signal function and its mixed order partial derivatives are derived using a wavelet series prior on the function. Assuming a known smoothness level of the signal function, an optimal sized credible region for a level set with assured frequentist coverage is derived based on a multidimensional trigonometric series prior. For the nonparametric regression problem, adaptive rates for level sets of the function and its mixed partial derivatives are obtained using a multidimensional wavelet series prior. When the smoothness level is given, optimal sized credible regions with assured frequentist coverage are obtained using a finite random series prior based on tensor products of B-splines. We also derive Hausdorff and Lebesgue contraction rates of a multivariate density function under a known smoothness setting.


Introduction
For a given constant c, the c-level set for a smooth function f : R d → R is defined as the set {x ∈ R d : f (x) = c}. In the literature, by level sets, some authors mean the sets {x ∈ R d : f (x) ≥ c}. To avoid a possible confusion, it would perhaps be appropriate to term the sets {x ∈ R d : f (x) = c} as level curves when d = 2, but owing to the fact that in higher dimensions, these are surfaces and hypersurfaces, we shall use the term "level set" for {x ∈ R d : f (x) = c}. Estimation of level sets is helpful in understanding the geometry of the function surface, in the problems like clustering (Cuevas et al. [18], Rinaldo and Wasserman [40]), support estimation of the density function (Cuevas and Fraiman [17], Biau et al. [5]) and binary classification (Mammen and Tsybakov [33]).
The most common approach to the estimation of level sets is the so-called "plug-in" approach, when, in computing a level set, the argument f is replaced by some nonparametric estimatef . Under some suitable metrics, convergence rates for density level set estimates by the plug-in method were studied in Tsybakov [46], and later in Baíllo et al. [1], Cadre [7], Singh and Nowak [43] and Rigollet and Vert [39]. Asymptotic normality of some measure of the symmetric difference between the level set and its plug-in estimator was studied in Mason and Polonik [34]. Cavalier [12] studied nonparametric regression level sets and Cuevas et al. [19] studied level sets of a general smooth function. Besides the plug-in approach to estimation, a direct approach called "excess mass approach" was proposed and studied in Polonik [36] and Polonik and Wang [37] for density and regression functions respectively. More recent papers studying statistical inference for level sets include Jankowski and Stanberry [28], Mammen and Polonik [32] and Chen et al. [13]. The constructions of confidence regions proposed in the first two papers both relied on the estimates of the sup-norm distance between f andf , while the third paper made use of the estimate of variation in the Hausdorff metric directly. Inferential problems of similar types have also been studied in econometrics literature where the objects of interests are identified as sets Chernozhukov et al. [14,15].
All the papers from the literature mentioned above followed the frequentist approach and the inference procedures about level sets mainly rely on the bootstrap procedure. To the best of our knowledge, Gayraud and Rousseau [20] is the only paper on level set estimation using the Bayesian approach, where the authors studied the contraction rate for a level set {f = c} in density estimation with respect to the Lebesgue distance, that is, the Lebesgue measure of the symmetric difference of {f ≥ c} with its true value {f 0 ≥ c}. The present paper has two primary goals. The first is to study contraction rates for the level set {f = c} of a function f of interest in terms of both Hausdorff and Lebesgue distance in several nonparametric settings such as for the signal function or its mixed partial derivatives in a multivariate signal with a Gaussian white noise model, for the multivariate regression function or a mixed partial derivative of it in a nonparametric regression setting, and for the density function in a multivariate density estimation problem, and check if such rates can be automatically adapted to the smoothness of the underlying function. Second, to validate Bayesian credible regions for level sets from the frequentist perspective by obtaining adequate frequentist coverage of such a region while maintaining its optimal size. The primary technique is to link the rates and the sizes of the credible regions respectively to the L ∞ -contraction rate of the underlying function and a posterior quantile of the L ∞ -distance between the function of interest and its Bayes estimate.
The general theory of posterior contraction rates in Bayesian nonparametric models was developed in Ghosal et al. [23] and Ghosal and van der Vaart [21], respectively for independent and identically distributed (i.i.d.) observations and general observations. In this general theory, the existence of tests of exponentially decaying errors plays a fundamental role. Shen and Wasserman [42] also studied the posterior contraction rates for i.i.d. observations but under somewhat stronger conditions. For a thorough account of the development of the theory of posterior contraction, see the recent monograph Ghosal and van der Vaart [22]. These works primarily focus on consistency and contraction rates in the Hellinger metric, or metrics on the underlying parameters which can be related to the Hellinger (or average root-squared-Hellinger distance in case of non-i.i.d. observations) relatively easily.
A challenge with our goals is that the contraction rate in terms of the L ∞distance is harder to obtain using the general theory of posterior contraction, because certain exponentially powerful tests as required by the theory can exist on much smaller portions of the parameter space, and the role of the prior becomes a lot more important. The resulting rates may turn out to be suboptimal for some commonly used natural priors. This led to the investigation of optimal rates through more direct analyses of the properties of the posterior distribution. For the one-dimensional signal with Gaussian white noise model, Giné and Nickl [24] obtained the optimal rate using conjugacy with Gaussian priors on wavelet coefficients, while Hoffmann et al. [27] obtained adaptive L ∞contraction rates using a spike-and-slab prior on these coefficients. Castillo [9] introduced a method based on a Bernstein-von Mises theorem to study sup-norm contraction rates and this method leads to minimax contraction rates in (univariate) density estimation with a wavelet series prior on log-density, and the signal with Gaussian white noise model using non-conjugate priors. Yoo and Ghosal [48] established optimal L ∞ -rates for the nonparametric multivariate regression function and its mixed partial derivative using tensor products of Bsplines prior. Yoo et al. [50] obtained adaptive posterior L ∞ -contraction rates in the same setup using a spike-and-slab prior. Shen and Ghosal [41] studied both L 2 -and L ∞ -rates for density and its derivatives and obtained minimax rate for the L 2 -distance, but suboptimal for the L ∞ -distance. Castillo and Nickl [10] obtained exact asymptotic L ∞ -coverage of an optimal-sized Bayesian credible set in the signal with a Gaussian white noise model through a Bernstein-von Mises theorem in a larger space. For the multivariate regression problem, Yoo and Ghosal [48] showed that a suitably inflated Bayesian credible L ∞ -ball around the posterior mean has high frequentist coverage and the optimal size. A more recent paper Castillo and van der Pas [11] derived optimal L ∞ -rates for the univariate hazard rate function. Nickl and Ray [35] studied d-dimensional diffusions and obtained the optimal L ∞ -rates for the drift vector field when d ≤ 4 and sub-optimal L ∞ -rates for d ≥ 5.
In this paper, we address the questions of posterior contraction and coverage of credible sets for level sets of functions appearing in nonparametric modeling. We address these problems by linking the posterior contraction rates and credible sets for level sets with those on the underlying function in terms of the L ∞ -distance. More specifically, three nonparametric models are consideredsignal with Gaussian white noise, nonparametric Gaussian regression, and multivariate smooth density on d-dimensional hypercube [0, 1] d . For the first and the third setup, we slightly extend known L ∞ -contraction results of respectively Hoffmann et al. [27] and Castillo [9] to the higher dimensional settings. Unfortunately, we find that the approach of Castillo [9] does not seem to give the optimal L ∞ -rate for a multivariate density function, although it does improve upon the rate using a finite random series prior based on tensor products of B-splines; see Proposition 6.2 of Shen and Ghosal [41]. For the nonparametric Gaussian regression problem, the L ∞ -posterior contraction results on the function and its mixed partial derivatives are directly usable from Yoo et al. [50]. The L ∞ -contraction rates for the signal with Gaussian white noise and nonparametric Gaussian regression also automatically adapt to the smoothness of the underlying function, leading to adaptive posterior contraction rates for the level sets.
The paper is organized as follows. In the next section, we present notations, preliminaries, and a brief tutorial on wavelets on the domain [0, 1] d . In Section 3, we present results on posterior contraction rates for level sets for all three models. The coverage of credible regions is addressed in Section 4. Computational algorithm and simulation results are presented in Section 5. The proofs and other related results are given in Section 6. Unless otherwise stated, throughout the paper it is implicitly assumed that d ≥ 2.

Notations and definitions
Let N = {1, 2, .., }, N 0 = {0, 1, 2, 3, . . .} and Z be the set of integers. Given two real sequences a n and b n , a n = O(b n ) or a n b n means that a n /b n is bounded, while a n = o(b n ) or a n b n means that a n /b n → 0. Also a n b n means that both a n = O(b n ) and b n = O(a n ). For a sequence of random elements Z n , Z n = O p (a n ) means that P(|Z n | ≤ C n a n ) → 1 for every C n → ∞.
Consider f : U → R defined on some bounded set U ⊂ R d . The L ∞ -norm (sup-norm) is denoted by f ∞ = sup x∈U |f (x)|. The L 2 -norm (with respect to Lebesgue measure) is denoted by · 2 with the associated inner product ·, · . Any integration without specification of the underlying measure is implicitly taken with respect to the Lebesgue measure. For g : U → R 1 on some bounded set U ⊂ R d , let ∇g be the gradient of g, which is a d × 1 vector of functions.
where supremum is taken over the support of f and α is the largest integer strictly smaller than α. Given two sets A and B in a Euclidean Space, let d( The Hausdorff distance between A and B is defined as (1)

Level sets
Let f : R d → R and consider some value c that belongs to the range of f . Then the level set at c is given by We shall assume the following conditions.
(A1) There exist > 0 and δ 1 > 0, such that for anyc ∈ [c − , c + ], and Assumption (A1) precludes functions arbitrarily flat around the level c. In particular, the condition implies that L c does not include any stationary points of the function f (i.e., points at which ∇f equals zero) or any flat part of f at the level c. Hence points of local extrema are excluded. To see this, consider t n = c + n , for some small n > 0 such that n < δ 1 . Clearly, for n . Therefore, there exists some x n such that f (x n ) = t n and x−x n ≤ A 1 ν1 n . Hence there exists some sequence x n → x and f (x n ) > c. Similarly, it is straightforward to see that there also exists some sequence x n → x and f (x n ) < c.
The following lemma gives a sufficient condition for (A1).

W. Li and S. Ghosal
The order in the bound above is sharp in the situation where the level set is a regular curve (ν 1 = 1). To see this, let f and g respectively be two-dimensional centered normal densities with covariance matrix identity and (1 + δ)-times identity. Then for a given c, the level sets for both f and g are circles whose radii differ by the order of δ as δ → 0. Hence their Hausdorff distance is of the order of δ. Now, as both f and g are radial functions, their L ∞ -distance can be seen to be also O(δ), as f (0) − g(0) is of the order of δ and the maximum of g − f is also easily seen to be of the order of δ. Thus for small f − g ∞ , the inequality may not be improved except for the constant. This implies that the optimal rate for L ∞ -estimation is also the optimal rate for level set estimation in the Hausdorff metric in the present situation. The conclusion remains valid even if the domain is restricted to a disc with a sufficiently large radius.
We also consider another metric given by the Lebesgue measure of symmetric difference between the regions enclosed by the level sets, namely, λ {f 1 ≥ c} {f 2 ≥ c} , where λ stands for the Lebesgue measure. We shall call this the Lebesgue metric and denote it by Leb(L c (f 1 ), L c (f 2 )). The following assumption will be needed to study distances in terms of the Lebesgue metric, where ν 2 plays a role similar to ν 1 in Assumption (A1).

Lemma 2.3. Let f be a continuous function satisfying (A1 ) and 1 ≤ p < ∞.
Then there exists δ > 0 such that f − g ∞ < δ for a function g implies that for some constant C 2 , which may be taken as A 2 .
Note that the bivariate normal density function satisfies (A1 ) with ν 2 = 1, and in this regular case, it can be argued similarly as above that the inequality (4) is sharp.

Wavelets on [0, 1] d
There is a huge literature on both theory and applications of wavelets; see Härdle et al. [26], Giné and Nickl [25] and the references therein. Suppose that φ, ψ are the scaling function and wavelet of a Daubechies wavelet basis of L 2 (R).
A wavelet basis on [0, 1] d (called CDV-wavelet basis, named after Cohen, Daubechies and Vial) can be constructed from φ, ψ starting from some sufficiently large fixed resolution level J 0 ; see Cohen et al. [16]. We write this basis as Let K(j) = {0, . . . , 2 j − 1} d and I is the set of sequence i = (i 1 , . . . , i d ) of zeros and ones excluding i = {0, . . . , 0}. Both φ and ψ can be constructed to be of sufficiently high regularity, say being q-times (q > α) continuously differentiable with their derivatives up to q-th order uniformly bounded. A wavelet series for f ∈ L 2 ([0, 1] d ) can be written as ψ 0 j,k (·) := φ j,k (·) and ψ 1 j,k (·) := ψ j,k (·). It should be pointed out that the summation over i ∈ I may be omitted throughout the proof with the understanding that |I| ≤ 2 d and thus inclusion or exclusion of this summation does not affect the asymptotic analysis. Therefore, up to renumbering indices, it is more convenient to write where the level j = 0 corresponds to the father wavelet, Λ : In the proof, we shall use the following properties of this wavelet basis.
when α is an integer. (iii) Ψ j,k has support S j,k with area (or volume) at most 2 −jd , and Ψ j,k ∞ 2 jd/2 , |Ψ j,k | 2 −jd/2 . (iv) For fixed level j, given a fixed Ψ j,k with its support S j,k , the number of wavelets of the level j < j with support intersecting S j,k is bounded by a universal constant (independent of j, j , k); the number of wavelets of the level j > j with support intersecting S j,k is bounded by 2 (j −j)d times a universal constant.

Posterior contraction rates for level sets
Given a prior Π on the parameters describing f , let Π(·|D n ) denote the posterior distribution, where n is the sample size, D n are the observations, and E Π (·|D n ) is the expectation under the law Π(·|D n ). When the context of the prior is clear, we may drop the superscript Π in E Π (·|D n ). If the observations D n follow (the n-fold product of) the true law P 0 , let E 0 denote the expectation under the true law. We denote the true function (signal function, regression function or density function) by f 0 .

Signal with Gaussian white noise on [0, 1] d
Consider the multivariate white noise model which is defined through the stochas- where dW is defined through a multivariate stochastic integral with respect to independent standard Brownian motions (W 1 (t 1 ), . . . , W d (t d )) and that With above choice of the basis, the equivalent sequence space model is given by where the parameters are given by θ j,k := f, Ψ j,k and ε j,k are all i.i.d. N(0,1). As in Hoffmann et al. [27], we put a spike-and-slab prior on θ j,k . Let J n be a deterministic increasing function of n to be defined in the following theorem, θ j,k 's drawn independently as here δ 0 is the point mass at zero and g is a bounded density function of R which satisfies inf for some suitable L 0 > 0. Let f 0 stand for the true signal function and let P 0 be the true distribution of the observed signal. A multivariate version of Theorem 3.1 of Hoffmann et al. [27] implies the following theorem.
Remark 3.1. The corresponding rate for the level sets of the derivatives of the function D r f is given by (n/ log n) ν l (|r|−α)/(d+2α) where |r| < α, l = 1, 2, for the case (i) and (ii) respectively.
The rate above is the optimal L ∞ -rate for α-smooth functions in d-dimension. In view of the sharpness of (3) when ν 1 = 1 (or (4) when ν 2 = 1) discussed earlier, this rate is also optimal for estimating a level set in terms of the Hausdorff metric and Lebesgue metric. Since the prior does not depend on the smoothness level α of the underlying true function, the posterior contraction rate is adaptive.

Nonparametric Gaussian regression
Consider the Gaussian nonparametric regression problem with observations Our study allows for both fixed and random designs. For the fixed design case, X is considered fixed and each Y i given X i is independently distributed. For the random design case, each X i is further assumed to be independent and identically distributed (i.i.d.). If the covariates are random, they are assumed to follow the uniform distribution on [0, 1] d . If the covariates are deterministic, let G n stand for the empirical distribution function of X 1 , . . . , X n . We assume that sup where U stands for the cumulative distribution function of the uniform distribution on [0, 1] d . More generally, if the relation holds with U replaced by a cumulative distribution function G having a positive and continuous density, then the case can be reduced to (6) by a transformation on x, but then the interpretation of the regression function will change too. However, the level sets are not affected by a transformation of x, so we can assume that the covariates are uniformly distributed. As in Yoo et al. [50], the priors on θ j,k = f, Ψ j,k and σ 2 are mutually independent with each other. Specifically, θ j,k 's are drawn independently as here, the density function g also satisfies Condition (5). The prior for σ is taken to be some positive and continuous density on (0, ∞).
Assume that the true error distribution is sub-Gaussian with mean zero and variance σ 2 0 . In view of Theorem 4.2 of Yoo et al. [50], the following theorem and remark readily follow. Theorem 3.2. Consider a spike-and-slab prior defined above, where 2 Jn (n/ log n) 1/(2d) and Remark 3.2. The corresponding rate for the level sets of the derivatives of the function D r f is given by (n/ log n) ν l (|r|−α)/(d+2α) where |r| < α, l = 1, 2, for the case (i) and (ii) respectively.
As the prior is free of the smoothness level α, the posterior of the level set of the regression function contracts adaptively at the optimal rate (n/ log n) −α/(2α+d) for the L ∞ -metric, which is also the optimal rate for level sets in the regular situation in view of the sharpness of (3) and (4).

Density estimation
∼ P 0 , whose density function f 0 is supported on [0, 1] d . Following Castillo [9], we put a prior through an exponentiation of a wavelet prior, generalized to d-dimension. More specifically, let and α j,k are drawn independently from some fixed density ϕ if j ≤ J n and are set to 0 if j > J n . There are two types of density functions for ϕ under consideration. The first is the log-Lipschitz case, that is, log(ϕ) is a Lipschitz function. For this class, we assume that The second type is to take ϕ to be the standard normal density function. In this case, we assume that for some r such that 0 < r < α − d/4,

Theorem 3.3. Suppose a positive density function
Using the priors described above with 2 Jn (n/ log n) 1/(2α+d) . For any K n → ∞, As the prior relies on the knowledge of the smoothness level α, the above rate is not adaptive. It is useful to point out also, the suboptimal rate arises from the suboptimal L ∞ -rate obtained for the multivariate density function when d > 1; see the discussion following the proof of Propositions 6.2. It is also useful to compare the obtained rate with those obtained by Gayraud and Rousseau [20] and Polonik [36]. The former used the weaker Lebesgue measure distance and concluded that for d = 2, the rate is max{n −α/(3α+3) , n −1/4 }, which is always weaker than the rate in Theorem 3.3 for smoothness α ≥ 2. The later paper used different conditions which include a metric entropy condition for the underlying class of candidate sets, and a different metric. But when the true density is bounded and bounded away from zero, the metric behaves like the Lebesgue measure metric. Under a comparable situation which involves at least the existence of a gradient, the rate is at the best n −1/3 , which is always weaker than ours for smoothness α ≥ 3. Thus even though our obtained rate for density is weaker than the anticipated optimal rate (n/ log n) −α/(2α+2) , we improve upon the available results when the function is at least moderately smooth. Unlike ours, their rates are capped and do not improve with smoothness. Further, our rate is also with respect to the Hausdorff distance, which is typically stronger than the Lebesgue measure metric in a bounded domain, if measured from a curve with bounded length (that is, not too wiggly).

Optimal credible regions with assured coverage
In this section, we provide credible regions for level sets with sufficient coverage for the signal with a Gaussian white noise model using the trigonometric series prior, and the Gaussian nonparametric regression using B-splines series priors assuming that the targeted smoothness level of the true function is known. The issue of adaptive size and coverage is a lot more complex. It is known that even for a one-dimensional signal in a white noise model with the L 2 -distance, it is impossible to obtain coverage with adaptive size sets by any method, Bayesian or not; see for instance Li [30], Baraud [2], Cai and Low [8]. Size of a credible region can adapt to the underlying complexity only if certain parts of the parameter space are removed from consideration by using restrictions like "self-similarity", "polished tail" or "excessive bias restriction"; see Szabó et al. [45], Belitser and Nurushev [4], Belitser [3], Sniekers and van der Vaart [44] and Ray [38] for results on adaptive Bayesian coverage in certain conjugacy settings with the size measured by L 2 -type distances. At present, it is unclear whether L ∞ -adaptive credible sets are possible to obtain, and if possible, which part of the parameter space is to be removed to maintain coverage with adaptive L ∞ -size, as the notion of a bias-variance decomposition for the L 2 -setting is not easily extended to the L ∞ -setting. Thus we study coverage of credible sets only under the known smoothness setting for the signal with a Gaussian white noise model and the nonparametric Gaussian regression model using conjugate priors, respectively based on multivariate trigonometric series and tensor products of B-splines. The conjugacy allows certain explicit calculations necessary for lower-bounding the size of the posterior spread, which is the key to obtaining frequentist coverage. The approach will induce credible regions for the level sets from those on the function constructed using posterior quantiles of the L ∞ -spread of the function around its center. The approach is similar to that of Yoo and Ghosal [49] and Li and Ghosal [31] respectively for the mode and the filaments of the regression function.

Signal with Gaussian white noise model
We start with the Gaussian white noise model. LetL be the induced level set off , wheref is the posterior mean function. For some 0 < γ < 1/2, let R n,γ denotes the (1 − γ)-quantile of the posterior distribution of f −f ∞ . Let where ρ is some positive constant sufficiently large. The following theorem provides a credible region with sufficiently high frequentist coverage in the signal with Gaussian white noise model using trigonometric series prior.
To obtain the coverage of credible regions, conjugacy will be essential. The prior will be based on conjugate normal distributions on the coefficients of an orthogonal basis expansion. While tensor products of CDV wavelets can again be used, we illustrate the results using the basis of tensor products of trigonometric functions. The details of the proofs for the posterior contractions rates for these two priors are different. For Expand the function f 0 in the wavelet series as for some collection of coefficients {θ 0,j,k : j ∈ {0, 1, . . .} d and k ∈ K(j)}. For the sake of brevity, we may suppress the ranges of the indices of summation and just write the expansion as While the posterior contraction rate and the convergence of the Bayes estimator can be obtained as before, we only present results on uncertainty quantification. The theorem below holds for d = 2 or 3. We do not prove it for d > 3 as the terms quickly become very cumbersome and the argument becomes tedious. However, by inspecting the proofs, we conjecture that it holds also for d > 3.
for some positive constant L and α > d. Then the following assertions hold: (9), both the credibility and its coverage probability for L c (f 0 ) tend to 1; n,γ }, both the credibility and its coverage tend to one, where C 2 is as in (4); Remark 4.1. Note that the smoothness condition (10) is stronger than conditions imposed commonly on Fourier coefficients to quantify their decay using square summability. This becomes necessary as such conditions characterize only L 2 -Sobolev-type smoothness, not Hölder-type. The condition (10) implies L ∞ -type-bounds and is slightly stronger than a typical Hölder condition.

Remark 4.2.
We note that the size of the credible region matches with the optimal rate up to a logarithmic factor in the regular situations.

Remark 4.3.
An alternative use of a posterior distribution in Bayesian analysis is to quantify the uncertainty that the level set will be contained in C γ = {x : |f (x) − c| < ρR n,γ }. This is attractive because of its simplicity, as the set is just a band in R d instead of a subset of the space of curves on R d . To examine its credibility, note that which tends to one by the proof of Theorem 4.1. Similarly, Also observe that if the gradient is nonzero throughout on the level curve, then the spread of the band is of the order The advantage of this set C γ is that the posterior mean and R n,γ can be obtained from posterior sampling. If one computes the set {x : |f (x) − c| < ρR n,γ } numerically, it will serve as a confidence set. Here, instead of relying on bootstrap to get the quantiles, the cut-off R n,γ is obtained from the posterior. This credible set has to be interpreted as a possible region of controlled size that contains the true level set.

Nonparametric Gaussian regression
For credible sets with coverage in the Gaussian nonparametric regression problem, we shall use B-splines functions. Choose a fixed order q and a sequence of knots 0 = t 0 < · · · < t N +1 = 1. Denote the B-spline functions of order q by B i and form the tensor products be a column vector of tensor product of B-splines functions; here N k denotes the number of interior points and J k = q k +N k denotes the number of basis functions used for each coordinate. If desired, N k , q k and the knot locations t k,0 , . . . , t k,N k +1 can be chosen differently for different k. Let δ k, = t k, +1 − t k, . We assume the quasi-uniformity of the knots, in that max{δ k, : 0 ≤ ≤ N k }/ min{δ k, : 0 ≤ ≤ N k } ≤ C for some C > 0. To ensure quality of the approximation, we fix q k > α, and let N k = N k (n) and J k = J k (n) := N k (n) + q k .
As in Yoo and Ghosal [48], we put a prior on the regression function through the representation f = b T θ, θ|σ 2 ∼ N(θ 0 , σ 2 Λ 0 ), where c 1 I ≤ Λ 0 ≤ c 2 I, for some constants 0 < c 1 ≤ c 2 < ∞ and I is the J k × J k identity matrix. Using B to denote (b J1,...,J d (X 1 ), . . . , b J1,...,J d (X n )) T , the model becomes Y |(X, θ, σ 2 ) ∼ N(Bθ, σ 2 I n ). It follows then the posterior distribution for f and its partial derivatives are obtained as where J (y) and GP denotes a Gaussian process. To handle σ 2 , we can either put a conjugate inverse-gamma prior σ 2 ∼ IG(a/2, b/2) with shape parameter a/2 > 2 and rate parameter b/2 > 0 or plug-in an estimate for σ 2 . The theoretical study is similar in both cases, using the consistency of the marginal maximum likelihood estimator of σ, so for the ease of exposition, we consider the second approach only. The plug-in posterior is given by sub-Gaussian with mean 0 and variance σ 2 0 for i = 1, . . . , n. Notice that under P 0 , x → A r (x)ε/σ 0 is a mean zero process with a sub-Gaussian tail. As in the adaptive case, the result below extends to both fixed and random design, except that Condition (6) can be replaced by sup where G can be taken as the uniform distribution for the fixed design, the sampling distribution of X i for the random design.
The following result shows the coverage of a Bayesian credible set. L c (f 0 ) both tend to 1; ν1α/(2α+d) ).
If f 0 satisfies Assumption (A1 ), then (iv) the credibility and coverage ofC γ : n,γ } both tend to one, where C 2 is as in (4);

Algorithm
We give a description of the two generic algorithms we use for finding level sets.
by a slice sampler that gives rise to the invariant distribution whose density is proportional to exp{−(f (x) − c) 2 /ā}.

Simulation results
We work with the nonparametric Gaussian regression setting. In the simulation, we consider the following function where g is the normal density function with mean 0. 5   the normal distribution with mean 0 and standard deviation 0.1 and then set The true level set consists of two circles, as given in Figure 1. The sample size is 2000. We use fifth-order B-splines functions, that is q 1 = q 2 = 5. One can choose the pair (J 1 , J 2 ) by their posterior mode by maximizing (in the logarithmic scale) We give the results using a simulated-annealing based algorithm. We choose τ = 0.5 andā = 10 −5 . The tuning parameters J 1 and J 2 are both chosen 9 throughout the pilot experiments.
We also experimented with different choices of J 1 and J 2 . Figure 2 shows that the posterior mean under different smoothing levels. When J 1 = J 2 = 7, just slightly smaller than the 9, it can be seen that the posterior mean completely fails to approximate the inner circle. The approximation seems quite reasonable with a larger J. Figure 3 gives uncertainty quantification for two cases J 1 = J 2 = 7 and J 2 = J 2 = 9. We choose γ = 0.1 and ρ = 1.2. This choice of ρ should give sufficiently large credibility but not too high. This can be done by some pilot simulation using the posterior samples. To evaluate the (1 − γ)-quantile R n,γ , we first draw 200 posterior samples of θ to compute their posterior meanθ. We compute b J1,J2 (x) T (θ −θ) ∞ by searching on a crude grid and pick the maximum point on the grid. Then starting from this maximum point, we apply the gradient ascent method to check if nearby points can achieve greater (absolute) value. We keep the largest value as the supremum. The (1 − γ)-empirical quantile over all these suprema gives our estimate of R n,γ . Finally, we draw 100 posterior level sets (level sets induced from the posterior samples of f ) and keep those that fall in the set C γ . The curves of different colors correspond to these level sets. It is obvious that undersmoothing tends to deliver better and more robust coverage. Now we assess the credibility and coverage performance over 100 iterations. Some experimentation suggests that the choice ρ = 1.2 works well, giving 97.13% credibility averaging over all iterations. To evaluate the coverage performance, we compute the Hausdorff distance between the truth L * c andL c . Simulation shows that L * c belongs toC γ respectively 88%, 90%, 94%, 96%, 98% time when C takes values respectively 0.21, 0.22, 0.23, 0.24, 0.25. Inspecting the proof of Lemma 2.1, the constant A 1 may be estimated as 1/ inf{ ∇f (x) : x ∈ {f = c}}. When we compute with C 1 = 6A 1 in this way, the frequentist coverage turns out to be 100%, as the theory predicts.
Consider the following events |θ j,k − θ 0,j,k | ≤γ log n/n}, for someγ sufficiently large and γ sufficiently small. Assume for now that where n,r,α := (log n/n) (α−|r|)/(2α+d) . In view that E 0 [Π(E c 1 ∩ E 2 ∩ E 3 |Y )] → 0, then it suffices to show that on the events To this goal, one can find some suitable choice of J n (α) (n/ log n) 1/(2α+d) such that Note that The sum-of-max over (j, k) can be taken over three different regionsJ n (α) ∩ J n (γ),J n (α) ∩ (J n (γ)) c and (J n (α)) c . On the first region, since J n (γ) ⊂J n (α) and J n (α) < J n , it can be bounded on E 1 as On the second region, since |θ 0,j,k | < γ (log n)/n, and on E 3 , θ j,k = 0, together implying that 2 (|r|+d/2)j |θ j,k − θ 0,j,k | is bounded by 2 (|r|+d/2)j (log n)/n up to some multiplicative constant. The sum-of-max over this region is again upper bounded by n,r,α . On the third region, on E 3 , noting that θ j,k = 0, the term can be dominated by which is further bounded by n,r,α .
Finally it can be shown by following the proof of Lemma 1 of Hoffmann et al. [27] → 0, noting that the dimension d does not alter most of the argument.
Proof of Theorem 3.3. The proof follows from L ∞ -contraction rate for multivariate density model derived in Proposition 6.2 below in conjugation with (3) and (4).

Proposition 6.2. Given a positive density function
; assume that the Gaussian prior under the condition (8) or the log-Lipschitz prior under the condition (7). Then for 2 Jn (n/ log n) 1/(2α+d) and any K n → ∞, we have Proof. The proof mainly follows the argument in Castillo [9]. For clarity, we sketch the main ideas of the proof, and present the key results that contribute to the final rates for the multivariate case (d > 1). We only provide proofs for key steps and those lemmas that are illustrative of the subtleties for the case d > 1, whereas other proofs are referred to the original paper. Given two density functions, f 1 and f 2 , let h(f 1 , f 2 ) denote their Hellinger distance. Some key intermediate results can be shown to hold (the Lemma 4 and 8 of Castillo [9]), that is, for any K n → ∞, ). Let D n be a measurable set given by Denote by Π Dn the normalization of Π to D n , that is dΠ Dn = 1 Dn dΠ(·)/Π(D n ). Therefore, Let g 0 = log f 0 and g = log f = T − c(T ). Since f − f 0 ∞ = e g0 (e g−g0 − 1) ∞ g − g 0 ∞ using the bound |e x − 1| |x| for bounded x and the fact that f 0 is bounded, it suffices to bound g − g 0 ∞ . By Markov's inequality, for any K n → ∞ and c n > 0, We then bound where g Jn is the L 2 -projection of g up to level J n . The second term is 0 by the definition of T , and the constant function is orthogonal to higher level wavelets. For the third term, let g 0,j,k denote the wavelet coefficient of g 0 , we bound which is bounded by 2 −Jnα n,α = (n/ log n) −α/(2α+d) . To estimate the first term on the right side of (18), we introduce a few more notations. Let the log-likelihood be denoted by where P 0 denotes integration with respect to the density f 0 . Let ζ j,k := Ψ j,k /f 0 and let A j,k denote the L 2 -projection of ζ j,k up to the level j (j ≤ J n ), that is, By Lemma 6.4 and Lemma 6.7, the second term on the right hand side of the above expression can be bounded by n,α , so we shall focus on the first term. Note that Therefore, for any t > 0, using Jensen's inequality and e |x| ≤ e x + e −x , the term E Π Dn ( g Jn − Γ Jn ∞ |D n ) can be bounded by As argued in p. 2074 of Castillo [9], g j,k − g 0,j,k can be written as Hence M j,k (t) is given by Note that, since |ρ(x)| ≤ x 2 for x small, and f − f 0 ∞ ζ n,α and by the assumption α > d, on D n , By Lemma 6.5, maximizing the upper bound (by choosing j = J n α/(α+1) ) on D n , In view of the bounds in the above two displays, and by Lemma 6.4 and Lemma 6.6 (setting γ n = A j,k ), we have, for any t > 0, The second term (22) on the right hand side of above expression is bounded by ρ n,2 := (n/ log n) (d−2α)/(2α+d) (log n) 2η1 , and hence ρ n,2 n,α as long as α > d.
As will be clear later, this term turns out to be the dominating factor for rates in the multivariate case (d > 1).
All that remains is to bound the first term (21) on the right hand side of above expression. One can write the ratio inside the parentheses as Recall that f = exp(T (x) − c(T )), where T = λ≤Jn u θ λ,u Ψ λ,u , θ λ,u := σ λ,u α λ,u , and The prior on f effectively is the prior on the coefficients θ = {θ λ,u } and therefore the numerator of (24) is the integration over the law of θ, whose density is denoted by p θ . We then proceed to apply change of variables in this integration, by changing θ λ,u → θ * λ,u := θ λ,u − tn −1/2 A j,k , Ψ λ,u . Therefore, the numerator can be written as for some suitably transformed set D n . By Lemma 6.8, for the Gaussian prior case, the term in the square bracket is bounded by exp{C|t| + Ct 2 } for some constant C > 0. Therefore, (24) can be bounded by exp{C|t| + Ct 2 }/Π(D n |D n ). Now the term on the right hand side of (21) can be bounded as which, by setting t = √ j, can be further bounded by n,α + n,α log(1/Π(D n |D n )). Combining all the above results, we obtain We choose c n = K n * n,α , where K n → ∞ is a given sequence, and * n,α := max( n,α , ρ n,1 , ρ n,2 ) in (17). By (16), we obtain By the assumption α > d > 1, ρ n,2 n,α ρ n,1 , thus * n,α = ρ n,1 , completing the proof for the Gaussian prior. Remark 6.1. For d = 1, the optimal rate n,α := (n/ log n) −α/(2α+d) was obtained. For d > 1, only a suboptimal rate ρ n,1 can be obtained, which may be due to some possibly crude bounds used in Lemma 6.5. It is not immediately clear whether the limitation is due to the techniques of the proof or due to the specification of the prior.
There are two directions that are worth of exploring in the future. One is to obtain improved results for lower smoothness situation possibly α > d/2. Another one is to obtain optima sup-norm rates for d > 1. There are some very recent results in these directions in other related nonparametric estimation problems; see for instance Castillo and van der Pas [11] for survival analysis and Nickl and Ray [35] for diffusions. The first paper deals with univariate functions, and its technique may be used to obtain results for lower smoothness situation. The second paper develops some techniques which makes it possible to obtain optimal sup-norm rates for d ≤ 4 in the multidimensional diffusion problems. It remains an open problem if similar techniques can be used for improving rates in the multivariate density estimation.
for some α > 1, and is bounded away from 0 and ∞. Let ζ j,k = Ψ j,k /f 0 . Then for k ∈ K(j), u ∈ K(λ), it holds that where the last step is due to the property (iii) of the wavelets bases. When j < λ, a symmetric bound can be obtained, but a better bound is possible. To see this, in view of Lemma 5 of Castillo [9] which implies that ζ j,k ∈ B α ∞,∞ ([0, 1] d ) with norm ζ j,k ∞,∞,α 2 j(α+d/2) . Lemma 6.4. Let f 0 be a density function bounded away from 0 and ∞ and let P 0 stand for the integration with respect to the density f 0 . Then for any j ≤ J n and k ∈ K(j), we have Proof. These bounds are not explicitly stated in Castillo [9] but yet are needed for the conditions of Lemma 6.6 and 6.7. We give a proof here for completeness. Write When λ ≤ j, for any fixed λ, a fixed number of {Ψ λ,u } have support intersecting with the support of ζ j,k . By Lemma (6.3), | ζ j,k , Ψ λ,u | 1, and the property (iii) of the wavelets, the first term on the right hand side can be bounded as Since λ > j, by Lemma (6.3) again, the second term can be bounded as which is bounded by a constant multiple of 2 jd/2 ≤ 2 Jnd/2 . Similarly, The first term on the right-hand side is bounded by λ≤j 2 (λ−j)d , which is O(1). The second term is bounded by For the last statement, Note that P 0 Ψ λ,u |Ψ λ,u | 2 −λd/2 , by similar argument, we can obtain P 0 A j,k ∞ = O(1). Lemma 6.5. Let f 0 ∈ H α ([0, 1] d ) for some α > 1, and be bounded away from 0 and ∞. Then for any j ≤ J n and k ∈ K(j), any density f bounded away from 0 and ∞, the following holds Proof. To show the first statement, write where the third line follows from Lemma 6.3 and the property (i) of the wavelets.
To show the second statement, recall first the support of Ψ j,k is denoted by S j,k whose volume vol(S j,k ) 2 −jd (since along each coordinate, the length of support is of order 2 −j ). For any fixed j, k, A j,k − ζ j,k is a linear combination of high frequency wavelets (λ > J n ≥ j). Observe first that for any λ > J n , and admissible u, each coordinate of Ψ λ,u has a support of length of order vol(S λ,u ) 1/d , which is smaller than that of each coordinate of ζ j,k (whose length is of the order of vol(S j,k ) 1/d ), that is, vol(S λ,u ) 1/d R × vol(S j,k ) 1/d for some constant R. Therefore, along each coordinate, supports of all Ψ λ,u 's which intersect with that of ζ j,k along the same coordinate are contained in an interval of length at most (2R + 1)vol(S j,k ) 1/d . Hence, the supports of Ψ λ,u 's which intersect with the support of ζ j,k are contained in a hyper-rectangle of volume at most (2R + 1) d vol(S j,k ). This implies that the volume of the support of ζ j,k − A j,k is bounded by a constant multiple of vol(S j,k ).
Let Δ j,k denote the support of ζ j,k − A j,k . Now bounding A j,k − ζ j,k by its supremum and applying Cauchy-Schwarz inequality, To obtain the bound for the other side, let D j,k : S j,k f 0 , which is bounded below from 0 (by the assumption on f 0 ). Since j ≤ J n , by the property of L 2 -projection, ζ j,k − A j,k 2 ≤ ζ j,k − D j,k 2 . Since D j,k has the same support as that of Ψ j,k , To bound the sup-norm, where the fifth line holds because f 0 ∈ H α ([0, 1] d ) with α > 1; and in the second from the last line above, the first factor in the parentheses vol(S j,k ) is due to the volume from the support and the second factor vol(S j,k ) 1/d is due to the maximum length of support along each coordinate (or the diameter of the support). Combing the above results, Lemma 6.6. Let f 0 be a density function bounded away from 0 and ∞. Let a n be a sequence of real numbers with na 2 n > 1, any n ≥ 1. Let {Π n } be a collection of priors on densities restricted to the set {f : h(f, f 0 ) ≤ a n }. Let {γ n } be an arbitrary sequence in L ∞ ([0, 1] d ). Setγ n := γ n − P 0 γ n , where P 0 denotes integration with respect to the density f 0 . Suppose that for some m > 0 and all n > 1, Then there exists C > 0 depending on m and f 0 ∞ only such that for any n ≥ 1 and any |t| ≤ log n, where f t is defined through the expression In particular, the result holds for γ n = A j,k for any j such that 2 j ≤ 2 Jn (n/ log n) 1/(2α+d) .
Proof. See the proof of Lemma 3 of Castillo [9]. Lemma 6.7. Suppose that f 0 is bounded away from 0 and ∞. Let g 0 = log f 0 . For A j,k any element of L ∞ ([0, 1] d ) such that there exists constants c 1 , c 2 , for any j, k ∈ K(j) with 2 j ≤ 2 Jn (n/ log n) 1/(2α+d) , any n sufficiently large, Then for any n sufficiently large, Proof. See the proof of Lemma 7 of Castillo [9], with some necessary adaption using the properties of tensor product wavelet basis.
For the log-Lipschitz prior under the condition (7), the upper bound is e C|t| for some C > 0.
Proof. See the argument in pages 2078-2080 of Castillo [9], with some necessary adaption using the properties of tensor product wavelet basis and Lemma 6.3.
The following result, needed to prove Theorem 4.1, is also of independent interest. Proposition 6.9. In the setup of Theorem 4.1, the following assertions hold, for d = 2, 3 : and hence the posterior for f contracts at f 0 at the rate n −α/(2α+d) (log n) d/2 with respect to the supremum distance.
Proof. To avoid the complication of notation, we first demonstrate the proof for the case d = 2 and then sketch the proof for the dominating terms for the case d = 3 after that.
Let Z=(f −f ) = j k φ j,k (θ j,k −θ j,k ). The posterior distribution of Z given the data is a centered Gaussian process that does not depend on D n . To ease the notation, we may just use E to denote the expectation with respect to the posterior distribution, i.e, we write E( Z ∞ ) for E( Z ∞ |D n ). Hence (a) is the same as showing E( Z ∞ ) n −α/(2α+2) log n.
Proof for d = 2: Note that E(Z 2 (x)) = j k μ j /(nμ j + 1)φ 2 j,k (x). By the uniform boundedness of the basis functions, EZ 2 (x) is bounded by Let n α = n 1/(α+1) . The second term can be bounded by The third term can be bounded in a similar way. For the fourth term, we split in two cases -one is where j 1 j 2 ≤ n α and the other j 1 j 2 > n α . For the first case, For any fixed j 1 so that 1 ≤ j 1 ≤ n α , the number of j 2 such that j 2 ≤ n α /j 1 is bounded by n α /j 1 . Therefore, n −1 #{(j 1 , j 2 ) : For the second case, we can have two scenarios: (i) when j 1 ≥ n α , j 2 ≥ 1, and hence is O(n −α/(1+α) log n). Thus EZ 2 (·) ∞ n −α/(1+α) log n. We next show that E|Z(x)−Z(y)| 2 is bounded by some power of n multiplied by x − y 2 . Note that

Bayesian analysis of level sets 2677
Clearly, when j 1 = j 2 = 0, the summand is bounded by n −1 . Consider the summation over j 1 = 0, j 2 ∈ N. Using the fact that |φ 0,j2,k (x) − φ 0,j2,k (y)| |x 2 − y 2 | 2 j 2 2 , the summation is bounded by Notice that for the second inequality above to hold, α is required to be greater than 2. A bound of the same order can be obtained by the summation over j 1 ∈ N, j 2 = 0. Consider the summation over j 1 , j 2 Therefore, by the Cauchy-Schwarz inequality. Hence As before, we consider the two cases for the summation. One is the summation over j 1 j 2 ≤ n α and the other is over j 1 j 2 > n α . For the first case, By a straightforward calculation, it is bounded by n (2−α)/(1+α) . For the second case, we again consider two scenarios (i) j 1 ≥ n α , j 2 ≥ 1 and (ii) j 2 < n α , j 2 ≥ n α /j 1 . Proceeding similarly as before, we can get the bound n (2−α)/(1+α) . In summary, we obtain that Now, by Lemma A.11 of Yoo and Ghosal [48] with δ n = n −p for p > 0 sufficiently large, we obtain E Z 2 ∞ log n × E(Z 2 (·)) ∞ . By the bound n −α/(1+α) log n for E(Z 2 (·)) ∞ obtained earlier in the proof, these imply To prove (b), let V =f − E 0f . Consider a mean-zero Gaussian process given by Following the same sequence of argument used for Z is used, we can establish that To prove (c), note that The summand of j 1 = j 2 = 0 is bounded by n −1 . Considering the summation over where the first term on the right hand side can be bounded as nα j2≥1 k∈K(0,j2) which is bounded by a constant multiple of n α /n = n −α/(1+α) by the uniform boundedness of the basis functions and the assumption on the smoothness of f 0 (take i 1 = 0, i 2 = α). Since for j 2 > n α , j2>nα k∈K(0,j2) |θ 0,j,k | n −α α , the second term of the right hand side of above display can be bounded as j2>nα k∈K(0,j2) |θ 0,j,k | n −α α = n −α/(α+1) . Therefore j2≥1 k∈K(0,j2) Finally, consider the sum over j 1 , j 2 ∈ N. We consider two scenarios (i) j 1 j 2 > n α = n 1/(α+1) and (ii) j 1 j 2 ≤ n α . For the first scenario, by the uniform boundedness of basis functions and the assumption on the smoothness of f 0 (taking s 1 = s 2 = α/2), it holds that j1j2>nα k∈K(j) |θ 0,j,k | n −α/2 α . It is easy to see that j1j2>nα k∈K(j) /(2(1+α)) . In the second scenario, we bound the term as which is of the order n −α/(2+2α) by the uniform boundedness of basis functions and the assumption on the smoothness of f 0 . In summary, E 0f − f 0 ∞ n −α/(2+2α) as claimed. This completes our proof for the case d = 2.
The first case to consider is when j 1 j 2 j 3 ≤ n α . The above term can be bounded as which can be bounded by a constant multiple of n 3 α /n. The second case to consider is when j 1 j 2 j 3 > n α . Again we only need to consider the three sub-cases: (i) j 1 > n α , j 2 ≥ 1, j 3 ≥ 1, (ii) j 1 < n α , j 2 > n α /j 1 , j 3 ≥ 1, (iii) j 1 < n α , j 1 j 2 < n α /j 1 , j 3 > n α /j 1 j 2 .
Proof of Part (2): Note that since for 0 < γ < 1/2, R n,γ is greater than the posterior median of Z ∞ . By the estimate (26) and Borell's inequality (cf. Proposition A.2.1 of Van der Vaart and Wellner [47]), the posterior mean of Z ∞ and the posterior median of Z ∞ are of the same order. As (1 − γ)quantile of Z ∞ is larger than its median, by Part (1), it follows that R n,γ E Z ∞ n −α/(2(1+α)) log n.
Let n = n −α/(2(1+α)) log n. By Borell's inequality (see Proposition A.2.1 of Van der Vaart and Wellner [47]), where c n = EZ 2 (·) ∞ is bounded by (log n)n −α/(1+α) in view of (26), and C is a constant, whose positivity can be ensured if ρ is chosen sufficiently large. Therefore, the above posterior probability tends to zero. Finally, by the Parts (b) and (c) of Proposition 6.9, establishing the coverage of C γ . Observe that C γ ⊂C γ , because any L c ∈ C γ is induced by some f such that f −f ∞ ≤ ρR n,γ . In view of Part (a) of Proposition 6.9, by the third assertion