Oracle posterior contraction rates under hierarchical priors

We offer a general Bayes theoretic framework to derive posterior contraction rates under a hierarchical prior design: the first-step prior serves to assess the model selection uncertainty, and the second-step prior quantifies the prior belief on the strength of the signals within the model chosen from the first step. In particular, we establish non-asymptotic oracle posterior contraction rates under (i) a local Gaussianity condition on the log likelihood ratio of the statistical experiment, (ii) a local entropy condition on the dimensionality of the models, and (iii) a sufficient mass condition on the second-step prior near the best approximating signal for each model. The first-step prior can be designed generically. The posterior distribution enjoys Gaussian tail behavior and therefore the resulting posterior mean also satisfies an oracle inequality, automatically serving as an adaptive point estimator in a frequentist sense. Model mis-specification is allowed in these oracle rates. The local Gaussianity condition serves as a unified attempt of nonasymptotic Gaussian quantification of the experiments, and can be easily verified in various experiments considered in [GvdV07a] and beyond. The general results are applied in various problems including: (i) trace regression, (ii) shape-restricted isotonic/convex regression, (iii) high-dimensional partially linear regression, (iv) covariance matrix estimation in the sparse factor model, (v) detection of non-smooth polytopal image boundary, and (vi) intensity estimation in a Poisson point process model. These new results serve either as theoretical justification of practical prior proposals in the literature, or as an illustration of the generic construction scheme of a (nearly) minimax adaptive estimator for a complicated experiment. MSC2020 subject classifications: Primary 62G20; secondary 62G05.


Overview
Suppose we observe X (n) from a statistical experiment (X (n) , A (n) , P (n) f ), where f belongs to a statistical model F and {P (n) f } f ∈F is dominated by a σ-finite measure μ. In many cases, instead of using a single 'big' model F, a collection of suitably nested (sub-)models {F m } m∈I ⊂ F are available to statisticians. A hierarchical Bayesian approach assigns a first-step prior Λ n assessing the uncertainty in which model to use, followed by a second-step prior Π n,m quantifying the prior belief in the strength of the signals within the specific chosen model F m from the first step.
Such a hierarchical prior design is intrinsic in many proposals for different problems, including the canonical Gaussian white noise/regression and density estimation [AGR13, BG03, dJvZ10, GLvdV08, KRvdV10, LvdV07, RS17,Scr06], and the more recent sparse linear regression [CSHvdV15,CvdV12], trace regression [ACCR14], shape restricted regression [HD11,HH03], covariance matrix estimation [GZ15,PBPD14], etc. Despite many contraction rates available for different models (see e.g. [Cas14,CSHvdV15,CvdV12,GGvdV00,GvdV07a,GvdV17,HRSH15,Rou10,SW01,vdVvZ08,vdVvZ09] for some key contributions), a unified theoretical understanding towards the behavior of posterior distributions under the hierarchical prior design has been limited. [GLvdV08] focused on designing adaptive Bayes procedures with models primarily indexed by the smoothness level of function classes in the context of density estimation. Their conditions are complicated and seem not directly applicable to other settings. [dJvZ10] uses a specific location mixture prior for regression/density estimation/classification. [AGR13] considered a more general setting where the models are indexed by functions that admit a linear 2 -basis structure (e.g. Sobolev/Besov type); see also [RS17]. [GvdVZ15] designed a prior specific to structured linear problems in the Gaussian regression model, with their main focus on high-dimensional (linear) and network problems. As such, all these results apriori require certain specific form of the prior, the model structure, or the statistical experiments.
The goal of this paper aims at giving a unified theoretical treatment of deriving posterior contraction rates under the common hierarchical prior design, without specifying particular forms for the prior, the model structure, or the experiments. More specifically, we aim at identifying common structural assumptions on the statistical experiments (X (n) , A (n) , P  where pen(m) 2 is related to the 'dimension' of F m , and (G2) puts little mass on models that are substantially larger than the oracle one balancing the bias-variance tradeoff in (1.1).
The oracle formulation (1.1) follows the convention in the frequentist literature on model selection [BC91,YB98,BBM99,Mas07,Tsy14], and has several advantages: (i) (minimaxity) if the true signal f 0 can be well-approximated by the models {F m }, the contraction rate in (1.1) is usually (nearly) minimax optimal, (ii) (adaptivity) if f 0 lies in certain low-dimensional model F m , the contraction rate adapts to this unknown information, and (iii) (mis-specification) if the models F m are mis-specified while d 2 n (f 0 , ∪ m∈I F m ) remains 'small', then the contraction rate should still be rescued by this relatively 'small' bias.
As the main abstract result of this paper (cf. Theorem 2.3), we show that our goals (G1)-(G2) can be accomplished under: (i) (Experiment) a local Gaussianity condition on the log likelihood ratio for the statistical experiment with respect to d n ; (ii) (Models) a dimensionality condition of the model F m measured in terms of local entropy with respect to the metric d n ; (iii) (Priors) exponential weighting for the first-step prior Λ n , and sufficient mass of the second-step prior Π n,m near the 'best' approximating signal f 0,m within the model F m for the true signal f 0 .
The local Gaussianity condition is rooted in the frequentist theory of the convergence rates of M -estimators (i.e. estimators maximizing certain likelihood) via the theory of Gaussian and empirical processes. In fact, the local Gaussianity serves as an essential ingredient for various (by-now standard) techniques, including the Gaussian concentration and the chaining with bracketing, that give a unification to the theory for, e.g. regression and density estimation [BM93,vdG00,vdVW96] (see Appendix E for more discussions). From the Bayesian theoretic side, one important convention in studying posterior contraction rates in the literature has been the construction of appropriate tests with exponentially small type I and II errors with respect to certain metric, the Gaussian behavior of type II error being particularly crucial [GGvdV00,GvdV07a]. It is rather curious if the frequentist local Gaussianity can also be useful in the Bayes theory. Our formulation in (i) can be viewed as an attempt in this regard, and seems useful in that, local Gaussianity with respect to the intrinsic metric is a rather universal property in various statistical experiments including the ones considered in [GvdV07a] and beyond: Gaussian/Laplace/binary/Poisson regression, density estimation, Gaussian autoregression, Gaussian time series, covariance matrix estimation, image boundary detection, and support boundary recovery in a Poisson point process model, etc. Moreover, such local Gaussianity naturally entails the Gaussian tail behavior of the posterior distribution, thereby complementing a recent result of [HRSH15] who showed that such a Gaussian tail behavior cannot be uniformly improved under uniform posterior consistency.
Conditions (ii) and (iii) are familiar in Bayes nonparametrics literature. In particular, the first-step prior can be designed generically (cf. Proposition 2.2). Sufficient mass of the second-step prior Π n,m is a minimal condition in the sense that using Π n,m alone should lead to a (nearly) optimal posterior contraction rate on the model F m .
As an illustration of the scope of our general results in concrete applications, we justify the prior proposals in (i) [ACCR14,MA15] for the trace regression problem, and in (ii) [HD11,HH03] for the shape-restricted regression problems. Despite many theoretical results for Bayes high-dimensional models (cf. [BG14,CSHvdV15,CvdV12,GvdVZ15,GZ15,PBPD14]), it seems that the important low-rank trace regression problem has not yet been successfully addressed. Our result here fills in this gap. Furthermore, to the best knowledge of the author, the theoretical results concerning shape-restricted regression problems provide the first systematic approach that bridges the gap between Bayesian nonparametrics and shape-restricted nonparametric function estimation literature in the context of adaptive estimation 3 .
Several other applications are considered, including: (iii) high-dimensional partially linear regression model, (iv) covariance matrix estimation in the sparse factor model, (v) detection of polytopal image boundary, and (vi) estimation of piecewise constant intensity in a Poisson point process model. These results serve as an illustration of the generic construction scheme of a (nearly) minimax adaptive estimator in multi-structured experiments, or in experiments that seem far from Gaussian. We also revisit some density estimation problems, in particular in the location mixture models. The purpose of this is to provide some guidance of how the local Gaussianity can be applied via appropriate localization of the parameter space, when such Gaussianity may fail to hold at a global scale.
During the preparation of this paper, we become aware of a very recent paper [YP17] who independently considered a similar problem. Both our approach and [YP17] shed light on the behavior of Bayes procedures under hierarchical priors, while differing in several important aspects (cf. Remark 2.6). Moreover, our work here applies to a wide range of applications that are not covered by [YP17].

Notation
Let (F, · ) be a subset of the normed space of real functions f : X → R. Let N (ε, F, · ) be the ε-covering number; see page 83 of [vdVW96] for more details. For a real-valued measurable function f defined on (X , A, P ), f Lp(P ) ≡ 1090 Q. Han P |f | p ) 1/p denotes the usual L p -norm under P (where p ≥ 1), and will be simplified as f p when there is no potential confusion. f ∞ ≡ f L∞ ≡ sup x∈X |f (x)| denotes the supremum norm.
For any v ∈ R d , we use v p to denote the usual Euclidean p-norm. For any ε > 0, denote B d (v, ε) ≡ {u ∈ R d : u − v 2 ≤ ε} the Euclidean ball in R d centered at v with radius ε. C x denotes a generic constant that depends only on x, whose numeric value may change from line to line. a x b and a x b mean a ≤ C x b and a ≥ C x b respectively, and a x b means a x b and a x b. For a, b ∈ R, a ∨ b := max{a, b} and a ∧ b := min{a, b}. P (n) f T denotes the expectation of a random variable T = T (X (n) ) under the experiment (X (n) , A (n) , P (n) f ).

Organization
Section 2 is devoted to the general results on oracle posterior contraction rates. We work out a wide range of experiments and some concrete applications that fit into our general theory in Section 3. Detailed proofs are deferred to the Appendix.

General results
In the hierarchical prior design framework, we first put a prior Λ n on the model index I, followed by a prior Π n,m on the model F m chosen from the first step. The overall prior is a probability measure on F given by Π n ≡ m∈I λ n (m)Π n,m . The posterior distribution is then a random measure on F: for a measurable subset B ⊂ F,

1091
Here d n : F × F → R ≥0 is a symmetric function satisfying for some constants c 2 , c 3 > 0 and d 0 ≥ 0 (possibly depending on n).
In Assumption A, we require the log likelihood ratio to have local Gaussian behavior with respect to the intrinsic 'metric' d n in the sense of (2.3). If κ Γ can be chosen to be 0, then the log likelihood ratio exhibits global Gaussian behavior. In Section 3, many statistical experiments, beyond the apparent Gaussian ones, will be shown to satisfy this local Gaussianity condition in their respective intrinsic metrics. In some cases the local Gaussianity by itself may entail certain apriori compactness constraints on the parameter space, for instance boundedness requirements for the parameter space in binary/Poisson regression and density estimation. These constraints can be removed, in a technical way, by working with appropriately localized subsets of the parameter space on which the local Gaussianity holds. See Section 2.3 and Appendix F for more details and examples in this regard.
As already mentioned in the Introduction, this local Gaussianity point of view has its root in the unified treatment of deriving convergence rates of Mestimators-a formal connection to the theory of sieved MLE under local Gaussianity will be given in Appendix E.
A direct consequence of the local Gaussianity of the statistical experiment is the following.
Next we state the assumption on the complexity of the models {F m } m∈I . Let I = N q be a q-dimensional lattice with the natural order (I, ≤) 4 . Here the dimension q is understood as the number of different structures in the models {F m } m∈I . For instance, in the trace regression problem (cf. Section 3.1.1), there is only one rank structure so q = 1; in the covariance matrix estimation problem in the sparse factor model (cf. Section 3.5.1), there are both rank and sparsity structures so q = 2. In the sequel we will not explicitly mention q unless otherwise specified. We require the models to be nested in the sense that F m ⊂ F m if and only if m ≤ m 5 .
Let f 0,m denote the 'best' approximation of f 0 within the model F m in the sense that f 0,m ∈ arg inf g∈Fm d n (f 0 , g) 6 . Our assumption on the model complexity below, at a heuristic level, says that F m has dimension nδ 2 n,m measured 1092 Q. Han in a local entropy sense, for some δ n,m > 0. In typical cases, F m has 'dimension' m, and δ 2 n,m ≈ m n × poly-log is regarded as the contraction rate on F m (up to logarithmic factors).
Assumption B (Models: Local entropy condition). Let {δ n,m } m∈I ⊂ R >0 be such that each δ n,m depends on n, m only, and: • For each m ∈ I, such that for any m ∈ I, α ≥ c 7 /2 and any Using δ n,m 's, the models can be divided into over-fitting or under-fitting ones according to whether δ 2 n,m ≥ inf g∈Fm d 2 n (f 0 , g) or δ 2 n,m < inf g∈Fm d 2 n (f 0 , g). Note that if we choose all models F m = F, then (2.4) reduces to the local entropy condition in [GGvdV00,GvdV07a]. When F m is finite-dimensional, typically we can check (2.4) for all g ∈ F m . Now we comment on (2.5). The left side of (2.5) essentially requires super linearity of the map m → δ 2 n,m , while the right side of (2.5) controls the degree of this super linearity. As a leading example, (2.5) will be trivially satisfied with c = γ = 1, h 0 = ∞ when nδ 2 n,m = c · m log(en) for some absolute constant c > 2/c 7 . Finally we state assumptions on the priors.
Assumption C (Priors: Mass condition). For all m, (P1) (First-step prior) There exists some h ≥ 1 such that Condition (P1) can be verified by using the following generic prior Λ n : (2.8) will be the model selection (first-step) prior on the model index I in all examples in Section 3.
Condition (P2) is reminiscent of the classical prior mass condition considered in [GGvdV00,GvdV07a]. Since δ 2 n,m is understood as the 'posterior contraction rate' for the model F m , (P2) can also be viewed as a solvability condition imposed on each model. Note that (2.7) only requires a sufficient prior mass on a Kullback-Leibler ball near f 0,m , where [GGvdV00, GvdV07a] use more complicated metric balls induced by higher moments of the Kullback-Leibler divergence.

Main abstract results
We say an index set M ⊂ I rectangular if and only if there exist some integers 3. Letf n ≡ Π n (f |X (n) ) be the posterior mean. If h 0 = ∞ and d n (·, ·) is convex in each of its arguments, then Here the constant C 0 depends on , κ, c, h and γ. Remark 2.4. Some technical comments: 1. f 0,m in Assumptions B and C may be taken other than the minimizer of f → d 2 n (f 0 , f) over F m . In this case, the conclusion of the above theorems is valid by using ε 2 n,m ≡ d 2 i=0 do not depend on m ∈ M, so the conclusions in (1)-(2) hold simultaneously for all m ∈ M.
Theorem 2.3 shows that the task of constructing Bayes procedures adaptive to a collection of models in the intrinsic metric of a given statistical experiment, can be essentially reduced to that of designing a suitable non-adaptive prior for each model, provided the model selection prior is chosen according to (P1). 7 We use the convention that

Q. Han
Furthermore, the resulting posterior mean serves as an automatic adaptive point estimator in a frequentist sense. Besides being rate-adaptive to the collection of models, (2.10) shows that the posterior distribution does not spread too much mass on overly large models. Results of this type have been derived primarily in the Gaussian regression model (cf. [CSHvdV15,CvdV12,GvdVZ15]) and in density estimation [GLvdV08]; here our result shows that this is a general phenomenon for the hierarchical prior design.
As mentioned in the Introduction, previous results [AGR13, dJvZ10, GvdVZ15, GLvdV08, RS17] require certain specific form of the prior, model structure, or the experiments. Our Theorem 2.3 can thus be viewed as a generalization of these results without such apriori requirements under a hierarchical prior design. As will be clear from concrete applications in Section 3, another advantage of the formulation of Theorem 2.3 is that Assumptions B-C typically concern finite-dimensional models F m so verification is easy and routine.
Note that f 0 is arbitrary and hence our oracle inequalities (2.9) and (2.11) account for model mis-specification errors. Previous work allowing model misspecification includes [GvdVZ15] who mainly focuses on structured linear models in the Gaussian regression setting, and [KvdV06] who pursued generality at the cost of harder-to-check conditions. Remark 2.5. We make some technical remarks.
1. The probability estimate in (2.9) is of Gaussian type and is therefore sharp (up to constants) in view of the lower bound result Theorem 2.1 in [HRSH15]. Such sharp estimates have been derived separately in the Hellinger metric [GGvdV00], or in individual settings, e.g. the sparse normal mean model [CvdV12], the sparse PCA model [GZ15], and the structured linear model [GvdVZ15], to name a few. The Gaussian estimate naturally implies good behavior of the posterior mean under bounded metrics (cf. page 507 of [GGvdV00]). In the leading case c = γ = 1, h 0 = ∞ in Assumption B, the posterior meanf n satisfies an oracle inequality with a Gaussian tail 8 . 2. (2.10) asserts that the posterior distribution does not concentrate on overly large models. It is also of significant interest to assert the converse in some models, i.e. the posterior distribution does not concentrate on overly small models under additional problem-specific conditions. We refer to the readers to [Bel17,CSHvdV15,RS16,YP17] and references therein for more details in this direction. 3. Assumption A implies, among other things, the existence of a good test (cf. Lemma 2.1). In this sense our approach here falls into the general testing approach adopted in [GGvdV00,GvdV07a]. Some alternative approaches for dealing with non-intrinsic metrics can be found in [Cas14,HRSH15,YG16]. 4. The constants {C i } 4 i=1 in Theorem 2.3 depend at most polynomially with respect to the constants involved in Assumption A. This will be useful in handling models where the local Gaussianity only holds locally on the parameter space (cf. Appendix F). 5. If d n does not satisfy the triangle inequality, then (2.9) and (2.10) in Theorem 2.3 hold if f 0 ∈ F m for some m (i.e. the form of an exact oracle inequality may be lost at a general level).
Remark 2.6. We compare our results with Theorems 4 and 5 of [YP17]. Both their results and our Theorem 2.3 shed light on the general problem of Bayes model selection, while differing in several important aspects: 1. Theorem 4 of [YP17] targets at exact model selection consistency, under a set of additional 'separation' assumptions. Our Theorem 2.3 (2) requires no extra assumptions, and shows that the posterior distribution does not concentrate on overly large models. This is significant in non-parametric problems: the true signal typically need not belong to any specific model. 2. Theorem 5 of [YP17] contains a term involving the cardinality of the models, so their bound will be finite only if there are finitely many models.
It remains open to see if this can be removed.

The localization (sieving) principle
Consider a sequence of models {F n }, whereF n is regarded as the localized model of F at sample size n. Note that any prior Π n on F can be localized to a priorΠ n onF n : for any B ⊂F n , defineΠ n (B) ≡ Π n (B ∩F n )/Π n (F n ). Now the quantity in Theorem 2.3 concerning posterior distribution can be decomposed by In essence, (2.12) suggests that we can use the machinery of Assumptions A-C to the localized model F n (typically by choosing the constants c 2 , c 3 , d 0 depending on n), as long as the residue term P (n) f0 Π n f / ∈F n X (n) is well-controlled. This typically reduces to a reasonable control of Π n (F \F n ) (cf. Lemma 1 of [GvdV07a], see also examples in Appendix F). The localization principle is under the name 'sieving' in [GGvdV00,GvdV07a].

Proof sketch
Here we sketch the main steps in the proof of our main abstract result Theorem 2.3. The details will be deferred to Appendix A. The proof can be roughly divided into two main steps. (Step 1) We first solve a localized problem on the model F m by 'projecting' the underlying probability measure from P f0 to P f0,m . In particular, we establish exponential deviation inequality for the posterior contraction rate via the 1096 Q. Han existence of tests guaranteed by Lemma 2.1: . This index may deviate from m substantially for small indices. (Step 2) We argue that, the cost of the projection in Step 1 is essentially a multiplicative O exp(c 2 nδ 2 n,m ) factor in the probability bound (2.13), cf. Lemma A.1, which is made possible by the local Gaussianity Assumption A. Then by choosing c 1 much larger than c 2 we obtain the conclusion by the definition of δ 2 n,m and the fact that δ 2 n,m ≈ d 2 n (f 0 , f 0,m ) ∨ δ 2 n,m . The existence of tests (Lemma 2.1) is used in Step 1.
Step 2 is inspired by the work of [CGS15] in the context of frequentist least squares estimator over a polyhedral cone in the Gaussian regression setting, where the localized problem therein is estimation of signals on a low-dimensional face (where 'risk adaptation' happens). In the Bayesian context, [CSHvdV15,CvdV12] used a change of measure argument in the Gaussian regression setting for a different purpose. Our proof strategy can be viewed as an extension of these ideas beyond the (simple) Gaussian regression model.

Models and applications
In this section we work out a couple of specific statistical models that satisfy the local Gaussianity Assumption A to illustrate the scope of the general results in Section 2. Some of the examples come from [GvdV07a]; we identify the 'intrinsic' metric to use in these models. Some concrete applications are also given. The applications presented in this section serve as a demonstration of the scope of our general results in deriving new contraction rate results. More applications can be found in Appendix F to illustrate the localization principle (cf. Section 2.3) and aid calculations/formulation in complicated list of models.

Regression models
Suppose we want to estimate θ = (θ 1 , . . . , θ n ) in a given model Θ n ⊂ R n in the following settings: for 1 ≤ i ≤ n, Using similar techniques we can derive analogous results for Gaussian regression with random design and white noise model. We omit the details. Remark 3.3. The boundedness assumption in Laplace/binary/Poisson models is imposed here for simplicity, and can be removed using the localization principle (cf. Section 2.3) for more concrete Θ n 's and priors. See Appendix F for an example.
Below we give three concrete applications in the Gaussian regression model

Example: Trace regression
Consider fitting the Gaussian regression model , rank(A) ≤ r}, and for r ∈ I 2 , F r ≡ F rmax 9 . Although various Bayesian methods have been proposed in the literature (cf. see [ACCR14] for a state-to-art summary), theoretical understanding has been limited. [MA15] derived an oracle inequality for an exponentially aggragated estimator for the matrix completion problem. Their result is purely frequentist. Below we consider a two step prior similar to [ACCR14,MA15], and derive the corresponding posterior contraction rates.
For a matrix B = (b ij ) ∈ R m1×m2 let B p denote its Schatten p-norm 10 . p = 1 and 2 correspond to the nuclear norm and the Frobenius norm respectively. To introduce the notion of RIP, let X : R m1×m2 → R n be the linear map defined via A → (tr(x i A)) n i=1 . Definition 3.4. The linear map X : R m1×m2 → R n is said to satisfy RIP (r, ν r ) for some 1 ≤ r ≤ r max and some ν r = (ν r ,ν r ) with 0 < ν r ≤ν r < ∞ iff ν r ≤ X (A) 2 √ n A 2 ≤ν r holds for all matrices A ∈ R m1×m2 such that rank(A) ≤ r. For r > r max , X satisfies RIP(r, ν r ) iff X satisfies RIP(r max , ν r ). Furthermore, X : R m1×m2 → R n is said to satisfy uniform RIP (ν; I) on an index set I iff X satisfies RIP(2r, ν) for all r ∈ I.

Q. Han
RIP(r, ν r ) is a variant of the RIP condition introduced in [CT05, CP11,RFP10] with scaling factorsν r = 1/(1 − δ r ) and ν r = 1/(1 + δ r ) for some 0 < δ r < 1. This condition quantifies the degree in which the linear map X behaves like an isometry between R m1×m2 and R n in terms of the 2 metric. Below are two canonical examples.
Consider a prior Λ n on I of form where c > 0 is a constant to be specified later. Given the chosen index r ∈ I 1 , a prior on F r is induced by a prior on all m 1 × m 2 matrices of form Here we use a product prior distribution G with Lebesgue density (g 1 ⊗ g 2 ) ⊗r on (R m1 × R m2 ) r . For simplicity we use g i ≡ g ⊗mi for i = 1, 2 where g is symmetric about 0 and non-increasing on (0, ∞) 13 . Let τ tr r,g ≡ sup Theorem 3.7. Fix 0 < η < 1/2 and r max ≤ n. Suppose that there exists some M ⊂ I 1 such that the linear map X : R m1×m2 → R n satisfies uniform RIP(ν; M), and that for all r ∈ M, we have Then there exists some c > 0 in (3.1) depending onν/ν, η such that for any r ∈ M, Here (ε tr n,r ) 2 ≡ max{inf B:rank(B)≤r By Theorem 5 of [RT11], the rate in (3.3) is minimax optimal up to a logarithmic factor. To the best knowledge of the author, the theorem above is the first result in the literature that addresses the posterior contraction rate in the context of trace regression in a fully Bayesian setup.
(3.2) may be verified in a case-by-case manner; or generically we can take M = {r 0 , r 0 + 1, . . .} if the model is well specified, at the cost of sacrificing the form of oracle inequalities (but still get nearly optimal posterior contraction rates) in (3.3). In particular, the first condition of (3.2) prevents the largest eigenvalue of A 0,r from growing too fast. This is in similar spirit with Theorem 2.8 of [CvdV12], showing that the magnitude of the signals cannot be too large for light-tailed priors to work in the sparse normal mean model. The second condition of (3.2) is typically a mild technical condition: we only need to choose η > 0 small enough.

Example: Isotonic regression
Consider the isotonic regression model For simplicity the design points are assumed to be Consider the following prior Λ n on I = N: where c > 0 is a constant to be specified later. Let g m ≡ g ⊗m where g is symmetric and non-increasing on (0, ∞). Thenḡ m (μ) ≡ m!g m 1 {μ1≤...≤μm} (μ) is a valid density on {μ 1 ≤ . . . ≤ μ m }. Given a chosen model F m by the prior Λ n , we randomly pick a set of change points {x i(k) } m k=1 (i(1) < . . . < i(m)) and put a priorḡ m on {f (x i(k) )}'s. [HH03] proposed a similar prior with Λ n being uniform since they assumed the maximum number of change points is known apriori. Below we derive a theoretical result without assuming the knowledge of this. Let τ iso m,g = s u p f0,m∈arg min g∈Fm 2 n (f0,g) Then there exists some c > 0 in (3.4) depending on η such that Here (ε iso n,m ) 2 ≡ max{inf g∈Fm 2 n (f 0 , g), m log(en)/n}, and the constants C i (i = 1, 2) depend on η.

Q. Han
(3.6) implies that if f 0 is piecewise constant, the posterior distribution contracts at nearly a parametric rate. For general isotonic signals f 0 ∈ F with f 0 ∞ < ∞, by using Theorem 4.1 of [CGS15], we obtain a contraction rate on the order of n −2/3 log(en) in 2 n . (3.5) can be checked by the following. Lemma 3.9. If f 0 is square integrable, and the prior density g is heavy-tailed in the sense that there exists some α > 0 such that lim inf |x|→∞ x α g(x) > 0. Then for any η ∈ (0, 1/α), (3.5) holds uniformly in all m ∈ N for n large enough depending on α and f 0 L2([0,1]) .

Example: Convex regression
Consider fitting the Gaussian regression model the class of piecewise affine convex functions with at most m pieces.
We will focus on the multivariate case since the univariate case can be easily derived using the techniques exploited in isotonic regression. A prior on each model F m can be induced by a prior on the slopes and the inter- The prior Λ n we will use on the index I = N is given by where c > 0 is a constant to be specified later. The first step prior used in [HD11] is a Poisson proposal, which slightly differs from (3.7) by a logarithmic factor. This would affect the contraction rate only by a logarithmic factor.
and n ≥ d. Then there exists some c > 0 in (3.7) depending on η such that The above oracle inequality shows that the posterior contraction rate of [HD11] (Theorem 3.3 therein) is far from optimal. (3.8) can be satisfied by using heavy-tailed priors g(·) in the same spirit as Lemma 3.9-if f 0 is square integrable and the design points are regular enough (e.g. using regular grids on [0, 1] d ). Explicit rates can be obtained using approximation techniques in [HW16]. Using the same proof as Lemma 4.10 therein, if f 0 is Lipschitz, the contraction rate in 2 2 becomes the familiar one in the sense that inf m∈N (ε cvx n,m ) 2 inf m∈N max{m −4/d , log n · m log 3m/n} (log 2 n/n) 4/(d+4) .
Remark 3.11. For univariate convex regression, the term log(3m) in (3.7)-(3.9) can be removed. The logarithmic term is due to the fact that the pseudodimension of F m scales as m log(3m) for d ≥ 2, cf. Lemma C.9.
Remark 3.12. Using similar priors and proof techniques we can construct a (nearly) rate-optimal adaptive Bayes estimator for the support function regression problem for convex bodies [Gun12]. There the models F m are support functions indexed by polytopes with m vertices, and a prior on F m is induced by a prior on the location of the m vertices. The pseudo-dimension of F m can be controlled using techniques developed in [Gun12]. Details are omitted.

Consider fitting the Gaussian regression model
where the dimension of the parametric part can diverge. We consider U to be the class of non-decreasing functions as an illustration (cf. Section 3.1.2). Consider models In this example the model index I is a 2-dimensional lattice. Our goal here is to construct an estimator that satisfies an oracle inequality over the models {F (s,m) } (s,m)∈{1,...,p}×{1,...,n} . Consider the following model selection prior: where c > 0 is a constant to be specified later. Here X ∈ R n×p is the design matrix so that X X/n is normalized with diagonal elements taking value 1 15 . For a chosen model F (s,m) , consider the following prior Π n,(s,m) : pick randomly a support S ⊂ {1, . . . , p} with |S| = s and a set of change points , and then put a prior g S,Q on β S and u(z i(k) )'s. For simplicity we use a product prior g S,Q ≡ g ⊗s ⊗ḡ m whereḡ m is a prior on
The condition p ≥ n can be replaced by p ≥ n δ for any δ > 0 by changing the constants. L > 0 prevents p, β 0,s ∞ and the maximal singular value of X from being too large. The second condition of (3.11) is the same as in (3.5) (so in particular can be checked using Lemma 3.9). When the model is well-specified in the sense that f 0 (x, z) = x β 0 + u 0 (z) for some β 0 ∈ B 0 (s 0 ) and u 0 ∈ U, the oracle rate in (3.12) becomes The two terms in the rate (3.13) trades off two structures of the experiment: the sparsity of h β (x) and the smoothness level of u(z). The resulting phase transition of the rate (3.13) in terms of these structures is in a sense similar to the results of [YLC19,YZ16]. It is also easy to derive some explicit rate results from (3.13). For instance, if u 0 ∈ U and u 0 ∞ < ∞, then by using Theorem 4.1 of [CGS15], (3.13) reduces to (s 0 log(ep) ∧ rank(X))/n + n −2/3 log(en).

Density estimation
Suppose X 1 , . . . , X n 's are i.i.d. samples from a density f ∈ F with respect to a measure ν on the sample space (X, A). We consider the following form of Lemma 3.14. Suppose that G is uniformly bounded. Then Assumption A is satisfied for h with constants {c i } 3 i=1 , κ depending on G only. Corollary 3.15. For density estimation, let d n ≡ h. If G is a class of uniformly bounded functions and Assumptions B-C hold, then (2.9)-(2.11) hold.
Remark 3.16. Similar to the above remark, the uniform boundedness is included here for simplicity. See Appendix F for an example on location mixture model where this restriction is removed.

Gaussian autoregression
where f belongs to a function class F with a uniform bound M , and ε i 's are i.i.d. N (0, 1). Then X n is a Markov chain with transition density p f (y|x) = φ(y − f (x)) where φ is the normal density. By the arguments on page 209 of [GvdV07a], this chain has a unique stationary distribution with density q f with respect to the Lebesgue measure λ on R. We assume that X 0 is generated from this stationary distribution under the true f . For any [GvdV07a] (cf. Section 7.4) uses a weighted L s (s > 2) norm to check the local entropy condition, and an average Hellinger metric as the loss function. Our results here use the metric d r,M defined as a weighted L 2 norm.

Gaussian time series
Suppose X 1 , X 2 , . . . is a stationary Gaussian process with spectral density f ∈ F defined on [−π, π]. Then the covariance matrix of We consider a special form of F: D n is bounded from above by the usual L 2 metric, and can be related to the L 2 metric from below (cf. Lemma B.3 of [GZ16]). Our result then shows that the metric to use in the entropy condition can be weakened to the L 2 norm rather than the much stronger L ∞ norm as in page 202 of [GvdV07a]. Such improvements are particularly important in, e.g. shape constrained models that are not totally bounded in L ∞ (cf. [GS13]). See also [CGR04,RCL12] for some related works in Bayesian spectral density estimation.

Covariance matrix estimation
, the set of p × p covariance matrices whose minimal and maximal eigenvalues are bounded by L −1 and L (where L > 1), respectively. For any Σ 0 ,

Example: Covariance matrix estimation in the sparse factor model
In this example, the model index I is a 2-dimensional lattice, and the sparsity structure depends on the rank structure. Consider the following model selection prior: where c > 0 is a constant to be specified later.
Theorem 3.23. Let p ≥ n. There exist some c > 0 in (3.14) and some sequence of sieve priors Π n,(k,s) on M (k,s) depending on L such that Since spectral norm (non-intrinsic) is dominated by Frobenius norm (intrinsic), our result shows that if the model is well-specified (i.e. Σ 0 ∈ M), then we can construct an adaptive Bayes estimator with convergence rates in both norms no worse than ks log p/n. [PBPD14] considered the same sparse factor model, where they proved a strictly sub-optimal rate k 3 s log p log n/n in spectral norm under ks log p. [GZ15] considered a closely related sparse PCA problem, where the convergence rate under spectral norm achieves the same rate as here (cf. Theorem 4.1 therein), while a factor of √ k is lost when using Frobenius norm as a loss function (cf. Remark 4.3 therein).
It should be mentioned that the sieve prior Π n,(k,s) is constructed using the metric entropy of M (k,s) and hence the resulting Bayes estimator and the posterior mean as a point estimator are purely theoretical. We use this example to illustrate (i) the construction scheme of a (nearly) optimal adaptive procedure for a multi-structured experiment based on the metric entropy of the underlying parameter space, and (ii) derivation of contraction rates in non-intrinsic metrics when these metrics can be related to the intrinsic metrics nicely.
It is also possible to use similar strategies as above in the closely related problem of estimating a sparse precision matrix (cf. [BG15]), but we refrain from repetitive details here.

Image boundary detection
Consider the setup in [LG17] as follows. Let {f (·; φ) : φ ∈ R p } be a class of densities dominated by a σ-finite measure μ and indexed by a p-dimensional Here X i can be understood as the location of i-th observation and Y i the corresponding pixel intensity. Let θ = (ξ, ρ, Γ) ∈ Θ be the parameter and define for any Here λ denotes the Lebesgue measure on [0, 1] d and λ(B) = B dλ. Clearly d n is symmetric, but may not satisfy the triangle inequality. The following lemma relates d n to the metric λ(·Δ·) of interest when two elements in Θ are close to each other in d n .
Corollary 3.26. Suppose that {f (·; φ) : φ ∈ Θ ⊂ R p } is any parametric class considered in Section 3.1, and that there exist some m ∈ N, η > 0 such that If Assumptions B-C hold for d n described above with θ 0,m replaced by θ 0 , then for n large enough (depending only on ξ 0 , ρ 0 , η), we have Here the constants {C i } 2 i=1 > 0 depend on ξ 0 , ρ 0 , η. Our result can be used for smooth boundaries as studied in [LG17], but we will be mainly interested in non-smooth boundaries. Indeed, we will propose a hierarchical prior (cf. Section 3.6.1) so that the posterior distribution is nearly parametrically rate-adaptive to non-smooth polytopal regions Γ.

Example: Detection of polytopal image boundaries
For simplicity of presentation, we specify the binary model for {f at most m vertices. Consider the following model selection prior: where c > 0 is a constant to be specified later. A prior Π n,m on the model Θ m can be induced by a product prior on (ξ, ρ, Γ). In particular, we put priors on ξ and ρ with densities g ξ and g ρ respectively, and a prior on Γ can be induced by taking the convex hull of randomly generated m points in [η, 1−η] 2 with density g ⊗m Γ . For simplicity, we assume that g ξ , g ρ , g Γ all follow the uniform distribution Theorem 3.27. In the above setting, if θ 0 ∈ Θ m with ξ 0 = ρ 0 , then there exists some c > 0 in (3.15) such that for n large enough,

Intensity estimation in a Poisson point process model
The goal is to recover the boundary f : [0, 1] → R of the support of the intensity λ 17 . Note that a dominating measure μ is not well-specified for all probability distributions P (n) f , and the likelihood ratio dP f1 = e n f0−f1 1 1 ∀i:f0(Xi)≤Yi , and therefore the Kullback-Leibler divergence is given bȳ The technical problem here is thatL 1 is not symmetric-fortunately by a slight modification, our machinery can still be applied. To this end, suppose  [MR13].
18 For a generic function class G defined on [0, 1], the left bracketing number N [ (ε, G,L 1 ) is the smallest number M of functions g 1 , . . . , g M such that for any g ∈ G there exists some j ∈ {1, . . . , M} with g j ≤ g and 1 0 (g − g j ) ≤ ε. Note that in this definition g j need not belong to G. the set in (2.4) restricted to f ≥ f 0 ; (ii) Assumption C holds with the set in (P2) restricted to f ≥ f 0,m , then (2.9)-(2.11) hold with the posterior distribution restricted to f ≥ f 0 .
In Section 3.7.1 we will use the above result to derive oracle contraction rates for estimating piecewise constant intensities.
It is also possible to consider the two-sided L 1 loss, at the expense of stronger conditions. Below is a result in this direction.
n,m hold for some f 0,m ≤ f 0 . Then using the prior (2.8), there exists some constant C > 0 such that for any m ∈ M,

Example: Estimating piecewise constant intensity in a Poisson point process model
Consider fitting the intensity λ f in the Poisson point process model by the class A prior on F m can be induced by a prior Π t n,m on {t 1 < . . . < t m−1 } followed by a prior Π a n,m on {a j } m j=1 . More specifically, we choose Π t n,m with density t = (t 1 , . . . , t m−1 ) → (m − 1)!1 t1<...<tm−1 (t), and Π a n,m with product density g ⊗m a . As before, we assume that g a is symmetric, non-increasing and satisfies the following: g a has full support, and there exists some sequence {R n } with log R n log n, and a large enough absolute constant C > 0 such that It is easily seen that this condition is very weak, and essentially does not require any tail condition on g a . The reason for this to occur is that the information geometry of the model studied here does not change with the L ∞ size of the model-the impact of this only occurs through the complexity of the model by logarithmic factors.
Consider the following prior Λ n on the model index I ≡ N: where c > 0 is a constant to be specified later.

Q. Han
Here the constants C i (i = 1, 2) are absolute.
Compared with Theorem 5.3 of [RSH17], our Theorem 3.30 works with a slightly weaker one-sided L 1 loss, but enjoys an exact form of an oracle posterior contraction rate. From here it is straightforward to derive rate result assuming Hölder smoothness on f 0 (as in [RSH17]). Note that here we do not require the technical condition log m log n as in [RSH17], so our result here shows rate-adaptivity of the posterior distribution to intensities with fixed number of constant pieces.

A.1. Proof of Theorem 2.3: main steps
First we need a lemma allowing a change-of-measure argument.
Lemma A.1. Let Assumption A hold. There exists some constant c 4 ≥ 1 only depending on c 1 , c 3 and κ such that for any random variable U ∈ [0, 1], any δ n ≥ d n (f 0 , f 1 ) and any j ∈ N, The next propositions solve the posterior contraction problem for the 'local' model F m .
The proofs of these results will be detailed in later subsections.
Proof of Theorem 2.3: main steps. Instead of (2.9), we will prove a slightly stronger statement as follows: for any j ≥ 8c 2 /c 7 h, and h ≥ 2c 4 c 8 c 2 , Here the constants c i (i = 1, 2) depends on the constants involved in Assumption A and c, h.

Proof of (A.3).
First consider the overfitting case. By Proposition A.2 and Lemma A.1, we see that when δ 2 n,m ≥ d 2 n (f 0 , f 0,m ) holds, for j ≥ 8c 2 /c 7 h, it holds that Here in the second line we used the fact that d 2 Next consider the underfitting case: fix m ∈ M such that δ 2 n,m < d 2 n (f 0 , f 0,m ). Apply Proposition A.3 and Lemma A.1, and use similar arguments to see that for j ≥ 8c 2 /c 7 h, Here in the second line we used (i) 2d 2 n (f, f 0,m ) ≥ d 2 n (f, f 0 ) − 2d 2 n (f 0 , f 0,m ), and (ii) δ n,m ≥ d n (f 0 , f 0,m ). The claim of (A.3) follows by combining the estimates. Proof of (2.11). The proof is essentially integration of tail estimates by a peeling device. Let the event A j be defined via Then, The inequality in the first line of the above display is due to Jensen's inequality applied with d 2 n (·, f 0 ) (the convexity follows since f → d n (f, f 0 ) is nonnegatively convex, so is its square), followed by Cauchy-Schwarz inequality. The summation can be bounded up to a constant depending on γ, c 1 , c 2 by where the inequality follows since nε 2 n,m ≥ nε 2 n,1 ≥ 1. This quantity can be bounded by a constant multiple of ∞ 0 x γ e −x/c2 dx independent of m. Now the proof is complete by noting that δ 2 n,m majorizes 1/n up to a constant, and then taking infimum over m ∈ M.

A.2. Proofs of Propositions A.2 and A.3
We will need several lemmas before the proof of Propositions A.2 and A.3.
Lemma A.4. Let Assumption A hold. Let F be a function class defined on the sample space X. Suppose that N : R ≥0 → R ≥0 is a non-increasing function such that for some ε 0 ≥ 2/(c 2 ∧ c 3 ) · d 0 and every ε ≥ ε 0 , the following entropy estimate holds: Then for any ε ≥ ε 0 , there exists some test φ n such that The constants c 5 , c 6 , c 7 are taken from Lemma 2.1.
Lemma A.5. Fix ε > 0. Let Assumption A holds for some d 0 such that ε ≥ 2/(c 2 ∧ c 3 )·d 0 . Suppose that Π is a probability measure on {f ∈ F : d n (f, f 0 ) ≤ ε}. Then for every C > 0, there exists some C > 0 depending on C, κ such that P (n) f0 The proof of these lemmas can be found in Appendix D. Here we used the left side of (2.5). This implies that for any random variable U ∈ [0, 1], we have

Proof of Proposition
On the power side, with m = jhm applied to (A.5) we see that n,jhm ≤ 2c 6 e −(c7/c 2 )njhδ 2 n,m .
The first inequality follows from the right side of (2.5) since c 2 (jh) γ δ 2 n,m ≥ δ 2 n,jhm , and the last inequality follows from the left side of (2.5). On the other hand, by applying Lemma A.5 with C = c 3 and ε 2 ≡ c 7 jhδ 2 n,m /8c 3 c 2 , we see that there exists some event E n such that n,m /8c3c 2 and it holds on the event E n that Note that where the inequality follows from (A.8). On the other hand, the expectation term in the above display can be further calculated as follows: The first term in the second inequality follows from (A.7) and the second term follows from (P1) in Assumption C along with the left side of (2.5). By (P1)-(P2) in Assumption C and j ≥ 8c 2 /c 7 h, We conclude (A.1) from (A.6), probability estimate on E c n . Proof of Proposition A.3. The proof largely follows the same lines as that of Proposition A.2. See Appendix D for details.

A.3. Completion of proof of Theorem 2.3
Proof of (2.10). For any m ∈ M such that δ 2 n,m ≥ d 2 n (f 0 , f 0,m ), following the similar reasoning in (A.9) with j = 8c 2 /c 7 h, From here (2.10) can be established by controlling the probability estimate for E c n as in Proposition A.2, and a change of measure argument using Lemma A.1.

A.4. Proof of Lemma 2.1
Proof of Lemma 2.1. Without loss of generality, we assume that d 0 = 0. Let c > 0 be a constant to be specified later. Consider the test statistics φ n ≡ 1 log(p . We first consider type I error. Under the null hypothesis, we have for any λ 1 ∈ (0, 1/κ Γ ), f1) .

A.5. Proof of Lemma A.1
We recall a standard fact.

Q. Han
Proof of Lemma A.1. For c = 2c 3 , consider the event E n ≡ log(p n . By Lemma A.6, we have for some constant C > 0 depending on c 1 , c 3 and κ, Here in ( * ) we used d n (f 0 , f 1 ) ≤ δ n . Then completing the proof.

A.6. Proof of Proposition 2.2
Proof of Proposition 2.2. Let Σ n = m e −2nδ 2 n,m be the total mass. Then The first condition of (P1) is trivial. We only need to verify the second condition of (P1): where the first inequality follows from (2.5) and the second by the condition h ≥ 2c 2 .

Appendix B: Proofs in Section 3 Part I: results for models
Proof of Lemma 3.1. Let P (n) θ0 denote the probability measure induced by the joint distribution of (X 1 , . . . , X n ) when the underlying signal is θ 0 .

Proof of Corollary 3.2. The claim follows from Lemma 3.1 and Theorem 2.3.
Proof of Lemma 3.14. Since the log-likelihood ratio for X 1 , . . . , X n can be decomposed into sums of the log-likelihood ratio for single samples, and the loglikelihood ratio is uniformly bounded over F (since G is bounded), classical Bernstein inequality applies to see that for any couple (f 0 , f 1 ), the local Gaussianity condition in Assumption A holds with v = κ g nVar f0 (log f 0 /f 1 ), c = κ Γ where κ g , κ Γ depend only on G. Hence we only need to verify that Var h 2 (f 0 , f 1 ). This can be seen by Lemma 8 of [GvdV07b] and the fact that Hellinger metric is dominated by the Kullback-Leiber divergence.

Q. Han
where the last inequality follows by stationarity. On the other hand, by Jensen's inequality, Collecting the above estimates, we see that for |λ| ≤ 1, completing the proof.
Proof of Corollary 3.18. The claim follows from Lemma 3.17 and Theorem 2.3.

Proof of Lemma 3.19.
For any g ∈ G, let p (n) g denote the probability density function of a n-dimensional multivariate normal distribution with covariance matrix Σ g ≡ T n (f g ), and P (n) g the expectation taken with respect to the density p (n) g . Then for any g 0 , g 1 ∈ G, where we used the fact that for a random vector X with covariance matrix Σ, EX AX = tr(ΣA). Let Let B = U ΛU be the spectral decomposition of B where U is orthonormal and Λ = diag(λ 1 , . . . , λ n ) is a diagonal matrix. Then we can further compute where g 1 , . . . , g n 's are i.i.d. standard normal. Note that for any |t| < 1/2, where the inequality follows from With t = −λλ i /2, we have that for any |λ| < 1/ max i λ i , Denote · and · F the matrix operator norm and Frobenius norm respectively. By the arguments on page 203 of [GvdV07a], we have Σ g ≤ 2π e g ∞ and Σ −1 Since G is a class of uniformly bounded function classes, the spectrum of the covariance matrices Σ g and their inverses running over g must be bounded. Hence Next, note that where in the first inequality we used MN F = NM F for symmetric matrices M, N and the general rule P Q F ≤ P Q F . Collecting the above estimates we see that Assumption A is satisfied for v = κ g nD 2 n (g 0 , g 1 ) and c = κ Γ for constants κ g , κ Γ depending on G only.
Finally we relate n −1 P ) and D 2 n (g 0 , g 1 ). First by (B.3), we have Here Then we may verify Assumption A along the lines in the proof of Lemma 3.1, by considering each of the terms above by virtue of independence of X i 's.
In particular, the test φ n is constructed in the 'same way' as in the proof of Lemma 2.1 with a modified way of writing: nL1(f1,f0) .

Now for type I error,
Here the last equality follows as For type II error, note that as soon as f ≥ f 1 , This proves the modified version of Lemma 2.1 in the current setting. Then in the proof of Lemma A.4, the entropy condition needs to be replaced by the entropy with left bracketing, due to the reasoning towards the last display in the proof of Lemma A.4. Now in the proof of Proposition A.2, we apply Lemma A.4 with the set restricted to f ≥ f 0 . The set in the control of denominator in (A.8) can be restricted to f ≥ f 0,m . The rest of the proofs carry over exactly so we omit the details.
Proof of Corollary 3.29. The proof is a combination of the change of measure idea in the current paper combined with the results in [RSH17]. Let m ∈ M be such that δ 2 n,m ≥ L 1 (f 0 , f 0,m ). Note that condition (ii) entails that

Q. Han
Then use Theorem 2.3 of [RSH17], we conclude that where K > 0 is a constant to be chosen later. Hence by choosing K = 2C 3 . We may similar consider m ∈ M such that δ 2 n,m < L 1 (f 0 , f 0,m ).

C.1. Proof of Theorem 3.7
Lemma C.1. Let r ∈ I. Suppose that the linear map X : R m1×m2 → R n is uniform RIP(ν; I). Then for any ε > 0 and A 0 ∈ R m1×m2 such that rank(A 0 ) ≤ r, we have We will need the following result. Proof of Lemma C.2. The case for B = 1 follows from Lemma 3.1 of [CP11] and the general case follows by a scaling argument. We omit the details.
Proof of Lemma C.3. We only need to consider r ≤ r max . First note that , then by noting that the Frobenius norm is sub-multiplicative and that . Now withε n,r ≡ δn,r ν √ c3ρr ∧ 1 we see that (C.1) can be further bounded from below by where v d = vol(B d (0, 1)), and v d ≥ (1/ √ d) d . The right side of the above display is bounded from below by e −2nδ 2 n,r , if we require max log τ −1 r,g , log(ε −1 n,r ∨ 1) ≤ logm/(2η).
Proof of Theorem 3.7. The theorem follows by Corollary 3.2, Proposition 2.2 coupled with Lemmas C.1 and C.3.
Proof of Lemma C.4. Let Q m denote all m-partitions of the design points x 1 , . . . , x n . Then it is easy to see that |Q m | = n m−1 . For a given m-partition Q ∈ Q m , let F m,Q ⊂ F m denote all monotonic non-decreasing functions that are constant on the partition Q. Then the entropy in question can be bounded by On the other hand, for any fixed m-partition Q ∈ Q m , the entropy term above . . , f(x n )) : f ∈ F m,Q }. By Pythagoras theorem, the set involved in the entropy is included in {γ ∈ P n,m,Q : γ − π P n,m,Q (g) 2 ≤ 2 √ nε} where π P n,m,Q is the natural projection from R n onto the subspace P n,m,Q . Clearly P n,m,Q is contained in a linear subspace with dimension no more than m. Using entropy result for the finite-dimensional space [Problem 2.1.6 in [vdVW96], page 94 combined with the discussion in page 98 relating the packing number and covering number], The claim follows by combining the estimates and log n m−1 ≤ m log(en).
Hence we can take δ 2 n,m ≡ 4 log(6/c5) Proof of Lemma C.5. Let Q 0,m = {I k } m k=1 be the associated m-partition of {x 1 , . . . , x n } of f 0,m ∈ F m with the convention that {I k } ⊂ {x 1 , . . . , x n } is ordered from smaller values to bigger ones. Then it is easy to see that μ 0,m = (μ 0,1 , . . . , μ 0,m ) ≡ f 0,m (x i(1) ), . . . , f 0,m (x i(m) ) ∈ R m is well-defined and μ 0,1 ≤ . . . ≤ μ 0,m . It is easy to see that any f ∈ F m,Q0,m satisfying the property that Here the first inequality in the last line follows from the definition ofḡ m and τ iso m,g . The claim follows by verifying (3.5) implies that the second and third term in the exponent above are both bounded by 1 2η · m log(en) [the third term does not contribute to the condition since √ c 3 δ −1 n,m ≤ n by noting c 3 = 1 in the Gaussian regression setting and definition of η].
Proof of Theorem 3.8. The theorem follows by Corollary 3.2, Proposition 2.2 coupled with Lemmas C.4 and C.5.
We now prove Lemma 3.9. We need the following result.

C.3. Proof of Theorem 3.10
Checking the local entropy assumption B requires some additional work. The notion of pseudo-dimension will be useful in this regard. Following [Pol90] Section 4, a subset V of R d is said to have pseudo-dimension t, denoted as pdim(V ) = t, if for every x ∈ R t+1 and indices I = (i 1 , · · · , i t+1 ) ∈ {1, · · · , n} t+1 with i α = i β Q. Han for all α = β, we can always find a sub-index set J ⊂ I such that no v ∈ V satisfies both v i > x i for all i ∈ J and v i < x i for all i ∈ I \ J.
Lemma C.8. Let V be a subset of R n with sup v∈V v ∞ ≤ B and pseudodimension at most t. Then, for every ε > 0, we have holds for some absolute constant κ ≥ 1.
Proof of Lemma C.7. Note that the entropy in question can be bounded by log N c 5 ε √ n, {P n,m −g}∩B n (0, 2 √ nε), · 2 . Since translation does not change the pseudo-dimension of a set, P n,m − g has the same pseudo-dimension with that of P n,m , which is bounded from above by D m by assumption. Further note that {P n,m − g} ∩ B n (0, 2 √ nε) is uniformly bounded by 2 √ nε, hence an application of Lemma C.8 yields that the entropy can be further bounded as follows: log N c 5 ε, {f ∈ F m : n (f, g) ≤ 2ε}, n ) ≤ κD m log 4 + 4n/c 5 ) ≤ C · D m log n for some constant C > 0 depending on c 5 whenever n ≥ 2.
The pseudo-dimension of the class of piecewise affine functions F m can be well controlled, as the following lemma shows. Lemma C.9 (Lemma 4.9 in [HW16]). pdim(P n,m ) ≤ 6md log 3m.
Proof of Lemma C.10. We write f 0,m ≡ max 1≤i≤m a i · x + b i throughout the proof. We first claim that for any To see this, for any x ∈ X, there exists some index i x ∈ {1, . . . , m} Oracle posterior contraction rates 1127 The reverse direction can be shown similarly, whence the claim follows by taking supremum over x ∈ X. This entails that  For the second condition of (2.5), note that for γ = 2, in order to verify δ 2 n,hm ≤ h 2 δ 2 n,m , it suffices to have hm log(3hm) ≤ h 2 m log(3m), equivalently 3hm ≤ (3m) h , and hence 3 h−1 ≥ h for all h ≥ 1 suffices. This is valid and hence completing the proof.
Proof of Theorem 3.10. This is a direct consequence of Corollary 3.2, Lemma C.10 and C.11, combined with Proposition 2.2.
First consider s log(ep) ≤ rank(X). Using notation in Lemma C.12, where f 0,(s,m) ∈ F (s,m),(S0,Q0) . To bound the prior mass of the above display from below, it suffices to bound the product of the following two terms: The first term equals Here the inequality follows by noting Xβ − Xβ 0,s where σ Σ denotes the largest singular value of X X/n. Note that σ Σ ≤ √ p since the trace for X X/n is p and the trace of a p.s.d. matrix dominates the largest eigenvalue. The set above is supported on R p S0 and hence can be further bounded from below by τ s The first terms in the above two lines can be verified by (3.11). The other terms in the above two lines do not contribute by noting that 2c 3 /δ 2 n,m ≤ 2c3c7 4 log(6/c5) n ≤ 1130

Q. Han
(1/2)n ≤ en since c 3 = 1 (in Gaussian regression model) and c 7 ∈ (0, 1), while 2c 3 σ 2 Σ /δ 2 n,s ≤ σ 2 Σ n ≤ pn ≤ p 2 and η < 1/4. Next for s log(ep) > rank(X), we may proceed with To bound the prior mass of the above display from below, it suffices to bound from below the product of π m and Let U ∈ R n×n and V ∈ R p×p give rise to the SVD of X: By choosing c > 2c 3 σ 2 1 ( β 0,s ∞ + 1) 2 , the RHS of the previous display can be bounded from below by g(1), as desired. π m can be handled similarly as in the case s log(ep) ≤ rank(X).
Proof of Theorem 3.13. The claim of the theorem follows by Corollary 3.2, Proposition 2.2 and Lemmas C.12-C.14.

C.5. Proof of Theorem 3.23
Lemma C.15. For any Σ 0 ∈ M (k,s) , the following entropy estimate holds: Proof. The set involved in the entropy is equivalent to We claim that sup Λ∈R (k,s) ΛΛ F ≤ √ kL. To see this, let Λ ≡ P ΞQ be the singular value decomposition of Λ, where P ∈ R p×p , Q ∈ R k×k are unitary matrices and Ξ ∈ R p×k is a diagonal matrix. Then ΛΛ 2 F = ΞΞ 2 F ≤ kL, proving the claim. Combined with (C.4) and Euclidean embedding, we see that the entropy in question can be bounded as follows: where B 0 (s; pk) ≡ {v ∈ R pk : |supp(v)| ≤ s}.
Proof of Theorem 3.23. Take δ 2 n,(k,s) = KC ks log(C p)/n for some C ≥ e depending on c 5 , c 7 , L and some absolute constant K ≥ 1. Apparently (2.5) holds with c = 1, γ = 1, h 0 = ∞. The prior Π n,(k,s) on M (k,s) will be the uniform distribution on a minimal C ks log(C p)/c 3 n covering-ball of the set {Σ ∈ M (k,s) } under the Frobenius norm · F . The above lemma entails that the cardinality for such a cover is no more than e C ks log(C p) for another constant C ≥ e depending on c 3 , c 5 , c 7 , L. Hence we have that which can be bounded from below by e −2nδ 2 n,(k,s) by choosing K large enough. The claim of Theorem 3.23 now follows from these considerations along with Corollary 3.22, Proposition 2.2.

C.6. Proof of Theorem 3.27
Lemma C.16. For θ 0 ∈ Θ m , we have log N c 5 ε, {θ ∈ Θ m , d n (θ, θ 0 ) ≤ 2ε}, d n ≤ 4m log Cηm c 4 5 ε 4 . Proof. We first claim that for ε ≤ 1, To see this, fix δ > 0 to be chosen later, and partition [0, 1] 2 into small squares with side length δ. Let D δ be the set of all polytopes in [0, 1] 2 with its at most m vertices all located on the grid points of these small squares. Apparently . Then for each Γ ∈ C m , let Γ δ ∈ D δ be such that Γ δ ⊃ Γ and that for every vertex v of Γ, there exists a vertex v δ of Γ δ so that both v and v δ are in the same small square, with distance at most √ 2δ. Then the points on the boundary of Γ δ is within distance √ 2δ to Γ, and therefore λ(Γ δ ΔΓ) ≤ √ 2( √ 2δ)m = 2δm (the estimate can be done in a conservative way by collapsing the set of vertices in Γ that corresponding to the same vertex in Γ δ into one vertex). Now let ε = 2δm yields the claim. Since

Q. Han
for some constant C 2 1 > 0 depending only through η, it follows that as desired.
Now we take δ 2 n,m ≡ C η m log n n for some large constant C η > 0.
and that for n large enough depending on Γ 0 , for any as long as C η > 0 is large enough.
Proof of Theorem 3.27. The claim follows by Corollary 3.26, Proposition 2.2 coupled with Lemmas C.16 and C.17.

C.7. Proof of Theorem 3.30
Lemma C.18. For any g ∈ F m such that g ≤ f 0 , and any Proof of Lemma C.18. Note that the local entropy with left bracketing in question can be bounded by its global counterpart N [ c 5 ε 2 , {f ∈ F m , |f | ≤ R},L 1 . Let m ≥ 2. Fix ε > 0, let δ 2 = c 5 ε 2 /(2Rm + 1). Without loss of generality, we assume that 1/δ 2 ∈ N, and we partition the interval [0, 1) into ∪ On the other hand, there are at most 1/δ 2 m−1 · 2R δ 2 m many choices off , and For m = 1, it is clear the above bound holds so the proof is complete.
Proof of Theorem 3.30. Let R n → ∞ be a sequence such that log R n log n. We omit the superscript in the constants in the proof. LetF n ≡ {f : [0, 1] → R : |f | ≤ R n , f ∈ F m } be the localized subset of F. By the decomposition (2.12), the probability in question can be bounded by We first handle the first term in (C.5). Now Corollary 3.28 combined with Lemma C.18 and C.19 yields that for n large enough where ε 2 n,m ≡ max{inf g∈Fm∩FnL1 (f 0 , g), m log(R 2 n n)/n}. Here C 2 , C 3 > 0 are absolute constants that do not depend on R n . Note that in applying (modified) Lemma C.19 we (implicitly) used the fact that the induced localized prior mass satisfies the following: Next we handle the second term in (C.5). Applying Lemma A.5 to the localized model with  where the last inequality holds for n large enough, and follows essentially from the same argument used in the proof of Lemma C.19. Now we have f0 dΠ n (f ) f0 dΠ n (f ) n,m Furthermore we have, where the last inequality follows as log( |x|>Rn g(x) dx) −1 ≥ C log(en) holds for a large enough constant C > 0. Combining the above estimates concludes the proof.

Appendix D: Proofs of auxiliary lemmas in Appendix A
Proof of Lemma A.4. Without loss of generality we assume d 0 = 0. Let F j := {f ∈ F : jε < d n (f, f 0 ) ≤ 2jε} and G j ⊂ F j be the collection of functions that form a minimal c 5 jε covering set of F j under the metric d n . Then by assumption |G j | ≤ N (jε). Furthermore, for each g ∈ G j , it follows by Lemma 2.1 that there exists some test ω n,j,g such that sup f ∈F :dn(f,g)≤c5dn(g,f0) f0 ω n,j,g + P (n) f (1 − ω n,j,g ) ≤ c 6 e −c7nd 2 n (g,f0) .
Recall that g ∈ G j ⊂ F j , then d n (g, f 0 ) > jε. Hence the indexing set above contains {f ∈ F : d n (f, g) ≤ c 5 jε}. Now we see that Consider the global test φ n := sup j≥1 max g∈Gj ω n,j,g , then

Q. Han
On the other hand, for any f ∈ F such that d n (f, f 0 ) ≥ ε, there exists some j * ≥ 1 and some g j * ∈ G j * such that d n (f, g j * ) ≤ j * c 5 ε. Hence The right hand side is independent of individual f ∈ F such that d n (f, f 0 ) ≥ ε and hence the claim follows.
Proof of Lemma A.5. WLOG we assume d 0 = 0. By Jensen's inequality, the probability in question is bounded by where the last inequality follows from Fubini's theorem and Assumption A. Now the condition on the prior Π entails that The claim follows by choosing λ > 0 small enough depending on C, κ. Then analogous to (A.6) and (A.7), for any random variable U ∈ [0, 1], we have exponential testability:

Proof of Proposition
Similar to (A.8), there exists an eventẼ n with

Q. Han
The entropy condition (E.1) used for the sieved MLE is of global type since the construction of the net F δn does not allow information on f 0 . Results of this type in the context of Gaussian regression and density estimation have long been known in the literature; we only refer the readers to [vdVW96,vdG00]. Our result here seems to yield some new results for other locally Gaussian experiments considered in Section 3.
The structural similarity of Theorem 2.3 (when only one model is used) and Proposition E.1 is obvious: both assertions hold under the same local Gaussianity structure of the experiment and the entropy condition, and the posterior distribution in Theorem 2.3 and the sieved MLE in Proposition E.1 both enjoy Gaussian tail behavior. Furthermore, the proofs for both results use (one-sided) Gaussian concentration in an essential way.

Appendix F: More examples
This section contains addition examples, including (i) regression models without boundedness restrictions, (ii) density estimation in location mixtures, (iii) estimation of piecewise constant signals in the Gaussian autoregression model and (iv) subset selection for sparse approximation of regression functions. The main purpose of (i) and (ii) is to demonstrate how the localization principle (cf. Section 2.3) can be applied in situations where local Gaussianity may fail over the entire parameter space, but still essentially holds on suitably localized subsets of the parameter space. The purpose of (iii) is to perform some explicit calculations without losing additional logarithmic factors, when the parameter space is non-compact. The purpose of (iv) is to demonstrate how to adapt the machinery in the paper to complicated model structures that are non-nested.

F.1. Removing boundedness restrictions in Section 3.1
The boundedness assumption in many examples in Section 3.1 is imposed for simplicity. Below we will remove the boundedness restriction in the binary regression model as a proof of concept. Proposition F.1. Suppose θ 0 ∈ Θ m and θ 0 ∈ [η, 1 − η] n for some η > 0. If g is such that x∈[0,t]∪[1−t,1] g(x) dx ≤ e −1/t C for some large constant C > 0 and t > 0 small. Then there exists C > 0 (depending on η and the prior) such that P (n) θ0 Π n θ ∈ Θ : θ − θ 0 2 2 > C m log C n/n → 0. The boundedness restrictions in other Laplace/Poisson models can be removed in a completely similar fashion so we omit these digressions.
For the first term in (F.2), we use Theorem 2.3. By the proof of Lemma 3.1, for any θ 0 , θ 1 ∈Θ n , Similarly we may verify the local Gaussianity condition with constants κ = (κ g , κ Γ ) depending polynomially on w n . So Assumption A is verified by choosing {c i } and κ (or its inverse) on the order of O(w C1 n ) for some C 1 > 0. Assumption B can be verified immediately using the similar arguments as in Lemma C.4. Assumption C follows by similar (and simpler) arguments in Lemma C.5 and the fact thatΠ n,m (A) ≥ Π n,m (A) for any A. Hence, the first term on the RHS of (F.2) is bounded by exp C 2 log(1/ω n ) − nδ 2 n,m ω C2 n , which is o(1) by our choice of w n and c > 0 large enough.

Q. Han
We handle the second term on the right hand side of (F.2) below. By applying Lemma A.5 to the localized model with ε 2 ≡ δ 2 n,m , we see that on an event E n with P θ0 dΠ n (θ) g(x) dx ≤ e − log C 7 n/C7 for some large C 7 > 0 by the assumption on g.

F.2. Density estimation in location mixtures
Consider estimation of a density f 0 on R from the class of location mixtures ∪ ∞ m=1 F m where F m consists densities of the type where σ > 0, w j ≥ 0, m j=1 w j = 1, μ j ∈ R and ψ σ (x) ≡ e −x 2 /2σ 2 / √ 2πσ 2 . This problem has received considerable attention, see e.g. [GvdV01,Rou10,KRvdV10,Scr16,DRRS18] and references therein for some Bayesian developments. The model selection prior Λ n on m is chosen as λ n (m) ∝ exp(−c mix m log(en)).

(F.3)
A prior Π n,m on the model F m is naturally induced by a product prior Π w ⊗Π μ ⊗ Π σ . For simplicity, we assume that Π w has the standard Dirichlet distribution, Π μ , Π σ have Lebesgue density g ⊗m μ , g σ with the following properties: g μ has full support on R such that − log g μ (x) log(x) as x → ∞, and − log g σ (x) log(1/x) as x → 0 and − log g σ (x) log(x) as x → ∞.
Proposition F.2. Suppose that f 0 ∈ F m , and the priors are specified as above.
Proposition F.2 says that the posterior distribution under such hierarchical priors adapts to the finite mixtures at a nearly parametric rate. Although this result does not seem to be explicitly spelled out in the literature, we believe that it can also be derived along the lines, e.g. [KRvdV10]. Indeed, [KRvdV10] proved adaptive behavior of the posterior contraction rates with respect to the local smoothness of the density, under similar hierarchical priors. It is clear from the above proposition that adaptation to the smoothness of the density can be accomplished once the quantity inf f ∈Fm h 2 (f, f 0 ) can be shown to be adaptive to the smoothness of f 0 . This has been the main focus of [KRvdV10] (in Kullback-Leibler divergence). The main purpose here, instead of repeating along the lines of [KRvdV10], rests in demonstrating how the localization principle can be used in the mixture model.
It can also be seen immediately from the proof that the Gaussian kernel can be replaced by any kernel of form considered in [KRvdV10]. Now defineF * n to be the set containing all f * defined as above from some f ∈F n . Note that for any f ∈F n , we have that n /(2σ 2 n ) ).

Q. Han
Then for a large enough constant C > 0, by the decomposition (2.12), we have for n large, f0Π n f ∈F n : h 2 (f * , f 0 ) > C 1 m log γ n n X (n) + P (n) f0 Π n f / ∈F n X (n) , which can be bounded by P (n) f0Π * n f * ∈F * n : h 2 (f * , f 0 ) > C 1 m log γ n/n X (n) (F.4) + P (n) f0 Π * n f * / ∈F * n X (n) + 1/n where Π * n ,Π * n are the natural induced priors from Π n ,Π n . The last inequality follows by noting that for γ 1 γ 2 . We handle the first term on the right hand side of (F.4). To this end, we first verify the local Gaussianity condition Assumption A. Clearly for any f * 0 , f * 1 ∈ F * n , n /(2σ 2 n ) ) σ n σ n e (3bn/σ n ) 2 .
Next we verify Assumption B. Let δ 2 n,m ≡ C m n log n for some large constant C > 0. Since the Hellinger distance is bounded by the square root of total variational distance, we have log N (c 5 ε,F * n ∩ F m , h) ≤ log N (c 2 5 ε 2 ,F * n ∩ F m , d TV ).