Unified Bayesian theory of sparse linear regression with nuisance parameters

We study frequentist asymptotic properties of Bayesian procedures for high-dimensional Gaussian sparse regression when unknown nuisance parameters are involved. Nuisance parameters can be finite-, high-, or infinite-dimensional. A mixture of point masses at zero and continuous distributions is used for the prior distribution on sparse regression coefficients, and appropriate prior distributions are used for nuisance parameters. The optimal posterior contraction of sparse regression coefficients, hampered by the presence of nuisance parameters, is also examined and discussed. It is shown that the procedure yields strong model selection consistency. A Bernstein-von Mises-type theorem for sparse regression coefficients is also obtained for uncertainty quantification through credible sets with guaranteed frequentist coverage. Asymptotic properties of numerous examples are investigated using the theory developed in this study. MSC2020 subject classifications: 62F15.


Introduction
While Bayesian model selection for classical low-dimensional problems has a long history, sparse estimation in high-dimensional regression was studied much later; see Bondell and Reich [5], Johnson and Rossell [20], and Narisetty and He [24] for consistent Bayesian model selection methods in high-dimensional linear models. Extensive theoretical investigations, however, have been carried out only very recently. Since the pioneering work of Castillo et al. [8], frequentist asymptotic properties of Bayesian sparse regression have been discovered under various settings, and there is now a substantial body of literature [e.g., 23,1,28,3,26,2,10,25,14,19,18].
Most of the existing studies deal with sparse regression setups without nuisance parameters and there are only a few exceptions. An unknown variance parameter, the simplest type of nuisance parameters, was incorporated for highdimensional linear regression in Song and Liang [28] and Bai et al. [2]. In these studies, the optimal properties of Bayesian procedures are characterized with continuous shrinkage priors. For more involved models, Chae et al. [10] adopted a nonparametric approach to estimate unknown symmetric densities in sparse linear regression. Ning et al. [25] considered a sparse linear model for vectorvalued response variables with unknown covariance matrices.
Although nuisance parameters may not be of primary interest, modeling frameworks require the complete description of their roles as they explicitly parameterize models. Therefore, one may want to achieve optimal estimation properties for sparse regression coefficients, no matter what a nuisance parameter is. It may also be of interest to examine posterior contraction of nuisance parameters as a secondary objective. Despite these facts, however, there have not been attempts to consider a general class of high-dimensional regression models with nuisance parameters. In this study, we consider a general form of Gaussian sparse regression in the presence of nuisance parameters, and establish a theoretical framework for Bayesian procedures.
We formulate a general framework to treat sparse regression models in a unified way as follows. Let η be possibly an infinite-dimensional nuisance parameter taking values in a set H. For each η ∈ H and an integer m i ∈ {1, . . . , m} for some m ≥ 1, suppose that there are a vector ξ η,i ∈ R mi and a positive definite matrix Δ η,i ∈ R mi×mi which define a regression model for a vector-valued response variable Y i ∈ R mi against covariates X i ∈ R mi×p given by where θ ∈ R p is a vector of regression coefficients. Here m i (and m) can increase with n. We consider the high-dimensional situation where p > n, but θ is assumed to be sparse, with many coordinates zero. The form in (1) clearly includes sparse linear regression with unknown error variances. Our main interest lies in more complicated setups. As will be shortly discussed in Section 1.1, many interesting examples belong to form (1).
In this paper, we develop a unified theory of posterior asymptotics in the highdimensional sparse regression models described by form (1). To the best of our knowledge, there is no study thus far considering a general modeling framework of sparse regression as in (1), even from the frequentist perspective. The results on complicated high-dimensional regression models are only available at modelspecific levels and cannot be universally used for different model classes. On the other hand, our approach is a unified theoretical treatment of the general model structure in (1) under the Bayesian framework. We establish general theorems on nearly optimal posterior contraction rates, a Bernstein-von Mises theorem via shape approximation to the posterior distribution of θ, and model selection consistency.
The general theory of posterior contraction using the canonical root-averagesquared Hellinger metric on the joint density [16] is not very useful in this context, since to recover rates in terms of the metric of interest on the regression coefficients, some boundedness conditions are needed [19]. To deal with this issue, we construct an exponentially powerful likelihood ratio test in small pieces that are sufficiently separated from the true parameters in terms of the average Rényi divergence of order 1/2 (which coincides with the average negative logaffinity). This test provides posterior contraction relative to the corresponding divergence. The posterior contraction rates of θ and η can then be recovered in terms of the metrics of interest under mild conditions on the parameter space. Due to a nuisance parameter η, the resulting posterior contraction for θ may be suboptimal. Conditions for the optimal posterior contraction will also be examined. Our results show that the obtained posterior contraction rates are adaptive to the unknown sparsity level.
For a Bernstein-von Mises theorem and selection consistency, stronger conditions are required than those used for posterior contraction, in line with the existing literature [e.g., 8,23]. As pointed out by Chae et al. [10], the Bernsteinvon Mises theorems for finite dimensional parameters in classical semiparametric models [e.g., 7] may not be directly useful in the high-dimensional context. We thus directly characterize a version of the Bernstein von-Mises theorem for model (1). The key idea is to find a suitable orthogonal projection that satisfies some required conditions, which is typically straightforward if the support of a prior for ξ η,i is a linear space. The complexity of the space of covariance matrices, measured by its metric entropy, also has an important role in deriving the Bernstein-von Mises theorem and selection consistency. Combining these two leads to a single component of normal distributions for an approximation, which enables to correctly quantify remaining uncertainty on the parameter through the posterior distribution.

Sparse linear regression with nuisance parameters
As briefly discussed above, the form in (1) is general and includes many interesting statistical models. Here we provide specific examples belonging to (1) in detail. In Section 5, these examples will be used to apply the main results developed in this study.
Example 1 (Multiple response models with missing components). We consider a general multiple response model with missing values, which is very common in practice. Suppose that for each i, a vector of m responses with covariance matrix Σ are supposed to be observed, but for the ith group (or subject) only m i entries are actually observed with the rest missing. Letting Y i ∈ R mi be the ith observation and Y aug i ∈ R m be the augmented vector of Y i and missing entries, we can write is the submatrix of the m × m identity matrix with the jth column included if the jth element of Y aug i is observed, j = 1, . . . , m. Assuming that the mean of Y i is only X i θ for covariates X i ∈ R mi×p and sparse coefficients θ ∈ R p with p > n, the model of interest can be written as . . , n. The model belongs to the class described by (1) with ξ η,i = 0 mi and Δ η,i = E T i ΣE i for η = Σ. Example 2 (Multivariate measurement error models). Suppose that a scalar response variable Y * i ∈ R is connected to fixed covariates X * i ∈ R p with p > n and random covariates Z i ∈ R q with fixed q ≥ 1, through the following linear additive relationship: S. Jeong and S. Ghosal i = 1, . . . , n. While X * i is fully observed without noise, we observe a surrogate W i of Z i as W i = Z i + τ i , τ i iid ∼ N q (0, Ψ), where to ensure identifiability, Ψ is assumed to be known. This type of model is called a measurement error model or an errors-in-variables model; see Fuller [13] and Carroll et al. [6] for a complete overview. By direct calculations, the joint distribution of (Y * i , W i ) is given by By ∈ R (q+1)×(q+1) with η = (α, β, μ, σ 2 , Σ), the model is of form (1) with m i = q + 1.
Example 3 (Parametric correlation structure). For m i ≥ 1, i = 1, . . . , n, suppose that we have a response variable Y i ∈ R mi and covariates X i ∈ R mi×p with p > n. We consider a standard regression model given by . . , n, but m i is considered to be possibly increasing. For a known parametric correlation structure G i and a fixed dimensional Euclidean parameter α, we model the covariance matrix as Σ i = σ 2 G i (α) using a variance parameter σ 2 and a correlation matrix G i (α) ∈ R mi×mi . Examples of G i include first order autoregressive and moving average correlation matrices. The model belongs to (1) by writing ξ η,i = 0 mi and Δ η,i = σ 2 G i (α) with η = (α, σ 2 ). Example 4 (Mixed effects models). For m i ≥ 1, i = 1, . . . , n, consider a response variable Y i ∈ R mi and covariates X i ∈ R mi×p with p > n and Z i ∈ R mi×q with fixed q ≥ 1. A mixed effect model given by ∼ N mi (0, σ 2 I mi ), i = 1, . . . , n, where Ψ ∈ R q×q is a positive definite matrix. Then the marginal law of Y i is given by Y i = X i θ + ε i , ε i ind ∼ N mi (0, σ 2 I mi + Z i ΨZ T i ). We assume that σ 2 is known. The model belongs to (1) by letting ξ η,i = 0 mi and Δ η,i = σ 2 I mi + Z i ΨZ T i with η = Ψ. Example 5 (Graphical structure with a sparse precision matrix). For a response variable Y i ∈ R m and covariates X i ∈ R m×p with increasing m ≥ 1 and p > n, consider a model given by Y i = X i θ + ε i , ε i iid ∼ N m (0, Ω −1 ), i = 1, . . . , n, where θ is a sparse coefficient vector and the precision matrix Ω ∈ R m×m is a positive definite matrix. Along with θ, we also impose sparsity on the offdiagonal entries of Ω, which accounts for a graphical structure between observations. More precisely, if an off-diagonal entry is zero, it implies the conditional independence between the two concerned entries of ε i given the remaining ones, and we suppose that most off-diagonal entries are actually zero, even though we do not know their locations. The model is then seen to be a special case of (1) by letting ξ η,i = 0 m and Δ η,i = Ω −1 with η = Ω.
Example 6 (Nonparametric heteroskedastic regression models). For a response variable Y i ∈ R and a row vector of covariates X i ∈ R 1×p , a linear regression model with a nonparametric heteroskedastic error is given by Y i = X i θ+ε i , ε i ind ∼ N(0, v(z i )), i = 1, . . . , n, where θ is a sparse coefficient vector, v : [0, 1] → (0, ∞)

Outline
The rest of this paper is organized as follows. In Section 2, some notations are introduced and a prior distribution on sparse regression coefficients is specified. Sections 3-4 provide our main results on the posterior contraction, the Bernstein-von Mises phenomenon, and selection consistency of the posterior distribution. In Section 5, our general theorems are applied to the examples considered above to derive the posterior asymptotic properties in each specific example. All technical proofs are provided in Appendix.

Notation
Here we describe the notations we use throughout this paper. For a vector θ = (θ j ) ∈ R p and a set S ⊂ {1, . . . , p} of indices, we write S θ = {j : θ j = 0} to denote the support of θ, s := |S| (or s θ := |S θ |) to denote the cardinality of S (or S θ ), and θ S = {θ j : j ∈ S} and θ S c = {θ j : j / ∈ S} to separate components of θ using S. In particular, the support of the true parameter θ 0 and its cardinality are written as S 0 and s 0 := |S 0 |, respectively. The notation θ q = ( j |θ j | q ) 1/q , 1 ≤ q < ∞, stands for the q -norm and θ ∞ = max j |θ j | denotes the maximum norm. We write ρ min (A) and ρ max (A) for the minimum and maximum eigenvalues of a square matrix A, respectively. For a matrix X = ((x ij )), let X sp = ρ 1/2 max (X T X) stand for the spectral norm and X F = ( i,j x 2 ij ) 1/2 stand for the Frobenius norm of X. We also define a matrix norm X * = max j X ·j 2 for X ·j the jth column of X, which is used for compatibility conditions. The column space of X is denoted by span(X). For further convenience, we write ς min (X) = ρ 1/2 min (X T X) for the minimum singular value of X. The notation X S means the submatrix of X with columns chosen by S. For sequences a n and b n , a n b n (or b n a n ) stands for a n ≤ Cb n for some constant C > 0 independent of n, and a n b n means a n b n a n . These inequalities are also used for relations involving constant sequences.
For given parameters θ and η, we write the joint density as p θ,η = n i=1 p θ,η,i for p θ,η,i the density of the ith observation vector Y i . In particular, the true joint density is expressed as p 0 = n i=1 p 0,i for p 0,i := p θ0,η0,i with the true parameters θ 0 and η 0 . The notation E 0 denotes the expectation operator with the true density p 0 . For two probability measures P and Q, let P −Q TV denote the total variation between P and Q. For two n-variate densities f := n i=1 f i and g := n i=1 g i of independent variables, denote the average Rényi divergence for the two squared pseudo-metrics: For compatibility conditions, the uniform compatibility number φ 1 and the smallest scaled singular value φ 2 are defined as , and Θ = R p for the parameter space of θ. Lastly, for a (pseudo-)metric space (F, d), let N ( , F, d) denote the -covering number, the minimal number of -balls that cover F.

Prior for the high-dimensional coefficients
In this subsection, we specify a prior distribution for the high-dimensional regression coefficients θ. A prior for η should satisfy the conditions required for the main results, so its specific characterization is deferred to Section 3. On the other hand, the prior for θ specified here is always good for our purposes and satisfies all requirements.
We first select a dimension s from a prior π p , and then randomly choose S ⊂ {1, . . . , p} for given s. A nonzero part θ S of θ is then selected from a prior g S on R s while θ S c is fixed to zero. The resulting prior specification for (S, θ) is formulated as where δ 0 is the Dirac measure at zero on R p−s with suppressed dimensionality. For the prior π p on the model dimensions, we consider a prior satisfying the following: for some constants A 1 , A 2 , A 3 , A 4 > 0, Examples of priors satisfying (3) can be found in Castillo and van der Vaart [9] and Castillo et al. [8]. For the prior g S , the s-fold product of the exponential power density is considered, where the regularization parameter is allowed to vary with p and X * , i.e., for some constants L 1 , L 2 , L 3 > 0. The order of λ is important in that it determines the boundedness requirement of the true signal θ 0 (see condition (C3) below). A particularly interesting case is obtained when λ is set to the lower bound X * /(L 1 p L2 ). Then the boundedness condition becomes very mild by choosing L 2 sufficiently large. When λ is set to the upper bound, the boundedness condition is still reasonably mild. However, it can actually be relaxed if the true signal is known to be small enough, though we do not pursue this generalization in this study. In Section 4, we shall see that values of λ that do not increase too fast are in fact necessary for a distributional approximation and selection consistency.

Remark 1.
Since some size restriction on θ 0 will be made unlike Castillo et al. [8], we note that the use of the Laplace density is not necessary and other prior distributions may also be used for θ. For example, normal densities can be used for g S to exploit semi-conjugacy. However, if its precision parameter is fixed independent of n, a normal prior requires a stronger restriction on the true signal than (C3) below. To achieve the nearly optimal posterior contraction, other densities with similar tail properties should also work with appropriate modifications for the true signal size (see, e.g., Jeong and Ghosal [19]). Instead of the spike-and-slab prior in (2) and (3), a class of continuous shrinkage priors may also be used at the expense of substantial modifications in the technical details [28]. In this paper, we only consider the prior in (2)-(4).

Posterior contraction rates
The prior for a nuisance parameter η should be chosen to complete the prior specification. Once we assign the prior for the full parameters, the posterior distribution Π(· | Y (n) ) is defined by Bayes' rule. How the prior for η is chosen is crucial to obtain desirable asymptotic properties of the posterior distribution.
In this subsection, we shall examine such conditions on the prior distribution for a nuisance parameter and study the posterior contraction rates for both θ and η.
The prior for η is put on a subspace H ⊂ H. In many instances, we take H = H, especially when a nuisance parameter is finite dimensional, but the flexibility of a subspace may be beneficial in infinite-dimensional situations. We need to choose H to satisfy certain conditions. (C1) There exists a nondecreasing sequence a n = o(n) such that a n max (C2) For some sequence¯ n such that a n¯ 2 n → 0 and n¯ 2 n → ∞ with a n satisfying (C1), n . The first condition of (C1) implies that we have a good approximation to the true parameter value in the parameter set H. This holds trivially if there exists η ∈ H such that Δ η ,i = Δ η0,i for every i ≤ n, which is obviously true if η 0 ∈ H. The second condition of (C1) means that in H, the maximum Frobenius norm of the difference between covariance matrices can be controlled by the average Frobenius norm multiplied by the sequence a n . Clearly, this holds with a n = 1 if Δ η,i is the same for every i ≤ n. By the triangle inequality, we see that (C1) implies that which is used throughout the paper. Condition (C2) is typically called the prior concentration condition, which requires a prior to put sufficient mass around the true parameter η 0 , measured by the pseudo-metric d n . As in other infinitedimensional situations, such a closeness is translated into the closeness in terms of the Kullback-Leibler divergence and variation (see Lemma 1 in Appendix for more details). As noted in Section 1, the true parameters should be restricted to certain norm-bounded subset of the parameter space. This is clarified as follows.
(C3) The true signal satisfies θ 0 ∞ λ −1 log p. (C4) The eigenvalues of the true covariance matrix satisfy Condition (C3) is required to apply the general strategy for posterior contraction to our modeling framework containing nuisance parameters. More specifically, the condition is imposed such that the prior assigns sufficient mass on a Kullback-Leibler neighborhood of θ 0 . If nuisance parameters are not present, one can directly handle the model and such a restriction may be removed [e.g., 8,14]. One may refer to Song and Liang [28], Ning et al. [25], and Bai et al. [2] for conditions similar to ours, where a variance parameter stands for a nuisance parameter. Still, the condition is mild if λ is chosen to decrease at an appropriate order. In particular, if λ is matched to the lower bound 1/(L 1 p L2 ), the condition becomes θ 0 ∞ (p L2 log p)/ X * which is very mild if L 2 is sufficiently large. Even if the upper bound L 3 X * / √ n is chosen, the condition is not restrictive as the right hand side of the condition can be made nondecreasing as long as X * is increasing at a suitable order. Condition (C4) implies that the eigenvalues of the true covariance matrix are bounded below and above. The lower and upper bounds are required for a lot of technical details, including the construction of an exponentially powerful test in Lemma 2 in Appendix.

Remark 2.
Condition (C3) is actually stronger than what it needs to be, but is adopted for the ease of interpretation. For Theorem 3 below to hold, it suffices if we have λ θ 0 1 ≤ (s 0 log p) ∨ n¯ 2 n for¯ n satisfying (C2). For the optimal posterior contraction in Theorem 4 below, a slightly stronger bound is needed: λ θ 0 1 ≤ s 0 log p (see Lemma 6 and its proof in Appendix).

Rényi posterior contraction and recovery
The goal of this subsection is to study posterior contraction of θ relative to the 1 -and 2 -metrics. To do so, we derive the posterior contraction rate with respect to the average Rényi divergence R n (f, g), and then the rates for θ relative to more concrete metrics will be recovered from the Rényi contraction.
To proceed, we first need to examine a dimensionality property of the support of θ. The following theorem shows that the posterior distribution is concentrated on models of relatively small sizes.
Compared to the literature [e.g., 8,23,3], the rate in Theorem 1 is floored by the extra term n¯ 2 n / log p. This arises from the presence of a nuisance parameter in the model formulation. To minimize its impact, a prior on η should be chosen such that (C2) holds for as small¯ n as possible; a suitable choice induces the (nearly) optimal contraction rate.
Using the basic results in Theorem 1, the next theorem obtains the rate at which the posterior distribution contracts at the truth with respect to the average Rényi divergence. The theorem requires additional assumptions on a prior.
(C5) For s := s 0 ∨ (n¯ 2 n / log p) with¯ n satisfying (C2), a sufficiently large B > 0, and some sequences γ n and n ≥ s log(p ∨ m ∨ γ n )/n satisfying 2 n /m → 0, there exists a subset H n ⊂ H such that e Bs log p Π(H \ H n ) → 0.
The above conditions are related to the classical ones in the literature (e.g., see Theorem 2.1. of Ghosal et al. [15]). Condition (6) requires that for every i ≤ n, the minimum eigenvalue of Δ η,i is not too small on a sieve H n . Although γ n can be any positive sequence, a sequence increasing exponentially fast makes the entropy in (7) too large, resulting in a suboptimal rate n . If γ n can be chosen to be smaller than p and m, then this does not lead to any deterioration of the rate in n . The entropy condition (7) is actually stronger than needed. Scrutinizing the proof of the theorem, one can see that the entropy appearing in the theorem is obtained using pieces that are smaller than those giving the exponentially powerful test in Lemma 2 in Appendix. However, the covering number with those pieces looks more complicated and the form in (7) suffices for all examples in the present paper. Lastly, condition (8) implies that the outside of a sieve H n should possess sufficiently small prior mass to kill the factor s log p arising from the lower bound of the denominator of the posterior distribution. In fact, conditions similar to (C2), (7) and (8) are also required for the prior of θ. By reading the proof, it is easy to see that the prior (2) explicitly satisfies the analogous conditions on an appropriately chosen sieve.
We want to sharpen the rate n ≥ s log(p ∨ m ∨ γ n )/n as much as possible. In most instances, γ n can be chosen such that log γ n log p. This is trivially satisfied if γ n is some polynomial in n as in the examples in this paper. If p is known to increase much faster than n, e.g., log p n c for some c ∈ (0, 1), then γ n need not be a polynomial in n and the condition can be met more easily with a sequence that grows even faster. Note also that we typically have log m log p in most cases. These postulates lead to n ≥ (s log p)/n. Indeed, it is often possible to choose n = (s log p)/n, which is commonly guaranteed by choosing an appropriate sieve H n and a prior. The condition will be made precise in (C5*) below for recovery and we only consider the situation that n = (s log p)/n in what follows.
Although Theorem 2 provides the basic results for posterior contraction, it does not give precise interpretations for the parameters θ and η themselves, because of the abstruse expression of the average Rényi divergence. The contraction rates with respect to more concrete metrics are recovered under some additional conditions. Under the additional assumption a n 2 n → 0, it can be shown that Theorem 1 and Theorem 2 explicitly imply that for the set with a sufficiently large constant M 1 , the posterior mass of A n goes to one in probability (see the proof of Theorem 3). To complete the recovery, we need to separate the sum of squares of the mean into X(θ − θ 0 ) 2 and nd 2 A,n (η, η 0 ), which requires an additional condition. The conditions required for the recovery are clarified as follows.
(C5 * ) While log m log p, (C5) holds for γ n and n = (s log p)/n such that log γ n log p and a n 2 n → 0 with a n satisfying (C1). (C6) For s satisfying (C5*), there exists η * ∈ H such that where n in A n satisfies n = (s log p)/n.
By expanding the quadratic term for the mean in A n , one can see that the separation is possible if (C6) is satisfied. Clearly, (C6) is trivially satisfied if the model has only Xθ for its mean, in which we take ξ η,i −ξ η * ,i = ξ η * ,i −ξ η0,i = 0 for every i ≤ n. In many cases where there exists η ∈ H such that d A,n (η , η 0 ) = 0, we can often take η * = η for the second inequality of (C6) to hold automatically.
The following theorem shows that the posterior distribution of θ and η contracts at their respective true values at some rates, relative to more easily comprehensible metrics than the average Rényi divergence. In the expressions, if K 1 s + s 0 < 1, the compatibility numbers should be understood be equal to 1 for interpretation.
The thresholds for contraction depend upon the compatibility conditions, which make their implication somewhat vague. As K 1 s + s 0 is much smaller than n * , it is not unreasonable to assume that φ 1 (K 1 s + s 0 ) and φ 2 (K 1 s + s 0 ) are bounded away from zero, whence the compatibility number is removed from the rates. We refer to Example 7 of Castillo et al. [8] for more discussion. In the next subsection, we will see that one of these restrictions is actually necessary for shape approximation or selection consistency.

Remark 3.
The separation condition (C6) can be left as an assumption to be satisfied, but can also be verified by a stronger condition on the design matrix without resorting to the values of the parameters. Suppose that for some integer q ≥ 1, there exists a matrix Z i ∈ R mi×q such that ξ η,i = Z i h(η) for every η ∈ H, with some map h : H → R q . Since we can write ξ η,i −ξ η * ,i = Z i (h(η)−h(η * )) for any η, η * ∈ H, the Cauchy-Schwarz inequality indicates that the first inequality of (C6) is implied by The left hand side is always between −1 and 1 by the Cauchy-Schwarz inequality, and is exactly equal to −1 or 1 if and only if the two vectors are linearly dependent. A sufficient condition for the preceding display is thus min{ς min ([X S , Z]) : s ≤ K 1 s + s 0 } 1 since the linear dependence cannot happen under such a condition due to the inequality s θ−θ0 ≤ s θ +s 0 ≤ K 1 s +s 0 for θ such that s θ ≤ K 1 s . This sufficient condition is not restrictive at all if q = o(n) as we already have K 1 s + s 0 = o(n). Since there typically exists η * ∈ H satisfying the second inequality of (C6) as long as H provides a good approximation for the true parameter η 0 , condition (C6) can be easily satisfied if the sufficient condition is met.
Notwithstanding the lack of formal study of minimax rates with additional complications, we still want to match our rates for θ with those in simple linear regression, which we call the "optimal" rates. In this sense, Theorem 3 only provides the suboptimal rates for θ if s 0 = o(s ). Although the theorem gives the optimal results if s 0 log p n¯ 2 n , it is practically hard to check this condition as s 0 is unknown. If s 0 is known to be nonzero, the desired conclusion is trivially achieved as soon as n¯ 2 n / log p 1. The following corollary, however, shows that the optimal rates are still available even if s 0 = 0, with restrictions on¯ n and the prior.  The corollary is useful in limited situations, especially when a parametric rate is available for a nuisance parameter. Even if n¯ 2 n = log n, we need to further assume that log n = o(log p), i.e., the ultra high-dimensional setup, to conclude that (a) holds, while we can always apply (b) because log n log p. Although assertion (b) holds for any s 0 ≥ 0 if A 4 is chosen sufficiently large, its specific threshold is not directly available. Indeed, by carefully reading the proof of Theorem 1 together with Lemma 1 in Appendix, one can see that the threshold depends on unknown constant bounds for the eigenvalues of the true covariance matrix in (C4). Still, (b) holds for any A 4 > 0 if s 0 > 0. We believe that the assumption s 0 > 0 is very mild, and hence simply apply (b) with this assumption to conclude the optimal contraction for models with finite dimensional nuisance parameters. The optimal rates can still be achieved for any s 0 ≥ 0 by verifying the conditions in the following subsection. With finite dimensional nuisance parameters, we do not pursue this direction as it seems an overkill considering the mildness of the assumption s 0 > 0, though those conditions are actually required for the Bernstein-von Mises theorem and selection consistency in Section 4.
In semiparametric situations with high-or infinite-dimensional nuisance parameters, none of (a) and (b) generally works unless p increases sufficiently fast. Still, the optimal rates can be achieved under stronger conditions using the semiparametric theory, as the following subsection provides.

Optimal posterior contraction for θ
Recall that only suboptimal rates may be available from Theorem 3 if s 0 log p n¯ 2 n . In many semiparametric situations, however, it is often possible to obtain parametric rates for finite dimensional parameters under stronger conditions, even when there are infinite-dimensional nuisance parameters in a model [4,7]. It has also been shown that a similar argument holds in some high-dimensional semiparametric regression models [10]. Therefore, it is naturally of interest to examine under what conditions we can replace s by s 0 in the rates for θ, even if s 0 log p n¯ 2 n . Similar to other semiparametric settings [4,10], this can be established by the semiparametric theory, but requires stronger conditions than those in traditional fixed dimensional parametric cases because of the highdimensions of the parameters in our setup.
To proceed, some additional conditions are required for technical reasons, which are made for the size of¯ n as the optimal rates are automatically attained if s 0 log p n¯ 2 n . Still, in a practical sense, the conditions almost always need to be verified to reach the optimal rates, since only oracle rates are generally available and we do not know which term is greater.
In what follows, we writes := n¯ 2 n / log p for¯ n satisfying the conditions of Theorem 3 through the definition of n . We first assume the following condition on the uniform compatibility number.
(C7) For a sufficiently large M , the uniform compatibility number φ 1 (Ms + s 0 ) is bounded away from zero.
This condition is weaker than assuming that the smallest scaled singular value φ 2 (Ms +s 0 ) is bounded away from zero, as we have φ 1 (s) ≥ φ 2 (s) for any s > 0 by the Cauchy-Schwarz inequality. We will also resort on a slightly stronger condition with respect to φ 1 for a distributional approximation in the following section. In this sense, our condition is weaker than those for Theorem 4 of Castillo et al. [8]. Condition (C7) is not restrictive as (C5*) requires s = o(n); we again refer to Example 7 of Castillo et al. [8].
To precisely describe other conditions, hereafter we use the following addi-tional notations. We writẽ andΔ η to denote the collection of Δ η,i for i = 1, . . . , n. In particular,X S ∈ R n * ×|S| denotes the submatrix ofX with columns chosen by an index set S. We also define the following neighborhoods of the true parameters: fors and n satisfying (C5*), and sufficiently large constantsM 1 andM 2 , Combined by other conditions, Theorem 3 implies that the posterior probabilities of these neighborhoods tend to one in probability if s 0 log p n¯ 2 n . We need some bounding conditions on these neighborhoods, which will be specified below.
Let Φ(η) = (ξ η ,Δ η ) for any given η ∈ H. For a given θ, we choose a bijective map η →η n (θ, η) : H → H such that Φ(η n (θ, η)) = (ξ η + HX(θ − θ 0 ),Δ η ) for some orthogonal projection H which may depend on the true parameter values, but not on θ and η. The projection H plays a key role here and for a distributional approximation in the following section, and thus should be appropriately chosen to satisfy the followings.
(C8) The orthogonal projection H satisfies (C9) The conditional law Π n,θ ofη n (θ, η) given θ, induced by the prior, is absolutely continuous relative to its distribution Π n,θ0 at θ = θ 0 (which is the same as the prior for η), and the Radon-Nikodym derivative dΠ n,θ /dΠ n,θ0 satisfies By reading the proof, one can see that Theorem 4 below is based on the approximate likelihood ratio. The first condition of (C8) is required to control the remainder of an approximation. The second condition of (C8) implies that u 2 (I − H)u 2 ≤ u 2 for every u ∈ span(X S ) with S such that s ≤ K 1s , as the second inequality trivially holds by the fact that I − H is an orthogonal projection. The use of the shifting map η →η n (θ, η) is justified by the condition (C9), which implies that a shift in certain directions does not substantially affect the prior on η. This is related in spirit to the absolute continuity condition in the semiparametric Bernstein-von Mises theorem (see, for example, Theorem 12. 8 of Ghosal and van der Vaart [17]). We will see that a distributional approximation also requires similar, but stronger conditions.
Lastly, the complexity of the neighborhood H n should also be controlled. Specifically, we make the following condition.
(C10) For a n and e n satisfying (C1) and a sufficiently large C > 0, Similar to (C8), these conditions are required to control the remainder of an approximation. The integral term comes from the expected supremum of a separable Gaussian process, exploiting the Gaussian likelihood of the model and the separability of H n with the standard deviation metric. Condition (C11) is crucial for this reason. Since we usually put a prior on η in an explicit way, condition (C11) is rarely violated in practice. One may see a connection between the first term of (C10) and the conditions for Corollary 1. The former easily tends to zero even if n¯ 2 n / log p is increasing, due to the extra term¯ n which commonly tends to zero in a polynomial order. Note also that the term s 0 ∨ 1 appears in (C8) and (C10). Although this gives sharper bounds, the conditions often need to be verified with s 0 ∨ 1 replaced by 1 as s 0 is unknown.
Under the conditions specified above, we obtain the following theorem for the contraction rates for θ which do not depend on¯ n . The compatibility numbers below should be understood to be 1 if s 0 = 0.
Similar to the paragraph followed by Theorem 3, the compatibility numbers are easily bounded away from zero so that they can be removed from the expressions. These are actually weaker than before as s 0 ≤ s . The simplified rates are then available for ease of interpretation.

Remark 4.
In regression models where no additional mean part ξ η,i exists, conditions (C8) and (C9) are trivially satisfied by choosing the zero matrix for H. This is also true for (C8*) and (C9*) specified in the next section.
In this case, by the triangle inequality, the first condition of (C8) is satisfied if there exists η * ∈ H such that nd 2 A,n (η * , η 0 )/(s 0 log p) → 0. For (C8*) in the next section, this is replaced by (s 2 log p)nd 2 A,n (η * , η 0 ) → 0. These are trivially the case if there exists η ∈ H such that d A,n (η , η 0 ) = 0. Also similar to Remark 3, a sufficient condition for the second line of (C8) is min{ς min ([X S , Z]) : s ≤ K 1s } 1 as pre-multiplication of a positive definite matrix by X S and Z is an isomorphism. This is also sufficient for (C8*) in the next section withs replaced by s .

Remark 6.
In many instances, for every δ > 0 and ζ n > 0, we typically have for some sequences r n and b n , especially when the part of η involved with d B,n is an r n -dimensional Euclidean parameter. Note that If b n is increasing, the right hand side is bounded by a multiple of ζ n √ r n log b n by the tail probability of a normal distribution, while it is bounded by a multiple of ζ n b n √ r n for nonincreasing b n . This simplification is useful to verify (C10) in many applications, and can also be used for (C10*) in the next section.

Bernstein-von Mises and selection consistency
An extremely important question is whether the true support S 0 is recovered with probability tending to one, which is the property called selection consistency. We will show this based on a distributional approximation to the posterior distribution. Combined with selection consistency, the shape approximation also leads to the product of a point mass and a normal distribution, which we call the Bernstein-von Mises theorem. This reduced approximate distribution enables us to correctly quantify the remaining uncertainty of the parameter through the posterior distribution.

Shape approximation to the posterior distribution
It is worth noting that selection consistency can often be verified without a distributional approximation. For example, in sparse linear regression with scalar unknown variance σ 2 , Song and Liang [28] deployed the marginal likelihood of the model support which can be obtained by integrating out θ and σ 2 from the likelihood using the inverse gamma kernel. In our general formulation, however, this approach is hard to implement due to the arbitrary structure of a nuisance parameter η. Indeed, the approach is not directly available even for a parametric covariance matrix with dimension m ≥ 2. In this sense, using a shape approximation could be a natural solution to the problem, which may require some extra conditions on the parameter space and on the priors for θ and η.
Recall that the results in Section 3.2 are based on the semiparametric theory. In this section we will need very similar conditions as before, but the requirements are generally stronger, as the remainder of an approximation should be strictly manipulated. Since the setup is high-dimensional, our conditions are even more restrictive than those for semiparametric models with a fixed dimensional parametric segment [e.g., 7]. One may refer to Section 3.3 of Chae et al. [10] for a relevant discussion.
Throughout this section, we only consider s that satisfies the conditions of Theorem 3. First of all, we make a modification of (C7). The following condition is slightly stronger than (C7), but is still not too restrictive as (C5*) requires s = o(n).
The assumption on the prior for θ is made only through the regularization parameter λ. As in Castillo et al. [8], λ should not increase too fast and should satisfy λs √ log p/ X * → 0. In our setup, the range of λ induces a sufficient condition for this: s 2 log p = o(n). Since this is weaker than the one that will be made later in this section, the "small lambda regime" is automatically met by a stronger condition for the entire procedure for a distributional approximation (see (C10*) below and the following paragraph).
For sufficiently large constantsM 1 andM 2 , we now define the neighborhoods, Note that Θ n is defined with an 1 -ball, which makes it contract more slowly than Θ n in (10) under (C7*). This is due to technical reasons that for a distributional approximation, the 1 -ball should be directly manipulated in the complement of Θ n . The neighborhood H n is also increased to be matched with Θ n . We leave more details on this to the reader; refer to the proof of Theorem 5 below.
As in Section 3.2, we choose a bijective map η →η n (θ, η) which gives rise to Φ(η n (θ, η)) = (ξ η + HX(θ − θ 0 ),Δ η ) for some orthogonal projection H. Again, the orthogonal projection H should be carefully chosen to satisfy some boundedness conditions. The conditions are similar to, but stronger than those in Section 3.2. This is not only because of the increased neighborhoods Θ n and H n , but also because the remainder of an approximation should be bounded on their complements. We precisely make the required conditions below.
(C8 * ) The orthogonal projection H satisfies (C9 * ) The conditional law Π n,θ ofη n (θ, η) given θ, induced by the prior, is absolutely continuous relative to its distribution Π n,θ0 at θ = θ 0 , and the Radon-Nikodym derivative dΠ n,θ /dΠ n,θ0 satisfies (C10 * ) For a n and e n satisfying (C1) and a sufficiently large C > 0, s log p s e n + a n s log p n Conditions (C8*)-(C10*) are required for similar reasons as in Section 3.2. We mention that (C10*) is a sufficient condition for the small lambda regime, since its necessary condition is s 5 log 3 p = o(n) that is stronger than s 2 log p = o(n). This necessary condition for (C10*) is often a sufficient condition in many finite dimensional models.
We define the standardized vector, Under the assumptions above, the posterior distribution of θ is approximated by Π ∞ given by where N S μ,Ω is the Gaussian measure with mean μ ∈ R s and precision Ω ∈ R s×s on the coordinate S, δ S c 0 is the Dirac measure at zero on S c ,θ S is the least Another way to express Π ∞ , for any measurable B ⊂ R p , is where L denotes the Lebesgue measure and It can be easily checked that both the expressions are equivalent. The results are summarized in the following theorem.

Model selection consistency
The shape approximation to the posterior distribution facilitates obtaining the next theorem which shows that the posterior distribution is concentrated on subsets of the true support with probability tending to one. The result is then used as the basis of selection consistency. Similar to the literature, the theorem requires an additional condition on the prior as follows.
(C12) The prior satisfies A 4 > 1 and s p a for a < A 4 − 1.
Since coefficients that are too close to zero cannot be identified by any selection strategy, some threshold for the true nonzero coefficients is needed for detection. The requirement of a threshold is a fundamental limitation in highdimensional setups. We make the following threshold, the so-called beta-min condition. The condition is made in view of the third assertion of Theorem 4. The second assertion can also be used to make a similar threshold, but we only consider the given one below as it is generally weaker.
(C13) The true parameter satisfies Since Theorem 3 implies that the posterior distribution of the support of θ includes that of the true support with probability tending to one, selection consistency is an easy consequence of Theorem 6 under the beta-min condition (C13). Moreover, this improves the distributional approximation in (15) so that the posterior distribution can be approximated by a single component of the mixture; that is, the Bernstein-von Mises theorem holds for the parameter component θ S0 . The arguments here are summarized in the following two corollaries, whose proofs are straightforward and thus are omitted.
Corollary 3 enables us to quantify the remaining uncertainty of the parameter through the posterior distribution. Specifically, we can construct credible sets for the individual components of θ 0 as in Castillo et al. [8]. It is easy to see that by the definition ofθ S0 , its jth component has a normal distribution, whose mean is the jth element of θ S0 and variance is the jth diagonal element of Correct uncertainty quantification is thus guaranteed by the weak convergence.

Applications
In this section, we apply the main results established in this study to the examples considered in Section 1.1. The main objective is to obtain nearly optimal posterior contraction rates and selection consistency via shape approximation to the posterior distribution with the Bernstein-von Mises phenomenon.
To use Corollary 1 for the optimal posterior contraction when n¯ 2 n = log n, we simply assume that s 0 > 0 for all examples in this section, although Theorem 4 can also be applied under stronger conditions. The assumption s 0 > 0 is extremely mild rather than considering the ultra high-dimensional case, i.e., log n = o(log p). A large enough A 4 is also sufficient instead of the assumption s 0 > 0, but we do not pursue this direction as a specific threshold is not available. We check the conditions of Theorem 4 only for more complicated models where n¯ 2 n > log n.

Multiple response models with missing components
We first apply the main results to Example 1. To recover posterior contraction of Σ from the primitive results, it is necessary to assume that every entry of the response is jointly observed sufficiently many times. To be more specific, let e ij be 1 if the jth entry of Y aug i is observed and be zero otherwise. The contraction rate of the (j, k)th element of Σ is directly determined by the order of n −1 n i=i e ij e ik . The ideal case is when this quantity is bounded away from zero, that is, the entries are jointly observed at a rate proportional to n. Then the recovery is possible without any loss of information. If n −1 n i=1 e ij e ik decays to zero, then the optimal recovery is not attainable, but consistent estimation may still be possible with slower rates. With an inverse Wishart prior on Σ, the following theorem studies the posterior asymptotic properties of the given model.
for some nondecreasing c n such that c n s 0 log p = o(n). Then the following assertions hold. Assume further that c n (s 2 0 ∨ log c n )(s 0 log p) 3 = o(n) and φ 1 (Ds 0 ) 1 for a sufficiently large D. Then the following assertions hold.
(c) For H ∈ R n * ×n * the zero matrix, the distributional approximation in (15) holds. (16) holds. (e) Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.

Multivariate measurement error models
We now consider Example 2. For convenience we write In this subsection, we use the symbol ⊗ for the Kronecker product of matrices. For priors of the nuisance parameters, normal prior distributions are assigned for the location parameters (α, β, and μ) and an inverse gamma and inverse Wishart prior are used for the scale parameters (σ 2 and Σ). The next theorem shows posterior asymptotic properties of the model. In particular, specific forms of their mean and variance for shape approximation are provided considering the modeling structure.
Then the following assertions hold.
(a) The optimal posterior contraction rates for θ in (11) are obtained. (b) The contraction rates for α, β, μ, and σ 2 are (s 0 log p)/n relative to the 2 -norms. The same rate is also obtained for Σ with respect to the Frobenius norm.
Assume further that s 5 0 log 3 p = o(n) and φ 1 (Ds 0 ) 1 for a sufficiently large D. Then the following assertions hold.
(c) The distributional approximation in (15) holds with the mean vector (16) holds. (e) Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.
We note that the marginal law of W i is given by W i ∼ N(μ, Σ+Ψ). This gives a hope that the rates for μ and Σ may actually be improved up to the parametric rate n −1/2 (possibly up to some logarithmic factors). However, other parameters are connected to the high-dimensional coefficients θ, so such a parametric rate may not be obtained for them.

Parametric correlation structure
Next, our main results are applied to Example 3. A correlation matrix G i (α) should be chosen so that the conditions in the main theorems can be satisfied.
Here we consider a compound-symmetric, a first order autoregressive, and a first order moving average correlation matrices: for α ∈ (b 1 , b 2 ) with fixed boundaries b 1 and b 2 of the range, respectively, The range is chosen so that the corresponding correlation matrix can be positive definite, i.e., . Again, an inverse gamma prior is assigned to σ 2 . For a prior on α, we consider a density (c) For H ∈ R n * ×n * the zero matrix, the distributional approximation in (15) holds. As for the prior for α, the property that the tail probabilities decay to zero exponentially fast near both zero and one is crucial for the optimal posterior contraction rates. It should be noted that many common probability distributions with compact supports may not be enough for this purpose (e.g., beta distributions).
The main difference between this example and those in the preceding subsections is that we consider possibly increasing m i here. Although we have the same form of contraction rates for θ as in previous examples, the implication is not the same due to a different order of X * . For increasing m i , it is expected to have X * √ n * , which is commonly the case in regression settings. This is reduced to X * √ n for the cases with fixed m i , and hence increasing m i may help get faster rates. While the increasing dimensionality of m i is often a benefit for contraction properties of θ, this may or may not be the case for the nuisance parameters since it depends on the dimensionality of η. In the example in this subsection, the dimension of the nuisance parameters is fixed although m i can increase, which makes their posterior contraction rates faster than those with fixed m i . However, this may not be true if η is increasing dimensional. For example, see the example in Section 5.5.

Mixed effects models
For the mixed effects models with sparse regression coefficients in Example 4, we assume that the maximum of Z i sp is bounded, which is particularly mild if m is bounded. We also assume that n i=1 1(m i ≥ q) n and min i {ς min (Z i ) : m i ≥ q} 1, that is, m i is likely to be larger than q with fixed probability and Z i is a full rank. These conditions are required for (C1) to hold. We put an inverse Wishart prior on Ψ as in other examples. The following theorem shows that the posterior asymptotic properties of the mixed effects models.
Then the following assertions hold. (c) For H ∈ R n * ×n * the zero matrix, the distributional approximation in (15) holds. (d) If A 4 > 1 and s 0 p a for a < A 4 − 1, then the no-superset result in (16) holds. (e) Under the beta-min condition as well as the conditions for (d), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.
Note that we assume that σ 2 is known, which is actually unnecessary at the modeling stage. The assumption was made to find a sequence a n satisfying (C1) with ease. This can be relaxed only with stronger assumptions on Z i . For example, if q = 1 and Z i is an all-one vector, then the model is equivalent to that with a compound-symmetric correlation matrix in Section 5.3 with some reparameterization, in which σ 2 can be treated as unknown.

Graphical structure with a sparse precision matrix
For the graphical structure models in Example 5, we define an edge-inclusion indicator Υ = {υ jk : 1 ≤ j ≤ k ≤ m} such that υ jk = 1 if ω jk = 0 and υ jk = 0 otherwise, where ω jk is the (j, k)th element of Ω. We put a prior with a density f 1 on (0, ∞) to the nonzero off-diagonal entries and a prior with a density f 2 on R to the diagonal entries of Ω, such that the support is truncated to a matrix space with restricted eigenvalues and entries. For the edge-inclusion indicator, we use a binomial prior with probability when |Υ| := j,k υ jk is given, and assign a prior to |Υ| such that log Π(|Υ| ≤r) −r logr. The prior specification is summarized as (a) The posterior contraction rates for θ are given by (9). Ifs 1, the optimal rates in (11) are obtained. (c) The optimal posterior contraction rates for θ in (11) are obtained even if s → ∞.
Assume further that (s ∨ m) 2 (s log p) 3 = o(n) and φ 1 (Ds ) 1 for a sufficiently large D. Then the following assertions hold.
(d) For H ∈ R n * ×n * the zero matrix, the distributional approximation in (15) holds. Note that increasing m is likely to improve the 2 -norm contraction rate for θ as we expect that X * √ mn. In particular, the improvement is clearly the case if d m and φ 2 (Ds ) 1 for a sufficiently large D. However, as pointed out in Section 5.3, this is not the case for Ω as its dimension is also increasing.
If we assume that log n log m, then the term (m + d)(log n)/n arising from the sparse precision matrix Ω becomes (m + d)(log m)/n. The latter is comparable to the frequentist convergence rate of the graphical lasso in Rothman et al. [27]. Therefore, our rate is deemed to be optimal considering the additional complication due to the mean term involving sparse regression coefficients.

Nonparametric heteroskedastic regression models
Next, we use the main results for Example 6. For a bounded, convex subset X ⊂ R, define the α-Hölder class C α (X ) as the collection of functions f : with the kth derivative f (k) of f and α the largest integer that is strictly smaller than α. Let the true function v 0 belong to C α [0, 1] with assumption that v 0 is strictly positive. While α > 1/2 suffices for the basic posterior contraction, we will see that the optimal posterior contraction for θ requires α > 1. The stronger condition α > 2 is even needed for the Bernstein-von Mises theorem and the selection consistency, but all these conditions are mild if the true function is sufficiently smooth. We put a prior on g through B-splines. The function is expressed as a linear combination of J-dimensional B-spline basis terms B J of order q ≥ α, i.e., (a) The posterior contraction rates for θ are given by (9). Ifs 1, the optimal rates in (11) are obtained. (b) The posterior contraction rate for v is (s 0 log p)/n ∨ (log n/n) α/(2α+1) with respect to the · 2,n -norm.
If further α > 1 and φ 1 (Ds ) 1 for a sufficiently large D, then the following assertion holds.
(c) The optimal posterior contraction rates for θ in (11) are obtained even if s → ∞.
(d) The distributional approximation in (15) holds with H the n × n zero matrix.
(e) If A 4 > 1 and s p a for a < A 4 − 1, then the no-superset result in (16) holds.
(f) Under the beta-min condition as well as the conditions for (e), the selection consistency in (17) and the Bernstein-von Mises theorem in (18) hold.
An inverse Gaussian prior is used due to the property that its tail probabilities at both zero and infinity decay to zero exponentially fast. The exponentially decaying tail probabilities in both directions are essential to obtain the optimal contraction rate. Note that standard choices such as gamma and inverse gamma distributions do not satisfy this property.
By investigating the proof, it can be seen that the condition α > 1/2 is required to satisfy condition (C1) for posterior contraction, so this condition is not avoidable in applying the main theorems. Unlike Theorem 13 below, assertion (c) does not require any further boundedness condition. This is because the restriction α > 1 makes the required bound tend to zero. For the Bernsteinvon Mises theorem and the selection consistency, it can be seen that α > 2 is necessary for the condition J(s 2 ∨ J)(s log p) 3 = o(n) but not sufficient. Although the requirement α > 2 is implied by the latter condition, we specify this in the statement due to its importance. We refer to the proof of Theorem 12 for more details.

Partial linear models
Lastly, we consider Example 7. We assume that the true function g 0 belongs to C α [0, 1] for with α > 0. Any α > 0 suffices for the basic posterior contraction, but stronger restrictions are required for further assertions as in Theorem 12. We put a prior on g through J-dimensional B-spline basis terms of order q ≥ a, i.e., g β (z) = β T B J (z). With a given J, we define the design matrix W J = (B J (z 1 ), . . . , B J (z n )) T ∈ R n×J . The standard normal prior is independently assigned to each component of β and an inverse gamma prior is assigned to σ 2 . Similar to Section 5.6, we assume that z i are sufficiently regularly distributed on [0, 1]. (a) The posterior contraction rates for θ are given by (9). Ifs 1, the optimal rates in (11) are obtained. (b) The contraction rates for g and σ 2 are (s 0 log p)/n ∨ (log n/n)ᾱ /(2ᾱ+1) with respect to the · 2,n -and 2 -norms, respectively.
(c) The optimal posterior contraction rates for θ in (11) are obtained even if s → ∞.
Assume that 1 <ᾱ < α − 1/2, (s 2 log p)(log n) 2α/(2ᾱ+1) n (2(ᾱ−α)+1)/(2ᾱ+1) = o(1), s 5 log 3 p = o(n), and φ 1 (Ds ) 1 for a sufficiently large D. Then the following assertions hold. Here we elaborate more on the choices of the number J of basis terms. For assertions (a)-(b), J can be chosen such thatᾱ = α which gives rise to the optimal rates for the nuisance parameters. This choice, however, does not satisfy (C8) and (C8*), and hence we need a better approximation for (I − H)ξ η0 2 with someᾱ < α to strictly control the remaining bias. For example, ifᾱ = α, the bondedness condition for (c) is reduced tos = o(1), which gives the optimal contraction for θ by (a). Therefore, to incorporate the case thats → ∞, there is a need to consider some appropriateᾱ that is strictly smaller than α. For the Bernstein-von Mises theorem and the selection consistency, the required restriction becomes even stronger such thatᾱ < α − 1/2.

Appendix A: Proofs for the main results
In this section, we provide proofs of the main theorems. We first describe the additional notations used for the proofs. For a matrix X, we write ρ 1 (X) ≥ ρ 2 (X) ≥ · · · for the eigenvalues of X in decreasing order. The notation Λ n (θ, η) = n i=1 (p θ,η,i /p 0,i )(Y i ) stands for the likelihood ratio of p θ,η and p 0 . Let E θ,η denote the expectation operator with the density p θ,η and let P 0 denote the probability operator with the true density. For two densities f and g, let K(f, g) = f log(f/g) and V (f, g) = f | log(f/g) − K(f, g)| 2 stand for the Kullback-Leibler divergence and variation, respectively. Using some constants ρ 0 , ρ 0 > 0, we rewrite (C4) as ρ 0 ≤ min i ρ min (Δ η0,i ) ≤ max i ρ max (Δ η0,i ) ≤ ρ 0 for clarity.

A.1. Proof of Theorem 1
We first state a lemma showing that the denominator of the posterior distribution is bounded below by a factor with probability tending to one, which will be used to prove the main theorems.

Lemma 1. Suppose that (C1)-(C4) are satisfied. Then there exists a constant
Proof. We define the Kullback-Leibler-type neighborhood B n = {(θ, η) ∈ Θ×H : n } for a sufficiently large C 1 . Then Lemma 10 of Ghosal and van der Vaart [16] implies that for any C > 0, Hence, it suffices to show that Π(B n ) is bounded below as in the lemma. By Lemma 9, the Kullback-Leibler divergence and variation of the ith observation are given by ≥ δ} with small δ > 0 and |I n,δ | the cardinality of I n,δ , we see that on B n , a n¯ 2 n a n n Since every i / ∈ I n,δ satisfies where the first inequality follows by the relation |1 − x| |1 − x −1 | as x → 1 and the second inequality holds by (i) of Lemma 10 in Appendix. Since a n |I n,δ |/n a n¯ 2 n by (21), it follows using (5) that for some constants C 2 , C 3 > 0, a n n Combining this with (21), we conclude that a n¯ 2 n + e n max i Δ η,i − Δ η0,i 2 F on B n , which implies that max i,k |1 − ρ * i,k | is small for all sufficiently large n, by (i) of Lemma 10 and the inequality |1 − x| |1 − x −1 | as x → 1. Hence, log ρ * i,k can be expanded in the powers of (1−ρ for every i and k. Furthermore, since max i,k |1 − ρ * i,k | is sufficiently small, we obtain that 1 by the restriction on the eigenvalues of Δ η0,i . Combining these results, it follows that on B n , both n −1 n i=1 K(p 0,i , p θ,η,i ) and n −1 n i=1 V (p 0,i , p θ,η,i ) are bounded above by a constant multiple of n −1 X(θ − θ 0 ) 2 2 + d 2 n (η, η 0 ). Hence, C 1 can be chosen sufficiently large such that by the inequality Xθ 2 ≤ p j=1 |θ j | X ·j 2 ≤ X * θ 1 . The logarithm of the second term on the rightmost side is bounded below by a constant multiple of −n¯ 2 n by (C2). To find the lower bound for the first term, we shall first work with the case s 0 ≥ 1, and then show that the same lower bound is obtained even when s 0 = 0. Now, assume that s 0 ≥ 1 and let Θ 0,n = {θ S0 ∈ R s0 : n −1/2 X * θ S0 − θ 0,S0 1 ≤ } for > 0 to be chosen later. Then by the inequality g S0 (θ S0 ) ≥ e −λ θ0 1 g S0 (θ S0 − θ 0,S0 ). Using the relation (6.2) of Castillo et al. [8] and the assumption on the prior in (4), the integral on the rightmost side satisfies for s 0 > 0, and thus the rightmost side of (23) is bounded below by by the inequality p s0 s 0 ! ≤ p s0 . Choosing =¯ n , the first term on the rightmost side of (22) satisfies Note that n¯ 2 n > 1 and s 0 +¯ n + s 0 log p s 0 log p if s 0 > 0, and thus the last display implies that there exists a constant C 4 > 0 such that If s 0 = 0, the first term of (22) is clearly bounded below by π p (0), so that the same lower bound for Π(B n ) in the last display is also obtained since we have λ θ 0 1 + s 0 log p = 0. Finally, the lemma follows from (20).
Proof of Theorem 1. For the set B = {(θ, η) : s θ >s} with any integers ≥ s 0 , we see that Π(B) is equal to Let E n be the event in (19). Since Λ n (θ, η) is nonnegative, by Fubini's theorem and Lemma 1, for some constant C 1 and sufficiently large p. For a sufficiently large constant C 2 , choose the largest integer that is smaller than C 2 s fors. Replacings + 1 by C 2 s in the last display, it is easy to see that the rightmost side goes to zero. The proof is complete since P 0 (E c n ) → 0 by Lemma 1.

A.2. Proof of Theorems 2-3 and Corollary 1
The following lemma shows that a small piece of the alternative centered at any (θ 1 , η 1 ) ∈ Θ × H are locally testable with exponentially small errors, provided that the center is sufficiently separated from the truth with respect to the average Rényi divergence. Theorem 2 for posterior contraction relative to the average Rényi divergence will then be proved by showing that the number of those pieces is controlled by the target rate. We write p 1 for the density with (θ 1 , η 1 ), and E 1 and P 1 for the expectation and probability with p 1 , respectively.

Lemma 2.
For a given sequence γ n > 0, a sequence a n satisfying (C1), given

S. Jeong and S. Ghosal
Then under (C1), there exists a testφ n such that Proof. For given (θ 1 , η 1 ) ∈ Θ × H such that R n (p 0 , p 1 ) ≥ δ 2 n , consider the most powerful testφ n = 1 {Λn(θ1,η1)≥1} given by the Neyman-Pearson lemma. It is then easy to see that The first inequality of the lemma is a direct consequence of the first line of the preceding display. For the second inequality of the lemma, note that by the Cauchy-Schwarz inequality, we have Thus, by the second line of (27), it suffices to show E 1 ((p θ,η /p 1 )(Y (n) )) 2 ≤ e 7nδ 2 n /8 for every (θ, η) ∈ F 1,n . Defining Δ * η, on the set F 1,n , where the second inequality is due to (C1). Since the leftmost side of the display is further bounded below by max i |ρ k (Δ * η,i ) − 1| for every k ≤ m i , we have that Since δ 2 n /m → 0 and ρ k (2Δ * η,i − I) = 2ρ k (Δ * η,i ) − 1 for every k ≤ m i , (28) implies that 2Δ * η,i − I is nonsingular for every i ≤ n, and hence on F 1,n , it can be shown that E 1 ((p θ,η /p 1 )(Y (n) )) 2 can be written as being equal to To bound this, note that det(Δ * where the first inequality holds by (28), the second inequality holds by the inequality (1 − x 2 )/(1 − 2x) ≤ 1 + 3x for small x > 0, and the last inequality holds by the inequality x + 1 ≤ e x . Now, for every (θ, η) ∈ F 1,n , observe that the exponent in (29) is bounded above by (29) and (30), the display completes the proof.
Hence, it suffices to show that the first term goes to zero for > 0 chosen to be the threshold in the theorem. Now, let Θ * n = {θ ∈ Θ : s θ ≤ K 1 s , θ ∞ ≤ p L2+2 / X * } and define F 1,n as in (26) with γ n = γ n and δ n = n . Then Lemma 2 implies that small pieces of the alternative densities can be tested with exponentially small errors as long as the center is n -separated from the true parameter values relative to the average Rényi divergence. To complete the proof, we shall show that the minimal number N * n of those small pieces that are needed to cover Θ * n × H n is controlled appropriately in terms of n , and that the prior mass of Θ n \ Θ * n and H \ H n decreases fast enough to balance the denominator of the posterior distribution. (For more discussion on a construction of a test using metric entropies, see Section D.2 and Section D.3 of Ghosal and van der Vaart [17].) Note that for every θ, θ ∈ Θ and η, η ∈ H, by the inequality X(θ − θ ) 2 ≤ X * θ − θ 1 ≤ p X * θ − θ ∞ and the Cauchy-Schwarz inequality. Since a n < n and 2 n > n −1 , it is easy to see that we have F 1,n ⊃ F 1,n for with the same (θ 1 , η 1 ) used to define F 1,n . Hence, log N * n is bounded above by log N 1 6mγ n np X * , Θ * n , · ∞ + log N 1 6mγ n n 3/2 , H n , d n . (32)

S. Jeong and S. Ghosal
Note that for any small δ > 0,
Using the last display and the entropy condition (7), the right hand side of (32) is bounded above by a constant multiple of n 2 n . Hence, by Lemma D.3 of Ghosal and van der Vaart [17], for every > n , there exists a test ϕ n such that for some C 1 > 0, E 0 ϕ n ≤ 2 exp(C 1 n 2 n − n 2 ) and E θ,η (1 − ϕ n ) ≤ exp(−n 2 /16) for every (θ, η) ∈ Θ * n × H n such that R n (θ, η) > . Note that under condition (3) on the prior distribution, we have − log π p (s 0 ) s 0 log p − log π p (0) s log p since π p (0) is bounded away from zero. Hence, for E n the event in (19) and some constant C 2 > 0, the first term on the right hand side of (31) is bounded by where the term P 0 E c n converges to zero by Lemma 1. Choosing = C 3 n for a sufficiently large C 3 , we have Furthermore, Π(H \ H n )e C2s log p goes to zero by condition (8). Now, to show that Π(Θ n \ Θ * n ) goes to zero exponentially fast, observe that by the inequality π p (s) ≤ (A 2 p −A4 ) s π p (0) for every S. Since the tail probability of the Laplace distribution is given by |x|>t 2 −1 λe −λ|x| dx = exp(−λt) for every t > 0, the rightmost side of the last display is bounded above by a constant multiple of Since λp L2+2 / X * p 2 by (4), the right hand side is bounded by e −C4p 2 for some C 4 > 0, and thus Π(Θ n \ Θ * n )e C2s log p goes to zero since s log p = o(p 2 ). Finally, we conclude that the left hand side of (31) goes to zero with = C 3 n .
Proof of Theorem 3. By Theorem 2, we obtain the contraction rate of the posterior distribution with respect to the average Rényi divergence R n (p θ,η , p 0 ) between p θ,η and p 0 given by Define Then Theorem 2 implies that by the last display, where the second inequality holds by the inequality log x ≤ x − 1. Note that by combining (i) and (ii) of Lemma 10 in Appendix, we obtain g 2 (Δ η,i , Δ η0,i ) Δ η,i − Δ η0,i 2 F if the left hand side is small. Thus, using the same approach in the proof of Lemma 1, (34) is further bounded below by for some constants C 1 , C 2 , C 3 > 0. Since C 1 − C 3 a n 2 n is bounded away from zero and e n is decreasing, (34) and (35) imply that n d B,n (η, η 0 ). Now, it is easy to see that by (5), which is bounded since e n + a n 2 n = o (1). Hence, we see that for η * satisfying (C6), The display implies that X(θ − θ 0 ) 2 2 + nd 2 A,n (η, η 0 ) n 2 n by Theorem 2 and (C6). Combining the results verifies the third and fourth assertions of the theorem. For the remainder, observe that s θ−θ0 ≤ s θ + s 0 ≤ K 1 s + s 0 s for θ such that s θ ≤ K 1 s . Therefore by Theorem 1, the first and the second assertions readily follow from the definitions of φ 1 and φ 2 .
Proof of Corollary 1. We first verify the assertion (a). If s 0 > 0 the assertion is trivial. If s 0 = 0, the condition n¯ 2 n / log p → 0 implies that s → 0, and hence Theorem 1 holds with s = 0. Since this means that θ = θ 0 = 0 if s 0 = 0, we can plug in s 0 for s in Theorem 3.
Similarly, the assertion (b) trivially holds if s 0 > 0 and we only need to verify the case s 0 = 0. By reading the proof of Theorem 1, one can see that (25) goes to zero for large enough A 4 if s 0 = 0. This completes the proof.

A.3. Proof of Theorem 4
To prove Theorem 4, we first provide preliminary results. Some of these will also be used to prove Theorems 5-6.

Lemma 3.
Suppose that (C1), (C2), (C7), (C8) and (C10) are satisfied for some orthogonal projection H. Then, for Λ * n (θ, η) = (p θ,η /p θ0,ηn(θ,η) )(Y (n) ) and Λ n (θ) in (14) with the corresponding H, there exists a positive sequence δ n → 0 such that for any θ with s θ ≤ K 1s , Proof. If s θ = s 0 = 0, the left hand side in the probability operator is zero, and the assertion trivially holds. We thus only consider the case s θ + s 0 > 0 below. By Markov's inequality, it suffices to show that there exists a positive sequence δ n = o(δ n ) such that Let Δ η ∈ R n * ×n * be the block-diagonal matrix formed by stacking η0,i , i = 1, . . . , n, and observe that log Λ * n (θ, η) = − The left hand side of (37) is thus bounded by the sum of the following terms: sup η∈ Hn First, observe that (38) is bounded above by a constant multiple of sup η∈ Hn Using (i) of Lemma 10 and the inequality provided that the rightmost side is sufficiently small. Because max i Δ η,i − Δ η0,i 2 F ≤ e n + a n d 2 B,n (η, η 0 ) e n + a n¯ 2 n on H n , (42) holds. This implies that for all sufficiently large n, the right hand side of (41) is bounded above by a constant multiple of η∈ Hn e n + a n d 2 B,n (η, η 0 ) where e n + a n¯ By the triangle inequality, the display is bounded by a constant multiple of Using the same approach used in (42), the second term is further bounded above by a constant multiple of Therefore, by (C8) and (C10), (43) is bounded by } for some δ n → 0. This is not more than the right hand side of (37) if s θ + s 0 > 0.
Note also that (40) is bounded by We have that φ 1 (s θ + s 0 ) ≥ φ 1 (K 1s + s 0 ) 1 by condition (C7). By Lemma 4 below, one can see that for some C 3 > 0. The term in the braces goes to zero by (C10). Combining the bounds, we easily see that there exists δ n → 0 satisfying (37). The assertion holds by choosing δ n = δ n .

Lemma 4. Consider a neighborhood H
for a n satisfying (C1). Then, for any orthogonal projection P and a sufficiently large C > 0, we have that under (C1), where Δ η ∈ R n * ×n * is the block-diagonal matrix formed by stacking the matrices Proof. Let W η,j =X T ·j P (I − Δ η )U forX ·j ∈ R n * the jth column ofX. Then, by Lemma 2.2.2 of van der Vaart and Wellner [29] applied with ψ(x) = e x 2 − 1, the expectation in the lemma is equal to where · ψ is the Orlicz norm for ψ. For any η 1 , η 2 ∈ H * n , define the standard deviation pseudo-metric between W η1,j and W η2,j as Using the tail bound for normal distributions and Lemma 2.2.1 of van der Vaart and Wellner [29], we see that W η1,j − W η2,j ψ d σ,j (η 1 , η 2 ) for every η 1 , η 2 ∈ H * n . We shall show that H * n is a separable pseudo-metric space with d σ,j for every j ≤ p. Then, under the true model P 0 , we see that {W η,j : η ∈ H * n } is a separable Gaussian process for d σ,j . Hence, by Corollary 2.2.5 of van der Vaart and Wellner [29], for any fixed η ∈ H * n , where diam j (H * n ) = sup{d σ,j (η 1 , η 2 ) : η 1 , η 2 ∈ H * n }. It is clear that W η ,j possesses a normal distribution with mean zero and variance (I − Δ η )PX ·j 2 2 . Using Lemma 2.2.1 of van der Vaart and Wellner [29] again, we see that for every η ∈ H * n . Here the last inequality holds by using (42) and the fact that Next, to further bound the second term in (46), note that for every η 1 , η 2 ∈ H * n , a n ζ 2 which is further bounded below by using (i) of Lemma 10. In the last display, we see that min i ρ min (Δ η2,i ) is bounded away from zero since η0,i sp e n + a n ζ 2 n + 1, and hence every eigenvalue η2,i ) is bounded below and above by a multiple of its reciprocal, as a n ζ 2 n → 0. This implies that a n ζ 2 n is further bounded below by a constant multiple of

S. Jeong and S. Ghosal
By the definition of d σ,j and the preceding displays, we thus obtain for every η 1 , η 2 ∈ H * n . Hence, using that diam j (H * n ) X ·j 2 ζ n √ a n , we can bound the second term in (46) above by a constant multiple of for some C 1 , C 2 > 0. This can be further bounded by replacing X ·j 2 in the display by X * . Then, using (45), (46), and (47), and by the substitution δ = /(C 2 X * √ a n ) for the last display, we bound (45) above by a constant multiple of X * log p e n + a n ζ 2 n + √ a n To complete the proof, it remains to show that H * n is a separable pseudometric space with d σ,j for every j ≤ p. By (48), we see that d σ,j (η 1 , η 2 ) X * √ a n d B,n (η 1 , η 2 ) for every η 1 , η 2 ∈ H * n . This implies that H * n is separable with d σ,j since H is separable with d B,n .

Lemma 5. For any orthogonal projection P ,
Proof. Note first thatX T ·j P U has a normal distribution with mean zero and variance PX ·j 2 2 , and hence we have by the tail probabilities of normal distributions. By choosing t = 2 √ log p and using the inequality PX ·j 2 ≤ X ·j 2 ≤ ρ −1/2 0 X * for every j ≤ p, we verify the assertion.
Proof. Let Θ * n = {θ ∈ Θ : s θ = s 0 , X(θ − θ 0 ) 2 2 ≤ 1}. Restricting the integral to this set, the left hand side of the inequality in (49) is bounded below by The exponent is equal to inf η∈ Hn since Δ η sp 1 on H n . We first consider the case s 0 > 0. Observe that sup η∈ Hn where the first term is bounded by a constant multiple of X * √ log p with P 0 -probability tending to one, due to Lemma 5. By Lemma 4 applied with P = I together with (C10), the expected value of the second term is bounded by δ n X * √ log p for some δ n → 0. Hence, for any M n → ∞, Consequently, taking a sufficiently slowly increasing M n for the above, (51) is bounded below by a constant multiple of with P 0 -probability tending to one. Note that X * θ − θ 0 1 ≤ √ s θ + s 0 X(θ − θ 0 ) 2 /φ 1 (s θ + s 0 ) and φ 1 (s θ + s 0 ) = φ 1 (2s 0 ) 1 on Θ * n by (C7), if s 0 log p n¯ 2 n . The last display is thus bounded below by −C 1 s 0 log p for some C 1 > 0, uniformly over θ ∈ Θ * n . Consequently, with P 0 -probability tending to one, (50) is bounded below by for some C 2 > 0, where the inequality holds by (23) and (24) since λ θ 0 1 ≤ s 0 log p by (C3). Since − log π p (s 0 ) s 0 log p if s 0 > 0, the display is further bounded below as in the assertion.
If s 0 = 0, (51) is equal to zero on Θ * n , as this is a singleton set {θ : θ = 0}. This means that (50) is bounded below by π p (0), which is also bounded away from zero. This leads to the desired assertion.
Proof of Theorem 4. The idea of our proof is similar in part to that of Theorem 3.5 in Chae et al. [10]. We only need to verify the first and fourth assertions. The second and third assertions then follow from the definitions of φ 1 and φ 2 . Note also that we only need to consider the case s 0 log p n¯ 2 n , as the assertions follow from Theorems 1 and 3 if s 0 log p n¯ 2 n . Let B n = {θ ∈ Θ : s θ > K 4 s 0 } ∪ {θ ∈ Θ : X(θ − θ 0 ) 2 2 > K 5 s 0 log p}. Also define H n as H n but using a constantM 2 ≤M 2 such that H n ⊂ H n . Then, by Theorem 3, we have that At the end of this proof, we will verify that sup θ∈ Θn∩Bn H n p θ0,ηn(θ,η) (Y (n) )dΠ(η) with P 0 -probability tending to one. Assuming that this is true for now and letting Ω * be the event satisfying (53), we see that (52) is bounded by To show that this tends to zero, for δ n in Lemma 3, define Since E 0 Λ n (θ) = 1 by the moment generating function of normal distributions, we obtain that If s 0 = 0, the rightmost side goes to zero for any K 4 > 0. If s 0 > 0, it still goes to zero for K 4 that is much larger than K 0 . Note also that by conditions (C4), (C7) and (C8), we have that for some C 1 , C 2 > 0 and any θ, on the event Ω. Hence by (36) and (54), for every θ ∈ B 2,n , on the event Ω. Therefore, This tends to zero if K 4 is sufficiently large. If s 0 = 0, B 3,n is the empty set as it implies θ = θ 0 = 0. Hence it suffices to consider the case that s 0 > 0 below. By (36) and (54) again, there exists a constant C 3 > 0 such that for every θ ∈ B 3,n , on the event Ω, where the last inequality holds by choosing K 5 much larger than K 4 . Therefore, which tends to zero for K 5 that is much larger than K 0 , if s 0 > 0. It only remains to show (53). Since the map η →η n (θ, η) is bijective for every fixed θ, for the set defined byη n (θ, H n ) = {η n (θ, η) : η ∈ H n } with given θ ∈ Θ n , we see that by the substitution in the integral. Writing Δ * 0 the block diagonal matrix formed by stacking Δ 1/2 η0,i , i = 1, . . . , n, it can be seen that Hence, we see thatM 2 can be chosen sufficiently larger thanM 2 such that η n (θ, H n ) ⊂ H n for every θ ∈ Θ n as we have by (C9), since dΠ(η) = dΠ n,θ0 (η). This verifies (53) and thus the proof is complete.

A.4. Proof of Theorems 5-6
To prove the shape approximation in Theorem 5 and the selection results in Theorem 6, we first obtain two lemmas. The first shows that the remainder of the approximation goes to zero in P 0 -probability, which is a stronger version of Lemma 3. The second implies that with a point mass prior for θ at θ 0 , we also obtain a rate which is not worse than that in Theorem 3.

Lemma 7.
Suppose that (C1), (C4), (C8*), and (C10*) are satisfied for some orthogonal projection H. Then, for Λ * n (θ, η) = (p θ,η /p θ0,ηn(θ,η) )(Y (n) ) and Λ n (θ) in (14) with the corresponding H, we have that Proof. Similar to the proof of Lemma 3, it suffices to show the following three assertions: First, note that the left side of (56) is bounded above by a constant multiple of sup θ∈ Θn sup η∈ Hn where the inequality holds by (42) and the fact that max i Δ η,i − Δ η0,i 2 F ≤ e n + a n d 2 B,n (η, η 0 ) e n + a n (s log p)/n = o(1) on H n . We see that (59) is bounded above by a constant multiple of sup θ∈ Θn X * θ − θ 0 2 1 sup η∈ Hn e n + a n d 2 B,n (η, η 0 ) s 2 log p e n + a n s log p n , which goes to zero by (C10*). Next, similar to (43), the left side of (57) is bounded by Using the same approach used in (42), the display is further bounded above by a constant multiple of s log p sup η∈ Hn (I − H)(ξ η −ξ η0 ) 2 + s 2 log p e n + a n s log p n , which goes to zero by (C8*) and (C10*). Now, using Lemma 4, note that (58) is bounded above by s log p e n + a n s log p n for some C 1 > 0. This tends to zero by (C10*).
Proof. Since the prior for θ is the point mass at θ 0 , we can reduce to a low dimensional model . . , n. Then the lemma can be easily verified using the main results on posterior contraction in Section 3. The denominator of the posterior distribution with the Dirac prior at θ 0 is bounded as in Lemma 1, which can be shown using (20) for the prior concentration condition (C2) and the expressions for the Kullback-Leibler divergence K(p 0,i , p θ0,η,i ) and variation V (p 0,i , p θ0,η,i ) with the true value θ 0 . For a local test relative to the average Rényi divergence, Lemma 2 applied with F 1,n , modified so that it can be involved only with a given η 1 such that R n (p 0 , p θ0,η1 ) ≥¯ 2 n , implies that a small piece of the alternative is tested with exponentially small errors. Hence, by (C5*), we obtain the contraction rate¯ 2 n relative to R n (p 0 , p θ0,η ) for Π θ0 (· | Y (n) ), as in the proof of Theorem 2. The lemma is then obtained by recovering the contraction rate of η with respect to d n using the approach in the proof of Theorem 3.
Proof of Theorem 5. Our proof is based on the proof of Theorem 6 in Castillo et al. [8], but is more involved due to η. We use the fact that for any probability measure Q and its renormalized restriction Q A (·) = Q(·∩A)/Q(A) to a set A, we have Q−Q A TV ≤ 2Q(A c ). First, using a sufficiently large constantM 2 that is smaller thanM 2 , define H n as H n in (12) such that H n ⊂ H n . Let Π((θ, η) ∈ ·) be the prior distribution restricted and renormalized on Θ n × H n and Π((θ, η) ∈ · | Y (n) ) be the corresponding posterior distribution. Also, Π ∞ (θ ∈ · | Y (n) ) is the restricted and renormalized version of Π ∞ (θ ∈ · | Y (n) ) to the set Θ n . Then the left hand side of the theorem is bounded above by where the first summand goes to zero in P 0 -probability since Π((θ, η) ∈ Θ n × H n | Y (n) ) → 1 in P 0 -probability by Theorem 1 and Theorem 3.
To show that the second summand goes to zero in P 0 -probability, note that for every measurable B ⊂ R p , we obtain where dV (θ) = S:s≤K1s π p (s) p s −1 (λ/2) s d{L(θ S ) ⊗ δ 0 (θ S c )}. In the last line, the factor e −λ θ0 1 H p θ0,η (Y (n) )dΠ(η) cancels out in the normalizing constant, but is inserted for the sake of comparison. For any sequences of measures {μ S } and {ν S }, if ν S is absolutely continuous with respect to μ S with the Radon-Nikodym derivative dν S /dμ S , then it can be easily verified that Hence, for C n = H p θ0,η (Y (n) )dΠ(η), we see that the second summand of (60) is bounded by Using the fact that |λ( θ 1 − θ 0 1 )| ≤ λ θ − θ 0 1 λs √ log p/ X * → 0 on Θ n and that sup{|1 − Λ * n (θ, η)/Λ n (θ)| : θ ∈ Θ n , η ∈ H n } goes to zero in P 0 -probability by Lemma 7, the last display is further bounded by 2 sup θ∈ Θn and hence we have that by the definition of φ 2 , Now, we shall show that for any fixed b > 2, Note that (H S −H S0 )U 2 2 has a chi-squared distribution with degree of freedom s − s 0 . Therefore, by Lemma 5 of Castillo et al. [8], there exists a constant C 2 such that for every b > 2 and given s ≥ s 0 + 1, where N s = p−s0 s−s0 is the cardinality of the set {S ∈ S n : |S| = s}. Since N s ≤ (p − s 0 ) s−s0 ≤ p s−s0 , for T n the event in the relation (68), it follows that This goes to zero as p → ∞, since for s ≤ K 1 s , , and s /p = o (1). To complete the proof, it remains to show that Π ∞ (θ : S θ ∈ S n | Y (n) ) goes to zero on the set T n . Combining (67) and (68), we see that Π ∞ (θ : S θ ∈ S n | Y (n) )1 Tn is bounded by which holds by the inequalities π p (s)/π p (s 0 ) ≤ (A 2 p −A4 ) s−s0 and p s0 p−s0 Hence, the preceding display goes to zero provided that a − A 4 + b/2 < 0 since s = o(n). This condition can be translated to a < A 4 − 1 by choosing b arbitrarily close to 2.
• Verification of (C1): Letσ jk be the (j, k)th element of Σ − Σ 0 . Observe that Hence, we see that c n has the same role as a n . We also have e n = 0 as the true Σ 0 belongs to the support of the prior. • Verification of (C2): Note that for every Σ 1 , Σ 2 ∈ H. Hence we obtain that for every¯ n > n −1/2 , This leads us to choose¯ n = (log n)/n for (C2) to be satisfied. • Verification of (C3): The assumption θ 0 ∞ λ −1 log p given in the theorem directly satisfies (C3). • Verification of (C4): We have the inequalities ρ min  (6) is satisfied with log γ n log n. Also, the entropy relative to d n is given by The entropy condition in (7)  (71) see, for example, Lemma 9.16 of Ghosal and van der Vaart [17]. The sieve condition (8) is met provided that M is chosen sufficiently large. Note that the condition a n • Verification of (C6): The separability condition is trivially satisfied in this example as there is no nuisance mean part.
Therefore, the contraction properties in Theorem 3 are obtained with s = s 0 ∨ (log n/ log p), but s is replaced by s 0 since s 0 > 0 and log n log p. The contraction rate for Σ with respect to the Frobenius norm follows from (69). The optimal posterior contraction directly follows from Corollary 1. Assertions (a) and (b) are thus proved. Next, we verify conditions (C8*)-(C10*) and (C11) to apply Theorems 5-6 and Corollaries 2-3.
• Verification of (C8*)-(C9*): These conditions are trivially satisfied with the zero matrix H as there is no nuisance mean part. • Verification of (C10*): Since the entropy in (C10*) is bounded above by a constant multiple of log N (δ, (69) and (70), the term in (C10*) is bounded by a multiple of (s ∨ √ log c n ) c n (s log p) 3 /n by Remark 6. This term tends to zero as s can be replaced by s 0 .
• Verification of (C11): Note that d B,n (Σ 1 , Σ 2 ) ≤ Σ 1 − Σ 2 F for every Σ 1 , Σ 2 by (70), and hence it suffices to show that H is a separable metric space with the Frobenius norm. Since the support of the prior for Σ is Euclidean, separability with the Frobenius norm is trivial.
Hence, under (C7*), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix H. Under (C7*) and (C12), Theorem 6 implies the no-superset result in (16). If the beta-min condition (C13) is also met, the strong results in Corollary 2 and Corollary 3 hold. These establish (c)-(e).

B.2. Proof of Theorem 8
We first verify the conditions for Theorem 3 for (a) and (b).
• Verification of (C8*): For H defined above, it is easy to see that the first condition of (C8*) is satisfied. The second condition is directly satisfied by Remark 5. • Verification of (C9*): Choose a map (α, β, μ, σ 2 , Σ) → (α + n −1 1 T n X * (θ − θ 0 ), β, μ, σ 2 , Σ) for η →η n (θ, η). To check (C9*), we shall verify that this map induces Φ(η n (θ, η)) = (ξ η + HX(θ − θ 0 ),Δ η ) as follows. Note that for matrices R k , k = 1, . . . , 6, we have the properties of the Kronecker product that ( if the matrices allow such operations. Using these properties, we see that H satisfies Therefore, under (C7*), Theorem 5 implies that the distributional approximation in (15) holds. Under (C7*) and (C12), we obtain the no-superset result in (16). The remaining assertions in the theorem are direct consequences of Corollary 2 and Corollary 3 if the beta-min condition (C13) is also satisfied. These prove (c)-(e). We complete the proof by showing that the covariance matrix of the nonzero part can be written as in the theorem. For given S, we obtaiñ is the first row of Δ −1 η0 with the top-left element excluded, the last display is equal to This completes the proof.

B.3. Proof of Theorem 9
We shall verify the conditions for the posterior contraction in Theorem 3 to prove (a)-(b). First we give the bounds for the eigenvalues of each correlation matrix. It can be shown that The first assertion in (73) follows directly from the identity ρ k (G CS i (α)) = ρ k (α1 mi 1 T mi ) + 1 − α for every k ≤ m i . For (74), see Theorem 2.1 and Theorem 3.5 of Fikioris [12]. The assertion in (75) is due to Theorem 2.2 of Kulkarni et al. [21].
• Verification of (C1): For the autoregressive correlation matrix, note that Using mn n * , we have that and hence max 1≤i≤n This gives us a n 1 for the autoregressive matrices. Similarly, we can also show that a n 1 satisfies (C1) for the compound-symmetric and the moving average correlation matrices. Also, we have e n = 0 for (C1) as the true parameter values α 0 and σ 2 0 are in the support of the prior. • Verification of (C2): Since the nuisance parameters are of fixed dimensions, condition (C2) is satisfied with¯ n = (log n)/n due to the restricted range of the true parameters, σ 2 0 1 and α 0 ∈ [b 1 + , b 2 − ] for some fixed > 0. • Verification of (C3): The assumption θ 0 ∞ λ −1 log p given in the theorem directly satisfies (C3). • Verification of (C4): Using (73)-(75), we see that for the compound-symmetric correlation matrix, condition (C4) is satisfied with the bounded range of the true parameters provided that m is bounded. For the other correlation matrices, condition (C4) is satisfied even with increasing m.
• Verification of (C5*): For a sufficiently large M > 0 and s = s 0 ∨(log n/ log p), choose a sieve H n = {σ 2 : Then using (73)-(75), it is easy to see that the minimum eigenvalue of each correlation matrix is bounded below by a polynomial in n, which implies that condition (6) is satisfied with log γ n log n. For the entropy calculation, note that for every type of correlation matrix, From the identity α k . By this inequality we obtain G i (α 1 )−G i (α 2 ) 2 F m 4 |α 1 −α 2 | 2 for every correlation matrix. Then, the last display is bounded by a multiple of m 2 (σ 2 1 − σ 2 2 ) 2 + e 2Ms log p m 4 (α 1 − α 2 ) 2 for every η 1 , η 2 ∈ H n . The entropy in (7) is thus bounded by log N δ n , {σ 2 : 0 < σ 2 ≤ e Ms log p }, | · | + log N δ n , {α : 0 < α < 1}, | · | , for δ n = (6m 3 n 3/2+C1 e Ms log p ) −1 with some constant C 1 > 0. It can be easily checked that each term in the last display is bounded by a multiple of s log p, by which the entropy condition in (7) is satisfied with n = (s log p)/n. Using the tail bounds of inverse gamma distributions and properties of the density Π(dα) near the boundaries, condition (8) is satisfied as long as M is chosen sufficiently large.
• Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.
• Verification of (C8*)-(C9*): These conditions are trivially satisfied with the zero matrix H since there is no nuisance mean part.

B.4. Proof of Theorem 10
We verify the conditions for the posterior contraction in Theorem 3 to show (a)-(b).
• Verification of (C1): Using the assumption max i Z i sp 1, note that where the last inequality holds since min i {ς min (Z i ) : m i ≥ q} 1 and n i=1 1(m i ≥ q) n. Thus we have a n 1 and e n = 0. • Verification of (C2): The condition is satisfied with¯ n = (log n)/n as Ψ is fixed dimensional and we have 1 ρ min (Ψ 0 ) ≤ ρ max (Ψ 0 ) 1. • Verification of (C3): The assumption θ 0 ∞ λ −1 log p given in the theorem directly satisfies (C3). • Verification of (C4): By Weyl's inequality, we obtain that Since Z i Ψ 0 Z T i is nonnegative definite, the right hand side of (78) is further bounded below by σ 2 , while the right hand side of (79) is bounded. The condition (C4) is thus satisfied.
• Verification of (C5*): For a sufficiently large M and s = s 0 ∨ (log n/ log p), define a sieve as H n = {Ψ : n −M ≤ ρ min (Σ) ≤ ρ max (Σ) ≤ e Ms log p }, so that the minimum eigenvalue condition (6) can be satisfied with log γ n log n. Similar to the proof of Theorem 7, it can be easily shown that conditions (7) and (8) are satisfied with n = (s log p)/n. • Verification of (C6): The separation condition is trivially satisfied as there is no nuisance mean part.
Therefore, the posterior contraction rates for θ are given by Theorem 3 with s replaced by s 0 since s 0 > 0 and log n log p. The contraction rate for Σ relative to the Frobenius norm is a direct consequence of (77). The optimal posterior contraction easily follows from Corollary 1. Thus assertions (a)-(b) hold. Now, we verify conditions (C8*)-(C10*) and (C11) to apply Theorems 5-6 and Corollaries 2-3.
• Verification of (C8*)-(C9*): These conditions are trivially satisfied with the zero matrix H since there is no nuisance mean part. • Verification of (C10*): For some C 1 > 0, the entropy in (C10*) is bounded above by a multiple of log N (δ, {Σ : Σ − Σ 0 F ≤M 2 C 1 n }, · F ) 0 ∨ log(3M 2 C 1 n /δ) by (77). The expression in (C10*) is thus bounded by a constant multiple of s 5 log 3 p by Remark 6. This tends to zero since s s 0 . • Verification of (C11): It is easy to see that d B,n (η, η 0 ) Ψ − Ψ 0 F since max i Z i sp 1. The separability of the space is thus trivial.
Hence, under (C7*), Theorem 5 can be applied to obtain the distributional approximation in (15) with the zero matrix H. Under (C7*) and (C12), we obtain the no-superset result in (16) by Theorem 6. The strong results in Corollary 2 and Corollary 3 follow explicitly from the beta-min condition (C13). These establish (c)-(e).

B.5. Proof of Theorem 11
We verify the conditions for the posterior contraction in Theorem 3.
• Verification of (C2): Using (i) of Lemma 10 and the relation 1 − x 1 − x −1 as x → 1, observe that Ω −1 − Ω −1 0 F Ω − Ω 0 F ¯ n if the right hand side is small enough. Thus, there exists a constant C 1 > 0 such that {Ω : Furthermore, although the components of Ω are not a priori independent as the prior is truncated to M + 0 (L), the truncation can only increase prior concentration since Ω 0 ∈ M + 0 (cL) for some 0 < c < 1. Hence, for some C 2 > 0, which justifies the choice¯ n (m + d)(log n)/n for (C2).
• Verification of (C3): The assumption θ 0 ∞ λ −1 log p given in the theorem directly satisfies (C3). • Verification of (C4): This is trivially met as Ω 0 ∈ M + 0 (cL) for some 0 < c < 1. • Verification of (C5*): Note that the minimum eigenvalue condition (6) is trivially satisfied with γ n = 1 since the prior is put on M + 0 (L). Now, for r n = Ms log p/ log n with s = s 0 ∨ (n¯ 2 n / log p) and sufficiently large M , choose a sieve as H n = {Ω ∈ M + 0 (L) : j,k 1{ω jk = 0} ≤r n }, that is, the maximum number of edges of Ω does not exceedr n . Then, for δ n = 1/6mn 3/2 , the entropy in (7)  where in the second term, the factor (mL/δ n ) m comes from the diagonal elements of Ω, while the rest is from the off-diagonal entries. It is easy to see that the last display is bounded by a multiple of s log p with chosenr n , and hence the entropy condition in (7) is satisfied. Lastly, note that for some C 3 > 0, log Π(H \ H n ) = log Π(|Υ| >r n ) −r n logr n ≤ −C 3 Ms log p.
Therefore, condition (8) is satisfied with sufficiently large M . • Verification of (C6): The separation condition is trivially met as there is no nuisance mean part.
Therefore, we obtain the posterior contraction properties for θ by Theorem 3.
The theorem also implies that the posterior distribution of Ω −1 contracts to Ω −1 0 at the rate n = (s 0 log p ∨ (m + d) log n)/n with respect to the Frobenius norm. This is also translated as convergence of Ω to Ω 0 at the same rate, since we obtain by (i) of Lemma 10 and the inequality 1 − x 1 − x −1 as x → 1. The assertion for the optimal posterior contraction is directly justified by Corollary 1. These prove (a)-(b). Next, we verify conditions (C8)-(C11) to obtain the optimal posterior contraction by applying Theorem 4.
which verifies the first assertion because E p1 Z T AZ = trA. After some algebra, we also obtain The rightmost side involves forms of E p1 (ZZ T Q 1 Z) and E p1 (Z T Q 1 ZZ T Q 2 Z) for two positive definite matrices Q 1 and Q 2 . It is easy to see that the former is zero, while it can be shown the latter equals 2tr(Q 1 Q 2 ) + tr(Q 1 )tr(Q 2 ); for example, see Lemma 6.2 of Magnus [22]. Plugging in this for the expected values of the products of quadratic forms, it is easy (but tedious) to verify the second assertion.
Conversely, using the sub-multiplicative property of the Frobenius norm, BC F ≤ B sp C F , it can be seen that Hence, g 2 (Σ 1 , Σ 2 ) < δ for a sufficiently small δ > 0 implies that r k=1 Since every term in the product of the last display is greater than or equal to 1, we have (d )/2 has the global minimum at d k = 1, and hence δ can be chosen sufficiently small to make |d k − 1| small for every k = 1, . . . , r, which establishes (ii).