Bayesian estimation of sparse precision matrices in the presence of Gaussian measurement error

: Estimation of sparse, high-dimensional precision matrices is an important and challenging problem. Existing methods all assume that observations can be made precisely but, in practice, this often is not the case; for example, the instruments used to measure the response may have limited precision. The present paper incorporates measurement error in the context of estimating a sparse, high-dimensional precision matrix. In particular, for a Gaussian graphical model with data corrupted by Gaussian measurement error with unknown variance, we establish a general result which gives suf-ﬁcient conditions under which the posterior contraction rates that hold in the no-measurement-error case carry over to the measurement-error case. Interestingly, this result does not require that the measurement error vari- ance be small. We apply our general result to several cases with well-known prior distributions for sparse precision matrices and also to a case with a newly-constructed prior for precision matrices with a sparse factor-loading form. Two diﬀerent simulation studies highlight the empirical beneﬁts of accounting for the measurement error as opposed to ignoring it, even when that measurement error is relatively small.


Introduction
The precision matrix, namely, the inverse of the covariance matrix of a Gaussian random vector, is a key object in multivariate analysis because of its role in describing conditional distributions. Let there be observations X 1 , . . . , X n , independent and identically distributed (i.i.d.) from a p-dimensional, mean-zero Gaussian distribution, N p (0, Σ), where Σ denotes a p×p positive definite covariance matrix, with corresponding precision matrix Ω = Σ −1 . The goal is to make inference on the unknown Ω, especially in the high-dimensional situation when demonstrated its optimal posterior contraction rate and strong performance in terms of computational speed and accuracy. Du and Ghosal [12] considered a high-dimensional discriminant analysis, where they implemented both the mixture prior and a horseshoe shrinkage prior on the off-diagonal entries in a sparse modified Cholesky decomposition.
Beyond the challenges of high-dimensionality and complex dependence structures, it may happen that the data are also corrupted in some way. A classical example is that where measurements taken on sample units can only be done with a low-precision device. In such a case, the natural sample variation is compounded by independent measurement errors. A more recent example, commonly found in medical applications, is where the data are corrupted intentionally to maintain privacy. In any case, the addition of a measurement error on top of the natural sampling variability creates new challenges. While there is an extensive body of literature on the subject of measurement error in statistics [11,8,17,19], very little work has been done in the context of structured precision matrix estimation in the presence of Gaussian measurement errors. Byrd, Nghiem and McGee [5] assumed the variance of measurement error to be known, treated the unobservable outcomes as missing data and recently proposed a method to impute them and estimate the precision matrix iteratively. They combined the imputation-regularized optimization algorithm [26] and Bayesian regularization for graphical models with unequal shrinkage [20] to formulate a new procedure and prove its consistency. Their results also revealed the necessity of adjusting for measurement error when present. Here we propose a fully Bayesian framework for handling measurement error and give general sufficient conditions for establishing the posterior contraction rate.
Our main goal in this paper is to understand the effect of measurement error on Bayesian methods for estimating a high-dimensional structured precision matrix of a multi-dimensional Gaussian random vector. We focus here on a Gaussian measurement error model, for i = 1, . . . , n and j = 1, . . . , m, where the X and Z samples are mutually independent, I p is the identity matrix of order p and m is the number of replicates for each X. Since the X samples carry information about Ω and the Z samples do not, the observable Y 's are "corrupted" by the convolution of informative and non-informative inputs. If ν is unknown, then m ≥ 2 replicates for each outcome X is required to guarantee identifiability of ν and Ω. On the other hand, for the special case that ν is known, the convergence results hold also for m = 1. Let Y i = (Y T i1 , . . . , Y T im ) T . Then, the marginal distribution of the Y 's is available in closed-form, where Σ ν is an mp × mp block matrix, with Ω −1 + νI p as the diagonal blocks and Ω −1 as the off-diagonal blocks. Write Ω ν = Σ −1 ν , which is an mp × mp block matrix with ν −1 {I p − (νΩ + mI p ) −1 } as the diagonal blocks and −ν −1 (νΩ + mI p ) −1 as the off-diagonal blocks.
In contrast to the covariance matrix, on which the effect of the measurement error is simply additive, the inverse (νΩ + mI p ) −1 from the above expression reveals how even the simple linear measurement error model leads to a very nonlinear corruption when the goal is estimating the precision matrix. An important quantity in this model is the measurement error variance, ν, which characterizes the magnitude of the measurement errors or the degree of corruption. Intuitively, if the measurement error is ignored and ν is not small, then the estimation of Ω will be negatively affected. Here we develop a general strategy that allows the user to incorporate additive Gaussian measurement error into existing Bayesian procedures for inference on structured, high-dimensional precision matrices in such a way that the posterior concentration rates are preserved and minimal changes to posterior computations are required. To accommodate the measurement error in our theoretical analysis, so that the posterior can effectively undo the troublesome inverse in Ω ν , it is crucial to have extra control on the prior distribution of the smallest eigenvalue of Ω. To address this, we express Ω as κI p + Θ, and put independent priors on the scalar κ > 0 and the p × p matrix Θ. Then the prior on κ gives us a control over the lower eigenvalue of Ω, while the prior for Θ can be any of those from the literature.
Expressing the matrix of interest as a sum "κI p + Θ" is a strategy that has appeared already in the literature. Indeed, Fan, Fan and Lv [13], Fan, Liao and Mincheva [16] and Pati et al. [30] have used such a model, with Θ having a sparse factor structure [4], for a high-dimensional covariance matrix. To our knowledge, this prior formulation has not been developed for inference on a precision matrix. Our posterior concentration rate result simultaneously covers the measurement error and no-measurement-error cases, and the rate attained parallels that obtained by Pati et al. [30] for the covariance matrix with respect to the Frobenius norm, with some improvements.
The remainder of this paper is organized as follows. In Section 2, we investigate what will happen when the measurement error is ignored, i.e., when a misspecified no-measurement-error model is fit to the corrupted data Y 1 , . . . , Y n in (1.2). Our general framework for incorporating Gaussian measurement error into existing Bayesian procedures for inference on structured, high-dimensional precision matrices is presented in Section 3 along with a general result on posterior contraction rates. The main conclusion from the result is that the rate in the absence of measurement error remains in force even when a substantial measurement error is present. Examples of the obtained rates with measurement error based on priors commonly used in the no-measurement-error literature are discussed in Section 4. A new prior for the estimation of a precision matrix with a sparse factor structure is proposed and the corresponding posterior concentration rate is illustrated in Section 5. The result is new even in the context of a Gaussian graphical model without measurement error. An extensive simulation study for investigating the numerical performance of the proposed model under different magnitudes of measurement error is conducted in Section 6, showing that the adjustment towards to the measurement error leads to a lower estimation error in terms of the Frobenius norm. All proofs are presented in the appendix.

Effect of ignoring measurement error
For a situation in which the data analyst is either unaware of the measurement error or simply chooses to ignore it, a natural question is what can go wrong? We show below that failing to account for the measurement error creates a large bias and, therefore, certain adjustments are necessary to account for the presence of measurement error and to ensure accurate estimation of Ω. For the sake of simplicity, we assume that ν is known and m = 1 throughout this section.
To develop some intuition, consider the case where the dimension p is fixed, so that the precision matrix can be estimated directly, at least for large n, without imposing any structural or sparsity assumptions. Let Ω n = Ω(Y 1 , . . . , Y n ) denote an asymptotically unbiased estimator of Ω based on the corrupted data Y 1 , . . . , Y n , e.g., Ω n = S −1 n , the inverse of the sample covariance matrix By asymptotically unbiased, we mean that where Ω denotes the true p × p precision matrix and A F = {tr(A T A)} 1/2 denotes the Frobenius norm of a matrix A, with tr(·) the trace operator. Also, let A 2 denote the spectral norm, i.e., the square root of the maximum eigenvalue of A T A.
Moreover, if ν ≤ Ω −1 2 , then the bound can be simplified to where λ min (·) stands for the minimum eigenvalue.
The proof is given in Appendix A. From the theorem, we can see the lower bound on the bias will vanish as ν → 0 but will increase monotonically to Ω 2 F as ν → ∞. Therefore, even in a relatively low-dimensional setting with fixed p, unless ν is vanishingly small, the mean squared error associated to any asymptotically unbiased estimator of the precision matrix is bounded away from 0 as n → ∞.
Moreover, the same proof would apply to a case of increasing dimension if p = O(n) and Ω is known to be diagonal. Since both the fixed-p and known-tobe diagonal Ω cases are simpler than the general high-dimensional structured precision matrix estimation problem, and the effect of ignoring measurement error is already profound, we conjecture that the estimation bias result will become even worse in the more general setup involving a complex structure, unknown ν and increasing p.

Prior and posterior distributions
For technical reasons that will be made clear below, when measurement error is present, we need precise control on the prior distribution of the smallest and the largest eigenvalues of Ω. We introduce a simple device to automatically satisfy the requirement, namely, adding a scalar multiple of the identity matrix to the precision matrix. That is, we express the precision matrix as where Θ is a positive semi-definite matrix and κ > 0 serves as a lower bound on the smallest eigenvalue of Ω. The strategy is to specify a prior distribution for Ω by assigning independent prior distributions to Θ and κ. That is, the prior Π for Ω is induced from independent priors Π Θ and Π κ for Θ and κ, respectively, by the mapping (Θ, κ) → Θ + κI d . This term κ is introduced only to automatically assure a lower bound for eigenvalues of the precision matrix Ω in the theoretical results. This structure of the prior, though, is not convenient for computation. Computational issues will be discussed in Section 6.1. Since ν is typically unknown, we assign it a prior distribution. Details about the specific priors for κ, Θ and ν are presented below. Prior for κ. As mentioned above, control on the prior distribution of eigenvalues is crucial, so the tails of the prior for κ need to be carefully chosen. In particular, we require exponential tails in both directions, i.e., there exists a constant C > 0 such that A common distribution that satisfies this requirement is the inverse Gaussian distribution [10] with density function, in the one-parameter form, given by π κ (t) ∝ t −3/2 e −(t−ξ) 2 /(2t) , t > 0, where ξ > 0 plays the role of the mean and variance. A generalized inverse Gaussian density, proportional to t a e −b(t−ξ) 2 /t with any b > 0 and a ∈ R, can also be used. Prior for Θ. Since the structure in Ω is determined by the structure in Θ, we choose Π Θ to induce the desired structure in Ω. Fortunately, most of the existing priors on a precision matrix could be directly applied on Θ here. For example, if we believe that Ω has a general sparsity structure, then we could take Π Θ to be a suitable G-Wishart prior [25] or a mixture thereof [3]. Similarly, structures like a sparse Cholesky decomposition or a sparse factor model can be imposed on Ω with a suitable choice of prior on Θ. Details will be given for a number of special cases in Section 4 and Section 5 below. Roughly, our technical requirement is that Π Θ satisfies the sufficient conditions originally laid out in Ghosal, Ghosh and van der Vaart [21] for posterior contraction at the target rate in the no-measurement-error context. These sufficient conditions have already been verified for various low-dimensional structures and commonly used priors that induce them, so our main focus here can be on the effects of measurement error.
Prior for ν. We require that the support of the prior distribution for ν is bounded by some large positive constant M , and that it has exponential lower tail, i.e., for some constant C > 0, We also require suitable prior concentration around the true-but-unknown measurement error variance ν . A common distribution that satisfies this requirement is the truncated inverse Gaussian distribution or a two-sided truncated distribution. Alternatively, a point mass at the adjusted maximum likelihood estimator of ν, which iŝ Y ij /m, can be used, which corresponds to an empirical Bayesian method.
Given a prior for Ω as described above, we update to the posterior distribution via Bayes's theorem. For the measurement error model (1.2), define the likelihood function as (3.5) where S n is the sample covariance matrix of Y as in Section 2 and det denotes the determinant operator. Then the corresponding posterior distribution, which depends on the data Y 1 , . . . , Y n and the known measurement error variance, is given by A consequence of this indirect formulation is that the posterior distribution cannot be computed in closed-form. Therefore, MCMC methods are needed to obtain samples from Π n . Fortunately, these methods can be developed by modifying the existing algorithms available in the no-measurement-error literature; see Section 6.1.

Posterior contraction rates
In this subsection, we characterize the posterior contraction rate with respect to the Frobenius distance in terms of the characteristics of the model, the prior, and the true precision matrix. Even under maximal sparsity, there are p unrestricted diagonal entries, so it is essential that the dimension p is of a smaller order of n, and in particular log p is the same order of log n, or less. As discussed above, the intuition here is that if the prior Π Θ for Θ is such that the posterior would achieve the desired contraction rate without measurement error, and if the prior Π ν for ν and prior Π κ for κ are reasonable in some sense, then the same posterior contraction rate prevails in the presence of measurement error. The following three conditions make this setup more precise.
Conditions on the prior.
(a) The prior Π κ has a continuous density on (0, ∞), and satisfies the tail condition (3.2). (b) Given n and a certain structure in Ω , (i) there exists a sieve S n of precision matrices, having the same posited low-dimensional structure as Ω , with entropy bound where δ n = (2Knp) −1 for the K presented below; (ii) the prior Π Θ for Θ satisfies Π Θ (S c n ) e −Kn 2 n for some sufficiently large K > 0; (iii) for any constant c > 0, there exists another constant C > 0, such that for any Θ having the same posited structure as Ω . for some constant C > 0, where ν is the true measurement error variance.
The conditions related to Π Θ look complicated but, for the most part, these are the now-classical sufficient conditions from Ghosal, Ghosh and van der Vaart [21] for establishing posterior concentration rate results. Therefore, other authors who have investigated concentration rate properties of posterior distributions under various low-dimensional structures and priors, like in Section 4, would likely have checked these conditions already. One noticeable difference is in the entropy bound in Condition (b)(i), where the radius is proportional to δ n = (2Knp) −1 , which is rather small. However, the dimension is what drives the entropy's magnitude, while the radius only impacts the logarithmic term, so the small δ n has no significant effect. The conditions for Π ν are satisfied by a continuous measure or a point mass prior located close enough to ν . For continuous measures, many distributions can be used, in which a common example is the inverse Gaussian distribution. Further, for the point mass prior, a typical choice isν in (3.4), which satisfies the conditions for Π ν since now |ν − ν | (np) −1/2 if n n −1/2 . The prior concentration condition (3.9) can be guaranteed if Π ν has a continuous support and ν > 0 is fixed, since the prior density has a lower bound around ν such that where − log n is of the order of log n, which is smaller than n 2 n up to a constant. However, it will be a little complicated if ν = 0 or it varies in a sequence approaching 0 fast. Then, the lower bound (3.9) may not hold if the prior density rapidly decays at 0, for example, in an inverse Gaussian density. Therefore, the assumption on the prior concentration (3.9) is sometimes useful. Theorem 3.1. Assume that Ω satisfies a specific low-dimensional structure, and has eigenvalues bounded away from 0. Consider a prior distribution for Ω = Θ + κI p induced from independent prior distributions Π κ for κ and Π Θ for Θ, where the prior for Θ is based on the same low-dimensional structure as that posited for Ω . Assume that there exists a sequence n n −1/2 with n → 0 and n 2 n log n and a constant M such that Conditions (a)-(c) are satisfied by Π κ , Π Θ and Π ν , respectively. Under the model in (1.2), for any fixed ν ≥ 0, the posterior distribution Π n in (3.6) contracts at the rate n , that is, there exists a constant L > 0, depending on Ω 2 and ν, such that (3.11) where¯ n = n p −1/2 . If Ω 2 is not bounded, then the conclusion of ν remains the same but the conclusion of Ω holds with the rate n = Ω 2 2 n . The proof of the theorem is given in Appendix A. A remarkable consequence of Theorem 3.1 is that the posterior contraction rate is not affected by measurement error even when it is not small.

Examples
In this section, we investigate some existing Bayesian methods for structured, high-dimensional precision matrix estimation and show how measurement error can be accommodated in these models. Since the prior for ν is regulated to satisfy Condition (c), we consider a prior for Θ from the literature and verify the requirements of Theorem 3.1 for each case. The prior for κ will be assumed to satisfy Condition (a), e.g., by choosing an inverse Gaussian distribution. Therefore, the discussion below will focus on the prior for Θ and on verifying Conditions (b) (i)-(iii) for Π Θ . We assume that both the smallest and largest eigenvalues of the true precision matrix Ω are bounded away from 0 and ∞ for all the examples in this section. Proofs of the rates derived in Theorems 4.1-4.3 are given in Appendix A.

General sparsity
Banerjee and Ghosal [3] proposed a Bayesian method to estimate a precision matrix with a general sparse structure in a Gaussian graphical model. In the first example, we adopt their setting as a prior for Θ and verify that all required Let Θ ij denote the entry at the ith row and jth column of Θ and Γ denote the matrix with the (i, j)th entry Γ ij = 1{Θ ij = 0}. The cardinality of {(i, j) : i < j, Γ ij = 1} will be denoted by γ. Consider the following prior where λ is a hyperparameter and q is a pre-specified probability controlling the sparsity. The smaller q is, the more sparse Θ is. Another factor controlling the sparsity, R, is either given a prior or is taken to be a large enough constant. Since the latter is a trivial case of the former, we demonstrate the main result of posterior contraction rate in the former setting. The prior of R should satisfy for some a > 0 and large enough constant M . Such distributions include the Poisson and the binomial distributions.

Sparse Cholesky decomposition
Assume that the true precision matrix has a sparse Cholesky decomposition Θ = UDU T , where U is a lower-triangular matrix and D is a diagonal matrix, and we specify a prior on Θ through U and D as in Du and Ghosal [12]. Let U ij denote the entry at the ith row and jth column of U and D ii denote the ith diagonal entry of D. Let Γ denote the matrix formed by the indicators Γ ij = 1{U ij = 0}. Following Proposition 1 in Du and Ghosal [12], for i = 1, 2, . . . , p, and j = 1, 2, . . . , i, consider the prior where α 1 , β 1 , σ 2 0 and σ 2 1 are some hyperparameters and C p is a constant going to zero as p → ∞ polynomially in p −1 .
An alternative to the above prior is to consider more general positive realvalued Γ ij , and induce a prior on U ij through the hierarchical scheme Estimation of precision matrices under measurement error 4555 for i = 1, 2, . . . , p, and j = 1, 2, . . . , i, where Cauchy + (0, 1) is the positive half-Cauchy distribution and σ 2 1 is a pre-specified global shrinkage parameter. For both setups, let γ denote the number of U ij 's greater than n p −1 , where n is introduced as below, and γ will have a binomial distribution as Bin(p with some constants a, b > 2 as we assumed for any 0 < j < i ≤ p. This restriction on η can be achieved by either choosing C p in (4.3) going to zero as p → ∞ polynomially in p −1 or modifying the prior Γ ij in (4.4) to be truncated above by 1/τ , where τ < p −b−2 2 n . Although the above condition on η is required for the theoretical result, in practice, we choose C p = 1 for the convenience of computation. In our simulation studies, the estimation results are not sensitive to the choice of C p .  .2), the posterior distribution Π n in (3.6) contracts at the rate n around Ω * , where n = n −1/2 (p + s ) 1/2 (log n) 1/2 , with s denoting the number of nonzero off-diagonal entries in U .

Banded structure using G-Wishart prior
Following Banerjee and Ghosal [2], assume that the sparse precision matrix has a banded structure. They assumed a k-banded structure on the precision matrix and used a G-Wishart prior. They derived the posterior convergence rate under such structure and prior with respect to the spectral norm. In this example, we consider the same structure and prior, but move our attention on the rate of the Frobenius norm in the presence of measurement error.
Suppose that the true p×p dimensional precision matrix Θ is k-banded, that is, Θ ij = 0 for all i, j = 1, . . . , p, such that |i − j| > k, with a fixed known value of k. A graphical Wishart distribution prior Θ ∼ G-Wish(δ, I p ), is assigned on Θ, where the graph G is induced by the k-banding. It is easy to conclude that the graph is decomposable with cliques C j = {j, . . . , j + k}, j = 1, . . . , p− k, and separators S j = {j, . . . , j +k −1}, j = 2, . . . , p−k [2]. An important property we use is that given Θ S2 , . . . , Θ S p−k , the matrices Θ C1 , . . . , Θ C p−k are conditionally independent and are Wishart distributed with δ degrees of freedom; here and elsewhere for a matrix M and S ⊂ {1, . . . , p}, M S = ((M ij : i, j ∈ S)), the principal minor of M formed by the entries of S. Let P G stand for the cone of positive definite matrices compliant with the graphical structure, that is, the is not an edge of the graph.

Sparse factor-model structure
The sparse factor-model structure has been used in the literature to develop prior distributions for structured, high-dimensional covariance matrices, e.g., in Pati et al. [30]. However, to our knowledge, such a prior has not been proposed for a structured precision matrix, even when no measurement error is present. So we separate it from the examples in the previous section because of the novel use of the prior for estimating the precision matrix and new results on posterior contraction rate even in a model without measurement error.
Consider the model (1.1), where the possibility ν = 0 (i.e. the no-measurement error model X i iid ∼ N p (0, Ω −1 )) is not ruled out. Following our discussion in Section 3.1, we assume the precision matrix Ω to be of the form where Λ is a p × k matrix with k ≤ p and at most s non-zero entries on each of the k columns. For a given k, let Γ denote the matrix with the (i, j)th entry Γ ij = 1{Λ ij = 0}, and let γ ≤ ks denote the total number of non-zero entries of Λ.
We follow Pati et al. [30] and impose the following assumptions on the true precision matrix Ω , the corresponding factor-loading matrix Λ , its dimension k , and the inherent error κ . We assume that the true precision matrix Ω has also the factor model structure of the form Ω = Λ Λ +κ I where Λ ∈ R p×k and k p. In high-dimensional setup, we typically assume there are two sequences bounding the column sparsity of the true loading matrix Λ as s and the largest eigenvalue of true precision matrix Ω as c . Furthermore, we let Γ denote the matrix of 1(Λ ij = 0) and γ = i,j Γ ij ≤ k s .

Assumptions.
Suppose that there exist c , k and s such that the following three properties hold: (A2) (c 5 s k ) 1/2 (log n) n 1/2 ; (A3) each column of Λ has at most s non-zero entries.
Assumption (A1) is to give control on the upper and lower bound of the true precision matrix. Assumption (A2) is introduced to control the final convergence rate appropriately; that assumption can be relaxed to (c s k ) 1/2 log n n 1/2 when ν = 0 is known. Assumption (A3) controls the sparsity, which is crucial in high-dimensional problems.
For the Bayesian model formulation, let Λ ij denote the entry at the ith row and jth column of the factor-loading matrix Λ. Then, we consider the spikeand-slab prior similar to that in Pati et al. [30], except that we make certain choices to meet Condition (b) such as the inverse Gaussian distribution. For i = 1, 2, . . . , p, and j = 1, 2, . . . , k, let the priors where δ 0 represents the Dirac distribution at zero and μ 1 , λ 1 , θ 1 , σ 2 1 ≥ 1 and η are all pre-specified hyper-parameters. The condition on variance σ 2 1 is natural because with the point mass at zero, large variation is preferable. In such a prior setup, given k = k , γ will have a binomial distribution as Bin(pk , η), where η = π(|Λ ij | > 0) for any 0 < i ≤ p and 0 < j ≤ k . The choice of η is made to guarantee that η (pk ) −1 . Moreover, with such a specification, we have the prior probability η/2 ≤ π(|Λ ij | > n /4 c 3 pk ) ≤ η with n introduced in Theorem 5.1.
In general, it is not easy to specify a bound k that controls the sparsity of Λ and, hence, it is difficult to specify an appropriate η. The problem can be addressed by putting a further prior on η: where a is the only new extra pre-specified hyper-parameter. However, when this deeper hierarchical model is utilized, there is a slight loss in terms of the posterior concentration rate in Theorem 5.1. With such a hyper-prior on η, given k = k , we can calculate that for any 0 < i ≤ p and 0 < j ≤ k . If the prior (5.2) is imposed on η, then the rates are • and n = n −1/2 (c ) 5/2 (s k ) 1/2 (log n) if ν is fixed and positive.

Computation
The existing literature often provides algorithms for sampling from the posterior distribution of Ω in the no-measurement-error context, and there is a simple and intuitive way to leverage these tools for sampling in cases with measurement error. Because the Bayes model in the measurement error case provides a joint distribution for (X, Y, Ω, ν), it is possible to write down the full conditionals as where Π(Ω | X 1 , . . . , X n ) is the posterior based on the no-measurement-error model, with the augmented dataset X 1 , . . . , X n , andȲ i = m −1 m j=1 Y i,j . Note that Y 1 , . . . , Y n do not appear in (6.2) because Ω is conditionally independent of Y , given X. Therefore, if we know how to sample from the no-measurement-error posterior distribution-e.g., using the algorithms available in the literaturethen we can easily embed this into a Gibbs sampling framework wherein we iteratively sample from this set of full conditionals and obtain a posterior sample of precision matrices that accommodates the known measurement error.
To execute step (6.2) efficiently, the prior on Ω needs to be convenient to work with. The assumed structure of Ω = Θ + κI p in the theoretical results in Section 3 is undoubtedly not convenient for computation. The role of κ is solely as a technical device to ensure a lower bound for the eigenvalues of Ω. If the prior on Θ already ensures a bound on the eigenvalues, then the additional term is not needed even for the theory. For practical applications, the additional term κI may not make a noticeable numerical difference and may sometimes be dropped, provided that this does not cause any instability in inverse, and simulations give sensible results. The numerical results presented below employ this simplification.
If ν is known, step (6.3) can simply be ignored by using the true ν in the other steps. However, when ν is unknown, under the model in (1.1), a prior is needed for ν to execute the algorithm. The conditions imposed on the prior for ν in the theoretical results are sufficient but are not necessary. So, for practical implementation, one could feel reasonably safe in taking any suitable prior for ν that simplifies the computation. For example, one might use the non-informative Jeffreys prior, that is, π(ν) ∝ ν −3/2 , or the inverse-Gamma prior. Consider the non-informative Jeffreys prior and the full conditional posterior of ν is in a closed form as inverse-Gamma distribution as Another popular method to deal with the unknown nuisance parameter is using an estimator of ν and pretending it as the truth. This empirical Bayesian method is equivalent to specify a point mass prior located at the estimator. A typical choice of such estimator is that in (3.4). The resulting procedure is sensible as long as the estimator of ν is sufficiently accurate. The following simulation study makes use of this simple plug-in method.

Simulations
Since the proposed method covers both cases that the true value ν of ν is known with m = 1 or it is unknown with m > 1, we conduct two separate simulation studies to explore the performance of adjusting for the measurement error.

When ν is known and m = 1
We conduct a simulation study over different structures of true precision matrix and known magnitudes of measurement error with m = 1. We fix the dimension p = 50 and sample size n = 100. Let Ω ij denote the entry in the ith row and jth column of the true precision matrix Ω , and consider the following four sparse structures for Ω : • AR (1): Ω ii = 10 and Ω i, For each sparse structure, 100 replicates are run. We consider the priors assigned on the Cholesky decomposition structure in Section 4.2 and the setup introduced as (4.3) to estimate the precision matrix, since all the the true precision matrices Ω described above have a sparse Cholesky decomposition and this prior will induce a posterior with fast and simple MCMC algorithm. Since our main interest is about the influence of the measurement error, we will not survey any other types of priors in this simulation study. The hyper-parameters are specified as α 1 = β 1 = 1/2 and C p = 1, since it is unrealistic to specify a too large C p . To show the different magnitude of influence by the priors, we consider two combinations of the spike-and-slab prior: • Diffuse prior: σ 2 0 = 0.01 and σ 2 1 = 10. • Informative prior: σ 2 0 = 0.0001 and σ 2 1 = 1. We use the Gibbs sampling technique introduced in Section 6.1 to sample from the posterior and choose Y i 's as the initial values of X i 's for i = 1, . . . , n, respectively. To show the effectiveness of the proposed method that adjusts for the measurement error, we compare the estimator after adjustment with that ignoring the measurement error. The estimation error is noted as "adjust" and "ignore" in the graphs. The estimator is the posterior mean and the Frobenius norm estimation errors are given in Figures 1, 2, 3, and 4 for the four structures, respectively. In these figures, only the central 90% of the estimation errors are displayed to remove some outliers, which are caused by the extra variation from measurement error in such a finite sample situation. Note that the x-axis in these figures denotes log 10 ν, where ν is the measurement error variance. In For the results of AR(1) model in Figure 1, the estimator that corrects for the measurement error has better accuracy in terms of the Frobenius norm except when ν is relatively large using the diffuse prior. At the same time, the variance of the estimation error becomes larger when ν > 1, compared with the variance of the baseline model, which even ignores the measurement error. This inflation of the variance in the adjusting model is due to the relatively small sample size compared to the relatively large dimension, as well as the extra variation introduced by the measurement error. This large variance means a lack of sufficient information about the signal in the data, so the estimation with the diffuse prior becomes problematic when ν is large. When a more informative prior is used, the variance is more stable and our procedure beats the naive method uniformly as shown in the right plot of Figure 1. On the other hand, since the estimator is close to the zero matrix when the magnitude of the measurement error is large, its error approximates to a fixed value and the variance is tiny. This assertion can be verified by comparing the error with large ν and the Frobenius norm of the true precision matrix listed above. This is the reason why we choose the entries of the true precision matrix relatively large such that the bias could be more dominant when the measurement error is ignored even with a small variance.
When the structure of the precision matrix is more complex, such as in AR (2), the proposed method performs uniformly better than the naive one, and the error stabilizes over different measurement error scenarios in Figure 2. Similar phenomena are observed in the block structures in Figure 3 and 4.

When ν is unknown and m > 1
We consider the same simulation settings as in Section 6.2.1, except that now ν is unknown and we have m = 2 replications for each outcomes to estimation ν and Ω. Since the effectiveness of adjusting for the measurement error in those four structures are similar, only AR(1) and AR(2) structures are considered. Further, the empirical Bayesian method usingν as in (3.4) is employed to estimate ν and only the informative prior is used for estimating Ω since its performance is much better than the diffuse prior.
For both AR(1) and AR(2) structures, as shown in Figure 5, correcting for the measurement error improves the accuracy of the estimation in terms of the Frobenius norm for each choice of ν. Again, the estimators based on the proposed method have larger variance than the baseline model since the unknown ν and the extra layer in the hierarchical model introduce more variability. Despite having slightly larger variability, the estimation error with the proposed method is still far superior to that using the naive method that ignores measurement error. There is a peculiar initial downward trend in the proposed method's estimation error as ν increases. We believe that this is because, when ν is very small, i.e., close to the boundary ν = 0, the plug-in estimatorν loses accuracy. However, when ν is away from the boundary, the expected trend emerges, namely, the estimation error is increasing but more slowly than for the naive method.

A.1. Proof of Theorem 2.1
A simple bias-variance decomposition yields Since the first term is non-negative, we get where the first term dominates as n → ∞ when ν is fixed. By the Woodbury for large enough n, which proves the first assertion. Now consider ν ≤ Ω −1 2 , the smallest eigenvalue of Ω −1 . Using a spectral decomposition UDU T of Ω , where U is an orthogonal matrix and D = diag(D 11 , . . . , D pp ), we obtain νD jj ≤ 1 for all j = 1, . . . , p. Hence . . . , p), the second assertion follows.

A.2. Proof of Theorem 3.1
The proof will proceed in a sequence of steps driven by those sufficient conditions in Ghosal, Ghosh and van der Vaart [21] for bounding the posterior concentration rate since now the Y i 's are independent and identically distributed.
Recall that the prior for Ω is based on independent priors for the ingredients Θ and κ in the representation Ω = Θ + κI p in (3.1). For a given true Ω , in the proof we consider the corresponding representation But this decomposition is not unique-there are many Θ and κ that would satisfy this equation, e.g., fix a weight w ∈ (0, 1), set κ = wλ min (Ω ), and then Θ = Ω − κ I p . Fortunately, this non-uniqueness does not affect us here, since the same conclusion is reached for every choice of (Θ , κ ) that satisfy the above display, provided the corresponding assumptions hold there. Indeed, recall that, e.g., Condition (b) requires that concentration of the prior Π Θ hold around some Θ sharing the same assumed structure in Ω .
Step 1: Prior concentration. For generic Ω and ν, let g Ω and g Ω,ν denote the N p (0, Ω −1 ) and N p (0, Ω −1 + νI p ) densities, respectively, so that the data in (1.2) are i.i.d. with density g Ω,ν . For the target rate n , we aim to show that for some C > 0, Kullback-Leibler (KL) divergence and corresponding KL variation of two densities f 1 and f 2 , respectively, as defined above. By our assumption that the eigenvalues of Ω are bounded away from 0 and infinity, and Lemma B.1 in Appendix B, there exists c > 0 such that Then, from the first result in Lemma B.2 in Appendix B, there exists c 1 , c 2 > 0 such that For the first term on the right hand side, replacing Ω by (Θ, κ) and Ω by (Θ , κ ), the triangle inequality gives Step 2: Sieve, test construction, and error rates. In order to apply the theory of posterior contraction in Ghosal and van der Vaart [22] to show that the contraction rate at the truth with respect a distance d is n , we need to establish a test for the true value against most of the complement of the d-neighborhood of size n around the truth with error probabilities decaying exponentially in n 2 n . This test is obtained by combining tests for the truth against small balls with centers separated from the truth by at least n . Whether such a test for the truth against a small ball exists depends on the metric. If d is the Hellinger metric on the corresponding densities, then such a test exists by celebrated existence theorems. For other metrics, either such a test has to be constructed directly in the given situation, or the distance has to be dominated by a multiple of the Hellinger distance near the true value. In the present context, the distance of interest is the Frobenius distance on the precision matrix, which is not directly comparable with the Hellinger distance for arbitrary pairs of matrices. Here, we construct the required test directly from the likelihood ratio tests of the truth against separated simple alternatives, which can be elegantly quantified by the Rényi divergence, similar to the strategy pursued by Ning, Jeong and Ghosal [29] and Jeong and Ghosal [23]. The structure of multivariate normality allows a useful control over the size of the likelihood ratios essential for this approach to work.
After the basic tests are constructed, these need to be combined, which is possible if their number can be controlled appropriately, and in particular, there are only finitely many covering sets with centers separated from the truth. This necessitates defining a sieve, a sequence of increasing subsets of the parameter space, on which the number of covering sets can be controlled, and the complement of the sieve has an exponentially small prior probability. We shall work with the sieve with S n as in the statement of the theorem and we choose M n = Kn 2 n → ∞, for some sufficiently large multiple K and therefore M n ≤ n. What makes this sieve appropriate for our purposes here is that every Ω ∈ T n has eigenvalues lowerbounded by M −1 n , which is not too small. This eigenvalue control is critical to our demonstration below that the combined likelihood ratio test has suitable bounds on its Type I/II errors in the presence of measurement error.
Step 3: Entropy bound. In the above analysis, the Ω † separated from Ω was fixed but arbitrary. So we can repeat that argument for finitely many different Ω † 's and construct a test for the complement of the Rényi neighborhood of Ω by taking the maximum of those Ω † -specific tests. The logarithm of the number of such tests is therefore bounded by the δ n -entropy of T n with respect to spectral norm log N (δ n , T n , ( · 2 , |·|)). It suffices to show that this is bounded by a constant multiple of n 2 n . To this end, if we take Ω 1 = Θ 1 + κ 1 I p , ν 1 and Ω 2 = Θ 2 + κ 2 I p , ν 2 in T n , then by the triangle inequality, for the given δ n , log N (δ n , T n , ( · 2 , | · |)) ≤ log N (δ n /2, S n , · 2 ) + 2 log(M n δ −1 n ) n 2 n + 2 log n n 2 n .
where the constant G > 0 can be made as large as we wish by choosing K large enough.

A.3. Proof of Theorem 4.1
As Condition (a) is directly assumed, we only need to verify the conditions of Theorem 3.1 for the sieve where the last expression is no more than Since R n n 2 n / log n, this verifies Condition (b)(i). Next, for Θ ∈ S c n , either |Θ ij | > M n for some (i, j), or γ > R n . The probability of this set is less than p 2 Π( Θ ∞ > M n ) + Π(R > R n ). Since the entries of Θ have exponential or Laplace distribution, both of which have an exponentially small tail probability, the first term is bounded by a multiple of exp(−λM n ) exp(−λKn 2 n ). The second term is bounded by Π(R > R n ) exp(−aR n log R n ) exp(−aKn 2 n ) by the assumption (4.2) and the choice R n n 2 n / log n. By taking K sufficiently large, we verify Condition (b)(ii).

A.4. Proof of Theorem 4.2
We consider the sieve where R n = Kn 2 n / log n and M n = Kn 2 n for some sufficiently large constant K > 0, and verify Conditions (b)(i)-(iii).
For any two precision matrices Θ 1 , Θ 2 ∈ S n with Cholesky decompositions Θ 1 = U 1 D 1 U T 1 and Θ 2 = U 2 D 2 U T 2 , we obtain Θ 1 − Θ 2 2 less than or equal to Since R n n 2 n / log n and p n 2 n , we have verified Condition (b)(i). To verify Condition (b)(ii), we observe that Under both (4.3) and (4.4), γ has a binomial distribution, and therefore the first term on the right hand side is bounded by exp(−aR n log R n ) exp(−aKn 2 n ). By the tail probabilities of gamma and normal distributions, we have an upper bound for the remaining two terms as exp(−λM n ) exp(−λKn 2 n ). Choosing K sufficiently large ensures the required bound.
To verify Condition (b)(iii) about prior concentration, we have for some constant c > 0, since all the diagonal values of D are bounded away from 0 and the prior density around D is lower bounded, and which shares the same lower bound (A.8). This follows because Π( U − U ∞ ≤ n /p) is equal to the probability that Π(|U ij | < n /p) (1−p −b ) when |U ij | = 0, and that is the case for p(p−1)/2−s * many pairs. Now by the triangle inequality, the facts that U 2 and D 2 are assumed to have a constant upper bound, and the prior independence of U and D, it follows that − log Π( Θ − Θ F ≤ ) (p + s * ) log(p/ n ) (p + s * ) log n, so the rate n = [{(p + s * ) log n}n −1 ] 1/2 satisfies the required condition.

A.5. Proof of Theorem 4.3
By the arguments given at the beginning of the proof of Theorem 3.1, we may assume that our choice of Θ meets the two conditions assumed about Ω , namely, has eigenvalues bounded and bounded away from 0, and a fixed, sufficiently small-size · ∞ -neighborhood of Θ is contained in P G .
We consider the sieve where M n = Kn 2 n with K to be chosen a suitably large constant. On the sieve, Note that Θ ∈ S c n if |Θ ij | > M n for some pair (i, j). The positive definiteness of Θ implies that the largest entry of θ in absolute value occurs at a diagonal position. By the property of the G-Wishart distribution, each diagonal entry is distributed as a chi-square distribution with δ degrees of freedom. From the tail estimate of a chi-square random variable Y , it then follows that for some c > 0 which can be made as large we please by choosing K large enough. It remains to verify that the prior concentration rate is n = {n −1 (p log n)} 1/2 . There are at most pk free arguments in Θ due to the k-banding structure. By Roverato [31], the G-Wishart density at the true value Θ is bounded below by a constant multiple of the product of a power of det(Θ ), e −tr(Θ )/2 and e −cp for some c > 0; see Equations (3.2), (3.3), (5.2), and (5.3) of Banerjee and Ghosal [2]. Clearly, tr(Θ ) = O(p) and | log det(Θ )| = O(p) by the boundedness of the eigenvalues of Θ and its inverse. Since Θ stays away from the boundary of P G by the assumption, it follows that for some c > 0. From (A.9)-(A.11), the rate n = {n −1 (p log n)} 1/2 follows by an application of Theorem 3.1.

A.6. Proof of Theorem 5.1
We apply Theorem 3.1 by verifying Conditions (b)(i)-(iii) on Π Θ with Θ = ΛΛ T . We prove the first assertion only; the second assertion can be established by a minor adjustment of the argument described in the end. Observe that We can lower bound Π( where the second factor is bounded below by exp(−K 1 k log k ) up to a constant multiple, with some constant K 1 , by well-known properties of the Poisson distribution. For the first term, we consider the supremum over all the entries of 0 < i ≤ p and 0 < j ≤ k and let δ = /(4 c 3 pk ). Then it can be bounded below by The first factor is proportional to For the second factor, we know that Π(|Λ ij − Λ ij | ≤ δ | |Λ ij | > δ) δ exp(−c /2) since the probability density is lower bounded on a closed region and Π(|Λ ij | > δ) (pk ) −1 . Therefore, we conclude that Then Condition (b)(iii) is verified for n = {n −1 (c s k log n)} 1/2 .
Next, let where γ n is a large constant multiple of n 2 n / log n, M n = n 2 n γ n and k n = γ n . To bound Π(S c n ) so that the theorem applies to give the rate n , we need to establish that for some sufficiently large constant C > 0, This is so because then conditioning on k ≤ k n and γ ≤ γ n , to obtain the desired bound e −Cn 2 n for Π(S c n ), the maximum number of (i, j) pairs to be considered is bounded by γ 2 n k 2 n , which can be absorbed in the exponent appearing in the bound for max Π(|Λ ij | > M n /(2γ n )). The first relation follows by the tail estimate for sufficiently large n. To derive the second relation, it now suffices to condition on k with k ≤ k n . By the tail probability of binomial distribution, Π(γ > γ n ) is bounded by e −C γn log n for some constant C > 0, which gives the desired bound.
For the third inequality in (A.12), the tail estimate of a normal distribution gives e −C Mn/γn , giving the desired bound in view of the choices of M n and γ n .
By the relations the δ n -metric entropy of S n with respect to the spectral norm is bounded by log kn k=1 γn γ=1 pk γ √ 2M n γ n M n /(2γ n ) δ n γ log k n γ n (pk n ) γn (M n /δ n ) γn γ n log n, which is of the order of n 2 n . This gives the rate n = {n −1 (log n)c s k } 1/2 in terms of the Rényi divergence. By the last assertion of Theorem 3.1, since Ω 2 ≤ c by assumption, the contraction rate in terms of the Frobenius distance is n −1/2 (log n) 1/2 (c s k ) 1/2 in the case of no-measurement error changes to (c ) 2 n −1/2 (log n) 1/2 (c s k ) 1/2 = n −1/2 (log n) 1/2 (c ) 5/2 (s k ) 1/2 for a fixed scale of measurement error. When a prior (5.2) is put on η, the only change in the calculation comes from the fact that then γ has a beta-binomial distribution instead of binomial. By the tail probability of beta-binomial distribution in Castillo and van der Vaart [9], Π(γ > γ n ) is bounded by k n (pk n − γ n ) (1+a4)pkn−γn a4pkn (1+a4)pkn+1 a4pkn pk 2 n 1 − γ n + 1 (1 + a 4 )pk n + 1 a4pkn pk 2 n e −C γn for some constant C > 0, which can be bounded as desired; here we have used the fact that γ n /pk n → 0. Since the tail estimate is weaker than e −C γn log n obtained in the case of a fixed η, this implies that the contraction rate in terms of the Rényi divergence weakens to n −1/2 (log n) √ c s k and that in terms of the Frobenius distance weakens to {n −1 (log n)c s k } 1/2 and {n −1 (log n)c 5 s k } 1/2 respectively, depending on ν = 0 is known and ν is positive and fixed.

Lemma B.2.
Given Ω, Ω , ν and ν , let Ω ν denote the precision matrix for model (1.2) involving Ω and ν and Ω ν denote the truth of it with Ω and ν . Then, if |ν − ν | ≤ / √ p and Ω − Ω F ≤ for small enough , we have where K 1 is a constant only depending on m, ν and Ω 2 . On the other hand, if Ω ν − Ω ν F ≤ and ν ≤ M for some constant M , we have where K 2 and K 3 are constants only depending on m and ν .