Bayesian inference for high-dimensional decomposable graphs

In this paper, we consider high-dimensional Gaussian graphical models where the true underlying graph is decomposable. A hierarchical $G$-Wishart prior is proposed to conduct a Bayesian inference for the precision matrix and its graph structure. Although the posterior asymptotics using the $G$-Wishart prior has received increasing attention in recent years, most of results assume moderate high-dimensional settings, where the number of variables $p$ is smaller than the sample size $n$. However, this assumption might not hold in many real applications such as genomics, speech recognition and climatology. Motivated by this gap, we investigate asymptotic properties of posteriors under the high-dimensional setting where $p$ can be much larger than $n$. The pairwise Bayes factor consistency, posterior ratio consistency and graph selection consistency are obtained in this high-dimensional setting. Furthermore, the posterior convergence rate for precision matrices under the matrix $\ell_1$-norm is derived, which turns out to coincide with the minimax convergence rate for sparse precision matrices. A simulation study confirms that the proposed Bayesian procedure outperforms competitors.


Introduction
Consider a sample of observations from a p-dimensional normal model where Ω is a p × p precision matrix. The main focus of this paper is estimating the (i) support of the precision matrix and (ii) precision matrix itself. The support recovery of the precision matrix (or equivalently, graph selection) means estimating the locations of nonzero entries of the precision matrix. A statistical inference on a precision matrix, or a covariance matrix Σ = Ω −1 , is essential 1 arXiv:2004.08102v1 [math.ST] 17 Apr 2020 to uncover the dependence structure of multivariate data. Especially, a precision matrix reveals the conditional dependences between the variables. However, especially when the number of variables p can be much larger than the sample size n, it is a challenging task because a consistent estimation is impossible without further assumptions 2018).
Various restrictive matrix classes have been suggested to enable consistent estimation in such high-dimensional settings. One of the most popular restrictive matrix classes is the set of sparse matrices. The sparsity assumption, which means most of entries of a matrix is zero, can be imposed on covariance matrices 2012b;Cai, Ren and Zhou;, precision matrices (Cai, Liu and Zhou;2015) or Cholesky factors Cao et al.;. In this paper, we focus on sparse precision matrices. They lead to sparse Gaussian graphical models, which will be described in Section 2.2. Various statistical methods have been proposed in the frequentist literature for estimating high-dimensional sparse precision matrices using penalized likelihood estimators (Yuan and Lin;Rothman et al.;2008;Ravikumar et al.; and neighborhood-based methods . Ren et al. (2015) and Cai, Liu and Zhou (2016) suggested a regression-based method and an adaptive constrained 1 -minimization method, respectively, and showed that the proposed methods achieve the minimax rates and graph selection consistency for sparse precision matrices.
On the Bayesian side, relatively few works have investigated asymptotic properties of posteriors for high-dimensional precision matrices. The main obstacle is the difficulty of constructing a convenient prior for sparse precision matrices. Because priors have to be defined on the space of sparse positive definite matrices, calculating normalizing constants is a nontrivial issue. Banerjee and Ghosal (2015) used a mixture of point mass at zero and Laplace priors for off-diagonal entries and exponential priors for diagonal entries under the positive definiteness constraint. They obtained the posterior convergence rate for sparse precision matrices under the Frobenius norm, but their result requires the assumption p = o(n). Furthermore, because the marginal posterior of the graph is intractable, they used Laplace approximation. Wang (2015) proposed a similar method by using continuous spike-and-slab priors for off-diagonal entries of precision matrices. However, theoretical properties of the induced posteriors are unavailable, and a Gibbs sampling algorithm should be used due to the unknown normalizing constant.
As an alternative, the G-Wishart prior (Atay- Kayis and Massam;2005) has been widely used to conduct a Bayesian inference for sparse precision matrices. One of advantages of this prior is that the prior density has a closed form if the underlying graph is decomposable, where the definition of a decomposable graph will be given in Section 2.2. Based on the G-Wishart prior, Xiang et al. (2015) proved the posterior convergence rate for preicsion matrices under the matrix ∞ -norm when the graph is decomposable. However, they assumed that the graph is known, which is rarely true in real applications. Banerjee and Ghosal (2014) also used the G-Wishart prior and derived the posterior convergence rate for banded (or bandable) precision matrices, whose entries farther than a certain distance from the diagonal are all zeros (or very small). Since the underlying graph is always decomposable for banded precision matrices, the posterior can be calculated in a closed form. However, in Xiang et al. (2015) and Banerjee and Ghosal (2014), the obtained posterior convergence rates are sub-optimal, and the graph selection consistency of posteriors has not been investigated.
Recently, Niu et al. (2019) and Liu and Martin (2019) investigated asymptotic properties of posteriors using G-Wishart priors when the true graph is decomposable and unknown. Niu et al. (2019) established the posterior ratio consistency as well as the graph selection consistency, when p grows to infinity as n → ∞. Liu and Martin (2019) obtained the posterior convergence rate of precision matrices under the Frobenius norm. However, these works assumed a moderate highdimensional setting, where p = O(n δ ) for some 0 < δ < 1. To the best of our knowledge, asymptotic properties of posteriors for decomposable Gaussian graphical models in an ultra high-dimensional setting, say p n, have not been established yet.
In this paper, we consider high-dimensional decomposable Gaussian graphical models. A hierarchical G-Wishart prior is proposed for sparse precision matrices. We fill the gap in the literature by showing that the proposed Bayesian method achieves the graph selection consistency and the posterior convergence rate in high-dimensional settings, even when p n. Under mild conditions, we first show the pairwise Bayes factor consistency (Theorem 3.1) and posterior ratio consistency (Theorem 3.2). Furthermore, the graph selection consistency of posteriors (Theorem 3.3) is shown under slightly stronger conditions. Based on these results, we also show that our method attains the posterior convergence rates for precision matrices (Theorem 3.4) under the matrix 1 -norm and spectral norm, which coincides with the minimax rate for sparse precision matrices. The practical performance of the proposed method is investigated in simulation studies, which shows that our method outperforms the other frequentist methods.
The rest of paper is organized as follows. In Section 2, we introduce notation, Gaussian graphical models, the hierarchical G-Wishart prior and the resulting posterior. In Section 3, we establish asymptotic properties of posteriors such as the graph selection consistency and posterior convergence rate. Simulation studies focusing on the graph selection property are provided in Section 4, and a discussion is given in Section 5. The proofs of the main results are provided in the Appendix.

Notation
For any positive sequences a n and b n , we denote a n = o(b n ), or equivalently, a n b n , if a n /b n −→ 0 as n → ∞, and a n = O(b n ), or equivalently, a n b n , if there exists a constant C > 0 such that a n /b n ≤ C for all sufficiently large n. We denote a n b n if there exist positive constants C 1 and C 2 such that C 1 ≤ a n /b n ≤ C 2 . For any p × p matrix A = (A ij ), P ⊂ {1, . . . , p} and 1 ≤ j ≤ p, Ax w for any integer 1 ≤ w ≤ ∞, where a w is the vector w -norm for any a ∈ R p . As special cases, we The matrix 2 -norm, (1), is called the spectral norm.

Gaussian graphical models
Consider an undirected graph by For simplicity, we denote the number of edges in a graph G by |G|. Let P G be the set of all p × p positive definite matrices Ω = (Ω ij ) with Ω ij = 0 if and only if (i, j) ∈ E. Suppose that we observe the data from the p-dimensional Gaussian graphical model, where Ω ∈ P G is a precision matrix. Since the graph G is usually unknown, both recovery of the graph G and estimation of the precision matrix Ω are the main goals of this paper. We consider the high-dimensional setting where p = p n grows to infinity as the sample size n gets larger. We assume that the data were generated from a true precision matrix Ω 0 and the true graph G 0 with We present here some necessary background on graph theory to be self-contained. A graph is said to be complete if all vertices are joined by an edge, and a complete subgraph that is maximal is called a clique. For given vertices v and w in V , a path of length k from v to w is a sequence As a special case, if v = w, then the path is called the cycle of length k. A chord is an edge between two vertices in a cycle but itself is not a part of the cycle. An undirected graph G is said to be decomposable if every cycle of length greater than or equal to 4 possesses a chord (Lauritzen;1996). One of the advantages of working with a decomposable graph G is that, for any decomposable graph G, there exist a perfect sequence of cliques P 1 , . . . , P h and the separators S 2 , . . . , S h defined as S l = (∪ l−1 j=1 P j ) ∩ P l for l = 2, . . . , h (Lauritzen (1996), Proposition 2.17). Here, a sequence is said to be perfect if every S l is complete and, for all j > 1, there exists a l < j such that S j ⊆ P l . In this paper, we will focus on decomposable graphs mainly to exploit this property.

Hierarchical G-Wishart prior
We consider a hierarchical prior for the precision matrix Ω in (2). First, we impose the following prior on the graph G, for some constant C τ > 0 and positive integer R, where D is a set of all decomposable graphs, and P 1 , . . . , P h is a perfect sequence of cliques of G. The condition max 1≤l≤h |P l | ≤ R implies that we focus only on the graphs having reasonably large numbers of edges. The prior (3)  For a given graph G, we will work with the G-Wishart prior (Atay-Kayis and Massam; 2005) whose density function is given by There are four hyperparameters in the proposed hierarchical G-Wishart prior: C τ , R, ν and A.
To obtain desired asymptotic properties of posterior, appropriate conditions for hyperparameters will be introduced in Section 3.

Posterior
For Bayesian inference on the graph G and precision matrix Ω, the joint posterior π(Ω, G | X n ) should be calculated. Due to the conjugacy of the G-Wishart prior, we have .
The posterior samples of (G, Ω) can be obtained from π(G | X n ) and π(Ω | G, X n ) in turn. Because the marginal posterior π(G | X n ) is only available up to some unknown normalizing constant, Markov chain Monte Carlo (MCMC) methods such as the Metropolis-Hastings (MH) algorithm should be adopted.

Main results
In this section, we show asymptotic properties of the proposed Bayesian procedure in high-dimensional settings. Let G 0 = (V, E 0 ) be the true graph, and P 0,1 , . . . , P 0,h 0 and S 0,2 , . . . , S 0,h 0 be the corresponding cliques and separators in a perfect ordering. Let Ω 0 = (Ω 0,ij ) and Σ 0 = (Σ 0,ij ) = Ω −1 0 be the true precision and covariance matrices, respectively. We assume that the data were generated from the p-dimensional Gaussian graphical model with the true precision matrix Ω 0 ∈ P G 0 , i.e.,

For given a random vector
we denote ρ ij|S as the partial correlation between Y i and Y j given To obtain desired asymptotic properties of posteriors, we assume the following conditions for the true graph and partial correlations.
Condition (A1) says that the size of cliques in the true graph G 0 is not too large so that it resides in the prior support. In fact, this condition is different from assuming a row-wise sparse Cai, Liu and Zhou;2014). Note that if Ω 0 has at most R nonzero entries in each row, then condition (A1) is met, whereas the reverse is not true. Thus, condition (A1) is much weaker than assuming a row-wise sparse precision matrix. Condition (A2) implies that the ith and jth variables have an imperfect linear relationship. Although 1 − 1/ (n ∨ p) is used as an upper bound for simplicity, a more general upper bound, 1 − 1/(n ∨ p) c for some constant c > 0, can be used with a proper change in the lower bound of C β in Theorems 3.1 and 3.2. Let min S⊆[p],|S|≤p ρ ij|S\{i,j} be the minimum partial correlation, then it is nonzero whenever (i, j) ∈ E 0 in a decomposable graph G 0 (Nie et al.;. Condition (A3) gives a lower bound for the nonzero partial correlations ρ ij|S\{i,j} with |S| ≤ R rather than |S| ≤ p. In our theory, this condition corresponds to the beta-min condition in the high-dimensional regression literature, which is essential to obtain selection consistency results Martin et al.;Cao et al.;.
(P1) Assume that ν and C τ are fixed constants such that ν > 2 and C τ > 0, respectively. Further assume that R = C r n/ log(n ∨ p) and A = gX T n X n , where g (n ∨ p) −α for some constants C r > 0 and α > 0.
Here, "P" stands for "prior". Condition (P1) is a sufficient condition for hyperparameters to guarantee the desired asymptotic properties of posteriors. Together with condition (A1), R = C r n/ log(n ∨ p) implies that the size of cliques in the true graph G 0 is at most of order n/ log(n ∨ p).
In the high-dimensional linear regression literature, it has been commonly assumed that the number of nonzero coefficients is bounded above by n/ log p up to some constant . By choosing the scale matrix A = gX T n X n , our prior can be seen as an inverse of the hyper-inverse Wishart g-prior (Carvalho and Scott;2009). Niu et al. (2019) used a similar prior with g = n −1 as suggested by Carvalho and Scott (2009). Note that the hyperparameter g serves as a penalty term for adding false edges in graphs, thus we essentially use a stronger penalty than Carvalho and Scott (2009) and Niu et al. (2019) if α > 1.

Graph selection properties of posteriors
The first property is consistency of pairwise Bayes factors using G-Wishart priors. Consider the hypothesis testing problem H 0 : G = G 0 versus H 1 : G = G 1 , for some graph G 1 = G 0 . If we use priors Ω ∼ W G 0 (ν, A) and Ω ∼ W G 1 (ν, A) under H 0 and H 1 , respectively, we support either H 0 or H 1 based on the Bayes factor B 10 (X n ) := f (X n | G 1 )/f (X n | G 0 ). In general, for a given threshold C th > 0, we support H 1 if log B 10 (X n ) > C th , and support H 0 otherwise. Theorem 3.1 shows that we can consistently support the true hypothesis H 0 : G = G 0 based on the pairwise Bayes factor B 10 (X n ) for any G 1 = G 0 .
Theorem 3.1 (Pairwise Bayes factor consistency) Assume that conditions (A1)-(A3) and (P1) hold with C β > 2α + 12 and α > 5/2. Then, we have For the rest, we consider the hierarchical G-Wishart prior described in Section 2.3. Theorem 3.2 shows what we call as the posterior ratio consistency. Note that the consistency of pairwise Bayes factors does not guarantee the posterior ratio consistency, and vice versa. As a by-product of Theorem 3.2, it can be shown that the posterior mode, G = argmax G π(G | X n ), is a consistent estimator of the true graph G 0 .
Specifically, the hyperparameter q in their method should be set at q exp(−C q n γ ) for some constants C q > 0 and max(0, 1 − 4α 1 ) < γ < 1 − σ − 2λ to achieve the posterior ratio consistency.
On the other hand, the hyperparameters in our method do not depend on unknown quantities as long as condition (A1) is met.
Next we show the strong graph selection consistency, which is much stronger than the posterior ratio consistency. To prove Theorem 3.3, we require the following conditions instead of conditions (A3) and (P1): Assume that ν and C τ are fixed constants such that ν > 2 and C τ > 0, respectively. Further assume that R = C r {n/ log(n ∨ p)} ξ and g (n ∨ p) −Rα for some constants C r > 0, 0 ≤ ξ ≤ 1 and α > 0.
Condition (B3) gives a larger lower bound for the nonzero partial correlations than condition (A3). Condition (P2) implies that we further restrict the size of cliques in the true graph and use stronger penalty for adding false edges. Note that if we assume that the sizes of cliques are bounded above by a constant C r , i.e., assuming ξ = 0, then condition (B3) is essentially equivalent to (A3) in terms of the rate.  B3) and (P2) hold with C β > 2(α + 2) and α > 1. Then, we have as n → ∞. Niu et al. (2019) also obtained the strong graph selection consistency under slightly stronger conditions than those they used to prove the posterior ratio consistency. However, their result holds only when p = o(n 1/3 ), which does not include the ultra high-dimensional setting, p n.

Posterior convergence rate for precision matrices
In this section, we establish the posterior convergence rate for high-dimensional precision matrices under the matrix 1 -norm using the proposed hierarchical G-Wishart prior. To obtain the posterior convergence rate, we further assume the followings: Condition (B4) is the row-wise sparsity assumption, which is different from condition (A1). If s 0 ≤ R, then condition (B4) implies condition (A1). Condition (B5) is the well-known bounded eigenvalue condition for Ω 0 , and similar conditions can be found in Ren et al. (2015), Banerjee and Ghosal (2015) and Liu and Martin (2019). Recently, Liu and Martin (2019) obtained the posterior convergence rate for precision matrices under the Frobenius norm without the betamin condition like condition (B3). However, they assumed a moderate high-dimensional setting, p + |G 0 | = o(n/ log p), which is much more restrictive than conditions (B1) and (B4). Theorem 3.4 shows the posterior convergence rate of the hierarchical G-Wishart prior under the matrix 1 -norm in high-dimensional settings, including p n.
Using the G-Wishart prior, Xiang et al. (2015) obtained a larger posterior convergence rate, s 5/2 0 {log(n ∨ p)/n} 1/2 , for a precision matrix Ω 0 ∈ P G 0 , where G 0 is decomposable and known. Banerjee and Ghosal (2014) derived the same posterior convergence rate for banded precision matrices. It was unclear whether the posterior convergence rates 5/2 0 {log(n ∨ p)/n} 1/2 using the G-Wishart prior can be improved or not. Our result reveals that this rate can be improved even when the true graph G 0 is unknown.
The obtained posterior convergence rate in Theorem 3.4 coincides with the minimax rate for high-dimensional sparse precision matrices. Note that (4) implies that as n → ∞ for some constant M > 0, because max 1≤w≤∞ A w ≤ A 1 for any symmetric matrix A (Cai and Zhou; 2012b). Cai, Ren and Zhou (2016) showed thats 0 log p/n is the minimax rate over HP(s 0 , −1 0 ) under the spectral norm, where is a class of sparse precision matrices. Thus, the posterior convergence rate of the hierarchical G-Wishart prior coincides with the minimax rate when log p log n.
It is worth mentioning that we consider a smaller parameter class than HP(s 0 , −1 0 ). If we denote the parameter space that satisfies conditions (A1), (A2), (B3), (B4) and (B5) as B(s 0 , R, −1 0 ) and assume R =s 0 , then it holds that B(s 0 , R, −1 0 ) ⊂ HP(s 0 , −1 0 ). Thus, the obtained posterior convergence rate in Theorem 3.4 might not be minimax for B(s 0 , R, −1 0 ). To prove that it is indeed minimax, it should be shown that the rate of a minimax lower bound for a carefully chosen subset of B(s 0 , R, −1 0 ) is equal to the posterior convergence rate in Theorem 3.4. However, since the main focus of this paper is to prove the graph selection consistency and the posterior convergence rate, we leave it as a future work. In this section, we illustrate the posterior ratio consistency results in Theorem 3.2 using a simulation experiment. First note that for a complete graph G, the explicit expression of the normalizing constant in the G-Wishart prior is given by As shown in Roverato (2000) and Banerjee and Ghosal (2014), for any decomposable graph G with the set of cliques {C 1 , . . . , C h } and the set of separators {S 2 , . . . , S h }, the following holds: where A C j denotes the submatrix of A formed by its columns and rows of indexed in C j . Note that I C j (·, ·) and I S j (·, ·) can be computed using (5) because C j and S j are complete for any decomposable graph G. Further note that the explicit form of the marginal likelihood is given by .
It then follows from (6) that for any decomposable graph G, we have Therefore, we can use (7) and prior (3) to compute the posterior ratio between any two decomposable graphs.
Next, we consider seven different values of p ranging from 50 to 350, and fix n = 150. Then, for each fixed p, we construct a p × p covariance matrix Σ 0,ij = 0.5 |i−j| for 1 ≤ i, j ≤ p such that the inverse covariance matrix Ω 0 = Σ −1 0 will possess a banded structure, i.e., the so-called AR(1) model. The matrix Ω 0 also gives us the structure of the true underlying graph G 0 . Next, we generate n random samples from N p (0, Σ 0 ) to construct our data matrix X n , and set the hyperparameters as A = p −2.01 X T n X n , ν = 3 and C τ = 0.5. The above process ensures all the assumptions in our Theorem 3.2 are satisfied. We then examine the posterior ratio under four different cases by computing the log of posterior ratio of a "non-true" decomposable graph G and G 0 , log{π(G | X n )/π(G 0 | X n )}, as follows.
1. Case 1: G is a supergraph of G 0 and the number of total edges of G is exactly twice of G 0 , i.e. |G| = 2|G 0 |.

Case 2:
G is a subgraph of G 0 and the number of total edges of G is exactly half of G 0 , i.e. |G| = 1 2 |G 0 |.

Case 3:
G is not necessarily a supergraph of G 0 , but the number of total edges of G is twice of |G 0 |.

Case 4:
G is not necessarily a subgraph of G 0 , but the number of total edges of G is half of The logarithms of the posterior ratio for various cases are provided in Figure 1. As expected in all four cases, the logarithm of the posterior ratio decreases as p becomes large, thereby providing a numerical illustration of Theorem 3.2.

Simulation II: Illustration of graph selection
In this section, we perform the graph selection procedure under the proposed hierarchical G-Wishart prior and evaluate its performance along with other competing methods. Recall that the marginal posterior for G is given by and available up to some unknown normalizing constant. We thereby suggest using the following MH algorithm for posterior inference: 1. Set the initial value G (1) .
In the above Step 2(a), we verify whether the resulting graph from local perturbations of the current graph is still decomposable by accepting only those moves that satisfy two conditions outlined in Green and Thomas (2013) on the junction tree representation of the proposed graph. The proposal kernel q(· | G ) is chosen such that a new graph G new is sampled by changing a randomly chosen nonzero entry in the lower triangular part of the adjacency matrix for G to 0 with probability 0.5 or by changing a randomly chosen zero entry to 1 randomly with probability 0.5.
Following the simulation settings in Yuan and Lin (2007) and Friedman et al. (2007), we consider two different structures of the true graph, which corresponds to the following sparsity patterns of the true inverse covariance matrix.
For each model, we consider three different values of p ranging from 50 to 250 and fix n = 150. Next, under each combination of the true precision matrix and the dimension, we generate n observations from N p (0, Σ 0 ). We will refer to our proposed method as the hierarchical G-Wishart (HGW) based graph selection. The performance of HGW will then be compared with other existing methods including the graphical lasso (GLasso) (Friedman et al.;, the constrained 1 -minimization for inverse matrix estimation (CLIME)  and the tuning-insensitive approach for optimally estimating Gaussian graphical models (TIGER) (Liu and Wang;. The tuning parameters for GLasso and TIGER were chosen by the criterion of StARS, the stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs 2010). The penalty parameter for CLIME was selected by 10-fold cross-validation. For HGW, the hyperparameters were set at as A = 1 δ p −2−δ X T n X n , δ = 0.01, ν = 3 and C τ = 0.5. The initial state for G was chosen by GLasso.   better than the regularization methods. Our method performs particularly well in sparse AR (1) model. This is because the consistency conditions of HGW are easier to satisfy under sparse settings. Generally speaking, the proposed method is able to achieve better specificity and precision, while the regularization methods have better sensitivity. The poor specificity of the regularization methods is in accordance with previous work demonstrating that selection of the regularization parameter using cross-validation is optimal with respect to prediction but tends to include more noise predictors compared with Bayesian methods . Overall, our simulation studies indicate that the proposed method can perform well under a variety of configurations with different dimensions, sparsity levels and correlation structures.

Discussion
In this paper, we assume that the true graph Another open problem is whether we can relax the decomposability condition. We assume that the support of the prior is a subset of all decomposable graphs mainly due to technical reasons. By focusing on decomposable graphs, the normalizing constants of posteriors are available in closed forms. This allows us to calculate upper and lower bounds of a posterior ratio. It is unclear to us whether this decomposability condition can be removed. Without this condition, general techniques for obtaining posterior convergence rate, for example, Theorem 8.9 in Ghosal and Van der Vaart (2017), might be needed. Banerjee and Ghosal (2015) used this technique to prove the posterior convergence rate for sparse precision matrices under the Frobenius norm. However, it might be difficult to obtain the posterior convergence rate under the matrix 1 -norm using similar arguments in Banerjee and Ghosal (2015). Let n and˜ n be the posterior convergence rates for precision matrices under the matrix 1 -norm and Frobenius norm, respectively, where n ˜ n . Then, one can see that it is much more difficult to prove the prior thickness (condition (i) of Theorem 8.9 in Ghosal and Van der Vaart (2017)) using n . Therefore, we suspect that the arguments in Banerjee and Ghosal (2015) cannot be directly applied to our setting.

Proofs
Proof of Theorem 3.1 If G = G 0 , then G 0 G or G 0 G. We first focus on the case G 0 G.
By Lemma 2.22 in Lauritzen (1996), there exist a sequence of decomposable graphs G 1 , . . . , G k−1 , G k differ from by exactly one edge. Then, .
is the added edge in the move from G l−1 to G l , and S l is the separator which separates two cliques including i l and j l in G l−1 . Note that ρ i l j l |S l = 0 for any l = 1, . . . , k by Lemma D.4 in Niu et al.
which is of order o(1) for any constant C 1 > 4 + and any sufficiently small constant > 0.
Therefore, we can restrict ourselves to event ∩ k l=1 N l (C 1 ) c . Because ν > 2 and α > 5/2, where the first inequality follows from Lemma C.1 in Niu et al. (2019). The last expression is of order o(1) by choosing a constant C 1 arbitrarily close to 4. Thus, we have for any G 0 G, as n → ∞. Now we consider the case G 0 G. Note that By the same arguments used in the previous paragraph, Again by Lemma 2.22 in Lauritzen (1996), there exist a sequence of decomposable graphs G = differ from by exactly one edge. For l = 1, . . . , k, let (i l , j l ) be the added edge in the move from G l−1 to G l , and S l is the separator which separates two cliques including i l and j l in G l−1 . Note that (i l , j l ) ∈ E 0 for any l = 1, . . . , k. Similar to G 0 G case, we only need to focus on event ∩ k l=1 N l (C 1 ) c for some constant C 1 > 12 + and sufficiently small constant > 0 because by Corollary A.1 in Niu et al. (2019), where the last inequality follows from Condition (A2). On where the first and last inequalities follow from Lemma B.1 in Niu et al. (2019) and Condition (A3), respectively. The last expression is of order o(1) by choosing a constant C 1 arbitrarily close to 12. Thus, we have for any G 0 G as n → ∞, which completes the proof.
Proof of Theorem 3.2 Similar to the proof of Theorem 3.1, we consider two cases: G 0 G and G 0 G. Compared to the ratio of marginal likelihoods in Theorem 3.1, we only need to consider the additional prior ratio term.
If G 0 G, we focus on the event ∩ k l=1 N l (C 1 ) c defined in the proof of Theorem 3.1. Then, by the proof of Theorem 3.1, we have which is of order o(1) by choosing C 1 arbitrarily close to 4, because α + C τ > 5/2.
If G 0 G, we focus on the event ∩ k l=1 N l (C 1 ) c defined in the proof of Theorem 3.1. Then, by the proof of Theorem 3.1, we have which is of order o(1) by choosing C 1 arbitrarily close to 12, because C β > 2(α + C τ ) + 16.
Let N ijS (C 1 ) = N ijS,1 (C 1 ) ∪ N ijS,2 (C 1 ). Then by Corollary A.1 in Niu et al. (2019), which is of order o(1) if we take the constant C 1 such that C 1 > 2 + for any sufficiently small constant > 0. Therefore, we restrict ourselves to event ∩ (i,j,S)∈I d N ijS (C 1 ) c in the rest.
The first term in (8) is bounded above by by taking a constant C 1 arbitrarily close to 2 because g = (n ∨ p) −Rα and α > 1.
Now we focus on the second term in (8). Note that .
By similar arguments used in the proof of Theorem 3.1, we have Furthermore, again by similar arguments used in the proof of Theorem 3.1 and Condition (B3), Also note that Thus, we have The first term (9) is bounded above by (1) because C β > 2(α + 2). The second term (10) is bounded above by which completes the proof.
Let P where for any matrix A and clique P , A (·,j) is the j column of A and (A P ) 0 = (A 0 (i,j) ) ∈ R p×p with A 0 (i,j) = A (i,j) for i, j ∈ P and A 0 (i,j) = 0 otherwise. Furthermore, A := sup x∈R p , x 2 =1 Ax 2 denotes the spectral norm. For a given index j ∈ [p], let then, on the event ∩ 1≤j≤p N c nj , we have For any j ∈ [p], where E π (ΩP (j) 0,l | X n ) is the posterior mean of ΩP (j) 0,l . By the property of the G-Wishart distribution, for any clique P ; 2000). Here W q (ν, A) denotes the Wishart distribution for q×q positive definite matrices B with the probability density proportional to det(B) (ν−2)/2 exp{−tr(BA)/2}, and Σ −1 0,l \P (j) 0,l (X T n X n ) −1 page 652). Thus, we have E π (ΩP (j) 0,l | X n ) = (n + ν + |P For a given constant C λ > 0, define the set N nj (C λ ) := X n : max 1≤l≤w j (n −1 X T n X n ) −1 P (j) 0,l > C λ /3 , then E π (ΩP (j) 0,l | X n ) ≤ C λ on the eventÑ nj (C λ ) c . By Lemma B.6 in Lee and Lee (2018), the posterior probability inside the expectation in (11) is bounded above by 5 |P (j) 0,l | e −c 1 (n+ν)M 2 |P (j) 0,l | log(n∨p)/n + e −c 2 (n+ν)M |P (j) 0,l | log(n∨p)/n on the eventÑ nj (C λ ) c , for some positive constants c 1 and c 2 depending only on C λ . We note here that we are using different parametrization for Wishart and inverse Wishart distributions compared to Lee and Lee (2018). Moreover, by Lemma B.7 in Lee and Lee (2018) and Condition (B5), P 0 Ñ nj (C λ ) = P 0 max for some large C λ because log p = o(n) and (n −1 X T n X n ) P (j) 0,l ∼ W |P (j) 0,l | (n − |P Note that (14) is bounded above by for all sufficiently large n and some constant C λ > 0, where the last inequality follows from Lemma B.7 in Lee and Lee (2018). Also note that (n −1 X T n X n ) −1 ≤ C λ 3 · 0 · n −1 (X T n X n ) P (j) 0,l ·P (j) 0,l − Σ 0,P (j) 0,l ·P (j) 0,l on the eventÑ nj (C λ ) c , where the last inequality follows from Condition (B5). Since n −1 (X T n X n ) P (j) 0,l ·P for some constants c 1 and c 2 depending on M and 0 , by Lemma B.6 in Lee and Lee (2018). It completes the proof.