Eﬃcient Gaussian graphical model determination under G -Wishart prior distributions

: This paper proposes a new algorithm for Bayesian model determination in Gaussian graphical models under G -Wishart prior distributions. We ﬁrst review recent development in sampling from G -Wishart distributions for given graphs, with a particular interest in the eﬃciency of the block Gibbs samplers and other competing methods. We generalize the maximum clique block Gibbs samplers to a class of ﬂexible block Gibbs samplers and prove its convergence. This class of block Gibbs samplers substantially outperforms its competitors along a variety of dimensions. We next develop the theory and computational details of a novel Markov chain Monte Carlo sampling scheme for Gaussian graphical model determination. Our method relies on the partial analytic structure of G -Wishart distributions integrated with the exchange algorithm. Unlike existing methods, the new method requires neither proposal tuning nor evaluation of normalizing constants of G -Wishart distributions.


Introduction
The purpose of this paper is to introduce a new algorithm for improving the efficiency of existing methods for Bayesian Gaussian graphical model determination under G-Wishart priors.Let y = (y (1) , y (2) , . . ., y (p) ) ′ be a p-dimensional random vector having a multivariate normal distribution N(0, Σ) with mean zero and covariance matrix Σ.Let Ω = (ω ij ) p×p = Σ −1 be the inverse of the covariance matrix.Let G = (V, E) be an undirected graph, where V is a non-empty set of vertices and E is a set of undirected edges.We apply G to Ω to represent strict conditional independencies.Specifically, each vertex i ∈ V corresponds to y (i) , and each edge (i, j) ∈ E corresponds to ω ij = 0; y (i) and y (j) are conditionally independent if and only if ω ij = 0, or equivalently, (i, j) / ∈ E. The G-Wishart distribution [23,1] is the conjugate prior for Ω when Ω is constrained by the graph G.A zero constrained random matrix Ω has the G-Wishart distribution W G (b, D) if its density is where b > 2 is the degree of freedom parameter, D is a symmetric positive definite matrix, I G (b, D) is the normalizing constant, namely, and M + (G) is the cone of symmetric positive definite matrices with off-diagonal entries ω ij = 0 whenever (i, j) / ∈ E. For arbitrary graphs, the explicit formula for computing I G (b, D) is given in equation (3.1).The G-Wishart distribution is used extensively for analyzing covariance structures in models of increasing dimension and complexity in biology [11], finance [4,29], economics [30], epidemiology [6] and other areas.
Conditional on a specific graph G and an observed dataset Y = (y 1 , . . ., y n ) of sample size n, the posterior distribution of Ω is then where S = Y Y ′ .To estimate Ω or any function of it, we need to sample from the G-Wishart distribution for any given graphs.For decomposable graphs, Carvalho, Massam and West [3] proposed a direct and efficient method based on the perfect ordering of the cliques.For arbitrary graphs, Piccioni [19] developed distributional theory for the block Gibbs sampler using Bayesian iterative proportional scaling.Implementation of this theory has been focused on a way that requires maximum clique decomposition and large matrix inversion [13,15], leading to the conclusion that the Bayesian iterative proportional scaling is not good for large problems because enumerating all cliques is NP-hard and inverting large matrix is computationally expensive.Motivated by these limitations, several other methods [28,15,6] took a different approach that used theoretical innovations for non-decomposable graphical models developed in Atay-Kayis and Massam [1].However, one key computational bottleneck of these methods is the matrix completion step for every update of Ω.The matrix completion is conducted iteratively with time complexity O(p 2 ) for completing one non-free element.With increasingly large problems, each update becomes increasingly burdensome.In Section 2.4, we revisit the block Gibbs sampler from a different yet more straightforward perspective that relies on the theory of a non-ordinary Gibbs sampler.We show that the class of block Gibbs samplers is indeed very broad.It not only includes the previously proposed approach based on maximum cliques as a special case, but also motivates a simple implementation that uses individual edges as components in the Gibbs sampler to avoid maximum clique enumeration.Through simulation experiments, we illustrate the flexibility and efficiency of the class of block Gibbs samplers as compared with existing methods.When G is unknown, most of the methods for determining graphical structures operate directly on the graphical model space by treating Ω as a nuisance parameter and computing the marginal likelihood function over graphs (e.g.[11,24,13]).Specifically, the marginal likelihood function for any graph G is computed by integrating out Ω with respect to its prior (1.1), 3) The ability to focus on the graph G alone allows for the development of various search algorithms to visit high probability region of graph space.Markov chain Monte Carlo (MCMC) methods are often outperformed by other stochastic search approaches [11,24,13].The primary challenge in these approaches based on the marginal likelihood function is that computing I G (b, D) and I G (b+ n, D + S) for non-decomposable graphs requires approximation.Two popular approximations are the Monte Carlo integration of Atay-Kayis and Massam [1] and the Laplace approximation of Lenkoski and Dobra [13].Neither approximation has theoretical results for the variance estimation, though they were empirically proven to be successful in guiding the graphical model search when carefully implemented (e.g.[11], [13]).
Alternatively, there are a number of carefully designed MCMC methods for sampling over the joint space of graphs and precision matrices [9,6].A salient feature of these joint space methods is that they do not need posterior normalizing constants whose approximation tends to be more numerically unstable than prior normalizing constants.However, all existing joint space samplers require the tuning of proposals for both across-and within-graph moves.Moreover, they do not remove the need for evaluating prior normalizing constants.
We argue that there are important situations where avoiding the approximation of p(Y | G) is preferred.First, when the size of the prime component is not restricted to be small or when the graphical decomposition is not conducted, the accuracy of the approximation methods can be hard to access even empirically; see one example in Section 6.Second, graphical models are often embedded within a larger and more complicated class of models such as those developments in the seemingly unrelated regression (SUR) models [27], the conditionally autoregressive (CAR) models [6], the mixture models [22] and the copula models [5].In these models, p(Y | G) is typically unavailable in closed form even given the normalizing constant I G .MCMC is routinely used for posterior computation, in which, the step of normalizing constant approximation often takes a substantial part of the run-time.Hence, a sampling method without evaluating I G can facilitate efficient posterior computation.In Section 5, we introduce one such method.Two key features make our algorithm efficient.The first feature is that we use the partial analytic structure [10] of G-Wishart distributions to automatically choose proposals for the reversible jump algorithm, yielding essentially Gibbs steps for both across-and within-graph moves.The other feature is that we use an exchange algorithm [17,14] to remove the need for evaluating prior normalizing constants in a carefully designed MCMC sampling scheme.Through simulation experiments, we illustrate the accuracy of the proposed algorithm, as well as highlighting its scalability to large graphs.Through a real-world example, we further illustrate that the algorithm can be embedded in a larger MCMC sampler for fitting broader classes of multivariate models.

Accept-reject algorithm
Wang and Carvalho [28] proposed an accept-reject algorithm for sampling from the G-Wishart distribution (1.1).Write D −1 = T ′ T and Ω = Φ ′ Φ as Cholesky decompositions and define Ψ = ΦT −1 .Following the nomenclature of Atay-Kayis and Massam [1], the free elements of Φ are those φ ij such that (i, j) ∈ E or i = j.We let Ψ V = [ψ 2 11 , . . ., ψ 2 pp , {ψ ij } (i,j)∈E,i<j ].From Theorem 1 and equation (38) of Atay-Kayis and Massam [1], these free elements have density defined by where ν i = |{j : j > i, (i, j) ∈ E}|, and the non-free elements {ψ rs : (r, s) / ∈ E, r < s} are uniquely defined functions of the free elements, namely: with t <ij] = t ij /t jj .Following the notation in Dobra, Lenkoski and Rodriguez [6], we rewrite (2.1) as is a function of the non-free elements of Ψ which is uniquely determined by Ψ V according to (2.2) and h(Ψ V ) is the density of the product of mutually independent chi-square and standard normal distributions.Based on the expression (2.3), Wang and Carvalho [28] suggested the following rejection sampling algorithm [21]: 1. Sample Ψ V following Step 1 and 2 in Section 4.3 of Atay-Kayis and Massam [1], and u ∼ U[0, 1]. 2. Check whether u < f (Ψ V ).If this holds, accept Ψ V as a sample from (2.1); if not, reject the value of Ψ V and repeat the sampling step.3. Construct a sample of Ω following Step 3 and 4 in Section 4.3 of Atay-Kayis and Massam [1].
Clearly, the acceptance rate depends on the triple (b, D, G).Dobra, Lenkoski and Rodriguez [6] showed that the acceptance rate can be as low as 10 −8 for some (b, D, G).

Independent Metropolis-Hastings algorithm
Mitsakakis, Massam and Escobar [15] proposed an independent Metropolis-Hastings algorithm for sampling from the G-Wishart distribution based on the density (2.3).Their method generates a candidate (Ψ * ) V from h(Ψ V ), then instead of accepting it by probability f {(Ψ * ) V } as in Wang and Carvalho [28], they correct the sample using a Metropolis-Hastings step.This results in the following modification in Steps 2 above: Although, this method improves the acceptance rate over Wang and Carvalho [28] considerably, it still suffers from the same problem of not accepting new samples frequently when graphs are large [6].

Random walk Metropolis-Hastings algorithm
The methods of Wang and Carvalho [28] and Mitsakakis, Massam and Escobar [15] both involve changes of all free elements Ψ V in a single step.Furthermore, this change of Ψ V does not depend on its current value.As a result, this may cause slow mixing and low acceptance rate.Dobra, Lenkoski and Rodriguez [6] proposed a random walk Metropolis-Hastings algorithm (RW) that perturbs only one element in Ψ V in a single step by drawing a random value from a normal distribution with a mean equal to its current value and a pre-specified variance.The random walk algorithm improves the efficiency over the methods of Wang and Carvalho [28] and Mitsakakis, Massam and Escobar [15] significantly.Nevertheless, the three methods in Section (2.1)-( 2.3) or any other methods that use the matrix completion (2.2) can be inefficient for large problems.To see this, note that the completion step (2.2) can only be conducted iteratively.
To complete ψ rs , it involves calculating terms such as s−1 j=i ψ ij t <js] at an estimated time complexity O(rs).Although the exact computing time depends on (r, s), in general, it requires O(p 2 ) calculations for completing a typical non-free element ψ rs with 1 ≤ r < s ≤ p.For a graph with the number of missing edges in the order of p 2 , the matrix completion requires a time complexity O(p 4 ) for one update, which makes these methods unacceptably slow in large graphs as shown in Section 2.5.

Block Gibbs sampler
Piccioni [19] presented a theoretical framework that allows the construction of a block Gibbs sampler for sampling from the natural conjugate prior for regular exponential families.When applied to graphical models, the block Gibbs sampler corresponds to the Bayesian iterative proportional scaling.Suppose {C 1 , . . ., C K } is the set of maximum cliques of an arbitrary graph G = (V, E).The Bayesian iterative proportional scaling method can be summarized as follows: Bayesian iterative proportional scaling [19].Given the current value Ω ∈ M + (G) and a set of maximum cliques {C 1 , . . ., C K }, for each j = 1, . . ., K: Lenkoski and Dobra [13] and Mitsakakis, Massam and Escobar [15] implemented this method using the output of maximum clique enumeration algorithms.They pointed out the Bayesian iterative proportional scaling has two limitations: It requires maximum clique enumeration which is NP-hard and lacks good algorithms; and it involves a series of large matrix inversions in Step 2 for small cliques which is computationally expensive.Although it is unclear whether the maximum condition is necessary in applying the general results of Piccioni [19] to G-Wishart distributions, we are able to provide and justify a class of block Gibbs samplers directly from a Gibbs theory.
We first review a non-ordinary but theoretically valid Gibbs sampler.Suppose θ is a random vector that can be partitioned into p subvectors, θ = (θ 1 , . . ., θ p ), and p(θ) is the target distribution.Consider a collection of index sets I = {I 1 , . . ., I K }, where I k ⊆ {1, . . ., p} for k = 1, . . ., K such that ∪ K k=1 I k = {1, . . ., p}.Let θ I k denote the subset of elements in θ corresponding to the index set θ I k .Step k of a Gibbs sampler then generates from Note that the elements of {θ I k : I k ∈ I} may not be disjoint; thus some components of θ may be updated in multiple steps.This is not an ordinary Gibbs sampler which updates each component of θ only once in one sweep.In our notation, an ordinary Gibbs sampler means I is a partition of {1, . . ., p}.However, updating elements of θ multiple times in one sweep does not cause any theoretical problem -moving components in a step of a Gibbs sampler from being conditioned on to being sampled neither changes the invariant distribution of the chain nor destroys the compatibility of the conditional distributions.In fact, this technique can improve the convergence property of the Gibbs sampler [26].
The above discussion shows that a non-ordinary Gibbs sampler has its target distribution as the stationary distribution.The following proposition is useful in proving that a Gibbs sampler is irreducible and aperiodic and hence will converge to its stationary distribution.Proposition 2.1 (See, for example, Proposition 5 of [19]).Let p(θ) be the target distribution.Suppose θ can be partitioned into p subvectors, θ = (θ 1 , . . ., θ p ).Consider the collection of index sets I = {I 1 , . . ., I K }, where I k ⊆ {1, . . ., p} for k = 1, . . ., K such that ∪ K k=1 I k = {1, . . ., p}.For the Gibbs sampler that simulates each component from p(θ is bounded in θ\θ I k for each k = 1 . . ., K, then it is irreducible and aperiodic. Using the above general formulation of a Gibbs sampler, we can design a class of Gibbs samplers for simulating from G-Wishart distributions and prove its convergence.We start with the construction of the Gibbs samplers that have the stationary distribution (1.1).Let Ω V = [ω 2 11 , . . ., ω 2 pp , {ω ij } (i,j)∈E,i<j ] be the set of the free elements of Ω.Consider a sequence of index sets I = {I 1 , . . ., I K }, where I k ⊆ V = {1, . . ., p} for k = 1, . . ., K such that: (i) the subset I k is complete, and (ii) ∪ I k ∈I Ω I k ,I k = Ω V where Ω I k ,I k is a submatrix of Ω corresponding to I k .The two conditions for the construction of I are important.Condition (i) ensures that updating Ω I k ,I k can be carried out with the Wishart distribution, and condition (ii) ensures that all free elements in Ω can be updated.The Gibbs sampler then cycles through the collection of submatrices {Ω I k ,I k : I k ∈ I}, drawing each submatrix Ω I k ,I k from its conditional distribution given all the other components in Ω: where Ω\Ω I k ,I k represents all the components of Ω except for Ω I k ,I k .For any complete subset I k of V , simulating Ω I k ,I k from (2.4) can be carried out as follows.Lemma 1 of Roverato [23] shows that Thus, we can first generate a Wishart random matrix A ∼ W(b, D I k ,I k ) and then set Ω Note that we use a different notation for a Wishart distribution with density (1.1) than Roverato [23].We write W(b, D) while Roverato [23] used W(b + p − 1, D −1 ).
We next examine the convergence property the above Gibbs samplers by considering the bound of its marginal distributions in order to apply Proposition 2.1.
Proof.From (2.5), we have is bounded in (X, Y ).Taking the derivative of f (X, Y ) with respect to Y and then solving the first order condition that ∂f (X, Y )/∂Y = 0, we obtain Y = −XB and whose density function is bounded by the property of G-Wishart distributions.
Coupled with Proposition 2.1, Proposition 2.2 shows that the class of block Gibbs samplers is indeed irreducible and aperiodic.With this elaboration, we may summarize the class of block Gibbs samplers as follows: For an arbitrary graph G, the choice of the collection of the index sets I may not be unique.For example, consider a 3-node complete graph with V = {1, 2, 3} and E = {(1, 2), (1, 3), (2, 3)}.Then two collections {(1, 2, 3)} and {(1, 2), (1, 3), (2, 3)} can both be used as I.However, different choices of I lead to different configurations of the Gibbs sampler and hence different efficiency.In the 3-node complete graph example, the choice I = {(1, 2, 3)} leads to a 1-step sampler that directly generates Ω from a Wishart distribution, while the choice I = {(1, 2), (1, 3), (2, 3)} implies a 3-step Gibbs sampler.Intuitively, to choose I, one would like K (the number of complete subsets) to be small and |I k | (the dimension of the complete subset) to be large in order to reduce the number of steps of the Gibbs sampler as well as the correlation between Ω I k ,I k 's.The extreme case where I is a collection of the maximum cliques of G offers one such choice.This corresponds to the algorithm implemented by Lenkoski and Dobra [13], which requires an algorithm for maximum clique decomposition.
In the other extreme, one can choose I to be the edge set E and the isolated node set I, that is, I = E ∪ I, and we might call it edgewise block Gibbs sampler.This choice of I = E ∪ I does not require any clique decomposition and can be directly read from the graph.Our simulation studies show that the edgewise block Gibbs sampler is easy to implement and converges rapidly for sparse graphs.In cases where p is large, we obtain a fast updating scheme that avoids inverting a (p − 2) × (p − 2) matrix at Step 2 of the above algorithm as follows.For any e = (i, j) ∈ E, we aim to fast compute Ω −1 ee .Note that ee Σ e,V \e and Σ ee is only 2 × 2. To rapidly compute Ω −1 V \e,V \e using Σ, we only have to show that we can fast update Σ after Ω is updated at Step 2.
Suppose (Ω, Σ) is the present value and Ω ee,new is a new value drawn by Step 2 in the block Gibbs sampler for edge e given (Ω, Σ).To update Ω, we do where ∆ = (Ω ee − Ω ee,new ) and U is a p × 2 matrix with the identity matrix in rows i and j and zeros in the remaining entries.To update Σ, we apply the identity and update Σ as follows All matrix inversions in the edgewise block Gibbs sampler are inversions of 2 × 2 matrices only which can be done very efficiently.
The two extreme block Gibbs samplers of using maximum cliques or edges illustrate a tradeoff between the efficiency and the ease of implementation.In practice, a more useful choice of I is, perhaps, a mix of large complete components and edges.For example, starting from I = E ∪ I, one can merge any number of I k such that the union of those I k forms a complete component to improve the efficiency over the edgewise samplers.The two general conditions for the construction of I makes the implementation of our block Gibbs sampler indeed flexible.Finally, we emphasize that the G-Wishart distribution is unimodal for any G.This important property ensures that the block Gibbs samplers typically converge rapidly and give reliable estimates.

Simulated experiments comparing samplers
To illustrate the computational aspects of the block Gibbs samplers, we compare both the edgewise and the maximum clique block Gibbs samplers with the random walk Metropolis-Hastings algorithm (RW) of Dobra, Lenkoski and Rodriguez [6], where RW is shown to be dominating other methods.We consider three types of graphs: The G-Wishart parameters are b = 103 and B has zeros on the diagonal, and δ is chosen so that the condition number of J is p.Here the condition number is defined as max(λ)/ min(λ) where λ is the eigenvalues of the matrix J.The sparse circle graph was used in Dobra, Lenkoski and Rodriguez [6] under a set of different values of p for p ≤ 20.For the RW approach, let σ m be the standard deviation of the normal proposal.Several initial runs under different combinations of (σ m , p) ∈ {0.1, 0.5, 1, 2, 3} × {10, 20, 30} suggested that σ m = 2 gives the best mixing results for all cases.Hence we used σ m = 2 in this simulation study.When updating one free element of Ψ V , we completed only the non-free elements coming after the free element which we perturbed using (2.2).One iteration entails updating all free elements Ψ V once.For the maximum clique Gibbs samplers, we used the algorithm of Bron and Kerbosch [2] to produce all maximum cliques.For the two block Gibbs samplers, one iteration entails updating all components in the set I once.We saved 5000 iterations after discarding 2000 burn-in iterations for all three samplers.To measure efficiency, we recorded the total CPU run time and the lag of iterations required to obtain samples that could be practically treated as independent, measured as

4, and
where ρ ij (k) is the autocorrelation at lag k for ω ij and M is the total number of saved iterations.We also calculated the effective sample size: where we cut off the sum at lag (L ij − 1) to reduce noise from higher order lags [12].Finally, we computed the percent error where E(σ ij ) is the theoretical expectation available in closed form (Corollary 2 of [23]) and σij is the posterior mean estimates of E(σ ij ).We used because only E(σ ij ) is analytically available for non-decomposable graphs.To summarize these different L ij , ESS ij and P E ij for different entries, we used their corresponding medians.
The results based on 10 repetitions are given in Table 1.We report the mean among the 10 runs.The standard deviations around the mean are less than 5% for CPU, L and ESS, and less than 40% for P E. The programs are written in Matlab and run on a quad-cpu 3.33GHz desktop running CentOS 5.0 unix.The last two rows in Table 1 record the relative ESS of the two Gibbs samplers over RW after standardizing for the CPU run time.As expected, the maximum clique block Gibbs sampler performs best in all scenarios in terms of computing time and mixing.For the circle and the random graphs, the edgewise sampler dominates RW in every dimension.Depending on p and G, the edgewise sampler gives a 5-to 150-fold reduction in run-time and around a 2-to 5-fold improvement in the effective sample size, and producing substantially smaller percentage errors.For the two clique graph, the edgewise sampler mixes less well than the RW sampler; however, it is significantly faster and overall more efficient as measured by relative ESS for large problems.In summary, the RW sampler does not scale well as the block Gibbs samplers as the dimension p grows.The maximum clique sampler is the most efficient sampler for G-Wishart distributions given the availability of maximum cliques.The edgewise sampler has excellent performances for sparse graphs and is easy to use.

Monte Carlo integration
Atay-Kayis and Massam [1] developed a Monte Carlo method to approximate the normalizing constant of the G-Wishart distribution based the decomposition described in Section 2.1.For a G-Wishart distribution W G (b, D), its normalizing constant can be expressed as where ν i is the number of neighbors of node i subsequent to it in the ordering of vertices, h i is the total number of neighbors of node i plus 1, and f (Ψ V ) is defined in (2.3).Because the distribution of Ψ V can be easily sampled from, it is straightforward to estimate the expectation part of (3.1) by Monte Carlo.
The Monte Carlo integration can be computationally expensive when the non-complete prime component is large or when it is used without graph decomposition.This is because it relies on the matrix completion (2.2) to evaluate the function f (Ψ V ).Moreover, the variance of the Monte Carlo estimator depends on the data, the graph and the order of nodes, making it difficult to evaluate [11].

Laplace approximation
Lenkoski and Dobra [13] proposed a Laplace approximation to I G (b, D), namely, where Ω is the mode of W G (b, D), and H is the Hessian matrix associated with l.
Theoretical evaluation of the Laplace approximation in Gaussian graphical models has yet to be investigated, though Lenkoski and Dobra [13] empirically demonstrated its potential to facilitate computation in problems where cliques are restricted to be small, e.g.p ≤ 5. Intuitively, the accuracy of the Laplace approximation depends on the degree to which the density of Ω V resembles a multivariate normal distribution.By comparing the margins of Ω to a normal distribution, Lenkoski and Dobra [13] empirically showed that the closeness increases as d increases.Hence, they suggested to use the computationally fast but less accurate Laplace approximation for the posterior normalizing constant and the computationally expensive but more accurate Monte Carlo integration for the prior normalizing constant.

Existing reversible jump samplers for graphical model determination
The normalizing constant approximation methods in Section 3 allow us to marginalize over Ω and work directly on the graphical model space.However, these approximations may be numerically unstable especially for posterior normalizing constants.Alternatively, there are several special samplers that can explore the joint space of graphs and precision matrices without integrating out Ω. Giudici and Green [9] proposed a reversible jump sampler for jointly simulating (G, Ω) for decomposable graphs.For both across-and within-graph moves, they update one entry in Σ at a time by proposing from an independent normal distribution with mean zero and the variance appropriately tuned.These types of moves require the check of the positive definite constraint of Ω for each update.Dobra, Lenkoski and Rodriguez [6] designed another reversible jump sampler based on the re-parameterization (G, Ψ) where Ψ = ΦT −1 with Φ and T the Cholesky decompositions of Ω and (D + Y ′ Y ) −1 respectively.This sampler ensures the positive definitiveness of Ω automatically; however, it requires the computationally intensive matrix completion and the tuning of proposals for both across-and within-graph moves.Furthermore, both Giudici and Green [9] and Dobra, Lenkoski and Rodriguez [6] still require the evaluation of prior normalizing constants.
In the following section, we improve the efficiency of the reversible jump sampler by first eliminating the need of proposal tuning and pilot run, and then eliminating the need for evaluating prior normalizing constants.

Eliminating proposal tuning
We first present a new algorithm that requires neither the matrix completion nor the proposal tuning for both across-and within-graph moves.The central idea of our sampler is to make use of the partial analytic structure (PAS) of G-Wishart distributions for stochastic model selection.We briefly summarize the main feature of a reversible jump algorithm that uses the partial analytic structure [10].Suppose there are K candidate models {M k } K k=1 , each of which is associated with a likelihood as p(y | θ k , M k ) that depends upon a set of unknown parameter θ k .Consider a move from the current model M k to a new model M k ′ .Suppose there exists a subvector (θ k ′ ) U of the parameter θ k ′ for a new model y} is available in closed form, and in the current model M k , there exists an equivalent subset of parameters (θ k ) −U with the same dimension as (θ k ′ ) −U .The PAS reversible jump algorithm uses a proposal distribution that sets ( In summary, the PAS algorithm proceeds as follows: PAS algorithm [10].Given the current state (M k , θ k ): 1. Update M k .Note that this move also involves making changes of (θ (b) Accept the proposed model M k ′ with probability: where 2. Update θ k .
(a) Update the parameters θ k ′ if M k ′ is accepted using standard MCMC steps.Otherwise, update the parameters θ k using standard MCMC steps.
We now detail the PAS algorithm for sampling (G, Ω) from the full posterior distribution Consider two graphs G = (V, E) and G ′ = (V, E ′ ) that differ by one edge (i, j) only.With no loss of generality, suppose edge (i, j) ∈ E and E ′ = E\(i, j).In the notation of PAS algorithm, we set where the conditional posterior odds against the edge (i, j) is given by: For the first numerator term, since ω ij = 0 under G ′ , it is defined as Note that, from (2.5), the full conditional posterior for ω jj is where c = Ω j,V \j (Ω V \j,V \j ) −1 Ω V \j,j .Let Ω 0 = Ω except for an entry 0 in the positions (i, j) and (j, i), and an entry c in the position (j, j).Then (5.4) can be analytically evaluated as Where I(b, D) is the normalizing constant of a scalar G-Wishart distribution W G (b, D) with p = 1.For the first denominator term, since ω ij = 0 under G, it is defined as To evaluate (5.7), we need the full conditional distribution of (ω ij , ω jj ).Let e = (i, j) and write Ω ee|V \e = Ω ee − Ω e,V \e (Ω V \e,V \e ) −1 Ω V \e,e .From (2.5), the conditional posterior of (ω ii , ω ij , ω jj ) is Further conditioned on ω ii , the full conditional distribution of (ω ij , ω jj ) can then be obtained by applying the standard Wishart theory in the following Proposition 3 to the Wishart matrix Ω ee|V \e . Proposition Let Ω 1 be equal to Ω except for entries of Ω e,V \e (Ω V \e,V \e ) −1 Ω V \e,e in the positions corresponding to e. Applying Proposition 5.1 by letting A = Ω ee|V \e , h = b + n and B = D ee + S ee allows us to evaluate the density (5.7) analytically: where a 11 is the first element of A = Ω ee|V \e .We plug-in (5.6) and (5.8) to (5.3) to provide the acceptance rate for a move from G to G ′ : where can be analytically evaluated.
We can summarize the PAS sampler for graphical model determination as follows: Algorithm 1.Given the current state (G, Ω): 1. Update G.Note that this move also involves making changes of (ω ij , ω jj ).
(a) Propose a new graph G ′ = (V, E ′ ) differing only one edge from G = (V, E) from the proposal distribution q(G ′ | G).Without loss of generality, assume edge e = (i, j) ∈ E and E ′ = E\e.
(c) If G ′ is accepted, set ω ij = 0, update the parameters ω jj from (5.5).If G ′ is rejected, update the parameters (ω ij , ω jj ) from its full conditional distribution using Proposition 2.2 (i).Specifically, in the notation of Propo- Update Ω conditional on the most recent G using the block Gibbs sampler in Section 2.4.
In Step 1(a) of Algorithm 1, instead of randomly picking up an edge and then correcting it by a Metropolis-Hastings step, we can often scan through all (i, j) for i < j according to various deterministic or random schedules and update edge (i, j) as a Bernoulli random variable with the following conditional posterior odds which lead to a Gibbs sampler with the acceptance rate (5.9) uniformly equal to 1.The main benefit of the above PAS algorithm is that the acceptance rate (5.9), with (ω ij , ω jj ) integrated out, eliminates the need of across-graph proposal tuning.The new algorithm also uses the block Gibbs sampler for simulating from G-Wishart distributions at given graphs, eliminating the need of matrix completion and within-graph proposal tuning.However, it still requires the prior normalizing constant approximation.

Eliminating evaluation of prior normalizing constants
This section aims to circumvent the remaining computational bottleneck arising from the intractable prior normalizing constants in Algorithm 1.Our main tool is the double Metropolis-Hastings algorithm [14], which is an extension of the exchange algorithm [17] for simulating from distributions with intractable normalizing constants.
We start with a brief review of the exchange algorithm proposed by Murray, Ghahramani and MacKay [17].Suppose data y are generated from the density p(y | θ) = Z(θ) −1 f (y | θ) where θ is the parameter and Z(θ) = f (y | θ)dy is the normalizing constant that depends on θ and is not analytically available.Suppose the prior for θ is p(θ).A standard Metropolis-Hastings (M-H) algorithm simulates from the posterior of θ: p(θ | y) ∝ p(θ)f (y | θ)/Z(θ) by proposing θ ′ from a proposal q(θ ′ | θ) and then accepting it with probability which depends on the ratio of two intractable normalizing constants.The exchange algorithm removes the need to evaluate Z by considering an augmented distribution where q(θ ′ | θ, y) is an arbitrary distribution and x is an auxiliary variable.Marginally, the original distribution p(θ | y) is maintained.The exchange algorithm samples (θ, θ ′ , x) from the augmented distribution using a standard Metropolis-Hastings sampler.Operationally, The exchange algorithm [17].Given the current state (θ, θ ′ , x): 1. Update (θ ′ , x) using a block Gibbs step.
(a) Propose θ ′ by exchanging θ and θ ′ .Note that this is a symmetric proposal.
(b) Accept θ ′ with probability Comparing the acceptance rate of the exchange algorithm with that of the traditional M-H algorithm, we see that the exchange algorithm replaces the intractable normalizing constant ratio with an estimate from a single sample at each parameter setting: which provides some insight about why the exchange algorithm works.The use of the auxiliary variable x removes Z(θ) from the joint distribution; however it requires an exact sampler for p(x | θ ′ ), which is not practical in many applications.Liang [14] proposed a double Metropolis-Hastings algorithms to avoid the need of exact samplers.Their approach generates x from p(x | θ ′ ) using a product of Metropolis-Hastings updates starting at y: They derived the following extension of the exchange algorithm that does not require an exact sampler for x ∼ p(x | θ ′ ): Double M-H algorithm [14].Given the current state (θ, θ ′ , x)

Same as
Step 2 in the exchange algorithm.
Since two types of Metropolis-Hastings moves are performed for updating θ: One for generating the auxiliary variable x in Step 1(a) and the other for accepting θ in Step 2. The algorithm is called a double Metropolis-Hastings algorithm by Liang [14].When x is generated exactly from f (x | θ ′ ) by an exact sampler, the double Metropolis-Hastings reduces to the exchange algorithm.When x is generated approximately by M-H methods, the double Metropolis-Hastings can be viewed as an approximated exchange algorithm.In such cases, caution must be made about the convergence of the double M-H algorithm.Since the relationship of (5.10) suggests that we use one sample of x to provide the information about the global normalizing constant Z(θ), this sample must be generated in a way that considers the entire space f (x | θ).An auxiliary variable x generated by an exact sampler considers the entire space of f (x | θ); however, an auxiliary variable x generated by a Markov chain will be biased towards a local mode near the starting point with only a few M-H steps.Choosing the M-H kernel K(• → •) for a MCMC to rapidly explore the global auxiliary variable space without being trapped by local modes is the key.We refer to Chapter 5 of Murray [16] for a discussion about using MCMC to generate the auxiliary variable x.Now, we extend the PAS algorithm in Section 5.1 by applying the double M-H algorithm to remove the need of prior normalizing constants.In the notation of the double M-H algorithm, let θ = G and y = Ω and consider the augmented joint distribution where p(Ω, G | Y ) is the original target distribution (5.2), q(G ′ | Ω, G, Y ) is any distribution that proposes a graph G ′ that differs by one edge (i, j) from G with (i, j) ∈ E and (i, j) / ∈ E ′ , and p(Ω ′ | G ′ ) is the density function of Ω ′ ∼ W G ′ (b, D).Marginally, the original posterior (5.2) is unaffected.We consider the following move types.
Move (1) generates G ′ directly from q(G ′ | Ω, G, Y ) and Ω ′ from p(Ω ′ | G ′ ) using a sequence of M-H steps starting from the current Ω.Notice that ω ij = 0 and ω ′ ij = 0. Thus, starting at Ω\(ω ij , ω jj ), we first update (ω ij , ω jj ) from their conditional prior distributions under G ′ and then use m steps of the block Gibbs sampler to generate the auxiliary Ω ′ .Hence the product of M-H updates consists of m steps of the block Gibbs sampler applied to W G ′ (b, D).Thanks to the unimodal property of the G-Wishart distribution, the Gibbs kernels will properly consider the entire auxiliary data space Ω ′ ∼ W G ′ (b, D) without being biased towards a local mode near the starting point.As for the choice of m, Liang [14] suggested only a small m (e.g.m = 1) is needed for obtaining a good sample of x ∼ p(x | θ ′ ).In the examples analyzed in this paper, we found that one iteration of a block Gibbs sampler is sufficient to provide good mixing results.
For Move (2), this is essentially a PAS step that proposes G by swapping G and G ′ , with (ω ij , ω jj ) and (ω ′ ij , ω ′ jj ) integrated out.The acceptance rate is then where the first part is exactly equal to the original acceptance rate (5.3) which is expressed in (5.9) as and the second part can be evaluated by making use of the full conditional distributions of (ω ′ ij , ω ′ jj ) under G ′ and G respectively.Let Ω ′0 = Ω ′ except for an entry 0 in the positions (i, j) and (j, i) and an entry Ω in the positions corresponding to the edge e = (i, j).It is apparent to show the following: , Collecting all terms together, we have the acceptance rate of a move from G to G ′ as where all terms are analytically available.Comparing the acceptance rate (5.11) of the double M-H algorithm to the acceptance rate (5.9) of the original PAS sampler, we see that the double M-H algorithm replaces the intractable prior normalizing constant with the unbiased estimate based on a single sample from the prior: .
This gives an interpretation on why the double M-H algorithm works.Finally, Move (3) generates Ω conditional on the graph from Move (2) using the block Gibbs sampler.We may summarize the algorithm as follows: , ω jj )} using the block Gibbs sampler with initial value Ω\(ω ij , ω jj ).(c) According to whether G ′ is accepted or not, update (ω ij , ω jj ) from their conditional distributions as in Step 1(c) in Algorithm 1.
3. Update Ω conditional on the most recent G using the block Gibbs sampler.
In step 1(a), we can also systematically scan through all (i, j) for i < j and update edge (i, j) using a Bernoulli proposal with the following odds which simplifies the acceptance rate (5.11) as 6. Simulation experiment

A 6 node example
To investigate the accuracy of Algorithm 2, we consider a case with p = 6, yielding a graphical model space of size 32768, which is small enough to be enumerated, yet large enough to be interesting and to have a significant proportion of non-decomposable graphs that are about 45% of all graphs.We let S = Y Y ′ = nA −1 where n = 18 and This choice of (S, n) represents 18 samples of Y from N(0, A −1 ).We placed the G-Wishart prior using the Monte Carlo integration of Section 3.1 for I G when G is non-decomposable and the closed-form of I G when G is decomposable.We then calculated the theoretical posterior edge inclusion probabilities as and the theoretical posterior expectations of Σ and Ω as respectively.In (6.  .Now, we compare the results obtained from Algorithm 2 to the above theoretical values.We applied Algorithm 2 with a systematic scan for 60000 sweeps and discarded the first 10000 as burn-in.Each sweep entails updating all possible edges and all elements in Ω once.Two chains were run: One starting at the identity matrix for Ω and one at the sample precision matrix.The results were essentially the same for both chains.The posterior mean estimates of (p ij ), Σ and Ω are

A 100 node circle graph example
The second example is more challenging as it has a large non-complete prime components of size 100.We simulated a sample of size n = 150 from the model N (0, A −1 ) where A is defined in Section 2.5.The prior parameters were b = 3, D = I 100 and independent edge inclusion probabilities 2/(p − 1).We ran the systematic scan version of Algorithm 2 for 30000 sweeps while discarding the first 30000 warm-up iterations.Two chains were run: One starting at the identity matrix and one at the sample covariance matrix.The results from these two runs were similar.The median effective sample size of the free elements of Ω corresponding to the posterior mean graph was 30000 for a sample of size 30000.The posterior mean graph which includes only edges having posterior inclusion probability exceeding 0.5 is the true underlying circle graph.The highest probability excluded edge has probability 0.08 while the lowest probability included edge has probability 1.
As for comparison, we used the Monte Carlo integration of Section 3.1 to approximate the marginal likelihood of the true underlying graph.Under a C++ implementation, it took about 2 minutes to calculate the prior normalizing constant using 1000 Monte Carlos iterations.For the posterior normalizing constant, the computing time is about the same.However, the algorithm seems to underflow in a standard implementation.That is, the true value of the function f (Ψ V ) tends to be smaller than the computer's smallest positive floating point number.Figure 1 displays the boxplot of values of the function logf (Ψ V ) evaluated at M = 1000 samples of Ψ V adjusted by an offset: where offset = max{logf (Ψ V i ) : i = 1, . . ., M }.The majority of these values are less than -2000, while the smallest positive floating point number in double precision is about -709 in a natural logarithm scale.Recall that logI G is estimated by,  so the summation is taken over zeros most of the time.This example illustrates that the Monte Carlo integration may require a high precision arithmetic library so that it can precisely give results of exponential functions of −10 4 ∼ −10 3 or so.To our knowledge, current software for Gaussian graphical models has yet supported this level of precision.Even if the underflow problem is addressed, the computation time can be unacceptable when we increase the number of Monte Carlo sample size until the variance falls below a fixed level.For example, using the default sample size 1.5p 3 suggested by Jones et al. [11] will cost 1.5 × 100 3 × 2/1000 = 3 × 10 3 minutes to evaluate this graph.Without a good estimate of the normalizing constant, we were unable to evaluate the accuracy of the Laplace approximation.However, note that the accuracy largely depends on the similarity between W G (b + n, D + S) and the normal distribution.Using the edgewise sampler, we simulated 10000 samples from W G (b + n, D + S) under the true graph.Figure 2 illustrates the Q-Q plots for two univariate margins of Ω V .These margins are clearly different from the normal distribution.

Mutual fund performance evaluation
In this section, we illustrate the extension of Gaussian graphical models to a class of sparse seemingly unrelated regression models.We show that how graphs can be useful in modeling real problems and how Algorithm 2 can be used as a key component of a larger sampling scheme.
The historical performance of a mutual fund can be summarized by estimating its alpha.This term is defined as the intercept in a regression of the excess return of the fund on the excess return of one or more passive benchmarks.This is usually estimated by applying an ordinary least square analysis (OLS) to the regression y 0,t = α 0 + x ′ t β 0 + e 0,t , t = 1 : T where y 0,t is the fund return at time t, x t is a k × 1 vector of benchmark returns at time t, and α 0 is the fund alpha.The choice of benchmarks is often guided by a pricing model, such as the capital asset pricing model (CAPM) [25] and the Fama-French three factor model [7].The work of Pástor and Stambaugh [18] has explored the role of nonbenchmark passive assets in estimating a fund's alpha using a seemingly unrelated regression (SUR) model.Suppose there are p nonbenchmark passive returns y 1:p,t besides the k benchmark returns x t .Further suppose returns on passive assets including benchmark or nonbenchmark assets are constructed for the period from 1 to T and a mutual fund only has a history from t 0 to T where t 0 ≥ 1.Then the SUR model used to estimate the mutual fund α 0 is written as y 0,t = α 0 + x ′ t β 0 + e 0,t , t = 1 : T, y i,t = α i + x ′ t β i + e i,t , i = 1 : p; t = 1 : T, where y 0,t = y ⋆ 0,t is missing for t < t 0 and the error vector e 0:p,t is distributed as N (0, Σ).The basic idea is that a more precise estimate of α 0 is provided through a more precise estimate of α 1:p when e 0,t is correlated with the e 1:p,t .Note that many mutual funds have relatively short histories as compared with passive assets.Given the more accurate estimate of α 1:p computed from a longer sample period, the α 0 estimated from SUR is more precise than the α 0 estimated solely based on OLS.Some interesting questions arise in evaluating mutual fund performance using SUR of (7.1).First, as is observed by Pástor and Stambaugh [18], the assumption of pricing power of benchmark assets on nonbenchmark assets, i.e. α i = 0 or not for i = 1 : p, is critical in estimating a fund's alpha in a SUR model.The second question concerns the strictness of the SUR model assumption, that is, returns are assumed to be contemporaneously correlated with all nonbenchmark returns given the benchmark returns.For some managed funds, perhaps only the errors from a subset of nonbenchmark assets are relevant in explaining returns of the fund.Including too many correlated nonbenchmark assets to estimate alpha will mean a potentially high misspecification risk.
Motivated by these practically important considerations, we consider the following sparse seemingly unrelated regression (SSUR) models that extend SUR to address the two questions above.We use the hierarchical mixture prior for each of α 0:p : where z i = 0 or 1 according to whether the benchmark assets have the pricing power or not respectively and ν i0 and ν i1 are set to be small and large respectively [8].We next apply the Gaussian graphical model to the residual covariance matrix Σ to naturally model the contemporaneous dependence among mutual fund and nonbenchmark returns.Algorithm 2 developed in Section 5 will then extend to include components to sample (α 0:p , z 0:p , β 0:p ) and y ⋆ 0,1:t0−1 at each iteration, using the efficient stochastic search variable selection (SSVS) procedure and conventional imputation approach.
To evaluate the efficacy of the model, we applied it to a collection of 15 actively managed Vanguard mutual funds, using monthly returns through December 2008 available from the Center for Research in Security Prices (CRSP) mutual fund database.The set of benchmark and nonbenchmark assets consists of eleven portfolios constructed passively.Monthly returns on these passive assets are available from January 1927 through December 2008.The sample period for any given mutual fund is a much shorter subset of this overall period.We specify the benchmark series as the excess market returns (MKT), and so the alpha is exclusively defined with respect to just MKT.The first two of nonbenchmark passive portfolios are the Fama-French factors, which are the payoffs on long-short spreads constructed by sorting stocks according to the market capitalization and the book-to-market ratio.The third, fourth and fifth nonbenchmark series are the momentum, short term and long term reversal factors respectively.The remaining five nonbenchmark assets are the value-weighted returns for five industrial portfolios.
We choose ν i,0 = 0.025 and ν i,1 = 0.5 for monthly α i .This choice of hyperparameters is in line with the view that a yearly return of 0.025 × 2 × 12 = 0.6% in excess of the compensation for the risk borne may possibly be ignored and the maximum plausible yearly return for α i is about 0.5 × 2 × 12 = 12%.We assume a uniform prior for z i .We compare three methods for estimating α 0 : OLS, SUR and SSUR.Table 2 reports the estimated α 0 , the standard error and the posterior probability of the event {z 0 = 1} within each fund based on the three methods for the period since a fund's inception.The SSUR estimates are nontrivially different from their OLS and SUR counterparts.In particular, the α 0 's tend towards zeros under SSUR.This is not surprising since SSUR assumes a positive probability for small values of α 0 .One important issue in fund performance evaluation is whether the managed fund adds value beyond the standard passive benchmarks.We address this issue by computing the estimated probability of the event {z 0 = 1} in the last column.Only a few funds have the estimated probability exceeding 0.5.This suggests that most of the 15 mutual funds do not generate excess returns beyond the passive benchmark assets.Furthermore, the SUR standard errors are generally smaller than their OLS counterparts.This observation is compatible with that in Pástor and Stambaugh [18].With few exceptions, SSUR seems to reduce the standard er- ror even more than SUR.Recall that the standard error of the SSUR estimates takes into account of structure uncertainty.The reduced standard errors seem to suggest that there is a great deal of sparsity within SUR and that identifying this sparsity can help provide more precise estimates of α 0 's.Finally, we note that the estimated graphs representing a fund's contemporaneously dependencies on nonbenchmark assets seem to reflect a fund's portfolio composition.For example, the fund Explorer seeks small US companies with growth potential and has top two holdings on the information technology and health care sectors as of May, 2008.The error of this fund is related to the error of nonbenchmark assets representing market capitalization, and high technology, and health care.

Discussion
We have described a sampling algorithm for Bayesian model determination in Gaussian graphical models.Our method has three ingredients: A block Gibbs sampler for within-graph moves, a reversible jump sampler using partial analytic structure for across-graph moves, and an exchange algorithm for avoiding the evaluation of prior normalizing constants.
For the covariance selection problem, a possible disadvantage of not approximating the marginal likelihood is that this does not allow for more flexible search algorithms for rapid traversal of the graph space.However, the subsequence of graphs from the auxiliary chain we developed will in many cases have the property that high probability graphs will appear more quickly than low ones, providing useful guidelines for setting Monte Carlo sample size or starting graphs using the more computationally intensive methods based on the normalizing constant approximation.
2. A random graph.The edge set E is randomly generated from independent Bernoulli distributions with probability 0.3.The G-Wishart parameters are b = 103 and D = I + 100J −1 where J = B + δI p and B ij = 0.5 if (i, j) ∈ E.

W G ( 3 ,
I 6 ) on Ω and the uniform prior p(G) ∝ 1 on G.We calculated the theoretical posterior edge inclusion probabilities, denoted by (p ij ) 1≤i<j≤p | Y , and the theoretical posterior expectations of Σ and Ω, denoted by E(Σ | Y ) and E(Ω | Y ) respectively as follows.For each G ∈ G where G is the space of all 32768 graphs, we calculated its posterior probability as p(G | Y ) = p(G)I G (b + n, D + S)/I G (b, D) G∈G {p(G)I G (b + n, D + S)/I G (b, 2), E(Σ | Y, G) and E(Ω | Y, G) are analytically available only for decomposable graphs[20].For non-decomposable graphs, we estimated E(Σ | Y, G) and E(Ω | Y, G) based on their corresponding posterior sample means calculated from the output of the Gibbs sampler of Section 2.4.We shall report the Monte Carlo sample sizes we used: In (6.1), sample sizes 1000 and 50000 were used for the prior and the posterior normalizing constants, respectively; in (6.2), a MCMC sample of 10000 iterations after an initial 5000 runs as burn-in was used.These sample sizes allow the Monte Carlo estimation to be performed for each of the non-decomposable graphs and also yield an agreement of about 2 decimal places for almost all elements in (p ij ) | Y, E(Σ | Y ) and E(Ω | Y ) when we repeated the entire process two more times.The final results of the theoretical posterior edge inclusion probabilities and the theoretical posterior expectations of Σ and Ω are:

Fig 1 .
Fig 1. Box plot showing the distribution of (logf (Ψ V ) − offset) in the 100 node cycle graph example.

Fig 2 .
Fig 2. Q-Q plots comparing the marginal distributions of two entries of Ω ∼ W G (b+n, D+S) with the normal distribution in the 100 node cycle graph example.
Comparing these MCMC estimates with the theoretical values computed above, we see that Algorithm 2 is able to produce accurate estimates.As for the computing time, under Matlab implementation, Algorithms 2 took about 15 minutes to complete 60000 sweeps, while the Monte Carlo integration method took about 16 hours to evaluate all 32768 graphs.

Table 2
Summary of the estimated monthly α for three different models: OLS, SUR and SSUR.For OLS and SUR, point estimates and standard errors are reported.For SSUR, posterior mean, standard deviation and probability of α = 0 are reported.