A Metropolis-Hastings based method for sampling from the G -Wishart distribution in Gaussian graphical models

: In Gaussian graphical models, the conjugate prior for the precision matrix K is called G -Wishart distribution, W G ( δ,D ). In this paper we propose a new sampling method for the W G ( δ,D ) based on the Metropolis Hastings algorithm and we show its validity through a number of numerical experiments. We show that this method can be easily used to estimate the Deviance Information Criterion, providing with a computationally inexpensive approach for model selection.


Introduction
Gaussian graphical models are statistical models for multivariate Gaussian data where dependencies and conditional independencies are represented by means of a graph G.The components of the Gaussian variable X = (X 1 , . . ., X p ) are represented by the set of nodes, V = {1, . . ., p}, of the graph and the absence of an edge between nodes i and j in V indicates the conditional independence of X i and X j given the remaining variables.In a Bayesian framework, in order to perform model selection or covariance estimation, one often uses the popular Diaconis-Ylvisaker conjugate prior ( [4]) for the precision matrix K = Σ −1 of the Gaussian distribution.This prior distribution is called the G-Wishart and was first given for arbitrary undirected graphs G by Roverato ([14]).Its density is given by where δ > 0 and D is symmetric positive definite matrix.The support set of this distribution is the set of all positive definite symmetric matrices K that have zero entries K ij whenever (i, j) is a missing edge in the graph G.This set is denoted with M + (G).Without loss of generality, we can assume that X ∼ N p (0, Σ).
Since the W G (δ, D) distribution is a conjugate prior, for a sample of size n, the posterior distribution of K given the data matrix x is the W G (δ + n, D + x t x).
When G is decomposable, the normalizing constant of the W G (δ, D), |K| (δ−2)/2 exp − 1 2 < K, D > dK, can be computed explicitly (see [13]).However, this is not true for non-decomposable graphs.In that case the normalizing constant cannot be obtained analytically.
The ability to sample from the G-Wishart (or the Hyper Inverse Wishart) distribution is important and has many applications, including the estimation of the normalizing constant.Another application is the estimation of the covariance matrix Σ and functions of it.Indeed in a Bayesian context, one is often interested in the posterior mean of Σ, E(Σ|x, G), given a set of data x of size n and a selected model G. Since the posterior distribution of K is given by the W G (δ + n, D + x T x), we can estimate E(Σ|x, G) by the quantity Ĵ, where where K i are random samples from the W G (δ+n, D+x T x).In a similar manner, one can estimate any function of K.
A method to sample from the G-Wishart distribution when G is non-decomposable was proposed in [3].Unfortunately, this method is wrong.The purpose of this paper is to offer a new method.In Section 2 we review the existing methods for the simulation of the G-Wishart.In Section 3, we present our new method, which is based on the Metropolis-Hastings algorithm.In Section 4, we investigate its performance through a number of experiments and use it together with the DIC criterion for model selection.

The decomposable case
Sampling from the Hyper Inverse Wishart distribution for decomposable graphs has been previously discussed in [3].If C 1 , C 2 , . . ., C k is a perfect ordering of the cliques of G and S 2 , . . ., S k are the corresponding separators, we use the notation Then the sampling scheme is as follows: Si ), with ⊗ denoting the Kronecker product, (b) compute Σ Ri,Si = U i Σ Si and Σ Ri = Σ Ri|Si + Σ Ri,Si Σ −1 Si Σ Si,Ri , and subsequently obtain Σ Ci from Equation (4).

The precision matrix
), where for a |I| × |J| matrix A with |I|, |J| < p we denote with [A] 0 the matrix obtained from A by filling up with zeros entries in order to obtain full dimension, i.e.

The general case
The method in 2.1 cannot be used when the graph is not decomposable, since in that case the marginals for the submatrices of Σ corresponding to prime components do not follow an inverse Wishart distribution [14].Recently a Gibbs Sampling based method has been proposed for the generation of random samples for the G-Wishart distribution that can be applied to non-decomposable graphs [1].This method is based on the theoretical results of [12] describing sufficient conditions in regular exponential families for the construction of a block Gibbs sampler for sampling from their natural conjugate densities.When applied to Gaussian graphical models, the block Gibbs sampler becomes the Bayesian Iterative Proportional Scaling (BIPS) ( [6]).For this cyclical method, the number of iterations to obtain one sample point is equal to the number of cliques of the graph.If C 1 , C 2 , . . ., C k is the set of cliques of G, the algorithm is as follows: After some burn-in time r 0 , the sequence (K r ) r>r0 is a set of random samples from the W G (δ, D).
The method has been implemented successfully, for example in [9].However, sampling using BIPS presents various limitations.The algorithm needs the set of cliques of G to be enumerated.This problem is known to be NP-hard ( [10]).Also, significant computational time is needed since for the generation of one sample a series of matrix inversions is needed (see step 2b above).In the following section we propose a new sampling method that does not suffer from those limitations.This method is based on the Metropolis Hastings (MH) algorithm.
A very recent addition to the litterature of sampling methods from the Hyper Wishart distribution for non-decomposable graphs is presented in [17], where a rejection sampling method is used for the first prime component in a perfect ordering of the prime components and the same method together with conditioning on the separators is also used for the subsequent prime components.

3.
A new sampling method for the W G (δ, D) distribution

The proposed MH-based sampling method
In our proposed method, we make use of the fact that, given a positive definite matrix D, there is bijection between M + (G) and M ⊳ * (G), the space of all the upper triangle matrices incomplete with respect to the graph G.The mapping is described in detail in [2].In summary, each K in M + (G) is mapped to ψ V , the projection of the upper triangle matrix ψ on to the space of V-incomplete matrices, where ψ = φT −1 , and ψ, T upper triangle matrices such that K = φ T φ, D −1 = T T T .Conversely, the completion ψ of ψ V can be done with the use of the following equations (see [2], Lemma 2): while for (r, s) ∈ V, r > 1, where h ij = t ij /t jj , t ij being the (i, j) entry of matrix T .Once ψ is completed, K is given by (ψT ) T ψT .The density of ψ V is such that where ν i is the number of nodes j connected with node i such that j > i, χ δ+νi denotes the distribution of ψ ii when ψ 2 ii follows the chi-square distribution with δ + ν i degrees of freedom, and N |E| (0 |E| , I |E| ) denotes the multivariate normal distribution with zero mean and covariance matrix equal to the identity matrix, of dimension equal to the number of edges |E|.The proposed MH algorithm described here can be used to sample from the distribution of ψ V .Then, samples from the G-Wishart can be obtained using the mapping described above.The proposed algorithm is an independence chain ( [16]), with proposal density The acceptance probability is equal to where Since w(ψ V ) is uniformly bounded (by 1) the chain is uniformly ergodic (the strongest convergence rate condition in use, [11]).We now present the pseudo code of the method: • Initialize the chain by sampling ψ V 0 from h, as in Equation ( 8); set 3. If log α > 0 then do: 4. Else do: For each ψ V cur there is a unique corresponding matrix K cur that can be constructed following the reverse procedure of what is described at the beginning of this section.The sequence of K cur gives a sample from the G-Wishart distribution.
The proposed sampling method has been implemented in R and the code is from the authors.

Experiment 1: Sampling from the prior
In order to investigate the performance of the MH sampling method under different scenarios we conducted a number of numerical experiments.For the first experiment, we sampled from the prior distribution of K, using a seven-nodes non-decomposable graph, as shown on Figure 1).We use as matrix D the identity matrix and δ = 3.We ran the chain for 10,000 iterations, discarding the first 2,000 as burn-in samples.For the assessment of the samples we compare with exact distributions that are available, since in this example graph G contains complete prime components.From [14] we know that, if where The empirical distribution of the sample {d (i) } i is then compared with the empirical distribution of det(X), where X ∼ W (δ, D C ). Q-Q plots can be used for the comparison.We also calculate the acceptance rates, autocorrelation and generate the trace plots.From the results shown in Figure 2 we see that under the simple case of D = I and despite the fact that G is non-decomposable, the sampling distribution is very similar to the true distribution.Also the acceptance rate is quite high, suggesting that the samples are approximately independent.The trace plot indicates that the chain seems to be mixing well.

Experiment 2: Sampling from the posterior
We also perform a similar experiment using the same graph, but this time sampling from the posterior distribution of K.For that we use a simulated dataset of 50 observations generated from a multivariate normal distribution with zero mean and covariance matrix Σ such that Σ −1 ∈ M + (G).Because of the way the simulated data are generated, the graph G is a plausible model for X.
It is important therefore to ensure that the algorithm is sampling well from the posterior distribution.As in Experiment 1, we used D = I and δ = 3.Again, we ran the chain for 10,000 iterations, discarding the first 2,000 as burn-in samples.
For the assessment of the samples we use similar methods as in Experiment 1.
The results presented in Figure 3 show good mixing of the chain, high acceptance rate, low autocorrelation and high proximity to the true distribution, indicating that the sampler performs very well.

Experiment 3: Comparison with the Block Gibbs Sampler
In a third experiment, we compare the MH method with the only other available method, the Block Gibbs Sampler.We use an 11-nodes graph, taken from [7].The graph was selected by the method of graphical lasso to represent the conditional independencies of a dataset, containing flow cytometry of 11 proteins measured on 7466 cells.The graph is shown in Figure 4.It is decomposable, but we prefer testing the sampler with a non-decomposable graph.We therefore modify the graph, by removing the edge (PIP2, P38).The resulting graph is non-decomposable, since it contains a chordless cycle of length 4, {PIcg, PIP2, Akt, P38} [8].We sample from the posterior distribution using as prior hyperparameters, D = I 11 and δ = 3. Figure 5 shows the autocorrelation plots of the log det of the precision matrix for the two method.The MH method required 38.1 seconds of elapsed CPU time per 1000 effective samples, while the Block  Gibbs Sampler only 4.4.Using samples from equal computational times (40sec) we also compare the empirical densities of the det of the precision matrix.The result in Figure 6 shows that the posterior density estimates of det(K) from the two methods are very similar.However, the estimate from MH shows signs of multiples modes.This could be because the estimate is based on an effective sample size 9.3 times smaller than the one from the Block Gibbs sampler.Following the results of those experiments we can conclude that the MH algorithm seems to provide with a reliable method for sampling from the G-Wishart distribution, with relatively small computational expense.In Gaussian graphical Models, the precision matrix K plays the role of the parameter θ, with known posterior distribution W G (δ + n, D + x T x).The DIC of a graph G can be defined as in Equation (10), and K and D can be estimated after sampling efficiently from the W G (δ + n, D + x T x) with the use of the proposed MH-method.We now present a numerical example of the calculation of DIC for various graphical models.We used the well known 4-dimensional Iris flower data set ( [5]), containing the measurements of the length and width of sepals and petals from 150 samples of Iris flowers.The samples are taken equally from three species of the flower, Iris setosa, Iris virginica and Iris versicolor.For our experiment only the 50 samples of Iris virginica were used.Since p = 4, there are 2 p(p−1)/2 = 2 6 = 64 possible four-node graphical models.We excluded the no-edge model and we estimated the DIC for the 63 remaining models.For the hyperparameters of the prior we use the values D = I 4 , δ = 3.Under each model we generated 10,000 samples from the posterior distribution of K, using the MHbased sampling method.The initial 2,000 iterations are discarded as "burn-in".Using those samples we estimate K and D, and subsequently calculate the DIC values.We examine the consistency of the DIC values with the values of the log of the ratio of normalizing constants (equal to the marginal likelihood p(x|G) up to a constant multiplying factor), denoted with λ. Figure 7 shows a scatterplot of the values of -DIC against λ.The values of the two measures seem to be well correlated, which gives the indication that model selection based on DIC value is similar to the one based on the marginal likelihood and that our sampler from the G-Wishart distribution worked well.Similar indication is offered by the observation that 9 out of the 10 best models based on DIC belong to the set of the 10 best models according to λ values.

Conclusions
The main contribution of this paper is the proposal of a new sampling method from the G-Wishart distribution, based on the Metropolis-Hastings algorithm.A first series of experiments showed satisfactory results and improved efficiency over existing sampling methods, such as the Block Gibbs Sampler.In addition, efficient sampling using this method can be used for the estimation of the Deviance Information Criterion (DIC), which, as our experiments suggest, can provide with a computationally inexpensive method for model selection.Further investigation is needed for fine tuning of the method in order to improve its performance.

Fig 2 .
Fig 2. Results from Experiment 1, sampling from the prior distribution of K.

Fig 3 .
Fig 3. Results from Experiment 2, sampling from the posterior distribution of K.

Fig 4 .
Fig 4. Graphical model estimated from a flow cytometry dataset, with p = 11 proteins measured on N = 7466 cells.Taken from [7].

Fig 5 .
Fig 5. Autocorrelation plots for log det(K) using Metropolis-Hastings and Block Gibbs Sampling methods.

Fig 6 .
Fig 6.Density estimation for det(K) using Metropolis-Hastings and Block Gibbs Sampling methods.

4. 4 .
Application to model selection using the DIC In this section we present an application of the proposed MH-based sampling method to model selection.The selection is performed with the use of the Deviance Information Criterion (DIC) ([15]), a criterion similar in spirit to the AIC and BIC.It provides with a measure of how well a particular model fits some existing data, while taking into account how easily the model fits the data (how many parameters it effectively has).It can therefore be used for Bayesian model selection.Given a dataset x and an unknown parameter θ, the deviance is defined as D(θ) = −2 log(p(x|θ)) + C, where p(x|θ) is the likelihood function and C a constant that depends only on the data.Under a specific model and posterior distribution of θ given x one can define the expectations θ = E θ|x [θ] and D = E θ|x [D(θ)].The Deviance Information Criterion can be calculated asDIC = p D + D,(10)where p D = D − D( θ), is the effective number of parameters of the model, a Bayesian measure of model complexity ([15]).The larger p D is, the textiteasier for the model to fit the data.On the other hand the quantity D measures how well the model fits the data, with larger D indicating a worse fit.Overall, models with smaller DIC are preferred to those with larger DIC values.One of the charasteristics of DIC that makes it attractive over other model selection criteria is the fact that it can be calculated when an MCMC or other sampling method for the posterior distribution of θ is available.In that case θ and D can be θ i , i = 1, . . ., N are samples from the posterior distribution of θ given x.The standard error of D can be estimated by N i=1 (D(θ i ) − D) 2 N (N − 1) .

Fig 7 .
Fig 7.  Scatterplot between the log ratio of the normalizing constants, λ, and the value of -DIC.