Bayesian Inference in Nonparanormal Graphical Models

Gaussian graphical models have been used to study intrinsic dependence among several variables, but the Gaussianity assumption may be restrictive in many applications. A nonparanormal graphical model is a semiparametric generalization for continuous variables where it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations on each of them. We consider a Bayesian approach in the nonparanormal graphical model by putting priors on the unknown transformations through a random series based on B-splines where the coefficients are ordered to induce monotonicity. A truncated normal prior leads to partial conjugacy in the model and is useful for posterior simulation using Gibbs sampling. On the underlying precision matrix of the transformed variables, we consider a spike-and-slab prior and use an efficient posterior Gibbs sampling scheme. We use the Bayesian Information Criterion to choose the hyperparameters for the spike-and-slab prior. We present a posterior consistency result on the underlying transformation and the precision matrix. We study the numerical performance of the proposed method through an extensive simulation study and finally apply the proposed method on a real data set.


Introduction
Graphical models describe intrinsic relationships among a collection of variables. Each variable in the collection is represented by a node or a vertex. Two nodes in the graph are connected by an edge if and only if the corresponding variables are not conditionally independent given the remaining variables. Conditional independence impacts the precision matrix, that is, the inverse covariance matrix, by setting the (i, j)th off-diagonal entry to zero if the random variables associated with the ith and jth nodes are conditionally independent given others. Conditional independence makes the partial correlation coefficient between the random variables associated with the ith and jth entries equal to zero as well. If the random variables in the collection can be assumed to be jointly normally distributed, then the conditional independence between the ith and the jth variables is exactly equivalent to having the (i, j)th entry of the precision matrix equal to zero. Such models are known as Gaussian Graphical Models (GGMs). Learning the conditional dependence structure in a GGM is therefore equivalent to estimating the corresponding precision matrix under the assumed sparsity condition. Modeling intrinsic dependence between random variables through GGMs is commonly used in biology, finance, and the social sciences.
Estimation of a sparse precision matrix needs some form of regularization. In the non-Bayesian literature, the estimation is typically carried out by minimizing the penalized log-likelihood of the data with an 1 -penalty on the elements of the precision matrix. This problem is known as the graphical lasso algorithm (Friedman et al., 2008). Many algorithms have been proposed to solve this problem including (Meinshausen and Buhlmann, 2006;Yuan and Lin, 2007;Friedman et al., 2008;Banerjee et al., 2008;d'Aspremont et al., 2008;Rothman et al., 2008;Lu, 2009;Scheinberg et al., 2010;Witten et al., 2011;Mazumder and Hastie, 2012).
Bayesian methods for GGMs involve using priors on the precision matrix and priors on the graph as well. A popular prior on a precision matrix is given by the family of G-Wishart priors (Giudici, 1999;Letac and Massam, 2007;Wang and Li, 2012). The G-Wishart prior is conjugate to multivariate normal random variables and yields an explicit expression for the posterior mean. If the underlying graph is decomposable, the normalizing constant in a G-Wishart distribution has a simple closed form expression. In the absence of decomposability, the expression is more complex (see (Uhler et al., 2017)), but may be computed by simulations. Simulations from a G-Wishart distribution is possible using the R package BDgraph Wit, 2015, 2017;Dobra and Mohammadi, 2017;Mohammadi and Wit, 2018), which uses an explicit expression for the normalizing constant for a decomposable graph and uses the birth-death MCMC algorithm (Mohammadi and Wit, 2015) if the graph is not decomposable. This allows computation of the marginal likelihood, and hence the posterior probability, of any given graph. However, as the number of possible graphs is huge, computing posterior probabilities of all graphs is an impossible task for even a modest number of nodes. The problem is worsened by the fact that a very low fraction of graphs are decomposable. Thus when learning the graphical structure from the data, alternative mechanisms of putting priors on the entries of the precision matrix that allow sparsity are typically employed. A prior that models a sparse precision matrix is ideally a mixture of a point mass at zero and a continuous component (Wong et al., 2003;Carter et al., 2011;Talluri et al., 2014;Banerjee and Ghosal, 2015). However, since the normalizing constants in these mixture priors are intractable due to the positive definiteness constraint on the precision matrix, absolutely continuous priors have been proposed. The Bayesian graphical lasso (Wang, 2012) has been developed as a Bayesian counterpart to the graphical lasso. However, its use of a double exponential prior, which does not have enough mass at zero, does not give a true Bayesian model for sparsity. Continuous shrinkage priors, such as the horseshoe (Carvalho et al., 2010), generalized double Pareto (Armagan et al., 2013), Dirichet-Laplace (Bhattacharya et al., 2015), and others have been proposed as better models of sparsity since these priors have infinite spikes at zero and heavy tails.
Only a few results on the frequentist behavior of Bayesian methods for precision matrix estimation exist in the literature. Banerjee and Ghosal (2014) studied posterior convergence rates for a G-Wishart prior inducing a banding structure, but the true precision matrix need not have a banded structure. Banerjee and Ghosal (2015) provided results on posterior contraction rates for the precision matrix under point mass spike-and-slab priors.
Although GGMs are useful, the distributional assumption may fail to hold on some occasions. A nonparametric extension of the normal distribution is the nonparanormal distribution in which the random variables X = (X 1 , . . . , X d ) are replaced by some transformed random variables f (X) := (f 1 (X 1 ), . . . , f d (X d )) and it is assumed that f (X) has a d-variate normal distribution N d (µ, Σ) (Liu et al., 2009). In some situations, the logarithmic transform may be appropriate, but in general the transformations f 1 , . . . , f d are hard to specify. It is therefore more sensible to let f 1 , . . . , f d be unspecified, and use a nonparametric technique for their estimation. Liu et al. (2009) designed the nonparanormal graphical model, a two-step estimation process in which the functions f j were estimated first using a truncated empirical distribution function, and then the inverse covariance matrix Ω = Σ −1 was estimated using the graphical lasso applied to the transformed data. Although the approach in Liu et al. (2009) works well in many settings, their estimator for the transformation functions is based on the empirical distribution function, which leads to an unsmooth estimator. While the focus of this paper is on the nonparanormal graphical model, an alternative to the nonparanormal graphical model is the copula Gaussian graphical model (Pitt et al., 2006;Dobra and Lenkoski, 2011;Liu et al., 2012;Mohammadi and Wit, 2017) which avoids estimation of the transformation functions by using rank-based methods to transform the observed variables.
Bayesian approaches can naturally blend the desired smoothness in the estimate by considering a prior on a function space that consists of smooth functions. Gaussian process priors are the most commonly used priors on functions (Lenk, 1991;Rasmussen and Williams, 2006;Choudhuri et al., 2007;van der Vaart and van Zanten, 2007). Priors on function spaces have also been developed using a finite random series of certain basis functions like trigonometric polynomials, B-splines, or wavelets (Rivoirard and Rousseau (2012); de Jonge and van Zanten (2012); Arbel et al. (2013); Shen and Ghosal (2015)). We consider a Bayesian approach using a finite random series of B-splines prior on the underlying transformations. We choose the B-splines basis over other possible choices because B-splines can easily accommodate restrictions on functions, such as monotonicity and linear constraints, without compromising good approximation properties (Shen and Ghosal, 2015). In our context, as the transformation functions f 1 , . . . , f d are increasing, imposing the monotonicity restriction through the prior is essential. This can be easily be installed through a finite random series of B-splines by imposing the order restriction on the coefficients. By equipping the vector of the coefficients with a multivariate normal prior truncated to the cone of ordered coordinates, the order restriction can be imposed maintaining the conjugacy inherited from the original multivariate normal distribution. A simple Gibbs sampler is constructed in which first, a truncated normal prior on the transformation functions results in a truncated normal posterior distribution that is sampled using a Hamiltonian Monte Carlo technique (Pakman and Paninski, 2014) and second, a Student t-spike-and-slab prior on the precision matrix of the transformed variables results in sampling the corresponding posterior distribution of the precision matrix and the edge matrix, which determines the absence or presence of an edge in the graphical model. The underlying graphical structure can then be constructed from the obtained edge matrix.
The paper is organized as follows. In the next section, we state model assumptions of the Gaussian graphical model and the nonparanormal graphical model. In addition, we specify the prior distributions for the underlying parameters. In Section 3, we derive the posterior distributions, describe the Gibbs sampling algorithm and the tuning procedure. In Section 4, we provide a posterior consistency result for the priors under consideration. In Section 5, we present a simulation study. In Section 6, we apply the method to a real data set and we provide proofs in Section 7. Finally, we conclude with a discussion section.

Model and Priors
Let X = (X 1 , . . . , X p ) denote a random vector that is distributed as p-variate multivariate normal, N p (µ, Σ). The undirected graph G = (V, E) that corresponds to this distribution consists of a vertex set V , which has p elements for each component of X, and an edge set E which consists of ordered pairs (d, k) where (d, k) ∈ E if there is an edge between X d and X k . The edge between (d, k) is excluded from E if and only if X d is independent of X k given all other variables, to be denoted by X \{d,k} . For multivariate normal distributions, the conditional independence holds if and only if Σ −1 d,k = Ω d,k = 0; here for a matrix A, A d,k denotes its (d, k)th element. Definition 1. A random vector X = (X 1 , ..., X p ) has a nonparanormal distribution if there exist smooth monotone functions . . , f p (X p )). In this case we shall write X ∼ NPN(µ, Σ, f ).
By assuming that the transformed variables f (X) are distributed as normal, the conditional independence information in the nonparanormal model is completely contained in the parameter Ω, as in a parametric normal model. Since the transformation functions are one-to-one, the inherent dependency structure given by the graph for the observed variables is retained by the transformed variables. We note that any continuous random variable can be transformed to a normal variable by a strictly increasing transformation. However testing for high-dimensional multivariate normality is not feasible, and testing for the nonparanormality assumption is not possible in high dimension, but clearly the condition is a lot more general than multivariate normality. Instead of testing for nonparanormality, one may assess the efficacy of the assumption by looking at the effect of the transformations. If the transformation functions are linear, then assuming multivariate normality should be adequate. If the transformation functions are non-linear, then modeling through the nonparanormal distribution may be useful.
We put prior distributions on the unknown transformation functions through a random series based on B-splines. The coefficients are ordered to induce monotonicity, and the smoothness is controlled by the degree of the B-splines and the number of basis functions used in the expansion. Cubic splines, which are B-splines of degree 4, are used in this paper. The resulting posterior means of the coefficients give rise to a monotone smooth Bayes estimate of the underlying transformations.
Thus the smooth monotone functions that we use to estimate the true transformation functions are assumed to be multivariate normal, where f is a p-vector of functions, X is an n × p matrix, and θ j is a p-vector; here B j (·) are the B-spline basis functions, θ j are the associated coefficients in the expansion of the function, and J is the number of B-spline basis functions used in the expansion. These transformed variables f (X) are subsequently used to estimate the sparse precision matrix and hence in structure learning.
In the next part, we discuss the prior on the coefficients in more detail.
• Prior on the B-spline coefficients First, we temporarily disregard the monotonicity issue and put a normal prior on the coefficients of the B-splines, θ ∼ N J (ζ, σ 2 I), where σ 2 is some positive constant, ζ is some vector of constants, and I is the identity matrix. A normal prior is convenient as it leads to conjugacy. However, apart from monotonicity of the transformation, we also need to address identifiability since unknown µ and Σ allow flexibility in the location and the scale of the transformation so that the distribution of f (X) can be multivariate normal for many different choices of f . The easiest way to address identifiability is to standardize the transformations by setting µ = 0 and the diagonal entries of Σ to 1. However, then it will be more difficult to put a prior on sparse Ω complying with the restriction on the diagonal entries of Σ because of the constraint Σ = Ω −1 . Hence it is easier to keep µ and Ω free and impose restrictions on the locations and the scales of the transformation functions f d , d = 1, . . . , p. There are different ways to impose constraints on the locations and scales of f d . One can impose some location and scale restrictions on the corresponding B-spline coefficients, for instance, by making the meanθ d = J −1 J j=1 θ dj = 0 and the varianceθ d = J −1 J j=1 (θ dj −θ d ) 2 = 1. Then the prior distribution for θ d , d = 1, . . . , p, will have to be conditioned on these restrictions. The non-linearity of the variance restriction makes the prior less tractable. In order to obtain a conjugate normal prior, we instead consider the following two linear constraints on the coefficients through function values of the transformations: It may be noted that, as only a few B-spline functions are non-zero at any given point, the restrictions (2.2) and (2.3) involve only a few θ j s. More specifically, as the degree of B-splines used in this paper is 4, the first equation involves only 4 coefficients and the second only 8, no matter how large J is.
The linear constraints can be written in matrix form as and c = (0, 1) .
Using conditional normal distribution theory, the resulting prior on the coefficients θ is where the prior mean and variance are However, the prior dispersion matrix Γ is singular due to the two linear constraints, resulting in a lack of Lebesgue density for the prior distribution on R J . Thus, we work with a dimension reduced coefficient vector by removing two coefficients to ensure that we have a Lebesgue density on R J−2 for the remaining components. Suppose we remove the last two coefficients.
Then, the reduced vector of basis coefficients isθ d = [θ d,1 , θ d,2 , ..., θ d,J−2 ]. Then we can solve for θ d,J−1 and θ d,J using Aθ = c to obtain, Then the resulting prior for the coefficients for each predictor is, where the reduction is denoted with a bar.
Finally, we impose the monotonicity constraint on the coefficients, which is equivalent with the series of inequalities (2.11) Due to the two linear constraints, the monotonicity constraint reduces tō The final prior on the coefficients is given by a truncated normal prior distribution (2.14) where T = {θ :Fθ+ḡ > 0}, and the N p (µ, Σ)-distribution restricted on a set T is denoted by TN p (µ, Σ, T ). The conjugacy property of the prior distribution is preserved by the truncation. Instead of the simplifying example of solving for the last two coefficients, we use a more general method to reduce the dimension. The Symbolic Math Toolbox in MATLAB was used to solve for any two coefficients in terms of the remaining coefficients. In particular, for the first row of the linear constraints matrix A (2.5), we find the first column with a nonzero element. Then, for the second row of the linear constraints matrix, we find the first column with a nonzero element that is not the same as the column selected from the first row. We use the indices from those two columns to select the two coefficients that will be removed from the dimension in order to findθ,F , andḡ.
Although any choice of ζ is admissible, the prior can put a substantial probability of the truncation set T = {θ :Fθ +ḡ > 0} only when the original mean vector ζ has increasing components. A simple choice of ζ involving only two hyper-parameters is given by where ν is a constant, τ is a positive constant, and Φ −1 is the inverse of the cumulative distribution function (i.e. the quantile function) of the standard normal distribution. The motivation for the choice comes from imagining that the prior distribution of each θ j as N(ν, τ 2 ) before the ordering is imposed, and hence the expectations of the order statistics of N(ν, τ 2 ) may be considered as good choices for their means. The expressions in (2.15) give reasonable approximation of these expectations. Similar expressions Φ −1 (j/(J + 1)) appear for the score function of locally most powerful rank tests against normal alternatives (see Hájek et al. (1999)). Royston (1982) . . , J, as a more accurate approximation for the expected values of standard normal order statistics than the expression Φ −1 (j/(J + 1)) used in rank tests.

• Prior on the mean
For each predictor, we put an improper uniform prior • Prior on the precision matrix We build on the techniques of Wang (2015), which use a normal spike-and-slab prior to estimate a sparse precision matrix, but replace the normal by a Student t-distribution spikeand-slab prior, following Scheipl et al. (2012). Let τ 2 d,k be the slab variance and c 0 τ 2 d,k be the spike variance. The spike scale c 0 is assumed to be very small and given. Having a continuous spike instead of a point mass at zero is more convenient since it admits density; see Wang (2015). Unlike in Wang (2015), we estimate the sparse precision matrix by allowing the spikeand-slab variances and probability to be random with an inverse-gamma prior to lead to a Student t-distribution for the slabs. The diagonal entries of Ω are given an exponential distribution with rate parameter λ/2 for some λ > 0. We introduce a symmetric matrix of latent binary variables L = ((l d,k )) with binary entries to represent the edge matrix. The entries l d,k , d < k, are assumed to be independent with π denoting the probability of 1, i.e. the probability of an edge. Let N(·|·, ·) and Exp(·|·) respectively stand for the densities of the normal and exponential distributions.
. The joint prior for Ω = ((ω d,k )) and L is then obtained as imsart-ba ver. 2014/10/16 file: ba-sample.tex date: June 13, 2018 The prior for η = (τ 2 d,k , π d,k , d < k, λ) are given by, independently of each other, where Be stands for the beta distribution and IG for the inverse-gamma distribution. The value of λ controls the distribution of the diagonal elements of Ω. We use λ = 1 under similar reasoning to Wang (2015), because it assigns considerable probability to the region of reasonable values of the diagonal elements. We used b 0 = b 1 = 1 for the inverse gamma prior and we tuned the hyperparameters, a 0 and a 1 , for the beta prior. The beta prior for the weights π can be used to incorporate prior knowledge about the sparsity of Ω. See Scheipl et al. (2012) for more details regarding the spike-and-slab prior based on a mixture of inverse gamma distributions.

Posterior Computation
The full posterior distribution is ), the prior on the B-spline coefficients is p d (θ d ), the prior on the means is p(µ d ), and the joint prior on the sparse precision matrix and the edge matrix is p(Ω, L). Here, the likelihood is constructed from the working assumption that J j=1 θ j B j (X) ∼ N p (µ, Ω −1 ). The joint posteriors are standard and so they are not derived. They can be evaluated in the following Gibbs sampling algorithm.
(a) Since we can reduce the number of coefficients by two, the basis functions for these two coefficients can be represented as where the * is used to denote the two-dimensional vectors B * and θ * d .
. . , p, i = 1, . . . , n), the joint posterior for the B-spline coefficients is a truncated normal, with density imsart-ba ver. 2014/10/16 file: ba-sample.tex date: June 13, 2018 restricted on the region {Fθ +ḡ > 0} to satisfy the monotonicity constraint. However, this truncated multivariate normal distribution is p × (J − 2) dimensional, so we sample it using the following conditional normals in a Markov chain, where, using the conditional normal theory, and λ 2 d = 1/ω d,d . Samples from the truncated conditional normal posterior distributions for the B-spline coefficients are obtained using the exact Hamiltonian Monte Carlo algorithm (exact HMC) (Pakman and Paninski, 2014). Each iteration of the exact HMC results in a transition kernel which leaves the target distribution invariant and the Metropolis acceptance probability equal to 1. The exact HMC within Gibbs is like Metropolis within Gibbs and hence is a valid algorithm to sample from the joint density.
2. Obtain the centered transformed variables: 3. The posterior density of Ω given L is where S = Z Z.
where φ stands for the normal density function.
4. We ensure that Ω is positive definite by checking that all of its eigenvalues are positive.
These steps are repeated until convergence.

Choice of Prior Parameters
We use a model selection criterion to determine the optimal number of basis functions pre-MCMC. Sampling methods that involve putting a prior on the number of basis functions, such as reversible jump Monte Carlo, are computationally expensive. We calculate the Akaike Information Criterion (AIC) for different numbers of basis functions and chose the number of basis functions that corresponded to the lowest AIC. The AIC was determined by minimizing the negative log-likelihood, −2l(θ d ), respect to the basis coefficients subject to the linear and monotonicity constraints. The AIC is preferred here as the true transform does not belong to the set of splines and hence correct model selection is not the goal, but minimizing estimated estimation error is, which is provided by the model with the lowest AIC. The lowest AIC was found between a grid of five and 20 basis functions by doing a search in which the lowest AIC was chosen when the next five values were larger than the current value in the search, since the AIC should generally be a monotonic curve. Then for each predictor, d = 1, . . . , p, and for the number of basis functions, J, (3.2) After plugging in the maximum likelihood estimators (MLEs) of µ d and σ d and making the substitution Z id = B j (X id ) − n −1 n m=1 B j (X md ), minimizing the −2l(θ d ) results in the following problem, minimize This problem can be equivalently solved using the quadratic programming function in MATLAB Optimization Toolbox: For numerical stability, the monotonicity constraint was changed to Fθ d ≥ 10 −4 . Finally, after plugging in the solution to the quadratic programming problem,θ d , the final number of basis functions is chosen by selecting the number J that minimized the AIC We use a model selection criterion to determine the hyperparameters, a 0 and a 1 , for the beta distribution for ((π dk )) and to determine the constant value for the spike scale, c 0 , after the MCMC sampling. Inspired by Dahl et al. (2005), we solve a convex optimization problem in order to use the Bayesian Information Criterion (BIC). First, we find the Bayes estimate of the inverse covariance matrix,Ω Bayes . Using the 1 -loss function, the Bayes estimate is defined asΩ Bayes = E(Σ|Z) −1 , as derived by Yang and Berger (1994). We find the average of the transformed variables,Z = M −1 M m=1 Z m , where Z m , m = 1, . . . , M , are obtained from the MCMC output. Then, using the sum of squares matrix, S =Z Z , we solve the following to obtain the maximum likelihood estimate of the inverse covariance matrix,Ω MLE , minimize Ω − log det Ω + 1 n tr(ΩS), subject to C(Ω), (3.5) where C represents the elements ofΩ that are zero and nonzero, and they are determined by the zeros of the estimated edge matrix from the MCMC. The estimated edge matrix from the MCMC sampling will be described in more detail in Section 5. This constrained optimization problem was implemented as an unconstrained optimization problem, as described in Dahl et al. (2005).
Finally, we calculate BIC = −l(Ω MLE ) + k log n, where k = #C(Ω), the sum of the number of diagonal elements and the number of edges in the estimated edge matrix, and −l(Ω MLE ) = −n log detΩ MLE + tr(Ω MLE S).
We select the combination of hyperparameters, a 0 , a 1 , c 0 , that results in the smallest BIC.

Posterior Consistency
Posterior consistency is a fundamental way of validating a Bayesian method using a frequentist yardstick in the large sample setting, and is of interest to both frequentists and Bayesians; for a thorough account of posterior consistency, see Ghosal and van der Vaart (2017). In Gaussian graphical models, using point mass spike-and-slab priors, Banerjee and Ghosal (2015) showed that the posterior for Ω is consistent in the high-dimensional setting provided that (p + s)(log p)/n → 0, where s stands for the number of non-zero off-diagonal entries of the true Ω. With a slight modification of the arguments, it follows that the result extends to continuous spike-and-slab priors provided that the spike scale c 0 is sufficiently small with increasing p. In the nonparanormal model, the main complicating factor comes from the unknown transformations f 1 , . . . , f p , since the rest will then be as in a Gaussian graphical model. Below we argue that these transformations may be estimated consistently in an appropriate sense.
We study the posterior distributions for each transformation f d separately, which can be learned from the marginal likelihood for each component. Thus the problem of posterior consistency for f d can be generically described as follows. For brevity, we drop the index d. Consider the model Y = f (X) ∼ N(µ, σ 2 ), where f is a continuously differentiable, strictly monotone increasing transformation from (0, 1) to R. Clearly, this model is not identifiable and hence consistent estimation is not possible in the usual sense. Identifiability can be ensured by setting µ = 0 and σ = 1, but the procedure followed in this paper instead puts constraints on f : f (1/2) = 0 and f (3/4) − f (1/4) = 1. We shall show that the posterior for f is consistent under this set of constraints.
As the function f is necessarily unbounded near 0 and 1 to ensure that f (X) is normally distributed, which is a distribution with unbounded support, it is clear that uniform posterior consistency for f is not possible. We shall therefore consider the notion of uniform convergence on a compact subset of (0, 1): for a fixed δ > 0, the pseudo-metric to consider is d(f 1 , f 2 ) = sup{|f 1 (x) − f 2 (x)| : δ ≤ x ≤ 1 − δ}. Even then, the usual posterior distribution may be highly impacted by observations near 0 or 1, so we actually study a modified posterior distribution, based on observations falling within the given fixed compact subset [δ, 1 − δ] of (0, 1), with δ < 1/4, to be described below.
Note that the connections between F and f is given by Thus we have We note that the posterior distributions of the quantities π − and π + can be obtained based on the counts n * − and n * + respectively. In particular, using a Dirichlet prior on the probability vector (π − , π + , 1 − π − − π + ), we have consistency for the posterior distribution of (π − , π + ) at (π − 0 , π + 0 ). We shall assume that the posterior distribution of (π − , π + ) is consistent. Note that the truncated observations alone do not lead to a posterior distribution for (π − , π + ).
The modification in the posterior distribution of µ, σ and f that we consider can be described as follows. Using the given prior on (µ, σ, f ) and the truncated observations X * 1 , . . . , X * n * , we obtain the induced posterior distribution of F * , while we obtain the posterior distribution on (π − , π + ) directly conditioning on (n * − , n * + ). Then the posterior distribution of {F (x) : x ∈ [δ, 1 − δ]} is induced from (4.1). Finally the modified posterior distribution of (µ, σ, f ) is induced from the relations . (4.5) The following theorem on posterior consistency refers to this modified posterior distribution rather than the original posterior distribution of (µ, σ, f ). The proof can be found in the Supplementary Material.
Theorem 1. In the above setting. let the prior on µ and σ contain µ 0 and σ 0 in its support and independently, the prior Π for f satisfies the condition that Then for any > 0, The condition on the prior for the transformation f is satisfied by the truncated normal prior described in Section 2, and hence the transformation f (as well as the mean and variance parameters µ and σ 2 ) are consistently estimated by the posterior, as shown in the following corollary.
Corollary 1. Let the prior on f be described by f = J j=1 θ j B j , where the prior for J has infinite support and θ = (θ 1 , . . . , θ J ) is given a truncated normal prior as described in Section 2. Then for any > 0, Π(f : d(f, f 0 ) < , d(f , f 0 ) < ) > 0 and hence (4.6) holds.

Simulation
We conduct a simulation study to assess the performance of the Bayesian approach to graphical structure learning in nonparanormal graphical models. The unobserved random variables, Y 1 , . . . , Y p , are simulated from a multivariate normal distribution such that Y i1 , . . . , Y ip i.i.d.
∼ N p (µ, Ω −1 ) for i = 1, . . . , n. The means µ are selected from an equally spaced grid between 1 and 2 with length p. We consider nine different combinations of n, p, and sparsity for Ω: • p = 25, n = 50, sparsity = 10% non-zero entries in the off-diagonals • p = 50, n = 150, sparsity = 5% non-zero entries in the off-diagonals • p = 100, n = 500, sparsity = 2% non-zero entries in the off-diagonals The sparsity levels for Ω are computed using lower triangular matrices that have diagonal entries that are Gaussian distributed with µ diag = 1 and σ diag = 0.1, and non-zero off-diagonal entries that are Gaussian distributed with µ \diag = 0 and σ \diag = 1. Since these are lower triangular matrices, we are ensured to have positive definite matrices.
The hyperparameters for the prior are chosen to be µ = 1, τ = 0.5, and σ 2 = 1. The observed variables X = (X 1 , . . . , X p ) are constructed from the simulated unobserved variables Y 1 , . . . , Y p . The functions used to construct the observed variables are four c.d.f.s and the power function evaluated at the simulated unobserved variables Y 1 , . . . , Y p . The four c.d.f.s are: normal, logistic, extreme value, and stable. The power function is where m is an integer between 1 and 5. The values of the parameters for each of the c.d.f.s were the maximum likelihood estimates for the parameters of the corresponding distributions (normal, logistic, extreme value, and stable), using the variables Y 1 , . . . , Y p .
The initial B-spline coefficient values for the exact HMC algorithm are constructed as follows. First, pretending that the data are already normal, we start with the identity function f (X d ) = X d , where X d is uniform and Φ −1 (X d ) is normal so that f = Φ −1 . Then in the model, with the pretension that the transformation is a linear combination of B-spline basis functions, Multiplying both sides by B k (X d ) and integrating, we have Since these functions in the integral are functions of normal probability densities, Gauss-Hermite quadrature is used to estimate the left and right-hand sides. The number of points used is 20. Then setting the approximation for the left-hand side, equal to b, and setting the approximation for the right-hand side, we have the linear equation b = Eθ. Using the quadratic programming function in the MATLAB Optimization Toolbox, we solve for θ for each predictor For numerical stability, the monotonicity constraint is changed to F θ ≥ 10 −4 .
After finding the initial coefficient values θ d , we construct the initial values for Y d = J j=1 θ dj B j (X d ) using the observed variables. These initial values for Y are used to find initial values for Σ, µ, and Ω for the algorithm, where Σ initial = cov(Y ), µ initial =Ȳ , and Ω initial = Σ −1 initial . We consider four combinations of the hyperparameter settings for the spike-and-slab algorithm with c 0 = {0.02, 0.00025} and (a 0 , a 1 ) = {(5, 25), (10, 30)}. The combination of hyperparameters that yield the lowest BIC is selected for the final estimates of the precision matrix and edge matrix. The spike-and-slab algorithm is implemented in MATLAB by modifying the code provided by Wang (2015).
Three chains are initiated for the algorithm, each of length 2T , where T = 15000. These chains have different starting points that are determined from the initial values of θ d . We modify the minimization statement to 1 2 Cθ E ECθ − b CEθ, where C is randomly selected without replacement to be 1, 2, or 3 for each of the three chains. The exact HMC algorithm is implemented in MATLAB using the code provided by the authors (Pakman and Paninski, 2014).
The total iteration range is (1, . . . , 2T ). The total iteration range is divided into Q batches of length a, where a = max(100, 2T /100 ) and Q = 2T /a. The burn-in is set to qa/2 . As recommended by Gelman et al. (2014) in Chapter 11, convergence is assessed by monitoring the mixing and stationarity after burn-in, by splitting each chain in half and checking that all the resulting half-sequences are mixed. We monitor one scalar quantity of interest, the total number of edges estimated by the edge matrix. For this quantity, we estimate the potential scale reduction,R, as well as the effective sample size, as described in Chapter 11 of Gelman et al. (2014). The Gibbs algorithm is deemed converged whenR < 1.05 and the effective sample size is at least 100.
The nonparanormal method of Liu et al. (2009) is implemented using the R package huge (Zhao et al., 2015). The graphical lasso method is selected for the graph estimation and the screening method selected is the lossless screening method. Three regularization selection methods are used to select the graphical model: the Stability Approach for Regularization Selection (StARS) (Liu et al., 2010), the modified Rotation Information Criterion (RIC) (Lysen, 2009), and the Extended Bayesian Information Criterion (EBIC) (Foygel and Drton, 2010). The default parameters in the huge package are used for each selection method. The documentation for the huge package mentions an alternative threshold of 0.05 for the StARS method, but the results are not sensitive to the default choice of 0.1 or 0.05, so the default threshold of 0.1 is used. As in Liu et al. (2009), the number of regularization parameters used is 50 and they were selected among an evenly spaced grid in the interval [0.16, 1.2].
We run 100 replications for each of the nine combinations and assessed structure learning for each replication. For each replication, we determine the final hyperparameter setting for the spikeand-slab algorithm by choosing, out of the four hyperparameter settings, the one that yielded the lowest value of the BIC. The finally selected hyperparameter setting is used to find the Bayesian estimates of the precision and edge matrices and are used to learn the graphical structure.
To assess the performance of the graphical structure learning, specificity (SP), sensitivity (SE), and Matthews Correlation Coefficient (MCC) are computed. These metrics have been previously used for assessing the accuracy of classification procedures (Baldi et al., 2000). They are defined as follows: The median probability model (Berger and Barbieri, 2004), commonly used for graphical model structures, is used to find the Bayesian estimate of the edge matrix. The edge matrix estimate is found by comparing the mean of the samples of edge matrices and determining if each off-diagonal element of the mean is greater than 0.5. Then it is coded as an edge, and each off-diagonal element with mean not greater than 0.5 is coded as no edge. Models that are estimated to have no edges resulted in NaNs as MCC values. If there are NaNs, they are removed from the calculations of the means and standard errors of the MCC values in the simulation. The results are presented in Tables  1-3. The Bayesian method has consistently high specificity, unlike the models selected by the EBIC, StARS, and RIC methods. The Bayesian method suffers in sensitivity for the 10%, 5%, and 2% models, but the models selected by the EBIC, StARS, and RIC methods also suffer in sensitivity. It is interesting to note that the EBIC selection method has been shown to perform well with the graphical lasso (Foygel and Drton, 2010), but appears to suffer in performance when the graphical lasso is combined with the nonparanormal estimation method. Overall, based on the MCC values, the Bayesian method performs better than the models selected by the EBIC, StARS, and RIC methods.

Real Data Application
We consider the data set based on the GeneChip (Affymetrix) microarrays for the plant Arabidopsis thaliana (Wille et al., 2004). Since there are 118 microarrays, the sample size is n = 118. There are 39 genes from the isoprenoid pathway that are used. For pre-processing, the expression levels for each gene, x i for i = 1, . . . , 118, are log-transformed and converted to values between 0 and 1 using the equation (x i − min(x i ))/(max(x i ) − min(x i )). We study the associations among the genes using the Bayesian nonparanormal method and the nonparanormal method of Liu et al. (2009). These data are treated as multivariate Gaussian in the original analyses (Wille et al., 2004). For the Bayesian nonparanormal method, the final hyperparameter setting is chosen using the BIC method and for the nonparanormal method of Liu et al. (2009), 50 regularization parameters are used on an evenly spaced grid in the interval [0.16,1.2]. The three selection methods, RIC, EBIC, and StARS, are used with the default parameters in the huge package. The nonparanormal model selected by EBIC result in no edges, so this model is not included in the comparison. The graphs are displayed in Figure 1.
Our study shows that the Bayesian and the non-Bayesian selection methods for nonparanormal models lead to graphs with different sparsity. In particular, the Bayesian nonparanormal and the   Note that weak consistency implies posterior consistency of the corresponding cumulative distribution function F * with respect to the uniform distance in view of Pòlya's theorem, i.e. Π(sup{|F * (x) − F * 0 (x)| : x ∈ I} > |X * 1 , . . . , X * n * ) → 0 a.s. for any > 0.
This completes the proof of theorem.
To prove the corollary, we observe that as the true transformation f 0 is strictly increasing and continuously differentiable, it is uniformly approximable on compact subintervals of (0, 1) by a linear combination of B-splines with strictly monotone increasing coefficients, by applying Lemma 1(b) of Shen and Ghosal (2015) to the derivative function, where the derivative is also uniformly approximable on compact intervals. Now since the truncated normal distribution has positive density on a small neighborhood of a strictly increasing coefficient vector, the condition of prior positivity is fulfilled. Thus the posterior consistency holds under the B-spline series prior with respect to the uniform pseudo-distance on any compact subset of (0, 1).

Discussion
We have introduced a Bayesian method to construct graphical models for continuous data that do not rely on a normality assumption. The method assumes the nonparanormal structure, that under some unknown monotone transformations on each components, the original observation vector reduces to a multivariate normal vector. The precision matrix of the transformed observations thus also determines the graphical structure of conditional independence of the original observations. We have considered a prior distribution on the underlying transformations through a finite random series of B-splines with increasing coefficients that are given a multivariate truncated normal prior, and the precision matrix of the transformed observations is given a spike-and-slab prior distribution. The procedure requires carefully considering identifiability restrictions. We have shown that certain linear constraints on the coefficients can give rise to identifiability. The advantage of using linear restrictions only is that the truncated multivariate normal structure on the vector of coefficients can be maintained under the identifiability restrictions. This allows us to use an efficient Gibbs sampler to compute the posterior distribution. We have shown that a suitably modified posterior distribution leads to posterior consistency of the mean and the variance of the transformed observations and the transformation functions using Euclidean distances on the mean and variance and the uniform pseudo-distance on a compact subset of the unit interval for the transformation functions.
The Bayesian method appears to perform better than an earlier proposed empirical estimation method in the nonparanormal model at picking up edges that are significantly different from zero, thereby resulting in sparser models. Although it is not feasible to check for the nonparanormal distribution and therefore determine if our transformations improve on detecting the true transformation functions compared to the previous method, we believe that the use of the smooth and strictly increasing transformation functions that take into account non-normality in combination with the prior on the precision matrix that incorporates sparsity improves on the goal of learning the structure of Gaussian graphical models when the data are continuous but not Gaussian.