Using Bayesian latent Gaussian graphical models to infer symptom associations in verbal autopsies

Learning dependence relationships among variables of mixed types provides insights in a variety of scientific settings and is a well-studied problem in statistics. Existing methods, however, typically rely on copious, high quality data to accurately learn associations. In this paper, we develop a method for scientific settings where learning dependence structure is essential, but data are sparse and have a high fraction of missing values. Specifically, our work is motivated by survey-based cause of death assessments known as verbal autopsies (VAs). We propose a Bayesian approach to characterize dependence relationships using a latent Gaussian graphical model that incorporates informative priors on the marginal distributions of the variables. We demonstrate such information can improve estimation of the dependence structure, especially in settings with little training data. We show that our method can be integrated into existing probabilistic cause-of-death assignment algorithms and improves model performance while recovering dependence patterns between symptoms that can inform efficient questionnaire design in future data collection.


Introduction
In many parts of the world, deaths are not systematically recorded, meaning that there is massive uncertainty about the distribution of deaths by cause (Horton, 2007;Jha, 2014).
Knowing why individuals are dying is essential for both rapid, acute public health actions (e.g. responding to an infectious disease outbreak) and for longer term monitoring (e.g. encouraging behavior change to head off an obesity epidemic). In many such areas, a surveybased tool called Verbal Autopsy (VA) are routinely used to collect information about the causes of death when medical autopsies cannot be performed. VAs consist of an interview with a caregiver or relative of the decedent and contain questions about the decedent's medical history and the circumstances surrounding the death. Collecting VA surveys is a time-consuming and resource intensive enterprise. A community informant alerts a health official of a recent death and then, after a period of months, a survey team returns to administer the VA survey with the relative or caregiver. VA surveys are taxing for the relative or caregiver both because they typically consist of over a hundred questions and because they require a person recall a traumatic time in depth. What's more, VA surveys themselves do not reveal the cause of death, but only the circumstances and symptoms.
Assigning cause of death requires either coding directly by a clinician or using one of several statistical and machine learning algorithms.
The majority of the existing statistical or algorithmic methods to assign cause of death using VA surveys make the assumption that VA symptoms are independent from one another conditional on cause of death (Byass et al., 2003;James et al., 2011;Miasnikof et al., 2015;McCormick et al., 2016). This assumption simplifies computation and is efficient in settings with limited training data. The ignored associations, however, provide valuable information that could be used to improve cause of death classification. Knowing that a person lives in a Malaria-prone area and had a fever before dying, for example, gives substantially more information than knowing only that the person presented with a fever before dying. One previous method by King and Lu (2008) does account for associations using a regression model on stochastic samples of combinations of symptoms taken from a gold-standard training dataset. This computation, however, is extensive for even moderately large symptom sets. Moreover, in order to account for symptom dependence, both the classic regression method (King and Lu, 2008) and the recently developed latent factor approach by Kunihama et al. (2018) relies on the existence of high-quality training data, which is typically unavailable in practice.
In this paper, we propose a latent Gaussian graphical model to infer associations between symptoms in VA data. In developing our model for associations between symptoms on VA surveys, we must address three statistical challenges that arise from the VA data. First, the VA data are of mixed types. That is, survey questions are a mixture of binary (e.g. Was the decedent in an auto accident?), continuous (e.g. How long did the decedent have a fever?), and count (e.g. How many times did the decedent vomit blood?) outcomes. The data are then usually pre-processed into a standard set of binary indicators for which many methods have been proposed to automatically assign cause(s) of death. Instead of dichotomize the many continuous variables, we develop a (latent) Gaussian graphical model and introduce new spike-and-slab prior for inverse correlation matrices to mitigate the risk of inferring spurious associations. We also develop an efficient Markov chain Monte Carlo algorithm to sample from the resulting posterior distribution.
A second challenge is that we want to not only learn the dependence structures among the symptoms given causes of death, but also to use them to improve prediction for unknown cause of death. As shown by a numerical example in the supplementary materials, violations of the conditional independence assumption can substantially bias the prediction of the cause of death. In practice, researchers are typically interested in both the accuracy of the method in assigning cause of death for a specific individual as well as the overall fraction of deaths due to each cause in the sample, in order to understand disease epidemiology and inform public policies. We address this challenge by extending the latent Gaussian model to a Gaussian mixture models framework so that it can be integrated into existing VA methods for both cause-of-death assignment and estimation of population cause-specific mortality fractions.
Lastly, a fundamental challenge we face in building such predictive model is that there is typically very limited training data available. "Gold standard" training data typically consists of physical autopsies with pathology reports. Obtaining such data is very expensive in low resource settings where they are not common practice, requires physicians commit time to performing autopsies rather than treating patients, and is generally only possible for the selected set of deaths that happen in a hospital. To date there is only one single "gold standard" dataset that is widely available to train VA algorithms (Murray et al., 2011a). This dataset, described in further detail in subsequent sections, contains cases from six different geographic areas. Using the binary symptoms from this dataset,  show that the empirical marginal probabilities (or conditional probabilities given a cause-of-death) of observing a symptom can vary significantly across training sites. The lack of reliable training data in the VA context limits the applicability of currently available statistical approaches. A standard approach to joint modeling of mixed variables, for example, is through characterizing the vector of observed variables by latent variables that follow some parametric models. However, when the data contains only a small number of observations and a high proportion of missing values, sometimes even the marginal distribution of the variables cannot be reliably estimated and thus it may lead to erroneous inference of the joint distribution of variables.
We address this challenge by incorporating expert knowledge, a common strategy in VA cause of death classification. In the VA context, expert knowledge consists of information about the marginal likelihood of seeing symptom given that the person died of a particular cause. This type of information is widely used in the VA literature (Byass et al., 2003) and is substantially less costly to obtain than in person autopsies. Only marginal propensities can be obtained since asking experts about all possible symptom combinations would be laborious and time consuming. A small number of joint probabilities could be solicited from experts, but there is currently no means available to guide researchers about which combinations are most influential. Our work provides one such approach for choosing combinations to elicit.
We incorporate information about the marginal propensity of seeing each symptom by decoupling the correlation matrix from the marginal variances and allow researchers to incorporate marginal informative priors through hierarchical models. This differs from many existing work on Gaussian copula models where the marginal distributions are usually treated as nuisance parameters (e.g., Hoff, 2007;. Our approach also contributes to the literature on inference of correlation matrices from mixed data, where several related ideas have been explored previously in other context. For instance, Talhouk et al. (2012) proposed a Bayesian framework for latent graphical models with decomposable graphs. Our shrinkage prior provides a more flexible approach to allow also non-decomposable graphs with a rejection-free sampling strategy. The recent work from  studied semiparametric approaches for structure learning and provided a two-step procedure to obtain sparse graph structures. Our approach also yields improved estimation of the latent correlation matrices and is more robust to missing data, as illustrated in Section 5. Our work also incorporates a different kind of expert knowledge, the marginal distribution of variables, rather than the interactions between variables, such as reference network structure among variables (Peterson et al., 2013) or distance metrics measuring 'closeness' of variables (Bu and Lederer, 2017).
The rest of the paper is structured as follows. In Section 2 we describe the proposed latent Gaussian graphical model to characterize the dependence structure in mixed data and present two different prior choices of the latent correlation matrix, reflecting different types of prior beliefs. In Section 3 we describe the details of the posterior sampling algorithms. In Section 4 we show how the latent Gaussian model could be extended to Gaussian mixture models and integrated into existing VA methods for cause-of-death assignment. Section 5 examines the performance of correlation matrix estimation, structure learning, and prediction performance with extensive numerical simulation. In Section 6 we apply our methods to a gold standard dataset and data from a health and demographic surveillance system (HDSS) site where only physician coded causes are available. Finally, in Section 7 we discuss the remaining limitations of the approach and some future directions for improvement.
2 Latent Gaussian graphical model for mixed data We begin by considering the characterization and estimation of dependence structures in mixed data. Let X = (X 1 , ..., X n ) T denote the data with n observations of p-dimensional random variables. In survey data, for example, X ij may represent the response of respondent i on question j. We use a latent Gaussian representation to encode the dependence between the variables by assuming that the observed data matrix X can be represented by a set of multivariate Gaussian random variables Z under some monotone transformation: where R is a correlation matrix, and f j (·)'s are non-decreasing functions. When the marginal transformation functions are unknown, this formulation is usually referred to as the Gaussian copula model (e.g., Xue et al., 2012). For continuous variables, a popular strategy to deal with the marginal transformation f j is to first estimate it byf j (z) =F −1 j (Φ(z)), whereF j is typically taken to be the empirical marginal cumulative distribution function of the j-th variable (e.g. Klaassen and Wellner, 1997;Liu et al., 2009). Inference on R is then performed with pseudo-dataẐ ij =f −1 j (X ij ). However, this strategy is problematic for discrete data, since directly applying monotonic marginal transformations changes only the sample space instead of the distribution of the observed data (Hoff, 2007). Therefore, for data with mixed variable types, it is common to adopt the semi-parametric marginal likelihood approach (Hoff, 2007). Inference on the correlation matrix is then carried out based on the marginal likelihood of the observed ordering of the variables, with the marginal transformation functions considered as nuisance parameters.
Moving now to binary variables, the marginal distribution can be characterized by the marginal probability, a single parameter. Thus direct estimation of the transformation functions can be reduced to estimating cutoffs of the latent Gaussian variables .
Conceptually, we can fix the marginal transformation, and estimate only the latent mean variable µ. That is, we can write the data generating process as where Λ is a diagonal matrix that contains marginal standard deviations for the continuous variables and fixed at 1 for the binary variables, and R is a correlation matrix. The marginal prior probabilities for binary variables p j = Pr(X ij = 1) are specified though the priors for µ, since the expectation of X ij given µ is P r(X ij = 1) = Pr(Z ij > 0) = 1 − Φ(−µ j ) = Φ(µ j ).
For simplicity, throughout this paper we assume the continuous variables are marginally Gaussian, similar to the scenario considered in . The extension to the case where the continuous variables exhibit non-Gaussian marginal patterns is straightforward by first preprocessing the raw continuous variables into pseudo-data using their marginal prior distributions (Liu et al., 2009) ). Specifying priors on their marginal variances, i.e., Λ, usually depends on the context. In this paper we adopt the improper prior on the marginal standard deviations suggested in Gelman (2006), so that Λ jj ∝ 1.
The latent Gaussian distribution provides a simplistic description of the conditional independence relationship for Z. Zeros in off-diagonal elements of the inverse correlation matrix, R −1 , correspond to pairs of latent variables that are conditionally independent given other latent variables. Thus for high-dimensional problems, we typically favor priors on R where elements in R −1 are shrunk to zero. However, the transformation of the marginal prior probabilities to µ 0 in the proposed model requiresR to have unit variance for the binary variables, or equivalently, the submatrix ofR corresponding to binary variables to be a correlation matrix. This complication prohibits standard graphical model problem to apply since posterior sampling on the space of the correlation matrices is generally more difficult than from the covariance matrices due to the constraint of unit diagonal elements. Next, we propose new class of priors and describe a parameter expansion (PX) scheme (Liu and Wu, 1999;Meng and Van Dyk, 1999) where the correlation matrix R is first expanded to a covariance matrix and updated, and then projected back to the space of correlation matrices.

Prior specification for the correlation matrix
We discuss two classes of priors for R = Λ −1R Λ −1 that lead to efficient posterior inference: one with the standard conjugate priors for the covariance matrix and uniform marginal priors for R, and one with a sparse structure in R −1 . Similar priors for marginally uniform R were proposed in Talhouk et al. (2012) for the multivariate probit model. Their direct generalization to sparse R −1 uses a Metropolis-Hasting algorithm that is computationally expensive and imposes an additional decomposability constraint on the graph structure. A major advantage of the proposed model, summarized in Section 3, is the computational simplicity of posterior sampling, as well as the removal of the decomposability constraint.

Marginally uniform prior for the correlation matrix
First, we review a marginally uniform prior on the correlation matrix, and the corresponding parameter expansion scheme. Without any additional knowledge about the structure of the latent correlation matrix, the marginal uniform prior on all the elements of R ( Barnard et al., 2000) is For the model Z i ∼ Normal(µ, R), sampling from the posterior distribution p(R|Z, µ) is not straightforward. However, with parameter expansion, we can expand the correlation matrix summarize the conditional independence structure in a more concise manner, one option would be to estimate a sparse representation ofR −1 using a two-stage procedure similar to  with the posterior meanR as input. Alternatively, we could incorporate sparsity directly into the prior, which we describe in the next section.

Spike-and-slab prior for the inverse correlation matrix
The marginally uniform prior for R is sometimes inappropriate for settings where sparse structure inR −1 is strongly suspected a priori. For example, we may expect only small groups of symptoms in a VA survey, say, all pregnancy-related symptoms, would be correlated but are conditionally independent of other clusters of symptoms. Several priors for sparse precision matrices have been proposed. The G-Wishart prior (Roverato, 2002) extends the Wishart distribution by restricting cells in the precision matrix that correspond to non-edges in a graph to be exact zeros, and has been extensively studied in existing literature (Jones et al., 2005;Mohammadi et al., 2017). More recently shrinkage priors have become more popular, in part due to their computational simplicity. Bayesian analogies to penalized precision matrix estimators have been proposed for Lasso (Wang and others, 2012; Peterson et al., 2013), horseshoe  and spike-and-slab mixture penalties (Wang, 2015;Li and McCormick, 2017;Deshpande et al., 2017). In this work we adapt the spike-and-slab prior idea proposed in Wang (2015) and propose a mixture prior for the inverse correlation matrix. The supplement material contains a brief introduction to Wang's original proposal and its relationship to Wishart priors.
The spike-and-slab framework is appealing because it performs graph selection and parameter inference simultaneously, in contrast to other shrinkage priors that require a further thresholding step after shrinkage. We put independent Gaussian priors on each off-diagonal element of the inverse correlation matrix, R −1 , i.e.
where R + denotes the space of correlation matrices, and C δ is the normalizing constant, which cancels out to result in the marginal prior (6) on the expanded parameter space.
We show in the supplementary material that C δ is finite and thus both distributions are proper. The proposed setup differs from current literature on shrinkage priors in two ways.
First, we restrict the support of R to the space of the correlation matrix, so that working with latent variables that cannot be normalized does not create identifiability issues. In the next section we show that this additional restriction does not increase computational cost by much. Second, we add a |R| −(p+1) term to ensure that the prior assigns no weight to degenerate R −1 . This term also allows the marginal distribution of Ω after parameter expansion to be in a form similar to the spike-and-slab prior defined in Wang (2015). Finally, we complete the parameter expansion scheme by defining the expansion parameter D such that d 2 j ∼ InvGamma((p + 1)/2, 1/2). The expanded precision matrix Ω = (DRD) −1 has the following marginal prior distribution: where σ 2 j is the j-th diagonal element of Ω −1 . This expanded prior can be derived with a standard change of variables, as described in more detail in the supplementary material.
However, it turns out that it can be efficiently sampled with a block update. We fully describe our sampling scheme in detail in Section 3.1.

Choosing the shrinkage parameters
The proposed prior for R has several hyperparameters, v 0 , v 1 , λ, and π δ , that jointly determine the prior scales and sparsity of R −1 . The relationship between the implied prior sparsity, i.e., p(δ = 1) and the hyperprameters, however, cannot be easily obtained, because of the constrained space of R + and the intractable normalizing constant C δ . We follow a similar practice to Wang (2015) in choosing the hyperparameters by simulating the implied prior edge probabilities from different combination of hyperparameters. We use the sampler in Section 3 and choose the values that lead to the desired prior sparsity.
Generally, v 1 /v 0 needs to be large so that it gives enough separation between the spikeand-slab densities. The choice of v 0 also needs to be carefully considered: an extremely small v 0 leads to a density that approaches the point-mass and thus can slow the mixing of the Markov chain, while a larger v 0 may absorb many elements of R −1 and assigns a heavy portion of prior mass on the 'sparse' models with many small values. The choice of v 0 may be roughly guided by comparing the marginal distributions implied by the prior to a pre-specified threshold for practical significance. We let v 0 = 0.01 in our experiments, as it can be seen from the prior simulation in Figure 1 that it assigns reasonable weights to graphs with edge probability between 0.05 to 0.2 under various choice of v 1 and π δ . Because of the linear constraints on the elements of R −1 imposed by the space of R + , the hyperparameter π δ typically differs from the implied marginal edge probability significantly, and also needs to be determined from numerical simulation. From Figure 1, the prior sparsity is relatively consistent for v 0 = 0.01 when v 1 /v 0 > 50 and π δ < 0.001 We chose v 1 /v 0 = 100 and π δ = 0.0001 in our experiments.
It is also worth noting that λ also contributes to the prior sparsity directly, as it regularizes the diagonal elements of R −1 . Since the support of diagonal elements of R −1 are (1, ∞), large λ restricts r jj to be closer to 1, leading the correlation between the j-th variable and other variables to be closer to 0, and thus sparser models. From our prior simulation, we found the choice of λ = 10 usually leads to reasonable prior sparsity. We include more discussion of the relationships between the proposed prior and that of Wang (2015) in the supplementary materials. 3 Sampling from the posterior Inference using the full model can be performed using Markov Chain Monte Carlo with mostly Gibbs steps and elliptical slice sampling (ESS), a rejection-free MCMC technique (Murray et al., 2010). We first describe in detail the sampling procedure with the spike-and-slab prior, and then describe how this step fits into the the full inference procedure in Section 3.2.

Posterior sampling with the spike-and-slab prior
We begin by describing sampling with the spike-and-slab prior. We update Ω with the prior defined in (6) in a column-wise fashion. Consider the j-th row and column of Ω, if we , and the variance specified by the latent indicators, V = {v 2 δ jk } jk , the joint distribution of u and v can be calculated as Notice that σ 2 j = 1/v, and for all k = j, σ 2 k depends on both u and v, rendering the block Gibbs update scheme in Wang (2015) inapplicable. However, the full conditional distribution for u and v can both be written as the product of a standard distribution and an additional correction term. We let then we have the full conditional distributions To sample from p(u|·), we use elliptical slice sampling (ESS) (Murray et al., 2010) to sample from both distributions by treating the normal distribution part as "prior" and the later term as "likelihood." For u, ESS first generates an elliptical locus from the normal prior and then searches for acceptable points for slice sampling. ESS typically sticks to the same posterior region when strong signals are provided in the "prior" Gaussian distribution, as is the case here. Additionally when Ω −1 is sparse, σ 2 k and d 2 k should be close to each other, and thus the signal from the "prior" part is typically much stronger. To implement ESS for v, we approximate the Gamma likelihood in . This approximation is typically reasonable given the size of n in the data we consider, and this again allows easy use of ESS. Furthermore, the added computational burden of ESS over the block Gibbs sampler in Wang (2015) is minimal, as the {σ 2 k }'s can be easily calculated by the fact that −j] , and σ 2 j = 1/v, without any additional computation of a matrix inversion.
Finally, each time a block update is performed, all latent indicators can be updated with the corresponding conditional posterior inclusion probabilities, .

Full posterior sampling steps
Given suitable initial values, the full sampling scheme updates each parameter in turn.
Update Z. The conditional posterior distributions of the latent variables conditional on the observed data are truncated Normal(µ,R) distributions with the truncation defined by domain I ij where I ij = (−∞, 0) if X ij binary and X ij = 0, (0, +∞) if X ij binary and X ij = 1, and (−∞, +∞) if X ij is missing or continuous. To sample from the multivariate truncated normal posterior, we draw approximate samples by iteratively sampling Z ij |Z i,−j by j] , and the truncated domain I ij is defined above.
Update Λ. We perform the conditional update of Λ by sampling from −j] . These conditional distributions can be efficiently sampled with ESS (Murray et al., 2010).
Update µ. The conditional posterior distribution for the mean parameters is also multivariate normal, Update R. To update the latent correlation matrix, we first draw the working expansion parameter with d 2 j |R ∼ InvGamma((p + 1)/2, β), where β = r ii /2 for the marginally uniform prior, and β = 1/2 for the spike-and-slab prior. The inverse Gamma distribu-tion is parameterized with shape and scale. We then construct the expanded observa- , and compute the sample covariance ma- . For the marginally uniform prior, the posterior conditional distribution of the expanded precision matrix Ω takes the conjugate form, Ω|W , µ ∼ Wishart(I p + S, n + p + 2). For the spike-and-slab prior, we sample the expanded precision matrix Ω|W , γ using ESS as described in Section 3.1.
After a new Ω is sampled, we can then compute the induced expansion parameter D = diag(σ 2 1 , ..., σ 2 p ) 4 Cause-of-death assignment using latent Gaussian mixture model In this section we extend the latent Gaussian graphical models to model data from a mixture of underlying distributions. This extension allows us to complete our model to simultaneously estimate the latent correlation matrix and assign causes of death using VA data. Before we describe our model, it is worth noting that for many existing automated VA methods such as InSilicoVA (McCormick et al., 2016), InterVA (Byass et al., 2003), and the Naive Bayes Classifier (Miasnikof et al., 2015), the classification rule is closely related to the naive Bayes classifier under the assumption of (conditional) independence between symptoms, i.e. .
For algorithms using this conditional independence assumption, the information provided by training data (aside from a prior guess of π c ) can be summarized by the conditional relationships between a single sign/symptom and causes. In contexts without training data, expert clinicians provide the same information in the form of informative prior beliefs (e.g. Byass et al., 2003;McCormick et al., 2016). Thus to extend the latent Gaussian graphical model to the context of cause-of-death assignment, we hope to incorporate such conditional relationships as well, in order to make full use of the existing information. This can be achieved similarly as before. We let y i denote the categorical indicator from a set of C causes of death for person i. A key goal of VA analysis is to associate unlabeled data with cause-of-death assignments. With a generative model similar to Section 2, we let the mean of the latent variable Z i depend on the class of the i-th observation. The complete data generating mechanism can be written as where the priors for µ andR are the same as in Section 2. Following the setup presented To account for the different strength of prior information for each mixture, we can also put an additional hyper-prior on σ 2 c . In our experiments with unspecified σ 2 c , we use weak independent priors such that σ 2 c ∼ InvGamma(0.001, 0.001), for c = 1, ..., C. Although not presented here, if marginal information on the continuous variable distributions is available in practice, we may also let X ij |y i = c to be f cj (z) =F −1 cj (Φ(z)), whereF cj is the fixed marginal distribution function, and inference can be similarly carried out with one additional step to update the observed continuous variables each time an assignment changes.
The mixture model approach allows the joint distribution of symptoms in the data to further guide the estimation of the latent correlation matrix. The proposed model is ideally suited for settings with some, but not extensive, training data. In verbal autopsy this typically happens when a small subset of deaths are assigned a cause either by a traditional medical autopsy or, more commonly, when clinicians review the verbal autopsy data and assign a cause of death, so-called 'physician-coded' VAs. In most settings physician-coded VAs are comparatively (very) rare because physician coding is costly in terms of physician time and opportunity costs, e.g. physicians not seeing living patients. The informative prior setup we propose allows researchers to combine prior or clinician-derived expert information with training data. Conceptually, in the extreme case when no training data exist, the latent Gaussian mixture model can still be estimated given strong informative priors on µ, i.e. the conditional probabilities of symptoms, and the latent correlation matrix will be estimated dynamically based on cause assignments in each iteration. In the following sections we show the advantages of combining both strong priors and limited training data using both simulated and observed data.
Finally, if the labeled and unlabeled deaths come from different populations (e.g. the labeled deaths occur in a high Malaria region whereas the unlabled deaths do not), then one could let the labeled and unlabeled deaths follow two multinomial distributions with different π, or further include additional subpopulation-specific π. Posterior inference of π, µ andR can be similarly carried out as in Section 3.2 with minor modifications. We leave the detailed algorithms in the supplementary material. After obtaining the posterior mean estimatorsπ,μ, andR through MCMC, the most likely cause-of-death assignments for each death can be obtained in a discriminant analysis fashion by marginalizing over the latent variable Z to obtain the Bayes classifier as 5 Simulation evidence In this section we conduct simulation experiments to characterize the performance of the proposed method for both the estimation of R under the latent Gaussian framework and classification under the mixture framework. We describe our data generation process and provide results for correlation matrix estimation and graph recovery in Section 5.1 and then for classification in Section 5.2.

Estimation error and graph recovery
To examine the performance of our method in recovering the latent correlation matrix under different scenarios, we follow a data generating procedure similar to those in Liu et al. (2012) and . In all our simulations, we generate the sparse precision matrix Ω so that ω jj = 1, and ω jk = ta jk , where a jk ∼ Bernoulli((2π) −0.5 exp(||z j − z k || 2 )/(2c)) and z j 's are independent bivariate uniform random variables sampled from [0, 1] 2 . We set c = 0.2 so that on average each node has 6.4 edges in the graph, and set t so that the precision matrix is positive definite. In all our examples we further rescale Ω so that its inverse is a correlation matrix. We consider the following two scenarios using the assumed generative model: (i) Assume X contains 10% continuous Gaussian variables and the marginal means for the latent variables µ j ∼ Unif[−1, 1], and let the marginal prior mean µ 0 be the true µ.
(ii) Same as before, except the marginal prior µ 0j is misspecified to be sign(µ 0j ) * µ 2 0j , and we further generate continuous variables from the misspecified marginal distribution so that X 3 ij is marginally Gaussian.
The misspecified case reflects the practical scenario where more extreme marginal probabilities are relatively easier to solicit but may be provided on a different scale compared to the truth. In all our simulations we set n = 200, p = 50, and randomly remove m% of the entries in the data matrix to represent m% missing data. We repeat the simulation under each scenario 100 times. For both proposed models, we run the MCMC 3, 000 iterations and report the mean estimator for R from the second half of the posterior draws. The results are similar for synthetic data with only binary variables and thus are omitted from reporting here.
To benchmark the performance of our method in recovering the true correlation matrix, we compare our method with the semi-parametric estimator proposed in .
To obtain a fair comparison with our method that uses marginal priors, we calculate the rank-based estimator with the prior marginal probabilities, instead of the empirical marginal probabilities calculated from data. In our experiments described above, this approach leads to better estimation of R. We note that this substitution may harm the estimator performance when marginal priors are misspecified significantly. We compare the estimated correlation matrix errorR − R in terms of the matrix element-wise maximum norm, spectral norm, and Frobenius norm. The results are in Tables 1. The posterior mean estimator R from the proposed approach consistently outperforms the rank-based estimator for all three norms and is more robust to missing data and model misspecification.
To evaluate performance for graph recovery under the marginal uniform prior, we use the same two-stage procedure as in  where we first obtain the posterior mean estimator ofR and then apply graphical Lasso to obtain a sparse R −1 . For the spike-andslab prior, we can directly threshold R −1 since the conditional posterior inclusion probability Pr(δ jk |r jk ) is a monotonically increasing function of |r jk |. We define the false positive rate and true positive rate in the same way as : where E is the number of edges in the graph. Tables 1 also shows the comparison of the ROC curve using AUC and maximum F1 score. Under all scenarios our estimator yields better AUC and F1 scores, especially when the fraction of missing data is high.  Table 1: Simulation with mixed X under different scenarios. The proposed latent Gaussian graphical model approach (Spike-and-Slab prior) outperforms the semi-parametric alternatives and the marginal uniform prior (Uniform prior) in both scenarios. The Spikeand-slab prior performs especially well in scenarios with a high proportion of missing data.

Classification error
In this section we illustrate the performance of our method for cause-of-death assignment in VA analysis. We generate n = 800 unlabeled data with p = 50 from C = 20 classes, where the class membership distributions are generated from Dirichlet(1). Data within all groups share the same latent correlation matrix but have different marginal mean vectors generated in the same way as described in 5.1.
We compare the average classification accuracy with that produced from the naive Bayes classifier and the underlying algorithm from InterVA (Byass et al., 2012), which is closely related to the naive Bayes classifier. To assess the performance in estimating class probability, as is a main goal in VA analysis, we also compared the estimation of π with the truth using 'CSMF accuracy' (Murray et al., 2011b) . For the proposed model, we further investigate the scenario where 80, 100 and 200 labeled data exist. Intuitively, adding labeled data helps our model identify the dependence structure more quickly, especially in the presence of low sample size and high proportion of missing data. However, we do not impose the assumption that the labeled data shares the same class distribution as the testing data to maintain fair comparison. Figure 2 display the results.
The proposed latent Gaussian model consistently outperforms both the naive Bayes classifier and InterVA model, and is more robust to misspecification.

Analysis of verbal autopsy data
In this section we present results comparing the proposed model and the naive Bayes classifier using VA data in two contexts. First, in Section 6.1, we compare our method to both InterVA and the Naive Bayes Classifier using a set of gold standard data. In this scenario, we have sufficient labeled data to obtain good estimates of the conditional distribution of each symptom given each cause. This setting mimics a scenario where informative prior information is available and of high quality, which is common but not ubiquitous in practice.
In Section 6.2, we evaluate our methods using data from health and demographic surveillance system (HDSS) sites where the missing data proportion is much higher and the sample sizes are smaller. We compare different methods with physician-coded causes of death and show that the proposed approach is able to improve classification accuracy compared to both InterVA and the Naive Bayes classifier with noisy marginal priors that are poorly specified, in the scenarios where no or little labeled data are available.

PHMRC gold standard data
We first evaluate the performance of the proposed methods using the Population Health Metrics Research Consortium (PHMRC) 'gold standard' VA dataset (Murray et al., 2011a).
The PHMRC dataset consists of about 7,000 deaths recorded in six sites across four countries (Andhra Pradesh, India; Bohol, Philippines; Dar es Salaam, Tanzania; Mexico City, Mexico; Pemba Island, Tanzania; and Uttar Pradesh, India). Gold standard causes are assigned using a set of specific diagnostic criteria that use laboratory, pathology, and medical imaging findings. All deaths occurred in a health facility. For each death, a blinded verbal autopsy was also conducted. We removed all deaths due to external causes, e.g., homicide, road traffic, etc., since the conditional probabilities of many symptom given an external cause is less meaningful, and external causes are also much easier to identify with a deterministic screening procedure in practice. For the rest of the deaths from 26 causes, we randomly selected 1, 000 deaths as testing data, additional 1, 000 deaths as labeled data, and used the rest of the dataset to calculate the conditional probability matrix of each symptom given each cause as the informative prior. We fit the proposed model both with and without the labeled data, but do not assume the labeled deaths share the same distribution of causes.
We repeated this experiment 50 times.
We compared our methods with both InterVA and the Naive Bayes classifier using the same prior information. We ran the MCMC chains for 3, 000 iterations and discarded the first half as burn-in. We put the hyper-prior described in Section 4 on σ 2 . We used flat Dirichlet prior with α c = 1 for all c = 1, ..., C and calculated the individual cause assignment using Equation 7, and compared with the truth in terms of the accuracy of most likely cause, top three most likely causes, and CSMF accuracy. Figure 3 shows clear improvements of the proposed method over alternatives that assume conditional independence.

HDSS sites
In this section, we apply our method to a dataset from the Karonga HDSS (Crampin et al.,  The metrics are evaluated on 1, 000 randomly selected deaths for InterVA, Naive Bayes classifier, and the proposed model without any training data (GM: w/o training). An additional 1, 000 randomly selected labeled death is used as training data in the last case (GM: w/i training). The labeled data are not assumed to have the same distribution of causes.
The Karonga VA data were first coded by two physicians, and if they disagreed, a third physician adjudicated and determined the final cause assignment. These assignments were originally coded into 88 cause categories. We removed the small fraction of deaths due to external causes (such as traffic accident and suicide) from this dataset since they are in practice easy to classify and may be conditionally independent from most of the symptoms.
Given the limited sample size, we further aggregated the remaining causes into broader groups. We aggregated the assignments into 16 subcategories. We remove the symptoms that are missing for over 90% of the data which reduces the size of the symptom list to 92. Finally, we formed a "prior" dataset by taking all the deaths (VA symptoms and We first fit the model on all the data from 2008-2014 using this empirical P(symptom|cause) matrix. We used the same hyperparameter setup as the previous example with PHMRC. In the VA questionnaires, there are several groups of questions probing different aspects of the same symptom, for example "fever of any kind" and "fever lasting less than 2 weeks", or "male" and any pregnancy-related symptoms. Such questions are expected to be conditional dependent due to the structure of the questionnaire, and thus we fix the corresponding selection indices to be 1 in the inverse correlation matrix. We compare our method with InterVA and the Naive Bayes classifier using the same "prior" P(symptom|cause) matrix. Table 2 summarizes the performance of each algorithm, and  In addition to the structures induced by the questionnaire, we also recover interesting symptom pairs that are conditionally dependent on each other. For example, history of high blood pressure is strongly positively associated with paralysis of one side of the body across all cross-validation experiments, which is expected given the relatively high prevalence of cardiovascular diseases in the data. In our experiment, there are 3874 potential edges excluding the one knowns from the survey. Table 3

Discussion
Understanding the correlation structure among high dimensional mixed data in the presence of missing data is a challenging task. In this work we propose a method that models the joint distribution of variables of mixed types and leverages marginal prior information.
Using both simulation, gold-standard, and physician-coded VA data, we demonstrate that our new framework can significantly improve estimation of the latent correlation structure, graph recovery, and classification performance. The estimation of sparse inverse correlation matrices proposed in this paper allows us to decouple the parameterization of the marginal distributions of variables from their dependence structures. For future directions of research, it may also be interesting to explore other parsimonious representations (e.g. Murray et al., 2013;Gruhl et al., 2013;Jin and Matteson, 2018) in the context of analyzing VA data.  Table 3: List of conditional dependent symptom pairs. The non-zero elements in the inverse correlation matrix are selected by the estimated median probability graph.
to target the posterior modes directly with deterministic EM-type algorithms (e.g. Ročková and George, 2014; Li and McCormick, 2017). Second, symptom reduction in VA analysis is of key interest as a shorter set of symptoms can both reduce the cost as well as improve the quality of data collection. There has been active research on variable selection in Gaussian mixture models (Andrews and McNicholas, 2014), and consequently the proposed framework may also be extended to perform symptom selection in a data-driven way. Third, the model presented in this paper focuses mostly on binary and continuous data. Extensions to ordinal data are also possible by specifying priors on additional cut-off points. With a normal prior on the log-scale differences between consecutive cutoffs, the proposed model can easily incorporate prior information on marginal probabilities of more than two levels. Finally, in this paper we only consider the case where all mixtures follow the same correlation matrix.
Direct extension to group-specific correlation matrices would be straightforward, but estimating several correlation matrices independently in the context of VA can be problematic since mixture probabilities are highly unbalanced. Priors on joint distribution of multiple correlation matrix that allow them to borrow information needs to be developed.
Finally, we would like to draw attention to the fact that using marginal information to guide the modeling of joint associations is strongly related to stratified sampling. If we consider cause of death as an unknown stratification variable, the marginal informative prior

A Derivation of the spike-and-slab prior
The proposed prior distribution for R is First we show that C δ < ∞ so that the prior distribution is proper. We note Since exp(−λr jj /2+ p+1 2 log(r jj )) is a non-negative function of r jj , and has a global maximum at r jj = (p + 1)/λ, and C is a positive constant, we have where the constant C < ∞, and R + |R| −(p+1) j (r jj ) − p+1 2 dR < ∞ as well since it is proportional to the marginally uniform prior of R derived from the Wishart distribution.
Therefore the normalizing constant C δ < ∞, and the prior is proper.
In order to obtain the prior distribution on the expanded precision matrix Ω = (DRD) −1 , we put prior on the marginal expansion parameter D with a prior distribution so that p(d 2 j |R) is an inverse Gamma distribution with shape and rate parameter being ((p + 1)/2, 1/2), we have where d j = σ j is the square root of the k-th diagonal element of Σ = Ω −1 , i.e.,

B Comparing spike-and-slab with Wishart prior
Since the proposed method is heavily based on the spike-and-slab prior for the precision matrix (Wang, 2015), Ω, we first describe the spike-and-slab prior on the precision matrix, and compare it to other commonly used prior families in this section. Wang (2015) defines the spike-and-slab prior as where M + denotes the space of positive definite matrices, δ jk are latent indicator variables for each ω jk related to their size (large or small), π δ is the prior sparsity parameter, and v 1 v 0 imposes different levels of shrinkage for the elements drawn from the "slab" and "spike" prior distributions respectively. Conditional on the binary indicators δ jk , this representation shrinks the elements of Ω differently: a very small v 0 allows us to strongly shrink elements in Ω to 0 if they are small in scale, and a larger v 1 , i.e. a more dispersed prior distribution, shrinks the larger elements only slightly and thus leads to less bias.
Due to the positive definiteness constraint, the normalizing constant for this prior distribution of Ω is intractable. We glean insights about this prior distribution by simulating from the prior using the MCMC steps described in Wang (2015). Figure  ii . Middle: The densities are derived from sampling 2, 000 draws using MCMC from the prior distribution after 2, 000 iterations of burn-in. similar to that of the marginal uniform prior. This is not surprising as it can be seen directly from the marginal distribution on the matrix elements of Ω as well. For the j-th column of Ω, the spike-and-slab prior induces the conditional prior distribution on ω [j,−j] and the in the constrained space that Θ is a inverse correlation matrix. This conditional density also can help guide the choice of the hyperparameters, by comparing λ, v 0 , and v 1 to −j] . The linear constraints may render the choice of hyperparameters not straightforward when the edge probability is larger. Nevertheless, we can see from Figure 6 that both the spike-andslab distributions still changes as expected when we fix all but one parameters, and behaves marginally similar to the spike-and-slab prior for the precision matrix. C Implied prior sparsity with different hyperparameters In this section, we provide more prior simulation results to facilitate the choice of λ, v 0 , v 1 , and π δ . Figure 7 illustrates our approach in understanding how these 4 parameters jointly imply the prior sparsity. It can be seen that small λ and extremely small v 0 usually leads to denser prior graph unless v 1 is also small, which defeats the purpose of using the continuous mixture prior. We choose to use λ = 10, v 0 = 0.01, v 1 /v 0 = 100, and π δ = 0.0001 in our experiments. In general, for the prior edge probability to be calibrated between 0.05 to 0.2, we believe the model is not very sensitive to parameters in the close range to our choices.

D Posterior inference for the classification model
This section describes the inference procedure for the model presented in Section 2 of the main paper. The steps are mostly similar to Section 3.2 of the paper.
Update Z and Λ. This first two steps are the same as in Section 3.2 of the main paper, except replacing µ to the corresponding µ c .
Update µ. The conditional posterior distribution for the mean parameters is also multivariate normal, Update R. To update the latent correlation matrix, we first draw the working expansion and expand the observations in the same way as Section 3.2 of the main paper. The rescaled sample covariance matrix is S = n i=1 (W i −Dµ y i ) Λ −2 (W i −Dµ y i ). The rest of the sampling steps are the same.
Update Y . This step can be performed with Equation (7) in the main paper, or by integrating out π, so that where n −i,c = i =i 1 Y i =c .
(optional) Update σ 2 c . When σ 2 c is not fixed in the model, we can sample them from the conjugate posterior distribution E More details about the Karonga data analysis E.1 Distribution of causes of death in Karonga data

E.2 Estimated dependence structures
In this subsection, we include some additional results of the analysis in Section 6.2 of the main paper using pre-2008 data as training set and all the rest as testing data. The estimated correlation matrix, inverse correlation matrix, and the posterior inclusion probabilities of edges are shown in Figure 9.

E.3 Cross validation
In this section, we present a cross validation study using the Karonga data. We first used the data from 2002 to 2007 to calculate the informative priors. We then randomly selected from the rest of the data α% of training set and use the rest as test set. We repeated the exercise for α = 5, 10, 20. Our train-test split differs from standard out-of-sample cross-validation analysis in that we use the smaller fraction as training data, in order to reflect more closely the practical realities of VA data. We assumed the training and testing data share the same CSMF, since they are both from the second period in time. Figure 10 and 11 shows the comparison based on the accuracy of the top-cause assignment and CSMF accuracy, both   The cells with orange color are the known edges from the questionnaire structure that is not estimated.
consistent with the patterns in the other experiments. Since we assumed the training data share the same CSMF, we included also a variation of Naive Bayes classifier derived from the training data only. It is worth noting that InterVA performs surprisingly well compared to the proposed model. This is misleading, however, as the experiment uses prior probabilities calculated from a small number of deaths, resulting in the probabilities for rare symptoms being very noisy. InterVA does not take into account the absence of symptoms, and therefore it is not subject to the impact of misspecified priors on rare symptom probabilities. We expect that if the priors are provided based on physician knowledge, or better estimated with a larger dataset as shown in the experiment with the PHMRC data, InterVA's classification rule will not be as effective as the Naive Bayes classifier or the proposed model.

F Numerical illustration of structural bias of independence assumption in VA analysis
In this section, we provide a numerical illustration to show the influence of ignoring correlation in cause-of-death assignment. We note that similar ideas of incorporating the dependencies between predictors for prediction have been studied recently in regression analysis (e.g. Guan et al., 2016;Peterson et al., 2015). For Naive Bayes classification, many previous studies have shown that it is, in many scenarios, robust to ignored dependencies (e.g., Rish, 2001), yet we are not aware of any formal discussion of the independence assumption in VA analysis. Here we illustrate some potential issues with the following example.
Assume the simple scenario where only three infectious diseases C = (c 1 , c 2 , c 3 ) are of interest. For example, HIV/AIDS, TB, and a third category of "undetermined infectious disease", which in general includes deaths possibly due to either HIV/AIDS or TB but cannot be determined from data. Assuming there are two symptoms S = (s 1 , s 2 ), and denoting P s 1 s 2 (C) = Pr(C|S = (s 1 , s 2 )), p i = Pr(s 1 = 1|C i ) and q i = Pr(s 2 = 1|C i ), we can write the conditional distribution for the four combinations of S as follows under the independence assumption Applying Bayes rule with uniform prior on the prior distribution of the three causes of death, we can see the entries in the table above are proportional to the posterior probability of assigning each cause of death given a specific observation of symptoms, since P s 1 ,s 2 (C i ) = Now consider the case where the two symptoms s 1 and s 2 are respectively key symptoms for c 1 and c 2 , so that p 1 > p 2 and q 1 < q 2 . Since deaths due to c 3 are essentially a mixture of the other two causes and we assume equal prevalence of c 1 and c 2 , we can roughly let P (S|C 3 ) = P (S|C 1 )/2 + P (S|C 2 )/2. Still using the independence assumption for c 1 and c 2 , we calculate the correct joint distribution of symptoms given c 3 to be 0 ((1 − p 1 )(1 − q 1 ) + (1 − p 2 )(1 − q 2 ))/2 ((1 − p 1 )q 1 + (1 − p 2 )q 2 )/2 1 (p 1 (1 − q 1 ) + p 2 (1 − q 2 ))/2 (p 1 q 1 + p 2 q 2 )/2    which violates the independence assumption since the product of marginal probabilities Pr(s 1 = 1|C 3 ) Pr(s 2 = 1|C 3 ) = (θ 10 +θ 11 )(θ 01 +θ 11 ) = (q 1 +q 2 )(p 1 +p2)/4 > (p 1 q 1 +p 2 q 2 )/2 = θ 11 when (p 1 −p 2 )(q 1 −q 2 ) < 0. This implies that by naively applying Bayes rule and assuming independence of symptoms, we will over-estimate P 11 (C 3 ) under this setup.
Additionally, we consider the scenario where p 1 = q 2 and q 1 = p 2 , which is highly likely when the conditional probabilities are provided as rankings instead of numerical values, as in the implementation of InterVA. It is obvious to show that Pr(s 1 = 1|C 3 ) Pr(s 2 = 1|C 3 ) = (q 1 + q 2 )(p 1 + p 2 )/4 = (q 1 + p 1 ) 2 /4 > q 1 p 1 , which means if independence of symptoms conditional on causes is assumed, a researcher will conclude P 11 (C 3 ) > P 11 (C 1 ), and similarly P 11 (C 3 ) > P 11 (C 2 ). In contrast, if the analysis is carried out with the correct conditional probability table, it should lead to P 11 (C 1 ) = P 11 (C 2 ) = P 11 (C 3 ) since the lower right entries in all three tables are equal. This heuristic example shows that even when some of the conditional independence assumptions are satisfied and all marginal P s|c are accurately estimated, due to the particular features of VA analysis that includes causes that are "undetermined", the independence assumption can lead to undesired outcomes that overestimate the "undetermined" categories. These biases result entirely from model assumptions and cannot be solved with more data, and the problem becomes even worse as the number of symptoms and causes grows.