Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis

Constructing gene regulatory networks is a fundamental task in systems biology. We introduce a Gaussian reciprocal graphical model for inference about gene regulatory relationships by integrating mRNA gene expression and DNA level information including copy number and methylation. Data integration allows for inference on the directionality of certain regulatory relationships, which would be otherwise indistinguishable due to Markov equivalence. Efficient inference is developed based on simultaneous equation models. Bayesian model selection techniques are adopted to estimate the graph structure. We illustrate our approach by simulations and two applications in ZODIAC pairwise gene interaction analysis and colon adenocarcinoma pathway analysis.

a hierarchical RGM to discover the relationship between cholesterol levels and pulmonary function in a longitudinal study. The graph structure is fixed and a generalized EM algorithm was developed to estimate the parameters. Telesca et al. (2012ab) use RGMs for latent variables that represent active/inactive proteins and differential vs. non-differential gene expression, respectively. However, the use of the RGM is restricted to representation and for convenient summaries. The actual inference model is based on the implied conditional independence structure only (after moralization). Also, they use RGMs with only directed edges, excluding possible undirected edges. By contrast, our use of RGMs is directly on the observed gene expressions and we use the RGM to build the probability model in a way that allows us to infer direction for directed edges.
Because different RGMs may determine equivalent conditional independence structure, inference algorithms that use only the implied conditional independence structure can not possibly distinguish such equivalent graphs based on observational data. In particular, the directionality of some regulatory relationships cannot be determined (examples are given in Section 2.2). In this paper, we propose a possible solution. We integrate different sources of genomic information including DNA copy number and DNA methylation in a way that allows us to determine the direction of some edges from first biological principles. This allows us then to identify the gene regulatory networks (GRN) that best fit the data, including inference on edge directions in some cases. That is, the additional genetic and epigenetic information together with fundamental biological knowledge can inform the directionality of the regulatory relationships between genes.
The proposed approach is motivated by two interesting applications. In the first application, we extend ZODIAC (Zhu et al., 2015) to incorporating directed edges between genes. ZODIAC is a free online tool (http://www.compgenome.org/ZODIAC) that provides (undirected) molecular interactions for about 200 million pairs of genes across different cancers based on multimodal genomic data from The Cancer Genome Atlas (TCGA). It can be used to explore gene regulations in cancer. With our contribution, ZODIAC will enable researchers to not only find out whether certain genetic interactions are likely to be present in cancer but also identify the sources (e.g. transcription factors) and targets (e.g. targeted genes) of the regulations. Our second application studies GRNs of colon adeno-carcinoma (COAD). We focus on genes from the RAS-MAPK pathway which is critical in the initiation and progression of COAD (TCGA, 2012). Integrating copy number and methylation information, we find biologically meaningful gene interactions as well as novel regulatory relationships that need to be validated by further biological experiment.
Finally, for clarification and to avoid misunderstanding we note that our use of graphical models is quite different from inference for random graph models (Hoff, 2009;Goldenberg et al., 2010) when the graph itself is the data.
The rest of the article is organized as follows. We describe our model in Section 2.
We present simulation studies in Section 3 and two applications in Section 4. Section 5 provides our closing discussion.

Notation
consists of a set of vertices V = {1, . . . , p} and a set of edges E connecting these vertices. We consider both directed and undirected edges denotes a directed edge from vertex i to vertex j and {i, j} denotes an undirected edge. We use the vertices to index a set of random variables, Y = Y V = (Y 1 , . . . , Y p ) T . For example, in our application Y j represents gene expressions for gene j.
A path of length K is an ordered sequence (i 0 , . . . , i K ) of distinct vertices except possibly such that there are no directed edges between vertices in the same path component (Koster, 1996). Some examples of RG with four vertices are given in Figure 1a. The boundary of a vertex i is bd(i) = {j | {j, i} ∈ E or (j, i) ∈ E} and the boundary of a subset V 0 ⊆ V is bd(V 0 ) = i∈V 0 bd(i)\V 0 . An anterior set is a subset V 0 ⊆ V such that bd(V 0 ) = ∅ and the smallest anterior set containing V 0 is denoted by an(V 0 ).
The Markov property (i.e. conditional independence relationships of Y ) of an RG relies on the notion of moralization. To moralize a graph G, we connect all vertices in the boundary of each path component of G by undirected edges and then replace all directed edges by undirected edges. The resulting moral graph is an undirected graph and is denoted by G m . Later we will use graph separation to introduce a global Markov property. In an undirected graph, two sets V 1 and V 2 are said to be separated by a third set V 3 if every path between V 1 and V 2 intersects V 3 . 1 2 Figure 1: Graphs for illustration. Graphs in (a) -(d) are reciprocal graphs and imply different conditional independence. Graphs in (e) -(h) are Markov equivalent as they have imply the same Markov properties. In our applications, we will use nodes 1 and 2 to represent gene expression of two genes and nodes 3 and 4 to represent copy number (or methylation) of the same genes.

Reciprocal graphical models, Markov equivalence and integrative genomics
A graphical model is a mapping between a family of distribution and an underlying graph.
In this paper, we focus on the class of RGs which by definition strictly contains MRFs and DAGs as subclasses.
The probability distribution of Y is said to be (global) Markov with respect to an RG . RGMs can represent global Markov properties beyond the conditional independence structure that is encoded in MRFs and DAGs. RGMs are a strictly larger class of probability model than MRFs and DAGs. For example, in Figure 1a, the only two conditional independence relationships are 3 ⊥ ⊥ 4 and 3 ⊥ ⊥ 4 | 1, 2. There is no MRF or DAG that encodes the same conditional independence relationships. More importantly for our application, RGMs are particularly useful for the construction of genomic networks because of the ability in modeling feedback loops which are not allowed by MRFs or DAGs. For example, in Figure 1a, gene 1 may regulate the expression of gene 2 while the status of gene 2 may reciprocally affect gene 1 through a feedback regulatory mechanism.
Two graphs are said to be Markov equivalent if they have the same Markov properties.
For example, graphs 1e, 1f, 1g and 1h are Markov equivalent as they have the same Markov property, namely, no conditional independence assertion. Markov equivalence is indeed an equivalence relation and hence induces Markov equivalence class. Graphs within the same Markov equivalence class are usually nonidentifiable from observational data. In particular, observational data do not allow for inference on the direction of the edges in Figures 1e-1h.
However, with prior knowledge, sometimes we are able to distinguish the relationships between variables. In this paper, we introduce such prior knowledge by deliberately considering edges with (biologically) known direction if included. This allows for inference on other edges. We develop this approach based on integrating genomic information across different platforms and exploiting the central dogma of molecular biology that mRNA is produced by transcription from segments of DNA on which the copy number and methylation are measured, but the reverse processes are rare and biologically uninterpretable. For illustration, let vertices 1 and 2 in Figure 1 represent mRNA gene expressions and vertices 3 and 4 represent copy numbers of the corresponding genes and assume there exists dependence between copy number and gene expression for each gene and hence by the central dogma, there are directed edges from 3 to 1 (3 → 1) and from 4 to 2 (4 → 2). Now we are able to fully identify the relationship between mRNAs 1 and 2 because each of graphs 1a, 1b, 1c and 1d defines a distinct set of conditional independence relationships.

Simultaneous equation models
For a formal description of our approach, we still need the mapping from the RGM to a family of probability models for the observational data. Let Y = (Y 1 , . . . , Y p ) T denote the mRNA gene expressions for genes 1, . . . , p. Let X = (X 1 , . . . , X 2p ) T be the set of DNA level measurements for genes 1, . . . , p with X 2i−1 and X 2i being the copy number and the methylation for gene i, respectively. We first state the simultaneous equation model (SEM) and will then introduce the mapping. An SEM for Y and X is given by (1) can be equivalently expressed as To link an SEM to an RGM, we draw a path diagram G = (V, E) of SEM by the following rules: In words, (i) we introduce a node for each variable in (Y , X) with nodes j = 1, . . . , p corresponding to Y j and p + j corresponding to X j for j = 1, . . . , 2p; (ii) nodes i = 1, . . . , p (i.e. Y i nodes) become targets of directed edges from node j if the corresponding a ij = 0 or b i,j−p = 0; (iii) we introduce undirected edges between X i and X j (i.e. nodes i + p and Figure 2 shows an example of an RGM with p = 2, with * indicating non-zero elements. In general, the path diagram of an SEM following rules (i)-(iii) is not always an RGM. However, the probability distribution of (Y , X) defined by a SEM is Markov with respect to the path diagram G (Spirtes, 1995;Koster, 1996) if . . , σ p ) and Ψ is block diagonal with each block being a full matrix (without zeros). Since the main inferential goal of this paper is to investigate the regulatory relationship between genes, we will not model the marginal distribution of X and will only focus on the conditional distribution of Y | X. We fix b ij to 0 for j = 2i−1 or 2i because copy number and methylation of gene i, in principle, only affect the expression of gene i.

Priors and graph structure learning
The structural zeros in A and B correspond to missing edges in the RG. Therefore, learning the graph structure is equivalent to finding sparse estimators for A and B. Towards this end, we define a non-local prior for A and B, which is constructed as follows. We put a thresholded prior on each element of A and B. We first write a ij and b ik as for i = 1, . . . , p, j = i and k = 2i − 1, 2i. The threshold parameter t i controls the minimum effect sizes of a ij and b ik . We do not fix t i but instead learn it from the data by assigning a prior t i ∼ Uniform(0, t 0 ). We impose normal priors forã ij andb ik ,ã ij ∼ N(0, τ ij ) and b ik ∼ N(0, ν ik ) and conjugate hyperpriors τ ij ∼ IG(α τ , β τ ) and ν ik ∼ IG(α ν , β ν ). Ni et al. (2016) show that marginally, after integrating with respect to t i , the induced marginal prior for a ij or b ik is a mixture of a point mass at 0 and a non-local prior (Johnson and Rossell, 2010). Non-local priors shrink small effects to zero, which is desirable in our setting where we are only interested in edges with moderate to strong effects. We complete our prior specifications with a conjugate prior σ i ∼ IG(α σ , β σ ). However, this is only feasible when the model space is small (i.e. with very small p).

Simulations
We carry out a simulation study to validate the model's ability to recover a true graph with sample sizes similar to the later application. We generate data by mimicking the actual data from the later application, using a sample size of n = 276 and p = 10 genes. For each gene, we generate two hypothetical DNA level measurements X from (3) with Ψ = I 2p (corresponding to copy number and methylation in the real data).
We consider three scenarios. In scenario 1, we randomly set 4/5 entires in A and 1/3 entries in B to zero. To ensure that the true model is identifiable, we restrict the simulation truth to include at least one edge from DNA level measurements to each gene (this is relaxed in scenario 2). We draw nonzero elements of A and B from {−0.5, 0.5} with equal probability and let Σ = 0.25I 10 in the simulation truth.
In scenario 2, we reduce the signal by drawing nonzero elements of A and B from {−0.4, 0.4} and increase the noise to Σ = I 10 . Moreover, we do not force every gene to be connected to at least one of its DNA level measurements. As a result one gene is independent of both its DNA level measurements. For scenarios 1 and 2, Y is then generated from (2).
Scenario 3 is the same as scenario 1, except that the conditional distribution of Y | X is misspecified and generated from a p-dimensional multivariate t-distribution T p (µ, Θ, δ) with location µ = (I p − A) −1 BX, scale matrix Θ = (I p − A) −1 Σ(I p − A) −T and degrees of freedom δ = 3. By design, scenarios 2 and 3 are more difficult and perhaps more realistic than scenario 1.
We run 50,000 MCMC iterations, discard the first 25,000 iterations as burn-in and retain every 5th sample. The graph is estimated as an MPM.
In Table 1 we report the true positive rate (TPR), FDR and Matthews correlation coefficient (MCC). We also plot the receiving operating characteristic (ROC) curves in  Table 1 (the ROC plot for scenario 1 is not shown -the AUC is 1). The performance in scenario 1 is nearly perfect. As expected, scenarios 2 and 3 are more challenging. In particular, we find a higher FDR, due to the lower signal-to-noise ratio and a minor identifiability issue for scenario 2 and model misspecification for scenario 3.

ZODIAC pairwise gene interaction analysis
We use TCGA-Assembler (Zhu et al., 2014) to retrieve mRNA gene expression (GE), DNA copy number (CN) and DNA methylation (ME) data for different cancer types from TCGA.
For illustration, we report the analysis based on pairs of genes whose GEs are moderately correlated (between 0.5 and 0.7) and show high correlations (> 0.8) with corresponding CNs or MEs. The resulting dataset consists of 1395 pairs from 15 cancers (details of the data are in secton B of the Supplementary Material). We apply our method separately to each pair of genes using the same hyperparameter and MCMC settings as in the simulation study. Since p = 2 is sufficiently small, we can report the HPM as an estimate of the graph.
In this application, we aim to demonstrate: (1) like ZODIAC, our model can be used as a hypothesis generating tool; (2) compared to the undirected graph model used in ZODIAC, the RGM is more flexible and allows for more and different inference and interpretation.
Among the 1395 gene pairs, 518 pairs are estimated to have one-way interactions, 131 pairs have two-way interactions and the rest have no interactions. In addition, we find 2544 intragenic GE-CN interactions and 1157 intragenic GE-ME interactions. We show three gene pairs in Figure 4. These interactions are confirmed by ZODIAC (Zhu et al., 2015) but our approach has richer interpretations. In Figure 4a, both our method and ZODIAC find a negative interaction between MEA1 and UBR2 whereas our analysis indicates that MEA1 downregulates UBR2 specifically in rectum adenocarcinoma (READ). Similarly, in Figure 4b, while ZODIAC finds positive interaction between PHTF2 and TLK1, our model suggests a positive feedback loop between these two genes in testicular germ cell tumors (TGCT). Figure 4c manifests a unique feature of our model, a negative feedback loop between COG3 and MRPS31 in colon adenocarcinoma (COAD). Negative feedback loops are a common and important mechanism in gene regulations. For example, it can reduce expression noise and intercellular variability in molecular levels (Singh, 2011). But negative feedback loops are not allowed by DAGs and MRFs, both of which include the simplified assumption that gene interaction is always symmetric, either positive or negative. Our model is able to generate hypotheses that involve both one-way and two-way interactions and both positive and negative feedback loops. and "m" indicate that they are associated with copy number and methylation, respectively. Solid lines with arrowheads represent stimulatory interactions, whereas dashed lines with horizontal bars denote inhibitory regulation. Edge width is proportional to its posterior probability.

Colon adenocarcinoma pathway analysis
We analyze data from colon adenocarcinoma (COAD) samples. We focus on genes that are mapped to the RAS-MAPK pathway, which is critical for initiation of carcinogenesis in COAD (Colussi et al., 2013). The RAS-MAPK pathway includes p = 10 core genes.
Restricting to samples with available GE, CN and ME data, the sample size is n = 276. We run two parallel MCMC simulations, each with 50,000 iterations, discard the first 50% as burn-in and thin the chains to every 5th sample. MCMC diagnostics show no evidence for lack of practical convergence (for details see part C of the Supplementary Materials). We summarize the posterior distribution on the unknown graph by controlling the posterior expected FDR < 10%.
We find that all genes are associated with their respective copy number and NRAS, Gene networks are often made up of a small set of recurrent regulation patterns, called network motifs, which can be thought of as fundamental building blocks for the network and are expected to occur more often in gene networks than in random networks. In Figures (5b)-(5e), we display four motifs identified by our method that are commonly observed in gene networks (Alon, 2007). Figure (5b) shows a feed-forward loop among SOS2, KRAS and MAPK3. Part of this feed-forward loop is well studied in COAD (Zenonos and Kyprianou, 2013 (Plotnikov et al., 2011;Zenonos and Kyprianou, 2013) while MAP2K1 can phosphorylate and inhibit SOS2 and thereby reduces MAP2K1 activation (Holt et al., 1996;Mendoza et al., 2011). On the gene level, we calculate the posterior distribution of degree of each gene, that is, the number of edges connected to the gene. We visualize these posterior distributions by box plots in Figure 6a. Highly connected genes are often called hub genes which are usually more involved in multiple regulatory activities than non-hub genes. MAPK1 and MAPK3 appear to be the two most highly connected genes. Several studies have shown that overexpression of MAPK1/3 plays an critical role in the progression of COAD (Fang and Richardson, 2005) and is responsible for proliferation, differentiation, survival, migration and angiogenesis of tumors in many cancers (Dhillon et al., 2007).
For another quantitative exploration of the known MAPK signalling cascade we consider the following inference. The following signaling pathway/cascade has been extensively studied and validated in biological literatue (Plotnikov et al., 2011;Zenonos and Kyprianou, 2013), where genes within each curly brace belong to the same gene family (except for GRB2 for which the protein often binds to SOS to form a protein complex) and play similar roles in the pathway. We label each gene in cascade (4) from left to right with integer j = 1, . . . , p as our reference ordering. The labels within each curly brace are arbitrary. To compare our findings with this well established cascade, we score each gene j by where indegree j is the number of edges pointing towards gene j and outdegree j is the number of edges pointing away from gene j. Intuitively, gene j is likely to be on the left of gene k in cascade (4) if Score j < Score k . We then rank the scores in an increasing order and plot the rank against the reference ordering in Figure 6b. We also calculate the normalized Kendall's tau distance, which is the ratio of the number of discordant pairs over the total number of pairs. The normalized Kendall's tau distance lies between 0 and 1, with 0 indicating a perfect agreement of the two orderings and 1 indicating a perfect disagreement. The resulting Kendall's tau distance is 0.07. Our findings appear to be consistent with the biologically validated pathway since both Figure 6b and the Kendall's tau distance indicate a good concordance between our rankings and the reference ordering.

Discussion
In this article, we have introduced a Gaussian RGM to model gene regulatory relationships from genomic data. RGMs are statistically more general and biologically more interpretable than MRFs and DAGs. By integrating DNA level information, we are able to differentiate between RGMs that belong to the same Markov equivalence class. We exploit the connection between RGMs and SEMs for efficient inference. We constructed a prior probability scatter plot for ranking vs reference ordering. Ranking of the genes is based on the Score j defined by (5). The reference ordering is based on (4) where the genes with the same bracket are assigned arbitrary ordering.
model for the unknown graph using a thresholded model which marginally defines a mixture of non-local prior and a point mass. We use simulation studies to illustrate the performance of our method in terms of graph structure learning. Our method is applied to a ZODIAC gene interaction analysis and a colon cancer pathway analysis. Some of our findings are consistent with the literature, while others need to be validated by biological experiments.
Although our applications focus on GRNs, the proposed approach is general and can be potentially applied to other scientific settings such as climate sciences and macroeconomics.
The approach works for any network where some edges have known direction if included (inclusion itself need not be fixed).
The link between RGMs and SEMs is based on the assumption that the gene expressions are multivariate Gaussian, which could be thought of as one limitation of our model. We have empirically demonstrated the robustness of the proposed model against slight model misspecification. More general sampling models might need an additional hierarchical layer of latent variables, such as latent probit scores for binary outcomes.