Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals

Due to the rapid development of high-throughput experimental techniques and fast-dropping prices, many transcriptomic datasets have been generated and accumulated in the public domain. Meta-analysis combining multiple transcriptomic studies can increase the statistical power to detect disease-related biomarkers. In this paper, we introduce a Bayesian latent hierarchical model to perform transcriptomic meta-analysis. This method is capable of detecting genes that are differentially expressed (DE) in only a subset of the combined studies, and the latent variables help quantify homogeneous and heterogeneous differential expression signals across studies. A tight clustering algorithm is applied to detected biomarkers to capture differential meta-patterns that are informative to guide further biological investigation. Simulations and three examples, including a microarray dataset from metabolism-related knockout mice, an RNA-seq dataset from HIV transgenic rats, and cross-platform datasets from human breast cancer, are used to demonstrate the performance of the proposed method.

1. Introduction. With the rapid development of high-throughput experimental techniques and dropping prices, many transcriptomic datasets have been generated and deposited into public databases. In general, each dataset contains small to moderate sample size that brings caution of accuracy and reproducibility of detected biomarkers (Simon et al., 2003;Simon, 2005;Domany, 2014). Meta-analysis combining multiple transcriptomic studies can increase statistical power and provide robust conclusions from various platforms and sample cohorts (Ramasamy et al., 2008). Tseng, Ghosh and Feingold (2012) presented a comprehensive review of methods and applications in the microarray meta-analysis field and categorized existing methods into combining p-values, combining effect sizes, direct merging (a.k.a mega-analysis) and combining non-parametric ranks. In general, meta-analysis can be viewed as information combination tools for adjusting batch effects (such as different experimental platform, protocols and bias) across studies to draw a more efficient and accurate conclusion. This paper focuses on combining p-value method and we will discuss other potential approaches of adjusting batch effects and directly merging studies in the discussion.
Following conventions in Birnbaum (1954) and Li and Tseng (2011), two hypothesis testings have been considered in meta-analysis. In the first setting (namely HS A ), we aim to detect biomarkers that are differentially expressed (DE) in all studies: H 0 : θ ∈ {θ s = 0} vs H A : θ ∈ {θ s = 0}, where θ s is the effect size of study s, 1 ≤ s ≤ S. Throughout this manuscript, effect size refers to unstandardized effect size (Cooper, Hedges and Valentine, 2009) (difference of group means or unstandardized regression coefficients). In the second setting (HS B ), targeted biomarkers are DE in one or more studies: H 0 : θ ∈ {θ s = 0} vs H A : θ ∈ {θ s = 0}. In view of overly stringent criterion in HS A in noisy genomic data and when S is large, Song and Tseng (2014) proposed a robust setting HS r , requiring that r or more studies are DE: H 0 : θ ∈ {θ s = 0} vs H A : θ ∈ I{θ s = 0} ≥ r, where I{·} is an indicator function taking value one if the statement is true and zero otherwise and r is usually pre-estimated with S/2 ≤ r ≤ S. Song and Tseng (2014) also proposed to use the rth-ordered p-value (rOP, T rOP = p (r) ) to test HS r . Generally speaking, HS A and HS r are of the major biological interests to detect consensus biomarkers in the disease or phenotypic contrast. However, when heterogeneous differential expression signals across studies are expected (e.g. studies come from different tissues or brain regions in the two mouse/rat examples in Section 4), biomarkers detected from HS B can be of interest. Chang et al. (2013) conducted a comparative study evaluating 12 popular microarray meta-analysis methods targeting on the three complementary hypothesis testings (HS A , HS B and HS r ).
Strictly speaking, HS B is a sound hypothesis testing, and statistical tests for HS B are easier to develop. Most popular p-values aggregation methods such as Fisher (Fisher, 1934) and Stouffer (Stouffer et al., 1949) methods aim for this purpose. In the literature, HS B is also called a conjunction or intersection hypothesis (Benjamini and Heller, 2008). On the other hand, HS A is a somewhat defective hypothesis testing. If we apply the maximum p-value test (T maxP = max s p s , where p s is the p-value for study s) for HS A and reasonably assuming that p-values are independently distributed with UNIF(0,1), the null distribution of T maxP is Beta(S,1) and it is easily seen that the test is anti-conservative when there exist genes DE in partial studies. The problem mainly comes from the non-complementary null and alternative spaces in HS A (and also HS r ) and the more adequate hypothesis testing should be HSĀ ≡ H 0 : θ ∈ {θ s = 0} vs H A : θ ∈ {θ s = 0} (and HSr ≡ H 0 : I{θ s = 0} < r vs H A : I{θ s = 0} ≥ r). Benjamini and Heller (2008) proposed a legitimate but conservative test for HSĀ and HSr. In genomic application, the composite null hypotheses of HSĀ and HSr are complicated by the fact that genes can be differentially expressed in up to "S − 1" (for HSĀ) or "r − 1" (for HSr) studies with different levels of effect sizes. Under such a scenario, it becomes very difficult to characterize the null distribution for hypothesis test in a frequentist setting. Theoretically it is necessary to borrow differential expression information across genes to significantly improve statistical power. Bayesian hierarchical modeling can provide a convenient solution for this purpose. Efron et al. (2008) and Efron (2009) applied empirical Bayes methods to control false discovery rate (FDR) in single microarray studies. Muralidharan (2010) further extended these works to allow the simultaneous modeling of both empirical null and alternative distribution of the test stiatistics. Zhao, Kang and Yu (2014) incorporated pathway information to select genes using the Bayesian mixture model. Despite these successful applications, a Bayesian method that combines multiple studies and detect DE genes based on various metaanalysis hypothesis settings is yet to be developed. In this paper, we propose a Bayesian latent hierarchical model (BayesMP, named after Bayesian method for meta-patterns) that uses non-parametric Bayesian method to effectively combine information across genes for direct testing for HSĀ (as well as HS B and HSr) in genome-wide scale. In simulations, we show successful Bayesian false discovery rate control of BayesMP while the original maxP and rOP method using beta null distribution loses FDR control for the HSĀ and HSr setting.
Traditionally, meta-analysis aims to pool consensus signals to increase statistical power (e.g. by fixed or random effects models). Recently, researchers have recognized existence of heterogeneous signals among cohorts and the importance of their characterization in meta-analysis. For example, Figure 1(a) shows three modules of detected biomarkers from the RNA-seq HIV transgenic rat data using BayesMP and meta-pattern clustering (See Section 4.2 for details). Module I and II are consensus biomarkers that are all down-regulated or all up-regulated across three brain regions. In contrast, the biomarkers in module III are down-regulated in HIP but up-regulated in PFC and STR. Such kind of biomarkers are somewhat expected biolog-ically because it is well-known that different brain regions are responsible for different functions such as reasoning, recognition, visual inspection and memory/speech. Several approaches such as adaptive weighting (or subset) method (Li and Tseng, 2011;Bhattacharjee et al., 2012) or lasso variable selection (Li et al., 2014) have been proposed for quantifying and inferring such heterogeneity. In the adaptively weighted Fisher's method (AW-Fisher;Li and Tseng 2011), for example, heterogeneity of differential expression signals in each study is categorized by w gs as 0 or 1 weights (1 representing differential expression for gene g in study s and 0 for nondifferential expression). Specifically, AW-Fisher considers weighted Fisher's statistics (i.e. T ( W g ) = −2 S s=1 w gs · log(p gs ), where W g = (w g1 , . . . , w gS ) is the vector of 0-1 weights reflecting gene-specific heterogeneous contribution of each study and p gs is the p-value of gene g in study s) and adaptively searches the best weight vector for gene g by minimizing the resulting p-value: W g = arg min Wg p(T ( W g )) = arg min Wg 1 − F −1 where F −1 χ 2 w is the inverse CDF of chi-squared distribution with degree of freedom w. The 0-1 weight estimated from AW-Fisher helps cluster detected biomarkers by their differential expression meta-patterns but has a disadvantage of hard-thresholding without quantification of variability. When S is large, the number of all 2 S −1 possible weight combinations also makes it intractable. In BayesMP, the differential expression indicators naturally come with variability estimates from posterior distribution (see a confidence score to be defined later in Section 2.3). In BayesMP, we also adopt a cosine dissimilarity measure on these posterior distributions and apply tight clustering (Tseng and Wong, 2005) to identify biomarkers of different meta-patterns (e.g. see the three modules of biomarkers in Figure 1). Unsupervised clustering of the expression pattern across studies identifies data-driven modules of biomarkers of different meta-patterns and provides interpretable results for further biological investigation. For example, it is interesting to investigate why biomarkers in module III are down-regulated in HIP but up-regulated in PFC and STR. We note here that our proposed cluster analysis to categorize detected biomarkers by studying heterogeneity in meta-analysis is a relatively novel concept. It is different from popular practices of clustering gene for identifying co-expression gene modules or clustering samples for discovering disease subtypes (e.g. Huo et al. (2016)).
The paper is structured as follows. Section 2 establishes the methodology, estimation and inference of BayesMP. Section 3 evaluates the performance of the proposed method using simulation datasets. Section 4 shows the application to three real examples of multi-tissue microarray studies using I.

II.
III. Three meta-pattern modules of biomarkers from HIV transgenic rats example. Each row (Module I, II and III) shows a set of detected biomarkers showing similar meta-pattern of differential signals. 1(a) Heatmaps of detected genes (on the rows) and samples (on the columns) for each brain region (HIP, PFC or STR), where each brain region represents a study (i.e. HIP for s = 1, PFC for s = 2, STR for s = 3). Black color bar on top represents F334 rats (control) and red color bar on top represents HIV transgenic rats (case). 1(b) Heatmaps of confidence scores (CS) (genes on the rows and three studies on the columns). Confidence score is described in Section 2.3, which ranges from -1 (blue color for down-regulation) to 1 (red color for up-regulation). 1(c) Bar plots of mean posterior probability (error bar represents standard deviation across all genes in the module) of differential expression indicator in each brain region. Up-regulation is shown by red and down-regulation is shown by blue. Number of genes is shown on top of each barplot. metabolism related knockout mice, multi-brain-region RNA-seq studies using HIV transgenic rats, and cross-platform breast cancer studies. Finally, Section 5 provides final conclusion and discussion.

HIP PFC STR
2. Methods. For the ease of discussion, we focus on detecting DE genes in two-class comparison in this manuscript. The method can be easily extended for studies with numerical or survival outcomes. In a meta-analysis combining S studies with G genes, we denote p gs as the one sided p-value testing for down-regulation for gene g in study s, where 1 ≤ g ≤ G and 1 ≤ s ≤ S. These p-values can be calculated from SAM (Tusher, Tibshirani and Chu, 2001) or limma (Smyth, 2005) for microarray studies (or RNA-seq studies with RPKM data), and edgeR (Robinson, McCarthy and Smyth, 2010) or DEseq (Anders and Huber, 2010) for RNA-seq studies with count data. As a result, our model is flexible to mixed studies of different platforms (e.g. microarray or RNA-seq) and study designs (e.g. case-control, numerical outcome or survival outcome). Throughout this manuscript, we use limma and edgeR to obtain the p-values. For modeling convenience, we transform the one-sided p-values into Z-statistics, i.e. Z gs = Φ −1 (p gs ), where Φ −1 (·) is the inverse cumulative density function (CDF) of standard Gaussian distribution. Z gs 's are the input data for BayesMP.
2.1. Bayesian hierarchical mixture model. Denote by θ gs the effect size of gene g in study s and Y gs an indicator variable s.t. Y gs = 1 if θ gs > 0 (upregulation), Y gs = −1 if θ gs < 0 (down-regulation) and Y gs = 0 if θ gs = 0 (non-DE gene). We assume that the Z-statistics from study s are sampled from a mixture distribution with three mixing components depending on Y gs : is the pdf of Z-statistics in study s, and f In most situations, if an appropriate statistical test is adopted, one can expect that p gs ∼ Unif(0, 1) if gene g in study s is not DE, hence reasonably assume f (s) 0 ≡ N(0, 1). In case the p-value distribution is not uniform under null hypothesis, one can also empirically estimate f (s) 0 following Efron (2004). Throughout this manuscript, we use theoretical null N(0, 1) and we put discussion about this choice in the conclusion section. Unlike f are usually unknown and their estimation is not trivial. To account for the complex composition of alternative f (s) ±1 (i.e. potentially several subgroups exist in the alternative space), we model them non-parametrically by assuming they are also mixtures of distributions using Dirichlet processes (DPs). DPs are widely discussed and applied in the literature (Neal, 2000;Müller and Quintana, 2004) and density estimation using DPs has also been discussed (Escobar and West, 1995). In our model, when Y gs = 0, we assume Z gs ∼ N(µ gs , 1), and µ gs follows distribution G s+ or G s− generated from DPs. Specifically, the generative process of Z gs given Y gs = ±1 is as follow.
Z gs ∼ N(µ gs , 1). DP(G, α) denotes a DP with base distribution G and concentration parameter α, and G 0+ (G 0− ) denotes normal density N(0, σ 2 0 ) left (right) truncated at 0. We find that the selection of σ 0 and α ± do not much affect the performance of this model in simulation (see details in Supplementary  Table S3). It should be noted that we assume Z gs ∼ N(µ gs , 1) where the variance is fixed at 1 to ensure that f is monotonically increasing (decreasing) while Z gs > 0 (Z gs < 0), which in turn guarantees the posterior probability of gene g being DE in study s to increase as |Z gs | increases (see Theorem 2.1). In addition, this assumption makes the MCMC simpler and hence speeds up the algorithm.
In order to borrow information across studies, we further assume that Y gs is generated depending on (1) the prior probability π g that gene g is a DE gene and (2) the conditional probability δ g for gene g in study s being upregulated (or 1 − δ g for down-regulated) given gene g is DE. Specifically, we assume W gs ∼ Mult 1, (1 − π g , π + g , π − g ) and Y gs = W gs · (0, 1, −1), where π + g = π g δ g , π − g = π g (1 − δ g ) and · is the inner product of two vectors. Given Y gs = y, Z gs is generated from f (s) y (Z). The graphical representation of the full generative model is shown in Figure 2.
We assume that each gene g is DE in different studies in the same probability π g , i.e. π g = Pr(Y gs = 0), and π g ∼ Beta(γ, 1−γ). γ can be interpreted as the proportion of DE genes pooling all studies since the expectation of π g from this prior is γ. We further set the prior of γ being uniform distribution, i.e. γ ∼ UNIF(0, 1).
For each gene g, define δ g = Pr(Y gs = 1|gene g is a DE gene). We assume δ g ∼ Beta(β, β). We set β = 1/2 in this paper, which gives a non-informative prior. Note that this conditional probability provides flexibility for a DE gene to contain conflicting differential expression directions (i.e. up-regulation in one study but down-regulation in another study; e.g. module III in Figure 1) Graphical representation of Bayesian latent hierarchical model. Shaded nodes are observed variables. Dashed nodes are pre-estimated/fixed parameters. Arrows represent generative process. Dashed lines represent equivalent variables. s is the study index and g is the gene index.

Model fitting.
Since conjugate priors were used in the generative model, we can generate the posterior samples efficiently using MCMC procedure. In order to update the DP with infinite number of components, we take the alternative view of DP as the Chinese restaurant process. We define C gs ∈ {. . . , −3, −2, −1, 0, 1, 2, 3, . . .} as the auxiliary component variable, s.t. C gs = 0 indicates Y gs = 0; C gs = k and k > 0 indicates that Y gs = 1 and µ gs is sampled from the kth component (or kth table in Chinese restau-rant process) of G s+ ; and C gs = k and k < 0 indicates that Y gs = −1 and µ gs is sampled from the k th component of G s− . The following steps provide details of our MCMC iterations.
1. Update π g 's: where C −g,s denotes all the C's in study s excluding gene g. Note that h (s) k can be calculated directly following the convention of Algorithm 3 in Neal (2000). Details of h (s) k (Z gs |C −g,s ) are given in Supplementary Section III. Finally, we set Y gs = sgn(C gs ), where sgn(·) is the sign function. 4. Because there is no immediate conjugate prior for γ, we sample γ using Metropolis-Hastings algorithm such that where dBeta(x; a, b) is the probability density function of beta distribution evaluated at x with shape parameter a and b. For details see Supplementary Section I..

Decision space and inference making.
A main benefit of Bayesian modeling is its capability of making inference by statistical decision theory, a generalized framework that covers the traditional hypothesis testing framework as a special case (Berger, 2013). Take HSĀ from Section 1 as an example. Traditional hypothesis testing considers null hypothesis H 0 : θ g ∈ Ω 0Ā where Ω 0Ā = { θ g : S s=1 I(θ gs = 0) < S} and θ g = (θ g1 , · · · , θ gS ) versus alternative hypothesis H A : θ g ∈ Ω 1Ā where Ω 1Ā = { θ g : S s=1 I(θ gs = 0) = S}. When observed data are unlikely to happen (i.e. type I error controlled at 5%) under null hypothesis, we reject the null hypothesis. One notable feature is that traditional hypothesis testing views null and alternative hypothesis spaces differently. The decision of hypothesis testing is only based on the null hypothesis -either reject or accept. The alternative hypothesis plays little role in decision making. In view of decision theory framework, a decision space (a.k.a. action space) is designed as DĀ = (Ω 0Ā , Ω 1Ā ). The inference generates a decision function f that maps from the observed data space Z to DĀ (i.e. f : Z → DĀ). Under this framework, type I error can be expressed as Pr(f (Z) = Ω 1Ā | θ g ∈ Ω 0Ā ); and type II error is Pr(f (Z) = Ω 0Ā | θ g ∈ Ω 1Ā ). Hypothesis testing is a special case under this framework with adequate type I error control to determine the decision function f . Unlike hypothesis testing, decision theory treats Ω 0Ā and Ω 1Ā equally, because in decision theory, the decision is made through cost analysis, which weighs the costs of making wrong decisions in both spaces. One can easily design a realistic loss (cost) function based on the two types of errors to determine their balance and achieve the best decision function. In this paper in order to make fair comparison with classical hypothesis testing, we use posterior probabilities from Bayesian modeling and adopt a false discovery control described by Newton et al. (2004) to determine the decision function.
Denote by ξ g = Pr( θ g ∈ Ω 0Ā |Z) = 1 − Pr( θ g ∈ Ω 1Ā |Z), which is local FDR (Efron and Tibshirani, 2002) by definition. Given a threshold κ, we declare gene g as a DE gene if ξ g ≤ κ and the expected number of false discoveries is g ξ g I(ξ g ≤ κ). False discovery rate from Bayesian modeling is defined as g ξgI(ξg≤κ) g I(ξg≤κ) (Newton et al., 2004). In simulation and real applications, we compare performance of FDR control from traditional hypothesis testing and FDR control from Bayesian modeling. Note that, we can consider Finally for a declared DE gene, we are given the posterior probability of whether gene g in study s is a non-DE gene (Pr(Y gs = 0|Z)), an upregulated gene (Pr(Y gs = 1|Z)) or a down-regulated gene (Pr(Y gs = −1|Z)). We propose a gene and study specific confidence score V gs = Pr(Y gs = 1|Z) − Pr(Y gs = −1|Z), which ranges between −1 and 1. We are confident that gene g is up-regulated in study s if V gs is close to one, and vise versa when V gs is close to minus one. See Figure 1(b) for an example.
2.4. Biomarker clustering for meta-patterns of homogenous and heterogenous differential signals. Several recently developed meta-analysis methods (Li and Tseng, 2011;Bhattacharjee et al., 2012;Li et al., 2014) provide modeling of homogeneous and heterogeneous differential signals. The 0-1 differential expression indicators (e.g. w g in AW-Fisher) allow further biological investigation on consensus biomarkers as well as study-specific biomarkers. However, when S becomes large, the number of biomarker categories grows exponentially to (2 S − 1) and the biomarker categories become intractable. In BayesMP, the posterior probability of the differential expression indicator (i.e. Pr(Y gs |Z)) provides probabilistic soft conclusions. After we obtain a list of biomarkers under certain global FDR control, we apply tight clustering algorithm (Tseng and Wong, 2005) to generate data-driven biomarker modules. Tight clustering is a resampling-based algorithm built upon K-means or K-medoids. This method aggregates information from repeated clustering of subsampled data to directly identify tight clusters (i.e. sets of biomarkers with close distances), and does not force every biomarker into a cluster. It can be applied to any dissimilarity matrix if K-medoids is used. Pathway enrichment analysis is then applied to functionally annotate each biomarker module. The resulting biomarker modules of different metapatterns will greatly facilitate interpretation and hypothesis generation for further biological investigation. In the first two real data applications, for example, heterogeneous meta-patterns of biomarkers are expected from the nature of multi-tissue or multi-brain-region design across studies. Biomarkers up-regulated in one brain region but non-DE (or even down-regulated) in another brain region is of great interest. It should be noted that in the third breast cancer data application, meta-patterns still help characterize the heterogeneity of different cohorts (e.g. differences of study population and probe design), even though this heterogeneity is hopefully minimal since these studies focus on the same disease with homogeneous tissue type.
To apply tight clustering, we need to define a dissimilarity measure for any pair of genes. Denote by U gs the posterior probability vector for Y gs : U gs = (Pr(Y gs = 1|Z), Pr(Y gs = −1|Z), Pr(Y gs = 0|Z)). For two genes i and j, we first calculate the dissimilarity of U is and U js in study s and then average over study index s. The dissimilarity measure between U is and U js we considered includes Cosine dissimilarity, l 2 dissimilarity, l2 2D dissimilarity, Symmetric KL dissimilarity and Hellinger dissimilarity. Details of these dissimilarity measurement are described in Supplementary Section II. By using the simulation setting in Section 3.2, we found Cosine dissimilarity outperforms others (see details in Supplementary Figure S3), and hence adopt it in our paper and would recommend it for other applications.

Simulation results.
3.1. DE gene detection and FDR control. To evaluate the performance of the proposed method and compare to other methods, we performed the simulations below.
1. Let S be the number of studies, G = 10, 000 be the total number of genes, and N = 20 be the number of cases and controls (2N samples in total). 2. We firstly focus on simulating gene correlation structure and assume no effect size for all genes in all studies. Sample expression levels with correlated genes following the procedure in Song and Tseng (2014).
(a) Sample 200 gene clusters with 20 genes in each cluster and the remaining 6,000 genes are uncorrelated. Denote by C g ∈ {0, 1, . . . , 200} the cluster membership indicator for gene g, e.g. C g = 1 indicates gene g is in cluster 1, whereas C g = 0 indicates gene g is not in any gene cluster.
(b) For cluster c and study s, sample A cs ∼ W −1 (Ψ, 60), where 1 ≤ c ≤ 200, Ψ = 0.5I 20×20 + 0.5J 20×20 , W −1 denotes the inverse Wishart distribution, I is the identity matrix and J is the matrix with all elements equal 1. Then A cs is calculated by standardizing A cs such that the diagonal elements are all 1's. The covariance matrix for gene cluster c in study s is calculated as Σ cs = σ 2 A cs , where σ is a tuning parameter we vary in the evaluation.
(c) Denote by g c1 , . . . , g c20 the indices of the 20 genes in cluster c (i.e. C g cj = c, where 1 ≤ c ≤ C(C = 200), and 1 ≤ j ≤ 20). Sample expression levels of genes in cluster c for sample n in study s as (X g c1 sn , . . . , X g c20 sn ) ∼ MVN(0, Σ cs ), where 1 ≤ n ≤ 2N and 1 ≤ s ≤ S. For any uncorrelated gene g with C g = 0, sample the expression level for sample n in study s as X gsn ∼ N(0, σ 2 ), where 1 ≤ n ≤ 2N and 1 ≤ s ≤ S.
3. Sample DE genes, effect sizes, and their differential expression directions.
(a) Assume that the first G 1 genes are DE in at lease one of the combined studies, where G 1 = 30% × G. For each 1 ≤ g ≤ G 1 , sample v g from discrete uniform distribution v g ∼ UNIF(1, . . . , S) and then randomly sample subset v g ⊆ {1, . . . , S} such that |v g | = v g .
Here v g is the set of studies in which gene g is DE.
(c) Sample d g ∼ Ber(0.5), where 1 ≤ g ≤ G 1 and s ∈ v g . Here d g controls effect size direction for gene g.

4.
Add the effect sizes to the gene expression levels sampled in Step 2c. For controls (1 ≤ n ≤ N ), set the expression levels as X gsn = X gsn . For cases (N + 1 ≤ n ≤ 2N ), if 1 ≤ g ≤ G 1 and s ∈ v g , set the expression levels as X gsn = X gsn + (−1) dg θ gs ; otherwise, set X gsn = X gsn .
We performed simulation with S = 3, 5, 10 and σ = 1, 2, 3 to account for different numbers of combined studies and various signal/noise ratio. We applied limma to compare the gene expression levels between the control group and case group. We transformed the two-sided p-values from limma to one-sided p-values by taking account of the directions of estimated effect sizes. Then one-sided p-values are transformed to Z statistics. BayesMP took 53 minutes on a regular PC with 1.4 GHz CPU (i.e. for one simulation with S = 3 and σ = 1) to obtain 10,000 posterior samples using MCMC.
Supplementary Figure S1 shows the posterior samples of π g in two example genes -a DE gene and a non-DE gene as well as γ. Because the posterior samples converge to a stationary distribution very fast for our method (see examples in Supplementary Figure S1), excluding 500 posterior samples for burn-in is enough for the analyses of this paper. We repeated the simulation for 100 times and averaged the results. We compared the performance of our method and existing methods designed for decision space DĀ (maxP), D B (Fisher's method and AW), and Dr (rOP) with r = S/2 + 1 using both FDR and the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Note that in our comparison, we used DĀ and Dr which are equivalent to the complementary hypothesis testings HSĀ and HSr, and the true number of studies in which a gene is DE can be calculated because the truth is known in simulation. Table 1 compares the actual false positive rate (FDR), false negative rate (FNR) and area under the curve (AUC) of different methods. The nominal FDR used in these methods is 5%, which is widely accepted in genomic research. For decision space D B , all the three methods controlled FDR around its nominal level, which is anticipated because HS B is complementary and equivalent to D B . Fisher and AW were slightly over-conservative in terms of FDR control -Fisher's method and AW controlled FDR at around 3.5%, whereas our BayesMP controlled FDR at around nominal 5%. This phenomenon has also been observed in Song and Tseng (2014) when the genes are correlated. Because BayesMP is less conservative than the other two methods, we were able to detect slightly more genes under D B . Besides BayesMP achieved similar (or slightly better) FNR and AUC with Fisher and AW under D B . Fisher' s method is known to be almost optimal (known as asymptotic Bahadur optimality (ABO) (Littell and Folks, 1971)) when effect sizes are consistent and equal for all studies. This indicated BayesMP is also almost optimal for D B under the simulation scenario. For decision space DĀ and Dr, we observed that maxP and rOP lost control of FDR. As discussed in Section 1, this is caused by the nature that HS A and HS r have non-complementary null and alternative spaces. To the contrary, BayesMP still controlled FDR close to its nominal level for DĀ and Dr. Note that because maxP and rOP were not able to control FDR at its nominal level, the number of genes detected by these methods are not directly comparable to our methods. However, FNR of BayesMP was only slightly larger than rOP for Dr, regardless of the conservative FDR control. When S was large (S = 10) in simulation, the FDR control of BayesMP under DĀ deviated from its nominal level (∼ 10% instead of 5%). The reason of the anti-conservative control was that the data simulation setting was different from model generative process, thus small errors accumulated when S gets large. However, BayesMP still performed much better than maxP and roP (FDR=0.58 for maxP in DĀ and FDR=0.23 for rOP in Dr setting). In addition, BayesMP achieved much larger AUC than maxP and rOP under DĀ and Dr, which indicated better power of BayesMP.
3.2. Simulation to evaluate meta-pattern gene module detection. To evaluate the performance of gene module detection, we adopted a simulation procedure similar to Section 3.1. We simulated S = 4 studies in total. Among the G = 10, 000 genes, we set 4% of them as homogeneously concordant DE genes, with the same direction in all studies (all positive or all negative). We denote "homo+" as the homogeneously concordant DE genes with all positive effect sizes and "homo−" as the homogeneously concordant DE genes with all negative effect sizes. We also set another 4% of all genes as studyspecific DE genes -differential expressed only in one study. Among them, 1/4 are DE genes only in the first study with positive effect sizes (denoted as "ssp1+"), 1/4 are DE genes only in the first study with negative effect sizes (denoted as "ssp1−"), 1/4 are DE genes only in the second study with positive effect sizes (denoted as "ssp2+"), and the rest 1/4 are DE genes only in the second study with negative effect sizes (denoted as "ssp1−"). The rest of the genes are not DE (denoted as "nonDE"). The biological variance σ is set to 1 in this simulation.
We first applied the proposed method to this synthetic dataset. We controlled FDR at 5% under D B and obtained 691 genes. These genes were used as input for our gene module detection using tight clustering algorithm. We identified 6 gene modules in these 691 genes. The detected gene modules   are tabulated against the true gene modules simulated in Table 2 (Module 0 contains scattered genes not assigned to any of the six modules). The detected gene modules clearly correspond to the true modules, and most of the nonDE genes were left to the noises. The heatmaps, confidence scores and DE patterns of these 6 modules are shown in Supplementary Figure S2. An alternative approach is to apply tight clustering directly on the Z-statistics. By comparing the results, we found that the modules detect by this naïve approach are neither pure nor distinguishable under our simulation settings (see details in Supplementary Table S1).
3.3. Additional simulations on sample size effects. To assess impact of unbalanced sample size, we simulated the following special scenarios with 1. Different numbers of samples in different studies. 2. Different numbers of cases and controls in each study. 3. Different ratios of case and control samples in each study.
Below, we followed the simulation setting in Section 3.1 unless otherwise mentioned. In Scenario 1, we allowed different studies to have different numbers of samples. Under this Scenario, we simulated Case (a): numbers of samples (case/control) being 20/20, 30/30, 40/40 for 3 studies respectively; and Case (b): numbers of samples (case/control) being 20/20, 50/50, 100/100 respectively. In Scenario 2, we allowed the numbers of cases and controls being different within each study. Under this Scenario, we simulated Case (c): numbers of samples (case/control) being 60/20, 60/20, 60/20 for 3 studies respectively. In Scenario 3, we allowed the ratios of case and control samples being different within each study. Under this Scenario, we simulated Case (d): numbers of samples (case/control) being 20/60, 40/40, 60/20 for 3 studies respectively.
The results of simulation Cases (a)∼(d) are shown in Supplementary Ta-ble S2. We observed that under Scenario 1, Scenario 2 and Scenario 3, BayesMP controlled FDR to its nominal level for DĀ, D B and Dr. These results indicate that our Bayesian model is robust against impact of heterogeneity sample size in a wide spectrum of scenarios.
3.4. Additional simulations on robustness of the algorithm. In our Bayesian hierarchical model, we assumed the null component comes N(0, 1) -theoretical null for all studies. However, this assumption can be violated when genes from null components are correlated (Efron et al., 2001). Therefore, we designed simulations to access the performance of our model when theoretical null assumption is not valid. To be specific, in our simulation setting Step 2c, we varied number of correlated clusters C = 200, 300, 400, 500, representing increasing probability of correlated null component. We evaluated the performance of the Bayesian hierarchical model and the result is shown in Supplementary Table S4. We observe that the BayesMP still performs very well even though the null components are correlated. Therefore BayesMP is robust against the assumption the theoretical null assumption for the null component.

Real data applications.
To further evaluate our method and demonstrate its usage, we applied BayesMP on three real meta-analysis examples: one on the gene expression of multi-tissue microarray studies using metabolism related knockout mice, one on multi-brain-region RNA-seq studies using HIV transgenic rats and another on transcriptomic breast cancer studies across multiple platforms. The sample size description is shown in Supplementary Table S5. 4.1. mouse metabolism data. Very long-chain acyl-CoA dehydrogenase (VLCAD) deficiency was found to be associated with energy metabolism disorder in children (Li and Tseng, 2011). Two genotypes of the mouse model -wild type (VLCAD +/+) and VLCAD-deficient (VLCAD −/−)were studied for three types of tissues (brown fat, liver and heart) with 3 to 4 mice in each genotype group. Total number of genes from these three transcriptomic microarray studies is 14,495. Supplementary Table S5(a) shows details of the study design and the data set is available in supplement. Twosided p-values were calculated using limma by comparing wild-type versus VLCAD-deficient mice in each tissue and one-sided p-values were obtained by considering the effect size direction. BayesMP took 62 minutes to obtain 10,000 posterior samples using MCMC and the first 500 posterior samples were excluded as burn-in iterations. By controlling FDR at 5%, we detected 168 probes under DĀ, among them 156 have concordant effect size direc-tions in all the three tissues. The heatmap for the genes detected under DĀ is shown in Supplementary Figure S6.
Similarly, under D B we obtained 1,243 DE genes at FDR level of 1%. Then we applied the tight clustering algorithm based on the cosine distance as described in Section 2.4 to detect modules in these genes. The results are shown in Figure 3. Using the tight clustering, we were able to detect 6 gene modules with unique patterns. The first two biomarker modules are consensus genes that are up-regulated or down-regulated in all tissues. The next four modules are biomarkers with study-specific differential patterns. For example, DE genes in Module III are up-regulated in heart but not in brown fat or liver. To examine the biological functions of these modules, we performed pathway enrichment analysis for genes in each module using Fisher's exact test. The pathway database was downloaded from Molecular Signatures Database (MSigDB) v5.0 (http://bioinf.wehi.edu.au/software/MSigDB/), where a mouse-version pathway database were created by combining pathways from KEGG, BIOCARTA, REACTOME and GO databases and mapping all the human genes to their orthologs in mouse using Jackson Laboratory Human and Mouse Orthology Report (http://www.informatics.jax.org/ orthology.shtml). At Fisher's exact test FDR 5% cutoff, we summarized the pathway detection result (see supplementary Excel file 1 for detailed pathway information). Among the six gene modules with distinct DE patterns, module I is enriched in enzyme pathways (e.g. KEGG LYSOSOME; q = 1.6 × 10 −3 ), module II is enriched in pathways for lyase activity (e.g. LYASE ACTIVITY; q = 0.26), module III is enriched in defense response pathways (e.g. DEFENSE RESPONSE PATHWAY; q = 2.2 × 10 −6 ), module IV is enriched in phosphatase regulator pathways (e,g, PHOSPHATASE REGULATOR ACTIVITY; q = 0.12). module VI is enriched in platelet related pathways (e.g. FORMATION OF PLATELET; q = 2.2 × 10 −2 ), For module V, we didn't detect any enriched pathways. Remarkably, all of these pathways are known to be related to different aspects of metabolism, which indicates that our method is able to detect homogeneous and heterogeneous gene modules that are biologically meaningful. The biomarker clustering result enhances meta-analysis interpretation and motivates hypothesis for further biological investigation. For example, it is intriguing to understand why VLCAD-mutation impacts DE genes only up-regulated in heart but not in brown fat or liver and why these genes are associated with the DEFENSE RESPONSE pathway.

4.2.
HIV transgenic rat RNA-seq data. Li et al. (2013) conducted studies to determine gene expression differences between F344 and HIV transgenic I.

Brown fat
Heart Liver Six meta-pattern modules of biomarkers from mouse metabolism example. Each row shows a set of detected biomarkers showing similar meta-pattern of differential signals. 3(a) Heatmaps of detected genes (on the rows) and samples (on the columns) for each tissue (brown fat, heart or liver), where each tissue represents a study (i.e. brown fat for s = 1, heart for s = 2, liver for s = 3). Black color bar on top represents wild type (control) and red color bar on top represents VLCAD-deficient mice (case). 3(b) Heatmaps of confidence scores (CS) (genes on the rows and three studies on the columns). Confidence score is described in Section 2.3, which ranges from -1 (blue color for down-regulation) to 1 (red color for up-regulation). 3(c) Bar plots of mean posterior probability (error bar represents standard deviation across all genes in the module) of differential expression indicator in each brain region. Up-regulation is shown by red and down-regulation is shown by blue.
Number of genes is shown on top of each barplot. In Figure 3(b) and 3(c), we use "Brown" to denote "Brown fat". rats using RNA-seq (GSE47474 in Gene expression Omnibus database http: //www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47474). The HIV transgenic rat model was designed to study learning, memory, vulnerability to drug addiction and other psychiatryic disorders vulnerable to HIV positive patients. They sequenced RNA transcripts with 12 F334 rats and 12 HIV transgenic rats in prefrontal cortex (PFC), hippocampus (HIP), and striatum (STR) brain regions (See detail in Supplementary Table S5(b)). We applied the same alignment procedure using tophat (Trapnell, Pachter and Salzberg, 2009) adopted by Li et al. (2013) and obtained the RNA-seq count data by bedtools (Quinlan and Hall, 2010) with 16,821 genes. We filtered out gene with less than 100 total counts within any brain region and ended up with 11,824 genes. We removed potential outliers by checking the sample correlation heatmaps (Supplementary Figure S7). We employed R package "edgeR" to perform DE gene detection and obtained two-sided p-values. The one-sided p-values were obtained by considering the effect size directions and further converted to Z statistics. It took 41 minutes to obtain 10,000 posterior samples via MCMC and the first 500 posterior samples were excluded as burn-in iterations. Since it is well known that the post-mortem brain expression profiles generally contain weak signals, we controlled FDR at 20% in the analysis. Under DĀ, we detected 69 genes, among which all 69 have concordant DE directions. The heatmaps of the expression levels (log of normalized counts) of these genes in the three brain regions are shown in Supplementary Figure S8. Under D B , we detected 669 genes. We further applied the tight clustering algorithm and obtained 3 gene modules. Their gene expression heatmaps, DE confidence scores and bar plots of posterior probability of differential expression are shown in Figure 1. To examine the biological functions of these modules, we also performed pathway enrichment analysis using the same procedure as in Section 4.1 (see supplementary Excel file 2 for detailed information). According to the results, module I is down-regulated in all the three brain regions, and is enriched in pathways related to immune system (e.g. REACTOME INNATE IMMUNITY SIG-NALING; p = 6.26 × 10 −3 ); module II is up-regulated in all the three brain regions, and is enriched in pathways related to response to virus (e.g. RE-SPONSE TO VIRUS; p = 1.81 × 10 −3 ); module III is down-regulated in HIP, but up-regulated in PFC and STR, and it is enriched in pathways related to synapsis (e.g. SYNAPTIC TRANSMISSION p = 2.75 × 10 −3 ) and neuron connections (e.g. KEGG NEUROACTIVE LIGAND RECEPTOR INTERACTION p = 2.88 × 10 −3 ). Since it is well-known that HIV attacks the immune system (Weiss, 1993), we anticipate that genes for immune response to be down-regulated, as observed in module I. The up-regulation of response to virus pathway we found in module II is reasonable since the mice are infected by virus. Moreover, because different brain regions have different functions, it is not surprising to discover some neuron-related genes that may respond differently to HIV in different brain region (module III).
4.3. Breast Cancer dataset. In this example, we combined 7 breast cancer transcriptomic datasets, which study the same biological problem using different gene expression platforms including Illuminia, Affymetrix and RNA-seq. Phenotype of interest is breast cancer grade, which is defined according to the cancer cells' growth patterns as well as their appearance compared to to healthy breast cells. Grade 1 cancer cells show slow and wellorganized growth patterns and they look a little bit different from healthy cells, while Grade 3 cancer cells grow quickly in disorganized patterns, with many dividing to make new cancer cells, which look very different from healthy cells. Details of these 7 datasets are described in Supplementary  Table S5(c). For each study, if multiple probes match to the same gene, we select the probe with the largest IQR (Gentleman et al., 2006) to represent the gene. After matching the same gene symbols, 3,920 genes that appeared in all 7 studies were selected for the analysis. For each study, we obtained two-sided p-values using limma (for continuous data) or edgeR (for count data) by comparing Grade 1 versus Grade 3. We calculated one-sided pvalues by considering direction of the effect sizes. We applied BayesMP and it took 31 minutes to obtain 10,000 posterior samples using MCMC and the first 500 posterior samples were excluded as burn-in iterations.
Since we expect the studies combined in this application to be more homogeneous than previous examples, genes DE in most of the studies are our major interest and worth further investigation. Moreover, because the studies are conducted by different groups using different platforms, we want our analysis to be robust against a couple of studies with poor quality or unspecific probe design. We visualize number of declared DE genes at FDR 5% for Dr (r = 1, . . . , 7) in Figure 4.
It is noticed that there is a dramatic drop of number of DE genes between r = 6 and r = 7. This indicates that it could be too stringent to require the DE genes being agreed by all seven studies. To make the DE gene detection replicable enough yet not too stringent, we chose to use Dr (r = 6). Under FDR 5%, we detected 1,499 significant genes. Pathway enrichment analysis was performed using Fisher's exact test. Top pathways associated with these genes are: REACTOME CELL CYCLE (q = 0.0025), REACTOME DNA REPLICATION (q = 0.0047). These results are biological meaningful since it is known that cancer cells' growth patterns of different grades depend on cell cycle and DNA replication.

Conclusion.
For meta-analysis at genome-wide level, the issues to efficiently integrate information across studies and genes and to quantify homogeneous and heterogeneous DE signals across studies have brought new statistical challenges. Bayesian hierarchical model provides a feasible and effective solution. Compared to traditional hypothesis testing, decision theory framework from Bayesian modeling provides a more flexible inference to determine DE genes from meta-analysis. In this paper, we proposed a Bayesian hierarchical model for general transcriptomic meta-analysis. From posterior distribution of the latent variable (DE indicators Y gs ), FDR is wellcontrolled and there is no need to select different test statistics for different hypothesis testing settings (HSĀ, HS B and HSr). Post hoc clustering analysis on the detected biomarkers generates biomarker modules of different meta-patterns that facilitate biological interpretation and provides clues for hypothesis generation and biological investigation.
Our proposed BayesMP framework has the following advantages. Firstly, the model is simple, yet practical and powerful. The model is based on one-sided p-values. This allows easy integration of data from different platforms (for example, many different platforms from microarray and RNAseq). As a contrast, Scharpf et al. (2009) described a full Bayesian hierarchical model for microarray meta-analysis where the input data are microarray raw data (normalized intensities). Although such a full Bayesian model theoretically best integrates all information and can be more pow-erful, it cannot combine new RNA-seq platforms since RNA-seq generates count data versus continuous intensity measures in microarray. Such a full hierarchical model also runs a greater risk of model mis-specification that increases systemic bias across different microarray platforms. Our framework based on p-values circumvents these difficulties and is powerful as long as the method used to generate p-values in each study is effective. Secondly, we adopted a conjugate Bayesian approach using DPs for alternative distributions f (s) ±1 , which enables mixture of multiple subgroups instead of single onecomponent alternative. DPs is non-parametric thus robust against model assumption and conjugacy guarantee fast computing of our Gibbs sampling procedure. Thirdly, we have shown that decision theory framework from BayesMP provides good FDR control and power under different hypothesis testings (or decision spaces). Fourthly, in contrast to the "hard" decision of 0-1 weights in AW-Fisher, the posterior distributions of the DE indicators Y gs provides a stochastic quantification and "soft" decision. For example, in the mouse metabolism example, gene Mbnl2 (probeset 1422836 at) and gene Bcl2l11 (probeset 1435449 at) have very similar p-values in the three studies: (0.0063, 0.16, 0.097) for Mbnl2 and (0.0070, 0.16, 0.098) for Bcl2l11. Using AW, Mbnl2 ended up with a p-value of 0.020 with weights (1, 0, 1) but Bcl2l11 had a p-value of 0.021 with weights (1, 1, 1). Slight change of the p-value in the second study (heart tissue) resulted in different weights. The posterior probabilities for these two genes are, however, very similar, with (0.827, 0.561, 0.671) and (0.819, 0.561, 0.662) respectively and they belonged to the same gene module I in Figure 3. The stochastic quantification avoids sensitive 0-1 weight changes in AW-Fisher. Finally, traditional Fisher's method does not categorize DE genes with homogeneous or heterogeneous DE patterns across studies. In contrast, the improved AW-Fisher method allows categorization of biomarkers but generates up to 2 S − 1 biomarker clusters, which becomes intractable when S is large. The posterior probability of Y gs from BayesMP allows application of tight clustering to directly identify tight clusters of biomarkers with distinct DE meta-patterns. Our simulation and three applications have shown good clustering accuracy and improved interpretation of the biomarker modules.
BayesMP potentially has the following potential limitations. Computing is often a consideration for Bayesian approaches. Our experiences have shown that 10,000 simulations are sufficient to generate the posterior probabilities in general and less than 1 hour is enough for combining around 10, 000 genes using a regular desktop. For applications to much larger number of features (e.g. SNPs or methylation sites in methyl-seq), parallel computing and/or faster MCMC techniques will be needed. Efron (2004) pointed out that it might worth to estimate an empirical null distribution for the Z-statistics, when the null distribution deviates from N(0, 1). In our simulation, we have shown BayesMP with theoretical null is robust when genes from null are correlated, which violates the theoretical null assumption. We also found that BayesMP with empirical null is slightly over-conservative (actually FDR 1% while nominal FDR 5%) when noise level is large. In our R package, we allow the users to choose from using the theoretical null or the empirical null. However, the user should also be careful about the assumptions made when estimating the empirical null distributions. For example, Efron (2004) assumes DE genes are less than a certain proportion (e.g. 10%) and empirical null is also from a Gaussian distribution. This assumption needs to be examined post hoc (e.g. examine whether the DE proportion is less than 10% in each study given a relatively loose FDR control, say FDR<10%). Also, to help the user make the decision whether theoretical or empirical null is more appropriate for their application, we provided histograms for visual diagnosis (see Supplementary  Figures S4 and S5 for details).
As mentioned in the introduction, batch effect correction and direct merging is a viable alternative for meta-analysis (Tseng, Ghosh and Feingold, 2012). There are two major types of batch effect correction method if the factor of interest (i.e. case/control label) is known. For the first type, when the unwanted variation (UV) factors (i.e. batch information) are known, people correct for these known UVs (e.g. Johnson, Li and Rabinovic, 2007;Walker et al., 2008) and pool samples together (a.k.a mega-analysis). This approach is a potent substitute of methods such as fixed effect model and random effect model when the studies combined are homogeneous and/or the DE genes are assumed to be consistent across studies. The major purpose of this mega-analysis is to boost statistical power using a sample size larger than each single study. However, this approach cannot be applied when exploring the heterogeneity in DE genes across studies is one of the major research interest, as we have shown in Section 4.2. The second type of batch effect correction methods assume the UV factors are unknown. People have developed many methods (e.g. Leek and Storey, 2007;Kang, Ye and Eskin, 2008;Listgarten et al., 2010) to eliminate the effects of these unknown UV factors. For our approach, this second type of methods can be applied to each single study to remove UV when they exist and to produce p-values as input of BayesMP. We also call for caution when using these batch correction methods, because it is reported that they might eliminate the effect of interestwhen UV factors are highly correlated with the factor of interest (Jacob, Gagnon-Bartsch and Speed, 2016).