Bayesian Analysis of RNA-Seq Data Using a Family of Negative Binomial Models

. The analysis of RNA-Seq data has been focused on three main cate-gories, including gene expression, relative exon usage and transcript expression. Methods have been proposed independently for each category using a negative binomial (NB) model. However, counts following a NB distribution on one feature (e.g., exon) do not guarantee a NB distribution for the other two features (e.g., gene/transcript). In this paper we propose a family of Negative Binomial models, which integrates the gene, exon and transcript analysis under a coherent NB model. The proposed model easily incorporates the uncertainty of assigning reads to transcripts and simpliﬁes substantially the estimation for the relative usage. We developed simple Gibbs sampling algorithms for the posterior inference by exploiting fully tractable closed-forms of computation via suitable conjugate priors. The proposed models were investigated under extensive simulations. Finally, we applied our model to a real data set.


Introduction
High throughput sequencing technology has rapidly become the standard method for measuring RNA expression levels (Mortazavi et al., 2008). RNA-Seq uses next-generation sequencing (NGS) methods to sequence cDNA that has been derived from an RNA sample, and hence produces millions of short reads. These reads are mapped to a reference genome using a data alignment tool, such as TopHat (Kim et al., 2013). The number of reads mapping within a genomic feature of interest (such as a gene or an exon) is used as a measure of the abundance of that feature in the analyzed sample (Anders et al., 2012).
For the statistical analysis, we rely on a simple count matrix produced from the RNA-Seq experiments. In the count matrix the row represents a feature and the column a sample in a specific experimental condition. Then the data are normalized to account for different library sizes (the total number of mapped reads) across different samples. A detailed review of the normalization methods can be found in Dillies et al. (2013) and Rapaport et al. (2013). It is widely recognized that a negative binomial (NB) distribution provides a good fit to read counts data produced by NGS. Many statistical algorithms have been published using a NB model, including Anders and Huber (2010); Robinson et al. (2010); Di et al. (2011); Hardcastle and Kelly (2010); Leng et al. (2013); van de Wiel et al. (2012van de Wiel et al. ( , 2014; Love et al. (2014); Wu et al. (2013); Trapnell et al. (2013); Niu et al. (2014). In all these works, the NB distribution was parameterized by a mean parameter μ and an overdispersion parameter α. Hypothesis tests were constructed to compare the μ's between conditions. However, all statistical inference rely on some form of approximation. Many treat the estimated dispersions as if they were known parameters, without allowing for uncertainty of the estimation. Other methods that account for the uncertainty in the dispersion rely on other approximations due to the lack of conjugacy in the parameter formulation (see Hardcastle and Kelly (2010); van de Wiel et al. (2012)). In this paper, we parameterize the NB distribution using a probability parameter p and a dispersion parameter r and treat both as unknown parameters. Based on the newly developed sampling algorithm in the field of topic modelling Carin, 2012, 2015), we estimate p and r using fully tractable closed-forms via suitable conjugate priors. More importantly, this parameterization allows us to define a family of NB models for each gene to unify the analysis of genes, exons and transcripts under a coherent statistical model.
The analysis of genes, exons and transcripts have been developed with separate NB models. For example, a NB distribution has been used for 1) modelling the total gene counts, such as edgeR (Robinson et al., 2010), DESeq (Anders and Huber, 2010), DE-Seq2 (Love et al., 2014), baySeq (Hardcastle and Kelly, 2010), NBPSeq (Di et al., 2011), and ShrinkBayes (van de Wiel et al., 2014), 2) modelling exon counts, such as DEXSeq (Anders et al., 2012), and 3) modelling transcript counts, such as Cuffdiff  and IUTA (Niu et al., 2014). However, under their model formulations, a NB model for one feature prohibits a NB distribution for the other two features. In particular, if different NB distributions are assumed for different exon counts within a particular gene, the total count in that gene by summing over the exon counts does not have a NB distribution.
In this paper, we propose a family of NB models (FNB for short), which unifies the exon, transcript and gene analysis under a coherent negative binomial modelling framework. Moreover, the FNB approach greatly simplifies the imputation of latent transcript counts and allows us to estimate the relative exon/transcript usage from a simple multinomial distribution.

Parameter estimation in a negative binomial distribution
Most negative binomial models developed for the analysis of RNA-Seq count data are parameterized by a mean parameter μ and an overdispersion parameter α. In this work, we characterize the NB distribution by parameters p and r. That is, y|r, p ∼ NB(r, p) has the probability mass function f (y|r, p) = Γ(r+y) y!Γ(r) (1 − p) r p y , where Γ denotes the gamma function, p ∈ (0, 1) and r > 0. The mean and overdispersion are obtained by μ = rp/(1 − p) and r = 1/α. It is a classical fact that the NB distribution is equivalent to a Gamma-Poisson mixture distribution: we can obtain y|r, p ∼ NB(r, p) by first drawing λ ∼ Gamma(r, (1 − p)/p) and then generating y|λ ∼ Pois(λ). A NB distribution can also be augmented under a compound Poisson representation, so that by endowing both parameters r and p with suitable conjugate priors, it is possible to draw samples for their posterior distribution in a tractable way, see Carin (2012, 2015). Specifically, given the NB model y j |r, p iid ∼ NB(r, p), for j = 1, · · · , n, the prior specifications p ∼ Beta(a 0 , b 0 ), and r ∼ Gamma(e 0 , f 0 ), where e 0 and f 0 are the shape and the rate parameters. Then the posterior distributions of p and r are obtained in the limit by iteratively applying the following Gibbs sampling steps: In the above display "|−" denotes the conditional distributions given the data and all remaining parameters. The first step for sampling p follows from the Beta-Binomial conjugacy. The last two steps complete the sampling of parameter r, which also involves auxiliary variable l j 's. These variables represent (latent) counts distributed according to a Chinese restaurant table (CRT) distribution, which are defined as follows. We write Now, given all l j 's, r can be sampled by the third equation, which is obtained by exploiting a Gamma-Poisson conjugacy.

Notation
We first introduce the notation for a particular gene. In general, let y j be gene count in sample j (j = 1, · · · , n), and r and p are gene-level parameters in the NB distribution. We use superscript "e" for exon-level data and parameters and superscript "t" for transcript-level data and parameters. Specifically, on the exon level, let y ei j be the read count in exon i (i = 1, · · · , E) in sample j, r ei be the corresponding dispersion parameter. On the transcript level, let y t i j be the read count in transcript i (i = 1, · · · , T ) in sample j, r t i be the corresponding dispersion parameter. We know that , the sum of exon counts and the sum of transcript counts are equal to the gene counts). In our proposed model (defined in the next section), , the sum of exon-level dispersion parameters and the sum of transcript-level dispersion parameters are equal to the gene-level dispersion parameter). There is a single parameter p for the gene, exons and transcripts within that gene. The mean expression for exons are μ ei (i = 1, · · · , E) and for transcripts are μ t i (i = 1, · · · , T ). data for a particular gene in a given sample ("?" refers to unobserved data). The M matrix for this data is on the right.
We also need notation on the exon-transcript level to show that our model integrates the gene, exon and transcript analysis under a coherent NB model. Let y eit i be the read count in exon i and transcript i (i = 1, · · · , E; i = 1, · · · , T ). Since not all exons are included in a particular transcript, we set M eit i = 1 if exon i belongs to transcript i and zero otherwise. Let S ei be the effective length of exon i, it follows that the effective Furthermore, we use subscript "k" to denote experimental conditions. For example, in condition k (k = 1, 2), the read count data are denoted by y jk , y ei jk and y t i jk on the gene, exon and transcript level, respectively, and the corresponding dispersion parameters are r k , r ei k and r t i k , mean parameters are μ k , μ ei k and μ t i k , and the probability parameter is p k .

A family of negative binomial models for the analysis of gene, exon and transcript
In this section we focus on the modelling for a particular gene in a particular sample (we dropped indices of j and k for simplified notation). Table 1 shows a simple hypothetical data set with a total read count 700. This total count is decomposed into multiple counts on the exon level (i.e., 700 = 100 + 200 + 400), or multiple counts on the transcript level. Here, "exon" is used loosely. More generally, "exon" can also be thought of as part of the exon (i.e., an exon counting bin in Anders et al. (2012), or as defined in Turro et al. (2011)). Regardless of these various definitions, counts in exons are observed, while counts in transcripts are unobserved. If we know how each exon count is allocated to different transcripts, we can impute the unobserved count for a given transcript by summing over exon counts belonging to that transcript. However, the allocation probabilities are unknown. Moreover, the allocation is under some restrictions because not all exons are included in a particular transcript. We use a M matrix to consider the restrictions. In the hypothetical example, the first row of M is (1, 1) indicating that count 100 is allocated to two transcripts, and the second row is (0, 1) indicating that count 200 is only allocated to transcript 2 since transcript 1 does not include exon 2.
In this paper, we exploit the infinitely divisible property of the negative binomial distribution and define a family of NB models for each gene. In a "gene" family, read counts in all family members (exons, transcripts and the corresponding gene) have NB distributions. These NB distributions share a common probability parameter p, but each retain its own dispersion parameter (we call it a FNB model). The key assumption in our FNB model is that all family members share the same p. Biologically, the common p can characterize the effect of common transcription factors that control the rates of transcription in regulating the amounts of RNA products. Mathematically, this assumption greatly reduces the number of parameters in the model, which could gain efficiency for the parameter estimation, especially when the sample size is small as in RNA-Seq studies. Furthermore, this assumption allows us to unify the analyses of gene, exon and transcript under a coherent modeling framework. To investigate the robustness of the FNB model to this assumption, we conducted extensive simulation studies and performed goodness-of-fit tests in the real data analysis.
To illustrate that the gene, exon and transcript analyses are integrated under our proposed FNB model, we start from the exon-transcript level by assuming that the read count in exon i and transcript i has a NB distribution. That is, y eit i ∼ NB(M eit i S eir t i , p), wherer t i is the dispersion parameter per unit length of transcript i , and S ei is the effective length of exon i (Turro et al., 2011). Here, y eit i = 0 if M eit i = 0. The count in exon i has a NB distribution, which is given by This is true because the sum of independent NB distributions with a common p is still a NB distribution with the same p and the dispersion parameter obtained by summing over the dispersion parameters in the independent NB distributions. Here, r ei = S ei T i M eit i r t i . It is worth noting that the M matrix and exon length are not necessary for the estimation of r ei , which can be directly estimated using observed exon counts. Similarly, on the transcript level, we have Therefore, analyses of the gene, exon and transcript were integrated under a coherent NB modeling framework. The only unknown parameters in the FNB model are the probability and dispersion parameters, the data on the exon level, the exon and transcript length, and M matrix are known.
In the next two sections, we show that the above modelling framework 1) provides a simple way to infer the allocation probabilities in order to impute unobserved transcript counts, and 2) reduces the complex problem of estimating the relative usage in a NB model to a simpler problem of estimating of the proportion of latent counts from a multinomial distribution.

Imputation of unobserved transcript counts
The read count in each transcript is not directly observed while similar transcripts can generate identical sequence reads (for example, the "?" in Table 1). Imputation of transcript counts is statistically very challenging especially under the NB model. Because of the difficulty, a two-step procedure is commonly used Niu et al., 2014). That is, impute transcript counts in the first step, and then perform differential test in the second step based on the imputed counts obtained from the first step. Ideally, these two steps should be modelled simultaneously to avoid the potential bias and gain efficiency in the parameter estimation.
In our imputation, we treat y eit i (i = 1, · · · , E; i = 1, · · · , T ) as unknown parameters, and estimate them together with other parameters in the NB distribution. As discussed in Section 2.3, y eit i ∼ NB(M eit i S eir t i , p), and it can also be reformulated using the Gamma-Poisson mixture formulation, At each Markov chain Monte Carlo (MCMC) iteration, we sample λ eit i (i = 1, · · · , T ) from Gamma distributions in (5). Based on the relationship between independent Poisson distributions and the multinomial distribution, we estimate transcript counts in exon i from . This probability vector p contains the allocation probabilities, with which the count in exon i is distributed to different transcripts. The imputed count for transcript i is simply y t i = i y eit i .
It is important to note that our imputation is naturally embedded in the whole estimation procedure. This procedure relies on the Gamma-Poisson mixture formulation of the NB distribution for imputation and the compound Poisson representation of the NB distribution for the parameter estimation.

Estimation of the relative usage
When the interest lies in estimating the fraction of the gene's reads that falls into the exon or transcript (i.e., the relative usage), read counts across all exons/transcripts in a particular gene need to be modelled simultaneously. The FNB model reduces the complex problem of estimating the relative usage in a NB model to a simple problem of estimating of the proportion of latent counts from a multinomial distribution. Specifically, the relative exon usage Q E = (Q e1 , · · · , Q ei , · · · , Q e E ) is sampled from a Dirichlet distribution, where l ei j ∼ CRT(y ei j , r ei ), and h 0 is a fixed hyperparameter in the Dirichlet prior to quantify the prior belief of the relative usage. Conceptually, latent counts across the exons have a multinomial distribution. The relative usage of an exon is proportional to the latent counts in that exon (proof for (6) can be found in supplementary materials (Zhao et al., 2017)).
Similarly, the relative transcript usage is based on the latent counts inferred from the imputed transcript counts, that is,

Parameter estimation and differential tests
In the previous sections, we defined our FNB model and emphasized its novelty through its simplicity to impute the unobserved transcript counts and infer the relative usage.
In this section we discuss detailed steps for parameter estimations and differential tests on the gene, exon and transcript level, respectively.
For the illustrative purpose, we assume two experimental conditions and use index k to denote condition, but the algorithms can be easily extended to deal with multiple conditions. Recall that y jk denotes the normalized total count for a particular gene in sample j in condition k (j = 1, · · · , n k ; k = 1, 2). The read counts on the gene, exon and transcript level all have NB distributions and they share a common p k , but they have their own dispersion parameters (see Section 2.2 for the relevant notation). In the sampling algorithms below, the gene-level dispersion parameters are assumed to be the same between conditions (i.e., r 1 = r 2 = r, which is a commonly used assumption in practice), but the algorithms can be easily modified to allow different r's.

Gene analysis
We proposed two algorithms to estimate the expression for a particular gene. Algorithm 1A is a simple extension of (1)-(3) for studies with two experimental conditions. After iterating the four steps in Algorithm 1A, we obtain posterior distributions of the mean gene expression, μ 1 and μ 2 , for the two conditions (we removed the "|−" in the algorithms below to simplify the notation). Based on these posterior distributions, we obtain the posterior distribution of the log fold change, β = log(μ 2 ) − log(μ 1 ), and compute a posterior probability P β = 2 × min{P (β ≤ 0), P (β > 0)}, with a smaller value of P β indicating stronger evidence of differential expression (DE) for that gene.
For the multiplicity control, we can convert P (β ≤ 0)'s, or P (β > 0)'s, over all genes to obtain the posterior expected false discovery rates (FDR) using methods in Lewin et al. (2007); León-Novelo et al. (2013) (see supplementary materials (Zhao et al., 2017)). These FDRs are derived under an one-sided test: H 1 : μ 2 ≤ μ 1 or H 1 : μ 2 > μ 1 . If r is assumed to be the same between conditions, we can derive the FDRs under a two-sided test (i.e., H 0 : μ 2 = μ 1 vs. H 1 : μ 2 = μ 1 ), see Algorithm 1B. If r is common between The rest (sampling l jk and r) are the same as in Algorithm 1A conditions, the hypothesis test for DE is reduced to H 0 : p 2 = p 1 vs. H 1 : p 2 = p 1 . Let z ∈ {0, 1} indicates choosing the null or alternative model, the posterior probability of selecting H 1 is given by where π 0 and π 1 are the prior probabilities for null and alternative model. The marginal likelihood functions, m 0 (y) and m 1 (y) under the null and alternative model, are obtained by integrating out the p k 's with respect to their prior distributions. Under our parametrization of the NB distribution, the marginal functions have simple forms due to the fact that a NB distribution with a beta prior for the probability parameter is a beta negative binomial distribution, which is given by where B(, ) denotes a beta function; derivation of the formula (7) is in supplementary materials (Zhao et al., 2017). Posterior probabilities P r(z = 1|−)'s over all genes, obtained from Algorithm 1B, are then converted to the posterior expected FDRs.
When the number of replicates per condition is small, it is beneficial to allow each gene to have its own dispersion estimate while borrowing information from the other genes. In the Bayesian framework, this is achieved by adding a hierarchical structure for gene-level r s. Here, we add index g to represent gene g. That is, r g ∼ Gamma(R, c 0 ) g = 1, · · · , G. Then the sampling of r g is where Parameters e 0 and f 0 in a gamma distribution quantify the prior belief for the population parameter R, and c 0 controls the shrinkage of the gene-level r g 's to the population average.

Exon analysis
Algorithm 2 contains steps to sample parameters in the exon analysis.

Algorithm 2 Exon Analysis
Sample MCMC sampling allows us to carry out essentially all posterior inference of interest.
2. To test differential relative usage for exon i, we calculate the posterior probability P (Q ei 2 > Q ei 1 ) or P (Q ei 2 < Q ei 1 ).
3. To test the difference in the overall exon usage, we use a compositional metric (Aitchison et al., 2000) (denoted by Adist) to compare the two probability vectors Q E 1 vs. Q E 2 . This metric is widely recognized as the most appropriate geometry for the compositional data. Specifically, Adist = ( Then we calculate the posterior probability P (Adist > threshold) to quantify the difference (for example, threshold = 1).
Finally, we can convert all posterior probabilities to FDRs for the multiplicity control under an one-sided hypothesis. To borrow data information in the whole genome, we can add a hierarchical prior for the gene-level r's as specified in (8)-(10). Similarly, we can add a gamma hierarchical prior for r's in the two conditions such that the r's are not the same, but could be similar between conditions.

Transcript analysis
Algorithm 3 contains steps to sample parameters in the transcript analysis, which are similar to Algorithm 2 for the exon analysis except that we added two more steps to impute the unobserved transcript counts.

Algorithm 3 Transcript Analysis
Sample )), and calculate The differential tests are the same as in the exon analysis with the following modifications based on suggestions in Cuffdiff .
1. The differential tests are based on transcript-length adjusted estimates μ * t i and 2. The expression for a particular gene is estimated using estimates of the transcript expression, that is, μ * k = i μ * t i k (k = 1, 2). This estimate has a larger variance compared to the estimate directly using the total gene count (see Section 3.1), since it considers the uncertainty of assigning reads to transcripts.

Simulation studies
In this section, we ran extensive simulations to evaluate our model. First, we investigated our Gibbs sampling algorithms for the gene expression analysis. Secondly, we tested the robustness of our FNB model when the key assumption of the model is violated. Finally, we evaluated our FNB model for the exon analysis and compared simulation results to DEXSeq.

Gene expression analysis
There are many existing algorithms to perform the differential test for gene expression data. We set up simulation studies as described in Soneson and Delorenzi (2013), and compared our method to three popular methods, DESeq, edgeR and baySeq. The differential test for the gene expression using the total counts in a gene does not require the key assumption in the FNB model. Hence, we directly simulated gene counts from NB distributions with mean and dispersion parameters estimated from two real RNA-Seq data sets. One data set contains RNA-Seq data from 69 unrelated Nigerian individuals (Pickrell et al., 2010). The other data set contains RNA-Seq data from 41 unrelated Caucasian individuals of European descent (Cheung et al., 2010). For each data set, the maximum likelihood estimates of the mean and the dispersion were obtained for each gene using all samples. Then we merged all pairs of mean and dispersion estimates to form a population of true parameters to be sampled in the simulation studies.
We studied two scenarios with 2 or 5 samples per condition. For each scenario, we simulated 10 data sets, with each data set consisting of 10,000 genes with mean and dispersion parameters randomly sampled from the population pool. The number of differentially expressed genes was set to 20%, and the log 2 fold changes for the differentially expressed genes were generated from a standard normal distribution. In these simulations, the dispersions in both conditions were assumed to be identical. Both algorithms 1A and 1B were used for the estimation, with a hierarchical structure on the gene-level dispersion parameters, as specified in (8)-(10). We set a 0 = b 0 = 1 in the beta prior and e 0 = f 0 = c 0 = 0.1 in the gamma prior (vague priors). With a burn-in of 4000 iterations, an additional 6000 iterations were used for inference. We observed that the chain mixes well. Figure 1 shows that our method is comparable to DESeq, edgeR and baySeq. It is not surprising that all four methods give similar performances because all methods used a NB model for the differential test. When n is larger (n = 5), the FNB model seems to have a slightly higher area under the receiver operating characteristic (ROC) curve than DESeq2 and edgeR, and slightly smaller false discoveries than the baySeq. Our results show that our Gibbs sampling algorithms provide competing alternatives for the differential test in gene expression analysis.

Robustness of the FNB model
The key assumption of the FNB model is that the counts in exons/transcripts share the same parameter p. This assumption greatly simplifies the estimation of relative usage and the imputation of transcript counts. In this section, we conducted simulation studies to assess the robustness of the FNB model when this assumption is violated.
In all simulations, we generated datasets consisting of 24,000 exons with 2 conditions and 3 samples per condition. Each exon count was generated from a NB distribution with a mean and a probability parameter, p. We randomly sampled 24,000 means from the mean estimates derived from the pasilla dataset (Brooks et al., 2010). The proportion of differentially expressed exons and the distribution of fold change were the same as the gene expression simulations in Section 4.1. Parameter p's were simulated under four scenarios. In scenario I, p's were fixed to be 0.8 for all exons in both conditions. In the remaining three scenarios, p's were generated from a beta distribution in both conditions. Specifically, we used beta(8, 2), beta(4, 1), and beta(1, 1) for scenario II, III, and IV, respectively. The mean of the distribution for p is 0.8 in scenarios II and III, which was determined based on the real dataset. In scenario IV, p's are uniformly distributed between 0 and 1. When p's have a larger variance, the assumption (i.e., the same p's for exons in the same gene) is more likely to be violated. Hence, scenarios II, III and IV studied the presence of mild, moderate and severe departure of the assumption, respectively.
In each scenario, there is a read count matrix of 24,000 rows, with each row representing an exon and a column representing a sample in a condition. Next, we used six different ways to assign gene IDs, resulting in six different datasets: 1) every 30 exons share an unique gene ID, 2) every 10 exons share an unique gene ID, 3) every 6 exons share an unique gene ID, 4) every 3 exons share an unique gene ID, 5) every 2 exons share an unique gene ID and 6) each exon has an unique gene ID. It is important to note that the six dastasets under each scenario have exactly the same read count matrices but different gene IDs. Since the FNB model assumes that all exons in the same gene share a common p, the first five datasets allow us to investigate whether the FNB model is robust to the violation of the assumption when every 30, 10, 6, 3 or 2 exons share a common p if p's are slightly, moderately and dramatically different (scenarios II-IV). Among these studied cases, the first dataset in scenario IV represents the strongest deviation from the assumption, because the model assumes that 30 exons have the same p's when the actual p's are likely to be very different. In scenario I, all six datasets satisfy the assumption, and the last dataset in scenarios II to IV also satisfies the assumption since each exon has its own p.
Finally, we applied our FNB model to 24 datasets (6 datasets × 4 scenarios), and results for the six datasets were compared in each scenario. In the FNB model, we used Algorithm 2 with a gamma hierarchical prior for r's in the two conditions. We set a 0 = b 0 = 1, e 0 = f 0 = h 0 = 0.1, and c 0 = 0.01 (weak priors). With a burn-in of 4000 iterations, an additional 6000 iterations were used for inference. We also applied DESeq2 to the last dataset in each scenario and compared to the FNB model. To evaluate the model performance, we calculated the area under the ROC curve (AUC). Additionally, we computed model selection criteria to assess the goodness-of-fit of the six FNB models within each scenario. We chose log-pseudo marginal likelihood (LPML) (Ibrahim et al., 2001;Chen et al., 2000) and Watanabe-Akaike information criterion (WAIC) (Watanabe, 2010). LPML is a cross-validated leave-one-out measure of a model's ability to predict the data. It is valid for small and large samples and does not suffer from a heuristic justification based on large sample normality. Based on Gelfand and Dey, 1994, the LPML implicitly includes a similar dimensional penalty as AIC asymptotically. WAIC was proposed recently and can also be viewed as an improvement over the standard DIC and it estimates pointwise out-of-sample prediction accuracy from a fitted Bayesian model (see supplementary materials (Zhao et al., 2017)). The best model should have the smallest WAIC and |LMPL|.
As shown in Figure 2, the FNB model were very robust in all studied scenarios. The differences in AUC are less than 0.005 even in scenario IV where p's are vastly different. Moreover, when p's are the same as in scenario I, the FNB model gained power as more exons were used to estimate the p parameter. To our surprise, a similar gain in power was also observed in scenario II where p's are different (p ranges from 0.186 to 0.999). Compared to the AUC in DESeq2, the FNB model is better in 3 out of 4 studied scenarios.
The model fit statistics in Figure 3 appear to tell us the same story as in the AUC results. The differences in LPML and WAIC statistics are very small in all studied scenarios.

Exon-level data analysis
We conducted extensive simulation studies to evaluate our FNB model for the exon analysis. In the simulations, we compared the FNB model to DEXSeq. DEXSeq is a commonly used package for testing the exon relative usage (i.e., the fraction of the gene's reads that falls into the exon) between conditions. It estimates the relative usage for each exon separately. Specifically, each exon is a factor with two levels, this (i.e., the number of reads to the exon in question) and others (i.e.,the read counts from all other exons of the same gene), and an interaction between exon and condition is tested in a generalized linear model. In contrast, the FNB model estimates the relative usage for all exons simultaneously. Furthermore, it tests a global hypothesis for the overall relative usage. In addition to the exon usage, the FNB model also tests for the differential expression of exons.
In all simulation studies, we generated datasets with 2 conditions with 3 samples per condition. Each simulated dataset consists of 10,000 genes with mean expressions randomly sampled from two real RNA-Seq data sets as described in 4.1. The mean expression for each exon was calculated based on its relative usage, which was simulated from a Dirichlet distribution. Finally, read counts in each exon were generated from a NB distribution. We assumed that 3 or 6 exons per gene in each dataset, which produces a dataset with 30,000 or 60,000 exons (call it a 3-exon or 6-exon dataset). To test for the exon Figure 3: Each curve indicates a different scenario. When each exon was assumed to have a different p (i.e., number of exons sharing a common p is one), |LMPL| statistics are 24.4, 26.4, 24.8 and 22.2 for scenario I to IV, respectively, and WAIC statistics are 47.0, 47.6, 47.8,and 43.4 for scenario I to IV, respectively (call these statistics as the baseline values). The WAIC and |LMPL| statistics on the y-axis were adjusted by the baseline values such that the first value in each scenario is zero. expression and the relative usage at the same test, we generated 70% genes with the same expression and same exon relative usage between conditions, 20% genes with the same expression and differential exon relative usage between conditions, and 10% genes with differential expression and same exon relative usage between conditions. The exon expression is calculated by multiplying the gene expression and the relative usage. Thus, each dataset includes exons with the same and different expressions/usages between conditions, and genes with the same and different overall exon relative usages.
To generate gene expressions, we used a similar approach as described in Section 4.1. For DE genes, log 2 fold changes were simulated from a normal distribution with mean 0 and standard deviation 2. The relative usages were generated from a Dirichlet distribution in both conditions. Specifically, a Dirichlet(1,4,8) for the 3-exon dataset and a Dirichlet(1,4,8,1,2,2) for the 6-exon dataset. The averaged usage is (0.08 0.31 0.61) and (0.06,0.22,0.44,0.06,0.11,0.11), respectively. Table 2 shows concentration parameters in the Dirichlet distributions used to generate the relative usage in condition 2. For the 3-exon dataset, we studied two scenarios. In scenario I, usages of the first two exons are switched in condition 2, and the differences in usage between conditions are about 23% (i.e., 0.31-0.08=23%) on average. For the 6-exon dataset, we studied three scenarios. In the last scenario, all six exons have different usages between conditions. Based on the mean gene expression and the relative usage described as above, we calculated the mean expression for each exon. To generate exon read counts from a NB distribution, we also need to determine a probability parameter p. We generated p's from three distributions. The first distribution assumes that p's for exons in the same  gene are the same. The last two distributions allow p's to be different (a more realistic assumption). To determine the p's in the first distribution, we first sampled gene-level dispersion parameters, r's, from the two real RNA-Seq data sets as described in Section 4.1, and then obtained the p by p = μ/(μ + r) (here, μ is the mean gene expression). In the other two distributions, p's were simulated from a beta distribution, i.e., beta(8,2) or a beta(1,1) as described in Section 4.2.
Finally, we applied the FNB model to six 3-exon datasets (2 scenarios for relative usage × 3 distributions for p), and nine 6-exon datasets (3 scenarios for relative usage × 3 distributions for p). As shown in Table 3, the FNB model tests for the exon expression, relative usage and the overall relative usage simultaneously. In contrast, DEXSeq only tests for the exon relative usage. With regarding to the relative usage, the FNB model exhibited a dramatically better performance than DEXSeq when the model assumption is satisfied. It performed similarly to DEXSeq when the assumption is violated, and it seems to be slightly better if the violation is not so strong (all ROC and FDR curves can be found in supplementary materials (Zhao et al., 2017)). Additionally, tests for the overall relative usage are always more powerful than the tests for a single exon usage.
In simulation studies we did not compare FNB to Cuffdiff, because we directly generated read count data from R package and it's technically challenging to compare to Cuffdiff which has its own pipeline that takes raw sequencing reads. However, we illustrated the advantage of FNB over Cuffdiff in the real data analysis.

Real data analysis
In this section we applied our FNB model on a real dataset (Brooks et al., 2010). The experiment investigated the effect of siRNA knock-down of pasilla, a gene that is known to bind to mRNA in the spliceosome, which is thought to be involved in the regulation of splicing. The dataset is available in DEXSeq package, which contains 3 biological replicates of the knockdown as well as 4 biological replicates for the untreated control. BDGP5.gtf was used as the reference transcriptome annotation.

Gene expression analysis
To test the differential gene expression, we used Algorithm 1B with the same priors as in the simulation studies. For each gene, we estimated prior π 1 /π 0 to be 0.15 using the  Table 3: AUC for the analysis of exon relative usage by the FNB model and DEXSeq. The FNB model also provides the analysis for the exon expression and overall relative usage.
qvalue package (Storey and Tibshirani, 2003). When the FDR was controlled at 5%, we compared the number of DE genes selected from the FNB model to that identified in DESeq, edgeR and baySeq. Figure 4 shows a good overlap among the set of DE genes selected by the four methods.

Exon analysis
In the data set, 70% genes have less than 6 exons and 90% genes have less than 11 exons.
We applied the FNB model to the pasilla data for the exon analysis. We normalized the count data using the default method in DEXSeq and deleted exons with the total sample counts ≤ 5. We used Algorithm 2 for the statistical inference with a 0 = b 0 = 1 and e 0 = f 0 = h 0 = 0.1 (r's on the gene-level were assumed to be the same between conditions based on the goodness-of-fit statistics). With a burn-in of 4000 iterations, an additional 6000 iterations were used for inference.
The model simultaneously provided test results for exon expression, relative usage for each exon, and the overall relative usage within a gene. We compared our results to DEXSeq (Anders et al., 2012) for the relative usage of each exon. In DEXSeq, the relative usage for a particular exon is tested through testing the interaction between that exon (defined as "this" and "others"; see Section 2.5) and condition in a generalized linear model. This test is the same as comparing Q ei 2 and Q ei 1 (i = 1, · · · , E) in the FNB  model as both methods compare whether the fraction of the gene's reads that fall into the exon differs significantly between conditions. Figure 5 shows a very good overlap between the two methods. The FNB model identified 520 (≈ 1%) significantly under-used exons (i.e., Q ei 2 < Q ei 1 ; k = 1 for the controls) while DEXSeq identified 77. When the FDR was controlled at 2.5%. About 90% of those exons identified by DEXSeq were also selected by FNB. A similar conclusion was obtained for the over-used exons (i.e., Q ei 2 > Q ei 1 ). It is important to note that the comparison to DEXSeq is just to show that the FNB model provided reasonable results for the real data analysis. However, the important feature of the FNB model is its ability to test multiple hypotheses. For example, the FNB model shows that 2.4% gene have large differences in the overall usage between two conditions (median Adist larger than 20). Another example of the FNB model used to evaluate composite hypotheses was presented in the transcript analysis.

Transcript analysis
Algorithm 3 was used for the inference with the same priors as in the exon analysis. With a burn-in of 4000 iterations, an additional 6000 iterations were used for inference. Posterior distributions of μ * t i k and μ * k (k = 1, 2) were used to test differential transcript and gene expression, respectively, as discussed in Section 3.3. Then we compared DE transcripts and genes identified by the FNB model to that selected by Cuffdiff  (see details of the parameter setting in Cuffdiff in supplementary materials (Zhao et al., 2017)). The comparisons were based on all transcripts/genes that were testable by Cuffdiff (the status in the output file showed "OK"). As shown in Figure  6, there is a good overlap among the set of DE transcripts/genes selected by the FNB model and Cuffdiff, and the FNB model identified more DE transcripts and genes.
It is important to note that Cuffdiff relies on a two-step method for inference (i.e., impute transcript counts + estimate expression), while the FNB model performs the two steps simultaneously. Figure 7 plots the statistical significance versus estimated transcript fold-changes for both methods. It seems that Cuffdiff missed out majority of transcripts with a large magnitude of fold change.
One appealing feature of the FNB model is its flexibility to test various hypotheses. It not only allows us to test the relative usage for a particular transcript, but also allows a test for the overall usage for a particular gene (see Section 3.3). Combined with the test for gene expression, we can categorize each gene into one of the four groups: (1) Non-DE gene and transcript usage, (2) DE gene and Non-DE transcript usage, (3) Non-DE gene and DE transcript usage, or (4) DE gene and transcript usage (Shi and Jiang (2013) considered (3) and (4) as a single group). Figure 8 depicts a representative gene for each of the four groups.

Goodness of fit of the FNB model
In the exon analysis, we compared the LPML and WAIC statistics for our FNB model, a Poisson model, and a NB model, in which each exon has its own dispersion and Figure 8: Plot of four possible scenarios for the analysis of gene expression and relative transcript usage. The transcripts with differential relative usage are indicated by the red lines. The difference in the overall relative usage is indicated by Adist, with a large value suggesting a large difference; PostPr= P (μ * 2 > μ * 1 )(μ * 2 vs μ * 1 ), and a value of PostPr close to 0 or 1 suggests an under-or over-expressed gene. probability parameter (call it an IndNB model). In the Poisson model, the Gamma-Poisson conjugacy was used to update the rate parameter of the poisson distribution for each exon. In the IndNB model, Algorithm 1A was used to estimate r e and p e k by treating each exon as a single gene. In the transcript analysis, we compared the FNB model to a Poisson model based on the fit on the exon-level observed data. In the Poisson model, a multinomial distribution was used to impute the latent transcript counts as described in Turro et al. (2011). In this comparison, we did not evaluate an IndNB model   Table 4 shows that the FNB model dramatically improved the fit of the data compared to the Poisson model, and the fit statistics are similar between FNB and IndNB model. Compared to the IndNB model, the FNB model is much more appealing, in that 1) greatly simplifies the estimation for (overall) relative usage, and 2) allows straightforward imputation of the unobserved transcript counts.

Discussions
We have developed a family of NB models that addressed the downstream analysis tasks for count-based RNA-Seq data. To our best knowledge, this is the only method which integrates the gene, exon and transcript analysis under an unified NB modelling framework. The computational algorithms are very simple because of the available conjugate forms in our FNB model. It easily incorporates the uncertainty of assigning reads to transcripts using the Gamma-Poisson mixture formulation of the NB distribution. Furthermore, it greatly simplifies the estimation of the relative exon/transcript usage, by reducing a complex problem of estimating the relative usage in a NB model to a simple problem of estimating of the proportion of latent counts from a multinomial distribution.
Our simulation studies showed that our FNB model is very robust to the violation of the model assumption. In real data analysis, we compared the FNB model to 1) DE-Seq, edgeR and baySeq for the gene expression analysis, 2) DEXSeq for the exon usage analysis, and 3) Cuffdiff for the differential transcript analysis. We have demonstrated that the FNB model provided a good fit to the real data. It also allows us to test various hypotheses of interest at the same time, including the exon/transcript expression, exon/transcript relative usage and overall relative usage.
Our method is relatively computationally intensive, but has been implemented to take advantage of parallel processing. It took approximately 3 hours for the transcript and exon analysis in the pasilla dataset, running on a quad-core Intel Xeon 2.10 GHz 8 GB RAM x64 computer. Frequentist approaches, such as DESeq, edgeR, baySeq, or DEXSeq, took just a couple of minutes. However, these methods only test a single hypothesis, while the FNB model is able to test multiple hypotheses simultaneously and provides a more comprehensive analysis than the previous mentioned methods.
In the transcript analysis, the FNB model relies on the M matrix to impute the count data in transcripts. In this work the M matrix was created using DEXSeq. Additionally, this M matrix can be obtained using other software, such as rSeq (Salzman et al., 2011;Shi and Jiang, 2013) and MMSEQ (Turro et al., 2011). In MMSEQ, the insert size from paired-end data can be considered to construct the M matrix.