Open Access
February 2011 Statistical Modeling of RNA-Seq Data
Julia Salzman, Hui Jiang, Wing Hung Wong
Statist. Sci. 26(1): 62-83 (February 2011). DOI: 10.1214/10-STS343
Abstract

Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform abundance from RNA-Seq data and is flexible enough to accommodate both single end and paired end RNA-Seq data and sampling bias along the length of the transcript. Based on the derivation of minimal sufficient statistics for the model, a computationally feasible implementation of the maximum likelihood estimator of the model is provided. Further, it is shown that using paired end RNA-Seq provides more accurate isoform abundance estimates than single end sequencing at fixed sequencing depth. Simulation studies are also given.

References

1.

Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11.Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11.

2.

Casella, G. and Berger, R. (2002). Statistical Inference, 2nd ed. Thomson Learning, Duxbury.Casella, G. and Berger, R. (2002). Statistical Inference, 2nd ed. Thomson Learning, Duxbury.

3.

Chi, K. R. (2008). The year of sequencing. Nature Methods 5 11–14.Chi, K. R. (2008). The year of sequencing. Nature Methods 5 11–14.

4.

Hansen, K. D., Brenner, S. E. and Dudoit, S. (2010). Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38 e131.Hansen, K. D., Brenner, S. E. and Dudoit, S. (2010). Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38 e131.

5.

Hansen, K. D., Lareau, L. F., Blanchette, M., Green, R. E., Meng, Q., Rehwinkel, J., Gallusser, F. L., Izaurralde, E., Rio, D. C., Dudoit, S. and Brenner, S. E. (2009). Genome-wide identification of alternative splice forms down-regulated by nonsense-mediated mRNA decay in Drosophila. PLoS Genetics 5 e1000525.Hansen, K. D., Lareau, L. F., Blanchette, M., Green, R. E., Meng, Q., Rehwinkel, J., Gallusser, F. L., Izaurralde, E., Rio, D. C., Dudoit, S. and Brenner, S. E. (2009). Genome-wide identification of alternative splice forms down-regulated by nonsense-mediated mRNA decay in Drosophila. PLoS Genetics 5 e1000525.

6.

Hiller, D., Jiang, H., Xu, W. and Wong, W. H. (2009). Identifiability of isoform deconvolution from junction arrays and RNA-Seq. Bioinformatics 25 3056–3059.Hiller, D., Jiang, H., Xu, W. and Wong, W. H. (2009). Identifiability of isoform deconvolution from junction arrays and RNA-Seq. Bioinformatics 25 3056–3059.

7.

Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. and Weissman, J. S. (2009). Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324 218–223.Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. and Weissman, J. S. (2009). Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324 218–223.

8.

Jiang, H. and Wong, W. H. (2008). Seqmap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics 24 2395–2396. 1201.68090Jiang, H. and Wong, W. H. (2008). Seqmap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics 24 2395–2396. 1201.68090

9.

Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25 1026–1032. 1201.68090Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25 1026–1032. 1201.68090

10.

Jiang, H., Wang, F., Dyer, N. P. and Wong, W. H. (2010). Cisgenome browser: A flexible tool for genomic data visualization. Bioinformatics 26 1781–1782. 1201.68090Jiang, H., Wang, F., Dyer, N. P. and Wong, W. H. (2010). Cisgenome browser: A flexible tool for genomic data visualization. Bioinformatics 26 1781–1782. 1201.68090

11.

Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25.Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25.

12.

Lehmann, E. L. (1998). Theory of Point Estimation, 2nd ed. Springer, New York. MR1451376Lehmann, E. L. (1998). Theory of Point Estimation, 2nd ed. Springer, New York. MR1451376

13.

Li, J., Jiang, H. and Wong, W. H. (2010). Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11 R50.Li, J., Jiang, H. and Wong, W. H. (2010). Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11 R50.

14.

Maher, C. A., Kumar-Sinha, C., Cao, X., Kalyana-Sundaram, S., Han, B., Jing, X., Sam, L., Barrette, T., Palanisamy, N. and Chinnaiyan, A. M. (2009). Transcriptome sequencing to detect gene fusions in cancer. Nature 458 97–101.Maher, C. A., Kumar-Sinha, C., Cao, X., Kalyana-Sundaram, S., Han, B., Jing, X., Sam, L., Barrette, T., Palanisamy, N. and Chinnaiyan, A. M. (2009). Transcriptome sequencing to detect gene fusions in cancer. Nature 458 97–101.

15.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, Boca Raton. MR727836McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, Boca Raton. MR727836

16.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621–628.Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621–628.

17.

Pan, Q., Shai, O., Lee, L. J., Frey, B. J. and Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genet. 40 1413–1415.Pan, Q., Shai, O., Lee, L. J., Frey, B. J. and Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genet. 40 1413–1415.

18.

Pruitt, K. D., Tatusova, T. and Maglott, D. R. (2005). NCBI reference sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33 D501–D504.Pruitt, K. D., Tatusova, T. and Maglott, D. R. (2005). NCBI reference sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33 D501–D504.

19.

Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., Swerdlow, H. and Turner1, D. J. (2008). A large genome center’s improvements to the illumina sequencing system. Nature Methods 5 1005–1010.Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., Swerdlow, H. and Turner1, D. J. (2008). A large genome center’s improvements to the illumina sequencing system. Nature Methods 5 1005–1010.

20.

She, Y., Hubbell, E. and Wang, H. (2009). Resolving deconvolution ambiguity in gene alternative splicing. BMC Bioinformatics 10.She, Y., Hubbell, E. and Wang, H. (2009). Resolving deconvolution ambiguity in gene alternative splicing. BMC Bioinformatics 10.

21.

Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H. and Yaspo, M.-L. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321 956–960.Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H. and Yaspo, M.-L. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321 956–960.

22.

Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). Tophat: Discovering splice junctions with RNA-Seq. Bioinformatics 25 1105–1111.Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). Tophat: Discovering splice junctions with RNA-Seq. Bioinformatics 25 1105–1111.

23.

Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28 511–515.Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28 511–515.

24.

Vega, V. B., Cheung, E., Palanisamy, N. and Sung, W.-K. (2009). Inherent signals in sequencing-based chromatin-immunoprecipitation control libraries. PLoS ONE 4 e5241.Vega, V. B., Cheung, E., Palanisamy, N. and Sung, W.-K. (2009). Inherent signals in sequencing-based chromatin-immunoprecipitation control libraries. PLoS ONE 4 e5241.

25.

Wahlstedt, H., Daniel, C., Enstero, M. and Ohman, M. (2009). Large-scale MRNA sequencing determines global regulation of RNA editing during brain development. Genome Res. 19 978–986.Wahlstedt, H., Daniel, C., Enstero, M. and Ohman, M. (2009). Large-scale MRNA sequencing determines global regulation of RNA editing during brain development. Genome Res. 19 978–986.

26.

Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P. and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470–476.Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P. and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470–476.

27.

Zhang, W., Duan, S., Bleibel, W. K., Wisel, S. A., Huang, R. S., Wu, X., He, L., Clark, T. A., Chen, T. X., Schweitzer, A. C., Blume, J. E., Dolan, M. E. and Cox, N. J. (2009). Identification of common genetic variants that account for transcript isoform variation between human populations. J. Human Genetics 125 81–93.Zhang, W., Duan, S., Bleibel, W. K., Wisel, S. A., Huang, R. S., Wu, X., He, L., Clark, T. A., Chen, T. X., Schweitzer, A. C., Blume, J. E., Dolan, M. E. and Cox, N. J. (2009). Identification of common genetic variants that account for transcript isoform variation between human populations. J. Human Genetics 125 81–93.
Copyright © 2011 Institute of Mathematical Statistics
Julia Salzman, Hui Jiang, and Wing Hung Wong "Statistical Modeling of RNA-Seq Data," Statistical Science 26(1), 62-83, (February 2011). https://doi.org/10.1214/10-STS343
Published: February 2011
Vol.26 • No. 1 • February 2011
Back to Top