Statistical Science

Statistical Modeling of RNA-Seq Data

Julia Salzman, Hui Jiang, and Wing Hung Wong

Full-text: Open access


Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform abundance from RNA-Seq data and is flexible enough to accommodate both single end and paired end RNA-Seq data and sampling bias along the length of the transcript. Based on the derivation of minimal sufficient statistics for the model, a computationally feasible implementation of the maximum likelihood estimator of the model is provided. Further, it is shown that using paired end RNA-Seq provides more accurate isoform abundance estimates than single end sequencing at fixed sequencing depth. Simulation studies are also given.

Article information

Statist. Sci. Volume 26, Number 1 (2011), 62-83.

First available in Project Euclid: 9 June 2011

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Mathematical Reviews number (MathSciNet)


Salzman, Julia; Jiang, Hui; Wong, Wing Hung. Statistical Modeling of RNA-Seq Data. Statistical Science 26 (2011), no. 1, 62--83. doi:10.1214/10-STS343.

Export citation


  • Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11.
  • Casella, G. and Berger, R. (2002). Statistical Inference, 2nd ed. Thomson Learning, Duxbury.
  • Chi, K. R. (2008). The year of sequencing. Nature Methods 5 11–14.
  • Hansen, K. D., Brenner, S. E. and Dudoit, S. (2010). Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38 e131.
  • Hansen, K. D., Lareau, L. F., Blanchette, M., Green, R. E., Meng, Q., Rehwinkel, J., Gallusser, F. L., Izaurralde, E., Rio, D. C., Dudoit, S. and Brenner, S. E. (2009). Genome-wide identification of alternative splice forms down-regulated by nonsense-mediated mRNA decay in Drosophila. PLoS Genetics 5 e1000525.
  • Hiller, D., Jiang, H., Xu, W. and Wong, W. H. (2009). Identifiability of isoform deconvolution from junction arrays and RNA-Seq. Bioinformatics 25 3056–3059.
  • Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. and Weissman, J. S. (2009). Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324 218–223.
  • Jiang, H. and Wong, W. H. (2008). Seqmap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics 24 2395–2396.
  • Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25 1026–1032.
  • Jiang, H., Wang, F., Dyer, N. P. and Wong, W. H. (2010). Cisgenome browser: A flexible tool for genomic data visualization. Bioinformatics 26 1781–1782.
  • Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25.
  • Lehmann, E. L. (1998). Theory of Point Estimation, 2nd ed. Springer, New York.
  • Li, J., Jiang, H. and Wong, W. H. (2010). Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11 R50.
  • Maher, C. A., Kumar-Sinha, C., Cao, X., Kalyana-Sundaram, S., Han, B., Jing, X., Sam, L., Barrette, T., Palanisamy, N. and Chinnaiyan, A. M. (2009). Transcriptome sequencing to detect gene fusions in cancer. Nature 458 97–101.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, Boca Raton.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621–628.
  • Pan, Q., Shai, O., Lee, L. J., Frey, B. J. and Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genet. 40 1413–1415.
  • Pruitt, K. D., Tatusova, T. and Maglott, D. R. (2005). NCBI reference sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33 D501–D504.
  • Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., Swerdlow, H. and Turner1, D. J. (2008). A large genome center’s improvements to the illumina sequencing system. Nature Methods 5 1005–1010.
  • She, Y., Hubbell, E. and Wang, H. (2009). Resolving deconvolution ambiguity in gene alternative splicing. BMC Bioinformatics 10.
  • Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H. and Yaspo, M.-L. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321 956–960.
  • Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). Tophat: Discovering splice junctions with RNA-Seq. Bioinformatics 25 1105–1111.
  • Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28 511–515.
  • Vega, V. B., Cheung, E., Palanisamy, N. and Sung, W.-K. (2009). Inherent signals in sequencing-based chromatin-immunoprecipitation control libraries. PLoS ONE 4 e5241.
  • Wahlstedt, H., Daniel, C., Enstero, M. and Ohman, M. (2009). Large-scale MRNA sequencing determines global regulation of RNA editing during brain development. Genome Res. 19 978–986.
  • Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P. and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470–476.
  • Zhang, W., Duan, S., Bleibel, W. K., Wisel, S. A., Huang, R. S., Wu, X., He, L., Clark, T. A., Chen, T. X., Schweitzer, A. C., Blume, J. E., Dolan, M. E. and Cox, N. J. (2009). Identification of common genetic variants that account for transcript isoform variation between human populations. J. Human Genetics 125 81–93.