The Annals of Applied Statistics

Quantifying alternative splicing from paired-end RNA-sequencing data

David Rossell, Camille Stephan-Otto Attolini, Manuel Kroiss, and Almond Stöcker

Full-text: Open access

Abstract

RNA-sequencing has revolutionized biomedical research and, in particular, our ability to study gene alternative splicing. The problem has important implications for human health, as alternative splicing may be involved in malfunctions at the cellular level and multiple diseases. However, the high-dimensional nature of the data and the existence of experimental biases pose serious data analysis challenges. We find that the standard data summaries used to study alternative splicing are severely limited, as they ignore a substantial amount of valuable information. Current data analysis methods are based on such summaries and are hence suboptimal. Further, they have limited flexibility in accounting for technical biases. We propose novel data summaries and a Bayesian modeling framework that overcome these limitations and determine biases in a nonparametric, highly flexible manner. These summaries adapt naturally to the rapid improvements in sequencing technology. We provide efficient point estimates and uncertainty assessments. The approach allows to study alternative splicing patterns for individual samples and can also be the basis for downstream analyses. We found a severalfold improvement in estimation mean square error compared popular approaches in simulations, and substantially higher consistency between replicates in experimental data. Our findings indicate the need for adjusting the routine summarization and analysis of alternative splicing RNA-seq studies. We provide a software implementation in the R package casper.

Article information

Source
Ann. Appl. Stat., Volume 8, Number 1 (2014), 309-330.

Dates
First available in Project Euclid: 8 April 2014

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1396966288

Digital Object Identifier
doi:10.1214/13-AOAS687

Mathematical Reviews number (MathSciNet)
MR3191992

Zentralblatt MATH identifier
06302237

Keywords
Alternative splicing RNA-Seq Bayesian modeling estimation

Citation

Rossell, David; Stephan-Otto Attolini, Camille; Kroiss, Manuel; Stöcker, Almond. Quantifying alternative splicing from paired-end RNA-sequencing data. Ann. Appl. Stat. 8 (2014), no. 1, 309--330. doi:10.1214/13-AOAS687. https://projecteuclid.org/euclid.aoas/1396966288


Export citation

References

  • Ameur, A., Wetterbom, A., Feuk, L. and Gyllensten, U. (2010). Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 11 R34.
  • Blencowe, B. J. (2006). Alternative splicing: New insights from global analyses. Cell 126 37–47.
  • Casella, G. and Berger, R. L. (2001). Statistical Inference, 2nd ed. Duxbury, N. Scituate.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1–38.
  • ENCODE Project Consortium (2004). The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306 636–640.
  • Glaus, P., Honkela, A. and Rattray, M. (2012). Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28 1721–1728.
  • Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M. J., Gnirke, A., Nusbaum, C., Rinn, J. L., Lander, E. S. and Regev, A. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnoly 28 503–510.
  • Holt, R. A. and Jones, S. J. M. (2008). The new paradigm of flow cell sequencing. Genome Research 18 839–846.
  • Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25 1026–1032.
  • Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457–481.
  • Katz, Y., Wang, E. T., Airoldi, E. M. and Burge, C. B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7 1009–1015.
  • Lacroix, V., Sammeth, M., Guigo, R. and Bergeron, A. (2008). Exact Transcriptome Reconstruction from Short Sequence Reads. In Proceedings of the 8th International Workshop on Algorithms in Bioinformatics. 50–63. Springer, Berlin.
  • Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25.
  • Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 1754–1760.
  • Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K. and Wang, J. (2009). SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25 1966–1967.
  • Montgomery, S. B., Sammeth, M., Gutierrez-Arcelus, M., Lach, R. P., Ingle, C., Nisbett, J., Guigo, R. and Dermitzakis, E. T. (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464 773–777.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and B., W. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621–628.
  • Pepke, S., Wold, B. and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. Nat. Methods 6 S22–S32.
  • Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. and Pachter, L. (2011a). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12 R22.
  • Roberts, A., Pimentel, H., Trapnell, C. and Pachter, L. (2011b). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27 2325–2329.
  • Rogers, M. F., Thomas, J., Reddy, A. S. and Ben-Hur, A. (2012). SpliceGrapher: Detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol. 13 R4.
  • Rossell, D., Stephan-Otto Attolini, C., Kroiss, M. and Stöcker, A. (2014). Supplement to “Quantifying alternative splicing from paired-end RNA-sequencing data.” DOI:10.1214/13-AOAS687SUPP.
  • Salzman, J., Jiang, H. and Wong, W. H. (2011). Statistical modeling of RNA-Seq data. Statist. Sci. 26 62–83.
  • Therneau, T. and Lumley, T. (2011). Survival: Survival analysis, including penalised likelihood. R package version 2.36-10.
  • Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics 25 1105–1111.
  • Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 511–515.
  • Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L. and Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7 562–578.
  • Wu, Z., Wang, X. and Zhang, X. (2011). Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics 27 502–508.
  • Wu, J., Akerman, M., Sun, S., McCombie, W. R., Krainer, A. R. and Zhang, M. Q. (2011). SpliceTrap: A method to quantify alternative splicing under single cellular conditions. Bioinformatics 27 3010–3016.
  • Xing, Y., Yu, T., Wu, Y. N., Roy, M., Kim, J. and Lee, C. (2006). An expectation–maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic. Acids Res. 34 3150–3160.

Supplemental materials

  • Supplementary material: Supplementary results. In Rossell et al. (2014) we assess the dependence of fragment start and length distributions on gene length, show additional simulation results, assess MCMC convergence and apply the approach to transcripts found de novo.