The Annals of Applied Statistics

MSIQ: Joint modeling of multiple RNA-seq samples for accurate isoform quantification

Wei Vivian Li, Anqi Zhao, Shihua Zhang, and Jingyi Jessica Li

Full-text: Open access

Abstract

Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call “joint modeling of multiple RNA-seq samples for accurate isoform quantification” (MSIQ), for more robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples and allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ’s advantages over existing approaches via application studies on real RNA-seq data of human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.

Article information

Source
Ann. Appl. Stat. Volume 12, Number 1 (2018), 510-539.

Dates
Received: November 2016
Revised: September 2017
First available in Project Euclid: 9 March 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1520564482

Digital Object Identifier
doi:10.1214/17-AOAS1100

Keywords
Isoform abundance estimation joint inference from multiple samples RNA sequencing Bayesian hierarchical models Gibbs sampling data heterogeneity

Citation

Li, Wei Vivian; Zhao, Anqi; Zhang, Shihua; Li, Jingyi Jessica. MSIQ: Joint modeling of multiple RNA-seq samples for accurate isoform quantification. Ann. Appl. Stat. 12 (2018), no. 1, 510--539. doi:10.1214/17-AOAS1100. https://projecteuclid.org/euclid.aoas/1520564482


Export citation

References

  • Adamski, M. G., Gumann, P. and Baird, A. E. (2014). A method for quantitative analysis of standard and high-throughput qPCR expression data based on input sample quantity. PLoS ONE 9 e103917.
  • Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M. et al. (2013). NCBI GEO: Archive for functional genomics data sets—Update. Nucleic Acids Res. 41 D991–D995.
  • Behr, J., Kahles, A., Zhong, Y., Sreedharan, V. T., Drewe, P. and Rätsch, G. (2013). MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29 2529–2538.
  • Bernard, E., Jacob, L., Mairal, J. and Vert, J.-P. (2014). Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 30 2447–2455.
  • Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E., Taub, M. A., Hansen, K. D., Jaffe, A. E., Langmead, B. and Leek, J. T. (2017). Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35 319–321.
  • Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., Gaffney, D. J., Elo, L. L., Zhang, X. et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biol. 17 13.
  • Germain, P.-L., Vitriolo, A., Adamo, A., Laise, P., Das, V. and Testa, G. (2016). RNAontheBENCH: Computational and empirical resources for benchmarking RNAseq quantification and differential expression methods. Nucleic Acids Res. 44 5054–5067.
  • Griebel, T., Zacher, B., Ribeca, P., Raineri, E., Lacroix, V., Guigó, R. and Sammeth, M. (2012). Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40 10073–10083.
  • Hansen, K. D., Wu, Z., Irizarry, R. A. and Leek, J. T. (2011). Sequencing technology does not eliminate biological variability. Nat. Biotechnol. 29 572–573.
  • Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B. L., Barrell, D., Zadissa, A., Searle, S. et al. (2012). GENCODE: The reference human genome annotation for the ENCODE project. Genome Res. 22 1760–1774.
  • Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25 1026–1032.
  • Kanitz, A., Gypas, F., Gruber, A. J., Gruber, A. R., Martin, G. and Zavolan, M. (2015). Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 16 1–26.
  • Katz, Y., Wang, E. T., Airoldi, E. M. and Burge, C. B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7 1009–1015.
  • Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. and Haussler, D. (2002). The human genome browser at UCSC. Genome Res. 12 996–1006.
  • Kulkarni, M. M. (2011). Digital multiplexed gene expression analysis using the NanoString nCounter system. Curr. Protoc. Mol. Biol. Unit25B.10.
  • Li, B. and Dewey, C. N. (2011). RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12 323.
  • Li, J. J., Jiang, C.-R., Brown, J. B., Huang, H. and Bickel, P. J. (2011). Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108 19867–19872.
  • Li, W. V., Zhao, A., Zhang, S. and Li, J. J. (2018). Supplement to “MSIQ: Joint modeling of multiple RNA-seq samples for accurate isoform quantification.” DOI:10.1214/17-AOAS1100SUPP.
  • Lin, Y.-Y., Dao, P., Hach, F., Bakhshi, M., Mo, F., Lapuk, A., Collins, C. and Sahinalp, S. C. (2012). CLIIQ: Accurate comparative detection and quantification of expressed isoforms in a population. In Algorithms in Bioinformatics 178–189. Springer, Berlin.
  • Macaulay, I. C. and Voet, T. (2014). Single cell genomics: Advances and future perspectives. PLoS Genet. 10 e1004126.
  • Mezlini, A. M., Smith, E. J., Fiume, M., Buske, O., Savich, G. L., Shah, S., Aparicio, S., Chiang, D. Y., Goldenberg, A. and Brudno, M. (2013). iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 23 519–529.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5 621–628.
  • Pachter, L. (2011). Models for transcript quantification from RNA-Seq. Preprint. Available at arXiv:1104.3889.
  • Patro, R., Mount, S. M. and Kingsford, C. (2014). Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32 462–464.
  • Pruitt, K. D., Brown, G. R., Hiatt, S. M., Thibaud-Nissen, F., Astashyn, A., Ermolaeva, O., Farrell, C. M., Hart, J., Landrum, M. J., McGarvey, K. M. et al. (2014). RefSeq: An update on mammalian reference sequences. Nucleic Acids Res. 42 D756–D763.
  • Quail, M. A., Smith, M., Coupland, P., Otto, T. D., Harris, S. R., Connor, T. R., Bertoni, A., Swerdlow, H. P. and Gu, Y. (2012). A tale of three next generation sequencing platforms: Comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13 341.
  • Roberts, A. and Pachter, L. (2013). Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10 71–73.
  • Rosenbloom, K. R., Armstrong, J., Barber, G. P., Casper, J., Clawson, H., Diekhans, M., Dreszer, T. R., Fujita, P. A., Guruvadoo, L., Haeussler, M. et al. (2015). The UCSC genome browser database: 2015 update. Nucleic Acids Res. 43 D670–D681.
  • Rossell, D., Attolini, C. S.-O., Kroiss, M. and Stöcker, A. (2014). Quantifying alternative splicing from paired-end RNA-sequencing data. Ann. Appl. Stat. 8 309.
  • Sakharkar, M. K., Chow, V. T. and Kangueane, P. (2004). Distributions of exons and introns in the human genome. In Silico Biol. 4 387–393.
  • Steijger, T., Abril, J. F., Engström, P. G., Kokocinski, F., Hubbard, T. J., Guigó, R., Harrow, J., Bertone, P., Consortium, R. et al. (2013). Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10 1177–1184.
  • Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). Tophat: Discovering splice junctions with RNA-Seq. Bioinformatics 25 1105–1111.
  • Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 511–515.
  • Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10 57–63.
  • Wu, A. R., Neff, N. F., Kalisky, T., Dalerba, P., Treutlein, B., Rothenberg, M. E., Mburu, F. M., Mantalas, G. L., Sim, S., Clarke, M. F. et al. (2014). Quantitative assessment of single-cell RNA-sequencing methods. Nat. Methods 11 41–46.
  • Ye, Y. and Li, J. J. (2016). NMFP: A non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data. BMC Genomics 17 127.
  • Zhang, J., Kuo, C.-C. J. and Chen, L. (2014). WemIQ: An accurate and robust isoform quantification method for RNA-seq data. Bioinformatics 31 878–885.

Supplemental materials