The Annals of Applied Statistics

Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping

Sean Ruddy, Marla Johnson, and Elizabeth Purdom

Full-text: Open access

Abstract

The prevalence of sequencing experiments in genomics has led to an increased use of methods for count data in analyzing high-throughput genomic data to perform analyses. The importance of shrinkage methods in improving the performance of statistical methods remains. A common example is gene expression data, where the counts per gene are often modeled as some form of an overdispersed Poisson. Shrinkage estimates of the per-gene dispersion parameter have led to improved estimation of dispersion, particularly in the case of a small number of samples.

We address a different count setting introduced by the use of sequencing data: comparing differential proportional usage via an overdispersed binomial model. We are motivated by our interest in testing for differential exon skipping in mRNA-Seq experiments. We introduce a novel shrinkage method that models the overdispersion with the double binomial distribution proposed by Efron [J. Amer. Statist. Assoc. 81 (1986) 709–721].

Our method (WEB-Seq) is an empirical Bayes strategy for producing a shrunken estimate of dispersion and effectively detects differential proportional usage, and has close ties to the weighted-likelihood strategy of edgeR developed for gene expression data [Bioinformatics 23 (2007) 2881–2887, Bioinformatics (Oxford, England) 26 (2010) 139–140]. We analyze its behavior on simulated data sets as well as real data and show that our method is fast, powerful and gives accurate control of the FDR compared to alternative approaches. We provide implementation of our methods in the R package DoubleExpSeq available on CRAN.

Article information

Source
Ann. Appl. Stat., Volume 10, Number 2 (2016), 690-725.

Dates
Received: March 2015
Revised: August 2015
First available in Project Euclid: 22 July 2016

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1469199890

Digital Object Identifier
doi:10.1214/15-AOAS871

Mathematical Reviews number (MathSciNet)
MR3528357

Zentralblatt MATH identifier
06625666

Keywords
Empirical Bayes dispersion estimation over-dispersed binomial alternative splicing mRNA-Seq

Citation

Ruddy, Sean; Johnson, Marla; Purdom, Elizabeth. Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping. Ann. Appl. Stat. 10 (2016), no. 2, 690--725. doi:10.1214/15-AOAS871. https://projecteuclid.org/euclid.aoas/1469199890


Export citation

References

  • Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 106.
  • Anders, S., Reyes, A. and Huber, W. (2012). Detecting differential usage of exons from RNA-seq data. Genome Res. 22 2008–2017.
  • Barbosa-Morais, N. L., Irimia, M., Pan, Q., Xiong, H. Y., Gueroussov, S., Lee, L. J., Slobodeniuc, V., Kutter, C., Watt, S., Colak, R., Kim, T., Misquitta-Ali, C. M., Wilson, M. D., Kim, P. M., Odom, D. T., Frey, B. J. and Blencowe, B. J. (2012). The evolutionary landscape of alternative splicing in vertebrate species. Science 338 1587–1593.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Bourgon, R., Gentleman, R. and Huber, W. (2010). Independent filtering increases detection power for high-throughput experiments. Proc. Natl. Acad. Sci. USA 107 9546–9551.
  • Brooks, A. N., Yang, L., Duff, M. O., Hansen, K. D., Park, J. W., Dudoit, S., Brenner, S. E. and Graveley, B. R. (2011). Conservation of an RNA regulatory map between drosophila and mammals. Genome Res. 21 193–202.
  • Brooks, A. N., Choi, P. S., de Waal, L., Sharifnia, T., Imielinski, M., Saksena, G., Pedamallu, C. S., Sivachenko, A., Rosenberg, M., Chmielecki, J., Lawrence, M. S., DeLuca, D. S., Getz, G. and Meyerson, M. (2014). A pan-cancer analysis of transcriptome changes associated with somatic mutations in U2AF1 reveals commonly altered splicing events. PLoS ONE 9 e87361.
  • Cancer Genome Atlas Research Network (2011). Integrated genomic analyses of ovarian carcinoma. Nature 474 609–615.
  • Denoeud, F., Aury, J.-M., Silva, C. D., Noel, B., Rogier, O., Delledonne, M., Morgante, M., Valle, G., Wincker, P., Scarpelli, C., Jaillon, O. and Artiguenave, F. (2008). Annotating genomes with massive-scale RNA sequencing. Genome Biol. 9 R175.
  • Dolzhenko, E. and Smith, A. D. (2014). Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC Bioinformatics 15 215.
  • Efron, B. (1986). Double exponential families and their use in generalized linear regression. J. Amer. Statist. Assoc. 81 709–721.
  • Feng, H., Conneely, K. N. and Wu, H. (2014). A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data. Nucleic Acids Res. 42 e69–e69.
  • Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M. J., Gnirke, A., Nusbaum, C., Rinn, J. L., Lander, E. S. and Regev, A. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28 503–510.
  • Hardcastle, T. J. and Kelly, K. A. (2010). BaySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11 422.
  • Hardcastle, T. J. and Kelly, K. A. (2013). Empirical Bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution. BMC Bioinformatics 14 135.
  • Hu, Y., Huang, Y., Du, Y., Orellana, C. F., Singh, D., Johnson, A. R., Monroy, A., Kuan, P. F., Hammond, S. M., Makowski, L., Randell, S. H., Chiang, D. Y., Hayes, D. N., Jones, C., Liu, Y., Prins, J. F. and Liu, J. (2013). DiffSplice: The genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res. 41 e39.
  • Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-seq. Bioinformatics 25 1026–1032.
  • Jørgensen, B. (1997). The Theory of Dispersion Models. Monographs on Statistics and Applied Probability 76. Chapman & Hall, London.
  • Katz, Y., Wang, E. T., Airoldi, E. M. and Burge, C. B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7 1009–1015.
  • Law, C. W., Chen, Y., Shi, W. and Smyth, G. K. (2014). Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15 R29.
  • Leng, N., Dawson, J. A., Thomson, J. A., Ruotti, V., Rissman, A. I., Smits, B. M. G., Haag, J. D., Gould, M. N., Stewart, R. M. and Kendziorski, C. (2013). EBSeq: An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29 1035–1043.
  • Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509–1517.
  • McCarthy, D. J., Chen, Y. and Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 40 4288–4297.
  • National Human Genome Research Institute (2014). Alternative splicing. Available at www.genome.gov.
  • Pan, Q., Shai, O., Lee, L. J., Frey, B. J. and Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40 1413–1415.
  • Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford Univ Press, London.
  • R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Richard, H., Schulz, M. H., Sultan, M., Nürnberger, A., Schrinner, S., Balzereit, D., Dagand, E., Rasche, A., Lehrach, H., Vingron, M., Haas, S. A. and Yaspo, M.-L. (2010). Prediction of alternative isoforms from exon expression levels in RNA-seq experiments. Nucleic Acids Res. 38 e112.
  • Robinson, M. D., Mccarthy, D. J. and Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26 139–140.
  • Robinson, M. D. and Smyth, G. K. (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23 2881–2887.
  • Robinson, M. D. and Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9 321–332.
  • Ruddy, S., Johnson, M. and Purdom, E. (2015a). Supplement A to “Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping.” DOI:10.1214/15-AOAS871SUPPA.
  • Ruddy, S., Johnson, M. and Purdom, E. (2015b). Supplement B to “Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping.” DOI:10.1214/15-AOAS871SUPPB.
  • Ruddy, S., Johnson, M. and Purdom, E. (2015c). Supplement C to “Shrinkage of dispersion parameters in the binomial family, with application to differential exon skipping.” DOI:10.1214/15-AOAS871SUPPC.
  • Salzman, J., Jiang, H. and Wong, W. H. (2010). Statistical modeling of RNA-Seq data. Technical Report No. BIO-252, Division of Biostatistics, Stanford Univ., Palo Alto.
  • Shen, S., Park, J. W., Huang, J., Dittmar, K. A., Lu, Z.-x., Zhou, Q., Carstens, R. P. and Xing, Y. (2012). MATS: A Bayesian framework for flexible detection of differential alternative splicing from RNA-seq data. Nucleic Acids Res. 40 e61.
  • Shi, Y. and Jiang, H. (2013). rSeqDiff: Detecting differential isoform expression from RNA-seq data using hierarchical likelihood ratio test. PloS One 8 e79448.
  • Smyth, G. K. (2005). Limma: Linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. Gentleman, V. J. Carey, W. Huber, R. A. Irizarry and S. Dudoit, eds.) 397–420. Springer, New York.
  • Sun, D., Xi, Y., Rodriguez, B., Park, H. J., Tong, P., Meong, M., Goodell, M. A. and Li, W. (2014). MOABS: Model based analysis of bisulfite sequencing data. Genome Biol. 15 R38.
  • Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). TopHat: Discovering splice junctions with RNA-seq. Bioinformatics 25 1105–1111.
  • Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 511.
  • Venables, J. P., Klinck, R., Koh, C., Gervais-Bird, J., Bramard, A., Inkel, L., Durand, M., Couture, S., Froehlich, U., Lapointe, E., Lucier, J.-F., Thibault, P., Rancourt, C., Tremblay, K., Prinos, P., Chabot, B. and Elela, S. A. (2009). Cancer-associated regulation of alternative splicing. Nature Publishing Group 16 670–676.
  • Wang, X. (2006). Approximating Bayesian inference by weighted likelihood. Canad. J. Statist. 34 279–298.
  • Williams, D. A. (1982). Extrabinomial variation in logistic linear models. J. Roy. Statist. Soc. Ser. C 31 144–148.
  • Wu, T. D. and Nacu, S. (2010). Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics (Oxford, England) 26 873–881.
  • Wu, H., Wang, C. and Wu, Z. (2013). A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 14 232–243.
  • Wu, J., Akerman, M., Sun, S., McCombie, W. R., Krainer, A. R. and Zhang, M. Q. (2011). SpliceTrap: A method to quantify alternative splicing under single cellular conditions. Bioinformatics 27 3010–3016.
  • Yang, X., Todd, J. A., Clayton, D. and Wallace, C. (2012). Extra-binomial variation approach for analysis of pooled DNA sequencing data. Bioinformatics 28 2898–2904.
  • Yu, D., Huber, W. and Vitek, O. (2013). Shrinkage estimation of dispersion in negative binomial models for RNA-seq experiments with small sample size. Bioinformatics 29 1275–1282.
  • Zhou, Y. H., Xia, K. and Wright, F. A. (2011). A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics (Oxford, England) 27 2672–2678.

Supplemental materials