The Annals of Applied Statistics

hmmSeq: A hidden Markov model for detecting differentially expressed genes from RNA-seq data

Shiqi Cui, Subharup Guha, Marco A. R. Ferreira, and Allison N. Tegge

Full-text: Open access

Abstract

We introduce hmmSeq, a model-based hierarchical Bayesian technique for detecting differentially expressed genes from RNA-seq data. Our novel hmmSeq methodology uses hidden Markov models to account for potential co-expression of neighboring genes. In addition, hmmSeq employs an integrated approach to studies with technical or biological replicates, automatically adjusting for any extra-Poisson variability. Moreover, for cases when paired data are available, hmmSeq includes a paired structure between treatments that incoporates subject-specific effects. To perform parameter estimation for the hmmSeq model, we develop an efficient Markov chain Monte Carlo algorithm. Further, we develop a procedure for detection of differentially expressed genes that automatically controls false discovery rate. A simulation study shows that the hmmSeq methodology performs better than competitors in terms of receiver operating characteristic curves. Finally, the analyses of three publicly available RNA-seq data sets demonstrate the power and flexibility of the hmmSeq methodology. An R package implementing the hmmSeq framework will be submitted to CRAN upon publication of the manuscript.

Article information

Source
Ann. Appl. Stat., Volume 9, Number 2 (2015), 901-925.

Dates
Received: February 2014
Revised: January 2015
First available in Project Euclid: 20 July 2015

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1437397117

Digital Object Identifier
doi:10.1214/15-AOAS815

Mathematical Reviews number (MathSciNet)
MR3371341

Zentralblatt MATH identifier
06499936

Keywords
Bayesian hierarchical model first order dependence next-generation sequencing overdispersion serial correlation

Citation

Cui, Shiqi; Guha, Subharup; Ferreira, Marco A. R.; Tegge, Allison N. hmmSeq: A hidden Markov model for detecting differentially expressed genes from RNA-seq data. Ann. Appl. Stat. 9 (2015), no. 2, 901--925. doi:10.1214/15-AOAS815. https://projecteuclid.org/euclid.aoas/1437397117


Export citation

References

  • Agrawal, R. and Gomez-Pinilla, F. (2012). ‘Metabolic syndrome’ in the brain: Deficiency in omega-3 fatty acid exacerbates dysfunctions in insulin receptor signalling and cognition. J. Gen. Physiol. 590 2485–2499.
  • Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K. and Watson, J. D. (1994). Molecular Biology of the Cell. Garland Science, New York.
  • Auer, P. L. and Doerge, R. W. (2011). A two-stage Poisson model for testing RNA-Seq data. Stat. Appl. Genet. Mol. Biol. 10 Art. 26, 28.
  • Auer, P. L., Srivastava, S. and Doerge, R. W. (2012). Differential expression—the next generation and beyond. Brief. Funct. Genomics 11 57–62.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Blekhman, R., Marioni, J. C., Zumbo, P., Stephens, M. and Gilad, Y. (2010). Sex-specific and lineage-specific alternative splicing in primates. Genome Res. 20 180–189.
  • Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11 94.
  • Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M.-C., van Asperen, R., Boon, K., Voute, P. A. et al. (2001). The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291 1289–1292.
  • Chib, S. and Greenberg, E. (1994). Bayes inference in regression models with ARMA$(p,q)$ errors. J. Econometrics 64 183–206.
  • Cui, S., Guha, S., Ferreira, M. and Tegge, A. N. (2015). Supplement to “hmmSeq: A hidden Markov model for detecting differentially expressed genes from RNA-seq data.” DOI:10.1214/15-AOAS815SUPP.
  • Edelman, L. B. and Fraser, P. (2012). Transcription factories: Genetic programming in three dimensions. Curr. Opin. Genet. Dev. 22 110–114.
  • Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, New York.
  • Gamerman, D. and Lopes, H. F. (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2nd ed. Chapman & Hall/CRC, Boca Raton, FL.
  • Gogolla, N., Galimberti, I., Deguchi, Y. and Caroni, P. (2009). Wnt signaling mediates experience-related regulation of synapse numbers and mossy fiber connectivities in the adult hippocampus. Neuron 62 510–525.
  • Guha, S., Li, Y. and Neuberg, D. (2008). Bayesian hidden Markov modeling of array CGH data. J. Amer. Statist. Assoc. 103 485–497.
  • Hardcastle, T. J. (2009). baySeq: Empirical Bayesian analysis of patterns of differential expression in count data. R package version 1.10.0.
  • Hardcastle, T. J. and Kelly, K. A. (2010). baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11 422.
  • Henn, A. D., Wu, S., Qiu, X., Ruda, M., Stover, M., Yang, H., Liu, Z., Welle, S. L., Holden-Wiltse, J., Wu, H. and Zand, M. S. (2013). High-resolution temporal response patterns to influenza vaccine reveal a distinct human plasma cell gene signature. Sci. Rep. 3 2327.
  • Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2009a). Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37 1–13.
  • Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4 44–57.
  • Hurst, L. D., Pál, C. and Lercher, M. J. (2004). The evolutionary dynamics of eukaryotic gene order. Nat. Rev. Genet. 5 299–310.
  • Kalita, A., Gupta, S., Singh, P., Surolia, A. and Banerjee, K. (2013). IGF-1 stimulated upregulation of cyclin D1 is mediated via STAT5 signaling pathway in neuronal cells. IUBMB Life 65 462–471.
  • Karlebach, G. and Shamir, R. (2008). Modelling and analysis of gene regulatory networks. Nat. Rev., Mol. Cell Biol. 9 770–780.
  • Kvam, V. M., Liu, P. and Si, Y. (2012). A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am. J. Bot. 99 248–256.
  • Langmead, B., Hansen, K. D., Leek, J. T. et al. (2010). Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11 R83.
  • Lee, J., Ji, Y., Liang, S., Cai, G. and Müller, P. (2011). On differential gene expression using RNA-seq data. Cancer Inform. 10 205–215.
  • Louhimo, R. and Hautaniemi, S. (2011). CNAmet: An R package for integrating copy number, methylation and expression data. Bioinformatics 27 887–888.
  • MacDonald, I. L. and Zucchini, W. (1997). Hidden Markov and Other Models for Discrete-Valued Time Series. Monographs on Statistics and Applied Probability 70. Chapman & Hall, London.
  • Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509–1517.
  • Mercer, T. R. and Mattick, J. S. (2013). Understanding the regulatory and transcriptional complexity of the genome through structure. Genome Res. 23 1081–1088.
  • Michalak, P. (2008). Coexpression, coregulation, and cofunctionality of neighboring genes in eukaryotic genomes. Genomics 91 243–248.
  • Müller, P., Parmigiani, G. and Rice, K. (2007). FDR and Bayesian multiple comparisons rules. In Bayesian Statistics 8 (J. M. Bernardo, S. Bayarri, J. O. Berger, A. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 349–370. Oxford Univ. Press, Oxford.
  • Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176.
  • Pe’er, D. and Hacohen, N. (2011). Principles and strategies for developing network models in cancer. Cell 144 864–873.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257–286.
  • Robertson, S. D., Matthies, H. J., Owens, W. A., Sathananthan, V., Christianson, N. S. B., Kennedy, J. P., Lindsley, C. W., Daws, L. C. and Galli, A. (2010). Insulin reveals akt signaling as a novel regulator of norepinephrine transporter trafficking and norepinephrine homeostasis. J. Neurosci. 30 11305–11316.
  • Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140.
  • Robinson, M. D. and Smyth, G. K. (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23 2881–2887.
  • Robinson, M. D. and Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9 321–332.
  • Scott, S. L. (2002). Bayesian methods for hidden Markov models. J. Amer. Statist. Assoc. 97 337–351.
  • Si, Y. and Liu, P. (2013). An optimal test with maximum average power while controlling FDR with application to RNA-Seq data. Biometrics 69 594–605.
  • Singer, G. A., Lloyd, A. T., Huminiecki, L. B. and Wolfe, K. H. (2005). Clusters of co-expressed genes in mammalian genomes are conserved by natural selection. Mol. Biol. Evol. 22 767–775.
  • Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 583–639.
  • Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100 9440–9445.
  • Tegge, A. N., Caldwell, C. W. and Xu, D. (2012). Pathway correlation profile of gene–gene co-expression for identifying pathway perturbation. PLoS ONE 7 e52127.
  • Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
  • van Arensbergen, J., van Steensel, B. and Bussemaker, H. J. (2014). In search of the determinants of enhancer–promoter interaction specificity. Trends Cell Biol. 24 695–702.
  • Wilhelm, S. and Manjunath, B. G. (2013). tmvtnorm: Truncated multivariate normal and student t distribution. R package version 1.4-8.
  • Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects; a Gibbs sampling approach. J. Amer. Statist. Assoc. 86 79–86.
  • Zeng, J., Konopka, G., Hunt, B. G., Preuss, T. M., Geschwind, D. and Yi, S. V. (2012). Divergent whole-genome methylation maps of human and chimpanzee brains reveal epigenetic basis of human regulatory evolution. Am. J. Hum. Genet. 91 455–465.
  • Zhao, S., Fung-Leung, W.-P., Bittner, A., Ngo, K. and Liu, X. (2014). Comparison of RNA-seq and microarray in transcriptome profiling of activated t cells. PLoS ONE 9 e78644.

Supplemental materials