The Annals of Applied Statistics

Nonparametric Bayesian learning of heterogeneous dynamic transcription factor networks

Xiangyu Luo and Yingying Wei

Full-text: Open access

Abstract

Gene expression is largely controlled by transcription factors (TFs) in a collaborative manner. Therefore, an understanding of TF collaboration is crucial for the elucidation of gene regulation. The co-activation of TFs can be represented by networks. These networks are dynamic in diverse biological conditions and heterogeneous across the genome within each biological condition. Existing methods for construction of TF networks lack solid statistical models, analyze each biological condition separately, and enforce a single network for all genomic locations within one biological condition, resulting in low statistical power and misleading spurious associations. In this paper, we present a novel Bayesian nonparametric dynamic Poisson graphical model for inference on TF networks. Our approach automatically teases out genome heterogeneity and borrows information across conditions to improve signal detection from very few replicates, thus offering a valid and efficient measure of TF co-activations. We develop an efficient parallel Markov chain Monte Carlo algorithm for posterior computation. The proposed approach is applied to study TF associations in ENCODE cell lines and provides novel findings.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 3 (2018), 1749-1772.

Dates
Received: March 2017
Revised: November 2017
First available in Project Euclid: 11 September 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1536652973

Digital Object Identifier
doi:10.1214/17-AOAS1129

Mathematical Reviews number (MathSciNet)
MR3852696

Keywords
Poisson graphical model nonparametric Bayes parallel Markov chain Monte Carlo next generation sequencing

Citation

Luo, Xiangyu; Wei, Yingying. Nonparametric Bayesian learning of heterogeneous dynamic transcription factor networks. Ann. Appl. Stat. 12 (2018), no. 3, 1749--1772. doi:10.1214/17-AOAS1129. https://projecteuclid.org/euclid.aoas/1536652973


Export citation

References

  • Aldous, D. J. (1985). Exchangeability and Related Topics. Springer, New York.
  • Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • Bickel, P. J., Boley, N., Brown, J. B., Huang, H. and Zhang, N. R. (2010). Subsampling methods for genomic inference. Ann. Appl. Stat. 4 1660–1697.
  • Carter, S. L., Brechbühler, C. M., Griffin, M. and Bond, A. T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20 2242–2250.
  • Cheng, Y. and Lenkoski, A. (2012). Hierarchical Gaussian graphical models: Beyond reversible jump. Electron. J. Stat. 6 2309–2331.
  • Cheng, C., Alexander, R., Min, R., Leng, J., Yip, K. Y., Rozowsky, J., Yan, K.-K., Dong, X., Djebali, S., Ruan, Y. et al. (2012). Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 22 1658–1667.
  • Chun, H., Zhang, X. and Zhao, H. (2015). Gene regulation network inference with joint sparse Gaussian graphical models. J. Comput. Graph. Statist. 24 954–974.
  • Danaher, P., Wang, P. and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 373–397.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. Stat. Methodol. 39 1–38.
  • Ebert, P. and Bock, C. (2015). Improving reference epigenome catalogs by computational prediction. Nat. Biotechnol. 33 354–355.
  • ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489 57–74.
  • Ernst, J. and Kellis, M. (2012). ChromHMM: Automating chromatin-state discovery and characterization. Nat. Methods 9 215–216.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
  • Gao, C., Zhu, Y., Shen, X. and Pan, W. (2016). Estimation of multiple networks in Gaussian mixture models. Electron. J. Stat. 10 1133–1154.
  • George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
  • Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K.-K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R. et al. (2012). Architecture of the human regulatory network derived from ENCODE data. Nature 489 91–100.
  • Grandori, C., Cowley, S. M., James, L. P. and Eisenman, R. N. (2000). The Myc/Max/Mad network and the transcriptional control of cell behavior. Annu. Rev. Cell Dev. Biol. 16 653–699.
  • Gropp, W., Lusk, E. and Skjellum, A. (1999). Using MPI: Portable Parallel Programming with the Message-Passing Interface, Vol. 1. MIT Press, Cambridge, MA.
  • Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98 1–15.
  • Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2015). Estimating heterogeneous graphical models for discrete data with an application to roll call voting. Ann. Appl. Stat. 9 821–848.
  • Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 29–36.
  • Hobert, O. (2008). Gene regulation by transcription factors and microRNAs. Science 319 1785–1786.
  • Inouye, D. I., Yang, E., Allen, G. I. and Ravikumar, P. (2017). A review of multivariate distributions for count data derived from the Poisson distribution. Wiley Interdiscip. Rev.: Comput. Stat. 9 e1398, 25.
  • Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
  • Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and Bayesian strategies. Ann. Statist. 33 730–773.
  • Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316 1497–1502.
  • Karlis, D. (2003). An EM algorithm for multivariate Poisson distribution and related models. J. Appl. Stat. 30 63–77.
  • Karlis, D. and Meligkotsidou, L. (2007). Finite mixtures of multivariate Poisson distributions with application. J. Statist. Plann. Inference 137 1942–1960.
  • Kawamura, K. (1979). The structure of multivariate Poisson distribution. Kodai Math. J. 2 337–345.
  • Kitamura, Y., Shimohama, S., Ota, T., Matsuoka, Y., Nomura, Y. and Taniguchi, T. (1997). Alteration of transcription factors NF-$\kappa$B and STAT1 in Alzheimer’s disease brains. Neurosci. Lett. 237 17–20.
  • Kocherlakota, S. and Kocherlakota, K. (1992). Bivariate Discrete Distributions. Wiley, New York.
  • Lan, K.-H., Kanai, F., Shiratori, Y., Ohashi, M., Tanaka, T., Okudaira, T., Yoshida, Y., Hamada, H. and Omata, M. (1997). In vivo selective gene expression and therapy mediated by adenoviral vectors for human carcinoembryonic antigen-producing gastric carcinoma. Cancer Res. 57 4279–4284.
  • Lara-Marquez, M. L., O’Dorisio, M. S., O’Dorisio, T. M., Shah, M. H. and Karacay, B. (2001). Selective gene expression and activation-dependent regulation of vasoactive intestinal peptide receptor type 1 and type 2 in human T cells. J. Immunol. 166 2522–2530.
  • Li, S.-H. and Li, X.-J. (2004). Huntingtin–protein interactions and the pathogenesis of Huntington’s disease. Trends Genet. 20 146–154.
  • Lin, Z., Wang, T., Yang, C. and Zhao, H. (2017). On joint estimation of Gaussian graphical models for spatial and temporal data. Biometrics 73 769–779.
  • Lochamy, J., Rogers, E. M. and Boss, J. M. (2007). CREB and phospho-CREB interact with RFX5 and CIITA to regulate MHC class II genes. Mol. Immunol. 44 837–847.
  • Luo, X. and Wei, Y. (2018). Supplement to “Nonparametric Bayesian learning of heterogeneous dynamic transcription factor networks.” DOI:10.1214/17-AOAS1129SUPP.
  • MacArthur, S., Li, X.-Y., Li, J., Brown, J. B., Chu, H. C., Zeng, L., Grondona, B. P., Hechmer, A., Simirenko, L., Keränen, S. V. et al. (2009). Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biol. 10 R80.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21 1087–1092.
  • Mitchell, P. J. and Tjian, R. (1989). Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245 371–378.
  • Mitra, R., Müller, P. and Ji, Y. (2016). Bayesian graphical models for differential pathways. Bayesian Anal. 11 99–124.
  • Mitra, R., Müller, P., Liang, S., Yue, L. and Ji, Y. (2013). A Bayesian graphical model for chip-seq data on histone modifications. J. Amer. Statist. Assoc. 108 69–80.
  • Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176.
  • Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. and Kanehisa, M. (1999). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27 29–34.
  • Peterson, C. B., Stingo, F. C. and Vannucci, M. (2015). Bayesian inference of multiple Gaussian graphical models. J. Amer. Statist. Assoc. 110 159–174.
  • Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140.
  • Rodriguez, A., Lenkoski, A., Dobra, A. et al. (2011). Sparse covariance estimation in heterogeneous samples. Electron. J. Stat. 5 981–1014.
  • Scherzer, C. R., Grass, J. A., Liao, Z., Pepivani, I., Zheng, B., Eklund, A. C., Ney, P. A., Ng, J., McGoldrick, M., Mollenhauer, B. et al. (2008). GATA transcription factors directly regulate the Parkinson’s disease-linked gene $\alpha$-synuclein. Proc. Natl. Acad. Sci. USA 105 10907–10912.
  • Shi, Q., Le, X., Abbruzzese, J. L., Wang, B., Mujaida, N., Matsushima, K., Huang, S., Xiong, Q. and Xie, K. (1999). Cooperation between transcription factor AP-1 and NF-$\kappa$B in the induction of interleukin-8 in human pancreatic adenocarcinoma cells by hypoxia. J. Interferon Cytokine Res. 19 1363–1371.
  • Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102 15545–15550.
  • Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82 528–540.
  • Wei, Y. and Wu, H. (2016). Measuring the spatial correlations of protein binding sites. Bioinformatics 32 1766–1772.
  • Xing, E. P., Sohn, K.-A. et al. (2007). Hidden Markov Dirichlet process: Modeling genetic inference in open ancestral space. Bayesian Anal. 2 501–527.
  • Xue, W., Kang, J., Bowman, F. D., Wager, T. D. and Guo, J. (2014). Identifying functional co-activation patterns in neuroimaging studies via Poisson graphical models. Biometrics 70 812–822.
  • Yang, E., Ravikumar, P. K., Allen, G. I. and Liu, Z. (2013). On Poisson graphical models. In Advances in Neural Information Processing Systems 1718–1726.
  • Yang, E., Ravikumar, P., Allen, G. I. and Liu, Z. (2015). Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16 3813–3847.
  • Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35.
  • Zervos, A. S., Gyuris, J. and Brent, R. (1993). Mxi1, a protein that specifically interacts with Max to bind Myc-Max recognition sites. Cell 72 223–232.
  • Zhang, B. and Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4 Article17.
  • Zhou, H., Cheruvanky, A., Hu, X., Matsumoto, T., Hiramatsu, N., Cho, M. E., Berger, A., Leelahavanichkul, A., Doi, K., Chawla, L. S. et al. (2008). Urinary exosomal transcription factors, a new class of biomarkers for renal disease. Kidney Int. 74 613–621.

Supplemental materials

  • Supplementary Materials to “Nonparametric Bayesian learning of heterogeneous dynamic transcription factor networks”. The zip file provides the supplementary details referenced in the main text, the C code that implements HDPGM, and the datasets used in the simulation study and the real application.