The Annals of Applied Statistics

Bayesian joint modeling of multiple gene networks and diverse genomic data to identify target genes of a transcription factor

Peng Wei and Wei Pan

Full-text: Open access

Abstract

We consider integrative modeling of multiple gene networks and diverse genomic data, including protein-DNA binding, gene expression and DNA sequence data, to accurately identify the regulatory target genes of a transcription factor (TF). Rather than treating all the genes equally and independently a priori in existing joint modeling approaches, we incorporate the biological prior knowledge that neighboring genes on a gene network tend to be (or not to be) regulated together by a TF. A key contribution of our work is that, to maximize the use of all existing biological knowledge, we allow incorporation of multiple gene networks into joint modeling of genomic data by introducing a mixture model based on the use of multiple Markov random fields (MRFs). Another important contribution of our work is to allow different genomic data to be correlated and to examine the validity and effect of the independence assumption as adopted in existing methods. Due to a fully Bayesian approach, inference about model parameters can be carried out based on MCMC samples. Application to an E. coli data set, together with simulation studies, demonstrates the utility and statistical efficiency gains with the proposed joint model.

Article information

Source
Ann. Appl. Stat., Volume 6, Number 1 (2012), 334-355.

Dates
First available in Project Euclid: 6 March 2012

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1331043399

Digital Object Identifier
doi:10.1214/11-AOAS502

Mathematical Reviews number (MathSciNet)
MR2951540

Zentralblatt MATH identifier
1235.62031

Keywords
Bayesian hierarchical model Markov random field gene networks joint modeling mixture models systems biology

Citation

Wei, Peng; Pan, Wei. Bayesian joint modeling of multiple gene networks and diverse genomic data to identify target genes of a transcription factor. Ann. Appl. Stat. 6 (2012), no. 1, 334--355. doi:10.1214/11-AOAS502. https://projecteuclid.org/euclid.aoas/1331043399


Export citation

References

  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. and Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25 25–29.
  • Bailey, T. L. and Elkan, C. (1995). Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 21 51–80.
  • Besag, J. (1986). On the statistical analysis of dirty pictures. J. Roy. Statist. Soc. Ser. B 48 259–302.
  • Brown, K. R. and Jurisica, I. (2005). Online predicted human interaction database. Bioinformatics 21 2076–2082.
  • Butala, M., Zfur-Bertok, D. and Busby, S. J. W. (2009). The bacteria LexA transcriptional repressor. Cell. Mol. Life Sci. 66 82–93.
  • Carlin, B. P. and Louis, T. A. (2009). Bayesian Methods for Data Analysis, 3rd ed. CRC Press, Boca Raton, FL.
  • Chen, M., Cho, J. and Zhao, H. (2011). Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 7 e1001353.
  • Cirz, R. T., Chin, J. K., Andes, D. R., de Crécy-Lagard, V., Craig, W. A. and Romesberg, F. E. (2005). Inhibition of mutation and combating the evolution of antibiotic resistance. PLoS Biol. 3 e176.
  • Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100 3339–3344.
  • Courcelle, J., Khodursky, A., Peter, B., Brown, P. O. and Hanawalt, P. C. (2001). Comparative gene expression profiles following UV exposure in wild-type and SOS-deficient Escherichia coli. Genetics 158 41–64.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Deng, M. H., Chen, T. and Sun, F. (2004). An integrated probabilistic model for functional prediction of proteins. J. Comput. Biol. 11 463–475.
  • Faith, J. J., Driscoll, M. E., Fusaro, V. A., Cosgrove, E. J., Hayete, B., Juhn, F. S., Schneider, S. J. and Gardner, T. S. (2008). Many microbe microarrays database: Uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 36 D866–D870.
  • Franke, L., van Bakel, H., Fokkens, L., de Jong, E. D., Egmont-Petersen, M. and Wijmenga, C. (2006). Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. 78 1011–1025.
  • Gama-Castro, S., Jiménez-Jacinto, V., Peralta-Gil, M., Santos-Zavaleta, A., Peñaloza-Spinola, M. I., Contreras-Moreira, B., Segura-Salazar, J., Muñiz-Rascado, L., Martínez-Flores, I., Salgado, H., Bonavides-Martínez, C., Abreu-Goodger, C., Rodríguez-Penagos, C., Miranda-Ríos, J., Morett, E., Merino, E., Huerta, A. M., Treviño-Quintanilla, L. and Collado-Vides, J. (2008). RegulonDB (version 6.0): Gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 36 D120–D124.
  • Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statist. Sci. 7 457–511.
  • Jensen, S. T., Chen, G. and Stoeckert, C. J. Jr. (2007). Bayesian variable selection and data integration for biological regulatory networks. Ann. Appl. Stat. 1 612–633.
  • Kanehisa, M. and Goto, S. (2002). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27–30.
  • Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24 1175–1182.
  • Michel, B. (2005). After 30 years of study, the bacterial SOS response still surprises us. PLoS Biol. 3 e255.
  • Møller, J., Pettitt, A. N., Reeves, R. and Berthelsen, K. K. (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93 451–458.
  • Pan, W., Wei, P. and Khodursky, A. (2008). A parametric joint model of DNA-protein binding, gene expression and DNA sequence data to detect target genes of a transcription factor. Pac. Symp. Biocomput. 13 465–476.
  • Prasad, T. S. K., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., Raju, R., Shafreen, B., Venugopal, A., Balakrishnan, L., Marimuthu, A., Banerjee, S., Somanathan, D. S., Sebastian, A., Rani, S., Ray, S., Kishore, C. J. H., Kanth, S., Ahmed, M., Kashyap, M. K., Mohmood, R., Ramachandra, Y. L., Krishna, V., Rahiman, B. A., Mohan, S., Ranganathan, P., Ramabadran, S., Chaerkady, R. and Pandey, A. (2009). Human protein reference database–2009 update. Nucleic Acids Res. 37 D767–D772.
  • Roth, F. P., Hughes, J. D., Estep, P. W. and Church, G. M. (1998). Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotech. 16 939–945.
  • Rydén, T. and Titterington, D. M. (1998). Computational Bayesian analysis of hidden Markov models. J. Comput. Graph. Statist. 7 194–211.
  • Spiegelhalter, D., Thomas, A., Best, N. and Lunn, D. (2003). WinBUGS User Manual, Version 1.4. Available at http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/manual14.pdf.
  • Sun, N., Carroll, R. J. and Zhao, H. (2006). Bayesian error analysis model for reconstructing transcriptional regulatory networks. Proc. Natl. Acad. Sci. USA 103 7988–7993.
  • Wade, J. T., Reppas, N. B., Church, G. M. and Struhl, K. (2005). Genomic analysis of LexA binding reveals the permissive nature of the Escherichia coli genome and identifies unconventional target sites. Genes Dev. 19 2619–2630.
  • Wang, W., Cherry, J. M., Nochomovitz, Y., Jolly, E., Botstein, D. and Li, H. (2005). Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation. Proc. Natl. Acad. Sci. USA 102 1998–2003.
  • Wei, Z. and Li, H. (2007). A Markov random field model for network-based analysis of genomic data. Bioinformatics 23 1537–1544.
  • Wei, Z. and Li, H. (2008). A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data. Ann. Appl. Stat. 2 408–429.
  • Wei, P. and Pan, W. (2008a). Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 24 404–411.
  • Wei, P. and Pan, W. (2008b). Incorporating gene functions into regression analysis of DNA-protein binding data and gene expression data to construct transcriptional networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5 401–415.
  • Wei, P. and Pan, W. (2010). Network-based genomic discovery: Application and comparison of Markov random-field models. J. R. Stat. Soc. Ser. C Appl. Stat. 59 105–125.
  • Wei, P. and Pan, W. (2011). Supplement to “Bayesian joint modeling of multiple gene networks and diverse genomic data to identify target genes of a transcription factor.” DOI:10.1214/11-AOAS502SUPP.
  • Winkler, G. (2003). Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction, 2nd ed. Applications of Mathematics (New York) 27. Springer, Berlin.
  • Wu, H., Su, Z., Mao, F., Olman, V. and Xu, Y. (2005). Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acids Res. 33 2822–2837.
  • Xie, Y., Pan, W., Jeong, K. S., Xiao, G. and Khodursky, A. B. (2010). A Bayesian approach to joint modeling of protein-DNA binding, gene expression and sequence data. Stat. Med. 29 489–503.
  • Yang, Y. H. and Dudoit, e. a. S. (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 304 e15.
  • Zhang, A. P. P., Pigli, Y. Z. and Rice, P. A. (2010). Structure of the LexA-DNA complex and implications for SOS box measurement. Nature 466 883–886.

Supplemental materials

  • Supplementary material: Supplemental tables and figures. WinBUGS codes, results for sensitivity analysis and MCMC convergence diagnostics plots can be found in the supplemental article.