The Annals of Applied Statistics

Bayesian variable selection and data integration for biological regulatory networks

Shane T. Jensen, Guang Chen, and Christian J. Stoeckert, Jr.

Full-text: Open access


A substantial focus of research in molecular biology are gene regulatory networks: the set of transcription factors and target genes which control the involvement of different biological processes in living cells. Previous statistical approaches for identifying gene regulatory networks have used gene expression data, ChIP binding data or promoter sequence data, but each of these resources provides only partial information. We present a Bayesian hierarchical model that integrates all three data types in a principled variable selection framework. The gene expression data are modeled as a function of the unknown gene regulatory network which has an informed prior distribution based upon both ChIP binding and promoter sequence data. We also present a variable weighting methodology for the principled balancing of multiple sources of prior information. We apply our procedure to the discovery of gene regulatory relationships in Saccharomyces cerevisiae (Yeast) for which we can use several external sources of information to validate our results. Our inferred relationships show greater biological relevance on the external validation measures than previous data integration methods. Our model also estimates synergistic and antagonistic interactions between transcription factors, many of which are validated by previous studies. We also evaluate the results from our procedure for the weighting for multiple sources of prior information. Finally, we discuss our methodology in the context of previous approaches to data integration and Bayesian variable selection.

Article information

Ann. Appl. Stat., Volume 1, Number 2 (2007), 612-633.

First available in Project Euclid: 30 November 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Regulatory networks Bayesian variable selection data integration transcription factors


Jensen, Shane T.; Chen, Guang; Stoeckert, Jr., Christian J. Bayesian variable selection and data integration for biological regulatory networks. Ann. Appl. Stat. 1 (2007), no. 2, 612--633. doi:10.1214/07-AOAS130.

Export citation


  • Banerjee, N. and Zhang, M. Q. (2003). Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Research 31 7024–7031.
  • Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B., Fraenkel, E., Jaakkola, T. S., Young, R. A. and Gifford, D. K. (2003). Computational discovery of gene modules and regulatory networks. Nature Biotechnology 21 1337–1342.
  • Berry, D. A. and Hochberg, Y. (1999). Bayesian perspectives on multiple comparisons. J. Statist. Plann. Inference 82 215–227.
  • Boulesteix, A.-L. and Strimmer, K. (2005). Predicting transcription factor activities from combined analysis of microarray and chip data: A partial least squares approach. Theoretical Biology and Medical Modelling 2 23.
  • Bussemaker, H. J., Li, H. and Siggia, E. D. (2001). Regulatory element detection using correlation with expression. Nature Genetics 27 167–171.
  • Chen, G., Jensen, S. T. and Stoeckert, C. J. (2007). Clustering of genes into regulons using integrated modeling–-COGRIM. Genome Biology 8 R4.
  • Chen, M.-H., Ibrahim, J. G., Shao, Q.-M. and Weiss, R. E. (2003). Prior elicitation for model selection and estimation in generalized linear mixed models. J. Statist. Plann. Inference 111 57–76.
  • Chen, M.-H., Ibrahim, J. G. and Yiannoutsos, C. (1999). Prior elicitation, variable selection and Bayesian computation for logistic regression models. J. Roy. Statist. Soc. Ser. B 61 223–242.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499.
  • Gao, F., Foat, B. and Bussemaker, H. (2004). Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics 5 1.
  • Garthwaite, P. and Dickey, J. (1996). Quantifying and using expert opinion for variable-selection problems in regression. Chemometrics and Intelligent Laboratory Systems 35 1–26.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intelligence 6 721–741.
  • George, E. (2000). The variable selection problem. J. Amer. Statist. Assoc. 95 1304–1308.
  • George, E. and McCulloch, R. (1996). Stochastic search variable selection. In Markov Chain Monte Carlo in Practice (W. E. Gilks, S. Richardson and D. J. Spiegelhalter, eds.) 203–213. Chapman and Hall/CRC, Boca Raton, FL.
  • Hughes, T., Marton, M., Jones, A., Roberts, C., Stoughton, R., Armour, C., Bennett, H., Coffey, E., Dai, H., He, Y., Kidd, M., King, A., Meyer, M., Slade, D., Lum, P., Stepaniants, S., Shoemaker, D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M. and Friend, S. (2000). Functional discovery via a compendium of expression profiles. Cell 102 109–126.
  • Ibrahim, J. G. and Chen, M.-H. (2000). Power prior distributions for regression models. Statist. Sci. 15 46–60.
  • Kloster, M., Tang, C. and Wingreen, N. (2005). Finding regulatory modules through large-scale gene-expression data analysis. Bioinformatics 21 1172–1179.
  • Lee, T., Rinaldi, N., Robert, F., Odom, D., Bar-Joseph, Z., Gerber, G., Hannett, N., Harbison, C., Thompson, C., Simon, I., Zeitlinger, J., Jennings, E., Murray, H., Gordon, D., Ren, B., Wyrick, J., Tagne, J., Volkert, T., Fraenkel, E., Gifford, D. and Young, R. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298 763–764.
  • Lemmens, K., Dhollander, T., De Bie, T., Monsieurs, P., Engelen, K., Smets, B., Winderickx, J., De Moor, B. and Marchal, K. (2006). Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biology 7 R37.
  • Liao, J. C., Boscolo, R., Yang, Y.-L., Tran, L. M., Sabatti, C. and Roychowdhury, V. P. (2003). Network component analysis: Reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. 100 15522–15527.
  • Matys, V., Fricke, E., Geffers, R. et al. (2003). TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucleic Acids Research 31 374–378.
  • Mewes, H., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Munsterkotter, M., Rudd, S. and Weil, B. (2002). MIPS: A database for genomes and protein sequences. Nucleic Acids Research 30 31–34.
  • Sabatti, C. and James, G. M. (2005). Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics 22 922–931.
  • Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D. and Friedman, N. (2003). Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34 166–176.
  • Segal, E., Taskar, B., Gasch, A., Friedman, N. and Koller, D. (2001). Rich probabilistic models for gene expression. Bioinformatics 1 1–10.
  • SGD project (2005). Saccharomyces genome database. Available at
  • Tadesse, M. G., Vannucci, M. and Lio, P. (2004). Identification of DNA regulatory motifs and Bayesian variable selection. Bioinformatics 20 2553–2561.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Xing, B. and van der Laan, M. J. (2005). A statistical method for constructing transcriptional regulatory networks using gene expression and sequence data. J. Comput. Biol. 12 229–246.
  • Yang, Y.-L., Suen, J., Brynildsen, M. P., Galbraith, S. J. and Liao, J. C. (2005). Inferring yeast cell cycle regulators and interactions using transcription factor activities. BMC Genomics 6 90.

Supplemental materials