The Annals of Applied Statistics

Transcription factor binding site prediction with multivariate gene expression data

Nancy R. Zhang, Mary C. Wildermuth, and Terence P. Speed

Full-text: Open access


Multi-sample microarray experiments have become a standard experimental method for studying biological systems. A frequent goal in such studies is to unravel the regulatory relationships between genes. During the last few years, regression models have been proposed for the de novo discovery of cis-acting regulatory sequences using gene expression data. However, when applied to multi-sample experiments, existing regression based methods model each individual sample separately. To better capture the dynamic relationships in multi-sample microarray experiments, we propose a flexible method for the joint modeling of promoter sequence and multivariate expression data.

In higher order eukaryotic genomes expression regulation usually involves combinatorial interaction between several transcription factors. Experiments have shown that spacing between transcription factor binding sites can significantly affect their strength in activating gene expression. We propose an adaptive model building procedure to capture such spacing dependent cis-acting regulatory modules.

We apply our methods to the analysis of microarray time-course experiments in yeast and in Arabidopsis. These experiments exhibit very different dynamic temporal relationships. For both data sets, we have found all of the well-known cis-acting regulatory elements in the related context, as well as being able to predict novel elements.

Article information

Ann. Appl. Stat., Volume 2, Number 1 (2008), 332-365.

First available in Project Euclid: 24 March 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Multivariate analysis linear models transcription regulation DNA motifs gene expression


Zhang, Nancy R.; Wildermuth, Mary C.; Speed, Terence P. Transcription factor binding site prediction with multivariate gene expression data. Ann. Appl. Stat. 2 (2008), no. 1, 332--365. doi:10.1214/10.1214/07-AOAS142.

Export citation


  • Bailey, T. L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers., Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology 28–36. AAAI Press, Stanford, CA.
  • Bussemaker, H. J., Li, H. and Siggia, E. D. (2001). Regulatory element detection using correlation with expression., Nature Genetics 27 167–171.
  • Chiang, D. Y., Moses, A. M., Kellis, M., Lander, E. S. and Eisen, M. B. (2003). Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts., Genome Biology 4 R43. DOI: 10.1186/gb-2003-4-7-r43.
  • Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis., Proc. Natl. Acad. Sci. 100 3339–3344.
  • Das, D., Banerjee, N. and Zhang, M. Q. (2004). Interacting models of cooperative gene regulation., Proc. Natl. Acad. Sci. 101 16234–16239.
  • Das, D., Nahlé, Z. and Zhang, M. Q. (2006). Adaptively inferring human transcriptional subnetworks., Molecular Systems Biology 2 DOI: 10.38/msb4100067.
  • Du, L. and Poovaiah, B. W. (2004). A novel family of Ca2+/calmodulin-binding proteins involved in transcriptional regulation: Interaction with fsh/Ring3 class transcription activators., Plant Molecular Biology 54 549–569.
  • Eulgem, T. (2005). Regulation of the Arabidopsis defense transcriptome., Trends in Plant Science 10 71–77.
  • Fisher, R. A. (1922). On the interpretation of, χ2 from contingency tables, and the calculation of P. J. Roy. Statist. Soc. 85 87–94.
  • Fratkin, E., Naughton, B., Brutlag, D. L. and Batzoglou, S. (2006). Motif cut: An algorithm for finding regulatory motifs., Bioinformatics 22 150–157.
  • Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion)., Ann. Statist. 19 1–141.
  • Gurr, S. and Rushton, P. (2005). Engineering plants with increased disease resistance: How are we going to express it?, Trends in Biotechnology 23 283–290.
  • Gutterson, N. and Reuber, T. L. (2004). Regulation of disease resistance pathways by AP2/ERF transcription factors., Current Opinions in Plant Biology 7 465–71.
  • Hertz, G. Z. and Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences., Bioinformatics 15 563–577.
  • Higo, K., Ugawa, Y., Iwamoto, M. and Korenaga, T. (1999). Plant, cis-acting regulatory DNA elements (PLACE) database. Nucleic Acids Research 27 297–300.
  • Johnson, C., Boden, E. and Arias, J. (2003). Salicylic acid and NPR1 induce the recruitment of trans-activating TGA factors to a defense gene promoter in Arabidopsis., Plant Cell 15 1846–1858.
  • Kaplan, B. et al. (2006). Rapid transcriptome changes induced by cytosolic Ca2+ transients reveal ABRE-related sequences as Ca2+-responsive, cis elements in Arabidopsis. Plant Cell 18 2733–2748.
  • Keles, S., Van der Laan, M. J. and Vulpe, C. (2004). Regulatory motif finding by logic regression., Bioinformatics 20 2799–2811.
  • Laloi, C. et al. (2004). The Arabidopsis cytosolic thioredoxin h5 gene induction by oxidative stress and its W-box-mediated response to pathogen elicitor., Plant Physiology 134 1006–1016.
  • Lebel, E., Heifetz, P., Thorne, L., Uknes, S., Ryals, J. and Ward, E. (1998). Functional analysis of regulatory sequences controlling PR-1 gene expression in Arabidopsis., The Plant J. 16 223–233.
  • Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments., Nature Biotechnology 20 835–839.
  • Maleck, K. et al. (2000). The transcriptome of Arabidopsis thaliana during systemic acquired resistance., Nature Genetics 26 403–410.
  • Osley, M. A. (1991). The regulation of histone synthesis in the cell cycle., Annual Review of Biochemistry 60 827–861.
  • Popescu, S. C. et al. (2007). Differential binding of calmodulin-related proteins to their targets revealed through high-density Arabidopsis protein microarrays., Proc. Natl. Acad. Sci. USA 104 4730–4735.
  • Pontier, D., Balague, C., Bezombes-Marion, I., Tronchet, M., Deslandes, L. and Roby, D. (2001). Identification of a novel pathogen-responsive element in the promoter of the tobacco gene HSR203J, a molecular marker of the hypersensitive response., Plant J. 26 495–507.
  • Rushton, P. J., Reinstädler, A., Lipka, V., Lippok, B. and Somssich, I. E. (2002). Synthetic plant promoters containing defined regulatory elements provide novel insights into pathogen- and wound-induced signaling., Plant Cell 14 749–762.
  • Segal, R. and Berk, A. J. (1991). Promoter activity and distance constraints of one versus two spl binding sites., J. Biological Chemistry 266 20406–20411.
  • Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization., Molecular Biology of the Cell 9 3273–3297.
  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Russ, B. (2007). Missing value estimation methods for DNA microarrays., Bioinformatics 17 520–525.
  • Turck, F., Zhou, A. and Somssich, I. E. (2004). Stimulus-dependent, promoter-specific binding of transcription factor WRKY1 to its native promoter and the defense-related gene PcPR1-1 in Parsley., Plant Cell 16 2573–2585.
  • Ulker, B. and Somssich, I. E. (2004). WRKY transcription factors: From DNA binding towards biological function., Current Opinions in Plant Biology 7 491–498.
  • URL for VirtualPlant:,
  • Wildermuth, M. C., Dewdney, J., Wu, G. and Ausubel, F. M. (2001). Isochorismate synthase is required to synthesize salicylic acid for plant defence., Nature 414 562–565.
  • Wildermuth, M. C., Tai, Y. C., Dewdney, J., Denoux, C., Speed, T. P. and Ausubel, F. M. (2007). Application of the MB-statistic to temporal global Arabidopsis expression data over course of powdery mildew infection reveals integrated biological processes. In, preparation.
  • Yang, T. and Poovaiah, B. W. (2002). A calmodulin-binding/CGCG box DNA-binding protein family involved in multiple signaling pathways in plants., J. Biological Chemistry 277 45049–45058.
  • Yang, T. and Poovaiah, B. W. (2003). Calcium/calmodulin-mediated signal network in plants., Trends in Plant Science 8 505–512.
  • Yu, D., Chen, C. and Chen, Z. (2001). Evidence for an important role of WRKY DNA binding proteins in the regulation of NPR1 gene expression., Plant Cell 13 1527–1540.
  • Zhang, N. R., Wildermuth, M. C. and Speed, T. R. (2008). Supplement to “Transcription factor binding site prediction with multivariate gene expression data.” DOI:, 10.1214/07-AOAS142SUPP.
  • Zhang, N. and Siegmund, D. (2007). A modified Bayes information criterion with applications to comparative genomic hybridization, data.Biometrics 63 22–32.
  • Zhang, N. (2005). Change-point models and sequence aliguments: Statistical problems of genomics. Ph.D. dissertation, Dept. Statistics, Stanford, Univ.
  • Zhou, Q. and Wong, W. (2004)., CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. 101 12114–112119.

Supplemental materials