Statistical Science

Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective

Shane T. Jensen, X. Shirley Liu, Qing Zhou, and Jun S. Liu

Full-text: Open access

Abstract

The Bayesian approach together with Markov chain Monte Carlo techniques has provided an attractive solution to many important bioinformatics problems such as multiple sequence alignment, microarray analysis and the discovery of gene regulatory binding motifs. The employment of such methods and, more broadly, explicit statistical modeling, has revolutionized the field of computational biology. After reviewing several heuristics-based computational methods, this article presents a systematic account of Bayesian formulations and solutions to the motif discovery problem. Generalizations are made to further enhance the Bayesian approach. Motivated by the need of a speedy algorithm, we also provide a perspective of the problem from the viewpoint of optimizing a scoring function. We observe that scoring functions resulting from proper posterior distributions, or approximations to such distributions, showed the best performance and can be used to improve upon existing motif-finding programs. Simulation analyses and a real-data example are used to support our observation.

Article information

Source
Statist. Sci. Volume 19, Number 1 (2004), 188-204.

Dates
First available in Project Euclid: 14 July 2004

Permanent link to this document
https://projecteuclid.org/euclid.ss/1089808282

Digital Object Identifier
doi:10.1214/088342304000000107

Mathematical Reviews number (MathSciNet)
MR2082154

Zentralblatt MATH identifier
1057.62101

Keywords
Gene regulation motif discovery Bayesian models scoring functions optimization Markov chain Monte Carlo

Citation

Jensen, Shane T.; Liu, X. Shirley; Zhou, Qing; Liu, Jun S. Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective. Statist. Sci. 19 (2004), no. 1, 188--204. doi:10.1214/088342304000000107. https://projecteuclid.org/euclid.ss/1089808282


Export citation

References

  • Bailey, T. L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc. Second International Conference on Intelligent Systems for Molecular Biology 28--36. AAAI Press, Menlo Park, CA.
  • Benos, P. V., Bulyk, M. L. and Stormo, G. D. (2002). Additivity in protein--DNA interactions: How good an approximation is it? Nucleic Acids Res. 30 4442--4451.
  • Benos, P. V., Lapedes, A. S. and Stormo, G. D. (2002). Probabilistic code for DNA recognition by proteins of the EGR family. J. Molecular Biol. 323 701--727.
  • Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler, D. L. (2002). GenBank. Nucleic Acids Res. 30 17--20.
  • Brazma, A., Jonassen, I., Vilo, J. and Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8 1202--1215.
  • Buck, M. J. and Lieb, J. D. (2004). ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83 349--360.
  • Bussemaker, H. J., Li, H. and Siggia, E. D. (2000). Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. U.S.A. 97 10,096--10,100.
  • Cardon, L. R. and Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Molecular Biol. 223 159--170.
  • Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. U.S.A. 100 3339--3344.
  • Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95 14,863--14,868.
  • Frith, M. C., Li, M. C. and Weng, Z. (2003). Cluster--Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 31 3666--3668.
  • Galas, D. J., Eggert, M. and Waterman, M. S. (1985). Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Molecular Biol. 186 117--128.
  • Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (1995). Bayesian Data Analysis. CRC, Boca Raton, FL.
  • Grundy, W. N., Bailey, T. L. and Elkan, C. P. (1996). ParaMEME: A parallel implementation and a web interface for a DNA and protein motif discovery tool. Computer Applications in the Biosciences 12 303--310.
  • Gupta, M. and Liu, J. S. (2003). Discovery of conserved sequence patterns using a stochastic dictionary model. J. Amer. Statist. Assoc. 98 55--66.
  • Hampson, S., Baldi, P., Kibler, D. and Sandmeyer, S. B. (2000). Analysis of yeast's ORF upstream regions by parallel processing, microarrays, and computational methods. In Proc. Eighth International Conference on Intelligent Systems for Molecular Biology 190--201. AAAI Press, Menlo Park, CA.
  • Hertz, G. Z. and Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15 563--577.
  • International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409 860--921.
  • IUPAC, Nomenclature Committee of the International Union of Biochemistry (NC-IUB) (1986). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Proc. Natl. Acad. Sci. U.S.A. 83 4--8.
  • Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773--795.
  • Keich, U. and Pevzner, P. A. (2002). Finding motifs in the twilight zone. Bioinformatics 18 1374--1381.
  • Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220 671--680.
  • Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262 208--214.
  • Lawrence, C. E. and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7 41--51.
  • Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958--966.
  • Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1156--1170.
  • Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1999). Markovian structures in biological sequence alignments. J. Amer. Statist. Assoc. 94 1--15.
  • Liu, X. S., Brutlag, D. L. and Liu, J. S. (2001). BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium on Biocomputing 6 127--138.
  • Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002). An algorithm for finding protein--DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nature Biotechnology 20 835--839.
  • Lodish, H., Baltimore, D., Berk, A., Zipursky, S. L., Matsudaira, P. and Darnell, J. (1995). Regulation of transcription initiation. In Molecular Cell Biology, 3rd ed. (J. Darnell, H. Lodish and D. Baltimore, eds.) 405--481. Scientific American Books, New York.
  • McCue, L. A., Thompson, W., Carmack, C. S., Ryan, M. P., Liu, J. S., Derbyshire, V. and Lawrence, C. E. (2001). Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res. 29 774--782.
  • Neuwald, A. F., Liu, J. S. and Lawrence, C. E. (1995). Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Science 4 1618--1632.
  • Pfahl, M. (1981). Characteristics of tight-binding repressors of the lac operon. J. Molecular Biol. 147 1--10.
  • Roth, F. P., Hughes, J. D., Estep, P. W. and Church, G. M. (1998). Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology 16 939--945.
  • Schena, M., Shalon, D., Davis, R. W. and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270 467--470.
  • Sinha, S. and Tompa, M. (2000). A statistical method for finding transcription factor binding sites. In Proc. Eighth International Conference on Intelligent Systems for Molecular Biology 344--354. AAAI Press, Menlo Park, CA.
  • Stirling, J. (1730). Methodus Differentialis. London.
  • Stormo, G. D. (2000). DNA binding sites: Representation and discovery. Bioinformatics 16 16--23.
  • Stormo, G. D. and Hartzell, G. W. (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. U.S.A. 86 1183--1187.
  • Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Statist. Assoc. 82 528--550.
  • van Helden, J., Andre, B. and Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Molecular Biol. 281 827--842.
  • van Helden, J., Rios, A. F. and Collado-Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28 1808--1818.
  • Velculescu, V. E., Zhang, L., Vogelstein, B. and Kinzler, K. W. (1995). Serial analysis of gene expression. Science 270 484--487.
  • Werner, T. (1999). Models for prediction and recognition of eukaryotic promoters. Mammalian Genome 10 168--175.
  • Xing, E. P., Wu, W., Jordan, M. I. and Karp, R. M. (2003). LOGOS: A modular Bayesian model for de novo motif detection. IEEE Computer Society Bioinformatics Conference, CSB2003.