Statistical Science

The EM Algorithm and the Rise of Computational Biology

Xiaodan Fan, Yuan Yuan, and Jun S. Liu

Full-text: Open access

Abstract

In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the “central dogma” of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.

Article information

Source
Statist. Sci., Volume 25, Number 4 (2010), 476-491.

Dates
First available in Project Euclid: 14 March 2011

Permanent link to this document
https://projecteuclid.org/euclid.ss/1300108232

Digital Object Identifier
doi:10.1214/09-STS312

Mathematical Reviews number (MathSciNet)
MR2807765

Zentralblatt MATH identifier
1329.92013

Keywords
EM algorithm computational biology literature review

Citation

Fan, Xiaodan; Yuan, Yuan; Liu, Jun S. The EM Algorithm and the Rise of Computational Biology. Statist. Sci. 25 (2010), no. 4, 476--491. doi:10.1214/09-STS312. https://projecteuclid.org/euclid.ss/1300108232


Export citation

References

  • Bailey, T. L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2 28–36.
  • Bailey, T. L. and Elkan, C. (1995a). Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 21 51–58.
  • Bailey, T. L. and Elkan, C. (1995b). The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3 21–29.
  • Baldi, P. and Chauvin, Y. (1994). Smooth on-line learning algorithms for hidden Markov models. Neural Computation 6 305–316.
  • Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T. and Simon, I. (2002). A new approach to analyzing gene expression time series data. In Proc. Sixth Ann. Inter. Conf. Comp. Biol. 39–48. ACM Press, New York.
  • Barton, G. and Sternberg, M. (1987). A strategy for the rapid multiple alignment of protein sequences. J. Mol. Biol. 198 327–337.
  • Batzoglou, S. (2005). The many faces of sequence alignment. Briefings in Bioinformatics 6 6–22.
  • Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41 164–171.
  • Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K. D., Ovcharenko, I., Pachter, L. and Rubin, E. M. (2003). Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299 1391–1394.
  • Bruno, W. (1996). Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13 1368–1374.
  • Bussemaker, H. J., Li, H. and Siggia, E. D. (2001). Regulatory element detection using correlation with expression. Nature Genetics 27 167–171.
  • Cardon, L. R. and Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 223 159–170.
  • Ceppellini, R., Siniscalco, M. and Smith, C. A. B. (1955). The estimation of gene frequencies in a random-mating population. Annals of Human Genetics 20 97–115.
  • Chen, X. and Blanchette, M. (2007). Prediction of tissue-specific cis-regulatory modules using Bayesian networks and regression trees. BMC Bioinformatics 8 (Suppl 10) S2.
  • Chiano, M. N. and Clayton, D. G. (1998). Fine genetic mapping using haplotype analysis and the missing data problem. Annals of Human Genetics 62 55–60.
  • Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51 79–94.
  • Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100 3339–3344.
  • Dasgupta, A. and Raftery, A. (1998). Detecting features in spatial point processes with clutter via model-based clustering. J. Amer. Statist. Assoc. 93 294–302.
  • Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Deng, M., Mehta, S., Sun, F. and Chen, T. (2002). Inferring domain–domain interactions from protein–protein interactions. Genome Res. 12 1540–1548.
  • Do, C. B., Mahabhashyam, M. S. P., Brudno, M. and Batzoglou, S. (2005). Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15 330–340.
  • Dudoit, S., Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87.
  • Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, Cambridge.
  • Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14 755–763.
  • Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Res. 22 2079–2088.
  • Edgar, R. (2004a). MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5 113.
  • Edgar, R. (2004b). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 1792–1797.
  • Edlefsen, P. T. (2009). Conditional Baum–Welch, dynamic model surgery, and the three Poisson Dempster–Shafer model. Ph.D. thesis, Dept. Statistics, Harvard Univ.
  • Excoffier, L. and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12 921–927.
  • Fan, X., Zhu, J., Schadt, E. and Liu, J. (2007). Statistical power of phylo-HMM for evolutionarily conserved element detection. BMC Bioinformatics 8 374.
  • Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17 368–376.
  • Felsenstein, J. and Churchill, G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13 93–104.
  • Feng, D. and Doolittle, R. (1987). Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25 351–360.
  • Finn, R., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S., Sonnhammer, E. and Bateman, A. (2006). Pfam: Clans, web tools and services. Nucleic Acids Res. Database Issue 34 D247–D251.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Friedman, N., Ninio, M., Pe’er, I. and Pupko, T. (2002). A structural EM algorithm for phylogenetic inference. J. Comput. Biol. 9 331–353.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741.
  • Ghahramani, Z. and Hinton, G. E. (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, Univ. Toronto, Toronto.
  • Hampson, S., Kibler, D. and Baldi, P. (2002). Distribution patterns of over-represented k-mers in non-coding yeast DNA. Bioinformatics 18 513–528.
  • Hastings, W. K. (1970). Monte Carlo sampling methods usings Markov chains and their applications. Biometrika 57 97–109.
  • Haussler, D., Krogh, A., Mian, I. S. and Sjolander, K. (1993). Protein modeling using hidden Markov models: Analysis of globins. In Proc. Hawaii Inter. Conf. Sys. Sci. 792–802. IEEE Computer Society Press, Los Alamitos, CA.
  • Hawley, M. E. and Kidd, K. K. (1995). HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes. Journal of Heredity 86 409–411.
  • Holmes, I. (2005). Using evolutionary expectation maximization to estimate indel rates. Bioinformatics 21 2294–2300.
  • Holmes, I. and Rubin, G. M. (2002). An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 317 753–764.
  • Hughey, R. and Krogh, A. (1996). Hidden Markov models for sequence analysis. Extension and analysis of the basic method. Comput. Appl. Biosci. 12 95–107.
  • Jensen, S. T., Liu, X. S., Zhou, Q. and Liu, J. S. (2004). Computational discovery of gene regulatory binding motifs: A Bayesian perspective. Statist. Sci. 19 188–204.
  • Ji, H. and Wong, W. H. (2006). Computational biology: Toward deciphering gene regulatory information in mammalian genomes. Biometrics 62 645–663.
  • Ji, X., Yuan, Y., Sun, Z. and Li, Y. (2004). HMMGEP: Clustering gene expression data using hidden Markov models. Bioinformatics 20 1799–1800.
  • Kang, H., Qin, Z. S., Niu, T. and Liu, J. S. (2004). Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. American Journal of Human Genetics 74 495–510.
  • Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., Weber, R. J., Haussler, D. and Kent, W. J. (2003). The UCSC genome browser database. Nucleic Acids Res. 31 51–54.
  • Karplus, K., Barrett, C. and Hughey, R. (1999). Hidden Markov models for detecting remote protein homologies. Bioinformatics 14 846–856.
  • Katoh, K., Kuma, K., Toh, H. and Miyata, T. (2005). MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33 511–518.
  • Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. (1994). Hidden Markov models in computational biology applications to protein modeling. J. Mol. Biol. 235 1501–1531.
  • Krogh, A., Mian, I. S. and Haussler, D. (1994). A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 22 4768–4778.
  • Kundaje, A., Middendorf, M., Gao, F., Wiggins, C. and Leslie, C. (2005). Combining sequence and time series expression data to learn transcriptional modules. IEEE/ACM Trans. Comp. Biol. Bioinfo. 2 194–202.
  • Lander, E. S. and Green, P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84 2363–2367.
  • Lawrence, C. E. and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7 41–51.
  • Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262 208–214.
  • Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer, New York.
  • Liu, J. S., Chen, R. and Wong, W. H. (1998). Rejection control and sequential importance sampling. J. Amer. Statist. Assoc. 93 1022–1031.
  • Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1156–1170.
  • Liu, J. S., Sabatti, C., Teng, J., Keats, B. J. and Risch, N. (2001). Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11 1716–1724.
  • Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology 20 835–839.
  • Long, J. C., Williams, R. C. and Urbanek, M. (1995). An E-M algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 56 799–810.
  • Lu, X., Niu, T. and Liu, J. S. (2003). Haplotype information and linkage disequilibrium mapping for single nucleotide polymorphisms. Genome Res. 13 2112–2117.
  • Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics 19 474–482.
  • Lunter, G., Miklos, I., Drummond, A., Jensen, J. and Hein, J. (2005). Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6 83.
  • Ma, P., Castillo-Davis, C., Zhong, W. and Liu, J. (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 1261–1269.
  • Madera, M. and Gough, J. (2002). A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30 4321–4328.
  • McKendrick, A. G. (1926). Applications of mathematics to medical problems. Proceedings Edinburgh Methematics Society 44 98–130.
  • McLachlan, G. J., Bean, R. W. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18 413–422.
  • Meng, X. and van Dyk, D. (1997). The EM algorithm—An old folk song sung to a fast new tune (with discussion). J. Roy. Statist. Soc. Ser. B 59 511–567.
  • Meng, X.-L. (1997). The EM algorithm and medical studies: A historical linik. Statistical Methods in Medical Research 6 3–23.
  • Meng, X.-L. and Pedlow, S. (1992). EM: A bibliographic review with missing articles. In Proc. Stat. Comp. Sec. 24–27. Amer. Statist. Assoc., Washington, DC.
  • Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics 21 1087–1092.
  • Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. J. Amer. Statist. Assoc. 44 335–341.
  • Moses, A., Chiang, D. and Eisen, M. (2004). Phylogenetic motif detection by expectation–maximization on evolutionary mixtures. In Pacific Symposium on Biocomputing 324–335. World Scientific, Singapore.
  • Neuwald, A. and Liu, J. (2004). Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model. BMC Bioinformatics 5 157.
  • Niu, T. (2004). Algorithms for inferring haplotypes. Genetic Epidemiology 27 334–347.
  • Notredame, C., Higgins, D. and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 205–217.
  • O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G. and Notredame, C. (2004). 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340 385–395.
  • Ott, J. (1979). Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigrees. American Journal of Human Genetics 31 161–175.
  • Pavesi, G., Mereghetti, P., Mauri, G. and Pesole, G. (2004). Weeder Web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32 W199–W203.
  • Prakash, A., Blanchette, M., Sinha, S. and Tompa, M. (2004). Motif discovery in heterogeneous sequence data. In Pacific Symposium on Biocomputing 348–359. World Scientific, Singapore.
  • Qin, Z. S., Niu, T. and Liu, J. S. (2002). Partition–ligation–expectation–maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American Journal of Human Genetics 71 1242–1247.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 257–286.
  • Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L. W., Richards, S., Weinstock, G. M., Wilson, R. K., Gibbs, R. A., Kent, W. J., Miller, W. and Haussler, D. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15 1034–1050.
  • Sinha, S. and Tompa, M. (2002). Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 30 5549–5560.
  • Sinha, S., Blanchette, M. and Tompa, M. (2004). PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5 170.
  • Smith, C. A. B. (1957). Counting methods in genetical statistics. Annals of Human Genetics 35 254–276.
  • Stormo, G. D. and Hartzell, G. W. I. (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA 86 1183–1187.
  • Suresh, R. M., Dinakaran, K. and Valarmathie, P. (2009). Model based modified K-means clustering for microarray data. In International Conference on Information Management and Engineering 271–273. IEEE Computer Society, Los Alamitos, CA.
  • Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Statist. Assoc. 82 528–540.
  • Tavaré, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. In Some Mathematical Questions in Biology—DNA Sequence Analysis (New York, 1984). Lectures on Mathematics in the Life Sciences 17 57–86. Amer. Math. Soc., Providence, RI.
  • Taylor, W. (1988). A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28 161–169.
  • Thompson, E. A. (1984). Information gain in joint linkage analysis. Math. Med. Biol. 1 31–49.
  • Thompson, J., Higgins, D. and Gibson, T. (1994). CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 4673–4680.
  • Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C., Fu, Y., Kent, W. J., Makeev, V. J., Mironov, A. A., Noble, W. S., Pavesi, G., Pesole, G., Régnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C. and Zhu, Z. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23 137–144.
  • Wallace, I. M., Blackshields, G. and Higgins, D. G. (2005). Multiple sequence alignments. Current Opinion in Structural Biology 15 261–266.
  • Wang, W., Cherry, J. M., Nochomovitz, Y., Jolly, E., Botstein, D. and Li, H. (2005). Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation. Proc. Natl. Acad. Sci. USA 102 1998–2003.
  • Weeks, D. E. and Lange, K. (1989). Trials, tribulations, and triumphs of the EM algorithm in pedigree analysis. Math. Med. Biol. 6 209–232.
  • Wolfe, K. H., Sharp, P. M. and Li, W. H. (1989). Mutation rates differ among regions of the mammalian genome. Nature 337 283–285.
  • Yang, Z. (1995). A space–time process model for the evolution of DNA sequences. Genetics 139 993–1005.
  • Yang, Z. (1997). PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13 555–556.
  • Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. and Ruzzo, W. L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics 17 977–987.