The Annals of Applied Statistics

A Dirichlet process mixture of hidden Markov models for protein structure prediction

Kristin P. Lennox, David B. Dahl, Marina Vannucci, Ryan Day, and Jerry W. Tsai

Full-text: Open access


By providing new insights into the distribution of a protein’s torsion angles, recent statistical models for this data have pointed the way to more efficient methods for protein structure prediction. Most current approaches have concentrated on bivariate models at a single sequence position. There is, however, considerable value in simultaneously modeling angle pairs at multiple sequence positions in a protein. One area of application for such models is in structure prediction for the highly variable loop and turn regions. Such modeling is difficult due to the fact that the number of known protein structures available to estimate these torsion angle distributions is typically small. Furthermore, the data is “sparse” in that not all proteins have angle pairs at each sequence position. We propose a new semiparametric model for the joint distributions of angle pairs at multiple sequence positions. Our model accommodates sparse data by leveraging known information about the behavior of protein secondary structure. We demonstrate our technique by predicting the torsion angles in a loop from the globin fold family. Our results show that a template-based approach can now be successfully extended to modeling the notoriously difficult loop and turn regions.

Article information

Ann. Appl. Stat., Volume 4, Number 2 (2010), 916-942.

First available in Project Euclid: 3 August 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Bayesian nonparametrics density estimation dihedral angles protein structure prediction torsion angles von Mises distribution


Lennox, Kristin P.; Dahl, David B.; Vannucci, Marina; Day, Ryan; Tsai, Jerry W. A Dirichlet process mixture of hidden Markov models for protein structure prediction. Ann. Appl. Stat. 4 (2010), no. 2, 916--942. doi:10.1214/09-AOAS296.

Export citation


  • Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152–1174.
  • Baker, D. and Sali, A. (2001). Protein structure prediction and structural genomics. Science 294 93–96.
  • Beal, M. J., Ghahramani, Z. and Rasmussen, C. E. (2002). The infinite hidden Markov model. In Advances in Neural Information Processing Systems 14 (Dietterich, T., Becker, S. and Ghahramani, Z., eds.) 504, 505, 508. MIT Press, Cambridge, MA.
  • Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Wiley, Chichester.
  • Bonneau, R. and Baker, D. (2001). Ab initio protein structure prediction: Progress and prospects. Annu. Rev. Biophys. Biomol. Struct. 30 173–189.
  • Boomsma, W., Mardia, K. V., Taylor, C. C., Ferkinghoff-Borg, J., Krogh, A. and Hamelryck, T. (2008). A generative, probabilistic model of local protein structure. Proc. Natl. Acad. Sci. USA 105 8932–8937.
  • Butterfoss, G. L., Richardson, J. S. and Hermans, J. (2005). Protein imperfections: Separating intrinsic from extrinsic variation of torsion angles. Acta Crystallogr. D Biol. Crystallogr. 61 88–98.
  • Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. J. Econometrics 75 79–97.
  • De Iorio, M., Müller, P., Rosner, G. L. and MacEachern, S. N. (2004). An ANOVA model for dependent random measures. J. Amer. Statist. Assoc. 99 205–215.
  • Dunson, D. B., Pillai, N. and Park, J.-H. (2007). Bayesian density regression. J. Roy. Statist. Soc. Ser. B Statist. Methodol. 69 163–183.
  • Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 1792–1797.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Fitzkee, N. C., Fleming, P. J. and Rose, G. D. (2005). The protein coil library: A structural database of nonhelix, nonstrand fragments derived from the PDB. Proteins 58 852–854.
  • Gelfand, A. E., Kottas, A. and MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. J. Amer. Statist. Assoc. 100 1021–1035.
  • Green, P. J. and Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scand. J. Statist. 28 355–375.
  • Griffin, J. E. and Steel, M. F. J. (2006). Order-based dependent Dirichlet processes. J. Amer. Statist. Assoc. 101 179–194.
  • Ho, B. K., Thomas, A. and Brasseur, R. (2003). Revisiting the Ramachandran plot: Hard-sphere repulsion, electrostatics, and h-bonding in the alpha-helix. Protein Sci. 12 2508–2522.
  • Kabsch, W. and Sander, C. (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 2577–2637.
  • Karplus, K., Sjolander, K., Barrett, C., Cline, M., Haussler, D., Hughey, R., Holm, L., Sander, C. and England, E. (1997). Predicting protein structure using hidden Markov models. Proteins: Structure, Function and Genetics 29 134–139.
  • Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
  • Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P. E. and Berman, H. M. (2006). The RCSB PDB information portal for structural genomics. Nucleic Acids Res. 34 D302–D305.
  • Lennox, K. P., Dahl, D. B., Vannucci, M. and Tsai, J. W. (2009a). Correction to density estimation for protein conformation angles using a bivariate von Mises distribution and Bayesian nonparametrics. J. Amer. Statist. Assoc. 104 1728.
  • Lennox, K. P., Dahl, D. B., Vannucci, M. and Tsai, J. W. (2009b). Density estimation for protein conformation angles using a bivariate von Mises distribution and Bayesian nonparametrics. J. Amer. Statist. Assoc. 104 586–596.
  • Lovell, S. C., Davis, I. W., Arendall, W. B. R., de Bakker, P. I., Word, J. M., Prisant, M. G., Richardson, J. S. and Richardson, D. C. (2003). Structure validation by Calpha geometry: Phi, Psi and Cbeta deviation. Proteins 50 437–450.
  • MacEachern, S. N. (2000). Dependent Dirichlet processes. Technical report, Dept. Statistics, Ohio State Univ.
  • Mardia, K. V. (1975). Statistics of directional data (com: P371-392). J. Roy. Statist. Soc. Ser. B 37 349–371.
  • Mardia, K. V., Hughes, G., Taylor, C. C. and Singh, H. (2008). A multivariate von Mises distribution with applications to bioinformatics. Canadian J. Statist. 36 99–109.
  • Mardia, K. V., Taylor, C. C. and Subramaniam, G. K. (2007). Protein bioinformatics and mixtures of bivariate von Mises distributions for angular data. Biometrics 63 505–512.
  • McGuffin, L. J., Bryson, K. and Jones, T. D. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16 404–405.
  • Michalsky, E., Goede, A. and Preissner, R. (2003). Loops in proteins (LIP)—a comprehensive loop database for homology modeling. Prot. Eng. 16 979–985.
  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • Osguthorpe, D. J. (2000). Ab initio protein folding. Curr. Opin. Struct. Biol. 10 146–152.
  • Ramachandran, G. N., Ramakrishnan, C. and Sasisekharan, V. (1963). Stereochemistry of polypeptide chain configurations. Mol. Biol. 7 95–99.
  • Rivest, L. P. (1982). Some statistical methods for bivariate circular data. J. Roy. Statist. Soc. Ser. B 44 81–90.
  • Rodríguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested Dirichlet process. J. Amer. Statist. Assoc. 103 1131–1144.
  • Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st century. J. Amer. Statist. Assoc. 97 337–351.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Singh, H., Hnizdo, V. and Demchuk, E. (2002). Probabilistic model for two dependent circular variables. Biometrika 89 719–723.
  • Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101 1566–1581.
  • Xing, E. P. and Sohn, K. A. (2007). Hidden Markov Dirichlet process: Modeling genetic inference in open ancestral space. Bayesian Anal. 2 501–528.