Bayesian Analysis

Bayesian Analysis of Continuous Time Markov Chains with Application to Phylogenetic Modelling

Tingting Zhao, Ziyu Wang, Alexander Cumberworth, Joerg Gsponer, Nando de Freitas, and Alexandre Bouchard-Côté

Full-text: Open access

Abstract

Bayesian analysis of continuous time, discrete state space time series is an important and challenging problem, where incomplete observation and large parameter sets call for user-defined priors based on known properties of the process. Generalized linear models have a largely unexplored potential to construct such prior distributions. We show that an important challenge with Bayesian generalized linear modelling of continuous time Markov chains is that classical Markov chain Monte Carlo techniques are too ineffective to be practical in that setup. We address this issue using an auxiliary variable construction combined with an adaptive Hamiltonian Monte Carlo algorithm. This sampling algorithm and model make it efficient both in terms of computation and analyst’s time to construct stochastic processes informed by prior knowledge, such as known properties of the states of the process. We demonstrate the flexibility and scalability of our framework using synthetic and real phylogenetic protein data, where a prior based on amino acid physicochemical properties is constructed to obtain accurate rate matrix estimates.

Article information

Source
Bayesian Anal., Volume 11, Number 4 (2016), 1203-1237.

Dates
First available in Project Euclid: 30 November 2015

Permanent link to this document
https://projecteuclid.org/euclid.ba/1448899900

Digital Object Identifier
doi:10.1214/15-BA982

Mathematical Reviews number (MathSciNet)
MR3577377

Zentralblatt MATH identifier
1357.60112

Subjects
Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43] 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]
Secondary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]

Keywords
CTMCs Bayesian GLMs rate matrix MCMC AHMC uniformization phylogenetic tree

Citation

Zhao, Tingting; Wang, Ziyu; Cumberworth, Alexander; Gsponer, Joerg; de Freitas, Nando; Bouchard-Côté, Alexandre. Bayesian Analysis of Continuous Time Markov Chains with Application to Phylogenetic Modelling. Bayesian Anal. 11 (2016), no. 4, 1203--1237. doi:10.1214/15-BA982. https://projecteuclid.org/euclid.ba/1448899900


Export citation

References

  • Adachi, J. and Hasegawa, M. (1996). “Model of amino acid substitution in proteins encoded by mitochondrial DNA.” Journal of Molecular Evolution, 42(4): 459–468.
  • Adachi, J., Waddell, P. J., Martin, W., and Hasegawa, M. (2000). “Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA.” Journal of Molecular Evolution, 50(4): 348–358.
  • Aldous, D. (1996). “Probability distributions on cladograms.” In: Random Discrete Structures, 1–18. Springer.
  • Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. (2003). “An Introduction to MCMC for Machine Learning.” Machine Learning, 50(1): 5–43.
  • Baele, G., Van de Peer, Y., and Vansteelandt, S. (2010). “Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences.” Journal of Molecular Evolution, 71(1): 34–50.
  • Barnes, M. R., Gray, I. C., et al. (2003). Bioinformatics for Geneticists, volume 2. Wiley Hoboken, NJ.
  • Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., and Klein, D. (2010). “Painless Unsupervised Learning with Features.” In: Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL10), volume 8, 582–590.
  • Chambers, J. M. and Hastie, T. J. (1991). Statistical Models in S. CRC Press, Inc.
  • Chen, L., Qin, Z., and Liu, J. S. (2001). “Exploring Hybrid Monte Carlo in Bayesian Computation.” Sigma, 2: 2–5.
  • Clayton, D. G. (1991). “A Monte Carlo method for Bayesian inference in frailty models.” Biometrics, 47: 467–485.
  • Dan Graur, W.-H. L. (2000). Fundamentals of Molecular Evolution. Sinauer Associates.
  • Dang, C. C., Le, Q. S., Gascuel, O., and Le, V. S. (2010). “FLU, an amino acid substitution model for influenza proteins.” BMC Evolutionary Biology, 10(1): 99.
  • Dayhoff, M., Schwartz, R., and Orcutt, B. (1978). Atlas of Protein Sequences and Structure, volume 5, chapter A model of evolutionary change in proteins, 345–352. Silver Springs: National Biomedical Research Foundation.
  • Didier, F., Henzinger, T. A., Mateescu, M., and Wolf, V. (2009). “Fast adaptive uniformization of the chemical master equation.” In: International Workshop on High Performance Computational Systems Biology, 2009. HIBI’09, 118–127. IEEE.
  • Dimmic, M. W., Mindell, D. P., and Goldstein, R. A. (2000). “Modeling evolution at the protein level using an adjustable amino acid fitness model.” Pacific Symposium on Biocomputing, 18–29.
  • Dimmic, M. W., Rest, J. S., Mindell, D. P., and Goldstein, R. A. (2002). “rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny.” Journal of Molecular Evolution, 55(1): 65–73.
  • Doron-Faigenboim, A. and Pupko, T. (2007). “A combined empirical and mechanistic codon model.” Molecular Biology and Evolution, 24(2): 388–397.
  • Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). “Hybrid Monte Carlo.” Physics Letters B, 195(2): 216–222.
  • Eddy, S. R. (1998). “Profile hidden Markov models.” Bioinformatics, 14(9): 755–763.
  • Felsenstein, J. (1981). “Evolutionary trees from DNA sequences: a maximum likelihood approach.” Journal of Molecular Evolution, 17(6): 368–376.
  • Geweke, J. (1992). “Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments.” In: Bayesian Statistics, volume 4, 169–193. Citeseer.
  • Geweke, J. (2004). “Getting it right: Joint distribution tests of posterior simulators.” Journal of the American Statistical Association, 99(467): 799–804.
  • Gilks, W. R. and Roberts, G. O. (1996). Markov chain Monte Carlo in practice, chapter Strategies for Improving MCMC, 89–114. Chapman and Hall/CRC.
  • Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): 123–214.
  • Goldman, N., Thorne, J. L., and Jones, D. T. (1996). “Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses.” Journal of Molecular Biology, 263(2): 196–208.
  • Grantham, R. (1974). “Amino acid difference formula to help explain protein evolution.” Science, 185(4154): 862–864.
  • Hajiaghayi, M., Kirkpatrick, B., Wang, L., and Bouchard-Côté, A. (2014). “Efficient Continuous-Time Markov Chain Estimation.” In: International Conference on Machine Learning, 2014. ICML’2014, volume 31, 638–646.
  • Hamze, F., Wang, Z., and de Freitas, N. (2013). “Self-Avoiding Random Dynamics on Integer Complex Systems.” ACM Transactions on Modeling and Computer Simulation, 23(1): 9:1–9:25.
  • Hasegawa, M., Kishino, H., and Yano, T. (1985). “Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.” Journal of Molecular Eevolution, 22: 160–174.
  • Heidelberger, P. and Welch, P. D. (1981). “A spectral method for confidence interval generation and run length control in simulations.” Communications of the ACM, 24(4): 233–245.
  • Heidelberger, P. and Welch, P. D. (1983). “Simulation run length control in the presence of an initial transient.” Operations Research, 31(6): 1109–1144.
  • Higham, N. J. (2009). “The scaling and squaring method for the matrix exponential revisited.” SIAM Review, 51(4): 747–764.
  • Hobolth, A. and Stone, E. A. (2009). “Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution.” The Annals of Applied Statistics, 3(3): 1204.
  • Hoffman, M. D. and Gelman, A. (2014). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research, 15: 1351–1381.
  • Horowitz, A. M. (1991). “A generalized guided Monte Carlo algorithm.” Physics Letters B, 268: 247–252.
  • Huelsenbeck, J. P., Bollback, J. P., and Levine, A. M. (2002). “Inferring the root of a phylogenetic tree.” Systematic Biology, 51(1): 32–43.
  • Huelsenbeck, J. P., Joyce, P., Lakner, C., and Ronquist, F. (2008). “Bayesian analysis of amino acid substitution models.” Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1512): 3941–3953.
  • Huelsenbeck, J. P. and Ronquist, F. (2001). “MrBayes: Bayesian inference of phylogenetic trees.” Bioinformatics, 17(8): 754–755.
  • Ibrahim, J. G., Chen, M.-H., and Sinha, D. (2005). Bayesian Survival Analysis. Wiley Online Library.
  • Irvahn, J. and Minin, V. N. (2014). “Phylogenetic Stochastic Mapping Without Matrix Exponentiation.” Journal of Computational Biology, 21(9): 676–690.
  • Ishwaran, H. (1999). “Applications of Hybrid Monte Carlo to Bayesian Generalized Linear Models: Quasicomplete Separation and Neural Networks.” Journal of Computational and Graphical Statistics, 8(4): 779–799.
  • Jennrich, R. I. and Bright, P. B. (1976). “Fitting systems of linear differential equations using computer generated exact derivatives.” Technometrics, 18(4): 385–392.
  • Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). “The rapid generation of mutation data matrices from protein sequences.” Computer Applications in the Biosciences: CABIOS, 8(3): 275–282.
  • Jukes, T. and Cantor, C. (1969). Evolution of Protein Molecules. New York: Academic Press.
  • Kalbfleisch, J. and Lawless, J. F. (1985). “The analysis of panel data under a Markov assumption.” Journal of the American Statistical Association, 80(392): 863–871.
  • Kay, B. R. (1977). “Proportional hazard regression models and the analysis of censored survival data.” Applied Statistics, 26(3): 227–237.
  • Kimura, M. (1980). “A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences.” Journal of Molecular Evolution, 16: 111–120.
  • Kosiol, C., Holmes, I., and Goldman, N. (2007). “An empirical codon model for protein sequence evolution.” Molecular Biology and Evolution, 24(7): 1464–1479.
  • Kschischang, F. R., Frey, B. J., and Loeliger, H. A. (2006). “Factor Graphs and the Sum-product Algorithm.” IEEE Transactions on Information Theory, 47(2): 498–519.
  • Lakner, C., Van Der Mark, P., Huelsenbeck, J. P., Larget, B., and Ronquist, F. (2008). “Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics.” Systematic Biology, 57(1): 86–103.
  • Lartillot, N. and Philippe, H. (2004). “A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process.” Molecular Biology and Evolution, 21(6): 1095–1109.
  • Le, S. Q., Dang, C. C., and Gascuel, O. (2012). “Modeling protein evolution with several amino acid replacement matrices depending on site rates.” Molecular Biology and Evolution, 2921–2936.
  • Le, S. Q. and Gascuel, O. (2008). “An improved general amino acid replacement matrix.” Molecular Biology and Evolution, 25(7): 1307–1320.
  • Lio, P. and Goldman, N. (1999). “Using protein structural information in evolutionary inference: transmembrane proteins.” Molecular Biology and Evolution, 16(12): 1696–1710.
  • Mahendran, N., Wang, Z., Hamze, F., and Freitas, N. D. (2012). “Adaptive MCMC with Bayesian optimization.” In: International Conference on Artificial Intelligence and Statistics, 751–760.
  • Minin, V. N. and Suchard, M. A. (2008). “Counting labeled transitions in continuous-time Markov models of evolution.” Journal of Mathematical Biology, 56(3): 391–412.
  • Miyazawa, S. (2011a). “Advantages of a mechanistic codon substitution model for evolutionary analysis of protein-coding sequences.” PloS One, 6(12): e28892.
  • Miyazawa, S. (2011b). “Selective constraints on amino acids estimated by a mechanistic codon substitution model with multiple nucleotide changes.” PloS One, 6(3): e17244.
  • Miyazawa, S. (2013). “Superiority of a mechanistic codon substitution model even for protein sequences in Phylogenetic analysis.” BMC Evolutionary Biology, 13(1): 257.
  • Moler, C. and Van Loan, C. (1978). “Nineteen dubious ways to compute the exponential of a matrix.” SIAM Review, 20(4): 801–836.
  • Müller, T. and Vingron, M. (2000). “Modeling amino acid replacement.” Journal of Computational Biology, 7(6): 761–776.
  • Murrell, B., Weighill, T., Buys, J., Ketteringham, R., Moola, S., Benade, G., Du Buisson, L., Kaliski, D., Hands, T., and Scheffler, K. (2011). “Non-negative matrix factorization for learning alignment-specific models of protein evolution.” PloS One, 6(12): e28898.
  • Neal, R. M. (2010). “MCMC using Hamiltonian dynamics.” Handbook of Markov Chain Monte Carlo, 54: 113–162.
  • Neuts, M. (1981). Matrix-Geometric Solutions in Stochastic Models. The Johns Hopkins University Press.
  • Nielsen, R. (2002). “Mapping mutations on phylogenies.” Systematic Biology, 51: 729–739.
  • Nodelman, U., Shelton, C., and Koller, D. (2002). “Continuous time Bayesian networks.” In: Uncertainty in AI, volume 18.
  • Paradis, E., Claude, J., and Strimmer, K. (2004). “APE: analyses of phylogenetics and evolution in R language.” Bioinformatics, 20(2): 289–290.
  • Pasarica, C. and Gelman, A. (2010). “Adaptively scaling the Metropolis algorithm using expected squared jumped distance.” Statistica Sinica, 20(1): 343.
  • Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). “CODA: Convergence Diagnosis and Output Analysis for MCMC.” R News, 6(1): 7–11. http://CRAN.R-project.org/doc/Rnews/
  • Rao, V. and Teh, Y. W. (2013). “Fast MCMC sampling for Markov jump processes and extensions.” The Journal of Machine Learning Research, 14(1): 3295–3320.
  • Roberts, G. O., Gelman, A., Gilks, W. R., et al. (1997). “Weak convergence and optimal scaling of random walk Metropolis algorithms.” The Annals of Applied Probability, 7(1): 110–120.
  • Roberts, G. O. and Rosenthal, J. S. (2009). “Examples of adaptive MCMC.” Journal of Computational and Graphical Statistics, 18(2): 349–367.
  • Roberts, G. O., Rosenthal, J. S., et al. (2001). “Optimal scaling for various Metropolis-Hastings algorithms.” Statistical Science, 16(4): 351–367.
  • Rodrigue, N., Philippe, H., and Lartillot, N. (2008). “Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models.” Bioinformatics, 24(1): 56–62.
  • Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M. A., and Huelsenbeck, J. P. (2012). “MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space.” Systematic Biology, 61(3): 539–542.
  • Schadt, E. E., Sinsheimer, J. S., and Lange, K. (1998). “Computational advances in maximum likelihood methods for molecular phylogeny.” Genome Research, 8(3): 222–233.
  • Sexton, J. C. and Weingarten, D. H. (1992). “Hamiltonian evolution for the hybrid Monte Carlo method.” Nuclear Physics B, 380: 665–677.
  • Sidje, R. B. and Stewart, W. J. (1999). “A numerical study of large sparse matrix exponentials arising in Markov chains.” Computational Statistics and Data Analysis, 29(3): 345–368.
  • Sneath, P. (1966). “Relations between chemical structure and biological activity in peptides.” Journal of Theoretical Biology, 12(2): 157–195.
  • Sohl-Dickstein, J., Mudigonda, M., and DeWeese., M. (2014). “Hamiltonian Monte Carlo without detailed balance.” In: International Conference on Machine Learning (ICML-14), volume 31, 719–726.
  • Stan Development Team (2014). “Stan: A C++ Library for Probability and Sampling, Version 2.5.0.” http://mc-stan.org/
  • Tataru, P. and Hobolth, A. (2011). “Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains.” BMC Bioinformatics, 12: 465–475.
  • Tavaré, S. (1986). “Some probabilistic and statistical problems in the analysis of DNA sequences.” Lectures on Mathematics in the Life Sciences, 17: 57–86.
  • Wang, Z., Mohamed, S., and de Freitas, N. (2013). “Adaptive Hamiltonian and Riemann Manifold Monte Carlo Samplers.” In: International Conference on Machine Learning (ICML), 1462–1470. JMLR W&CP 28(3): 1462–1470, 2013.
  • Whelan, S. and Goldman, N. (2001). “A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.” Molecular Biology and Evolution, 18(5): 691–699.
  • Yang, Z., Nielsen, R., and Hasegawa, M. (1998). “Models of amino acid substitution and applications to mitochondrial protein evolution.” Molecular Biology and Evolution, 15(12): 1600–1611.
  • Zhang, J., Watson, L. T., and Cao, Y. (2010). “A modified uniformization method for the solution of the chemical master equation.” Computers and Mathematics with Applications, 59(1): 573–584.
  • Zhao, T., Wang, Z., Cumberworth, A., Gsponer, J., de Freitas, N., and Bouchard-Côté, A. (2015). “Supplemental Materials: Bayesian Analysis of Continuous Time Markov Chains with Application to Phylogenetic Modelling.” Bayesian Analysis.

Supplemental materials