The Annals of Applied Statistics
previous :: next

Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution

Asger Hobolth and Eric A. Stone

Source: Ann. Appl. Stat. Volume 3, Number 3 (2009), 1204-1231.

Abstract

Analyses of serially-sampled data often begin with the assumption that the observations represent discrete samples from a latent continuous-time stochastic process. The continuous-time Markov chain (CTMC) is one such generative model whose popularity extends to a variety of disciplines ranging from computational finance to human genetics and genomics. A common theme among these diverse applications is the need to simulate sample paths of a CTMC conditional on realized data that is discretely observed. Here we present a general solution to this sampling problem when the CTMC is defined on a discrete and finite state space. Specifically, we consider the generation of sample paths, including intermediate states and times of transition, from a CTMC whose beginning and ending states are known across a time interval of length T. We first unify the literature through a discussion of the three predominant approaches: (1) modified rejection sampling, (2) direct sampling, and (3) uniformization. We then give analytical results for the complexity and efficiency of each method in terms of the instantaneous transition rate matrix Q of the CTMC, its beginning and ending states, and the length of sampling time T. In doing so, we show that no method dominates the others across all model specifications, and we give explicit proof of which method prevails for any given Q, T, and endpoints. Finally, we introduce and compare three applications of CTMCs to demonstrate the pitfalls of choosing an inefficient sampler.

Related Works:

Keywords: Continuous-time Markov chains; simulation; molecular evolution

Full-text: Access denied (no subscription detected)

In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1254773285
Digital Object Identifier: doi:10.1214/09-AOAS247

References

Blackwell, P. (2003). Bayesian inference for Markov processes with diffusion and discrete components. Biometrika 90 613–627.
Bladt, M. and Sørensen, M. (2005). Statistical inference for discretely observed Markov jump processes. J. Roy. Statist. Soc. Ser. B 67 395–410.
Fearnhead, P. and Sherlock, C. (2006). An exact Gibbs sampler for the Markov modulated Poisson processes. J. Roy. Statist. Soc. Ser. B 68 767–784.
Goldman, N. and Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11 725–736.
Guttorp, P. (1995). Stochastic Modeling of Scientific Data. Chapman and Hall, Suffolk, Great Britain.
Hasegawa, M., Kishino, H. and Yano, T. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22 160–174.
Hobolth, A. (2008). A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution with neighbor-dependent substitution rates. J. Computat. Graph. Statist. 17 138–162.
Hobolth, A. and Jensen, J. (2005). Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Statist. Appl. Genet. Mol. Biol. 418.
Hobolth, A. and Stone, E. (2009). Supplement to “Efficient simulation from finite-state, continuous-time Markov chains with incomplete observations.” DOI: 10.1214/09-AOAS247SUPP.
Holmes, I. and Rubin, G. (2002). An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 317 757–768.
Hwang, D. and Green, P. (2004). Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. PNAS 101 13994–1314001.
Jensen, A. (1953). Markoff chains as an aid in the study of Markoff processes. Skand. Aktuarietidsskr. 36 87–91.
Jensen, J. and Pedersen, A. (2000). Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. in Appl. Probab. 32 499–517.
Lartillot, N. (2006). Conjugate Gibbs sampling for Bayesian phylogenetic models. J. Comput. Biol. 13 1701–1722.
Mateiu, L. and Rannala, B. (2006). Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Syst. Biol. 55 259–269.
Minin, V. and Suchard, M. (2008). Counting labeled transitions in continuous-time Markov models of evolution. J. Math. Biol. 56 391–412.
Mouse Genome Sequencing Consortium (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420 520–562.
Nielsen, R. (2002). Mapping mutations on phylogenies. Syst. Biol. 51 729–739.
Rodrigue, N., Philippe, H. and Lartillot, N. (2008). Uniformization for sampling realizations of Markov processes: Applications to Bayesian implementations of codon substitution models. Bioinformatics 24 56–62.
Siepel, A., Pollard, K. and Haussler, D. (2006). New methods for detecting lineage-specific selection. In Proceedings for the 10th International Conference on Research in Computational Molecular Biology 190–205. Springer, Berlin.
The Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437 69–87.
previous :: next

2009 © Institute of Mathematical Statistics