Electronic Journal of Statistics

A nonparametric HMM for genetic imputation and coalescent inference

Lloyd T. Elliott and Yee Whye Teh

Full-text: Open access


Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. We develop a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. Our model provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. We provide truncation-free MCMC inference for our model using a new auxiliary sampling scheme for Bayesian nonparametric HMMs. In a series of experiments on male X chromosome data from the Thousand Genomes Project and also on data simulated from a population bottleneck we show the benefits of our model over the popular finite model fastPHASE, which can itself be seen as a parametric truncation of our model. We find that the number of HMM states found by our model is correlated with the time to the most recent common ancestor in population bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics applied to large and complex genetic data.

Article information

Electron. J. Statist., Volume 10, Number 2 (2016), 3425-3451.

Received: January 2016
First available in Project Euclid: 16 November 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62F15: Bayesian inference
Secondary: 92D10: Genetics {For genetic algebras, see 17D92}

Bayesian nonparametrics statistical genetics HMMs genetic imputation TMRCA inference haplotype inference population genetics


Elliott, Lloyd T.; Teh, Yee Whye. A nonparametric HMM for genetic imputation and coalescent inference. Electron. J. Statist. 10 (2016), no. 2, 3425--3451. doi:10.1214/16-EJS1197. https://projecteuclid.org/euclid.ejs/1479287227

Export citation


  • [1] G. A. T. McVean and N. J. Cardin. Approximating the coalescent with recombination., Philosophical Transactions of the Royal Society of London B: Biological Sciences, 360 (1459):1387–93, 2005.
  • [2] S. R. Browning and B. R. Browning. Haplotype phasing: existing methods and new developments., Nature Reviews Genetics, 12(10):703–14, 2011.
  • [3] J. Hein, M. H. Schierup, and C. Wiuf., Gene Genealogies, Variation and Evolution. Oxford University Press, 2005.
  • [4] T. S. Ferguson. A Bayesian analysis of some nonparametric problems., Annals of Statistics, 1(2):209–30, 1973.
  • [5] D. Blackwell and J. B. MacQueen. Ferguson distributions via Pólya urn schemes., Annals of Statistics, 1:353–5, 1973.
  • [6] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. HDPs., Journal of the American Statistical Association, 101(476) :1566–81, 2006.
  • [7] M. J. Beal, Y. W. Teh, and M. I. Jordan. Infinite hidden Markov models via the hierarchical Dirichlet process. In, Snowbird Learning Workshop, 2004.
  • [8] E. P. Xing, M. I. Jordan, and R. Sharan. Bayesian haplotype inference via the Dirichlet process., Journal of Computational Biology, 14(3), 2007.
  • [9] E. P. Xing and K. Sohn. Hidden Markov Dirichlet process: Modeling genetic recombination in open ancestral space., Bayesian Analysis, 2(2):501–27, 2007.
  • [10] E. Fox, E. Sudderth, M. I. Jordan, and A. Willsky. An HDP-HMM for systems with state persistence. In, Proceedings of the International Conference on Machine Learning, volume 25, 2008.
  • [11] M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and R. S. Lander. High-resolution haplotype structure in the human genome., Nature Genetics, 29(2):229–32, 2001.
  • [12] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the infinite hidden Markov model. In, Proceedings of the International Conference on Machine Learning, volume 25, 2008.
  • [13] P. Scheet and M. Stephens. A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase., The American Journal of Human Genetics, 78(4):629–44, 2006.
  • [14] Y. W. Teh, C. Blundell, and L. T. Elliott. Modelling genetic variations using fragmentation-coagulation processes. In, Advances in Neural Information Processing Systems, volume 22, 2011.
  • [15] L. T. Elliott and Y. W. Teh. Scalable imputation of genetic data with a discrete FCP. In, Advances in Neural Information Processing Systems, volume 23, 2012.
  • [16] B. L. Browning and S. R. Browning. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals., American Journal of Human Genetics, 84(2):210–23, 2009.
  • [17] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing., Nature, 467(7319):1061–73, 2010.
  • [18] M. Kimura., The Neutral Theory of Molecular Evolution. Cambridge University Press, 1983.
  • [19] R. R. Hudson. Properties of a neutral allele model with intragenic recombination., Theoretical Population Biology, 23(2):183–201, 1983.
  • [20] C. Wiuf and J. Hein. Recombination as a point process along sequences., Theoretical Population Biology, 55(3):248–59, 1999.
  • [21] E. S. Lander and N. J. Schork. Genetic dissection of complex traits., Science, 265 (5181):2037–48, 1994.
  • [22] R. R. Hudson. Generating samples under a Wright-Fisher neutral model of genetic variation., Bioinfomatics, 18(2), 2002.
  • [23] H. Li and R. Durbin. Inference of human population history from individual whole-genome sequences., Nature, 475 (7357):493–6, 2011.
  • [24] W. J. Ewens. The sampling theory of selectively neutral alleles., Theoretical Population Biology, 3(1):87–112, 1972.
  • [25] L. Bottolo S. Myers, C. Freeman, G. McVean, and P. Donnelly. A fine-scale map of recombination rates and hotspots across the human genome., Science, 310 (5746):321–4, 2005.
  • [26] K. Sharp, W. Kretzschmar, O. Delaneau, and J. Marchini. Phasing for medical sequencing using rare variants and large haplotype reference panels., Bioinformatics, 2016.
  • [27] S. Sun, C. M. T. Greenwood, and R. M. Neal. Haplotype inference using a Bayesian hidden Markov model., Genetic Epidemiology, 31(8):937–48, 2007.
  • [28] W. L. Buntine. Operations for learning with graphical models., Journal of Artificial Intelligence Research, 2:159–225, 1994.
  • [29] J. Pitman. Poisson–Dirichlet and GEM invariant distributions for split-and-merge transformations of an interval partition. Technical Report 597, Department of Statistics, University of California at Berkeley, 2002.
  • [30] F. Wood, C. Archambeau, J. Gasthaus, L. F. James, and Y. W. Teh. A stochastic memoizer for sequence data. In, Proceedings of the International Conference on Machine Learning, volume 26, 2009.
  • [31] S. Früwirth-Schnatter. Data augmentation and dynamic linear models., Journal of Time Series Analysis, 15(2):183–202, 1994.
  • [32] R. M. Neal. Slice sampling., Annals of Statistics, 31(3):705–41, 2003.
  • [33] J. Pitman. Combinatorial stochastic processes. Technical Report 621, Department of Statistics, University of California at Berkeley, 2002.
  • [34] J. Pitman., Combinatorial Stochastic Processes. Springer-Verlag, 2006.
  • [35] B. N. Howie, P. Donnelly, and J. Marchini. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies., PLOS Genetics, 6, 2009.
  • [36] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. In, Advances in Neural Information Processing Systems, volume 14, pages 993 –1022, 2002.
  • [37] N. Li and M. Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data., Genetics, 165(4) :2213–33, 2003.