Annales de l'Institut Henri Poincaré, Probabilités et Statistiques

Bayesian nonparametric analysis of Kingman’s coalescent

Stefano Favaro, Shui Feng, and Paul A. Jenkins

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Kingman’s coalescent is one of the most popular models in population genetics. It describes the genealogy of a population whose genetic composition evolves in time according to the Wright–Fisher model, or suitable approximations of it belonging to the broad class of Fleming–Viot processes. Ancestral inference under Kingman’s coalescent has had much attention in the literature, both in practical data analysis, and from a theoretical and methodological point of view. Given a sample of individuals taken from the population at time $t>0$, most contributions have aimed at making frequentist or Bayesian parametric inference on quantities related to the genealogy of the sample. In this paper we propose a Bayesian nonparametric predictive approach to ancestral inference. That is, under the prior assumption that the composition of the population evolves in time according to a neutral Fleming–Viot process, and given the information contained in an initial sample of $m$ individuals taken from the population at time $t>0$, we estimate quantities related to the genealogy of an additional unobservable sample of size $m^{\prime}\geq1$. As a by-product of our analysis we introduce a class of Bayesian nonparametric estimators (predictors) which can be thought of as Good–Turing type estimators for ancestral inference. The proposed approach is illustrated through an application to genetic data.


La coalescence de Kingman est l’un des modèles les plus populaires en génétique des populations. Il décrit la généalogie d’une population dont la composition génétique évolue dans le temps selon le modèle de Wright–Fisher, ou des approximations appropriées de celle-ci appartenant à la grande classe des processus de Fleming–Viot. L’inférence ancestrale sous la coalescence de Kingman a reçu beaucoup d’attention dans la littérature, à la fois dans l’analyse des données, et d’un point de vue théorique et méthodologique. Étant donné un échantillon d’individus échantillonnés dans la population au temps $t>0$, la plupart des contributions existantes visaient l’inférence paramétrique, fréquentiste ou bayésienne, sur des quantités liées à la généalogie de l’échantillon. Dans cet article, nous proposons une approche prédictive bayésienne non paramétrique de l’inférence ancestrale. C’est-à-dire, sous l’hypothèse préalable que la composition de la population évolue dans le temps selon un processus de Fleming–Viot neutre, et compte tenu de l’information contenue dans un échantillon initial de $m$ individus dans la population au temps $t>0$, nous estimons des quantités liées à la généalogie d’un échantillon additionnel non observable de taille $m^{\prime}\geq1$. En corollaire de notre analyse, nous introduisons une classe d’estimateurs bayésiens non paramétriques (prédicteurs) qui peuvent être considérés comme des estimateurs de type Good–Turing pour l’inférence ancestrale. L’approche proposée est illustrée par une application sur données génétiques.

Article information

Ann. Inst. H. Poincaré Probab. Statist., Volume 55, Number 2 (2019), 1087-1115.

Received: 26 June 2017
Revised: 7 April 2018
Accepted: 17 April 2018
First available in Project Euclid: 14 May 2019

Permanent link to this document

Digital Object Identifier

Primary: 62C10: Bayesian problems; characterization of Bayes procedures
Secondary: 62M05: Markov processes: estimation

Ancestral inference Bayesian nonparametrics Dirichlet process Kingman’s coalescent Lineages distributions Predictive probability


Favaro, Stefano; Feng, Shui; Jenkins, Paul A. Bayesian nonparametric analysis of Kingman’s coalescent. Ann. Inst. H. Poincaré Probab. Statist. 55 (2019), no. 2, 1087--1115. doi:10.1214/18-AIHP910.

Export citation


  • [1] N. Berestycki. Recent Progress in Coalescent Theory. Ensaios Matemáticos. SBM, Rio de Janeiro, 2009.
  • [2] M. Birkner, J. Blath, M. Möhle, M. Steinrücken and J. Tams. A modified lookdown construction for the Xi–Fleming–Viot process with mutation and populations with recurrent bottlenecks. ALEA 6 (2009) 25–61.
  • [3] C. A. Charalambides. Combinatorial Methods in Discrete Distributions. Wiley, Hoboken, 2005.
  • [4] M. De Iorio and R. C. Griffiths. Importance sampling on coalescent histories I. Adv. in Appl. Probab. 36 (2004) 417–433.
  • [5] P. Donnelly and T. G. Kurtz. A countable representation of the Fleming–Viot measure-valued diffusion. Ann. Probab. 24 (1996) 698–742.
  • [6] S. N. Ethier and R. C. Griffiths. The transition function of a Fleming–Viot process. Ann. Probab. 21 (1993) 1571–1590.
  • [7] S. N. Ethier and T. G. Kurtz. Fleming–Viot processes in population genetics. SIAM J. Control Optim. 31 (1993) 345–386.
  • [8] W. J. Ewens. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3 (1972) 87–112.
  • [9] W. J. Ewens. Mathematical Population Genetics. Springer, Berlin, 2004.
  • [10] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 (1973) 209–230.
  • [11] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika 40 (1953) 237–264.
  • [12] I. J. Good and G. H. Toulmin. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 (1956) 45–63.
  • [13] R. C. Griffiths. Lines of descent in the diffusion approximation of neutral Wright–Fisher models. Theor. Popul. Biol. 17 (1980) 37–50.
  • [14] R. C. Griffiths. Asymptotic line of descent distributions. J. Math. Biol. 21 (1984) 67–75.
  • [15] R. C. Griffiths, P. A. Jenkins and Y. S. Song. Importance sampling and the two-locus model with subdivided population structure. Adv. in Appl. Probab. 40 (2008) 473–500.
  • [16] R. C. Griffiths and S. Tavaré. Ancestral inference in population genetics. Statist. Sci. 9 (1994) 307–319.
  • [17] R. C. Griffiths and S. Tavaré. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46 (1994) 131–159.
  • [18] R. C. Griffiths and S. Tavaré. The genealogy of a neutral mutation. In Highly Structured Stochastic Systems, P. J. Green, N. L. Hjort and S. Richardson (Eds). Oxford University Press, Oxford, 2003.
  • [19] J. Hey and R. Nielsen. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics 167 (2004) 747–760.
  • [20] A. Hobolth, M. Uyenoyama and C. Wiuf. Importance sampling for the infinite sites model. Stat. Appl. Genet. Mol. Biol. 7 (2008) 32.
  • [21] F. M. Hoppe. The sampling theory of neutral alleles and an urn model in population genetics. J. Math. Biol. 25 (1987) 123–159.
  • [22] J. F. C. Kingman. The coalescent. Stochastic Process. Appl. 13 (1982) 235–248.
  • [23] J. F. C. Kingman. On the genealogy of large populations. J. Appl. Probab. 19 (1982) 27–43.
  • [24] N. Li and M. Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165 (2003) 2213–2233.
  • [25] M. Möhle. On sampling distributions for coalescent processes with simultaneous multiple collisions. Bernoulli 12 (2006) 35–53.
  • [26] M. Möhle and S. Sagitov. Coalescent patterns in diploid exchangeable population models. J. Math. Biol. 47 (2003) 337–352.
  • [27] J. S. Paul and Y. S. Song. A principled approach to deriving approximate conditional sampling distributions in population genetics models with recombination. Genetics 186 (2010) 321–338.
  • [28] J. S. Paul, M. Steinrücken and Y. S. Song. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187 (2011) 1115–1128.
  • [29] S. Sheehan, K. Harris and Y. S. Song. Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics 194 (2013) 647–662.
  • [30] R. S. Singh, R. C. Lewontin and A. A. Felton. Genetic heterogeneity within electrophoretic “alleles” of xanthine dehydrogenase in Drosophila pseudoobscura. Genetics 84 (1976) 609–629.
  • [31] M. Stephens. Inference under the coalescent. In Handbook of Statistical Genetics, D. J. Balding, M. Bishop and C. Cannings (Eds). Wiley, New York, 2001.
  • [32] M. Stephens and P. Donnelly. Inference in molecular population genetics. J. Roy. Statist. Soc. Ser. B 62 (2000) 605–655.
  • [33] S. Tavaré. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26 (1984) 119–164.
  • [34] S. Tavaré. Ancestral inference in population genetics. In Ecole d’Eté de Probabilités de Saint-Flour XXXI. Lecture Notes in Mathematics. Springer, New York, 2004.
  • [35] G. A. Watterson. Lines of descent and the coalescent. Theor. Popul. Biol. 26 (1984) 77–92.