Advances in Applied Probability

Approximate sampling formulae for general finite-alleles models of mutation

Anand Bhaskar, John A. Kamm, and Yun S. Song

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Many applications in genetic analyses utilize sampling distributions, which describe the probability of observing a sample of DNA sequences randomly drawn from a population. In the one-locus case with special models of mutation, such as the infinite-alleles model or the finite-alleles parent-independent mutation model, closed-form sampling distributions under the coalescent have been known for many decades. However, no exact formula is currently known for more general models of mutation that are of biological interest. In this paper, models with finitely-many alleles are considered, and an urn construction related to the coalescent is used to derive approximate closed-form sampling formulae for an arbitrary irreducible recurrent mutation model or for a reversible recurrent mutation model, depending on whether the number of distinct observed allele types is at most three or four, respectively. It is demonstrated empirically that the formulae derived here are highly accurate when the per-base mutation rate is low, which holds for many biological organisms.

Article information

Adv. in Appl. Probab., Volume 44, Number 2 (2012), 408-428.

First available in Project Euclid: 16 June 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 92D15: Problems related to evolution
Secondary: 65C50: Other computational problems in probability 92D10: Genetics {For genetic algebras, see 17D92} 41A58: Series expansions (e.g. Taylor, Lidstone series, but not Fourier series)

Sampling probability coalescent theory urn model martingale


Bhaskar, Anand; Kamm, John A.; Song, Yun S. Approximate sampling formulae for general finite-alleles models of mutation. Adv. in Appl. Probab. 44 (2012), no. 2, 408--428. doi:10.1239/aap/1339878718.

Export citation


  • Arratia, A., Barbour, A. D. and Tavaré, S. (2003). Logarithmic Combinatorial Structures: A Probabilistic Approach. European Mathematical Society, Zürich.
  • Bhaskar, A. and Song, Y. S. (2012). Closed-form asymptotic sampling distributions under the coalescent with recombination for an arbitrary number of loci. Adv. Appl. Prob. 44, 391–407.
  • Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoret. Pop. Biol. 3, 87–112.
  • Fu, Y.-X. (1995). Statistical properties of segregating sites. Theoret. Pop. Biol. 48, 172–197.
  • Griffiths, R. C. (2003). The frequency spectrum of a mutation, and its age, in a general diffusion model. Theoret. Pop. Biol. 64, 241–251.
  • Griffiths, R. C. and Lessard, S. (2005). Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theoret. Pop. Biol. 68, 167–77.
  • Griffiths, R. C. and Tavaré, S. (1994). Ancestral inference in population genetics. Statist. Sci. 9, 307–319.
  • Griffiths, R. C. and Tavaré, S. (1994). Sampling theory for neutral alleles in a varying environment. Phil. Trans. R. Soc. London B 344, 403–410.
  • Hoppe, F. M. (1984). Pólya-like urns and the Ewens' sampling formula. J. Math. Biol. 20, 91–94.
  • Jenkins, P. A. and Song, Y. S. (2009). Closed-form two-locus sampling distributions: accuracy and universality. Genetics 183, 1087–1103.
  • Jenkins, P. A. and Song, Y. S. (2010). An asymptotic sampling formula for the coalescent with recombination. Ann. Appl. Prob. 20, 1005–1028.
  • Jenkins, P. A. and Song, Y. S. (2011). The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theoret. Pop. Biol. 80, 158–173.
  • Jenkins, P. A. and Song, Y. S. (2012). Padé approximants and exact two-locus sampling distributions. Ann. Appl. Prob. 22, 576–607.
  • Kingman, J. F. C. (1982). The coalescent. Stoch. Process. Appl. 13, 235–248.
  • Kingman, J. F. C. (1982). On the genealogy of large populations. In Essays in Statistical Science (J. Appl. Prob. Spec. Vol. 19A), eds J. Gani and E. J. Hannan, Applied Probability Trust, Sheffield, pp. 27–43.
  • Nachman, M. W. and Crowell, S. L. (2000). Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304.
  • Pitman, J. (1992). The two-parameter generalization of Ewens' random partition structure. Tech. Rep. 345, Department of Statistics, University of California, Berkeley.
  • Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Prob. Theory Relat. Fields 102, 145–158.
  • Stephens, M. (2001). Inference under the coalescent. In Handbook of Statistical Genetics, eds D. Balding, M. Bishop, and C. Cannings, John Wiley, Chichester, pp. 213–238.
  • Wright, S. (1949). Adaptation and selection. In Genetics, Paleontology, and Evolution, eds G. L. Jepson, G. G. Simpson, and E. Mayr, Princeton University Press, pp. 365–389.
  • Yang, Z. (1994). Estimating the pattern of nucleotide substitution. J. Molec. Evol. 39, 105–111.