Source: Adv. in Appl. Probab. Volume 44, Number 2
(2012), 408-428.
Many applications in genetic analyses utilize sampling distributions, which
describe the probability of observing a sample of DNA sequences randomly drawn
from a population. In the one-locus case with special models of mutation, such
as the infinite-alleles model or the finite-alleles parent-independent mutation
model, closed-form sampling distributions under the coalescent have been known
for many decades. However, no exact formula is currently known for more general
models of mutation that are of biological interest. In this paper, models with
finitely-many alleles are considered, and an urn construction related to the
coalescent is used to derive approximate closed-form sampling formulae for an
arbitrary irreducible recurrent mutation model or for a reversible recurrent
mutation model, depending on whether the number of distinct observed allele
types is at most three or four, respectively. It is demonstrated empirically
that the formulae derived here are highly accurate when the per-base mutation
rate is low, which holds for many biological organisms.
References
Arratia, A., Barbour, A. D. and Tavaré, S. (2003). Logarithmic Combinatorial Structures: A Probabilistic Approach. European Mathematical Society, Zürich.
Bhaskar, A. and Song, Y. S. (2012). Closed-form asymptotic sampling distributions under the coalescent with recombination for an arbitrary number of loci. Adv. Appl. Prob. 44, 391–407.
Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoret. Pop. Biol. 3, 87–112.
Mathematical Reviews (MathSciNet):
MR325177
Fu, Y.-X. (1995). Statistical properties of segregating sites. Theoret. Pop. Biol. 48, 172–197.
Griffiths, R. C. (2003). The frequency spectrum of a mutation, and its age, in a general diffusion model. Theoret. Pop. Biol. 64, 241–251.
Griffiths, R. C. and Lessard, S. (2005). Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theoret. Pop. Biol. 68, 167–77.
Griffiths, R. C. and Tavaré, S. (1994). Ancestral inference in population genetics. Statist. Sci. 9, 307–319.
Griffiths, R. C. and Tavaré, S. (1994). Sampling theory for neutral alleles in a varying environment. Phil. Trans. R. Soc. London B 344, 403–410.
Hoppe, F. M. (1984). Pólya-like urns and the Ewens' sampling formula. J. Math. Biol. 20, 91–94.
Mathematical Reviews (MathSciNet):
MR758915
Jenkins, P. A. and Song, Y. S. (2009). Closed-form two-locus sampling distributions: accuracy and universality. Genetics 183, 1087–1103.
Jenkins, P. A. and Song, Y. S. (2010). An asymptotic sampling formula for the coalescent with recombination. Ann. Appl. Prob. 20, 1005–1028.
Jenkins, P. A. and Song, Y. S. (2011). The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theoret. Pop. Biol. 80, 158–173.
Jenkins, P. A. and Song, Y. S. (2012). Padé approximants and exact two-locus sampling distributions. Ann. Appl. Prob. 22, 576–607.
Kingman, J. F. C. (1982). The coalescent. Stoch. Process. Appl. 13, 235–248.
Mathematical Reviews (MathSciNet):
MR671034
Kingman, J. F. C. (1982). On the genealogy of large populations. In Essays in Statistical Science (J. Appl. Prob. Spec. Vol. 19A), eds J. Gani and E. J. Hannan, Applied Probability Trust, Sheffield, pp. 27–43.
Mathematical Reviews (MathSciNet):
MR633178
Nachman, M. W. and Crowell, S. L. (2000). Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304.
Pitman, J. (1992). The two-parameter generalization of Ewens' random partition structure. Tech. Rep. 345, Department of Statistics, University of California, Berkeley.
Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Prob. Theory Relat. Fields 102, 145–158.
Stephens, M. (2001). Inference under the coalescent. In Handbook of Statistical Genetics, eds D. Balding, M. Bishop, and C. Cannings, John Wiley, Chichester, pp. 213–238.
Wright, S. (1949). Adaptation and selection. In Genetics, Paleontology, and Evolution, eds G. L. Jepson, G. G. Simpson, and E. Mayr, Princeton University Press, pp. 365–389.
Yang, Z. (1994). Estimating the pattern of nucleotide substitution. J. Molec. Evol. 39, 105–111.