Source: Ann. Appl. Probab.
Volume 20, Number 3
Ewens sampling formula (ESF) is a one-parameter family of probability distributions with a number of intriguing combinatorial connections. This elegant closed-form formula first arose in biology as the stationary probability distribution of a sample configuration at one locus under the infinite-alleles model of mutation. Since its discovery in the early 1970s, the ESF has been used in various biological applications, and has sparked several interesting mathematical generalizations. In the population genetics community, extending the underlying random-mating model to include recombination has received much attention in the past, but no general closed-form sampling formula is currently known even for the simplest extension, that is, a model with two loci. In this paper, we show that it is possible to obtain useful closed-form results in the case the population-scaled recombination rate ρ is large but not necessarily infinite. Specifically, we consider an asymptotic expansion of the two-locus sampling formula in inverse powers of ρ and obtain closed-form expressions for the first few terms in the expansion. Our asymptotic sampling formula applies to arbitrary sample sizes and configurations.
Arratia, A., Barbour, A. D. and Tavaré, S. (2003). Logarithmic Combinatorial Structures: A Probabilistic Approach. European Mathematical Society Publishing House, Switzerland.
De Iorio, M. and Griffiths, R. C. (2004a). Importance sampling on coalescent histories. I. Adv. in Appl. Probab. 36 417–433.
De Iorio, M. and Griffiths, R. C. (2004b). Importance sampling on coalescent histories. II. Adv. in Appl. Probab. 36 434–454.
Ethier, S. N. and Griffiths, R. C. (1990). On the two-locus sampling distribution. J. Math. Biol. 29 131–159.
Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3 87–112.
Mathematical Reviews (MathSciNet): MR325177
Fearnhead, P. and Donnelly, P. (2001). Estimating recombination rates from population genetic data. Genetics 159 1299–1318.
Golding, G. B. (1984). The sampling distribution of linkage disequilibrium. Genetics 108 257–274.
Griffiths, R. C. (1981). Neutral two-locus multiple allele models with recombination. Theor. Popul. Biol. 19 169–186.
Mathematical Reviews (MathSciNet): MR630871
Griffiths, R. C. (1991). The two-locus ancestral graph. In Selected Proceedings of the Sheffield Symposium on Applied Probability. IMS Lecture Notes—Monograph Series (I. V. Basawa and R. L. Taylor, eds.) 18 100–117. IMS, Hayward, CA.
Griffiths, R. C., Jenkins, P. A. and Song, Y. S. (2008). Importance sampling and the two-locus model with subdivided population structure. Adv. in Appl. Probab. 40 473–500.
Griffiths, R. C. and Lessard, S. (2005). Ewens’ sampling formula and related formulae: Combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theor. Popul. Biol. 68 167–177.
Griffiths, R. C. and Marjoram, P. (1996). Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3 479–502.
Hoppe, F. (1984). Pólya-like urns and the Ewens’ sampling formula. J. Math. Biol. 20 91–94.
Hudson, R. R. (1985). The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics 109 611–631.
Hudson, R. R. (2001). Two-locus sampling distributions and their application. Genetics 159 1805–1817.
Jenkins, P. A. and Song, Y. S. (2009). Closed-form two-locus sampling distributions: Accuracy and universality. Genetics 183 1087–1103.
Kingman, J. F. C. (1982a). The coalescent. Stochastic Process. Appl. 13 235–248.
Mathematical Reviews (MathSciNet): MR671034
Kingman, J. F. C. (1982b). On the genealogy of large populations. J. Appl. Probab. 19 27–43.
Kuhner, M. K., Yamato, J. and Felsenstein, J. (2000). Maximum likelihood estimation of recombination rates from population data. Genetics 156 1393–1401.
McVean, G. A. T., Myers, S., Hunt, S., Deloukas, P., Bentley, D. R. and Donnelly, P. (2004). The fine-scale structure of recombination rate variation in the human genome. Science 304 581–584.
Myers, S., Bottolo, L., Freeman, C., McVean, G. and Donnelly, P. (2005). A fine-scale map of recombination rates and hotspots across the human genome. Science 310 321–324.
Nielsen, R. (2000). Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154 931–942.
Pitman, J. (1992). The two-parameter generalization of Ewens’ random partition structure. Technical Report 345, Dept. Statistics, Univ. California, Berkeley.
Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probab. Theory Related Fields 102 145–158.
Slatkin, M. (1994). An exact test for neutrality based on the Ewens sampling distribution. Genet. Res. 64 71–74.
Slatkin, M. (1996). A correction to an exact test based on the Ewens sampling distribution. Genet. Res. 68 259–260.
Stephens, M. (2001). Inference under the coalescent. In Handbook of Statistical Genetics (D. Balding, M. Bishop and C. Cannings, eds.) 213–238. Wiley, Chichester, UK.
Stephens, M. and Donnelly, P. (2000). Inference in molecular population genetics. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 605–655.
Wang, Y. and Rannala, B. (2008). Bayesian inference of fine-scale recombination rates using population genomic data. Philos. Trans. R. Soc. 363 3921–3930.
Watterson, G. A. (1977). Heterosis or neutrality? Genetics 85 789–814.
Mathematical Reviews (MathSciNet): MR504021