Statistical Science

The Ubiquitous Ewens Sampling Formula

Harry Crane

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Ewens’s sampling formula exemplifies the harmony of mathematical theory, statistical application, and scientific discovery. The formula not only contributes to the foundations of evolutionary molecular genetics, the neutral theory of biodiversity, Bayesian nonparametrics, combinatorial stochastic processes, and inductive inference but also emerges from fundamental concepts in probability theory, algebra, and number theory. With an emphasis on its far-reaching influence throughout statistics and probability, we highlight these and many other consequences of Ewens’s seminal discovery.

Article information

Source
Statist. Sci., Volume 31, Number 1 (2016), 1-19.

Dates
First available in Project Euclid: 10 February 2016

Permanent link to this document
https://projecteuclid.org/euclid.ss/1455115906

Digital Object Identifier
doi:10.1214/15-STS529

Mathematical Reviews number (MathSciNet)
MR3458585

Keywords
Ewens’s sampling formula Poisson–Dirichlet distribution random partition coalescent process inductive inference exchangeability logarithmic combinatorial structures Chinese restaurant process Ewens–Pitman distribution Dirichlet process stick breaking permanental partition model cyclic product distribution clustering Bayesian nonparametrics $\alpha$-permanent

Citation

Crane, Harry. The Ubiquitous Ewens Sampling Formula. Statist. Sci. 31 (2016), no. 1, 1--19. doi:10.1214/15-STS529. https://projecteuclid.org/euclid.ss/1455115906


Export citation

References

  • [1] Aldous, D. J. (1985). Exchangeability and related topics. In École d’été de Probabilités de Saint-Flour, XIII—1983. Lecture Notes in Math. 1117 1–198. Springer, Berlin.
  • [2] Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152–1174.
  • [3] Arratia, R., Barbour, A. D. and Tavaré, S. (1992). Poisson process approximations for the Ewens sampling formula. Ann. Appl. Probab. 2 519–535.
  • [4] Arratia, R., Barbour, A. D. and Tavaré, S. (2000). Limits of logarithmic combinatorial structures. Ann. Probab. 28 1620–1644.
  • [5] Arratia, R., Barbour, A. D. and Tavaré, S. (2003). Logarithmic Combinatorial Structures: A Probabilistic Approach. Eur. Math. Soc., Zürich.
  • [6] Bacallado, S., Favaro, S. and Trippa, L. (2015). Bayesian nonparametric inference for shared species richness in multiple populations. J. Statist. Plann. Inference 166 14–23.
  • [7] Bartholomew, D. J. (1973). Stochastic Models for Social Processes, 2nd ed. Wiley, London.
  • [8] Bayes, T. (1764). An essay toward solving a problem in the doctrine of chances. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 53 370–418.
  • [9] Bertoin, J. (2006). Random Fragmentation and Coagulation Processes. Cambridge Studies in Advanced Mathematics 102. Cambridge Univ. Press, Cambridge.
  • [10] Billingsley, P. (1972). On the distribution of large prime divisors. Period. Math. Hungar. 2 283–289.
  • [11] Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353–355.
  • [12] Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • [13] Borodin, A. and Corwin, I. (2014). Macdonald processes. Probab. Theory Related Fields 158 225–400.
  • [14] Cauchy, A. (1815). Mémoire sur les fonctions qui ne peuvent obtenir que deux valeurs égales et de signes contraires par suite des transpositions opérées entre les variables qu’elles renferment. Journal de l’École Polytechnique 10 91–169.
  • [15] Cesari, O., Favaro, S. and Nipoti, B. (2014). Posterior analysis of rare variants in Gibbs-type species sampling models. J. Multivariate Anal. 131 79–98.
  • [16] Champernowne, D. (1953). A model of income distribution. Econom. J. 63.
  • [17] Christiansen, F. B. (2008). Theories of Population Variation in Genes and Genomes. Princeton Univ. Press, Princeton, NJ.
  • [18] Crane, H. (2011). A consistent Markov partition process generated from the paintbox process. J. Appl. Probab. 48 778–791.
  • [19] Crane, H. (2013). Some algebraic identities for the $\alpha$-permanent. Linear Algebra Appl. 439 3445–3459.
  • [20] Crane, H. (2014). The cut-and-paste process. Ann. Probab. 42 1952–1979.
  • [21] Crane, H. (2015a). Clustering from categorical data sequences. J. Amer. Statist. Assoc. 110 810–823.
  • [22] Crane, H. (2015b). Generalized Ewens–Pitman model for Bayesian clustering. Biometrika 102 231–238.
  • [23] de Finetti, B. (1937). La prévision: Ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincaré 7 1–68.
  • [24] de Morgan, A. (1838). An Essay on Probabilities, and on Their Application to Life Contingencies and Insurance Offices. Longman et al., London.
  • [25] Derrida, B. (1981). Random-energy model: An exactly solvable model of disordered systems. Phys. Rev. B (3) 24 2613–2626.
  • [26] Derrida, B. (1997). From random walks to spin glasses. Phys. D 107 186–198.
  • [27] Diaconis, P. and Ram, A. (2012). A probabilistic interpretation of the Macdonald polynomials. Ann. Probab. 40 1861–1896.
  • [28] Donnelly, P. (1986). Partition structures, Pólya urns, the Ewens sampling formula, and the ages of alleles. Theoret. Population Biol. 30 271–288.
  • [29] Donnelly, P. and Grimmett, G. (1993). On the asymptotic distribution of large prime factors. J. Lond. Math. Soc. (2) 47 395–404.
  • [30] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
  • [31] Ethier, S. N. and Griffiths, R. C. (1993). The transition function of a Fleming–Viot process. Ann. Probab. 21 1571–1590.
  • [32] Etienne, R. (2005). A new sampling formula for neutral biodiversity. Ecology Letters 8 253–260.
  • [33] Etienne, R. (2007). A neutral sampling formula for multiple samples and an “exact” text of neutrality. Ecology Letters 10 608–618.
  • [34] Etienne, R. and Alonso, D. (2005). A dispersal-limited sampling theory for species and alleles. Ecology Letters 8 1147–1156.
  • [35] Ewens, W. and Tavaré, S. (1998). The Ewens sampling formula. In Encyclopedia of Statistical Science (S. Kotz, C. B. Read and D. L. Banks, eds.) Wiley, New York.
  • [36] Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoret. Population Biology 3 87–112; erratum, ibid. 3 (1972), 240; erratum, ibid. 3 (1972), 376.
  • [37] Favaro, S., Lijoi, A. and Prünster, I. (2013). Conditional formulae for Gibbs-type exchangeable random partitions. Ann. Appl. Probab. 23 1721–1754.
  • [38] Feng, S. (2010). The Poisson–Dirichlet Distribution and Related Topics: Models and Asymptotic Behaviors. Springer, Heidelberg.
  • [39] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • [40] Fisher, R. (1922). On the dominance ratio. Proceedings of the Royal Society of Edinburgh 42 321–341.
  • [41] Fisher, R., Corbet, A. and Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology 12 42–58.
  • [42] Fleming, W. H. and Viot, M. (1979). Some measure-valued Markov processes in population genetics theory. Indiana Univ. Math. J. 28 817–843.
  • [43] Gnedin, A. (2010). A species sampling model with finitely many types. Electron. Commun. Probab. 15 79–88.
  • [44] Good, I. J. and Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 45–63.
  • [45] Griffiths, R. C. (1979). Exact sampling distributions from the infinite neutral alleles model. Adv. in Appl. Probab. 11 326–354.
  • [46] Hartigan, J. A. (1990). Partition models. Comm. Statist. Theory Methods 19 2745–2756.
  • [47] Higgs, P. (1995). Frequency distributions in population genetics parallel those in statistical physics. Phys. Rev. E (3) 51 1–7.
  • [48] Hoppe, F. M. (1984). Pólya-like urns and the Ewens’ sampling formula. J. Math. Biol. 20 91–94.
  • [49] Hough, J. B., Krishnapur, M., Peres, Y. and Virág, B. (2006). Determinantal processes and independence. Probab. Surv. 3 206–229.
  • [50] Hubbell, S., (2001). The Unified Neutral Theory of Biodiversity and Biogeography. Princeton Univ. Press, Princeton, NJ.
  • [51] Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
  • [52] Ishwaran, H. and James, L. F. (2003). Generalized weighted Chinese restaurant processes for species sampling mixture models. Statist. Sinica 13 1211–1235.
  • [53] James, L. F. (2013). Stick-breaking $\mathrm{PG}(\alpha,\zeta)$-generalized Gamma processes. Available at arXiv:1308.6570v3.
  • [54] Johnson, W. (1932). Probability: The deductive and inductive problems. Mind 41 409–423.
  • [55] Karlin, S. and McGregor, J. (1972). Addendum to a paper of W. Ewens. Theoret. Population Biology 3 113–116.
  • [56] Kerov, S. (2005). Coherent random allocations, and the Ewens–Pitman formula. Zap. Nauchn. Sem. S.-Peterburg. Otdel. Mat. Inst. Steklov. (POMI) 325 127–145.
  • [57] Kerov, S., Okounkov, A. and Olshanski, G. (1998). The boundary of the Young graph with Jack edge multiplicities. Int. Math. Res. Not. IMRN 4 173–199.
  • [58] Kimura, M. (1968). Evolutionary rate at the molecular level. Nature 217 624–626.
  • [59] Kingman, J. F. C. (1977). The population structure associated with the Ewens sampling formula. Theoret. Population Biology 11 274–283.
  • [60] Kingman, J. F. C. (1978a). Random partitions in population genetics. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 361 1–20.
  • [61] Kingman, J. F. C. (1978b). The representation of partition structures. J. Lond. Math. Soc. (2) 18 374–380.
  • [62] Kingman, J. F. C. (1980). Mathematics of Genetic Diversity. CBMS-NSF Regional Conference Series in Applied Mathematics 34. SIAM, Philadelphia, PA.
  • [63] Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13 235–248.
  • [64] Knuth, D. E. and Trabb Pardo, L. (1976/77). Analysis of a simple factorization algorithm. Theoret. Comput. Sci. 3 321–348.
  • [65] Lijoi, A., Mena, R. H. and Prünster, I. (2007). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94 769–786.
  • [66] Macchi, O. (1975). The coincidence approach to stochastic point processes. Adv. in Appl. Probab. 7 83–122.
  • [67] Macdonald, I. G. (1995). Symmetric Functions and Hall Polynomials, 2nd ed. Clarendon Press, New York.
  • [68] McCullagh, P. and Yang, J. (2006). Stochastic classification models. In International Congress of Mathematicians. Vol. III 669–686. Eur. Math. Soc., Zürich.
  • [69] McCullagh, P. and Yang, J. (2008). How many clusters? Bayesian Anal. 3 101–120.
  • [70] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • [71] Olshanski, G. (2011). Random permutations and related topics. In The Oxford Handbook of Random Matrix Theory 510–533. Oxford Univ. Press, Oxford.
  • [72] Perman, M., Pitman, J. and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probab. Theory Related Fields 92 21–39.
  • [73] Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probab. Theory Related Fields 102 145–158.
  • [74] Pitman, J. (1996). Random discrete distributions invariant under size-biased permutation. Adv. in Appl. Probab. 28 525–539.
  • [75] Pitman, J. (2003). Poisson–Kingman partitions. In Statistics and Science: A Festschrift for Terry Speed. Institute of Mathematical Statistics Lecture Notes—Monograph Series 40 1–34. IMS, Beachwood, OH.
  • [76] Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Math. 1875. Springer, Berlin.
  • [77] Pitman, J. and Yor, M. (1997). The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 855–900.
  • [78] Sibuya, M. (2014). Prediction in Ewens–Pitman sampling formula and random samples from number partitions. Ann. Inst. Statist. Math. 66 833–864.
  • [79] Simon, H. A. (1955). On a class of skew distribution functions. Biometrika 42 425–440.
  • [80] Sloane, N. Online Encyclopedia of Integer Sequences. Published electronically at http://www.oeis.org/.
  • [81] Spielman, R., McGinnis, R. and Ewens, W. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). American Journal of Human Genetics 52 506–516.
  • [82] Tavaré, S. and Ewens, W. (1997). The Multivariate Ewens Distribution. In Discrete Multivariate Distributions (N. L. Johnson, S. Kotz and N. Balakrishnan, eds.) Wiley, New York.
  • [83] Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445–455.
  • [84] Valiant, L. G. (1979). The complexity of computing the permanent. Theoret. Comput. Sci. 8 189–201.
  • [85] Vere-Jones, D. (1988). A generalization of permanents and determinants. Linear Algebra Appl. 111 119–124.
  • [86] Vershik, A. M. (1986). Asymptotic distribution of decompositions of natural numbers into prime divisors. Dokl. Akad. Nauk SSSR 289 269–272.
  • [87] Wakeley, J. (2008). Coalescent Theory: An Introduction. Roberts and Company Publishers, Greenwood Village, CO.
  • [88] Watterson, G. (1978). The homozygosity test of neutrality. Genetics 88 405–417.
  • [89] Watterson, G. A. (1977). Heterosis or neutrality? Genetics 85 789–814.
  • [90] Wright, S. Evolution in Mendelian populations. Genetics 16 97–159.
  • [91] Yule, G. (1925). A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis. F. R. S. Phil. Trans. Roy. Soc. London, B 213 21–87.
  • [92] Zabell, S. (1988). Symmetry and its discontents. In Causation, Chance, and Credence 1 155–190. Kluwer Academic, Norwell.
  • [93] Zabell, S. (1992). Predicting the unpredictable. Synthese 90 205–232.
  • [94] Zabell, S. (1997). The continuum of inductive methods revisited. In The Cosmos of Science: Essays of Exploration Univ. Pittsburg Press, Pittsburgh, PA.

See also