## Bernoulli

• Bernoulli
• Volume 24, Number 3 (2018), 1910-1941.

### Finite sample properties of the mean occupancy counts and probabilities

#### Abstract

For a probability distribution $P$ on an at most countable alphabet $\mathcal{A}$, this article gives finite sample bounds for the expected occupancy counts $\mathbb{E}K_{n,r}$ and probabilities $\mathbb{E}M_{n,r}$. Both upper and lower bounds are given in terms of the counting function $\nu$ of $P$. Special attention is given to the case where $\nu$ is bounded by a regularly varying function. In this case, it is shown that our general results lead to an optimal-rate control of the expected occupancy counts and probabilities with explicit constants. Our results are also put in perspective with Turing’s formula and recent concentration bounds to deduce bounds in probability. At the end of the paper, we discuss an extension of the occupancy problem to arbitrary distributions in a metric space.

#### Article information

Source
Bernoulli, Volume 24, Number 3 (2018), 1910-1941.

Dates
Revised: October 2016
First available in Project Euclid: 2 February 2018

https://projecteuclid.org/euclid.bj/1517540463

Digital Object Identifier
doi:10.3150/16-BEJ915

Mathematical Reviews number (MathSciNet)
MR3757518

Zentralblatt MATH identifier
06839255

#### Citation

Decrouez, Geoffrey; Grabchak, Michael; Paris, Quentin. Finite sample properties of the mean occupancy counts and probabilities. Bernoulli 24 (2018), no. 3, 1910--1941. doi:10.3150/16-BEJ915. https://projecteuclid.org/euclid.bj/1517540463

#### References

• [1] Almudevar, A., Bhattacharya, R.N. and Sastri, C.C.A. (2000). Estimating the probability mass of unobserved support in random sampling. J. Statist. Plann. Inference 91 91–105.
• [2] Ben-Hamou, A., Boucheron, S. and Ohannessian, M.I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 23 249–287.
• [3] Berend, D. and Kontorovich, A. (2012). The missing mass problem. Statist. Probab. Lett. 82 1102–1110.
• [4] Berend, D. and Kontorovich, A. (2013). On the concentration of the missing mass. Electron. Commun. Probab. 18 no. 3, 7.
• [5] Chao, A. (1981). On estimating the probability of discovering a new species. Ann. Statist. 9 1339–1342.
• [6] Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 11 265–270.
• [7] Chao, A. and Lee, S.-M. (1992). Estimating the number of classes via sample coverage. J. Amer. Statist. Assoc. 87 210–217.
• [8] Chao, A., Lee, S.-M. and Chen, T.-C. (1988). A generalized Good’s nonparametric coverage estimator. Chinese J. Math. 16 189–199.
• [9] Chen, S.F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13 359–394.
• [10] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
• [11] Esty, W.W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random sample. Ann. Statist. 11 905–912.
• [12] Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. New York–London–Sydney: Wiley.
• [13] Gandolfi, A. and Sastri, C.C.A. (2004). Nonparametric estimations about species not observed in a random sample. Milan J. Math. 72 81–105.
• [14] Gnedin, A., Hansen, B. and Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Probab. Surv. 4 146–171.
• [15] Good, I.J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
• [16] Good, I.J. and Toulmin, G.H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 45–63.
• [17] Grabchak, M. and Cosme, V. (2016). On the performance of turing’s formula: A simulation study. Comm. Statist. Simulation Comput. To appear.
• [18] Harris, B. (1959). Determining bounds on integrals with applications to cataloging problems. Ann. Math. Stat. 30 521–548.
• [19] Harris, B. (1968). Statistical inference in the classical occupancy problem unbiased estimation of the number of classes. J. Amer. Statist. Assoc. 63 837–847.
• [20] Holst, L. (1981). Some asymptotic results for incomplete multinomial or Poisson samples. Scand. J. Stat. 8 243–246.
• [21] Johnson, N.L. and Kotz, S. (1977). Urn Models and Their Application: An Approach to Modern Discrete Probability Theory. Wiley Series in Probability and Mathematical Statistics. New York–London–Sydney: John Wiley & Sons.
• [22] Karamata, J. (1933). Sur un mode de croissance régulière. Théorèmes fondamentaux. Bull. Soc. Math. France 61 55–62.
• [23] Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. J. Math. Mech. 17 373–401.
• [24] Khanloo, B.Y.S. and Haffari, G. (2015). Novel Bernstein-like concentration inequalities for the missing mass. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence.
• [25] Mao, C.X. and Lindsay, B.G. (2002). A Poisson model for the coverage problem with a genomic application. Biometrika 89 669–681.
• [26] McAllester, D. and Ortiz, L. (2004). Concentration inequalities for the missing mass and for histogram rule error. J. Mach. Learn. Res. 4 895–911.
• [27] McAllester, D.A. and Schapire, R. (2000). On the convergence rate of Good–Turing estimators. In Proceedings of the 13th Annual Conference on Computational Learning Theory 1–6.
• [28] Ohannessian, M.I. and Dahleh, M.A. (2010). Distribution-dependent performance of the Good–Turing estimator for the missing mass. In Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems 679–682.
• [29] Ohannessian, M.I. and Dahleh, M.A. (2012). Rare probability estimation under regularly varying heavy tails. In Proceedings of the 25th Annual Conference on Learning Theory 23 21.1–21.24.
• [30] Orlitsky, A., Santhanam, N.P. and Zhang, J. (2004). Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inform. Theory 50 1469–1481.
• [31] Robbins, H. (1955). A remark on Stirling’s formula. Amer. Math. Monthly 62 26–29.
• [32] Robbins, H.E. (1968). Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Stat. 39 256–257.
• [33] Starr, N. (1979). Linear estimation of the probability of discovering a new species. Ann. Statist. 7 644–652.
• [34] Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445–455.
• [35] Zhang, C.-H. (2005). Estimation of sums of random variables: Examples and information bounds. Ann. Statist. 33 2022–2041.
• [36] Zhang, C.-H. and Zhang, Z. (2009). Asymptotic normality of a nonparametric estimator of sample coverage. Ann. Statist. 37 2582–2595.
• [37] Zhang, Z. (2016). Domains of attraction on countable alphabets. Bernoulli. To appear.
• [38] Zhang, Z. and Huang, H. (2007). Turing’s formula revisited. J. Quant. Linguist. 14 222–241.
• [39] Zhang, Z. and Huang, H. (2008). A sufficient normality condition for Turing’s formula. J. Nonparametr. Stat. 20 431–446.