## Bernoulli

• Bernoulli
• Volume 23, Number 1 (2017), 249-287.

### Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications

#### Abstract

An infinite urn scheme is defined by a probability mass function $(p_{j})_{j\geq1}$ over positive integers. A random allocation consists of a sample of $N$ independent drawings according to this probability distribution where $N$ may be deterministic or Poisson-distributed. This paper is concerned with occupancy counts, that is with the number of symbols with $r$ or at least $r$ occurrences in the sample, and with the missing mass that is the total probability of all symbols that do not occur in the sample. Without any further assumption on the sampling distribution, these random quantities are shown to satisfy Bernstein-type concentration inequalities. The variance factors in these concentration inequalities are shown to be tight if the sampling distribution satisfies a regular variation property. This regular variation property reads as follows. Let the number of symbols with probability larger than $x$ be $\vec{\nu}(x)=|\{j\colon p_{j}\geq x\}|$. In a regularly varying urn scheme, $\vec{\nu}$ satisfies $\lim_{\tau\rightarrow0}\vec{\nu}(\tau x)/\vec{\nu}(\tau)=x^{-\alpha}$ for $\alpha\in[0,1]$ and the variance of the number of distinct symbols in a sample tends to infinity as the sample size tends to infinity. Among other applications, these concentration inequalities allow us to derive tight confidence intervals for the Good–Turing estimator of the missing mass.

#### Article information

Source
Bernoulli, Volume 23, Number 1 (2017), 249-287.

Dates
Received: December 2014
Revised: May 2015
First available in Project Euclid: 27 September 2016

Permanent link to this document
https://projecteuclid.org/euclid.bj/1475001355

Digital Object Identifier
doi:10.3150/15-BEJ743

Mathematical Reviews number (MathSciNet)
MR3556773

Zentralblatt MATH identifier
1366.60016

#### Citation

Ben-Hamou, Anna; Boucheron, Stéphane; Ohannessian, Mesrob I. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 23 (2017), no. 1, 249--287. doi:10.3150/15-BEJ743. https://projecteuclid.org/euclid.bj/1475001355

#### References

• [1] Acharya, J., Jafarpour, A., Orlitsky, A. and Suresh, A.T. (2013). Optimal probability estimation with applications to prediction and classification. In Colt 2013. J. of Mach. Learn. Research—Proc. Track 30 764–796.
• [2] Anderson, C.W. (1970). Extreme value theory for a class of discrete distributions with applications to some stochastic processes. J. Appl. Probab. 7 99–113.
• [3] Bahadur, R.R. (1960). On the number of distinct values in a large sample from an infinite discrete distribution. Proc. Nat. Inst. Sci. India Part A 26 67–75.
• [4] Barbour, A.D. and Gnedin, A.V. (2009). Small counts in the infinite occupancy scheme. Electron. J. Probab. 14 365–384.
• [5] Bartroff, J., Goldstein, L. and Işlak, Ü. (2014). Bounded size biased couplings, log concave distributions and concentration of measure for occupancy models. Preprint. Available at arXiv:1402.6769.
• [6] Batu, T., Fischer, E., Fortnow, L., Kumar, R., Rubinfeld, R. and White, P. (2001). Testing random variables for independence and identity. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001) 442–451. Los Alamitos, CA: IEEE Computer Soc.
• [7] Beirlant, J., Goegebeur, Y., Teugels, J. and Segers, J. (2004). Statistics of Extremes: Theory and Applications. Wiley Series in Probability and Statistics. Chichester: Wiley.
• [8] Berend, D. and Kontorovich, A. (2013). On the concentration of the missing mass. Electron. Commun. Probab. 18 1–7.
• [9] Berend, D. and Kontorovich, A. (2015). A finite sample analysis of the Naive Bayes classifier. J. Mach. Learn. Res.. 16 1519–1545.
• [10] Bertoin, J. (2006). Random Fragmentation and Coagulation Processes. Cambridge Studies in Advanced Mathematics 102. Cambridge: Cambridge Univ. Press.
• [11] Bingham, N.H., Goldie, C.M. and Teugels, J.L. (1989). Regular Variation. Encyclopedia of Mathematics and Its Applications 27. Cambridge: Cambridge Univ. Press.
• [12] Bogachev, L.V., Gnedin, A.V. and Yakubovich, Y.V. (2008). On the variance of the number of occupied boxes. Adv. in Appl. Math. 40 401–432.
• [13] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities. Oxford: Oxford Univ. Press.
• [14] Bunge, J. and Fitzpatrick, M. (1993). Estimating the number of species: A review. J. Amer. Statist. Assoc. 88 364–373.
• [15] Chen, L. H. Y., Goldstein, L. and Shao, Q.-M. (2010). Normal Approximation by Stein’s Method. Berlin: Springer.
• [16] de Haan, L. and Ferreira, A. (2006). Extreme Value Theory: An Introduction. Springer Series in Operations Research and Financial Engineering. New York: Springer.
• [17] Dubhashi, D. and Ranjan, D. (1998). Balls and bins: A study in negative dependence. Random Structures Algorithms 13 99–124.
• [18] Efron, B. and Stein, C. (1981). The jackknife estimate of variance. Ann. Statist. 9 586–596.
• [19] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
• [20] Esty, W.W. (1982). Confidence intervals for the coverage of low coverage samples. Ann. Statist. 10 190–196.
• [21] Fisher, R.A., Corbet, A.S. and Williams, C.B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12 42–58.
• [22] Gnedin, A., Hansen, B. and Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Probab. Surv. 4 146–171.
• [23] Gnedin, A.V. (2010). Regeneration in random combinatorial structures. Probab. Surv. 7 105–156.
• [24] Good, I.J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
• [25] Good, I.J. and Toulmin, G.H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 45–63.
• [26] Grübel, R. and Hitczenko, P. (2009). Gaps in discrete random samples. J. Appl. Probab. 46 1038–1051.
• [27] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
• [28] Hwang, H.-K. and Janson, S. (2008). Local limit theorems for finite and infinite urn models. Ann. Probab. 36 992–1022.
• [29] Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. J. Math. Mech. 17 373–401.
• [30] Kearns, M. and Saul, L. (1998). Large deviation methods for approximate probabilistic inference. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence 311–319. San Mateo, CA: Morgan Kaufmann.
• [31] Kolchin, V.F., Sevast’yanov, B.A. and Chistyakov, V.P. (1978). Random Allocations. Washington, DC: V.H. Winston & Sons.
• [32] McAllester, D. and Ortiz, L. (2004). Concentration inequalities for the missing mass and for histogram rule error. J. Mach. Learn. Res. 4 895–911.
• [33] McAllester, D.A. and Schapire, R.E. (2000). On the convergence rate of Good–Turing estimators. In Colt 2000 1–6.
• [34] Mossel, E. and Ohannessian, M.I. (2015). On the impossibility of learning the missing mass. Preprint. Available at arXiv:1503.03613.
• [35] Ohannessian, M.I. and Dahleh, M.A. (2012). Rare probability estimation under regularly varying heavy tails. In Colt 2012. J. of Mach. Learn. Research—Proc. Track 23 21.1–21.24.
• [36] Orlitsky, A., Santhanam, N.P. and Zhang, J. (2004). Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inform. Theory 50 1469–1481.
• [37] Pitman, J. and Yor, M. (1997). The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 855–900.
• [38] Raginsky, M. and Sason, I. (2013). Concentration of Measure Inequalities in Information Theory, Communications and Coding 10. Boston: Now Publishers.
• [39] Shao, Q.-M. (2000). A comparison theorem on moment inequalities between negatively associated and independent random variables. J. Theoret. Probab. 13 343–356.
• [40] Valiant, G. and Valiant, P. (2011). Estimating the unseen: An $n/\log(n)$-sample estimator for entropy and support size, shown optimal via new CLTs. In STOC’11—Proceedings of the 43rd ACM Symposium on Theory of Computing 685–694. New York: ACM.