## The Annals of Statistics

### Chebyshev polynomials, moment matching, and optimal estimation of the unseen

#### Abstract

We consider the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least $\frac{1}{k}$. Under the independent sampling model, we show that the sample complexity, that is, the minimal sample size to achieve an additive error of $\varepsilon k$ with probability at least 0.1 is within universal constant factors of $\frac{k}{\log k}\log^{2}\frac{1}{\varepsilon }$, which improves the state-of-the-art result of $\frac{k}{\varepsilon^{2}\log k}$ in [In Advances in Neural Information Processing Systems (2013) 2157–2165]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+\log^{2}k)$ time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

#### Article information

Source
Ann. Statist., Volume 47, Number 2 (2019), 857-883.

Dates
Revised: November 2017
First available in Project Euclid: 11 January 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1547197241

Digital Object Identifier
doi:10.1214/17-AOS1665

Mathematical Reviews number (MathSciNet)
MR3909953

Zentralblatt MATH identifier
07033154

Subjects
Primary: 62G05: Estimation
Secondary: 62C20: Minimax procedures

#### Citation

Wu, Yihong; Yang, Pengkun. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. Ann. Statist. 47 (2019), no. 2, 857--883. doi:10.1214/17-AOS1665. https://projecteuclid.org/euclid.aos/1547197241

#### References

• [1] Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D. and Trevisan, L. (2002). Counting distinct elements in a data stream. In Proceedings of the 6th Randomization and Approximation Techniques in Computer Science 1–10. Springer, Berlin.
• [2] Bunge, J. and Fitzpatrick, M. (1993). Estimating the number of species: A review. J. Amer. Statist. Assoc. 88 364–373.
• [3] Burnham, K. P. and Overton, W. S. (1979). Robust estimation of population size when capture probabilities vary among animals. Ecology 60 927–936.
• [4] Cai, T. T. and Low, M. G. (2011). Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. Ann. Statist. 39 1012–1041.
• [5] Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 265–270.
• [6] Chao, A. and Lee, S.-M. (1992). Estimating the number of classes via sample coverage. J. Amer. Statist. Assoc. 87 210–217.
• [7] Charikar, M., Chaudhuri, S., Motwani, R. and Narasayya, V. (2000). Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) 268–279. ACM, New York.
• [8] Darroch, J. and Ratcliff, D. (1980). A note on capture-recapture estimation. Biometrics 36 149–153.
• [9] Dzyadyk, V. K. and Shevchuk, I. A. (2008). Theory of Uniform Approximation of Functions by Polynomials. de Gruyter, Berlin.
• [10] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
• [11] Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12 42–58.
• [12] Gandolfi, A. and Sastri, C. C. A. (2004). Nonparametric estimations about species not observed in a random sample. Milan J. Math. 72 81–105.
• [13] Global Language Monitor. Number of words in the English language. https://www.languagemonitor.com/global-english/no-of-words/. Accessed: 2016-02-16.
• [14] Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
• [15] Good, I. J. and Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 45–63.
• [16] Gotelli, N. J. and Colwell, R. K. (2011). Estimating species richness. Biological Diversity: Frontiers in Measurement and Assessment 12 39–54.
• [17] Harris, B. (1968). Statistical inference in the classical occupancy problem unbiased estimation of the number of classes. J. Amer. Statist. Assoc. 63 837–847.
• [18] Huang, S.-P. and Weir, B. (2001). Estimating the total number of alleles using a sample coverage method. Genetics 159 1365–1373.
• [19] Ibragimov, I. A., Nemirovskii, A. S. and Khas’minskii, R. Z. (1987). Some problems on nonparametric estimation in Gaussian white noise. Theory Probab. Appl. 31 391–406.
• [20] Jiao, J., Venkat, K., Han, Y. and Weissman, T. (2015). Minimax estimation of functionals of discrete distributions. IEEE Trans. Inform. Theory 61 2835–2885.
• [21] Lepski, O., Nemirovski, A. and Spokoiny, V. (1999). On estimation of the $L_{r}$ norm of a regression function. Probab. Theory Related Fields 113 221–253.
• [22] Lewontin, R. C. and Prout, T. (1956). Estimation of the number of different classes in a population. Biometrics 12 211–223.
• [23] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York.
• [24] Mao, C. X. and Lindsay, B. G. (2007). Estimating the number of classes. Ann. Statist. 35 917–930.
• [25] Marchand, J. and Schroeck Jr, F. (1982). On the estimation of the number of equally likely classes in a population. Comm. Statist. Theory Methods 11 1139–1146.
• [26] McNeil, D. R. (1973). Estimating an author’s vocabulary. J. Amer. Statist. Assoc. 68 92–96.
• [27] Mitzenmacher, M. and Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge Univ. Press, Cambridge.
• [28] Orlitsky, A., Suresh, A. T. and Wu, Y. (2016). Optimal prediction of the number of unseen species. Proc. Natl. Acad. Sci. USA 113 13283–13288.
• [29] Oxford English Dictinary. http://public.oed.com/about/. Accessed: 2016-02-16.
• [30] Raskhodnikova, S., Ron, D., Shpilka, A. and Smith, A. (2009). Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39 813–842.
• [31] Robbins, H. E. (1968). Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Stat. 39 256–257.
• [32] Samuel, E. (1968). Sequential maximum likelihood estimation of the size of a population. Ann. Math. Stat. 39 1057–1068.
• [33] Steele, J. M. (1986). An Efron–Stein inequality for nonsymmetric statistics. Ann. Statist. 14 753–758.
• [34] Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445–455.
• [35] Timan, A. F. (1963). Theory of Approximation of Functions of a Real Variable. Pergamon Press, Elmsford, NY.
• [36] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer, New York, NY.
• [37] Valiant, G. (2017). Private communication.
• [38] Valiant, G. and Valiant, P. (2010). A CLT and tight lower bounds for estimating entropy. In Electronic Colloquium on Computational Complexity (ECCC) 17 179.
• [39] Valiant, G. and Valiant, P. (2011). Estimating the unseen: An $n/\log(n)$-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing 685–694.
• [40] Valiant, G. and Valiant, P. (2011). The power of linear estimators. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on 403–412. IEEE, New York.
• [41] Valiant, P. and Valiant, G. (2013). Estimating the unseen: Improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems 2157–2165.
• [42] Wu, Y. and Yang, P. (2016). Sample complexity of the distinct elements problem. Available at arXiv:1612.03375.
• [43] Wu, Y. and Yang, P. (2016). Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inform. Theory 62 3702–3720.
• [44] Wu, Y. and Yang, P. (2017). Supplement to “Chebyshev polynomials, moment matching and optimal estimation of the unseen.” DOI:10.1214/17-AOS1665SUPP.

#### Supplemental materials

• Supplementary material for “Chebyshev polynomials, moment matching and optimal estimation of the unseen”. Due to space constraints, the technical proofs have been given in the supplementary documents [44].