The Annals of Statistics

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Yihong Wu and Pengkun Yang

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We consider the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least $\frac{1}{k}$. Under the independent sampling model, we show that the sample complexity, that is, the minimal sample size to achieve an additive error of $\varepsilon k$ with probability at least 0.1 is within universal constant factors of $\frac{k}{\log k}\log^{2}\frac{1}{\varepsilon }$, which improves the state-of-the-art result of $\frac{k}{\varepsilon^{2}\log k}$ in [In Advances in Neural Information Processing Systems (2013) 2157–2165]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+\log^{2}k)$ time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

Article information

Source
Ann. Statist., Volume 47, Number 2 (2019), 857-883.

Dates
Received: June 2016
Revised: November 2017
First available in Project Euclid: 11 January 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1547197241

Digital Object Identifier
doi:10.1214/17-AOS1665

Mathematical Reviews number (MathSciNet)
MR3909953

Zentralblatt MATH identifier
07033154

Subjects
Primary: 62G05: Estimation
Secondary: 62C20: Minimax procedures

Keywords
Support size estimation large domain polynomial approximation high-dimensional statistics nonparametric inference

Citation

Wu, Yihong; Yang, Pengkun. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. Ann. Statist. 47 (2019), no. 2, 857--883. doi:10.1214/17-AOS1665. https://projecteuclid.org/euclid.aos/1547197241


Export citation

References

  • [1] Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D. and Trevisan, L. (2002). Counting distinct elements in a data stream. In Proceedings of the 6th Randomization and Approximation Techniques in Computer Science 1–10. Springer, Berlin.
  • [2] Bunge, J. and Fitzpatrick, M. (1993). Estimating the number of species: A review. J. Amer. Statist. Assoc. 88 364–373.
  • [3] Burnham, K. P. and Overton, W. S. (1979). Robust estimation of population size when capture probabilities vary among animals. Ecology 60 927–936.
  • [4] Cai, T. T. and Low, M. G. (2011). Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. Ann. Statist. 39 1012–1041.
  • [5] Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 265–270.
  • [6] Chao, A. and Lee, S.-M. (1992). Estimating the number of classes via sample coverage. J. Amer. Statist. Assoc. 87 210–217.
  • [7] Charikar, M., Chaudhuri, S., Motwani, R. and Narasayya, V. (2000). Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) 268–279. ACM, New York.
  • [8] Darroch, J. and Ratcliff, D. (1980). A note on capture-recapture estimation. Biometrics 36 149–153.
  • [9] Dzyadyk, V. K. and Shevchuk, I. A. (2008). Theory of Uniform Approximation of Functions by Polynomials. de Gruyter, Berlin.
  • [10] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
  • [11] Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12 42–58.
  • [12] Gandolfi, A. and Sastri, C. C. A. (2004). Nonparametric estimations about species not observed in a random sample. Milan J. Math. 72 81–105.
  • [13] Global Language Monitor. Number of words in the English language. https://www.languagemonitor.com/global-english/no-of-words/. Accessed: 2016-02-16.
  • [14] Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
  • [15] Good, I. J. and Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 45–63.
  • [16] Gotelli, N. J. and Colwell, R. K. (2011). Estimating species richness. Biological Diversity: Frontiers in Measurement and Assessment 12 39–54.
  • [17] Harris, B. (1968). Statistical inference in the classical occupancy problem unbiased estimation of the number of classes. J. Amer. Statist. Assoc. 63 837–847.
  • [18] Huang, S.-P. and Weir, B. (2001). Estimating the total number of alleles using a sample coverage method. Genetics 159 1365–1373.
  • [19] Ibragimov, I. A., Nemirovskii, A. S. and Khas’minskii, R. Z. (1987). Some problems on nonparametric estimation in Gaussian white noise. Theory Probab. Appl. 31 391–406.
  • [20] Jiao, J., Venkat, K., Han, Y. and Weissman, T. (2015). Minimax estimation of functionals of discrete distributions. IEEE Trans. Inform. Theory 61 2835–2885.
  • [21] Lepski, O., Nemirovski, A. and Spokoiny, V. (1999). On estimation of the $L_{r}$ norm of a regression function. Probab. Theory Related Fields 113 221–253.
  • [22] Lewontin, R. C. and Prout, T. (1956). Estimation of the number of different classes in a population. Biometrics 12 211–223.
  • [23] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York.
  • [24] Mao, C. X. and Lindsay, B. G. (2007). Estimating the number of classes. Ann. Statist. 35 917–930.
  • [25] Marchand, J. and Schroeck Jr, F. (1982). On the estimation of the number of equally likely classes in a population. Comm. Statist. Theory Methods 11 1139–1146.
  • [26] McNeil, D. R. (1973). Estimating an author’s vocabulary. J. Amer. Statist. Assoc. 68 92–96.
  • [27] Mitzenmacher, M. and Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge Univ. Press, Cambridge.
  • [28] Orlitsky, A., Suresh, A. T. and Wu, Y. (2016). Optimal prediction of the number of unseen species. Proc. Natl. Acad. Sci. USA 113 13283–13288.
  • [29] Oxford English Dictinary. http://public.oed.com/about/. Accessed: 2016-02-16.
  • [30] Raskhodnikova, S., Ron, D., Shpilka, A. and Smith, A. (2009). Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39 813–842.
  • [31] Robbins, H. E. (1968). Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Stat. 39 256–257.
  • [32] Samuel, E. (1968). Sequential maximum likelihood estimation of the size of a population. Ann. Math. Stat. 39 1057–1068.
  • [33] Steele, J. M. (1986). An Efron–Stein inequality for nonsymmetric statistics. Ann. Statist. 14 753–758.
  • [34] Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445–455.
  • [35] Timan, A. F. (1963). Theory of Approximation of Functions of a Real Variable. Pergamon Press, Elmsford, NY.
  • [36] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer, New York, NY.
  • [37] Valiant, G. (2017). Private communication.
  • [38] Valiant, G. and Valiant, P. (2010). A CLT and tight lower bounds for estimating entropy. In Electronic Colloquium on Computational Complexity (ECCC) 17 179.
  • [39] Valiant, G. and Valiant, P. (2011). Estimating the unseen: An $n/\log(n)$-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing 685–694.
  • [40] Valiant, G. and Valiant, P. (2011). The power of linear estimators. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on 403–412. IEEE, New York.
  • [41] Valiant, P. and Valiant, G. (2013). Estimating the unseen: Improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems 2157–2165.
  • [42] Wu, Y. and Yang, P. (2016). Sample complexity of the distinct elements problem. Available at arXiv:1612.03375.
  • [43] Wu, Y. and Yang, P. (2016). Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inform. Theory 62 3702–3720.
  • [44] Wu, Y. and Yang, P. (2017). Supplement to “Chebyshev polynomials, moment matching and optimal estimation of the unseen.” DOI:10.1214/17-AOS1665SUPP.

Supplemental materials

  • Supplementary material for “Chebyshev polynomials, moment matching and optimal estimation of the unseen”. Due to space constraints, the technical proofs have been given in the supplementary documents [44].