Abstract
We consider the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least $\frac{1}{k}$. Under the independent sampling model, we show that the sample complexity, that is, the minimal sample size to achieve an additive error of $\varepsilon k$ with probability at least 0.1 is within universal constant factors of $\frac{k}{\log k}\log^{2}\frac{1}{\varepsilon }$, which improves the state-of-the-art result of $\frac{k}{\varepsilon^{2}\log k}$ in [In Advances in Neural Information Processing Systems (2013) 2157–2165]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+\log^{2}k)$ time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.
Citation
Yihong Wu. Pengkun Yang. "Chebyshev polynomials, moment matching, and optimal estimation of the unseen." Ann. Statist. 47 (2) 857 - 883, April 2019. https://doi.org/10.1214/17-AOS1665
Information