## The Annals of Statistics

### Chebyshev polynomials, moment matching, and optimal estimation of the unseen

#### Abstract

We consider the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least $\frac{1}{k}$. Under the independent sampling model, we show that the sample complexity, that is, the minimal sample size to achieve an additive error of $\varepsilon k$ with probability at least 0.1 is within universal constant factors of $\frac{k}{\log k}\log^{2}\frac{1}{\varepsilon }$, which improves the state-of-the-art result of $\frac{k}{\varepsilon^{2}\log k}$ in [In Advances in Neural Information Processing Systems (2013) 2157–2165]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+\log^{2}k)$ time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

#### Article information

Source
Ann. Statist., Volume 47, Number 2 (2019), 857-883.

Dates
Revised: November 2017
First available in Project Euclid: 11 January 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1547197241

Digital Object Identifier
doi:10.1214/17-AOS1665

Mathematical Reviews number (MathSciNet)
MR3909953

Zentralblatt MATH identifier
07033154

Subjects
Primary: 62G05: Estimation
Secondary: 62C20: Minimax procedures

#### Citation

#### References

#### Supplemental materials

• Supplementary material for “Chebyshev polynomials, moment matching and optimal estimation of the unseen”. Due to space constraints, the technical proofs have been given in the supplementary documents [44].