## The Annals of Statistics

### Exact lower bounds for the agnostic probably-approximately-correct (PAC) machine learning model

#### Abstract

We provide an exact nonasymptotic lower bound on the minimax expected excess risk (EER) in the agnostic probably-approximately-correct (PAC) machine learning classification model and identify minimax learning algorithms as certain maximally symmetric and minimally randomized “voting” procedures. Based on this result, an exact asymptotic lower bound on the minimax EER is provided. This bound is of the simple form $c_{\infty}/\sqrt{\nu}$ as $\nu\to\infty$, where $c_{\infty}=0.16997\dots$ is a universal constant, $\nu=m/d$, $m$ is the size of the training sample and $d$ is the Vapnik–Chervonenkis dimension of the hypothesis class. It is shown that the differences between these asymptotic and nonasymptotic bounds, as well as the differences between these two bounds and the maximum EER of any learning algorithms that minimize the empirical risk, are asymptotically negligible, and all these differences are due to ties in the mentioned “voting” procedures. A few easy to compute nonasymptotic lower bounds on the minimax EER are also obtained, which are shown to be close to the exact asymptotic lower bound $c_{\infty}/\sqrt{\nu}$ even for rather small values of the ratio $\nu=m/d$. As an application of these results, we substantially improve existing lower bounds on the tail probability of the excess risk. Among the tools used are Bayes estimation and apparently new identities and inequalities for binomial distributions.

#### Article information

Source
Ann. Statist., Volume 47, Number 5 (2019), 2822-2854.

Dates
Received: June 2016
Revised: December 2017
First available in Project Euclid: 3 August 2019

Permanent link to this document
https://projecteuclid.org/euclid.aos/1564797865

Digital Object Identifier
doi:10.1214/18-AOS1766

Mathematical Reviews number (MathSciNet)
MR3988774

#### Citation

Kontorovich, Aryeh; Pinelis, Iosif. Exact lower bounds for the agnostic probably-approximately-correct (PAC) machine learning model. Ann. Statist. 47 (2019), no. 5, 2822--2854. doi:10.1214/18-AOS1766. https://projecteuclid.org/euclid.aos/1564797865

#### References

• [1] Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge.
• [2] Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
• [3] Berend, D. and Kontorovich, A. (2015). A finite sample analysis of the naive Bayes classifier. J. Mach. Learn. Res. 16 1519–1545.
• [4] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities. A Nonasymptotic Theory of Independence. Oxford Univ. Press, Oxford.
• [5] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
• [6] Devroye, L. and Lugosi, G. (1995). Lower bounds in pattern recognition and learning. Pattern Recognit. 28 1011–1018.
• [7] Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Probability and Mathematical Statistics 1. Academic Press, New York.
• [8] Hardy, G. H., Littlewood, J. E. and Pólya, G. (1988). Inequalities. Cambridge Univ. Press, Cambridge.
• [9] Haussler, D. (1992). Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Inform. and Comput. 100 78–150.
• [10] Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean $n$-cube with bounded Vapnik–Chervonenkis dimension. J. Combin. Theory Ser. A 69 217–232.
• [11] Kearns, M. J. and Schapire, R. E. (1994). Efficient distribution-free learning of probabilistic concepts. J. Comput. System Sci. 48 464–497.
• [12] Kearns, M. J., Schapire, R. E. and Sellie, L. (1994). Toward efficient agnostic learning. Mach. Learn. 17 115–141.
• [13] Kingman, J. F. C. (1961). A convexity property of positive matrices. Quart. J. Math. Oxford Ser. (2) 12 283–284.
• [14] Kontorovich, A., Sabato, S. and Urner, R. (2016). Active nearest-neighbor learning in metric spaces. In NIPS 2016. Available at arXiv:1605.06792.
• [15] Long, P. M. (1999). The complexity of learning according to two models of a drifting environment. Mach. Learn. 37 337–354.
• [16] Pinelis, I. (2016). Optimal binomial, Poisson, and normal left-tail domination for sums of nonnegative random variables. Electron. J. Probab. 21 Paper No. 20.
• [17] Pinelis, I. F. (1991). Criteria for complete determinateness for concave-convex games. Mat. Zametki 49 73–76, 159.
• [18] Pinelis, I. S. and Utev, S. A. (1989). Sharp exponential estimates for sums of independent random variables. Teor. Veroyatn. Primen. 34 384–390.
• [19] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ. Press, Cambridge.
• [20] Simon, H. U. (1996). General bounds on the number of examples needed for learning probabilistic concepts. J. Comput. System Sci. 52 239–254.
• [21] Sion, M. (1958). On general minimax theorems. Pacific J. Math. 8 171–176.
• [22] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22 28–76.
• [23] v. Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Math. Ann. 100 295–320.
• [24] Valiant, L. G. (1984). A theory of the learnable. Commun. ACM 27 1134–1142.
• [25] Vapnik, V. N. and Červonenkis, A. Ja. (1971). The uniform convergence of frequencies of the appearance of events to their probabilities. Teor. Veroyatn. Primen. 16 264–279.