The Annals of Statistics

Exact lower bounds for the agnostic probably-approximately-correct (PAC) machine learning model

Aryeh Kontorovich and Iosif Pinelis

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We provide an exact nonasymptotic lower bound on the minimax expected excess risk (EER) in the agnostic probably-approximately-correct (PAC) machine learning classification model and identify minimax learning algorithms as certain maximally symmetric and minimally randomized “voting” procedures. Based on this result, an exact asymptotic lower bound on the minimax EER is provided. This bound is of the simple form $c_{\infty}/\sqrt{\nu}$ as $\nu\to\infty$, where $c_{\infty}=0.16997\dots$ is a universal constant, $\nu=m/d$, $m$ is the size of the training sample and $d$ is the Vapnik–Chervonenkis dimension of the hypothesis class. It is shown that the differences between these asymptotic and nonasymptotic bounds, as well as the differences between these two bounds and the maximum EER of any learning algorithms that minimize the empirical risk, are asymptotically negligible, and all these differences are due to ties in the mentioned “voting” procedures. A few easy to compute nonasymptotic lower bounds on the minimax EER are also obtained, which are shown to be close to the exact asymptotic lower bound $c_{\infty}/\sqrt{\nu}$ even for rather small values of the ratio $\nu=m/d$. As an application of these results, we substantially improve existing lower bounds on the tail probability of the excess risk. Among the tools used are Bayes estimation and apparently new identities and inequalities for binomial distributions.

Article information

Ann. Statist., Volume 47, Number 5 (2019), 2822-2854.

Received: June 2016
Revised: December 2017
First available in Project Euclid: 3 August 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Primary: 68T05: Learning and adaptive systems [See also 68Q32, 91E40] 62C20: Minimax procedures 62C10: Bayesian problems; characterization of Bayes procedures 62C12: Empirical decision procedures; empirical Bayes procedures 62G20: Asymptotic properties 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]
Secondary: 62G10: Hypothesis testing 62C20: Minimax procedures 91A35: Decision theory for games [See also 62Cxx, 91B06, 90B50] 60C05: Combinatorial probability

PAC learning theory classification generalization error minimax decision rules Bayes decision rules empirical estimators binomial distribution


Kontorovich, Aryeh; Pinelis, Iosif. Exact lower bounds for the agnostic probably-approximately-correct (PAC) machine learning model. Ann. Statist. 47 (2019), no. 5, 2822--2854. doi:10.1214/18-AOS1766.

Export citation


  • [1] Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge.
  • [2] Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
  • [3] Berend, D. and Kontorovich, A. (2015). A finite sample analysis of the naive Bayes classifier. J. Mach. Learn. Res. 16 1519–1545.
  • [4] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities. A Nonasymptotic Theory of Independence. Oxford Univ. Press, Oxford.
  • [5] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
  • [6] Devroye, L. and Lugosi, G. (1995). Lower bounds in pattern recognition and learning. Pattern Recognit. 28 1011–1018.
  • [7] Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Probability and Mathematical Statistics 1. Academic Press, New York.
  • [8] Hardy, G. H., Littlewood, J. E. and Pólya, G. (1988). Inequalities. Cambridge Univ. Press, Cambridge.
  • [9] Haussler, D. (1992). Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Inform. and Comput. 100 78–150.
  • [10] Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean $n$-cube with bounded Vapnik–Chervonenkis dimension. J. Combin. Theory Ser. A 69 217–232.
  • [11] Kearns, M. J. and Schapire, R. E. (1994). Efficient distribution-free learning of probabilistic concepts. J. Comput. System Sci. 48 464–497.
  • [12] Kearns, M. J., Schapire, R. E. and Sellie, L. (1994). Toward efficient agnostic learning. Mach. Learn. 17 115–141.
  • [13] Kingman, J. F. C. (1961). A convexity property of positive matrices. Quart. J. Math. Oxford Ser. (2) 12 283–284.
  • [14] Kontorovich, A., Sabato, S. and Urner, R. (2016). Active nearest-neighbor learning in metric spaces. In NIPS 2016. Available at arXiv:1605.06792.
  • [15] Long, P. M. (1999). The complexity of learning according to two models of a drifting environment. Mach. Learn. 37 337–354.
  • [16] Pinelis, I. (2016). Optimal binomial, Poisson, and normal left-tail domination for sums of nonnegative random variables. Electron. J. Probab. 21 Paper No. 20.
  • [17] Pinelis, I. F. (1991). Criteria for complete determinateness for concave-convex games. Mat. Zametki 49 73–76, 159.
  • [18] Pinelis, I. S. and Utev, S. A. (1989). Sharp exponential estimates for sums of independent random variables. Teor. Veroyatn. Primen. 34 384–390.
  • [19] Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ. Press, Cambridge.
  • [20] Simon, H. U. (1996). General bounds on the number of examples needed for learning probabilistic concepts. J. Comput. System Sci. 52 239–254.
  • [21] Sion, M. (1958). On general minimax theorems. Pacific J. Math. 8 171–176.
  • [22] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22 28–76.
  • [23] v. Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Math. Ann. 100 295–320.
  • [24] Valiant, L. G. (1984). A theory of the learnable. Commun. ACM 27 1134–1142.
  • [25] Vapnik, V. N. and Červonenkis, A. Ja. (1971). The uniform convergence of frequencies of the appearance of events to their probabilities. Teor. Veroyatn. Primen. 16 264–279.