• Bernoulli
  • Volume 17, Number 2 (2011), 687-713.

Margin-adaptive model selection in statistical learning

Sylvain Arlot and Peter L. Bartlett

Full-text: Open access


A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Actually, we consider a weaker version of this condition that allows one to take into account that learning within a small model can be much easier than within a large one. Requiring this “strong margin adaptivity” makes the model selection problem more challenging. We first prove, in a general framework, that some penalization procedures (including local Rademacher complexities) exhibit this adaptivity when the models are nested. Contrary to previous results, this holds with penalties that only depend on the data. Our second main result is that strong margin adaptivity is not always possible when the models are not nested: for every model selection procedure (even a randomized one), there is a problem for which it does not demonstrate strong margin adaptivity.

Article information

Bernoulli, Volume 17, Number 2 (2011), 687-713.

First available in Project Euclid: 5 April 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

adaptivity empirical minimization empirical risk minimization local Rademacher complexity margin condition model selection oracle inequalities statistical learning


Arlot, Sylvain; Bartlett, Peter L. Margin-adaptive model selection in statistical learning. Bernoulli 17 (2011), no. 2, 687--713. doi:10.3150/10-BEJ288.

Export citation


  • [1] Arlot, S. (2007). Resampling and Model Selection. PhD thesis, University Paris-Sud 11, December 2007. Available at
  • [2] Arlot, S. (2008). V-fold cross-validation improved: V-fold penalization. Available at arXiv:0802.0566v2.
  • [3] Arlot, S. (2009). Model selection by resampling penalization. Electron. J. Stat. 3 557–624 (electronic).
  • [4] Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245–279 (electronic).
  • [5] Audibert, J.-Y. (2004). Classification under polynomial entropy and margin assumptions and randomized estimators. Laboratoire de Probabilites et Modeles Aleatoires. Preprint.
  • [6] Audibert, J.-Y. and Tsybakov, A.B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
  • [7] Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
  • [8] Bartlett, P.L., Bousquet, O. and Mendelson, S. (2005). Local rademacher complexities. Ann. Statist. 33 1497–1537.
  • [9] Bartlett, P.L., Jordan, M.I. and McAuliffe, J.D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
  • [10] Bartlett, P.L., Mendelson, S. and Philips, P. (2004). Local complexities for empirical risk minimization. In Learning Theory. Lecture Notes in Comput. Sci. 3120 270–284. Berlin: Springer.
  • [11] Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: Exponential bounds and rates of convergence. Bernoulli 4 329–375.
  • [12] Blanchard, G., Lugosi, G. and Vayatis, N. (2004). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894.
  • [13] Blanchard, G. and Massart, P. (2006). Discussion: “Local Rademacher complexities and oracle inequalities in risk minimization” [Ann. Statist. 34 (2006) 2593–2656] by V. Koltchinskii. Ann. Statist. 34 2664–2671.
  • [14] Devroye, L. and Lugosi, G. (1995). Lower bounds in pattern recognition and learning. Pattern Recognition 28 1011–1018.
  • [15] Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc. 78 316–331.
  • [16] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [17] Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
  • [18] Lecué, G. (2007). Suboptimality of penalized empirical risk minimization in classification. In COLT 2007 Lecture Notes in Artificial Intelligence 4539. Berlin: Springer.
  • [19] Lugosi, G. (2002). Pattern classification and learning theory. In Principles of Nonparametric Learning (Udine, 2001). CISM Courses and Lectures 434 1–56. Vienna: Springer.
  • [20] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
  • [21] Mammen, E. and Tsybakov, A.B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
  • [22] Massart, P. (2003). Concentration inequalities and model Selection. Lecture Notes in Mathematics 1896. Berlin: Springer.
  • [23] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
  • [24] Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • [25] Tsybakov, A.B. and van de Geer, S.A. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224.
  • [26] Vapnik, V.N. (1998). Statistical Learning Theory. New York: Wiley.
  • [27] Vapnik, V.N. and Cervonenkis, A.J. (1971). The uniform convergence of frequencies of the appearance of events to their probabilities. (Russian. English summary) Teor. Verojatnost. i Primenen. 16 264–279.