Bernoulli

  • Bernoulli
  • Volume 22, Number 3 (2016), 1520-1534.

Performance of empirical risk minimization in linear aggregation

Guillaume Lecué and Shahar Mendelson

Full-text: Open access

Abstract

We study conditions under which, given a dictionary $F=\{f_{1},\ldots,f_{M}\}$ and an i.i.d. sample $(X_{i},Y_{i})_{i=1}^{N}$, the empirical minimizer in $\operatorname{span}(F)$ relative to the squared loss, satisfies that with high probability

\[R(\tilde{f}^{\mathrm{ERM}})\leq\inf_{f\in\operatorname{span}(F)}R(f)+r_{N}(M),\] where $R(\cdot)$ is the squared risk and $r_{N}(M)$ is of the order of $M/N$.

Among other results, we prove that a uniform small-ball estimate for functions in $\operatorname{span}(F)$ is enough to achieve that goal when the noise is independent of the design.

Article information

Source
Bernoulli, Volume 22, Number 3 (2016), 1520-1534.

Dates
Received: March 2014
Revised: February 2015
First available in Project Euclid: 16 March 2016

Permanent link to this document
https://projecteuclid.org/euclid.bj/1458132990

Digital Object Identifier
doi:10.3150/15-BEJ701

Mathematical Reviews number (MathSciNet)
MR3474824

Zentralblatt MATH identifier
1346.60075

Keywords
aggregation theory empirical processes empirical risk minimization learning theory

Citation

Lecué, Guillaume; Mendelson, Shahar. Performance of empirical risk minimization in linear aggregation. Bernoulli 22 (2016), no. 3, 1520--1534. doi:10.3150/15-BEJ701. https://projecteuclid.org/euclid.bj/1458132990


Export citation

References

  • [1] Anthony, M. and Bartlett, P.L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge Univ. Press.
  • [2] Audibert, J.-Y. (2007). Progressive mixture rules are deviation suboptimal. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 36, 2007. Advances in Neural Information Processing Systems (NIPS) 20 41–48. Vancouver: MIT Press.
  • [3] Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
  • [4] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford Univ. Press.
  • [5] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. Wadsworth Statistics/Probability Series. Belmont, CA: Wadsworth Advanced Books and Software.
  • [6] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [7] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
  • [8] Chafaï, D., Guédon, O., Lecué, G. and Pajor, A. (2012). Interactions Between Compressed Sensing Random Matrices and High Dimensional Geometry. Panoramas et Synthèses [Panoramas and Syntheses] 37. Paris: Société Mathématique de France.
  • [9] Dalalyan, A.S. and Tsybakov, A.B. (2008). Aggregation by exponential weighting, sharp pac-Bayesian bounds and sparsity. Mach. Learn. 72 39–61.
  • [10] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. New York: Springer.
  • [11] de la Peña, V.H. and Giné, E. (1999). Decoupling: From Dependence to Independence. Probability and Its Applications (New York). Randomly Stopped Processes. $U$-Statistics and Processes. Martingales and Beyond. New York: Springer.
  • [12] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. New York: Springer.
  • [13] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
  • [14] Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
  • [15] Lecué, G. (2013). Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli 19 2153–2166.
  • [16] Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
  • [17] Lecué, G. and Mendelson, S. (2010). Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli 16 605–613.
  • [18] Lecué, G. and Mendelson, S. (2013). Learning subGaussian classes: Upper and minimax bounds. Technical report, CNRS, Ecole polytechnique and Technion.
  • [19] Lecué, G. and Mendelson, S. (2013). Minimax rate of convergence and the performance of ERM in phase recovery. Technical report. Electron. J. Probab. To appear.
  • [20] Lecué, G. and Mendelson, S. (2014). Sparse recovery under weak moment assumptions. Technical report. J. Eur. Math. Soc. To appear.
  • [21] Lecué, G. and Rigollet, P. (2014). Optimal learning with $Q$-aggregation. Ann. Statist. 42 211–224.
  • [22] Mendelson, S. (2013). Learning without concentration. Technical report. J. ACM. Available at arXiv:1401.0304. To appear.
  • [23] Mendelson, S. (2014). Learning without concentration for general loss functions. Technical report, Technion, Israel and ANU, Australia. Available at arXiv:1410.3192.
  • [24] Mendelson, S. (2014). A remark on the diameter of random sections of convex bodies. In Geometric Aspects of Functional Analysis (GAFA Seminar Notes). Lecture Notes in Math. 2116 395–404.
  • [25] Mendelson, S. and Koltchinskii, V. (2013). Bounding the smallest singular value of a random matrix without concentration. Technical report, Technion and Georgia Tech. Available at arXiv:1312.3580.
  • [26] Nemirovski, A. (2000). Lectures on Probability Theory and Statistics. Lecture Notes in Math. 1738. Berlin: Springer. Lectures from the 28th Summer School on Probability Theory held in Saint-Flour, August 17–September 3, 1998, edited by Pierre Bernard.
  • [27] Rigollet, Ph. and Tsybakov, A.B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist. 16 260–280.
  • [28] Schapire, R.E. and Freund, Y. (2012). Boosting: Foundations and Algorithms. Adaptive Computation and Machine Learning. Cambridge, MA: MIT Press.
  • [29] Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Information Science and Statistics. New York: Springer.
  • [30] Tsybakov, A.B. (2003). Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003). Lecture Notes in Artificial Intelligence 2777 303–313. Heidelberg: Springer.
  • [31] Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. New York: Springer. Revised and extended from the 2004 French original, translated by Vladimir Zaiats.
  • [32] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. New York: Springer.
  • [33] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
  • [34] Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc. 96 574–588.
  • [35] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.