The Annals of Statistics

Learning by mirror averaging

A. Juditsky, P. Rigollet, and A. B. Tsybakov

Full-text: Open access

Abstract

Given a finite collection of estimators or classifiers, we study the problem of model selection type aggregation, that is, we construct a new estimator or classifier, called aggregate, which is nearly as good as the best among them with respect to a given risk criterion. We define our aggregate by a simple recursive procedure which solves an auxiliary stochastic linear programming problem related to the original nonlinear one and constitutes a special case of the mirror averaging algorithm. We show that the aggregate satisfies sharp oracle inequalities under some general assumptions. The results are applied to several problems including regression, classification and density estimation.

Article information

Source
Ann. Statist., Volume 36, Number 5 (2008), 2183-2206.

Dates
First available in Project Euclid: 13 October 2008

Permanent link to this document
https://projecteuclid.org/euclid.aos/1223908089

Digital Object Identifier
doi:10.1214/07-AOS546

Mathematical Reviews number (MathSciNet)
MR2458184

Zentralblatt MATH identifier
1274.62288

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62C20: Minimax procedures 62G05: Estimation 62G20: Asymptotic properties

Keywords
Learning aggregation oracle inequalities mirror averaging model selection stochastic optimization

Citation

Juditsky, A.; Rigollet, P.; Tsybakov, A. B. Learning by mirror averaging. Ann. Statist. 36 (2008), no. 5, 2183--2206. doi:10.1214/07-AOS546. https://projecteuclid.org/euclid.aos/1223908089


Export citation

References

  • [1] Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85–113.
  • [2] Belomestny, D. and Spokoiny, V. (2007). Spatial aggregation of local likelihood estimates with applications to classification. Ann. Statist. 35 2287–2311.
  • [3] Ben-Tal, A. and Nemirovski, A. S. (1999). The conjugate barrier mirror descent method for non-smooth convex optimization. MINERVA Optimization Center Report, Faculty of Industrial Engineering and Management, Technion—Israel Institute of Technology, Haifa. Available at http://iew3.technion.ac.il/Labs/Opt/opt/Pap/CP_MD.pdf.
  • [4] Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: Some recent advances. ESAIM Probab. Statist. 9 323–375.
  • [5] Bunea, F. and Nobel, A. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean. Manuscript. Available at http://www.stat.fsu.edu/~flori.
  • [6] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [7] Catoni, O. (1997). A mixture approach to universal model selection. Preprint LMENS-97-30, Ecole Normale Supérieure, Paris.
  • [8] Catoni, O. (1999). “Universal” aggregation rules with exact bias bounds. Preprint 510, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6 and Paris 7. Available at http://www.proba.jussieu.fr/mathdoc/preprints/index.html#1999.
  • [9] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Ecole d’Eté de Probabilités de Saint-Flour XXXI—2001. Lecture Notes in Math. 1851. Springer, New York.
  • [10] Cesa-Bianchi, N. and Lugosi, G. (2006) Prediction, Learning, and Games. Cambridge Univ. Press.
  • [11] Haussler, D., Kivinen, J. and Warmuth, M. K. (1998). Sequential prediction if individual sequences under general loss functions. IEEE Trans. Inform. Theory 44 1906–1925.
  • [12] Juditsky, A., Nazin, A., Tsybakov, A. and Vayatis, N. (2005). Recursive aggregation of estimators via the mirror descent algorithm with averaging. Problems Inform. Transmission 41 368–384.
  • [13] Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. In Proc. Fourth. European Conf. on Computational Learning Theory (H. U. Simon and P. Fischer, eds.) 153–167. Springer, Berlin.
  • [14] Lee, W. S., Bartlett, P. L. and Williamson, R.C. (1996). Efficient agnostic learning with bounded fan-in. IEEE Trans. Inform. Theory 42 2118–2132.
  • [15] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
  • [16] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
  • [17] Massart, P. (2007). Concentration Inequalities and Model Selection. Ecole d’Eté de Probabilités de Saint-Flour XXXIII—2003. Lecture Notes in Math. 1896. Springer, New York.
  • [18] Nemirovski, A. (2000). Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour XXVIII—1998. Lecture Notes in Math. 1738 85–277. Springer, New York.
  • [19] Nemirovski, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization, Wiley, Chichester.
  • [20] Nesterov, Y. (2007). Primal–dual subgradient methods for convex problems. Mathematical Programming. Published online DOI: 10.1007/s10107-007-0149-x. Also available as CORE Discussion Paper n. 2005/67, Center for Operation Research and Econometrics, Louvain-la-Neuve, Belgium, 2005.
  • [21] Petrov, V. V. (1995) Limit Theorems of Probability Theory. Clarendon Press, Oxford.
  • [22] Samarov, A. and Tsybakov, A. (2007). Aggregation of density estimators and dimension reduction. In Advances in Statistical Modeling and Inference. Essays in Honor of Kjell A. Doksum (V. Nair, ed.) 233–251. World Scientific, Singapore.
  • [23] Singer, A. and Feder, M. (1999). Universal linear prediction by model order weighting. IEEE Trans. Signal Process. 47 2685–2699.
  • [24] Tsybakov, A. (2003). Optimal rates of aggregation. In Computational Learning Theory and Kernel Machines, COLT-2003 (B. Schölkopf and M. Warmuth, eds.) 303–313. Springer, Heidelberg.
  • [25] Tsybakov, A. (2004). Introduction à l’estimation non paramétrique. Springer, Berlin.
  • [26] Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annual Workshop on Computational Learning Theory 372–383. Morgan Kaufman, San Mateo.
  • [27] Vovk, V. (2001). Competitive on-line statistics. Internat. Statist. Rev. 69 213–248.
  • [28] Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist. 31 252–273.
  • [29] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
  • [30] Zhang, T. (2006). From epsilon-entropy to KL-complexity: Analysis of minimum information complexity density estimation. Ann. Statist. 34 2180–2210.