Volume 18, Number 3
We consider the problem of aggregating the elements of a possibly infinite dictionary for building a decision procedure that aims at minimizing a given criterion. Along with the dictionary, an independent identically distributed training sample is available, on which the performance of a given procedure can be tested. In a fairly general set-up, we establish an oracle inequality for the Mirror Averaging aggregate with any prior distribution. By choosing an appropriate prior, we apply this oracle inequality in the context of prediction under sparsity assumption for the problems of regression with random design, density estimation and binary classification.
 Abramovich, F., Grinshtein, V. and Pensky, M. (2007). On optimality of Bayesian testimation in the normal means problem. Ann. Statist. 35 2261–2286.
 Alquier, P. (2008). PAC-Bayesian bounds for randomized empirical risk minimizers. Math. Methods Statist. 17 279–304.
 Audibert, J.Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
 Barron, A (1987). Are Bayes rules consistent in information? In Open Problems in Communication and Computation (T.M. Cover and B. Gopinath, eds.) 85–91. New York: Springer.
Mathematical Reviews (MathSciNet): MR922073
 Bartlett, P.L., Jordan, M.I. and McAuliffe, J.D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
 Bickel, P., Ritov, Y. and Tsybakov, A.B. (2010). Hierarchical selection of variables in sparse high-dimensional regression. In Borrowing Strength: Theory Powering Applications – A Festschrift for Lawrence D. Brown. IMS Collections 6 56–69. IMS, Beachwood, OH.
 Bickel, P.J., Ritov, Y. and Tsybakov, A.B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
 Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
 Bunea, F. and Nobel, A. (2008). Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inform. Theory 54 1725–1735.
 Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2004). Aggregation for regression learning. Available at arxiv:math/0410214
 Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2006). Aggregation and sparsity via l1 penalized least squares. In Learning Theory. Lecture Notes in Computer Science 4005 379–391. Berlin: Springer.
 Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
 Bunea, F., Tsybakov, A.B. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194 (electronic).
 Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351.
 Candès, E.J. (2006). Compressive sampling. In International Congress of Mathematicians. Vol. III 1433–1452. Zürich: Eur. Math. Soc.
 Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
 Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. Beachwood, OH: IMS.
 Cesa-Bianchi, N., Conconi, A. and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Trans. Inform. Theory 50 2050–2057.
 Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge: Cambridge Univ. Press.
 Dalalyan, A. and Tsybakov, A.B. (2009). Sparse regression learning by aggregation and Langevin Monte-Carlo. In Proceedings of COLT-2009. Published online.
 Dalalyan, A. and Tsybakov, A.B. (2010). Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci.
To appear. Available at arxiv:0903.1223(v3)
 Dalalyan, A. and Tsybakov, A.B. (2008). Aggregation by exponential weighting, sharp oracle inequalities and sparsity. Machine Learning 72 39–61.
 Dalalyan, A.S. and Tsybakov, A.B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Berlin: Springer.
 Dembo, A. and Zeitouni, O. (1998). Large Deviations Techniques and Applications, 2nd ed. Applications of Mathematics (New York) 38. New York: Springer.
 Donoho, D.L., Elad, M. and Temlyakov, V.N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 6–18.
 Gaïffas, S. and Lecué, G. (2009). Hyper-sparse optimal aggregation. Available at arXiv:0912.1618
 Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107.
 Goldenshluger, A. (2009). A universal procedure for aggregating estimators. Ann. Statist. 37 542–568.
 Haussler, D., Kivinen, J. and Warmuth, M.K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Trans. Inform. Theory 44 1906–1925.
 Johnstone, I.M. and Silverman, B.W. (2005). Empirical Bayes selection of wavelet thresholds. Ann. Statist. 33 1700–1752.
 Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
 Juditsky, A.B., Nazin, A.V., Tsybakov, A.B. and Vayatis, N. (2005). Recursive aggregation of estimators by the mirror descent algorithm with averaging. Probl. Inf. Transm. 41 368–384.
 Klemelä, J. (2009). Smoothing of Multivariate Data: Density Estimation and Visualization. Wiley Series in Probability and Statistics. Hoboken, NJ: Wiley.
 Koltchinskii, V. (2009). The Dantzig selector and sparsity oracle inequalities. Bernoulli 15 799–828.
 Koltchinskii, V. (2009). Sparse recovery in convex hulls via entropy penalization. Ann. Statist. 37 1332–1359.
 Koltchinskii, V. (2009). Sparsity in penalized empirical risk minimization. Ann. Inst. Henri Poincaré Probab. Stat. 45 7–57.
 Lecué, G. (2007). Optimal rates of aggregation in classification under low noise assumption. Bernoulli 13 1000–1022.
 Leung, G. and Barron, A.R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
 Lounici, K. (2007). Generalized mirror averaging and D-convex aggregation. Math. Methods Statist. 16 246–259.
 McAllester, D. (2003). PAC-Bayesian stochastic model selection. Machine Learning 51 5–21.
 Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
 Rigollet, P. and Tsybakov, A.B. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
 Rivoirard, V. (2006). Nonlinear estimation over weak Besov spaces and minimax Bayes method. Bernoulli 12 609–632.
 Salmon, J. and Le Pennec, E. (2009). NL-Means and aggregation procedures. Proc. ICIP 2009 2941–2944.
 Seeger, M.W. (2008). Bayesian inference and optimal design for the sparse linear model. J. Mach. Learn. Res. 9 759–813.
 Tsybakov, A.B. (2003). Optimal rates of aggregation. In Computational Learning Theory and Kernel Machines (B. Schölkopf and M. Warmuth, eds.). Lecture Notes in Artificial Intelligence 2777 303–313. Heidelberg: Springer.
 Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
 van de Geer, S.A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
 Vovk, V. (1990). Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory, COLT1990 371–386. Morgan Kaufmann: CA.
 Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc. 96 574–588.
 Yang, Y. (2003). Regression with multiple candidate models: Selecting or mixing? Statist. Sinica 13 783–809.
 Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.
 Zhang, C.H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
 Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85.
 Zhang, T. (2006). From ϵ-entropy to KL-entropy: Analysis of minimum information complexity density estimation. Ann. Statist. 34 2180–2210.
 Zhang, T. (2009). Some sharp performance bounds for least squares regression with L1 regularization. Ann. Statist. 37 2109–2144.
 Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
 Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
 Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.