## Bernoulli

• Bernoulli
• Volume 19, Number 2 (2013), 646-675.

### On the optimality of the aggregate with exponential weights for low temperatures

#### Abstract

Given a finite class of functions $F$, the problem of aggregation is to construct a procedure with a risk as close as possible to the risk of the best element in the class. A classical procedure (PAC-Bayesian statistical learning theory (2004) Paris 6, Statistical Learning Theory and Stochastic Optimization (2001) Springer, Ann. Statist. 28 (2000) 75–87) is the aggregate with exponential weights (AEW), defined by

$\tilde{f}^{\mathrm{AEW}}=\sum_{f\in F}\widehat{\theta}(f)f,\qquad\mbox{where }\widehat{\theta}(f)=\frac{\exp(-({n}/{T})R_{n}(f))}{\sum_{g\in F}\exp(-({n}/{T})R_{n}(g))},$

where $T>0$ is called the temperature parameter and $R_{n}(\cdot)$ is an empirical risk.

In this article, we study the optimality of the AEW in the regression model with random design and in the low-temperature regime. We prove three properties of AEW. First, we show that AEW is a suboptimal aggregation procedure in expectation with respect to the quadratic risk when $T\leq c_{1}$, where $c_{1}$ is an absolute positive constant (the low-temperature regime), and that it is suboptimal in probability even for high temperatures. Second, we show that as the cardinality of the dictionary grows, the behavior of AEW might deteriorate, namely, that in the low-temperature regime it might concentrate with high probability around elements in the dictionary with risk greater than the risk of the best function in the dictionary by at least an order of $1/\sqrt{n}$. Third, we prove that if a geometric condition on the dictionary (the so-called “Bernstein condition”) is assumed, then AEW is indeed optimal both in high probability and in expectation in the low-temperature regime. Moreover, under that assumption, the complexity term is essentially the logarithm of the cardinality of the set of “almost minimizers” rather than the logarithm of the cardinality of the entire dictionary. This result holds for small values of the temperature parameter, thus complementing an analogous result for high temperatures.

#### Article information

Source
Bernoulli, Volume 19, Number 2 (2013), 646-675.

Dates
First available in Project Euclid: 13 March 2013

https://projecteuclid.org/euclid.bj/1363192042

Digital Object Identifier
doi:10.3150/11-BEJ408

Mathematical Reviews number (MathSciNet)
MR3037168

Zentralblatt MATH identifier
06168767

#### Citation

Lecué, Guillaume; Mendelson, Shahar. On the optimality of the aggregate with exponential weights for low temperatures. Bernoulli 19 (2013), no. 2, 646--675. doi:10.3150/11-BEJ408. https://projecteuclid.org/euclid.bj/1363192042

#### References

• [1] Alquier, P. (2006). Transductive and inductive adaptative inference for density and regression estimation. Ph.D. thesis, Paris 6.
• [2] Audibert, J.-Y. (2004). PAC-Bayesian statistical learning theory. Ph.D. thesis, Paris 6.
• [3] Audibert, J.-Y. (2007). No fast exponential deviation inequalities for the progressive mixture rule. Technical report, CERTIS.
• [4] Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
• [5] Bartlett, P.L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
• [6] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
• [7] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
• [8] Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. Beachwood, OH: IMS.
• [9] Dalalyan, A.S. and Tsybakov, A.B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Berlin: Springer.
• [10] Emery, M., Nemirovski, A. and Voiculescu, D. (2000). Lectures on Probability Theory and Statistics. Lecture Notes in Math. 1738. Berlin: Springer. Lectures from the 28th Summer School on Probability Theory held in Saint-Flour, August 17–September 3, 1998, Edited by Pierre Bernard.
• [11] Gaïffas, S. and Lecué, G. (2007). Optimal rates and adaptation in the single-index model using aggregation. Electron. J. Stat. 1 538–573.
• [12] Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes (with discussion). Ann. Probab. 12 929–998.
• [13] Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
• [14] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [15] Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
• [16] Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
• [17] Lecué, G. and Mendelson, S. (2010). On the optimality of the empirical risk minimization procedure for the convex aggregation problem. Unpublished manuscript.
• [18] Lecué, G. and Mendelson, S. (2010). Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli 16 605–613.
• [19] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 23. Berlin: Springer.
• [20] Leung, G. and Barron, A.R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• [21] Mendelson, S. (2008). Lower bounds for the empirical minimization algorithm. IEEE Trans. Inform. Theory 54 3797–3803.
• [22] Mendelson, S. (2008). Obtaining fast error rates in nonconvex situations. J. Complexity 24 380–397.
• [23] Petrov, V.V. (1995). Limit Theorems of Probability Theory: Sequences of Independent Random Variables. Oxford Studies in Probability 4. New York: Oxford Univ. Press.
• [24] Rao, M.M. and Ren, Z.D. (1991). Theory of Orlicz Spaces. Monographs and Textbooks in Pure and Applied Mathematics 146. New York: Dekker.
• [25] Samarov, A. and Tsybakov, A. (2007). Aggregation of density estimators and dimension reduction. In Advances in Statistical Modeling and Inference. Ser. Biostat. 3 233–251. Hackensack, NJ: World Sci. Publ.
• [26] Tsybakov, A. (2003). Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003). Lecture Notes in Artificial Intelligence 2777 303–313. Heidelberg: Springer.
• [27] Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• [28] Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. New York: Springer. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.
• [29] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. New York: Springer.
• [30] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
• [31] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
• [32] Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc. 96 574–588.