## The Annals of Statistics

### On the exponentially weighted aggregate with the Laplace prior

#### Abstract

In this paper, we study the statistical behaviour of the Exponentially Weighted Aggregate (EWA) in the problem of high-dimensional regression with fixed design. Under the assumption that the underlying regression vector is sparse, it is reasonable to use the Laplace distribution as a prior. The resulting estimator and, specifically, a particular instance of it referred to as the Bayesian lasso, was already used in the statistical literature because of its computational convenience, even though no thorough mathematical analysis of its statistical properties was carried out. The present work fills this gap by establishing sharp oracle inequalities for the EWA with the Laplace prior. These inequalities show that if the temperature parameter is small, the EWA with the Laplace prior satisfies the same type of oracle inequality as the lasso estimator does, as long as the quality of estimation is measured by the prediction loss. Extensions of the proposed methodology to the problem of prediction with low-rank matrices are considered.

#### Article information

Source
Ann. Statist., Volume 46, Number 5 (2018), 2452-2478.

Dates
Revised: July 2017
First available in Project Euclid: 17 August 2018

https://projecteuclid.org/euclid.aos/1534492841

Digital Object Identifier
doi:10.1214/17-AOS1626

Mathematical Reviews number (MathSciNet)
MR3845023

Zentralblatt MATH identifier
06964338

Subjects
Primary: 62J05: Linear regression
Secondary: 62H12: Estimation

#### Citation

Dalalyan, Arnak S.; Grappin, Edwin; Paris, Quentin. On the exponentially weighted aggregate with the Laplace prior. Ann. Statist. 46 (2018), no. 5, 2452--2478. doi:10.1214/17-AOS1626. https://projecteuclid.org/euclid.aos/1534492841

#### References

• Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Control AC-19 716–723.
• Alquier, P. (2013). Bayesian methods for low-rank matrix estimation: Short survey and theoretical study. In 24th International Conference, ALT 2013, Singapore. Proceedings 309–323. Berlin, Heidelberg.
• Alquier, P. and Biau, G. (2013). Sparse single-index model. J. Mach. Learn. Res. 14 243–280.
• Alquier, P. and Lounici, K. (2011). PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electron. J. Stat. 5 127–145.
• Arias-Castro, E. and Lounici, K. (2014). Estimation and variable selection with exponential weights. Electron. J. Stat. 8 328–354.
• Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
• Bellec, P. C., Lecué, G. and Tsybakov, A. B. (2016). Slope meets lasso: Improved oracle bounds and optimality. Technical report, arXiv:1605.08651.
• Bellec, P. C., Dalalyan, A. S., Grappin, E. and Paris, Q. (2016). On the prediction loss of the lasso in the partially labeled setting. Technical report, arXiv:1606.06179.
• Belloni, A., Chernozhukov, V. and Wang, L. (2014). Pivotal estimation via square-root lasso in nonparametric regression. Ann. Statist. 42 757–788.
• Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• Bobkov, S. and Madiman, M. (2011). Concentration of the information in data with log-concave distributions. Ann. Probab. 39 1528–1543.
• Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E. J. (2015). SLOPE—Adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140.
• Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• Bunea, F., She, Y. and Wegkamp, M. H. (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Statist. 39 1282–1309.
• Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the lasso. Electron. J. Stat. 1 169–194.
• Candès, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory 57 2342–2359.
• Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory 56 2053–2080.
• Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. Ann. Statist. 43 1986–2018.
• Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. Ann. Statist. 40 2069–2101.
• Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Springer, Berlin.
• Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Lecture Notes–Monograph Series 56. IMS, Beachwood, OH.
• Chernousova, E., Golubev, Y. and Krymova, E. (2013). Ordered smoothers with exponential weighting. Electron. J. Stat. 7 2395–2419.
• Chesneau, C. and Lecué, G. (2009). Adapting to unknown smoothness by aggregation of thresholded wavelet estimators. Statist. Sinica 19 1407–1417.
• Cottet, V. and Alquier, P. (2016). 1-bit matrix completion: PAC-Bayesian analysis of a variational approximation. Technical report, arXiv:1604.04191.
• Dai, D., Rigollet, P., Xia, L. and Zhang, T. (2014). Aggregation of affine estimators. Electron. J. Stat. 8 302–327.
• Dalalyan, A. S. (2017). Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 651–676.
• Dalalyan, A. S., Grappin, E. and Paris, Q. (2018). Supplement to “On the exponentially weighted aggregate with the Laplace prior.” DOI:10.1214/17-AOS1626SUPP.
• Dalalyan, A. S., Hebiri, M. and Lederer, J. (2017). On the prediction performance of the Lasso. Bernoulli 23 552–581.
• Dalalyan, A., Ingster, Y. and Tsybakov, A. B. (2014). Statistical inference in compound functional models. Probab. Theory Related Fields 158 513–532.
• Dalalyan, A. S. and Salmon, J. (2012). Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40 2327–2355.
• Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
• Dalalyan, A. S. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72 39–61.
• Dalalyan, A. S. and Tsybakov, A. B. (2012a). Mirror averaging with sparsity priors. Bernoulli 18 914–944.
• Dalalyan, A. S. and Tsybakov, A. B. (2012b). Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci. 78 1423–1443.
• Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224.
• Durmus, A. and Moulines, E. (2016). Sampling from strongly log-concave distributions with the Unadjusted Langevin Algorithm. Technical report, arXiv:1605.01559.
• Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35 109–135.
• Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. J. Comput. Graph. Statist. 7 397–416.
• Gaïffas, S. and Lecué, G. (2011). Sharp oracle inequalities for high-dimensional matrix prediction. IEEE Trans. Inform. Theory 57 6942–6957.
• Gaï ffas, S. and Lecué, G. (2007). Optimal rates and adaptation in the single-index model using aggregation. Electron. J. Stat. 1 538–573.
• Gao, C., van der Vaart, A. W. and Zhou, H. H. (2015). A general framework for Bayes structured linear models. Technical report, arXiv:1506.02174.
• Giraud, C. (2015). Introduction to High-Dimensional Statistics. CRC Press, Boca Raton, FL.
• Golubev, Y. and Ostrovski, D. (2014). Concentration inequalities for the exponential weighting method. Math. Methods Statist. 23 20–37.
• Guedj, B. and Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electron. J. Stat. 7 264–291.
• Hans, C. (2009). Bayesian lasso regression. Biometrika 96 835–845.
• Harchaoui, Z., Douze, M., Paulin, M., Dudik, M. and Malick, J. (2012). Large-scale image classification with trace-norm regularization. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on 3386–3393.
• Hoffmann, M., Rousseau, J. and Schmidt-Hieber, J. (2015). On adaptive posterior concentration rates. Ann. Statist. 43 2259–2295.
• Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
• Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli 20 282–303.
• Koltchinskii, V. (2009). Sparse recovery in convex hulls via entropy penalization. Ann. Statist. 37 1332–1359.
• Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole D’Eté de Probabilités de Saint-Flour XXXVIII-2008 38. Springer.
• Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
• Lecué, G. and Mendelson, S. (2013). On the optimality of the aggregate with exponential weights for low temperatures. Bernoulli 19 646–675.
• Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• Lim, Y. J. and Teh, Y. W. (2007). Variational Bayesian approach to movie rating prediction. In KDD-cup-2007, Proceedings.
• Mai, T. T. and Alquier, P. (2015). A Bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Stat. 9 823–841.
• Mallows, C. L. (1973). Some comments on $C_{P}$. Technometrics 15 661–675.
• McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998) 230–234. ACM, New York.
• Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097.
• Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 13 1665–1697.
• Park, T. and Casella, G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc. 103 681–686.
• Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
• Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
• Shen, X. and Wu, Y. (2012). A unified approach to salient object detection via low rank matrix recovery. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on 853–860.
• Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In 18th Annual Conference on Learning Theory, COLT 2005. Proceedings (P. Auer and R. Meir, eds.) 545–560.
• Su, W. and Candès, E. (2016). SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. 44 1038–1068.
• Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tibshirani, R. J. and Taylor, J. (2012). Degrees of freedom in lasso problems. Ann. Statist. 40 1198–1232.
• Tsybakov, A. B. (2014). Aggregation and minimax optimality in high-dimensional estimation. In Proceedings of the International Congress of Mathematicians (Seoul, August 2014) 3 225–246.
• van de Geer, S. (2016). Estimation and Testing Under Sparsity: École D’Été de Probabilités de Saint-Flour XLV—2015. Springer, Berlin.
• van de Geer, S. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electron. J. Stat. 3 1360–1392.
• van der Pas, S. L., Salomond, J.-B. and Schmidt-Hieber, J. (2016). Conditions for posterior contraction in the sparse normal means problem. Electron. J. Stat. 10 976–1000.
• Vovk, V. G. (1990). Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT 1990 371–386.
• Wipf, D. P., Palmer, J. A. and Rao, B. D. (2003). Perspectives on sparse Bayesian learning. In Advances in Neural Information Processing Systems 16 249–256.
• Yang, Y. (2000a). Adaptive estimation in pattern recognition by combining different procedures. Statist. Sinica 10 1069–1089.
• Yang, Y. (2000b). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
• Yang, Y. (2000c). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
• Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc. 96 574–588.
• Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
• Yuditskiĭ, A. B., Nazin, A. V., Tsybakov, A. B. and Vayatis, N. (2005). Recursive aggregation of estimators by the mirror descent method with averaging. Problemy Peredachi Informatsii 41 78–96.
• Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
• Zhou, Y., Wilkinson, D., Schreiber, R. and Pan, R. (2008). Large-scale parallel collaborative filtering for the Netflix prize. In AAIM 2008. Proceedings 337–348. Berlin, Heidelberg.
• Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. Ann. Statist. 35 2173–2192.

#### Supplemental materials

• Supplement to “On the exponentially weighted aggregate with the Laplace prior”. The proofs of equation (10), as well as the proofs of results of Section 5, have been gathered in the Supplementary Material.