## Statistical Science

### Sparse Estimation by Exponential Weighting

#### Abstract

Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks, including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition in the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to state-of-the-art procedures on some basic numerical examples.

#### Article information

Source
Statist. Sci., Volume 27, Number 4 (2012), 558-575.

Dates
First available in Project Euclid: 21 December 2012

https://projecteuclid.org/euclid.ss/1356098556

Digital Object Identifier
doi:10.1214/12-STS393

Mathematical Reviews number (MathSciNet)
MR3025134

Zentralblatt MATH identifier
1331.62351

#### Citation

Rigollet, Philippe; Tsybakov, Alexandre B. Sparse Estimation by Exponential Weighting. Statist. Sci. 27 (2012), no. 4, 558--575. doi:10.1214/12-STS393. https://projecteuclid.org/euclid.ss/1356098556

#### References

• Alquier, P. and Lounici, K. (2011). PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electron. J. Stat. 5 127–145.
• Baraniuk, R., Davenport, M., DeVore, R. and Wakin, M. (2008). A simple proof of the restricted isometry property for random matrices. Constr. Approx. 28 253–263.
• Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Mach. Learn. 48 85–113.
• Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
• Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5 232–253.
• Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
• Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• Catoni, O. (1999). Universal aggregation rules with exact bias bounds. Technical report, Laboratoire de Probabilités et Modeles Aléatoires, Preprint 510.
• Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Springer, Berlin.
• Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. IMS, Beachwood, OH.
• Dalalyan, A. S. and Salmon, J. (2011). Sharp oracle inequalities for aggregation of affine estimators. Available at ArXiv:1104.3969.
• Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
• Dalalyan, A. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72 39–61.
• Dalalyan, A. and Tsybakov, A. B. (2012). Mirror averaging with sparsity priors. Bernoulli 18 914–944.
• Dalalyan, A. S. and Tsybakov, A. B. (2012). Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci. 78 1423–1443.
• Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1–22.
• Gerchinovitz, S. (2011). Prediction of individual sequences and prediction in the statistical framework: Some links around sparse regression and aggregation techniques. Ph.D. thesis, Univ. Paris Sud—Paris XI.
• Giraud, C. (2007). Mixing least-squares estimators when the variance is unknown. Available at arXiv:0711.0372.
• Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107.
• Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
• Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
• Juditsky, A. B., Nazin, A. V., Tsybakov, A. B. and Vayatis, N. (2005). Recursive aggregation of estimators by the mirror descent method with averaging. Probl. Inf. Transm. 41 368–384.
• Kneip, A. (1994). Ordered linear smoothers. Ann. Statist. 22 835–866.
• Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
• Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
• Lecué, G. (2012). Empirical risk minimization is optimal for the Convex aggregation problem. Bernoulli. To appear.
• Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• Lounici, K., Pontil, M., van de Geer, S. and Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204.
• Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
• Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
• Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
• Rigollet, P. and Tsybakov, A. B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist. 16 260–280.
• Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. Springer, New York.
• Rudin, L. I., Osher, S. and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60 259–268.
• Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.
• Tsybakov, A. B. (2003). Optimal rates of aggregation. In COLT (B. Schölkopf and M. K. Warmuth, eds.). Lecture Notes in Computer Science 2777 303–313. Springer, Berlin.
• Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
• Wegkamp, M. (2003). Model selection in nonparametric regression. Ann. Statist. 31 252–273.
• Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.
• Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
• Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.