Electronic Journal of Statistics

Estimation and variable selection with exponential weights

Ery Arias-Castro and Karim Lounici

Full-text: Open access


In the context of a linear model with a sparse coefficient vector, exponential weights methods have been shown to be achieve oracle inequalities for denoising/prediction. We show that such methods also succeed at variable selection and estimation under the near minimum condition on the design matrix, instead of much stronger assumptions required by other methods such as the Lasso or the Dantzig Selector. The same analysis yields consistency results for Bayesian methods and BIC-type variable selection under similar conditions.

Article information

Electron. J. Statist., Volume 8, Number 1 (2014), 328-354.

First available in Project Euclid: 18 April 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J99: None of the above, but in this section

Estimation variable selection model selection sparse linear model exponential weights Gibbs sampler identifiability condition


Arias-Castro, Ery; Lounici, Karim. Estimation and variable selection with exponential weights. Electron. J. Statist. 8 (2014), no. 1, 328--354. doi:10.1214/14-EJS883. https://projecteuclid.org/euclid.ejs/1397826704

Export citation


  • [1] Abramovich, F. and V. Grinshtein (2010). Map model selection in gaussian regression. Electron. J. Stat. 4, 932–949.
  • [2] Alquier, P. and K. Lounici (2011). Pac-bayesian theorems for sparse regression estimation with exponential weights. Electronic Journal of Statistics 5, 127–145. Arxiv:1009.2707.
  • [3] Bach, F. R. (2008). Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, ICML ’08, New York, NY, USA, pp. 33–40. ACM.
  • [4] Bickel, P., Y. Ritov, and A. Tsybakov (2009). Simultaneous analysis of lasso and dantzig selector. Annals of Statistics 37(4), 1705–1732.
  • [5] Birgé, L. and P. Massart (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3(3), 203–268.
  • [6] Bunea, F. (2008). Consistent selection via the Lasso for high dimensional approximating regression models. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, Volume 3 of Inst. Math. Stat. Collect., pp. 122–137. Beachwood, OH: Inst. Math. Statist.
  • [7] Bunea, F. and A. Nobel (2008). Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inform. Theory 54(4), 1725–1735.
  • [8] Bunea, F., A. Tsybakov, and M. Wegkamp (2007). Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics 1, 169–194.
  • [9] Cai, T. T. and L. Wang (2011). Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Trans. Inform. Theory 57(7), 4680–4688.
  • [10] Candès, E. and T. Tao (2007). The Dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35(6), 2313–2351.
  • [11] Candès, E. J. and M. A. Davenport (2013). How well can we estimate a sparse vector? Appl. Comput. Harmon. Anal. 34(2), 317–323.
  • [12] Candès, E. J. and Y. Plan (2009). Near-ideal model selection by $\ell_{1}$ minimization. Ann. Statist. 37(5A), 2145–2177.
  • [13] Catoni, O. (2004). Statistical learning theory and stochastic optimization, Volume 1851 of Lecture Notes in Mathematics. Berlin: Springer-Verlag. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
  • [14] Chen, J. and Z. Chen (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.
  • [15] Chipman, H., E. I. George, and R. E. McCulloch (2001). The practical implementation of Bayesian model selection. In Model selection, Volume 38 of IMS Lecture Notes Monogr. Ser., pp. 65–134. Beachwood, OH: Inst. Math. Statist. With discussion by M. Clyde, Dean P. Foster, and Robert A. Stine, and a rejoinder by the authors.
  • [16] Dalalyan, A. and J. Salmon (2011). Optimal aggregation of affine estimators. In Proceedings of the 24th annual conference on Computational Learning Theory, Budapest (Hungary).
  • [17] Dalalyan, A. and J. Salmon (2012). Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40(4), 2327–2355.
  • [18] Dalalyan, A. and A. Tsybakov (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning theory, Volume 4539 of Lecture Notes in Comput. Sci., pp. 97–111. Berlin: Springer.
  • [19] Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
  • [20] Fan, J. and J. Lv (2011). Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57(8), 5467–5484.
  • [21] Fan, J. and H. Peng (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32(3), 928–961.
  • [22] Gautier, E. and A. Tsybakov (2011, October). High-dimensional instrumental variables regression and confidence sets. Technical report, Arxiv preprint 1105.2454v3.
  • [23] Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14(4), 1089–1107.
  • [24] Ji, P. and J. Jin (2012). UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Statist. 40(1), 73–103.
  • [25] Jin, J., C. Zhang, and Q. Zhang (2012). Optimality of graphlet screening in high dimensional variable selection. Available online at http://arxiv.org/abs/1204.6452.
  • [26] Juditsky, A., P. Rigollet, and A. B. Tsybakov (2008). Learning by mirror averaging. Ann. Statist. 36(5), 2183–2206.
  • [27] Leung, G. and A. Barron (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52(8), 3396–3410.
  • [28] Lounici, K. (2007). Generalized mirror averaging and $D$-convex aggregation. Math. Methods Statist. 16(3), 246–259.
  • [29] Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electronic Journal of Statistics 2, 90–102.
  • [30] Lounici, K. (2009). Statistical Estimation in High-Dimension, Sparsity Oracle Inequalities. Ph. D. thesis, University Paris Diderot - Paris 7.
  • [31] Lounici, K., M. Pontil, A. Tsybakov, and S. van de Geer (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39(4), 2164–2204.
  • [32] Meinshausen, N., P. Bühlmann, and E. Zürich (2006). High dimensional graphs and variable selection with the lasso. Annals of Statistics 34, 1436–1462.
  • [33] Meinshausen, N. and B. Yu (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37(1), 246–270.
  • [34] Raskutti, G., M. J. Wainwright, and B. Yu (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57(10), 6976–6994.
  • [35] Rigollet, P. and A. Tsybakov (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39(2), 731–771.
  • [36] Rigollet, P. and A. B. Tsybakov (2012). Sparse estimation by exponential weighting. Statist. Sci. 27(4), 558–575.
  • [37] Rudelson, M. and S. Zhou (2013). Reconstruction from anisotropic random measurements. IEEE Trans. Inform. Theory 59(6), 3434–3447.
  • [38] Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7(2), 221–264. With comments and a rejoinder by the author.
  • [39] Stewart, G. W. and J. G. Sun (1990). Matrix perturbation theory. Computer Science and Scientific Computing. Boston, MA: Academic Press Inc.
  • [40] Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. Available from http://arxiv.org/abs/1011.3027.
  • [41] Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electron. J. Stat. 6, 38–90.
  • [42] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55(5), 2183–2202.
  • [43] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10(8), 25–47.
  • [44] Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92(4), 937–950.
  • [45] Ye, F. and C.-H. Zhang (2010). Rate minimaxity of the Lasso and Dantzig selector for the $\ell_{q}$ loss in $\ell_{r}$ balls. J. Mach. Learn. Res. 11, 3519–3540.
  • [46] Zhang, C.-H. (2007). Information-theoretic optimality of variable selection with concave penalty. Technical report, Dept. Statistics, Rutgers Univ.
  • [47] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38(2), 894–942.
  • [48] Zhang, C.-H. and T. Zhang (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statist. Sci. 27(4), 576–593.
  • [49] Zhao, P. and B. Yu (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563.