The Annals of Statistics

Exponential Screening and optimal rates of sparse estimation

Philippe Rigollet and Alexandre Tsybakov

Full-text: Open access

Abstract

In high-dimensional linear regression, the goal pursued here is to estimate an unknown regression function using linear combinations of a suitable set of covariates. One of the key assumptions for the success of any statistical procedure in this setup is to assume that the linear combination is sparse in some sense, for example, that it involves only few covariates. We consider a general, nonnecessarily linear, regression with Gaussian noise and study a related question, that is, to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE). We introduce a new estimation procedure, called Exponential Screening, that shows remarkable adaptation properties. It adapts to the linear combination that optimally balances MSE and sparsity, whether the latter is measured in terms of the number of nonzero entries in the combination (0 norm) or in terms of the global weight of the combination (1 norm). The power of this adaptation result is illustrated by showing that Exponential Screening solves optimally and simultaneously all the problems of aggregation in Gaussian regression that have been discussed in the literature. Moreover, we show that the performance of the Exponential Screening estimator cannot be improved in a minimax sense, even if the optimal sparsity is known in advance. The theoretical and numerical superiority of Exponential Screening compared to state-of-the-art sparse procedures is also discussed.

Article information

Source
Ann. Statist., Volume 39, Number 2 (2011), 731-771.

Dates
First available in Project Euclid: 9 March 2011

Permanent link to this document
https://projecteuclid.org/euclid.aos/1299680953

Digital Object Identifier
doi:10.1214/10-AOS854

Mathematical Reviews number (MathSciNet)
MR2816337

Zentralblatt MATH identifier
1215.62043

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62G05: Estimation 62J05: Linear regression 62C20: Minimax procedures 62G20: Asymptotic properties

Keywords
High-dimensional regression aggregation adaptation sparsity sparsity oracle inequalities minimax rates Lasso BIC

Citation

Rigollet, Philippe; Tsybakov, Alexandre. Exponential Screening and optimal rates of sparse estimation. Ann. Statist. 39 (2011), no. 2, 731--771. doi:10.1214/10-AOS854. https://projecteuclid.org/euclid.aos/1299680953


Export citation

References

  • Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.
  • Alquier, P. and Lounici, K. (2010). PAC-Bayesian bounds for sparse regression estimation with exponential weights. HAL.
  • Baraniuk, R., Davenport, M., DeVore, R. and Wakin, M. (2008). A simple proof of the restricted isometry property for random matrices. Constr. Approx. 28 253–263.
  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39 930–945.
  • Bickel, P., Ritov, Y. and Tsybakov, A. (2008). Hierarchical selection of variables in sparse high-dimensional regression. Available at ArXiv:0801.1158.
  • Bickel, P. J. and Doksum, K. A. (2006). Mathematical Statistics: Basic Ideas and Selected Topics, Vol. 1, 2nd ed. Prentice-Hall, Upper Saddle River, NJ.
  • Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • Bühlmann, P. and van de Geer, S. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Statist. 3 1360–1392.
  • Bunea, F., Tsybakov, A. and Wegkamp, M. (2007a). Sparsity oracle inequalities for the Lasso. Electron. J. Statist. 1 169–194 (electronic).
  • Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007b). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • Candes, E. (2008). The restricted isometry property and its implications for compressed sensing. C. R. Math. Acad. Sci. Paris 346 589–592.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351.
  • Dalalyan, A. and Tsybakov, A. (2008). Aggregation by exponential weighting, sharp pac-Bayesian bounds and sparsity. Machine Learning 72 39–61.
  • Dalalyan, A. and Tsybakov, A. B. (2010). Mirror averaging with sparsity priors. Available at ArXiv.org:1003.1189.
  • Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
  • Dalalyan, A. S. and Tsybakov, A. B. (2009). Sparse regression learning by aggregation and Langevin Monte Carlo. Available at ArXiv:0903.1223.
  • Donoho, D. L. and Johnstone, I. M. (1994a). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
  • Donoho, D. L. and Johnstone, I. M. (1994b). Minimax risk over lp-balls for lq-error. Probab. Theory Related Fields 99 277–303.
  • Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum entropy and the nearly black object. J. Roy. Statist. Soc. Ser. B 54 41–81. With discussion and a reply by the authors.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975.
  • George, E. I. (1986a). Combining minimax shrinkage estimators. J. Amer. Statist. Assoc. 81 437–445.
  • George, E. I. (1986b). Minimax multiple shrinkage estimation. Ann. Statist. 14 188–205.
  • Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
  • Koh, K., Kim, S.-J. and Boyd, S. (2008). l1_ls: A Matlab solver for l1-regularized least squares problems. BETA version. Stanford Univ. Available at http://www.stanford.edu/~boyd/l1_ls.
  • Koltchinskii, V. (2010). Oracle inequalities in empirical risk minimization and sparse recovery problems. St. Flour Lecture Notes. To appear.
  • Koltchinskii, V. (2009a). The Dantzig selector and sparsity oracle inequalities. Bernoulli 15 799–828.
  • Koltchinskii, V. (2009b). Sparsity in penalized empirical risk minimization. Ann. Inst. H. Poincaré Probab. Statist. 45 7–57.
  • LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. and Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems 396–404. Morgan Kaufmann, San Francisco.
  • Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
  • Lounici, K. (2007). Generalized mirror averaging and D-convex aggregation. Math. Methods Statist. 16 246–259.
  • Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
  • Raskutti, G., Wainwright, M. J. and Yu, B. (2009). Minimax rates of estimation for high-dimensional linear regression over q-balls. Available at ArXiv:0910.2042.
  • Reynaud-Bouret, P. (2003). Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Related Fields 126 103–153.
  • Rigollet, P. (2009). Maximum likelihood aggregation and misspecified generalized linear models. Technical report. Available at ArXiv:0911.2919.
  • Rigollet, P. and Tsybakov, A. (2010). Exponential screening and optimal rates of sparse estimation. Available at ArXiv:1003.2654v3.
  • Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer, New York.
  • Tsybakov, A. B. (2003). Optimal rates of aggregation. In COLT (B. Schölkopf and M. K. Warmuth, eds.). Lecture Notes in Computer Science 2777 303–313. Springer, Berlin.
  • Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
  • van de Geer, S. A. (2008). High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 614–645.
  • Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S. and Ma, Y. (2009). Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31 210–227.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • Zhang, C.-H. and Melnik, O. (2009). Plus: Penalized linear unbiased selection. R package version 0.8. CRAN. Available at http://CRAN.R-project.org/package=plus.
  • Zhang, T. (2009). Some sharp performance bounds for least squares regression with L1 regularization. Ann. Statist. 37 2109–2144.