Bernoulli

  • Bernoulli
  • Volume 19, Number 2 (2013), 521-547.

Least squares after model selection in high-dimensional sparse models

Alexandre Belloni and Victor Chernozhukov

Full-text: Open access

Abstract

In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that the OLS post-Lasso estimator performs at least as well as Lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the Lasso-based model selection “fails” in the sense of missing some components of the “true” regression model. By the “true” model, we mean the best $s$-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, OLS post-Lasso estimator can perform strictly better than Lasso, in the sense of a strictly faster rate of convergence, if the Lasso-based model selection correctly includes all components of the “true” model as a subset and also achieves sufficient sparsity. In the extreme case, when Lasso perfectly selects the “true” model, the OLS post-Lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by Lasso, which guarantees that this dimension is at most of the same order as the dimension of the “true” model. Our rate results are nonasymptotic and hold in both parametric and nonparametric models. Moreover, our analysis is not limited to the Lasso estimator acting as a selector in the first step, but also applies to any other estimator, for example, various forms of thresholded Lasso, with good rates and good sparsity properties. Our analysis covers both traditional thresholding and a new practical, data-driven thresholding scheme that induces additional sparsity subject to maintaining a certain goodness of fit. The latter scheme has theoretical guarantees similar to those of Lasso or OLS post-Lasso, but it dominates those procedures as well as traditional thresholding in a wide variety of experiments.

Article information

Source
Bernoulli Volume 19, Number 2 (2013), 521-547.

Dates
First available in Project Euclid: 13 March 2013

Permanent link to this document
https://projecteuclid.org/euclid.bj/1363192037

Digital Object Identifier
doi:10.3150/11-BEJ410

Mathematical Reviews number (MathSciNet)
MR3037163

Zentralblatt MATH identifier
06168762

Keywords
Lasso OLS post-Lasso post-model selection estimators

Citation

Belloni, Alexandre; Chernozhukov, Victor. Least squares after model selection in high-dimensional sparse models. Bernoulli 19 (2013), no. 2, 521--547. doi:10.3150/11-BEJ410. https://projecteuclid.org/euclid.bj/1363192037.


Export citation

References

  • [1] Belloni, A. and Chernozhukov, V. (2011). Supplement to “$\ell_{1}$-penalized quantile regression in high-dimensional sparse models.” DOI:10.1214/10-AOS827SUPP.
  • [2] Belloni, A. and Chernozhukov, V. (2012). Supplement to “Least squares after model selection in high-dimensional sparse models.” DOI:10.3150/11-BEJ410SUPP.
  • [3] Belloni, A. and Chernozhukov, V. (2011). $\ell_{1}$-penalized quantile regression in high-dimensional sparse models. Ann. Statist. 39 82–130.
  • [4] Bickel, P.J., Ritov, Y. and Tsybakov, A.B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • [5] Bunea, F. (2008). Consistent selection via the Lasso for high-dimensional approximating models. In IMS Lecture Notes Monograph Series 123 123–137.
  • [6] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2006). Aggregation and sparsity via $l_{1}$ penalized least squares. In Learning Theory. Lecture Notes in Computer Science 4005 379–391. Berlin: Springer.
  • [7] Bunea, F., Tsybakov, A.B. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194.
  • [8] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [9] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • [10] Efromovich, S. (1999). Nonparametric Curve Estimation: Methods, Theory, and Applications. Springer Series in Statistics. New York: Springer.
  • [11] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • [12] Koltchinskii, V. (2009). Sparsity in penalized empirical risk minimization. Ann. Inst. Henri Poincaré Probab. Stat. 45 7–57.
  • [13] Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2 90–102.
  • [14] Lounici, K., Pontil, M., Tsybakov, A.B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT 2009) 73–82. Omnipress.
  • [15] Lounici, K., Pontil, M., Tsybakov, A.B. and van de Geer, S. (2012). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. To appear.
  • [16] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
  • [17] Rosenbaum, M. and Tsybakov, A.B. (2010). Sparse recovery under matrix uncertainty. Ann. Statist. 38 2620–2651.
  • [18] Rudelson, M. and Vershynin, R. (2008). On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. Math. 61 1025–1045.
  • [19] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [20] Tsybakov, A.B. (2008). Introduction to Nonparametric Estimation. Berlin: Springer.
  • [21] van de Geer, S.A. (2000). Empirical Processes in M-Estimation. Cambridge: Cambridge Univ. Press.
  • [22] van de Geer, S.A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
  • [23] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer Series in Statistics. New York: Springer.
  • [24] Wainwright, M.J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
  • [25] Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics. New York: Springer.
  • [26] Zhang, C.H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • [27] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.

Supplemental materials

  • Supplementary material: Supplementary material for Least squares after model selection in high-dimensional sparse models. The online supplemental article [2] contains a finite sample results for the estimation of $\sigma$, details regarding the oracle problem, omitted proofs, uniform control of sparse eigenvalues, and Monte Carlo experiments to access the performance of the estimators proposed in the paper.