Annals of Statistics

Nearly unbiased variable selection under minimax concave penalty

Cun-Hui Zhang

Full-text: Open access


We propose MC+, a fast, continuous, nearly unbiased and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO may prevent consistent variable selection. Subset selection is unbiased but computationally costly. The MC+ has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. The PLUS computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the penalized loss. Its output is a continuous piecewise linear path encompassing from the origin for infinite penalty to a least squares solution for zero penalty. We prove that at a universal penalty level, the MC+ has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO. This selection consistency applies to the case of pn, and is proved to hold for exactly the MC+ solution among possibly many local minimizers. We prove that the MC+ attains certain minimax convergence rates in probability for the estimation of regression coefficients in r balls. We use the SURE method to derive degrees of freedom and Cp-type risk estimates for general penalized LSE, including the LASSO and MC+ estimators, and prove their unbiasedness. Based on the estimated degrees of freedom, we propose an estimator of the noise level for proper choice of the penalty level. For full rank designs and general sub-quadratic penalties, we provide necessary and sufficient conditions for the continuity of the penalized LSE. Simulation results overwhelmingly support our claim of superior variable selection properties and demonstrate the computational efficiency of the proposed method.

Article information

Ann. Statist., Volume 38, Number 2 (2010), 894-942.

First available in Project Euclid: 19 February 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J05: Linear regression 62J07: Ridge regression; shrinkage estimators
Secondary: 62H12: Estimation 62H25: Factor analysis and principal components; correspondence analysis

Variable selection model selection penalized estimation least squares correct selection minimax unbiasedness mean squared error nonconvex minimization risk estimation degrees of freedom selection consistency sign consistency


Zhang, Cun-Hui. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 (2010), no. 2, 894--942. doi:10.1214/09-AOS729.

Export citation


  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd International Symposium on Information Theory (V. Petrov and F. Csáki, eds.) 267–281. Akadmiai Kiadó, Budapest.
  • Antoniadis, A. and Fan, J. (2001). Regularized wavelet approximations (with discussion). J. Amer. Statist. Assoc. 96 939–967.
  • Bach, F. R. (2008). Bolasso: Model consistent Lasso estimation through the bootstrap. In Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008, Helsinki, Finland) (A. McCallum and S. Roweis, eds.) 33–40.
  • Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the lasso. Electron. J. Stat. 1 169–194 (electronic).
  • Candés, E. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215.
  • Candés, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann. Statist. 35 2313–2404.
  • Chen, S. and Donoho, D.L. (1994). On basis pursuit. Technical report, Dept. Statistics, Stanford Univ.
  • Davidson, K. and Szarek, S. (2001). Local operator theory, random matrices and Banach spaces. In Handbook on the Geometry of Banach Spaces (W. B. Johnson and J. Lindenstrauss, eds.) I 317–366. North-Holland, Amsterdam.
  • Donoho, D. L. and Johnstone, I. (1994a). Minimax risk over p-balls for q-error. Probab. Theory Related Fields 99 277–303.
  • Donoho, D. L. and Johnstone, I. M. (1994b). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
  • Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum entropy and the nearly black object (with discussion). J. Roy. Statist. Soc. Ser. B 54 41–81.
  • Efron, B. (1986). How biased is the apparent error of a prediction rule? J. Amer. Statist. Assoc. 81 461–470.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499.
  • Efron, B., Hastie, T. and Tibshirani, R. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2358–2364.
  • Fan, J. (1997). Comments on “Wavelets in statistics: A review” by A. Antoniadis. J. Italian Statist. Assoc. 6 131–138.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, Jinchi. (2008). Sure independence screening for ultrahigh-dimensional feature space. J. Roy. Statist. Soc. Ser. B 70 849–911.
  • Fan, J. and Peng, H. (2004). On nonconcave penalized likelihood with diverging number of parameters. Ann. Statist. 32 928–961.
  • Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975.
  • Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148–156. Morgan Kauffmann, San Francisco.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–307.
  • Gao, H.-Y. and Bruce, A. G. (1997). Waveshrink with firm shrinkage. Statist. Sinica 7 855–874.
  • Genkin, A. Lewis, D. D. and Madigan, D. (2004). Large-scale Bayesian logistic regression for text categorization. Technical report, DIMACS, Rutgers Univ.
  • Greenshtein E. and Ritov Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
  • Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statist. Sinica 18 1603–1618.
  • Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617–1642.
  • Mallows, C. L. (1973). Some comments on Cp. Technometrics 12 661–675.
  • Meinshausen, N. (2007). Relaxed Lasso. Comput. Statist. Data Anal. 52 374–393.
  • Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436–1462.
  • Meinshausen, N., Rocha, G. and Yu, B. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2373–2384.
  • Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 2246–2270.
  • Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083–1104.
  • Osborne, M., Presnell, B. and Turlach, B. (2000a). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389–404.
  • Osborne, M., Presnell, B. and Turlach, B. (2000b). On the lasso and its dual. J. Comput. Graph. Statist. 9 319–337.
  • Park, M. Y. and Hastie, T. (2007). An L1 regularization-path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 659–677.
  • Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Ann. Statist. 35 1012–1030.
  • Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5 197–227.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135–1151.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030–1051.
  • Van de Geer, S. (2008). High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 614–645.
  • Wainwright, M. (2006). Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical Report 708, Dept. Statistics, Univ. California, Berkeley.
  • Ye, F. and Zhang, C.-H. (2009). Rate Minimaxity of the Lasso and Dantzig Estimators. Technical Report No. 2009-001. Dept. Statistics, Rutgers Univ.
  • Yuan, M. and Lin, Y. (2007). On the nonnegative garrote estimator. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 143–161.
  • Zhang, C.-H. (2007a). Continuous generalized gradient descent. J. Comput. Graph. Statist. 16 761–781.
  • Zhang, C.-H. (2007b). Penalized linear unbiased selection. Technical Report 2007-003. Dept. Statistics, Rutgers Univ.
  • Zhang, C.-H. (2007c). Information-theoretic optimality of variable selection with concave penalty. Technical Report No. 2007-008. Dept. Statistics, Rutgers Univ.
  • Zhang, C.-H. (2008). Discussion: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1553–1560.
  • Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional regression. Ann. Statist. 36 1567–1594.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of LASSO. J. Mach. Learn. Res. 7 2541–2567.
  • Zhao, P. and Yu, B. (2007). Stagewise Lasso. J. Mach. Learn. Res. 8 2701–2726.
  • Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Statist. 36 1509–1533.