The Annals of Statistics

When do stepwise algorithms meet subset selection criteria?

Xiaoming Huo and Xuelei (Sherry) Ni

Full-text: Open access


Recent results in homotopy and solution paths demonstrate that certain well-designed greedy algorithms, with a range of values of the algorithmic parameter, can provide solution paths to a sequence of convex optimization problems. On the other hand, in regression many existing criteria in subset selection (including Cp, AIC, BIC, MDL, RIC, etc.) involve optimizing an objective function that contains a counting measure. The two optimization problems are formulated as (P1) and (P0) in the present paper. The latter is generally combinatoric and has been proven to be NP-hard. We study the conditions under which the two optimization problems have common solutions. Hence, in these situations a stepwise algorithm can be used to solve the seemingly unsolvable problem. Our main result is motivated by recent work in sparse representation, while two others emerge from different angles: a direct analysis of sufficiency and necessity and a condition on the mostly correlated covariates. An extreme example connected with least angle regression is of independent interest.

Article information

Ann. Statist., Volume 35, Number 2 (2007), 870-887.

First available in Project Euclid: 5 July 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J07: Ridge regression; shrinkage estimators

Subset selection model selection stepwise algorithms convex optimization concurrent optimal subset


Huo, Xiaoming; Ni, Xuelei (Sherry). When do stepwise algorithms meet subset selection criteria?. Ann. Statist. 35 (2007), no. 2, 870--887. doi:10.1214/009053606000001334.

Export citation


  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (B. N. Petrov and F. Csáki, eds.) 267--281. Akadémiai Kiadó, Budapest.
  • Burnham, K. P. and Anderson, D. A. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed. Springer, New York.
  • Chen, S. S. (1995). Basis pursuit. Ph.D. dissertation, Dept. Statistics, Stanford Univ.
  • Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33--61. Reprinted at SIAM Rev. 43 (2001) 129--159.,
  • DeVore, R. A. and Temlyakov, V. N. (1996). Some remarks on greedy algorithms. Adv. Comput. Math. 5 173--187.
  • Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via $\ell_1$ minimization. Proc. Natl. Acad. Sci. USA 100 2197--2202.
  • Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presense of noise. IEEE Trans. Inform. Theory 52 6--18.
  • Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845--2862.
  • Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200--1224.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407--499.
  • Elad, M. and Bruckstein, A. M. (2002). A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. Inform. Theory 48 2558--2567.
  • Fan, J. and Li, R. (2001). Variable selection via nonconvave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348--1360.
  • Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians 3 595--622. European Math. Soc., Zürich.
  • Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928--961.
  • Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947--1975.
  • Fuchs, J.-J. (2004). On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theory 50 1341--1344.
  • Furnival, G. and Wilson, R. (1974). Regressions by leaps and bounds. Technometrics 16 499--511.
  • Gatu, C. and Kontoghiorghes, E. J. (2006). Branch-and-bound algorithms for computing the best-subset regression models. J. Comput. Graph. Statist. 15 139--156.
  • George, E. I. (2000). The variable selection problem. J. Amer. Statist. Assoc. 95 1304--1308.
  • Gilmour, S. G. (1996). The interpretation of Mallows's $C_p$-statistic. The Statistician 45 49--56.
  • Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore.
  • Gribonval, R., Figueras i Ventura, R. M. and Vandergheynst, P. (2005). A simple test to check the optimality of sparse signal approximations. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'05) 5 717--720. IEEE Press, Piscataway, NJ. Available at A longer version is available at
  • Gribonval, R. and Nielsen, M. (2003). Sparse representations in unions of bases. IEEE Trans. Inform. Theory 49 3320--3325.
  • Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5 1391--1415.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
  • Huo, X. and Ni, X. S. (2005). When do stepwise algorithms meet subset selection criteria? Technical report, Georgia Inst. Technology. Available at
  • Jain, A. and Zongker, D. (1997). Feature selection: Evaluation, application and small sample performance. IEEE Trans. Pattern Analysis and Machine Intelligence 19 153--158.
  • Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence 97 273--324.
  • Malioutov, D. M., Cetin, M. and Willsky, A. S. (2005). Homotopy continuation for sparse signal representation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'05) 5 733--736. IEEE Press, Piscataway, NJ.
  • Mallows, C. L. (1973). Some comments on $C_P$. Technometrics 15 661--675.
  • Miller, A. J. (2002). Subset Selection in Regression, 2nd ed. Chapman and Hall/CRC, Boca Raton, FL.
  • Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Trans. Comput. 26 917--922.
  • Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput. 24 227--234.
  • Ni, X. S. (2006). New results in detection, estimation and model selection. Ph.D. dissertation, Georgia Inst. Technology. Available at
  • Ni, X. S. and Huo, X. (2005). Enhanced leaps-and-bounds methods in subset selections with additional optimality tests. INFORMS QSR student paper competition finalist. Available at
  • Ni, X. S. and Huo, X. (2006). Regressions by enhanced leaps-and-bounds via optimality tests (LBOT). Technical report, Georgia Inst. Technology. Available at
  • Osborne, M. R. (1985). Finite Algorithms in Optimization and Data Analysis. Wiley, Chichester.
  • Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319--337.
  • Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389--403.
  • Ridout, M. S. (1988). Algorithm AS 233: An improved branch and bound algorithm for feature subset selection. Appl. Statist. 37 139--147.
  • Roberts, S. J. (1984). Algorithm AS 199: A branch and bound algorithm for determining the optimal feature subset of given size. Appl. Statist. 33 236--241.
  • Rockafellar, R. T. (1970). Convex Analysis. Princeton Univ. Press.
  • Santosa, F. and Symes, W. W. (1986). Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Statist. Comput. 7 1307--1330.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461--464.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267--288.
  • Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory 50 2231--2242.
  • Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030--1051.
  • Turlach, B. A. (2004). Discussion of ``Least angle regression,'' by B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. Ann. Statist. 32 481--490.
  • Wu, C.-F. J. (1993). Construction of supersaturated designs through partially aliased interactions. Biometrika 80 661--669.
  • Zou, H. (2005). The adaptive Lasso and its oracle properties. Unpublished manuscript.