The Annals of Statistics

Least angle regression

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani

Full-text: Open access

Abstract

The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

Article information

Source
Ann. Statist. Volume 32, Number 2 (2004), 407-499.

Dates
First available in Project Euclid: 28 April 2004

Permanent link to this document
http://projecteuclid.org/euclid.aos/1083178935

Digital Object Identifier
doi:10.1214/009053604000000067

Mathematical Reviews number (MathSciNet)
MR2060166

Zentralblatt MATH identifier
02100802

Subjects
Primary: 62J07: Ridge regression; shrinkage estimators

Keywords
Lasso boosting linear regression coefficient paths variable selection

Citation

Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert. Least angle regression. Ann. Statist. 32 (2004), no. 2, 407--499. doi:10.1214/009053604000000067. http://projecteuclid.org/euclid.aos/1083178935.


Export citation

References

  • Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81 461--470.
  • Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The $.632+$ bootstrap method. J. Amer. Statist. Assoc. 92 548--560.
  • Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. J. Comput. System Sci. 55 119--139.
  • Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189--1232.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337--407.
  • Golub, G. and Van Loan, C. (1983). Matrix Computations. Johns Hopkins Univ. Press, Baltimore, MD.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
  • Lawson, C. and Hanson, R. (1974). Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, NJ.
  • Mallows, C. (1973). Some comments on $C_p$. Technometrics 15 661--675.
  • Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083--1104.
  • Osborne, M., Presnell, B. and Turlach, B. (2000a). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389--403.
  • Osborne, M. R., Presnell, B. and Turlach, B. (2000b). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319--337.
  • Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York.
  • Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135--1151.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B. 58 267--288.
  • Weisberg, S. (1980). Applied Linear Regression. Wiley, New York.
  • Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93 120--131.
  • Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: $X$-fixed prediction error. J. Amer. Statist. Assoc. 87 738--754.
  • George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881--889.
  • Ishwaran, H. and Rao, J. S. (2000). Bayesian nonparametric MCMC for large variable selection problems. Unpublished manuscript.
  • Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 98 438--455.
  • Mallows, C. (1973). Some comments on $C_p$. Technometrics 15 661--675.
  • Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression (with discussion). J. Amer. Statist. Assoc. 83 1023--1036.
  • Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486--494.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
  • Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927--961.
  • Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000--19, Dept. Statistics, Stanford Univ.
  • Akaike, H. (1973). Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika 60 255--265.
  • Birgé, L. and Massart, P. (2001a). Gaussian model selection. J. Eur. Math. Soc. 3 203--268.
  • Birgé, L. and Massart, P. (2001b). A generalized $C_p$ criterion for Gaussian model selection. Technical Report 647, Univ. Paris 6 & 7.
  • Foster, D. and George, E. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947--1975.
  • Knight, K. and Fu, B. (2000). Asymptotics for Lasso-type estimators. Ann. Statist. 28 1356--1378.
  • Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation with soft thresholding penalties. Statist. Neerlandica 56 453--478.
  • Mallows, C. (1973). Some comments on $C_p$. Technometrics 15 661--675.
  • van de Geer, S. (2001). Least squares estimation with complexity penalties. Math. Methods Statist. 10 355--374.
  • Breiman, L. (2001). Random forests. Available at ftp://ftp.stat.berkeley.edu/pub/users/breiman/ randomforest2001.pdf.
  • Fu, W. J. (1998). Penalized regressions: The Bridge versus the Lasso. J. Comput. Graph. Statist. 7 397--416.
  • Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389--403.
  • Ridgeway, G. (2003). GBM 0.7-2 package manual. Available at http://cran.r-project.org/doc/ packages/gbm.pdf.
  • Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493--1517.
  • Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119--139.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189--1232.
  • Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12 512--518. MIT Press, Cambridge, MA.
  • Rosset, S. and Zhu, J. (2004). Piecewise linear regularized solution paths. Advances in Neural Information Processing Systems 16. To appear.
  • Rosset, S., Zhu, J. and Hastie, T. (2003). Boosting as a regularized path to a maximum margin classifier. Technical report, Dept. Statistics, Stanford Univ.
  • Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2004). 1-norm support vector machines. Neural Information Processing Systems 16. To appear.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289--300.
  • Blake, C. and Merz, C. (1998). UCI repository of machine learning databases. Technical report, School Information and Computer Science, Univ. California, Irvine. Available at www.ics.uci.edu/~mlearn/MLRepository.html.
  • Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425--455.
  • Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947--1975.
  • Foster, D. P. and Stine, R. A. (1996). Variable selection via information theory. Technical Report Discussion Paper 1180, Center for Mathematical Studies in Economics and Management Science, Northwestern Univ.
  • Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486--494.
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373--384.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London.
  • Moore, D. S. and McCabe, G. P. (1999). Introduction to the Practice of Statistics, 3rd ed. Freeman, New York.
  • Nelder, J. A. (1977). A reformulation of linear models (with discussion). J. Roy. Statist. Soc. Ser. A 140 48--76.
  • Nelder, J. A. (1994). The statistics of linear models: Back to basics. Statist. Comput. 4 221--234.
  • Cook, R. D. (1998). Regression Graphics. Wiley, New York.
  • Cook, R. D. and Weisberg, S. (1999a). Applied Regression Including Computing and Graphics. Wiley, New York.
  • Cook, R. D. and Weisberg, S. (1999b). Graphs in statistical analysis: Is the medium the message? Amer. Statist. 53 29--37.
  • Efron, B. (2001). Discussion of ``Statistical modeling: The two cultures,'' by L. Breiman. Statist. Sci. 16 218--219.
  • Li, K. C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316--342.
  • Weisberg, S. (1981). A statistic for allocating $C_p$ to individual cases. Technometrics 23 27--31.
  • Weisberg, S. (2002). Dimension reduction regression in R. J. Statistical Software 7. (On-line journal available at www.jstatsoft.org. The software is available from cran.r-project.org.)
  • Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000-19, Dept. Statistics, Stanford Univ.
  • Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203--268.
  • Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation. J. Amer. Statist. Assoc. To appear.
  • Foster, D. and Stine, R. (1997). An information theoretic comparison of model selection criteria. Technical report, Dept. Statistics, Univ. Pennsylvania.
  • George, E. I. and Foster, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika 87 731--747.
  • Leblanc, M. and Tibshirani, R. (1998). Monotone shrinkage of trees. J. Comput. Graph. Statist. 7 417--433.