The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.
References
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
Mathematical Reviews (MathSciNet):
MR726392
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81 461--470.
Mathematical Reviews (MathSciNet):
MR845884
Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The $.632+$ bootstrap method. J. Amer. Statist. Assoc. 92 548--560.
Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. J. Comput. System Sci. 55 119--139.
Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189--1232.
Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337--407.
Golub, G. and Van Loan, C. (1983). Matrix Computations. Johns Hopkins Univ. Press, Baltimore, MD.
Mathematical Reviews (MathSciNet):
MR733103
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
Lawson, C. and Hanson, R. (1974). Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs, NJ.
Mathematical Reviews (MathSciNet):
MR366019
Mallows, C. (1973). Some comments on $C_p$. Technometrics 15 661--675.
Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083--1104.
Osborne, M., Presnell, B. and Turlach, B. (2000a). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389--403.
Osborne, M. R., Presnell, B. and Turlach, B. (2000b). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319--337.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR346957
Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135--1151.
Mathematical Reviews (MathSciNet):
MR630098
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B. 58 267--288.
Weisberg, S. (1980). Applied Linear Regression. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR591462
Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93 120--131.
Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: $X$-fixed prediction error. J. Amer. Statist. Assoc. 87 738--754.
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881--889.
Ishwaran, H. and Rao, J. S. (2000). Bayesian nonparametric MCMC for large variable selection problems. Unpublished manuscript.
Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 98 438--455.
Mallows, C. (1973). Some comments on $C_p$. Technometrics 15 661--675.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression (with discussion). J. Amer. Statist. Assoc. 83 1023--1036.
Mathematical Reviews (MathSciNet):
MR997578
Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486--494.
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927--961.
Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000--19, Dept. Statistics, Stanford Univ.
Akaike, H. (1973). Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika 60 255--265.
Mathematical Reviews (MathSciNet):
MR326953
Birgé, L. and Massart, P. (2001a). Gaussian model selection. J. Eur. Math. Soc. 3 203--268.
Birgé, L. and Massart, P. (2001b). A generalized $C_p$ criterion for Gaussian model selection. Technical Report 647, Univ. Paris 6 & 7.
Foster, D. and George, E. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947--1975.
Knight, K. and Fu, B. (2000). Asymptotics for Lasso-type estimators. Ann. Statist. 28 1356--1378.
Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation with soft thresholding penalties. Statist. Neerlandica 56 453--478.
Mallows, C. (1973). Some comments on $C_p$. Technometrics 15 661--675.
van de Geer, S. (2001). Least squares estimation with complexity penalties. Math. Methods Statist. 10 355--374.
Breiman, L. (2001). Random forests. Available at ftp://ftp.stat.berkeley.edu/pub/users/breiman/ randomforest2001.pdf.
Fu, W. J. (1998). Penalized regressions: The Bridge versus the Lasso. J. Comput. Graph. Statist. 7 397--416.
Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389--403.
Ridgeway, G. (2003). GBM 0.7-2 package manual. Available at http://cran.r-project.org/doc/ packages/gbm.pdf.
Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493--1517.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119--139.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189--1232.
Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12 512--518. MIT Press, Cambridge, MA.
Rosset, S. and Zhu, J. (2004). Piecewise linear regularized solution paths. Advances in Neural Information Processing Systems 16. To appear.
Rosset, S., Zhu, J. and Hastie, T. (2003). Boosting as a regularized path to a maximum margin classifier. Technical report, Dept. Statistics, Stanford Univ.
Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2004). 1-norm support vector machines. Neural Information Processing Systems 16. To appear.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289--300.
Blake, C. and Merz, C. (1998). UCI repository of machine learning databases. Technical report, School Information and Computer Science, Univ. California, Irvine. Available at www.ics.uci.edu/~mlearn/MLRepository.html.
Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425--455.
Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947--1975.
Foster, D. P. and Stine, R. A. (1996). Variable selection via information theory. Technical Report Discussion Paper 1180, Center for Mathematical Studies in Economics and Management Science, Northwestern Univ.
Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486--494.
Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373--384.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London.
Mathematical Reviews (MathSciNet):
MR727836
Moore, D. S. and McCabe, G. P. (1999). Introduction to the Practice of Statistics, 3rd ed. Freeman, New York.
Nelder, J. A. (1977). A reformulation of linear models (with discussion). J. Roy. Statist. Soc. Ser. A 140 48--76.
Mathematical Reviews (MathSciNet):
MR458743
Nelder, J. A. (1994). The statistics of linear models: Back to basics. Statist. Comput. 4 221--234.
Cook, R. D. (1998). Regression Graphics. Wiley, New York.
Cook, R. D. and Weisberg, S. (1999a). Applied Regression Including Computing and Graphics. Wiley, New York.
Cook, R. D. and Weisberg, S. (1999b). Graphs in statistical analysis: Is the medium the message? Amer. Statist. 53 29--37.
Efron, B. (2001). Discussion of ``Statistical modeling: The two cultures,'' by L. Breiman. Statist. Sci. 16 218--219.
Li, K. C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316--342.
Weisberg, S. (1981). A statistic for allocating $C_p$ to individual cases. Technometrics 23 27--31.
Mathematical Reviews (MathSciNet):
MR604907
Weisberg, S. (2002). Dimension reduction regression in R. J. Statistical Software 7. (On-line journal available at www.jstatsoft.org. The software is available from cran.r-project.org.)
Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000-19, Dept. Statistics, Stanford Univ.
Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203--268.
Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation. J. Amer. Statist. Assoc. To appear.
Foster, D. and Stine, R. (1997). An information theoretic comparison of model selection criteria. Technical report, Dept. Statistics, Univ. Pennsylvania.
George, E. I. and Foster, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika 87 731--747.
Leblanc, M. and Tibshirani, R. (1998). Monotone shrinkage of trees. J. Comput. Graph. Statist. 7 417--433.