The Annals of Applied Statistics

Robust VIF regression with application to variable selection in large data sets

Debbie J. Dupuis and Maria-Pia Victoria-Feser

Full-text: Open access

Abstract

The sophisticated and automated means of data collection used by an increasing number of institutions and companies leads to extremely large data sets. Subset selection in regression is essential when a huge number of covariates can potentially explain a response variable of interest. The recent statistical literature has seen an emergence of new selection methods that provide some type of compromise between implementation (computational speed) and statistical optimality (e.g., prediction error minimization). Global methods such as Mallows’ $C_{p}$ have been supplanted by sequential methods such as stepwise regression. More recently, streamwise regression, faster than the former, has emerged. A recently proposed streamwise regression approach based on the variance inflation factor (VIF) is promising, but its least-squares based implementation makes it susceptible to the outliers inevitable in such large data sets. This lack of robustness can lead to poor and suboptimal feature selection. In our case, we seek to predict an individual’s educational attainment using economic and demographic variables. We show how classical VIF performs this task poorly and a robust procedure is necessary for policy makers. This article proposes a robust VIF regression, based on fast robust estimators, that inherits all the good properties of classical VIF in the absence of outliers, but also continues to perform well in their presence where the classical approach fails.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 319-341.

Dates
First available in Project Euclid: 9 April 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527201

Digital Object Identifier
doi:10.1214/12-AOAS584

Mathematical Reviews number (MathSciNet)
MR3086421

Zentralblatt MATH identifier
06171274

Keywords
Variable selection linear regression multicollinearity $M$-estimator college data

Citation

Dupuis, Debbie J.; Victoria-Feser, Maria-Pia. Robust VIF regression with application to variable selection in large data sets. Ann. Appl. Stat. 7 (2013), no. 1, 319--341. doi:10.1214/12-AOAS584. https://projecteuclid.org/euclid.aoas/1365527201


Export citation

References

  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) (B. N. Petrov and F. Csaki, eds.) 267–281. Akadémiai Kiadó, Budapest.
  • Alqallaf, F., Van Aelst, S., Yohai, V. J. and Zamar, R. H. (2009). Propagation of outliers in multivariate data. Ann. Statist. 37 311–331.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • Clark, D. (2011). Do recessions keep students in school? The impact of youth unemployment on enrollment in post-compulsory education in England. Economica 78 523–545.
  • Dupuis, D. J. and Victoria-Feser, M.-P. (2011). Fast robust model selection in large datasets. J. Amer. Statist. Assoc. 106 203–212.
  • Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation. J. Amer. Statist. Assoc. 99 619–642.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Foster, D. P. and Stine, R. A. (2004). Variable selection in data mining: Building a predictive model for bankruptcy. J. Amer. Statist. Assoc. 99 303–313.
  • Foster, D. P. and Stine, R. A. (2008). $\alpha$-investing: A procedure for sequential control of expected false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 429–444.
  • Frank, A. and Asuncion, A. (2010). UCI machine learning repository. Univ. California, School of Information and Computer Science, Irvine, CA. Available at http://archive.ics.uci.edu/ml.
  • Friedman, J. H. (2008). Fast sparse regression and classification. Technical Report, Stanford Univ.
  • Friedman, J. H., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1–22.
  • Gneiting, T. (2011). Making and evaluating point forecasts. J. Amer. Statist. Assoc. 106 746–762.
  • Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. thesis, Univ. California, Berkeley.
  • Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69 383–393.
  • Heritier, S., Cantoni, E., Copt, S. and Victoria-Feser, M.-P. (2009). Robust Methods in Biostatistics. Wiley, Chichester.
  • Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35 73–101.
  • Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics 221–233. Univ. California Press, Berkeley, CA.
  • Khan, J. A., Van Aelst, S. and Zamar, R. H. (2007). Robust linear model selection based on least angle regression. J. Amer. Statist. Assoc. 102 1289–1299.
  • Kienzl, G. S., Alfonso, M. and Melguizo, T. (2007). The effect of local labor market conditions in the 1990s on the likelihood of community college students’ persistence and attainment. Research in Higher Education 48 751–774.
  • Lin, D., Foster, D. P. and Ungar, L. H. (2011). VIF regression: A fast regression algorithm for large data. J. Amer. Statist. Assoc. 106 232–247.
  • Machado, J. A. F. (1993). Robust model selection and $M$-estimation. Econometric Theory 9 478–493.
  • Mallows, C. L. (1973). Some comments on $C_{p}$. Technometrics 15 661–675.
  • Marquardt, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics 12 591–612.
  • Pennington, K. L., McGinty, D. and Williams, M. R. (2002). Community college enrollment as a function of economic indicators. Community College Journal of Research and Practice 26 431–437.
  • Petrongolo, B. and San Segundo, M. J. (2002). Staying-on at school at 16: The impact of labor market conditions in Spain. Economics of Education Review 21 353–365.
  • Renaud, O. and Victoria-Feser, M.-P. (2010). A robust coefficient of determination for regression. J. Statist. Plann. Inference 140 1852–1862.
  • Ronchetti, E. (1982). Robust testing in linear models: The infinitesimal approach. Ph.D. thesis, ETH Zürich, Switzerland.
  • Ronchetti, E., Field, C. and Blanchard, W. (1997). Robust linear model selection by cross-validation. J. Amer. Statist. Assoc. 92 1017–1023.
  • Ronchetti, E. and Staudte, R. G. (1994). A robust version of Mallows’ $C_{P}$. J. Amer. Statist. Assoc. 89 550–559.
  • Rouse, C. E. (1995). Democratization or diversion? The effect of community colleges on educational attainment. Journal of Business and Economic Statistics 12 217–224.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Wetterlind, P. J. (1976). A multi-variable input model for the projection of higher education enrollments in Arizona. Ph.D. dissertation, Univ. Arizona, Tucson.
  • Zhang, T. (2009). Adaptive forward-backward greedy algorithm for sparse learning with linear models. Adv. Neural Inf. Process. Syst. 21 1921–1928.
  • Zhou, J., Foster, D. P., Stine, R. A. and Ungar, L. H. (2006). Streamwise feature selection. J. Mach. Learn. Res. 7 1861–1885.