The Annals of Applied Statistics

Sparse least trimmed squares regression for analyzing high-dimensional large data sets

Andreas Alfons, Christophe Croux, and Sarah Gelper

Full-text: Open access

Abstract

Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an $L_{1}$ penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 226-248.

Dates
First available in Project Euclid: 9 April 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527197

Digital Object Identifier
doi:10.1214/12-AOAS575

Mathematical Reviews number (MathSciNet)
MR3086417

Zentralblatt MATH identifier
06171270

Keywords
Breakdown point outliers penalized regression robust regression trimming

Citation

Alfons, Andreas; Croux, Christophe; Gelper, Sarah. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 7 (2013), no. 1, 226--248. doi:10.1214/12-AOAS575. https://projecteuclid.org/euclid.aoas/1365527197


Export citation

References

  • Alfons, A. (2012a). simFrame: Simulation framework. R package version 0.5.0.
  • Alfons, A. (2012b). robustHD: Robust methods for high-dimensional data. R package version 0.1.0.
  • Alfons, A., Templ, M. and Filzmoser, P. (2010). An object-oriented framework for statistical simulation: The R package simFrame. Journal of Statistical Software 37 1–36.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Germain, J.-F. and Roueff, F. (2010). Weak convergence of the regularization path in penalized M-estimation. Scand. J. Stat. 37 477–495.
  • Gertheiss, J. and Tutz, G. (2010). Sparse modeling of categorial explanatory variables. Ann. Appl. Stat. 4 2150–2180.
  • Hassan, R., Bera, T. and Pastan, I. (2004). Mesothelin: A new target for immunotherapy. Clin. Cancer Res. 10 3937–3942.
  • Hastie, T. and Efron, B. (2011). lars: Least angle regression, lasso and forward stagewise. R package version 0.9-8.
  • Khan, J. A., Van Aelst, S. and Zamar, R. H. (2007). Robust linear model selection based on least angle regression. J. Amer. Statist. Assoc. 102 1289–1299.
  • Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356–1378.
  • Koenker, R. (2011). quantreg: Quantile regression. R package version 4.67.
  • Lee, D., Lee, W., Lee, Y. and Pawitan, Y. (2011). Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems 109 1–8.
  • Li, G., Peng, H. and Zhu, L. (2011). Nonconcave penalized $M$-estimation with a diverging number of parameters. Statist. Sinica 21 391–419.
  • Maglott, D., Ostell, J., Pruitt, K. D. and Tatusova, T. (2005). Entrez gene: Gene-centered information at NCBI. Nucleic Acids Res. 33 D54–D58.
  • Maronna, R. A. (2011). Robust ridge regression for high-dimensional data. Technometrics 53 44–53.
  • Maronna, R. A., Martin, R. D. and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, Chichester.
  • Meinshausen, N. (2007). Relaxed lasso. Comput. Statist. Data Anal. 52 374–393.
  • Menjoge, R. S. and Welsch, R. E. (2010). A diagnostic method for simultaneous feature selection and outlier identification in linear regression. Comput. Statist. Data Anal. 54 3181–3193.
  • Oshima, R. G., Baribault, H. and Caulín, C. (1996). Oncogenic regulation and function of keratins 8 and 18. Cancer and Metastasis Rewiews 15 445–471.
  • Owens, D. W. and Lane, E. B. (2003). The quest for the function of simple epithelial keratins. Bioessays 25 748–758.
  • R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Radchenko, P. and James, G. M. (2011). Improved variable selection with forward-lasso adaptive shrinkage. Ann. Appl. Stat. 5 427–448.
  • Rosset, S. and Zhu, J. (2004). Discussion of “Least angle regression,” by B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. Ann. Statist. 32 469–475.
  • Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc. 79 871–880.
  • Rousseeuw, P. J. and Leroy, A. M. (2003). Robust Regression and Outlier Detection, 2nd ed. Wiley, Hoboken.
  • Rousseeuw, P. J. and Van Driessen, K. (2006). Computing LTS regression for large data sets. Data Min. Knowl. Discov. 12 29–45.
  • Shankavaram, U. T., Reinhold, W. C., Nishizuka, S., Major, S., Morita, D., Chary, K. K., Reimers, M. A., Scherf, U., Kahn, A., Dolginow, D., Cossman, J., Kaldjian, E. P., Scudiero, D. A., Petricoin, E., Liotta, L., Lee, J. K. and Weinstein, J. N. (2007). Transcript and protein expression profiles of the NCI-60 cancer cell panel: An integromic microarray study. Molecular Cancer Therapeutics 6 820–832.
  • She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. J. Amer. Statist. Assoc. 106 626–639.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
  • Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-lasso. J. Bus. Econom. Statist. 25 347–355.
  • Wang, S., Nan, B., Rosset, S. and Zhu, J. (2011). Random lasso. Ann. Appl. Stat. 5 468–485.
  • Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2 224–244.
  • Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. Ann. Statist. 15 642–656.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541–2563.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. Ann. Statist. 35 2173–2192.