Electronic Journal of Statistics

Minimum Distance Lasso for robust high-dimensional regression

Aurélie C. Lozano, Nicolai Meinshausen, and Eunho Yang

Full-text: Open access

Abstract

We propose a minimum distance estimation method for robust regression in sparse high-dimensional settings. Likelihood-based estimators lack resilience against outliers and model misspecification, a critical issue when dealing with high-dimensional noisy data. Our method, Minimum Distance Lasso (MD-Lasso), combines minimum distance functionals customarily used in nonparametric estimation for robustness, with $\ell_{1}$-regularization. MD-Lasso is governed by a scaling parameter capping the influence of outliers: the loss is locally convex and close to quadratic for small squared residuals, and flattens for squared residuals larger than the scaling parameter. As the parameter approaches infinity the estimator becomes equivalent to least-squares Lasso. MD-Lasso is able to maintain the robustness of minimum distance functionals in sparse high-dimensional regression. The estimator achieves maximum breakdown point and enjoys consistency with fast convergence rates under mild conditions on the model error distribution. These hold for any solution in a convexity region around the true parameter and in certain cases for every solution. We provide an alternative set of results that do not require the solutions to lie within the convexity region but where the $\ell_{2}$-norm of the feasible solutions is constrained within a safety radius. Thanks to this constraint, a first-order optimization method is able to produce local optima that are consistent. A connection is established with re-weighted least-squares that intuitively explains MD-Lasso robustness. The merits of our method are demonstrated through simulation and eQTL analysis.

Article information

Source
Electron. J. Statist., Volume 10, Number 1 (2016), 1296-1340.

Dates
Received: August 2014
First available in Project Euclid: 19 May 2016

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1463664092

Digital Object Identifier
doi:10.1214/16-EJS1136

Mathematical Reviews number (MathSciNet)
MR3504182

Zentralblatt MATH identifier
1349.62322

Keywords
Lasso robust estimation high-dimensional variable selection sparse learning

Citation

Lozano, Aurélie C.; Meinshausen, Nicolai; Yang, Eunho. Minimum Distance Lasso for robust high-dimensional regression. Electron. J. Statist. 10 (2016), no. 1, 1296--1340. doi:10.1214/16-EJS1136. https://projecteuclid.org/euclid.ejs/1463664092


Export citation

References

  • [1] Alfons, A., Croux, C., and Gelper, S. (2013), “Sparse least trimmed squares regression for analyzing high-dimensional large data sets,”, Ann. Appl. Stat., 7, 226–248.
  • [2] Antczak, T. (2013), “The Exact l1 Penalty Function Method for Constrained Nonsmooth Invex Optimization Problems,” in, System Modeling and Optimization, Springer Berlin Heidelberg, vol. 391 of IFIP Advances in Information and Communication Technology, pp. 461–470.
  • [3] Aravkin, A., Friedlander, M., Herrmann, F. J., and van Leeuwen, T. (2012), “Robust inversion, dimensionality reduction, and randomized sampling,”, Mathematical Programming, 134, 101–125.
  • [4] Arefin, A., Mathieson, L., Johnstone, D., Berretta, R., and Moscato, P. (2012), “Unveiling clusters of RNA transcript pairs associated with markers of Alzheimer’s disease progression,”, PLoS ONE, 7 (9), e45535.
  • [5] Arendt, T., Holzer, M., Stöbe, A., Gärtner, U., Lüth, H. J., Brückner, M. K., and Ueberham, U. (2000), “Activated mitogenic signaling induces a process of dedifferentiation in Alzheimer’s disease that eventually results in cell death,”, Annals of the New York Academy of Science, 920–249.
  • [6] Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012), “Optimization with sparsity-inducing penalties,”, Foundations and Trends in Machine Learning, 4, 1–106.
  • [7] Bartlett, P. L. and Mendelson, S. (2003), “Rademacher and gaussian complexities: risk bounds and structural results,”, Journal of Machine Learning Research, 3, 463–482.
  • [8] Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998), “Robust and efficient estimation by minimising a density power divergence,”, Biometrika, 85.
  • [9] Ben-Israel, A. and Mond, B. (1986), “What is invexity,”, Journal of the Australian Mathematical Society Series B, 28, 1–9.
  • [10] Beran, R. (1977), “Robust location estimates,”, Annals of Statistics, 5, 431–444.
  • [11] Bertsekas, D. (2011), “Incremental gradient, subgradient, and proximal methods for convex optimization: a survey,”, Optimization for Machine Learning, MIT Press.
  • [12] Bickel, P., Ritov, Y., and Tsybakov, A. (2009), “Simultaneous analysis of Lasso and Dantzig selector,”, Annals of Statistics, 37, 1705–1732.
  • [13] Chi, E. C. and Scott, D. W. (2014), “Robust parametric classification and variable selection by a minimum distance criterion,”, Journal of Computational and Graphical Statistics, 23, 111–128.
  • [14] Davison, A. C. and Hinkley, D. V. (1997), Bootstrap Methods and Their Applications, Cambridge: Cambridge University Press, iSBN 0-521-57391-2.
  • [15] Donoho, D. L. and Liu, R. C. (1994), “The “Automatic” robustness of minimum distance functional,”, Annals of Statistics, 16, 552–586.
  • [16] Fan, J., Lv, J., and Qi, L. (2011), “Sparse high dimensional models in economics,”, Annual Review of Economics, 3, 291.
  • [17] Ghai, R., Mobli, M., Norwood, S. J., Bugarcic, A., Teasdale, R. D., et al. (2011), “Phox homology band 4.1/ezrin/radixin/moesin-like proteins function as molecular scaffolds that interact with cargo receptors and Ras GTPases,”, Proceedings of the National Academy of Science USA, 108, 7763–7768.
  • [18] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986), Robust Statistics:The Approach Based on Influence Functions, Wiley Series in Probability and Statistics.
  • [19] Huber, P. J. (1981), Robust Statistics, Wiley New York.
  • [20] Jacob, L., Obozinski, G., and Vert, J.-P. (2009), “Group lasso with overlap and graph lasso,” in, Proc. of the 26th Annual International Conference on Machine Learning, New York, NY, USA: ACM, pp. 433–440.
  • [21] Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F., and Thirion, B. (2012), “Multi-scale mining of fMRI data with hierarchical structured sparsity,”, SIAM Journal on Imaging Sciences, 5, 835–856.
  • [22] Jiang, X., Jia, L. W., Li, X. H., Cheng, X., Xie, J. Z., Ma, Z. W., Xu, W. J., Liu, Y., Yao, Y., Du, L. L., and Zhou, X. W. (2013), “Capsaicin ameliorates stress-induced Alzheimer’s disease-like pathological and cognitive impairments in rats,”, Journal of Alzheimer’s Disease, 35 (1), 91– 105.
  • [23] Ledoux, M. and Talagrand, M. (1991), Probability in Banach Spaces: Isoperimetry and Processes, Ergebnisse der Mathematik und Ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics Series, Springer.
  • [24] Loh, P.-L. and Wainwright, M. J. (2013), “Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima,”, http://arxiv.org/abs/1305.2436.
  • [25] Lööv, C., Fernqvist, M., Walmsley, A., Marklund, N., and Erlandsson, A. (2012), “Neutralization of LINGO-1 during in vitro differentiation of neural stem cells results in proliferation of immature neurons,”, PLoS ONE.
  • [26] Mairal, J. and Yu, B. (2013), “Supervised feature selection in graphs with path coding penalties and network flows,”, http://arxiv.org/abs/1204.4539.
  • [27] Maronna, R. A., Martin, R. D., and Yohai, V. J. (2006), Robust Statistics: Theory and Methods, Chichester: Wiley.
  • [28] Martins, A., Figueiredo, M. A. T., Aguiar, P., Smith, N. A., and Xing, E. P. (2011), “Online learning of structured predictors with multiple kernels,” in, International Conf. on Artificial Intelligence and Statistics - AISTATS.
  • [29] Meinshausen, N. and Bühlmann, P. (2006), “High-dimensional graphs and variable selection with the Lasso,”, Annals of Statistics, 34, 1436–1462.
  • [30] Negahban, S., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012), “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers,”, Statististical Science, 27, 538–557.
  • [31] Nesterov, Y. E. (2007), “Gradient methods for minimizing composite objective function,”, Technical Report 76, Center of Operations Research and Econometrics, Catholic University of Louvain.
  • [32] Nguyen, N. H., Nasrabadi, N. M., and Tran, T. D. (2011), “Robust Lasso with missing and grossly corrupted observations,”, Advances in Neural Information Processing Systems 24, 1881–1889.
  • [33] Raskutti, G., Wainwright, M. J., and Yu, B. (2010), “Restricted Eigenvalue Properties for Correlated Gaussian Designs,”, Journal of Machine Learning Research, 11, 2241–2259.
  • [34] Reiman, E., Webster, J., Myers, A., Hardy, J., Dunckley, T., Zismann, V. L., Joshipura, K. D., Pearson, J. V., Hu-Lince, D., Huentelman, M. J., Craig, D. W., Coon, K. D., et al. (2007), “GAB2 alleles modify Alzheimer’s risk in APOE epsilon4 carriers,”, Neuron, 54, 713–720.
  • [35] Richard, E., Savalle, P., and Vayatis, N. (2012), “Estimation of simultaneously sparse and low rank matrices,” in, Proceedings of the 29th International Conference on Machine Learning (ICML-12), New York, NY, USA, pp. 1351–1358.
  • [36] Scott, D. (2001), “Parametric statistical modeling by minimum integrated square error,”, Technometrics, 43, 274–285.
  • [37] Sugiyama, M., Suzuki, T., Kanamori, T., Du Plessis, M. C., Liu, S., and Takeuchi, I. (2012), “Density-difference estimation,”, Advances in Neural Information Processing Systems, 25, 692–700.
  • [38] Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,”, Journal of the Royal Statistical Society, Series B, 58, 267–288.
  • [39] Tibshirani, R., Saunders, M., Rosset, R., Zhu, J., and Knight, K. (2005), “Sparsity and smoothness via the fused lasso,”, Journal of the Royal Statistical Society Series B, 91–108.
  • [40] van Rijsbergen, C. J. (1979), Information Retrieval, Butterworth.
  • [41] Vollbach, H., Heun, R., Morris, C. M., Edwardson, J. A., McKeith, I. G., Jessen, F., Schulz, A., Maier, W., and Kölsch, H. (2005), “APOA1 polymorphism influences risk for early-onset nonfamiliar AD,”, Annals of Neurology, 58, 436–441.
  • [42] Vu, V. Q., Ravikumar, P., Naselaris, T., Kay, K. N., Gallant, J. L., and Yu, B. (2011), “Encoding and decoding V1 FMRI responses to natural images with sparse nonparametric models,”, Annals of Applied Statistics, 5, 1159–1182.
  • [43] Wang, H., Li, G., and Jiang, G. (2007), “Robust regression shrinkage and consistent variable selection through the LAD-lasso,”, Journal of Business and Economics Statistics, 25, 347–355.
  • [44] Wolfowitz, J. (1957), “The minimum distance method,”, Annals of Mathematical Statistics, 28, 75–88.
  • [45] Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. M., and Lange, K. (2009), “Genome-wide association analysis by lasso penalized logistic regression,”, Bioinformatics, 25, 714–721.
  • [46] Yuan, M. and Lin, Y. (2006), “Model selection and estimation in regression with grouped variables,”, Journal of the Royal Statistical Society, Series B, 68, 49–67.