The Annals of Statistics

Pathwise coordinate optimization for sparse learning: Algorithm and theory

Tuo Zhao, Han Liu, and Tong Zhang

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

The pathwise coordinate optimization is one of the most important computational frameworks for high dimensional convex and nonconvex sparse learning problems. It differs from the classical coordinate optimization algorithms in three salient features: warm start initialization, active set updating and strong rule for coordinate preselection. Such a complex algorithmic structure grants superior empirical performance, but also poses significant challenge to theoretical analysis. To tackle this long lasting problem, we develop a new theory showing that these three features play pivotal roles in guaranteeing the outstanding statistical and computational performance of the pathwise coordinate optimization framework. Particularly, we analyze the existing pathwise coordinate optimization algorithms and provide new theoretical insights into them. The obtained insights further motivate the development of several modifications to improve the pathwise coordinate optimization framework, which guarantees linear convergence to a unique sparse local optimum with optimal statistical properties in parameter estimation and support recovery. This is the first result on the computational and statistical guarantees of the pathwise coordinate optimization framework in high dimensions. Thorough numerical experiments are provided to support our theory.

Article information

Source
Ann. Statist., Volume 46, Number 1 (2018), 180-218.

Dates
Received: August 2016
Revised: January 2017
First available in Project Euclid: 22 February 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1519268428

Digital Object Identifier
doi:10.1214/17-AOS1547

Mathematical Reviews number (MathSciNet)
MR3766950

Zentralblatt MATH identifier
06865109

Subjects
Primary: 62F30: Inference under constraints 90C26: Nonconvex programming, global optimization
Secondary: 62J12: Generalized linear models 90C52: Methods of reduced gradient type

Keywords
Nonconvex sparse learning pathwise coordinate optimization global linear convergence optimal statistical rates of convergence oracle property active set strong rule

Citation

Zhao, Tuo; Liu, Han; Zhang, Tong. Pathwise coordinate optimization for sparse learning: Algorithm and theory. Ann. Statist. 46 (2018), no. 1, 180--218. doi:10.1214/17-AOS1547. https://projecteuclid.org/euclid.aos/1519268428


Export citation

References

  • Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • Duchi, J. (2015). Lecture Notes for Statistics and Information Theory. Available at http://stanford.edu/class/stats311/Lectures/full_notes.pdf.
  • Eloyan, A., Muschelli, J., Nebel, M. B., Liu, H., Han, F., Zhao, T., Barber, A. D., Joel, S., Pekar, J. J., Mostofsky, S. H. and Caffo, B. (2012). Automated diagnoses of attention deficit hyperactive disorder using magnetic resonance imaging. Front. Syst. Neurosci. 6 61.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849.
  • Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
  • Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. J. Comput. Graph. Statist. 7 397–416.
  • Hastie, T. (2009). Fast regularization paths via coordinate descent. In The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Denver 2008. Plenary Talk.
  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Monographs on Statistics and Applied Probability 43. Chapman & Hall, London.
  • Li, X., Zhao, T., Arora, R., Liu, H. and Hong, M. (2016). An improved convergence analysis of cyclic block coordinate descent-type methods for strongly convex minimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.
  • Liu, H., Wang, L. and Zhao, T. (2015). Calibrated multivariate regression with application to neural semantic basis discovery. J. Mach. Learn. Res. 16 1579–1606.
  • Loh, P.-L. and Wainwright, M. J. (2015). Regularized $M$-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16 559–616.
  • Lu, Z. and Xiao, L. (2015). On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152 615–642.
  • Luo, Z. Q. and Tseng, P. (1992). On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72 7–35.
  • Mazumder, R., Friedman, J. H. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc. 106 1125–1138.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Neale, B. M., Kou, Y., Liu, L., Ma’Ayan, A., Samocha, K. E., Sabo, A., Lin, C.-F., Stevens, C., Wang, L.-S., Makarov, V. et al. (2012). Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485 242–245.
  • Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. Statist. Sci. 27 538–557.
  • Nesterov, Yu. (2013). Gradient methods for minimizing composite functions. Math. Program. 140 125–161.
  • Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11 2241–2259.
  • Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
  • Razaviyayn, M., Hong, M. and Luo, Z.-Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23 1126–1153.
  • Richtárik, P. and Takáč, M. (2014). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144 1–38.
  • Shalev-Shwartz, S. and Tewari, A. (2011). Stochastic methods for $\ell_{1}$-regularized loss minimization. J. Mach. Learn. Res. 12 1865–1892.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R. J. (2012). Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 245–266.
  • Wang, L., Kim, Y. and Li, R. (2013). Calibrating nonconvex penalized regression in ultra-high dimension. Ann. Statist. 41 2505–2536.
  • Wang, Z., Liu, H. and Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann. Statist. 42 2164–2201.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, T. (2013). Multi-stage convex relaxation for feature selection. Bernoulli 19 2277–2293.
  • Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • Zhao, T. and Liu, H. (2016). Accelerated path-following iterative shrinkage thresholding algorithm with application to semiparametric graph estimation. J. Comput. Graph. Statist. 25 1272–1296.
  • Zhao, T., Liu, H. and Zhang, T. (2018). Supplement to “Pathwise coordinate optimization for sparse learning: Algorithm and theory.” DOI:10.1214/17-AOS1547SUPP.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
  • Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13 1059–1062.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533.

Supplemental materials

  • Supplement to “Pathwise coordinate optimization for sparse learning: Algorithm and theory”. The supplementary materials contain the supplementary proofs of the theoretical lemmas in the paper “Pathwise coordinate optimization for nonconvex sparse learning: Algorithm and theory.”.