Statistical Science

Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations

Xiaohan Yan and Jacob Bien

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Demanding sparsity in estimated models has become a routine practice in statistics. In many situations, we wish to require that the sparsity patterns attained honor certain problem-specific constraints. Hierarchical sparse modeling (HSM) refers to situations in which these constraints specify that one set of parameters be set to zero whenever another is set to zero. In recent years, numerous papers have developed convex regularizers for this form of sparsity structure, which arises in many areas of statistics including interaction modeling, time series analysis, and covariance estimation. In this paper, we observe that these methods fall into two frameworks, the group lasso (GL) and latent overlapping group lasso (LOG), which have not been systematically compared in the context of HSM. The purpose of this paper is to provide a side-by-side comparison of these two frameworks for HSM in terms of their statistical properties and computational efficiency. We call special attention to GL’s more aggressive shrinkage of parameters deep in the hierarchy, a property not shared by LOG. In terms of computation, we introduce a finite-step algorithm that exactly solves the proximal operator of LOG for a certain simple HSM structure; we later exploit this to develop a novel path-based block coordinate descent scheme for general HSM structures. Both algorithms greatly improve the computational performance of LOG. Finally, we compare the two methods in the context of covariance estimation, where we introduce a new sparsely-banded estimator using LOG, which we show achieves the statistical advantages of an existing GL-based method but is simpler to express and more efficient to compute.

Article information

Source
Statist. Sci., Volume 32, Number 4 (2017), 531-560.

Dates
First available in Project Euclid: 28 November 2017

Permanent link to this document
https://projecteuclid.org/euclid.ss/1511838027

Digital Object Identifier
doi:10.1214/17-STS622

Mathematical Reviews number (MathSciNet)
MR3730521

Zentralblatt MATH identifier
06849281

Keywords
Hierarchical sparsity convex regularization group lasso latent overlapping group lasso

Citation

Yan, Xiaohan; Bien, Jacob. Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations. Statist. Sci. 32 (2017), no. 4, 531--560. doi:10.1214/17-STS622. https://projecteuclid.org/euclid.ss/1511838027


Export citation

References

  • Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. In Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS’08 105–112. Curran Associates Inc., Red Hook, NY.
  • Bach, F., Jenatton, R., Mairal, J. and Obozinski, G. (2012). Structured sparsity through convex optimization. Statist. Sci. 27 450–468.
  • Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
  • Bertsekas, D. P. (1999). Nonlinear Programming, 2nd ed. Athena Scientific, Belmont, MA.
  • Bien, J., Bunea, F. and Xiao, L. (2016). Convex banding of the covariance matrix. J. Amer. Statist. Assoc. 111 834–845.
  • Bien, J., Taylor, J. and Tibshirani, R. (2013). A LASSO for hierarchical interactions. Ann. Statist. 41 1111–1141.
  • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
  • Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 1–122.
  • Choi, N. H., Li, W. and Zhu, J. (2010). Variable selection with the strong heredity constraint and its oracle property. J. Amer. Statist. Assoc. 105 354–364.
  • Chouldechova, A. and Hastie, T. (2015). Generalized additive model selection. Available at arXiv:1506.03850.
  • Haris, A., Witten, D. and Simon, N. (2016). Convex modeling of interactions with strong heredity. J. Comput. Graph. Statist. 25 981–1004.
  • Jacob, L., Obozinski, G. and Vert, J. (2009). Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning. ICML’09 433–440. ACM, New York.
  • Jenatton, R., Audibert, J.-Y. and Bach, F. (2011). Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12 2777–2824.
  • Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2010). Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10 487–494. Omnipress, Madison, WI.
  • Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2011). Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12 2297–2334.
  • Levina, E., Rothman, A. and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested Lasso penalty. Ann. Appl. Stat. 2 245–263.
  • Lim, M. and Hastie, T. (2015). Learning interactions via hierarchical group-lasso regularization. J. Comput. Graph. Statist. 24 627–654.
  • Lou, Y., Bien, J., Caruana, R. and Gehrke, J. (2016). Sparse partially linear additive models. J. Comput. Graph. Statist. 25 1026–1040.
  • Nelder, J. A. (1977). A reformulation of linear models. J. Roy. Statist. Soc. Ser. A 140 48–76.
  • Nesterov, Yu. (2013). Gradient methods for minimizing composite functions. Math. Program. 140 125–161.
  • Nicholson, W. B., Bien, J. and Matteson, D. S. (2014). Hierarchical vector autoregression. Available at arXiv:1412.5250.
  • Obozinski, G., Jacob, L. and Vert, J.-P. (2011). Group lasso with overlaps: The latent group lasso approach. Research report. Available at https://hal.inria.fr/inria-00628498.
  • Radchenko, P. and James, G. M. (2010). Variable selection using adaptive nonlinear interaction structures in high dimensions. J. Amer. Statist. Assoc. 105 1541–1553.
  • Rothman, A. J., Levina, E. and Zhu, J. (2010). A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97 539–550.
  • Schmidt, M. and Murphy, K. (2010). Convex structure learning in log-linear models: Beyond pairwise potentials. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research 9 709–716.
  • She, Y., Wang, Z. and Jiang, H. (2017). Group regularized estimation under structural hierarchy. J. Amer. Statist. Assoc. To appear. DOI:10.1080/01621459.2016.1260470.
  • Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2013). A sparse-group lasso. J. Comput. Graph. Statist. 22 231–245.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109 475–494.
  • Turlach, B. A., Venables, W. N. and Wright, S. J. (2005). Simultaneous variable selection. Technometrics 47 349–363.
  • Villa, S., Rosasco, L., Mosci, S. and Verri, A. (2014). Proximal methods for the latent group lasso penalty. Comput. Optim. Appl. 58 381–407.
  • Yuan, M., Joseph, V. R. and Zou, H. (2009). Structured variable selection and estimation. Ann. Appl. Stat. 3 1738–1757.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68 49–67.
  • Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.