## Statistical Science

### Structured Sparsity through Convex Optimization

#### Abstract

Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the $\ell_{1}$-norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the $\ell_{1}$-norm can then be extended to structured norms built on either disjoint or overlapping groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of nonlinear variable selection.

#### Article information

Source
Statist. Sci. Volume 27, Number 4 (2012), 450-468.

Dates
First available in Project Euclid: 21 December 2012

https://projecteuclid.org/euclid.ss/1356098550

Digital Object Identifier
doi:10.1214/12-STS394

Mathematical Reviews number (MathSciNet)
MR3025128

Zentralblatt MATH identifier
1331.90050

Keywords
Sparsity convex optimization

#### Citation

Bach, Francis; Jenatton, Rodolphe; Mairal, Julien; Obozinski, Guillaume. Structured Sparsity through Convex Optimization. Statist. Sci. 27 (2012), no. 4, 450--468. doi:10.1214/12-STS394. https://projecteuclid.org/euclid.ss/1356098550.

#### References

• Adams, R., Ghahramani, Z. and Jordan, M. (2010). Tree-structured stick breaking for hierarchical data. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 19–27.
• Aharon, M., Elad, M. and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Processing 54 4311–4322.
• Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
• Bach, F. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. In Neural Information Processing Systems 21.
• Bach, F. (2010). Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems 23.
• Bach, F. (2011a). Learning with submodular functions: A convex optimization perspective. Technical Report No. 00645271, HAL.
• Bach, F. (2011b). Shaping level sets with submodular functions. In Advances in Neural Information Processing Systems 24.
• Bach, F., Mairal, J. and Ponce, J. (2008). Convex sparse matrix factorizations. Technical report. Preprint. Available at arXiv:0812.1869.
• Bach, F., Jenatton, R., Mairal, J. and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning 4 1–106.
• Baraniuk, R. G., Cevher, V., Duarte, M. F. and Hegde, C. (2010). Model-based compressive sensing. IEEE Trans. Inform. Theory 56 1982–2001.
• Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
• Becker, S., Bobin, J. and Candès, E. J. (2011). NESTA: A fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4 1–39.
• Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• Blei, D. M., Griffiths, T. L. and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM 57 1–30.
• Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
• Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64 115–123.
• Borwein, J. M. and Lewis, A. S. (2006). Convex Analysis and Nonlinear Optimization. Theory and Examples, 2nd ed. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC 3. Springer, New York.
• Buntine, W. L. (2002). Variational extensions to EM and multinomial PCA. In Proceedings of the European Conference on Machine Learning (ECML).
• Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215.
• Cevher, V., Duarte, M. F., Hegde, C. and Baraniuk, R. G. (2008). Sparse signal recovery using Markov random fields. In Advances in Neural Information Processing Systems 20.
• Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
• Chen, X., Lin, Q., Kim, S., Carbonell, J. G. and Xing, E. P. (2011). Smoothing proximal gradient method for general structured sparse learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI).
• Combettes, P. L. and Pesquet, J. C. (2010). Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer, New York.
• d’Aspremont, A., Bach, F. and El Ghaoui, L. (2008). Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9 1269–1294.
• Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224.
• Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
• Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group Lasso and a sparse group Lasso. Preprint.
• Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
• Gramfort, A. and Kowalski, M. (2009). Improving M/EEG source localization with an inter-condition sparse prior. In IEEE International Symposium on Biomedical Imaging.
• Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proc. Natl. Acad. Sci. USA 101 5228–5235.
• Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, New York.
• Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004.
• Huang, J., Zhang, T. and Metaxas, D. (2011). Learning with structured sparsity. J. Mach. Learn. Res. 12 3371–3412.
• Jacob, L., Obozinski, G. and Vert, J. P. (2009). Group Lasso with overlaps and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML).
• Jenatton, R., Audibert, J.-Y. and Bach, F. (2011). Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12 2777–2824.
• Jenatton, R., Obozinski, G. and Bach, F. (2010). Structured sparse principal component analysis. In International Conference on Artificial Intelligence and Statistics (AISTATS).
• Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F. and Thirion, B. (2011a). Multi-scale mining of fMRI data with hierarchical structured sparsity. SIAM J. Imaging Sci. To appear. Technical report. Preprint. Available at arXiv:1105.0363.
• Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2011b). Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12 2297–2334.
• Jolliffe, I. T., Trendafilov, N. T. and Uddin, M. (2003). A modified principal component technique based on the LASSO. J. Comput. Graph. Statist. 12 531–547.
• Kavukcuoglu, K., Ranzato, M. A., Fergus, R. and LeCun, Y. (2009). Learning invariant features through topographic filter maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
• Kim, S., Sohn, K. A. and Xing, E. P. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25 204–212.
• Kim, S. and Xing, E. P. (2010). Tree-guided group Lasso for multi-task regression with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML).
• Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82–95.
• Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788–791.
• Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 34 2272–2297.
• Liu, H., Palatucci, M. and Zhang, J. (2009). Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In Proceedings of the International Conference on Machine Learning (ICML).
• Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of the Conference on Learning Theory.
• Lounici, K., Pontil, M., van de Geer, S. and Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204.
• Mackey, L. (2009). Deflation methods for sparse PCA. In Advances in Neural Information Processing Systems 21.
• Mairal, J. (2010). Sparse coding for machine learning, image processing and computer vision. Ph.D. thesis, École normale supérieure de Cachan—ENS Cachan. Available at http://tel.archives-ouvertes.fr/tel-00595312/fr/.
• Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11 19–60.
• Mairal, J., Jenatton, R., Obozinski, G. and Bach, F. (2011). Convex and network flow optimization for structured sparsity. J. Mach. Learn. Res. 12 2681–2720.
• Mallat, S. G. (1999). A Wavelet Tour of Signal Processing. Academic Press, New York.
• Martinez, A. M. and Kak, A. C. (2001). PCA versus LDA. In IEEE Transactions on Pattern Analysis and Machine Intelligence 23 228–233.
• Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• Moghaddam, B., Weiss, Y. and Avidan, S. (2006). Spectral bounds for sparse PCA: Exact and greedy algorithms. In Advances in Neural Information Processing Systems 18.
• Moreau, J.-J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R. Acad. Sci. Paris 255 2897–2899.
• Needell, D. and Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal. 26 301–321.
• Negahban, S. N. and Wainwright, M. J. (2011). Simultaneous support recovery in high dimensions: Benefits and perils of block $\ell_{1}/\ell_{\infty}$-regularization. IEEE Trans. Inform. Theory 57 3841–3863.
• Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems 22.
• Nesterov, Y. (2004). Introductory Lectures on Convex Optimization. A Basic Course. Applied Optimization 87. Kluwer Academic, Boston, MA.
• Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic Univ. Louvain.
• Obozinski, G. and Bach, F. (2012). Convex relaxation for combinatorial penalties. Technical report, HAL.
• Obozinski, G., Jacob, L. and Vert, J. P. (2011). Group Lasso with overlaps: The latent group Lasso approach. Technical Report No. inria-00628498, HAL.
• Obozinski, G., Taskar, B. and Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20 231–252.
• Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Ann. Statist. 39 1–47.
• Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 607–609.
• Percival, D. (2012). Theoretical properties of the overlapping group Lasso. Electron. J. Statist. 6 269–288.
• Quattoni, A., Carreras, X., Collins, M. and Darrell, T. (2009). An efficient projection for $\ell_{1}/\ell_{\infty}$ regularization. In Proceedings of the International Conference on Machine Learning (ICML).
• Rao, N. S., Nowak, R. D., Wright, S. J. and Kingsbury, N. G. (2011). Convex approaches to model wavelet sparsity patterns. In International Conference on Image Processing (ICIP).
• Rapaport, F., Barillot, E. and Vert, J.-P. (2008). Classification of arrayCGH data using fused SVM. Bioinformatics 24 i375–i382.
• Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030.
• Roth, V. and Fischer, B. (2008). The group-Lasso for generalized linear models: Uniqueness of solutions and efficient algorithms. In Proceedings of the International Conference on Machine Learning (ICML).
• Rudin, L. I., Osher, S. and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Phys. D 60 259–268.
• Schmidt, M., Le Roux, N. and Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems 24.
• Schmidt, M. and Murphy, K. (2010). Convex structure learning in log-linear models: Beyond pairwise potentials. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).
• Shalev-Shwartz, S., Srebro, N. and Zhang, T. (2010). Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20 2807–2832.
• Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press, Cambridge.
• Shen, X. and Huang, H.-C. (2010). Grouping pursuit through a regularization solution surface. J. Amer. Statist. Assoc. 105 727–739.
• Singh, A. P. and Gordon, G. J. (2008). A unified view of matrix factorization models. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases.
• Sprechmann, P., Ramirez, I., Sapiro, G. and Eldar, Y. (2010). Collaborative hierarchical sparse modeling. In 44th Annual Conference on Information Sciences and Systems (CISS) 1–6. IEEE.
• Stojnic, M., Parvaresh, F. and Hassibi, B. (2009). On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Trans. Signal Process. 57 3075–3085.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.
• Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory 50 2231–2242.
• Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030–1051.
• Turlach, B. A., Venables, W. N. and Wright, S. J. (2005). Simultaneous variable selection. Technometrics 47 349–363.
• van de Geer, S. (2010). $\ell_{1}$-regularization in high-dimensional statistical models. In Proceedings of the International Congress of Mathematicians. Volume IV 2351–2369. Hindustan Book Agency, New Delhi.
• Varoquaux, G., Jenatton, R., Gramfort, A., Obozinski, G., Thirion, B. and Bach, F. (2010). Sparse structured dictionary learning for brain resting-state activity modeling. In NIPS Workshop on Practical Applications of Sparse Modeling: Open Issues and New Directions.
• Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
• Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
• Wright, S. J., Nowak, R. D. and Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57 2479–2493.
• Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2 224–244.
• Xiang, Z. J., Xi, Y. T., Hasson, U. and Ramadge, P. J. (2009). Boosting with spatial regularization. In Advances in Neural Information Processing Systems 22.
• Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11 2261–2286.
• Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
• Yuan, G.-X., Chang, K.-W., Hsieh, C.-J. and Lin, C.-J. (2010). A comparison of optimization methods and software for large-scale L1-regularized linear classification. J. Mach. Learn. Res. 11 3183–3234.
• Zass, R. and Shashua, A. (2007). Nonnegative sparse PCA. In Advances in Neural Information Processing Systems 19.
• Zhang, T. (2009). Some sharp performance bounds for least squares regression with $L_{1}$ regularization. Ann. Statist. 37 2109–2144.
• Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
• Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
• Zhong, L. W. and Kwok, J. T. (2011). Efficient sparse modeling with automatic feature grouping. In Proceedings of the International Conference on Machine Learning (ICML).
• Zhou, Y., Jin, R. and Hoi, S. C. H. (2010). Exclusive Lasso for multi-task feature selection. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).
• Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
• Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.