The Annals of Applied Statistics

Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping

Seyoung Kim and Eric P. Xing
Source: Ann. Appl. Stat. Volume 6, Number 3 (2012), 1095-1117.

Abstract

We consider the problem of estimating a sparse multi-response regression function, with an application to expression quantitative trait locus (eQTL) mapping, where the goal is to discover genetic variations that influence gene-expression levels. In particular, we investigate a shrinkage technique capable of capturing a given hierarchical structure over the responses, such as a hierarchical clustering tree with leaf nodes for responses and internal nodes for clusters of related responses at multiple granularity, and we seek to leverage this structure to recover covariates relevant to each hierarchically-defined cluster of responses. We propose a tree-guided group lasso, or tree lasso, for estimating such structured sparsity under multi-response regression by employing a novel penalty function constructed from the tree. We describe a systematic weighting scheme for the overlapping groups in the tree-penalty such that each regression coefficient is penalized in a balanced manner despite the inhomogeneous multiplicity of group memberships of the regression coefficients due to overlaps among groups. For efficient optimization, we employ a smoothing proximal gradient method that was originally developed for a general class of structured-sparsity-inducing penalties. Using simulated and yeast data sets, we demonstrate that our method shows a superior performance in terms of both prediction errors and recovery of true sparsity patterns, compared to other methods for learning a multivariate-response regression.

First Page: Show Hide

Related Works:

Full-text: Access denied (no subscription detected)
In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1346418575
Digital Object Identifier: doi:10.1214/12-AOAS549
Zentralblatt MATH identifier: 06096523
Mathematical Reviews number (MathSciNet): MR3012522

References

Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
Mathematical Reviews (MathSciNet): MR2486527
Zentralblatt MATH: 1175.94009
Digital Object Identifier: doi:10.1137/080716542
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
Mathematical Reviews (MathSciNet): MR2061575
Chen, Y., Zhu, J., Lum, P. K., Yang, X., Pinto, S., MacNeil, D. J., Zhang, C., Lamb, J., Edwards, S., Sieberts, S. K. et al. (2008). Variations in DNA elucidate molecular networks that cause disease. Nature 452 429–435.
Chen, X., Lin, Q., Kim, S., Carbonell, J. and Xing, E. P. (2011). Smoothing proximal gradient method for general structured sparse learning. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI) 105–114. AUAI Press, Corvallis, OR.
Cheung, V., Spielman, R., Ewens, K., Weber, T., Morley, M. and Burdick, J. (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437 1365–1369.
Emilsson, V., Thorleifsson, G., Zhang, B., Leonardson, A. S., Zink, F., Zhu, J., Carlson, S., Helgason, A., Walters, G. B., Gunnarsdottir, S. et al. (2008). Genetics of gene expression and its effect on disease. Nature 452 423–428.
Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
Mathematical Reviews (MathSciNet): MR2415737
Zentralblatt MATH: 05226935
Digital Object Identifier: doi:10.1214/07-AOAS131
Project Euclid: euclid.aoas/1196438020
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 531–537.
Zentralblatt MATH: 1047.65504
Hastie, T., Tibshirani, R., Botstein, D. and Brown, P. (2001). Supervised harvesting of expression trees. Genome Biol. 2 0003.1–0003.12.
Jacob, L., Obozinski, G. and Vert, J. (2009). Group lasso with overlap and graph lasso. In Proceedings of the 26th International Conference on Machine Learning. ACM, New York.
Jenatton, R., Audibert, J. and Bach, F. (2009). Structured variable selection with sparsity-inducing norms. Technical report, INRIA.
Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5 e1000587.
Kim, S. and Xing, E. P. (2012). Supplement to “Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping.” DOI:10.1214/12-AOAS549SUPP.
Lee, S. I., Pe’er, D., Dudley, A., Church, G. and Koller, D. (2006). Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc. Natl. Acad. Sci. USA 103 14062–14067.
Obozinski, G., Taskar, B. and Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20 231–252.
Mathematical Reviews (MathSciNet): MR2610775
Digital Object Identifier: doi:10.1007/s11222-008-9111-x
Obozinski, G., Wainwright, M. J. and Jordan, M. J. (2008). High-dimensional union support recovery in multivariate regression. In Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA.
Pujana, M. A., Han, J. J., Starita, L. M., Stevens, K. N., Tewari, M., Ahn, J. S., Rennert, G., Moreno, V., Kirchhoff, T., Gold, B. et al. (2007). Network modeling links breast cancer susceptibility and centrosome dysfunction. Nature Genetics 39 1338–1349.
Segal, E., Shapira, M., Regev, A., Pe’er, D., Botstein, D., Koller, D. and Friedman, N. (2003). Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34 166–178.
Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lønning, P. E. and Børresen-Dale, A. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA 98 10869–10874.
Stranger, B., Forrest, M., Clark, A., Minichiello, M., Deutsch, S., Lyle, R., Hunt, S., Kahl, B., Antonarakis, S., Tavare, S. et al. (2005). Genome-wide associations of gene expression variation in humans. PLoS Genetics 1 695–704.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
Mathematical Reviews (MathSciNet): MR1379242
Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
Mathematical Reviews (MathSciNet): MR2212574
Zentralblatt MATH: 1141.62030
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00532.x
Yuan, X. and Yan, S. (2010). Visual classification with multi-task joint sparse representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society Press, Los Alamitos, CA.
Zhang, Y. (2010). Multi-task active learning with output constraints. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park, CA.
Zhang, B. and Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4 Art. 17, 45 pp. (electronic).
Mathematical Reviews (MathSciNet): MR2170433
Zentralblatt MATH: 1077.92042
Digital Object Identifier: doi:10.2202/1544-6115.1128
Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
Mathematical Reviews (MathSciNet): MR2549566
Zentralblatt MATH: 05644286
Digital Object Identifier: doi:10.1214/07-AOS584
Project Euclid: euclid.aos/1250515393
Zhou, Y., Jin, R. and Hoi, S. C. H. (2010). Exclusive lasso for multi-task feature selection. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR W&CP.
Zhu, J., Zhang, B., Smith, E. N., Drees, B., Brem, R. B., Kruglyak, L., Bumgarner, R. E. and Schadt, E. E. (2008). Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics 40 854–861.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.
Mathematical Reviews (MathSciNet): MR2137327
Zentralblatt MATH: 1069.62054
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00503.x

2013 © Institute of Mathematical Statistics

The Annals of Applied Statistics

The Annals of Applied Statistics

Turn MathJax Off
What is MathJax?