The Annals of Applied Statistics

Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping

Seyoung Kim and Eric P. Xing

Full-text: Access denied (no subscription detected)In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.

Abstract

We consider the problem of estimating a sparse multi-response regression function, with an application to expression quantitative trait locus (eQTL) mapping, where the goal is to discover genetic variations that influence gene-expression levels. In particular, we investigate a shrinkage technique capable of capturing a given hierarchical structure over the responses, such as a hierarchical clustering tree with leaf nodes for responses and internal nodes for clusters of related responses at multiple granularity, and we seek to leverage this structure to recover covariates relevant to each hierarchically-defined cluster of responses. We propose a tree-guided group lasso, or tree lasso, for estimating such structured sparsity under multi-response regression by employing a novel penalty function constructed from the tree. We describe a systematic weighting scheme for the overlapping groups in the tree-penalty such that each regression coefficient is penalized in a balanced manner despite the inhomogeneous multiplicity of group memberships of the regression coefficients due to overlaps among groups. For efficient optimization, we employ a smoothing proximal gradient method that was originally developed for a general class of structured-sparsity-inducing penalties. Using simulated and yeast data sets, we demonstrate that our method shows a superior performance in terms of both prediction errors and recovery of true sparsity patterns, compared to other methods for learning a multivariate-response regression.

Article information

Source
Ann. Appl. Stat. Volume 6, Number 3 (2012), 1095-1117.

Dates
First available in Project Euclid: 31 August 2012

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1346418575

Digital Object Identifier
doi:10.1214/12-AOAS549

Zentralblatt MATH identifier
06096523

Mathematical Reviews number (MathSciNet)
MR3012522

Citation

Kim, Seyoung; Xing, Eric P. Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. The Annals of Applied Statistics 6 (2012), no. 3, 1095--1117. doi:10.1214/12-AOAS549. http://projecteuclid.org/euclid.aoas/1346418575.


Export citation

References

  • Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
  • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
  • Chen, Y., Zhu, J., Lum, P. K., Yang, X., Pinto, S., MacNeil, D. J., Zhang, C., Lamb, J., Edwards, S., Sieberts, S. K. et al. (2008). Variations in DNA elucidate molecular networks that cause disease. Nature 452 429–435.
  • Chen, X., Lin, Q., Kim, S., Carbonell, J. and Xing, E. P. (2011). Smoothing proximal gradient method for general structured sparse learning. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI) 105–114. AUAI Press, Corvallis, OR.
  • Cheung, V., Spielman, R., Ewens, K., Weber, T., Morley, M. and Burdick, J. (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437 1365–1369.
  • Emilsson, V., Thorleifsson, G., Zhang, B., Leonardson, A. S., Zink, F., Zhu, J., Carlson, S., Helgason, A., Walters, G. B., Gunnarsdottir, S. et al. (2008). Genetics of gene expression and its effect on disease. Nature 452 423–428.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
  • Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Hastie, T., Tibshirani, R., Botstein, D. and Brown, P. (2001). Supervised harvesting of expression trees. Genome Biol. 2 0003.1–0003.12.
  • Jacob, L., Obozinski, G. and Vert, J. (2009). Group lasso with overlap and graph lasso. In Proceedings of the 26th International Conference on Machine Learning. ACM, New York.
  • Jenatton, R., Audibert, J. and Bach, F. (2009). Structured variable selection with sparsity-inducing norms. Technical report, INRIA.
  • Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5 e1000587.
  • Kim, S. and Xing, E. P. (2012). Supplement to “Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping.” DOI:10.1214/12-AOAS549SUPP.
  • Lee, S. I., Pe’er, D., Dudley, A., Church, G. and Koller, D. (2006). Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc. Natl. Acad. Sci. USA 103 14062–14067.
  • Obozinski, G., Taskar, B. and Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20 231–252.
  • Obozinski, G., Wainwright, M. J. and Jordan, M. J. (2008). High-dimensional union support recovery in multivariate regression. In Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA.
  • Pujana, M. A., Han, J. J., Starita, L. M., Stevens, K. N., Tewari, M., Ahn, J. S., Rennert, G., Moreno, V., Kirchhoff, T., Gold, B. et al. (2007). Network modeling links breast cancer susceptibility and centrosome dysfunction. Nature Genetics 39 1338–1349.
  • Segal, E., Shapira, M., Regev, A., Pe’er, D., Botstein, D., Koller, D. and Friedman, N. (2003). Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34 166–178.
  • Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lønning, P. E. and Børresen-Dale, A. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA 98 10869–10874.
  • Stranger, B., Forrest, M., Clark, A., Minichiello, M., Deutsch, S., Lyle, R., Hunt, S., Kahl, B., Antonarakis, S., Tavare, S. et al. (2005). Genome-wide associations of gene expression variation in humans. PLoS Genetics 1 695–704.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
  • Yuan, X. and Yan, S. (2010). Visual classification with multi-task joint sparse representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society Press, Los Alamitos, CA.
  • Zhang, Y. (2010). Multi-task active learning with output constraints. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI). AAAI Press, Menlo Park, CA.
  • Zhang, B. and Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4 Art. 17, 45 pp. (electronic).
  • Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
  • Zhou, Y., Jin, R. and Hoi, S. C. H. (2010). Exclusive lasso for multi-task feature selection. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR W&CP.
  • Zhu, J., Zhang, B., Smith, E. N., Drees, B., Brem, R. B., Kruglyak, L., Bumgarner, R. E. and Schadt, E. E. (2008). Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nature Genetics 40 854–861.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.

Supplemental materials

  • Supplementary material: The balanced weighting scheme of tree lasso and additional experimental results. We prove that the weighting scheme of the tree-lasso penalty achieves a balanced penalization of all regression coefficients. We also provide additional experimental results on the comparison of the tree lasso with other sparse regression methods using simulated data sets.