The Annals of Applied Statistics

Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer

Jie Peng, Ji Zhu, Anna Bergamaschi, Wonshik Han, Dong-Young Noh, Jonathan R. Pollack, and Pei Wang

Full-text: Open access


In this paper we propose a new method remMap—REgularized Multivariate regression for identifying MAster Predictors—for fitting multivariate response regression models under the high-dimension–low-sample-size setting. remMap is motivated by investigating the regulatory relationships among different biological molecules based on multiple types of high dimensional genomic data. Particularly, we are interested in studying the influence of DNA copy number alterations on RNA transcript levels. For this purpose, we model the dependence of the RNA expression levels on DNA copy numbers through multivariate linear regressions and utilize proper regularization to deal with the high dimensionality as well as to incorporate desired network structures. Criteria for selecting the tuning parameters are also discussed. The performance of the proposed method is illustrated through extensive simulation studies. Finally, remMap is applied to a breast cancer study, in which genome wide RNA transcript levels and DNA copy numbers were measured for 172 tumor samples. We identify a trans-hub region in cytoband 17q12-q21, whose amplification influences the RNA expression levels of more than 30 unlinked genes. These findings may lead to a better understanding of breast cancer pathology.

Article information

Ann. Appl. Stat. Volume 4, Number 1 (2010), 53-77.

First available in Project Euclid: 11 May 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier


Peng, Jie; Zhu, Ji; Bergamaschi, Anna; Han, Wonshik; Noh, Dong-Young; Pollack, Jonathan R.; Wang, Pei. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4 (2010), no. 1, 53--77. doi:10.1214/09-AOAS271.

Export citation


  • Albertson, D. G., Collins, C., McCormick, F. and Gray, J. W. (2003). Chromosome aberrations in solid tumors. Nature Genetics 34 369–376.
  • Antoniadis, A. and Fan, J. (2001). Regularization of wavelet approximations. J. Amer. Statist. Assoc. 96 939–967.
  • Bai, T. and Luoh, S. W. (2008). GRB-7 facilitates HER-2/Neu-mediated signal transduction and tumor formation. Carcinogenesis 29 473–479.
  • Bakin, S. (1999). Adaptive regression and model selection in data mining problems. Ph.D. thesis, Australian National Univ., Canberra.
  • Bedrick, E. and Tsai, C. (1994). Model selection for multivariate regression in small samples. Biometrics 50 226–231.
  • Bergamaschi, A., Kim, Y. H., Wang, P., Sorlie, T., Hernandez-Boussard, T., Lonning, P. E., Tibshirani, R., Borresen-Dale, A. L. and Pollack, J. R. (2006). Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer. Genes Chromosomes Cancer 45 1033–1040.
  • Bergamaschi, A., Kim, Y. H., Kwei, K. A., Choi, Y. L., Bocanegra, M., Langerod, A., Han, W., Noh, D. Y., Huntsman, D. G., Jeffrey, S. S., Borresen-Dale, A. L. and Pollack, J. R. (2008). CAMK1D amplification implicated in epithelial-mesenchymal transition in basal-like breast cancer. Mol. Oncol. 2 327–339.
  • Breiman, L. and Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression (with discussion). J. Roy Statist. Soc. Ser. B 59 3–54.
  • Brown, P., Fearn, T. and Vannucci, M. (1999). The choice of variables in multivariate regression: A non-conjugate Bayesian decision theory approach. Biometrika 86 635–648.
  • Brown, P., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. J. Roy. Statist. Soc. Ser. B 60 627–641.
  • Brown, P., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors. J. Roy. Statist. Soc. Ser. B 64 519–536.
  • Chang, H. Y., Sneddon, J. B., Alizadeh, A. A., Sood, R., West, R. B., Montgomery, K., Chi, J. T., van de Rijn, M., Botstein, D. and Brown, P. O. (2004). Gene expression signature of fibroblast serum response predicts human cancer progression: Similarities between tumors and wounds. PLoS Biol. 2.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 109–148.
  • Fu, W. (1998). Penalized regressions: The bridge vs the lasso. J. Comput. Graph. Statist. 7 397–416.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Regularized paths for generalized linear models via coordinate descent. Technical report, Dept. Statistics, Stanford Univ.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302–332.
  • Fujikoshi, Y. and Satoh, K. (1997). Modified AIC and Cp in multivariate linear regression. Biometrika 84 707–716.
  • Gardner, T. S., di Bernardo, D., Lorenz, D. and Collins, J. J. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301 102–105.
  • Hyman, E., Kauraniemi, P., Hautaniemi, S., Wolf, M., Mousses, S., Rozenblum, E., Ringner, M., Sauter, G., Monni, O., Elkahloun, A., Kallioniemi, O.-P. and Kallioniemi, A. (2002). Impact of dna amplification on gene expression patterns in breast cancer. Cancer Res. 62 6240–6245.
  • Izenman, A. (1975). Reduced-rank regression for the multivariate linear model. J. Multivariate Anal. 5 248–264.
  • Jeong, H., Mason, S. P., Barabasi, A. L. and Oltvai, Z. N. (2001). Lethality and centrality in protein networks. Nature 411 41–42.
  • Kapp, A. V., Jeffrey, S. S., Langerod, A., Borresen-Dale, A. L., Han, W., Noh, D. Y., Bukholm, I. R., Nicolau, M., Brown, P. O. and Tibshirani, R. (2006). Discovery and validation of breast cancer subtypes. BMC Genomics 7 231.
  • Kao, J. and Pollack, J. R. (2006). RNA interference-based functional dissection of the 17q12 amplicon in breast cancer reveals contribution of coamplified genes. Genes Chromosomes Cancer 45 761–769.
  • Kim, S., Sohn, K.-A. and Xing, E. P. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25 204–212.
  • Langerod, A., Zhao, H., Borgan, O., Nesland, J. M., Bukholm, I. R., Ikdahl, T., Karesen, R., Borresen-Dale, A. L. and Jeffrey, S. S. (2007). TP53 mutation status and gene expression profiles are powerful prognostic markers of breast cancer. Breast Cancer Res. 9 R30.
  • Lutz, R. and Bühlmann, P. (2006). Boosting for high-multivariate responses in high-dimensional linear regression. Statist. Sinica 16 471–494.
  • Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2008). Union support recovery in high-dimensional multivariate regression. Available at
  • Paik, S., Shak, S., Tang, G., Kim, C., Baker, J., Cronin, M., Baehner, F. L. Walker, M. G., Watson, D., Park, T., Hiller, W., Fisher, E. R., Wickerham, D. L., Bryant, J. and Wolmark, N. (2004). A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. New England J. of Medicine 351 2817–2826.
  • Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009a). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746.
  • Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D. Y., Pollack J. R. and Wang, P. (2009b). Supplement to “Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer.” DOI: 10.1214/09-AOAS271SUPP.
  • Pollack, J., Srlie, T., Perou, C., Rees, C., Jeffrey, S., Lonning, P., Tibshirani, R., Botstein, D., Brresen-Dale, A. and Brown, P. (2002). Microarray analysis reveals a major direct role of dna copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA 99 12963–12968.
  • Reinsel, G. and Velu, R. (1998). Multivariate Reduced-Rank Regression: Theory and Applications. Springer, New York.
  • Saal, L. H., Johansson, P., Holm, K., Gruvberger-Saal, S. K., She, Q. B., Maurer, M., Koujak, S., Ferrando, A. A., Malmström, P., Memeo, L., Isola, J., Bendahl, P., Rosen, N., Hibshoosh, H., Ringner, M., Borg, A. and Parsons, R. (2007). Poor prognosis in carcinoma is associated with a gene expression signature of aberrant PTEN tumor suppressor pathway activity. Proc. Natl. Acad. Sci. USA 104 7564–7569.
  • Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lønning P. E. and Børresen-Dale, A. L. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA 98 10869–10874.
  • Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lønning, P. E., Brown, P. O., Børresen-Dale, A.-L. and Botstein, D. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. USA 100 8418–8423.
  • Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Haibe-Kains, B., Desmedt, C., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D., Buyse, M., Van de Vijver, M. J., Bergh, J., Piccart, M. and Delorenzi, M. (2006). Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis. J. Natl. Cancer. Inst. 98 262–272.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics 9 18–29.
  • Turlach, B., Venables, W. and Wright, S. (2005). Simultaneous variable selection. Technometrics 47 349–363.
  • Wang, P. (2004). Statistical methods for CGH array analysis. Ph.D. thesis, Stanford Univ.
  • Wang, Y., Klijn, J. G., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M. and Yu, J. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365 671–679.
  • van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E. T., Friend, S. H. and Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England J. of Medicine 347 1999–2009.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68 49–67.
  • Yuan, M., Ekici, A., Lu, Z. and Monterio, R. (2007). Dimension reduction and coefficient estimation in multivariate linear regression. J. Roy. Statist. Soc. Ser. B 69 329–346.
  • Zhao, H., Langerod, A., Ji, Y., Nowels, K. W., Nesland, J. M., Tibshirani, R., Bukholm, I. K., Karesen, R., Botstein, D., Borresen-Dale, A. L. and Jeffrey, S. S. (2004). Different gene expression patterns in invasive lobular and ductal carcinomas of the breast. Mol. Biol. Cell. 15 2523–2536.
  • Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
  • Zou, H., Hastie, T. and Tibshirani, R. (2007). On the degrees of freedom of the lasso. Ann. Statist. 35 2173–2192.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301–320.

Supplemental materials