The Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 7, Number 1 (2013), 269-294.
Sparse integrative clustering of multiple omics data sets
Ronglai Shen, Sijian Wang, and Qianxing Mo
Full-text: Open access
Abstract
High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation and gene expression associated with a disease. An integrated genomic profiling approach measures multiple omics data types simultaneously in the same set of biological samples. Such approach renders an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996) 267–288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301–320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 91–108] methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design [Monographs on Statistics and Applied Probability (1994) Chapman & Hall] is used to seek “experimental” points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic and transcriptomic data for subtype analysis in breast and lung cancer data sets.
Article information
Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 269-294.
Dates
First available in Project Euclid: 9 April 2013
Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527199
Digital Object Identifier
doi:10.1214/12-AOAS578
Mathematical Reviews number (MathSciNet)
MR3086419
Zentralblatt MATH identifier
06171272
Keywords
Sparse integrative clustering latent variable approach penalized regression
Citation
Shen, Ronglai; Wang, Sijian; Mo, Qianxing. Sparse integrative clustering of multiple omics data sets. Ann. Appl. Stat. 7 (2013), no. 1, 269--294. doi:10.1214/12-AOAS578. https://projecteuclid.org/euclid.aoas/1365527199
References
- Alizadeh, A. A., Eisen, M. B., Davis, E. E. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403 503–511.
- Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehar, J., Kryukov, G. V., Sonkin, D., Reddy, A., Liu, M., Murray, L., Berger, M. F., Monahan, J. E., Morais, P., Meltzer, J., Korejwa, A., Jane-Valbuena, J., Mapa, F. A., Thibault, J., Bric-Furlong, E., Raman, P., Shipway, A., Engels, I. H., Cheng, J., Yu, G. K., Yu, J., Aspesi, P., de Silva, M., Jagtap, K., Jones, M. D., Wang, L., Hatton, C., Palescandolo, E., Gupta, S., Mahan, S., Sougnez, C., Onofrio, R. C., Liefeld, T., MacConaill, L., Winckler, W., Reich, M., Li, N., Mesirov, J. P., Gabriel, S. B., Getz, G., Ardlie, K., Chan, V., Myer, V. E. and Weber, B. L. (2012). The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483 603–307.
- Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina, J., Boehm, J. S., Dobson, J., Urashima, M., Mc Henry, K. T., Pinchback, R. M., Ligon, A. H., Cho, Y. J., Haery, L., Greulich, H., Reich, M., Winckler, W., Lawrence, M. S., Weir, B. A., Tanaka, K. E., Chiang, D. Y., Bass, A. J., Loo, A., Hoffman, C., Prensner, J., Liefeld, T., Gao, Q., Yecies, D., Signoretti, S., Maher, E., Kaye, F. J., Sasaki, H., Tepper, J. E., Fletcher, J. A., Tabernero, J., Baselga, J., Tsao, M., Demichelis, F., Rubin, M. A., Janne, P. A., Daly, M. J., Nucera, C., Levine, R. L., Ebert, B. L., Gabriel, S., Rustgi, A., Antonescu, C. R., Ladanyi, M., Letai, A., Garraway, L., Loda, M., Beer, D., True, L. D., Okamoto, A., Pomeroy, S. L., Singer, S., Golub, T. R., Lander, E. S., Getz, G. and Sellers, W. R. (2010). The landscape of somatic copy-number alteration across human cancers. Nature 463 899–905.
- Cancer Genome Atlas Research Network (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455 1061–1068.
- Carvalho, I., Milanezi, F., Martins, A., Reis, R. M. and Schmitt, F. (2005). Overexpression of platelet-derived growth factor receptor alpha in breast cancer is associated with tumour progression. Breast Cancer Res. 7 R788–R795.
- Chen, H., Xing, H. and Zhang, N. R. (2011). Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput. Biol. 7 e1001060, 15.Mathematical Reviews (MathSciNet): MR2776334
- Chin, L. and Gray, J. W. (2008). Translating insights from the cancer genome into clinical practice. Nature 452 553–563.Zentralblatt MATH: 1032.68737
- Chitale, D., Gong, Y., Taylor, B. S., Broderick, S., Brennan, C., Somwar, R., Golas, B., Wang, L., Motoi, N., Szoke, J., Reinersman, J. M., Major, J., Sander, C., Seshan, V. E., Zakowski, M. F., Rusch, V., Pao, W., Gerald, W. and Ladanyi, M. (2009). An integrated genomic analysis of lung cancer reveals loss of DUSP4 in EGFR-mutant tumors. Nature 28 2773–2783.
- Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.Mathematical Reviews (MathSciNet): MR501537
- Dudoit, S. and Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 1–21.
- Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.Mathematical Reviews (MathSciNet): MR1946581
Zentralblatt MATH: 1073.62547
Digital Object Identifier: doi:10.1198/016214501753382273 - Fang, K. T. and Wang, Y. (1994). Number-Theoretic Methods in Statistics. Monographs on Statistics and Applied Probability 51. Chapman & Hall, London.Mathematical Reviews (MathSciNet): MR1284470
- Feinberg, A. P. and Vogelstein, B. (1983). Hypomethylation distinguishes genes of some human cancers from their normal counterparts. Nature 301 89–92.
- Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.Mathematical Reviews (MathSciNet): MR2722294
- Holliday, R. (1979). A new theory of carcinogenesis. Br. J. Cancer 40 513–522.
- Holm, K., Hegardt, C., Staaf, J. et al. (2010). Molecular subtypes of breast cancer are associated with characteristic DNA methylation patterns. Breast Cancer Research 12 R36.
- Hoshida, Y., Nijman, S. M., Kobayashi, M., Chan, J. A., Brunet, J. P., Chiang, D. Y., Villanueva, A., Newell, P., Ikeda, K., Hashimoto, M., Watanabe, G., Gabriel, S., Friedman, S. L., Kumada, H., Llovet, J. M. and Golub, T. R. (2003). Integrative transcriptome analysis reveals common molecular subclasses of human hepatocellular carcinoma. Cancer Research 69 7385–7392.
- Irizarry, R. A., Ladd-Acosta, C., Carvalho, B., Wu, H., Brandenburg, S. A., Jeddeloh, J. A., Wen, B. and Feinberg, A. P. (2008). Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res. 18 780–790.
- Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.Mathematical Reviews (MathSciNet): MR2036084
- Kapp, A. V. and Tibshirani, R. (2007). Are clusters found in one dataset present in another dataset? Biostatistics 8 9–31.
- Laird, P. W. (2003). The power and the promise of DNA methylation markers. Nat. Rev. Cancer 3 253–266.
- Laird, P. W. (2010). Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet. 11 191–203.
- Lapointe, J., Li, C., Higgins, J. P., van de Rijn, M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U., Ekman, P., DeMarzo, A. M., Tibshirani, R., Botstein, D., Brown, P. O., Brooks, J. D. and Pollack, J. R. (2003). Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc. Natl. Acad. Sci. USA 101 811–816.
- Le Cao, K. A., Martin, P. G., Robert-Granie, P. and Besse, P. (2009). Sparse canonical methods for biological data integration: Application to a cross-platform study. BMC Bioinformatics 26 34.
- Ng, A. Y., Jordan, M. I. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2 849–856.
- Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
- Olshen, A. B., Bengtsson, H., Neuvial, P., Spellman, P. T., Olshen, R. A. and Seshan, V. E. (2011). Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. Bioinformatics 27 2038–2046.
- Parkhomenko, E., Tritchler, D. and Beyene, J. (2009). Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8 Art. 1, 36.
- Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4 53–77.Mathematical Reviews (MathSciNet): MR2758084
Zentralblatt MATH: 1189.62174
Digital Object Identifier: doi:10.1214/09-AOAS271
Project Euclid: euclid.aoas/1273584447 - Perou, C. M., Jeffrey, S. S., van de Rijn, M. et al. (1999). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. USA 96 9212–9217.
- Pollack, J. R., Sørlie, T., Perou, C. M. et al. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA 99 12963–12968.
- Rohe, K., Chatterjee, S. and Yu, B. (2010). Spectral clustering and the high-dimensional stochastic block model. Available at arXiv:1007.1684.arXiv: 1007.1684
Mathematical Reviews (MathSciNet): MR2893856
Zentralblatt MATH: 1227.62042
Digital Object Identifier: doi:10.1214/11-AOS887
Project Euclid: euclid.aos/1314190618 - Shen, R., Olshen, A. B. and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25 2906–2912.Zentralblatt MATH: 1254.92006
- Simon, R. (2010). Translational research in oncology: Key bottlenecks and new paradigms. Expert Reviews Molecular Medicine 12 e32.
- Soneson, C., Lilljebjörn, H., Fioretos, T. and Fontes, M. (2010). Integrative analysis of gene expression and copy number alterations using canonical correlation analysis. BMC Bioinformatics 11 191.
- Sorlie, T., Perou, C. M., Tibshirani, R. et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA 98 10869–10874.
- TCGA Network (2011). Integrated genomic analyses of ovarian carcinoma. Nature 474 609–615.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.Mathematical Reviews (MathSciNet): MR1379242
- Tibshirani, R. and Walther, G. (2005). Cluster validation by prediction strength. J. Comput. Graph. Statist. 14 511–528.
- Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9 18–29.
- Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.Mathematical Reviews (MathSciNet): MR2136641
Zentralblatt MATH: 1060.62049
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00490.x - Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 61 611–622.Mathematical Reviews (MathSciNet): MR1707864
Zentralblatt MATH: 0924.62068
Digital Object Identifier: doi:10.1111/1467-9868.00196 - van Wieringen, W. N. and van de Wiel, M. A. (2009). Nonparametric testing for DNA copy number induced differential mRNA gene expression. Biometrics 65 19–29.Mathematical Reviews (MathSciNet): MR2665842
Digital Object Identifier: doi:10.1111/j.1541-0420.2008.01052.x - Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu, J., Haussler, D. and Stuart, J. M. (2010). Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26 237–245.
- Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657–663.
- Waaijenborg, S., Verselewel de Witt Hamer, P. C. and Zwinderman, A. H. (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlaton analysis. Stat. Appl. Genet. Mol. Biol. 7 Art. 3, 29.
- Wang, S. and Zhu, J. (2008). Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64 440–448, 666.Mathematical Reviews (MathSciNet): MR2432414
Digital Object Identifier: doi:10.1111/j.1541-0420.2007.00922.x - Wang, S., Nan, B., Zhu, J. and Beer, D. G. (2008). Doubly penalized Buckley–James method for survival data with high-dimensional covariates. Biometrics 64 132–140, 323.Mathematical Reviews (MathSciNet): MR2422827
Digital Object Identifier: doi:10.1111/j.1541-0420.2007.00877.x - Wang, S., Nan, B., Zhou, N. and Zhu, J. (2009). Hierarchically penalized Cox regression with grouped variables. Biometrika 96 307–322.Mathematical Reviews (MathSciNet): MR2507145
Zentralblatt MATH: 1163.62089
Digital Object Identifier: doi:10.1093/biomet/asp016 - Witten, D. M. and Tibshirani, R. J. (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8 Art. 28, 29.
- Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
- Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.Mathematical Reviews (MathSciNet): MR2212574
Zentralblatt MATH: 1141.62030
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00532.x - Zhao, Y. and Simon, R. (2010). Development and validation of predictive indices for a continuous outcome using gene expression profiles. Cancer Inform. 9 105–114.
- Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.Mathematical Reviews (MathSciNet): MR2137327
Zentralblatt MATH: 1069.62054
Digital Object Identifier: doi:10.1111/j.1467-9868.2005.00503.x

- You have access to this content.
- You have partial access to this content.
- You do not have access to this content.
More like this
- Integrative sparse $K$-means with overlapping group lasso in genomic applications for disease subtype discovery
Huo, Zhiguang and Tseng, George, The Annals of Applied Statistics, 2017 - A semiparametric Bayesian model for comparing DNA copy numbers
Nieto-Barajas, Luis, Ji, Yuan, and Baladandayuthapani, Veerabhadran, Brazilian Journal of Probability and Statistics, 2016 - Estimation of multiple networks in Gaussian mixture models
Gao, Chen, Zhu, Yunzhang, Shen, Xiaotong, and Pan, Wei, Electronic Journal of Statistics, 2016
- Integrative sparse $K$-means with overlapping group lasso in genomic applications for disease subtype discovery
Huo, Zhiguang and Tseng, George, The Annals of Applied Statistics, 2017 - A semiparametric Bayesian model for comparing DNA copy numbers
Nieto-Barajas, Luis, Ji, Yuan, and Baladandayuthapani, Veerabhadran, Brazilian Journal of Probability and Statistics, 2016 - Estimation of multiple networks in Gaussian mixture models
Gao, Chen, Zhu, Yunzhang, Shen, Xiaotong, and Pan, Wei, Electronic Journal of Statistics, 2016 - Reconstructing transmission trees for communicable diseases using densely sampled genetic data
Worby, Colin J., O’Neill, Philip D., Kypraios, Theodore, Robotham, Julie V., De Angelis, Daniela, Cartwright, Edward J. P., Peacock, Sharon J., and Cooper, Ben S., The Annals of Applied Statistics, 2016 - Bayesian joint modeling of multiple gene
networks and diverse genomic data to identify target genes of a
transcription factor
Wei, Peng and Pan, Wei, The Annals of Applied Statistics, 2012 - Bayesian Variable Selection Regression of Multivariate Responses for Group Data
Liquet, B., Mengersen, K., Pettitt, A. N., and Sutton, M., Bayesian Analysis, 2017 - Super-resolution estimation of cyclic arrival rates
Chen, Ningyuan, Lee, Donald K. K., and Negahban, Sahand N., The Annals of Statistics, 2019 - A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis
Wang, Zhishi, He, Qiuling, Larget, Bret, and Newton, Michael A., The Annals of Applied Statistics, 2015 - Sample size determination for training cancer classifiers from microarray and RNA-seq data
Safo, Sandra, Song, Xiao, and Dobbin, Kevin K., The Annals of Applied Statistics, 2015 - Sparse permutation invariant covariance estimation
Rothman, Adam J., Bickel, Peter J., Levina, Elizaveta, and Zhu, Ji, Electronic Journal of Statistics, 2008