The Annals of Applied Statistics

A statistical framework for data integration through graphical models with application to cancer genomics

Yuping Zhang, Zhengqing Ouyang, and Hongyu Zhao

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Recent advances in high-throughput biotechnologies have generated various types of genetic, genomic, epigenetic, transcriptomic and proteomic data across different biological conditions. It is likely that integrating data from diverse experiments may lead to a more unified and global view of biological systems and complex diseases. We present a coherent statistical framework for integrating various types of data from distinct but related biological conditions through graphical models. Specifically, our statistical framework is designed for modeling multiple networks with shared regulatory mechanisms from heterogeneous high-dimensional datasets. The performance of our approach is illustrated through simulations and its applications to cancer genomics.

Article information

Ann. Appl. Stat. Volume 11, Number 1 (2017), 161-184.

Received: February 2016
Revised: September 2016
First available in Project Euclid: 8 April 2017

Permanent link to this document

Digital Object Identifier

Cancer genomics data integration graphical models


Zhang, Yuping; Ouyang, Zhengqing; Zhao, Hongyu. A statistical framework for data integration through graphical models with application to cancer genomics. Ann. Appl. Stat. 11 (2017), no. 1, 161--184. doi:10.1214/16-AOAS998.

Export citation


  • Albert, R., Jeong, H. and Barabási, A.-L. (2000). Error and attack tolerance of complex networks. Nature 406 378–382.
  • Auslender, A. and Teboulle, M. (2006). Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16 697–725 (electronic).
  • Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286 509–512.
  • Beck, A. and Teboulle, M. (2009). Gradient-based algorithms with applications to signal recovery. Convex Optim. Signal Process. Commun. 42–88.
  • Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 1–122.
  • Chen, X., Slack, F. J. and Zhao, H. (2013). Joint analysis of expression profiles from multiple cancers improves the identification of microRNA–gene interactions. Bioinformatics 29 2137–2145.
  • Chen, S., Witten, D. M. and Shojaie, A. (2015). Selection and estimation for mixed graphical models. Biometrika 102 47–64.
  • Cheng, J., Levina, E. and Zhu, J. (2013). High-dimensional mixed graphical models. Preprint. Available at arXiv:1304.2810.
  • Chun, H., Chen, M., Li, B. and Zhao, H. (2013). Joint conditional Gaussian graphical models with multiple sources of genomic data. Front. Genet. 4 Article ID 294. DOI:10.3389/fgene.2013.00294.
  • Ciriello, G., Miller, M. L., Aksoy, B. A., Senbabaoglu, Y., Schultz, N. and Sander, C. (2013). Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45 1127–1133.
  • Danaher, P., Wang, P. and Witten, D. M. (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 373–397.
  • Fellinghauer, B., Bühlmann, P., Ryffel, M., von Rhein, M. and Reinhardt, J. D. (2013). Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Comput. Statist. Data Anal. 64 132–152.
  • Feng, Z., Zhang, H., Levine, A. J. and Jin, S. (2005). The coordinate regulation of the p53 and mTOR pathways in cells. Proc. Natl. Acad. Sci. USA 102 8204–8209.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2009). Glmnet: Lasso and elastic-net regularized generalized linear models. R Package Version 1.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report, Dept. Statistics, Stanford Univ., Stanford.
  • Ge, H., Walhout, A. J. and Vidal, M. (2003). Integrating ‘omic’ information: A bridge between genomics and systems biology. Trends Genet. 19 551–560.
  • Govindan, R. and Tangmunarunkit, H. (2000). Heuristics for Internet map discovery. In Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies 3 1371–1380. IEEE, New York.
  • Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2010). Joint structure estimation for categorical Markov networks. Technical report, Dept. Statistics, Univ. of Michigan, Ann Arbor.
  • Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98 1–15.
  • Hawkins, R. D., Hon, G. C. and Ren, B. (2010). Next-generation genomics: An integrative approach. Nat. Rev. Genet. 11 476–486.
  • Hecker, M., Lambeck, S., Toepfer, S., van Someren, E. and Guthke, R. (2009). Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems 96 86–103.
  • Hestenes, M. R. (1969). Multiplier and gradient methods. J. Optim. Theory Appl. 4 303–320.
  • Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Statist. 19 984–1006. Supplementary materials available online.
  • Höfling, H. and Tibshirani, R. (2009). Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. J. Mach. Learn. Res. 10 883–906.
  • Jeong, H., Mason, S. P., Barabási, A-L. and Oltvai, Z. N. (2001). Lethality and centrality in protein networks. Nature 411 41–42.
  • Joyce, A. R. and Palsson, B. Ø. (2006). The model organism as a system: Integrating “omics” data sets. Nat. Rev., Mol. Cell Biol. 7 198–210.
  • Lauritzen, S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. Oxford Univ. Press, New York.
  • Lee, J. D. and Hastie, T. J. (2012). Learning mixed graphical models. Preprint. Available at arXiv:1205.5012.
  • Li, B., Chun, H. and Zhao, H. (2012). Sparse estimation of conditional graphical models with application to gene networks. J. Amer. Statist. Assoc. 107 152–167.
  • Mazumder, R. and Hastie, T. (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13 781–794.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Myers, C. L. and Troyanskaya, O. G. (2007). Context-sensitive data integration and prediction of biological networks. Bioinformatics 23 2322–2330.
  • Myers, C. L., Robson, D., Wible, A., Hibbs, M. A., Chiriac, C., Theesfeld, C. L., Dolinski, K. and Troyanskaya, O. G. (2005). Discovery of biological networks from diverse functional genomic data. Genome Biol. 6 Article ID R114. DOI:10.1186/gb-2005-6-13-r114.
  • Myers, C. L., Barrett, D. R., Hibbs, M. A., Huttenhower, C. and Troyanskaya, O. G. (2006). Finding function: Evaluation methods for functional genomic data. BMC Genomics 7 187.
  • Network, C. G. A. et al. (2012). Comprehensive molecular portraits of human breast tumours. Nature 490 61–70.
  • Newman, M. E. J. (2006). Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E (3) 74 Article ID 036104.
  • Ouyang, Z., Zhou, Q. and Wong, W. H. (2009). ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA 106 21521–21526.
  • Peng, J., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746.
  • Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using $\ell_{1}$-regularized logistic regression. Ann. Statist. 38 1287–1319.
  • Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. and Kim, D. (2015). Methods of integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genet. 16 85–97.
  • Shen, K. and Tseng, G. C. (2010). Meta-analysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics 26 1316–1323.
  • Tomczak, K., Czerwińska, P. and Wiznerowicz, M. (2015). The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. 19 A68–A77.
  • Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. and Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100 8348–8353.
  • Varambally, S., Yu, J., Laxman, B., Rhodes, D. R., Mehra, R., Tomlins, S. A., Shah, R. B., Chandran, U., Monzon, F. A., Becich, M. J. et al. (2005). Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8 393–406.
  • Witten, D. M., Friedman, J. H. and Simon, N. (2011). New insights and faster computations for the graphical lasso. J. Comput. Graph. Statist. 20 892–900.
  • Yang, E., Ravikumar, P., Allen, G. I. and Liu, Z. (2013). On graphical models via univariate exponential family distributions. Preprint. Available at arXiv:1301.4183.
  • Yin, J. and Li, H. (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat. 5 2630–2650.
  • Yook, S.-H., Oltvai, Z. N. and Barabási, A.-L. (2004). Functional and topological characterization of protein interaction networks. Proteomics 4 928–942.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
  • Zhang, Y., Ouyang, Z. and Zhao, H. (2017). Supplement to “A statistical framework for data integration through graphical models with application to cancer genomics.” DOI:10.1214/16-AOAS998SUPP.

Supplemental materials

  • Supplement to “A statistical framework for data integration through graphical models with application to cancer genomics.”. We present technical and methodological details regarding the model and algorithm in Section 2 and 4. Furthermore, complementary results for the application in Section 7 are provided.