The Annals of Applied Statistics

Bootstrap inference for network construction with an application to a breast cancer microarray study

Shuang Li, Li Hsu, Jie Peng, and Pei Wang

Full-text: Open access

Abstract

Gaussian Graphical Models (GGMs) have been used to construct genetic regulatory networks where regularization techniques are widely used since the network inference usually falls into a high–dimension–low–sample–size scenario. Yet, finding the right amount of regularization can be challenging, especially in an unsupervised setting where traditional methods such as BIC or cross-validation often do not work well. In this paper, we propose a new method—Bootstrap Inference for Network COnstruction (BINCO)—to infer networks by directly controlling the false discovery rates (FDRs) of the selected edges. This method fits a mixture model for the distribution of edge selection frequencies to estimate the FDRs, where the selection frequencies are calculated via model aggregation. This method is applicable to a wide range of applications beyond network construction. When we applied our proposed method to building a gene regulatory network with microarray expression breast cancer data, we were able to identify high-confidence edges and well-connected hub genes that could potentially play important roles in understanding the underlying biological processes of breast cancer.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 391-417.

Dates
First available in Project Euclid: 9 April 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527204

Digital Object Identifier
doi:10.1214/12-AOAS589

Mathematical Reviews number (MathSciNet)
MR3086424

Zentralblatt MATH identifier
06171277

Keywords
High dimensional data GGM model aggregation mixture model FDR

Citation

Li, Shuang; Hsu, Li; Peng, Jie; Wang, Pei. Bootstrap inference for network construction with an application to a breast cancer microarray study. Ann. Appl. Stat. 7 (2013), no. 1, 391--417. doi:10.1214/12-AOAS589. https://projecteuclid.org/euclid.aoas/1365527204


Export citation

References

  • Bach, F. (2008). Bolasso: Model consistent lasso estimation through the bootstrap. In Proceedings of the 25th International Conference on Machine 33–40. ACM, New York.
  • Bild, A. and Johnson, G. (2001). Signaling by erbB receptors in breast cancer: Regulation by compartmentalization of heterodimetric receptor complexes. Annual summary report. Available at http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA400019.
  • Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927–961.
  • Cox, D. R. and Wermuth, N. (1996). Multivariate Dependencies: Models, Analysis and Interpretation. Monographs on Statistics and Applied Probability 67. Chapman & Hall, London.
  • Dempster, A. P. (1972). Covariance selection. Biometrika 32 95–108.
  • Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G. and West, M. (2004). Sparse graphical models for exploring gene expression data. J. Multivariate Anal. 90 196–212.
  • Efron, B. (2004a). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2004b). The estimation of prediction error: Covariance penalties and cross-validation. J. Amer. Statist. Assoc. 99 619–642.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Freedman, D. (1977). A remark on the difference between sampling with and without replacement. J. Amer. Statist. Assoc. 72 681.
  • Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165–175.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
  • Gardner, T. S., di Bernardo, D., Lorenz, D. and Collins, J. J. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301 102–105.
  • Hsieh, F. C., Cheng, G. and Lin, J. (2005). Evaluation of potential Stat3-regulated genes in human breast cancer. Biochem Biophys Res. Commun. 335 292–299.
  • Jeong, H., Masion, S., Barabasi, A. and Oltvai, Z. (2011). Lethality and centrality in protein networks. Nature 411 41–42.
  • Katenka, N. and Kolaczyk, E. (2012). Inference and characterization of multi-attribute networks with application to computational biology. Ann. Appl. Stat. 6 1068–1094.
  • Kim, Y. H., Girard, L., Giacomini, C. P., Wang, P., Hernandez-Boussard, T., Tibshirani, R., Minna, J. D. and Pollack, J. R. (2006). Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification. Oncogene 25 130–138.
  • Kolar, M., Song, L., Ahmed, A. and Xing, E. P. (2010). Estimating time-varying networks. Ann. Appl. Stat. 4 94–123.
  • Li, H. and Gui, J. (2006). Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7 302–317.
  • Li, S., Hsu, L., Peng, J. and Wang, P. (2013). Supplement to “Bootstrap inference for network construction with an application to a breast cancer microarray study.” DOI:10.1214/12-AOAS589SUPP.
  • Liang, K.-y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73 13–22.
  • Loi, S., Haibe-Kains, H., Desmedt, C., Lallemand, F., Tutt, A., Gillet, C., Ellis, P., Harris, A., Bergh, J., Foekens, J., Klijn, J., Larsimont, D., Buyse, M., Bontempi, G., Delorenzi, M., Piccart, M. and Sotiriou, C. (2007). Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade. Journal of Clinical Oncology 25 1239–1246.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 417–473.
  • Newman, M. E. J. (2003). The structure and function of complex networks. SIAM Rev. 45 167–256 (electronic).
  • Nie, L., Wu, G. and Zhang, W. (2006). Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: A multiple regression to identify sources of variations. Biochem. Biophys. Res. Commun. 339 603–610.
  • Pathak, P. K. (1962). On simple random sampling with replacement. Sankhyā Ser. A 24 287–302.
  • Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746.
  • Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D.-Y., Pollack, J. R. and Wang, P. (2010). Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4 53–77.
  • Pollack, J., Srlie, T., Perou, C., Rees, C., Jeffrey, S., Lonning, P., Tibshirani, R., Botstein, D., Brresen-Dale, A. and Brown, P. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. USA 99 12963–12968.
  • Postel-Vinay, S., Véron, A. S., Tirode, F., Pierron, G., Reynaud, S., Kovar, H., Oberlin, O., Lapouble, E., Ballet, S., Lucchesi, C., Kontny, U., González-Neira, A., Picci, P., Alonso, J., Patino-Garcia, A., de Paillerets, B. B., Laud, K., Dina, C., Froguel, P., Clavel-Chapelon, F., Doz, F., Michon, J., Chanock, S. J., Thomas, G., Cox, D. G. and Delattre, O. (2012). Common variants near TARDBP and EGR2 are associated with susceptibility to Ewing sarcoma. Nat. Genet. 44 323–327.
  • Rothman, A. J., Bickel, P. J., Levina, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515.
  • Schadt, E. E., Lamb, J., Yang, X., Zhu, J., Edwards, S., Guhathakurta, D., Sieberts, S. K., Monks, S., Reitman, M., Zhang, C., Lum, P. Y., Leonardson, A., Thieringer, R., Metzger, J. M., Yang, L., Castle, J., Zhu, H., Kash, S. F., Drake, T. A., Sachs, A. and Lusis, A. J. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37 710–717.
  • Schäfer, J. and Strimmer, K. (2005). Learning large-scale graphical Gaussian models from genomic data. AIP Conf. Proc. 776 263–276.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shackleford, T. J., Zhang, Q., Tian, L., Vu, T. T., Korapati, A. L., Baumgartner, A. M., Le, X.-F., Liao, W. S. and Claret, F. X. (2011). Stat3 and CCAAT/enhancer binding protein beta (C/EBP-beta) regulate Jab1/CSN5 expression in mammary carcinoma cells. Breast Cancer Res. 13 R65.
  • Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the $q$-value. Ann. Statist. 31 2013–2035.
  • Tegner, J., Yeung, M., Hasty, J. and Collins, J. (2003). Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. USA 100 5944–5949.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tront, J. S., Hoffman, B. and Liebermann, D. A. (2006). Gadd45a suppresses Ras-driven mammary tumorigenesis by activation of c-Jun NH2-terminal kinase and p38 stress signaling resulting in apoptosis and senescence. Cancer Res. 66 8448–8454.
  • Varambally, S., Yu, J., Laxman, B., Rhodes, D. R., Mehra, R., Tomlins, S. A., Shah, R. B., Chandran, U., Monzon, F. A., Becich, M. J., Wei, J. T., Pienta, K. J., Ghosh, D., Rubin, M. A. and Chinnaiyan, A. M. (2005). Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8 393–406.
  • Waaijenborg, S., Verselewel de Witt Hamer, P. and Zwinderman, A. (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Stat. Appl. Genet. Mol. Biol. 7 1329.
  • Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
  • Wang, S., Nan, B., Rosset, S. and Zhu, J. (2011). Random Lasso. Ann. Appl. Stat. 5 468–485.
  • Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, Chichester.
  • Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
  • Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

Supplemental materials

  • Supplementary material: Supplement to “Bootstrap inference for network construction with an application to a breast cancer microarray study”. This supplement contains additional simulation results, details of the hub genes detected by BINCO on the breast cancer data, and examples of $p_{ij}$ and $\tilde{p}_{ij}$ being close.