The Annals of Applied Statistics

Network-based feature screening with applications to genome data

Mengyun Wu, Liping Zhu, and Xingdong Feng

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Modern biological techniques have led to various types of data, which are often used to identify important biomarkers for certain diseases with appropriate statistical methods, such as feature screening. Model-free feature screening has been extensively studied in the literature, and it is effective to select useful predictors for ultra-high dimensional data. These existing screening procedures are conducted based on certain marginal correlations between predictors and a response variable, therefore network structures connecting the predictors are usually ignored. Google’s PageRank algorithm has achieved remarkable success. We adopt its spirit to adjust original screening approaches by incorporating the network information. We can then significantly improve the performance of those screening methods in choosing useful biomarkers, which is demonstrated in an intensive simulation study. A couple of real genome datasets along with a biological network are further analyzed by comparing results on both accuracy of predicting responses and stability of identifying biomarkers.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 1250-1270.

Dates
Received: January 2017
Revised: May 2017
First available in Project Euclid: 28 July 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743493

Digital Object Identifier
doi:10.1214/17-AOAS1097

Mathematical Reviews number (MathSciNet)
MR3834302

Keywords
Correlation feature screening model-free network ultra-high dimension variable selection

Citation

Wu, Mengyun; Zhu, Liping; Feng, Xingdong. Network-based feature screening with applications to genome data. Ann. Appl. Stat. 12 (2018), no. 2, 1250--1270. doi:10.1214/17-AOAS1097. https://projecteuclid.org/euclid.aoas/1532743493


Export citation

References

  • Barabási, A.-L., Gulbahce, N. and Loscalzo, J. (2011). Network medicine: A network-based approach to human disease. Nat. Rev. Genet. 12 56–68.
  • Barabasi, A. L. and Oltvai, Z. N. (2004). Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 5 101–113.
  • Barut, E., Fan, J. and Verhasselt, A. (2016). Conditional sure independence screening. J. Amer. Statist. Assoc. 111 1266–1277.
  • Brune, K., Hong, S.-M., Li, A. et al. (2008). Genetic and epigenetic alterations of familial pancreatic cancers. Cancer Epidemiol. Biomark. Prev. 17 3536–3542.
  • Campagna, D., Cope, L., Lakkur, S. S., Henderson, C., Laheru, D., Iacobuzio-Donahue, C. A. et al. (2008). Gene expression profiles associated with advanced pancreatic cancer. Int. J. Clin. Exp. Pathol. 1 32–43.
  • Chen, G., Chakravarti, N., Aardalen, K. et al. (2014). Molecular profiling of patient-matched brain and extracranial melanoma metastases implicates the PI3K pathway as a therapeutic target. Clin. Cancer Res. 20 5537–5546.
  • Chuang, H., Lee, E., Liu, Y. T. et al. (2006). Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3 140.
  • Cun, Y. and Fröhlich, H. (2012). Biomarker gene signature discovery integrating network knowledge. Biol. 1 5–17.
  • Das, J. and Yu, H. (2012). HINT: High-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol. 6 92.
  • Fan, J., Han, F. and Liu, H. (2014). Challenges of Big Data analysis. Nat. Sci. Rev. 1 293–314.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • Gustafsson, M., Nestor, C. E., Zhang, H. et al. (2014). Modules, networks and systems medicine for understanding disease and aiding diagnosis. Gen. Med. 6 1–11.
  • Hawrylycz, M., Miller, J. A., Menon, V., Feng, D., Dolbeare, T., Guillozet-Bongaarts, A. L., Jegga, A. G., Aronow, B. J., Lee, C.-K., Bernard, A., Glasser, M. F., Dierker, D. L., Menche, J., Szafer, A., Collman, F., Grange, P., Berman, K. A., Mihalas, S., Yao, Z., Stewart, L., Barabási, A.-L., Schulkin, J., Phillips, J., Ng, L., Dang, C., Haynor, D. R., Jones, A., Essen, D. C. V., Koch, C. and Lein, E. (2015). Canonical genetic signatures of the adult human brain. Nat. Neurosci. 18 1832–1844.
  • He, Z. and Yu, W. (2010). Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34 215–225.
  • Hong, H. G., Wang, L. and He, X. (2016). A data-driven approach to conditional screening of high-dimensional variables. Statistics 5 200–212.
  • Hruban, R. H., Goggins, M., Parsons, J. and Kern, S. E. (2000). Progression model for pancreatic cancer. Clin. Cancer Res. 6 2969–2972.
  • Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2009a). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4 44–57.
  • Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2009b). Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37 1–13.
  • Huang, Z.-Q., Buchsbaum, D. J., Raisch, K. P., Bonner, J. A., Bland, K. I. and Vickers, S. M. (2003). Differential responses by pancreatic carcinoma cell lines to prolonged exposure to Erbitux (IMC-C225) anti-EGFR antibody. J. Surg. Res. 111 274–283.
  • Jagirdar, R., Solenov, E. I., Hatzoglou, C., Molyvdas, P.-A., Gourgoulianis, K. I. and Zarogiannis, S. G. (2013). Gene expression profile of aquaporin 1 and associated interactors in malignant pleural mesothelioma. Genetics 517 99–105.
  • Javle, M., Li, Y., Tan, D., Dong, X., Chang, P., Kar, S. and Li, D. (2014). Biomarkers of TGF-$\beta$ signaling pathway and prognosis of pancreatic cancer. PLoS ONE 9 e85942.
  • Langville, A. N. and Meyer, C. D. (2012). Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton Univ. Press, Princeton, NJ.
  • Leiserson, M. D., Vandin, F., Wu, H. T. et al. (2015). Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47 106–114.
  • Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24 1175–1182.
  • Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning. J. Amer. Statist. Assoc. 107 1129–1139.
  • Martinezledesma, E., Verhaak, R. G. and Trevino, V. (2015). Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. Sci. Rep. 5.
  • Moffitt, R. A., Marayati, R., Flate, E. L. et al. (2015). Virtual microdissection identifies distinct tumor- and stroma-specific subtypes of pancreatic ductal adenocarcinoma. Nat. Genet. 47 1168–1178.
  • Pan, W., Xie, B. and Shen, X. (2010). Incorporating predictor network in penalized regression with application to microarray data. Biometrics 66 474–484.
  • Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York.
  • Shi, X., Yi, H. and Ma, S. (2015). Measures for the degree of overlap of gene signatures and applications to TCGA. Brief. Bioinform. 16 266–272.
  • Tascilar, M., Skinner, H. G., Rosty, C. et al. (2001). The SMAD4 protein and prognosis of pancreatic ductal adenocarcinoma. Clin. Cancer Res. 7 4115–4121.
  • Taylor, I. W., Linding, R., Wardefarley, D. et al. (2009). Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat. Biotechnol. 27 199–204.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Vidal, M., Cusick, M. E. and Barabasi, A. L. (2011). Interactome networks and human disease: Cell. Cell 144 986–998.
  • Wang, X. and Leng, C. (2016). High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 589–611.
  • Wong, H. H. and Lemoine, N. R. (2009). Pancreatic cancer: Molecular pathogenesis and new therapeutic targets. Nat. Rev. Gastroenterol. Hepatol. 6 412–422.
  • Wu, M., Zhu, L. and Feng, X. (2018). Supplement to “Network-based feature screening with applications to genome data.” DOI:10.1214/17-AOAS1097SUPP.
  • Yu, G. and Liu, Y. (2016). Sparse regression incorporating graphical structure among predictors. J. Amer. Statist. Assoc. 111 707–720.
  • Zhu, L.-P., Li, L., Li, R. and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106 1464–1475.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via elastic net. J. R. Stat. Soc. Ser. B 67 301–320.

Supplemental materials

  • Some additional tables. The Supplementary Materials includes some additional simulation results with different network structures, signal-noise-ratios and types of responses, the top 100 biomarkers for the dataset GSE71729 identified by DC-SIS-Network and the corresponding KEGG pathway analysis results.