The Annals of Applied Statistics

Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes

Francesco C. Stingo, Yian A. Chen, Mahlet G. Tadesse, and Marina Vannucci

Full-text: Open access


The vast amount of biological knowledge accumulated over the years has allowed researchers to identify various biochemical interactions and define different families of pathways. There is an increased interest in identifying pathways and pathway elements involved in particular biological processes. Drug discovery efforts, for example, are focused on identifying biomarkers as well as pathways related to a disease. We propose a Bayesian model that addresses this question by incorporating information on pathways and gene networks in the analysis of DNA microarray data. Such information is used to define pathway summaries, specify prior distributions, and structure the MCMC moves to fit the model. We illustrate the method with an application to gene expression data with censored survival outcomes. In addition to identifying markers that would have been missed otherwise and improving prediction accuracy, the integration of existing biological knowledge into the analysis provides a better understanding of underlying molecular processes.

Article information

Ann. Appl. Stat., Volume 5, Number 3 (2011), 1978-2002.

First available in Project Euclid: 13 October 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Bayesian variable selection gene expression Markov chain Monte Carlo Markov random field prior pathway selection


Stingo, Francesco C.; Chen, Yian A.; Tadesse, Mahlet G.; Vannucci, Marina. Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. Ann. Appl. Stat. 5 (2011), no. 3, 1978--2002. doi:10.1214/11-AOAS463.

Export citation


  • Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669–679.
  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. and Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25 25–29.
  • Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119–137.
  • Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser. B 36 192–236.
  • Bild, A. H., Yao, G., Chang, J. T., Wang, Q., Potti, A., Chasse, D., Joshi, M.-B., Harpole, D., Lancaster, J. M., Berchuck, A., Olson, J. A. Jr., Marks, J. R., Dressman, H. K., West, M. and Nevins, J. R. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439 353–357.
  • Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinformatics 8 32–44.
  • Brown, P. J., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 627–641.
  • Chipman, H., George, E. I. and McCulloch, R. E. (2001). The practical implementation of Bayesian model selection. In Model Selection. Institute of Mathematical Statistics Lecture Notes—Monograph Series 38 65–134. IMS, Beachwood, OH.
  • Dahlquist, K. D., Salomonis, N., Vranizan, K., Lawlor, S. C. and Conklin, B. R. (2002). GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet. 31 19–20.
  • Denkert, C., Winzer, K.-J. and Hauptmann, S. (2004). Prognostic impact of cyclooxygenase-2 in breast cancer. Clin. Breast Cancer 4 428–433.
  • Doniger, S., Salomonis, N., Dahlquist, K., Vranizan, K., Lawlor, S. and Conklin, B. (2003). MAPPFinder: Using Gene Ontology and GenMAPP to create a global gene-expression profile for microarray data. Genome Biology 41 R7.
  • Downward, J. (2006). Cancer biology: Signatures guide drug choice. Nature 439 274–275.
  • Frankel, L. B., Lykkesfeldt, A. E., Hansen, J. B. and Stenvang, J. (2007). Protein Kinase C alpha is a marker for antiestrogen resistance and is involved in the growth of tamoxifen resistant human breast cancer cells. Breast Cancer Res. Treat. 104 165–179.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report, Dept. Stat., Stanford Univ.
  • George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339–373.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies, and other large-scale problems. Ann. Appl. Stat. To appear.
  • Guo, W., Pylayeva, Y., Pepe, A., Yoshioka, T., Muller, W. J., Inghirami, G. and Giancotti, F. G. (2006). Beta 4 integrin amplifies ErbB2 signaling to promote mammary tumorigenesis. Cell 126 489–502.
  • Gupta, G. P., Nguyen, D. X., Chiang, A. C., Bos, P. D., Kim, J. Y., Nadal, C., Gomis, R. R., Manova-Todorova, K. and Massagué, J. (2007). Mediators of vascular remodelling co-opted for sequential steps in lung metastasis. Nature 446 765–770.
  • Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G. R., Wu, G. R., Matthews, L., Lewis, S., Birney, E. and Stein, L. (2005). Reactome: A knowledgebase of biological pathways. Nucleic Acids Res. 33 D428–D432.
  • Kanehisa, M. and Goto, S. (2000). Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27–30.
  • Krieger, C., Zhang, P., Mueller, L., Wang, A., Paley, S., Arnaud, M., Pick, J., Rhee, S. and Karp, P. (2004). MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 32 D438–442.
  • Kwon, D., Tadesse, M. G., Sha, N., Pfeiffer, R. M. and Vannucci, M. (2007). Identifying biomarkers from mass spectrometry data with ordinal outcome. Cancer Inform. 3 19–28.
  • Kyung, M., Gill, J., Ghosh, M. and Casella, G. (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 5 369–412.
  • Landemaine, T., Jackson, A., Bellahcène, A., Rucci, N., Sin, S., Abad, B. M., Sierra, A., Boudinet, A., Guinebretière, J.-M., Ricevuto, E., Noguès, C., Briffod, M., Bièche, I., Cherel, P., Garcia, T., Castronovo, V., Teti, A., Lidereau, R. and Driouch, K. (2008). A six-gene signature predicting breast cancer lung metastasis. Cancer Res. 68 6092–6099.
  • Lee, S., Jeong, Y., Im, H. G., Kim, C., Chang, Y. and Lee, I. (2007). Silibinin suppresses PMA-induced MMP-9 expression by blocking the AP-1 activation via MAPK signaling pathways in MCF-7 human breast carcinoma cells. Biochemical and Biophysical Research Communications 354 65–171.
  • Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomics data. Bioinformatics 24 1175–1182.
  • Li, F. and Zhang, N. (2010). Bayesian Variable selection in structured high-dimensional covariate space with application in genomics. J. Amer. Statist. Assoc. 105 1202–1214.
  • Lindgren, F., Geladi, P. and Wold, S. (1993). The kernel algorithm of PLS. Journal of Chemometrics 7 45–59.
  • Lønne, G. K., Cornmark, L., Zahirovic, I. O., Landberg, G., Jirström, K. and Larsson, C. (2010). PKCalpha expression is a marker for breast cancer aggressiveness. Mol. Cancer 9 76.
  • Lucas, J., Carvalho, C., Wang, Q., Bild, A. Nevins, J. and West, M. (2006). Sparse statistical modelling in gene expression genomics. In Bayesian Inference for Gene Expression and Proteomics (K. Do, P. Mueller and M. Vannucci, eds.) 155–176. Cambridge Univ. Press, Cambridge.
  • Møller, J., Pettitt, A. N., Reeves, R. and Berthelsen, K. K. (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93 451–458.
  • Nakao, M., Bono, H., Kawashima, S., Kamiya, T., Sato, K., Goto, S. and Kanehisa, M. (1999). Genome-scale gene expression analysis and pathway reconstruction in KEGG. Genome Informatics Series: Workshop on Genome Informatics 10 94–103.
  • Pan, W., Xie, B. and Shen, X. (2010). Incorporating predictor network in penalized regression with application to microarray data. Biometrics 66 474–484.
  • Park, M. Y., Hastie, T. and Tibshirani, R. (2007). Averaged gene expressions for regression. Biostatistics 8 212–227.
  • Pittman, J., Huang, E., Dressman, H., Horng, C., Cheng, S., Tsou, M., Chen, C., Bild, A., Iversen, E., Huang, A., Nevins, J. and West, M. (2004). Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl. Acad. Sci. USA 101 8431–8436.
  • Propp, J. G. and Wilson, D. B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. In Proceedings of the Seventh International Conference on Random Structures and Algorithms (Atlanta, GA, 1995) 9 223–252.
  • Sha, N., Tadesse, M. G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 22 2262–2268.
  • Sha, N., Vannucci, M., Tadesse, M. G., Brown, P. J., Dragoni, I., Davies, N., Roberts, T. C., Contestabile, A., Salmon, M., Buckley, C. and Falciani, F. (2004). Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60 812–828.
  • Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C. and Golub, T. R. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8 68–74.
  • Stingo, F. and Vannucci, M. (2011). btitleVariable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 27 495–501.
  • Stingo, F., Chen, Y., Tadesse, M. and Vannucci, M. (2011). Supplement to: “Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes.” DOI:10.1214/11-AOAS463SUPP.
  • Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102 15545–15550.
  • Telesca, D., Muller, P., Parmigiani, G. and Freedman, R. (2008). Modeling dependent gene expression. Technical report, Dept. of Biostatistics, Univ. Texas M.D. Anderson Cancer Center.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17 520–525.
  • van’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., van der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R. and Friend, S. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530–536.
  • Wei, L. J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Stat. Med. 11 1871–1879.
  • Wei, Z. and Li, H. (2007). A Markov random field model for network-based analysis of genomic data. Bioinformatics 23 1537–1544.
  • Wei, Z. and Li, H. (2008). A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data. Ann. Appl. Stat. 2 408–429.
  • Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In Multivariate Analysis (Proc. Internat. Sympos., Dayton, Ohio, 1965) (P. Krishnaiaah, ed.) 391–420. Academic Press, New York.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
  • Zhang, J. D. and Wiemann, S. (2009). KEGGgraph: A graph approach to KEGG PATHWAY in R and bioconductor. Bioinformatics 25 1470–1471.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.
  • Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.

Supplemental materials