The Annals of Applied Statistics

Variable selection and regression analysis for graph-structured covariates with an application to genomics

Caiyan Li and Hongzhe Li

Full-text: Open access


Graphs and networks are common ways of depicting biological information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein–protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a graph-constrained regularization procedure and its theoretical properties for regression analysis to take into account the neighborhood information of the variables measured on a graph. This procedure involves a smoothness penalty on the coefficients that is defined as a quadratic form of the Laplacian matrix associated with the graph. We establish estimation and model selection consistency results and provide estimation bounds for both fixed and diverging numbers of parameters in regression models. We demonstrate by simulations and a real data set that the proposed procedure can lead to better variable selection and prediction than existing methods that ignore the graph information associated with the covariates.

Article information

Ann. Appl. Stat. Volume 4, Number 3 (2010), 1498-1516.

First available in Project Euclid: 18 October 2010

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Mathematical Reviews number (MathSciNet)


Li, Caiyan; Li, Hongzhe. Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 4 (2010), no. 3, 1498--1516. doi:10.1214/10-AOAS332.

Export citation


  • Bickel, P. L., Ritov, Y. and Tsybakov, A. B. (2008). Hierarchical selection of variables in sparse high-dimensional regression. Technical report, Dept. Statistics, Univ. California, Berkeley.
  • Bottcher, R. T. and Niehrs, C. (2005). Fibroblast growth factor signaling during early vertebrate development. Endocrine Reviews 26 63–77.
  • Chung, F. (1997). Spectral Graph Theory. CBMS Reginal Conferences Series 92. Amer. Math. Soc., Providence, RI.
  • De-Fraja, C., Conti, L., Govoni, S., Battaini, F. and Cattaneo, E. (2000). STAT signalling in the mature and aging brain. International Journal of Developmental Neuroscience 18 439–446.
  • Donoho, D. and Johnstone, I. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81 425–455.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961.
  • Flanagan, J. G. and Vanderhaeghen, P. (1998). The ephrins and Eph receptors in neural development. Annual Review Neuroscience 21 309–345.
  • Friedman, J., Hastie, T., Hoefling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302–332.
  • Hayesmoore, J. B., Bray, N. J., Cross, W. C., Owen, M. J., O’Donovan, M. C. and Morris, H. R. (2009). The effect of age and the H1c MAPT haplotype on MAPT expression in human brain. Neurobiol. Aging 30 1652–1656.
  • Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587–613.
  • Huang, J. and Xie, H. (2007). Asymptotic oracle properties of SCAD-penalized least squares estimators. In Asymptotics: Particles, Processes and Inverse Problems. IMS Lecture Notes Monogr. Ser. 55 149–166. IMS, Beachwood, OH.
  • Jia, J. and Yu, B. (2008). On model selection consistency of elastic net when pn. Technical Report 756, Dept. Statistics, Univ. California, Berkeley.
  • Kanehisa, M. and Goto, S. (2002). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28 27–30.
  • Kimura, K. D., Tissenbaum, H. A., Liu, Y. and Ruvkun, G. (1997). daf-2, an insulin receptor-like gene that regulates longevity and diapause in Caenorhabditis elegans. Science 277 942–946.
  • Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24 1175–1182.
  • Li, C. and Li, H. (2010). Supplement to “Variable selection and regression analysis for graph-structured covariates with an application to genomics” DOI: 10.1214/10-AOAS332SUPP.
  • Lu, T., Pan, Y., Kao, S.-Y., Li, C., Kohane, I., Chan, J. and Yankner, B. A. (2004). Gene regulation and DNA damage in the aging human brain. Nature 429 883–891.
  • Nardi, Y. and Rinado, A. (2008). On the asymptotic properties of the group lasso estimator for linear models. Electron. J. Statist. 2 605–633.
  • Portnoy, S. (1984). Asymptotic behavior of M-estimators of p regression parameters when p/n is large. I. Consistency. Ann. Statist. 12 1298–1309.
  • Stein, E., Savaskan, N. E., Ninnemann, O., Nitsch, R., Zhou, R. and Skutella, T. (1999). A role for the Eph ligand ephrin-A3 in entorhino-hippocampal axon targeting. Journal of Neuroscience 19 8885–8893.
  • Tatar, M., Kopelman, A., Epstein, D., Tu, M. P., Yin, C. M. and Garofalo, R. S. (2001). A mutant Drosophila insulin receptor homolog that extends life-span and impairs neuroendocrine function. Science 292 107–110.
  • Tian, X., Gotoh, T., Tsuji, K., Lo, E. H., Huang, S. and Feig, L. A. (2004). Developmentally regulated role for Ras-GRFs in coupling NMDA glutamate receptors to Ras, Erk and CREB. EMBO J. 23 1567–1575.
  • Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67 91–108.
  • Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Statist. 2 224–244.
  • Yeoh, J. S. and de Haan, G. (2007). Fibroblast growth factors as regulators of stem cell self-renewal and aging. Mechanisms of Ageing and Development 128 17–24.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68 49–67.
  • Yung, R. L. and Mo, R. (2003). Aging is associated with increased human T cell CC chemokine receptor gene expression. Journal of Interferon & Cytokine Research 23 575–582.
  • Zhang, C. and Huang, J. (2006). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2567.
  • Zhou, D., Bousquet, O., Lal, T., Weston, J. and Scholkopf, B. (2004). Learning with local and global consistency. In NIPS 16 321–328. MIT Press, Cambridge, MA.
  • Zhu, X. (2005). Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin–Madison.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301–320.
  • Zou, H. and Zhang, H. H. (2009). On the adaptive elastic net with a diverging number of parameters. Ann. Statist. 37 1733–1751.

Supplemental materials