The Annals of Applied Statistics

Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis

Y. X. Rachel Wang, Keni Jiang, Lewis J. Feldman, Peter J. Bickel, and Haiyan Huang

Full-text: Open access


Networks pervade many disciplines of science for analyzing complex systems with interacting components. In particular, this concept is commonly used to model interactions between genes and identify closely associated genes forming functional modules. In this paper, we focus on gene group interactions and infer these interactions using appropriate partial correlations between genes, that is, the conditional dependencies between genes after removing the influences of a set of other functionally related genes. We introduce a new method for estimating group interactions using sparse canonical correlation analysis (SCCA) coupled with repeated random partition and subsampling of the gene expression data set. By considering different subsets of genes and ways of grouping them, our interaction measure can be viewed as an aggregated estimate of partial correlations of different orders. Our approach is unique in evaluating conditional dependencies when the correct dependent sets are unknown or only partially known. As a result, a gene network can be constructed using the interaction measures as edge weights and gene functional groups can be inferred as tightly connected communities from the network. Comparisons with several popular approaches using simulated and real data show our procedure improves both the statistical significance and biological interpretability of the results. In addition to achieving considerably lower false positive rates, our procedure shows better performance in detecting important biological pathways.

Article information

Ann. Appl. Stat., Volume 9, Number 1 (2015), 300-323.

First available in Project Euclid: 28 April 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Gene association networks community structure sparse canonical correlation analysis (SCCA) partial correlation


Wang, Y. X. Rachel; Jiang, Keni; Feldman, Lewis J.; Bickel, Peter J.; Huang, Haiyan. Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis. Ann. Appl. Stat. 9 (2015), no. 1, 300--323. doi:10.1214/14-AOAS792.

Export citation


  • Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981–2014.
  • Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097–2122.
  • Channarond, A., Daudin, J.-J. and Robin, S. (2012). Classification and estimation in the stochastic blockmodel based on the empirical degrees. Electron. J. Stat. 6 2574–2601.
  • D’haeseleer, P., Liang, S. and Somogyi, R. (2000). Genetic network inference: From co-expression clustering to reverse engineering. Bioinformatics 16 707–726.
  • Daub, C. O., Steuer, R., Selbig, J. and Kloska, S. (2004). Estimating mutual information using B-spline functions—An improved similarity measure for analysing gene expression data. BMC Bioinformatics 5 118.
  • Daudin, J.-J., Picard, F. and Robin, S. (2008). A mixture model for random graphs. Stat. Comput. 18 173–183.
  • de la Fuente, A., Bing, N., Hoeschele, I. and Mendes, P. (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20 3565–3574.
  • Edwards, D. (2000). Introduction to Graphical Modelling, 2nd ed. Springer, New York.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
  • Gachon, C. M. M., Langlois-Meurinne, M., Henry, Y. and Saindrenan, P. (2005). Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: Functional and evolutionary implications. Plant Mol. Biol. 58 229–245.
  • Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5 109–137.
  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377.
  • Jain, A. K., Murty, M. N. and Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys 31 264–323.
  • Jiang, D., Tang, C. and Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 16 1370–1386.
  • Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10.
  • Kaufman, L. and Rousseeuw, P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
  • Kerr, G., Ruskin, H. J., Crane, M. and Doolan, P. (2008). Techniques for clustering gene expression data. Comput. Biol. Med. 38 283–293.
  • Kim, K., Jiang, K., Teng, S. M., Feldman, L. J. and Huang, H. (2012). Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28 815–822.
  • Kinney, J. B. and Atwal, G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 111 3354–3359.
  • Langfelder, P. and Horvath, S. (2007). Eigengene networks for studying the relationships between co-expression modules. BMC Syst. Biol. 1 54.
  • Langfelder, P., Zhang, B. and Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: The Dynamic Tree Cut package for R. Bioinformatics 24 719–720.
  • Lee, W., Lee, D., Lee, Y. and Pawitan, Y. (2011). Sparse canonical covariance analysis for high-throughput data. Stat. Appl. Genet. Mol. Biol. 10 Art. 30, 26.
  • Li, K.-C. (2002). Genome-wide coexpression dynamics: Theory and application. Proc. Natl. Acad. Sci. USA 99 16875–16880.
  • Loreti, E., Poggi, A., Novi, G., Alpi, A. and Perata, P. (2005). A genome-wide analysis of the effects of sucrose on gene expression in Arabidopsis seedlings under anoxia. Plant Physiol. 137 1130–1138.
  • Magwene, P. and Kim, J. (2004). Estimating genomic coexpression networks using first-order conditional independence. Genome Biology 5 R100.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436–1462.
  • Naoumkina, M. A., Zhao, Q., Gallego-Giraldo, L., Dai, X., Zhao, P. X. and Dixon, R. A. (2010). Genome-wide analysis of phenylpropanoid defence pathways. Mol. Plant Pathol. 11 829–846.
  • Newman, M. E. J. (2010). Networks: An Introduction. Oxford Univ. Press, Oxford.
  • Parkhomenko, E., Tritchler, D. and Beyene, J. (2009). Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8 Art. 1, 36.
  • Peng, J., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746.
  • Ramesh, A., Trevino, R., Von Hoff, D. D. and Kim, S. (2010). Clustering context-specific gene regulatory networks. In Pacific Symposium on Biocomputing 444–455.
  • Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518–1524.
  • Schäfer, J. and Strimmer, K. (2005). An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21 754–764.
  • Scott, J. and Peter, J. C. (2011). The SAGE Handbook of Social Network Analysis. SAGE Publications, London.
  • Sønderby, I. E., Geu-Flores, F. and Halkier, B. A. (2010). Biosynthesis of glucosinolates—Gene discovery and beyond. Trends in Plant Science 15 283–290.
  • Steuer, R., Kurths, J., Daub, C. O., Weise, J. and Selbig, J. (2002). The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18 S231–S240.
  • Taylor, L. P. and Grotewold, E. (2005). Flavonoids as developmental regulators. Curr. Opin. Plant Biol. 8 317–323.
  • Teng, S. L. and Huang, H. (2009). A statistical framework to inter functional gene relationships from biologically interrelated microarray experiments. J. Amer. Statist. Assoc. 104 465–473.
  • Theodoridis, S. and Koutroumbas, K. (2005). Pattern Recognition, 4th ed. Academic Press, Burlington, MA.
  • Verkerk, R., Schreiner, M., Krumbein, A., Ciska, E., Holst, B., Rowland, I., Schrijver, R. D., Hansen, M., Gerhäuser, C., Mithen, R. and Dekker, M. (2009). Glucosinolates in Brassica vegetables: The influence of the food supply chain on intake, bioavailability and human health. Mol. Nutr. Food Res. 53 Suppl 2 S219.
  • Waaijenborg, S., Verselewel de Witt Hamer, P. C. and Zwinderman, A. H. (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlaton analysis. Stat. Appl. Genet. Mol. Biol. 7 Art. 3, 29.
  • Wang, Y. X. R., Jiang, K., Feldman, L. J., Bickel, P. J. and Huang, H. (2015). Supplement to “Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis.” DOI:10.1214/14-AOAS792SUPP.
  • Wang, Y. X. R. and Huang, H. (2014). Review on statistical methods for gene network reconstruction using expression data. J. Theoret. Biol. 362 53–61.
  • Ward, J. H. Jr. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58 236–244.
  • Wille, A. and Bühlmann, P. (2006). Low-order conditional independence graphs for inferring genetic networks. Stat. Appl. Genet. Mol. Biol. 5 Art. 1, 34 pp. (electronic).
  • Wille, A., Zimmermann, P., Vranova, E., Fürholz, A., Laule, O., Bleuler, S., Hennig, L., Prelic, A., von Rohr, P., Thiele, L., Zitzler, E., Gruissem, W. and Bühlmann, P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology 5 1–13.
  • Witten, D. M. and Tibshirani, R. (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8 1–27.
  • Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
  • Woo, H.-H., Jeong, B. R. and Hawes, M. C. (2005). Flavonoids: From cell cycle regulation to biotechnology. Biotechnol. Lett. 27 365–374.
  • Yan, X. and Chen, S. (2007). Regulation of plant glucosinolate metabolism. Planta 226 1343–1352.
  • Zhou, S., Rütimann, P., Xu, M. and Bühlmann, P. (2011). High-dimensional covariance estimation based on Gaussian graphical models. J. Mach. Learn. Res. 12 2975–3026.

Supplemental materials

  • Supplementary information.: Asymptotic analysis and additional explanations of the procedure, additional simulation and real data results. The code for estimating the edge weight matrix can be requested from