The Annals of Applied Statistics

A fast algorithm for detecting gene–gene interactions in genome-wide association studies

Jiahan Li, Wei Zhong, Runze Li, and Rongling Wu

Full-text: Open access


With the recent advent of high-throughput genotyping techniques, genetic data for genome-wide association studies (GWAS) have become increasingly available, which entails the development of efficient and effective statistical approaches. Although many such approaches have been developed and used to identify single-nucleotide polymorphisms (SNPs) that are associated with complex traits or diseases, few are able to detect gene–gene interactions among different SNPs. Genetic interactions, also known as epistasis, have been recognized to play a pivotal role in contributing to the genetic variation of phenotypic traits. However, because of an extremely large number of SNP–SNP combinations in GWAS, the model dimensionality can quickly become so overwhelming that no prevailing variable selection methods are capable of handling this problem. In this paper, we present a statistical framework for characterizing main genetic effects and epistatic interactions in a GWAS study. Specifically, we first propose a two-stage sure independence screening (TS-SIS) procedure and generate a pool of candidate SNPs and interactions, which serve as predictors to explain and predict the phenotypes of a complex trait. We also propose a rates adjusted thresholding estimation (RATE) approach to determine the size of the reduced model selected by an independence screening. Regularization regression methods, such as LASSO or SCAD, are then applied to further identify important genetic effects. Simulation studies show that the TS-SIS procedure is computationally efficient and has an outstanding finite sample performance in selecting potential SNPs as well as gene–gene interactions. We apply the proposed framework to analyze an ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select 23 active SNPs and 24 active epistatic interactions for the body mass index variation. It shows the capability of our procedure to resolve the complexity of genetic control.

Article information

Ann. Appl. Stat., Volume 8, Number 4 (2014), 2292-2318.

First available in Project Euclid: 19 December 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Gene–gene interaction GWAS high-dimensional data sure independence screening variable selection


Li, Jiahan; Zhong, Wei; Li, Runze; Wu, Rongling. A fast algorithm for detecting gene–gene interactions in genome-wide association studies. Ann. Appl. Stat. 8 (2014), no. 4, 2292--2318. doi:10.1214/14-AOAS771.

Export citation


  • Altshuler, D., Daly, M. J. and Lander, E. S. (2008). Genetic mapping in human disease. Science 322 881–888.
  • Ayers, K. L. and Cordell, H. J. (2010). SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 34 879–891.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Burton, P. R., Clayton, D. G., Cardon, L. R., Craddock, N., Deloukas, P., Duncanson, A. et al. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Chipman, H. (1996). Bayesian variable selection with related predictors. Canad. J. Statist. 24 17–36.
  • Cho, S., Kim, H., Oh, S., Kim, K. and Park, T. (2009). Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis. BMC Proceedings 3 S25.
  • Cordell, H. J. (2009). Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10 392–404.
  • Daly, A. K. (2010). Genome-wide association studies in pharmacogenomics. Nat. Rev. Genet. 11 241–246.
  • Das, K., Li, J., Wang, Z., Tong, C., Fu, G., Li, Y., Xu, M., Ahn, K., Mauger, D., Li, R. and Wu, R. (2011). A dynamic model for genome-wide association studies. Hum. Genet. 8 1–8.
  • Dawber, T. R., Meadors, G. F. and Moore, F. E. (1951). Epidemiological approaches to heart disease: The Framingham study. Am. J. Publ. Health 41 279–286.
  • Duggal, P., Gillanders, E. M., Holmes, T. N. and Bailey-Wilson, J. E. (2008). Establishing an adjusted p-value threshold to control the family-wide type 1 error in genome wide association studies. BMC Genomics 9 516.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • Frayling, T. M., Timpson, N. J., Weedon, M. N., Zeggini, E., Freathy, R. M., Lindgren, C. M. et al. (2007). A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316 889–894.
  • Gorlov, I. P., Gorlova, O. Y., Sunyaev, S. R., Spitz, M. R. and Amos, C. I. (2008). Shifting paradigm of association studies: Value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82 100–112.
  • Harley, J. B., Alarcn-Riquelme, M. E., Criswell, L. A., Jacob, C. O., Kimberly, R. P., Moser, K. L. et al. (2008). Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat. Genet. 40 204–210.
  • He, Q. and Lin, D. (2011). A variable selection method for genome-wide association studies. Bioinformatics 27 1–8.
  • Hirschhorn, J. N. (2009). Genomewide association studies—Illuminating biologic pathways. N. Engl. J. Med. 360 1699–1701.
  • Holden, M., Deng, S., Wojnowski, L. and Kulle, B. (2008). GSEA-SNP: Applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics 24 2784–2785.
  • Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617–1642.
  • Jacobs, K. B., Yeager, M., Wacholder, S., Craig, D., Kraft, P., Hunter, D. J. et al. (2009). A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nat. Genet. 41 1253–1257.
  • Jaquish, C. E. (2007). The Framingham Heart Study, on its way to becoming the gold standard for cardiovascular genetic epidemiology? BMC Med. Genet. 8 63.
  • Kim, Y., Wojciechowski, R., Sung, H., Mathias, R., Wang, L., Klein, A., Lenroot, R., Malley, J. and Bailey-Wilson, J. (2009). Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects. BMC Proceedings 3 S64.
  • Lange, K., Cantor, R., Horvath, S., Perola, M., Sabatti, C., Sinsheimer, J. and Sobel, E. (2001). Mendel version 4.0: A complete package for the exact genetic analysis of discrete traits in pedigree and population data sets. Am. J. Hum. Genet. 69 (Suppl. 1) A1886.
  • Lange, K., Papp, J. C., Sinsheimer, J. S., Sripracha, R., Zhou, H. and Sobel, E. M. (2013). Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29 1568–1570.
  • Li, J., Das, K., Fu, G., Li, R. and Wu, R. (2011). The Bayesian lasso for genome-wide association studies. Bioinformatics 27 516–523.
  • Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J. et al. (2009). Finding the missing heritability of complex diseases. Nature 461 747–753.
  • Psychiatric GCCC (2009). Genomewide association studies: History, rationale, and prospects for psychiatric disorders. Am. J. Psychiatr. 166 540–556.
  • Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F. and Moore, J. H. (2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69 138.
  • Scuteri, A., Sanna, S., Chen, W.-M., Uda, M., Albai, G., Strait, J. et al. (2007). Genome-wide association scan shows genetic variants in the FTO gene are associated with obesity-related traits. PLoS Genet. 3 e115.
  • Speliotes, E. K., Willer, C. J., Berndt, S. I., Monda, K. L., Thorleifsson, G. et al. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42 937–948.
  • Szymczak, S., Biernacka, J. M., Cordell, H. J., Gonzalez-Recio, O., Konig, I. R., Zhang, H. and Sun, Y. V. (2009). Machine learning in genome-wide association studies. Genet. Epidemiol. 33 S51–S57.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Ueki, M. and Tamiya, G. (2012). Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinformatics 13 72.
  • Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. and Yu, W. (2010a). BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 87 325–340.
  • Wan, X., Yang, C., Yang, Q., Xue, H., Tang, N. and Yu, W. (2010b). Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics 26 30–37.
  • Wang, G., Volkow, N., Logan, J., Pappas, N., Wong, C., Zhu, W., Netusll, N. and Fowler, J. (2001). Brain dopamine and obesity. The Lancet 357 354–357.
  • Wang, H., Li, R. and Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94 553–568.
  • Wang, K., Li, M. and Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81 1278–1283.
  • Wang, Y., Liu, G., Feng, M. and Wong, L. (2011). An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics 27 2936–2943.
  • Weedon, M. N. and Frayling, T. M. (2008). Reaching new heights: Insights into the genetics of human stature. Trends Genet. 24 595–603.
  • Wu, J., Devlin, B., Ringquist, S., Trucco, M. and Roeder, K. (2010). Screen and clean: A tool for identifying interactions in genome-wide association studies. Genet. Epidemiol. 34 275–285.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.
  • Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2 224–244.
  • Wu, Z. and Zhao, H. (2009). Statistical power of model selection strategies for genome-wide association studies. PLoS Genet. 5 e1000582.
  • Yang, C., He, Z., Wan, X., Yang, Q., Xue, H. and Yu, W. (2009). SNPHarvester: A filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics 25 504–511.
  • Yang, C., Wan, X., Yang, Q., Xue, H. and Yu, W. (2010). Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group lasso. BMC Bioinformatics 11 S18.
  • Yi, N., Kaklamani, V. G. and Pasche, B. (2011). Bayesian analysis of genetic interactions in case-control studies, with application to adiponectin genes and colorectal cancer risk. Ann. Hum. Genet. 75 90–104.
  • Zhang, X., Huang, S., Zou, F. and Wang, W. (2010). TEAM: Efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics 26 217–227.
  • Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 39 1167–1173.
  • Zhou, H., Sehl, M. E., Sinsheimer, J. S. and Lange, K. (2010). Association screening of common and rare genetic variants by penalized regression. Bioinformatics 26 2375–2382.
  • Zhu, L.-P., Li, L., Li, R. and Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. J. Amer. Statist. Assoc. 106 1464–1475.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.
  • Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533.