Statistical Science

Structures and Assumptions: Strategies to Harness Gene × Gene and Gene × Environment Interactions in GWAS

Charles Kooperberg, Michael LeBlanc, James Y. Dai, and Indika Rajapakse

Full-text: Open access


Genome-wide association studies, in which as many as a million single nucleotide polymorphisms (SNP) are measured on several thousand samples, are quickly becoming a common type of study for identifying genetic factors associated with many phenotypes. There is a strong assumption that interactions between SNPs or genes and interactions between genes and environmental factors substantially contribute to the genetic risk of a disease. Identification of such interactions could potentially lead to increased understanding about disease mechanisms; drug × gene interactions could have profound applications for personalized medicine; strong interaction effects could be beneficial for risk prediction models. In this paper we provide an overview of different approaches to model interactions, emphasizing approaches that make specific use of the structure of genetic data, and those that make specific modeling assumptions that may (or may not) be reasonable to make. We conclude that to identify interactions it is often necessary to do some selection of SNPs, for example, based on prior hypothesis or marginal significance, but that to identify SNPs that are marginally associated with a disease it may also be useful to consider larger numbers of interactions.

Article information

Statist. Sci. Volume 24, Number 4 (2009), 472-488.

First available: 20 April 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)


Kooperberg, Charles; LeBlanc, Michael; Dai, James Y.; Rajapakse, Indika. Structures and Assumptions: Strategies to Harness Gene × Gene and Gene × Environment Interactions in GWAS. Statistical Science 24 (2009), no. 4, 472--488. doi:10.1214/09-STS287.

Export citation


  • Albert, P. S., Ratnasinghe, D., Tangrea, J. and Wacholder, S. (2001). Limitations of the case-only design for identifying gene-environment interactions. Am. J. Epid. 154 687–693.
  • Breiman, L. (2001). Statistical modeling: The two cultures (with discussion). Statist. Sci. 16 199–231.
  • Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Browning, S. R. and Browning, S. L. (2007). Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am. J. Hum. Genet. 81 1084–1097.
  • Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92 399–418.
  • Chatterjee, N., Kalaylioglu, Z., Moslehi, R., Peters, U. and Wacholder, S. (2006). Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. Am. J. Hum. Genet. 79 1002–1016.
  • Cohen, J. C., Kiss, R. S., Pertsemlidis, A., Marcel, Y. L., McPherson, R. and Hobbs, H. H. (2004). Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305 869–872.
  • Cordell, H. J. and Clayton, D. G. (2002). A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70 124–141.
  • Dai, J. Y., LeBlanc, M. and Kooperberg, C. (2009). Semiparametric estimation exploiting covariate independence in two-phase randomized trials. Biometrics 65 178–187.
  • Dai, J. Y., LeBlanc, M., Smith, N. L., Psaty, B. M. and Kooperberg, C. (2009). SHARE: An adaptive algorithm to select the most informative set of SNPs for genetic association. Biostatistics. To appear.
  • Durrant, C., Zondervan, K. T., Cardon, L. R., Hunt, S., Deloukas, P. and Morris, A. P. (2004). Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet. 75 35–43.
  • Efron, B., Hastie, T., Johnstone, L. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Epstein, M. G., Allen, A. S. and Satten, G. A. (2003). Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 73 1316–1329.
  • Evans, D. M., Marchini, J., Morris, A. P. and Cardon, L. R. (2006). Two stage two locus models in genome wide association. PLoS Genet. 2 e157.
  • Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 9 1–141.
  • Frankel, W. N. and Schork, N. J. (1996). Who’s afraid of epistais? Nat. Genet. 14 371–373.
  • Gudbjartsson, D. F., Arnar, D. O., Helgadottir, A., Gretarsdottir, S., Holm, H., Sigurdsson, A., Jonasdottir, A., Baker, A., Thorleifsson, G., Kristjansson, K., et al. (2007). Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448 352–375.
  • HapMap Consotium (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449 851–861.
  • Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for non-orthogonal problems. Technometrics 12 55–67.
  • Hoh, J., Wille, A. and Ott, J. (2001). Trimmwing, weighting, and grouping SNPs in human case-control association studies. Genome Res. 11 2115–2119.
  • Huang, J., Lin, A., Narasimhan, B., Quertermous, T., Hsiung, C. A., Ho, L.-T., Grove, J. S., Olivier, M., Ranade, K., Risch, N. J. and Olshen, R. A. (2004). Tree-structured supervised learning and the genetics of hypertension. Proc. Natl. Acad. Sci. 101 10529–10534.
  • Kaiser, J. (2008). A plan to capture human diversity in 1000 genomes. Science 319 395.
  • Kooperberg, C., Bis, J. C., Narciante, K. D., Heckbert, S. R., Lumley, T. and Psaty, B. M. (2007). Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. Am. J. Epid. 165 334–343.
  • Kooperberg, C., Bose, S. and Stone, C. J. (1997). Polychotomous regression. J. Amer. Statist. Assoc. 92 117–127.
  • Kooperberg, C. and LeBlanc, M. (2008). Increasing the power of identifying gene × gene interactions in genome-wide association studies. Genet. Epidemiol. 32 255–263.
  • Kooperberg, C. and Ruczinski, I. (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genet. Epidemiol. 28 157–170.
  • Kooperberg, C., Stone, C. J. and Truong, Y. K. (1995). Hazard regression. J. Amer. Statist. Assoc. 90 78–94.
  • Kraft, P., Yen, Y., Stram, D., Morrison, J. and Gauderman, W. (2007). Exploiting gene-environment interaction to detect genetic associations. Hum. Hered. 63 111–119.
  • Kryukov, G. V., Pennacchio, L. A. and Sunyaev, S. R. (2007). Most rare missense alleles are deleterious in humans: Implications for complex disease and association studies. Am. J. Hum. Genet. 80 727–739.
  • LeBlanc, M. and Kooperberg, C. (2009). Adaptively weighted association statistics. Genet. Epidemiol. 33 442–452.
  • Lin, D. Y. (2006). Evaluating statistical significance in two-stage genomewide association studies. Am. J. Hum. Genet. 78 505–509.
  • Lin, D. Y. and Zeng, D. (2006). Likelihood-based inference on haplotype effects in genetic association studies. J. Amer. Statist. Assoc. 101 89–104.
  • Marchini, J., Donnelly, P. and Cardon, L. R. (2005). Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37 413–417.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies via imputation of genotypes. Nat. Genet. 39 906–913.
  • Millstein, J., Conti, D. V., Gilliand, F. D. and Gauderman, W. J. (2006). A testing famework for identifying susceptibility genes in the presence of epistasis. Am. J. Hum. Genet. 78 15–27.
  • Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene-environment independence for analysis of case-control studies: An empirical Bayes approach to trade off between bias and efficiency. Biometrics 64 685–694.
  • Nejentsev, S., Walker, N., Riches, D., Engholm, M. and Todd, J. A. (2009). Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324 387–389.
  • Park, M. and Hastie, T. (2008). Penalized logistic regression for detecting gene interactions. Biostatistics 9 30–50.
  • Philips, P. C. (2008). Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. 9 855–867.
  • Piegorsch, W. W., Weinberg, C. R. and Taylor, J. A. (1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statist. Med. 13 153–162.
  • Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases. Am. J. Hum. Genet. 69 124–137.
  • Pritchard, J. K. and Przeworski, M. (2001). Linakge disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69 1–14.
  • Richie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F. and Moore, J. H. (2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69 138–147.
  • Rajapakse, I., Perlman, M. D. and Kooperberg, C. (2009). Contrasting covariance matrices as a novel test for genetic interactions. Manuscript.
  • Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. J. Comput. Graph. Statist. 12 475–511.
  • Schaid, D. J. (2001). Evaluating associations of haplotypes with traits. Genet. Epidemiol. 27 348–364.
  • Schwender, H. and Ickstadt, K. (2007). Identification of SNP interactions using logic regression. Biostatistics 9 187–198.
  • Seltman, H., Roeder, K. and Devlin, B. (2001). Transmission/disequilibrium test meets measured haplotype analysis: Family-based association analysis guided by evolution of haplotypes. Am. J. Hum. Genet. 68 1250–1263.
  • Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 7 e14.
  • Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genome-wide studies. Proc. Natl. Acad. Sci. 100 9440–9445.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tukey, J. W. (1949). One degree of freedom for non-additivity. Biometrics 5 232–242.
  • Umbach, D. M. and Weinberg, C. R. (1997). Designing and analyzing case-control studies to exploit independence of genotype and exposure. Statist. Med. 16 1731–1743.
  • Vermeire, S., Wild, G., Kocher, K., Cousineau, J., Dufresne, L., Bitton, A., Langelier, D., Pierre, P., Lapointe, G., Cohen, A., Daly, M. J. and Rioux, J. D. (2002). CARD15 genetic variation in a Quebec population: Prevalence, genotype-phenotype relationship, and haplotype structure. Am. J. Hum. Genet. 71 74–83.
  • Weir, B. S. (1996). Genetic Data Analisys II. Sinauer Associates, Sunderland, MA.
  • The Welcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Zhao, J., Jin, L. and Xiong, M. (2006). Test for interaction between two unlinked loci. Am. J. Hum. Genet. 79 831–845.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301–320.