Statistical Science

Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

Nilanjan Chatterjee, Yi-Hau Chen, Sheng Luo, and Raymond J. Carroll

Full-text: Open access


Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the “retrospective” likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy–Weinberg-Equilibrium (HWE), gene–gene and gene–environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype–environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data.

Article information

Statist. Sci., Volume 24, Number 4 (2009), 489-502.

First available in Project Euclid: 20 April 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Case-control studies Empirical-Bayes genetic epidemiology haplotypes model averaging model robustness model selection retrospective studies shrinkage


Chatterjee, Nilanjan; Chen, Yi-Hau; Luo, Sheng; Carroll, Raymond J. Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes. Statist. Sci. 24 (2009), no. 4, 489--502. doi:10.1214/09-STS297.

Export citation


  • Andersen, E. B. (1970). Asymptotic properties of conditional maximum-likelihood estimators. J. Roy. Statist. Soc. Ser. B 32 283–301.
  • Chapman, J. M., Cooper, J. D., Todd, J. A. and Clayton, D. G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power. Human Heredity 56 18–31.
  • Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene–environment independence in case-control studies. Biometrika 92 399–418.
  • Chatterjee, N., Spinka, C., Chen, J. and Carroll, R. J. (2006). Likelihood based inference on haplotype effects in genetic association studies-Comment. J. Amer. Statist. Assoc. 101 108–111.
  • Chen, J. and Chatterjee, N. (2007). Exploiting Hardy–Weinberg equilibrium for efficient screening of single SNP associations from case-control studies. Human Heredity 63 196–204.
  • Chen, Y. H., Chatterjee, N. and Carroll, R. J. (2009). Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J. Amer. Statist. Assoc. 104 220–233.
  • Cornfield, J. (1956). A statistical problem arising from retrospective studies. In Proceedings of the Third Berkeley Sympos. Math. Statist. Probab. 135–148. Univ. California Press, Berkeley.
  • Epstein, M. P. and Satten, G. A. (2003). Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73 1316–1329.
  • Hartl, D. L. and Clark, A. G. (2007). Principles of Population Genetics, 4th ed. Sinauer Associates, Sunderland, MA.
  • Hunter, D. J., Kraft, P., Jacobs, K. B., Cox, D. G., Yeager, M., Hankinson, S. E., Wacholder, S., Wang, Z., Welch, R., Hutchinson, A., Wang, J., Yu, K., Chatterjee, N. et al. (2007). A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nature Genetics 39 870–874.
  • Lake, S. L., Lyon, H., Tantisira, K., Silverman, E. K., Weiss, S. T., Laird, N. M. and Schaid, D. J. (2003). Estimation and tests of haplotype–environment interaction when linkage phase is ambiguous. Human Heredity 55 56–65.
  • Li, D. and Conti, D. V. (2009). Detecting gene–environment interactions using a combined case-only and case-control approach. American Journal of Epidemiology 169 497–504.
  • Lin, D. Y. and Hu, Y. (2008). Reply to Marchini and Howie. American Journal of Human Genetics 83 539–540.
  • Lin, D. Y., Hu, Y. and Huang, B. E. (2008). Simple and efficient analysis of disease association with missing genotype data. American Journal of Human Genetics 82 444–445.
  • Lin, D. Y. and Zeng, D. (2006). Likelihood-based inference on haplotype effects in genetic association studies. J. Amer. Statist. Assoc. 101 89–104.
  • Luo, S., Mukherjee, B., Chen, J. and Chatterjee, N. (2009). Shrinkage estimation for robust and efficient screening of single-SNP sssociation from case-control genome-wide association studies. Genetic Epidemiology Online.
  • Marchini, J. and Howie, B. (2008). Comparing algorithms for genotype imputation. American Journal of Human Genetics 83 535–539.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics 39 906–913.
  • Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene–environment independence for analysis of case-control studies: An empirical Bayes approach to trade off between bias and efficiency. Biometrics 64 685–694.
  • Nicolae, D. L. (2006). Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genetic Epidemiology 30 718–727.
  • Piegorsch, W. W., Weinberg, C. R. and Taylor, J. A. (1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statist. Med. 13 153–162.
  • Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66 403–412.
  • Roeder, K., Carroll, R. J. and Lindsay, B. G. (1996). A semiparametric mixture approach to case-control studies with errors in covariables. J. Amer. Statist. Assoc. 91 722–732.
  • Satten, G. A. and Epstein, M. P. (2004). Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genetic Epidemiology 27 192–201.
  • Satten, G. A. and Kupper, L. L. (1993). Conditional regression analysis of the exposure-disease odds ratio using known probability-of-exposure values. Biometrics 49 429–440.
  • Spinka, C., Carroll, R. J. and Chatterjee, N. (2005). Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genetic Epidemiology 29 108–127.
  • Thomas, G., Jacobs, K. B., Yeager, M., Kraft, P., Wacholder, S., Orr, N., Yu, K., Chatterjee, N., Welch, R., Hutchinson, A. et al. (2008). Multiple novel loci identified in a genome-wide association study of prostate cancer. Nature Genetics 40 310–315.
  • van Belle, G., Heagerty, P. J., Fisher, L. D. and Lumley, T. S. (2004). Biostatistics: A Methodology for the Health Sciences. Wiley, Hoboken, NJ.
  • Xiong, M., Zhao, J. and Berwinkle, E. (2002). Generalized T2 test for genome association studies. American Journal of Human Genetics 70 1257–1268.
  • Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S., Minichiello, M. J., Fearnhead, P., Yu, K., Chatterjee, N. et al. (2007). Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics 39 645–649.
  • Yu, K., Li, Q., Bergen, A. W., Pfeiffer, R., Rosenberg, P., Caporaso, N., Kraft, P. and Chatterjee, N. (2009). Pathway analysis by adaptive combination of P-values. Genetic Epidemiology 33 700–709.
  • Zhao, L. P., Li, S. S. and Khalid, N. A. (2003). Method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. American Journal of Human Genetics 72 1231–1250.