Statistical Science

Methodological Issues in Multistage Genome-Wide Association Studies

Duncan C. Thomas, Graham Casey, David V. Conti, Robert W. Haile, Juan Pablo Lewinger, and Daniel O. Stram

Full-text: Open access


Because of the high cost of commercial genotyping chip technologies, many investigations have used a two-stage design for genome-wide association studies, using part of the sample for an initial discovery of “promising” SNPs at a less stringent significance level and the remainder in a joint analysis of just these SNPs using custom genotyping. Typical cost savings of about 50% are possible with this design to obtain comparable levels of overall type I error and power by using about half the sample for stage I and carrying about 0.1% of SNPs forward to the second stage, the optimal design depending primarily upon the ratio of costs per genotype for stages I and II. However, with the rapidly declining costs of the commercial panels, the generally low observed ORs of current studies, and many studies aiming to test multiple hypotheses and multiple endpoints, many investigators are abandoning the two-stage design in favor of simply genotyping all available subjects using a standard high-density panel. Concern is sometimes raised about the absence of a “replication” panel in this approach, as required by some high-profile journals, but it must be appreciated that the two-stage design is not a discovery/replication design but simply a more efficient design for discovery using a joint analysis of the data from both stages. Once a subset of highly-significant associations has been discovered, a truly independent “exact replication” study is needed in a similar population of the same promising SNPs using similar methods. This can then be followed by (1) “generalizability” studies to assess the full scope of replicated associations across different races, different endpoints, different interactions, etc.; (2) fine-mapping or resequencing to try to identify the causal variant; and (3) experimental studies of the biological function of these genes. Multistage sampling designs may be more useful at this stage, say, for selecting subsets of subjects for deep resequencing of regions identified in the GWAS.

Article information

Statist. Sci., Volume 24, Number 4 (2009), 414-429.

First available in Project Euclid: 20 April 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Multistage sampling genetic associations replication resequencing DNA pooling gene–environment interactions


Thomas, Duncan C.; Casey, Graham; Conti, David V.; Haile, Robert W.; Lewinger, Juan Pablo; Stram, Daniel O. Methodological Issues in Multistage Genome-Wide Association Studies. Statist. Sci. 24 (2009), no. 4, 414--429. doi:10.1214/09-STS288.

Export citation


  • Albert, P. S., Ratnasinghe, D., Tangrea, J. and Wacholder, S. (2001). Limitations of the case-only design for identifying gene–environment interactions. Am. J. Epidemiol. 154 687–693.
  • Altshuler, D., Daly, M. J. and Lander, E. S. (2008). Genetic mapping in human disease. Science 322 881–888.
  • Anderson, C. A., Pettersson, F. H., Barrett, J. C., Zhuang, J. J., Ragoussis, J., Cardon, L. R. et al. (2008). Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am. J. Hum. Genet. 83 112–119.
  • Anonymous (1999). Freely associating. Nat. Genet. 22 1–2.
  • Astle, W. and Balding, D. J. (2009). Population structure and cryptic relatedness in genetic association studies. Statist. Sci. 24 451–471.
  • Bansal, A., van den Boom, D., Kammerer, S., Honisch, C., Adam, G., Cantor, C. R. et al. (2002). Association testing by DNA pooling: An effective initial screen. Proc. Natl. Acad. Sci. USA 99 16871–16874.
  • Barratt, B. J., Payne, F., Rance, H. E., Nutland, S., Todd, J. A. and Clayton, D. G. (2002). Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. 66 393–405.
  • Barrett, J. C. and Cardon, L. R. (2006). Evaluating coverage of genome-wide association studies. Nat. Genet. 38 659–662.
  • Breslow, N. E. and Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. J. Roy. Stat. Soc. Ser. C 48 457–468.
  • Chanock, S. J., Manolio, T., Boehnke, M., Boerwinkle, E., Hunter, D. J., Thomas, G. et al. (2007). Replicating genotype-phenotype associations. Nature 447 655–660.
  • Chasman, D. I. (2008). On the utility of gene set methods in genomewide association studies of quantitative traits. Genet. Epidemiol. 32 658–668.
  • Chatterjee, N., Chen, Y.-H., Luo, S. and Carroll, R. J. (2009). Analysis of case-control association studies: SNPs, imputation and haplotypes. Statist. Sci. 24 489–502.
  • Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene–environment independence in case-control studies. Biometrika 92 399–418.
  • Chatterjee, N., Kalaylioglu, Z. and Carroll, R. J. (2005). Exploiting gene–environment independence in family-based case-control studies: Increased power for detecting associations, interactions and joint effects. Genet. Epidemiol. 28 138–156.
  • Chen, G. K. and Witte, J. S. (2007). Enriching the analysis of genomewide association studies with hierarchical modeling. Am. J. Hum. Genet. 81 397–404.
  • Cheng, K. F. (2006). A maximum likelihood method for studying gene–environment interactions under conditional independence of genotype and exposure. Stat. Med. 25 3093–3109.
  • Clarke, G. M., Carter, K. W., Palmer, L. J., Morris, A. P. and Cardon, L. R. (2007). Fine mapping versus replication in whole-genome association studies. Am. J. Hum. Genet. 81 995–1005.
  • Craig, D. W., Huentelman, M. J., Hu-Lince, D., Zismann, V. L., Kruer, M. C., Lee, A. M. et al. (2005). Identification of disease causing loci using an array-based genotyping approach on pooled DNA. BMC Genomics 6 138.
  • Craig, D. W., Pearson, J. V., Szelinger, S., Sekar, A., Redman, M., Corneveaux, J. J. et al. (2008). Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5 887–893.
  • de Bakker, P. I., Yelensky, R., Pe’er, I., Gabriel, S. B., Daly, M. J. and Altshuler, D. (2005). Efficiency and power in genetic association studies. Nat. Genet. 37 1217–1223.
  • Docherty, S. J., Butcher, L. M., Schalkwyk, L. C. and Plomin, R. (2007). Applicability of DNA pools on 500 K SNP microarrays for cost-effective initial screens in genomewide association studies. BMC Genomics 8 214.
  • Dudbridge, F. (2006). A note on permutation tests in multistage association scans. Am. J. Hum. Genet. 78 1094–1095.
  • Eberle, M. A., Ng, P. C., Kuhn, K., Zhou, L., Peiffer, D. A., Galver, L. et al. (2007). Power to detect risk alleles using genome-wide tag SNP panels. PLoS Genet. 3 1827–1837.
  • Fearnhead, N. S., Wilding, J. L., Winney, B., Tonks, S., Bartlett, S., Bicknell, D. C. et al. (2004). Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proc. Natl. Acad. Sci. USA 101 15992–15997.
  • Feng, Z., Prentice, R. and Srivastava, S. (2004). Research issues and strategies for genomic and proteomic biomarker discovery and validation: A statistical perspective. Pharmacogenomics 5 709–719.
  • Gail, M. H., Pfeiffer, R. M., Wheeler, W. and Pee, D. (2008). Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics 9 201–215.
  • Gauderman, W. J. (2002). Sample size requirements for matched case-control studies of gene–environment interaction. Stat. Med. 21 35–50.
  • Gieger, C., Geistlinger, L., Altmaier, E., Hrabe de Angelis, M., Kronenberg, F., Meitinger, T. et al. (2008). Genetics meets metabolomics: A genome-wide association study of metabolite profiles in human serum. PLoS Genet. 4 e1000282.
  • Goddard, M. E., Wray, N. R., Verbyla, K. and Visscher, P. M. (2009). Estimating effects and making predictions from genome-wide marker data. Statist. Sci. 24 517–529.
  • Guedj, M., Robelin, D., Hoebeke, M., Lamarine, M., Wojcik, J. and Nuel, G. (2006). Detecting local high-scoring segments: A first-stage approach for genome-wide association studies. Stat. Appl. Genet. Mol. Biol. 5 Art. 22.
  • Han, B., Kang, H. M. and Eskin, E. (2009). Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. PLoS Genet. 5 e1000456.
  • Hao, K., Schadt, E. E. and Storey, J. D. (2008). Calibrating the performance of SNP arrays for whole-genome association studies. PLoS Genet. 4 e1000109.
  • Hirschhorn, J. N. and Daly, M. J. (2005). Genome-wide association studies for common disease and complex traits. Nat. Rev. Genet. 6 95–108.
  • Hoggart, C. J., Clark, T. G., de Iorio, M., Whittaker, J. C. and Balding, D. J. (2008a). Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32 179–185.
  • Hoggart, C. J., Whittaker, J. C., de Iorio, M. and Balding, D. J. (2008b). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4 e1000130.
  • Hopper, J. L., Southey, M. C., Dite, G. S., Jolley, D. J., Giles, G. G., McCredie, M. R. E. et al. (1999). Population-based estimate of the average age-specific cumulative risk of breast cancer for a defined set of protein-truncating mucations in BRCA1 and BRCA2. Cancer Epidemiol. Biomark. Prev. 8 741–747.
  • Hunter, D. J. and Kraft, P. (2007). Drinking from the fire hose—statistical issues in genomewide association studies. N. Engl. J. Med. 357 436–439.
  • Hunter, D. J., Thomas, G., Hoover, R. N. and Chanock, S. J. (2007). Scanning the horizon: What is the future of genome-wide association studies in accelerating discoveries in cancer etiology and prevention? Cancer Causes Control. 18 479–484.
  • Ioannidis, J. P. (2007). Non-replication and inconsistency in the genome-wide association setting. Hum. Hered. 64 203–213.
  • Iyengar, S. K. and Elston, R. C. (2007). The genetic basis of complex traits: Rare variants or “common gene, common disease”? Methods Mol. Biol. 376 71–84.
  • Jennison, C. and Turnbull, B. W. (2000). Group Sequential Methods with Applications to Clinical Trials xviii. Chapman & Hall/CRC, Boca Raton, FL.
  • Johnson, T. (2007). Bayesian method for gene detection and mapping, using a case and control design and DNA pooling. Biostatistics 8 546–565.
  • Jorgenson, E. and Witte, J. S. (2006). Coverage and power in genomewide association studies. Am. J. Hum. Genet. 78 884–888.
  • Kirov, G., Zaharieva, I., Georgieva, L., Moskvina, V., Nikolov, I., Cichon, S. et al. (2009). A genome-wide association study in 574 schizophrenia trios using DNA pooling. Mol. Psychiatry 14 796–803.
  • Kooperberg, C., LeBlanc, M., Dai, J. Y. and Rajapakse, I. (2009). Structures and assumptions: Strategies to harness gene × gene and gene × environment interactions in GWAS. Statist. Sci. 24 472–488.
  • Kraft, P. (2006). Efficient two-stage genome-wide association designs based on false positive report probabilities. Pac. Symp. Biocomputing 11 523–534.
  • Kraft, P. (2008). Curses—winner’s and otherwise—in genetic epidemiology. Epidemiology 19 649–651; discussion 657–648.
  • Kraft, P. and Cox, D. G. (2008). Study designs for genome-wide association studies. Adv. Genet. 60 465–504.
  • Kraft, P., Chanock, C., Hunter, D., Chatterjee, N., and Thomas, G. (2008). Cost-efficient multi-stage designs for genome-wide association studies. In Genetic Dissection of Complex Traits, 2nd ed. (D. C. Rao and C. C. Gu, eds.) 465–504. Academic Press, Boston.
  • Kraft, P., Yen, Y. C., Stram, D. O., Morrison, J. and Gauderman, W. J. (2007). Exploiting gene–environment interaction to detect genetic associations. Hum. Hered. 63 111–119.
  • Kraft, P., Zeggini, E. and Ioannidis, J. P. A. (2009). Replication in genome-wide association studies. Statist. Sci. 24 561–573.
  • Kryukov, G. V., Pennacchio, L. A. and Sunyaev, S. R. (2007). Most rare missense alleles are deleterious in humans: Implications for complex disease and association studies. Am. J. Hum. Genet. 80 727–739.
  • Laird, N. M. and Lange, C. (2009). The role of family-based designs in genome-wide association studies. Statist. Sci. 24 388–397.
  • Lewinger, J. P., Conti, D. V., Baurley, J. W., Triche, T. J. and Thomas, D. C. (2007a). Hierarchical Bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genet. Epidemiol. 31 871–882.
  • Lewinger, J. P., Duggan, D. J., Taverna, D. M., Gauderman, W. J., Stram, D. O. and Thomas, D. C. (2007b). Choosing a platform and design for genomewide association studies: Cost, sample size, and power trade-offs. In American Society of Human Genetics. San Diego, CA.
  • Li, B. and Leal, S. M. (2008). Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am. J. Hum. Genet. 83 311–321.
  • Li, D. and Conti, D. V. (2009). Detecting interactions using a combined case-only and case-control approach. Am. J. Epidemiol. 169 497–504.
  • Lin, D. Y. (2006). Evaluating statistical significance in two-stage genomewide association studies. Am. J. Hum. Genet. 78 505–509.
  • Macgregor, S. (2007). Most pooling variation in array-based DNA pooling is attributable to array error rather than pool construction error. Eur. J. Hum. Genet. 15 501–504.
  • Mardis, E. R. (2008). The impact of next-generation sequencing technology on genetics. Trends Genet. 24 133–141.
  • Meaburn, E., Butcher, L. M., Schalkwyk, L. C. and Plomin, R. (2006). Genotyping pooled DNA using 100K SNP microarrays: A step towards genomewide association scans. Nucleic Acids Res. 34 e27.
  • Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene–environment independence for analysis of case-control studies: An empirical Bayes approach to trade off between bias and efficiency. Biometrics 64 685–694.
  • Mukherjee, B., Zhang, L., Ghosh, M. and Sinha, S. (2007). Semiparametric Bayesian analysis of case-control data under conditional gene–environment independence. Biometrics 63 834–844.
  • Mukherjee, B., Ahn, J., Gruber, S. B., Rennert, G., Moreno, V. and Chatterjee, N. (2008). Tests for gene–environment interaction from case-control data: A novel study of type I error, power and designs. Genet. Epidemiol. 32 615–626.
  • Muller, H. H., Pahl, R. and Schafer, H. (2007). Including sampling and phenotyping costs into the optimization of two stage designs for genomewide association studies. Genet. Epidemiol. 31 844–852.
  • Murcray, C., Lewinger, J. P. and Gauderman, W. J. (2009). Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 169 219–226.
  • Nannya, Y., Taura, K., Kurokawa, M., Chiba, S. and Ogawa, S. (2007). Evaluation of genome-wide power of genetic association studies based on empirical data from the HapMap project. Hum. Mol. Genet. 16 3494–3505.
  • Pan, W. (2005). Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Statist. Appl. Genet. Molec. Biol. 4 Art. 12.
  • Pe’er, I., de Bakker, P. I., Maller, J., Yelensky, R., Altshuler, D. and Daly, M. J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet. 38 663–667.
  • Pe’er, I., Yelensky, R., Altshuler, D. and Daly, M. J. (2008). Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32 381–385.
  • Pearson, J. V., Huentelman, M. J., Halperin, R. F., Tembe, W. D., Melquist, S., Homer, N. et al. (2007). Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am. J. Hum. Genet. 80 126–139.
  • Pfeiffer, R. M., Rutter, J. L., Gail, M. H., Struewing, J. and Gastwirth, J. L. (2002). Efficiency of DNA pooling to estimate joint allele frequencies and measure linkage disequilibrium. Genet. Epidemiol. 22 94–102.
  • Pfeiffer, R. M., Gail, M. H. and Pee, D. (2009). On combining data from genome-wide association studies to discover disease-associated SNPs. Statist. Sci. 24 547–560.
  • Piegorsch, W., Weinberg, C. and Taylor, J. (1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat. Med. 13 153–162.
  • Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69 124–137.
  • Rebbeck, T. R., Martinez, M. E., Sellers, T. A., Shields, P. G., Wild, C. P. and Potter, J. D. (2004). Genetic variation and cancer: Improving the environment for publication of association studies. Cancer Epidemiol. Biomark. Prev. 13 1985–1986.
  • Risch, N. and Teng, J. (1998). The relative power of family-based and case-control designs for linkage disequilibrium studies of compex human diseases, I. DNA pooling. Genome Res. 8 1273–1288.
  • Roeder, K., Bacanu, S. A., Wasserman, L. and Devlin, B. (2006). Using linkage genome scans to improve power of association in genome scans. Am. J. Hum. Genet. 78 243–252.
  • Roeder, K., Devlin, B. and Wasserman, L. (2007). Improving power in genome-wide association studies: Weights tip the scale. Genet. Epidemiol. 31 741–747.
  • Saito, A. and Kamatani, N. (2002). Strategies for genome-wide association studies: Optimization of study designs by the stepwise focusing method. J. Hum. Genet. 47 360–365.
  • Samani, N. J., Erdmann, J., Hall, A. S., Hengstenberg, C., Mangino, M., Mayer, B. et al. (2007). Genomewide association analysis of coronary artery disease. N. Engl. J. Med. 357 443–453.
  • Satagopan, J. M. and Elston, R. C. (2003). Optimal two-stage genotyping in population-based association studies. Genet. Epidemiol. 25 149–157.
  • Satagopan, J. M., Verbel, D. A., Venkatraman, E. S., Offit, K. E. and Begg, C. B. (2002). Two-stage designs for gene-disease association studies. Biometrics 58 163–170.
  • Satagopan, J. M., Venkatraman, E. S. and Begg, C. B. (2004). Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60 589–597.
  • Sebastiani, P., Zhao, Z., Abad-Grau, M. M., Riva, A., Hartley, S. W., Sedgewick, A. E. et al. (2008). A hierarchical and modular approach to the discovery of robust associations in genome-wide association studies from pooled DNA samples. BMC Genet. 9 6.
  • Service, S. K., Sandkuijl, L. A. and Freimer, N. B. (2003). Cost-effective designs for linkage disequilibrium mapping of complex traits. Am. J. Hum. Genet. 72 1213–1220.
  • Sham, P., Bader, J. S., Craig, I., O’Donovan, M. and Owen, M. (2002). DNA Pooling: A tool for large-scale association studies. Nat. Rev. Genet. 3 862–871.
  • Skol, A. D., Scott, L. J., Abecasis, G. R. and Boehnke, M. (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38 209–213.
  • Skol, A. D., Scott, L. J., Abecasis, G. R. and Boehnke, M. (2007). Optimal designs for two-stage genome-wide association studies. Genet. Epidemiol. 31 776–788.
  • Spinola, M., Leoni, V. P., Galvan, A., Korsching, E., Conti, B., Pastorino, U. et al. (2007). Genome-wide single nucleotide polymorphism analysis of lung cancer risk detects the KLF6 gene. Cancer Lett. 251 311–316.
  • Steer, S., Abkevich, V., Gutin, A., Cordell, H. J., Gendall, K. L., Merriman, M. E. et al. (2007). Genomic DNA pooling for whole-genome association scans in complex disease: Empirical demonstration of efficacy in rheumatoid arthritis. Genes Immun. 8 57–68.
  • Su, Z., Cardin, N., The Wellcome Trust Case Control Consortium, Donnelly, P. and Marchini, J. (2009). A Bayesian method for detecting and characterizing allelic heterogeneity and boosting signals in genome-wide association studies. Statist. Sci. 24 430–450.
  • Thomas, D. C. (2007). Multistage sampling for latent variable models. Lifetime Data Anal. 13 565–581.
  • Thomas, D. C. and Conti, D. V. (2007). Two stage genetic association studies. In Encycolpedia of Clinical Trials (R. C. Elston, ed.). Wiley, New York.
  • Thomas, D., Xie, R. and Gebregziabher, M. (2004). Two-stage sampling designs for gene association studies. Genet. Epidemiol. 27 401–414.
  • Thomas, D. C., Siemiatycki, J., Dewar, R., Robins, J., Goldberg, M. and Armstrong, B. G. (1985). The problem of multiple inference in studies designed to generate hypotheses. Am. J. Epidemiol. 122 1080–1095.
  • Thomas, G., Jacobs, K. B., Yeager, M., Kraft, P., Wacholder, S., Orr, N. et al. (2008). Multiple loci identified in a genome-wide association study of prostate cancer. Nat. Genet. 40 310–315.
  • van Steen, K., McQueen, M. B., Herbert, A., Raby, B., Lyon, H., Demeo, D. L. et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat. Genet. 37 683–691.
  • Wakefield, J. (2007). A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am. J. Hum. Genet. 81 208–227.
  • Wakefield, J. (2008). Reporting and interpretation in genome-wide association studies. Int. J. Epidemiol. 37 641–653.
  • Wang, H., Thomas, D. C., Pe’er, I. and Stram, D. O. (2006). Optimal two-stage genotyping designs for genome-wide association scans. Genet. Epidemiol. 30 356–368.
  • Wang, K., Li, M. and Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81 1278–1283.
  • White, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am. J. Epidemiol. 115 119–128.
  • Whittemore, A. S. (2007). A Bayesian false discovery rate for multiple testing. J. Appl. Statist. 34 1–9.
  • Yu, K., Chatterjee, N., Wheeler, W., Li, Q., Wang, S., Rothman, N. et al. (2007). Flexible design for following up positive findings. Am. J. Hum. Genet. 81 540–551.
  • Zaykin, D. V. and Zhivotovsky, L. A. (2005). Ranks of genuine associations in whole-genome scans. Genetics 171 813–823.
  • Zheng, Y., Heagerty, P. J., Hsu, L., and Newcomb, P. A. (2010). On combining family-based and population-based case-control data in association studies. Biometrics. To appear.
  • Zhong, H. and Prentice, R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 9 621–634.
  • Zöllner, S. and Teslovich, T. M. (2009). Using GWAS data to identify copy number variants contributing to common complex diseases. Statist. Sci. 24 530–546.
  • Zollner, S. and Pritchard, J. K. (2007). Overcoming the winner’s curse: Estimating penetrance parameters from case-control data. Am. J. Hum. Genet. 80 605–615.
  • Zou, G. and Zhao, H. (2004). The impacts of errors in individual genotyping and DNA pooling on association studies. Genet. Epidemiol. 26 1–10.
  • Zuo, Y., Zou, G. and Zhao, H. (2006). Two-stage designs in case-control association analysis. Genetics 173 1747–1760.