The Annals of Applied Statistics

Bayesian semiparametric analysis for two-phase studies of gene-environment interaction

Jaeil Ahn, Bhramar Mukherjee, Stephen B. Gruber, and Malay Ghosh

Full-text: Open access


The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected subsample. It is natural to apply such a strategy for collecting genetic data in a subsample enriched for exposure to environmental factors for gene-environment interaction ($G\times E$) analysis. In this paper, we consider two-phase studies of $G\times E$ interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phases I and II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene–gene and gene-environment independence to trade off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the nonparametric Bayes construction of Dunson and Xing [J. Amer. Statist. Assoc. 104 (2009) 1042–1051]. We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo-likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The subsample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.

Article information

Ann. Appl. Stat., Volume 7, Number 1 (2013), 543-569.

First available in Project Euclid: 9 April 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Biased sampling colorectal cancer Dirichlet prior exposure enriched sampling gene-environment independence joint effects multivariate categorical distribution spike and slab prior


Ahn, Jaeil; Mukherjee, Bhramar; Gruber, Stephen B.; Ghosh, Malay. Bayesian semiparametric analysis for two-phase studies of gene-environment interaction. Ann. Appl. Stat. 7 (2013), no. 1, 543--569. doi:10.1214/12-AOAS599.

Export citation


  • Agresti, A. (2002). Categorical Data Analysis, 2nd ed. Wiley, New York.
  • Ahn, J., Mukherjee, B., Gruber, S. B. and Ghosh, M. (2013). Supplement to “Bayesian semiparametric analysis for two-phase studies of gene-environment interaction.” DOI:10.1214/12-AOAS599SUPP.
  • Amundadottir, L., Kraft, P., Stolzenberg-Solomon, R. Z., Fuchs, C. S., Petersen, G. M., Arslan, A. A. et al. (2009). Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat. Genet. 41 986–990.
  • Bhattacharjee, S., Chatterjee, N. and Wheeler, W. (2011). An R package for analysis of case-control studies in genetic epidemiology. Package CGEN, Version 1.0.0. Available at
  • Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data. J. Amer. Statist. Assoc. 107 362–377.
  • Breslow, N. E. and Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika 75 11–20.
  • Breslow, N. E. and Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. J. Appl. Stat. 48 457–468.
  • Breslow, N. E. and Holubkov, R. (1997a). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J. R. Stat. Soc. Ser. B Stat. Methodol. 59 447–461.
  • Breslow, N. E. and Holubkov, R. (1997b). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat. Med. 16 103–116.
  • Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92 399–418.
  • Chatterjee, N., Chen, Y.-H. and Breslow, N. E. (2003). A pseudoscore estimator for regression problems with two-phase sampling. J. Amer. Statist. Assoc. 98 158–168.
  • Chatterjee, N. and Chen, Y.-H. (2007). Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 123–142.
  • Cochran, W. G. (1963). Sampling Techniques. Wiley, New York.
  • Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
  • Durt, T. (2010). Experimental proposal for testing the emergence of environment induced (EIN) classical selection rules with biological systems. Studia Logica 95 259–277.
  • Flanders, W. D. and Greenland, S. (1991). Analytic methods for 2-stage case-control studies and other stratified designs. Stat. Med. 10 739–747.
  • Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statist. Sci. 7 457–472.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741.
  • George, E. I. and Mcculloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
  • Hachem, C., Morgan, R., Johnson, M., Muebeler, M. and El-Serag, H. (2009). Statins and the risk of colorectal carcinoma: A nested case-control study in veterans with diabetes. Am. J. Gastroenterol. 104 1241–1248.
  • Haneuse, S. and Chen, J. (2011). A multiphase design strategy for dealing with participation bias. Biometrics 67 309–318.
  • Haneuse, S. J.-P. A. and Wakefield, J. C. (2007). Hierarchical models for combining ecological and case-control data. Biometrics 63 128–136, 312.
  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
  • Hunter, D. J., Kraft, P., Jacobs, K. B., Cox, D. G., Yeager, M. et al. (2007). A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39 870–874.
  • Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 98 438–455.
  • Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 61 413–438.
  • Lee, A. J., Scott, A. J. and Wild, C. J. (2010). Efficient estimation in multi-phase case-control studies. Biometrika 97 361–374.
  • Li, D. and Conti, D. V. (2009). Interactions using a combined case-only and case-control approach. Am. J. Epidemiol. 169 497–504.
  • Lipkin, S. M. et al. (2010). Genetic variation in 3-hydroxy-3-methylglutaryl CoA reductase modifies the chemopreventive activity of statins for colorectal cancer. Cancer Prev. Res. (Phila) 3 597–603.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Lumley, T. (2011). R for analyzing data from complex surveys. Package Survey, Version 3.2.4. Available at
  • Manski, C. F. and Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica 45 1977–1988.
  • Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023–1036.
  • Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene-environment independence for analysis of case-control studies: An empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64 685–694.
  • Mukherjee, B., Ahn, J., Stephen, B. G., Rennert, G., Victor, M. and Chatterjee, N. (2008). Testing gene-environment interaction from case-control data: A novel study of type-1 error, power and designs. Gen. Epid. 32 615–626.
  • Mukherjee, B., Ahn, J., Gruber, S. B., Ghosh, M. and Chatterjee, N. (2010). Bayesian sample size determination for case-control studies of gene-environment interaction. Biometrics 66 934–948.
  • Müller, P., Parmigiani, G., Shildkraut, J. and Tardella, L. (1999). A Bayesian hierarchical approach for combining case-control and prospective studies. Biometrics 55 858–866.
  • Murcray, C. E., Lewinger, J. P. and Gauderman, W. J. (2009). Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 169 219–226.
  • Neyman, J. (1938). Contribution to the theory of sampling from human populations. J. Amer. Statist. Assoc. 33 101–116.
  • Park, J. H., Wacholder, S., Gail, M. H., Peters, U., Jacobs, K. B., Chanock, S. J. and Chatterjee, N. (2010). Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42 570–575.
  • Pfeiffer, R. M. and Gail, M. H. (2003). Sample size calculations for population- and family-based case-control association studies on marker genotypes. Genet. Epidemiol. 25 136–148.
  • Piegorsch, W. W., Weinberg, C. R. and Taylor, J. (1994). Non hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat. Med. 13 153–162.
  • Plummer, M., Best, N., Cowles, K. and Vines, K. (2009). Output analysis and diagnostics for MCMC. Package CODA, Version 0.13-4. Available at
  • Poynter, J. N., Gruber, S. B., Higgins, P. D. R., Almog, R., Bonner, J. D., Rennert, H. S., Low, M., Greenson, J. K. and Rennert, G. (2005). Statins and the risk of colorectal cancer. N. Engl. J. Med. 352 2184–2192.
  • Reilly, M. and Pepe, M. S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82 299–314.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
  • Schill, W., Jöckel, K. H., Drescher, K. and Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika 80 339–352.
  • Scott, A. J. and Wild, C. J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 84 57–71.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Umbach, D. M. and Weinberg, C. R. (1997). Designing and analysing case-control studies to exploit independence of genotype and exposure. Stat. Med. 11 259–272.
  • Vansteelandt, S., VanderWeele, T. J. and Robins, J. M. (2008). Multiply robust inference for statistical interactions. J. Amer. Statist. Assoc. 103 1693–1704.
  • Wacholder, S., Hartge, P., Prentice, R., Garcia-Closas, M. et al. (2010). Performance of common genetic variants in breast-cancer risk models. N. Engl. J. Med. 362 986–993.
  • Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput. 36 45–54.
  • Whittemore, A. S. and Halpern, J. (1998). Multi-stage sampling in genetic epidemiology. Stat. Med. 16 153–167.
  • Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S. et al. (2007). Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat. Genet. 39 645–649.

Supplemental materials

  • Supplementary material: Bayesian semiparametric analysis for two-phase studies of gene-environment interaction. We consider two-phase studies of $G\times E$ interaction where phase I data is available on exposure, covariates and disease status and stratified sampling is done to prioritize individuals for genotyping at phase II. We consider a Bayesian analysis based on the joint retrospective likelihood of phases I and II data that handles multiple genetic and environmental factors, data adaptive use of gene-environment independence.