Open Access
March 2013 Bayesian semiparametric analysis for two-phase studies of gene-environment interaction
Jaeil Ahn, Bhramar Mukherjee, Stephen B. Gruber, Malay Ghosh
Ann. Appl. Stat. 7(1): 543-569 (March 2013). DOI: 10.1214/12-AOAS599


The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected subsample. It is natural to apply such a strategy for collecting genetic data in a subsample enriched for exposure to environmental factors for gene-environment interaction ($G\times E$) analysis. In this paper, we consider two-phase studies of $G\times E$ interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phases I and II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene–gene and gene-environment independence to trade off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the nonparametric Bayes construction of Dunson and Xing [J. Amer. Statist. Assoc. 104 (2009) 1042–1051]. We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo-likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The subsample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.


Download Citation

Jaeil Ahn. Bhramar Mukherjee. Stephen B. Gruber. Malay Ghosh. "Bayesian semiparametric analysis for two-phase studies of gene-environment interaction." Ann. Appl. Stat. 7 (1) 543 - 569, March 2013.


Published: March 2013
First available in Project Euclid: 9 April 2013

zbMATH: 06171283
MathSciNet: MR3086430
Digital Object Identifier: 10.1214/12-AOAS599

Keywords: Biased sampling , colorectal cancer , Dirichlet prior , exposure enriched sampling , gene-environment independence , joint effects , multivariate categorical distribution , spike and slab prior

Rights: Copyright © 2013 Institute of Mathematical Statistics

Vol.7 • No. 1 • March 2013
Back to Top