The Annals of Applied Statistics

Bayesian model search and multilevel inference for SNP association studies

Melanie A. Wilson, Edwin S. Iversen, Merlise A. Clyde, Scott C. Schmidler, and Joellen M. Schildkraut

Full-text: Open access


Technological advances in genotyping have given rise to hypothesis-based association studies of increasing scope. As a result, the scientific hypotheses addressed by these studies have become more complex and more difficult to address using existing analytic methodologies. Obstacles to analysis include inference in the face of multiple comparisons, complications arising from correlations among the SNPs (single nucleotide polymorphisms), choice of their genetic parametrization and missing data. In this paper we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations, MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level, with the prior distribution on SNP inclusion in the model providing an intrinsic multiplicity correction. We use simulated data sets to characterize MISA’s statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally “validated” in independent studies. We examine sensitivity of the NCOCS results to prior choice and method for imputing missing data. MISA is available in an R package on CRAN.

Article information

Ann. Appl. Stat. Volume 4, Number 3 (2010), 1342-1364.

First available in Project Euclid: 18 October 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

AIC Bayes factor Bayesian model averaging BIC Evolutionary Monte Carlo false discovery genetic models lasso model uncertainty single nucleotide polymorphism variable selection


Wilson, Melanie A.; Iversen, Edwin S.; Clyde, Merlise A.; Schmidler, Scott C.; Schildkraut, Joellen M. Bayesian model search and multilevel inference for SNP association studies. Ann. Appl. Stat. 4 (2010), no. 3, 1342--1364. doi:10.1214/09-AOAS322.

Export citation


  • Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nature 7 781–791.
  • Clayton, D. G., Walker, N. M., Smyth, D. J. and Pask, R. (2005). Population structure differential bias and genomic control in a large-scale casecontrol association study. Nature Genet. 37 1243–1246.
  • Clyde, M. (1999). Bayesian model averaging and model search strategies (with discussion). In Bayesian Statistics 6—Proceedings of the Sixth Valencia International Meeting 157–185. Oxford Univ. Press, New York.
  • Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci. 19 81–94.
  • Cordell, H. J. and Clayton, D. G. (2002). A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in type 1 diabetes. AJHG 70 124–141.
  • Cui, W. and George, E. I. (2008). Empirical Bayes vs. fully Bayes variable selection. J. Statist. Plan. Inference 138 888–900.
  • Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103.
  • Flint, J. and Mackay, T. F. C. (2009). Genetic architecture of quantitative traits in mice, flies and humans. Genome Research 19 723–733.
  • Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457–472.
  • George, E. (1999). Discussion of “Model averaging and model search strategies” by M. Clyde. In Bayesian Statistics 6—Proceedings of the Sixth Valencia International Meeting (J. M. Bernardo, J. O. Berger, P. Dawid and A. F. M. Smith, eds.) 157–185. Oxford Univ. Press, Oxford.
  • Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In Proc. 23rd Symp. Interface. Computing Science and Statistics 156–163.
  • Goodman, S. N. (1999). Toward evidence-based medical statistics. 2: The Bayes factor. Annal. Intern. Med. 130 1005–1013.
  • Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian model averaging: A tutorial (with discussion). Statist. Sci. 14 382–401. Corrected version available at
  • Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. Univ. Michigan Press, Ann Arbor, MI.
  • Jeffreys, H. (1961). Theory of Probability, 3rd ed. Oxford Univ. Press.
  • Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
  • Kooperberg, C. and Ruczinski, I. (2004). Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology 28 157–170.
  • Lavine, M. and Schervish, M. J. (1997). Bayes factors: What they are and what they are not. Amer. Statist. 53 119–122.
  • Ley, E. and Steel, M. F. (2009). On the effect of prior assumptions in Bayesian model averaging with applications to growth regression. Appl. Econometrics 24 651–674.
  • Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: Applications to cp model sampling and change point problem. Statist. Sinica 10 317–342.
  • Lokhorst, J. and Venables, B. (2009). lasso2: L1 constrained estimation aka “lasso.” R package version 1.2-10.
  • Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). On the LASSO and its dual. J. Comp. Graph. Statist. 9 319–337.
  • Park, M. Y. and Hastie, T. (2008). Penalized logistic regression for detecting gene interactions. Bioinformatics 9 30–50.
  • Raftery, A. E. (1986). A note on Bayes factors for log-linear contingency table models with vague prior information. J. Roy. Statist. Soc. Ser. B 48 249–250.
  • Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. J. Computat. Graph. Statist. 12 475–511.
  • Schildkraut, J. M., Moorman, P. G., Bland, A. E. and Halabi, S. (2008). Cyclin E overexpression in epithelial ovarian cancer characterizes an etiologic subgroup. Cancer Epidemiology Biomarkers and Prevention 17 585–593.
  • Schildkraut, J. M., Goode, E. L., Clyde, M. A. and Iversen, E. S. (2009). Single nucleotide polymorphisms in the TP53 region and susceptibility to invasive epithelial ovarian cancer. Cancer Research 69 2349–2357.
  • Schwender, H. and Ickstadt, K. (2007). Identification of SNP interactions using logic regression. Biostatistics 9 187–198.
  • Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann. Statist. 38 2587–2619.
  • Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLOS Genetics 3.
  • Shi, W., Lee, K. and Wahba, G. (2007). Detecting disease-causing genes by lasso-patternsearch algorithm. BMC Proceedings 1 Suppl 1, S60.
  • Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nature Genet. 10 681–690.
  • Stephens, M., Smith, N. and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics 68 978–989.
  • Storey, J. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc. 64 479–498.
  • Wacholder, S. (2004). Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. Journal of the National Cancer Institute 96 434–442.
  • Wakefield, J. (2007). A Bayesian measure of the probability of false discovery in genetic epidemiology studies. The American Journal of Human Genetics 81 208–227.
  • Wellcome Trust (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Wilson, M. A., Iversen, E. S., Clyde, M. A., Schmidler, S. C. and Schildkraut, J. M. (2010). Supplement to “Bayesian Model Search and Multilevel Inference for SNP Association Studies.” DOI: 10.1214/09-AOAS322SUPP.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.

Supplemental materials

  • Supplementary material: Bayesian model search and multilevel inference for SNP association studies: Supplementary materials. In this supplement we provide details for: (1) Derivation of the implied prior distribution on the regression coefficients when AIC is used to approximate the marginal likelihood in logistic regression, (2) Description of the marginal Bayes factor screen used to reduce the number of SNPs in the MISA analysis, (3) Details of how the simulated genetic data sets used in the power analysis of MISA were created and information on the statistical software we developed for this purpose, and (4) Location of the freely available software resources referred to in this and the parent document.