The Annals of Applied Statistics

Bayesian variable selection regression for genome-wide association studies and other large-scale problems

Yongtao Guan and Matthew Stephens

Full-text: Access denied (no subscription detected)In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: you are already an IMS member, you may need to update your Euclid profile following the instructions here:


We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete “correct” subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. Here we focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of “missing heritability” in genome-wide association studies. More generally, we demonstrate that, despite the apparent computational challenges, BVSR can provide useful inferences in these large-scale problems, and in our simulations produces better power and predictive performance compared with standard single-SNP analyses and the penalized regression method LASSO. Methods described here are implemented in a software package, pi-MASS, available from the Guan Lab website

Article information

Ann. Appl. Stat. Volume 5, Number 3 (2011), 1780-1815.

First available: 13 October 2011

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Mathematical Reviews number (MathSciNet)


Guan, Yongtao; Stephens, Matthew. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics 5 (2011), no. 3, 1780--1815. doi:10.1214/11-AOAS455.

Export citation


  • Agliari, A. and Parisetti, C. C. (1988). A-g reference informative prior: A note on Zellner’s g prior. J. Roy. Statist. Soc. Ser. D 37 271–275.
  • Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669–679.
  • Barber, M. J., Mangravite, L. M., Hyde, C. L., Chasman, D. I., Smith, J. D., McCarty, C. A., Li, X., Wilke, R. A., Rieder, M. J., Williams, P. T., Ridker, P. M., Chatterjee, A., Rotter, J. I., Nickerson, D. A., Stephens, M. and Krauss, R. M. (2010). Genome-wide association of lipid-lowering response to statins in combined study populations. PLoS ONE 5 e9763.
  • Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist. 32 870–897.
  • Brown, P. J., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 519–536.
  • Casella, G. and Robert, C. P. (1996). Rao–Blackwellisation of sampling schemes. Biometrika 83 81–94.
  • Clayton, D. G., Walker, N. M., Smyth, D. J., Pask, R., Cooper, J. D., Maier, L. M., Smink, L. J., Lam, A. C., Ovington, N. R., Stevens, H. E., Nutland, S., Howson, J. M. M., Faham, M., Moorhead, M., Jones, H. B., Falkowski, M., Hardenbol, P., Willis, T. D. and Todd, J. A. (2005). Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37 1243–1246.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911.
  • George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889.
  • Guan, Y. and Krone, S. M. (2007). Small world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing. Ann. Appl. Probab. 17 284–304.
  • Guan, Y. and Stephens, M. (2008). Practical issues in imputation-based association mapping. PLoS Genet. 4 e1000279.
  • Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 97–109.
  • Hoggart, C. J., Whittaker, J. C., De Iorio, M. and Balding, D. J. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4 e1000130.
  • Lange, L. A., Burdon, K., Langefeld, C. D., Liu, Y., Beck, S. R., Rich, S. S., Freedman, B. I., Brosnihan, K. B., Herrington, D. M., Wagenknecht, L. E. and Bowden, D. W. (2006). Heritability and expression of c-reactive protein in type 2 diabetes in the diabetes heart study. Ann. Hum. Genet. 70 717–725.
  • Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of g priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 410–423.
  • Maher, B. (2008). Personal genomes: The case of the missing heritability. Nature 456 18–21.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39 906–913.
  • Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21 1087–1092.
  • Miller, A. (2002). Subset Selection in Regression, 2nd ed. Monographs on Statistics and Applied Probability 95. Chapman & Hall/CRC, Boca Raton, FL.
  • Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023–1036.
  • O’Hara, R. B. and Sillanpää, M. J. (2009). A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 4 85–117.
  • Pankow, J. S., Folsom, A. R., Cushman, M., Borecki, I. B., Hopkins, P. N., Eckfeldt, J. H. and Tracy, R. P. (2001). Familial and genetic determinants of systemic markers of inflammation: The NHLBI family heart study. Atherosclerosis 154 681–689.
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909.
  • Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69 124–137.
  • Pritchard, J. K., Stephens, M., Rosenberg, N. A. and Donnelly, P. (2000). Association mapping in structured populations. Am. J. Hum. Genet. 67 170–181.
  • Raftery, A. E., Madigan, D. and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. J. Amer. Statist. Assoc. 92 179–191.
  • Raychaudhuri, S., Plenge, R. M., Rossin, E. J., Ng, A. C. Y., Purcell, S. M., Sklar, P., Scolnick, E. M., Xavier, R. J., Altshuler, D., Daly, M. J. and Consortium, I. S. (2009). Identifying relationships among genomic disease regions: Predicting genes at pathogenic snp associations and rare deletions. PLoS Genet. 5 e1000534.
  • Reiner, A. P., Barber, M. J., Guan, Y., Ridker, P. M., Lange, L. A., Chasman, D. I., Walston, J. D., Cooper, G. M., Jenny, N. S., Rieder, M. J., Durda, J. P., Smith, J. D., Novembre, J., Tracy, R. P., Rotter, J. I., Stephens, M., Nickerson, D. A. and Krauss, R. M. (2008). Polymorphisms of the HNF1A gene encoding hepatocyte nuclear factor-1 alpha are associated with C-reactive protein. Am. J. Hum. Genet. 82 1193–1201.
  • Ridker, P. M., Rifai, N., Rose, L., Buring, J. E. and Cook, N. R. (2002). Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events. N. Engl. J. Med. 347 1557–1565.
  • Ridker, P. M., Pare, G., Parker, A., Zee, R. Y., Danik, J. S., Buring, J. E., Kwiatkowski, D., Cook, N. R., Miletich, J. P. and Chasman, D. I. (2008). Loci related to metabolic-syndrome pathways including LEPR, HNF1A, IL6R, and GCKR associate with plasma c-reactive protein: The women’s genome health study. Am. J. Hum. Genet. 82 1185–1192.
  • Scheet, P. and Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78 629–644.
  • Servin, B. and Stephens, M. (2007). Efficient multipoint analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 3 e114.
  • Smith, G. D. and Ebrahim, S. (2003). Mendelian randomization: Can genetic epidemiology contribute to understanding environmental determinants of disease? Internat. J. Epidemiology 32 1–22.
  • Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics 75 317–343.
  • Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10 681–690.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Verzilli, C., Shah, T., Casas, J. P., Chapman, J., Sandhu, M., Debenham, S. L., Boekholdt, M. S., Khaw, K. T. T., Wareham, N. J., Judson, R., Benjamin, E. J., Kathiresan, S., Larson, M. G., Rong, J., Sofat, R., Humphries, S. E., Smeeth, L., Cavalleri, G., Whittaker, J. C. and Hingorani, A. D. (2008). Bayesian meta-analysis of genetic association studies with different sets of markers. Am. J. Hum. Genet. 82 859–872.
  • Veyrieras, J.-B., Kudaravalli, S., Kim, S. Y., Dermitzakis, E. T., Gilad, Y., Stephens, M. and Pritchard, J. K. (2008). High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4 e1000214.
  • Wakefield, J. (2009). Bayes factors for genome-wide association studies: Comparison with P-values. Genet. Epidemiol. 33 79–86.
  • Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.
  • Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Goddard, M. E. and Visscher, P. M. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 565–569.
  • Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques (P. K. Goel and A. Zellner, eds.) Stud. Bayesian Econometrics Statist. 6 233–243. North-Holland, Amsterdam.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320.