The Annals of Applied Statistics

Simultaneous SNP identification in association studies with missing data

Zhen Li, Vikneswaran Gopal, Xiaobo Li, John M. Davis, and George Casella

Full-text: Open access


Association testing aims to discover the underlying relationship between genotypes (usually Single Nucleotide Polymorphisms, or SNPs) and phenotypes (attributes, or traits). The typically large data sets used in association testing often contain missing values. Standard statistical methods either impute the missing values using relatively simple assumptions, or delete them, or both, which can generate biased results. Here we describe the Bayesian hierarchical model BAMD (Bayesian Association with Missing Data). BAMD is a Gibbs sampler, in which missing values are multiply imputed based upon all of the available information in the data set. We estimate the parameters and prove that updating one SNP at each iteration preserves the ergodic property of the Markov chain, and at the same time improves computational speed. We also implement a model selection option in BAMD, which enables potential detection of SNP interactions. Simulations show that unbiased estimates of SNP effects are recovered with missing genotype data. Also, we validate associations between SNPs and a carbon isotope discrimination phenotype that were previously reported using a family based method, and discover an additional SNP associated with the trait. BAMD is available as an R-package from

Article information

Ann. Appl. Stat., Volume 6, Number 2 (2012), 432-456.

First available in Project Euclid: 11 June 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Hierarchical models Bayes models Gibbs sampling genome-wide association


Li, Zhen; Gopal, Vikneswaran; Li, Xiaobo; Davis, John M.; Casella, George. Simultaneous SNP identification in association studies with missing data. Ann. Appl. Stat. 6 (2012), no. 2, 432--456. doi:10.1214/11-AOAS516.

Export citation


  • Balding, D. J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7 781–791.
  • Casella, G., Girón, F. J., Martínez, M. L. and Moreno, E. (2009). Consistency of Bayesian procedures for variable selection. Ann. Statist. 37 1207–1228.
  • Chen, W. M. and Abecasis, G. R. (2007). Family-based association tests for genomewide association scan. The American Journal of Human Genetics 81 913–926.
  • Chen, M.-H. and Shao, Q.-M. (1997). Estimating ratios of normalizing constants for densities with different dimensions. Statist. Sinica 7 607–630.
  • Dai, J. Y., Ruczinski, I., LeBlanc, M. and Kooperberg, C. (2006). Imputation methods to improve inference in SNP association studies. Genet. Epidemiol. 30 690–702.
  • Falconer, D. S. and Macay, T. F. C. (1996). Introduction to Quantitative Genetics, 4th ed. Longman, Harlow.
  • Flint-Garcia, S. A., Thornsberry, J. M. and Buckler, E. S. (2003). Structure of linkage disequilibrium in plants. Annu. Rev. Plant Bio. 54 357–374.
  • González-Martínez, S. C., Ersoz, E., Brown, G. R., Wheeler, N. C. and Neale, D. B. (2006). DNA sequence variation and selection of tag single-nucleotide polymorphisms at candidate genes for drought-stress response in Pinus taeda L. Genetics 172 1915–1926.
  • González-Martínez, S. C., Huber, D. A., Ersoz, E., Davis, J. M. and Neale, D. B. (2008). Association genetics in Pinus taeda L. II. Carbon isotope discrimination. Heredity 101 19–26.
  • Greenland, S. and Finkle, W. D. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am. J. Epidemiol. 142 1255–1264.
  • Hager, W. W. (1989). Updating the inverse of a matrix. SIAM Rev. 31 221–239.
  • Henderson, C. R. (1976). A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32 69–83.
  • Hirschhorn, J. N. and Daly, M. J. (2005). Genome-wide association studies for common diseases and complex traits. Nature Genetics 6 95–108.
  • Hobert, J. P. and Casella, G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. J. Amer. Statist. Assoc. 91 1461–1473.
  • Huisman, M. (2000). Imputation of missing item responses: Some simple techniques. Quality and Quantity 34 331–351.
  • Kayihan, G. C., Huber, D. A., Morse, A. M., White, T. T. and Davis, J. M. (2005). Genetic dissection of fusiform rust and pitch canker disease traits in loblolly pine. Theory of Applied Genetics 110 948–958.
  • Li, Z., Gopal, V., Li, X., Davis, J. M. and Casella, G. (2011). Supplement to “Simultaneous SNP identification in association studies with missing data.” DOI:10.1214/11-AOAS516SUPP.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley, New York.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39 906–913.
  • McKeever, D. B. and Howard, J. L. (1996). Value of timber and agricultural products in the United States 1991. Forest Products Journal 46 45–50.
  • Meng, X.-L. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statist. Sinica 6 831–860.
  • Neale, D. B. and Ingvarsson, P. K. (2008). Population, quantitative and comparative genomics of adaptation in forest trees. Curr. Opin. Plant Biol. 11 149–155.
  • O’Hagan, A. and Forster, J. (2004). Kendall’s Advanced Theory of Statistics: Vol. 2B: Bayesian Inference. Arnold, London.
  • Pritchard, J. K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155 945–959.
  • Quaas, R. L. (1976). Computing the diagonal elements and inverse of a large numerator relationship matrix. Biometrics 46 949–953.
  • Quesada, T., Gopal, V., Cumbie, W. P., Eckert, A. J., Wegrzyn, J. L., Neale, D. B., Goldfarb, B., Huber, D. A., Casella, G. and Davis, J. M. (2010). Association mapping of quantitative disease resistance in a natural population of Loblolly pine (Pinus taeda L.). Genetics 186 677–686.
  • Scheet, P. and Stephens, M. A. (2006). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am. Journ. Hum. Genetics 78 629–644.
  • Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 3 e114.
  • Stephens, M., Smith, N. J. and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68 978–989.
  • Su, S. Y., White, J., Balding, D. J. and Coin, L. J. M. (2008). Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 9 Art. 513.
  • Sun, Y. V. and Kardia, S. L. R. (2008). Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks. European Journal of Human Genetics 16 487–495.
  • Szatkiewicz, J. P., Beane, G. L., Ding, Y., Hutchins, L., de Villena, F. P. and Churchill, G. A. (2008). An imputed genotype resource for the laboratory mouse. Mammalian Genome 19 199–208.
  • van der Heijden, G. J., Donders, A. R., Stijnen, T. and Moons, K. G. (2006). Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J. Clin. Epidemiol. 59 1102–1109.
  • Wear, D. N. and Greis, J. G. (2002). Southern forest resource assessment: Summary of findings. Journal of Forestry 100 6–14.
  • Wilson, M. A., Iversen, E. S., Clyde, M. A., Schmidler, S. C. and Schildkraut, J. M. (2010). Bayesian model search and multilevel inference for SNP association studies. Ann. Appl. Stat. 4 1342–1364.
  • Yu, J. M., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S. and Buckler, E. S. (2006). A unified mixed model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics 38 203–208.
  • Zhu, C., Gore, M., Buckler, E. S. and Yu, J. (2008). Status and prospects of association mapping in plants. The Plant Genome 1 5–20.

Supplemental materials

  • Supplementary material: Theory and additional simulations. The Supplemental Information contains details on the variable selector, and the proof of convergence of the two Markov chains (the Gibbs sampler and the model search). In addition, there are further comparisons between BAMD and BIMBAM.