The Annals of Applied Statistics

Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies

Matti Pirinen, Peter Donnelly, and Chris C. A. Spencer

Full-text: Open access

Abstract

Motivated by genome-wide association studies, we consider a standard linear model with one additional random effect in situations where many predictors have been collected on the same subjects and each predictor is analyzed separately. Three novel contributions are (1) a transformation between the linear and log-odds scales which is accurate for the important genetic case of small effect sizes; (2) a likelihood-maximization algorithm that is an order of magnitude faster than the previously published approaches; and (3) efficient methods for computing marginal likelihoods which allow Bayesian model comparison. The methodology has been successfully applied to a large-scale association study of multiple sclerosis including over 20,000 individuals and 500,000 genetic variants.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 369-390.

Dates
First available in Project Euclid: 9 April 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527203

Digital Object Identifier
doi:10.1214/12-AOAS586

Mathematical Reviews number (MathSciNet)
MR3086423

Zentralblatt MATH identifier
06171276

Keywords
Genetic association study case-control study linear mixed model

Citation

Pirinen, Matti; Donnelly, Peter; Spencer, Chris C. A. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 7 (2013), no. 1, 369--390. doi:10.1214/12-AOAS586. https://projecteuclid.org/euclid.aoas/1365527203


Export citation

References

  • Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics 11 375–386.
  • Astle, W. (2009). Population structure and cryptic relatedness in genetic association studies. Ph.D. thesis, Univ. London.
  • Astle, W. and Balding, D. J. (2009). Population structure and cryptic relatedness in genetic association studies. Statist. Sci. 24 451–471.
  • Atwell, S., Huang, Y. S., Vilhjalmsson, B. J., Willems, G., Horton, M., Li, Y. et al. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465 627–631.
  • Aulchenko, Y. S., de Koning, D.-J. and Haley, C. (2007). Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177 577–585.
  • Aulchenko, Y. S., Ripke, S., Isaacs, A. andvan Duijn, C. M. (2007). GenABEL: An R library for genome-wide association analysis. Bioinformatics 23 1294–1296.
  • Boyko, A. R., Quignon, P., Li, L. and Schoenebeck, J. J. et al. (2010). A simple genetic architecture underlies morphological variation in dogs. PLoS Biol. 8 e1000451.
  • Bradbury, P. J., Zhang, Z., Kroon, D. E., Casstevens, T. M., Ramdoss, Y. and Buckler, E. S. (2007). TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23 2633–2635.
  • Devlin, B., Roeder, K. and Wasserman, L. (2001). Genomic control, a new approach to genetic-based association studies. Theor. Pop. Biol. 60 155–166.
  • Fisher, R. A. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions on Royal Society of Edinburgh 52 399–433.
  • Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD.
  • IMSGC and WTCCC2 (2011). Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature 476 214–219.
  • Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J. and Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics 178 1709–1723.
  • Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-Y., Freimer, N. B., Sabatti, C. and Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42 348–354.
  • Kass, R. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.
  • Listgarten, J., Kadie, C., Schadt, E. E. and Heckerman, D. (2010). Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl. Acad. Sci. USA 107 16465–16470.
  • Lynch, M. and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA.
  • McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A. and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 9 356–369.
  • Patterson, N., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet. 2 e190.
  • Pirinen, M., Donnelly, P. and Spencer, C. (2013). Supplement to “Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies.” DOI:10.1214/12-AOAS586SUPP.
  • Schott, J. R. (2005). Matrix Analysis for Statistics, 2nd ed. Wiley, Hoboken, NJ.
  • Sorensen, D. and Gianola, D. (2002). Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer, New York.
  • Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10 681–690.
  • Wakefield, J. (2009). Bayes factors for genome-wide association studies: Comparison with $p$-values. Gen. Epidem. 33 79–86.
  • WTCCC. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447 661–678.
  • Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Goddard, M. E. and Visscher, P. M. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Gen. 42 565–569.
  • Yang, J., Weedon, M. N., Purcell, S., Lettre, G., Estrada, K. et al. (2011). Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 19 807–812.
  • Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S. and Buckler, E. S. (2005). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Gen. 38 203–208.
  • Zhang, Z., Buckler, E. S., Casstevens, T. M. and Bradbury, P. J. (2009). Software engineering the mixed model for genome-wide association studies on large samples. Brief. Bioinformatics 10 664–675.
  • Zhang, Z., Ersoz, E., Lai, C. Q., Todhunter, R. J., Tiwari, H. K., Gore, M. A., Bradbury, J. M., Yu, J., Arnett, D. K., Ordovas, J. M. and Buckler, E. S. (2010). Mixed linear model approach adapted for genome-wide association studies. Nat. Gen. 42 355–360.

Supplemental materials

  • Supplementary material: Supplementary text. In this supplement we give the details of the application of the mixed model to binary data, of the conditional maximization of the likelihood function and of the Bayesian computations.