The Annals of Applied Statistics

A unified framework for variance component estimation with summary statistics in genome-wide association studies

Xiang Zhou

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Linear mixed models (LMMs) are among the most commonly used tools for genetic association studies. However, the standard method for estimating variance components in LMMs—the restricted maximum likelihood estimation method (REML)—suffers from several important drawbacks: REML requires individual-level genotypes and phenotypes from all samples in the study, is computationally slow, and produces downward-biased estimates in case control studies. To remedy these drawbacks, we present an alternative framework for variance component estimation, which we refer to as MQS. MQS is based on the method of moments (MoM) and the minimal norm quadratic unbiased estimation (MINQUE) criterion, and brings two seemingly unrelated methods—the renowned Haseman–Elston (HE) regression and the recent LD score regression (LDSC)—into the same unified statistical framework. With this new framework, we provide an alternative but mathematically equivalent form of HE that allows for the use of summary statistics. We provide an exact estimation form of LDSC to yield unbiased and statistically more efficient estimates. A key feature of our method is its ability to pair marginal $z$-scores computed using all samples with SNP correlation information computed using a small random subset of individuals (or individuals from a proper reference panel), while capable of producing estimates that can be almost as accurate as if both quantities are computed using the full data. As a result, our method produces unbiased and statistically efficient estimates, and makes use of summary statistics, while it is computationally efficient for large data sets. Using simulations and applications to 37 phenotypes from 8 real data sets, we illustrate the benefits of our method for estimating and partitioning SNP heritability in population studies as well as for heritability estimation in family studies. Our method is implemented in the GEMMA software package, freely available at www.xzlab.org/software.html.

Article information

Source
Ann. Appl. Stat. Volume 11, Number 4 (2017), 2027-2051.

Dates
Received: November 2016
Revised: March 2017
First available in Project Euclid: 28 December 2017

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1514430276

Digital Object Identifier
doi:10.1214/17-AOAS1052

Keywords
Genome-wide association studies summary statistics variance component linear mixed model MINQUE method of moments

Citation

Zhou, Xiang. A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann. Appl. Stat. 11 (2017), no. 4, 2027--2051. doi:10.1214/17-AOAS1052. https://projecteuclid.org/euclid.aoas/1514430276


Export citation

References

  • Abecasis, G. R., Cardon, L. R. and Cookson, W. O. (2000). A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66 279–292.
  • Allen, H. L., Estrada, K., Lettre, G. Berndt, S. I., Weedon, M. N., Rivadeneira, F., Willer, C. J., Jackson, A. U., Vedantam, S. et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467 832–838.
  • Almasy, L. and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62 1198–1211.
  • Amos, C. I. (1994). Robust variance-components approach for assessing genetic linkage in pedigrees. Am. J. Hum. Genet. 54 535–543.
  • Browning, S. R. (2006). Multilocus association mapping using variable-length Markov chains. Am. J. Hum. Genet. 78 903–913.
  • Browning, S. R. and Browning, B. L. (2013). Identity-by-descent-based heritability analysis in the Northern Finland Birth Cohort. Hum. Genet. 132 129–138.
  • Bulik-Sullivan, B. (2015). Relationship between LD score and Haseman–Elston regression. BioRxiv 0 018283.
  • Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S., Yang, J., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson, N., Daly, M. J., Price, A. L. and Neale, B. M. (2015a). LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47 291–295.
  • Bulik-Sullivan, B., Finucane, H. K., Anttila, V., Gusev, A., Day, F. R., Loh, P.-R., ReproGen Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3 et al. (2015b). An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47 1236–1241.
  • Chen, G.-B. (2014). Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman–Elston regression. Front. Genet. 5 107.
  • Chen, W.-M., Broman, K. W. and Liang, K.-Y. (2004). Quantitative trait linkage analysis by generalized estimating equations: Unification of variance components and Haseman–Elston regression. Genet. Epidemiol. 26 265–272.
  • Crawford, L., Zeng, P., Mukherjee, S. and Zhou, X. (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. BioRxiv.
  • de Los Campos, G., Sorensen, D. and Gianola, D. (2015). Genomic heritability: What is it? PLoS Genet. 11 e1005048.
  • Diao, G. and Lin, D. Y. (2005). A powerful and robust method for mapping quantitative trait loci in general pedigrees. Am. J. Hum. Genet. 77 97–111.
  • Drigalenko, E. (1998). How sib-pairs reveal linkage. Am. J. Hum. Genet. 63 1242–1245.
  • Elston, R. C., Buxbaum, S., Jacobs, K. B. and Olson, J. M. (2000). Haseman and Elston revisited. Genet. Epidemiol. 19 1–17.
  • Finucane, H. K., Bulik-Sullivan, B., Gusev, A., Trynka, G., Reshef, Y., Loh, P.-R., Anttilla, V., Xu, H., Zang, C. et al. (2015). Partitioning heritability by functional category using GWAS summary statistics. Nat. Genet. 47 1228–1235.
  • Furlotte, N. A., Heckerman, D. and Lippert, C. (2014). Quantifying the uncertainty in heritability. J. Hum. Genet. 59 269–275.
  • García-Cortés, L. A., Moreno, C., Varona, L. and Altarriba, J. (1992). Variance component estimation by resampling. J. Anim. Breed. Genet. 109 358–363.
  • Gilmour, A. R., Thompson, R. and Cullis, B. R. (1995). Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51 1440–1450.
  • Golan, D., Lander, E. S. and Rosseta, S. (2014). Measuring missing heritability: Inferring the contribution of common variants. Proc. Natl. Acad. Sci. USA 111 E5272–E5281.
  • Guan, Y. and Stephens, M. (2008). Practical issues in imputation-based association mapping. PLoS Genet. 4 e1000279.
  • Gusev, A., Bhatia, G., Zaitlen, N., Vilhjalmsson, B. J., Diogo, D., Stahl, E. A., Gregersen, P. K., Worthington, J., Klareskog, L. et al. (2013). Quantifying missing heritability at known GWAS loci. PLoS Genet. 9 e1003993.
  • Gusev, A., Lee, S. H., Trynka, G., Finucane, H., Vilhjálmsson, B. J., Xu, H., Zang, C., Ripke, S., Bulik-Sullivan, B. et al. (2014). Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet. 5 535–552.
  • Haseman, J. K. and Elston, R. C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2 3–19.
  • Hayes, B. J., Visscher, P. M. and Goddard, M. E. (2009). Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. (Camb.) 91 47–60.
  • Hayes, M. G., del Bosque-Plata, L., Tsuchiya, T., Hanis, C. L., Bell, G. I. and Cox, N. J. (2005). Patterns of linkage disequilibrium in the type 2 diabetes gene calpain-10. Diabetes 54 3573–3576.
  • Hofer, A. (1998). Variance component estimation in animal breeding: A review. J. Anim. Breed. Genet. 115 247–265.
  • Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. and Abecasis, G. R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44 955–959.
  • Jostins, L., Ripke, S., Weersma, R. K., Duerr, R. H., McGovern, D. P., Hui, K. Y., Lee, J. C., Schumm, L. P., Sharma, Y. et al. (2012). Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491 119–124.
  • Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J. and Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics 178 1709–1723.
  • Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-Y., Freimer, N. B., Sabatti, C. and Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42 348–354.
  • Kostem, E. and Eskin, E. (2013). Improving the accuracy and efficiency of partitioning heritability into the contributions of genomic regions. Am. J. Hum. Genet. 92 558–564.
  • Lee, S. H., Wray, N. R., Goddard, M. E. and Visscher, P. M. (2011). Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88 294–305.
  • Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I. and Heckerman, D. (2011). FaST linear mixed models for genome-wide association studies. Nat. Methods 8 833–835.
  • Loh, P.-R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Chasman, D. I., Ridker, P. M., Neale, B. M., Berger, B. et al. (2015a). Efficient Bayesian mixed model analysis increases association power in large cohorts. Nat. Genet. 47 284–290.
  • Loh, P.-R., Bhatia, G., Gusev, A., Finucane, H. K., Bulik-Sullivan, B. K., Pollack, S. J., Schizophrenia Working Group of the Psychiatric Genomics Consortium, de Candia, T. R., Lee, S. H. et al. (2015b). Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47 1385–1392.
  • Makowsky, R., Pajewski, N. M., Klimentidis, Y. C., Vazquez, A. I., Duarte, C. W., Allison, D. B. and de los Campos, G. (2011). Beyond missing heritability: Prediction of complex traits. PLoS Genet. 7 e1002051.
  • Manning, A. K., Hivert, M.-F., Scott, R. A., Grimsby, J. L., Bouatia-Naji, N., Chen, H., Rybin, D., Liu, C.-T., Bielak, L. F. et al. (2012). A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat. Genet. 44 659-669.
  • Matilainen, K., Mäntysaari, E. A., Lidauer, M. H., Strandén, I. and Thompson, R. (2012). Employing a Monte Carlo algorithm in expectation maximization restricted maximum likelihood estimation of the linear mixed model. J. Anim. Breed. Genet. 129 457–468.
  • Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 94 559–573.
  • Pirinen, M., Donnelly, P. and Spencer, C. C. A. (2013). Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Ann. Appl. Stat. 7 369–390.
  • Price, A. L., Weale, M. E., Patterson, N., Myers, S. R., Need, A. C., Shianna, K. V., Ge, D., Rotter, J. I., Torres, E. et al. (2008). Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 1 132–135.
  • Price, A. L., Helgason, A., Thorleifsson, G., McCarroll, S. A., Kong, A. and Stefansson, K. (2011). Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 7 e1001317.
  • Rao, C. R. (1970). Estimation of heteroscedastic variances in linear models. J. Amer. Statist. Assoc. 65 161–172.
  • Rao, C. R. (1971). Estimation of variance and covariance components—MINQUE theory. J. Multivariate Anal. 1 257–275.
  • Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407.
  • Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random effects. Statist. Sci. 6 15–51.
  • Sabatti, C., Service, S. K., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C. G., Zaitlen, N. A., Varilo, T. et al. (2008). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41 35–46.
  • Sham, P. C. and Purcell, S. (2001). Equivalence between Haseman–Elston and variance-components linkage analyses for sib pairs. Am. J. Hum. Genet. 68 1527–1532.
  • Sham, P. C., Purcell, S., Cherny, S. S. and Abecasis, G. R. (2002). Powerful regression-based quantitative-trait linkage analysis of general pedigrees. Am. J. Hum. Genet. 71 238–253.
  • Speed, D. and Balding, D. J. (2015). Relatedness in the post-genomic era: Is it still useful? Nat. Rev. Genet. 16 33–33.
  • Speed, D., Hemani, G., Johnson, M. R. and Balding, D. J. (2012). Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 91 1011–1021.
  • Speliotes, E. K., Willer, C. J., Berndt, S. I., Monda, K. L., Thorleifsson, G., Jackson, A. U., Allen, H. L., Lindgren, C. M., Luan, J. et al. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42 937-948.
  • Splansky, G. L., Corey, D., Yang, Q., Atwood, L. D., Cupples, L. A., Benjamin, E. J., D’Agostino, R. B., Fox, C. S., Larson, M. G. et al. (2007). The third generation cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: Design, recruitment, and initial examination. Am. J. Epidemiol. 165 1328–1335.
  • Teslovich, T. M., Musunuru, K., Smith, A. V., Edmondson, A. C., Stylianou, I. M., Koseki, M., Pirruccello, J. P., Ripatti, S., Chasman, D. I. et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466 707–713.
  • The 1000 Genomes Project Consortium (2012). An integrated map of genetic variation from 1092 human genomes. Nature 491 56–65.
  • The Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447 661–678.
  • Thompson, E. A. and Shaw, R. G. (1990). Pedigree analysis for quantitative traits: Variance components without matrix inversion. Biometrics 46 399–413.
  • Visscher, P. M., Hill, W. G. and Wray, N. R. (2008). Heritability in the genomics era—concepts and misconceptions. Nat. Rev. Genet. 9 255–266.
  • Wen, X. and Stephens, M. (2010). Using linear predictors to impute allele frequencies from summary or pooled genotype data. Ann. Appl. Stat. 4 1158–1182.
  • Whittaker, J. C., Thompson, R. and Denham, M. (2000). Marker-assisted selection using ridge regression. Genet. Res. 75 249–252.
  • Wray, N. R., Yang, J., Hayes, B. J., Price, A. L., Goddard, M. E. and Visscher, P. M. (2013). Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 14 507–515.
  • Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714–721.
  • Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G. et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 565–569.
  • Yang, J., Manolio, T. A., Pasquale, L. R., Boerwinkle, E., Caporaso, N., Cunningham, J. M., de Andrade, M., Feenstra, B., Feingold, E. et al. (2011a). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43 519–525.
  • Yang, J., Weedon, M. N., Purcell, S., Lettre, G., Estrada, K., Willer, C. J., Smith, A. V., Ingelsson, E., O’Connell, J. R. et al. (2011b). Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 19 807–812.
  • Yang, J., Lee, S. H., Goddard, M. E. and Visscher, P. M. (2011c). GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88 76–82.
  • Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium, Madden, P. A. F., Heath, A. C., Martin, N. G. et al. (2012). Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44 369–375.
  • Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. and Price, A. L. (2014). Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46 100–106.
  • Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A. E., Lee, S. H., Robinson, M. R., Perry, J. R. B., Nolte, I. M. et al. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47 1114–1120.
  • Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M. et al. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38 203–208.
  • Zaykin, D. V., Meng, Z. and Ehm, M. G. (2006). Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am. J. Hum. Genet. 78 737–746.
  • Zhang, Z., Ersoz, E., Lai, C.-Q., Todhunter, R. J., Tiwari, H. K., Gore, M. A., Bradbury, P. J., Yu, J., Arnett, D. K. et al. (2010). Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42 355–360.
  • Zhou, X. (2017). Supplement to “A unified framework for variance component estimation with summary statistics in genome-wide association studies.” DOI:10.1214/17-AOAS1052SUPP.
  • Zhou, X., Carbonetto, P. and Stephens, M. (2013). Polygenic modelling with Bayesian sparse linear mixed models. PLoS Genet. 9 e1003264.
  • Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44 821-824.
  • Zhou, X. and Stephens, M. (2014). Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods 11 407–409.
  • Zhu, J. and Weir, B. (1996). Mixed model approaches for diallele analysis based on a bio-model. Genet. Res. 68 233–240.

Supplemental materials

  • Supplementary Material. Supplementary figures, tables and text.