The Annals of Applied Statistics

Joint analysis of SNP and gene expression data in genetic association studies of complex diseases

Yen-Tsung Huang, Tyler J. VanderWeele, and Xihong Lin

Full-text: Open access


Genetic association studies have been a popular approach for assessing the association between common Single Nucleotide Polymorphisms (SNPs) and complex diseases. However, other genomic data involved in the mechanism from SNPs to disease, for example, gene expressions, are usually neglected in these association studies. In this paper, we propose to exploit gene expression information to more powerfully test the association between SNPs and diseases by jointly modeling the relations among SNPs, gene expressions and diseases. We propose a variance component test for the total effect of SNPs and a gene expression on disease risk. We cast the test within the causal mediation analysis framework with the gene expression as a potential mediator. For eQTL SNPs, the use of gene expression information can enhance power to test for the total effect of a SNP-set, which is the combined direct and indirect effects of the SNPs mediated through the gene expression, on disease risk. We show that the test statistic under the null hypothesis follows a mixture of $\chi^{2}$ distributions, which can be evaluated analytically or empirically using the resampling-based perturbation method. We construct tests for each of three disease models that are determined by SNPs only, SNPs and gene expression, or include also their interactions. As the true disease model is unknown in practice, we further propose an omnibus test to accommodate different underlying disease models. We evaluate the finite sample performance of the proposed methods using simulation studies, and show that our proposed test performs well and the omnibus test can almost reach the optimal power where the disease model is known and correctly specified. We apply our method to reanalyze the overall effect of the SNP-set and expression of the ORMDL3 gene on the risk of asthma.

Article information

Ann. Appl. Stat., Volume 8, Number 1 (2014), 352-376.

First available in Project Euclid: 8 April 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Causal inference data integration mediation analysis mixed models score test SNP set analysis variance component test


Huang, Yen-Tsung; VanderWeele, Tyler J.; Lin, Xihong. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann. Appl. Stat. 8 (2014), no. 1, 352--376. doi:10.1214/13-AOAS690.

Export citation


  • Cai, T., Lin, X. and Carroll, R. J. (2012). Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics 13 776–790.
  • Cheung, V. G., Spielman, R. S., Ewens, K. G., Weber, T. M., Morley, M. and Burdick, J. T. (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437 1365–1369.
  • Cusanovich, D. A., Billstrand, C., Zhou, X., Chavarria, C., Leon, S. D., Michelini, K. et al. (2012). The combination of a genome-wide association study of lymphocyte count and analysis of gene expression data reveals novel asthma candidate genes. Hum. Mol. Genet. 21 2111–2123.
  • Davies, R. (1980). The distribution of a linear combination of chi-square random variables. Appl. Stat. 29 323–333.
  • Dermitzakis, E. T. (2008). From gene expression to disease risk. Nat. Genet. 40 492–493.
  • Dickson, S. P., Wang, K., Krantz, I., Hakonarson, H. and Goldstein, D. B. (2010). Rare variants create synthetic genome-wide associations. PLoS Biol. 8 e1000294.
  • Dixon, A. L., Liang, L., Moffatt, M. F., Chen, W., Heath, S., Wong, K. C. C. et al. (2007). A genome-wide association study of global gene expression. Nat. Genet. 39 1202–1207.
  • Fu, J., Keurentjes, J. J. B., Bouwmeester, H., America, T., Verstappen, F. W. A., Ward, J. L., Beale, M. H., de Vos, R. C. H., Dijkstra, M., Scheltema, R. A., Johannes, F., Koornneef, M., Vreugdenhil, D., Breitling, R. and Jansen, R. C. (2009). System-wide molecular evidence for phenotypic buffering in Arabidopsis. Nat. Genet. 41 166–167.
  • Hageman, R. S., Leduc, M. S., Korstanje, R., Paigen, B. and Churchill, G. A. (2011). A Bayesian framework for inference of the genotype–phenotype map for segregating populations. Genetics 187 1163–1170.
  • Hsu, Y. H., Zillilkens, M., Wilson, S., Farber, C., Demissie, S., Soranzo, N. et al. (2010). An integration of genome-wide association study and expression profiling to prioritize the discovery of susceptibility loci for osteoporosis-related traits. PLoS Genet. 6 e1000977.
  • Huang, Y. T., VanderWeele, T. J. and Lin, X. (2013). Supplement to “Joint analysis of SNP and gene expression data in genetic association studies of complex diseases.” DOI:10.1214/13-AOAS690SUPP.
  • Hunter, D. and Chanock, S. (2010). Genome-wide association studies and “the art of the soluble”. J. Natl. Cancer Inst. 102 1–2.
  • Imai, K., Keele, L. and Yamamoto, T. (2010). Identification, inference and sensitivity analysis for causal mediation effects. Statist. Sci. 25 51–71.
  • Innocenti, F., Cooper, G. M., Stanaway, I. B., Gamazon, E. R., Smith, J. D., Mirkov, S. et al. (2011). Identification, replication, and functional fine-mapping of expression quantitative trait loci in primary human liver tissue. PLoS Genet. 7 e1002078.
  • Johannes, F., Colot, V. and Jansen, R. C. (2008). Epigenome dynamics: A quantitative genetics perspective. Nat. Rev. Genet. 9 883–890.
  • Kline, P. and Santos, A. (2012). A score based approach to wild bootstrap inference. Journal of Econometric Methods 1 23–41.
  • Kwee, L. C., Liu, D., Lin, X., Ghosh, D. and Epstein, M. P. (2008). A powerful and flexible multilocus association test for quantitative traits. Am. J. Hum. Genet. 82 386–397.
  • Lee, P. H. and Shatkay, H. (2008). F-SNP: Computationally predicted functional SNPs for disease association studies. Nucleic Acids Res. 36 D820–D824.
  • Li, Y., Alvarez, O. A., Gutteling, E. W., Tijsterman, M., Fu, J., Riksen, J. A., Hazendonk, E., Prins, P., Plasterk, R. H., Jansen, R. C., Breitling, R. and Kammenga, J. E. (2006). Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet. 2 e222.
  • Li, Y., Tesson, B. M., Churchill, G. A. and Jansen, R. C. (2010). Critical reasoning on causal inference in genome-wide linkage and association studies. Trends Genet. 26 493–498.
  • Lin, X. (1997). Variance component testing in generalised linear models with random effects. Biometrika 84 309–326.
  • Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies via imputation of genotypes. Nat. Genet. 39 906–913.
  • Moffatt, M. F., Kabesch, M., Liang, L., Dixon, A. L., Strachan, D., Heath, S. et al. (2007). Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448 470–473.
  • Morley, M., Molony, C. M., Weber, T. M., Devlin, J. L., Ewens, K. G., Spielman, R. S. et al. (2004). Genetic analysis of genome-wide variation in human gene expression. Nature 430 743–747.
  • Neto, E. C., Broman, A. T., Keller, M. P., Attie, A. D., Zhang, B., Zhu, J. and Yandell, B. S. (2013). Modeling causality for pairs of phenotypes in system genetics. Genetics 193 1003–1013.
  • Parzen, M. I., Wei, L. J. and Ying, Z. (1994). A resampling method based on pivotal estimating functions. Biometrika 81 341–350.
  • Pearl, J. (2001). Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence 411–420. Morgan Kaufmann, San Francisco.
  • Robins, J. (2003). Semantics of causal DAG models and the identification of direct and indirect effects. In Highly Structured Stochastic Systems (P. Green, N. L. Hjort and S. Richardson, eds.) 70–81. Oxford Univ. Press, Oxford.
  • Robins, J. M. and Greenland, S. (1992). Identifiability and exchangeability for direct and indirect effects. Epidemiology 3 143–155.
  • Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66 688–701.
  • Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6 34–58.
  • Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics 2 110–114.
  • Schadt, E. E., Monks, S. A., Drake, T. A., Lusis, A. J., Che, N., Colinayo, V. et al. (2003). Genetics of gene expression surveyed in maize, mouse and man. Nature 422 297–302.
  • Schadt, E. E., Lamb, J., Yang, X., Zhu, J., Edwards, S., Guhathakurta, D. et al. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37 710–717.
  • Smith, D. G. and Ebrahim, S. (2003). Mendelian randomization: Can genetic epidemiology contribute to understanding environmental determinants of disease? Int. J. Epidemiol. 32 1–22.
  • Smith, D. G. and Ebrahim, S. (2005). What can Mendelian randomisation tell us about modifiable behavioural and environmental exposures? British Medical Journal 330 1076–1079.
  • Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498.
  • VanderWeele, T. J. and Vansteelandt, S. (2009). Conceptual issues concerning mediation, interventions and composition. Stat. Interface 2 457–468.
  • VanderWeele, T. J. and Vansteelandt, S. (2010). Odds ratios for mediation analysis for a dichotomous outcome. Am. J. Epidemiol. 172 1339–1348.
  • Wu, M., Kraft, P., Epstein, M., Taylor, D., Chanock, S., Hunter, D. et al. (2010). Powerful SNP set analysis for case–control genomewide association studies. Am. J. Hum. Genet. 86 929–942.
  • Zeger, S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics 44 1049–1060.
  • Zhang, M., Liang, L., Morar, N., Dixon, A. L., Lathrop, G. M., Ding, J. et al. (2012). Integrating pathway analysis and genetics of gene expression for genome-wide association study of basal cell carcinoma. Hum. Genet. 131 615–623.
  • Zhong, H., Beaulaurier, J., Lum, P. Y., Molony, C., Yang, X., Macneil, D. J. et al. (2010). Liver and adipose expression associated SNPs are enriched for association to type 2 diabetes. PLoS Genet. 6 e1000932.
  • Zhu, J., Zhang, B., Smith, E. N., Drees, B., Brem, R. B., Kruglyak, L., Bumgarner, R. E. and Schadt, E. E. (2008). Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat. Genet. 40 854–861.

Supplemental materials

  • Supplementary material: Detailed causal and statistical development and supplementary table and figures. Section 1: detailed development of causal mediation model and derivations referenced in Sections 4.1 and 4.2; Section 2: derivation of model (4.7) in Sections 4.4; Section 3: asymptotic distribution of $Q$ referenced in Section 3.1; table and figures referenced in Sections 5.2 and 5.4.