The Annals of Applied Statistics

A sparse conditional Gaussian graphical model for analysis of genetical genomics data

Jianxin Yin and Hongzhe Li

Full-text: Open access

Abstract

Genetical genomics experiments have now been routinely conducted to measure both the genetic markers and gene expression data on the same subjects. The gene expression levels are often treated as quantitative traits and are subject to standard genetic analysis in order to identify the gene expression quantitative loci (eQTL). However, the genetic architecture for many gene expressions may be complex, and poorly estimated genetic architecture may compromise the inferences of the dependency structures of the genes at the transcriptional level. In this paper we introduce a sparse conditional Gaussian graphical model for studying the conditional independent relationships among a set of gene expressions adjusting for possible genetic effects where the gene expressions are modeled with seemingly unrelated regressions. We present an efficient coordinate descent algorithm to obtain the penalized estimation of both the regression coefficients and the sparse concentration matrix. The corresponding graph can be used to determine the conditional independence among a group of genes while adjusting for shared genetic effects. Simulation experiments and asymptotic convergence rates and sparsistency are used to justify our proposed methods. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one. We apply our methods to the analysis of a yeast eQTL data set and demonstrate that the conditional Gaussian graphical model leads to a more interpretable gene network than a standard Gaussian graphical model based on gene expression data alone.

Article information

Source
Ann. Appl. Stat., Volume 5, Number 4 (2011), 2630-2650.

Dates
First available in Project Euclid: 20 December 2011

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1324399609

Digital Object Identifier
doi:10.1214/11-AOAS494

Mathematical Reviews number (MathSciNet)
MR2907129

Zentralblatt MATH identifier
1234.62151

Keywords
eQTL Gaussian graphical model regularization genetic networks seemingly unrelated regression

Citation

Yin, Jianxin; Li, Hongzhe. A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat. 5 (2011), no. 4, 2630--2650. doi:10.1214/11-AOAS494. https://projecteuclid.org/euclid.aoas/1324399609


Export citation

References

  • Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516.
  • Bing, N. and Hoeschele, I. (2005). Genetical genomics analysis of a yeast segregant population for transcription network inference. Genetics 170 533–542.
  • Brazhnik, P., de la Fuente, A. and Mendes, P. (2002). Gene networks: How to put the function in genomics. Trends Biotechnol. 20 467–472.
  • Brem, R. B. and Kruglyak, L. (2005). The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of National Academy of Sciences 102 1572–1577.
  • Chaibub Neto, E., Keller, M. P., Attie, A. D. and Yandell, B. S. (2010). Causal Graphical Models in Systems Genetics: A unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. Ann. Appl. Statist. 4 320–339.
  • Chen, L. S., Emmert-Streib, F. and Storey, J. D. (2007). Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 8 R219.
  • Cheung, V. G. and Spielman, R. S. (2002). The genetics of variation in gene expression. Nat. Genet. 32 522–525.
  • Chickering, D. M., Heckerman, D. and Meek, C. (2004). Large-sample learning of Bayesian networks is NP-hard. J. Mach. Learn. Res. 5 1287–1330 (electronic).
  • Dempster, A. P. (1972). Covariance selection. Biometrics 28 157–175.
  • Fan, J., Feng, Y. and Wu, Y. (2009). Network exploration via the adaptive lasso and SCAD penalties. Ann. Appl. Stat. 3 521–541.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
  • Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. and Hirakawa, M. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38 D355–D360.
  • Kendziorski, C. and Wang, P. (2003). A review of statistical methods for expression quantitative trait loci mapping. Mammalian Genome 17 509–517.
  • Kendziorski, C. M., Chen, M., Yuan, M., Lan, H. and Attie, A. D. (2006). Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics 62 19–27.
  • Kontos, K. (2009). Gaussian graphical model selection for gene regulatory network reverse engineering and function prediction. Ph.D. dissertation, Univ. Libre de Bruxelles.
  • Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 4254–4278.
  • Li, H. and Gui, J. (2006). Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7 302–317.
  • Liu, B., De La Feunte, A. and Hoeschele, I. (2008). Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 178 1763–1776.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Neto, E. C., Keller, M. P., Attie, A. D. and Yandell, B. S. (2010). Causal graphical models in systems genetics: A unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. Ann. Appl. Stat. 4 320–339.
  • Peng, J., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746.
  • Rothman, A. J., Levina, E. and Zhu, J. (2010). Sparse multivariate regression with covariance estimation. J. Comput. Graph. Statist. 19 947–962.
  • Schadt, E. E., Monks, S. A., Drake, T. A., Lusis, A. J., Che, N., Colinayo, V., Ruff, T. G., Milligan, S. B., Lamb, J. R., Cavet, G., Linsley, P. S., Mao, M., Stoughton, R. B. and Friend, S. H. (2003). Genetics of gene expression surveyed in maize, mouse and man. Nature 422 297–302.
  • Schäfer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 4 Art. 32, 28 pp. (electronic).
  • Segal, E., Friedman, N., Kaminski, N., Regev, A. and Koller, D. (2005). From signatures to models: Understanding cancer using microarrays. Nat. Genet. 37 S38–S45.
  • Stark, C., Breitkreutz, B.-J., Chatr-Aryamontri, A., Boucher, L., Oughtred, R., Livstone, M. S., Nixon, J., Van Auken, K., Wang, X., Shi, X., Reguly, T., Rust, J. M., Winter, A., Dolinski, K. and Tyers, M. (2011). The BioGRID interaction database: 2011 update. Nucleic Acids Res. 39 D698–D704.
  • Steffen, M., Petti, A., Aach, J., D’Haeseleer, P. and Church, G. (2002). Automated modelling of signal transduction networks. BMC Bioinformatics 3 34.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Wang, H., Li, R. and Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94 553–568.
  • Yin, J. and Li, H. (2011). Supplement to “A sparse conditional Gaussian graphical model for analysis of genetical genomics data.” DOI:10.1214/11-AOAS494SUPP.
  • Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Amer. Statist. Assoc. 57 348–368.
  • Zhu, J., Lum, P. Y., Lamb, J., GuhaThakurta, D., Edwards, S. W., Thieringer, R., Berger, J. P., Wu, M. S., Thompson, J., Sachs, A. B. and Schadt, E. E. (2004). An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenetic Genome Research 105 363–374.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

Supplemental materials

  • Supplementary material: Supplemental materials for “A sparse conditional Gaussian graphical model for analysis of genetical genomics data”. The online supplemental materials include the simulation standard errors of Tables 1 and 2, two propositions on the Hessian matrix of the likelihood function and the convergence of the algorithm and the theoretical properties of the proposed penalized estimates of the sparse cGGM: its asymptotic distribution, the oracle properties when p and q are fixed as n → ∞ and the convergence rates and sparsistency of the estimators when p = p_n and q = q_n diverge as n → ∞. All the proofs are also given in the supplemental materials.