The Annals of Applied Statistics

A hierarchical Bayesian model for inference of copy number variants and their association to gene expression

Alberto Cassese, Michele Guindani, Mahlet G. Tadesse, Francesco Falciani, and Marina Vannucci

Full-text: Open access

Abstract

A number of statistical models have been successfully developed for the analysis of high-throughput data from a single source, but few methods are available for integrating data from different sources. Here we focus on integrating gene expression levels with comparative genomic hybridization (CGH) array measurements collected on the same subjects. We specify a measurement error model that relates the gene expression levels to latent copy number states which, in turn, are related to the observed surrogate CGH measurements via a hidden Markov model. We employ selection priors that exploit the dependencies across adjacent copy number states and investigate MCMC stochastic search techniques for posterior inference. Our approach results in a unified modeling framework for simultaneously inferring copy number variants (CNV) and identifying their significant associations with mRNA transcripts abundance. We show performance on simulated data and illustrate an application to data from a genomic study on human cancer cell lines.

Article information

Source
Ann. Appl. Stat., Volume 8, Number 1 (2014), 148-175.

Dates
First available in Project Euclid: 8 April 2014

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1396966282

Digital Object Identifier
doi:10.1214/13-AOAS705

Mathematical Reviews number (MathSciNet)
MR3191986

Zentralblatt MATH identifier
06302231

Keywords
Bayesian hierarchical models comparative genomic hybridization arrays gene expression hidden Markov models measurement error variable selection

Citation

Cassese, Alberto; Guindani, Michele; Tadesse, Mahlet G.; Falciani, Francesco; Vannucci, Marina. A hierarchical Bayesian model for inference of copy number variants and their association to gene expression. Ann. Appl. Stat. 8 (2014), no. 1, 148--175. doi:10.1214/13-AOAS705. https://projecteuclid.org/euclid.aoas/1396966282


Export citation

References

  • Barnes, C., Plagnol, V., Fitzgerald, T., Redon, R., Marchini, J., Clayton, D. and Hurles, M. E. (2008). A robust statistical method for case–control association testing with copy number variation. Nature Genetics 40 1245–1252.
  • Belfiore, A., Genua, M. and Malaguarnera, R. (2009). PPAR-gamma agonists and their effects on IGF-I receptor signaling: Implications for cancer. PPAR Research 2009 Article ID 830501.
  • Breheny, P., Chalise, P., Batzler, A., Wang, L. and Fridley, B. L. (2012). Genetic association studies of copy-number variation: Should assignment of copy number states precede testing? PLoS ONE 7 e34262.
  • Broet, P., Lewin, A., Richardson, S., Dalmasso, C. and Magdelenat, H. (2004). A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 20 2562–2571.
  • Brown, P. J., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 627–641.
  • Bussey, K. J., Chin, K., Lababidi, S., Reimers, M., Reinhold, W. C., Ku, W.-L., Gwadry, F., Kouros-Mehr, A. H., Fridlyand, J., Jain, A., Collins, C., Nishizuka, S., Tonon, G., Roschke, A., Gehlhaus, K., Kirsch, I., Scudiero, D. A., Gray, J. W. and Weinstein, J. N. (2006). Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel. Molecular Cancer Therapeutics 5 853–867.
  • Cardin, N., Holmes, C., Donnelly, P., Wellcome Trust Case Control Consortium and Marchini, J. (2011). Bayesian hierarchical mixture modeling to assign copy number from a targeted CNV array. Genetic Epidemiology 35 536–548.
  • Cassese, A., Guindani, M., Tadesse, M. G., Falciani, F. and Vannucci, M. (2014). Supplement to “A hierarchical Bayesian model for inference of copy number variants and their association to gene expression.” DOI:10.1214/13-AOAS705SUPP.
  • Chen, X., Wang, L. and Ishwaran, H. (2010). An integrative pathway-based clinical-genomic model for cancer survival prediction. Statist. Probab. Lett. 80 1313–1319.
  • Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W. L., Lapuk, A., Neve, R. M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B. M., Esserman, L., Albertson, D. G., Waldman, F. M. and Gray, J. W. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 10 529–541.
  • Choi, H., Qin, Z. S. and Ghosh, D. (2010). A double-layered mixture model for the joint analysis of DNA copy number and gene expression data. J. Comput. Biol. 17 121–137.
  • Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: An objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleid Acids Research 35 2013–2025.
  • Cordell, H. J. (2002). Epistasis: What it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics 11 2463–2468.
  • Costa, T., Guindani, M., Bassetti, F., Leisen, F. and Airoldi, E. M. (2013). Generalized species sampling priors with latent beta reinforcements. Available at arXiv:1012.0866.
  • Dalenc, F., Drouet, J., Ader, I., Delmas, C., Rochaix, P., Favre, G., Cohen-Jonathan, E. and Toulas, C. (2012). Increased expression of a COOH-truncated nucleophosmin resulting from alternative splicing is associated with cellular resistance to ionizing radiation in HeLa cells. Int. J. Cancer 100 662–668.
  • Drier, Y., Sheffer, M. and Domany, E. (2013). Pathway-based personalized analysis of cancer. Proc. Natl. Acad. Sci. USA 110 6388–6393.
  • Du, L., Chen, M., Lucas, J. and Carin, L. (2010). Sticky hidden Markov modeling of comparative genomic hybridization. IEEE Trans. Signal Process. 58 5353–5368.
  • Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2011). A sticky HDP–HMM with application to speaker diarization. Ann. Appl. Stat. 5 1020–1056.
  • George, E. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339–373.
  • Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics, 4 (PeñíScola, 1991) 169–193. Oxford Univ. Press, New York.
  • Guha, S., Li, Y. and Neuberg, D. (2008). Bayesian hidden Markov modeling of array CGH data. J. Amer. Statist. Assoc. 103 485–497.
  • Heidelberger, P. and Welch, P. D. (1981). A spectral method for confidence interval generation and run length control in simulations. Comm. ACM 24 233–245.
  • Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20 388–400.
  • Kaczynski, J., Hansson, G. and Wallerstedt, S. (2009). Wallerstedtincreased porphyrins in primary liver cancer mainly reflect a parallel liver disease. Gastroenterology Research and Practice 2009 Article ID 402394.
  • Marioni, J. C., Thorne, N. P. and Tavare, S. (2006). BioHMM: A heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics 22 1144–1146.
  • Monni, S. and Tadesse, M. G. (2009). A stochastic partitioning method to associate high-dimensional responses and covariates. Bayesian Anal. 4 413–436.
  • Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64 479–489, 667.
  • Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155–176.
  • Noor, R., Mittal, S. and Iqbal, J. (2002). Superoxide dismutase-applications and relevance to human diseases. Med. Sci. Monit. 8 9.
  • Ovacik, M. A., Sukumaran, S., Almon, R. R., DuBois, D. C., Jusko, W. J. and Androulakis, I. P. (2010). Circadian signatures in rat liver: From gene expression to pathways. BMC Bioinformatics 11 540.
  • Picard, F., Robin, S., Lebarbier, E. and Daudin, J.-J. (2007). A segmentation/clustering model for the analysis of array CGH data. Biometrics 63 758–766.
  • Raber, P., Ochoa, A. C. and Rodríguez, P. C. (2012). Metabolism of L-arginine by myeloid-derived suppressor cells in cancer: mechanisms of T cell suppression and therapeutic perspectives. Immunol. Invest. 41 614–634.
  • Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W. et al. (2006). Global variation in copy number in the human genome. Nature 444 444–454.
  • Richardson, S., Bottolo, L. and Rosenthal, J. S. (2010). Bayesian models for sparse regression analysis of high dimensional data. Bayesian Statistics 9 539–569.
  • Richardson, S. and Gilks, W. R. (1993). Conditional independence models for epidemiological studies with covariate measurement error. Stat. Med. 12 1703–1722.
  • Rodriguez, R. R. R., Duran, R. C. D., Falciani, F., Peña, J. G. T. and Trevino, V. (2012). COMPADRE: An R and web resource for pathway activity analysis by component decompositions. Bioinformatics 28 2701–2702.
  • Scott-Boyer, M. P., Imholte, G. C., Tayeb, A., Labbe, A., Deschepper, C. F. and Gottardo, R. (2012). An integrated hierarchical Bayesian model for multivariate eQTL mapping. Stat. Appl. Genet. Mol. Biol. 11 1515–1544.
  • Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305 525–528.
  • Sha, N., Vannucci, M., Tadesse, M. G., Brown, P. J., Dragoni, I., Davies, N., Roberts, T. C., Contestabile, A., Salmon, M., Buckley, C. and Falciani, F. (2004). Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60 812–828.
  • Somwar, H., Erdjument-Bromage, R., Larsson, E., Shum, D., Lockwood, W. W., Yang, G., Sander, C., Ouerfelli, O., Tempst, P. J., Djaballah, H. and Varmus, H. E. (2011). Superoxide dismutase 1 (SOD1) is a target for a small molecule identified in a screen for inhibitors of the growth of lung adenocarcinoma cell lines. PNAS 108 39.
  • Stingo, F. C., Chen, Y. A., Vannucci, M., Barrier, M. and Mirkes, P. E. (2010). A Bayesian graphical modeling approach to microRNA regulatory network inference. Ann. Appl. Stat. 4 2024–2048.
  • Stranger, B. E., Forrest, M. S., Dunning, M., Ingle, C. E., Beazley, C., Thorne, N., Redon, R., Bird, C. P., de Grassi, A., Lee, C., Tyler-Smith, C., Carter, N., Scherer, S. W., Tavaré, S., Deloukas, P., Hurles, M. E. and Dermitzakis, E. T. (2007). Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315 848–853.
  • Su, J., Yoon, B.-J. and Dougherty, E. R. (2009). Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PLoS ONE 4 e8161.
  • Subirana, I., Diaz-Uriarte, R., Lucas, G. and Gonzalez, J. R. (2011). CNVassoc: Association analysis of CNV data using R. BMC Med. Genomics 4 47.
  • Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657–663.
  • Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F. A., Hakonarson, H. and Bucan, M. (2007). PennCNV: An integrated hidden Markov model deisigned for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17 1665–1674.
  • Wang, K., Chen, Z., Tadesse, M. G., Glessner, J., Grant, S. F. A., Hakonarson, H., Bucan, M. and Li, M. (2008). Modeling genetic inheritance of copy number variations. Nucleid Acids Research 36 21.
  • Wu, G., Guo, Z., Chatterjee, A., Huang, X., Rubin, E., Wu, F., Mambo, E., Chang, X., Osada, M., Kim, M. S., Moon, C., Califano, J. A., Ratovitski, E. A., Gollin, S. M., Sukumar, S., Sidransky, D. and Trink, B. (2006). Overexpression of glycosylphosphatidylinositol (GPI) transamidase subunits phosphatidylinositol glycan class T and/or GPI anchor attachment 1 induces tumorigenesis and contributes to invasion in human breast cancer. Cancer Res. 66 9829–9836.
  • Yang, Y. and Bedford, M. T. (2013). Protein arginine methyltransferases and cancer. Nat. Rev. Cancer 13 37–50.

Supplemental materials

  • Supplementary material: Supplement to “A hierarchical Bayesian model for inference of copy number variants and their association to gene expression”. Description of the MCMC steps and additional results on the case study.