The Annals of Applied Statistics

Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays

Heejung Shim and Matthew Stephens

Full-text: Access denied (no subscription detected) We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Understanding how genetic variants influence cellular-level processes is an important step toward understanding how they influence important organismal-level traits, or “phenotypes,” including human disease susceptibility. To this end, scientists are undertaking large-scale genetic association studies that aim to identify genetic variants associated with molecular and cellular phenotypes, such as gene expression, transcription factor binding, or chromatin accessibility. These studies use high-throughput sequencing assays (e.g., RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution data on how the traits vary along the genome in each sample. However, typical association analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. Here we develop and apply statistical methods that better exploit the high-resolution data. The key idea is to treat the sequence data as measuring an underlying “function” that varies along the genome, and then, building on wavelet-based methods for functional data analysis, test for association between genetic variants and the underlying function. Applying these methods to identify genetic variants associated with chromatin accessibility (dsQTLs), we find that they identify substantially more associations than a simpler window-based analysis, and in total we identify 772 novel dsQTLs not identified by the original analysis.

Article information

Ann. Appl. Stat. Volume 9, Number 2 (2015), 665-686.

Received: July 2013
Revised: June 2014
First available in Project Euclid: 20 July 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Wavelets high-throughput sequencing assays RNA-seq DNase-seq chromatin accessibility ChIP-seq genetic association analysis hierarchical model Bayesian inference functional data


Shim, Heejung; Stephens, Matthew. Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. Ann. Appl. Stat. 9 (2015), no. 2, 665--686. doi:10.1214/14-AOAS776.

Export citation


  • Abramovich, F. and Angelini, C. (2006). Testing in mixed-effects FANOVA models. J. Statist. Plann. Inference 136 4326–4348.
  • Antoniadis, A. and Sapatinas, T. (2007). Estimation and inference in functional mixed-effects models. Comput. Statist. Data Anal. 51 4793–4813.
  • Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D. E., Wang, Z., Wei, G., Chepelev, I. and Zhao, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell 129 823–837.
  • Benjamini, Y. and Speed, T. P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40 e72.
  • Besag, J. and Clifford, P. (1991). Sequential Monte Carlo $p$-values. Biometrika 78 301–304.
  • Boyle, A. P., Davis, S., Shulha, H. P., Meltzer, P., Margulies, E. H., Weng, Z., Furey, T. S. and Crawford, G. E. (2008). High-resolution mapping and characterization of open chromatin across the genome. Cell 132 311–322.
  • Cheung, V. G., Nayak, R. R., Wang, I. X., Elwyn, S., Cousins, S. M., Morley, M. and Spielman, R. S. (2010). Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol. 8 e1000480.
  • Clement, L., De Beuf, K., Thas, O., Vuylsteke, M., Irizarry, R. A. and Crainiceanu, C. M. (2012). Fast wavelet based functional models for transcriptome analysis with tiling arrays. Stat. Appl. Genet. Mol. Biol. 11 Art. 4, 38.
  • Crouse, M. S., Nowak, R. D. and Baraniuk, R. G. (1998). Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Process. 46 886–902.
  • Dabney, A., Storey, J. D. and Warnes, G. R. (2015). qvalue: Q-value estimation for false discovery rate control. R package version 1.30.0.
  • Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. and Noble, W. S. (2007). Unsupervised segmentation of continuous genomic data. Bioinformatics 23 1424–1426.
  • Degner, J. F., Pai, A. A., Pique-Regi, R., Veyrieras, J.-B., Gaffney, D. J., Pickrell, J. K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G. E., Stephens, M., Gilad, Y. and Pritchard, J. K. (2012). DNasel sensitivity QTLs are a major determinant of human expression variation. Nature 482 390–394.
  • Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224.
  • Fan, J. and Lin, S.-K. (1998). Test of significance when data are curves. J. Amer. Statist. Assoc. 93 1007–1021.
  • Frazee, A. C., Sabunciyan, S., Hansen, K. D., Irizarry, R. A. and Leek, J. T. (2014). Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics 15 413–426.
  • Fryzlewicz, P. and Nason, G. P. (2004). A Haar–Fisz algorithm for Poisson intensity estimation. J. Comput. Graph. Statist. 13 621–638.
  • Hesselberth, J. R., Chen, X., Zhang, Z., Sabo, P. J., Sandstrom, R., Reynolds, A. P., Thurman, R. E., Neph, S., Kuehn, M. S., Noble, W. S., Fields, S. and Stamatoyannopoulos, J. A. (2009). Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods 6 283–289.
  • Jackman, S. (2009). Bayesian Analysis for the Social Sciences. Wiley, Chichester.
  • Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein–DNA interactions. Science 316 1497–1502.
  • Karczewski, K. J., Dudley, J. T., Kukurba, K. R., Chen, R., Butte, A. J., Montgomery, S. B. and Snyder, M. (2013). Systematic functional regulatory assessment of disease-associated variants. Proc. Natl. Acad. Sci. USA 110 9607–9612.
  • Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S. M., Habegger, L., Rozowsky, J., Shi, M., Urban, A. E., Hong, M.-Y., Karczewski, K. J., Huber, W., Weissman, S. M., Gerstein, M. B., Korbel, J. O. and Snyder, M. (2010). Variation in transcription factor binding among humans. Science 328 232–235.
  • Kolaczyk, E. D. (1999). Bayesian multiscale models for Poisson processes. J. Amer. Statist. Assoc. 94 920–933.
  • Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 1724–1735.
  • Mallat, S. G. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11 674–693.
  • Mangravite, L. M., Engelhardt, B. E., Medina, M. W., Smith, J. D., Brown, C. D., Chasman, D. I., Mecham, B. H., Howie, B., Shim, H., Naidoo, D., Feng, Q., Rieder, M. J., Chen, Y.-D. I., Rotter, J. I., Ridker, P. M., Hopewell, J. C., Parish, S., Armitage, J., Collins, R., Wilke, R. A., Nickerson, D. A., Stephens, M. and Krauss, R. M. (2013). A statin-dependent QTL for GATM expression is associated with statin-induced myopathy. Nature 502 377–380.
  • Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509–1517.
  • Mikkelsen, T. S., Ku, M., Jaffe, D. B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.-K., Koche, R. P., Lee, W., Mendenhall, E., O’Donovan, A., Presser, A., Russ, C., Xie, X., Meissner, A., Wernig, M., Jaenisch, R., Nusbaum, C., Lander, E. S. and Bernstein, B. E. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448 553–560.
  • Mitra, A. and Song, J. (2012). WaveSeq: A novel data-driven method of detecting histone modification enrichments using wavelets. PLoS ONE 7 e45486.
  • Montgomery, S. B., Sammeth, M., Gutierrez-Arcelus, M., Lach, R. P., Ingle, C., Nisbett, J., Guigo, R. and Dermitzakis, E. T. (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464 773–777.
  • Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixed models. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 179–199.
  • Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64 479–489.
  • Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5 621–628.
  • Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E. and Cox, N. J. (2010). Trait-associated SNPs are more likely to be eQTLs: Annotation to enhance discovery from GWAS. PLoS Genet. 6 e1000888.
  • Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras, J.-B., Stephens, M., Gilad, Y. and Pritchard, J. K. (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464 768–772.
  • Pique-Regi, R., Degner, J. F., Pai, A. A., Boyle, A. P., Song, L., Lee, B.-K., Gaffney, D. J., Gilad, Y. and Pritchard, J. K. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21 447–455.
  • Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 3 e114.
  • Shim, H. and Stephens, M. (2015). Supplement to “Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays.” DOI:10.1214/14-AOAS776SUPP.
  • Spencer, C. C. A., Deloukas, P., Hunt, S., Mullikin, J., Myers, S., Silverman, B., Donnelly, P., Bentley, D. and McVean, G. (2006). The influence of recombination on human genetic diversity. PLoS Genet. 2 e148.
  • Stegle, O., Parts, L., Durbin, R. and Winn, J. (2010). A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6 e1000770.
  • Teslovich, T. M., Musunuru, K., Smith, A. V., Edmondson, A. C., Stylianou, I. M., Koseki, M., Pirruccello, J. P., Ripatti, S., Chasman, D. I., Willer, C. J., Johansen, C. T., Fouchier, S. W., Isaacs, A., Peloso, G. M., Barbalic, M., Ricketts, S. L. et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466 707–713.
  • Timmermann, K. E. and Nowak, R. D. (1999). Multiscale modeling and estimation of Poisson processes with application to photon-limited imaging. IEEE Trans. Inform. Theory 45 846–862.
  • van der Waerden, B. L. (1953). Order tests for the two-sample problem. II, III. Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen, Serie A 564 303–310, 311–316.
  • Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P. and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470–476.
  • WTCCC (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661–678.
  • Wu, S., Wang, J., Zhao, W., Pounds, S. and Cheng, C. (2010). ChIP-PaM: An algorithm to identify protein-DNA interaction using ChIP-seq data. Theor. Biol. Med. Model 7 18.
  • Yang, X. and Nie, K. (2008). Hypothesis testing in functional linear regression models with Neyman’s truncation and wavelet thresholding for longitudinal data. Stat. Med. 27 845–863.
  • Zhang, Y., Shin, H., Song, J. S., Lei, Y. and Liu, X. S. (2008). Identifying positioned nucleosomes with epigenetic marks in human from ChIP-seq. BMC Genomics 9 537.
  • Zhao, W. and Wu, R. (2008). Wavelet-based nonparametric functional mapping of longitudinal curves. J. Amer. Statist. Assoc. 103 714–725.
  • Zhu, H., Brown, P. J. and Morris, J. S. (2011). Robust, adaptive functional regression in functional mixed model framework. J. Amer. Statist. Assoc. 106 1167–1179.

Supplemental materials

  • Supplement to “Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays”. Supplement Material referenced in Sections 3, 4 and 5 are provided in the Supplement Material file.