The Annals of Applied Statistics

A hierarchical framework for state-space matrix inference and clustering

Chandler Zuo, Kailei Chen, Kyle J. Hewitt, Emery H. Bresnick, and Sündüz Keleş

Full-text: Open access


Integrative analysis of multiple experimental datasets measured over a large number of observational units is the focus of large numbers of contemporary genomic and epigenomic studies. The key objectives of such studies include not only inferring a hidden state of activity for each unit over individual experiments, but also detecting highly associated clusters of units based on their inferred states. Although there are a number of methods tailored for specific datasets, there is currently no state-of-the-art modeling framework for this general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space mapping and state-space clustering. In state-space mapping, it maps observations onto a finite state-space, representing the activation states of units across conditions. In state-space clustering, MBASIC incorporates a finite mixture model to cluster the units based on their inferred state-space profiles across all conditions. Both the state-space mapping and clustering can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC flexibly adapts to a large number of parametric distributions for the observed data, as well as the heterogeneity in replicate experiments. It allows for imposing structural assumptions on each cluster, and enables model selection using information criterion. In our data-driven simulation studies, MBASIC showed significant accuracy in recovering both the underlying state-space variables and clustering structures. We applied MBASIC to two genome research problems using large numbers of datasets from the ENCODE project. The first application grouped genes based on transcription factor occupancy profiles of their promoter regions in two different cell types. The second application focused on identifying groups of loci that are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing these data with a two-step approach using ENCODE results on transcription factor occupancy data.

Article information

Ann. Appl. Stat., Volume 10, Number 3 (2016), 1348-1372.

Received: May 2015
Revised: January 2016
First available in Project Euclid: 28 September 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

State-space clustering E-M algorithm transcription factors ChIP-seq


Zuo, Chandler; Chen, Kailei; Hewitt, Kyle J.; Bresnick, Emery H.; Keleş, Sündüz. A hierarchical framework for state-space matrix inference and clustering. Ann. Appl. Stat. 10 (2016), no. 3, 1348--1372. doi:10.1214/16-AOAS938.

Export citation


  • Anandapadamanaban, M., Andresen, C., Helander, S., Ohyama, Y., Siponen, M. I., Lundström, P., Kokubo, T., Ikura, M., Moche, M. and Sunnerhagen, M. (2013). High-resolution structure of TBP with TAF1 reveals anchoring patterns in transcriptional regulation. Nat. Struct. Mol. Biol. 20 1008–1014.
  • Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 R106.
  • Cheng, C., Yan, K.-K., Hwang, W., Qian, J., Bhardwaj, N., Rozowsky, J., Lu, Z. J., Niu, W., Alves, P., Kato, M., Snyder, M. and Gerstein, M. (2011). Construction and analysis of an integrated regulatory network derived from high-throughput sequencing data. PLoS Comput. Biol. 7.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Doré, L. C., Chlon, T. M., Brown, C. D., White, K. P. and Crispino, J. D. (2012). Chromatin occupancy analysis reveals genome-wide GATA factor switching during hematopoiesis. Blood 119 3724–3733.
  • ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489 57–74.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Gao, X., Johnson, K. D., Chang, Y.-I., Boyer, M. E., Dewey, C. N., Zhang, J. and Bresnick, E. H. (2013). Gata2 cis-element is required for hematopoietic stem cell generation in the mammalian embryo. J. Exp. Med. 210 2833–2842.
  • Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K.-K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A. P., Cayting, P., Charos, A., Chen, D. Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O’Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K. Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P. J., Myers, R. M., Weissman, S. M. and Snyder, M. (2012). Architecture of the human regulatory network derived from ENCODE data. Nature 489 91–100.
  • Holley, D. W., Groh, B. S., Wozniak, G., Donohoe, D. R., Sun, W., Godfrey, V. and Bultman, S. J. (2014). The BRG1 chromatin remodeler regulates widespread changes in gene expression and cell proliferation during B cell activation. J. Cell. Physiol. 229 44–52.
  • Hsu, A. P., Johnson, K. D., Falcone, E. L., Sanalkumar, R., Sanchez, L., Hickstein, D. D., Cuellar-Rodriguez, J., Lemieux, J. E., Zerbe, C. S., Bresnick, E. H. and Holland, S. M. (2013). GATA2 haploinsufficiency caused by mutations in a conserved intronic element leads to MonoMAC syndrome. Blood 121 3830–3837.
  • Hu, G., Schones, D. E., Cui, K., Ybarra, R., Northrup, D., Tang, Q., Gattinoni, L., Restifo, N. P., Huang, S. and Zhao, K. (2011). Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Res. 21 1650–1658.
  • Ji, H., Li, X., Wang, Q. and Ning, Y. (2013). Differential principle component analysis of ChIP-seq. Proc. Natl. Acad. Sci. USA 110 6789–6794.
  • Johnson, K. D., Hsu, A. P., Ryu, M.-J., Wang, J., Gao, X., Boyer, M. E., Liu, Y., Lee, Y., Calvo, K. R., Keles, S., Zhang, J., Holland, S. M. and Bresnick, E. H. (2012). Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. J. Clin. Invest. 10 3692–3704.
  • Kim, S.-I., Bresnick, E. H. and Bultman, S. J. (2009). BRG1 directly regulates nucleosome structure and chromatin looping of the $\alpha$ globin locus to activate transcription. Nucleic Acids Res. 37 6019–6027.
  • Kim, S.-I., Bultman, S. J., Kiefer, C. M., Dean, A. and Bresnick, E. H. (2009). BRG1 requirement for long-range interaction of a locus control region with a downstream promoter. Proc. Natl. Acad. Sci. USA 106 2259–2264.
  • Kunarso, G., Chia, N.-Y., Jeyakani, J., Hwang, C., Lu, X., Chan, Y.-S., Ng, H.-H. and Bourque, G. (2010). Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat. Genet. 42 631–634.
  • Lee, S., Huang, J. Z. and Hu, J. (2010). Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4 1579–1601.
  • Liang, K. and Keles, S. (2012). Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics 28 121–122.
  • Linneman, A. K., O’Geen, H., Keleş, S., Farnham, P. J. and Bresnick, E. H. (2011). Genetic framework for GATA factor function in vascular biology. Proc. Natl. Acad. Sci. USA 108 13641–13646.
  • Neph, S., Stergachis, A. B., Reynolds, A., Sandstrom, R., Borenstein, E. and Stamatoyannopoulos, J. A. (2012). Circuitry and dynamics of human transcription factor regulatory networks. Cell 150 1274–1286.
  • Roy, S., Wapinski, I., Pfiffner, J., French, C., Socha, A., Konieczka, J., Habib, N., Kellis, M., Thompson, D. and Regev, A. (2013). Arboretum: Reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules. Genome Res. 23 1039–1050.
  • Schmidt, D., Wilson, M. D., Ballester, B., Schwalie, P. C., Brown, G. D., Marshall, A., Kutter, C., Watt, S., Martinez-Jimenez, C. P., Mackay, S., Talianidis, I., Flicek, P. and Odom, D. T. (2010). Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328 1036–1040.
  • Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102 15545–15550.
  • Waltman, P., Kacmarczyk, T., Bate, A. R., Kearns, D. B., Reiss, D. J., Eichenberger, P. and Bonneau, R. (2010). Multi-species integrative biclustering. Genome Biol. 11 R96.
  • Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje, A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M. and Weng, Z. (2012). Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22 1798–1812.
  • Wei, Y., Tenzen, T. and Ji, H. (2015). Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics 16 31–46.
  • Wei, Y., Li, X., Wang, Q. and Ji, H. (2012). iaseq: Integrative analysis of allele-specificity of protein-dna interactions in multiple chip-seq datasets. BMC Genomics 13 1–19.
  • Zeng, X., Sanalkumar, R., Bresnick, E. H., Li, H., Chang, Q. and Keleş, S. (2013). jMOSAiCS: Joint analysis of multiple ChIP-seq datasets. Genome Biol. 14 R38.
  • Zuo, C., Chen, K., Hewitt, K. J., Bresnick, E. H. and Keleş, S. (2016). Supplement to “A hierarchical framework for state-space matrix inference and clustering.” DOI:10.1214/16-AOAS938SUPP.
  • Zuo, C. and Keleş, S. (2014). A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 30 753–760.

Supplemental materials