Bi-clustering is a useful approach in analyzing large biological data sets when the observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling bi-clustering problems in moderate to high dimensions and propose three Bayesian bi-clustering models on categorical data which increase in complexities in their modeling of the distributions of features across bi-clusters. Our proposed methods apply to a wide range of scenarios: from situations where data are cluster-distinguishable only among a small subset of features but masked by a large amount of noise to situations where different groups of data are identified by different sets of features or data exhibit hierarchical structures. Through simulation studies we show that our methods outperform existing (bi-)clustering methods in both identifying clusters and recovering feature distributional patterns across bi-clusters. We further apply the developed approaches to a human genetic dataset, a human single-cell genomic dataset, and a collection of 1774 mouse genomic datasets with a focus on 58 genes from two pathways.
This research was supported in part by NSF Grants DMS-1903139 and DMS-2015411.
Han Yan, Jiexing Wu and Yang Li contributed equally to the manuscript. This presentation reflects the analysis and views of Yang Li. No recipient should interpret this presentation to represent the general views of Citadel Securities or its personnel. Facts, analysis, and views presented in this presentation have not been reviewed by, and may not reflect information known to, other Citadel Securities professionals.
"Bayesian bi-clustering methods with applications in computational biology." Ann. Appl. Stat. 16 (4) 2804 - 2831, December 2022. https://doi.org/10.1214/22-AOAS1622