December 2022 Bayesian bi-clustering methods with applications in computational biology
Han Yan, Jiexing Wu, Yang Li, Jun S. Liu
Author Affiliations +
Ann. Appl. Stat. 16(4): 2804-2831 (December 2022). DOI: 10.1214/22-AOAS1622


Bi-clustering is a useful approach in analyzing large biological data sets when the observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling bi-clustering problems in moderate to high dimensions and propose three Bayesian bi-clustering models on categorical data which increase in complexities in their modeling of the distributions of features across bi-clusters. Our proposed methods apply to a wide range of scenarios: from situations where data are cluster-distinguishable only among a small subset of features but masked by a large amount of noise to situations where different groups of data are identified by different sets of features or data exhibit hierarchical structures. Through simulation studies we show that our methods outperform existing (bi-)clustering methods in both identifying clusters and recovering feature distributional patterns across bi-clusters. We further apply the developed approaches to a human genetic dataset, a human single-cell genomic dataset, and a collection of 1774 mouse genomic datasets with a focus on 58 genes from two pathways.

Funding Statement

This research was supported in part by NSF Grants DMS-1903139 and DMS-2015411.


Han Yan, Jiexing Wu and Yang Li contributed equally to the manuscript. This presentation reflects the analysis and views of Yang Li. No recipient should interpret this presentation to represent the general views of Citadel Securities or its personnel. Facts, analysis, and views presented in this presentation have not been reviewed by, and may not reflect information known to, other Citadel Securities professionals.


Download Citation

Han Yan. Jiexing Wu. Yang Li. Jun S. Liu. "Bayesian bi-clustering methods with applications in computational biology." Ann. Appl. Stat. 16 (4) 2804 - 2831, December 2022.


Received: 1 February 2020; Revised: 1 October 2021; Published: December 2022
First available in Project Euclid: 26 September 2022

Digital Object Identifier: 10.1214/22-AOAS1622

Keywords: Bi-Clustering , categorical data , clustering , Genetics , high dimensionality , Model selection , Variable selection

Rights: Copyright © 2022 Institute of Mathematical Statistics


This article is only available to subscribers.
It is not available for individual sale.

Vol.16 • No. 4 • December 2022
Back to Top