Bayesian bi-clustering methods with applications in computational biology

Han Yan; Jiexing Wu; Yang Li; Jun S. Liu

doi:10.1214/22-AOAS1622

Abstract

Bi-clustering is a useful approach in analyzing large biological data sets when the observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling bi-clustering problems in moderate to high dimensions and propose three Bayesian bi-clustering models on categorical data which increase in complexities in their modeling of the distributions of features across bi-clusters. Our proposed methods apply to a wide range of scenarios: from situations where data are cluster-distinguishable only among a small subset of features but masked by a large amount of noise to situations where different groups of data are identified by different sets of features or data exhibit hierarchical structures. Through simulation studies we show that our methods outperform existing (bi-)clustering methods in both identifying clusters and recovering feature distributional patterns across bi-clusters. We further apply the developed approaches to a human genetic dataset, a human single-cell genomic dataset, and a collection of 1774 mouse genomic datasets with a focus on 58 genes from two pathways.

Funding Statement

This research was supported in part by NSF Grants DMS-1903139 and DMS-2015411.

Acknowledgments

Han Yan, Jiexing Wu and Yang Li contributed equally to the manuscript. This presentation reflects the analysis and views of Yang Li. No recipient should interpret this presentation to represent the general views of Citadel Securities or its personnel. Facts, analysis, and views presented in this presentation have not been reviewed by, and may not reflect information known to, other Citadel Securities professionals.

Citation

Download Citation

Han Yan. Jiexing Wu. Yang Li. Jun S. Liu. "Bayesian bi-clustering methods with applications in computational biology." Ann. Appl. Stat. 16 (4) 2804 - 2831, December 2022. https://doi.org/10.1214/22-AOAS1622

Information

Received: 1 February 2020; Revised: 1 October 2021; Published: December 2022

First available in Project Euclid: 26 September 2022

MathSciNet: MR4489234

zbMATH: 1498.62128

Digital Object Identifier: 10.1214/22-AOAS1622

Keywords: Bi-Clustering , categorical data , clustering , Genetics , high dimensionality , Model selection , Variable selection

Abstract

Funding Statement

Acknowledgments

Citation

Information

KEYWORDS/PHRASES

PUBLICATION TITLE:

PUBLICATION YEARS