May 2023 Cross-Study Replicability in Cluster Analysis
Lorenzo Masoero, Emma Thomas, Giovanni Parmigiani, Svitlana Tyekucheva, Lorenzo Trippa
Author Affiliations +
Statist. Sci. 38(2): 303-316 (May 2023). DOI: 10.1214/22-STS871


In cancer research, clustering techniques are widely used for exploratory analyses, playing a critical role in the identification of novel cancer subtypes and patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several data sets. In this paper, we review methods for replicability of clustering analyses, and discuss a novel framework for evaluating cross-study clustering replicability, useful when two or more studies are available. Our approach can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e., for the whole sample) as well as locally (i.e., for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the usefulness of our procedure to evaluate if the same clusters are identified consistently across a collection of data sets.

Funding Statement

The fifth author has been supported by NIH Grant 5R01LM013352-02 and NSF Grant 2113707.


Download Citation

Lorenzo Masoero. Emma Thomas. Giovanni Parmigiani. Svitlana Tyekucheva. Lorenzo Trippa. "Cross-Study Replicability in Cluster Analysis." Statist. Sci. 38 (2) 303 - 316, May 2023.


Published: May 2023
First available in Project Euclid: 6 February 2023

MathSciNet: MR4596760
zbMATH: 07708433
Digital Object Identifier: 10.1214/22-STS871

Keywords: clustering , multiple studies , replicability

Rights: Copyright © 2023 Institute of Mathematical Statistics


This article is only available to subscribers.
It is not available for individual sale.

Vol.38 • No. 2 • May 2023
Back to Top