Abstract
In this paper we address the problem of obtaining a single clustering estimate $\hat{c}$ based on an MCMC sample of clusterings $c^{(1)},c^{(2)}\ldots,c^{(M)}$ from the posterior distribution of a Bayesian cluster model. Methods to derive $\hat{c}$ when the number of groups $K$ varies between the clusterings are reviewed and discussed. These include the maximum a posteriori (MAP) estimate and methods based on the posterior similarity matrix, a matrix containing the posterior probabilities that the observations $i$ and $j$ are in the same cluster. The posterior similarity matrix is related to a commonly used loss function by Binder (1978). Minimization of the loss is shown to be equivalent to maximizing the Rand index between estimated and true clustering. We propose new criteria for estimating a clustering, which are based on the posterior expected adjusted Rand index. The criteria are shown to possess a shrinkage property and outperform Binder's loss in a simulation study and in an application to gene expression data. They also perform favorably compared to other clustering procedures.
Citation
Arno Fritsch. Katja Ickstadt. "Improved criteria for clustering based on the posterior similarity matrix." Bayesian Anal. 4 (2) 367 - 391, June 2009. https://doi.org/10.1214/09-BA414
Information