Bayesian Analysis

Improved criteria for clustering based on the posterior similarity matrix

Arno Fritsch and Katja Ickstadt

Full-text: Open access

Abstract

In this paper we address the problem of obtaining a single clustering estimate $\hat{c}$ based on an MCMC sample of clusterings $c^{(1)},c^{(2)}\ldots,c^{(M)}$ from the posterior distribution of a Bayesian cluster model. Methods to derive $\hat{c}$ when the number of groups $K$ varies between the clusterings are reviewed and discussed. These include the maximum a posteriori (MAP) estimate and methods based on the posterior similarity matrix, a matrix containing the posterior probabilities that the observations $i$ and $j$ are in the same cluster. The posterior similarity matrix is related to a commonly used loss function by Binder (1978). Minimization of the loss is shown to be equivalent to maximizing the Rand index between estimated and true clustering. We propose new criteria for estimating a clustering, which are based on the posterior expected adjusted Rand index. The criteria are shown to possess a shrinkage property and outperform Binder's loss in a simulation study and in an application to gene expression data. They also perform favorably compared to other clustering procedures.

Article information

Source
Bayesian Anal., Volume 4, Number 2 (2009), 367-391.

Dates
First available in Project Euclid: 22 June 2012

Permanent link to this document
https://projecteuclid.org/euclid.ba/1340370282

Digital Object Identifier
doi:10.1214/09-BA414

Mathematical Reviews number (MathSciNet)
MR2507368

Zentralblatt MATH identifier
1330.62249

Keywords
adjusted Rand index cluster analysis Dirichlet process mixture model Markov chain Monte Carlo

Citation

Fritsch, Arno; Ickstadt, Katja. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal. 4 (2009), no. 2, 367--391. doi:10.1214/09-BA414. https://projecteuclid.org/euclid.ba/1340370282


Export citation

References

  • Antoniak, C. E. (1974). "Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems." Annals of Statistics, 2: 1152–1174.
  • Bensmail, H., Celeux, G., Raftery, A. E., and Robert, C. P. (1997). "Inference in Model-Based Cluster Analysis." Statistics and Computing, 7: 1–10.
  • Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. New York: Wiley.
  • Binder, D. A. (1978). "Bayesian Cluster Analysis." Biometrika, 65: 31–38.
  • Blackwell, D. and MacQueen, J. B. (1973). "Ferguson Distributions via Polya Urn Schemes." Annals of Statistics, 1: 353–355.
  • Dahl, D. B. (2005). "Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models." Journal of Computational and Graphical Statistics, under revision (preprint available from author's web page).
  • –- (2006). "Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model." In Do, K. A., Müller, P., and Vannucci, M. (eds.), Bayesian Inference for Gene Expression and Proteomics, 201–218. Cambridge University Press.
  • Dahl, D. B. and Newton, M. A. (2007). "Multiple Hypothesis Testing by Clustering Treatment Effects." Journal of the American Statistical Association, 102: 517–526.
  • Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). "Maximum Likelihood from Incomplete Data via the EM" Algorithm. Journal of the Royal Statistical Society, Ser. B, 39: 1–38.
  • Dunson, D. B. (2008). "Nonparametric Bayes Applications to Biostatistics." Department of Statistical Science, Duke University, Durham, NC, USA, (Technical Report 6).
  • Escobar, M. D. and West, M. (1995). "Bayesian Density Estimation and Inference Using Mixtures." Journal of the American Statistical Association, 90: 577–588.
  • Ferguson, T. S. (1973). "A Bayesian Analysis of Some Nonparametric Problems." Annals of Statistics, 1: 209–230.
  • Fraley, C. and Raftery, A. E. (2002). "Model-Based Clustering, Discriminant Analysis and Density Estimation." Journal of the American Statistical Association, 97: 611–631.
  • Gene Ontology Consortium (2000). "Gene Ontology: Tool for the Unification of Biology." Nature Genetics, 25: 25–29.
  • Griffin, J. E. and Steel, M. F. J. (2006). "Order-Based Dependent Dirichlet Processes." Journal of the American Statistical Association, 101: 179–194.
  • Hubert, L. and Arabie, P. (1985). "Comparing Partitions." Journal of Classification, 2: 193–218.
  • Hurn, M., Justel, A., and Robert, C. P. (2003). "Estimating Mixtures of Regressions." Journal of Computational and Graphical Statistics, 12: 55–79.
  • Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R., and Hood, L. (2001). "Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network." Science, 292: 929–934.
  • Ishwaran, H. and James, L. F. (2001). "Gibbs Sampling Methods for Stick-Breaking Priors." Journal of the American Statistical Association, 96: 161–173.
  • Jara, A., Hanson, T., Quintana, F. A., Müller, P., and Rosner, G. L. (2009). "DP"package: Bayesian Nonparametric and Semiparametric Analysis." R package: 1.0–7.
  • Jensen, S. T. and Liu, J. S. (2008). "Bayesian Clustering of Transcription Factor Binding Motifs." Journal of the American Statistical Association, 103: 188–200.
  • Johnson, N. L., Kotz, S., and Balakrishan, N. (1995). Continuous Univariate Distributions, Volume 2. New York: Wiley, 2nd edition.
  • Kaufman, L. and Rousseuw, P. J. (1990). Finding Groups in Data. New York: Wiley.
  • Kim, S., Tadesse, M. G., and Vannucci, M. (2006). "Variable Selection in Clustering via Dirichlet Process Mixture Models." Biometrika, 93: 877–893.
  • Lau, J. W. and Green, P. J. (2007). "Bayesian Model-Based Clustering Procedures." Journal of Computational and Graphical Statistics, 16: 526–558.
  • Lijoi, A., Mena, R. H., and Prünster, I. (2007). "Controlling the Reinforcement in Bayesian Non-Parametric Mixture Models." Journal of the Royal Statistical Society, Ser. B, 69: 715–740.
  • Medvedovic, M., Yeung, K., and Bumgarner, R. (2004). "Bayesian Mixture Model Based Clustering of Replicated Microarray Data." Bioinformatics, 20: 1222–1232.
  • Meilă, M. (2007). "Comparing Clusterings – an Information Based Distance." Journal of Multivariate Analysis, 98: 873–895.
  • Milligan, G. W. and Cooper, M. C. (1986). "A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis." Multivariate Behavioral Research, 21: 441–458.
  • Pitman, J. and Yor, M. (1997). "The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator." Annals of Probability, 25: 855–900.
  • Qin, Z. S. (2006). "Clustering Microarray Gene Expression Data Using Weighted Chinese Restaurant Process." Bioinformatics, 22: 1988–1997.
  • Quintana, F. A. and Iglesias, P. L. (2003). "Bayesian Clustering and Product Partition Models." Journal of the Royal Statistical Society, Ser. B, 65: 557–574.
  • R Development Core Team (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org
  • Rand, W. M. (1971). "Objective Criteria for the Evaluation of Clustering Methods." Journal of the American Statistical Association, 66: 846–850.
  • Richardson, S. and Green, P. J. (1997). "On Bayesian Analysis of Mixtures with an Unknown Number of Components." Journal of the Royal Statistical Society, Ser. B, 59: 731–792.
  • Stephens, M. (2000). "Dealing with Label Switching in Mixture Models." Journal of the Royal Statistical Society, Ser. B, 62: 795–809.
  • Tadesse, M. G., Sha, N., and Vannucci, M. (2005). "Bayesian Variable Selection in Clustering High-Dimensional Data." Journal of the American Statistical Association, 100: 602–617.