Open Access
2022 Determine the number of clusters by data augmentation
Wei Luo
Author Affiliations +
Electron. J. Statist. 16(2): 3910-3936 (2022). DOI: 10.1214/22-EJS2032

Abstract

Determining the number of clusters is crucial for the successful application of clustering. In this paper, we propose a new order-determination method called the data augmentation estimator (DAE), for the general model-based clustering. The estimator is based on a novel idea that augments data with an independently generated small cluster, which enables us to justify how the instability of clustering changes with the number of clusters assumed in clustering. The pattern of instability provides an alternative characterization of the true number of clusters to the commonly used goodness-of-fit measure. By combining the two sources of information appropriately, the proposed estimator reaches asymptotic consistency under general conditions and is easily implementable. It is also more efficient than the conventional BIC-type approaches that use the goodness-of-fit measure only. These properties are illustrated by the simulation studies and real data examples at the end.

Acknowledgments

The author thanks the editor, the associate editor and the referees for their helpful comments and constructive suggestions. Luo was supported in part by the National Science Foundation of China (12131006, 12001484).

Citation

Download Citation

Wei Luo. "Determine the number of clusters by data augmentation." Electron. J. Statist. 16 (2) 3910 - 3936, 2022. https://doi.org/10.1214/22-EJS2032

Information

Received: 1 August 2021; Published: 2022
First available in Project Euclid: 14 July 2022

MathSciNet: MR4452010
zbMATH: 07567703
Digital Object Identifier: 10.1214/22-EJS2032

Keywords: Data augmentation , instability of clustering , Model-based clustering , order determination

Vol.16 • No. 2 • 2022
Back to Top