Open Access
Translator Disclaimer
2012 Regularized k-means clustering of high-dimensional data and its asymptotic consistency
Wei Sun, Junhui Wang, Yixin Fang
Electron. J. Statist. 6: 148-167 (2012). DOI: 10.1214/12-EJS668


K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.


Download Citation

Wei Sun. Junhui Wang. Yixin Fang. "Regularized k-means clustering of high-dimensional data and its asymptotic consistency." Electron. J. Statist. 6 148 - 167, 2012.


Published: 2012
First available in Project Euclid: 3 February 2012

zbMATH: 1335.62109
MathSciNet: MR2879675
Digital Object Identifier: 10.1214/12-EJS668

Primary: 62H30

Keywords: diverging dimension , K-means , Lasso , selection consistency , stability , Variable selection

Rights: Copyright © 2012 The Institute of Mathematical Statistics and the Bernoulli Society


Back to Top