Electronic Journal of Statistics

Regularized k-means clustering of high-dimensional data and its asymptotic consistency

Wei Sun, Junhui Wang, and Yixin Fang

Full-text: Open access


K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.

Article information

Electron. J. Statist., Volume 6 (2012), 148-167.

First available in Project Euclid: 3 February 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

K-means diverging dimension lasso selection consistency variable selection stability


Sun, Wei; Wang, Junhui; Fang, Yixin. Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electron. J. Statist. 6 (2012), 148--167. doi:10.1214/12-EJS668. https://projecteuclid.org/euclid.ejs/1328280901

Export citation


  • [1] Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, R., Chan, W., Greiner, T., Weisenburger, D., Armitage, Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., and Staudt, L. (2000), “Different Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling,”, Nature, 403, 503-511.
  • [2] Bansal, N., Blum, A., and Chawla, S. (2004), “Correlation Clustering,”, Machine Learning, 56, 86-113.
  • [3] Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002), “A Stability Based Method for Discovering Structure in Clustered Data,”, Pacific Symposium on Biocomputing, 6-17.
  • [4] Breiman, L. (1995), “Better Subset Regression Using the Nonnegative Garrote,”, Technometrics, 37, 373-384.
  • [5] Changdra, B., Shanker, S., and Mishra, S. (2006), “A New Approach: Interrelated Two-way Clustering of Gene Expression Data,”, Statistical Methodology, 3, 93-102.
  • [6] Dettling, M. (2004), “BagBoosting for Tumor Classification with Gene Expression Data,”, Bioinformatics, 20, 3583-3593.
  • [7] Donoho, D., and Jin, J. (2008), “Higher Criticism Thresholding: Optimal Feature Selection When Useful Features are Rare and Weak,”, The Proceedings of the National Academy of Sciences of the United States of America, 105, 14790-14795.
  • [8] Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparision of Discrimination Methods for the Classification of Tumor Using Gene Expression Data,”, Journal of the American Statistical Association, 97, 77-87.
  • [9] Fan, J. and Peng, H. (2004), “Nonconcave penalized likelihood with a diverging number of parameters,”, The Annals of Statistics, 32, 928-961.
  • [10] Fang, Y. and Wang, J. (2011), “Penalized Cluster Analysis with Applications to Family Data,”, Computational Statistics and Data Analysis, 55, 2128-2136.
  • [11] Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, A., Bloomfield, C., and Lander, E. (1999), “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,”, Science, 286, 531-537.
  • [12] Guo, J., Levina, E., Michailidis, G., and Zhu, J. (2010), “Pairwise Variable Selection for High-dimensional Model-based Clustering,”, Biometrics, 66, 793-804.
  • [13] Hall, P., Marron, J.S., and Neeman, A. (2005), “Geometric Representation of High Dimension, Low Sample Size Data,”, Journal of the Royal Statistical Society, Series B, 67, 427-444.
  • [14] Lloyd, S.P. (1982), “Least Squares Quantization in PCM,”, IEEE Transactions on Information Theory, 28, 129-137.
  • [15] MacQueen, J. (1967), “Some Methods for Clasification and Analysis of Multivariate Observations,”, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 281-297.
  • [16] Pan, W., and Shen, X. (2007), “Penalized Model-Based Clustering with Application to Variable Selection,”, Journal of Machine Learning Research, 8, 1145-1164.
  • [17] Pollard, D. (1981), “Strong Consistency of K-means Clustering,”, The Annals of Statistics, 9, 135-140.
  • [18] Pollard, D. (1982), “A Central Limit Theorem for K-means Clustering,”, The Annals of Probability, 10, 919-926.
  • [19] Raftery, A., and Dean, N. (2006), “Variable Selection for Model-based Clustering,”, Journal of the American Statistical Association, 101, 168-178.
  • [20] Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”, Journal of the Royal Statistical Society, Series B, 58, 267-288.
  • [21] Tibshirani, R., Walther, G., and Hastie, T. (2001), “Estimating the Number of Clusters in a Data Set via the Gap Statistic,”, Journal of the Royal Statistical Society, Series B, 63, 411-423.
  • [22] Wang, H., and Leng, C. (2008), “A Note on Adaptive Group Lasso,”, Computational Statistics and Data Analysis, 52, 5277-5286.
  • [23] Wang, J. (2010), “Consistent Selection of the Number of Clusters via Cross Validation,”, Biometrika, 97, 893-904.
  • [24] Wang, S., and Zhu, J. (2008), “Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data,”, Biometrics, 64, 440-448.
  • [25] Witten, D., and Tibshirani, R. (2010), “A Framework for Feature Selection in Clustering,”, Journal of the American Statistical Association, 105, 713-726.
  • [26] Xie, B., Pan, W., and Shen, X. (2008), “Variable Selection in Penalized Model-Based Clustering Via Regularization on Grouped Parameters,”, Biometrics, 64, 921-930.
  • [27] Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,”, Journal of the Royal Statistical Society, Series B, 68, 49-67.
  • [28] Zou, H. (2006), “The Adaptive Lasso and Its Oracle Properties,”, Journal of the American Statistical Association, 101, 1418-1429.
  • [29] Zou, H., and Zhang, H. (2009), “On the Adaptive Elastic-net with A Diverging Number of Parameters,”, The Annals of Statistics, 37, 1733-1751.