Regularized k-means clustering of high-dimensional data and its asymptotic consistency

K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering. AMS 2000 subject classifications: Primary 62H30.


Introduction
Cluster analysis is to assign observations into a number of clusters such that observations in the same cluster are similar to each other.The similarity is often quantified by some distance measures, such as the Euclidean distance [15] and correlation [2].To optimize the similarity measures, various clustering algorithms are developed.Among others, k-means clustering is one of the most popular clustering algorithms, which aims at minimizing the within-cluster dissimilarity measured by the Euclidean distance.While the k-means clustering is conceptually simple and computationally efficient, its performance can be severely deteriorated when clustering high-dimensional data where the number of variables becomes large and many of them may contain no information about the clustering structure.Furthermore, the interpretability of the k-means clustering can be impeded as it usually includes all the variables and produces complicated clustering models.To overcome these difficulties in clustering highdimensional data, a more appropriate clustering algorithm that can simultaneously perform cluster analysis and select informative variables is in demand.
In statistical literature, two major kinds of variable selection techniques are developed in the context of high-dimensional data analysis.The first kind is to pre-screen the redundant variables by conducting a multiple testing procedure and controlling certain error rates, such as [7] and the reference therein.The second kind is the shrinkage method, which penalizes the model fitting with various types of regularization terms that encourage model sparsity, such as the LASSO regression in [20].Although variable selection for regression has been extensively studied, analogous result for clustering is limited, such as [19,16,26,24,12,10].Focusing on the k-means and hierarchical clustering, [25] proposed a general sparse clustering framework using a similar idea as nonnegative garrote [4], however the asymptotic consistency was not discussed in their framework.
In this article, we propose a regularized k-means clustering, which can perform cluster analysis and variable selection at the same time.The key idea is to formulate k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers.Note that all cluster centers share the same set of variables, so the group lasso penalty term is employed to select the variables in a group fashion; i.e., a variable is redundant if it is not used in any cluster center.The regularized k-means clustering framework can also be extended to the model-based clustering, where the EM algorithm is employed to minimize the regularized negative log-likelihood function.To optimally balance the trade-off between model fitting and sparsity, a model selection criterion is developed based on the clustering stability in [3,23].The key idea is that if multiple samples are available from the same distribution, a good clustering algorithm should yield clustering assignments of observations that do not vary much from one sample to another.An efficient estimation scheme based on bootstrap is proposed to accurately estimate the clustering stability in highdimensional clustering.Furthermore, the asymptotic estimation and selection consistency of the proposed regularized k-means clustering with diverging dimension is established.Whereas the selection consistency in regression has been obtained in [9,29], analogous results in the context of cluster analysis seem rare.The effectiveness of the proposed algorithms is also demonstrated in a variety of simulated examples as well as applications to two gene microarray examples.
The rest of the paper is organized as follows.Section 2 reviews the standard kmeans clustering.Section 3 presents the proposed regularized k-means clustering as well as its efficient implementation.Section 4 introduces the stability-based model selection criterion for tuning the regularized k-means clustering.Asymptotic estimation and selection consistency is established in section 5. Extension to the regularized model-based clustering is provided in section 6, followed by simulation studies in section 7 and two real gene examples in section 8.A brief discussion is given in section 9. Technical details are provided in Appendix.

Clustering analysis and k-means clustering
In the k-means clustering, assume that n data points X 1 , . . ., X n are available with X i = (X i1 , . . ., X ip ) T , and the number of clusters is pre-specified as K.The K clusters are denoted by A 1 , . . ., A K with centers C 1 , . . ., C K , where C k = (C k1 , . . ., C kp ) T .The k-means clustering then attempts to solve min where • is the standard Euclidean norm.Note that the global minimization in (1) is NP-hard and requires integer programming due to the discrete feature of A k .As a remedy, an iterative scheme [14] is often employed to approximate the solution of (1), which updates A k and C k separately at each iteration pretending the other one is fixed.Specifically, at t-th iteration, for the fixed K centers C k is updated by assigning each observation X i to the closest cluster; and then for the fixed Although the k-means clustering has been reported successful in many real applications, its performance can be less effective in high-dimensional cluster analysis.[13] pointed out that when the sample size is fixed and the dimension diverges, the distances among observations tend to be deterministic.In specific, the observations from the same cluster tend to lie symmetrically at the vertices of a regular simplex, and the distance between observations from different clusters is determined by the cluster difference relative to the data dimension.Consequently, if the cluster difference is relatively small compared with the diverging data dimension, the k-means clustering based on the Euclidean distance will operate in a degenerate fashion, assigning all the observations to the same cluster.In addition, the k-means clustering tends to include all the variables no matter if the variable contains information about the clustering structure or not.This is undesirable in high dimensional cluster analysis, where the clustering structure often lies in a low dimensional subspace and the majority of the variables are redundant in capturing the structure.

Regularized k-means clustering
This section proposes the regularized k-means clustering for high dimensional cluster analysis, which allows simultaneous clustering model fitting and variable selection.
The key idea of the regularized k-means clustering is to extend the k-means clustering in (1) by adding an adaptive group lasso penalty term on cluster centers.Specifically, the regularized k-means clustering is formulated as min where the training data X 1 , . . ., X n are centralized so that the mean of each variable is zero.In (2), the first term is equivalent to the k-means clustering, which measures the within-cluster distance from each observation to its corresponding cluster center, the second term J(C (j) ) is a regularization term on each variable, where C (j) = (C 1j , . . ., C Kj ) T and C kj is the j-th element of C k .Particularly, the regularization term J(C (j) ) can be group LASSO penalty λ C (j) [27], or adaptive group LASSO penalty J(C (j) ) = λ j C (j) [22], where λ and λ j , j = 1, . . ., p are tuning parameters that control the balance between the clustering model fitting and sparsity.Whereas the group LASSO penalty uses the same λ for all dimensions and may ignore the relative importance of each dimension [28], the adaptive group LASSO penalty associates each dimension with a different λ j so that the relative importance of each dimension can be incorporated.For illustration, we set J(C (j) ) = λ j C (j) in this article and note that it can be generalized to other types of regularization terms such as the group LASSO penalty and the L ∞ -norm penalty [24].
To solve the optimization in (2), we adopt a similar iterative scheme as in solving the k-means clustering.That is, we update A k and C k separately at each iteration pretending the other one is fixed.When C k is fixed, A k is updated by assigning each observation X i to the closest cluster.When A k is fixed, the following Lemma 3.1 suggests that C k can be solved in a componentwise fashion, which can substantially facilitate the computation in high-dimensional cluster analysis.
Lemma 3.1 follows immediately from the following equality, A direct consequence of Lemma 3.1 is that when L is fixed, solving (2) can be simplified to min for each individual variable, where and C (1) , . . ., C (p) are the estimated cluster centers from the standard k-means clustering.
The details of the proposed regularized k-means clustering are as follows.
Step 1. Initialize centers K by the standard k-means clustering.
Step 2. Until the termination condition is met, repeat (a).Given C , find the cluster assignment matrix L (t) .(b).Given L (t) , update C (t) by minimizing (3) for each j.

As computational remarks, to overcome the sensitivity to the initialization in
Step 1 the standard k-means clustering is randomly started multiple times and the one with smallest within-cluster distance is selected as the initialization.In Step 2 the iteration stops when L (t) does not change any more.Based on our limited numerical experience, the algorithm stops often within no more than five iterations.

Selection of tuning parameters
In the proposed regularized k-means clustering formulation, two tuning parameters, K and λ, need to be appropriately determined so that the clustering performance can be optimized.In this section, the tuning parameters are selected through a selection criterion based on clustering stability.
The key idea of clustering stability is that if we repeatedly draw samples from the same population and apply the regularized clustering algorithm, a good clustering algorithm should produce clustering assignments that are similar from one sample to another.In the proposed regularized k-means clustering, different values of K and λ define different clustering algorithms, therefore we select the values of K and λ such that the resulting clustering algorithm has the maximal clustering stability.
Denote that Z = {X 1 , . . ., X n } is a random sample of size n from some unknown distribution F (x) with x ∈ R p .Following [23], we define clustering assignment ψ(x) to be a mapping: R p → {1, . . ., K}, and the regularized kmeans clustering Ψ(•; K, λ) generates a clustering assignment ψ when applied to a sample Z.The clustering distance between any two clustering assignments ψ 1 (x) and ψ 2 (x) is defined as where X and Y are independently sampled from F , and Clearly, the distance between ψ 1 and ψ 2 measures the probability of their disagreement.The clustering instability of regularized k-means clustering Ψ(•; K, λ) is then where Ψ(Z 1 ; K, λ) and Ψ(Z 2 ; K, λ) are clustering assignments obtained by applying Ψ(•; K, λ) to two independent samples Z 1 and Z 2 respectively.To accurately estimate S(Ψ, K, λ, n), we propose the bootstrap resampling scheme.Consider the candidate algorithms {Ψ(•, K, λ) : K = 2, . . ., K.max; λ ≥ 0}, where K.max specifies the largest possible number of clusters, and K = 1 is excluded as it assigns all observations into the same cluster and thus provides little structural information of the data.Given n observations (X 1 , . . ., X n ), three where i and X (3) j are elements in sample Z * b 3 , and |A| is the cardinality of set A. Then the optimal K and λ can be estimated by the following voting scheme.For each λ, K λ = mode{ K * 1 λ , . . ., K * B λ }, where K * b λ = argmin 2≤K≤K.maxS * b (Ψ, K, λ, n), then the optimal K is estimated as K = mode{ K λ }.Given the estimated K, the optimal λ is estimated as λ = mode{ λ * 1 , . . ., λ * B }, where λ * b = argmin λ S * b (Ψ, K, λ, n).

Consistency of regularized k-means clustering
We now present the asymptotic estimation and selection consistency of the proposed regularized k-means clustering with diverging dimension.The estimation consistency assures that the estimated cluster centers converge almost surely to the true cluster centers based on population, and the selection consistency shows that the uninformative variables are eliminated from the estimated cluster centers with probability tending to one.
Let X 1 , . . ., X n be a random sample from an unknown distribution P , and denote P n as the associated empirical measure.Regarding (2) as a function of cluster centers and the empirical measure P n , the regularized k-means clustering is to minimize over C = (C 1 , . . ., C K ) T .Denote C = ( C 1 , . . ., C K ) T as the estimated cluster centers by solving (6), C = ( C1 , . . ., CK ) T as the true cluster centers which minimizes and L, L as the cluster assignment matrices of X 1 , . . ., X n based on C and C respectively.
Theorem 1 shows that the regularized k-means clustering with a properly selected λ attains similar asymptotic estimation consistency as the standard k-means clustering in [17,18].Note that the dimension p is allowed to diverge to infinity at an order of o(min(n 2 λ 2 , n −1/2 λ −1 )).In specific, if p = O(n a ) with 0 < a < 1/3, setting λ = O(n −(a+3)/4 ) satisfies the order conditions.These conditions have also been used in [9] for establishing the asymptotic consistency of high-dimensional regularized regression.
Next we establish the asymptotic selection consistency of the regularized kmeans clustering, which is desirable in high-dimensional cluster analysis where many variables are redundant and contain no information about the clustering structure.Without loss of generality, we assume that only the first p 0 < p variables are informative in that C(j) = 0 for j ≤ p 0 and C(j) = 0 for j > p 0 .The informative variable set is denoted as A = {1, . . ., p 0 } and the uninformative variable set is then A c = {p 0 + 1, . . ., p}.
Theorem 2. Under Assumptions (i) − (vii) in the Appendix, if n 1/2 λp → 0 and n −2 λ −2 p → 0 as n → ∞, then P ( C (j) = 0) → 1 for any j ∈ A c .Theorem 2 establishes the asymptotic selection consistency in the sense that the regularized k-means clustering can eliminate the uninformative variables in the estimated cluster centers with probability tending to one.As a summary, Theorems 1 and 2 demonstrate that the proposed regularized k-means clustering is capable of performing cluster analysis and variable selection at the same time.
Note that the asymptotic estimation and selection consistency is established assuming the number of clusters K is pre-specified.When the true number of clusters is available, the asymptotic results assure that the true cluster centers and the informative variables can be accurately recovered.When the true number of clusters is not known, [23] shows the selection consistency of the number of clusters in the un-penalized clustering framework.However, it remains unclear whether similar consistent results can be obtained for the regularized methods due to the difficulty of tuning K and λ simultaneously.A numerical experiment has been conducted in section 7.2 to demonstrate the superior performance of tuning K and λ via the selection criterion in section 4.

Regularized model-based clustering
The regularized clustering framework can be extended to the regularized modelbased clustering with the adaptive group lasso penalty.As opposed to the L 1 penalty in [16], adaptive group lasso penalty encourages the selection of variables in a factor fashion with each variable as one factor.
In general, assume each observation X i , i = 1, . . ., n is drawn from a mixture model with f (x) = K k=1 π k f k (x; θ k ), where π k is the mixture weight and f k (x; θ k ) can be any distribution function of the mixture component indexed by parameter θ k .For illustration, f k (x; θ k ) is assumed to be a multivariate normal distribution, where The regularized log-likelihood function for the observed data can be then formulated as To facilitate the high-dimensional clustering as in [16], we further assume that a common diagonal covariance matrix is shared among the mixture components.

W. Sun et al.
In specific, V k = V = diag(σ 2 1 , . . ., σ 2 p ) for all k's.An EM algorithm can be employed to maximize (8), where the cluster assignment L ik is treated as missing data.
If L ik is available, the regularized log-likelihood function for the compete data is In the expectation step, the conditional expectation of ( 9) is denoted as where . In the Maximization step, maximizing (10) yields the update of the parameters, The centers can be obtained by a direct calculation based on the Karush-Kuhn-Tucker conditions.Specifically, for any For any These two conditions imply that where I K is K × K identity matrix, 1 n is the vector of all 1's.Note that (A) + above is component-wise, so (A) + = (a ij+ ), where a ij+ = max(0, a ij ).Therefore the element C .
The details of the EM algorithm are as follows.
Step 1. Initialize centers K by the standard k-means clustering and π .
Similar as Algorithm 1, the standard k-means clustering in Step 1 is randomly started multiple times to overcome its sensitivity to the initialization.The iteration in Step 2 stops when L (t) does not change any more.

Simulation study
This section examines the effectiveness of the proposed regularized k-means clustering and regularized model-based clustering, and compares them against the standard k-means and the sparse k-means.As shown by [25], the sparse kmeans outperforms many other popular high-dimensional clustering algorithms in a variety of numerical experiments.To assess the performance of various clustering algorithms, the clustering error is defined as the estimated distance between an estimated clustering assignment ψ and the true assignment ψ of the sample data X 1 , . . ., X n .
The simulated data consist of 80 observations X i ∈ R p ; i = 1, . . ., 80 generated as follows.First, Y i 's are uniformly sampled from {1, 2, 3, 4}, which indicate the cluster memberships.Then for each i, the first 50 informative variables are generated from N (µ(Y i ), I 50 ), where ), and 1 25 is a vector of 25 ones, and the last p − 50 noise variables are generated from N (0, 1).To examine the clustering performance in various scenarios, we set p = 50, 200, 500 or 1000 and µ = 0.4, 0.6 or 0.8.Clearly, the four clusters are well separated when µ is large, and can be heavily overlapped when µ is small.Furthermore, when the data dimension p increases the first 50 informative variables become harder to identify as more noise variables are present.
Two scenarios are considered.In scenario I, we focus on the clustering performance of various clustering algorithms pretending the true number of clusters is given.In scenario II, with K unknown, we compare the clustering performance of various clustering algorithms after adjusted to the tuning parameter selection.In both scenarios, the selection criterion in section 4 is used to select tuning parameters for the standard k-means, the regularized k-means clustering and the regularized model-based clustering, and gap statistic [21] is used to select tuning parameters for the sparse k-means as suggested in [25].

Scenario I: K is known
In scenario I, the number of clusters is fixed as 4 in all clustering algorithms.For all the sparse k-means, the regularized k-means clustering and the regularized model-based clustering, the tuning parameters are selected through a grid search over 20 grid points {10 −2+4l/19 ; l = 0, . . ., 19}.For fair comparison, the number of bootstrap samples is set as B = 10 in both the stability-based selection criterion in section 4 and the gap statistics, and all clustering algorithms are randomly started 100 times to overcome their dependence on the initialization.Following the setup by [25], each simulation is replicated 20 times, and the averaged clustering error and averaged number of selected informative variables are summarized in Tables 1 and 2. Evidently, our proposed regularized k-means clustering and regularized modelbased clustering deliver superior results against their competitors in terms of both clustering error and variable selection.In Table 1, the regularized k-means clustering yields smaller clustering error than both the standard k-means and the sparse k-means when p > 50, except that both the proposed regularized model-based clustering and sparse k-means lead to perfect clustering when µ = 0.8 and p = 200.When p = 50 with no noise variable present the regularized model-based clustering yields the best performance for µ = 0.6 and 0.8, while the standard k-means clustering has great advantage for µ = 0.4, whereas the performance of the sparse k-means appears to be less competitive.In Table 2, the number of selected variables by the regularized k-means clustering is much closer to the truth than that of the sparse k-means in most cases, whereas the standard k-means clustering does not performance any variable selection at all.When p = 1000, in the examples of µ = 0.6 and µ = 0.8, the regularized k-means clustering tends to include a few more variables than the sparse k-means, yet it is still reasonably close to the number of true informative variables.Furthermore, the regularized model-based clustering performs similarly as the regularized kmeans clustering, but it requires substantially higher computational cost.As a consequence, the results of the regularized model-based clustering for p = 1000 is omitted in Tables 1 and 2 because of the long computational time.

Scenario II: K is unknown
Now we conduct a comparison of all clustering algorithms in a more realistic scenario, where the number of clusters is unknown.For illustration, we only consider p = 200 and µ = 0.8.To select the number of clusters and tuning parameters, similar tuning procedures as in section 7.1 are applied.The grid search is conducted over K ∈ {2, . . ., 10} and the same grid points for λ as in section 7.1.The simulation is replicated 20 times and the averaged clustering errors and averaged number of selected variables are summarized in Table 3.
Again the regularization k-means clustering and regularized model-based clustering deliver superior performance in both clustering and variable selection, and outperforms the sparse k-means and the standard k-means.The performance of sparse k-means is severely deteriorated as gap statistic selects the wrong number of clusters 18 out of 20 times.The difficulty of gap statistic in selecting number of clusters is also pointed out in [25].On the contrary, the selection criterion based on clustering stability appears to perform well in selecting the number of clusters and the tuning parameters.
To illustrate the effectiveness of the clustering stability based selection criterion, we randomly select one replication and display the estimated clustering instability and the clustering error for various values of K and λ.In Figure 1,   it is clear that there is a positive relevance between clustering instability and clustering error for various K or λ's.Furthermore, we examine the behavior of the regularized k-means clustering and the tuning parameter selection criterion as sample size grows.The simulation is conducted for regularized k-means clustering with sample size n = 20, 40, 80.The estimated number of clusters, number of selected variables and clustering errors over 20 replications are summarized in Table 4.As sample size increases, the true number of clusters is selected with higher probability, the noninformative variables are tending not to be selected, and the clustering errors decrease implying better estimate of the clustering centers.

Applications to gene microarray analysis
In this section, we apply the proposed regularized k-means clustering to two benchmark microarray datasets, Leukemia [11] and Lymphoma [1].In the Leukemia data, [11] studied microarray gene expression data to discovery two types of human acute leukemias: acute myeloid leukemia(AML) and acute lymphoblastic leukemia(ALL).This dataset consists of 72 patients in total, 25 patients with AML and 47 patients with ALL.The Gene expression levels were measured by Affymetrix microarrays containing 6817 human genes.Distinguishing ALL from AML is clinically significant for successful treatment because those chemotherapy regimens for ALL patients are different from AML patients, in which case using ALL therapy for AML (and vice versa) cases may result in distinctly reduced cute rates and possible toxicities.In the lymphoma data set, the total sample size is 62 and the number of genes is 4026.Three types of most prevalent adult lymphoid malignancies were studied: 42 cases of diffuse large B-cell lymphoma (DLBCL), 9 samples of follicular lymphoma (FL), and 11 observations of B-cell chronic lymphocytic leukemia (CLL).A specialized cDNA microarray was used to measure the gene expression levels.Both data sets are provided by [6] and available at http : //stat.ethz.ch/dettling/bagboost.html.Following the pre-processing steps in [8], both data sets are pre-processed by first setting a thresholding window [100, 16000] and then excluding genes with max/min ≤ 5 or (max − min) ≤ 500.Finally a logarithmic transformation and standardization are applied.For the original lymphoma data set, some arrays contain genes with missing values.As suggested in [8], a simple 5 nearest neighbor algorithm is employed to impute the missing values.
All the clustering algorithms are randomly started 100 times to overcome their dependence on the initialization.To optimally tune the algorithms, a grid search over K and tuning parameter λ as in section 7.2 is conducted to optimize the clustering instability or gap statistic.Note that there is no true clustering assignment in both gene microarray data sets, we compare the estimated clustering assignments to the available cancer types of each tumor.The comparison results are summarized in Table 5.
In the Leukemia data, the regularized k-means clustering correctly selects 2 clusters and makes only 2 misclassification out of 72 samples.In the Lymphoma data, the regularized k-means clustering correctly selects 3 clusters and yields the smallest clustering error with only 1/62.Clearly, the regularized k-means clustering achieves competitive clustering performance with much less selected important genes compared with the sparse k-means and the standard k-means clustering algorithms.Furthermore, in the leukemia data the number of the selected important genes by the regularized k-means clustering agrees with the observations in [11,5].
To scrutinize the performance the regularized k-means clustering in the gene microarray examples, we plot the heatmap of the Lymphoma data based on the 66 selected genes in Figure 2. The three clusters are distinct on the heatmap in that genes 1, . . ., 22 have significant signals in detecting FL and CLL, genes 23, . . ., 27 have significant signals in discriminating FL, and genes 28, . . ., 66 have significant signals in distinguishing DLBCL.(vi) Matrix Γ defined in [18] is positive definite at C; (vii) arg min 1≤k≤K X − C k 2 is unique with probability one.
Here Assumption (i) is a standard assumption for Euclidean distance based cluster analysis.Assumptions (ii)-(vi) are analogous to the assumptions in [17,18], where p is allowed to diverge as n → ∞.Assumption (vii) is necessary to prevent the ambiguity in estimating the cluster assignment matrix.
Proof of Theorem 1.First we show the estimated cluster centers C 1 , . . ., C K lie in a compact region of R p when n is large enough.It suffices to show there exists a sufficiently large closed ball B(M ) centered at the origin and of radius M , which contains all the estimated cluster centers when n is sufficiently large.Note that minimization of ( 6) is equivalent to the minimization of min where s n → ∞ as n → ∞.As proved in [17], under the assumptions (ii) and (iii), there is an M 1 so large that, when n is large enough, the estimated cluster centers { C1 , . . ., CK } based on the standard k-means are contained in B(M 1 ).By the fact that s n → ∞, there exists a sufficiently large s N such that the set {C : Note that C → C in Theorem 1 together with Assumption (vii) implies that the estimated cluster assignment matrix L converges in probability to the true cluster assignment matrix L. Therefore, in the last equality, the first four terms are of the order o p (1), and the fifth term is of the order O p (1) due to assumption (i).It follows from the fact that 2 n LT L is a nonnegative matrix and the component-wise central limit theorem of standard k-means that Note that nλ → ∞ according to the assumption n −2 λ −2 p → 0. So the last term diverges to infinity and dominates the first five terms, which leads to the contradiction to the above K.K.T. condition.Therefore, C (p) must be equal to 0 with probability tending to one.This completes the proof.

Fig 1 .
Fig 1.The plots of clustering instability and clustering error as functions of number of clusters K and tuning parameter λ respectively.

Fig 2 .
Fig 2. Heatmap of the Lymphoma data set based on 66 genes selected by the regularized k-means clustering.Each row represents one of the 62 sample tumors and each column represents one of the 66 selected genes.
This article proposes the regularized k-means clustering which is able to simultaneously cluster high-dimensional observations and select informative variables.To optimally balance the tradeoff between model fitting and model sparsity, a tuning parameter selection criterion based on clustering stability is developed.The proposed methods deliver superior performance in both cluster analysis and variable selection, and outperform their competitors in simulated and real experiments.A possible future direction is to extend the framework of the regularized k-means clustering to other clustering algorithms, like fuzzy c-means, which relaxes the constraints of the discrete and nonnegative clustering assignment of k-means.

Table 1
The averaged clustering errors and their estimated standard deviations for various clustering algorithms in section 7.1

Table 2
The averaged numbers of selected variables and their estimated standard deviations for various clustering algorithms in section 7.1

Table 3
The selected numbers of clusters, averaged numbers of selected variables, averaged clustering errors and their estimated standard deviations in section 7.2

Table 4
The estimated numbers of clusters, numbers of selected variables, and clustering errors with various sample sizes in section 7.2

Table 5
The selected numbers of clusters and informative genes and clustering errors in two gene microarray examples