Penalized model-based clustering with unconstrained covariance matrices

: Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One eﬀective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model permitting a treatment of general co-variance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm utilizing the graphical lasso (Friedman et al 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.


Introduction
As an important tool of data analysis, clustering has emerged as indispensable to analyzing high-dimensional genomic data.For example, in gene function discovery, various methods of clustering have been applied to group genes with their expressions across multiple conditions (Eisen et al 1998 [(9)]; Tavazoie et al 1999 [(39)]); in disease subtype discovery, clustering is used to cluster patients' tissue samples with their genomic expressions (Golub et al 1999 [(1)]).In this process, because of unknown identities of many relevant genes and/or experimental conditions it is necessary to select informative genes or conditions to yield meaningful clusters.Such a task of variable selection is critical not only to clustering but also to other modeling strategies such as classification.For classification, Alaiya et al (2002) [( 1)] studied borderline ovarian tumor classification, where classification using all 1584 protein spots is unsatisfactory but that focusing on a subset of around 200 selected spots provided more accurate results.For clustering, there have been only a limited number of studies, in contrast to a large body of literature on variable selection for classification and regression.As pointed out in Pan and Shen (2007) [( 30)], clustering imposes many unique challenges to variable selection in that some well known model selection procedures, e.g.best subset selection with BIC (Schwarz 1978 [(36)]), may not be applicable to clustering, which is unlike in classification and regression.One main reason is that in general there are many true models in clustering, most of which are not useful.For example, any true noise variable may suggest the existence of only one cluster; however, this discovery, albeit true, is useless because, in clustering one would like to uncover the underlying heterogeneity and structures in the data, such as identifying informative variables that suggest the existence of two or more clusters.Among many clustering approaches, model-based clustering has become increasingly important due to its interpretability.In addition to good empirical performance relative to its competitors (Thalamuthu et al, 2008 [(40)]), modelbased clustering has a solid probabilistic framework of mixture models, which facilitates model building and checking, such as selecting the number of clusters (McLachlan, 1987 [(25)]; Fraley and Raftery, 1998 [(12)]).Although Normal mixture models have been extensively studied by both frequentist and Bayesian approaches (Banfield and Raftery, 1993 [(3)]; Muller et al, 1996 [(28)], and references therein), our focus here is on high-dimensional data, for which variable selection is necessary (e.g.Pan and Shen, 2007 [(30)]).There are basically two categories of approaches to variable selection for high-dimensional model-based clustering: the Bayesian approaches (Liu et al, 2003 [(24)]; Teh et al, 2004 [(41)]; Hoff, 2006 [(15)]; Tadesse et al, 2005 [(38)]; Kim et al, 2006 [(19)]; Raftery and Dean, 2006 [(32)]) and penalized likelihood approaches (Pan and Shen, 2007 [(30)]; Xie et al, 2008a [(49)]; Xie et al, 2008b [(50)]; Wang and Zhu, 2008 [(47)]; Guo et al 2009 [(14)]).In general, the Bayesian approaches are more flexible by allowing more general covariance matrices, but computationally are more demanding due to the use of MCMC for stochastic searches.For the penalized likelihood approaches, one common assumption is that each cluster has a diagonal covariance matrix, implying the same orientation for all clusters (Banfield and Raftery, 1993 [(3)]).As to be shown later, this is too stringent and can severely degrade performance in practice.Conceptually a general or unconstrained covariance matrix should be allowed for each cluster; however the challenge is how to treat means and general covariances subject to the constraint that any resulting covariance matrix estimate is positive definite.This challenge is evident in the recent literature on Gaussian graphical modeling that estimates a large covariance matrix based on a Normal sample (Huang et al, 2006 [(16)]; Yuan and Lin, 2007 [(52)]; Levina et al, 2008 [(21)]; Rothman et al, 2009 [(34)]; Fan et al, 2009 [(10)] and references therein).This problem continues to be even more challenging for mixture models, because it is unknown which observations are from which Normal components.
In this article, we propose a general penalized likelihood approach that permits unconstrained covariance matrices in a Normal mixture model.A major innovation here is the recognition of the connection between fitting Normal mixture models and Gaussian graphical modeling.Our approach utilizes the recent development in Gaussian graphical modeling by effectively embedding an existing Gaussian graphical modeling method into the E-M algorithm for Normal mixture models.In particular, we implement our method using the graphical lasso method (Friedman et al, 2007 [(11)]) for covariance estimation.Moreover, we generalize the proposed method to semisupervised learning, permitting partially labeled observations.The rest of this article is organized as follows.Section 2 reviews the penalized model-based clustering method with diagonal covariance matrices, followed by a description of our proposed method that allows for a common or clusterspecific general covariance matrices.A brief discussion of an extension to semisupervised learning is given to permit known cluster memberships for a subset of observations.Section 3 presents simulation results, and an application to real microarray data is contained in section 4, to demonstrate the feasibility and effectiveness of the method compared against its counterpart with diagonal covariance matrices.Section 5 concludes with a summary and a discussion of future work.

Mixture model and its penalized likelihood
Denote by X = {x 1 , ..., x n } a random sample of n K-dimensional observations.Assume that the n observations are standardized with sample mean 0 and sample variance 1 for each variable.Assume that the observations are independent and from a mixture distribution with probability density function (pdf) where f i is the pdf for component or cluster i with unknown parameter vector θ i , and π i is the prior probability for component i with For parameter estimation, we adopt the maximum penalized likelihood estimator (MPLE) that maximizes the penalized log-likelihood where Θ represents all unknown parameters and p λ (Θ) is a penalty function for Θ with a regularization parameter (vector) λ.In what follows, we assume a Gaussian mixture model with f i being a multivariate Normal density, and regularize their mean vectors and possibly covariance matrices.
To obtain the MPLE, we employ the E-M algorithm (Dempster et al, 1977 [(7)]).Let z ij denote the indicator of whether x j belongs to component i; namely, z ij = 1 if x j comes from component i, and z ij = 0 otherwise.Here z ij 's are regarded as missing data simply because they are not observed.If z ij 's were observed, the complete data penalized log-likelihood becomes Given a current estimate Θ (r) , the E-step of the E-M calculates where τ ij is the posterior probability of x j 's belonging to component i.The M-step maximizes Q P to update the estimate of Θ.

Penalized clustering with diagonal covariance matrices
For comparison, we briefly review the method of Pan and Shen (2007) [( 30)], which specifies the components f i as multivariate Normal with a common diagonal covariance matrix They proposed a penalty function p λ (Θ) as an L 1 -norm of the mean parameters where µ ik is the mean of kth variable for component i.Note that the observations are standardized to have zero mean and unit variance for each variable k.If µ 1k = ... = µ gk = 0, then variable k cannot differentiate the components, hence deemed as noninformative (i.e. a noise variable) and automatically excluded from clustering.Variable selection is realized when small estimates of µ ik 's can be shrunken to be exactly 0 by the use of the L 1 penalty (Tibshirani, 1996 [(42)]).
For convenience, a generic notation Θ (r) is used to represent the parameter estimate at iteration r.The E-M updating formulas for maximizing the penalized likelihood (2) are as follows: at iteration r + 1, the posterior probability of x j belonging to component i is the prior probability of an observation from the i th component π(r+1 the variance of variable k σ2,(r+1) and the mean of variable k in cluster i for i = 1, 2, • • • , g and k = 1, 2, • • • , K. For sufficiently large λ 1 , we have μik = 0; if μ1k = μ2k = ... = μgk = 0 for variable k, variable k is a noise variable and does not contribute to clustering as can be seen from equation (6).
If the variance parameters are not regularized, it is straightforward to extend ( 6)-( 9) to the case with cluster-specific diagonal covariance matrices V i = diag(σ 2 i,1 , ..., σ 2 i,1 ): all the updating formulas remain to be the same except replacing σ 2 k by σ 2 i,k : σ2,(r+1) Note that the treatment here differs from Xie et al (2008b) [( 50)], in which the variance parameters are regularized.Throughout this article, we assume that an informative variable is defined to have cluster-specific means, no matter whether it has a common or cluster-specific variances.

Penalized clustering with a common unconstrained covariance
We now consider general or unconstrained covariance matrix V that further relax the diagonal covariance matrix assumption.Denote W = V −1 the inverse covariance matrix (or precision matrix) with elements W kl .
To realize variable selection, we require that a noise variable has a common mean across clusters.Since the data have been standardized to have mean 0 for each variable, a common mean implies µ 1k = ... = µ gk = 0.As in Bayesian approaches (Tadesse et al 2005 [(38)]), one can assume that any noise variable is uncorrelated with any informative variable, though this assumption is not necessary in our approach (because this assumption does not influence our estimation procedure).To facilitate estimating large and sparse covariance matrices, we propose the following penalty function: Note that, the penalty on the mean parameter is mainly for variable selection, while that for the precision matrix is necessary for high-dimensional data.Since the data dimension K is larger than the sample size n, the sample covariance matrix (or the maximum likelihood estimate under the Normality) is necessarily singular.In addition, as discussed in the literature of Gaussian graphical modeling, penalization on a large covariance (or precison) matrix can yield better estimates than non-penalized ones.Although various penalties have been proposed for a covariance (or precision) matrix, some do not yield a positive-definite covariance estimate.For the problem considered here, since we need to calculate the log-likelihood, and thus the determinant of a covariance estimate, the positive-definiteness of a covariance estimate is needed, which imposes a major technical difficulty.In Gaussian graphical modeling, one aims to estimate the covariance or precision matrix of a Normal distribution; since all the observations are known to be iid from the same Normal distribution, the problem is easier than that for mixture models, where we need to cluster the observations into various unknown groups (each corresponding to a Normal distribution) and estimate the covariance matrix (and other parameters) for each group simultaneously.A major contribution of our work is recognition of the connection between Gaussian mixture modeling and Gaussian graphical modeling: in spite of the unknown cluster-or group-memberships of the observations in a Gaussian mixture model, the estimation of a covariance (or precision) matrix for a mixture component can be formulated to be similar to that for Gaussian graphical modeling.

Estimation of the non-covariance parameters
For the EM algorithm, the E-step yields Q P as given in ( 4) and the M-step maximizes Q P with respect to the unknown parameters, resulting in the same updating formulas for τ ij and π i as given in ( 6) and ( 7).The updating formula for µ ij is derived from the following theorem.
Theorem 1.The sufficient and necessary conditions for μik to be a (global) maximizer of Q P (for a fixed i and k) are and where Hence, we have the below updating formula for the mean parameter: Simple algebra indicates that the updating formulas ( 14)-( 15) for µ ik reduces to (9) when the covariance matrix is diagonal.The coordinate-wise updating for µ as above converges to the global maximum in view of the results of Tseng (1988) [( 44)] and Tseng (2001) [( 45)], because the first term of Q P , the conditional expectation of the complete data log-likelihood, is concave, while the second term, the L 1 penalty on the mean parameters, is separable (and concave) in µ ik 's.
It remains to derive an updating formula for the covariance matrix in the E-M algorithm.

Estimation of the inverse covariance matrix
To derive the estimate of the covariance matrix V , we focus on the M-step of ( 4) with respect to the V .Replacing V with W −1 , we only need to find the updating formula for W .To maximize Q p with respect to W , it suffices to maximize is the empirical covariance matrix.For (16), we shall use the graphical lasso algorithm of Friedman et al (2007) [( 11)], to maximize an objective function over all non-negative definite matrices W for a known covariance matrix S. Hence, we can apply the algorithm to maximize (16) with λ = 2λ 2 /n and S = S. Their algorithm is implemented in R package glasso.

Penalized clustering with cluster-specific covariance matrices
To permit varying cluster volumes and orientations, we now consider componentspecific unconstrained covariance matrices V i , i = 1, ..., g.We employ a slightly modified penalty: In the E-M algorithm, the updating formulas for π and τ remain the same as in ( 6) and (7), respectively; for µ, we only need to replace W in ( 14) and ( 15) with W i .Thus, we only need to consider the covariance matrix estimation.Note where C stands for a constant term unrelated to W i .To maximize Q p , we only need to maximize .
Hence, we can separately maximize each of these g terms using the graphical lasso to obtain an updated estimate of W i .

Model selection
We propose using the predictive log-likelihood based on an independent tuning dataset or cross-validation (CV) as our model selection criterion.Through this criterion, we use a grid search to estimate the optimal (g, λ 1 , λ 2 ) as the one with the maximum predictive log-likelihood.
For any given (g, λ 1 , λ 2 ), because of possibly many local maxima for the mixture model, we run the EM algorithm multiple times with random starts.For our numerical examples, we started with the K-means clustering, and used its result as initial parameter estimates for the E-M algorithm.From the multiple runs, we selected the one giving the maximal penalized log-likelihood as the final result for the given (g, λ 1 , λ 2 ).

Extension: semi-supervised learning
We further extend the proposed method to mixture model-based semi-supervised learning, in which some observations have known cluster labels while the others do not (McLachlan and Peel, 2002 [(27)]; Liang et al, 2007 [(22)]).Pan et al (2006) [(31)] developed the penalized mixture approach with diagonal covariance matrices; here we push it to the case with unconstrained covariance matrices.
Without loss of generality, assume that we have partially labeled K-dimensional data x 1 , ..., x n , in which the first n 0 observations do not have class labels while the remaining n 1 do have.Furthermore, assume the existence of g 0 classes among the first n 0 observations and g 1 known classes among the n 1 labeled observations.The log-likelihood is The penalized log-likelihood can be then constructed with a suitable penalty function p λ (Θ).It is noted that, in the E-M algorithm, because z ij 's for the last n 1 observations are indeed observed, their corresponding "posterior probabilities" are known as τ ij = z ij , while those for the first n 0 observations are the same as (6).With the new updating formula for τ , the updating formulas for µ, π and covariance parameters in the E-M algorithm are the same as before.
Model selection can be performed as before.First, we use an independent tuning dataset or CV, including both labeled and unlabeled observations, to select the number of clusters and penalty parameters.Then we fit the selected model to the whole data set.Note that, after obtaining all parameter estimates, we calculate the posterior probabilities τ for each observation, including the n 1 observations with known class labels, and use the posterior probabilities for class assignment.

Small K: Why use non-diagonal covariance matrices
To better visualize the advantage of allowing unconstrained covariance matrices, we applied the methods to two-dimensional data.We considered two simple setups: the first with only one true cluster while the second with two clusters.The number of observations in each set-up was 200; for set-up 2, 100 observations were generated from each cluster.We used an independent tuning dataset of an equal size as that of a training dataset to choose the number of clusters and the penalty parameters.
Figure 1 displays the true clusters, estimated clusters with cluster-specific unconstrained and diagonal covariance matrices respectively.Throughout this article, to reflect the sampling variability, for the true clusters, the parameter values were estimated based on the true cluster memberships of the observations.The correctly classified observations were represented by open circles or diamonds, while incorrect ones by filled ones.For set-up 1 (Figure 1), although there was only one cluster, due to the use of the diagonal covariance matrices, to account for the fact that the orientation of the true cluster was not parallel to either axis, two clusters with their orientation parallel to either axis were needed, leading to selecting an incorrect number of clusters.For set-up 2, even though a correct number of clusters was identified with the use of diagonal covariances, again due to the restriction on the orientation of the clusters imposed

Diagonal cov
Cluster Cluster by the diagonal covariances, there were a large number of mis-classified observations.In contrast, the new method with unconstrained covariance matrices yielded much better results.

Simulated data with diagonal covariance matrices
We next considered three set-ups with K > n and diagonal covariance matrices.
The first was a null case with g = 1 cluster; the other two were with two clusters: one with difference only in the mean parameters, and the other in both the mean and variance parameters.For each set-up, 100 simulated datasets were generated.Each dataset contained n = 80 observations with dimension K = 100.For set-up 1, all observations belonged to the same cluster; for set-ups 2 and 3, the first 40 observations formed the first cluster while the remaining 40 observations belonged to the second cluster.Specifically, for set-up 1, all variables came from a standard normal distribution N (0, 1); for set-up 2 and 3, 90 variables were noises generated from N (0, 1), and the remaining 10 variables were informative.For set-up 2, the informative variables for the first cluster came from N (0, 1) and from N (1.5, 1) for the second cluster.For set-up 3, the informative variables were from N (0, 1) and N (1.5, 2) for the two clusters respectively.We applied three methods: a common unconstrained covariance, clusterspecific unconstrained covariance matrices, and diagonal covariances.For the one with diagonal covariances, we used either a common diagonal covariance or cluster-specific diagonal covariances according to the truth to ideally optimize its performance.For each simulated dataset, we fitted a series of models with the number of components g = 1, 2 and 3, and performed a grid search to choose the penalty parameters.We show the frequencies of selecting various numbers of clusters, the average number of informative variables incorrectly selected to be noninformative (z 1 ), the average number of noninformative variables correctly selected (z 2 ), the number of observations correctly classified to cluster i (C i ), and the number of observations mis-classified from cluster i (IC i ).We also report the Rand index (RI) or adjusted Rand index (aRI) to summarize the quality of clustering.
As shown in Table 1, for the null case, we could select the correct number of clusters correctly using the proposed method with a common covariance matrix, while the proposed method with cluster-specific covariance matrices did not perform so well.For set-up 2 with the true model with a common covariance matrix, the proposed method with a common unconstrained covariance matrix correctly selected g = 2 most often, and had a comparable performance to the method with a diagonal covariance in terms of the sample assignments, as summarized by the Rand or adjusted Rand index (Table 2).For set-up 3 with the true model with a cluster-dependent diagonal covariance matrices, the proposed method with cluster-dependent covariances correctly selected g = 2 nearly as often as using cluster-dependent diagonal covariance matrices.In terms of sample assignments, the proposed method also performed comparably.For these three set-ups, a close examination indicated that for the proposed method the highest (predictive) log-likelihood was achieved when we had a sufficiently large penalty on the off-diagonal elements of the inverse covariance matrices, leading to the estimated covariance matrices close to being diagonal as were the truth.

Simulated data with non-diagonal covariance matrices
We considered some true models with non-diagonal covariance matrices, while other aspects of simulation remained the same as in section 3.2; in particular, among 100 variables, ten of which were informative.We used two non-diagonal covariance matrices for the 10 informative variables: a compound symmetry (CS) and an AR-1 with ρ = 0.6; the noise variables were independent of each other and of the informative variables.The resulting two covariance matrices are denoted as V 0,1 and V 0,2 .
The following four set-ups were considered: set-up 1 was for a null case with only one cluster; for set-ups 2 and 3, we had two clusters sharing the same Table 1 Simulated data with diagonal covariances: frequencies of the selected numbers (g) of clusters, and mean numbers of predicted noise variables among the true informative (z 1 ) and noise variables (z 2 ).For set-up 1, the truth was g = 1, z 1 = 10 and z 2 = 90; for other set-ups, g = 2, z 1 = 0 and z 2 = 90.covariance matrix V 0,1 or V 0,2 , but with mean parameters differing by 1.5 in each informative variable; for set-up 4, the two clusters differed in both the mean vectors (by 1.5) and covariance matrices as V 0,1 and V 0,2 respectively.As before, a training dataset contained 80 observations, 40 of which came from the first cluster (if there were two clusters); we used an independent tuning dataset of size 80.We applied the three methods; again according to the truth, we used a common diagonal covariance matrix for set-ups 1-3, but cluster-specific diagonal covariance matrices for set-up 4.
The frequencies of the selected numbers of clusters based on 100 simulated datasets are shown in Table 3.For each case, the proposed methods performed better than the diagonal matrix method; between the two proposed methods, depending on the truth (i.e.whether there was a common covariance matrix), one of them performed better than the other.The same conclusion can be drawn on the performance of the methods for sample classification (Table 4).

Data and a clustering analysis
We first applied the methods to a well-known leukemia gene expression dataset of Golub et al (1999) [( 13)] to compare their performance.The (training) data contained 38 patient samples, among which 11 were acute myeloid leukemia (AML) while the remaining were acute lymphoblastic leukemia (ALL) samples.The ALL samples consisted of two subtypes: 8 T-cell and 19 B-cell samples.For each sample, the expression levels of 7129 genes were measured by an Affymetrix microarray.As in Dudoit et al (2002) [( 8)], we pre-processed the data in the following steps: 1) truncation: any expression level x jk was truncated below at 1 if x jk < 1, and above at 16,000 if x jk > 16, 000; 2) filtering: any gene was excluded if its max/min ≤ 5 and max − min ≤ 500, where max and min were the maximum and minimum expression levels of the gene across all the samples.Finally, as preliminary gene screening, we selected the top 300 genes with the Table 2. Simulated data with diagonal covariance matrices: sample assignments for ĝ = 2 and (adjusted) Rand index (RI/aRI) values.
# #C i represents the average number of the samples correctly assigned to cluster i, i = 1, 2; # #I − C i represents the average number of the samples incorrectly assigned to cluster i that # arise from the other cluster, i = 1, 2.

Sample assignments, ĝ = 2
Rand Index Cluster 1 Table 3 Simulated data with non-diagonal covariance matrices: frequencies of the selected numbers (g) of clusters, and mean numbers of predicted noise variables among the true informative (z 1 ) and noise variables (z 2 ).For set-up 1, the truth was g = 1, z 1 = 10 and z 2 = 90; for others, g = 2, z 1 = 0 and z 2 = 90.largest sample variances across the 38 samples.Because there were only a small number of samples, we took stratified sampling by cell types in 3-fold cross validation (CV).We fitted the models with g = 1, 2 ,3 and 4. The first local maximum of the predictive log-likelihood was achieved at g = 3.
The clustering results for g = 3 are shown in Table 5.Although all the 300 genes were selected as informative, there was evidence for a large number of genes with differential expression between the leukemia subtypes (e.g.Pan and Shen 2007 [(30)]).It is clear that the new method with cluster-specific unconstrained covariance matrices gave fewer errors in sample classification.To see why, we examined a few genes in more details.Genes CST3 (cystatin c, M23197) and ZYX (zyxin, X95735) were in the top 50 genes ranked by Golub et at (1999) [( 13)], and were two of the 17 genes selected by Antonov et al (2004) [(? )] to distinguish the AML and ALL subtypes.CST3 was also regarded as a suitable marker by Bardi et al (2004) 46)]further identified ZYX as the most significant gene to discriminate AML/ALL subtypes.These two genes were also identified among the top 20 genes used in the classifier by Liao et al (2007) [(23)] .In addition, we included two genes with gene accession number HG613-HT613 and M38591.The expression levels of gene pairs (HG613-HT613, M23197), and (X95735,M38591) are shown in Figure 2, with the true and estimated clusters.Clearly the true clusters did not necessarily have orientations parallel with either axis, which was captured by the proposed method, leading to more accurate sample classifications with more homogeneous clusters.We also examined element-wise differences between the covariance matrices resulting from the true clusters and estimated clusters: the unconstrained covariance matrices were much closer to the true covariance matrices than the diagonal covariance matrices were (results not shown).# #C i represents the average number of the samples correctly assigned to cluster i, i = 1, 2; # #I − C i represents the average number of the samples incorrectly assigned to cluster i that arise from the other cluster, i = 1, 2.

Sample assignments, ĝ = 2
Rand Index Cluster 1   19)] proposed a mixture of multivariate normal distributions via Dirichlet process (DP) mixtures for model-based clustering with the capability of variable selection for high-dimensional data.In particular, their method uses non-diagonal covariance matrices.Among others, a concentration parameter α for the DP prior and some data-dependent priors have to be specified; note that, as pointed out by other authors (Richardson and Green, 1997 [(35)]; Wasserman, 2000 [( 48)]), it is not possible to obtain proper posterior distributions with fully noninformative priors in mixture models.Kim et al (2006) [( 19)] applied their method to the leukemia gene expression data of Golub et al (1999) [( 13)].The data preprocessing was the same as before except that, rather than using the top 300 genes with the largest sample variances, all the 3571 genes were used. ).
On the other hand, with α = 1, the sampler visited models with 3 to 6 components, and the clustering results were ).
In each case, about 120 genes were selected to be informative.We applied our proposed method with cluster-specific unconstrained covariance matrices.For the number of cluster g = 1 to 7, the predictive log-likelihood values based on a 3-fold CV were −28160, −24836, −23117, −23551, −23896, −24432 and −25218, respectively.Hence, we would select g = 3.For comparison, we showed the clustering results for both g = 3 and g = 7 in Table 6.
Comparing our results with that of Kim et al (2006) [( 19)], we see some significant differences.First, our method could distinguish the two subtypes of ALL, while that of Kim et al (2006) [(19)] failed.Second, in contrast to that of Kim et al (2006) [(19)], our results did not show the heterogeneous subgroups of AML even when we forced to have 7 clusters.Third, our method selected 3 clusters, corresponding to the three subtypes of the leukemia, while the Bayesian method chose a larger number of clusters.Finally, the method of Kim et al (2006) [( 19)] selected about 120 informative genes, in contrast to 1372 genes selected by our method for g = 3.We also note that the results of the Bayesian method depended on the choices of the prior and sample allocation method.
Currently we implemented our method in R, in which the glasso function is called to estimate a covariance matrix.For analysis of Golub's data as considered here, it took about two days to complete running our program on a laptop.17)] identified 37 genes that would discriminate the LVEC and MVEC samples, and we applied the proposed method using these 37 genes to cluster the 80 samples.Here, we used four-fold CV for tuning.We fitted the models with g = 1, 2, 3 and g = 4; the first local maximum of the predictive log-likelihood was reached at g = 3.At g = 3, 30 genes out of 37 were selected as informative.

BOEC gene expression data
The clustering results for three clusters are shown in Table 7.It is confirmed that each type of the samples clustered closely together, as observed by Jiang (2007) [(17)].Nevertheless, it is noted that using unconstrained covariance matrices led to more accurate (i.e. more homogeneous) clustering than using diagonal covariance matrices.For illustration, two pairs of genes were chosen to have their expression levels and the corresponding true and estimated clusters plotted in Figure 3. Again, based on the true clusters, it is evident that non-diagonal covariance structures were captured better by our proposed method, leading to better (i.e. more homogeneous) clustering results; a direct examination of the estimated covariance matrices also confirmed this point (results not shown).For the BOEC data, treating the LVEC and MVEC samples with known class labels (i.e.g 1 = 2), we obtained the following results using the 37 genes as used before.First, if we did not allow the BOEC samples to form their own class with g 0 = 0, the BOEC samples (largely) fell to the class of MVEC, as shown before (Pan et al 2006).Second, if we allowed the possibility of the BOEC samples to form their own class with g 0 = 1, they indeed formed a separate class.As shown in Table 8, in either case, the proposed method with unconstrained covariance matrices performed better than using diagonal covariance matrices with fewer mixed sample assignments.The proposed method selected 26 and 29 informative genes for the two cases respectively.

Discussion
We have proposed a new approach to penalized model-based clustering with unconstrained covariance matrices.We have shown its better empirical performance than that using diagonal covariance matrices when some informative variables are correlated, which is often the case for high-dimensional genomic data, as supported by co-expressions of functionally-related genes for microarray gene expression data.One key technical challenge is in estimating a possibly large covariance matrix.By taking advantage of the recent development in Gaussian graphical models, we have implemented our approach with the use of the graphical lasso algorithm (Friedman et al 2007 [(11)]), largely due to its fast speed.Nevertheless, in principle, other covariance estimation methods, either frequentist (Huang et 5)], and references therein), could be used, though computational speed is an important factor.Alternatively, some nondiagonal structural assumptions may be imposed on covariance matrices.Xie et al (2009) [( 51)] proposed such an approach based on the mixture of factor analyzers: some latent variables are used to model the covariance structure in each cluster, which however is computationally demanding if the number of the latent variables needs to be chosen data-adaptively.We comment on that, although we have focused on its application in variable selection, penalized model-based clustering may be useful in its own right, such as in providing better parameter estimates (due to regularization) and thus better clustering results for small samples.For future improvement, we may follow Xie et al (2008b) [( 50)] to impose a penalty on variance parameters to account for clustering structures in varying variances across clusters, in addition to that in locations or means.It may be of interest to extend the proposed method to deal with other issues in high-dimensional genomic data, such as prior biological knowledge and outliers (Pan 2006 [(29); Tseng 2007 [(43)]).More investigations are necessary, especially in further evaluating the proposed method with real-world applications.
Free software will be posted on our web site.
Appendix A: Appendix: Proof of Theorem 1 For simplicity of notation, we drop the "hat" from any parameter estimate.Since Q P is differentiable with respect to µ ik when µ ik = 0, while nondifferentiable at µ ik = 0, we consider the following two cases: i) If µ ik = 0 is a maximum, given that Q P is concave and differentiable, then the sufficient and necessary condition for µ ik to be the global maximum of Q P is ∂Q P /∂µ ik = 0 ⇐⇒ n j=1 τ ij K l=1 (x jl − µ il )W lk − λ 1 sign(µ ik ) = 0, from which (12) can be easily derived if we separate µ ik from other components of µ i .

Fig 1 .
Fig 1. Simulated data from only one cluster (top panels) and from two clusters (bottom panels) with K = 2.
2. A comparison with a Bayesian method Kim et al (2006) [(

Table 5
Clustering results for the leukemia gene expression data with K = 300 genes.
They considered two values of α.For α = 38, the MCMC sampler visited models with 4 to 7 components.The sample allocations based on the maximum a posteriori probability (MAP) and on the least-squares clustering (LSC) algorithm were respectively,

Table 6
Clustering results with cluster-specific unconstrained covariance matrices for the leukemia gene expression data with K = 3571 genes.

Table 7
Clustering results for the BOEC data.

Table 8
Semi-supervised learning results for the BOEC data.To address the biological question of whether the BOEC samples belong to either or neither of the LVEC and MVEC classes, we fully utilize the known memberships of the LVEC and MVEC samples while allowing the memberships of the BOEC samples to be determined.This is a semi-supervised learning problem with some observations with known cluster memberships (McLachlan and Peel 2002 [(27)]; Pan et al 2006 [(31)]).Below we illustrate the application of the methods to the BOEC data.