Electronic Journal of Statistics

Penalized model-based clustering with unconstrained covariance matrices

Hui Zhou, Wei Pan, and Xiaotong Shen

Full-text: Open access


Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model with general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm to utilize the graphical lasso (Friedman et al. 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.

Article information

Electron. J. Statist., Volume 3 (2009), 1473-1496.

First available in Project Euclid: 4 January 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

Covariance estimation EM algorithm Gaussian graphical models high-dimension but low-sample size $L_1$ penalization normal mixtures penalized likelihood semi-supervised learning


Zhou, Hui; Pan, Wei; Shen, Xiaotong. Penalized model-based clustering with unconstrained covariance matrices. Electron. J. Statist. 3 (2009), 1473--1496. doi:10.1214/09-EJS487. https://projecteuclid.org/euclid.ejs/1262617415

Export citation


  • [1] Alaiya, A.A. et al. (2002). Molecular classification of borderline ovarian tumors using hierarchical cluster analysis of protein expression profiles., Int. J. Cancer, 98, 895–899.
  • [2] Antonov, A.V., Tetko, I.V., Mader, M.T., Budczies, J. and Mewes, H.W. (2004). Optimization models for cancer classification: extracting gene interaction information from microarray expression data., Bioinformatics, 20, 644–652.
  • [3] Baker, Stuart G. and Kramer, Barnett S. (2006). Identifying genes that contribute most to good classification in microarray., BMC Bioinformatics, Sep 7; 7:407.
  • [4] Banfield, J.D. and Raftery, A.E. (1993). Model-Based Gaussian and Non-Gaussian Clustering., Biometrics, 49, 803–821.
  • [5] Bardi, E., Bobok, I., Olah, A.V., Olah, E., Kappelmayer, J. and Kiss, C. (2004). Cystatin C is a suitable marker of glomerular function in children with cancer., Pediatric Nephrology, 19, 1145–1147.
  • [6] Carvalho, C.M. and Scott, J.G. (2009). Objective Bayesian model selection in Gaussian graphical models., Biometrika, 96, 497–512.
  • [7] Chi, J-T. et al. (2003). Endothelial cell diversity revealed by global expression profiling., PNAS, 100, 10623–10628.
  • [8] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion)., JRSS-B, 39, 1–38.
  • [9] Dudoit, S., Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using expression data., J. Am. Stat. Assoc., 97, 77–87.
  • [10] Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns., PNAS, 95, 14863–14868.
  • [11] Fan, J., Feng, Y. and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties., Ann. Appl. Stat., 3, 521–541.
  • [12] Friedman, J., Hastie, T. and Tibshirani, R. (2007). Sparse inverse covariance estimation with the graphical lasso., Biostatistics, 0, 1–10.
  • [13] Fraley, C. and Raftery, A.E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis., Computer J., 41, 578–588.
  • [14] Golub, T. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring., Science, 286, 531–537.
  • [15] Guo, F.J., Levina, E., Michailidis, G. and Zhu, J. (2009). Pairwise Variable Selection for High-dimensional Model-based Clustering. To appear in, Biometrics.
  • [16] Hoff. P.D. (2006). Model-based subspace clustering., Bayesian Analysis, 1, 321–344.
  • [17] Huang, J.Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance selection and estimation via penalised normal likelihood., Biometrika, 93, 85–98.
  • [18] Jiang, A., Pan, W., Yu, S. and Robert, P.H. (2007). A practical question based on cross-platform microarray data normalization: are BOEC more like large vessel or microvascular endothelial cells or neither of them?, Journal of Bioinformatics and Computational Biology 5 875–893.
  • [19] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models., Statist. Sci., 20, 388–400.
  • [20] Kim, S., Tadesse, M.G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models., Biometrika, 93, 877–893.
  • [21] Lau, J.W. and Green, P.J. (2007) Bayesian model based clustering procedure., Journal of Computational and Graphical Statistics, 16, 526–558.
  • [22] Levina, L., Rothman, A. and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested lasso penalty., Annals of Applied Statistics, 2, 245–263.
  • [23] Liang, F., Mkherjee, S. and West, M. (2007). The use of unlabeled data in predictive modeling., Statistical Science, 22, 189–205.
  • [24] Liao, J.G. and Chin, K.V. (2007). Logistic regression for disease classification using microarray data: model selection in a large p and small n case., Bioinformatics, 23, 1945–1951.
  • [25] Liu, J.S., Zhang, J.L., Palumbom M.J. and Lawrencem C.E. (2003). Bayesian clustering with variable and transformation selection (with discussion)., Bayesian Statistics 7, 249–275.
  • [26] McLachlan, G. (1987). On bootstrapping likelihood ratio test statistics for the number of components in a normal mixture., Applied Statistics 36, 318–324.
  • [27] McLachlan, G.J., Bean, R.W. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data., Bioinformatics, 18, 413 - 422.
  • [28] McLachlan, G.J. and Peel, D. (2002)., Finite Mixture Model. New York, John Wiley & Sons, Inc.
  • [29] Muller, P., Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures., Biometrika, 83, 67–79.
  • [30] Pan, W. (2006). Incorporating gene functions as priors in model-based clustering of microarray gene expression data., Bioinformatics, 22, 795–801.
  • [31] Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection., Journal of Machine Learning Research, 8, 1145–1164.
  • [32] Pan, W., Shen, X., Jiang, A. and Hebbel, R.P. (2006). Semi-supervised learning via penalized mixture model with application to microarray sample classification., Bioinformatics 22, 2388–2395.
  • [33] Raftery, A.E. and Dean, N. (2006). Variable selection for model-based clustering., Journal of the American Statistical Association, 101, 168–178.
  • [34] Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods., JASA, 66, 846–850.
  • [35] Rothman, A., Levina, L. and Zhu, J. (2009). Generalized thresholding of large covariance matrices., JASA, 2009, 104(485): 177–186.
  • [36] Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixture models (with Discussion)., J R Statist Soc B, 59, 731–792.
  • [37] Schwarz, G. (1978). Estimating the dimension of a model., Annals of Statistics, 6, 461–464.
  • [38] Scott, J.G. and Carvalho, C.M. (2009). Feature-inclusion stochastic search for Gaussian graphical models., J. Comp. Graph. Stat., 17, 790–808.
  • [39] Tadesse, M.G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data., Journal of the American Statistical Association, 100, 602–617.
  • [40] Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. and Church, G.M. (1999) Systematic determination of genetic network architecture., Nat. Genet, 22, 281–285.
  • [41] Thalamuthu, A., Mukhopadhyay, I., Zheng, X. and Tseng, G.C. (2006). Evaluation and comparison of gene clustering methods in microarray analysis., Bioinformatics, 22, 2405–2412.
  • [42] Teh, Y.W., Jordan, M.I., Beal, M.J. and Beal, M.J. (2004). Sharing clusters among related groups: Hierarchical Dirichlet processes., NIPS.
  • [43] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso., JRSS-B, 58, 267–288.
  • [44] Tseng, G.C. (2007). Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data, Bioinformatics, 23, 2247–2255.
  • [45] Tseng, P. (1988) Coordinate ascent for maximizing nondifferentiable concave functions. Technical report LIDS-P; 1840, Massachusetts Institute of Technology. Laboratory for Information and Decision, Systems.
  • [46] Tseng, P. (2001) Convergence of block coordinate descent method for nondifferentiable maximization., J. Opt. Theory and Applications, 109, 474–494.
  • [47] Wang, Y., Tetko, I.V., Hall, M.A., Frank, E., Facius, A., Mayer, K.F.X. and Mewes, H.W. (2005). Gene selection from microarray data for cancer classification - a machine learning approach., Comput Biol Chem, 29, 37–46.
  • [48] Wang, S. and Zhu, J. (2008). Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data., Biometrics, 64, 440–448.
  • [49] Wasserman, L. (2000). Asymptotic inference for mixture models using data-dependent priors., J R Statist Soc B, 62, 159–180.
  • [50] Xie, B., Pan, W. and Shen, X. (2008a). Variable selection in penalized model-based clustering via regularization on grouped parameters., Biometrics, 64, 921–930.
  • [51] Xie, B., Pan, W. and Shen, X. (2008b). Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables., Electron. J. Statist., 2, 168–212.
  • [52] Xie, B., Pan, W. and Shen, X. (2009). Penalized mixtures of factor analyzers with application to clustering high dimensional microarray data. To appear, Bioinformatics. Available at http://www.biostat.umn.edu./rrs.php as Research Report 2009-019, Division of Biostatistics, University of Minnesota.
  • [53] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model., Biometrika, 94, 19–35.