Statistics Surveys

Variable selection methods for model-based clustering

Michael Fop and Thomas Brendan Murphy

Full-text: Open access

Abstract

Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

Article information

Source
Statist. Surv., Volume 12 (2018), 18-65.

Dates
Received: July 2017
First available in Project Euclid: 26 April 2018

Permanent link to this document
https://projecteuclid.org/euclid.ssu/1524729611

Digital Object Identifier
doi:10.1214/18-SS119

Mathematical Reviews number (MathSciNet)
MR3794323

Zentralblatt MATH identifier
06875306

Keywords
Gaussian mixture model latent class analysis model-based clustering R packages variable selection

Rights
Creative Commons Attribution 4.0 International License.

Citation

Fop, Michael; Murphy, Thomas Brendan. Variable selection methods for model-based clustering. Statist. Surv. 12 (2018), 18--65. doi:10.1214/18-SS119. https://projecteuclid.org/euclid.ssu/1524729611


Export citation

References

  • Agresti, A. (2002). Categorical Data Analysis. Wiley.
  • Andrews, J. L. and McNicholas, P. D. (2013). vscc: Variable selection for clustering and classification R package version 0.2, https://cran.r-project.org/package=vscc.
  • Andrews, J. L. and McNicholas, P. D. (2014). Variable selection for clustering and classification. Journal of Classification 31 136–153.
  • Badsberg, J. H. (1992). Model search in contingency tables by CoCo. In Computational Statistics (Y. Dodge and J. Whittaker, eds.) Vol. 1, 251–256. Heidelberg: Physica Verlag.
  • Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
  • Bartholomew, D., Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis. Wiley.
  • Bartolucci, F., Montanari, G. E. and Pandolfi, S. (2016). Item selection by latent class-based methods: an application to nursing home evaluation. Advances in Data Analysis and Classification 10 245–262.
  • Bartolucci, F., Montanari, G. E. and Pandolfi, S. (2017). Latent ignorability and item selection for nursing home case-mix evaluation. Journal of Classification.
  • Baudry, J.-P., Maugis, C. and Michel, B. (2012). Slope heuristics: overview and implementation. Statistics and Computing 22 455–470.
  • Bellman, R. (1957). Dynamic Programming. Princeton University Press.
  • Benaglia, T., Chauveau, D., Hunter, D. R. and Young, D. (2009). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software 32 1–29.
  • Bhattacharya, S. and McNicholas, P. D. (2014). A LASSO-penalized BIC for mixture model selection. Advances in Data Analysis and Classification 8 45–61.
  • Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE transactions on pattern analysis and machine intelligence 22 719–725.
  • Biernacki, C. and Lourme, A. (2014). Stable and visualizable Gaussian parsimonious clustering models. Statistics and Computing 24 953–969.
  • Blum, A. L. and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence 97 245–271.
  • Bontemps, D. and Toussile, W. (2013). Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics 7 2344–2371.
  • Bouveyron, C. and Brunet, C. (2012). Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Statistics and Computing 22 301–324.
  • Bouveyron, C. and Brunet-Saumard, C. (2014a). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis 71 52–78.
  • Bouveyron, C. and Brunet-Saumard, C. (2014b). Discriminative variable selection for clustering with the sparse Fisher-EM algorithm. Computational Statistics 29 489–513.
  • Bouveyron, C., Girard, S. and Schmid, C. (2007). High-dimensional data clustering. Computational Statistics & Data Analysis 52 502–519.
  • Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition 28 781–793.
  • Celeux, G., Martin-Magniette, M. L., Maugis-Rabusseau, C. and Raftery, A. E. (2014). Comparing model selection and regularization approaches to variable selection in model-based clustering. Journal de la Société Française de Statistique 155 57–71.
  • Celeux, G., Maugis-Rabusseau, C. and Sedki, M. (2018). Variable selection in model-based clustering and discriminant analysis with a regularization approach. Advances in Data Analysis and Classification.
  • Chang, W.-C. (1983). On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions. Journal of the Royal Statistical Society. Series C (Applied Statistics) 32 267–275.
  • Chavent, M., Kuentz-Simonet, V., Liquet, B. and Saracco, J. (2012). ClustOfVar: An R package for the clustering of variables. Journal of Statistical Software, Articles 50 1–16.
  • Chen, W.-C. and Maitra, R. (2015). EMCluster: EM Algorithm for model-based clustering of finite mixture Gaussian distribution. R Package, URL http://cran.r-project.org/package=EMCluster.
  • Chen, L. S., Prentice, R. L. and Wang, P. (2014). A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation. Biometrics 70 312–322.
  • Clark, S. J. and Sharrow, D. J. (2011). Contemporary model life tables for developed countries: An application of model-based clustering. Working Paper, Center for Statistics and the Social Sciences, University of Washington.
  • Clogg, C. C. (1988). Latent class models for measuring. In Latent Trait and Latent Class Models (R. Langeheine and J. Rost, eds.) 8, 173–205. Plenum Press.
  • Collins, L. M. and Lanza, S. T. (2010). Latent Class and Latent Transition Analysis. Wiley.
  • Congressional Quarterly Almanac (1984). 98$^{th}$ Congress, 2$^{nd}$ session.
  • Constantinopoulos, C., Titsias, M. K. and Likas, A. (2006). Bayesian feature and model selection for Gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 1013–1018.
  • Dash, M. and Liu, H. (1997). Feature selection for classification. Intelligent data analysis 1 131–156.
  • Dean, N. and Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics 62 11–35.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 1–38.
  • DeSantis, S. M., Houseman, E. A., Coull, B. A., Stemmer-Rachamimov, A. and Betensky, R. A. (2008). A penalized latent class model for ordinal data. Biostatistics 9 249.
  • Dy, J. G. and Brodley, C. E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research 5 845–889.
  • Fop, M. and Murphy, T. B. (2017). LCAvarsel: Variable selection for latent class analysis R package version 1.1, https://cran.r-project.org/package=LCAvarsel.
  • Fop, M., Smart, K. M. and Murphy, T. B. (2017). Variable selection for latent class analysis with application to low back pain diagnosis. Annals of Applied Statistics 11 2085–2115.
  • Formann, A. K. (1985). Constrained latent class models: Theory and applications. British Journal of Mathematical and Statistical Psychology 38 87–111.
  • Formann, A. K. (2007). Mixture analysis of multivariate categorical data with covariates and missing entries. Computational Statistics & Data Analysis 51 5236–5246.
  • Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering. Journal of Classification 5 205–228.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 97 611–631.
  • Friedman, J. H. and Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 815–849.
  • Galimberti, G., Manisi, A. and Soffritti, G. (2017). Modelling the role of variables in model-based cluster analysis. Statistics and Computing 1–25.
  • Galimberti, G., Montanari, A. and Viroli, C. (2009). Penalized factor mixture analysis for variable selection in clustered data. Computational Statistics & Data Analysis 53 4301–4310.
  • Gollini, I. and Murphy, T. B. (2014). Mixture of latent trait analyzers for model-based clustering of categorical data. Statistics and Computing 24 569–588.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A. and Bloomfield, C. D. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 215–231.
  • Green, P. J. (1990). On use of the EM for penalized likelihood estimation. Journal of the Royal Statistical Society. Series B (Methodological) 52 443–452.
  • Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66 793–804.
  • Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research 3 1157–1182.
  • Hoff, P. D. (2005). Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics 61 1027–1036.
  • Hoff, P. D. (2006). Model-based subspace clustering. Bayesian Analysis 1 321–344.
  • Houseman, E. A., Coull, B. A. and Betensky, R. A. (2006). Feature-specific penalized latent class analysis for genomic data. Biometrics 62 1062–1070.
  • Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification 2 193–218.
  • Human Mortality Database (2017). University of California, Berkeley (USA), and Max Planck Institute for Demographic Research (Germany). http://www.mortality.org/.
  • Hunt, L. and Jorgensen, M. (2003). Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis 41 429–440.
  • John, G. H., Kohavi, R. and Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning 121–129.
  • Karlis, D. and Meligkotsidou, L. (2007). Finite mixtures of multivariate Poisson distributions with application. Journal of Statistical Planning and Inference 137 1942–1960.
  • Kim, S., Tadesse, M. G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika 93 877–893.
  • Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence 97 273–324.
  • Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning (ICML) (L. Saitta, ed.) 284–292. Morgan Kaufmann Publishers.
  • Kosmidis, I. and Karlis, D. (2016). Model-based clustering using copulas with applications. Statistics and Computing 26 1079–1099.
  • Law, M. H. C., Figueiredo, M. A. T. and Jain, A. K. (2004). Simultaneous feature selection and clustering using mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 26 1154–1166.
  • Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G. and Govaert, G. (2015). Rmixmod: The R package of the model-based unsupervised, supervised, and semi-supervised classification mixmod library. Journal of Statistical Software, Articles 67 1–29.
  • Lee, H. and Li, J. (2012). Variable selection for clustering by separability based on ridgelines. Journal of Computational and Graphical Statistics 21 315–336.
  • Lee, S. X. and McLachlan, G. J. (2013). On mixtures of skew normal and skew t-distributions. Advances in Data Analysis and Classification 7 241–266.
  • Lee, S. X. and McLachlan, G. J. (2016). Finite mixtures of canonical fundamental skew t-distributions: The unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing 26 573–589.
  • Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, Articles 11 1–18.
  • Linzer, D. A. and Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software 42 1–29.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley.
  • Liu, H. and Motoda, H. (2007). Computational Methods of Feature Selection. CRC Press.
  • Liu, J. S., Zhang, J. L., Palumbo, M. J. and Lawrence, C. E. (2003). Bayesian clustering with variable and transformation selections. In Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 249–275. Oxford University Press.
  • Malsiner-Walli, G., Frühwirth-Schnatter, S. and Grün, B. (2016). Model-based clustering based on sparse finite Gaussian mixtures. Statistics and Computing 26 303–324.
  • Marbac, M., Biernacki, C. and Vandewalle, V. (2015). Model-based clustering for conditionally correlated categorical data. Journal of Classification 32 145–175.
  • Marbac, M. and Sedki, M. (2017a). Variable selection for model-based clustering using the integrated complete-data likelihood. Statistics and Computing 27 1049–1063.
  • Marbac, M. and Sedki, M. (2017b). VarSelLCM: Variable selection for model-based clustering of continuous, count, categorical or mixed-type data set with missing values R package version 2.0.1, https://CRAN.R-project.org/package=VarSelLCM.
  • Marbac, M. and Sedki, M. (2017c). Variable selection for mixed data clustering: A model-based approach. arXiv:1703.02293.
  • Maugis, C., Celeux, G. and Martin-Magniette, M. L. (2009a). Variable selection for clustering with Gaussian mixture models. Biometrics 65 701–709.
  • Maugis, C., Celeux, G. and Martin-Magniette, M. L. (2009b). Variable selection in model-based clustering: A general variable role modeling. Computational Statistics and Data Analysis 53 3872–3882.
  • Maugis-Rabusseau, C., Martin-Magniette, M.-L. and Pelletier, S. (2012). SelvarClustMV: Variable selection approach in model-based clustering allowing for missing values. Journal de la Société Française de Statistique 153 21–36.
  • McLachlan, G. J. and Basford, K. E. (1988). Mixture models: Inference and applications to clustering. Marcel Dekker.
  • McLachlan, G. J., Bean, R. W. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18 413–422.
  • McLachlan, G. and Krishnan, T. (2008). The EM Algorithm and Extensions. Wiley.
  • McLachlan, G. J. and Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t-distributions In Advances in Pattern Recognition: Joint IAPR International Workshops SSPR’98 and SPR’98 Sydney, Australia, August 11–13, 1998 Proceedings 658–666. Springer Berlin Heidelberg.
  • McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley.
  • McLachlan, G. J. and Rathnayake, S. (2014). On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 341–355.
  • McLachlan, G. J., Peel, D., Basford, K. E. and P., A. (1999). The EMMIX software for the fitting of mixtures of normal and t-components. Journal of Statistical Software 4 1–14.
  • McNicholas, P. D. (2016). Model-based clustering. Journal of Classification 33 331–373.
  • McNicholas, D. P. and Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing 18 285–296.
  • McParland, D. and Gormley, I. C. (2016). Model based clustering for mixed data: clustMD. Advances in Data Analysis and Classification 10 155–169.
  • Melnykov, V. and Maitra, R. (2010). Finite mixture models and model-based clustering. 4 80–116.
  • Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2017). e1071: Misc functions of the Department of Statistics, Probability Theory Group (formerly: E1071), TU Wien R package version 1.6-8, https://CRAN.R-project.org/package=e1071.
  • Miller, A. (2002). Subset selection in regression. CRC Press.
  • Nia, V. P. and Davison, A. C. (2012). High-dimensional Bayesian clustering with variable selection: The R package bclust. Journal of Statistical Software 47 1–22.
  • Nia, V. P. and Davison, A. C. (2015). A simple model-based approach to variable selection in classification and clustering. Canadian Journal of Statistics 43 157–175.
  • O’Hagan, A., Murphy, T. B. and Gormley, I. C. (2012). Computational aspects of fitting mixture models via the expectation-maximization algorithm. Computational Statistics & Data Analysis 56 3843–3864.
  • Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research 8 1145–1164.
  • Python Software Foundation (2017). Python: A dynamic, open source programming language https://www.python.org/.
  • R Core Team (2017). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria https://www.R-project.org/.
  • Raftery, A. E. and Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association 101 168–178.
  • Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L. and Celeux, G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 31 1420–1427.
  • Ritter, G. (2014). Robust cluster analysis and variable selection. CRC Press.
  • Saeys, Y., Inza, I. n. and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics 23 2507–2517.
  • Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6 461–464.
  • Scrucca, L. (2016). Genetic algorithms for subset selection in model-based clustering. In Unsupervised Learning Algorithms (M. E. Celebi and K. Aydin, eds.) 55–70. Springer.
  • Scrucca, L. and Raftery, A. E. (2018). clustvarsel: A package implementing variable selection for Gaussian model-based clustering in R. Journal of Statistical Software 84 1–28.
  • Scrucca, L., Fop, M., Murphy, T. B. and Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8 289–317.
  • Sedki, M., Celeux, G. and Maugis, C. (2014). SelvarMix: A R package for variable selection in model-based clustering and discriminant analysis with a regularization approach. INRIA Techical report.
  • Sedki, M., Celeux, G. and Maugis-Rabusseau, C. (2017). SelvarMix: Regularization for variable selection in model-based clustering and discriminant analysis R package version 1.2.1, https://CRAN.R-project.org/package=SelvarMix.
  • Silvestre, C., Cardoso, M. G. M. S. and Figueiredo, M. (2015). Feature selection for clustering categorical data with an embedded modelling approach. Expert Systems 32 444–453.
  • Steinley, D. and Brusco, M. J. (2008). Selection of variables in cluster analysis: An empirical comparison of eight procedures. Psychometrika 73 125–144.
  • Sun, W., Wang, J. and Fang, Y. (2012). Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electronic Journal of Statistics 6 148–167.
  • Swartz, M. D., Mo, Q., Murphy, M. E., Lupton, J. R., Turner, N. D., Hong, M. Y. and Vannucci, M. (2008). Bayesian variable selection in clustering high-dimensional data with substructure. Journal of Agricultural, Biological, and Environmental Statistics 13 407.
  • Tadesse, M. G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association 100 602–617.
  • Toussile, W. (2016). ClustMMDD: Variable selection in clustering by mixture models for discrete data R package version 1.0.4.
  • Toussile, W. and Gassiat, E. (2009). Variable selection in model-based clustering using multilocus genotype data. Advances in Data Analysis and Classification 3 109–134.
  • Vermunt, J. K. and Magdison, J. (2002). Latent class cluster analysis. In Applied Latent Class Analysis (J. A. Hagenaars and A. L. McCutcheon, eds.) 3, 89–106. Cambridge University Press.
  • Wallace, C. S. and Freeman, P. R. (1987). Estimation and inference by compact coding. Journal of the Royal Statistical Society. Series B (Methodological) 49 240–265.
  • Wallace, M. L., Buysse, D. J., Germain, A., Hall, M. H. and Iyengar, S. (2017). Variable selection for skewed model-based clustering: Application to the identification of novel sleep phenotypes. Journal of the American Statistical Association 0 0-0.
  • Wang, S. and Zhu, J. (2008). Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64 440–448.
  • White, A. and Murphy, T. B. (2014). BayesLCA: An R package for Bayesian latent class analysis. Journal of Statistical Software 61 1–28.
  • White, A., Wyse, J. and Murphy, T. B. (2016). Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. Statistics and Computing 26 511–527.
  • Witten, D. M. and Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association 105 713–726.
  • Witten, D. M. and Tibshirani, R. (2013). sparcl: Perform sparse hierarchical clustering and sparse k-means clustering R package version 1.0.3, https://CRAN.R-project.org/package=sparcl.
  • Wu, B. (2013). Sparse cluster analysis of large-scale discrete variables with application to single nucleotide polymorphism data. Journal of Applied Statistics 40 358–367.
  • Xie, B., Pan, W. and Shen, X. (2008a). Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64 921–930.
  • Xie, B., Pan, W. and Shen, X. (2008b). Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electronic Journal of Statistics 2 168–212.
  • Xie, B., Pan, W. and Shen, X. (2010). Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data. Bioinformatics 26 501.
  • Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research 5 1205–1224.
  • Zhang, Q. and Ip, E. H. (2014). Variable assessment in latent class models. Computational Statistics & Data Analysis 77 146–156.
  • Zhou, H., Pan, W. and Shen, X. (2009). Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics 3 1473–1496.