## Electronic Journal of Statistics

### Clustering and variable selection for categorical multivariate data

#### Abstract

This article investigates unsupervised classification techniques for categorical multivariate data. The study employs multivariate multinomial mixture modeling, which is a type of model particularly applicable to multilocus genotypic data. A model selection procedure is used to simultaneously select the number of components and the relevant variables. A non-asymptotic oracle inequality is obtained, leading to the proposal of a new penalized maximum likelihood criterion. The selected model proves to be asymptotically consistent under weak assumptions on the true probability underlying the observations. The main theoretical result obtained in this study suggests a penalty function defined to within a multiplicative parameter. In practice, the data-driven calibration of the penalty function is made possible by slope heuristics. Based on simulated data, this procedure is found to improve the performance of the selection procedure with respect to classical criteria such as $\mathbf{BIC}$ and $\mathbf{AIC}$. The new criterion provides an answer to the question “Which criterion for which sample size?” Examples of real dataset applications are also provided.

#### Article information

Source
Electron. J. Statist., Volume 7 (2013), 2344-2371.

Dates
First available in Project Euclid: 19 September 2013

https://projecteuclid.org/euclid.ejs/1379596773

Digital Object Identifier
doi:10.1214/13-EJS844

Mathematical Reviews number (MathSciNet)
MR3108816

Zentralblatt MATH identifier
1349.62259

#### Citation

Bontemps, Dominique; Toussile, Wilson. Clustering and variable selection for categorical multivariate data. Electron. J. Statist. 7 (2013), 2344--2371. doi:10.1214/13-EJS844. https://projecteuclid.org/euclid.ejs/1379596773

#### References

• Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression., J. Mach. Learn. Res. 10 245–279.
• Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning, Repository.
• Bai, Z., Rao, C. R. and Wu, Y. (1999). Model selection with data-oriented penalty., J. Statist. Plann. Inference 77 102–117.
• Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood., IEEE Trans. Pattern Anal. 22 719–725.
• Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection., Probab. Theory Related Fields 138 33–73.
• Celeux, G. and Govaert, G. (1991). Clustering criteria for discrete data and latent class models., J. Classif. 8 157–176.
• Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions., J. Am. Stat. Assoc. 95 957–970.
• Chen, C., Forbes, F. and Francois, O. (2006). Fastruct: Model-based clustering made faster., Molecular Ecology Notes 6 980–983.
• Collins, L. M. and Lanza, S. T. (2010)., Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences. Wiley Series in Probability and Statistics. Wiley.
• Corander, J., Marttinen, P., Sirén, J. and Tang, J. (2008). Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations., BMC Bioinformatics 9 539.
• Dempster, A. P., Lairdsand, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm., J. Royal Statist. Soc. Series B 39 1–38.
• Genoveve, C. R. and Wasserman, L. (2000). Rates of convergence for the Gaussian mixture sieve., Ann. Statist. 28 1105–1127.
• Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models., Biometrika 61 215–231.
• Latch, E. K., Dharmarajan, G., Glaubitz, J. C. and Rhodes, O. E. Jr. (2006). Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation., Conservation Genetics 7 295.
• Lebarbier, É. (2002). Quelques approches pour la détection de rupture à horizon fini PhD thesis, Univ Paris-Sud, F-91405, Orsay.
• Massart, P. (2007)., Concentration inequalities and model selection. Lecture Notes in Mathematics 1896. Springer-Verlag, Berlin.
• Maugis, C. and Michel, B. (2011a). A non asymptotic penalized criterion for Gaussian mixture model selection., ESAIM: P&S 15 41–68.
• Maugis, C. and Michel, B. (2011b). Data-driven penalty calibration: A case study for Gaussian mixture model selection., ESAIM: P&S 15 320–339.
• McCutcheon, A. L. (1987)., Latent Class Analysis. Quantitative Applications in the Social Sciences 64. Sage Publications, Thousand Oaks, California.
• McLachlan, G. and Peel, D. (2000)., Finite Mixture Models. Wiley Series in Probability and Statistics. Wiley.
• Nadif, M. and Govaert, G. (1998). Clustering for binary data and mixture models – choice of the model., Appl. Stoch. Models Data Anal. 13 269–278.
• Pritchard, J. K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data., Genetics 155 945–59.
• Rigouste, L., Cappé, O. and Yvon, F. (2006). Inference and evaluation of the multinomial mixture model for text clustering., Inform. Process. Manag. 43 1260–1280.
• Rosenberg, N. A., Burke, T., Elo, K., Feldman, M. W., Freidlin, P. J., Groenen, M. A. M., Hillel, J., Ma, A., Vignal, A., Wimmers, K. and Weigend, S. (2001). Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds., Biotechnology.
• Toussile, W. and Gassiat, E. (2009). Variable selection in model-based clustering using multilocus genotype data., Adv. Data Anal. Classif. 3 109–134.
• Verzelen, N. (2009). Adaptative estimation to regular Gaussian Markov random fields PhD thesis, Univ, Paris-Sud.
• Villers, F. (2007). Tests et selection de modèles pour l’analyse de données protéomiques et transcriptomiques PhD thesis, Univ, Paris-Sud.