The Annals of Statistics

A general trimming approach to robust cluster Analysis

Luis A. García-Escudero, Alfonso Gordaliza, Carlos Matrán, and Agustin Mayo-Iscar

Source: Ann. Statist. Volume 36, Number 3 (2008), 1324-1345.

Abstract

We introduce a new method for performing clustering with the aim of fitting clusters with different scatters and weights. It is designed by allowing to handle a proportion α of contaminating data to guarantee the robustness of the method. As a characteristic feature, restrictions on the ratio between the maximum and the minimum eigenvalues of the groups scatter matrices are introduced. This makes the problem to be well defined and guarantees the consistency of the sample solutions to the population ones.

The method covers a wide range of clustering approaches depending on the strength of the chosen restrictions. Our proposal includes an algorithm for approximately solving the sample problem.

Primary Subjects: 62H3
Secondary Subjects: 62H3
Keywords: Robustness; cluster analysis; trimming; asymptotics; trimmed k-means; EM-algorithm; fast-MCD algorithm; Dykstra’s algorithm

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1211819566
Digital Object Identifier: doi:10.1214/07-AOS515

References

[1] Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
Mathematical Reviews (MathSciNet): MR1243494
Digital Object Identifier: doi:10.2307/2532201
[2] Bock, H.-H. (2002). Clustering methods: From classical models to new approaches. Statistics in Transition 5 725–758.
[3] Celeux, G. and Govaert, A. (1992). A classification EM algorithm for clustering and two stochastic versions. Comput. Statist. Data Anal. 14 315–332.
[4] Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed k-means: An attempt to robustify quantizers. Ann. Statist. 25 553–576.
[5] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
Mathematical Reviews (MathSciNet): MR501537
[6] Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer. Statist. Assoc. 78 837–842.
Mathematical Reviews (MathSciNet): MR727568
Digital Object Identifier: doi:10.2307/2288193
[7] Flury, B. (1997). A First Course in Multivariate Statistics. Springer, New York.
Mathematical Reviews (MathSciNet): MR1465937
Zentralblatt MATH: 0879.62052
[8] Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer J. 41 578–588.
[9] Gallegos, M. T. (2001). Robust clustering under general normal assumptions. Preprint. Available at http://www.fmi.uni-passau.de/forschung/mip-berichte/MIP-0103.html.
[10] Gallegos, M. T. (2002). Maximum likelihood clustering with outliers. In Classification, Clustering and Data Analysis: Recent Advances and Applications (K. Jajuga, A. Sokolowski and H.-H. Bock, eds.) 247–255. Springer, New York.
Mathematical Reviews (MathSciNet): MR2010460
Zentralblatt MATH: 1032.62059
[11] Gallegos, M. T. and Ritter, G. (2005). A robust method for cluster analysis. Ann. Statist. 33 347–380.
Mathematical Reviews (MathSciNet): MR2157806
Digital Object Identifier: doi:10.1214/009053604000000940
Project Euclid: euclid.aos/1112967709
[12] García-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of k-means and trimmed k-means. J. Amer. Statist. Assoc. 94 956–969.
[13] García-Escudero, L. A. and Gordaliza, A. (2007). The importance of the scales in heterogeneous robust clustering. Comput. Statist. Data Anal. 51 4403–4412.
[14] García-Escudero, L. A., Gordaliza, A. and Matrán, C. (1999). A central limit theorem for multivariate generalized trimmed k-means. Ann. Statist. 27 1061–1079.
[15] García-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434–449.
[16] García-Escudero, L. A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A. (2006). The TCLUST approach to robust cluster analysis. Technical report. Available at http://www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf.
[17] Goldfarb, D. and Idnani, A. (1983). A numerically stable dual method for solving strictly convex quadratic programs. Math. Program. 27 1–33.
Mathematical Reviews (MathSciNet): MR712108
Digital Object Identifier: doi:10.1007/BF02591962
[18] Hathaway, R. J. (1985). A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Statist. 13 795–800.
Mathematical Reviews (MathSciNet): MR790575
Digital Object Identifier: doi:10.1214/aos/1176349557
Project Euclid: euclid.aos/1176349557
[19] Hennig, C. (2004). Breakdown points for ML estimators of location-scale mixtures. Ann. Statist. 32 1313–1340.
Mathematical Reviews (MathSciNet): MR2089126
Digital Object Identifier: doi:10.1214/009053604000000571
Project Euclid: euclid.aos/1091626171
[20] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet): MR560319
Zentralblatt MATH: 0432.62029
[21] Maronna, R. (2005). Principal components and orthogonal regression based on robust scales. Technometrics 47 264–273.
Mathematical Reviews (MathSciNet): MR2164700
Digital Object Identifier: doi:10.1198/004017005000000166
[22] Maronna, R. and Jacovkis, P. M. (1974). Multivariate clustering procedures with variable metrics. Biometrics 30 499–505.
[23] McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1789474
Zentralblatt MATH: 0963.62061
[24] Papadimitriou, C. H. and Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ.
Mathematical Reviews (MathSciNet): MR663728
Zentralblatt MATH: 0503.90060
[25] Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212–223.
[26] Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387–397.
[27] Van Aelst, S., Wang, X., Zamar, R. H. and Zhu, R. (2006). Linear grouping using orthogonal regression. Comput. Statist. Data Anal. 50 1287–1312.
Mathematical Reviews (MathSciNet): MR2224373
[28] Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Wiley, New York.

2009 © Institute of Mathematical Statistics