The Annals of Statistics

A general trimming approach to robust cluster Analysis

Luis A. García-Escudero, Alfonso Gordaliza, Carlos Matrán, and Agustin Mayo-Iscar
Source: Ann. Statist. Volume 36, Number 3 (2008), 1324-1345.

Abstract

We introduce a new method for performing clustering with the aim of fitting clusters with different scatters and weights. It is designed by allowing to handle a proportion α of contaminating data to guarantee the robustness of the method. As a characteristic feature, restrictions on the ratio between the maximum and the minimum eigenvalues of the groups scatter matrices are introduced. This makes the problem to be well defined and guarantees the consistency of the sample solutions to the population ones.

The method covers a wide range of clustering approaches depending on the strength of the chosen restrictions. Our proposal includes an algorithm for approximately solving the sample problem.

First Page: Show Hide
Primary Subjects: 62H3
Secondary Subjects: 62H3
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1211819566
Digital Object Identifier: doi:10.1214/07-AOS515
Mathematical Reviews number (MathSciNet): MR2418659
Zentralblatt MATH identifier: 05294975

References

[1] Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
Mathematical Reviews (MathSciNet): MR1243494
Digital Object Identifier: doi:10.2307/2532201
Zentralblatt MATH: 0794.62034
[2] Bock, H.-H. (2002). Clustering methods: From classical models to new approaches. Statistics in Transition 5 725–758.
[3] Celeux, G. and Govaert, A. (1992). A classification EM algorithm for clustering and two stochastic versions. Comput. Statist. Data Anal. 14 315–332.
[4] Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed k-means: An attempt to robustify quantizers. Ann. Statist. 25 553–576.
[5] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
Mathematical Reviews (MathSciNet): MR501537
[6] Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer. Statist. Assoc. 78 837–842.
Mathematical Reviews (MathSciNet): MR727568
Digital Object Identifier: doi:10.2307/2288193
Zentralblatt MATH: 0535.62063
[7] Flury, B. (1997). A First Course in Multivariate Statistics. Springer, New York.
Mathematical Reviews (MathSciNet): MR1465937
Zentralblatt MATH: 0879.62052
[8] Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer J. 41 578–588.
[9] Gallegos, M. T. (2001). Robust clustering under general normal assumptions. Preprint. Available at http://www.fmi.uni-passau.de/forschung/mip-berichte/MIP-0103.html.
[10] Gallegos, M. T. (2002). Maximum likelihood clustering with outliers. In Classification, Clustering and Data Analysis: Recent Advances and Applications (K. Jajuga, A. Sokolowski and H.-H. Bock, eds.) 247–255. Springer, New York.
Mathematical Reviews (MathSciNet): MR2010460
Zentralblatt MATH: 1032.62059
[11] Gallegos, M. T. and Ritter, G. (2005). A robust method for cluster analysis. Ann. Statist. 33 347–380.
Mathematical Reviews (MathSciNet): MR2157806
Digital Object Identifier: doi:10.1214/009053604000000940
Project Euclid: euclid.aos/1112967709
Zentralblatt MATH: 1064.62074
[12] García-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of k-means and trimmed k-means. J. Amer. Statist. Assoc. 94 956–969.
[13] García-Escudero, L. A. and Gordaliza, A. (2007). The importance of the scales in heterogeneous robust clustering. Comput. Statist. Data Anal. 51 4403–4412.
[14] García-Escudero, L. A., Gordaliza, A. and Matrán, C. (1999). A central limit theorem for multivariate generalized trimmed k-means. Ann. Statist. 27 1061–1079.
[15] García-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434–449.
[16] García-Escudero, L. A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A. (2006). The TCLUST approach to robust cluster analysis. Technical report. Available at http://www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf.
[17] Goldfarb, D. and Idnani, A. (1983). A numerically stable dual method for solving strictly convex quadratic programs. Math. Program. 27 1–33.
Mathematical Reviews (MathSciNet): MR712108
Digital Object Identifier: doi:10.1007/BF02591962
Zentralblatt MATH: 0537.90081
[18] Hathaway, R. J. (1985). A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Statist. 13 795–800.
Mathematical Reviews (MathSciNet): MR790575
Digital Object Identifier: doi:10.1214/aos/1176349557
Project Euclid: euclid.aos/1176349557
Zentralblatt MATH: 0576.62039
[19] Hennig, C. (2004). Breakdown points for ML estimators of location-scale mixtures. Ann. Statist. 32 1313–1340.
Mathematical Reviews (MathSciNet): MR2089126
Digital Object Identifier: doi:10.1214/009053604000000571
Project Euclid: euclid.aos/1091626171
Zentralblatt MATH: 1047.62063
[20] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet): MR560319
Zentralblatt MATH: 0432.62029
[21] Maronna, R. (2005). Principal components and orthogonal regression based on robust scales. Technometrics 47 264–273.
Mathematical Reviews (MathSciNet): MR2164700
Digital Object Identifier: doi:10.1198/004017005000000166
[22] Maronna, R. and Jacovkis, P. M. (1974). Multivariate clustering procedures with variable metrics. Biometrics 30 499–505.
[23] McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1789474
Zentralblatt MATH: 0963.62061
[24] Papadimitriou, C. H. and Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ.
Mathematical Reviews (MathSciNet): MR663728
Zentralblatt MATH: 0503.90060
[25] Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212–223.
[26] Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387–397.
[27] Van Aelst, S., Wang, X., Zamar, R. H. and Zhu, R. (2006). Linear grouping using orthogonal regression. Comput. Statist. Data Anal. 50 1287–1312.
Mathematical Reviews (MathSciNet): MR2224373
[28] Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Wiley, New York.

2012 © Institute of Mathematical Statistics

The Annals of Statistics

The Annals of Statistics