The Annals of Statistics

A robust method for cluster analysis

María Teresa Gallegos and Gunter Ritter
Source: Ann. Statist. Volume 33, Number 1 (2005), 347-380.

Abstract

Let there be given a contaminated list of nd-valued observations coming from g different, normally distributed populations with a common covariance matrix. We compute the ML-estimator with respect to a certain statistical model with nr outliers for the parameters of the g populations; it detects outliers and simultaneously partitions their complement into g clusters. It turns out that the estimator unites both the minimum-covariance-determinant rejection method and the well-known pooled determinant criterion of cluster analysis. We also propose an efficient algorithm for approximating this estimator and study its breakdown points for mean values and pooled SSP matrix.

First Page: Show Hide
Primary Subjects: 62H30
Secondary Subjects: 62F35
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1112967709
Digital Object Identifier: doi:10.1214/009053604000000940
Zentralblatt MATH identifier: 02182566
Mathematical Reviews number (MathSciNet): MR2157806

References

Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, 3rd ed. Wiley, Chichester.
Mathematical Reviews (MathSciNet): MR1272911
Zentralblatt MATH: 0801.62001
Bezdek, J. C., Keller, J., Krisnapuram, R. and Pal, N. R. (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Dordrecht.
Mathematical Reviews (MathSciNet): MR1745848
Zentralblatt MATH: 0998.68138
Coleman, D. A. and Woodruff, D. L. (2000). Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood. J. Comput. Graph. Statist. 9 672--688.
Mathematical Reviews (MathSciNet): MR1821813
Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed $k$-means: An attempt to robustify quantizers. Ann. Statist. 25 553--576.
Mathematical Reviews (MathSciNet): MR1439314
Digital Object Identifier: doi:10.1214/aos/1031833664
Project Euclid: euclid.aos/1031833664
Zentralblatt MATH: 0878.62045
Donoho, D. L. and Huber, P. J. (1983). The notion of a breakdown point. In A Festschrift for Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds.) 157--184. Wadsworth, Belmont, CA.
Mathematical Reviews (MathSciNet): MR689745
Zentralblatt MATH: 0523.62032
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611--631.
Mathematical Reviews (MathSciNet): MR1951635
Digital Object Identifier: doi:10.1198/016214502760047131
Zentralblatt MATH: 1073.62545
Friedman, H. and Rubin, J. (1967). On some invariant criteria for grouping data. J. Amer. Statist. Assoc. 62 1159--1178.
Mathematical Reviews (MathSciNet): MR223012
Garciá-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of $k$-means and trimmed $k$-means. J. Amer. Statist. Assoc. 94 956--969.
Mathematical Reviews (MathSciNet): MR1723291
Garciá-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434--449.
Mathematical Reviews (MathSciNet): MR1983163
Digital Object Identifier: doi:10.1198/1061860031806
Gather, U. and Kale, B. K. (1988). Maximum likelihood estimation in the presence of outliers. Comm. Statist. Theory Methods 17 3767--3784.
Mathematical Reviews (MathSciNet): MR968034
Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. dissertation, Univ. California, Berkeley.
Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 1887--1896.
Mathematical Reviews (MathSciNet): MR301858
Digital Object Identifier: doi:10.1214/aoms/1177693054
Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
Mathematical Reviews (MathSciNet): MR405726
Zentralblatt MATH: 0372.62040
Hodges, J. L., Jr. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1 163--186. Univ. California Press, Berkeley.
Mathematical Reviews (MathSciNet): MR214251
Zentralblatt MATH: 0211.50205
Lopuhaä, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Statist. 19 229--248.
Mathematical Reviews (MathSciNet): MR1091847
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet): MR560319
Zentralblatt MATH: 0432.62029
Mathar, R. (1981). Ausreiß er bei ein- und mehrdimensionalen Wahrscheinlichkeitsverteilungen. Ph.D. dissertation, Mathematisch--Naturwissenschaftliche Fakultät der Rheinisch-Westfälischen Technischen Hochschule Aachen.
Pesch, C. (2000). Eigenschaften des gegenüber Ausreissern robusten MCD-Schätzers und Algorithmen zu seiner Berechnung. Ph.D. dissertation, Fakultät für Mathematik und Informatik, Univ. Passau.
Ritter, G. and Gallegos, M. T. (1997). Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters 18 525--539.
Ritter, G. and Gallegos, M. T. (2002). Bayesian object identification: Variants. J. Multivariate Anal. 81 301--334.
Mathematical Reviews (MathSciNet): MR1906383
Digital Object Identifier: doi:10.1006/jmva.2001.2009
Zentralblatt MATH: 1011.62011
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications (W. Grossmann, G. C. Pflug, I. Vincze and W. Wertz, eds.) 283--297. Reidel, Dordrecht.
Mathematical Reviews (MathSciNet): MR851060
Zentralblatt MATH: 0609.62054
Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212--223.
Schroeder, A. (1976). Analyse d'un mélange de distributions de probabilités de même type. Rev. Statist. Appl. 24 39--62.
Mathematical Reviews (MathSciNet): MR445694
Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387--397.
Späth, H. (1985). Cluster Dissection and Analysis. Theory, FORTRAN Programs, Examples. Ellis Horwood, Chichester.
Symons, M. J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics 37 35--43.
Mathematical Reviews (MathSciNet): MR673031

2012 © Institute of Mathematical Statistics

The Annals of Statistics

The Annals of Statistics