Source: Ann. Statist. Volume 33, Number 1
(2005), 347-380.
Let there be given a contaminated list of n ℝd-valued observations coming from g different, normally distributed populations with a common covariance matrix. We compute the ML-estimator with respect to a certain statistical model with n−r outliers for the parameters of the g populations; it detects outliers and simultaneously partitions their complement into g clusters. It turns out that the estimator unites both the minimum-covariance-determinant rejection method and the well-known pooled determinant criterion of cluster analysis. We also propose an efficient algorithm for approximating this estimator and study its breakdown points for mean values and pooled SSP matrix.
References
Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, 3rd ed. Wiley, Chichester.
Bezdek, J. C., Keller, J., Krisnapuram, R. and Pal, N. R. (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Dordrecht.
Coleman, D. A. and Woodruff, D. L. (2000). Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood. J. Comput. Graph. Statist. 9 672--688.
Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed $k$-means: An attempt to robustify quantizers. Ann. Statist. 25 553--576.
Donoho, D. L. and Huber, P. J. (1983). The notion of a breakdown point. In A Festschrift for Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds.) 157--184. Wadsworth, Belmont, CA.
Mathematical Reviews (MathSciNet):
MR689745
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611--631.
Friedman, H. and Rubin, J. (1967). On some invariant criteria for grouping data. J. Amer. Statist. Assoc. 62 1159--1178.
Mathematical Reviews (MathSciNet):
MR223012
Garciá-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of $k$-means and trimmed $k$-means. J. Amer. Statist. Assoc. 94 956--969.
Garciá-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434--449.
Gather, U. and Kale, B. K. (1988). Maximum likelihood estimation in the presence of outliers. Comm. Statist. Theory Methods 17 3767--3784.
Mathematical Reviews (MathSciNet):
MR968034
Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. dissertation, Univ. California, Berkeley.
Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 1887--1896.
Mathematical Reviews (MathSciNet):
MR301858
Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
Mathematical Reviews (MathSciNet):
MR405726
Hodges, J. L., Jr. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1 163--186. Univ. California Press, Berkeley.
Mathematical Reviews (MathSciNet):
MR214251
Lopuhaä, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Statist. 19 229--248.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
Mathematical Reviews (MathSciNet):
MR560319
Mathar, R. (1981). Ausreiß er bei ein- und mehrdimensionalen Wahrscheinlichkeitsverteilungen. Ph.D. dissertation, Mathematisch--Naturwissenschaftliche Fakultät der Rheinisch-Westfälischen Technischen Hochschule Aachen.
Pesch, C. (2000). Eigenschaften des gegenüber Ausreissern robusten MCD-Schätzers und Algorithmen zu seiner Berechnung. Ph.D. dissertation, Fakultät für Mathematik und Informatik, Univ. Passau.
Ritter, G. and Gallegos, M. T. (1997). Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters 18 525--539.
Ritter, G. and Gallegos, M. T. (2002). Bayesian object identification: Variants. J. Multivariate Anal. 81 301--334.
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications (W. Grossmann, G. C. Pflug, I. Vincze and W. Wertz, eds.) 283--297. Reidel, Dordrecht.
Mathematical Reviews (MathSciNet):
MR851060
Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212--223.
Schroeder, A. (1976). Analyse d'un mélange de distributions de probabilités de même type. Rev. Statist. Appl. 24 39--62.
Mathematical Reviews (MathSciNet):
MR445694
Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387--397.
Späth, H. (1985). Cluster Dissection and Analysis. Theory, FORTRAN Programs, Examples. Ellis Horwood, Chichester.
Symons, M. J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics 37 35--43.
Mathematical Reviews (MathSciNet):
MR673031