Electronic Journal of Statistics

On clustering procedures and nonparametric mixture estimation

Stéphane Auray, Nicolas Klutchnikoff, and Laurent Rouvière

Full-text: Open access

Abstract

This paper deals with nonparametric estimation of conditional densities in mixture models in the case when additional covariates are available. The proposed approach consists of performing a preliminary clustering algorithm on the additional covariates to guess the mixture component of each observation. Conditional densities of the mixture model are then estimated using kernel density estimates applied separately to each cluster. We investigate the expected $L_{1}$-error of the resulting estimates and derive optimal rates of convergence over classical nonparametric density classes provided the clustering method is accurate. Performances of clustering algorithms are measured by the maximal misclassification error. We obtain upper bounds of this quantity for a single linkage hierarchical clustering algorithm. Lastly, applications of the proposed method to mixture models involving electricity distribution data and simulated data are presented.

Article information

Source
Electron. J. Statist., Volume 9, Number 1 (2015), 266-297.

Dates
First available in Project Euclid: 18 February 2015

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1424267116

Digital Object Identifier
doi:10.1214/15-EJS995

Mathematical Reviews number (MathSciNet)
MR3314483

Zentralblatt MATH identifier
1307.62101

Subjects
Primary: 62G07: Density estimation
Secondary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

Keywords
Nonparametric estimation mixture models clustering

Citation

Auray, Stéphane; Klutchnikoff, Nicolas; Rouvière, Laurent. On clustering procedures and nonparametric mixture estimation. Electron. J. Statist. 9 (2015), no. 1, 266--297. doi:10.1214/15-EJS995. https://projecteuclid.org/euclid.ejs/1424267116


Export citation

References

  • Arias-Castro, E. (2011). Clustering based on pairwise distances when the data is of mixed dimensions., IEEE Transaction on Information Theory 57 1692–1706.
  • Baudry, J. P. (2009). Sélection de modèle pour la classification non supervisée. Choix du nombre de classes. PhD thesis, Université Paris Sud, 11.
  • Benaglia, T., Chauveau, D. and Hunter, D. R. (2009). An EM-like algorithm for semi- and non-parametric estimation in multivariate mixtures., Journal of Computational and Graphical Statistics 18 505–526.
  • Benaglia, T., Chauveau, D. and Hunter, D. R. (2011). Bandwidth selection in an EM-like algorithm for nonparametric multivariate mixtures. In, Nonparametric Statistics and Mixture Models: A Festschrift in Honor of Thomas P. Hettmansperger 15–27. World Scientific Publishing Co.
  • Berlinet, A. and Devroye, L. (1994). A comparison of kernel density estimates., Publications de l’ISUP 38.
  • Biau, G., Cadre, B. and Pelletier, B. (2007). A graph-based estimator of the number of clusters., ESAIM Probability and Statistics 11 272–280.
  • Biau, G., Cadre, B. and Pelletier, B. (2008). Exact rates in density support estimation., Journal of Multivariate Analysis 99 2185–2207.
  • Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood., IEEE Transactions on Pattern Analysis and Machine Intelligence 22 719–725.
  • Bordes, L., Mottelet, S. and Vandekerkhove, P. (2006). Estimation of a two-component mixture model., The Annals of Statistics 34 1204–1232.
  • Celeux, G. and Govaert, G. (1995). Parsimonous Gaussian models in cluster analysis., Pattern Recognition 28 781–793.
  • Cerrito, P. B. (1992). Using stratification to estimate multimodal density functions with applications to regression., Communications in Statistics – Simulation and Computation 21 1149–1164.
  • Cormen, T. H., Leiserson, C. E. and Rivest, R. L. (1990)., Introduction to Algorithms. The MIT Press, Cambridge.
  • Cuevas, A., Febrero, M. and Fraiman, R. (2000). Estimating the number of clusters., Canadian Journal of Statistics 28 367–382.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion)., Journal of the Royal Statistical Society B 39 1–38.
  • Devroye, L. and Györfi, L. (1985)., Nonparametric Density Estimation: The $L_1$ View. Wiley.
  • Devroye, L. and Lugosi, G. (2001)., Combinatorial Methods in Density Estimation. Springer-Verlag, New York.
  • Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling., Journal of the Royal Statistical Society, Series B 56 363–375.
  • Everit, B. S. and Hand, D. J. (1981)., Finite Mixture Distributions. Wiley, New York.
  • Hall, P. and Titterington, D. M. (1984). Efficient nonparametric estimation of mixture proportions., Journal of the Royal Statistical Society, Series B 46 465–473.
  • Hall, P. and Titterington, D. M. (1985). The use of uncategorized data to improve the performance of a nonparametric estimator of a mixture density., Journal of the Royal Statistical Society, Series B 47 155–163.
  • Hall, P. and Zhou, X. H. (2003). Nonparametric estimation of component distributions in a multivariate mixture., The Annals of Statistics 31 201–224.
  • Hartigan, J. A. (1975)., Clustering Algorithms. John Wiley.
  • Hayfield, T. and Racine, J. S. (2008). Nonparametric econometrics: The np package., Journal of Statistical Software 27.
  • Hengartner, N. W. and Matzner-Løber, E. (2009). Asymptotic unbiased density estimators., ESAIM. Probability and Statistics 13 1–14.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables., Journal of the American Statistical Society 58 13–30.
  • Jeon, B. and Landgrebe, D. A. (1994). Fast Parzen density estimation using clustering-based branch and bound., IEEE Transactions on Pattern Analysis and Machine Intelligence 16 950–954.
  • Kitamura, Y. (2004). Nonparametric identifiability of finite mixtures. Technical report – Yale, University.
  • Lindsay, B. G. (1983a). The geometry of mixture likelihoods: A general theory., The Annals of Statistics 11 86–94.
  • Lindsay, B. G. (1983b). The geometry of mixture likelihoods. II. The exponential family., The Annals of Statistics 11 783–792.
  • Maier, M., Hein, M. and Von Luxburg, U. (2009). Optimal construction of $k$-nearest-neigbor graphs for identifying noisy clusters., Theoritical Computer Science 410 1749–1764.
  • McLachlan, G. J. and Basford, K. E. (1988)., Mixture models: Inference and Applications to Clustering. Dekker, New York.
  • McLachlan, G. J. and Peel, D. (2000)., Finite Mixture Models. Wiley, New York.
  • Ng, A. Y., Jordan, M. I. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In, Advances in Neural Information Processing Systems 14 849–856.
  • Parzen, E. (1962). On estimation of a probability density function and mode., The Annals of Mathematical Statistics 33 1065–1076.
  • Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm., SIAM Review 26 195–239.
  • Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function., The Annals of Mathematical Statistics 27 832–837.
  • Ruzgas, T., Rudzkis, R. and Kavaliauskas, M. (2006). Application of clustering in the nonparametric estimation of distribution density., Nonlinear Analysis: Modeling and Control 11 393–411.
  • Titterington, D. M. (1983). Minimum-distance non-parametric estimation of mixture proportions., Journal of the Royal Statistical Society, Series B 45 37–46.