The Annals of Statistics

Generalized density clustering

Alessandro Rinaldo and Larry Wasserman

Full-text: Open access

Abstract

We study generalized density-based clustering in which sharply defined clusters such as clusters on lower-dimensional manifolds are allowed. We show that accurate clustering is possible even in high dimensions. We propose two data-based methods for choosing the bandwidth and we study the stability properties of density clusters. We show that a simple graph-based algorithm successfully approximates the high density clusters.

Article information

Source
Ann. Statist. Volume 38, Number 5 (2010), 2678-2722.

Dates
First available in Project Euclid: 11 July 2010

Permanent link to this document
https://projecteuclid.org/euclid.aos/1278861457

Digital Object Identifier
doi:10.1214/10-AOS797

Mathematical Reviews number (MathSciNet)
MR2722453

Zentralblatt MATH identifier
1200.62066

Subjects
Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]
Secondary: 62G07: Density estimation

Keywords
Density clustering kernel density estimation

Citation

Rinaldo, Alessandro; Wasserman, Larry. Generalized density clustering. Ann. Statist. 38 (2010), no. 5, 2678--2722. doi:10.1214/10-AOS797. https://projecteuclid.org/euclid.aos/1278861457


Export citation

References

  • Ambrosio, L., Fusco, N. and Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems. Clarendon, Oxford.
  • Audibert, J. and Tsybakov, A. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
  • Baíllo, A., Cuesta-Albertos, J. and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets. Statist. Probab. Lett. 53 27–35.
  • Ben-David, S., von Luxburg, U. and Pall, D. (2006). A sober look at clustering stability. In Learning Theory. Lecture Notes in Computer Science 4005 5–19. Springer, Berlin.
  • Ben-Hur, A., Elisseeff, A. and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing (R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale and T. E. Klein, eds.) 6–17. World Scientific, New York.
  • Biau, G., Cadre, B. and Pellettier, B. (2007). A graph-based estimator of the number of clusters. ESAIM Probab. Stat. 11 272–280.
  • Cadre, B. (2006). Kernel density estimation on level sets. J. Multivariate Anal. 97 999–1023.
  • Castro, R. and Nowak, R. (2008). Minimax bounds for active learning. IEEE Trans. Inform. Theory 5 2339–2353.
  • Cormen, T., Leiserson, C., Rivest, R. and Stein, C. (2002). Introduction to Algorithms. McGraw-Hill, New York.
  • Cuevas, A., Febrero, M. and Fraiman, R. (2000). Estimating the number of clusters. Canad. J. Statist. 28.
  • Cuevas, A. and Fraiman, R. (1997). A plug-in approach to support estimation. Ann. Statist. 25 2300–2312.
  • Cuevas, A., Gonzàlez-Manteiga, W. and Rodrìguez-Casal, A. (2006). Plug-in estimation of general level sets. Aust. N. Z. J. Stat. 48 7–19.
  • Devroye, L., Györfi, L. and Lugosi, G. (1997). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • Devroye, L. and Wise, L. (1980). Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38 480–488.
  • Einmahl, U. and Mason, D. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33 1380–1403.
  • Evans, L. and Gariepy, R. (1992). Measure Theory and Fine Properties of Functions. CRC Press, Boca Raton, FL.
  • Falconer, K. (2003). Fractal Geometry: Mathematical Foundations and Applications. Wiley, Hoboken, NJ.
  • Federer, H. (1969). Geometric Measure Theory. Springer, New York.
  • Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38 907–921.
  • Giné, E. and Koltchinskii, V. (2006). Empirical graph laplacian approximation of laplace-beltrami operators: Large sample results. In High Dimensional Probability (E. Giné, V. Koltchinskii, W. Li and J. Zinn, eds.) 238–259.
  • Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
  • Hartigan, J. (1975). Clustering Algorithms. Wiley, New York.
  • Jang, W. and Hendry, M. (2007). Cluster analysis of massive datasets in astronomy. Stat. Comput. 17 253–262.
  • Korostelev, A. and Tsybakov, A. (1993). Minimax Theory of Image Reconstruction. Springer, New York.
  • Lange, T., Roth, V., Braun, M. and J.Buhmann (2004). Stability-based validation of clustering solutions. Neural Computation 16 1299–1323.
  • Lee, J. (2003). Introduction to Smooth Manifolds. Springer, New York.
  • Leoni, G. and Fonseca, I. (2007). Modern Methods in the Calculus of Variations: Lp Spaces. Springer, New York.
  • Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
  • Massart, P. (2000). About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863–884.
  • Mattila, P. (1999). Geometry of Sets and Measures in Euclidean Spaces: Fractals and Rectifiability. Cambridge Univ. Press, Cambridge.
  • Mueller, D. and Sawitzki, G. (1991). Excess mass estimates and test for multimodality. J. Amer. Statist. Assoc. 86 738–746.
  • Ng, A., Jordan, M. and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 849–856.
  • Niyogi, P., Smale, S. and Weinberger, S. (2008). Finding the homology of submanifolds with high confidence. Discrete and Compuational Geometry 38 419–441.
  • Nolan, D. and Pollard, D. (1987). U-processes: Rates of convergence. Ann. Statist. 15 780–799.
  • Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann. Statist. 23 855–881.
  • Rigollet, P. (2007). Generalization error bounds in semi-supervised classification under the cluster assumption. J. Mach. Learn. Res. 8 1369–1392.
  • Rigollet, P. and Vert, R. (2006). Fast rates for plug-in estimators of density level sets. Available at arXiv:math/0611473.
  • Scott, C. and Nowak, R. (2006). Learning minimum volume sets. J. Mach. Learn. Res. 7 665–704.
  • Singh, A., Nowak, R. and Zhu, X. (2008). Unlabeled data: Now it helps, now it doesn’t. In Neural Information Processing Systems.
  • Singh, A., Scott, C. and Nowak, R. (2009). Adaptive hausdorff estimation of density level sets. Ann. Statist. 37 2760–2782.
  • Steinwart, I., Hush, D. and Scovel, C. (2005). A classification for anomaly detection. J. Mach. Learn. Res. 6 211–232.
  • Stuetzle, W. and Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Statist. 19 397–418.
  • Tsybakov, A. (1997). On nonparametric eatimaiton of density level sets. Ann. Statist. 25 948–969.
  • Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • van der Vaart, A. and Wellner, J. (1996). Weak Convergence of Empirical Processes. Springer, New York.
  • von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17 395–416.
  • Willett, R. and Nowak, R. (2007). Minimax optimal level-set estimation. IEEE Trans. Image Process. 12 2965–2979.