Electronic Journal of Statistics

A weighted k-nearest neighbor density estimate for geometric inference

Gérard Biau, Frédéric Chazal, David Cohen-Steiner, Luc Devroye, and Carlos Rodríguez

Full-text: Open access

Abstract

Motivated by a broad range of potential applications in topological and geometric inference, we introduce a weighted version of the k-nearest neighbor density estimate. Various pointwise consistency results of this estimate are established. We present a general central limit theorem under the lightest possible conditions. In addition, a strong approximation result is obtained and the choice of the optimal set of weights is discussed. In particular, the classical k-nearest neighbor estimate is not optimal in a sense described in the manuscript. The proposed method has been implemented to recover level sets in both simulated and real-life data.

Article information

Source
Electron. J. Statist., Volume 5 (2011), 204-237.

Dates
First available in Project Euclid: 14 April 2011

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1302784854

Digital Object Identifier
doi:10.1214/11-EJS606

Mathematical Reviews number (MathSciNet)
MR2792552

Zentralblatt MATH identifier
1274.62264

Subjects
Primary: 62G07: Density estimation 62G05: Estimation
Secondary: 62G20: Asymptotic properties

Keywords
Geometric inference level sets density estimation k-nearest neighbor estimate weighted estimate consistency rates of convergence central limit theorem strong approximation

Citation

Biau, Gérard; Chazal, Frédéric; Cohen-Steiner, David; Devroye, Luc; Rodríguez, Carlos. A weighted k -nearest neighbor density estimate for geometric inference. Electron. J. Statist. 5 (2011), 204--237. doi:10.1214/11-EJS606. https://projecteuclid.org/euclid.ejs/1302784854


Export citation

References

  • [1] E. Arias-Castro, D.L. Donoho, and X. Huo. Adaptive multiscale detection of filamentary structures in a background of uniform random points., The Annals of Statistics, 34:326–349, 2006.
  • [2] M.S. Bartlett. Statistical estimation of density functions., Sankhyā. Series A, 25:245–254, 1963.
  • [3] P.K. Bhattacharya and Y.P. Mack. Weak convergence of, k-NN density and regression estimators with varying k and applications. The Annals of Statistics, 15:976–994, 1987.
  • [4] G. Biau, B. Cadre, D.M. Mason, and B. Pelletier. Asymptotic normality in density support estimation., Electronic Journal of Probability, 14 :2617–2635, 2009.
  • [5] G. Biau, B. Cadre, and B. Pelletier. Exact rates in density support estimation., Journal of Multivariate Analysis, 99 :2185–2207, 2008.
  • [6] P. Billingsley., Probability and Measure. Third Edition. John Wiley and Sons, New York, 1995.
  • [7] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces., Probability Theory and Related Fields, 137:541–593, 2007.
  • [8] CGAL. Computational Geometry Algorithms Library., http://www.cgal.org.
  • [9] F. Chazal, D. Cohen-Steiner, and A. Lieutier. Normal cone approximation and offset shape isotopy., Computational Geometry: Theory and Applications, 42:566–581, 2009.
  • [10] F. Chazal, D. Cohen-Steiner, and A. Lieutier. A sampling theory for compact sets in Euclidean space., Discrete and Computational Geometry, 41:461–479, 2009.
  • [11] F. Chazal, D. Cohen-Steiner, A. Lieutier, and B. Thibert. Stability of curvature measures., Computer Graphics Forum, 28 :1485–1496, 2009.
  • [12] F. Chazal, D. Cohen-Steiner, and Q. Mérigot. Geometric inference for measures based on distance functions. Research Report 6930, INRIA, 2009., http://hal.inria.fr/inria-00383685.
  • [13] F. Chazal, D. Cohen-Steiner, and Q. Mérigot. Boundary measures for geometric inference., Journal on Foundations of Computational Mathematics, 10:221–240, 2010.
  • [14] A. Cuevas, R. Fraiman, and A. Rodríguez-Casal. A nonparametric approach to the estimation of lengths and surface areas., The Annals of Statistics, 35 :1031–1051, 2007.
  • [15] A. Cuevas and A. Rodríguez-Casal. Set estimation: An overview and some recent developments. In M.G. Akritas and D.N. Politis, editors, Recent Advances and Trends in Nonparametric Statistics, pages 251–264, Amsterdam, 2003. North-Holland.
  • [16] P. Deheuvels. Estimation non paramétrique de la densité par histogrammes généralisés., Publications de l’Institut de Statistique de l’Université de Paris, 22:1–24, 1977.
  • [17] L. Devroye., Non-Uniform Random Variate Generation. Springer-Verlag, New York, 1986.
  • [18] L. Devroye, L. Györfi, and G. Lugosi., A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 1996.
  • [19] L. Devroye and G. Wise. Detection of abnormal behavior via nonparametric estimation of the support., SIAM Journal on Applied Mathematics, 38:480–488, 1980.
  • [20] L.P. Devroye and T.J. Wagner., Nonparametric discrimination and density estimation. Technical Report 183, Information Systems Research Laboratory, Electronics Research Center, The University of Texas at Austin, Austin, 1976.
  • [21] L.P. Devroye and T.J. Wagner. The strong uniform consistency of nearest neighbor density estimates., The Annals of Statistics, 5:536–540, 1977.
  • [22] V.A. Epanechnikov. Non-parametric estimation of a multivariate probability density., Theory of Probability and its Applications, 14:153–158, 1969.
  • [23] E. Fix and J.L. Hodges., Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical Report 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.
  • [24] K. Fukunaga and T.E. Flick. An optimal global nearest neighbor metric., IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:314–318, 1984.
  • [25] K. Fukunaga and L.D. Hostetler. Optimization of, k-nearest neighbor density estimates. IEEE Transactions on Information Theory, 19:320–326, 1973.
  • [26] C.R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. On the path density of a gradient field., The Annals of Statistics, 37 :3236–3271, 2009.
  • [27] C.R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman., Nonparametric filament Estimation, 2010. http://arxiv.org/abs/1003.5536.
  • [28] T. Hastie and W. Stuetzle. Principal curves., Journal of the American Statistical Association, 84:502–516, 1989.
  • [29] J. Komlós, P. Major, and G. Tusnády. An approximation of partial sums of independent RV’-s, and the sample DF. I., Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 32:111–131, 1975.
  • [30] A.P. Korostelev and A.B. Tsybakov., Minimax Theory of Image Reconstruction, volume 82 of Lecture Notes in Statistics. Springer-Verlag, New York, 1993.
  • [31] D.O. Loftsgaarden and C.P. Quesenberry. A nonparametric estimate of a multivariate density function., The Annals of Mathematical Statistics, 36 :1049–1051, 1965.
  • [32] Y.P. Mack. Asymptotic normality of multivariate, k-NN density estimates. Sankhyā. Series A, 42:53–63, 1980.
  • [33] Y.P. Mack and M. Rosenblatt. Multivariate, k-nearest neighbor density estimates. Journal of Multivariate Analysis, 9:1–15, 1979.
  • [34] D.M. Mason., Coupling via the KMT quantile inequality. Technical Report, University of Delaware, Newark, 2011.
  • [35] K.S. Miller., Multidimensional Gaussian Distributions. Wiley, New York, 1964.
  • [36] D.S. Moore and J.W. Yackel. Consistency properties of nearest neighbor density function estimators., The Annals of Statistics, 5:143–154, 1977.
  • [37] D.S. Moore and J.W. Yackel. Large sample properties of nearest neighbor density function estimators. In S.S. Gupta and D.S. Moore, editors, Statistical Decision Theory and Related Topics. II, pages 269–279, New York, 1977. Academic Press.
  • [38] D.M. Mount and S. Arya. ANN: A Library for Approximate Nearest Neighbor Searching., http://www.cs.umd.edu/~mount/ANN.
  • [39] P. Niyogi, S. Smale, and S. Weinberger., A topological view of unsupervised learning from noisy data. Technical Report, University of Chicago, Chicago, 2008. http://www.math.uchicago.edu/~shmuel/noise.pdf.
  • [40] A. Petrunin. Semiconcave functions in Alexandrov’s geometry. In, Surveys in Differential Geometry. 11, pages 137–201. International Press, Somerville, 2007.
  • [41] C. Rodríguez and J. Van Ryzin. Maximum entropy histograms., Statistics and Probability Letters, 3:117–120, 1985.
  • [42] C. Rodríguez and J. Van Ryzin. Large sample properties of maximum entropy histograms., IEEE Transactions on Information Theory, 32:751–759, 1986.
  • [43] C.C. Rodríguez., On a new class of density estimators. Technical Report, Department of Mathematics and Statistics, State University of New York at Albany, New York, 1986. http://omega.albany.edu:8008/npde.ps.
  • [44] C.C. Rodríguez., Weak convergence of multivariate k-nn. Technical Report, Department of Mathematics and Statistics, State University of New York at Albany, New York, 1992. http://omega.albany.edu:8008.
  • [45] C.C. Rodríguez. Optimal recovery of local truth. In, Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 19th International Workshop, volume 567, pages 80–115. AIP Conference Proceedings, 2001.
  • [46] R.J. Samworth., Optimal weighted nearest neighbour classifiers. Technical Report, University of Cambridge, Cambridge, 2011. http://arxiv.org/abs/1101.5783.
  • [47] R.D. Short and K. Fukunaga. The optimal distance measure for nearest neighbor classification., IEEE Transactions on Information Theory, 27:622–627, 1981.
  • [48] USGS. US Geological Survey., http://earthquake.usgs.gov/earthquakes/eqarchives/epic.
  • [49] A.W. van der Vaart and J.A. Wellner., Weak Convergence and Empirical Processes. With Applications to Statistics. Springer-Verlag, New York, 1996.
  • [50] C. Villani., Topics in Optimal Transportation. American Mathematical Society, Providence, 2003.
  • [51] R.L. Wheeden and A. Zygmund., Measure and Integral. An Introduction to Real Analysis. Marcel Dekker, New York, 1977.
  • [52] R. Willink. Relationships between central moments and cumulants, with formulae for the central moments of gamma distributions., Communications in Statistics - Theory and Methods, 32:701–704, 2003.