## Electronic Journal of Statistics

### Bandwidth selection for kernel density estimators of multivariate level sets and highest density regions

#### Abstract

We consider bandwidth matrix selection for kernel density estimators of density level sets in $\mathbb{R} ^{d}$, $d\ge 2$. We also consider estimation of highest density regions, which differs from estimating level sets in that one specifies the probability content of the set rather than specifying the level directly. This complicates the problem. Bandwidth selection for KDEs is well studied, but the goal of most methods is to minimize a global loss function for the density or its derivatives. The loss we consider here is instead the measure of the symmetric difference of the true set and estimated set. We derive an asymptotic approximation to the corresponding risk. The approximation depends on unknown quantities which can be estimated, and the approximation can then be minimized to yield a choice of bandwidth, which we show in simulations performs well. We provide an R package lsbs for implementing our procedure.

#### Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 4313-4376.

Dates
First available in Project Euclid: 18 December 2018

https://projecteuclid.org/euclid.ejs/1545123626

Digital Object Identifier
doi:10.1214/18-EJS1501

Subjects
Primary: 62G07: Density estimation

#### Citation

Doss, Charles R.; Weng, Guangwei. Bandwidth selection for kernel density estimators of multivariate level sets and highest density regions. Electron. J. Statist. 12 (2018), no. 2, 4313--4376. doi:10.1214/18-EJS1501. https://projecteuclid.org/euclid.ejs/1545123626

#### References

• Baíllo, A. (2003). Total error in a plug-in estimator of level sets., Statist. Probab. Lett. 65 411–417.
• Baíllo, A., Cuesta-Albertos, J. A. and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets., Statist. Probab. Lett. 53 27–35.
• Billingsley, P. (2012)., Probability and Measure. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ.
• Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates., Biometrika 71 353–360.
• Bredon, G. E. (1993)., Topology and geometry. Graduate Texts in Mathematics 139. Springer-Verlag, New York.
• Cadre, B. t. (2006). Kernel estimation of density level sets., J. Multivariate Anal. 97 999–1023.
• Cadre, B., Pelletier, B. and Pudlo, P. (2013). Estimation of density level sets with a given probability content., Journal of Nonparametric Statistics 25 261–272.
• Cavalier, L. (1997). Nonparametric estimation of regression level sets., Statistics 29 131–160.
• Chacón, J. E. and Duong, T. (2010). Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices., TEST 19 375–398.
• Chacón, J. E., Duong, T. and Wand, M. P. (2011). Asymptotics for general multivariate kernel density derivative estimators., Statist. Sinica 21 807–840.
• Chen, Y.-C. (2016). Generalized cluster trees and singular measures., arXiv.
• Chen, Y.-C. (2017). A tutorial on kernel density estimation and recent advances., arXiv:1704.03924v1.
• Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2017). Density level sets: asymptotics, inference, and visualization., J. Amer. Statist. Assoc. 112 1684–1696.
• Cuevas, A., Febrero, M. and Fraiman, R. (2001). Cluster analysis: a further approach based on density estimation., Comput. Statist. Data Anal. 36 441–459.
• De Marchi, S. and Elefante, G. (2018). Quasi-monte carlo integration on manifolds with mapped low-discrepancy points and greedy minimal riesz s-energy points., Applied Numerical Mathematics 127 110–124.
• Dudley, R. M. (1999)., Uniform Central Limit Theorems 63. Cambridge University Press, Cambridge.
• Duong, T. and Hazelton, M. L. (2003). Plug-in bandwidth matrices for bivariate kernel density estimation., Journal of Nonparametric Statistics 15 17–30.
• Duong, T. and Hazelton, M. L. (2005). Cross-Validation Bandwidth Matrices for Multivariate Kernel Density Estimation., Scandinavian Journal of Statistics 32 485–506.
• Duong, T., Koch, I. and Wand, M. P. (2009). Highest Density Difference Region Estimation with Application to Flow Cytometric Data., Biometrical Journal 51 504–521.
• Durrett, R. (2010)., Probability: theory and examples. Cambridge university press.
• Evans, L. C. and Gariepy, R. F. (2015)., Measure Theory and Fine Properties of Functions, revised ed. Textbooks in Mathematics. CRC Press, Boca Raton, FL.
• Ferguson, T. S. (1996)., A course in large sample theory. Texts in Statistical Science Series. Chapman & Hall, London.
• Folland, G. B. (1999)., Real analysis, second ed. Pure and Applied Mathematics. John Wiley & Sons, Inc., New York.
• Garcia, J. N., Kutalik, Z., Cho, K.-H. and Wolkenhauer, O. (2003). Level sets and minimum volume sets of probability density functions., International journal of approximate reasoning 34 25–47.
• Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators., Ann. Inst. H. Poincaré Probab. Statist. 38 907–921. En l’honneur de J. Bretagnolle, D. Dacunha-Castelle, I. Ibragimov.
• Giné, E., Koltchinskii, V. and Zinn, J. (2004). Weighted uniform consistency of kernel density estimators., Ann. Probab. 32 2570–2605.
• Guillemin, V. and Pollack, A. (1974)., Differential Topology. Prentice-Hall, Inc., Englewood Cliffs, N.J.
• Hall, P. and Marron, J. S. (1987). Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation., Probab. Theory Related Fields 74 567–581.
• Hall, P., Marron, J. S. and Park, B. U. (1992). Smoothed cross-validation., Probab. Theory Related Fields 92 1–20.
• Hämmerlin, G. and Hoffmann, K.-H. (1991)., Numerical mathematics. Undergraduate Texts in Mathematics. Springer-Verlag, New York. Translated from the German by Larry Schumaker.
• Hartigan, J. A. (1975)., Clustering algorithms 209. Wiley New York.
• Hartigan, J. A. (1987). Estimation of a convex density contour in two dimensions., J. Amer. Statist. Assoc. 82 267–270.
• Hyndman, R. J. (1996). Computing and graphing highest density regions., Amer. Statist. 50 120–126.
• Jankowski, H. and Stanberry, L. (2012). Confidence regions in level set estimation., Preprint.
• Jones, M. C., Marron, J. S. and Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation., J. Amer. Statist. Assoc. 91 401–407.
• Lichman, M. and Smyth, P. (2014). Modeling human location data with mixtures of kernel densities. In, the 20th ACM SIGKDD international conference 35–44. ACM Press, New York, New York, USA.
• Magnus, J. R. and Neudecker, H. (1999)., Matrix differential calculus with applications in statistics and econometrics. Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd., Chichester. Revised reprint of the 1988 original.
• Mammen, E. and Polonik, W. (2013). Confidence regions for level sets., J. Multivariate Anal. 122 202–214.
• Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis., Ann. Statist. 27 1808–1829.
• Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error., Ann. Statist. 20 712–736.
• Mason, D. M. and Polonik, W. (2009). Asymptotic normality of plug-in level set estimates., Ann. Appl. Probab. 19 1108–1142.
• Müller, D. W. and Sawitzki, G. (1991). Excess mass estimates and tests for multimodality., J. Amer. Statist. Assoc. 86 738–746.
• Park, C., Huang, J. Z. and Ding, Y. (2010). A computable plug-in estimator of minimum volume sets for novelty detection., Oper. Res. 58 1469–1480.
• Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters—an excess mass approach., Ann. Statist. 23 855–881.
• Qiao, W. (2018). Asymptotics and optimal bandwidth selection for nonparametric estimation of density level sets., arXiv:1707.09697.
• R Core Team, (2018). R: A language and environment for statistical computing. r foundation for statistical computing, Vienna, Austria.
• Rinaldo, A. and Wasserman, L. (2010). Generalized density clustering., Ann. Statist. 38 2678–2722.
• Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators., Scand. J. Statist. 9 65–78.
• Sain, S. R., Baggerly, K. A. and Scott, D. W. (1994a). Cross-validation of multivariate densities., J. Amer. Statist. Assoc. 89 807–817.
• Sain, S. R., Baggerly, K. A. and Scott, D. W. (1994b). Cross-validation of multivariate densities., Journal of the American Statistical Association 89 807–817.
• Samworth, R. J. and Wand, M. P. (2010). Asymptotics and optimal bandwidth selection for highest density region estimation., Ann. Statist. 38 1767–1792.
• Scott, D. W. and Terrell, G. R. (1987). Biased and unbiased cross-validation in density estimation., Journal of the American Statistical Association 82 1131–1146.
• Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation., Journal of the Royal Statistical Society. Series B (Methodological) 53 683–690.
• Spivak, M. (1965)., Calculus on Manifolds. W. A. Benjamin, Inc., New York-Amsterdam.
• Tsybakov, A. B. (1997). On nonparametric estimation of density level sets., Ann. Statist. 25 948–969.
• van der Vaart, A. W. and Wellner, J. A. (1996)., Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York.
• Walther, G. (1997). Granulometric smoothing., Ann. Statist. 25 2273–2299.
• Wand, M. P. and Jones, M. C. (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation., J. Amer. Statist. Assoc. 88 520–528.
• Wand, M. P. and Jones, M. C. (1994). Multivariate plug-in bandwidth selection., Comput. Statist. 9 97–116.
• Wand, M. P. and Jones, M. C. (1995)., Kernel smoothing. Monographs on Statistics and Applied Probability 60. Chapman and Hall, Ltd., London.
• Wasserman, L. (2016). Topological data analysis., arXiv:1609.08227v1.