Source: Ann. Statist. Volume 33, Number 1
(2005), 284-306.
It is shown that, for kernel-based classification with univariate distributions and two populations, optimal bandwidth choice has a dichotomous character. If the two densities cross at just one point, where their curvatures have the same signs, then minimum Bayes risk is achieved using bandwidths which are an order of magnitude larger than those which minimize pointwise estimation error. On the other hand, if the curvature signs are different, or if there are multiple crossing points, then bandwidths of conventional size are generally appropriate. The range of different modes of behavior is narrower in multivariate settings. There, the optimal size of bandwidth is generally the same as that which is appropriate for pointwise density estimation. These properties motivate empirical rules for bandwidth choice.
References
Ancukiewicz, M. (1998). An unsupervised and nonparametric classification procedure based on mixtures with known weights. J. Classification 15 129--141.
Baek, S. and Sung, K. M. (2000). Fast $K$-nearest-neighbour search algorithm for nonparametric classification. Electronics Letters 36 1821--1822.
Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801--849.
Breiman, L. (2001). Random forests. Machine Learning 45 5--32.
Chanda, K. C. and Ruymgaart, F. H. (1989). Asymptotic estimate of probability of misclassification for discriminant rules based on density estimates. Statist. Probab. Lett. 8 81--88.
Cover, T. M. (1968). Rates of convergence for nearest neighbor procedures. In Proc. Hawaii International Conference on System Sciences (B. K. Kinariwala and F. F. Kuo, eds.) 413--415. Univ. Hawaii Press, Honolulu.
Devroye, L. (1982). Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans. Pattern Anal. Machine Intelligence 4 154--157.
Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77--87.
Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc. 78 316--331.
Mathematical Reviews (MathSciNet):
MR711106
Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The $.632+$ bootstrap method. J. Amer. Statist. Assoc. 92 548--560.
Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. (1989). A general lower bound on the number of examples needed for learning. Inform. and Comput. 82 247--261. Also published in Proc. 1988 Workshop on Computational Learning Theory (D. Haussler and L. Pitt, eds.) 139--154. Morgan Kaufmann, San Mateo, CA.
Fix, E. and Hodges, J. (1951). Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical Report No. 4, Project No. 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX.
Friedman, J. H. (1997). On bias, variance, $0/1$-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1 55--77.
Friedman, J. H., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337--407.
Fukunaga, K. and Flick, T. E. (1984). Classification error for a very large number of classes. IEEE Trans. Pattern Anal. Machine Intelligence 6 779--788.
Fukunaga, K. and Hummels, D. M. (1987). Bias of nearest neighbor estimates. IEEE Trans. Pattern Anal. Machine Intelligence 9 103--112.
Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
Hall, P. (1983). Large sample optimality of least squares cross-validation in density estimation. Ann. Statist. 11 1156--1174.
Mathematical Reviews (MathSciNet):
MR720261
Hall, P. and Kang, K.-H. (2002). Effect of bandwidth choice on Bayes risk in nonparametric classification. Available at http://stats.hufs.ac.kr/$\sim$khkang.
Hall, P. and Schucany, W. R. (1989). A local cross-validation algorithm. Statist. Probab. Lett. 8 109--117.
Härdle, W. and Kelly, G. (1987). Nonparametric kernel regression estimation---optimal choice of bandwidth. Statistics 18 21--35.
Mathematical Reviews (MathSciNet):
MR871448
Holmström, L. and Klemelä, J. (1992). Asymptotic bounds for the expected $L^1$ error of a multivariate kernel density estimator. J. Multivariate Anal. 42 245--266.
Jiang, W. X. (2002). On weak base hypotheses and their implications for boosting regression and classification Ann. Statist. 30 51--73.
Kharin, Yu. S. (1983). Analysis and optimization of Rosenblatt--Parzen classifier with the aid of asymptotic expansions. Automat. Remote Control 44 72--80.
Mathematical Reviews (MathSciNet):
MR714594
Kharin, Yu. S. and Ducinskas, K. (1979). The asymptotic expansion of the risk for classifiers using maximum likelihood estimates. Statist. Problemy Upravleniya---Trudy Sem. Protsessy Optimal. Upravleniya V Sektsiya 38 77--93. (In Russian.)
Mathematical Reviews (MathSciNet):
MR565564
Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. J. Amer. Statist. Assoc. 96 589--604.
Krzyżak, A. (1991). On exponential bounds on the Bayes risk of the nonparametric classification rules. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 347--360. Kluwer, Dordrecht.
Lapko, A. V. (1993). Nonparametric Classification Methods and Their Application. VO Nauka, Novosibirsk. (In Russian.)
Lin, C.-T. (2001). Nonparametric classification on two univariate distributions. Comm. Statist. Theory Methods 30 319--330.
Lugosi, G. and Nobel, A. (1996). Consistency of data-driven histogram methods for density estimation and classification. Ann. Statist. 24 687--706.
Lugosi, G. and Pawlak, M. (1994). On the posterior-probability estimate of the error rate of nonparametric classification rules. IEEE Trans. Inform. Theory 40 475--481.
Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808--1829.
Marron, J. S. (1983). Optimal rates on convergence to Bayes risk in nonparametric discrimination. Ann. Statist. 11 1142--1155.
Mathematical Reviews (MathSciNet):
MR720260
Mielniczuk, J., Sarda, P. and Vieu, P. (1989). Local data-driven bandwidth choice for density estimation. J. Statist. Plann. Inference 23 53--69.
Pawlak, M. (1993). Kernel classification rules from missing data. IEEE Trans. Inform. Theory 39 979--988.
Psaltis, D., Snapp, R. R. and Venkatesh, S. S. (1994). On the finite sample performance of the nearest neighbor classifier. IEEE Trans. Inform. Theory 40 820--837.
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
Steele, B. M. and Patterson, D. A. (2000). Ideal bootstrap estimation of expected prediction error for $k$-nearest neighbor classifiers: Applications for classification and error assessment. Statist. Comput. 10 349--355.
Stoller, D. S. (1954). Univariate two-population distribution-free discrimination. J. Amer. Statist. Assoc. 49 770--777.
Mathematical Reviews (MathSciNet):
MR66608
Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist. 12 1285--1297.
Mathematical Reviews (MathSciNet):
MR760688
Yang, Y. H. (1999a). Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inform. Theory 45 2271--2284.
Yang, Y. H. (1999b). Minimax nonparametric classification. II. Model selection for adaptation. IEEE Trans. Inform. Theory 45 2285--2292.