The Annals of Statistics

Robust nearest-neighbor methods for classifying high-dimensional data

Yao-ban Chan and Peter Hall

Full-text: Open access

Abstract

We suggest a robust nearest-neighbor approach to classifying high-dimensional data. The method enhances sensitivity by employing a threshold and truncates to a sequence of zeros and ones in order to reduce the deleterious impact of heavy-tailed data. Empirical rules are suggested for choosing the threshold. They require the bare minimum of data; only one data vector is needed from each population. Theoretical and numerical aspects of performance are explored, paying particular attention to the impacts of correlation and heterogeneity among data components. On the theoretical side, it is shown that our truncated, thresholded, nearest-neighbor classifier enjoys the same classification boundary as more conventional, nonrobust approaches, which require finite moments in order to achieve good performance. In particular, the greater robustness of our approach does not come at the price of reduced effectiveness. Moreover, when both training sample sizes equal 1, our new method can have performance equal to that of optimal classifiers that require independent and identically distributed data with known marginal distributions; yet, our classifier does not itself need conditions of this type.

Article information

Source
Ann. Statist., Volume 37, Number 6A (2009), 3186-3203.

Dates
First available in Project Euclid: 17 August 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1250515384

Digital Object Identifier
doi:10.1214/08-AOS591

Mathematical Reviews number (MathSciNet)
MR2549557

Zentralblatt MATH identifier
1191.62113

Subjects
Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

Keywords
Classification boundary detection boundary false discovery rate heterogeneous components higher criticism optimal classification threshold zero–one data

Citation

Chan, Yao-ban; Hall, Peter. Robust nearest-neighbor methods for classifying high-dimensional data. Ann. Statist. 37 (2009), no. 6A, 3186--3203. doi:10.1214/08-AOS591. https://projecteuclid.org/euclid.aos/1250515384


Export citation

References

  • [1] Cover, T. M. (1968). Rates of convergence for nearest neighbor procedures. In Proceedings of the Hawaii International Conference on System Sciences (B. K. Kinariwala and F. F. Kuo, eds.) 413–415. Univ. Hawaii Press, Honolulu.
  • [2] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13 21–27.
  • [3] Dasarathy, B. V. (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society, Los Alamitos, CA.
  • [4] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • [5] Donoho, D. J. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
  • [6] Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • [7] Fritz, J. (1975). Distribution-free exponential error bound for nearest neighbor pattern classification. IEEE Trans. Inform. Theory 21 552–557.
  • [8] Hall, P., Pittelkow, Y. and Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high-dimensional data with small sample sizes. J. Roy. Statist. Soc. Ser. B 70 159–173.
  • [9] Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Raffeld, M., Yakhini, Z., Ben-Dor, A., Dougherty, E., Kononen, J., Bubendorf, L., Fehrle, W., Pittaluga, S., Gruvberger, S., Loman, N., Johannsson, O., Olsson, H., Wilfond, B., Sauter, G., Kallioniemi, O.-P., Borg, A. and Trent, J. (2001). Gene expression profiles in hereditary breast cancer. N. Engl. J. Med. 344 539–548.
  • [10] Holst, M. and Irle, A. (2001). Nearest neighbor classification with dependent training sequences. Ann. Statist. 29 1424–1442.
  • [11] Ingster, Y. I. (1999). Minimax detection of a signal for ln-balls. Math. Methods Statist. 7 401–428.
  • [12] Ingster, Y. I. (2001). Adaptive detection of a signal of growing dimension. I. Meeting on mathematical statistics. Math. Methods Statist. 10 395–421.
  • [13] Ingster, Y. I. (2002). Adaptive detection of a signal of growing dimension. II. Math. Methods Statist. 11 37–68.
  • [14] Jin, J. (2002). Detection boundary for sparse mixtures. Unpublished manuscript.
  • [15] Kulkarni, S. R. and Posner, S. E. (1995). Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Trans. Inform. Theory 41 1028–1039.
  • [16] Psaltis, D., Snapp, R. R. and Venkatesh, S. S. (1994). On the finite sample performance of the nearest neighbor classifier. IEEE Trans. Inform. Theory 40 820–837.
  • [17] Shakhnarovich, G., Darrell, T. and Indyk, P. (2006). Nearest-Neighbor Methods in Learning and Vision. MIT Press, Boston.
  • [18] Wagner, T. J. (1971). Convergence of the nearest neighbor rule. IEEE Trans. Inform. Theory 17 566–571.