The Annals of Statistics

Optimal properties of centroid-based classifiers for very high-dimensional data

Peter Hall and Tung Pham

Full-text: Open access

Abstract

We show that scale-adjusted versions of the centroid-based classifier enjoys optimal properties when used to discriminate between two very high-dimensional populations where the principal differences are in location. The scale adjustment removes the tendency of scale differences to confound differences in means. Certain other distance-based methods, for example, those founded on nearest-neighbor distance, do not have optimal performance in the sense that we propose. Our results permit varying degrees of sparsity and signal strength to be treated, and require only mild conditions on dependence of vector components. Additionally, we permit the marginal distributions of vector components to vary extensively. In addition to providing theory we explore numerical properties of a centroid-based classifier, and show that these features reflect theoretical accounts of performance.

Article information

Source
Ann. Statist., Volume 38, Number 2 (2010), 1071-1093.

Dates
First available in Project Euclid: 19 February 2010

Permanent link to this document
https://projecteuclid.org/euclid.aos/1266586623

Digital Object Identifier
doi:10.1214/09-AOS736

Mathematical Reviews number (MathSciNet)
MR2604705

Zentralblatt MATH identifier
1183.62104

Subjects
Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

Keywords
Centroid method classification discrimination distance-based classifiers high-dimensional data location differences minimax performance scale adjustment sparsity

Citation

Hall, Peter; Pham, Tung. Optimal properties of centroid-based classifiers for very high-dimensional data. Ann. Statist. 38 (2010), no. 2, 1071--1093. doi:10.1214/09-AOS736. https://projecteuclid.org/euclid.aos/1266586623


Export citation

References

  • Almirantis, Y. and Provata, A. (1999). Long- and short-range correlations in genome organization. J. Stat. Phys. 97 233–262.
  • Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • Bilmes, J. A. and Kirchhoff, K. (2004). Generalized rules for combination and joint training of classifiers. PAA Pattern Anal. Appl. 6 201–211.
  • Chan, Y.-B. and Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika 96 469–478.
  • Cootes, T. F., Hill, A., Taylor, C. J. and Haslam, J. (1993). The use of active shape models for locating structures in medical images. In: Information Processing in Medical Imaging (H. H. Barret and A. F. Gmitro, eds.). Lecture Notes in Computer Science 687 33–47. Springer, Berlin.
  • Cover, T. M. (1968). Rates of convergence for nearest neighbor procedures. In Proc. Hawaii International Conference on System Sciences (B. K. Kinariwala and F. F. Kuo, eds.) 413–415. Univ. Hawaii Press, Honolulu.
  • Dabney, A. R. (2005). Classification of microarrays to nearest centroids. Bioinformatics 21 4148–4154.
  • Dabney, A. R. and Storey, J. D. (2005). Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays. Working Paper 267, Univ. Washington Biostatistics Working Paper Series.
  • Dabney, A. R. and Storey, J. D. (2007). Optimality driven nearest centroid classification from genomic data. PLoS One 2 (electronic). Available at http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0001002.
  • Dasarathy, B. V. (1990). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos.
  • Devroye, L. and Wagner, T. J. (1982). Nearest neighbor methods in discrimination. In Classification, Pattern Recognition and Reduction of Dimensionality (P. R. Krishnaiah and L. N. Kanal, eds.) Handbook of Statistics 2 193–197. North-Holland, Amsterdam.
  • Duda, R. O., Hart, P. E. and Stork, D. G. (2001). Pattern Classification, 2nd ed. Wiley, New York.
  • Dudoit, S., Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87.
  • Fan, J. and Yao, Q. (2003). Nonlinear Time Series. Springer, New York.
  • Franco-Lopez, H., Ek, A. R. and Bauer, M. E. (2001). Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method. Remote Sensing of Environment 77 251–274.
  • Hall, P. and Kang, K.-H. (2005). Bandwidth choice for nonparametric classification. Ann. Statist. 33 284–306.
  • Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427–444.
  • Hall, P. and Pham, T. (2009). Optimal properties of centroid-based classifiers for very high-dimensional data. [Long version of the current submission.]
  • Hall, P., Pittelkow, Y. and Ghosh, M. (2007). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 159–173.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, New York.
  • Mansilla, R., de Castillo, N., Govezensky, T., Miramontes, P., José, M. and Coho, G. (2004). Long-range correlation in the whole human genome. Available at http://arxiv.org/pdf/q-bio/0402043v1.
  • Messer, P. W. and Arndt, P. F. (2006). CorGen-measuring and generating long-range correlations for DNA sequence analysis. Nucleic Acids Research 34 W692–W695.
  • Shakhnarovich, G., Darrell, T. and Indyk, P. (2005). Nearest-Neighbor Methods in Learning and Vision. Theory and Practice. MIT Press, Cambridge, MA.
  • Schoonover, J. R., Marx, R. and Zhang, S. L. (2003). Multivariate curve resolution in the analysis of vibrational spectroscopy data files. Applied Spectroscopy 57 483–490.
  • Simard, P., Lecun, Y. and Denker, J. S. (1993). Efficient pattern recognition using a new transformation distance. In Adv. Neural Infor. Process. Syst. (S. Hanson, J. Cowan and L. Giles, eds.) 5 50–58. Morgan Kaufmann, San Fransisco.
  • Sinden, F. and Wilfong, G. (1992). On-line recognition of handwritten symbols. Technical Report No. 11228-910930-02IM, AT&T Bell Laboratories.
  • Stone, C. J. (1977). Consistent nonparametric regression (with discussion). Ann. Statist. 5 595–645.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Nat. Acad. Sci. 99 6567–6572.
  • van der Walt, C. M. and Barnard, E. (2006). Data characteristics that determine classifier performance. In Proc. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa 160–165.
  • Wakahara, T., Kimura, Y. and Tomono, A. (2001). Affine-invariant recognition of gray-scale characters using global affine transformation correlation. IEEE Trans. Patt. Anal. Machine Intell. 23 384–395.
  • Wang, S. and Zhu, J. (2007). Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 23 972–979.