The Annals of Applied Statistics

An outlier map for Support Vector Machine classification

Michiel Debruyne

Full-text: Open access

Abstract

Support Vector Machines are a widely used classification technique. They are computationally efficient and provide excellent predictions even for high-dimensional data. Moreover, Support Vector Machines are very flexible due to the incorporation of kernel functions. The latter allow to model nonlinearity, but also to deal with nonnumerical data such as protein strings. However, Support Vector Machines can suffer a lot from unclean data containing, for example, outliers or mislabeled observations. Although several outlier detection schemes have been proposed in the literature, the selection of outliers versus nonoutliers is often rather ad hoc and does not provide much insight in the data. In robust multivariate statistics outlier maps are quite popular tools to assess the quality of data under consideration. They provide a visual representation of the data depicting several types of outliers. This paper proposes an outlier map designed for Support Vector Machine classification. The Stahel–Donoho outlyingness measure from multivariate statistics is extended to an arbitrary kernel space. A trimmed version of Support Vector Machines is defined trimming part of the samples with largest outlyingness. Based on this classifier, an outlier map is constructed visualizing data in any type of high-dimensional kernel space. The outlier map is illustrated on 4 biological examples showing its use in exploratory data analysis.

Article information

Source
Ann. Appl. Stat., Volume 3, Number 4 (2009), 1566-1580.

Dates
First available in Project Euclid: 1 March 2010

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1267453953

Digital Object Identifier
doi:10.1214/09-AOAS256

Mathematical Reviews number (MathSciNet)
MR2752147

Zentralblatt MATH identifier
1185.62112

Keywords
Support Vector Machine high-dimensional data analysis robust statistics data visualization

Citation

Debruyne, Michiel. An outlier map for Support Vector Machine classification. Ann. Appl. Stat. 3 (2009), no. 4, 1566--1580. doi:10.1214/09-AOAS256. https://projecteuclid.org/euclid.aoas/1267453953


Export citation

References

  • Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96 6475–6750.
  • Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J. and Foa, R. (2004). Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103 2771–2778.
  • Christmann, A. and Steinwart, I. (2004). On robustness properties of convex risk minimization methods for pattern recognition. J. Mach. Learn. Res. 5 1007–1034.
  • Donoho, D. L. (1982). Breakdown properties of multivariate location estimators. Qualifying paper. Harvard Univ.
  • Furey, T. S., Cristianini, N., Duffy, D., Bednarski, W., Schummer, M. and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16 906–914.
  • Jaakkola, T., Diekhans, M. and Haussler, D. (2000). A discriminative framework for detecting remote protein homologies. J. Comput. Biol. 7 95–114.
  • Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. 46 389–422.
  • Hubert, M. and Engelen, S. (2004). Robust PCA and classification in biosciences. Bioinformatics 20 1728–1736.
  • Hubert, M., Rousseeuw, P. J. and Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics 47 64–79.
  • Kadota, K., Tominaga, D., Akiyama, Y. and Takahashi, K. (2003). Detecting outlying samples in microarray data: A critical assessment of the effect of outliers on sample classification. Chem-Bio. Inform. J. 3 30–45.
  • Leslie, C., Eskin, E. and Noble, W. S. (2002). The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing 2002 (R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale and T. E. Klein, eds.) 564–575. World Scientific, Hackensack, NJ.
  • Leslie, C., Eskin, E., Weston, J. and Noble, W. S. (2003). Mismatch string kernels for svm protein classification. In Advances in Neural Information Processing Systems (S. Becker, S. Thrun and K. Obermayer, eds.) 15 1441–1448. MIT Press, Cambridge, MA.
  • Li, L., Darden, T. A., Weinberg, C. R., Levine, A. J. and Pedersen, L. G. (2001). Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Comb. Chem. High Throughput Screen. 4 727–739.
  • Liao, L. and Noble, W. S. (2002). Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proceedings of the Sixth International Conference on Computational Molecular Biology (T. Lengauer, ed.) 225–232. ACM Press, New York.
  • Malossini, A., Blanzieri, E. and Ng, R. T. (2006). Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22 2114–2121.
  • Maronna, R. and Yohai, V. (1995). The behavior of the Stahel–Donoho robust multivariate estimator. J. Amer. Statist. Assoc. 90 330–341.
  • Pochet, N., De Smet, F., Suykens, J. A. K. and De Moor, B. (2004). Systematic benchmarking of microarray data classification: Assessing the role of nonlinearity and dimensionality reduction. Bioinformatics 20 3185–3195.
  • Pollack, J. D., Li, Q. and Pearl, D. K. (2005). Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria and Eukaryota: Insights by Bayesian analyses. Mol. Phylogenet. Evol. 35 420–430.
  • Rousseeuw, P. J. and Van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. J. Amer. Statist. Assoc. 85 633–639.
  • Saigo, H., Vert, J., Ueda, N. and Akutsul, T. (2004). Protein homology detection using string alignment kernels. Bioinformatics 20 1682–1689.
  • Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
  • Stahel, W. A. (1981). Robuste Schätzungen: Infinitesimale optimalität und schätzungen von kovarianzmatrizen. Ph.D. thesis, ETH Zürich.
  • Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer, New York.
  • Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
  • West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. 98 11462–11467.