Open Access
December 2011 Prototype selection for interpretable classification
Jacob Bien, Robert Tibshirani
Ann. Appl. Stat. 5(4): 2403-2424 (December 2011). DOI: 10.1214/11-AOAS495

Abstract

Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of “representative” samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples.

We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well.

We demonstrate the interpretative value of producing prototypes on the well-known USPS ZIP code digits data set and show that as a classifier it performs reasonably well. We apply the method to a proteomics data set in which the samples are strings and therefore not naturally embedded in a vector space. Our method is compatible with any dissimilarity measure, making it amenable to situations in which using a non-Euclidean metric is desirable or even necessary.

Citation

Download Citation

Jacob Bien. Robert Tibshirani. "Prototype selection for interpretable classification." Ann. Appl. Stat. 5 (4) 2403 - 2424, December 2011. https://doi.org/10.1214/11-AOAS495

Information

Published: December 2011
First available in Project Euclid: 20 December 2011

zbMATH: 1234.62096
MathSciNet: MR2907120
Digital Object Identifier: 10.1214/11-AOAS495

Keywords: ‎classification‎ , integer program , nearest neighbors , prototypes , set cover

Rights: Copyright © 2011 Institute of Mathematical Statistics

Vol.5 • No. 4 • December 2011
Back to Top