## The Annals of Applied Statistics

- Ann. Appl. Stat.
- Volume 5, Number 4 (2011), 2403-2424.

### Prototype selection for interpretable classification

Jacob Bien and Robert Tibshirani

#### Abstract

Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of “representative” samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples.

We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well.

We demonstrate the interpretative value of producing prototypes on the well-known USPS ZIP code digits data set and show that as a classifier it performs reasonably well. We apply the method to a proteomics data set in which the samples are strings and therefore not naturally embedded in a vector space. Our method is compatible with any dissimilarity measure, making it amenable to situations in which using a non-Euclidean metric is desirable or even necessary.

#### Article information

**Source**

Ann. Appl. Stat., Volume 5, Number 4 (2011), 2403-2424.

**Dates**

First available in Project Euclid: 20 December 2011

**Permanent link to this document**

https://projecteuclid.org/euclid.aoas/1324399600

**Digital Object Identifier**

doi:10.1214/11-AOAS495

**Mathematical Reviews number (MathSciNet)**

MR2907120

**Zentralblatt MATH identifier**

1234.62096

**Keywords**

Classification prototypes nearest neighbors set cover integer program

#### Citation

Bien, Jacob; Tibshirani, Robert. Prototype selection for interpretable classification. Ann. Appl. Stat. 5 (2011), no. 4, 2403--2424. doi:10.1214/11-AOAS495. https://projecteuclid.org/euclid.aoas/1324399600