The Annals of Applied Statistics

Prototype selection for interpretable classification

Jacob Bien and Robert Tibshirani

Full-text: Open access


Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of “representative” samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples.

We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well.

We demonstrate the interpretative value of producing prototypes on the well-known USPS ZIP code digits data set and show that as a classifier it performs reasonably well. We apply the method to a proteomics data set in which the samples are strings and therefore not naturally embedded in a vector space. Our method is compatible with any dissimilarity measure, making it amenable to situations in which using a non-Euclidean metric is desirable or even necessary.

Article information

Ann. Appl. Stat., Volume 5, Number 4 (2011), 2403-2424.

First available in Project Euclid: 20 December 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Classification prototypes nearest neighbors set cover integer program


Bien, Jacob; Tibshirani, Robert. Prototype selection for interpretable classification. Ann. Appl. Stat. 5 (2011), no. 4, 2403--2424. doi:10.1214/11-AOAS495.

Export citation


  • Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository. Univ. California, Irvine, School of Information and Computer Sciences.
  • Basu, M. and Ho, T. K. (2006). Data Complexity in Pattern Recognition. Springer, London.
  • Cannon, A. H. and Cowen, L. J. (2004). Approximation algorithms for the class cover problem. Ann. Math. Artif. Intell. 40 215–223.
  • Cano, J. R., Herrera, F. and Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation 7 561–575.
  • Cano, J. R., Herrera, F. and Lozano, M. (2007). Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability. Data and Knowledge Engineering 60 90–108.
  • Ceyhan, E., Priebe, C. E. and Marchette, D. J. (2007). A new family of random graphs for testing spatial segregation. Canad. J. Statist. 35 27–50.
  • Cover, T. M. and Hart, P. (1967). Nearest neighbor pattern classification. Proc. IEEE Trans. Inform. Theory IT-11 21–27.
  • Devijver, P. A. and Kittler, J. (1982). Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs, NJ.
  • DeVinney, J. and Wierman, J. C. (2002). A SLLN for a one-dimensional class cover problem. Statist. Probab. Lett. 59 425–435.
  • Fayed, H. A. and Atiya, A. F. (2009). A novel template reduction approach for the K-nearest neighbor method. IEEE Transactions on Neural Networks 20 890–896.
  • Feige, U. (1998). A threshold of lnn for approximating set cover. J. ACM 45 634–652.
  • Friedman, J. H., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1–22.
  • Hart, P. (1968). The condensed nearest-neighbor rule. IEEE Trans. Inform. Theory 14 515–516.
  • Hastie, T. and Simard, P. Y. (1998). Models and metrics for handwritten digit recognition. Statist. Sci. 13 54–65.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Hussain, Z., Szedmak, S. and Shawe-Taylor, J. (2004). The linear programming set covering machine. Pattern Analysis, Statistical Modelling and Computational Learning.
  • Kohonen, T. (2001). Self-Organizing Maps, 3rd ed. Springer Series in Information Sciences 30. Springer, Berlin.
  • Könemann, J., Parekh, O. and Segev, D. (2006). A unified approach to approximating partial covering problems. In Algorithms—ESA 2006. Lecture Notes in Computer Science 4168 468–479. Springer, Berlin.
  • Leslie, C. S., Eskin, E., Cohen, A., Weston, J. and Noble, W. S. (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics 20 467–476.
  • Lozano, M., Sotoca, J. M., Sánchez, J. S., Pla, F., Pkalska, E. and Duin, R. P. W. (2006). Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces. Pattern Recognition 39 1827–1838.
  • Marchand, M. and Shawe-Taylor, J. (2002). The set covering machine. J. Mach. Learn. Res. 3 723–746.
  • Marchiori, E. (2010). Class conditional nearest neighbor for large margin instance selection. IEEE Trans. Pattern Anal. Mach. Intell. 32 364–370.
  • Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 659–677.
  • Priebe, C. E., DeVinney, J. G., Marchette, D. J. and Socolinsky, D. A. (2003). Classification using class cover catch digraphs. J. Classification 20 3–23.
  • Ripley, B. D. (2005). Pattern Recognition and Neural Networks. Cambridge Univ. Press, New York.
  • Simard, P. Y., Le Cun, Y. A. and Denker, J. S. (1993). Efficient pattern recognition using a new transformation distance. In Advances in Neural Information Processing Systems 50–58. Morgan Kaufmann, San Mateo, CA.
  • Takigawa, I., Kudo, M. and Nakamura, A. (2009). Convex sets as prototypes for classifying patterns. Eng. Appl. Artif. Intell. 22 101–108.
  • Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
  • Tipping, M. E. and Schölkopf, B. (2001). A kernel approach for vector quantization with guaranteed distortion bounds. In Artificial Intelligence and Statistics (T. Jaakkola and T. Richardson, eds.) 129–134. Morgan Kaufmann, San Francisco.
  • Vazirani, V. V. (2001). Approximation Algorithms. Springer, Berlin.
  • Wilson, D. R. and Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning 38 257–286.
  • Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2004). 1-norm support vector machines. In Advances in Neural Information Processing Systems 16 (S. Thrun, L. Saul and B. Schölkopf, eds.). MIT Press, Cambridge, MA.