Annals of Applied Statistics

Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model

Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan

Full-text: Open access


We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. Our models are decision lists, which consist of a series of ifthen…statements (e.g., if high blood pressure, then stroke) that discretize a high-dimensional, multivariate feature space into a series of simple, readily interpretable decision statements. We introduce a generative model called Bayesian Rule Lists that yields a posterior distribution over possible decision lists. It employs a novel prior structure to encourage sparsity. Our experiments show that Bayesian Rule Lists has predictive accuracy on par with the current top algorithms for prediction in machine learning. Our method is motivated by recent developments in personalized medicine, and can be used to produce highly accurate and interpretable medical scoring systems. We demonstrate this by producing an alternative to the CHADS$_{2}$ score, actively used in clinical practice for estimating the risk of stroke in patients that have atrial fibrillation. Our model is as interpretable as CHADS$_{2}$, but more accurate.

Article information

Ann. Appl. Stat., Volume 9, Number 3 (2015), 1350-1371.

Received: October 2013
Revised: April 2015
First available in Project Euclid: 2 November 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Bayesian analysis classification interpretability


Letham, Benjamin; Rudin, Cynthia; McCormick, Tyler H.; Madigan, David. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat. 9 (2015), no. 3, 1350--1371. doi:10.1214/15-AOAS848.

Export citation


  • Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules. In VLDB’94 Proceedings of the 20th International Conference on Very Large Databases 487–499. Morgan Kaufmann, San Francisco, CA.
  • Antman, E. M., Cohen, M., Bernink, P. J. L. M., McCabe, C. H., Horacek, T., Papuchis, G., Mautner, B., Corbalan, R., Radley, D. and Braunwald, E. (2000). The TIMI risk score for unstable angina/non-ST elevation MI: A method for prognostication and therapeutic decision making. JAMA 284 835–842.
  • Bache, K. and Lichman, M. (2013). UCI machine learning repository. Available at
  • Borgelt, C. (2005). An implementation of the FP-growth algorithm. In OSDM’05 Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations 1–5. ACM, New York.
  • Bratko, I. (1997). Machine learning: Between accuracy and interpretability. In Learning, Networks and Statistics (G. Della Riccia, H.-J. Lenz and R. Kruse, eds.). International Centre for Mechanical Sciences 382 163–177. Springer, Vienna.
  • Breiman, L. (1996a). Bagging predictors. Mach. Learn. 24 123–140.
  • Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383.
  • Breiman, L. (2001a). Random forests. Mach. Learn. 45 5–32.
  • Breiman, L. (2001b). Statistical modeling: The two cultures. Statist. Sci. 16 199–231.
  • Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 27:1–27:27.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (1998). Bayesian CART model search. J. Amer. Statist. Assoc. 93 935–948.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (2002). Bayesian treed models. Mach. Learn. 48 299–320.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
  • Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist 34 571–582.
  • Denison, D. G. T., Mallick, B. K. and Smith, A. F. M. (1998). A Bayesian CART algorithm. Biometrika 85 363–377.
  • Dougherty, J., Kohavi, R. and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In ICML’95 Proceedings of the 12th International Conference on Machine Learning 194–202. Morgan Kaufmann, San Francisco, CA.
  • Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. and Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 9 1871–1874.
  • Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI’93 Proceedings of the 1993 International Joint Conference on Artificial Intelligence 1022–1027. Morgan Kaufmann, San Francisco, CA.
  • Freitas, A. A. (2014). Comprehensible classification models: A position paper. ACM SIGKDD Explorations Newsletter 15 1–10.
  • Friedman, J. H. and Popescu, B. E. (2008). Predictive learning via rule ensembles. Ann. Appl. Stat. 2 916–954.
  • Gage, B. F., Waterman, A. D., Shannon, W., Boechler, M., Rich, M. W. and Radford, M. J. (2001). Validation of clinical classification schemes for predicting stroke. Journal of the American Medical Association 285 2864–2870.
  • Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457–472.
  • Giraud-Carrier, C. (1998). Beyond predictive accuracy: What? Technical report, Univ. Bristol, Bristol, UK.
  • Goh, S. T. and Rudin, C. (2014). Box drawings for learning with imbalanced data. In KDD’14 Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 333–342. DOI:10.1145/2623330.2623648.
  • Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11 63–91.
  • Huysmans, J., Dejaeger, K., Mues, C., Vanthienen, J. and Baesens, B. (2011). An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems 51 141–154.
  • Jennings, D. L., Amabile, T. M. and Ross, L. (1982). Informal covariation assessments: Data-based versus theory-based judgements. In Judgment Under Uncertainty: Heuristics and Biases, (D. Kahneman, P. Slovic and A. Tversky, eds.) 211–230. Cambridge Univ. Press, Cambridge, MA.
  • King, G., Lam, P. and Roberts, M. (2014). Computer-assisted keyword and document set discovery from unstructured text. Technical report, Harvard.
  • Knaus, W. A., Draper, E. A., Wagner, D. P. and Zimmerman, J. E. (1985). APACHE II: A severity of disease classification system. Critical Care Medicine 13 818–829.
  • Leondes, C. T. (2002). Expert Systems: The Technology of Knowledge Management and Decision Making for the 21st Century. Academic Press, San Diego, CA.
  • Letham, B., Rudin, C., McCormick, T. H. and Madigan, D. (2013). An interpretable stroke prediction model using rules and Bayesian analysis. In Proceedings of AAAI Late Breaking Track. MIT, Cambridge, MA.
  • Letham, B., Rudin, C., McCormick, T. H. and Madigan, D. (2014). An interpretable model for stroke prediction using rules and Bayesian analysis. In Proceedings of 2014 KDD Workshop on Data Science for Social Good. MIT, Cambridge, MA.
  • Letham, B., Rudin, C., McCormick, T. H. and Madigan, D. (2015). Supplement to “Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model.” DOI:10.1214/15-AOAS848SUPPA, DOI:10.1214/15-AOAS848SUPPB.
  • Levenshtein, V. I. (1965). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Dokl. 10 707–710.
  • Li, W., Han, J. and Pei, J. (2001). CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining 369–376. IEEE, New York.
  • Lim, W. S., van der Eerden, M. M., Laing, R., Boersma, W. G., Karalus, N., Town, G. I., Lewis, S. A. and Macfarlane, J. T. (2003). Defining community acquired pneumonia severity on presentation to hospital: An international derivation and validation study. Thorax 58 377–382.
  • Lip, G. Y. H., Frison, L., Halperin, J. L. and Lane, D. A. (2010a). Identifying patients at high risk for stroke despite anticoagulation: A comparison of contemporary stroke risk stratification schemes in an anticoagulated atrial fibrillation cohort. Stroke 41 2731–2738.
  • Lip, G. Y. H., Nieuwlaat, R., Pisters, R., Lane, D. A. and Crijns, H. J. G. M. (2010b). Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: The euro heart survey on atrial fibrillation. Chest 137 263–272.
  • Liu, B., Hsu, W. and Ma, Y. (1998). Integrating classification and association rule mining. In KDD’98 Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining 80–96. AAAI Press, Palo Alto, CA.
  • Madigan, D., Mittal, S. and Roberts, F. (2011). Efficient sequential decision-making algorithms for container inspection operations. Naval Res. Logist. 58 637–654.
  • Madigan, D., Mosurski, K. and Almond, R. G. (1997). Explanation in belief networks. J. Comput. Graph. Statist. 6 160–181.
  • Marchand, M. and Sokolova, M. (2005). Learning with decision lists of data-dependent features. J. Mach. Learn. Res. 6 427–451.
  • McCormick, T. H., Rudin, C. and Madigan, D. (2012). Bayesian hierarchical rule modeling for predicting medical conditions. Ann. Appl. Stat. 6 622–668.
  • Meinshausen, N. (2010). Node harvest. Ann. Appl. Stat. 4 2049–2072.
  • Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits to our capacity for processing information. The Psychological Review 63 81–97.
  • Muggleton, S. and De Raedt, L. (1994). Inductive logic programming: Theory and methods. J. Logic Programming 19 629–679.
  • Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo.
  • Rivest, R. L. (1987). Learning decision lists. Mach. Learn. 2 229–246.
  • Rudin, C. and Ertekin, Ş. (2015). Learning optimized lists of classification rules. Technical report, MIT, Cambridge, MA.
  • Rudin, C., Letham, B. and Madigan, D. (2013). Learning theory analysis for association rules and sequential event prediction. J. Mach. Learn. Res. 14 3441–3492.
  • Rüping, S. (2006). Learning interpretable models. Ph.D. thesis, Univ. Dortmund.
  • Shmueli, G. (2010). To explain or to predict? Statist. Sci. 25 289–310.
  • Souillard-Mandar, W., Davis, R., Rudin, C., Au, R., Libon, D. J., Swenson, R., Price, C. C., Lamar, M. and Penney, D. L. (2015). Learning classification models of cognitive conditions from subtle behaviors in the digital clock drawing test. Machine Learning. To appear.
  • Srikant, R. and Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In SIGMOD’96 Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data 1–12. ACM, New York.
  • Stang, P. E., Ryan, P. B., Racoosin, J. A., Overhage, J. M., Hartzema, A. G., Reich, C., Welebob, E., Scarnecchia, T. and Woodcock, J. (2010). Advancing the science for active surveillance: Rationale and design for the observational medical outcomes partnership. Ann. Intern. Med. 153 600–606.
  • Taddy, M. A., Gramacy, R. B. and Polson, N. G. (2011). Dynamic trees for learning and design. J. Amer. Statist. Assoc. 106 109–123.
  • Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
  • Vellido, A., Martín-Guerrero, J. D. and Lisboa, P. J. G. (2012). Making machine learning models interpretable. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. ESANN, Bruges.
  • Wang, F. and Rudin, C. (2015). Falling rule lists. In JMLR Workshop and Conference Proceedings 38 1013–1022. San Diego, CA.
  • Wang, T., Rudin, C., Doshi, F., Liu, Y., Klampfl, E. and MacNeille, P. (2015). Bayesian or’s of and’s for interpretable classification with application to context aware recommender systems. Available at arXiv:1504.07614.
  • Wu, Y., Tjelmeland, H. and West, M. (2007). Bayesian CART: Prior specification and posterior simulation. J. Comput. Graph. Statist. 16 44–66.
  • Wu, X., Zhang, C. and Zhang, S. (2004). Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems 22 381–405.
  • Yin, X. and Han, J. (2003). CPAR: Classification based on predictive association rules. In ICDM’03 Proceedings of the 2003 SIAM International Conference on Data Mining 331–335. SIAM, Philadelphia, PA.
  • Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12 372–390.
  • Zhang, Y., Laber, E. B., Tsiatis, A. and Davidian, M. (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Available at arXiv:1504.07715.

Supplemental materials