Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges

Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the"Rashomon set"of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.


Introduction
With widespread use of machine learning (ML), the importance of interpretability has become clear in avoiding catastrophic consequences. Black box predictive models, which by definition are inscrutable, have led to serious societal problems that deeply affect health, freedom, racial bias, and safety. Interpretable predictive models, which are constrained so that their reasoning processes are more understandable to humans, are much easier to troubleshoot and to use in practice. It is universally agreed that interpretability is a key element of trust for AI models [304,257,235,17,287,278,42]. In this survey, we provide fundamental principles, as well as 10 technical challenges in the design of inherently interpretable machine learning models.
Let us provide some background. A black box machine learning model is a formula that is either too complicated for any human to understand, or proprietary, so that one cannot understand its inner workings. Black box models are difficult to troubleshoot, which is particularly problematic for medical data. Black box models often predict the right answer for the wrong reason (the "Clever Hans" phenomenon), leading to excellent performance in training but poor performance in practice [268,172,225,331,19,119]. There are numerous other issues with black box models. In criminal justice, individuals may have been subjected to years of extra prison time due to typographical errors in black box model inputs [317] and poorly-designed proprietary models for air quality have had serious consequences for public safety during wildfires [203]; both of these situations may have been easy to avoid with interpretable models. In cases where the underlying distribution of data changes (called domain shift, which occurs often in practice), problems arise if users cannot troubleshoot the model in real-time, which is much harder with black box models than interpretable models. Determining whether a black box model is fair with respect to gender or racial groups is much more difficult than determining whether an interpretable model has such a bias. In medicine, black box models turn computer-aided decisions into automated decisions, precisely because physicians cannot understand the reasoning processes of black box models. Explaining black boxes, rather than replacing them with interpretable models, can make the problem worse by providing misleading or false characterizations [250,173,171], or adding un-necessary authority to the black box [253]. There is a clear need for innovative machine learning models that are inherently interpretable.
There is now a vast and confusing literature on some combination of interpretability and explainability. Much literature on explainability confounds it with interpretability/comprehensibility, thus obscuring the arguments (and thus detracting from their precision), and failing to convey the relative importance and use-cases of the two topics in practice. Some of the literature discusses topics in such generality that its lessons have little bearing on any specific problem. Some of it aims to design taxonomies that miss vast topics within interpretable ML. Some of it provides definitions that we disagree with. Some of it even provides guidance that could perpetuate bad practice. Importantly, most of it assumes that one would explain a black box without consideration of whether there is an interpretable model of the same accuracy. In what follows, we provide some simple and general guiding principles of interpretable machine learning. These are not meant to be exhaustive. Instead they aim to help readers avoid common but problematic ways of thinking about interpretability in machine learning.
The major part of this survey outlines a set of important and fundamental technical grand challenges in interpretable machine learning. These are both modern and classical challenges, and some are much harder than others. They are all either hard to solve, or difficult to formulate correctly. While there are numerous sociotechnical challenges about model deployment (that can be much more difficult than technical challenges), human-computer interaction challenges, and how robustness and fairness interact with interpretability, those topics can be saved for another day. We begin with the most classical and most canonical problems in interpretable machine learning: how to build sparse models for tabular data, including decision trees (Challenge #1) and scoring systems (Challenge #2). We then delve into a challenge involving additive models (Challenge #3), followed by another in case-based reasoning (Challenge #4), which is another classic topic in interpretable artificial intelligence. We then move to more exotic problems, namely supervised and unsupervised disentanglement of concepts in neural networks (Challenges #5 and #6). Back to classical problems, we discuss dimension reduction (Challenge #7). Then, how to incorporate physics or causal constraints (Challenge #8). Challenge #9 involves understanding, exploring, and measuring the Rashomon set of accurate predictive models. Challenge #10 discusses interpretable reinforcement learning. Table 1 provides a guideline that may help users to match a dataset to a suitable interpretable supervised learning technique. We will touch on all of these techniques in the challenges.

General principles of interpretable machine learning
Our first fundamental principle defines interpretable ML, following [250]: Principle 1 An interpretable machine learning model obeys a domain-specific set of constraints to allow it (or its predictions, or the data) to be more easily Table 1 Rule of thumb for the types of data that naturally apply to various supervised learning algorithms. "Clean" means that the data do not have too much noise or systematic bias.
"Tabular" means that the features are categorical or real, and that each feature is a meaningful predictor of the output on its own. "Raw" data is unprocessed and has a complex data type, e.g., image data where each pixel is a feature, medical records, or time series data.

Models
Data type decision trees / decision lists (rule lists) / decision sets somewhat clean tabular data with interactions, including multiclass problems. Particularly useful for categorical data with complex interactions (i.e., more than quadratic). Robust to outliers. scoring systems somewhat clean tabular data, typically used in medicine and criminal justice because they are small enough that they can be memorized by humans. generalized additive models (GAMs) continuous data with at most quadratic interactions, useful for large-scale medical record data. case-based reasoning any data type (different methods exist for different data types), including multiclass problems. disentangled neural networks data with raw inputs (computer vision, time series, textual data), suitable for multiclass problems.

understood by humans. These constraints can differ dramatically depending on the domain.
A typical interpretable supervised learning setup, with data {z i } i , and models chosen from function class F is: where the loss function, as well as soft and hard interpretability constraints, are chosen to match the domain. (For classification z i might be (x i , y i ), x i ∈ R p , y i ∈ {−1, 1}.) The goal of these constraints is to make the resulting model f or its predictions more interpretable. While solutions of (*) would not necessarily be sufficiently interpretable to use in practice, the constraints would generally help us find models that would be interpretable (if we design them well), and we might also be willing to consider slightly suboptimal solutions to find a more useful model. The constant C trades off between accuracy and the interpretability penalty, and can be tuned, either by cross-validation or by taking into account the user's desired tradeoff between the two terms. Equation (*) can be generalized to unsupervised learning, where the loss term would simply be replaced by a loss term for the unsupervised problem, whether it is novelty detection, clustering, dimension reduction, or another task.
Creating interpretable models can sometimes be much more difficult than creating black box models for many different reasons including: (i) Solving the optimization problem may be computationally hard, depending on the choice of constraints and the model class F. (ii) When one does create an interpretable model, one invariably realizes that the data are problematic and require troubleshooting, which slows down deployment (but leads to a better model). (iii) It might not be initially clear which definition of interpretability to use. This definition might require refinement, sometimes over multiple iterations with domain experts. There are many papers detailing these issues, the earliest dating from the mid-1990s [e.g., 157].
Interpretability differs across domains just as predictive performance metrics vary across domains. Just as we might choose from a variety of performance metrics (e.g., accuracy, weighted accuracy, precision, average precision, precision@N, recall, recall@N, DCG, NCDG, AUC, partial AUC, mean-timeto-failure, etc.), or combinations of these metrics, we might also choose from a combination of interpretability metrics that are specific to the domain. We may not be able to define a single best definition of interpretability; regardless, if our chosen interpretability measure is helpful for the problem at hand, we are better off including it. Interpretability penalties or constraints can include sparsity of the model, monotonicity with respect to a variable, decomposibility into sub-models, an ability to perform case-based reasoning or other types of visual comparisons, disentanglement of certain types of information within the model's reasoning process, generative constraints (e.g., laws of physics), preferences among the choice of variables, or any other type of constraint that is relevant to the domain. Just as it would be futile to create a complete list of performance metrics for machine learning, any list of interpretability metrics would be similarly fated.
Our 10 challenges involve how to define some of these interpretability constraints and how to incorporate them into machine learning models. For tabular data, sparsity is usually part of the definition of interpretability, whereas for computer vision of natural images, it generally is not. (Would you find a model for natural image classification interpretable if it uses only a few pixels from the image?) For natural images, we are better off with interpretable neural networks that perform case-based reasoning or disentanglement, and provide us with a visual understanding of intermediate computations; we will describe these in depth. Choices of model form (e.g., the choice to use a decision tree, or a specific neural architecture) are examples of interpretability constraints. For most problems involving tabular data, a fully interpretable model, whose full calculations can be understood by humans such as a sparse decision tree or sparse linear model, is generally more desirable than either a model whose calculations can only be partially understood, or a model whose predictions (but not its model) can be understood. Thus, we make a distinction between fully interpretable and partially interpretable models, often preferring the former.
Interpretable machine learning models are not needed for all machine learning problems. For low-stakes decisions (e.g., advertising), for decisions where an explanation would be trivial and the model is 100% reliable (e.g., "there is no lesion in this mammogram" where the explanation would be trivial), for decisions where humans can verify or modify the decision afterwards (e.g., segmentation of the chambers of the heart), interpretability is probably not needed. 1 On the other hand, for self-driving cars, even if they are very reliable, problems would arise if the car's vision system malfunctions causing a crash and no reason for the crash is available. Lack of interpretability would be problematic in this case.
Our second fundamental principle concerns trust: Principle 2 Despite common rhetoric, interpretable models do not necessarily create or enable trust -they could also enable distrust. They simply allow users to decide whether to trust them. In other words, they permit a decision of trust, rather than trust itself.
With black boxes, one needs to make a decision about trust with much less information; without knowledge about the reasoning process of the model, it is more difficult to detect whether it might generalize beyond the dataset. As stated by Afnan et al. [4] with respect to medical decisions, while interpretable AI is an enhancement of human decision making, black box AI is a replacement of it.
An important point about interpretable machine learning models is that there is no scientific evidence for a general tradeoff between accuracy and interpretability when one considers the full data science process for turning data into knowledge. (Examples of such pipelines include KDD, CRISP-DM, or the CCC Big Data Pipelines; see Figure 1, or [95,51,7].) In real problems, interpretability  [95].
is useful for troubleshooting, which leads to better accuracy, not worse. In that sense, we have the third principle: Principle 3 It is important not to assume that one needs to make a sacrifice in accuracy in order to gain interpretability. In fact, interpretability often begets accuracy, and not the reverse. Interpretability versus accuracy is, in general, a false dichotomy in machine learning.
Interpretability has traditionally been associated with complexity, and specif-Interpretable machine learning grand challenges 7 ically, sparsity, but model creators generally would not equate interpretability with sparsity. Sparsity is often one component of interpretability, and a model that is sufficiently sparse but has other desirable properties is more typical. While there is almost always a tradeoff of accuracy with sparsity (particularly for extremely small models), there is no evidence of a general tradeoff of accuracy with interpretability. Let us consider both (1) development and use of ML models in practice, and (2) experiments with static datasets; in neither case have interpretable models proven to be less accurate.

Development and use of ML models in practice.
A key example of the pitfalls of black box models when used in practice is the story of COMPAS [224,258]. COMPAS is a black box because it is proprietary: no one outside of its designers knows its secret formula for predicting criminal recidivism, yet it is used widely across the United States and influences parole, bail, and sentencing decisions that deeply affect people's lives [13]. COMPAS is error-prone because it could require over 130 variables, and typographical errors in those variables influence outcomes [317]. COMPAS has been explained incorrectly, where the news organization ProPublica mistakenly assumed that an important variable in an approximation to COMPAS (namely race) was also important to COM-PAS itself, and used this faulty logic to conclude that COMPAS depends on race other than through age and criminal history [13,258]. While the racial bias of COMPAS does not seem to be what ProPublica claimed, COMPAS still has unclear dependence on race. And worst, COMPAS seems to be unnecessarily complicated as it does not seem to be any more accurate than a very sparse decision tree [10,11] involving only a couple of variables. The story of COMPAS is a key example simultaneously demonstrating many pitfalls of black boxes in practice. And, it shows an example of when black boxes are unnecessary, but used anyway.
Other systemic issues arise when using black box models in practice, or when developing them as part of a data science process, that would cause them to be less accurate than an interpretable model. A full list is beyond the scope of this survey, but an article that points out many serious issues that arise with black box models in a particular domain is that of Afnan et al. [4], who discuss in vitro fertilization (IVF). In modern IVF, black box models that have not been subjected to randomized controlled trials determine who comes into existence. Afnan et al. [4] point out ethical and practical issues, ranging from the inability to perform shared decision-making with patients, to economic consequences of clinics needing to "buy into" environments that are similar to those where the black box models are trained so as to avoid distribution shift; this is necessary precisely because errors cannot be detected effectively in real-time using the black box. They also discuss accountability issues ("Who is accountable when the model causes harm?"). Finally, they present a case where a standard performance metric (namely area under the ROC curve -AUC) has been misconstrued as representing the value of a model in practice, potentially leading to overconfidence in the performance of a black box model. Specifically, a reported AUC was inflated by including many "obvious" cases in the sample over which it was computed. If we cannot trust reported numerical performance results, then interpretability would be a crucial remaining ingredient to assess trust.
In a full data science process, like the one shown in Figure 1, interpretability plays a key role in determining how to update the other steps of the process for the next iteration. One interprets the results and tunes the processing of the data, the loss function, the evaluation metric, or anything else that is relevant, as shown in the figure. How can one do this without understanding how the model works? It may be possible, but might be much more difficult. In essence, the messiness that comes with messy data and complicated black box models causes lower quality decision-making in practice.
Let's move on to a case where the problem is instead controlled, so that we have a static dataset and a fixed evaluation metric.
Static datasets. Most benchmarking of algorithms is done on static datasets, where the data and evaluation metric are not cleaned or updated as a result of a run of an algorithm. In other words, these experiments are not done as part of a data science process, they are only designed to compare algorithms in a controlled experimental environment.
Even with static datasets and fixed evaluation metrics, interpretable models do not generally lead to a loss in accuracy over black box models. Even for deep neural networks for computer vision, even on the most challenging of benchmark datasets, a plethora of machine learning techniques have been designed that do not sacrifice accuracy but gain substantial interpretability [333,53,158,58,12,216].
Let us consider two extremes of data types: tabular data, where all variables are real or discrete features, each of which is meaningful (e.g., age, race, sex, number of past strokes, congestive heart failure), and "raw" data, such as images, sound files or text, where each pixel, bit, or word is not useful on its own. These types of data have different properties with respect to both machine learning performance and interpretability.
For tabular data, most machine learning algorithms tend to perform similarly in terms of prediction accuracy. This means it is often difficult even to beat logistic regression, assuming one is willing to perform minor preprocessing such as creating dummy variables [e.g., see 61]. In these domains, neural networks generally find no advantage. It has been known for a very long time that very simple models perform surprisingly well for tabular data [128]. The fact that simple models perform well for tabular data could arise from the Rashomon Effect discussed by Leo Breiman [41]. Breiman posits the possibility of a large Rashomon set, i.e., a multitude of models with approximately the minumum error rate, for many problems. Semenova et al. [269] show that as long as a large Rashomon set exists, it is more likely that some of these models are interpretable.
For raw data, on the other hand, neural networks have an advantage currently over other approaches [164]. In these raw data cases, the definition of interpretability changes; for visual data, one may require visual explanations. In such cases, as discussed earlier and in Challenges 4 and 5, interpretable neural networks suffice, without losing accuracy.
These two data extremes show that in machine learning, the dichotomy between the accurate black box and the less-accurate interpretable model is false. The often-discussed hypothetical choice between the accurate machine-learningbased robotic surgeon and the less-accurate human surgeon is moot once someone builds an interpretable robotic surgeon. Given that even the most difficult computer vision benchmarks can be solved with interpretable models, there is no reason to believe that an interpretable robotic surgeon would be worse than its black box counterpart. The question ultimately becomes whether the Rashomon set should permit such an interpretable robotic surgeon-and all scientific evidence so far (including a large-and-growing number of experimental papers on interpretable deep learning) suggests it would.
Our next principle returns to the data science process.

Principle 4
As part of the full data science process, one should expect both the performance metric and interpretability metric to be iteratively refined.
The knowledge discovery process in Figure 1 explicitly shows these important feedback loops. We have found it useful in practice to create many interpretable models (satisfying the known constraints) and have domain experts choose between them. Their rationale for choosing one model over another helps to refine the definition of interpretability. Each problem can thus have its own unique interpretability metrics (or set of metrics).
The fifth principle is as follows: Principle 5 For high stakes decisions, interpretable models should be used if possible, rather than "explained" black box models.
Hence, this survey concerns the former. This is not a survey on Explainable AI (XAI, where one attempts to explain a black box using an approximation model, derivatives, variable importance measures, or other statistics), it is a survey on Interpretable Machine Learning (creating a predictive model that is not a black box). Unfortunately, these topics are much too often lumped together within the misleading term "explainable artificial intelligence" or "XAI" despite a chasm separating these two concepts [250]. Explainability and interpretability techniques are not alternative choices for many real problems, as the recent surveys often imply; one of them (XAI) can be dangerous for high-stakes decisions to a degree that the other is not.
Interpretable ML is not a subset of XAI. The term XAI dates from ∼2016, and grew out of work on function approximation; i.e., explaining a black box model by approximating its predictions by a simpler model [e.g., 70,69], or explaining a black box using local approximations. Interpretable ML also has a (separate) long and rich history, dating back to the days of expert systems in the 1950's, and the early days of decision trees. While these topics may sound similar to some readers, they differ in ways that are important in practice.
In particular, there are many serious problems with the use of explaining black boxes posthoc, as outlined in several papers that have shown why explaining black boxes can be misleading and why explanations do not generally serve their intended purpose [250,173,171]. The most compelling such reasons are: • Explanations for black boxes are often problematic and misleading, potentially creating misplaced trust in black box models. Such issues with explanations have arisen with assessment of fairness and variable importance [258,82] as well as uncertainty bands for variable importance [113,97].
There is an overall difficulty in troubleshooting the combination of a black box and an explanation model on top of it; if the explanation model is not always correct, it can be difficult to tell whether the black box model is wrong, or if it is right and the explanation model is wrong. Ultimately, posthoc explanations are wrong (or misleading) too often. One particular type of posthoc explanation, called saliency maps (also called attention maps) have become particularly popular in radiology and other computer vision domains despite known problems [2,53,334]. Saliency maps highlight the pixels of an image that are used for a prediction, but do not explain how the pixels are used. As an analogy, consider a real estate agent who is pricing a house. A "black box" real estate agent would provide the price with no explanation. A "saliency" real estate agent would say that the price is determined from the roof and backyard, but doesn't explain how the roof and backyard were used to determine the price. In contrast, an interpretable agent would explain the calculation in detail, for instance, using "comps" or comparable properties to explain how the roof and backyard are comparable between properties, and how these comparisons were used to determine the price. One can see from this real estate example how the saliency agent's explanation is insufficient. Saliency maps also tend to be unreliable; researchers often report that different saliency methods provide different results, making it unclear which one (if any) actually represents the network's true attention. 2 • Black boxes are generally unnecessary, given that their accuracy is generally not better than a well-designed interpretable model. Thus, explanations that seem reasonable can undermine efforts to find an interpretable model of the same level of accuracy as the black box. • Explanations for complex models hide the fact that complex models are difficult to use in practice for many different reasons. Typographical errors in input data are a prime example of this issue [as in the use of COMPAS in practice, see 317]. A model with 130 hand-typed inputs is more errorprone than one involving 5 hand-typed inputs.
In that sense, explainability methods are often used as an excuse to use a black box model-whether or not one is actually needed. Explainability techniques give authority to black box models rather than suggesting the possibility of models that are understandable in the first place [253].
XAI surveys have (thus far) universally failed to acknowledge the important point that interpretability begets accuracy when considering the full data science process, and not the other way around. Perhaps this point is missed because of the more subtle fact that one does generally lose accuracy when approximating a complicated function with a simpler one, so it would appear that the simpler approximation is less accurate. (Again the approximations must be imperfect, otherwise one would throw out the black box and instead use the explanation as an inherently interpretable model.) But function approximators are not used in interpretable ML; instead of approximating a known function (a black box ML model), interpretable ML can choose from a potential myriad of approximatelyequally-good models, which, as we noted earlier, is called "the Rashomon set" [41,97,269]. We will discuss the study of this set in Challenge 9. Thus, when one explains black boxes, one expects to lose accuracy, whereas when one creates an inherently interpretable ML model, one does not.
In this survey, we do not aim to provide yet another dull taxonomy of "explainability" terminology. The ideas of interpretable ML can be stated in just one sentence: an interpretable model is constrained, following a domain-specific set of constraints that make reasoning processes understandable. Instead, we highlight important challenges, each of which can serve as a starting point for someone wanting to enter into the field of interpretable ML.

Sparse logical models: Decision trees, decision lists, decision sets
The first two challenges involve optimization of sparse models. We discuss both sparse logical models in Challenge #1 and scoring systems (which are sparse linear models with integer coefficients) in Challenge #2. Sparsity is often used as a measure of interpretability for tabular data where the features are meaningful. Sparsity is useful because humans can handle only 7±2 cognitive entities at the same time [208], and sparsity makes it easier to troubleshoot, check for typographical errors, and reason about counterfactuals (e.g., "How would my prediction change if I changed this specific input?"). Sparsity is rarely the only consideration for interpretability, but if we can design models to be sparse, we can often handle additional constraints. Also, if one can optimize for sparsity, a useful baseline can be established for how sparse a model could be with a particular level of accuracy.
We remark that more sparsity does not always equate to more interpretability. This is because "humans by nature are mentally opposed to too simplistic representations of complex relations" [93,100]. For instance, in loan decisions, we may choose to have several sparse mini-models for length of credit, history of default, etc., which are then assembled at the end into a larger model composed of the results of the mini-models [see 54, who attempted this]. On the other hand, sparsity is necessary for many real applications, particularly in healthcare and criminal justice where the practitioner needs to memorize the model.
Logical models, which consist of logical statements involving "if-then," "or," and "and" clauses are among the most popular algorithms for interpretable machine learning, since their statements provide human-understandable reasons for each prediction.
When would we use logical models? Logical models are usually an excellent choice for modeling categorical data with potentially complicated interaction terms (e.g., "IF (female AND high blood pressure AND congenital heart failure), OR (male AND high blood pressure AND either prior stroke OR age > 70) THEN predict Condition 1 = true"). Logical models are also excellent for multiclass problems. Logical models are also known for their robustness to outliers and ease of handling missing data. Logical models can be highly nonlinear, and even classes of sparse nonlinear models can be quite powerful. Figure 2 visualizes three logical models: a decision tree, a decision list, and a decision set. Decision trees are tree-structured predictive models where each branch node tests a condition and each leaf node makes a prediction. Decision lists, identical to rule lists or one-sided decision trees, are composed of if-thenelse statements. The rules are tried in order, and the first rule that is satisfied makes the prediction. Sometimes rule lists have multiple conditions in each split, whereas decision trees typically do not. A decision set, also known as a "disjunction of conjunctions," "disjunctive normal form" (DNF), or an "OR of ANDs" is comprised of an unordered collection of rules, where each rule is a conjunction of conditions. A positive prediction is made if at least one of the rules is satisfied. Even though these logical models seem to have very different forms, they are closely related: every decision list is a (one-sided) decision tree and every decision tree can be expressed as an equivalent decision list (by listing each path to a leaf as a decision rule). The collection of leaves of a decision tree (or a decision list) also forms a decision set.
Let us provide some background on decision trees. Since Morgan and Sonquist [212] developed the first decision tree algorithm, many works have been proposed to build decision trees and improve their performance. However, learning decision trees with high performance and sparsity is not easy. Full decision tree optimization is known to be an NP-complete problem [174], and heuristic greedy splitting and pruning procedures have been the major type of approach since the 1980s to grow decision trees [40,239,192,205]. These greedy methods for building decision trees create trees from the top down and prune them back afterwards. They do not go back to fix a bad split if one was made. Consequently, the trees created from these greedy methods tend to be both less accurate and less interpretable than necessary. That is, greedy induction algorithms are not designed to optimize any particular performance metric, leaving a gap between the performance that a decision tree might obtain and the performance that the algorithm's decision tree actually attains, with no way to determine how large the gap is (see Figure 3 for a case where the 1984 CART algorithm did  [13]. not obtain an optimal solution, as shown by the better solution from a 2020 algorithm called "GOSDT," to the right). This gap can cause a problem in practice because one does not know whether poor performance is due to the choice of model form (the choice to use a decision tree of a specific size) or poor optimization (not fully optimizing over the set of decision trees of that size). When fully optimized, single trees can be as accurate as ensembles of trees, or neural networks, for many problems. Thus, it is worthwhile to think carefully about how to optimize them.   [40] and (b) 9-leaf decision tree generated by GOSDT [185] for the classic Monk 2 dataset [88]. The GOSDT tree is optimal with respect to a balance between accuracy and sparsity.
problem that is a special case of (*): where the user specifies the loss function and the trade-off (regularization) parameter. Efforts to fully optimize decision trees, solving problems related to (1.1), have been made since the 1990s [28,85,94,220,221,131,185]. Many recent papers directly optimize the performance metric (e.g., accuracy) with soft or hard sparsity constraints on the tree size, where sparsity is measured by the number of leaves in the tree. Three major groups of these techniques are (1) mathematical programming, including mixed integer programming (MIP) [see the works of 28,29,251,301,302,118,6] and SAT solvers [214,130] [see also the review of 47], (2) stochastic search through the space of trees [e.g., 321,114,228], and (3) customized dynamic programming algorithms that incorporate branchand-bound techniques for reducing the size of the search space [131,185,222,78].
Decision list and decision set construction lead to the same challenges as decision tree optimization, and have a parallel development path. Dating back to 1980s, decision lists have often been constructed in a top-down greedy fashion. Associative classification methods assemble decision lists or decision sets from a set of pre-mined rules, generally either by greedily adding rules to the model one by one, or simply including all "top-scoring" rules into a decision set, where each rule is scored separately according to a scoring function [247,63,188,184,325,275,200,298,252,62,108,65,99,104,199,198]. Sometimes decision lists or decision sets are optimized by sampling [180,321,306], providing a Bayesian interpretation. Some recent works can jointly optimize performance metrics and sparsity for decision lists [251,327,10,11,8] and decision sets [311,110,170,132,77,197,109,80,328,45]. Some works optimize for individual rules [77,255].
In recent years, great progress has been made on optimizing the combination of accuracy and sparsity for logical models, but there are still many challenges that need to be solved. Some important ones are as follows: 1.1 Can we improve the scalability of optimal sparse decision trees?
A lofty goal for optimal decision tree methods is to fully optimize trees as fast as CART produces its (non-optimal) trees. Current state-of-the-art optimal decision tree methods can handle medium-sized datasets (thousands of samples, tens of binary variables) in a reasonable amount of time (e.g., within 10 minutes) when appropriate sparsity constraints are used. But how to scale up to deal with large datasets or to reduce the running time remains a challenge. These methods often scale exponentially in p, the number of dimensions of the data. Developing algorithms that reduce the number of dimensions through variable screening theorems or through other means could be extremely helpful.
For methods that use the mathematical programming solvers, a good formulation is key to reducing training time. For example, MIP solvers use branch-and-bound methods, which partition the search space recursively and solve Linear Programming (LP) relaxations for each partition to produce lower bounds. Small formulations with fewer variables and constraints can enable the LP relaxations to be solved faster, while stronger LP relaxations (which usually involve more variables) can produce high quality lower bounds to prune the search space faster and reduce the number of LPs to be solved. How to formulate the problem to leverage the full power of MIP solvers is an open question. Currently, mathematical programming solvers are not as efficient as the best customized algorithms. For customized branch-and-bound search algorithms such as GOSDT and OSDT [185,131], there are several mechanisms to improve scalability: (1) effective lower bounds, which prevent branching into parts of the search space where we can prove there is no optimal solution, (2) effective scheduling policies, which help us search the space to find close-to-optimal solutions quickly, which in turn improves the bounds and again prevents us from branching into irrelevant parts of the space, (3) computational reuse, whereby if a computation involves a sum over (even slightly) expensive computations, and part of that sum has previously been computed and stored, we can reuse the previous computation instead of computing the whole sum over again, (4) efficient data structures to store subproblems that can be referenced later in the computation should that subproblem arise again. 1.2 Can we efficiently handle continuous variables? While decision trees handle categorical variables and complicated interactions better than other types of approaches (e.g., linear models), one of the most important challenges for decision tree algorithms is to optimize over continuous features. Many current methods use binary variables as input [131,185,301,222,78], which assumes that continuous variables have been transformed into indicator variables beforehand (e.g., age> 50). These methods are unable to jointly optimize the selection of variables to split at each internal tree node, the splitting threshold of that variable (if it is continuous), and the tree structure (the overall shape of the tree). Lin et al. [185] preprocess the data by transforming continuous features into a set of dummy variables, with many different split points; they take split points between every ordered pair of unique values present in the training data. Doing this preserves optimality, but creates a huge number of binary features, leading to a dramatic increase in the size of the search space, and the possibility of hitting either time or memory limits. Some methods [301,222] preprocess the data using an approximation, whereby they consider a much smaller subset of possible thresholds, potentially sacrificing the optimality of the solution [see 185, Section 3, which explains this]. One possible technique to help with this problem is to use similar support bounds, identified by Angelino et al. [11], but in practice these bounds have been hard to implement because checking the bounds repeatedly is computationally expensive, to the point where the bounds have never been used (as far as we know). Future work could go into improving the determination of when to check these bounds, or proving that a subset of all possible dummy variables still preserves closeness to optimality. 1.3 Can we handle constraints more gracefully? Particularly for greedy methods that create trees using local decisions, it is difficult to enforce global constraints on the overall tree. Given that domain-specific constraints may be essential for interpretability, an important challenge is to determine how to incorporate such constraints. Optimization approaches (mathematical programming, dynamic programming, branch-and-bound) are more amenable to global constraints, but the constraints can make the problem much more difficult. For instance, falling constraints [306,55] enforce decreasing probabilities along a rule list, which make the list more interpretable and useful in practice, but make the optimization problem harder, even though the search space itself becomes smaller.
Example: Suppose a hospital would like to create a decision tree that will be used to assign medical treatments to patients. A tree is convenient because it corresponds to a set of questions to ask the patient (one for each internal node along a path to a leaf). A tree is also convenient in its handling of multiple medical treatments; each leaf could even represent a different medical treatment. The tree can also handle complex interactions, where patients can be asked multiple questions that build on each other to determine the best medication for the patient. To train this tree, the proper assumptions and data handling were made to allow us to use machine learning to perform causal analysis (in practice these are more difficult than we have room to discuss here). The questions we discussed above arise when the variables are continuous; for instance, if we split on age somewhere in the tree, what is the optimal age to split at in order to create a sparse tree? (See Challenge 1.2.) If we have many other continuous variables (e.g., blood pressure, weight, body mass index), scalability in choosing how to split them all becomes an issue. Further, if the hospital has additional preferences, such as "falling probabilities," where fewer questions should be asked to determine whether a patient is in the most urgent treatment categories, again it could affect our ability to find an optimal tree given limited computational resources (see Challenge 1.3).

Scoring systems
Scoring systems are linear classification models that require users to add, subtract, and multiply only a few small numbers in order to make a prediction. These models are used to assess the risk of numerous serious medical conditions since they allow quick predictions, without using a computer. Such models are also heavily used in criminal justice. Table 2 shows an example of a scoring system. A doctor can easily determine whether a patient screens positive for obstructive sleep apnea by adding points for the patient's age, whether they have hypertension, body mass index, and sex. If the score is above a threshold, the patient would be recommended to a clinic for a diagnostic test. Scoring systems commonly use binary or indicator variables (e.g., age ≥ 60) and point values (e.g., Table 2) which make computation of the score easier for humans. Linear models do not handle interaction terms between variables like the logical models discussed above in Challenge 1.1, and they are not particularly useful for multiclass problems, but they are useful for counterfactual reasoning: if we ask "What changes could keep someone's score low if they developed hypertension?" the answer would be easy to compute using point scores such as "4, 4, 2, 2, and -6." Such logic is more difficult for humans if the point scores are instead, for instance, 53.2, 41.1, 16.3, 23.6 and -61.8.

Table 2
A scoring system for sleep apnea screening [295]. Patients that screen positive may need to come to the clinic to be tested. Risk scores are scoring systems that have a conversion table to probabilities. For instance, a 1 point total might convert to probability 15%, 2 points to 33% and so on. Whereas scoring systems with a threshold (like the one in Table 2) would be measured by false positive and false negative rates, a risk score might be measured using the area under the ROC curve (AUC) and calibration.
The development of scoring systems dates back at least to criminal justice work in the 1920s [44]. Since then, many scoring systems have been designed for healthcare [154,152,153,38,175,14,107,211,274,315,286]. However, none of the scoring systems mentioned so far was optimized purely using an algorithm applied to data. Each scoring system was created using a different method involving different heuristics. Some of them were built using domain expertise alone without data, and some were created using rounding heuristics for logistic regression coefficients and other manual feature selection approaches to obtain integer-valued point scores [see, e.g., 175].
Such scoring systems could be optimized using a combination of the user's preferences (and constraints) and data. This optimization should ideally be accomplished by a computer, leaving the domain expert only to specify the problem. However, jointly optimizing for predictive performance, sparsity, and other user constraints may not be an easy task. Equation (2.1) shows an example of a generic optimization problem for creating a scoring system: with small integer coefficients, ∀ j, λ j ∈ {−10, −9, .., 0, .., 9, 10} and additional user constraints.
Here, the user would specify the loss function (logistic loss, etc.), the tradeoff parameter C between the number of nonzero coefficients and the training loss, and possibly some additional constraints, depending on the domain. The integrality constraint on the coefficients makes the optimization problem very difficult. The easiest way to satisfy these constraints is to fit real coefficients (e.g., run logistic regression, perhaps with 1 regularization) and then round these real coefficients to integers. However, rounding can go against the loss gradient and ruin predictive performance. Here is an example of a coefficient vector that illustrates why rounding might not work: When rounding, we lose all signal coming from all variables except the first two. The contribution from the eliminated variables may together be significant even if each individual coefficient is small, in which case, we lose predictive performance.
Compounding the issue with rounding is the fact that 1 regularization introduces a strong bias for very sparse problems. To understand why, consider that the regularization parameter must be set to a very large number to get a very sparse solution. In that case, the 1 regularization does more than make the solution sparse, it also imposes a strong 1 bias. The solutions disintegrate in quality as the solutions become sparser, then rounding to integers only makes the solution worse.
An even bigger problem arises when trying to incorporate additional constraints, as we allude to in (2.1). Even simple constraints such as "ensure precision is at least 20%" when optimizing recall would be very difficult to satisfy manually with rounding. There are four main types of approaches to building scoring systems: i) exact solutions using optimization techniques, ii) approximation algorithms using linear programming, iii) more sophisticated rounding techniques, iv) computer-aided exploration techniques.
Exact solutions. There are several methods that can solve (2.1) directly [294,291,292,293,256]. To date, the most promising approaches use mixed-integer linear programming solvers (MIP solvers) which are generic optimization software packages that handle systems of linear equations, where variables can be either linear or integer. Commercial MIP solvers (currently CPLEX and Gurobi) are substantially faster than free MIP solvers, and have free academic licenses. MIP solvers can be used directly when the problem is not too large and when the loss function is discrete or linear (e.g., classification error is discrete, as it takes values either 0 or 1). These solvers are flexible and can handle a huge variety of user-defined constraints easily. However, in the case where the loss function is nonlinear, like the classical logistic loss i log(1 + exp(−y i f (x i ))), MIP solvers cannot be used directly. In that case, it is possible to use an algorithm called RiskSLIM [293] that uses sophisticated optimization tools: cutting planes within a branch-and-bound framework, using "callback" functions to a MIP solver. A major benefit of scoring systems is that they can be used as decision aids in very high stakes settings; RiskSLIM has been used to create a model (the 2HELPS2B score) that is used in intensive care units of hospitals to make treatment decisions about critically ill patients [281].
While exact optimization approaches provide optimal solutions, they struggle with larger problems. For instance, to handle nonlinearities in continuous covariates, these variables are often discretized to form dummy variables by splitting on all possible values of the covariate (similar to the way continuous variables are handled for logical model construction as discussed above, e.g., create dummy variables for age<30, age<31, age<32, etc.). Obviously doing this can turn a small number of continuous variables into a large number of categorical variables. One way to reduce the size of the problem is to use only a subset of thresholds (e.g., age<30, age<35, age<40, etc.), but it is possible to lose accuracy if not enough thresholds are included. Approximation methods can be valuable in such cases.
Approximate methods. Several works solve approximate versions of (2.1), including works of [277,276,35,34,48]. These works generally use a piecewise linear or piecewise constant loss function, and sometimes use the 1 norm for regularization as a proxy for the number of terms. This allows for the possibility of solving linear programs using a mathematical programming solver, which is generally computationally efficient since linear programs can be solved much faster than mixed-integer programs. The main problems with approximation approaches is that it is not clear how well the solution of the approximate optimization problem is to the solution of the desired optimization problem, particularly when user-defined constraints or preferences are required. Some of these constraints may be able to be placed into the mathematical program, but it is still not clear whether the solution of the optimization problem one solves would actually be close to the solution of the optimization problem we actually care about.
It is possible to use sampling to try to find useful scoring systems, which can be useful for Bayesian interpretations, though the space of scoring systems can be quite large and hard to sample [251].
Sophisticated rounding methods. Keeping in mind the disadvantages to rounding discussed above, there are some compelling advantages to sophisticated rounding approaches, namely that they are easy to program and use in practice. Rounding techniques cannot easily accommodate constraints, but they can be used for problems without constraints, problems where the constraints are weak enough that rounding would not cause us to violate them, or they can be used within the middle of algorithms like RiskSLIM [293] to help find optimal solutions faster. There are several variations of sophisticated rounding. Chevaleyre et al. [60] propose randomized rounding, where non-integers are rounded up or down randomly. They also propose a greedy method where the sum of coefficients is fixed and coefficients are rounded one at a time. Sokolovska et al. [277] propose an algorithm that finds a local minimum by improving the solution at each iteration until no further improvements are possible. Ustun and Rudin [293] propose a combination of rounding and "polishing." Their rounding method is called Sequential Rounding. At each iteration, Sequential Rounding chooses a coefficient to round and whether to round it up or down. It makes this choice by evaluating each possible coefficient rounded both up and down, and chooses the option with the best objective. After Sequential Rounding produces an integer coefficient vector, a second algorithm, called Discrete Coordinate Descent (DCD), is used to "polish" the rounded solution. At each iteration, DCD chooses a coefficient, and optimizes its value over the set of integers to obtain a feasible integer solution with a better objective. All of these algorithms are easy to program and might be easier to deal with than troubleshooting a MIP or LP solver.
Computer-aided exploration techniques. These are design interfaces where domain experts can modify the model itself rather than relying directly on optimization techniques to encode constraints. Billiet et al. [35,34] created a toolbox that allows users to make manual adjustments to the model, which could potentially help users design interpretable models according to certain types of preferences. Xie et al. [320] also suggest an expert-in-the-loop approach, where heuristics such as the Gini splitting criteria can be used to help discretize continuous variables. Again, with these approaches, the domain expert must know how they want the model to depend on its variables, rather than considering overall performance optimization.
2.1 Improve the scalability of optimal sparse scoring systems: As discussed, for scoring systems, the only practical approaches that produce optimal scoring systems require a MIP solver, and these approaches may not be able to scale to large problems, or optimally handle continuous variables. Current state-of-the-art methods for optimal scoring systems (like RiskSLIM) can deal with a dataset with about thousands of samples and tens of variables within an hour. However, an hour is quite long if one wants to adjust constraints and rerun it several times. How to scale up to large datasets or to reduce solving time remains a challenge, particularly when including complicated sets of constraints.

Ease of constraint elicitation and handling: Since domain experts
often do not know the full set of constraints they might want to use in advance, and since they also might want to adjust the model manually [34], a more holistic approach to scoring systems design might be useful. There are many ways to cast the problem of scoring system design with feedback from domain experts. For instance, if we had better ways of representing and exploring the Rashomon set (see Challenge 9), domain experts might be able to search within it effectively, without fear of leaving that set and producing a suboptimal model. If we knew domain experts' views about the importance of features, we should be able to incorporate that through regularization [307]. Better interfaces might elicit better constraints from domain experts and incorporate such constraints into the models. Faster optimization methods for scoring systems would allow users faster turnaround for creating these models interactively.
Example: A physician wants to create a scoring system for predicting seizures in critically ill patients [similarly to 281]. The physician has upwards of 70 clinical variables, some of which are continuous (e.g., patient age). The physician creates dummy variables for age and other continuous features, combines them with the variables that are already binarized, runs 1 -regularized logistic regression and rounds the coefficients. However, the model does not look reasonable, as it uses only one feature, and isn't very accurate. The physician thinks that the model should instead depend heavily on age, and have a false positive rate below 20% when the true positive rate is above 70% (see Challenge 2.2). The physician downloads a piece of software for developing scoring systems. The software reveals that a boosted decision tree model is much more accurate, and that there is a set of scoring systems with approximately the same accuracy as the boosted tree. Some of these models do not use age and violate the physician's other constraint, so these constraints are then added to the system, which restricts its search. The physician uses a software tool to look at models provided by the system and manually adjusts the dependence of the model on age and other important variables, in a way that still maintains predictive ability. Ideally, this entire process takes a few hours from start to finish (see Challenge 2.1). Finally, the physician takes the resulting model into the clinic to perform a validation study on data the model has not been trained with, and the model is adopted into regular use as a decision aid for physicians.

Generalized additive models
Generalized additive models (GAMs) were introduced to present a flexible extension of generalized linear models (GLMs) [217], allowing for arbitrary functions for modeling the influence of each feature on a response ( [122], see also [318]). The set of GAMs includes the set of additive models, which, in turn, includes the set of linear models, which includes scoring systems (and risk scores). Figure  4 shows these relationships. The standard form of a GAM is:   If the features are all binary (or categorical), the GAM becomes a linear model and the visualizations are just step functions. The visualizations become more interesting for continuous variables, like the ones shown in Figure 5. If a GAM has bivariate component functions (that is, if we choose an f j to depend on two variables, which permits an interaction between these two variables), a heatmap can be used to visualize the component function on the two dimensional plane and understand the pairwise interactions [196]. As a comparison point with decision trees, GAMs typically do not handle more than a few interaction terms, and all of these would be quadratic (i.e., involve 2 variables); this contrasts with decision trees, which handle complex interactions of categorical variables. GAMs, like other linear models, do not handle multiclass problems in a natural way. GAMs have been particularly successful for dealing with large datasets of medical records that have many continuous variables because they can elucidate complex relationships between, for instance, age, disease and mortality. (Of course, dealing with large raw medical datasets, we would typically encounter serious issues with missing data, or bias in the labels or variables, which would be challenging for any method, including GAMs.) A component function f j can take different forms. For example, it can be a weighted sum of indicator functions, that is: If the weights on the indicator functions are integers, and only a small set of weights are nonzero, then the GAM becomes a scoring system. If the indicators are all forced to aim in one direction (e.g., 1[x ·j > θ j ] for all j , with no indicators in the other direction, 1[x ·j < θ j ]) and the coefficients c j,j are all constrained to be nonnegative, then the function will be monotonic. In the case that splines are used as component functions, the GAM can be a weighted sum of the splines' basis functions, i.e. f j (x ·j ) = Kj k=1 β jk b jk (x ·j ). There are many different ways to fit GAMs. The traditional way is to use backfitting, where we iteratively train a component function to best fit the residuals from the other (already-chosen) components [122]. If the model is fitted using boosting methods [101,102,103], we learn a tree on each single feature in each iteration and then aggregate them together [195]. Among different estimations of component functions and fitting procedures, Binder and Tutz [36] found that boosting performed particularly well in high-dimensional settings, and Lou et al. [195] found that using a shallow bagged ensemble of trees on a single feature in each step of stochastic gradient boosting generally achieved better performance.
We remark that GAMs have the advantage that they are very powerful, particularly if they are trained as boosted stumps or trees, which are reliable out-of-the-box machine learning techniques. The AdaBoost algorithm also has the advantage that it maximizes convex proxies for both classification error and area under the ROC curve (AUC) simultaneously [254,72]. This connection explains why boosted models tend to have both high AUC and accuracy. However, boosted models are not naturally sparse, and issues with bias arise under 1 regularization, as discussed in the scoring systems section.
We present two interesting challenges involving GAMs: when we have prior knowledge, e.g., that risk increases with age. In the case when component functions are estimated by splines, many works apply convex regularizers (e.g., 1 ) to control both smoothness and sparsity [186,245,206,324,234,194,262,121]. For example, they add "roughness" penalties and lasso type penalties on the f j 's in the objective function to control both the smoothness of component functions and sparsity of the model. Similarly, if the f j 's are sums of indicators, as in (3.1), we could regularize to reduce the c j,j coefficients to induce smoothness. These penalties are usually convex, therefore, when combined with convex loss functions, convex optimization algorithms minimize their (regularized) objectives. There could be some disadvantages to this setup: (1) as we know, 1 regularization imposes a strong unintended bias on the coefficients when aiming for very sparse solutions. (2) Lou et al. [195] find that imposing smoothness may come at the expense of accuracy, (3) imposing smoothness may miss important naturally-occurring patterns like a jump in a component function; in fact, they found such a jump in mortality as a function of age that seems to occur around retirement age. In that case, it might be more interpretable to include a smooth increasing function of age plus an indicator function around retirement age. At the moment, these types of choices are hand-designed, rather than automated. As mentioned earlier, boosting can be used to train GAMs to produce accurate models. However, sparsity and smoothness are hard to control with AdaBoost since it adds a new term to the model at each iteration. 3.2 How to use GAMs to troubleshoot complex datasets? GAMs are often used on raw medical records or other complex data types, and these datasets are likely to benefit from troubleshooting. Using a GAM, we might find counterintuitive patterns; e.g., as shown in [49], asthma patients fared better than non-asthma patients in a health outcomes study. Caruana et al. [49] provides a possible reason for this finding, which is that asthma patients are at higher natural risk, and are thus given better care, leading to lower observed risk. Medical records are notorious for missing important information or providing biased information such as billing codes. Could a GAM help us to identify important missing confounders, such as retirement effects or special treatment for asthma patients? Could GAMs help us reconcile medical records from multiple data storage environments? These data quality issues can be really important.
Example: Suppose a medical researcher has a stack of raw medical records and would like to predict the mortality risk for pneumonia patients. The data are challenging, including missing measurements (structural missingness as well as data missing not at random, and unobserved variables), insurance codes that do not convey exactly what happened to the patient, nor what their state was. However, the researcher decides that there is enough signal in the data that it could be useful in prediction, given a powerful machine learning method, such as a GAM trained with boosted trees. There are also several important continuous variables, such as age, that could be visualized. A GAM with a small number of component functions might be appropriate since the doctor can visualize each component function. If there are too many component functions (GAM without sparsity control), analyzing contributions from all of them could be overwhelming (see Challenge 3.1). If the researcher could control the sparsity, smoothness, and monotonicity of the component functions, she might be able to design a model that not only predicts well, but also reveals interesting relationships between observed variables and outcomes. This model could also help us to determine whether important variables were missing, recorded inconsistently or incorrectly, and could help identify key risk factors (see Challenge 3.2). From there, the researcher might want to develop even simpler models, such as decision trees or scoring systems, for use in the clinic (see Challenges 1 and 2).

Modern case-based reasoning
Case-based reasoning is a paradigm that involves solving a new problem using known solutions to similar past problems [1]. It is a problem-solving strategy that we humans use naturally in our decision-making processes [219]. For example, when ornithologists classify a bird, they will look for specific features or patterns on the bird and compare them with those from known bird species to decide which species the bird belongs to. The interesting question is: can a machine learning algorithm emulate the case-based reasoning process that we humans are accustomed to? A model that performs case-based reasoning is appealing, because by emulating how humans reason, the model can explain its decision-making process in an interpretable way.
The potential uses for case-based reasoning are incredibly broad: whereas the earlier challenges apply only to tabular data, case-based reasoning applies to both tabular and raw data, including computer vision. For computer vision and other raw data problems, we distinguish between the feature extraction and prediction steps. The feature extraction can be verified by a human, and the prediction steps are simple calculations. That is, while a human may not understand the mapping from the original image to the feature space, they can visually verify that a particular interpretable concept/feature has been extracted. Then, the features/concepts are combined to form a prediction, which is done via a sparse linear combination, or another calculation that a human can understand.
Case-based reasoning has long been a subject of interest in the artificial intelligence (AI) community [246,160,120]. There are, in general, two types of case-based reasoning techniques: (i) nearest neighbor-based techniques, and (ii) prototype-based techniques, illustrated in Figure 6. There are many variations of each of these two types. Nearest neighbor-based techniques. These techniques make a decision for a previously unseen test instance, by finding training instances that most closely resemble the particular test instance (i.e., the training instances that have the smallest "distance" or the largest "similarity" measures to the test instance). A classic example of nearest neighbor-based techniques is k-nearest neighbors (kNN) [98,68]. Traditionally, a kNN classifier is non-parametric and requires no training at all -given a previously unseen test instance, a kNN classifier finds the k training instances that have the smallest 2 distances to the test instance, and the class label of the test instance is predicted to be the majority label of those k training instances. Many variants of the original kNN classification scheme have been developed in the 1970s and 80s [89,106,142,30]. Some later papers on nearest neighbor-based techniques focused on the problem of "adaptive kNN," where the goal is to learn a suitable distance metric to quantify the dissimilarity between any pair of input instances (instead of using a pre-determined distance metric such as the Euclidean 2 distance measure), to improve the performance of nearest neighbor-based techniques. For example, [316] proposed a method to learn a parametrized distance metric (such as the matrix in a Mahalanobis distance measure) for kNN. Their method involves minimizing a loss function on the training data, such that for every training instance, the distance between the training instance and its "target" neighbors (of the same class) will be minimized while the distance between the training instance and its "imposter" neighbors (from other classes) will be maximized (until those neighbors are at least 1 distance unit away from the training instance). More recently, there are works that began to focus on "performing kNN in a learned latent space," where the latent space is often learned using a deep neural network. For example, Salakhutdinov and Hinton [264] proposed to learn a nonlinear transformation, using a deep neural network that transforms the input space into a feature space where a kNN classifier will perform well (i.e., deep kNN). Papernot and McDaniel [229] proposed an algorithm for deep kNN classification, which uses the k-nearest neighbors of a test instance from every hidden layer of a trained neural network. Card et al. [46] introduced a deep weighted averaging classifier, which classifies an input based on its latent-space distances to other training examples.
We remark that the notion of "adaptive kNN" is mathematically the same as "performing kNN in a learned latent space." Let us show why that is. In adaptive kNN, we would learn a distance metric d(·, ·) such that k-nearest neighbors tends to be an accurate classifier. The distance between points x 1 and x 2 would be d(x 1 , x 2 ). For latent space nearest neighbor classification, we would learn a mapping φ : x → φ(x) from our original space to the latent space, and then That is, the latent space mapping acts as a transformation of the original metric so that 2 distance works well for kNN.
As an aside, the problem of matching in observational causal inference also can use case-based reasoning. Here treatment units are matched to similar control units to estimate treatment effects. We refer readers to recent work on this topic for more details [310,231].
Prototype-based techniques. Despite the popularity of nearest-neighbor techniques, those techniques often require a substantial amount of distance computations (e.g., to find out the nearest neighbors of a test input), which can be slow in practice. Also, it is possible that the nearest neighbors may not be particularly good representatives of a class, so that reasoning about nearest neighbors may not be interpretable. Prototype-based techniques are an alternative to nearest-neighbor techniques that have neither of these disadvantages. Prototypebased techniques learn, from the training data, a set of prototypical cases for comparison. Given a previously unseen test instance, they make a decision by finding prototypical cases (instead of training instances from the entire training set) that most closely resemble the particular test instance. One of the earliest prototype learning techniques is learning vector quantization (LVQ) [159]. In LVQ, each class is represented by one or more prototypes, and points are assigned to the nearest prototype. During training, if the training example's class agrees with the nearest prototype's class, then the prototype is moved closer to the training example; otherwise the prototype is moved further away from the training example. In more recent works, prototype learning is also achieved by solving a discrete optimization program, which selects the "best" prototypes from a set of training instances according to some training objective. For example, Bien and Tibshirani [33] formulated the prototype learning problem as a set-cover integer program (an NP-complete problem), which can be solved using standard approximation algorithms such as relaxation-and-rounding and greedy algorithms. Kim et al. [145] formulated the prototype learning problem as an optimization program that minimizes the squared maximum mean discrepancy, which is a submodular optimization problem and can be solved approximately using a greedy algorithm.
Part-based prototypes. One issue that arises with both nearest neighbor and prototype techniques is the comparison of a whole observation to another whole observation. This makes little sense, for instance, with images, where some aspects resemble a known past image, but other aspects resemble a different image. For example, we consider architecture of buildings: while some architectural elements of a building may resemble one style, other elements resemble another. Another example is recipes. A recipe for a cheesecake with strawberry topping may call for part of a strawberry pancake recipe according to the typical preparation for a strawberry sauce, while the cheesecake part of the recipe could follow a traditional plain cheesecake recipe. In that case, it makes more sense to compare the strawberry cheesecake recipe to both the pancake recipe and the plain cheesecake recipe. Thus, some newer case-based reasoning methods have been comparing parts of observations to parts of other observations, by creating comparisons on subsets of features. This allows case-based reasoning techniques both more flexibility and more interpretability.
Kim et al. [146] formulated a prototype-parts learning problem for structured (tabular) data using a Bayesian generative framework. They considered the example (discussed above) of recipes in their experiments. Wu and Tabak [319] used a convex combination of training instances to represent a prototype, where the prototype does not necessarily need to be a member of the training set. Using  [183]. The network compares a previously unseen image of "6" with 15 prototypes of handwritten digits, learned from the training set, and classifies the image as a 6 because it looks like the three prototypes of handwritten 6's, which have been visualized by passing them through a decoder from latent space into image space. (b) Part-based prototype classification of a ProtoPNet [53]. The ProtoPNet compares a previously unseen image of a bird with prototypical parts of a clay colored sparrow, which are learned from the training set. It classifies the image as a clay colored sparrow because (the network thinks that) its head looks like a prototypical head from a clay-colored sparrow, its wing bars look like prototypical wing bars from a clay-colored sparrow, and so on. Here, the prototypes do not need to be passed through a decoder, they are images from the training set. a convex combination of training examples as a prototype would be suitable for some data types (e.g., tabular data, where a convex combination of real training examples might resemble a realistic observation), but for images, averaging the latent positions of units in latent space may not correspond to a realistic-looking image, which means the prototype may not look like a real image, which could be a disadvantage to this type of approach.
Recently, there are works that integrate deep learning with prototype-and prototype-parts-based classification. This idea was first explored by Li et al. [183] for image classification. They created a neural network architecture that contains an autoencoder and a special prototype layer, where each unit of that layer (i.e., a prototype) stores a weight vector that resembles some encoded training input. Given an input instance, the network compares the encoded input instance with the learned prototypes (stored in the prototype layer), which can be visualized using the decoder. The prediction of the network is based on the 2 distances between the encoded input instance and the learned prototypes. The authors applied the network to handwritten digit recognition [MNIST,177], and the network was able to learn prototypical cases of how humans write digits by hand. Given a test image of a handwritten digit, the network was also able to find prototypical cases that are similar to the test digit (Figure 7(a)).
More recently, Chen et al. [53] extended the work of Li et al. [183] to create a prototypical part network (ProtoPNet) whose prototype layer stores proto-typical parts of encoded training images. The prototypical parts are patches of convolutional-neural-network-encoded training images, and represent typical features observed for various image classes. Given an input instance, the network compares an encoded input image with each of the learned prototypical parts, and generates a prototype activation map that indicates both the location and the degree of the image patch most similar to that prototypical part. The authors applied the network to the benchmark Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset [305] for bird recognition, and the ProtoPNet was able to learn prototypical parts of 200 classes of birds, and use these prototypes to classify birds with an accuracy comparable to non-interpretable black-box models. Given a test image of a bird, the ProtoPNet was able to find prototypical parts that are similar to various parts of the test image, and was able to provide an explanation for its prediction, such as "this bird is a clay-colored sparrow, because its head looks like that prototypical head from a clay-colored sparrow, and its wing bars look like those prototypical wing bars from a clay-colored sparrow" (Figure 7(b)). In their work, Chen et al. [53] also removed the decoder, and instead introduced prototype projection, which pushes every prototypical part to the nearest encoded training patch of the same class for visualization. This improved the visualization quality of the learned prototypes (in comparison to the approach of Li et al. [183] which used a decoder).
These two works (i.e., [183,53]) have been extended in the domain of deep case-based reasoning and deep prototype learning. In the image recognition domain, Nauta et al. [215] proposed a method for explaining what visual characteristics a prototype (in a trained ProtoPNet) is looking for; Nauta et al. [216] proposed a method for learning neural prototype trees based on a prototype layer; Rymarczyk et al. [260] proposed data-dependent merge-pruning of the prototypes in a ProtoPNet, to allow prototypes that activate on similarly looking parts from various classes to be pruned and shared among those classes. In the sequence modeling domain (such as natural language processing), Ming et al. [209] and Hong et al. [129] took the concepts in [183] and [53], and integrated prototype learning into recurrent neural networks for modeling sequential data. Barnett et al. [21] extended the ideas of Chen et al. [53] and developed an application to interpretable computer-aided digital mammography.
Despite the recent progress, many challenges still exist in the domain of casebased reasoning, including: 4.1 How to extend the existing case-based reasoning approach to handling more complex data, such as video? Currently, case-based reasoning has been used for structured (tabular) data, static images, and simple sequences such as text. It remains a challenge to extend the existing casebased reasoning approach to handling more complex data, such as video data, which are sequences of static images. While Trinh et al. [290] recently extended the ideas in Chen et at. [53] and developed a dynamic prototype network (DPNet) which learns prototypical video patches from deep-faked and real videos, how to efficiently compare various videos and find similar videos (for either nearest-neighbor or prototype-based classification) re-mains an open challenge for case-based video classification tasks. Performing case-based reasoning on video data is technically challenging because of the high dimensionality of the input data. On the other hand, a video is an ordered combination of frames and we can take advantage of the sequential nature of the data. For example, current algorithms can be refined or improved with more information that comes from neighboring video frames; could prototypes be designed from neighboring frames or parts of the frames? 4.2 How to integrate prior knowledge or human supervision into prototype learning? Current approaches to prototype learning do not take into account prior knowledge or expert opinions. At times, it may be advantageous to develop prototype-learning algorithms that collaborate with human experts in choosing prototypical cases or prototypical features. For example, in healthcare, it would be beneficial if a prototype-based classifier learns, under the supervision of human doctors, prototypical signs of cancerous growth. For example, the doctors might prune prototypes, design them, or specify a region of interest where the prototypes should focus. Such human-machine collaboration would improve the classifier's accuracy and interpretability, and would potentially reduce the amount of data needed to train the model. However, human-machine collaboration is rarely explored in the context of prototype learning, or case-based reasoning in general. 4.3 How to troubleshoot a trained prototype-based model to improve the quality of prototypes? Prototype-based models make decisions based on similarities with learned prototypes. However, sometimes, the prototypes may not be learned well enough, in the sense that they may not capture the most representative features of a class. This is especially problematic for part-based prototype models, because these models reason by comparing subsets of features they deem important. In the case of a (part-based) prototype model with "invalid" prototypes that capture non-representative or undesirable features (e.g., a prototype of text in medical images, which represents information that improves training accuracy but not test accuracy), one way to "fix" the model is to get rid of the invalid prototypes, but this may lead to an imbalance in the number of prototypes among different classes and a bias for some classes that are over-represented with abundant prototypes. An alternative solution is to replace each undesirable prototype with a different prototype that does not involve the undesirable features.
The challenge here lies in how we can replace undesirable prototypes systematically without harming the model performance, and at the same time improve the given model incrementally without retraining from scratch.

Complete supervised disentanglement of neural networks
Deep neural networks (DNNs) have achieved state-of-the-art predictive performance for many important tasks [176]. DNNs are the quintessential "black box" because the computations within its hidden layers are typically inscrutable. As a result, there have been many works that have attempted to "disentangle" DNNs in various ways so that information flow through the network is easier to understand. "Disentanglement" here refers to the way information travels through the network: we would perhaps prefer that all information about a specific concept (say "lamps") traverse through one part of the network while information about another concept (e.g., "airplane") traverse through a separate part. Therefore, disentanglement is an interpretability constraint on the neurons. For example, suppose we hope neuron "neur" in layer l is aligned with concept c, in which case, the disentanglement constraint is where Signal(c, x) means the signal of concept c that passes through x, which measures similarity between its two arguments. This constraint means all signal about concept c in layer l will only pass through neuron neur. In other words, using these constraints, we could constrain our network to have the kind of "Grandmother node" that scientists have been searching for in both real and artificial convolutional neural networks [115].
In this challenge, we consider the possibility of fully disentangling a DNN so that each neuron in a piece of the network represents a human-interpretable concept.
The vector space whose axes are formed by activation values on the hidden layer's neurons is known as the latent space of a DNN. Figure 8 shows an example of what an ideal interpretable latent space might look like. The axes of the latent space are aligned with individual visual concepts, such as "lamp," "bed," "nightstand," "curtain." Note that the "visual concepts" are not restricted to objects but can also be things such as weather or materials in a scene. We hope all information about the concept travels through that concept's corresponding neuron on the way to the final prediction. For example, the "lamp" neuron will be activated if and only if the network thinks that the input image contains information about lamps. This kind of representation makes the reasoning process of the DNN much easier to understand: the image is classified as "bedroom" because it contains information about "bed" and "lamp." Such a latent space, made of disjoint data generation factors, is a disentangled latent space [26,123]. An easy way to create a disentangled latent space is just to create a classifier for each concept (e.g., create a lamp classifier), but this is not a good strategy: it might be that the network only requires the light of the lamp rather than the actual lamp body, so creating a lamp classifier could actually reduce performance. Instead, we would want to encourage the information that is used about a concept (if any is used) to go along one path through the network.
Disentanglement is not guaranteed in standard neural networks. In fact, information about any concept could be scattered throughout the latent space of a standard DNN. For example, post hoc analyses on neurons of standard convolutional neural networks [336,337] show that concepts that are completely unrelated could be activated on the same axis, as shown in Figure 9. Even if we create a vector in the latent space that is aimed towards a single concept [as is done in 147,338], that vector could activate highly on multiple concepts, which means the signal for the two concepts is not disentangled. In that sense, vectors in the latent space are "impure" in that they do not naturally represent single concepts [see 58, for a detailed discussion].
This challenge focuses on supervised disentanglement of neural networks, i.e., the researcher specifies which concepts to disentangle in the latent space. (In the next section we will discuss unsupervised disentanglement.) Earlier work in this domain disentangles the latent space for specific applications, such as face recognition, where we might aim to separate identity and pose [341]. Recent work in this area aims to disentangle the latent space with respect to a collection of predefined concepts [58,158,193,3]. For example, Chen et al. [58] adds constraints to the latent space to decorrelate the concepts and align them with axes in the latent space; this is called "concept whitening." (One could think of this Example of an impure neuron in standard neural network from [337]. This figure shows images that highly activate this neuron. Both dining tables (green) and Greek-style buildings (red) are highly activated on the neuron, even though these two concepts are unrelated.
as analogous to a type of principal components analysis for neural networks.) To define the concepts, a separate dataset is labeled with concept information (or the main training dataset could also serve as this separate dataset as long as it has labels for each concept.) These concept datasets are used to align the network's axes along the concepts. This type of method can create disentangled latent spaces where concepts are more "pure" than those of a standard DNN, without hurting accuracy. With the disentangled latent space, one can answer questions about how the network gradually learns concepts over layers [58], or one could interact with the model by intervening on the activations of concept neurons [158]. Many challenges still exist for supervised disentanglement of DNNs, including:

How to make a whole layer of a DNN interpretable? Current meth-
ods can make neurons in a single layer interpretable but all of them have limitations. Several works [158,193,3] have tried to directly learn a concept classifier in the latent space. However, good discriminative power on concepts does not guarantee concept separation and disentanglement [58]. That is, we could easily have our network classify all the concepts correctly, in addition to the overall classification task, but this would not mean that the information flow concerning each concept goes only though that concept's designated path through the network; the activations of the different concept nodes would likely still be correlated. The method discussed earlier, namely that of [58], can fully disentangle the latent space with respect to a few pre-defined concepts. But it also has two disadvantages if we want to disentangle all neurons in a layer, namely: (i) this method currently has a group of unconstrained neurons to handle residual information, which means it does not disentangle all neurons in a layer, just some of them, and (ii) it requires loading samples of all target concepts to train the latent space. If the layer to be disentangled contains 500 neurons, we would need to load samples for all 500 concepts, which takes a long time. Perhaps the latter problem could be solved by training with a small random subset of concepts at each iteration; but this has not been tried in current methods at the time of this writing. The first problem might be solved by using an unsupervised concept detector from Challenge 6 and having a human interact with the concepts to determine whether each one is interpretable. 5.2 How to disentangle all neurons of a DNN simultaneously? Current methods try to disentangle at most a single layer in the DNN (that is, they attempt only the problem discussed above). This means one could only interpret neurons in that specific layer, while semantic meanings of neurons in other layers remain unknown. Ideally, we want to be able to completely understand and modify the information flowing through all neurons in the network. This is a challenging task for many obvious reasons, the first one being that it is hard practically to define what all these concepts could possibly be. We would need a comprehensive set of human-interpretable concepts, which is hard to locate, create, or even to parameterize. Even if we had this complete set, we would probably not want to manually specify exactly what part of the network would be disentangled with respect to each of these numerous concepts. For instance, if we tried to disentangle the same set of concepts in all layers, it would be immediately problematic because DNNs are naturally hierarchical: high-level concepts (objects, weather, etc.) are learned in deeper layers, and the deeper layers leverage low-level concepts (color, texture, object parts, etc.) learned in lower layers. Clearly, complex concepts like "weather outside" could not be learned well in earlier layers of the network, so higher-level concepts might be reserved for deeper layers. Hence, we also need to know the hierarchy of the concepts to place them in the correct layer. Defining the concept hierarchy manually is almost impossible, since there could be thousands of concepts. But how to automate it is also a challenge.

How to choose good concepts to learn for disentanglement?
In supervised disentanglement, the concepts are chosen manually. To gain useful insights from the model, we need good concepts. But what are good concepts in specific application domains? For example, in medical applications, past works mostly use clinical attributes that already exist in the datasets. However, Chen et al. [58] found that attributes in the ISIC dataset might be missing the key concept used by the model to classify lesion malignancy. Active learning approaches could be incredibly helpful in interfacing with domain experts to create and refine concepts. Moreover, it is challenging to learn concepts with continuous values. These concepts might be important in specific applications, e.g., age of the patient and size of tumors in medical applications. Current methods either define a concept by using a set of representative samples or treat the concept as a binary variable, where both are discrete. Therefore, for continuous concepts, a challenge is how to choose good thresholds to transform the continuous concept into one or multiple binary variables.

How to make the mapping from the disentangled layer to the output layer interpretable?
The decision process of current disentangled neural networks contains two parts, x → c mapping the input x to the disentangled representation (concepts) c, and c → y mapping the disentangled representation c to the output y. [The notations are adopted from 158]. All current methods on neural disentanglement aim at making the c interpretable, i.e., making the neurons in the latent space aligned with human understandable concepts, but how these concepts combine to make the final prediction, i.e., c → y, often remains a black box. This leaves a gap between the interpretability of the latent space and the interpretability of the entire model. Current methods either rely on variable importance methods to explain c → y posthoc [58], or simply make c → y a linear layer [158]. However, a linear layer might not be expressive enough to learn c → y. [158] also shows that a linear function c → y is less effective than nonlinear counterparts when the user wants to intervene in developing the disentangled representation, e.g., replacing predicted concept valuesĉ j with true concept values c j . Neural networks like neural additive models [5] and neural decision trees [322] could potentially be used to model c → y, since they are both differentiable, nonlinear, and intrinsically interpretable once the input features are interpretable.
Example: Suppose machine learning practitioners and doctors want to build a supervised disentangled DNN on X-ray data to detect and predict arthritis. First, they aim to choose a set of relevant concepts that have been assessed by doctors (Challenge 5.3). They should also choose thresholds to turn continuous concepts (e.g., age) into binary variables to create the concept datasets (Challenge 5.3). Using the concept datasets, they can use supervised disentanglement methods like concept whitening [58] to build a disentangled DNN. However, if they choose too many concepts to disentangle in the neural network, loading samples from all of the concept datasets may take a very long time (Challenge 5.1). Moreover, doctors may have chosen different levels of concepts, such as bone spur (high-level) and shape of joint (low-level), and they would like the low-level concepts to be disentangled by neurons in earlier layers, and high-level concepts to be disentangled in deeper layers, since these concepts have a hierarchy according to medical knowledge. However, current methods only allow placing them in the same layer (Challenge 5.2). Finally, all previous steps can only make neurons in the DNN latent space aligned with medical concepts, while the way in which these concepts combine to predict arthritis remains uninterpretable (Challenge 5.4).

Unsupervised disentanglement of neural networks
The major motivation of unsupervised disentanglement is the same as Challenge 5, i.e., making information flow through the network easier to understand and interact with. But in the case where we do not know the concepts, or in the case where the concepts are numerous and we do not know how to parameterize them, we cannot use the techniques from Challenge 5. In other words, the concept c in Constraint (5.1) is no longer a concept we predefine, but it must still be an actual concept in the existing universe of concepts. There are situations where concepts are actually unknown; for example, in materials science, the concepts, such as key geometric patterns in the unit cells of materials, have generally not been previously defined. There are also situations where concepts are generally known but too numerous to handle; a key example is computer vision for natural images. Even though we could put a name to many concepts that exist in natural scenes, labeled datasets for computer vision have a severe labeling bias: we tend only to label entities in images that are useful for a specific task (e.g., object detection), thus ignoring much of the information found in images. If we could effectively perform unsupervised disentanglement, we can rectify problems caused by human bias, and potentially make scientific discoveries in uncharted domains. For example, an unsupervised disentangled neural network can be used to discover key patterns in materials and characterize their relation to the physical properties of the material (e.g., "will the material allow light to pass through it?"). Figure 10 shows such a neural network, with a latent space completely disentangled and aligned with the key patterns discovered without supervision: in the latent space, each neuron corresponds to a key pattern and all information about the pattern flows through the corresponding neuron. Analyzing these patterns' contribution to the prediction of a desired physical property could help material scientists understand what correlates with the physical properties and could provide insight into the design of new materials. Multiple branches of works are related to unsupervised disentanglement of deep neural networks, and we will describe some of them.
Disentangled representations in deep generative models: Disentanglement of generative models have long been studied [267,79]. Deep generative models such as generative adversarial networks [GANs,112] and variational auto-encoders [VAE,149] try to learn a mapping from points in one probability distribution to points in another. The first distribution is over points in the latent space, and these points are chosen i.i.d. according to a zero-centered Gaussian distribution. The second distribution is over the space from which the data are drawn (e.g., random natural images in the space of natural images). The statistical independence between the latent features makes it easy to disentangle the representation [27]: a disentangled representation simply guarantees that knowing or changing one latent feature (and its corresponding concept) does not affect the distribution of any other. Results on simple imagery datasets show that these generative models can decompose the data generation process into disjoint factors (for instance, age, pose, identity of a person) and explicitly represent them in the latent space, without any supervision on these factors; this happens based purely on statistical independence of these factors in the data.
Recently, the quality of disentanglement in deep generative models has been improved [57,124]. These methods achieve full disentanglement without supervision by maximizing the mutual information between latent variables and the observations. However, these methods only work for relatively simple imagery data, such as faces or single 3D objects (that is, in the image, there is only one object, not a whole scene). Learning the decomposition of a scene into groups of objects in the latent space, for example, is not yet achievable by these deep generative models. One reason for the failure of these approaches might be that the occurrence of objects may not be statistically independent; for example, some objects such as "bed" and "lamp" tend to co-occur in the same scene. Also, the same type of object may occur multiple times in the scene, which cannot be easily encoded by a single continuous latent feature.
Neural networks that incorporate compositional inductive bias: Another line of work designs neural networks that directly build compositional structure into a neural architecture. Compositional structure occurs naturally in computer vision data, as objects in the natural world are made of parts. In areas that have been widely studied beyond computer vision, including speech recognition, researchers have already summarized a series of compositional hypotheses, and incorporated them into machine learning frameworks. For example, in computer vision, the "vision as inverse graphics" paradigm [23] tries to treat vision tasks as the inverse of the computer graphics rendering process. In other words, it tries to decode images into a combination of features that might control rendering of a scene, such as object position, orientation, texture and lighting. Many studies on unsupervised disentanglement have been focused on creating this type of representation because it is intrinsically disentangled. Early approaches toward this goal includes DC-IGN [165] and Spatial Transformers [135]. Recently, Capsule Networks [126,261] have provided a new way to incorporate compositional assumptions into neural networks. Instead of using neurons as the building blocks, Capsule Networks combine sets of neurons into larger units called "Capsules," and force them to represent information such as pose, color and location of either a particular part or a complete object. This method was later combined with generative models in the Stack Capsule Autoencoder (SCAE) [163]. With the help of the Set Transformer [179] in combining information between layers, SCAE discovers constituents of the image and organizes them into a smaller set of objects. Slot Attention modules [191] further use an iterative attention mechanism to control information flow between layers, and achieve better results on unsupervised object discovery. Nevertheless, similar to the generative models, these networks perform poorly when aiming to discover concepts on more realistic datasets. For example, SCAE can only discover stroke-like structures that are uninterpretable to humans on the Street View House Numbers (SVHN) dataset [218] (see Figure 11). The reason is that the SCAE can only discover visual structures that appear frequently in the dataset, but in reality the appearance of objects that belong to the same category can vary a lot. There have been proposals [e.g., GLOM 125] on how a neural network with a fixed architecture could potentially parse an image into a part-whole hierarchy. Although the idea seems to be promising, no working system has been developed yet. There is still a lot of room for development for this type of method.  [218]. Right: Stroke-like templates discovered by the SCAE. Although the objects in the dataset are mostly digits, current capsule networks are unable to discover them as concepts without supervision.
Works that perform unsupervised disentanglement implicitly: Many other interpretable neural networks can also learn disjoint concepts in the latent space although the idea of "disentanglement" is not explicitly mentioned in these papers. For example, as mentioned in Section 4, Chen et al. [53] propose to create a prototype layer, storing prototypical parts of training images, to do case based reasoning. When classifying birds, the prototypical parts are usually object parts such as heads and wings of birds. Interestingly, a separation cost is applied to the learned prototypes to encourage diversity of the prototypical parts, which is very similar to the idea of disentanglement. Zhang et al. [333] propose to maximize mutual information between the output of convolutional filters and predefined part templates (feature masks with positive values in a localized region and negative in other places). The mutual information regularization term is essentially the sum of two disentanglement losses, (a) an inter-category entropy loss that encourages each filter to be exclusively activated by images of one category and not activated on other categories; (b) a spatial entropy loss that encourages each filter to be activated only on a local region of the image. Results show the convolutional filters with the mutual information loss tend to be activated only on a specific part of the object. These two methods are capable of learning single objects or parts in the latent space but have not yet been generalized to handle more comprehensive concepts (such as properties of scenes, including style of an indoor room -e.g., cozy, modern, etc., weather for outdoor scenes, etc.), since these concepts are not localized in the images.
Despite multiple branches of related work targeting concept discovery in different way, many challenges still exist:

How to quantitatively evaluate unsupervised disentanglement?
Recall that unsupervised disentanglement is desired (but challenging) in two different types of domains: (a) domains in which we do not know what the concepts are (e.g., materials science); (b) domains in which concepts are known but labeling biases exist. Let us start with a version of case (b) where some concepts are known, such as objects in natural images. Let us say we are just missing a subset of concept labels in the dataset, which is a simple type of missing data bias. If we build a neural network that disentangles the latent space, we can quantitatively evaluate the quality of disentanglement. For instance, let us say we know a collection of concepts that we would like disentangled (say, a collection of specific objects that appear in natural images). If we use an unsupervised algorithm for disentangling the space, and then evaluate whether it indeed disentangled the known concepts, then we have a useful quantitative evaluation [124,56,148,92,84]. If we are working in a domain where we do not know the concepts to disentangle (case (a)), we also might not know what regularization (or other type of inductive bias) to add to the algorithm so that it can discover concepts that we would find interpretable. In that case, it becomes difficult to evaluate the results quantitatively. Materials science, as discussed above, is an example of one of these domains, where we cannot find ground-truth labels of key patterns, nor can we send queries to human evaluators (since even material scientists do not know the key patterns in many cases). Evaluation metrics from natural images do not work. Evaluating disentanglement in these domains is thus a challenge. Going back to case (b), in situations where labeling biases exist (i.e., when only some concepts are labeled, and the set of labels are biased), current evaluation metrics of disentanglement that rely only on human annotations can be problematic. For instance, most annotations in current imagery datasets concern only objects in images but ignore useful information such as lighting and style of furniture. An unsupervised neural network may discover important types of concepts that do not exist in the annotations. Medical images may be an important domain here, where some concepts are clearly known to radiologists, but where labels in available datasets are extremely limited.

How to adapt neural network architectures designed with compositional constraints to other domains?
Incorporating compositional assumptions into network architecture design is a way to create intrinsically interpretable disentangled representations.
That said, such assumptions are usually modality-and task-specific, severely limiting the general applicability for such designs. Let us take Capsule Networks [261] and the Slot Attention module [191] in the computer vision field as examples: these modules try to create object-centric and part-centric representations inside the latent space of the network. These ideas are based on the assumption that an image can be understood as a composition of different objects, and the interference between objects is negligible. Nevertheless, such an assumption cannot necessarily be applied to materials science, in which a material's physical properties could depend jointly, in a complex way, on the patterns of constituent materials within it. How would we redefine the compositional constraints derived in computer vision for natural images to work for other domains such as materials science? 6.3 How to learn part-whole disentanglement for more complicated patterns in large vision datasets? A specific area within unsupervised disentanglement in computer vision is to semantically segment the image into different parts, where each part represents an object or a part of an object [71,261,163,191]. Current networks can achieve convincing results over simple, synthetic datasets (where objects consist of simple geometric shapes or digits). (Such datasets include CLEVR6, [137], d-Sprite, [202], and Objects Room and Tetrominoes, [139].) However, no work has successfully learned part-whole relationships on more realistic datasets such as ImageNet. New techniques may be needed to handle various interactions between different objects in a real-world dataset.
Example: Suppose material scientists want to build a classification model that can predict whether the designs of metameterials support existence of band gaps (same example as Figure 10). Because unit cells of metamaterials are usually represented as a matrix/tensor of what constituent materials are placed at each location, the researchers plan to build a deep neural network, since neural networks excel at extracting useful information from raw inputs. They might encounter several challenges when building the disentangled neural network. First, they need to identify the disentanglement constraints to build into the network architecture. Architectures that work well for imagery data may not work for unit cells of metamaterials, since key patterns in unit cells can be completely different from objects/concepts in imagery data (Challenge 6.2 above). Moreover, evaluation of disentanglement can be challenging as the material patterns have no ground truth labels (Challenge 6.1 above).

Dimension reduction for data visualization
Even in data science, a picture is worth a thousand words. Dimension reduction (DR) techniques take, as input, high-dimensional data and project it down to a lower-dimensional space (usually 2D or 3D) so that a human can better comprehend it. Data visualization can provide an intuitive understanding of the underlying structure of the dataset. DR can help us gain insight and build hy-potheses. DR can help us design features so that we can build an interpretable supervised model, allowing us to work with high-dimensional data in a way we would not otherwise be able to. With DR, biases or pervasive noise in the data may be illuminated, allowing us to be better data caretakers. However, with the wrong DR method, information about the high-dimensional relationships between points can be lost when projecting onto a 2D or 3D space. DR methods are unsupervised ML methods that are constrained to be interpretable. Referring back to our generic interpretable ML formulation (*), DR methods produce a function mapping data x 1 , x 2 , ..., x n from p-dimensional space to y 1 , y 2 , ... y n in a low dimensional space (usually 2D). The constraint of mapping to 2 dimensions is an interpretability constraint. DR methods typically have a loss function that aims to preserve a combination of distance information between points and neighborhood information around each point when projecting to 2D. Each DR algorithm chooses the loss function differently. Each algorithm also chooses its own distance or neighborhood information to preserve.  [177] using different kinds of DR methods: PCA [232], t-SNE [297,187,236], UMAP [204], and PaCMAP [313]. The axes are not quantified because these are projections into an abstract 2D space.
Generally speaking, there are two primary types of approaches to DR for visualization, commonly referred to as local and global methods. Global methods aim mainly to preserve distances between any pair of points (rather than neighborhoods), while the local methods emphasize preservation of local neighborhoods (that is, which points are nearest neighbors). As a result, local methods can preserve the local cluster structure better, while failing to preserve the overall layout of clusters in the space, and vice versa. Figure 12 demonstrates the difference between the two kinds of algorithms over the MNIST handwritten figure dataset [177], which is a dataset where local structure tends to be more important than global structure. The only global method here, Principal Component Analysis (PCA) [232], fails to separate different digits into clusters, but it gives a sense of which digits are different from each other. t-SNE [297], which is a local method, successfully separated all the digits, but does not keep the scale information that is preserved in the PCA embedding. More recent methods, such as UMAP [204] and PaCMAP [313] also separated the digits while preserving some of the global information.
Early approaches toward this problem, including PCA [232] and Multidimensional Scaling (MDS) [289], mostly fall into the global category. They aim to preserve as much information as possible from the high-dimensional space, including the distances or rank information between pairs of points. These methods usually apply matrix decomposition over the data or pairwise distance matrix, and are widely used for data preprocessing. These methods usually fail to preserve local structure, including cluster structure.
To solve these problems, researchers later (early 2000s) developed methods that aimed at local structure, because they had knowledge that high dimensional data lie along low-dimensional manifolds. Here, it was important to preserve the local information along the manifolds. Isomap [285], Local Linear Embedding (LLE) [249], Hessian Local Linear Embedding [87], and Laplacian Eigenmaps [24] all try to preserve exact local Euclidean distances from the original space when creating low-dimensional embeddings. But distances between points behave differently in high dimensions than in low dimensions, leading to problems preserving the distances. In particular, these methods tended to exhibit what is called the "crowding problem," where samples in the low-dimensional space crowd together, devastating the local neighborhood structure and losing information. t-SNE [297] was able to handle this problem by transforming the highdimensional distances between each pair of points into probabilities of whether the two points should be neighbors in the low-dimensional space. Doing this aims to ensure that local neighborhood structure is preserved. Then, during the projection to low dimensions, t-SNE enforces that the distribution over distances between points in the low-dimensional space is a specific transformation of the distribution over distances between points in the high-dimensional space. Forcing distances in the low-dimensional space to follow a specific distribution avoids the crowding problem, which is when a large proportion of the distances are almost zero.
Though more successful than previous methods, t-SNE still suffers from slow running times, sensitivity to hyperparameter changes, and lack of preservation of global structure. A series of t-SNE variants aim to improve on these shortcomings. The most famous among them are BH t-SNE [296] and FIt-SNE [187]. BH t-SNE constructs a k-nearest neighbor graph over the high dimensional space to record local neighborhood structure, and utilizes the Barnes-Hut force calculation algorithm [which is typically used for multi-body simulation; 20] to accelerate optimization of the low-dimensional embedding. These choices reduced the complexity for each step in the optimization from O(n 2 ), where n is the number of samples in the dataset, to a smaller O(n log n). FIt-SNE further accelerates the rendering phase with a Fast Fourier Transform, and brings down the complexity to O(n). Besides these variants, multiple new algorithms have been created based on the framework of t-SNE. Prominent examples include LargeVis [284], which is widely used in network analysis, and UMAP [204], which is widely used in computational biology. With a better initialization created by unsupervised machine learning algorithms (such as spectral clustering) and better loss functions, these algorithms improve global structure preservation and run-time efficiency. Figure 13 demonstrates differences in results from DR methods on a dataset of isotropic Gaussians. As the figure shows, the results of different DR techniques look quite different from each other. Recent studies on Fig 13. Visualization of 4000 points sampled from 20 isotropic Gaussians using Laplacian Eigenmap [24,233], t-SNE [297,187,236], ForceAtlas2 [134], UMAP [204], TriMap [9] and PaCMAP [313]. The 20 Gaussians are equally spaced on an axis in 50-dimensional space, labelled by the gradient colors. The best results are arguably those of t-SNE and PaCMAP in this figure, which preserve clusters compactly and their relative placement (yellow on the left, purple on the right).
DR algorithms shed light on how the loss function affects the rendering of local structure [37], and provide guidance on how to design good loss functions so that the local and global structure can both be preserved simultaneously [313]. Nevertheless, several challenges still exists for DR methods: 7.1 How to capture information from the high dimensional space more accurately?
Most of the recent DR methods capture information in the high dimensional space mainly from the k-nearest-neighbors and their relative distances, at the expense of information from points that are more distant, which would allow the preservation of more global structure. Several works [314,64] discuss possible pitfalls in data analysis created by t-SNE and UMAP due to loss of non-local information. Recent methods mitigate the loss of global information by using global-aware initialization [204,155] (that is, initializing the distances in the low-dimensional space using PCA) and/or selectively preserving distances between non-neighbor samples [105,313]. Nevertheless, these methods are still designed and optimized under the assumption that the nearest neighbors, defined by the given metric (usually Euclidean distance in the high dimensional space), can accurately depict the relationships between samples. This assumption may not hold true for some data, for instance, Euclidean distance may not be suitable for measuring distances between weights (or activations) of a neural network [see 50, for a detailed example of such a failure]. We would like DR methods to better capture information from the high dimensional space to avoid such pitfalls. 7.2 How should we select hyperparameters for DR?
Modern DR methods, due to their multi-stage characteristics, involve a large number of hyperparameters, including the number of high-dimensional nearest neighbors to be preserved in the low-dimensional space and the learning rate used to optimize the low-dimensional embedding. There are often dozens of hyperparameters in any given DR method, and since DR methods are unsupervised and we do not already know the structure of the high-dimensional data, it is difficult to tune them. A poor choice of hyperparameters may lead to disappointing (or even misleading) DR results. Fig 14 shows some DR results for the Mammoth dataset [133,64] using t-SNE, LargeVis, UMAP, TriMAP and PaCMAP with different sets of reasonable hyperparameters. When their perplexity parameter or the number of nearest neighbors is not chosen carefully, algorithms can fail to preserve the global structure of the mammoth (specifically, the overall placement of the mammoth's parts), and they create spurious clusters (losing connectivity between parts of the mammoth) and lose details (such as the toes on the feet of the mammoth). For more detailed discussions about the effect of different hyperparameters, see [314,64]. Multiple works [for example 25,313] aimed to alleviate this problem for the most influential hyperparameters, but the problem still exists, and the set of hyperparameters remains datadependent. The tuning process, which sometimes involves many runs of a DR method, is time and power consuming, and requires user expertise in both the data domain and in DR algorithms. It could be extremely helpful to achieve better automatic hyperparameter selection for DR algorithms. 7.3 Can the DR transformation from high-to low-dimensions be made more interpretable or explainable? The DR mapping itself -that is, the transformation from high to low dimensions -typically is complex. There are some cases in which insight into this mapping can be gained, for instance, if PCA is used as the DR method, we may be able to determine which of the original dimensions are dominant in the first few principle components. It may be useful to design modern approaches to help users understand how the final two or three dimensions are defined in terms of the high-dimensional features. This may take the form of explanatory post-hoc visualizations or constrained DR methods.
Example: Computational biologists often apply DR methods to single-cell RNA sequence data to understand the cell differentiation process and discover previously-unknown subtypes of cells. Without a suitable way to tune parameters, they may be misled by a DR method into thinking that a spurious cluster from a DR method is actually a new subtype of cell, when it is simply a failure of the DR method to capture local or global structure (Challenge 7.2 above).
Since tuning hyperparameters in a high-dimensional space is difficult (with- Fig 14. Projection from [313] of the Mammoth dataset into 2D using t-SNE [297,187,236], LargeVis [284], UMAP [204], TriMap [9] and PaCMAP [313]. Incorrectly-chosen hyperparameters will lead to misleading results even in a simple dataset. This issue is particularly visible for t-SNE (first two columns) and UMAP (fourth column). The original dataset is 3 dimensional and is shown at the top.
out the ground truth afforded to supervised methods), the researchers have no way to see whether this cluster is present in the high-dimensional data or not (Challenge 7.1 above). Scientists could waste a lot of time examining each such spurious cluster. If we were able to solve the problems with DR tuning and structure preservation discussed above, it will make DR methods more reliable, leading to potentially increased understanding of many datasets.

Machine learning models that incorporate physics and other generative or causal constraints
There is a growing trend towards developing machine learning models that incorporate physics (or other) constraints. These models are not purely data-driven, in the sense that their training may require little data or no data at all [e.g., 244]. Instead, these models are trained to observe physical laws, often in the form of ordinary (ODEs) and partial differential equations (PDEs). These physicsguided models provide alternatives to traditional numerical methods (e.g., finite element methods) for solving PDEs, and are of immense interest to physicists, chemists, and materials scientists. The resulting models are interpretable, in the sense that they are constrained to follow the laws of physics that were provided to them. (It might be easier to think conversely: physicists might find that a standard supervised machine learning model that is trained on data from a known physical system -but that does not follow the laws of physics -would be uninterpretable.) The idea of using machine learning models to approximate ODEs and PDEs solutions is not new. Lee and Kang [178] developed highly parallel algorithms, based on neural networks, for solving finite difference equations (which are themselves approximations of original differential equations). Psichogios and Ungar [238] created a hybrid neural network-first principles modeling scheme, in which neural networks are used to estimate parameters of differential equations. Lagaris et al. [168,169] explored the idea of using neural networks to solve initial and boundary value problems. Several additional works [167,43,259] used sparse regression and dynamic mode decomposition to discover the governing equations of dynamical systems directly from data. More recently, Raissi et al. [242] extended the earlier works and developed the general framework of a physicsinformed neural network (PINN). In general, a PINN is a neural network that approximates the solution of a set of PDEs with initial and boundary conditions. The training of a PINN minimizes the residuals from the PDEs as well as the residuals from the initial and boundary conditions. In general, physics-guided models (neural networks) can be trained without supervised training data. Let us explain how this works. Given a differential equation, say, f (t) = af (t)+bt+c, where a, b and c are known constants, we could train a neural network g to approximate f , by minimizing (g (t) − ag(t) − bt − c) 2 at finitely many points t. Thus, no labeled data in the form (t, f (t)) (what we would need for conventional supervised machine learning) is needed. The derivative g (t) with respect to input t (at each of those finitely many points t used for training) is found by leveraging the existing network structure of g using back-propagation. Figure  15 illustrates the training process of a PINN for approximating the solution of one-dimensional heat equation ∂u ∂t = k ∂ 2 u ∂x 2 with initial condition u(x, 0) = f (x) and Dirichlet boundary conditions u(0, t) = 0 and u(L, t) = 0. If observed data are available, a PINN can be optimized with an additional mean-squared-error term to encourage data fit. Many extensions to PINNs have since been developed, including fractional PINNs [226] and parareal PINNs [207]. PINNs have been extended to convolutional [340] and graph neural network [270] backbones. They have been used in many scientific applications, including fluid mechanics modeling [243], cardiac activation mapping [263], stochastic systems modeling [323], and discovery of differential equations [241,240].
In addition to neural networks, Gaussian processes are also popular models for approximating solutions of differential equations. For example, Archambeau et al. [16] developed a variational approximation scheme for estimating the posterior distribution of a system governed by a general stochastic differential equation, based on Gaussian processes. Zhao et al. [335] developed a PDEconstrained Gaussian process model, based on the global Galerkin discretization of the governing PDEs for the wire saw slicing process. More recently, Pang et al. [227] used the neural-network-induced Gaussian process (NNGP) regression for solving PDEs.
Despite recent success of PINNs or physics-guided machine learning in general, challenges still exist, including:

Characterization of the "Rashomon" set of good models
In many practical machine learning problems, there is a multiplicity of almostequally-accurate models. This set of high performing models is called the Rashomon set, based on an observation of the Rashomon effect by the statistician Leo Breiman. The Rashomon effect occurs when there are multiple descriptions of the same event [41] with possibly no ground truth. The Rashomon set includes these multiple descriptions of the same dataset [and again, none of them are assumed to be the truth, see 269,97,86,201]. Rashomon sets occur in multiple domains, including credit score estimation, medical imaging, natural language processing, health records analysis, recidivism prediction, model explanations for planning and robotics, and so on [74,201,280]. We have discussed in Challenges 4 and 5 that even deep neural networks for computer vision exhibit Rashomon sets, because neural networks that perform case-based reasoning on disentanglement still yielded models that were equally accurate to their unconstrained values; thus, these interpretable deep neural models are within the Rashomon set. Rashomon sets present an opportunity for data scientists: if there are many equally-good models, we can choose one that has desired properties that go beyond minimizing an objective function. In fact, the model that optimizes the training loss might not be the best to deploy in practice anyway due to the possibilities of poor generalization, trust issues, or encoded inductive biases that are undesirable [74]. More careful approaches to problem formulation and model selection could be taken that include the possibility of model multiplicity in the first place. Simply put -we need ways to explore the Rashomon set, particularly if we are interested in model interpretability.
Formally, the Rashomon set is the set of models whose training loss is below a specific threshold, as shown in Figure 16 (a). Given a loss function and a model class F, the Rashomon set can be written as where f * can be an empirical risk minimizer, optimal model or any other reference model. We would typically choose F to be complex enough to contain models that fit the training data well without overfitting. The threshold , which is called the Rashomon parameter, can be a hard hyper-parameter set by a machine learning practitioner or a percentage of the loss (i.e., becomes Loss(f * )). We would typically choose or to be small enough that suffering this additional loss would have little to no practical significance on predictive performance. For instance, we might choose it to be much smaller than the (generalization) error between training and test sets. We would conversely want to choose or to be as large as permitted so that we have more flexibility to choose models within a bigger Rashomon set.
It has been shown by Semenova et al. [269] that when the Rashomon set is large, under weak assumptions, it must contain a simple (perhaps more interpretable) model within it. The argument goes as follows: assume the Rashomon set is large, so that it contains a ball of functions from a complicated function class F complicated (think of this as high-dimensional polynomials that are complex enough to fit the data well without overfitting). If a set of simpler functions F simpler could serve as approximating set for F complicated (think decision trees of a certain depth approximating the set of polynomials), it means that each complicated function could be well-approximated by a simpler function (and indeed, polynomials can be well-approximated by decision trees). By this logic, the ball of F complicated that is within the Rashomon set must contain at least one function within F simpler , which is the simple function we were looking for.
Semenova et al. [269] also suggested a useful rule of thumb for determining whether a Rashomon set is large: run many different types of machine learning algorithms (e.g., boosted decision trees, support vector machines, neural networks, random forests, logistic regression) and if they generally perform similarly, it correlates with the existence of a large Rashomon set (and thus the possibility of a simpler function also achieving a similar level of accuracy). The knowledge that there might exist a simple-yet-accurate function before finding it could be very useful, particularly in the cases where finding an optimal sparse model is NP-hard, as in Challenge 1. Here, the user would run many different algorithms to determine whether it would be worthwhile to solve the hard problem of finding an interpretable sparse model.
The knowledge of the functions that lie within Rashomon sets sees value in multiple interesting use cases. Models with various important properties besides interpretability can exist in the Rashomon set, including fairness [67] and monotonicity. Thus, understanding the properties of the Rashomon set could be pivotal for analysis of a complex machine learning problem and its possible modeling choices.
The size of the Rashomon set can be considered as a way of measuring the complexity of a learning problem. Problems with large Rashomon sets are less complicated problems, since more good solutions exist to these problems. The Rashomon set is a property of the model class and dataset. The size of the Rashomon set differs from other known characterizations of the complexity in machine learning. Ways that complexity of function classes, algorithms, or learning problems are typically measured in statistical learning theory include the VC dimension, covering numbers, algorithmic stability, Rademacher complexity, or flat minima [see, for instance, 279,140,339,266,161,39,299]; the size of the Rashomon set differs from all of these quantities in fundamental ways, and it is important in its own right for showing the existence of simpler models.
A useful way to represent the hypothesis for a problem is by projecting it to variable importance space, where each axis represents the importance of a variable. That way, a single function is represented by a point in this space (i.e., a vector of coordinates), where each coordinate represents how important a variable is to that model. (Here, variable importance for variable v is measured by model reliance or conditional model reliance, which represents the change in loss we would incur if we scrambled variable v.) The Rashomon set can be represented as a subset of this variable importance space. Fisher et al. [97] used this representation to create a model independent notion of variable importance: specifically, they suggest to consider the maximum and minimum of variable importance across all models in the Rashomon set. This is called the Model Class Reliance (MCR). For any user who would pick a "good" model (i.e., a model in the Rashomon set), then the importance of the variable is within the range given by the MCR. Dong and Rudin [86] expand on this by suggesting to visualize the "cloud" of variable importance values for models within the Rashomon set. This cloud helps us understand the Rashomon set well enough to choose a model we might find interpretable.
Current research works have barely scratched the characterization and usage of model multiplicity and Rashomon sets in machine learning. Exploring the Rashomon set is critical if we hope to find interpretable models within the set of accurate models. We ask the following questions about the Rashomon set: 9.1 How can we characterize the Rashomon set? As discussed above, the volume of the Rashomon set is a useful indirect indicator of the existence of interpretable models. There have been several works that suggest computing statistics of the Rashomon set in parameter space, such as the volume in parameter space, which is called the Rashomon volume. The Rashomon volume or other statistics that characterize the Rashomon set can be computed more easily if the volume in parameter space has a single global minimum [127,83,144,144,52]; these variations include -flatness [127,83] and -sharpness [144,83]. The Rashomon ratio provides some perspective on the size of the Rashomon set, defined as the ratio of the Rashomon volume to the volume of the whole hypothesis space [269], assuming that both are bounded. There are several problems that arise when computing the size of the Rashomon set in parameter space. First, the volume of a set of functions can be hard to compute or estimate, but there are ways to do it, including sampling (or analytical computation in some simple cases when there is one minimum). Second, the choice of parameterization of the hypothesis space matters when working in parameter space. Overparametrization or underparametrization can cause volumes in parameter space to be artificially made larger or smaller without a change in the actual space of functions [83]. For instance, if we make a copy of a variable and include it in the datasets, it can change the Rashomon ratio. A third but separate problem with the Rashomon ratio is its denominator: the denominator is the volume of all models we might reasonably consider before seeing the data. If we change the set of models we might consider, we change the denominator, and the ratio changes. And, because different model spaces are different, there might not be a way to directly compare the Rashomon ratios computed on different hypothesis spaces to each other. This brings us to another way to measure the Rashomon set: The pattern Rashomon ratio [269,201], which could potentially be used to handle some of these problems. The pattern Rashomon ratio considers unique predictions on the data (called "patterns") rather than the count of functions themselves. In other words, the pattern Rashomon ratio considers the count of predictions that could be made rather than a count of functions. Figure  17 illustrates the difference between the Rashomon ratio and the pattern Rashomon ratio. The pattern Rashomon ratio has a fixed denominator (all possible patterns one could produce on the data using functions from the class). A direct comparison of the pattern Rashomon ratio over function classes is meaningful. However, it is still difficult to compute. It is also more sensitive to small shifts and changes in data. Despite that multiple measures, including -flatness, -sharpness, Rashomon ratio, and pattern ratio, have been proposed to measure the size of the Rashomon set, they do not provide a universal and simple-to-calculate way of characterization the size of the Rashomon set. Many challenges to remain open: (a) What is a good space in which to measure the size of the Rashomon set? What is a good measure of the size of the Rashomon set? As discussed above, the parameter space suffers from issues with parameterization, but the pattern Rashomon ratio has its own problems. Would variable importance space be suitable, where each axis represents the importance of a raw variable, as in the work of Dong and Rudin [86]? Or is there an easier space and metric to work with?
(b) Are there approximation algorithms or other techniques that will allow us to efficiently compute or approximate the size of the Rashomon set? At worst, computation over the Rashomon set requires a brute force calculation over a large discrete hypothesis space. In most cases, the computation should be much easier. Perhaps we could use dynamic programming or branch and bound techniques to reduce the search space so that it encompasses the Rashomon set but not too much more than that? In some cases, we could actually compute the size of the Rashomon set analytically. For instance, for linear regression, a closed-form solution in parameter space for the volume of the Rashomon set has been derived based on the singular values of the data matrix [269].

What techniques can be used to visualize the Rashomon set?
Visualization of the Rashomon set can potentially help us to understand its properties, issues with the data, biases, or underspecification of the problem. To give an example of how visualization can be helpful in troubleshooting models generally, Li et al. [181] visualized the loss landscape of neural networks and, as a result, found answers to questions about the selection of the batch size, optimizer, and network architecture. Kissel and Mentch [151] proposed a tree-based graphical visualization to display outputs of the model path selection procedure that finds models from the Rashomon set based on a forward selection of variables. The visualization helps us to understand the stability of the model selection process, as well as the richness of the Rashomon set, since wider graphical trees imply that there are more models available for the selection procedure. To characterize the Rashomon set in variable importance space, variable importance diagrams have been proposed [86], which are 2-dimensional projections of the variable importance cloud. The variable importance cloud is created by mapping every variable to its importance for every good predictive model (every model in the Rashomon set). Figure 16(b) uses a similar technique of projection in two-dimensional space and depicts the visualization of the example of the Rashomon set of Figure  16(a). This simple visualization allows us to see the Rashomon set's layout, estimate its size or locate sparser models within it. Can more sophisticated approaches be developed that would allow good visualization? The success of these techniques most likely will have to depend on whether we can design a good metric for the model class. After the metric is designed, perhaps we might be able to utilize techniques from Challenge 7 for the visualization. 9.3 What model to choose from the Rashomon set? When the Rashomon set is large, it can contain multiple accurate models with different properties. Choosing between them might be difficult, particularly if we do not know how to explore the Rashomon set. Interactive methods might rely on dimension reduction techniques (that allow users to change the location of data on the plot or to change the axis of visualization), weighted linear models (that allow the user to compute weights on specific data points), continuous feedback from the user (that helps to continuously improve models prediction in changing environments, for example, in recommender systems) in order to interpret or choose a specific model with desired property. Das et al. [75] design a system called BEAMES that allows users to interactively select important features, change weights on data points, visualize and select a specific model or even an ensemble of models. BEAMES searches the hypothesis space for models that are close to the practitioners' constraints and design choices. The main limitation of BEAMES is that it works with linear regression classifiers only. Can a similar framework that searches the Rashomon set, instead of the whole hypothesis space, be designed? What would the interactive specification of constraints look like in practice to help the user choose the right model? Can collaboration with domain experts in other ways be useful to explore the Rashomon set?
Example: Suppose that a financial institution would like to make data-driven loan decisions. The institution must have a model that is as accurate as possible, and must provide reasons for loan denials. In practice, loan decision prediction problems have large Rashomon sets, and many machine learning methods perform similarly despite their different levels of complexity. To check whether the Rashomon set is indeed large, the financial institution wishes to measure the size of the Rashomon set (Challenge 9.1) or visualize the layout of it (Challenge 9.2) to understand how many accurate models exist, and how many of them are interpretable. If the Rashomon set contains multiple interpretable models, the financial institution might wish to use an interactive framework (Challenge 9.3) to navigate the Rashomon set and design constraints that would help to locate the best model for their purposes. For example, the institution could additionally optimize for fairness or sparsity.

Interpretable reinforcement learning
Reinforcement learning (RL) methods determine what actions to take at different states in order to maximize a cumulative numerical reward [e.g., see 282]. In RL, a learning agent interacts with the environment by a trial and error process, receiving signals (rewards) for the actions it takes. Recently, deep reinforcement learning algorithms have achieved state-of-the-art performance in mastering the game of Go and various Atari games [273,210] and have been actively used in robotic control [156,283], simulated autonomous navigation and self-driving cars [150,265], dialogue summarization, and question answering [182]. RL has also been applied to high-stakes decisions, such as healthcare and personalized medicine, including approaches for dynamic treatment regimes [190,326], treatment of sepsis [162], treatment of chronic disorders including epilepsy [117], for HIV therapy [230], and management of Type 1 diabetes [136].
In reinforcement learning, at each timestep, an agent observes the state (situation) that it is currently in, chooses an action according to a policy (a learned mapping from states to actions), and receives an intermediate reward that depends on the chosen action and the current state, which leads the agent to the next state, where the process repeats. For example, let us consider an RL system that is used for the management of glucose levels of Type 1 diabetes patients by regulating the amount of synthetic insulin in the blood. The state is represented by the patient's information, including diet, exercise level, weight, so on; actions include the amount of insulin to inject; rewards are positive if the blood sugar level is within a healthy range and negative otherwise. Figure 18 shows the schematic diagram of a reinforcement learning system based on the diabetes example.
Reinforcement learning is more challenging than a standard supervised setting: the agent does not receive all the data at once (in fact, the agent is responsible for collecting data by exploring the environment); sometimes besides immediate rewards, there are delayed rewards when feedback is provided only after some sequence of actions, or even after the goal is achieved (for example, the positive reward on a self-driving car agent is provided after it reaches a destination). Because reinforcement learning involves long-term consequences to actions, because actions and states depend on each other, and because agents need to collect data through exploration, interpretability generally becomes more difficult than in a standard supervised learning. Search spaces for RL can be massive since we need to understand which action to take in each state based on an estimate of future rewards, uncertainty of these estimates, and exploration strategy. In deep reinforcement learning [210,273], policies are defined by deep neural networks, which helps to solve complex applied problems, but typically the policies are essentially impossible to understand or trust.
Interpretability could be particularly valuable in RL. Interpretability might help to reduce the RL search space; if we understand the choices and intent of the RL system, we can remove actions that might lead to harm. Interpretability could make it easier to troubleshoot (and use) RL for long-term medical treatment or to train an agent for human-robot collaboration.
One natural way to include interpretability in a reinforcement learning system is by representing the policy or value function by a decision tree or rule list [248,272]. In a policy tree representation, the nodes might contain a small set of logical conditions, and leaves might contain an estimate of the optimal probability distribution over actions (see Figure 19). A policy tree can be grown, for example, in a greedy incremental way by updating the tree only when the estimated discounted future reward would increase sufficiently [248]; in the future, one could adapt methods like those from Challenge 1 to build such trees.
Other methods make specific assumptions about the domain or problem that allows for interpretable policies. One of the most popular simplifications is to consider a symbolic or relational state representation. In this case, the state is a combination of logical or relational statements. For example, in the classic Blocks World domain, where the objective is to learn how to stack blocks on top of each other to achieve a target configuration, these statements comprising the state might look like on(block1, block2); on(block1, table); free(block2). Typically, in relational reinforcement learning, policies are represented as relational regression trees [90,91,143,96,76] that are more interpretable than neural networks.
A second set of assumptions is based on a natural decomposition of the problem for multi-task or skills learning. For example, Shu et al. [271] use hierarchical policies that are learned over multiple tasks, where each task is decomposed into multiple sub-tasks (skills). The description of the skills is created by a human, so that the agent learns these understandable skills (for instance, task stack blue block can be decomposed into the skills find blue block, get blue block, and put blue block ).
As far as we know, there are currently no general interpretable well-performing methods for deep reinforcement learning that allow transparency in the agent's actions or intent. Progress has been made instead on explainable deep reinforcement learning (posthoc approximations), including tree-based explanations [66,189,81,22,73], reward decomposition [138], and attention-based methods [330,213]. An interesting approach is that of Verma et al. [300], who define rule-based policies through a domain-specific, human-readable programming language that generalizes to unseen environments, but shows slightly worse performance than neural network policies it was designed to explain. The approach works for deterministic policies and symbolic domains only and will not work for the domains where the state is represented as a raw data image, unless an additional logical relation extractor is provided.
Based on experience from supervised learning (discussed above), we know that post-hoc explanations suffer from multiple problems, in that explanations are often incorrect or incomplete. Atrey et al. [18] have argued that for deep reinforcement learning, saliency maps should be used for exploratory, and not explanatory, purposes. Therefore, developing interpretable reinforcement learning policies and other interpretable RL models is important. The most crucial challenges for interpretable reinforcement learning area include:

What constraints would lead to an accurate interpretable policy?
In relational methods, policies or value functions could be represented by decision or regression trees, however, since these methods work for symbolic domains only, they have limited practical use without a method of extracting interpretable relations from the domain. Can interpretable deep neural network methods similar to those in Challenge 5 provide an interpretable policy without the performance compromise? Are there other ways of adding constraints or structure to the policy that will ensure humanunderstandable actions and intent? Will adding other sparsity or interpretability constraints, as in Challenges 1 and 2, to the policy or value function help improve interpretability without sacrificing accuracy? Can we design interpretable models whose explanations can capture longer term consequences of actions?

Under what assumptions does there exist an interpretable pol-
icy that is as accurate as the more complicated black-box model? Reinforcement learning systems are tricky to train, understand and debug. It is a challenge in itself to decide what constraints will lead to an interpretable policy without loss of performance. For example, for PIRL [300], there was a drop in performance of the resulting rule-based policy. Of course, one of the possible reasons for this drop could be the fact that it used posthoc approximations rather than inherently interpretable models; another reason is that there might not be a well-performing sparse rule-based policy. So, the question is: how would we know if it is useful to optimize for an interpretable policy? A simple test that tells the modeler if such a model exists (similar to the characterizations of the Rashomon sets in Challenge 9) would be helpful before trying to develop an algorithm to find this well-performing model (if it exists). 10.3 Can we simplify or interpret the state space? Deep reinforcement learning often operates on complex data (raw data) or huge, if not continuous state spaces. The agent might be required to do a lot of exploration to learn useful policies in this case. Simplifying the state space might not only make the RL system learn more efficiently, but might also help humans to understand the decisions it is making. For example, by applying t-SNE to the last layer activation data of Mnih et al.'s deep Q-learning model [210], Zahavy [329] showed that the network automatically discovers hierarchical structures. Previously, these structures had not been used to simplify the state space. Guan et al. [116] improved sample efficiency by incorporating domain knowledge through human feedback and visual explanations. Zhang et al. [332] improved over batch RL by computing the policy only at specific decision points where clinicians treat patients differently, instead of at every timestep. This resulted in a significantly smaller state space and faster planning.
Example: Consider a machine learning practitioner that is working on planning for assistive robots that help the elderly to perform day-to-day duties. In cases when the robot does some damage during its task execution or chooses a unusual plan to execute, it might be hard to understand the logic behind the robot's choices. If the practitioner can check that an interpretable model exists (Challenge 10.2), optimize for it (Challenge 10.1), and interpret the state space (Challenge 10.3), debugging the robot's failed behavior and correcting it can be much faster than without these techniques. Further, interpretability will help to strengthen trust between the robot and the human its assisting, and help it to generalize to new unforeseen cases.

Problems that were not in our top 10 but are really important
We covered a lot of ground in the 10 challenges above, but we certainly did not cover all of the important topics related to interpretable ML. Here are a few that we left out: • Can we improve preprocessing to help with both interpretability and test accuracy? As we discussed, some interpretable modeling problems are computationally difficult (or could tend to overfit) without preprocessing. For supervised problems with tabular data, popular methods like Principal Component Analysis (PCA) would generally transform the data in a way that damages interpretability of the model. This is because each transformed feature is a combination of all of the original features. Perhaps there are other general preprocessing tools that would preserve predictive power (or improve it), yet retain interpretability. • Can we convey uncertainty clearly? Uncertainty quantification is always important. Tomsett et al. [288] discuss the importance of uncertainty quantification and interpretability, and Antoran et al. [15] discuss interpretable uncertainty estimation. • Can we divide the observations into easier cases and harder cases, assigning more interpretable models to the easier cases? Not all observations are equally easy to classify. As noted in the work of Wang et al. [309,312], it is possible that easier cases could be handled by simpler models, leaving only the harder cases for models with less interpretability. • Do we need interpretable neural networks for tabular data? There have been numerous attempts to design interpretable neural networks for tabular data, imposing sparsity, monotonicity, etc. However, it is unclear whether there is motivation for such formulations, since neural networks do not generally provide much benefit for tabular data over other kinds of models. One nice benefit of the "optimal" methods discussed above is that when we train them we know how far we are from a globally optimal solution; the same is not true for neural networks, rendering them not just less interpretable but also harder to implement reliably. • What are good ways to co-design models with their visualizations? Work in data and model visualization will be important in conveying information to humans. Ideally, we would like to co-design our model in a way that lends itself to easier visualization. One particularly impressive interactive visualization was done by Gomez et al. [111] for a model on loan decisions. Another example of an interactive model visualization tool is that of Chen et al. [54]. These works demonstrate that as long as the model can be visualized effectively, it can be non-sparse but still be interpretable. In fact, these works on loan decisions constitute an example of when a sparse model might not be interpretable, since people might find a loan decision model unacceptable if it does not include many types of information about the applicant's credit history. Visualizations can allow the user to hone in on the important pieces of the model for each prediction. • Can we do interpretable matching in observational causal inference? In the classical potential outcomes framework of causal inference, matching algorithms are often used to match treatment units to control units. Treatment effects can be estimated using the matched groups. Matching in causal inference can be considered a form of case-based reasoning (Challenge 4). Ideally, units within each matched group would be similar on important variables. One such framework for matching is the Almost Matching Exactly framework [also called "learning to match," see 310], which learns the important variables using a training set, and prioritizes matching on those important variables. Matching opens the door for interpretable machine learning in causal inference because it produces treatment and control data whose distributions overlap. Interpretable machine learning algorithms for supervised classification discussed in Challenges 1, 2, and 3 can be directly applied to matched data; for instance, a sparse decision tree can be used on matched data to obtain an interpretable model for individual treatment effects, or an interpretable policy (i.e., treatment regimes). • How should we measure variable importance? There are several types of variable importance measures: those that measure how important a variable is to a specific prediction, those that measure how important a variable is generally to a model, and those that measure how important a variable is independently of a model. Inherently interpretable models should generally not need the first one, as it should be clear how much a variable contributed to the prediction; for instance, for decision trees and other logical models, scoring systems, GAMs, case-based reasoning models, and disentangled neural networks, the reasoning process that comes with each prediction tells us explicitly how a variable (or concept) contributed to a given prediction. There are many explainable ML tutorials to describe how to estimate the importance of variables for black box models. For the other two types of variable importance, we refer readers to the work of Fisher et al. [97], which has relevant references. If we can decide on an ideal variable importance measure, we may be able to create interpretable models with specific preferences on variable importance. • What are the practical challenges in deploying interpretable ML models? There are many papers that provide useful background about the use of interpretability and explainability in practice. For instance, an interesting perspective is provided by Bhatt et al. [31] who conduct an explainability survey showing that explanations are often used only internally for troubleshooting rather than being shown to users. Several studies, e.g. [141,237], have performed extensive human experiments on interpretable models and posthoc explanations of black box models, with some interesting and sometimes nonintuitive results. • What type of explanation is required by law? The question of what explanations are legally required is important and interesting, but we will leave those questions to legal scholars (e.g., [32,303]). Interestingly, scholars have explicitly stated that a "Right to Explanation" for automated decision-making does not actually exist in the European Union's General Data Protection Regulation, despite the intention [303]. Rudin [250] proposes that for a high-stakes decision that deeply affects lives, no black box model should be used unless the deployer can prove that no interpretable model exists with a similar accuracy. If enacted, this would likely mean that black box models rarely (if ever) would be deployed for high-stakes decisions. • What are other forms of interpretable models? It is not possible to truly review all interpretable ML models. One could argue that most of applied Bayesian statistics fits into our definition of interpretable ML because the models are constrained to be formed through an interpretable generative process. This is indeed a huge field, and numerous topics that we were unable to cover also would have deserved a spot in this survey.

Conclusion
In this survey, we hoped to provide a pathway for readers into important topics in interpretable machine learning. The literature currently being generated on interpretable and explainable AI can be downright confusing. The sheer diversity of individuals weighing in on this field includes not just statisticians and computer scientists but legal experts, philosophers, and graduate students, many of whom have not either built or deployed a machine learning model ever. It is easy to underestimate how difficult it is to convince someone to use a machine learning model in practice, and interpretability is a key factor. Many works over the last few years have contributed new terminology, mistakenly subsumed the older field of interpretable machine learning into the new field of "XAI," and review papers have universally failed even to truly distinguish between the basic concepts of explaining a black box and designing an interpretable model. Because of the misleading terminology, where papers titled "explainability" are sometimes about "interpretability" and vice versa, it is very difficult to follow the literature (even for us). At the very least, we hoped to introduce some fundamental principles, and cover several important areas of the field and show how they relate to each other and to real problems. Clearly this is a massive field that we cannot truly hope to cover, but we hope that the diverse areas we covered and problems we posed might be useful to those needing an entrance point into this maze. Interpretable models are not just important for society, they are also beautiful. One might also find it absolutely magical that simple-yet-accurate models exist for so many real-world datasets. We hope this document allows you to see not only the importance of this topic but also the elegance of its mathematics and the beauty of its models.