Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact firstname.lastname@example.org with any questions.
Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the “Rashomon set” of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.
The research on statistical inference after data-driven model selection can be traced as far back as Koopmans (1949). The intensive research on modern model selection methods for high-dimensional data over the past three decades revived the interest in statistical inference after model selection. In recent years, there has been a surge of articles on statistical inference after model selection and now a rather vast literature exists on this topic. Our manuscript aims at presenting a holistic review of post-model-selection inference in linear regression models, while also incorporating perspectives from high-dimensional inference in these models. We first give a simulated example motivating the necessity for valid statistical inference after model selection. We then provide theoretical insights explaining the phenomena observed in the example. This is done through a literature survey on the post-selection sampling distribution of regression parameter estimators and properties of coverage probabilities of naïve confidence intervals. Categorized according to two types of estimation targets, namely the population- and projection-based regression coefficients, we present a review of recent uncertainty assessment methods. We also discuss possible pros and cons for the confidence intervals constructed by different methods.
The generation of random sequences is the basis of simulation and can be used in many different areas such as Statistics, Computer Science, Systems Management and Control, Biology, Particle Physics, Cryptography or Cyber-Security, among others. It is crucial that the numbers generated were random or at least, behave as such. The fundamental statistical properties required for such sequences are randomness and independence and, from a cryptographic perspective, unpredictability. There is a variety of methods to generate these sequences. The main ones are physical and arithmetic methods. In this work, a detailed study of the main arithmetic methods is carried out. On the other hand, the necessity of secure sequence generation will be analyzed and new lines of ongoing research focusing applications in Internet of Things and new generator designs will be described.
Planned missing survey data, for example stemming from split questionnaire designs are becoming increasingly common in survey research, making imputation indispensable to obtain reasonably analyzable data. However, these data can be difficult to impute due to low correlations, many predictors, and limited sample sizes to support imputation models. This paper presents findings from a Monte Carlo simulation, in which we investigate the accuracy of correlations after multiple imputation using different imputation methods and predictor set specifications based on data from the German Internet Panel (GIP). The results show that strategies that simplify the imputation exercise (such as predictive mean matching with dimensionality reduction or restricted predictor sets, linear regression models, or the multivariate normal model without transformation) perform well, while especially generalized linear models for categorical data, classification trees, and imputation models with many predictor variables lead to strong biases.
Central subspaces have long been a key concept for sufficient dimension reduction. Initially constructed for solving problems in the setting, central subspace methods have seen many successes and developments. However, over the last few years and with the advancement of technology, many statistical problems are now situated in the high dimensional setting where . In this article we review the theory of central subspaces and give an updated overview of central subspace methods for the , and big data settings. We also develop a new classification system for these techniques and list some R and MATLAB packages that can be used for estimating the central subspace. Finally, we develop a central subspace framework for bioinformatics applications and show, using two distinct data sets, how this framework can be applied in practice.