Abstract
A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. Few proposals have been developed to tackle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. We represent the categorical predictor by a graph where the nodes are the categories and establish a probability distribution over meaningful partitions of this graph. Conditionally on the observed data, we obtain a posterior distribution for the levels aggregation, allowing the inference about the most probable clustering for the categories. Simultaneously, we extract inference about all the other regression model parameters. We compare our and state-of-art methods showing that it has equally good predictive performance and more interpretable results. Our approach balances out accuracy vs. interpretability, a current important concern in statistics and machine learning.
Acknowledgments
The authors would like to thank Professor Karen Kafadar and the anonymous reviewers for their careful revision and all of their constructive comments. All of them contributed substantially to the improvement of this paper. We thank the Brazilian research funding agencies CNPq, CAPES, and FAPEMIG for partial financial support to this research. We also like to express our gratitude to the University Federal de Minas Gerais, especially to PROGRAD, for allowing the use of the educational data.
Citation
Tulio L. Criscuolo. Renato M. Assunção. Rosangela H. Loschi. Wagner Meira Jr.. Danna Cruz-Reyes. "Handling categorical features with many levels using a product partition model." Ann. Appl. Stat. 17 (1) 786 - 814, March 2023. https://doi.org/10.1214/22-AOAS1651
Information