March 2023 Handling categorical features with many levels using a product partition model
Tulio L. Criscuolo, Renato M. Assunção, Rosangela H. Loschi, Wagner Meira Jr., Danna Cruz-Reyes
Author Affiliations +
Ann. Appl. Stat. 17(1): 786-814 (March 2023). DOI: 10.1214/22-AOAS1651


A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. Few proposals have been developed to tackle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. We represent the categorical predictor by a graph where the nodes are the categories and establish a probability distribution over meaningful partitions of this graph. Conditionally on the observed data, we obtain a posterior distribution for the levels aggregation, allowing the inference about the most probable clustering for the categories. Simultaneously, we extract inference about all the other regression model parameters. We compare our and state-of-art methods showing that it has equally good predictive performance and more interpretable results. Our approach balances out accuracy vs. interpretability, a current important concern in statistics and machine learning.


The authors would like to thank Professor Karen Kafadar and the anonymous reviewers for their careful revision and all of their constructive comments. All of them contributed substantially to the improvement of this paper. We thank the Brazilian research funding agencies CNPq, CAPES, and FAPEMIG for partial financial support to this research. We also like to express our gratitude to the University Federal de Minas Gerais, especially to PROGRAD, for allowing the use of the educational data.


Download Citation

Tulio L. Criscuolo. Renato M. Assunção. Rosangela H. Loschi. Wagner Meira Jr.. Danna Cruz-Reyes. "Handling categorical features with many levels using a product partition model." Ann. Appl. Stat. 17 (1) 786 - 814, March 2023.


Received: 1 July 2020; Revised: 1 January 2022; Published: March 2023
First available in Project Euclid: 24 January 2023

Digital Object Identifier: 10.1214/22-AOAS1651

Keywords: Categorical predictors , clustering effects , Dimension reduction , Linear regression , random partition

Rights: Copyright © 2023 Institute of Mathematical Statistics


This article is only available to subscribers.
It is not available for individual sale.

Vol.17 • No. 1 • March 2023
Back to Top