High-Cardinality-Covariates-Regularization

This is the R Code of our paper "High-Cardinality Categorical Covariates in Network Regressions" which can be downloaded from SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4549049

Abstract. High-cardinality (nominal) categorical covariates are challenging in regression modeling because they lead to high-dimensional models. E.g., in generalized linear models (GLMs), categorical covariates can be implemented by dummy coding which results in high-dimensional regression parameters for high-cardinality categorical covariates. It is difficult to find the correct structure of interactions in high-cardinality covariates, and such high-dimensional models are prone to overfitting. Various regularization strategies can be applied to prevent overfitting. In neural network regressions, a popular way of dealing with categorical covariates is entity embedding, and, typically, overfitting is taken care of by exploiting early stopping strategies. In case of high-cardinality categorical covariates, this often leads to a very early stopping, resulting in a poor predictive model. Building on Avanzi, Taylor, Wang and Wong (arXiv 2023), we introduce new versions of random effects entity embedding of categorical covariates. In particular, having a hierarchical structure in the categorical covariates, we propose a recurrent neural network architecture and a Transformer architecture, respectively, for random effects entity embedding that give us very accurate regression models.