KaveIO/PhiK

Handling categorical variables with many unique values

jkleint opened this issue · 2 comments

First, thanks for this package; it's great that cutting-edge statistical research can be put into practice so promptly and easily.

I have a dataset with millions of rows and categorical variables with tens of thousands of unique values. When I run phik_matrix() on it, I get a warning like this:
UserWarning: The number of unique values of variable a is very large: 101. Are you sure this is not an interval variable? Analysis for pairs of variables including a might be slow.

And it is indeed slow. What's the best way to handle high-cardinality categoricals? Downsample to less than 100 values? Choose only rows with the most common 100 values?

mbaak commented

Thanks for the kind words. Indeed the approach I would recommend (for now) is too reduce the number of unique values, if possible. For example in case of addresses, reduce to zipcode, or county level, etc.

I will check if there is code left that can be easily parallelized or compiled. I think there is. I'll return to this comment later.

mbaak commented

I've committed v0.9.11 of the phik library where the calculation of phik should be faster by at least a factor of two. I've also parallellized the calculation of phik for different variable pairs. Altogether is should be a sizeable speedup.

However, having a huge number of unique values will always remain a bottleneck for phik, b/c by construction the formula for phik relies on mirroring the number of categories see in data in the calculation of a bivariate normal distribution, from which phik is then derived.

So my recommendation to reduce the number of unique values still stands. Working with several thousand bins per variable should be quite alright though ...