ronikobrosly/causal-curve

How to deal with the discrete variables that are not binary?

v6l4188 opened this issue · 4 comments

Hello :D
Here I have a question. In your end-to-end demonstration, some features are discrete, but they are binary - 0 or 1. But in my data, the discrete features are not binary, for example, one feature can be an integer between 0 to 30. In this case, how to deal with this kind of feature? If I use one-hot method, will the dimension be too high and the data become too sparse? Or should I use binary coding? Or is it better to do nothing with it?
It will be appreciated if you can help me :D

Hello @v6l4188 ! That’s a great question. To answer that, could you tell me how many observations you have in your data frame (what’s the N)? If you were to one-hot encode all possible discrete features, how many feature columns would you want to use?

Hello @v6l4188 ! That’s a great question. To answer that, could you tell me how many observations you have in your data frame (what’s the N)? If you were to one-hot encode all possible discrete features, how many feature columns would you want to use?

Thank you for your quick reply! I have 354,218 observations, and the number of features is 54. If I use one-hot to encode the discrete features, there are 18 features to be encoded, and the final total number of features is about 200.

@v6l4188 ahh ok. I was trying to get a quick sense of the ratio of observations to parameters/features in your case. If that feature is nominal/non-ordered discrete (e.g. there are 31 countries in that feature) then it should be one-hot encoded into 30 binary features. If it is naturally ordered discrete (e.g. the 31 categories represent income levels from low to high), then it’s up to you whether to make binary or just leave as one feature of 31 possible integers. If it is ordered discrete, I would probably leave as one ordered integer feature, just to make things simple. Does this help?

@v6l4188 ahh ok. I was trying to get a quick sense of the ratio of observations to parameters/features in your case. If that feature is nominal/non-ordered discrete (e.g. there are 31 countries in that feature) then it should be one-hot encoded into 30 binary features. If it is naturally ordered discrete (e.g. the 31 categories represent income levels from low to high), then it’s up to you whether to make binary or just leave as one feature of 31 possible integers. If it is ordered discrete, I would probably leave as one ordered integer feature, just to make things simple. Does this help?

Yes, it helps. :D thank you again!