larskotthoff/fselector

Handling class with single case

Closed this issue · 3 comments

Hi,

This is more of a question about how the function handles a situation rather than a technical issue. I am using information.gain for feature selection. The data set contains 40 cases, 15 from class A, 10 from B, 3 from C and then the remaining are the single cases in their classes. When I label them like this using information gain they give me a list of attribute importance. However when I am using a single case against everything else e.g. Class D vs non-Class D, all attribute importance drops to zero.

That does make sense since you can't calculate the entropy for that if you have only one case in one class. However, I would like to know how it handles those classes with single case in the mix of some other "normal" classes. If the attribute score means nothing to those classes with single case, perhaps adding a warning message will be nice. It will be very useful for data science beginners like myself.

Thanks a lot.

I'm not sure what your question is. The information gain of an attribute doesn't explicitly consider how many cases there are. If the information gain of a particular attribute is 0, it means that knowing its value doesn't help you discriminate between the classes, regardless of how many classes there are. This is a normal case and not an exceptional condition that would warrant a warning.

Does that help?

Thanks for your reply. I think your answer does relate to my question. Please correct me if I am wrong.

To summarise if I look for the information gain of a binary (e.g. D vs non-D) data set, and there is only a single sample in D, it is likely that there is no way to discriminate it from the non-D so it will generate 0 for every attribute. And when I have a data set with class A, B, C, etc. it will calculate information gain of attributes for distinguishing ALL classes, as oppose to A vs non-A + B vs non-B + C vs non-C?

If that's the case I can see what you mean. Thanks a lot. Sorry for asking a question that is not so much related to the package itself.

Yes, that is correct.