Sparse data and memory
Closed this issue · 2 comments
I'd like to try the techniques but have sparse text features. The current code doesn't run with sparse matrices. How hard would it be to change so it did?
If I do try to limit to less than 1000 features and make dense matrices the memory still seems to go above 20G even with 1 thread (num_cores forced to 1). Is this an intractable limitation of the techniques or is this unexpected?
Thanks.
Hi,
Since kNN is used for the pair-wise estimation of MI between features, the limiting factor is the number of samples rather than the number of features. How many samples do you have? If you have a lot, could you possibly subsample your dataset to say a 1000 samples and run the feature selection many times to see how stable the selected features are?
I'm not sure you can estimate mutual information from sparse matrices.. But if you have text features (counts I presume) you might be better off using the JMI method from the FEAST package. There's a nice Python wrapper for it: https://github.com/mutantturkey/PyFeast
The estimation of MI in the FEAST package isn't based on kNN and it's in C, so you'd be able to get MI estimates and perform JMI on millions of samples as well.. I guess..
Let me know how it goes!
Cheers,
Dan
I tried on around 88K samples which itself is down sampled from1 million.
Thanks for the suggestions! I'll try on various smaller samples and also checkout FEAST.