gregversteeg/corex_topic

Tradeoffs with using fractional counts

aditya-malte opened this issue · 3 comments

Hi @ryanjgallagher @gregversteeg,
I understand that the model is binary in nature and that you have also given a feature to use fractional counts which you mention is experimental. A use case that I am tackling might require fractional counts. So what according to you would be the potential risks of using fractional counts. Also, any specific reason as to why you'd initially proceeded with binarization than the fractional count approach. Just asking this to understand the potential risks that might be involved with using the model in this different fashion.
Thanks in advance

Hi @aditya-malte,

Mathematically, most of the setup of the CorEx topic model assumes binary counts. This assumption is what makes it possible for us to run CorEx on large, sparse data.

I haven't done much with the fractional counts implementation. @gregversteeg would have more insight into it. Given the size / sparsity of your data, he might also have other CorEx implementations that are better suited for your use case.

For example the bio_corex implementation can be run for continuous data, though it's not designed for large sparse data.

Hi @ryanjgallagher,
Thanks for the quick reply. My use case mostly caters to small amounts of data (around 20k samples). Also, training time is a non issue for me(doesn’t matter if 2 hours or 7 hours), unless prohibitively long (think 2 days). I’m assuming by sparsity you are meaning (say) a very sparse but large vocabulary.
The bio-corex insight is definitely helpful, will check out.

Other than training time are there any other side effects? (like lower quality topics, etc.)

Also, to what extent would memory usage potentially be affected, does it increase memory consumption by say 10x for example(just want a rough idea)

Important NOTE: Anchoring is an important feature for the use case (seems to be absent in bio-corex)

Sorry to chime in late. If your data is sparse, then I think this version is the way to go. Mathematically, I think it's justified to use fractional counts. Imagine you have a document/sample that is for four words (0,1,1/2,0). The way that this is modeled is like a distribution, for just this one sample, of p(x1)=0, p(x2)=1,p(x3)=1/2,p(x4)=0. Mathematically, it would be equivalent to take each sample and draw new ones according to this distribution. For instance, if you sampled in this way you'd get half of your samples of 0,1,1,0 and half of 0,1,0,0. Even though that's the meaning, it's more efficient to use the fractional counts (assuming this type of sampling is compatible with how you interpret fractional counts).