GateNLP/gateplugin-LearningFramework

Bugs in cc.mallet.topics.ParallelTopicModel

clause opened this issue · 1 comments

I think I've found two issues in cc.mallet.topics.ParallelTopicModel.

The first is on Line 245. The loop bound should be tokens.size() (or .getLength()), not topics.length. If an instances has fewer tokens than the minimum capacity of a FeatureSequence (currently 2), then spurious topics will be added.

The second, is the worker's docLengthCounts and topicDocCounts are always incremented but only cleared when alphaStatistics are collected. This results in the counts being optimizeInterval/saveSampleInterval times larger than they should be. topicDocCounts should be cleared every loop or only calculated when alpha stats are. docLengthCounts should only be calculated once (i.e., when the TopicAssignment are created). Caching this computation will save some time and memory.

Please report bugs in the cc.mallet package here:
https://github.com/mimno/Mallet/issues