refine training sets

Question

refine training sets

Opened this issue 12 years ago · 0 comments

currently we just chuck all the tokens at the classifier, this is brutal, best to strip out most identifier and create a more language-specific dataset that characterizes the language, not identifiers that happened to be used a lot in our training data