
refine training sets

Opened this issue · 0 comments

tj commented

currently we just chuck all the tokens at the classifier, this is brutal, best to strip out most identifier and create a more language-specific dataset that characterizes the language, not identifiers that happened to be used a lot in our training data