refine training sets
Opened this issue · 0 comments
tj commented
currently we just chuck all the tokens at the classifier, this is brutal, best to strip out most identifier and create a more language-specific dataset that characterizes the language, not identifiers that happened to be used a lot in our training data