leijie-wang/filterbuddy-experiment

Training ML algorithms based on user labels

Closed this issue · 1 comments

While we are still developing the frontend interface for labeling examples, we can assume that users have access to a list of examples in the format of [(text1, 1), (text2, 0), ....]. You are expected to add functions to train a traditional ML classifier based on user labels.

The algorithm should be of the structure pre-trained word embeddings + traditional ML (SVM, Bayes, or RandomForests....). You should decide on the architecture based on preliminary results. By the way, you should also try different size of examples (from 100 to more)

While we do not have real examples for now, you could use the toxicity dataset and its labels.

I completed the traditional ML filter with training and testing methods, using the toxicity dataset & the mean of the annotators' toxicity ratings to label each comment. My next step is adding embeddings. Let me know if the code I have for the traditional ML filter (without embeddings) would be helpful and, if so, where to put/send it. We can also wait until I add the embeddings.