ML Model: experiment with a Transformer model

Question

ML Model: experiment with a Transformer model

julien-c opened this issue 4 years ago · 17 comments

Depending on the size of the dataset, using a Transformer model from https://github.com/huggingface/transformers could boost the accuracy of the classifier.

I can give a hand if needed=)

Answer 1 · 2020-05-12T14:40:00.000Z

Yes, I will be looking into using one of those this iteration! Any help would be appreciated :)

I was planning on working with them via https://simpletransformers.ai/docs/multi-class-classification/, as it seems to wrap up the models in a way that makes quite a bit more sense to someone with zero ML experience (me).

How much data do you think would be needed for one of those models to work well?

Answer 2 · 2020-05-12T14:55:16.000Z

Ref microsoft/vscode#97537

Answer 3 · 2020-05-12T16:35:47.000Z

Yes, simple-transformers should work.

50 to 100 samples for each label would probably work well already. Is the human-labelled dataset accessible somewhere? (or one should just call GitHub API?)

(BTW: how do you differentiate between labels that were human-annotated vs. labels that were attributed automatically?)

Also this model card – trained on CodeSearchNet by @hamelsmu – might be of interest – i.e. you would probably want to two steps:

pre-train your model on a massive amount of (non-labelled) issues' text (maybe just the vscode repo, but probably a larger slice of GitHub)
Then finetune a multi-class classification layer on your labelled dataset.

Happy to help if needed!

Answer 4 · 2020-05-12T20:20:31.000Z

@jlewi is working on building an issue label classifier right now for Kubeflow, specifically this project https://github.com/kubeflow/code-intelligence/tree/master/Label_Microservice

I would suggest using Google BigQuery to get a large batch of issue data at once. Here is a blog post that will help you How to Automate Tasks on GitHub With Machine Learning for Fun and Profit

Answer 5 · 2020-05-12T22:43:51.000Z

@julien-c

pre-train your model on a massive amount of (non-labelled) issues' text (maybe just the vscode repo, but probably a larger slice of GitHub)

Does this mean I would not be using a base like bert-base-uncased but rather training my own? vscode has ~90000 issues, how many do you think this would need?

how do you differentiate between labels that were human-annotated vs. labels that were attributed automatically

Using the github graphQL api I can get the timeline of label events, which includes the creator of each event. I filter out the known-bots:
https://github.com/microsoft/vscode-github-triage-actions/blob/master/classifier/train/fetch-issues/createDataDir.ts#L267-L272

Is the human-labelled dataset accessible somewhere?

I don't think I'm at liberty to post it myself, what with GDPR/etc.

One can get a copy of the data by running

INPUT_TOKEN="{github pat}" GITHUB_REPOSITORY="Microsoft/vscode" node classifier/train/fetch-data/index.js

on the deep-investigations branch of this repo

Answer 6 · 2020-05-12T22:49:06.000Z

@ThilinaRajapakse, is it possible to pre-train models in simple-transformers? I wasn't able to find anything referencing it in a skim of the docs.

Answer 7 · 2020-05-13T06:46:19.000Z

Yes, it is possible. I'm still working on completing the documentation at simpletransformers.ai, so it's actually not added there yet. But, you can find the relevant info in the readme.

You can start with a pre-trained model like bert-base-cased and pre-train it on the large, unlabelled Github issue dataset.

Pre-training a pre-trained model sounds weird, but the way I think of it is, pre-training is like teaching a language to the model. bert-base-cased already "knows" English, but probably isn't familiar with the programmer-jargon of Github issues. So, the idea is to retain the English knowledge while adapting the model to understand the more technical language found in issues. This is done by using the same technique (masked language modeling) that was used to initially train the original, pre-trained bert-base-cased model, hence the term pre-training.

Then, you can take the model that was pre-trained on the Github issues, add the multi-class classification layer (done automatically when the model is loaded to a classification model), and fine-tune it on your custom, labelled dataset.

I hope that made sense!

Answer 8 · 2020-05-13T18:47:49.000Z

@ThilinaRajapakse what's the relation of simple transformers to Hugging Face? Is this a separate library?

Does anyone have an example of using a hugging face model that is fine tuned on a dataset and adding a custom head for a classification task, just using Hugging face and not another library? I looked through the docs and wasn't clear if there was something like this included cc: @julien-c Nevermind found this

Answer 9 · 2020-05-13T18:57:53.000Z

Simple Transformers is built on top of and relies on the Hugging Face library, but it is a separate library. I am not affiliated with Hugging Face. It was something I wrote for my own convenience in the beginning but some people seem to find it useful, so I've kept expanding it.

Answer 10 · 2020-05-13T19:01:52.000Z

@ThilinaRajapakse it does seem useful a bit nervous about introducing more dependencies but I think your abstraction is really useful especially when doing common tasks. Thanks for sharing!

Answer 11 · 2020-05-13T19:33:08.000Z

Thank you!

Answer 12 · 2020-05-13T22:36:29.000Z

Just an FYI for bulk downloads we switched to using BigQuery to fetch the data to avoid hitting GitHub's GraphQL API limits. The code to construct the comment stream and labels based on GitHub events is here
https://github.com/kubeflow/code-intelligence/blob/master/py/code_intelligence/github_bigquery.py

The BigQuery data is from the GitHub Archive which I believe also makes tarballs available.
https://www.gharchive.org/

This means we also have negative examples; i.e. we have issues where a label was added by either a human or bot and then removed by a human because it wasn't correct.

@julien-c Do you have suggestions about how we could incorporate these negative examples into training?

Answer 13 · 2020-05-14T00:18:21.000Z

Do you have suggestions about how we could incorporate these negative examples into training?

I know you didn't ask me the question but I can try to answer_ the fact that you initially had the wrong label, but you have the correct label now doesn't seem like something you would handle differently from the main case, (aside from this suggesting that this is a much harder example for your model to classify, but would have to look at those examples to determine this).

If you don't have the right label and only the wrong label, I would suggest going through and hand labeling those if possible. If this is not feasible you could explore some variation of label smoothing but I'm not sure this will work and is more of a wild idea to give "a slightly stronger negative label" to the known negatives (example 0.1) vs. the other classes for which the label is unknown (example 0.15). This is a wild idea but is the best thing I can think of to take advantage of this "negative information". P.S. I haven't done a literature search for this, it is quite possible someone has tried this

Answer 14 · 2020-05-14T00:41:25.000Z

Does anyone have an example of using a hugging face model that is fine tuned on a dataset and adding a custom head for a classification task, just using Hugging face?

Yes, alternatively to simple-transformers you should be able to use huggingface/transformers to do this reasonably easily. We recently released a Trainer that lets you train or fine-tune model on different tasks with a common API (documentation is still WIP) https://github.com/huggingface/transformers/tree/master/examples

Answer 15 · 2020-05-14T21:03:46.000Z

I've run fine-tune training on the full vscode issues dataset, and quickly looking through the results it seems like the vocabulary of the model doesnt get expanded to include works from the fine-tuning dataset? For instance, vscode and debug dont show up in the outputted vocab.txt file. Is this expected? It seems to me that that's leaving a lot of performance on the table.

@ThilinaRajapakse not sure if this is a simple-transformers thing or base transformers (@julien-c).

Code:

from simpletransformers.language_modeling import LanguageModelingModel
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
}

model = LanguageModelingModel("bert", "bert-base-uncased", args=train_args)

model.train_model(
    "issues.train.tokens",
    eval_file="issues.test.tokens",
    output_dir="finetuned",
)

model.eval_model("issues.test.tokens", "finetune-eval")

Excerpt of issues.train.tokens:

when no folder or file open, the debug play button should not be shown  - open a new window with no folder or file open - go to the debug viewlet. - the gear icon is missing which makes sense as there is no workspace to create the launch.json file in - the play icon is present. clicking on it opens the "select environment" action choosing any option obviously would result in an error. we should either not show the play icon at all or on clicking on it should show an error saying "no file/folder to debug" instead of showing the "select environment" action 
keyboard navigation in trees can't find children of never-expanded folders   1. focus on explorer 2. start typing, for instance "foo" 3. if "foo" is nested under a folder that has never been expanded since window load, "foo" will not be found. 4. if i have expanded foo's immediate parent at least once, "foo" will be found, no matter if it's collapsed or not. (having a `jsconfig.json` doesn't help) 
crash when opening 50mb yaml file  version 0.10.1 as the subject says, when i attempt to open a 55mb yaml file, crashes. opening smaller files seems to work fine (5mb). the larger one makes it blow up every time. let me know if i can provide any more information.

Answer 16 · 2020-05-14T21:41:35.000Z

The vocabulary of the model doesn't get expanded through fine-tuning the language model. The BERT vocabulary has unused tokens (~1000 for bert-base-uncased) which you can replace with your own tokens. You can add new tokens to the vocabulary by editing the vocab.txt file and replacing the unused tokens (will look like [unusedXXX]).

That said, my understanding is that this usually doesn't significantly improve the performance of the model. I think it is because the model can "learn" the meaning of words like vscode even if it tokenized into smaller parts. Since BERT's representations are context-dependant, it probably doesn't matter too much whether vscode appears as a single token or as several tokens next to each other.

I think it's better to consider doing this later on if/when you are looking for any possible improvement.

It's probably best to take the model you fine-tuned and try it out on the actual classification. You can load the fine-tuned model as a classifier with the snippet below.

from simpletransformers.classifcation import ClassificationModel
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
}

# Using the file path from your snippet
model = ClassificationModel("bert", "finetuned", args=train_args)

Answer 17 · 2020-06-18T02:47:06.000Z

Added a classifier using bert fine-tuned on the vscode issue-base. Initial results are promising:

54 issues
  43 triaged  (80%)
    4 incorrectly triaged (8%)
    39 correctly triaged (72%)
  11 skipped (20%)

Implementation at: https://github.com/microsoft/vscode-github-triage-actions/tree/master/classifier-deep.

Thanks to everyone for your suggestions!

Aside:
I implemented a confidence-threshold configuration that proved very helpful for getting better numbers. You can see a live feed of the bot's confidence levels at https://github.com/JacksonKearl/testissues/issues. By default, the threshold is at around 70% but it can be configured on a per-person/label basis.