CodeSearchNet
Closed this issue · 5 comments
I am currently working on an IterableDataset object to train this on the data from the CodeSearchNet dataset (https://github.com/github/CodeSearchNet). it seems to work fine on my machine now but it is not tested and the code needs much refactoring and cleanup.
if you care to look at what I've done so far it is here: https://github.com/cameronbergh/auto_coding/tree/dev
I will create a PR in the next few days...
okay, i have what i think is a somewhat decent PR ready which doesn't break the existing code. I am currently running a training of distilGpt2 on the CodeSearchNet dataset and so far the results look good.
i will post the trained model in a few days?
also, i'll will start the training on larger gpt2 models soon.
oh and the code is available here https://github.com/cameronbergh/auto_coding/tree/dev
@cameronbergh I just quickly checked what you have done based upon this repo. That's really nice so your PR is welcome. You said you are going to play around with the larger model. May I ask you what computational resources you have on your side? I did not try GPT2-large due to lack of GPUs on my side :(.
I have an old server with 64 cores and 170+ GB of ram. It has been training gpt2-1558M on a dataset of python code from the python150k dataset and codesearchnet. its taken a long time but its performs pretty good. my other machine has two 2080ti's.
my PR is a little delayed due to dependency hell ....
i did a little (not very rigorous) experiment:
-training gpt2-117M on GPUs (2x 2080ti)
-training gpt2-1558M on CPUs (64x @ 2.10ghz)
the larger/slower setup seems to perform better in the same amount of time!
I'll post the models eventually!
okay i have gotten the training on this debugged and it is currently running here https://app.wandb.ai/impudentstrumpet/gpt2distilcsnl?workspace=user-impudentstrumpet
Great work. Just let you know, I plan to upgrade the trainer class. I want to add some features to it. For example, compatible with "none" dev dataset, and some logging controls. Have you made some changes to the trainer class in your branch?