Reproducing the OctoCoder model
Opened this issue ยท 12 comments
Hello, I have a few questions about OctoCoder.
For this part in the paper:
For instruction tuning our models, we select 5,000 random samples from COMMITPACKFT across the 6 programming languages that we evaluate on.
Could you please provide the exact training data and the launch script to fine-tune StarCoder into OctoCoder?
Or, the seeds that you used for selecting 5,000 instructions from CommitPackFT?
For a second question, was OctoCoder and the results in the paper produced using the finetuning/starcoder/finetune.py
with LoRA/peft?
Thanks!
Btw, fantastic results @Muennighoff and team :)
I think this is the exact dataset we used for OctoCoder: https://huggingface.co/datasets/bigcode/guanaco-commits
Yes, we used LoRA for OctoCoder.
cc @ArmelRandy
Hi @Muennighoff , @ArmelRandy:
Did you find full-finetuning of starcoder to be better than LoRa/PEFT ?
(I'm a bit confused since the paper doesn't mention the use of LoRA / PEFT techniques.)
We did not find a significant difference between LoRA and full fine-tuning, thus we use LoRA for all experiments.
Sorry for that. I have added the above as a note in Appendix M Hyperparameters
. We will update the arXiv version in a few months.
Hi @Muennighoff ,
I think this is the exact dataset we used for OctoCoder: https://huggingface.co/datasets/bigcode/guanaco-commits
The above dataset contains 13K samples. However, from the paper it seems ~23K samples were used for training OctoCoder.
Am I missing something?
For OctoCoder, we use OASST + CommitPackFT so 8587 + 5000 ~ 13,000
The others are only used in the ablations
For OctoCoder, we use OASST + CommitPackFT so 8587 + 5000 ~ 13,000 The others are only used in the ablations
Thanks! :)
Great! Appreciate the response.
Could you also clarify the environments used in evaluation? We are seeing discrepancies between the paper and our eval results up by 10% on OctoCoder. Perhaps you could specify the build versions of languages? I see you just specify the latest stable Rust in the code, for example.
Sure, these are the evaluation versions:
Python: Python 3.9.13
torch 1.13.0+rocm5.2
accelerate 0.20.3
transformers 4.32.1
C++: 11.4.0
(but newer ones should be fine too)
JS: js-md5@0.7.3
Java: java version "18" 2022-03-22
Go: go1.18.4
Rust: rustc 1.71.1 (eb26296b5 2023-08-03)
Also HumanEval performance is noisy as there's only 164 samples per task per subset. You may find that a different seed or checkpoint from a different step could make up for that 10% relative difference on Python HumanEvalSynthesize.
Other people have been able to reproduce the results, someone even got 46.5
pass@1 on Python by just reevaluating OctoCoder with our script, better than our paper. Probably due to different versions or batch size setting.
We did not find a significant difference between LoRA and full fine-tuning, thus we use LoRA for all experiments.
Sorry for that. I have added the above as a note in
Appendix M Hyperparameters
. We will update the arXiv version in a few months.
@Muennighoff From the Appendix, there is a description that "OCTOCODER was trained for 35 steps with a sequence length of 2048".
In my opinion, with a length of 2048 and a step of 35, it seems like the entire dataset won't be fully covered (approximately 60,000 tokens only??). Am I understanding this correctly?
Note that it's 2.2 million total finetuning tokens due to the batch size of 32. The steps & seq len is correct - You usually do not need many tokens for instruction tuning, see e.g. the below graph from prior work
@Muennighoff How did you decide on the number of samples from CommitPackFT to use during fine tuning? i.e. where did the 5k number come from? Your graph above seems to indicate increased performance for the BLOOMZ model during fine-tuning well into the 100s of millions of tokens, and I've seen other fine-tunings of Llama-2 using training sets that vary from ~5k all the way up to ~80k, for similar-ish tasks. I am curious what insights/experiences you used to come up with 5k.
The 5K was mostly arbitrary. Our filtered OASST dataset had around 5K so we just decided to fix it at 5K for CommitPackFT, too. You can probably use more.
You are right that perf improves into the 100s of millions for BLOOMZ; mT0 seems to saturate earlier. It could be that fine-tuning OctoCoder for longer would lead to better performance.