The dataset for fine-tuning
gfzum opened this issue · 2 comments
gfzum commented
Hi I'm interested in the reproduction of the fine-tuning process of RepairLlama, yet I don't find the dataset file you've mentioned in the arXiv paper (or the ior_21
file in full_finetune.py
). Wondering if you can help to share : )
andre15silva commented
Hi @gfzum !
The processed datasets are available at https://huggingface.co/datasets/ASSERT-KTH/repairllama-datasets
It contains the datasets used for training the RepairLLaMA models, one subset per input/output representation pair.
To get the 30k..50k datasets we did further filtering based on the token length of input + output pairs being less than 1024 tokens.
If it interests you, you can also find these on our HuggingFace org:
- Megadiff (original dataset, in HF format): https://huggingface.co/datasets/ASSERT-KTH/megadiff
- Megadiff Single-Function (single-function diffs only, with buggy and fixed functions extracted from it): https://huggingface.co/datasets/ASSERT-KTH/megadiff-single-function
gfzum commented
Thank you Andre!