ASSERT-KTH/repairllama

The dataset for fine-tuning

gfzum opened this issue · 2 comments

Hi I'm interested in the reproduction of the fine-tuning process of RepairLlama, yet I don't find the dataset file you've mentioned in the arXiv paper (or the ior_21 file in full_finetune.py). Wondering if you can help to share : )

Hi @gfzum !

The processed datasets are available at https://huggingface.co/datasets/ASSERT-KTH/repairllama-datasets
It contains the datasets used for training the RepairLLaMA models, one subset per input/output representation pair.
To get the 30k..50k datasets we did further filtering based on the token length of input + output pairs being less than 1024 tokens.

If it interests you, you can also find these on our HuggingFace org:

Thank you Andre!