The dataset for fine-tuning

Question

The dataset for fine-tuning

gfzum opened this issue 8 months ago · 2 comments

Hi I'm interested in the reproduction of the fine-tuning process of RepairLlama, yet I don't find the dataset file you've mentioned in the arXiv paper (or the ior_21 file in full_finetune.py). Wondering if you can help to share : )

Answer 1 · 2024-03-25T07:55:23.000Z

Hi @gfzum !

The processed datasets are available at https://huggingface.co/datasets/ASSERT-KTH/repairllama-datasets
It contains the datasets used for training the RepairLLaMA models, one subset per input/output representation pair.
To get the 30k..50k datasets we did further filtering based on the token length of input + output pairs being less than 1024 tokens.

If it interests you, you can also find these on our HuggingFace org:

Megadiff (original dataset, in HF format): https://huggingface.co/datasets/ASSERT-KTH/megadiff
Megadiff Single-Function (single-function diffs only, with buggy and fixed functions extracted from it): https://huggingface.co/datasets/ASSERT-KTH/megadiff-single-function

Answer 2 · 2024-03-25T15:24:03.000Z

Thank you Andre!