Question about the overall framework setup for evaluating patches with defects4j
ramsey-coding opened this issue · 5 comments
Hi,
Thanks for this great artifact. But that's as usual from @monperrus!
I have a question about the overall framework. For each bug in defects4j, you need to:
- checkout the repo
- evaluate the patch
- and run the test cases.
Where is the code for all these setup? I would like to reproduce the whole framework from scratch.
Thank you!
Hi @ramsey-coding,
Thanks for your interest in RepairLLaMA :)
The code for doing this is currently not public. We are working on making it public, but currently do not have a timeline for this.
I will let you know when we have an update on this.
@andre15silva, since there is no timeline for it, could you please help me with the following queries?
Defects4J provides shell commands to check out the repository, evaluate, and run test cases. I presume you have used these commands to build your infrastructure. Or do you did something else?
@andre15silva, since there is no timeline for it, could you please help me with the following queries?
Defects4J provides shell commands to check out the repository, evaluate, and run test cases. I presume you have used these commands to build your infrastructure. Or do you did something else?
Yes, we use the commands provided by Defects4J!
For HumanEval-Java we also use the commands provided in that repo. Note that we use our own fork of HumanEval-Java located at https://github.com/ASSERT-KTH/human-eval-java
@andre15silva, how have you integrated the code generated by LLMs back into the buggy code? Do you plan to share that code?
Also sometimes LLMs generate noisy information.
What steps have you taken to address this?
@andre15silva, how have you integrated the code generated by LLMs back into the buggy code? Do you plan to share that code?
We replace the buggy function with the candidate fixed function generated by the model. The code for doing that is together with the rest of the infrastructure we want to open-source, yes.
Also sometimes LLMs generate noisy information.
What steps have you taken to address this?
For the models based on codellama we simply take the output. It is worth nothing that the models are trained for these representations (i.e. infilling for the base model, or the fine-tuned representations presented in the paper).
This means that they do not output any noise similar to what instruction-tuned models like ChatGPT usually output.
For the ChatGPT experiments we parse the code inside the three-backtick box, which is a common format in Markdown and usually what the model generates.