Replication Package for "Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation"
This repository contains the replication package for the paper "Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation" by Fernando Vallecillos Ruiz, Anastasiia Grishina, Max Hort and Leon Moonen, which is currently under review.
An earlier version was deposited on arXiv with DOI: 10.48550/arXiv.2401.07994 with title "A Novel Approach for Automated Program Repair using Round-Trip Translation with Large Language Models".
The replication package is archived on Zenodo with DOI: 10.5281/zenodo.10500593. It is maintained on GitHub at https://github.com/secureIT-project/RTT_for_APR.
The source code is distributed under the MIT license, and except for 3rd party datasets that come with their own license, all documentation, data, models and results in this repository are distributed under the CC BY 4.0 license.
The replication package is organized as follows:
- clm-apr
- plbart: code to generate patches with PLBART models.
- codet5: code to generate patches with CodeT5 models.
- transcoder: code to generate patches with the TransCoder model.
- incoder: code to generate patches with InCoder models.
- santacoder: code to generate patches with the SantaCoder model.
- starcoder: code to generate patches with the StarCoderBase model.
- quixbugs: code to validate patches generated for the QuixBugs benchmark.
- defects4j: code to validate patches generated for any of the Defects4J benchmarks.
- humaneval: code to validate patches generated for the HumanEval-Java benchmark.
- humaneval-java: the HumanEval-Java benchmark proposed by Jiang et al. 2023
- jasper: a Java tool to parse Java programs needed to preprocess input.
- model: folder to download the language models.
- analysis_wandb: data from WandB and Jupyter notebook to create graphs.
- tmp_benchmarks: folder for temporary files used in patch validation. The folder may contain pairs of `paralell' folders src and src_org for each benchmark, used to replace buggy code with candidate patches.
- Python version: 3.8--3.10.
- Git LFS is required for model downloading.
- Create an account on Weights and Biases
- Install the Weights and Biases library
- Run
wandb login
and follow the instructions
OpenAI account is needed with access to gpt-3.5-turbo
and gpt-4
. The OPENAI_API_KEY
environment variable should be set to your OpenAI API access token.
- Defects4J To generate inputs for the Defects4J datasets or to validate them, you need to have installed their tool.
- Java 8
- Apache Maven
We recommend the use of the setup script:
setup.sh
which performs the following:
- Creates a virtual environment for Python and activate it.
- Install the packages in
requirements.txt
. - Compiles Jasper.
- Downloads parsers.
- Check if the Defects4J installation is correct.
The following bash script contains the code to download all of the models used:
models/download_models.sh
We recommend downloading only the models you are going to use due to their size
cd models
chmod +x download_models.sh
./download_models.sh
To run one specific model, for example, PLBART (C#), use the following commands:
cd models
git lfs install
git clone https://huggingface.co/uclanlp/plbart-java-cs
git clone https://huggingface.co/uclanlp/plbart-cs-java
cd ../..
Each script in each clm-apr/[model]
folder connects one or more models with
one dataset. These scripts follow the template: [benchmark]_[model]_[technique].py.
The scripts first create an [model]_input.json
file with the preprocessed
input. Then generate outputs based on that file with one or more models.
For example:
cd clm-apr/plbart
python quixbugs_plbart_round.py # Generates input for QuixBugs and generate patches using Java<->C# RTT.
python quixbugs_plbart_round_nl.py # Generates input for QuixBugs and generate patches using Java<->NL RTT.
Optionally, use argument --device_map cpu
if you wish to run the script on
CPU, for example:
python quixbugs_plbart_round.py --device_map cpu
Otherwise, the script will be run on all available CUDA GPU's.
We have commented the generation of inputs in the scripts. Users are free to
uncomment this method and try for themselves. It is easily recognizable by
their name template [model]_[benchmark]_input()
. In the previous case:
quixbugs_plbart_input()
These steps are also included in the [benchmark]_[model]_[technique].py script mentioned above. They are modularized in the method recognizable by their name template [model]_[benchmark]_output(). For example:
quixbugs_incoder_output()
This method:
- Reads the input json file.
- Generates outputs through the LLM.
- Postprocess the output (extract the patch, clean up extra token, etc.).
- Creates [model]_output_[technique]_[extra].json.
The last 3 steps are repeated according to the number of runs set to performed
(10 in our experiments). Each run will produce a different file with the seed
used in its generation. For example, quixbugs\_plbart\_round.py
and
quixbugs\_plbart\_round_nl.py
scripts create:
clm-apr/quixbugs/plbart_results/run_0/plbart_java_cs_java_output_round_csharp_batch.json
clm-apr/quixbugs/plbart_results/run_0/plbart_java_nl_java_output_round_nl_batch.json
The last step evaluates the generated outputs against the test-suites of each benchmark. This script reads the previous outputs files and generates a new one with the results of the test for one model. Furthermore, it connects with the WandB tool to calculate metrics and send them to analyze.
Following the previous examples, to validate the results previously obtained, we execute the following:
cd clm-apr/quixbugs
python validate_quixbugs_parallel.py
Given the included JSON, this script would create:
clm-apr/quixbugs/plbart_results/run_0/plbart_java_cs_java_validate_round_csharp_batch.json
We have disabled WandB in the script to allow users to try the script first.
However, it can be easily activated by changing the parameter mode="disabled"
to mode="online"
.
We have set the variable total_runs = 1
, as well as input_file
and output_file
to the results included. They should be modified accordingly to validate more runs
or to validate other files/models.
We include two CSV files obtained through WandB.
'data_cleaned_grouped.csv': Aggregated metrics of the 25 outputs for all runs.
'full_data_all_runs.csv': All metrics for all outputs on all runs.
If you build on this data or code, please cite this work by referring to the paper:
@misc{ruiz2024:rtt:arxiv,
title = {A Novel Approach for Automated Program Repair using Round-Trip Translation with Large Language Models},
author = {Vallecillos Ruiz, Fernando and Anastasiia Grishina and Max Hort and Leon Moonen},
year = {2024},
month = jan,
number = {arXiv:2401.07994},
eprint = {2401.07994},
primaryclass = {cs},
publisher = {{arXiv}}
}
- v0.1 - initial replication package corresponding to v1 of arXiv deposit: includes raw data, code, and example outputs.
The work included in this repository was supported by the Research Council of Norway through the secureIT project (IKTPLUSS #288787). Max Hort is supported through the ERCIM ‘Alain Bensoussan’ Fellowship Programme. The empirical evaluation was performed on the Experimental Infrastructure for Exploration of Exascale Computing (eX3), financially supported by the Research Council of Norway under contract #270053.
Jiang, N.; Liu, K.; Lutellier, T.; and Tan, L. 2023. Impact of Code Language Models on Automated Program Repair. In 45th International Conference on Software Engineering (ICSE), 1430–1442. IEEE. ISBN 978-1-66545-701-9.