CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

A pytorch implementation of paper CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

results: This folder contains all the bugs in both Defects4J and QuixBugs benchmarks that CURE fixed. Each file contains the buggy line, CURE's patch and the developer's patch
candidate_patches: This folder contains all the candidate patches CURE generated for bugs in each benchmark
data: This folder contains the vocabulary file, subword tokenizer, some training data examples, and the GPT PL model pre-trained on code.
- vocabulary
  - subword.txt: the subword tokenizer model needed by subword-nmt
  - vocabulary.txt: the vocabulary file used in CURE's paper
- models: This folder is used to save the models
  - code_gpt.pt: the save GPT PL model trained on code
- patches: This folder is used to save the generated patches
  - gpt_conut_1.txt: an example file that contains the candidate patches generated by a GPT-CoNuT model, including 100 patches for each QuixBugs bug.
  - gpt_fconv_1.txt: an example file that contains the candidate patches generated by a GPT-FConv model, including 100 patches for each QuixBugs bug.
- data: This folder is used to save the training data and validation data
  - CURE uses the source code training data shared by previous work CoCoNuT
src: This folder includes the source code for CURE's APR model

Dependency

Python 3.8
PyTorch 1.4.0
NumPy 1.18.1
Huggingface transformers 2.10.0
subword-nmt

Usage

To train a GPT-CoNuT model, run src/trainer/gpt_conut_trainer.py Some settings you may need to change:

vocab_file: the path to the vocabulary file used by the model
train_file: the path to the training data
valid_file: the path to the validation data
gpt_file: the path to the saved GPT PL model
hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
save_dir: the directory to save the model, default: data/models/

To train a GPT-FConv model, run src/trainer/gpt_fconv_trainer.py Some settings you may need to change:

vocab_file: the path to the vocabulary file used by the model
train_file: the path to the training data
valid_file: the path to the validation data
gpt_file: the path to the saved GPT PL model
hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
save_dir: the directory to save the model, default: data/models/

To prepare input for new test data, check data/data/prepare_testing_data.py, make sure you check the readme file and follow the three steps to prepare the test input.

To generate patches, run src/tester/generator.py Some settings you may need to change:

vocab_file: the path to the vocabulary file used by the model
input_file: the input data to the model for generating patches, with each line referring to a bug in the following format: buggy line <CTX> surrounding function. see candidate_patches/QuixBugs/quixbugs_bpe.txt for reference.
identifier_txt_file: the valid identifiers for each bug, with each line being a list of valid identifiers, identifiers are split by space. see candidate_patches/QuixBugs/identifier.txt for reference
identifier_token_file: the tokenized identifiers for each bug, with each line being a list of valid identifiers tokenized by camel letter, underscore, and subword. identifiers are split by \t. see candidate_patches/QuixBugs/identifier.tokens for reference
output_file: the path to the output result
beam_size: the number of candidate patches generated by each model
model_file: the path to the saved APR model
CURE's trained models: https://zenodo.org/record/7030145#.YwvXfFvMI5l

data/patches/gpt_conut_1.txt and data/patches/gpt_fconv_1.txt are example candidate patches generated by GPT-CoNuT and GPT-FConv models for QuixBugs benchmark.

To validate the candidate patches generated by models, run src/validation/rerank.py, which will rerank the patches generated by all the models and the result will be dumped into data/patches/reranked_patches.json, then run src/validation/validate_quixbugs.py or src/validation/validate_defects4j.py, which will run unit test cases (offered by Defects4J or QuixBugs) to validate the candidate patches. The final result will be dumped into data/patches/validated_patches.json

If you use CURE for academic purpose, please cite the following citation:

@inproceedings{jiang2021cure,
  author={Jiang, Nan and Lutellier, Thibaud and Tan, Lin},
  booktitle={2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)}, 
  title={CURE: Code-Aware Neural Machine Translation for Automatic Program Repair}, 
  year={2021},
  pages={1161-1173},
  doi={10.1109/ICSE43902.2021.00107}
}

lin-tan/CURE

CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

Dependency

Usage