/CURE

Artefact for our ICSE 2021 paper CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

Primary LanguagePythonOtherNOASSERTION

CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

A pytorch implementation of paper CURE: Code-Aware Neural Machine Translation for Automatic Program Repair

File Structure

  • results: This folder contains all the bugs in both Defects4J and QuixBugs benchmarks that CURE fixed. Each file contains the buggy line, CURE's patch and the developer's patch
  • candidate_patches: This folder contains all the candidate patches CURE generated for bugs in each benchmark
  • data: This folder contains the vocabulary file, subword tokenizer, some training data examples, and the GPT PL model pre-trained on code.
    • vocabulary
      • subword.txt: the subword tokenizer model needed by subword-nmt
      • vocabulary.txt: the vocabulary file used in CURE's paper
    • models: This folder is used to save the models
      • code_gpt.pt: the save GPT PL model trained on code
    • patches: This folder is used to save the generated patches
      • gpt_conut_1.txt: an example file that contains the candidate patches generated by a GPT-CoNuT model, including 100 patches for each QuixBugs bug.
      • gpt_fconv_1.txt: an example file that contains the candidate patches generated by a GPT-FConv model, including 100 patches for each QuixBugs bug.
    • data: This folder is used to save the training data and validation data
      • CURE uses the source code training data shared by previous work CoCoNuT
  • src: This folder includes the source code for CURE's APR model

Dependency

  • Python 3.8
  • PyTorch 1.4.0
  • NumPy 1.18.1
  • Huggingface transformers 2.10.0
  • subword-nmt

Usage

To train a GPT-CoNuT model, run src/trainer/gpt_conut_trainer.py Some settings you may need to change:

  • vocab_file: the path to the vocabulary file used by the model
  • train_file: the path to the training data
  • valid_file: the path to the validation data
  • gpt_file: the path to the saved GPT PL model
  • hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
  • save_dir: the directory to save the model, default: data/models/

To train a GPT-FConv model, run src/trainer/gpt_fconv_trainer.py Some settings you may need to change:

  • vocab_file: the path to the vocabulary file used by the model
  • train_file: the path to the training data
  • valid_file: the path to the validation data
  • gpt_file: the path to the saved GPT PL model
  • hyper_parameter: the hyper-parameter of the model (including the number of encoder/decoder layers, dropout rate, etc.)
  • save_dir: the directory to save the model, default: data/models/

To prepare input for new test data, check data/data/prepare_testing_data.py, make sure you check the readme file and follow the three steps to prepare the test input.

To generate patches, run src/tester/generator.py Some settings you may need to change:

  • vocab_file: the path to the vocabulary file used by the model
  • input_file: the input data to the model for generating patches, with each line referring to a bug in the following format: buggy line <CTX> surrounding function. see candidate_patches/QuixBugs/quixbugs_bpe.txt for reference.
  • identifier_txt_file: the valid identifiers for each bug, with each line being a list of valid identifiers, identifiers are split by space. see candidate_patches/QuixBugs/identifier.txt for reference
  • identifier_token_file: the tokenized identifiers for each bug, with each line being a list of valid identifiers tokenized by camel letter, underscore, and subword. identifiers are split by \t. see candidate_patches/QuixBugs/identifier.tokens for reference
  • output_file: the path to the output result
  • beam_size: the number of candidate patches generated by each model
  • model_file: the path to the saved APR model
  • CURE's trained models: https://zenodo.org/record/7030145#.YwvXfFvMI5l

data/patches/gpt_conut_1.txt and data/patches/gpt_fconv_1.txt are example candidate patches generated by GPT-CoNuT and GPT-FConv models for QuixBugs benchmark.

To validate the candidate patches generated by models, run src/validation/rerank.py, which will rerank the patches generated by all the models and the result will be dumped into data/patches/reranked_patches.json, then run src/validation/validate_quixbugs.py or src/validation/validate_defects4j.py, which will run unit test cases (offered by Defects4J or QuixBugs) to validate the candidate patches. The final result will be dumped into data/patches/validated_patches.json

If you use CURE for academic purpose, please cite the following citation:

@inproceedings{jiang2021cure,
  author={Jiang, Nan and Lutellier, Thibaud and Tan, Lin},
  booktitle={2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)}, 
  title={CURE: Code-Aware Neural Machine Translation for Automatic Program Repair}, 
  year={2021},
  pages={1161-1173},
  doi={10.1109/ICSE43902.2021.00107}
}