NPR4J: A Python repository from kwz219

NPR4J-Framework

Supported NPR systems

List of implemented papers

Overall Procedure

Instructions for training and translateing models with NPR4J

https://docs.google.com/document/d/1EEheTmFiMcvsgvCtb5BSL4YdlxLiSReweuzqKIshCkE/edit?usp=sharing

Install Prerequirement

pip install bson scipy pymongo h5py javalang nltk torch transformers OpenNMT-py==2.2.0

Data Preprocess

Processing data into forms that each NPR system needs

Preprocess_RawData.py

Before data preprocessing, you need to prepare your data into a filedir including:

|---data_dir
|---data.ids: each line has a id to identify data samples
|---buggy_lines: each file contains the buggy line of a sample
|---buggy_methods: each file contains the buggy method of a sample
|---buggy_classes: each file contains the buggy class of a sample
|---fix_lines: each file contains the developer patch line of a sample
|---fix_methods: each file contains the developer patch method of a sample
|---fix_classes: each file contains the developer patch class of a sample
|---metas: meta information of data samples

Raw data of NPR4J-Benchmark can be downloaded from this link: https://drive.google.com/drive/folders/1vKyABQbdvH8SuQc23VihB2INj_brrdnv?usp=sharing

Training

To train a NPR system, you can use a simple command like this:

python train.py -model NPR_SYSTEM_NAME -config CONFIG_FILE_PATH

Generating Patches

To use a trained NPR system to generate patches, you can use a simple command like this:

python translate.py -model NPR_SYSTEM_NAME -config CONFIG_FILE_PATH

Resources of trained NPR systems

Trained NPR models can be downloaded from this link: https://drive.google.com/drive/folders/18WmVJQwAOmcbudgHK839KYfY98JKVrEH?usp=sharing

GPU memory requirements for each NPR model (with tuned hyperparameters in our experiment)

SequenceR: 20GB for training, less than 10GB for predicting
Recoder: 40GB for training, 20GB for predicting
CODIT: less than 10GB for training and predicting
Edits: less than 10GB for training and predicting
CoCoNut (singleton mode): less than 10GB for training and predicting
Tufano: less than 10GB for trainging and predicting
CodeT5-ft: 40GB for training, 20GB for predicting
UniXCoder-ft: 40GB for training, 20GB for predicting

##Latest Experiment Results

considering 9 NPR systems: (Edits, Tufano, CoCoNut, CodeBERT-ft, RewardRepair, Recoder, SequenceR, CodeBert-ft, UniXCoder-ft) candidate number: up to 300
manual validation results 1: https://docs.google.com/spreadsheets/d/11oUYyEiMnDfHRONSrB9hY1smXcrroJSN/edit?usp=sharing&ouid=116802316915888919937&rtpof=true&sd=true manual validation results 2: latest_results/additional_result_check.xlsx

kwz219/NPR4J