Is this Change the Answer to that Problem? Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness
@inproceedings{tian2022change,
title={Is this Change the Answer to that Problem? Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness},
author={Tian, Haoye and Tang, Xunzhu and Habib, Andrew and Wang, Shangwen and Liu, Kui and Xia, Xin and Klein, Jacques and Bissyand{\'E}, Tegawend{\'E} F},
booktitle={37th IEEE/ACM International Conference on Automated Software Engineering},
pages={1--13},
url = {https://doi.org/10.1145/3551349.3556914},
doi = {10.1145/3551349.3556914},
year={2022}
}
Paper Link: https://dl.acm.org/doi/abs/10.1145/3551349.3556914
Quatrain (Question Answering for Patch Correctness Evaluation), a supervised learning approach that exploits a deep NLP model to classify the relatedness of a bug report with a patch description.
artifact_detection_model: a model to detect codes in text.
data: processed and structured dataset.
experiment: scripts to obtain experimental results of paper.
figure: saved figures for experiment
preprocess: scripts to extract bug reports and commit messages.
representation: embeddings representation model.
utils: scripts to deduplicate dataset.
---------------
INSTALL.md: installation instructions.
quatrain_model.h5: pre-trained QUATRAIN model for users' custom prediction.
requirements.txt: required dependencies.
run.py: entrance to conduct experiment.
- bug report summary: title for bug issue.
- bug report description: detailed description for bug issue.
- patch description: CodeTrans-generated commit message for patch.
- data/bugreport_patch.txt: 9135 (1591:7544) Pairs of Bug report & Commit message. Structured as
bug-id $$ bug report summary $$ bug report description $$ patchId $$ patch description $$ label
- data/bugreport_patch_json_bert.pickle: Bert embeddings of Pairs of Bug report & Commit message.
- data/bugreport_patch_array_bert.pickle: Bert embeddings of paris for 10-fold cross validation.
data/BugReport: Bug reports texts for Defects4j, Bugsjar, Bears. Structured as `bug-id $$ bug report summary $$ bug report description` in txt file.
data/CommitMessage: Commit messages written by developer or generated by CodeTrans in format of json and pickle. Structured as `bug-id: commit message` in json file.
---------------
BATS_RESULT_0.0.json: the prediction results of BATS with cut-off 0.0 on our dataset.
BATS_RESULT_0.8.json: the prediction results of BATS with cut-off 0.8 on our dataset.
PATCHSIM_RESULT.json: the prediction results of Patch-Sim on our dataset.
PatchLabelsYe.csv: the original prediction results of ODS.
Bears_testinfo.txt: the stack failure information of test suites for Bears.
bears_index_dict(inverse).json: dictionary of bug-id and commit-id.
save_bugreport_patch.py: script to produce data/bugreport_patch.txt.
- python 3.7 (Anaconda recommended)
- pip install -r requirements.txt
run sudo apt-get install python3.7-dev
first if you don't have python3.7 dev package.
download ASE2022withTextUnique.zip (need to be unzipped) and ASE_features2_bert.pickle from data in Zenodo, accordingly change the absolute path of these two files in experiment/config.py of this repository as below.
- self.path_patch ---> ASE2022withTextUnique. Original dataset with patches text and commit messages text.
- self.path_ASE2020_feature ---> ASE_features2_bert.pickle. The feature from Tian et al.'s ASE2020 paper for our RQ3 DL experiment.
Simplified dataset: ASE2022withText.
To obtain the experimental results of our paper, execute run.py
with the following parameters:
- Figure 2: Distributions of Euclidean distances between bug and patch descriptions.
python run.py hypothesis
- Figure 5: Distribution of Patches in Train and Test Data.
- Table 2: Confusion matrix of Quatrain prediction.
python run.py RQ1
- The improved F1: a better F1 score of 0.793 by re-balancing the test data.
python run.py RQ1 balance
- Figure 6: Impact of length of patch description to prediction.
python run.py RQ2.1
- Figure 7: The distribution of probability of patch correctness on original and random bug report.
- The dropped +Recall: 22% (241/1073) of developer patches, which were previously predicted as correct, are no longer recalled by Quatrain after they have been associated to a random bug report.
python run.py RQ2.2
- Figure 8: Impact of distance between generated patch descrip- tion to ground truth on prediction performance.
- The dropped +Recall: The metric (+Recall) drops by 37 percentage points to 45% when the developer-written descriptions are replaced with CodeTrans-generated descriptions.
python run.py RQ2.3
- The dropped AUC: we evaluated Quatrain in a setting where all developer commit messages were replaced with CodeTrans-generated descriptions: the AUC metric dropped by 11 percentage points to 0.774, confirming our findings.
python run.py RQ2.3 CodeTrans
- Table 3: Quatrain vs a DL-based patch classifie.
- New identification: Among 9135 patches, our approach identifies 7842 patches, of which 2735 patches cannot be identified by Tian et al.'s approach (RF).
python run.py RQ3 DL
- Table 4: Quatrain vs BATS.
- New identification: 180 out of 345 patches are exclusively identified by Quatrain.
python run.py RQ3 BATS
- Table 5: Quatrain vs (execution-based) PATCH-SIM.
- New identification: Most of the patches (1856/3149) that we identify are not correctly predicted by PATCH-SIM.
python run.py RQ3 PATCHSIM
- RF with 10-fold: RandomForest (RF) on the embeddings of the bug report and the patch based on 10-fold cross validation.
- RF with 10-group: RandomForest (RF) on the embeddings of the bug report and the patch based on 10-group cross validation.
python run.py insights
To predict the correctness of your custom patches, you are welcome to use the prediction interface.
- BERT model client&server: 24-layer, 1024-hidden, 16-heads, 340M parameters. download it here.
- Environment for BERT server (different from reproduction)
- python 3.7
- pip install tensorflow==1.14
- pip install bert-serving-client==1.10.0
- pip install bert-serving-server==1.10.0
- pip install protobuf==3.20.1
- Launch BERT server via
bert-serving-start -model_dir "Path2BertModel"/wwm_cased_L-24_H-1024_A-16 -num_worker=2 -max_seq_len=360 -port 8190
- switch the port in BERT_Port in case your port 8190 is occupied.
- Bug report text: developer-written bug report.
- Patch description text: generating patch description for your plausible patches with commit message generation tools, e.g. CodeTrans. Github and API.
Let's give it a try!
python run.py predict $bug_report_text $patch_description_text
For instance: python run.py predict 'Missing type-checks for var_args notation' 'check var_args properly'
To re-train QUATRAIN model on our or other dataset, execute the following steps.
- Structure your dataset as data/bugreport_patch.txt.
- Obtain Bert embeddings of your dataset via
experiment/save_bugreport_dataset_json.py
- Accordingly, change self.dataset_json in experiment/config.py
- Execute
python run.py RQ1
Quatrain is distributed under the terms of the MIT License, see LICENSE.