Is this Change the Answer to that Problem? Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness

@inproceedings{tian2022change,
  title={Is this Change the Answer to that Problem? Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness},
  author={Tian, Haoye and Tang, Xunzhu and Habib, Andrew and Wang, Shangwen and Liu, Kui and Xia, Xin and Klein, Jacques and Bissyand{\'E}, Tegawend{\'E} F},
  booktitle={37th IEEE/ACM International Conference on Automated Software Engineering},
  pages={1--13},
  url = {https://doi.org/10.1145/3551349.3556914}, 
  doi = {10.1145/3551349.3556914},
  year={2022}
}

Paper Link: https://dl.acm.org/doi/abs/10.1145/3551349.3556914

Quatrain

Quatrain (Question Answering for Patch Correctness Evaluation), a supervised learning approach that exploits a deep NLP model to classify the relatedness of a bug report with a patch description.

Catalogue of Repository

artifact_detection_model: a model to detect codes in text.
data: processed and structured dataset.
experiment: scripts to obtain experimental results of paper. 
figure: saved figures for experiment
preprocess: scripts to extract bug reports and commit messages.
representation: embeddings representation model.
utils: scripts to deduplicate dataset.
---------------
INSTALL.md: installation instructions.
quatrain_model.h5: pre-trained QUATRAIN model for users' custom prediction.
requirements.txt: required dependencies.
run.py: entrance to conduct experiment.

Ⅰ) Dataset

bug report summary: title for bug issue.
bug report description: detailed description for bug issue.
patch description: CodeTrans-generated commit message for patch.

A) Table 1: Datasets of labelled patches.

data/bugreport_patch.txt: 9135 (1591:7544) Pairs of Bug report & Commit message. Structured as bug-id $$ bug report summary $$ bug report description $$ patchId $$ patch description $$ label
data/bugreport_patch_json_bert.pickle: Bert embeddings of Pairs of Bug report & Commit message.
data/bugreport_patch_array_bert.pickle: Bert embeddings of paris for 10-fold cross validation.

B) Colleted elements

data/BugReport: Bug reports texts for Defects4j, Bugsjar, Bears. Structured as `bug-id $$ bug report summary $$ bug report description` in txt file.
data/CommitMessage: Commit messages written by developer or generated by CodeTrans in format of json and pickle. Structured as `bug-id: commit message` in json file.
---------------
BATS_RESULT_0.0.json: the prediction results of BATS with cut-off 0.0 on our dataset. 
BATS_RESULT_0.8.json: the prediction results of BATS with cut-off 0.8 on our dataset.
PATCHSIM_RESULT.json: the prediction results of Patch-Sim on our dataset.
PatchLabelsYe.csv: the original prediction results of ODS.
Bears_testinfo.txt: the stack failure information of test suites for Bears.   
bears_index_dict(inverse).json: dictionary of bug-id and commit-id. 
save_bugreport_patch.py: script to produce data/bugreport_patch.txt.

Ⅱ) Requirements

A) Environment

python 3.7 (Anaconda recommended)
pip install -r requirements.txt

run sudo apt-get install python3.7-dev first if you don't have python3.7 dev package.

B) Data elements

download ASE2022withTextUnique.zip (need to be unzipped) and ASE_features2_bert.pickle from data in Zenodo, accordingly change the absolute path of these two files in experiment/config.py of this repository as below.

self.path_patch ---> ASE2022withTextUnique. Original dataset with patches text and commit messages text.
self.path_ASE2020_feature ---> ASE_features2_bert.pickle. The feature from Tian et al.'s ASE2020 paper for our RQ3 DL experiment.

Simplified dataset: ASE2022withText.

Ⅲ) Experiment

To obtain the experimental results of our paper, execute run.py with the following parameters:

A) Sec. 2.2 (Hypothesis validation)

Figure 2: Distributions of Euclidean distances between bug and patch descriptions.

python run.py hypothesis

B) Sec. 5.1 (RQ1: Effectiveness of Quatrain)

Figure 5: Distribution of Patches in Train and Test Data.
Table 2: Confusion matrix of Quatrain prediction.

python run.py RQ1

The improved F1: a better F1 score of 0.793 by re-balancing the test data.

python run.py RQ1 balance

C) Sec. 5.2 (RQ2: Analysis of the Impact of Input Quality on Quatrain)

RQ 2.1

Figure 6: Impact of length of patch description to prediction.

python run.py RQ2.1

RQ 2.2

Figure 7: The distribution of probability of patch correctness on original and random bug report.
The dropped +Recall: 22% (241/1073) of developer patches, which were previously predicted as correct, are no longer recalled by Quatrain after they have been associated to a random bug report.

python run.py RQ2.2

RQ 2.3

Figure 8: Impact of distance between generated patch descrip- tion to ground truth on prediction performance.
The dropped +Recall: The metric (+Recall) drops by 37 percentage points to 45% when the developer-written descriptions are replaced with CodeTrans-generated descriptions.

python run.py RQ2.3

The dropped AUC: we evaluated Quatrain in a setting where all developer commit messages were replaced with CodeTrans-generated descriptions: the AUC metric dropped by 11 percentage points to 0.774, confirming our findings.

python run.py RQ2.3 CodeTrans

D) Sec. 5.3 (RQ3: Comparison Against the State of the Art)

Sec. 5.3.1 (Comparing against Static Approaches)

Table 3: Quatrain vs a DL-based patch classifie.
New identification: Among 9135 patches, our approach identifies 7842 patches, of which 2735 patches cannot be identified by Tian et al.'s approach (RF).

python run.py RQ3 DL

Table 4: Quatrain vs BATS.
New identification: 180 out of 345 patches are exclusively identified by Quatrain.

python run.py RQ3 BATS

Sec. 5.3.2 (Comparing against Dynamic Approach)

Table 5: Quatrain vs (execution-based) PATCH-SIM.
New identification: Most of the patches (1856/3149) that we identify are not correctly predicted by PATCH-SIM.

python run.py RQ3 PATCHSIM

E) Sec. 6.1 (Experimental insights)

RF with 10-fold: RandomForest (RF) on the embeddings of the bug report and the patch based on 10-fold cross validation.
RF with 10-group: RandomForest (RF) on the embeddings of the bug report and the patch based on 10-group cross validation.

python run.py insights

Ⅳ) Custom Prediction

To predict the correctness of your custom patches, you are welcome to use the prediction interface.

A) Requirements for BERT

BERT model client&server: 24-layer, 1024-hidden, 16-heads, 340M parameters. download it here.
Environment for BERT server (different from reproduction)
- python 3.7
- pip install tensorflow==1.14
- pip install bert-serving-client==1.10.0
- pip install bert-serving-server==1.10.0
- pip install protobuf==3.20.1
- Launch BERT server via bert-serving-start -model_dir "Path2BertModel"/wwm_cased_L-24_H-1024_A-16 -num_worker=2 -max_seq_len=360 -port 8190
- switch the port in BERT_Port in case your port 8190 is occupied.
Bug report text: developer-written bug report.
Patch description text: generating patch description for your plausible patches with commit message generation tools, e.g. CodeTrans. Github and API.

B) Predict

Let's give it a try!

python run.py predict $bug_report_text $patch_description_text

For instance: python run.py predict 'Missing type-checks for var_args notation' 'check var_args properly'

Ⅴ) Custom Train

To re-train QUATRAIN model on our or other dataset, execute the following steps.

Structure your dataset as data/bugreport_patch.txt.
Obtain Bert embeddings of your dataset via experiment/save_bugreport_dataset_json.py
Accordingly, change self.dataset_json in experiment/config.py
Execute python run.py RQ1

License

Quatrain is distributed under the terms of the MIT License, see LICENSE.

Trustworthy-Software/Quatrain