This is the replication package for "Revisiting Learning-based Commit Message Generation". We provide the data and scripts to reproduce all the results of the paper. The contents of each folder are as follows.
CommitMessages
contains the generated commit messages used and analyzed in each RQ, which will be introduced in Section 2.Patterns
contains the tool to obtain the frequent patterns, which will be introduced in Section 3.Scripts
contains the scripts to reproduce the results of this paper, which analyze the commit messages in the folderCommitMessages
. The scripts will be introduced in Section 4.DataSet
contains the benchmark used in this paper.Models
contains the implementation of models studied in this paper.Examples
contains the examples of each pattern.MergeProcess
contains the merge graphs of each pattern.
We provide a requirements file requirements.txt
containing all the packages needed for reproduction. You can run the following commands to create the conda environment and install all the packages.
$ conda create -n icse-msg-study python=3.8
$ conda activate icse-msg-study
$ python3 -m pip install -r requirements.txt
In this section, we introduce the commit messages used in each RQ.
In RQ1, the commit messages generated by the default models are used and analyzed. For each technique , the commit messages generated by the default model are placed in the file CommitMessages/DefaultModels/output_modelname_train
and CommitMessages/DefaultModels/output_modelname
for the training set and testing set respectively , and modelname
is one of NMT, PtrGNCMsg, CODISUM, CoreGen, and FIRA.
In RQ2, the commit messages generated by the models training with various modified datasets are used and analyzed. For each technique, the commit messages generated by the model training with the dataset which decreases the pattern ratio by R are placed in the file CommitMessages/ModifiedDataset/output_modelname_R
, and R is from 0 to 0.9. For example, the commit messages generated by NMT training with the dataset which decreases the pattern ratio by 20% are placed in CommitMessages/ModifiedDataset/output_nmt_0.2
.
In RQ3, the commit messages generated by the models training with various input representations are used and analyzed. For each technique, the commit messages generated by the model training with mark representation are placed in the file CommitMessages/InputRepresentation/output_modelname_onlymark
.
In RQ4, the commit messages generated by the models removing each component are used and analyzed. For each technique, the commit messages generated by the model removing different components are placed in the file CommitMessages/Component/output_modelname_noatten
, CommitMessages/Component/output_modelname_nocopy
, or CommitMessages/Component/output_modelname_noanony
.
We use MaxSP [1], a sequential pattern mining algorithm, to mine frequent patterns from the commit messages generated by each technique. We use the implementation of MaxSP provided by the data mining library SPMF [2], and the documentation is in this website. Switch to the directory Patterns
, you can execute get_patterns.sh
to get the frequent patterns of each model, which will be saved in Patterns/results/modelname_maximal_patterns
.
$ bash get_patterns.sh modelname
In the beginning, the pattern mining algorithm returns some raw patterns, and we iteratively summarize the patterns in a bottom-up way. The merge process is as follows.
(1)fix checkstyle (2)fix crash (3)fix error (4)fix failing test
(5)fix javadoc (6)fix potential npe (7)fix quality (8)fix test (9)fix unit test
(10)fix bug in ... (11)fix a bug in ... (12)fix typo in ... (13)fix npe in ... (14)fix ... in javadoc
(15)fix ... when ...
(16)remove unused ... (17)remove unused code (18)remove unnecessary ... (19)remove unnecessary code
(20)add javadoc (21)add a ... (22)add missing ... (23)add ... to ... (24)add ... for ...
(25)don't ... (26)do not ... (27)don't show ... (28)don't ... if ...
(1-9) --> (29)fix ...
(10-14) --> (30)fix ... in ...
(16)(17) --> (31)remove unused ...
(18)(19) --> (32)remove unnecessary ...
(20)(21) --> (33)add ...
(23)(24) --> (34)add ... for|to ...
(25)(27) --> (35)don't ...
(15)(30) --> (36)fix ... in|when ...
(22)(33) --> (37)add [missing] ...
(26)(35) --> (38)don't|do not ...
(29)(36) --> (39)fix ... [in|when ...]
(31)(32) --> (40)remove unused|unnecessary ...
(34)(37) --> (41)add [missing] ... [for|to ...]
(28)(38) --> (42)don't|do not ... [if...]
(39)(40)(41)(42) are the final Fix Pattern, Removal Pattern, Addition Pattern, and Avoidance Pattern respectively.
To make the merge process more intuitive, we present the merge graph of each pattern.
The merge graph of Addition Pattern is in the following image, which is in the directory MergeProcess/merge_addition.png
.
The merge graph of Removal Pattern is in the following image, which is in the directory MergeProcess/merge_remove.png
.
The merge graph of Fix Pattern is in the following image, which is in the directory MergeProcess/merge_fix.png
.
The merge graph of Avoidance Pattern is in the following image, which is in the directory MergeProcess/merge_avoid.png
.
In this paper, we obtain a total of four patterns, that is, Addition Pattern, Removal Pattern, Fix Pattern, and Avoidance Pattern, which is shown in the following table, including pattern name, pattern format, and pattern description.
Pattern Name | Pattern Format | Pattern Description |
---|---|---|
Addition Pattern | Add [missing] ... [for|to ...] | New/missing elements are needed |
Removal Pattern | Remove unused|unnecessary ... | Code changes have redundant elements |
Fix Pattern | Fix ... [in|when ...] | Code changes have bugs |
Avoidance Pattern | Don't|do not ... [if ...] | Some things shouldn't be done |
The detailed description of each pattern is as follows.
-
Addition Pattern. The commit message is used when a new element is added or certain missing element is added. We give a real example of Addition Pattern in the following image, which is located in the directory
Examples/example_add.png
. Addition Pattern consists of six sub-patterns, including (1) Add missing ... for ..., (2) Add missing ... to ..., (3) Add ... for ..., (4) Add ... to ..., (5) Add missing ..., and (6) Add ... -
Removal Pattern. The commit message is used when the commit removes redundant elements (e.g., code, imports, and methods) in code changes. We give a real example of Removal Pattern in the following image, which is located in the directory
Examples/example_remove.png
. Removal Pattern consists of 2 sub-patterns, (1) Remove unused ..., and (2) Remove unnecessary ... -
Fix Pattern. The commit message is used when the commit fixes some bugs. We give a real example of Fix Pattern in the following image, which is located in the directory
Examples/example_fix.png
. Fix Pattern consists of 3 sub-patterns, (1) Fix ... in ..., (2) Fix ... when ..., and (3) Fix ... -
Avoidance Pattern. The pattern is often related to the commit message which states not to do some wrong behavior. We give a real example of Avoidance Pattern in the following image, which is located in the directory
Examples/example_avoid.png
. Avoidance Pattern consists of 4 sub-patterns, (1) Don't ... if ..., (2) Do not ... if ..., (3) Don't ..., and (4) Do not ...
In this section, we show the scripts and steps to reproduce the results present in the paper, which analyze the commit messages in the folder CommitMessages
. We name the scripts to get the Table X/Figure X presented in RQ Y of the paper as get_rqY_tableX.py
/get_rqY_figureX.py
, which is placed in the folder Scripts
. For example, the Figure 5 is presented in RQ2, so the script to get it is get_rq2_table5.py
. Switch to the directory Scripts
, you can run each script to get the corresponding table/figure, which will be saved in Scripts/TablesAndFigures/rqY_tableX.tex
or Scripts/TablesAndFigures/rqY_figureX.png
. The table is in the format of latex, and can be compiled to pdf, and the package multirow and booktabs are needed for compilation.
$ python get_rqY_tableX.py
or
$ python get_rqY_figureX.py
In addition, we put the commands to execute all the scripts to one single script run_total.sh
, and you can switch to the directory Scripts
and get all the tables/figures of the paper by running
$ bash run_total.sh
[1] Fournier-Viger P, Wu C W, Tseng V S. Mining maximal sequential patterns without candidate maintenance[C]//International Conference on Advanced Data Mining and Applications. Springer, Berlin, Heidelberg, 2013: 169-180.
[2] Fournier-Viger P, Lin J C W, Gomariz A, et al. The SPMF open-source data mining library version 2[C]//Joint European conference on machine learning and knowledge discovery in databases. Springer, Cham, 2016: 36-40.