/PRSummarizer

Primary LanguagePythonMIT LicenseMIT

PRSummarizer

The source code of the paper "Automatic Generation of Pull Request Description".

Dataset

Raw Data

Our collected 333K pull requests can be downloaded from here. Here is a PR example in the json file:

{
    "id": "elastic/elasticsearch_37980",
    "body": "'Eclipse build files were missing so .eclipse project files were not being generated.\\r\\nCloses #37973\\r\\n\\r\\n'",
    "cms": [
      "'Added missing eclipse-build.gradle files\\n\\nCloses #fix/37973'"
    ],
    "commits": {
      "'3e10ee798c932cc1cab1ea6ca679417408fc1416'": {
        "cm": "'Added missing eclipse-build.gradle files\\n\\nCloses #fix/37973'",
        "comments": []
      }
    }
  }
  • id: $user/$project_$prid
  • body: PR description
  • cms: the commit messages in this PR
  • commis: the commits in this PR
    • key is the SHA1 hash digest
      • cm: commit message
      • comments: source code comments added in this commit

Preprocessed Dataset

Our dataset can be downloaded from here, which contains:

  • the train, validation and test sets
  • a json file for building vocabulary

Regular Expressions

To preprocess the raw data, we used the following regular expressions:

email_pattern = r'(^|\s)<[\w.-]+@(?=[a-z\d][^.]*\.)[a-z\d.-]*[^.]>'
url_pattern = r'https?://[-a-zA-Z0-9@:%._+~#?=/]+(?=($|[^-a-zA-Z0-9@:%._+~#?=/]))'
reference_pattern = r'#[\d]+'
signature_pattern = r'^(signed-off-by|co-authored-by|also-by):'
at_pattern = r'@\S+'
structure_pattern = r'^#+'
version_pattern = r'(^|\s|-)[\d]+(\.[\d]+){1,}'
sha_pattern = r'(^|\s)[\dA-Fa-f-]{7,}(?=(\s|$))'
digit_pattern = r'(^|\s|-)[\d]+(?=(\s|$))'

Installation

Clone and Prepare Dataset

$ git clone https://github.com/Tbabm/PRSummarizer.git
$ cd PRSummarizer
$ mkdir data
# download our preprocessed dataset and place the four files in `data`
$ mkdir models

Install ROUGE

  • See here for instructions about installing ROUGE
  • Please make sure you have correctly set environment variable ROUGE to /absolute/path/to/ROUGE-RELEASE-1.5.5

Install Dependencies

Through conda:

$ conda env create -f environment.yml

OR through pip

$ pip install -r requirements.txt

Install pyrouge

  • install and test pyrouge if you haven't done it.
$ git clone https://github.com/bheinzerling/pyrouge
$ cd pyrouge
$ pip install .

# set rouge path for pyrouge
$ pyrouge_set_rouge_path ${ROUGE}

# test the installation of pyrouge
$ python -m pyrouge.test

Usage

Train

Train Attn+PG first:

python3 -m prsum.prsum train --param-path params_attn_pg.json

After training, suppose the models are stored in models/train_12345678/model/. Select the best Attn+PG model:

python3 -m prsum.prsum select_model \
                       --param_path params_attn_pg.json \
                       --model_pattern "models/train_12345678/model/model_{}_" \
                       --start_iter 1000 \
                       --end_iter 26000

Suppose the best model is model_12000_87654321. Train Attn+PG+RL based on the best model:

python3 -m prsum.prsum train \
                       --param_path params_attn_pg_rl.json \
                       --model_path "models/train_12345678/model/model_12000_87654321"

Validate

Select the best Attn+PG model:

# start_iter = the best iteration of `Attn+PG` (here, 12000) + save_interval (here, 1000)
START_ITER=13000
python3 -m prsum.prsum select_model
                       --param_path params_attn_pg_rl.json \
                       --model_pattern "models/train_12345678/model/rl_model_{}_" \
                       --start_iter $START_ITER \
                       --end_iter 41000

Suppose the best model is model_34000_98765432.

Test

Test the best Attn+PG+RL model:

python3 -m prsum.prsum decode \
                       --param_path params_attn_pg_rl.json \
                       --model_path "models/train_12345678/model/rl_model_34000_98765432" \
                       --ngram_filter 1

Now, you will get the test results.

NOTE: Your test results may be slightly different from those reported in our paper. Because the pointer generator uses the scatter_add function in pytorch. When using GPUs, this function is undeterministic. See here for more details.

Pre-trained Model

Our pre-trained model and test results can be downloaded here. To test with our pre-atrained model:

mkdir models
mv rl_model_34000 ./models
python3 -m prsum.prsum decode \
                       --param_path params_attn_pg_rl.json \
                       --model_path "models/rl_model_34000" \
                       --ngram_filter 1

Citation

If you use this code, please consider citing our paper:

@inproceedings{liu2019automatic,
  title={Automatic generation of pull request descriptions},
  author={Liu, Zhongxin and Xia, Xin and Treude, Christoph and Lo, David and Li, Shanping},
  booktitle={Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering},
  pages={176--188},
  year={2019},
}

Thanks!

Reference