code_review_automation

This repository is the replication package of the research work "Using Pre-trained Models to Boost Code Review Automation".

In our work we trained several T5 models to automate three code-review tasks, each one using a specific dataset. Here we provide everything needed to replicate our experiments. We also provide all the raw data generated while running our experiments (e.g., the predictions output of the models).

There are two ways of replicating our results:

Use our fine-tuned models to generate new predictions;
Train your own models from scratch.

For the second option (train your own models) you will need a Google Colab pro account and a Google Cloud Storage account (details follow).

Resources

In the code folder we provide:
- the Google Colab notebooks we used to:
  - Preprocessing.ipynb: preprocess the pre-training dataset and train the Sentencepiece tokenizer;
  - PreTraining.ipynb: pre-train the T5 model;
  - FineTuning.ipynb: fine-tune the T5 models on different tasks.
- Analyzer.py, Cleaner.py: the two main Python classes we used to preprocess the fine-tuning dataset. In particular the function isCommentRelevant(...) contained in the Cleaner class (line 1129) encloses the updated heuristic to detect irrelevant comments.
- utils: folder containing some useful resources used during the fine-tuning data preprocessing.
manual analysis.xlsx: contains the results of the manual analysis we performed on some non perfect predictions (see the paper for details).
perfect_predictions.zip: for convenience, we stored the perfect predictions generated by our model at k=1 (the model is allowed to generate one single prediction) in HTML format. Use these files if you want to have a quick look at correct predictions generated by the models. All generated predictions are instead available in results.zip.

Here we stored the extra materials you need in order to replicate our experiments:

automating_code_review.zip contains all the material needed to successfully run our Google Colab notebooks (see section Train your T5 models for more details).
datasets.zip contains all the processed and split datasets we used:
- pre-training
  - pre-training.tsv
- fine-tuning
  - new_large
    - code-to-code
      - test.tsv, train.tsv, val.tsv
    - code-to-comment
      - test.tsv, train.tsv, val.tsv
    - code&comment-to-code
      - test.tsv, train.tsv, val.tsv
  - Tufano_etal_ICSE21
    - code-to-code
      - test.tsv, train.tsv, val.tsv
    - code&comment-to-code
      - test.tsv, train.tsv, val.tsv
generate_predictions.zip contains the scripts to successfully generate predictions using a T5 model checkpoint (see section Use our fine-tuned T5 models for more details)
models.zip contains the (best) checkpoints of our T5 models (pre-trained or not), for all the tasks (code-to-code, code-to-comment, code&comment-to-code) and both the datasets (new_large_dataset, Tufano_etal_dataset) we used. We also stored the checkpoint of the pre-trained model without any fine-tuning. The following is the content of the models folder:
- T5_non_pre-trained_new_large_dataset_code-to-code
- T5_non_pre-trained_new_large_dataset_code-to-comment
- T5_non_pre-trained_new_large_dataset_code&comment-to-code
- T5_non_pre-trained_Tufano_etal_dataset_code-to-code
- T5_non_pre-trained_Tufano_etal_dataset_code&comment-to-code
- T5_pre-trained
- T5_pre-trained_new_large_dataset_code-to-code
- T5_pre-trained_new_large_dataset_code-to-comment
- T5_pre-trained_new_large_dataset_code&comment-to-code
- T5_pre-trained_Tufano_etal_dataset_code-to-code
- T5_pre-trained_Tufano_etal_dataset_code&comment-to-code
tokenizer.zip contains the Sentencepiece tokenizer and the extracted vocabulary obtained by training on our pre-training dataset:
- TokenizerModel.model, TokenizerModel.vocab
results.zip contains for each dataset (new_large_dataset, Tufano_etal_dataset) the results obtained by each model (pre-trained or not) fine-tuned on each task (code-to-code, code-to-comment, code&comment-to-code). In particular, for each combination of dataset and model we share the following files:
- source.txt: input file for the model;
- target.txt: target file (expected output);
- predictions_<k>.txt: generated predictions file with BEAM_SIZE = k (k=1,3,5,10).
- code_bleu_<k>.txt or bleu_<k>.txt: code_BLEU or BLEU scores file (depending on the task) with BEAM_SIZE = k (k=1,3,5,10)
- confidence_<k>.txt: confidence scores file with BEAM_SIZE = k (k=1,3,5,10)

Use our fine-tuned T5 models

In order to generate predictions with our models you need:

the models checkpoints stored in models.zip;
the content of the archive generate_predictions.zip;
the datasets stored in datasets.zip.

The folder generate_predictions stores all the necessary code to generate the predictions of the T5 models with different beam sizes end evaluate them in terms of perfect predictions and codeBLEU (code-to-code and code&comment-to-code tasks) or BLEU (code-to-comment task) score.

First, you need to convert the checkpoint model in PyThorch. To do that you need to run the following command from the generate_prediction folder:

python3 ./tf_2_pytorch_T5.py --tf_checkpoint_path <model_path> --config_file ./config.json --pytorch_dump_path ./dumps

where <model_path> is the path of the checkpoint model you want to use. For example, if you want to generate the predictions for the code-to-code task on the new_large_dataset using the pre-trained T5 model, you need to run the following command:

python3 ./tf_2_pytorch_T5.py --tf_checkpoint_path ../models/T5_pre-trained_new_large_dataset_code-to-code/model.ckpt-best --config_file ./config.json --pytorch_dump_path ./dumps

In the python script generate_predictions/generate_predictions.py set up the beam size (line 45), the task of interest (line 47) and the path to the right dataset (line 48). For example:

beam_size = 1
batch_size = 64
task = 'code2code: '  # possible options: 'code2code: ', 'code&comment2code: ', 'code2comment: '
data_dir = "../dataset/fine-tuning/new_large/code-to-code/"

The output you will get is a textual file, predictions_k.txt (where k = beam_size), stored in the same dataset folder, containing all the generated predictions.

In order to evaluate the generated predictions in terms of perfect predictions and codeBLEU or BLEU score, you need to run one of the python scripts generate_predictions/for_codeBLEU.py or generate_predictions/for_BLEU.py, after you set the right paths to the target file, the predictions files and where to store the results (lines 69-71 or 17-19).

Train your own T5 models

To train the T5 models we have used the Google Colab service. To replicate our training you will need a Google Colab pro account and a Google Cloud Storage (GCS) account. Once you have you GCS account you need to set up a new bucket. Please, follow the guide provided by Google.

In your GCS bucket upload the content of the archive automating_code_review.zip. In it we have stored our datasets, our pre-trained model, our Sentencepiece tokenizer and some other utilities to replicate our work. Moreover, we have kept the same structure as our bucket to facilitate the use of the Colab notebooks.

Once everything is set you can:

Pre-train a T5 model from scratch using our pre-training dataset following the PreTraining.ipynb notebook;
Fine-tune a T5 model (with or without pre-training) on one of the downstream task (code-to-code, code-to-comment, code&comment-to-code tasks) using our datasets, following the FineTuning.ipynb notebook;

We also provided a notebook (Preprocessing.ipynb) with all the preprocessing steps we followed to prepare our pre-training dataset and to train on it the Sentencepiece tokenizer.

Xin-Zhou-smu/code_review_automation

code_review_automation

Resources

Use our fine-tuned T5 models

Train your own T5 models