This repository contains the source code for the course IN4334 - Analytics and Machine Learning for Software Engineering given at the Delft University of Technology.
With the help of deep learning, commit messages were generated based on git diff
files with a Sequence to Sequence model.
For more information see the paper.
Edit the config on the top of the git-helper/github_api.py file to change the language to gather or the amount of repositories.
pip install GitPython==3.0.3 PyGithub==1.43.8
python git-helper/github_api.py <output_dir>
After the scraping is finished you can find a 'msg' and 'diff' dir in your output_dir which will contain all the messages and diffs collected from GitHub. You can supply the output_dir to the preprocessing script as argument.
First install the required dependencies:
pip install -r requirements.txt
Or with pipenv:
pipenv install
pipenv shell
And download the spacy model:
python -m spacy download en_core_web_sm
Then start a Python shell and run the following commands:
import nltk
nltk.download('punkt')
Now, make the necessary modifications to the configuration in preprocessing/constants.py
. Then, from the root of this repo, run python -m preprocessing.main
. The script will first index the dataset, which can take a couple of minutes, and then process the dataset, which can take more than an hour.
- Training from config:
python train.py --config config/<config.json>
- Training from checkpoint:
python train.py --resume saved/models/<subdirectories>/checkpoint.pth
- Analyse the logs with Tensorboard:
tensorboard --logdir saved/log/<model_name>
Note that when resuming a model from a checkpoint, the corresponding config.json
from saved/models/<model_name>/<subdirectories>
will be used.
For GPU support for PyTorch with CUDA, see the official documentation on the PyTorch site
- Test from config:
python test.py --config config/<config.json>
- Test from checkpoint:
python test.py --resume saved/models/<model_name>/<subdirectories>
- Analyse the test logs with Tensorboard:
tensorboard --logdir saved/test_log/<model_name>
The test script will compute the following on the test set:
- The loss and perplexity.
- Inference on the diff data. The file with predictions of the commit messages are stored with the
.pred
suffix. For the exact files location seeconfig['inference']
.
The OpenNMT-py
toolkit from here is included in this repository. It can be execute with the test and train scripts in scripts
from the root of this repository with ./scripts/train.sh
or ./scripts/test.sh
The configuration used in this research are the following:
- Java data:
config/java.json
- C# data:
config/cs.json
- NMT1 from Jiang et al. :
config/nmt1.json
- NMT1 but preprocessed with our preprocessing:
config/nmt1_preprocessed.json
The collected datasets, the trained model, and all of the testing results are available online at Zenodo
The following resources were used during creating of this codebase: