Current code is developed on PyTorch 0.4, not sure if it works on other versions.
A subset of data (20k docs) is provided here for you to test the code. Unzip and place it to data/.
If you need to train on the whole kp20k dataset, download the json data and run preprocess.py
first. No trained model will be released in the near future.
Update I will not be updating this repo for a while. But please see the information below to help you run the code. Some Some test datasets in JSON format: download
- preprocess.py: entry for preprocessing datasets in JSON format.
- train.py: entry for training models.
- predict.py: entry for generating phrases with well-trained models (checkpoints).
You can refer to these scripts as examples.
Note that duplicate papers that appear in popular test datasets (e.g. Inspec, SemEval) are also included in the KP20k training dataset. Please be sure to remove them before training.