Keyphrase Generation (built on OpenNMT-py)

This is a repository providing code and datasets used in An Empirical Study on Neural Keyphrase Generation, One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases and Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence.

Resources

(2021.1 update) MagKP-CS data is released magkp_train.json.zip. Please refer to notebook/split_magkp.ipynb for generating LN/Nsmall/Nlarge splits.
All datasets and selected model checkpoints in the papers can be downloaded here (data.zip. Unzip the file data.zip and models.zip and override the original data/ and model/ folder.

Quickstart

All the config files used for training and evaluation can be found in folder config/. For more examples, you can refer to scripts placed in folder script/.

Preprocess the data

source kp_convert.sh # dump json to src/tgt files (OpenNMT format)
python preprocess.py -config config/preprocess/config-preprocess-keyphrase-kp20k.yml

Train a One2Seq model with Diversity Mechanisms enabled

python train.py -config config/train/config-rnn-keyphrase-one2seq-diverse.yml

Train a One2One model

python train.py -config config/train/config-rnn-keyphrase-one2one-stackexchange.yml

Run generation and evaluation

python kp_gen_eval.py -tasks pred eval report -config config/test/config-test-keyphrase-one2seq.yml -data_dir data/keyphrase/meng17/ -ckpt_dir models/keyphrase/meng17-one2seq-kp20k-topmodels/ -output_dir output/meng17-one2seq-topbeam-selfterminating/meng17-one2many-beam10-maxlen40/ -testsets duc inspec semeval krapivin nus -gpu -1 --verbose --beam_size 10 --batch_size 32 --max_length 40 --onepass --beam_terminate topbeam --eval_topbeam

Evaluation and Datasets

You may refer to notebook/json_process.ipynb to have a glance at the pre-processing.

We follow the data pre-processing and evaluation protocols in Meng et al. 2017. We pre-process both document texts and ground-truth keyphrases, including word segmentation, lowercasing and replacing all digits with symbol <digit>.

We manually clean the data examples in the valid/test set of KP20k (clean noisy text, replace erroneous keyphrases with actual author keyphrases, remove examples without any ground-truth keyphrases) and use scripts to remove invalid training examples (without any author keyphrase).

We evaluate models' performance on predicting present and absent phrases separately. Specifically, we first tokenize (replace punctuation marks with whitespace), lowercase and stem (using the Porter Stemmer of NLTK) the text, then we determine the presence of each ground-truth keyphrase by checking whether its words can be found verbatim in the source text.

To evaluate present phrase performance, we compute Precision/Recall/F1-score for each document taking only present ground-truth keyphrases as target and ignore the absent ones. We report the macro-averaged scores over all documents (changed after Empirical Study) ~~documents that have at least one present ground-truth phrases (corresponding to the column #PreDoc in the Table below, and similarly to the case of absent phrase evaluation.~~

where #(pred) and #(target) are the number of predicted and ground-truth keyphrases respectively; and #(correct@k) is the number of correct predictions among the first k results.

We clarify that, since our study mainly focuses on keyword/keyphrase extraction/generation on short text, we only used the abstract of Semeval and NUS as source text. Therefore statistics like #PreKP may be different from the ones computed with fulltext, which also affect the final F1-scores. For the ease of reproduction, we post the detailed statistics in the following table and processed testsets with present/absent phrases split can be found in the released data (e.g. data/json/kp20k/kp20k_test_meng17token.json).

Dataset	#Train	#Valid	#Test	#KP	#PreDoc	#PreKP	#AbsDoc	#AbsKP
KP20k	514k	19,992	19,987	105,181	19,059	66,746	16,325	38,435
MagKP	2.7m	--	--	34.8m	--	--	--	--
Inspec	--	1,500	500	4,913	497	3,921	363	992
Krapivin	--	1,844	460	2,641	438	1,492	416	1,149
NUS	--	-	211	2,461	207	1,260	195	1,201
Semeval	--	144	100	1,507	100	673	99	834
StackEx	298k	16,000	16,000	43,131	13,498	24,864	10,967	18,267
DUC	--	--	308	2,484	308	2,421	38	63

Contributers

Major contributors are:

Rui Meng (University of Pittsburgh)
Eric Yuan (Microsoft Research, Montréal)
Tong Wang (Microsoft Research, Montréal)
Khushboo Thaker (University of Pittsburgh)

Citation

Please cite the following papers if you are interested in using our code and datasets.

@article{meng2020empirical,
  title={An Empirical Study on Neural Keyphrase Generation},
  author={Meng, Rui and Yuan, Xingdi and Wang, Tong and Zhao, Sanqiang and Trischler, Adam and He, Daqing},
  booktitle={"2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics"},
  url={"https://arxiv.org/pdf/2009.10229.pdf"},
  year={2021}
}

@inproceedings{yuan2018onesizenotfit,
    title={One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases},
    author={Yuan, Xingdi  and  Wang, Tong  and  Meng, Rui  and   Thaker, Khushboo  and  Brusilovsky, Peter  and  He, Daqing  and Trischler, Adam},
    booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    month={jul},
    year={2020},
    publisher={Association for Computational Linguistics},
    url={https://www.aclweb.org/anthology/2020.acl-main.710},
    doi={10.18653/v1/2020.acl-main.710},
    pages={7961--7975},
}

@article{meng2019ordermatters,
  title={Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence},
  author={Meng, Rui and Yuan, Xingdi and Wang, Tong and Brusilovsky, Peter and Trischler, Adam and He, Daqing},
  journal={arXiv preprint arXiv:1909.03590},
  url={https://arxiv.org/pdf/1909.03590.pdf},
  year={2019}
}

@inproceedings{meng2017kpgen,
  title={Deep keyphrase generation},
  author={Meng, Rui and Zhao, Sanqiang and Han, Shuguang and He, Daqing and Brusilovsky, Peter and Chi, Yu},
  booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={582--592},
  url={https://arxiv.org/pdf/1704.06879.pdf},
  year={2017}
}

tc64/OpenNMT-kpg-release