
Code for EMNLP2020 paper: "Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation"

This repo contains the code and data of the following paper:

Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation, Dayiheng Liu, Yeyun Gong, Yu Yan, Jie Fu, Bo Shao, Daxin Jiang, Jiancheng Lv, Nan Duan, EMNLP2020 [paper]


  • Python 3.6
  • numba 0.49.1
  • tensorflow 1.10.0
  • numpy 1.16.2
  • nltk 3.3+
  • cuda 9.0


Download the dataset file at here.

tar -xzvf keyaware_news_emnlp20.tar

The data file directory is as follows

|-- dev_keyaware_news_KQTAC.txt
|-- test_keyaware_news_KQTAC.txt
|-- test_keyaware_news_KQTAC_5slot.txt
|-- test_keyaware_news_KQTAC_multi.txt
`-- train_keyaware_news_KQTAC.txt

The files train_keyaware_news_KQTAC.txt, dev_keyaware_news_KQTAC.txt, and test_keyaware_news_KQTAC.txt contain 5-tuple <keyphrase, query, title, article, click_times>.

The file test_keyaware_news_KQTAC_multi.txt provides 5 tuples with different predicted keyphrases for each test example. For each article, we obtained 5 keyphrases by the SEQ2SEQ model as described in our paper.

Similarly, the file test_keyaware_news_KQTAC_5slot.txt provides 5 tuples with different predicted keyphrases for each test example. For each article, we obtained 5 keyphrases by the SLOT model as described in our paper.


Headline Generation

Our headline generation baselines are based on BERT-base-uncased model, which can be downloaded at here.

run run_base.sh for BASE model training and testing.

run run.sh for our model training and testing.

The detailed hyper-parameters can be found in run.sh and config.py.

The model checkpoints and log file will be saved at OUTPUT_DATA_DIR and LOG_FILE in run.sh, respectively.

Note that we also provide some variants of the keyphrase-aware headline generation model and keyphrase-agnostic baselines, which can be found in model_pools/. If you want to use other baselines, please replace the MODEL=${2:-encoder_filter_query_plus_decoder_mem} in run.sh to other models (the model names can be found in model_pools/__init__.py).

Keyphrase Generation

To training the SEQ2SEQ model for keyphrase generation, please replace the content of the title with key for each sample in the train_keyaware_news_KQTAC.txt, dev_keyaware_news_KQTAC.txt, and test_keyaware_news_KQTAC.txt. After that, run run_base.sh to use the BASE model for keyphrase generation. If you want to generate diverse keyphrases, please set --use_diverse_beam_search and tune --decode_gamma to control the diverse penalty.

To training the SLOT model for keyphrase generation, we adopt the implementation of the answer span prediction provided by Huggingface, please refer to the code here.


