/better-highlighting

Source code for Better Highlighting: Creating Sub-Sentence Summary Highlights accepted at EMNLP 2020

Primary LanguagePythonOtherNOASSERTION

Better Highlighting: Creating Sub-Sentence Summary Highlights

We provide the source code for the paper "Better Highlighting: Creating Sub-Sentence Summary Highlights", accepted at EMNLP'20. If you find the code useful, please cite the following paper.

@inproceedings{cho-song-li-yu-foroosh-liu:2020,
 Author = {Sangwoo Cho and Kaiqiang Song and Chen Li and Dong Yu and Hassan Foroosh and Fei Liu},
 Title = {Better Highlighting: Creating Sub-Sentence Summary Highlights},
 Booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
 Year = {2020}}

Goal

  • Our system seeks to summarize multi-document articles with sub-sentence segments
  • The code consists of sub-sentence segment generation, segment importance and similarity score computation, and DPP.

Dependencies

The code is written in Python (v3.6) and Pytorch (v1.4). We suggest the following environment:

Sub-sentence segment generation

  • seg_gen.py: generate all segments from sentences with XLNet
    $ python seg_gen.py --dataset 0 --split train --data_start 0 --data_end 1

  • seg_filter_subsent.py: filter out segments (generate candidate segments for a summary)
    $ python seg_filter_subsent.py --dataset 0

  • draw_fullsent_pos.py: draw positions of original sentences in percent
    $ python draw_fullsent_pos.py

BERT-sim, BERT-imp fine-tuning on CNN/DM

Data generation

  • We use the CNN/DM summary dataset, downloaded from HERE (pre-processed CNN/DM summary data file direct link).
    • The data contains a list of candidate summary sentences (the most similiar sentences to the summary) in each article.
  • gen_cnndm_pairs.py: generate train/test data for BERT-sim (pair), BERT-imp (pair_leadn)
    • The generated data is balanced.

BERT-sim, BERT-imp train / test

  • run_finetune.py: main
  • train_finetune.py: trainer
  • dataset_cnndm.py: data feeder
  • BERT-sim: $ python run_finetune.py --data_type pair --max_seq_len 128
  • BERT-imp: $ python run_finetune.py --data_type pair_leadn --max_seq_len 512

BERT-sim, BERT-imp prediction on target dataset

  • run_bert_scores.py: BERT similarity and importance score prediction (DUC, TAC)
    $ python run_bert_scores.py --dataset 0 --data_type xlnet --split train --data_start 0 --data_end 1 --gpu_id 0 --batch_size_imp 5 --batch_size_sim 25
  • merge_bert_ext.py: merge predicted files into one file and convert *.pkl to *.mat (BERT feature file on CNN is *.h5 and no conversion to *.mat)
    $ python merge_bert_ext.py --dataset 0 --data_type xlnet --split train

Text generation for DPP training

  • gen_text_DPP.py: generate texts (.txt, .words, .pos, .Y, .YY, .seg, idf, dict) for DPP training/testing from candidate segments or sentences
    $ python gen_text_DPP.py --dataset 0 --data_type xlnet

Utility files

  • read_text_from_data.py: text loader for DUC, TAC
  • utils.py: utility functions

DPP training, testing

  • under src/DPP directory
  • $ bash run.bash run_DPP.m
  • run_DPP.m: main file to set parameters and run DPP training/testing
  • main_DPP.m: read train/test text and call DPP.m
  • DPP.m: set more specific parameters, assign features, DPP train/test

System summary

We provide our best system summaries of DUC-04 and TAC-11 (/summary_results). We do not provide DUC and TAC dataset due to license. Please download DUC 03/04 and TAC 08/09/10/11 dataset with your request and approval.

License

This project is licensed under the BSD License - see the LICENSE.md file for details.