/multi_mds

Primary LanguagePythonApache License 2.0Apache-2.0

How "Multi" is Multi-Document Summarization?

This repo contains the code for the paper: How multi is Multi-Document Summarization (EMNLP 2022).

Setup

conda create --name multi_mds python=3.8 
conda activate multi_mds 
pip install -r requirements.txt 

If the setup fails on jsonnet, see this issue.

Preprocessing data

You should pre-preprocess your dataset into jsonl format where each lines includes the following fields:

  • documents: a List of source documents
  • summary: a reference (or system) summary
  • topic_id: instance id

You can find an example input file in the repo: wcep10_test.jsonl.

Compute the AAC score and curves

There are several steps for computing the AAC score:

  1. extract the openIE from all source documents and the summary
  2. prepare pairs of OpenIE
  3. compute alignment scores between source and summary propositions for each topic
  4. build greedily the maximally covering subsets of source documents
  5. compute the Area Above the Curve and save the coverage plot.

You can run a single command that will compute all steps together, while skipping accomplished steps (edit the path of raw_data_dir and process_dir):

bash run.sh [preprocessed_data] [dir_path] 

Alternatively, you can run each step separately, as follows:

  1. Extract all Open IE tuples from the summary and the source documents.
export raw_data= # path to jsonl file 
export data_dir= # output dir

python extract_open_ie.py --raw_data $raw_data \
                          --data_dir $data_dir \
                          --gpu 0 

This script will create a directory $data_dir/oie with the propositions from the summary and the documents.

  1. Prepare pairs:
python prepare_oie_pairs.py --data_dir $data_dir

This script will create a file $data_dir/pairs.pickle with all possible pairs of open IE.

  1. Compute alignment scores between source and summary propositions for each topic:
python get_superpal_scores.py --data_dir $data_dir \
                              --model biu-nlp/superpal \
                              --device_ids 0,1,2,3 \
                              --batch_size 64

This script will run the alignment model on the $data_dir/pairs.pickle and save the results in the directory $data_dir/result_npy.

  1. Build greedy subsets of documents that maximize coverage
python build_greedy_subsets.py --data_dir $data_dir 
  1. Compute AAC score and save plot in $data_dir/plot.png.
python get_aac_scores.py --data_dir $data_dir

Citation

@inproceedings{Wolhandler2022HowI,
  title={How "Multi" is Multi-Document Summarization?},
  author={Ruben Wolhandler and Arie Cattan and Ori Ernst and Ido Dagan},
  booktitle={EMNLP},
  year={2022}
}