This is the implementation of the paper Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters.
-
This code is tested using Python 3.6, Pytorch 1.4, and CUDA 10.1
-
Install Apex:
cd ~; git clone https://github.com/NVIDIA/apexcd apexpip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" ./
-
Install nccl:
cd ~; git clone https://github.com/NVIDIA/nccl.gitcd ncclmake -j10 src.build CUDA_HOME=/usr/local/cuda-10.1sudo apt install build-essential devscripts debhelper fakerootmake -j10 pkg.debian.build CUDA_HOME=/usr/local/cuda-10.1sudo apt install ./build/pkg/deb/*.deb
-
Setup fairSeq and import some files:
cd ~pip uninstall -y enum34# Prevent AttributeError: module 'enum' has no attribute 'IntFlag'git clone --branch v0.9.0 https://github.com/pytorch/fairseqmkdir ~/fairseq/modelscd ~/fairseq/modelswget 'https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz'tar -xzf bart.large.tar.gzcd ~/fairseqwget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'- Download
encoder-updated.jsonfile fromhttps://github.com/amazon-research/BartGraphSumm/blob/main/data/encoder-updated.jsonand put it under~/fairseq
-
Install NLTK and Spacy:
pip install nltk spacy more_itertoolspython -m spacy download en_core_web_smpython -m nltk.downloader stopwordspython -m nltk.downloader punkt
-
For ROUGE:
sudo apt-get install -y cpanminuscpanm —force XML::Parsercd ~pip install -U git+https://github.com/pltrdy/pyrougegit clone https://github.com/pltrdy/files2rouge.gitcd files2rougepython setup_rouge.pypython setup.py installpyrouge_set_rouge_path ~/.files2rouge
-
Get PreSumm (for fast parallel ROUGE impl — note that this does not split summaries into sentences and therefore gives worse ROUGE-L scores than files2rouge)
cd ~git clone https://github.com/nlpyang/PreSumm.git
-
Other:
mkdir -p ~/resultspip install rouge
-
cd ~; mkdir data; cd data -
multi-news-500 (Preprocessed and truncated data):
- Get it from original source and rename the folder as
multi-news-500; Also, rename xxx.txt.src.yyy as xxx.source and xxx.txt.tgt.yyy as xxx.target
- Get it from original source and rename the folder as
-
multi-news-full-clean (Preprocessed but not truncated):
- Get it from original source and rename the folder as
multi-news-full-clean; Also, rename the files inside this folder as follows: xxx.txt.src as xxx.source and xxx.txt.tgt as xxx.target
- Get it from original source and rename the folder as
-
multi-news-full-raw (not processed and not truncated) -- This is only needed for graph construction in BART-Long-Graph models.
- Get it from original source and rename the folder as
multi-news-full-raw; Also, rename the files inside this folder as follows: xxx.src as xxx.source and xxx.tgt as xxx.target
- Get it from original source and rename the folder as
cd ~; git clone git@github.com:amazon-research/BartGraphSumm.gitcd ~/BartGraphSumm/src/fairseqpip install --editable .cd ../
Try the following command to train and evaluate the BART baseline model on Multi-News-500 dataset
cd ~/BartGraphSumm/src
make -f bart-large.mk TASK=~/data/multi-news-500 OUTPUT_DIR=~/results/bart-large-multinews-model1 rouge
The ROUGE F1 scores (R1/R2/RL) can be found at
cat ~/results/bart-large-multinews-model1/test.rouge-stdout | grep ">"
>> ROUGE-F(1/2/3/l): 49.22/18.88/23.88
The scores correspond to the numbers from BART (input length=500) in Table 4 in the paper.
- Create a longformer model
cd ~; cp -r ~/fairseq/models/bart.large ~/fairseq/models/bart.large.longcd ~/BartGraphSumm/src; python convert_model_to_long.py --input_model_path ~/fairseq/models/bart.large/model.pt --output_model_path ~/fairseq/models/bart.large.long/model.pt
- Create data of length 500 tokens:
cd ~/BartGraphSumm/srcpython prepare_data.py --data_path ~/data/multi-news-full-clean --output_path ~/data/multi-news-500-sentmarkers --max_length 500 --sentence_level_markers
make -f bart-large-long.mk TASK=~/data/multi-news-500-sentmarkers MAX_TOKENS=1024 OUTPUT_DIR=~/results/bart-large-multinews-model2 rouge
The ROUGE F1 scores (R1/R2/RL) can be found at
cat ~/results/bart-large-multinews-model2/test.rouge-stdout | grep ">"
>> ROUGE-F(1/2/3/l): 48.54/18.56/23.78
The scores correspond to the numbers from Bart-Long in Table 1 in the paper.
- Create BART-Long model with additional encoder for encoding the graph information
cp -r ~/fairseq/models/bart.large ~/fairseq/models/bart.large.long.graph.linearcd ~/BartGraphSumm/src; python convert_model_to_long.py --input_model_path ~/fairseq/models/bart.large/model.pt --output_model_path ~/fairseq/models/bart.large.long.graph.linear/model.pt --linear_graph
- Create the graph data and its linearized form
- Create a new virtual env with pytorch 1.5 --> our graph construction code relies on latest allennlp modules which has pytorch 1.5 requirements. So, please follow the below setups in creating a new virtual environment
sudo apt-get install virtualenvcd ~/; virtualenv -p python3.6 graph_env; source graph_env/bin/activatepip install numpy allennlp==1.0.0 allennlp_models==1.0.0 networkx==2.4 matplotlib==3.3.0cd ~/BartGraphSumm/srcpython graph_construction.py --data_path ~/data/multi-news-full-raw --output_path ~/data/multi-news-full-graph --split trainpython graph_construction.py --data_path ~/data/multi-news-full-raw --output_path ~/data/multi-news-full-graph --split valpython graph_construction.py --data_path ~/data/multi-news-full-raw --output_path ~/data/multi-news-full-graph --split test- Load back the Pytorch 1.4 environment
- Now create the data by concatenating plan text input with graph information for each sample in the multi-news dataset:
python prepare_data.py --data_path ~/data/multi-news-full-clean --output_path ~/data/multi-news-500-500 --max_length 500 --mode standard_with_graph_knowledge --graph_data_path ~/data/multi-news-full-graph --sentence_level_markers
make -f bart-large-graph-linear.mk TASK=~/data/multi-news-500-500 MAX_TOKENS=1500 MAX_EPOCH=8 OUTPUT_DIR=~/results/bart-large-multinews-model3 rouge
The ROUGE F1 scores (R1/R2/RL) can be found at
cat ~/results/bart-large-multinews-model3/test.rouge-stdout | grep ">"
>> ROUGE-F(1/2/3/l): 49.03/19.04/24.04
The scores correspond to the numbers from Bart-Long-Graph (500 tokens graph text) in Table 1 in the paper.
To create the data with 1000 tokens for graph information: python prepare_data.py --data_path ~/data/multi-news-full-clean --output_path ~/data/multi-news-500-1000 --max_length 1000 --mode standard_with_graph_knowledge --graph_data_path ~/data/multi-news-full-graph --sentence_level_markers
and train the model:
make -f bart-large-graph-linear.mk TASK=~/data/multi-news-500-1000 MAX_TOKENS=2500 MAX_EPOCH=8 LR=3e-05 OUTPUT_DIR=~/results/bart-large-multinews-model4 rouge
The ROUGE F1 scores (R1/R2/RL) can be found at
cat ~/results/bart-large-multinews-model4/test.rouge-stdout | grep ">"
>> ROUGE-F(1/2/3/l): 49.24/18.99/23.97
The scores correspond to the numbers from Bart-Long-Graph (1000 tokens graph text) in Table 1 in the paper.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
If you find this code helpful, please consider citing the following paper:
@inproceedings{pasunuru2021efficient,
title={Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters},
author={Pasunuru, Ramakanth and Liu, Mengwen and Bansal, Mohit and Ravi, Sujith and Dreyer, Markus},
booktitle={NAACL},
year={2021}
}