Learning by Semantic Similarity Makes Abstractive Summarization Better
Under review.
Please be noticed that the majority of Git logs were intentionally omitted to keep the review process Double-Blind.
We will redirect or update this repository after we receive the final decisions.
This repository provides pre-processed dataset, source code, and pre-trained weights used for our experiment.
Folder description
/--|fairseq-semsim/
|datasets/
|results/
|README.md
|model.jpg
/fairseq-semsim
: The codes for our model. Modified from fairseq (v 0.8.0 : 534905) and Rewarder repositories./datasets
: Our version of the pre-processed CNN/DM dataset and the pre-processing code. Modified from PGN by See et al. following instructions of BART (issue #1391)/results
: We provide summarization results for the CNN/DM dataset and the reduced dataset (n=1000). Folder contains generated summaries of BART and SemSim and reference summaries (not tokenized).
Requirements and Installation
For preparing (pre-processing) the CNN/DM dataset
Please check README inside datasets folder.
For fine-tuning and inferencing
- PyTorch version >= 1.2.0 (CUDA available version)
- Python version >= 3.6
- fairseq == 0.8.0
- pytorch_transformers == 1.2.0
Also you need to install fairseq from the source
cd fairseq-semsim
pip install --editable .
We also provides our pre-trained weights for fast evauation of our model : Download.
Name | SHA1SUM |
---|---|
semsim.pt | d7ba2c2e06201e373a5e53cffe40d153ee867cc4 |
Fine-tuning the model
If you wish to fine-tune SemSim model your own from BART checkpoint, please download the checkpoint bart.large.cnn
.
Our example code and instructions are copied and modfied from fairseq.
here and move files to /fairseq-semsim/cnn_dm
folder.
1) Get data-files from 2) BPE preprocess:
Please make sure that you are executing the commands from the /fairseq-semsim
folder.
cd fairseq-semsim
for SPLIT in train val
do
for LANG in source target
do
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json encoder.json \
--vocab-bpe vocab.bpe \
--inputs "cnn_dm/$SPLIT.$LANG" \
--outputs "cnn_dm/$SPLIT.bpe.$LANG" \
--workers 60 \
--keep-empty;
done
done
3) Binarize dataset:
fairseq-preprocess \
--source-lang "source" \
--target-lang "target" \
--trainpref "cnn_dm/train.bpe" \
--validpref "cnn_dm/val.bpe" \
--destdir "cnn_dm-bin/" \
--workers 60 \
--srcdict dict.txt \
--tgtdict dict.txt;
bart.large.cnn
with SemSim approach on CNN-DM summarization task:
4) Fine-tuning Use the following command to fine-tune bart.large.cnn
with SemSim strategy.
BART_PATH=/pretrained/BART/bart.large.cnn/model.pt
TOTAL_NUM_UPDATES=50000
WARMUP_UPDATES=500
LR=3e-05
MAX_TOKENS=1792
UPDATE_FREQ=32
python train.py cnn_dm-bin \
--restore-file $BART_PATH \
--max-tokens $MAX_TOKENS \
--task translation \
--source-lang source --target-lang target \
--layernorm-embedding \
--share-all-embeddings \
--share-decoder-input-output-embed \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--arch bart_large \
--criterion semantic_similarity_loss \
--label-smoothing 0.1 \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999 )" --adam-eps 1e-08 \
--clip-norm 0.1 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--update-freq $UPDATE_FREQ \
--skip-invalid-size-inputs-valid-test \
--save-dir checkpoints/semsim \
--find-unused-parameters;
We followed most of default settings of BART. However, we removed a few options such as --truncate-source
and --fp16
.
MAX_TOKENS
was changed to 1792 to fit our GPU memory.
We used one NVIDIA TITAN RTX GPU with 24GB memory and it took 7~9 hours for a single epoch. We achieved best performace at epoch 6.
We believe 24GB GPU memory is a mimimum requirement for fine-tuning. Our test on 12GB GPU memory was faild. We managed to train the model on 16GB memory with MAX_TOKENS=1024
but we haven't tested the result. Our code do not support multi-GPU setting yet.
For details, check the instructions from /fairseq-semsim
and Fine-tuning BART
file.
Evaluating the model (Inferencing)
Please make sure that you are executing following python script from the /fairseq-semsim
folder.
cd fairseq-semsim
Run following python script to generate summaries.
(Please also check instructions from BART repository for details.)
import torch
from fairseq.models.bart import BARTModel
bart = BARTModel.from_pretrained(
'checkpoints/',
checkpoint_file='semsim.pt',
data_name_or_path='cnn_dm-bin'
)
bart.cuda()
bart.eval()
bart.half()
count = 1
bsz = 32 # for 12GB GPU memory
with open('cnn_dm/test.source') as source, open('cnn_dm/test.hypo', 'w') as fout:
sline = source.readline().strip()
slines = [sline]
for sline in source:
if count % bsz == 0:
with torch.no_grad():
hypotheses_batch = bart.sample(slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
for hypothesis in hypotheses_batch:
fout.write(hypothesis + '\n')
fout.flush()
slines = []
slines.append(sline.strip())
count += 1
if slines != []:
hypotheses_batch = bart.sample(slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
for hypothesis in hypotheses_batch:
fout.write(hypothesis + '\n')
fout.flush()
Please adjust bsz
for faster inferencing. (bsz=32
works well on 12GB GPU memory)
Install files2rouge
from here.
export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
# Tokenize hypothesis and target files.
cat test.hypo | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.tokenized
cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.target
files2rouge test.hypo.tokenized test.hypo.target
# Expected output: (ROUGE-L Average_F: 0.4153)
You need java and CoreNLP library to run the code.