This repo contains code/checkpoints to reproduce the results of the paper: Adapting Pretrained Text-to-Text Models for Long Text Sequences. We further pretrain the BART model for long sequence tasks, setting new state-of-the-art on abstract summarization of long texts (e.g., GovReport, BookSum, SummScreen, QMSum). Our implementation is based on custom forks of fairseq and xformers. You could use this repo to finetune on your own long-context tasks or implement efficienct long-context models while using the fast fairseq package.
Our model is developed using A100 GPUs and CUDA version 11.4, PyTorch 1.12.1. The exact result numbers might vary due to environment differences.
-
Install xformers and fairseq by running
pip install -e .
under their directory. Install apex following https://github.com/NVIDIA/apex. -
Install Triton -- to suppress errors from xformers
pip install triton
- Install summarizaztion pyrouge and rouge_score
pip install -U git+https://github.com/pltrdy/pyrouge
pip install rouge_score
Method | GovReport (# Params) | BookSum-Chapters (# Params) | SummScreen-FD (# Params) | SummScreen-TVM (# Params) |
---|---|---|---|---|
ROUGE-1/2 | ROUGE-1/2 | ROUGE-1/2 | ROUGE-1/2 | |
Previous SOTA | 61.0/28.8 (525M) | 38.3/9.2 (660M) | 36.8/9.2 (660M) | 51.0/14.7 (660M) |
BART-LS (ours) 440M | 62.0/30.9 | 38.5/10.3 | 39.1/10.7 | 51.8/17.2 |
Model Description | Download |
---|---|
Pretrained Model | model_100k.pt |
Finetuned checkpoint on GovReport | model_gov.py |
Finetuned checkpoint SummScreen-fd | model_fd.py |
Finetuned checkpoint on BookSum | model_book.py |
Dictionary/vocabulary file | dict.txt |
- Pretraining task: fairseq-py/fairseq/tasks/long_denoising.py
- Summarization task: fairseq-py/fairseq/tasks/summarization.py
- Pooling layers: fairseq-py/fairseq/models/long_transformers/pooling_layers.py
- Block Attention: xformers/xformers/components/attention/block_noglobal.py.
- Integration to fairseq's transformer architecture: fairseq-py/fairseq/modules/multihead_attention.py
Apart from the block attention implemented with native PyTorch operations, we also provides a faster version within xformers implemented with Triton: xformers/xformers/components/attention/blocksparse_local.py. This implementation brings about 20-30% efficiency gains and slightly worse results. To enable this options, simply pass --attention-name bs_local
. You can easy implement other architectures without worring about other transformer blocks.
- Prepare raw data. Organize you data as
{train|val|test}.{src|tgt}
, where each line corresponds to an example. - Under fairseq-py/, binarize the data following
bash ./scripts/summarization/binarize.sh
. For query-based summarization, checkfairseq-py/scripts/summarization/qmsum_preprocess.sh
- The hyperparameters we used for each dataset can be found at fairseq-py/fb_sweep/long_finetune/sweep_summ.py. After downloading the checkpoints and put them under checkpoints/, use the following script to run finetuning:
bash scripts/summarization/ft_summ.sh
python scripts/summarization/long_generate.py \
--model-dir ../checkpoints/model_fd.pt \
--data-dir ${BINARIZED_DATA} \
--save-dir ${SUMMARY_SAVE_DIR} \
--split valid \
--bsz 4
This script will print ROUGE numbers calculated by rouge_score, which is used by Scrolls. In our paper, we reported the rouge scores using files2rouge. Please follow their repo to install file2rouge and download standord-corenlp for tokenization.
If you find the repo useful, please consider citing our paper:
@article{xiong2022adapting,
title={Adapting Pretrained Text-to-Text Models for Long Text Sequences},
author={Xiong, Wenhan and Gupta, Anchit and Toshniwal, Shubham and Mehdad, Yashar and Yih, Wen-tau},
journal={arXiv preprint arXiv:2209.10052},
year={2022}
}
CC-BY-NC 4.0