This repository contains the accompanying code for the paper:
"Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering ." Avi Caciularu, Arman Cohan, Ido Dagan, Jacob Goldberger and Arman Cohan. In ACL, 2023. [PDF]
You can either pretrain by yourself or use the pretrained QAmden model weights and tokenizer files, which are available on HuggingFace.
Code for loading and using the QAmden pre-trained model:
from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden')
Please note that during our pretraining we used the document separators (similarly as PRIMERA), which you might want to add to your data. The document separator is <doc-sep>
(the last token in the vocabulary).
We also provide QAmden fine-tuned over the multinews dataset:
from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden-multinews')
For generating the pre-training your own QAmden model:
- Download and untar the preprocessed newshead data.
- Process the data by running
pretrain_preprocess_qasem.py
. - Filter the processed data and create the csv files by running
preprocess_and_filter_data.py
.
Instead, you can download and use the already preprocessed data:
from datasets import load_dataset
qamden_pretraining_dataset = load_dataset("biu-nlp/QAmden-pretraining")
Once you have the data, launch pre-training using the pretrain_qamden.py
script.
Use the finetune_summarization.py
script to evaluate over multi-news or over multi_x_science_sum.
If you find our work useful, please cite the paper as:
@article{caciularu2023Peekacross,
title={Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering},
author={Caciularu, Avi and Peters, Matthew E. and Goldberger, Jacob and Dagan, Ido and Cohan, Arman},
journal={The Annual Meeting of the Association for Computational Linguistics (ACL 2023)},
year={2023}
}