/peekacross

Primary LanguagePythonMIT LicenseMIT

Improving Multi-Document Modeling via Cross-Document Question-Answering

This repository contains the accompanying code for the paper:

"Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering ." Avi Caciularu, Arman Cohan, Ido Dagan, Jacob Goldberger and Arman Cohan. In ACL, 2023. [PDF]

You can either pretrain by yourself or use the pretrained QAmden model weights and tokenizer files, which are available on HuggingFace.

Pre-trained Model Usage

Code for loading and using the QAmden pre-trained model:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden')

Please note that during our pretraining we used the document separators (similarly as PRIMERA), which you might want to add to your data. The document separator is <doc-sep> (the last token in the vocabulary).

We also provide QAmden fine-tuned over the multinews dataset:

from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/QAmden')
model = AutoModel.from_pretrained('biu-nlp/QAmden-multinews')

Pre-training your own QAmden model

For generating the pre-training your own QAmden model:

  1. Download and untar the preprocessed newshead data.
  2. Process the data by running pretrain_preprocess_qasem.py.
  3. Filter the processed data and create the csv files by running preprocess_and_filter_data.py.

Instead, you can download and use the already preprocessed data:

from datasets import load_dataset
qamden_pretraining_dataset = load_dataset("biu-nlp/QAmden-pretraining")

Once you have the data, launch pre-training using the pretrain_qamden.py script.

Evaluating the QAmden model on multi-document summarization

Use the finetune_summarization.py script to evaluate over multi-news or over multi_x_science_sum.


Citation:

If you find our work useful, please cite the paper as:

@article{caciularu2023Peekacross,
  title={Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering},
  author={Caciularu, Avi and Peters, Matthew E. and Goldberger, Jacob and Dagan, Ido and Cohan, Arman},
  journal={The Annual Meeting of the Association for Computational Linguistics (ACL 2023)},
  year={2023}
}