/SLED

The official repository for Efficient Long-Text Understanding Using Short-Text Models (Ivgi et al., 2022) paper

Primary LanguagePythonMIT LicenseMIT

SLED

The official repository for Efficient Long-Text Understanding Using Short-Text Models (Ivgi et al., 2022), to appear in Transactions of the Association for Computational Linguistics (TACL) 2023 .

SLED models use pretrained, short-range encoder-decoder models, and apply them over. long-text inputs by splitting the input into multiple overlapping chunks, encoding each independently and perform fusion-in-decoder.

Data

The data for this paper is hosted on the dataset hub here. It is based on the SCROLLS dataset (paper), the SQuAD 1.1 dataset (paper) and the HotpotQA dataset (paper). It doesn't contain any unpublished data, but includes the configuration needed for the paper.

Usage example :

from datasets import load_dataset
qasper = load_dataset("tau/sled","qasper")

Installation

Make sure to install pytorch according to your machine spec. See installation options here.

Installing SLED is easy with pip.

pip install py-sled

Some backbone models require additional dependencies. If you wish to work with T5 for example, you can install using.

pip install py-sled[t5]

If you wish to run the examples, install the required dependencies with

pip install py-sled[examples]

If you wish to continue developing this repository, install the full development requirments with

pip install py-sled[dev]

Usage

Working with SLED is seamless when working with HuggingFace's Transformers AutoClasses.

A minimal usage example:

import sled  # ** required so SLED would be properly registered by the AutoClasses **
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('tau/bart-base-sled')
model = AutoModel.from_pretrained('tau/bart-base-sled')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Important: You need to import sled before using the AutoClass (e.g. AutoModel.from_pretrained('tau/bart-base-sled)) for it to work.

Minimal working example can be found here.

To work with SCROLLS like data that was used for the paper, see here.

Custom datasets

For SLED to be able to prepend the prefix input to every chunk, it requires the input tensor prefix_length. If using a custom dataset, you can refer to run.py for the correct way to preprocess the data.

Note: Currently, HF's Seq2SeqTrainer doesn't pass the prefix_length tensor in the prediction loop, so you should use the CustomSeq2SeqTrainer or something similar until it is fixed.

Backbone models

There are multiple model cards available on HuggingfaceHub including

If you wish to use a custom model that is available as a model card (public or private) on the hub, or use different parameters for SLED, you can create a json config file like the below, and change the underlying_config to your custom model card.

{
  "model_type": "tau/sled",
  "underlying_config": "facebook/bart-base",
  "context_size": 256,
  "window_fraction": 0.5,
  "prepend_prefix": true,
  "encode_prefix": true,
  "sliding_method": "dynamic"
}

You can then load it like below

import sled
from transformers import AutoModelForSeq2SeqLM
custom_sled_model = AutoModelForSeq2SeqLM.from_pretrained(<your custom json config>)

Citation

If you use this repository, please cite as below:

@inproceedings{Ivgi2022EfficientLU,
  title={Efficient Long-Text Understanding with Short-Text Models},
  author={Maor Ivgi and Uri Shaham and Jonathan Berant},
  year={2022}
}

Disclaimer

This repository is still under active development, and may contain some unintended behavior. Please open an issue if any unexpected behaviour occurs, and we will promptly try to fix it.

The code was developed and tested with transformers version 4.21.0. Newer version may break backward compatibility and cause unexpected behaviour.