MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.
This repo contains official code for the paper MuLD: The Multitask Long Document Benchmark.
The easiest method is to use the Huggingface Datasets library:
import datasets
ds = datasets.load_dataset("ghomasHudson/muld", "NarrativeQA")
ds = datasets.load_dataset("ghomasHudson/muld", "HotpotQA")
ds = datasets.load_dataset("ghomasHudson/muld", "Character Archetype Classification")
ds = datasets.load_dataset("ghomasHudson/muld", "OpenSubtitles")
ds = datasets.load_dataset("ghomasHudson/muld", "AO3 Style Change Detection")
ds = datasets.load_dataset("ghomasHudson/muld", "VLSP")
If you prefer to download the data files yourself:
- NarrativeQA Train Val Test
- HotpotQA Train, Val
- Character Archetype Classification Train Val Test
- OpenSubtitles Train Test
- Style Change Train Val Test
- VLSP Test
If you use our benchmark please cite the paper:
@misc{hudson2022muld,
title={MuLD: The Multitask Long Document Benchmark},
author={G Thomas Hudson and Noura Al Moubayed},
year={2022},
eprint={2202.07362},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Additionally please cite the datasets we used (particularly NarrativeQA, HotpotQA, and Opensubtitles where we directly use their data with limited filtering).
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||||
---|---|---|---|---|---|---|---|
name | MuLD |
||||||
alternateName | Multitask Long Document Benchmark |
||||||
url | https://github.com/ghomasHudson/muld |
||||||
description | MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks
where the inputs consist of at least 10,000 words. The benchmark covers a
wide variety of task types including translation, summarization,
question answering, and classification. Additionally there is a range of
output lengths from a single word classification label all the way up
to an output longer than the input text. |
||||||
citation | https://arxiv.org/abs/2202.07362 |
||||||
creator |
|