/Multi-XScience

Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles

MIT LicenseMIT

Multi-XScience

Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.

Authors: Yao Lu, Yue Dong, Laurent Charlin

Appendix: model implementation and evaluation details.

Dataset Statistics

word-level statistics

train/val/test examples average document length summary length number of references
30,369/5,066/5,093 778.08 116.44 4.42

We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.

Datasets % of novel unigram % of novel bi-grams % of novel tri-grams % of novel 4-grams
CNN-DailyMail (single) 17.00 53.91 71.98 80.29
NY Times (single) 22.64 55.59 71.93 80.16
XSum (single) 35.76 83.45 95.50 98.49
WikiSum 18.20 51.88 69.82 78.16
Multi-News 17.76 57.10 75.71 82.30
Multi-XScience 42.33 81.75 94.57 97.62

Dataset Format

key description
aid arxiv id (e.g. 2010.14235)
mid microsoft academic graph id
abstract text of paper abstract
ref_abstract meta-information of reference papers
ref_abstract.cite_N meta-information of reference paper cite_N (special cite symbol)
ref_abstract.cite_N.mid reference paper's (cite_N) microsoft academic graph id
ref_abstract.cite_N.abstract text of reference paper (cite_N) abstract

Extended Usage

Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.