Multi-XScience

Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.

Authors: Yao Lu, Yue Dong, Laurent Charlin

Appendix: model implementation and evaluation details.

Dataset Statistics

word-level statistics

train/val/test examples	average document length	summary length	number of references
30,369/5,066/5,093	778.08	116.44	4.42

We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.

Datasets	% of novel unigram	% of novel bi-grams	% of novel tri-grams	% of novel 4-grams
CNN-DailyMail (single)	17.00	53.91	71.98	80.29
NY Times (single)	22.64	55.59	71.93	80.16
XSum (single)	35.76	83.45	95.50	98.49
WikiSum	18.20	51.88	69.82	78.16
Multi-News	17.76	57.10	75.71	82.30
Multi-XScience	42.33	81.75	94.57	97.62

Dataset Format

key	description
aid	arxiv id (e.g. 2010.14235)
mid	microsoft academic graph id
abstract	text of paper abstract
ref_abstract	meta-information of reference papers
ref_abstract.cite_N	meta-information of reference paper cite_N (special cite symbol)
ref_abstract.cite_N.mid	reference paper's (cite_N) microsoft academic graph id
ref_abstract.cite_N.abstract	text of reference paper (cite_N) abstract

Extended Usage

Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.

yaolu/Multi-XScience

Multi-XScience

Dataset Statistics

Dataset Format

Extended Usage