This repository contains the DiscoFuse dataset described in the paper:
- Title: "DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion"
- Authors: Mor Geva, Eric Malmi, Idan Szpektor, Jonathan Berant
- https://arxiv.org/abs/1902.10526
- Accepted as a long research paper in NAACL 2019.
If you use this dataset in your work, please cite our paper:
@InProceedings{GevaEtAl2019,
title = {{DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion}},
author = {Geva, Mor and Malmi, Eric and Szpektor, Idan and Berant, Jonathan},
booktitle = {Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
note = {arXiv preprint arXiv:1902.10526},
year = {2019}
}
DiscoFuse was created by applying a rule-based splitting method on two corpora - sports articles crawled from the Web, and Wikipedia. See the paper for a detailed description of the dataset generation process and evaluation.
DiscoFuse has two parts:
File | Download | Source | Examples |
---|---|---|---|
discofuse_v1_sports.tar.gz | Link | Sports articles | 44,177,443 |
discofuse_v1_wikipedia.tar.gz | Link | Wikipedia | 16,642,323 |
For each part, we provide a random split to train (98% of the examples), development (1%) and test (1%) sets. In addition, as the original data distribution is highly skewed (see details in the paper), we also provide a balanced version for each part.
Overall, each part contains 6 data subsets:
- train
- train_balanced
- dev
- dev_balanced
- test
- test_balanced
All files are in a textual TSV (tab-separated-value) format. Each example contains the following attributes:
coherent_first_sentence: The first sentence of the original text from which the example was generated.
coherent_second_sentence: The second sentence of the original text from which the example was generated. In case the example was generated from a single sentence, this field will be empty.
incoherent_first_sentence: The first sentence of the split text, generated by our rule-based method.
incoherent_second_sentence: The second sentence of the split text, generated by our rule-based method.
discourse_type: The discourse phenomena identified in the original text. See below a full list of all discourse types.
connective_string: In case a connective word was removed from the original text, it will be specified in this field. Otherwise, this field will be empty.
has_coref_type_pronoun: Contains 1.0 if there was a pronoun replacement during example generation, and 0.0 otherwise.
has_coref_type_nominal: Contains 1.0 if there was a nominal replacement during example generation, and 0.0 otherwise.
Below is a list of discourse types, the prefix PAIR/SINGLE indicates whether the example was generated from two consecutive sentences or from a single sentence. CONN indicates a connective removal from the original text, S_COORD stands for sentence coordination and VP_COORD for verb-phrase coordination.
- PAIR_ANAPHORA
- PAIR_CONN
- PAIR_CONN_ANAPHORA
- PAIR_NONE
- SINGLE_APPOSITION
- SINGLE_CATAPHORA
- SINGLE_CONN_INNER
- SINGLE_CONN_INNER_ANAPHORA
- SINGLE_CONN_START
- SINGLE_RELATIVE
- SINGLE_S_COORD
- SINGLE_S_COORD_ANAPHORA
- SINGLE_VP_COORD
Please see the paper for data distribution and more details on each discourse type.
- coherent_first_sentence: Melvyn Douglas originally was signed to play Sam Bailey , but the role ultimately went to Walter Pidgeon .
- coherent_second_sentence: -
- incoherent_first_sentence: Melvyn Douglas originally was signed to play Sam Bailey .
- incoherent_second_sentence: The role ultimately went to Walter Pidgeon .
- discourse_type: SINGLE_S_COORD
- connective_string: , but
- has_coref_type_pronoun: 0.0
- has_coref_type_nominal: 0.0
- coherent_first_sentence: The target , which is only six feet away , serves the archer as a mirror in order to reflect the status of the archer 's mind and spirit .
- coherent_second_sentence: -
- incoherent_first_sentence: The target serves the archer as a mirror in order to reflect the status of the archer 's mind and spirit .
- incoherent_second_sentence: The target is only six feet away .
- discourse_type: SINGLE_RELATIVE
- connective_string: -
- has_coref_type_pronoun: 0.0
- has_coref_type_nominal: 0.0
- coherent_first_sentence: Rather than returning to England , Ingram stayed in the Gambia and turned to trade .
- coherent_second_sentence: However , on a visit to England in 1860 , he was declared bankrupt .
- incoherent_first_sentence: Rather than returning to England , Ingram stayed in the Gambia and turned to trade .
- incoherent_second_sentence: On a visit to England in 1860 , Ingram was declared bankrupt .
- discourse_type: PAIR_CONN_ANAPHORA
- connective_string: however ,
- has_coref_type_pronoun: 1.0
- has_coref_type_nominal: 0.0
The data is licensed under Creative Commons Attribution-ShareAlike 3.0 license.
- morgeva [at] mail.tau.ac.il
- szpektor [at] google.com