DiscoFuse

This repository contains the DiscoFuse dataset described in the paper:

  • Title: "DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion"
  • Authors: Mor Geva, Eric Malmi, Idan Szpektor, Jonathan Berant
  • https://arxiv.org/abs/1902.10526
  • Accepted as a long research paper in NAACL 2019.

If you use this dataset in your work, please cite our paper:

@InProceedings{GevaEtAl2019,
  title = {{DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion}},
  author = {Geva, Mor and Malmi, Eric and Szpektor, Idan and Berant, Jonathan},
  booktitle = {Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  note = {arXiv preprint arXiv:1902.10526},
  year = {2019}
}

Dataset

DiscoFuse was created by applying a rule-based splitting method on two corpora - sports articles crawled from the Web, and Wikipedia. See the paper for a detailed description of the dataset generation process and evaluation.

DiscoFuse has two parts:

File Download Source Examples
discofuse_v1_sports.tar.gz Link Sports articles 44,177,443
discofuse_v1_wikipedia.tar.gz Link Wikipedia 16,642,323

For each part, we provide a random split to train (98% of the examples), development (1%) and test (1%) sets. In addition, as the original data distribution is highly skewed (see details in the paper), we also provide a balanced version for each part.

Overall, each part contains 6 data subsets:

  • train
  • train_balanced
  • dev
  • dev_balanced
  • test
  • test_balanced

Data Format

All files are in a textual TSV (tab-separated-value) format. Each example contains the following attributes:

coherent_first_sentence: The first sentence of the original text from which the example was generated.

coherent_second_sentence: The second sentence of the original text from which the example was generated. In case the example was generated from a single sentence, this field will be empty.

incoherent_first_sentence: The first sentence of the split text, generated by our rule-based method.

incoherent_second_sentence: The second sentence of the split text, generated by our rule-based method.

discourse_type: The discourse phenomena identified in the original text. See below a full list of all discourse types.

connective_string: In case a connective word was removed from the original text, it will be specified in this field. Otherwise, this field will be empty.

has_coref_type_pronoun: Contains 1.0 if there was a pronoun replacement during example generation, and 0.0 otherwise.

has_coref_type_nominal: Contains 1.0 if there was a nominal replacement during example generation, and 0.0 otherwise.

Below is a list of discourse types, the prefix PAIR/SINGLE indicates whether the example was generated from two consecutive sentences or from a single sentence. CONN indicates a connective removal from the original text, S_COORD stands for sentence coordination and VP_COORD for verb-phrase coordination.

  • PAIR_ANAPHORA
  • PAIR_CONN
  • PAIR_CONN_ANAPHORA
  • PAIR_NONE
  • SINGLE_APPOSITION
  • SINGLE_CATAPHORA
  • SINGLE_CONN_INNER
  • SINGLE_CONN_INNER_ANAPHORA
  • SINGLE_CONN_START
  • SINGLE_RELATIVE
  • SINGLE_S_COORD
  • SINGLE_S_COORD_ANAPHORA
  • SINGLE_VP_COORD

Please see the paper for data distribution and more details on each discourse type.

Examples

Example 1 (from Wikipedia portion)

  • coherent_first_sentence: Melvyn Douglas originally was signed to play Sam Bailey , but the role ultimately went to Walter Pidgeon .
  • coherent_second_sentence: -
  • incoherent_first_sentence: Melvyn Douglas originally was signed to play Sam Bailey .
  • incoherent_second_sentence: The role ultimately went to Walter Pidgeon .
  • discourse_type: SINGLE_S_COORD
  • connective_string: , but
  • has_coref_type_pronoun: 0.0
  • has_coref_type_nominal: 0.0

Example 2 (from sports portion)

  • coherent_first_sentence: The target , which is only six feet away , serves the archer as a mirror in order to reflect the status of the archer 's mind and spirit .
  • coherent_second_sentence: -
  • incoherent_first_sentence: The target serves the archer as a mirror in order to reflect the status of the archer 's mind and spirit .
  • incoherent_second_sentence: The target is only six feet away .
  • discourse_type: SINGLE_RELATIVE
  • connective_string: -
  • has_coref_type_pronoun: 0.0
  • has_coref_type_nominal: 0.0

Example 3 (from Wikipedia portion)

  • coherent_first_sentence: Rather than returning to England , Ingram stayed in the Gambia and turned to trade .
  • coherent_second_sentence: However , on a visit to England in 1860 , he was declared bankrupt .
  • incoherent_first_sentence: Rather than returning to England , Ingram stayed in the Gambia and turned to trade .
  • incoherent_second_sentence: On a visit to England in 1860 , Ingram was declared bankrupt .
  • discourse_type: PAIR_CONN_ANAPHORA
  • connective_string: however ,
  • has_coref_type_pronoun: 1.0
  • has_coref_type_nominal: 0.0

License

The data is licensed under Creative Commons Attribution-ShareAlike 3.0 license.

Contacts

  • morgeva [at] mail.tau.ac.il
  • szpektor [at] google.com