The enriched version of the WebNLG dataset, described in the INLG 2018 paper "Enriching the WebNLG corpus".
WebNLG is a valuable resource and benchmark for the Natural Language Generation (NLG) community. However, as other NLG benchmarks, it only consists of a collection of parallel raw representations and their corresponding textual realizations. This work aimed to provide intermediate representations of the data for the development and evaluation of popular tasks in the NLG pipeline architecture (Reiter and Dale, 2000), such as Discourse Ordering, Lexicalization, Aggregation and Referring Expression Generation.
Here are the changes per version:
- v1.5: English Lexicalization templates, introduced in the EMNLP 2019 paper "Neural data-to-text generation: A comparison between pipeline and end-to-end architectures". (August 22th, 2019)
- v1.4: full revision of the delexicalized templates. (April 1st, 2019)
- v1.3: tokenization by NLTK, leading to a better extraction of references and discourse ordering information. (January 31st, 2019)
- v1.2: annotation of the test part of the corpus. See Issue #2. (January 31st, 2019)
- v1.1: fix on some annotation mistakes. See Issue #1. (November 21st, 2018)
- v1.0: first version with annotation of the train and development parts of the corpus and German translation.
BETA
- v2.0 (BETA): Tree templates. (April 1st, 2019)
<!-- entry -->
<entry category="Monument" eid="Id5" size="3">
<!-- original tripleset -->
<originaltripleset>
<otriple>11th_Mississippi_Infantry_Monument | region | Adams_County,_Pennsylvania</otriple>
<otriple>11th_Mississippi_Infantry_Monument | established | 2000</otriple>
<otriple>11th_Mississippi_Infantry_Monument | category | Contributing_property</otriple>
</originaltripleset>
<!-- modified tripleset -->
<modifiedtripleset>
<mtriple>11th_Mississippi_Infantry_Monument | location | Adams_County,_Pennsylvania</mtriple>
<mtriple>11th_Mississippi_Infantry_Monument | established | 2000</mtriple>
<mtriple>11th_Mississippi_Infantry_Monument | category | Contributing_property</mtriple>
</modifiedtripleset>
<!-- lexical entry -->
<lex comment="good" lid="Id1">
<!-- ordered tripleset segmented in sentences -->
<sortedtripleset>
<sentence ID="1">
<striple>11th_Mississippi_Infantry_Monument | location | Adams_County,_Pennsylvania</striple>
</sentence>
<sentence ID="2">
<striple>11th_Mississippi_Infantry_Monument | established | 2000</striple>
<striple>11th_Mississippi_Infantry_Monument | category | Contributing_property</striple>
</sentence>
</sortedtripleset>
<!-- extracted referring expressions -->
<references>
<reference entity="11th_Mississippi_Infantry_Monument" number="1" tag="AGENT-1" type="description">The 11th Mississippi Infantry Monument</reference>
<reference entity="Adams_County,_Pennsylvania" number="2" tag="PATIENT-1" type="name">Adams County , Pennsylvania</reference>
<reference entity="11th_Mississippi_Infantry_Monument" number="3" tag="AGENT-1" type="pronoun">It</reference>
<reference entity="2000" number="4" tag="PATIENT-2" type="name">2000</reference>
<reference entity="Contributing_property" number="5" tag="PATIENT-3" type="name">contributing property</reference>
</references>
<!-- original text -->
<text>
The 11th Mississippi Infantry Monument which is located in Adams County, Pennsylvania. It was established in 2000 and falls under the category of contributing property.
</text>
<!-- text with delexicalized referring expressions -->
<template>
AGENT-1 which is located in PATIENT-1 . AGENT-1 was established in PATIENT-2 and falls under the category of PATIENT-3 .
</template>
<!-- lexicalization template -->
<lexicalization>
AGENT-1 which VP[aspect=simple,tense=present,voice=active,person=3rd,number=singular] be located in PATIENT-1 . AGENT-1 VP[aspect=simple,tense=past,voice=passive,person=null,number=singular] establish in PATIENT-2 and VP[aspect=simple,tense=present,voice=active,person=3rd,number=null] fall under DT[form=defined] the category of PATIENT-3 .
</lexicalization>
</lex>
</entry>
Besides the official English version of the data (en
), we also provide a silver-standard version of the corpus in German (de
). The details on how to obtain the translation is presented on the following section and on the INLG 2018 paper.
To obtain the enriched version of the dataset as available in the mentioned directory, make sure to proper set up the University of Edinburgh's Neural MT System for WMT17, publicly available here.
After the settings, update the path variable MT_PATH
in the main.sh
script before execute it:
sh main.sh
- WebNLG Reader: An easy-to-use Python reader and cleaner for the corpus.
- NeuralREG: An end-to-end approach to Referring Expression Generation trained and evaluated on the enriched WebNLG.
- DeepNLG: Project which aims to systematically compare neural pipeline and end-to-end architectures for NLG. The enriched WebNLG was used for evaluation.
The WebNLG data is licensed under the CC Attribution-Noncommercial-Share Alike 4.0 International. The original version of the dataset can be found here.
@InProceedings{ferreiraetal2018,
author = "Castro Ferreira, Thiago
and Moussallem, Diego
and Wubben, Sander
and Krahmer, Emiel",
title = "Enriching the WebNLG corpus",
booktitle = "Proceedings of the 11th International Conference on Natural Language Generation",
year = "2018",
series = {INLG'18},
publisher = "Association for Computational Linguistics",
address = "Tilburg, The Netherlands",
}
@InProceedings{gardentetal017,
author = "Gardent, Claire
and Shimorina, Anastasia
and Narayan, Shashi
and Perez-Beltrachini, Laura",
title = {Creating Training Corpora for {NLG} Micro-Planners},
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ",
series = {ACL'17},
year = "2017",
publisher = "Association for Computational Linguistics",
pages = "179--188",
address = "Vancouver, Canada",
doi = "10.18653/v1/P17-1017",
url = "http://www.aclweb.org/anthology/P17-1017"
}