/sharedtask2019

Repository for DISRPT2019 shared task

Primary LanguagePython

DISRPT/sharedtask2019

Repository for the DISRPT2019 shared task on Discourse Unit Segmentation

Introduction

The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. We provide training, development and test datasets from all available languages and treebanks in the RST, SDRT and PDTB formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for discourse units. For datasets which have treebanks, we will evaluate in two different scenarios: with and without gold syntax, or otherwise using provided automatic parses for comparison.

https://sites.google.com/view/disrpt2019/

Shared task participants are encouraged to follow this repository in case bugs are found and need to be fixed

Types of data

The tasks are oriented towards finding the locus of discourse relations in texts. For frameworks that segment text into non-overlapping spans covering each entire documents (RST and SDRT), the task corresponds to finding the starting point of each discourse unit. For PDTB-style datasets, the task is to identify the spans of discourse connectives that explicitly identify the existence of a discourse relation.

Submitting a system

Systems should be accompanied by a regular workshop paper in the ACL format, as described on the workshop website. During submission, you will be asked to supply a URL from which your system can be downloaded. If your system does not download necessary resources by itself (e.g. word embeddings), these resources should be included at the download URL. The system download should include a README file describing exactly how paper results can be reproduced. Please do not supply pre-trained model, but rather instructions on how to train the system using the downloaded resources. For any questions regarding system submissions, please contact the organizers.

Important dates

  • Fri, December 28, 2018 - shared task sample data release
  • Mon, January 21, 2019 - training data release
  • Fri, February 15, 2019 - test data release
  • Thu, February 28, 2019 - papers due (shared task & regular workshop papers)
  • Wed, March 27, 2019 - notification of acceptance
  • Fri, April 5, 2019 - camera-ready papers due
  • Thu, June 6, 2019 - workshop

Directories

The shared task repository currently comprises the following directories (to be extended as the task progresses):

  • data - individual corpora from various languages and frameworks.
    • Folders are given names in the scheme LANG.FRAMEWORK.CORPUS, e.g. eng.rst.gum is the directory for the GUM corpus, which is in English and annotated in the framework of Rhetorical Structure Theory (RST).
    • Note that four corpora (eng.rst.rstdt, eng.pdtb.pdtb, tur.pdtb.tdb, zho.pdtb.cdtb) do not contain text and text therefore needs to be reconstructed using utils/process_underscores.py.
  • sample - sample data to illustrate formats, provided ahead of the release of training data (this data will be included in training data).
  • utils - scripts for validating, evaluating and generating data formats. The official scorer is seg_eval.py.

See the README files in individual data directories for more details on each dataset.

Surprise language

At the release of the test data, a surprise language dataset was added: Turkish connective data from the Turkish from the Turkish Discourse Bank (TDB). Since the existence of the surprise language was not originally announced, and some systems may not have resources to handle the additional language, evaluation on the Turkish dataset is optional and will be reported separately in the shared task results.

Statistics

corpus lang framework train_toks train_sents train_docs dev_toks dev_sents dev_docs test_toks test_sents test_docs total_toks total_sents total_docs seg_style underscored SpaceAfter syntax has_multitoks
deu.rst.pcc deu rst 26,831 1,773 142 3,152 207 17 3,239 213 17 33,222 2,193 176 EDU no no UD no
eng.pdtb.pdtb eng pdtb 1,061,222 44,563 1,992 39,768 1,703 79 55,658 2,364 91 1,156,648 48,630 2,162 Conn yes no UD(gold) no
eng.rst.gum eng rst 67,098 3,600 78 15,593 784 18 15,924 890 18 98,615 5,274 114 EDU no yes UD(gold) no
eng.rst.rstdt eng rst 166,849 6,672 309 17,309 717 38 21,666 929 38 205,824 8,318 385 EDU yes no UD(gold) no
eng.sdrt.stac eng sdrt 36,445 7,689 29 4,747 981 6 6,549 1,350 6 47,741 10,020 41 EDU no no UD no
eus.rst.ert eus rst 21,122 990 84 7,533 350 28 6,658 320 28 35,313 1,660 140 EDU no no other no
fra.sdrt.annodis fra sdrt 22,278 880 64 4,987 227 11 5,146 211 11 32,411 1,318 86 EDU no no UD no
nld.rst.nldt nld rst 17,566 1,202 56 3,789 257 12 3,565 248 12 24,920 1,707 80 EDU no no UD no
por.rst.cstn por rst 44,808 1,595 110 6,233 232 14 3,615 123 12 54,656 1,950 136 EDU no yes UD yes
rus.rst.rrt rus rst 214,484 9,859 140 29,412 1,327 19 28,768 1,327 19 272,664 12,513 178 EDU no no UD no
spa.rst.rststb spa rst 43,034 1,577 203 7,531 256 32 8,026 303 32 58,591 2,136 267 EDU no no UD no
spa.rst.sctb spa rst 10,249 304 32 2,450 74 9 3,813 100 9 16,512 478 50 EDU no no UD no
tur.pdtb.tdb tur pdtb 398,203 25,080 159 50,272 3,179 19 47,880 2,937 19 496,355 31,196 197 Conn yes yes UD yes
zho.pdtb.cdtb zho pdtb 52,061 2,049 125 11,178 438 21 10,075 404 18 73,314 2,891 164 Conn yes no other(gold) no
zho.rst.sctb zho rst 8,960 344 32 2,102 87 9 3,380 132 9 14,442 563 50 EDU no no UD(V1) no