
German Drama Corpus for Coreference

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

release DOI license


General Information

The GerDraCor-Coref (German Drama Corpus for Coreference) is a fork of the GerDraCor and contains coreference annotations for a subset of the GerDraCor texts. The texts are all German dramatic texts, written between 1730 and 1920. Annotated are all noun phrases; singletons were removed. Additionally, generic entities, abstract anaphora and ambiguous mentions are also marked explicitly. In case of the latter two, only a part of the corpus has been annotated.

File Naming

The names of the files are composed of a short form of the title of the play and an appropriate file ending indicating the format, e.g. Rosenkavalier.xmi, Rosenkavalier.xml, Rosenkavalier.conll for "Der Rosenkavalier" by Hugo von Hofmannsthal. A full list of file names and their corresponding play is given in plays.csv.

Partial Annotations

Some texts have not been fully annotated, but only one or more acts. The act(s) annotated are indicated in the filename, e.g. Manuscript_Act5.xmi. If the full text was annotated, no special marker is applied, e.g. Sara.xmi.

Parallel Annotations

In order to make Inter-Annotator agreement studies possible, we carried out parallel annotations of single acts, annotated by distinct annotators. These annotations are located in the folder parallel_annotations and the annotator and act is additionally indicated in the filename, e.g. Sara-AS_Act1.xmi.


All files are encoded in UTF-8 Unicode.


We provide several formats to represent the coreference annotations:

  • XMI
  • TEI
  • CoNLL 2012

For the texts that have not been fully annotated, we additionally provide TEI output only for the parts that have been annotated. The CoNLL output always only contains the annotated parts. The XMI output always contains the full text.


As the XMI files can become quite large, they have been compressed using gzip. Uncompress them by entering a command line and run

$ gzip -d <FILENAME>.xmi.gz


DIRNDL is a file format based on the CoNLL format, but additionally also contains a speaker column (among others).

Running the export scripts

The manual annotations are stored in the XMI format, all other formats are exported automatically using CorefAnnotator, DramaNLP and Python. DramaNLP needs to be compiled following the instructions at https://github.com/quadrama/DramaNLP#compiling-from-source. The paths to CorefAnnotator and DramaNLP need to be set as described in the scripts create-tei.sh and create-conll.sh. To reproduce the export of formats included in this corpus, the scripts can be run as follows:

$ sh create-tei.sh
$ sh create-conll.sh
$ python3 split_tei.py tei/ tei/part/ --addScenes
$ sh split_tei_parallel.sh

There is also a makefile that runs the entire pipeline for convenience:

$ make all


The annotations are sorted into folders according to the different output formats. Parallel annotations by different annotators are organized into branches in the git tree. (ToDo) The main annotations are located in the gold branch. Partial annotations are sorted under the main folder in a subfolder called part.

Folder structure

$ tree -d
├── conll
│   └── part
├── parallel_annotations
│   ├── conll
│   ├── tei
│   └── xmi
├── tei
│   ├── full
│   └── part
└── xmi


$ git branch
* gold


If you are using GerDraCor-Coref for a publication, please refer to the following paper:

   author    = {Janis Pagel and Nils Reiter},
   booktitle = {{Proceedings of the Language Resources and Evaluation Conference (LREC)}},
   location  = {Marseille, France},
   month     = {5},
   pages     = {55--64},
   title     = {{GerDraCor-Coref: A Coreference Corpus for Dramatic Texts in German}},
   url       = {http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.7.pdf},
   year      = {2020},


Like GerDraCor, GerDraCor-Coref is released under the Creative Commons Zero copyright waiver CC0.


We appreciate contributions regarding extensions, bug fixes and the like. Please feel free to create issues or pull requests.

