A historical German-language corpus (1840-1919) of fictional and non-fictional texts, annotated for speech, thought and writing representation (STWR).
The corpus was created by the DFG-funded project "Redewiedergabe - eine literatur- und sprachwissenschaftliche Korpusanalyse" (Leibniz Institute for the German Language / University of Würzburg). Homepage: www.redewiedergabe.de
The following publication complements this technical description with in-depth discussion of corpus design and annotation. Please cite it when using the corpus:
The detailed annotation guidelines developed by project REDEWIEDERGABE are available at redewiedergabe.de/richtlinien/richtlinien.html or at (in German).
If you encounter any issues or have any questions, please use Github's Issues tracker.
Project Redewiedergabe also provides automatic taggers for German STWR, trained (mostly) on this corpus.
The corpus REDEWIEDERGABE (and the additional material) is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
We ask you to mention project "Redewiedergabe" regarding the annotation, and project TextGrid, Deutsches Textarchiv, Leibniz-Institut für Deutsche Sprache and Staats- und Universitätsbibliothek Bremen regarding the texts.
Samples | Tokens | STWR instances | Notes | |
---|---|---|---|---|
Main corpus | 838 | 489,459 | 12,123 | Detailed statistical data |
Main corpus (Beta release) | 619 | 360,974 | 9,451 | Detailed statistical data; Differences to the final release |
This is a collection of several types of additional annotated material produced by project Redewiedergabe. The material generally follows the same annotation guidelines and is available in the same formats as the core corpus, but has some idiosyncrasies and less quality control. For additional information about the different corpus parts follow the links in the table.
Files | Tokens | STWR instances | Notes | |
---|---|---|---|---|
Single-annotated samples | 258 | 150,162 | 4,395 | Annotations only by a single annotator |
Single-annotated full texts (fictional) | 18 | 235,493 | 6,232 | Annotations only by a single annotator; NOTE: Annotation guidelines differ slightly with respect to speaker |
Single-annotated full texts (non-fictional) | 15 | 84,769 | 1,472 | Annotations only by a single annotator |
Indirect full texts | 16 | 51,864 | 272 | Only instances of indirect STWR with a simplified annotation system |
Free indirect full texts (fictional) | 142 | 2,647,924 | 2,136 | Only instances of free indirect STWR with a simplified annotation system; semi-automated annotation |
Primary annotations of the core corpus | 1,704 | 989,384 | 27,297 | Collection of all individual annotations of the core corpus |
KONVENS 2020 data | Data splits used for the STWR taggers, as described in the KONVENS 2020 paper |
The core corpus is available in three different formats:
- Column-based text format
- XML format
- XMI format (Not available for the Beta release)
NOTE: The XMI files are compatible with the free annotation tool ATHEN (developed by Markus Krug in the Kallimachos project) and its STWR view (developed by Tanja Tu).
The core corpus REDEWIEDERGABE and the additional material was created by the DFG-funded project "Redewiedergabe. Eine literatur- und sprachwissenschaftliche Korpusanalyse" (2017-2020) in a cooperation between Leibniz-Institut für Deutsche Sprache, Mannheim (Abteilung Lexik) and Universität Würzburg (Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte).
Project members: Annelen Brunner (IDS Mannheim), Stefan Engelberg (IDS Mannheim), Fotis Jannidis (Universität Würzburg), Ngoc Duyen Tanja Tu (IDS Mannheim), Lukas Weimer (Universität Würzburg).
In addition, the following people participated in the annotation: Sarah Gorke, Anna Hartmann, Janne Lorenzen, Christoph Peterek, Laura Schäfer, Lisa Sergel and Theresa Valta.
Project homepage: www.redewiedergabe.de
A list of all publications can be found here.
The core corpus REDEWIEDERGABE is a historical corpus of fictional and non-fictional texts. These texts were published between 1840-1919 and were compiled from the following three sources:
- Narrative texts from the 'Digitalen Bibliothek', converted to TEI format by project TextGrid
- Texts from the magazine "Die Grenzboten", digitized by Universitätsbibliothek Bremen (Source: Die Grenzboten: Zeitschrift für Politik, Literatur und Kunst. Berlin: Dt. Verl, 1841-1922. Staats- und Universitätsbibliothek Bremen, Ac 7155 Public Domain Mark 1.0), TEI structuring by Deutsches Textarchiv and OCR correction by project "Redewiedergabe".
- Texts from the "Mannheimer Korpus Historischer Zeitungen und Zeitschriften" (Mannheim corpus of historical newspapers and magazines), collected by the Leibniz-Institut für Deutsche Sprache and converted by Deutsches Textarchiv.
The corpus does not consist of complete texts but of text samples. The sample length is at least 500 tokens for texts from the Digitale Bibliothek and at least 200 tokens for newspaper/magazine texts. The samples are drawn randomly from the available material with following additional rules: For the texts from the Digitale Bibliothek, it was enforced that material by each author was considered evenly within a decade. Accordingly, for the texts from MKHZ it was enforced that the different newspapers/magazines were considered evenly. Thus we prevented authors or newspapers with little material from dropping out entirely during the sampling process.
Each sample contains metadata with information about the publication time, text type, fictionality status. Author and title are provided if available (more information: Metadata).
The core corpus contains detailed annotation of instances of speech, thought and writing representation (STWR). We distinguish four main types: direct STWR (Er sagte: "Ich bin hungrig."), indirect STWR (Er sagte, er sei hungrig.), free indirect STWR (Wo sollte er jetzt etwas zu essen herbekommen?) and reported STWR (Er sprach über Restaurants.), as well as the main media speech, thought and writing. In addition to that, we annotate attributes like embedding level, non-factual STWR, borderline cases, pragmatic and metaphoric use, as well as frames, introductory expressions and speakers.
Each sample of the main corpus was annotated independently by two different people. The final annotation was created by a third person on the basis of those annotations. The underlying first annotations are also available (see primary annotations).
The detailed annotation guidelines are available at redewiedergabe.de/richtlinien/richtlinien.html (in German).
An overview over the structure of the annotations is available at Annotation structure.