/FETE

Fast Encoding of Theater in TEI: Automatic TEI generation based on OCR output

Primary LanguageHTMLGNU General Public License v3.0GPL-3.0

FETE: Fast Encoding of Theater in TEI

DOI

Introduction

FETE is an application to generate TEI for the body of theater plays in Alsatian, based on OCR output. It takes HOCR or ALTO formats as input. It outputs TEI for the play's body: <div> elements for acts and scenes (with the relevant @type attribute), as well as stage directions (<stage>) and character speech turns (<sp> elements and their children, also identifiying the <speaker> element). When speech is in verse, <l> elements are encoded.

The <teiHeader> element for the plays needs to be encoded separately, as does the play's <castList> element and other frontmatter preceding the play's first scene, besides backmatter after the last scene, if any exists.

Inspired by earlier literature (e.g. Grobid among others), the tool uses Conditional Random Fields (CRF) as implemented in sklearn-crfsuite. Lexical and typographical cues present in OCR output, besides token coordinates on the page, are exploited to generate TEI elements.

The tool was developed by Andrew Briand (University of Washington), in the context of work supervised by Pablo Ruiz within the Methal project (University of Strasbourg); the project is creating a large TEI-encoded corpus of theater in Alsatian varieties.

Application structure

  • example: example input, XML output obtained with it and CRF model used to predict the output.
  • hocr2alto: Scripts to convert between HOCR and ALTO formats.
    • Usage is documented in the script
    • Requires the ocr-fileformat package
  • sklearn_crfsuite: The main program is in this directory, see Generating TEI and Training a model below for its usage.
  • utils: Some scripts for common manipulations to HOCR and TEI documents. Usage described in the scripts.

Requirements

The tool requires the packages listed in requirements.txt. To install them, you can run pip install -r requirements.txt from the directory where requirements.txt resides.

It is not required, but a good practice, to create a virtual environment for projects using the tool and install the requirements there. To create an environment, you can use venv, or if you have Anaconda, you can create it with conda create --name fete python=3.12, then activate the environment (conda activate fete) and run pip install requirements.txt once the environment is active.

Generating TEI

To generate TEI based on a directory of HOCR files, use the following command from within the sklearn_crfsuite directory:

python main.py MODEL_PATH HOCR_DIRECTORY OUTPUT_TEI_PATH

For instance:

python main.py ../example/models/model-exp3-20221226.crf ../example/inputs/hocr-verbotte-fahne ../example/outputs/verbotte-fahne-exp3.xml

This will predict the ../example/outputs/verbotte-fahne-exp3.xml TEI file based on HOCR at ../example/inputs/hocr-verbotte-fahne

Training a model

Training data consist on HOCR files and manually corrected TEI for them.

At the moment the training corpus is located at pre-defined directories inside sklearn_crfsuite:

  • html: The HOCR (from which the features are computed)
  • tei: The TEI (the labels to predict)

To train a model, use the following command from within sklearn_crfsuite:

python train.py html tei MODEL_OUTPUT_PATH
python train.py html tei ../example/models/model-exp3-new.crf

The exp3 infix in the model filename was used for the follwoing reason: Several feature combinations were implemented in the tool. The best one was called exp3 and this model was trained with it, so we chose to include exp3 in the filename (output-file naming is manual)

Postprocessing the output XML

Let's show this with an example. If you trained a model using the example command above and use it to predict TEI for ../example/input/hocr-verbotte-fahne, your results should reproduce ../example/outputs/verbotte-fahne-exp3.xml.

The prediction doesn't look bad, but you'll see it is not valid XML. This is because the model is designed to handle the plays' body, from the start of the first act to the final curtain, but not the front matter and back matter that may precede and follow those. Since we did not remove HOCR files for the front matter and back matter, the model tried to generate TEI from them, but this was expected to give errors. Once the portions generated based on the front matter and back matter are removed, the file will be valid XML. You can compare the file before and after by comparing ../example/outputs/verbotte-fahne-exp3.xml with ../example/outputs/verbotte-fahne-exp3-postpro.xml.

Instead of postprocessing the output XML by removing the front- and backmatter content, we could also remove the input HOCR files (or paragraphs if the body does not start and end on its own page) for such content before generating the XML output.

Adapting to other languages

The lexical cues used by the tool are currently suitable for Alsatian theater. Paratext in Alsatian theater is often in German and sometimes in French. Accordingly, lexical cues are now provided in Alsatian varieties, besides German and French.

The tool's lexical features (see sklearn_crfsuite/features.py) could be adapted to further languages. For training, a corpus of HOCR (or ALTO) plays and their corresponding TEI-encoding versions is needed (see Training a model above).

How to cite

The software may be cited as:

  • Briand, Andrew & Ruiz Fabo, Pablo (2023). FETE: Fast Encoding of Theater in TEI.

You can also cite a related publication:

  • Ruiz Fabo, Pablo, Bernhard, Delphine, Briand, Andrew & Werner, Carole. (2024). Computational drama analysis from almost zero electronic text: The case of Alsatian theater. To appear in Andresen, Melanie & Reiter, Nils (eds.). Computational Drama Analysis: Reflecting Methods and Interpretations. Preprint at https://univoak.eu/islandora/object/islandora:157880