/coraxml-utils

(MIRROR) Processing diplomatic transcriptions in historical data

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

CorA-XML Utils

License: GPLv3 Test coverage Code style: black

CorA-XML Utils is a collection of tools for processing CorA-XML and the various associated transcription languages for historical manuscripts.

It consists of:

  • A model for CorA-XML
  • A model for transcriptions
  • Importers to read different file formats and
  • Exporters to dump the content of a data model to certain formats
  • Scripts for carrying out various combinations of these tasks.

Due to the distinction between "dipl" and "mod" tokenizations in CorA-XML, CorA requires functions that can keep track of these parallel tokenizations whenever tokens are edited in the interface. However, because the details of each project's transcription guidelines can differ, CorA relies on external scripts to manage this process. Users can of course write their own scripts to manage token editing and importing documents, but we provide our CorA-XML Utils to support the production of the requisite scripts and to offer additional functionality besides.

In order to support a new transcription standard, one must only define the regular expressions (which comprise the main functionality of a parser) in coraxml_utils/parser.py and in coraxml_utils/tokenizer.py -- insofar as the existing modules aren't compatible, in which case the existing modules can be used as a reference. Once this has been done, the rest of the functionality comes free: the included editing script (bin/check_and_parse_token) and import script (bin/trans2coraxml.py) operate on the structured internal representation of token transcriptions and thus work for any transcription standard for which there is a parser. This is then all you need to use CorA for your data. Additionally, CorA-XML Utils contains a number of export modules, for JSON, TEI, Markdown, etc. Once the parsers have been defined, these modules can help get your data into the format you need it in.

Installation

Dependencies:

  • regex
  • lxml
  • nose2
  • click

These should be installed automatically by the setup script:

pip install [--user] git+https://github.com/comphist/coraxml-utils

(NB: Your pip executable might be called pip3.)

Running tests

From the test/ directory (optionally calculating test coverage):

nose2 [--with-coverage --coverage coraxml_utils]

The data model

Corpus documents

A CorA-XML file is represented in our data model by the Document object. The internal structure of Document objects reflects the fact that they are meant to represent historical prints and manuscripts. They thus also model the layout of text on pages.

A Document thus consists of Pages which then are made up of Columns which contain a series of Lines. Each line contains a series of diplomatic transcriptions (TokDipl). Parallel to these structures, the Documents contains the list of CoraTokens, which represent the mapping between the diplomatically faithful tokenizations and the modernized, annotatable tokenizations. Each CoraToken object contains a series of TokDipl and TokAnno objects, and the TokAnno objects contain all of the annotations visible/editable on CorA.

Transcriptions

A transcription (Trans) consists of characters (Char) -- see the next section for more on characters.

The central distinction that CorA-XML makes is that between diplomatic tokenizations and modernized, i.e. annotatable, tokenizations. CorA-XML additionally differentiates between diplomatic representations of transcribed text and simplified ASCII representations of the same text.

A Trans object thus has two essential methods: tokenize_dipl and tokenize_anno for producing the two tokenizations. The tokenize_dipl method produces a list of DiplTrans objects, which contain the UTF diplomatic representation of the transcriptions (accessible with .utf()). The tokenize_anno produces a list of AnnoTrans objects that contain the simplified ASCII representations (.simple()).

Character classes

For the processing of transcriptions, coraxml_utils makes use of a detailed character class model.

Visualization of character class hierarchy: character model overview

Scripts

Contents of the bin/ directory.

Interacting with CorA

For scripting some of the basic functions of CorA there is corascript.py.

Conversion Scripts

Included are a number of scripts for converting between formats, which follow the naming convention {source}2{destination}.py:

  • trans2coraxml.py
  • coraxml2gatejson.py
  • coraxml2tei.py
  • (etc.)

These special scripts sometimes perform various related functions in addition to simply converting from one format to another. For instance, trans2coraxml.py also prints the messages that CorA uses to provide user feedback when new documents are being uploaded and processed.

Scripts such as coraxml2coraxml.py also apply various custom transformations to the data and confirm that the data are valid and have been correctly processed.

If all you need is plain conversion from one format to another, you might only need the coraxml_utils executable.

Usage ``` Usage: coraxml_utils convert [OPTIONS] INFILE

Options: -f, --from [coraxml|bonnxml|trans] Format of the input. [default: trans] -t, --to [coraxml|trans|gatejson|tei|md] Format of the output. [default: coraxml] -P, --parser [plain|rem|ref|ren|redi|anselm] Token parser to use. [default: plain] -o, --outfile FILENAME --help Show this message and exit.

</details>

# Available Transcription Parsers

Currently there are parsers for the following transcription conventions.

* ReM ([Referenzkorpus Mittelhochdeutsch](https://linguistics.rub.de/rem))
* ReF ([Referenzkorpus Frühneuhochdeutsch](https://linguistics.rub.de/ref))
* ReDI ([Referenzkorpus Deutscher Inschriften](https://www.ruhr-uni-bochum.de/wegera/ReDI/index.htm))
* Anselm ([The Anselm Corpus](https://linguistics.rub.de/anselm))
* ReN ([Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200&ndash;1650)](https://www.slm.uni-hamburg.de/ren))

Please note: The parser for ReN is not very strict. Therefore it can be used to
import valid transcriptions.  But it should not be used to validate
transcriptions.


# Importers

* `CoraXMLImporter`
* `TransImporter` (For plain text transcription files.)
* `BonnXMLImporter` (For ReM.)


# Exporters

* `CoraXMLExporter`
  - Data imported with the `CoraXMLImporter` and exported with this exporter should be identical.
* `TransExporter`
* `TEIExporter`
* `GateJsonExporter` (This is the variant of Tweet JSON used by GATE.)
* `MarkdownExporter`


# Modifiers

Sometimes you want to transform a document in some way before exporting it to a
destination format: rename a node, add some tags, etc. For this, CorA-XML Utils
uses **modifiers**: functions that perform whatever post-processing one might
require in certain situations.

The following are some of the modifiers currently included in CorA-XML Utils.

## Adding tokenization tags

For ReF, Anselm, and ReM (at least) we want to have tags indicating where
univerbation or multiverbation has taken place. The `add_tokenization_tags`
function adds these tags based on the `TokenBound` annotations added during the
transcription phase.

## Modifying tags

The `add_punc_tags` function converts sentence boundary annotations (such as
`(.)` or `(?)`) to tags that are easier to query.