Description

The participants of the Winter School of Handwritten Text Recognition of Medieval Manuscripts Latin / Greek / Czech, Byzantine Greek Group, trained the Transkribus model "15th c. liturgical" for Byzantine Greek. The model was trained from images of the codex Dresden, SLUB A. 151. This repository contains a) transcriptions from 66 images set as "ground truth" and b) automatic transcriptions from the rest of the pages (to added later). The automatic transcriptions have an error rate of 22,3%. The errors are mostly wrong accents and breathings. To enhance searchability despite the erroneous transcription, try fuzzy search. Alternatively, you can remove the accents from the txt file (e.g. https://dev.to/djemos/removeaccents-py-5dmd).

Origin of the data:

Data organisation

The transcriptions of the 66 pages marked as "ground truth" are in the separate file "ground_truth". All transcriptions, whether checked and marked as ground truth by the editors or not, are in the file "all_transcriptions".

How to cite

This dataset was created by Angelos Zaloumis, Canan Arıkan-Caba, Carole Hofstetter, Eirini Afentoulidou, Ekaterini Mitsiou, Emanuele Scieri, Georgi Mitov, Konstantina Tsakona, Kyriaco Nikias, Louiza Argyriou, Panagiotis Leontaridis. The digitisation is not copyright free, but the transcription is. However, properly annotating a corpus takes time and is a task that should be recognised. If you use any item from this corpus as ground truth, cite the dataset using the following information.

Copy citation BibTeX from Zenodo

Copyright and licence

This dataset was created as part of the Winter School of Handwritten Text Recognition of Medieval Manuscripts 2023/2024, Vienna at the Österreichische Akademie der Wissenschaften, Institut für Mittelalterforschung, all transcriptions are licensed under the Creative Commons 4 licence. Images were provided by the Saxon State and University Library (SLUB) and are licensed under the Public Domain Mark - No Copyright Protection.