/handwriting-gt

Ground Truth for HTR training

Primary LanguageShell

Handwriting-GT ✍️

A collection of handwritten ground truth for HTR training.

About

This collection is based on various manuscript editions of the Digital Humanities in order to provide the edited texts (transcriptions) as ground truth for training HTR models.

All ground truth is provided as PAGE XML. All transcriptions are based on the OCR-D transcription guidelines Level 2.

See sections below for individual data set descriptions.

Data Sets

Faustedition

Folder Source Pages Lines License
gsa_389889 faustedition 8 230 CC BY-NC-SA 4.0
gsa_390028 faustedition 94 2493 CC BY-NC-SA 4.0
gsa_390825 faustedition 30 743 CC BY-NC-SA 4.0
gsa_391098 faustedition 414 10178 CC BY-NC-SA 4.0
gsa_391511 faustedition 6 168 CC BY-NC-SA 4.0
gsa_391347 faustedition 35 955 CC BY-NC-SA 4.0
gsa_391247 faustedition 68 1698 CC BY-NC-SA 4.0
671 16816

Download images using the bash script download_imgs.sh in each data set folder.

Source: Johann Wolfgang Goethe: Faust. Historisch-kritische Edition. Herausgegeben von Anne Bohnenkamp, Silke Henke und Fotis Jannidis unter Mitarbeit von Gerrit Brüning, Katrin Henzel, Christoph Leijser, Gregor Middell, Dietmar Pravida, Thorsten Vitt und Moritz Wissenbach.

Transcription guidlines: The following normalisations were resolved with respect to OCR-D transcription guidelines Level 2:

  • Round brackets: ( and ) (edition) → /: and :/ (ground truth)
  • Hyphens: - (edition) → = (ground truth)

Theodor Fontane Notizbücher

Folder Source Pages Lines License
A01 Fontane Edition 67 1046 CC BY-NC-ND 4.0
C13 Fontane Edition 53 879 CC BY-NC-ND 4.0
120 1925

Download images using the bash script download_imgs.sh in each data set folder.

Source: Theodor Fontane: Notizbücher. Digitale genetisch-kritische und kommentierte Edition. Hrsg. von Gabriele Radecke.

Transcription guidlines: The following normalisations were resolved with respect to OCR-D transcription guidelines Level 2:

  • Sammlung (edition) → Sam̄lung (ground truth)

August Wilhelm Schlegel Briefe

Folder Source Pages Lines License
GT_PAGE Schlegel Briefe 40 788 CC BY-NC-SA 3.0
40 788

Download images using the bash script download_imgs.sh in each data set folder.

Source: August Wilhelm Schlegel: Digitale Edition der Korrespondenz. Hg. von Jochen Strobel und Claudia Bamberg. Dresden, Marburg, Trier 2014–2020.

Transcription guidlines: The following normalisations were resolved with respect to OCR-D transcription guidelines Level 2:

  • round s (edition) → long ſ (ground truth)