
How can I use the models for Fraktur (German) ?

Opened this issue · 10 comments

I would like to use your model for Fraktur. How must this implemented or is this only a special command?

  • Download the desired model file(s) (*.traineddata), either fast (recommended for recognition) or best (required for additional training) variant
  • Install the model file(s) in your local tessdata directory or a subdirectory of that directory
  • Optionally rename the model file(s)
  • Run Tesseract and specify the name of the model file (-l MODEL), maybe with the subdirectory before the name and without the trailing .traineddata

Models are available from these URLs:

We used https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069_755545_3685930.traineddata (CER 1.069 % on selected ground truth) for our latest own OCR, but depending on your texts other models might give better results.

Realy good. Thank you very much! I see only one Problem with my Test file with the "ſ" => "s" . But anyway realy good in comparison of my previous Tests.
Belehrt durch die Erfahrung, wie leicht der Zuhörer Urtheil
über die Geiſteserzeugniſſe ihres Predigers durch ſo manche

We train our models to detect the long s as "ſ", so if you want an "s", that requires a simple search and replace operation on the results.

o.k. thanks, understood. As I told before, I am very happy with this result! I detect another issue "oͤ" instead of "ö", but not always. Maybe my bad scan could the reason. I have very rough paper. Does you prefere .jpg or .png as the source?
I will figure out for my project, if it's good enough to improve my pictures, or I have to improve the traineddata. The second is maybe the more difficult thing.

The model was trained on a wide range of historic texts (from early prints to early 20th century) which include both umlaut variants "oͤ" and "ö". Tesseract does not care which image format you provide: it works with jpg, png and other image formats.

My print is from 1828. I see both variants on the same page, even it's only a unique sign "ö" on paper.

Can you provide example images?


Result: see row 24, same line different character.
beſtehenden allerhchſten Vorſchriften kräftig zu fördern: um


Line 24 contains indeed both variants of "ö", so the OCR result is correct when it makes a difference. "allerhöchsten" uses lower case "o" combined with a small "e". That's what the OCR should detect.

Line 24:
case 1: "allerhöchsten" => "o" & "e" - AND same line
case 2: "fördern" => "ö"
Why ???