UB-Mannheim/tesseract

How can I use the models for Fraktur (German) ?

Opened this issue · 10 comments

I would like to use your model for Fraktur. How must this implemented or is this only a special command?

  • Download the desired model file(s) (*.traineddata), either fast (recommended for recognition) or best (required for additional training) variant
  • Install the model file(s) in your local tessdata directory or a subdirectory of that directory
  • Optionally rename the model file(s)
  • Run Tesseract and specify the name of the model file (-l MODEL), maybe with the subdirectory before the name and without the trailing .traineddata

Models are available from these URLs:

We used https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069_755545_3685930.traineddata (CER 1.069 % on selected ground truth) for our latest own OCR, but depending on your texts other models might give better results.

Realy good. Thank you very much! I see only one Problem with my Test file with the "ſ" => "s" . But anyway realy good in comparison of my previous Tests.
Example:
Vorrede.
Belehrt durch die Erfahrung, wie leicht der Zuhörer Urtheil
über die Geiſteserzeugniſſe ihres Predigers durch ſo manche

We train our models to detect the long s as "ſ", so if you want an "s", that requires a simple search and replace operation on the results.

o.k. thanks, understood. As I told before, I am very happy with this result! I detect another issue "oͤ" instead of "ö", but not always. Maybe my bad scan could the reason. I have very rough paper. Does you prefere .jpg or .png as the source?
I will figure out for my project, if it's good enough to improve my pictures, or I have to improve the traineddata. The second is maybe the more difficult thing.

The model was trained on a wide range of historic texts (from early prints to early 20th century) which include both umlaut variants "oͤ" and "ö". Tesseract does not care which image format you provide: it works with jpg, png and other image formats.

My print is from 1828. I see both variants on the same page, even it's only a unique sign "ö" on paper.

Can you provide example images?

Source:
grafik

Result: see row 24, same line different character.
beſtehenden allerhchſten Vorſchriften kräftig zu fördern: um

grafik

Line 24 contains indeed both variants of "ö", so the OCR result is correct when it makes a difference. "allerhöchsten" uses lower case "o" combined with a small "e". That's what the OCR should detect.

Line 24:
case 1: "allerhöchsten" => "o" & "e" - AND same line
case 2: "fördern" => "ö"
Why ???