How can I use the models for Fraktur (German) ?

Question

How can I use the models for Fraktur (German) ?

Hermann12 opened this issue 4 years ago · 10 comments

Hermann12 commented 4 years ago

I would like to use your model for Fraktur. How must this implemented or is this only a special command?

Answer 1 · 2021-04-11T11:17:55.000Z

Download the desired model file(s) (*.traineddata), either fast (recommended for recognition) or best (required for additional training) variant
Install the model file(s) in your local tessdata directory or a subdirectory of that directory
Optionally rename the model file(s)
Run Tesseract and specify the name of the model file (-l MODEL), maybe with the subdirectory before the name and without the trailing .traineddata

Models are available from these URLs:

https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/AustrianNewspapers/ (trained from newspapers)
https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/Fraktur_5000000/ (trained based on script/Fraktur)
https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/GT4HistOCR/ (trained from scratch)
https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/ (latest models)

We used https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069_755545_3685930.traineddata (CER 1.069 % on selected ground truth) for our latest own OCR, but depending on your texts other models might give better results.

Answer 2 · 2021-04-11T11:46:03.000Z

Realy good. Thank you very much! I see only one Problem with my Test file with the "ſ" => "s" . But anyway realy good in comparison of my previous Tests.
Example:
Vorrede.
Belehrt durch die Erfahrung, wie leicht der Zuhörer Urtheil
über die Geiſteserzeugniſſe ihres Predigers durch ſo manche

Answer 3 · 2021-04-11T15:46:02.000Z

We train our models to detect the long s as "ſ", so if you want an "s", that requires a simple search and replace operation on the results.

Answer 4 · 2021-04-11T20:11:21.000Z

o.k. thanks, understood. As I told before, I am very happy with this result! I detect another issue "oͤ" instead of "ö", but not always. Maybe my bad scan could the reason. I have very rough paper. Does you prefere .jpg or .png as the source?
I will figure out for my project, if it's good enough to improve my pictures, or I have to improve the traineddata. The second is maybe the more difficult thing.

Answer 5 · 2021-04-11T20:35:57.000Z

The model was trained on a wide range of historic texts (from early prints to early 20th century) which include both umlaut variants "oͤ" and "ö". Tesseract does not care which image format you provide: it works with jpg, png and other image formats.

Answer 6 · 2021-04-11T20:54:06.000Z

My print is from 1828. I see both variants on the same page, even it's only a unique sign "ö" on paper.

Answer 7 · 2021-05-19T06:02:10.000Z

Can you provide example images?

Answer 8 · 2021-05-19T21:21:56.000Z

Source:

Result: see row 24, same line different character.
beſtehenden allerhoͤchſten Vorſchriften kräftig zu fördern: um

Answer 9 · 2021-05-20T03:57:34.000Z

Line 24 contains indeed both variants of "ö", so the OCR result is correct when it makes a difference. "allerhöchsten" uses lower case "o" combined with a small "e". That's what the OCR should detect.

Answer 10 · 2021-05-20T15:57:35.000Z

Line 24:
case 1: "allerhöchsten" => "o" & "e" - AND same line
case 2: "fördern" => "ö"
Why ???