How can I use the models for Fraktur (German) ?
Hermann12 opened this issue · 10 comments
I would like to use your model for Fraktur. How must this implemented or is this only a special command?
- Download the desired model file(s) (*.traineddata), either fast (recommended for recognition) or best (required for additional training) variant
- Install the model file(s) in your local
tessdata
directory or a subdirectory of that directory - Optionally rename the model file(s)
- Run Tesseract and specify the name of the model file (
-l MODEL
), maybe with the subdirectory before the name and without the trailing.traineddata
Models are available from these URLs:
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/AustrianNewspapers/ (trained from newspapers)
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/Fraktur_5000000/ (trained based on script/Fraktur)
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/GT4HistOCR/ (trained from scratch)
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/ (latest models)
We used https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069_755545_3685930.traineddata (CER 1.069 % on selected ground truth) for our latest own OCR, but depending on your texts other models might give better results.
Realy good. Thank you very much! I see only one Problem with my Test file with the "ſ" => "s" . But anyway realy good in comparison of my previous Tests.
Example:
Vorrede.
Belehrt durch die Erfahrung, wie leicht der Zuhörer Urtheil
über die Geiſteserzeugniſſe ihres Predigers durch ſo manche
We train our models to detect the long s as "ſ", so if you want an "s", that requires a simple search and replace operation on the results.
o.k. thanks, understood. As I told before, I am very happy with this result! I detect another issue "oͤ" instead of "ö", but not always. Maybe my bad scan could the reason. I have very rough paper. Does you prefere .jpg or .png as the source?
I will figure out for my project, if it's good enough to improve my pictures, or I have to improve the traineddata. The second is maybe the more difficult thing.
The model was trained on a wide range of historic texts (from early prints to early 20th century) which include both umlaut variants "oͤ" and "ö". Tesseract does not care which image format you provide: it works with jpg, png and other image formats.
My print is from 1828. I see both variants on the same page, even it's only a unique sign "ö" on paper.
Can you provide example images?
Line 24 contains indeed both variants of "ö", so the OCR result is correct when it makes a difference. "allerhöchsten" uses lower case "o" combined with a small "e". That's what the OCR should detect.
Line 24:
case 1: "allerhöchsten" => "o" & "e" - AND same line
case 2: "fördern" => "ö"
Why ???