OCR-D/ocrd_all

Failed to load any lstm-specific dictionaries for lang

Closed this issue · 9 comments

Hi,

I am trying to use stweil's GT4HistOCR model (from https://ocr-d.de/en/models.html) with the following command:
tesseract 00000005.tif 00000005 -l Fraktur_50000000.334_450937

I get the following error message: Failed to load any lstm-specific dictionaries for lang ...

The ocr recognition runs anyway. The Problem is that some characters are recognized correctly but are not displayed properly, e.g.: Hoͤring / Verſuche / uͤber / laͤnger

How to fix this problem?

Thanks

kba commented

I get the following error message: Failed to load any lstm-specific dictionaries for lang ...

This is just a warning, it means that the tesseract model doesn't have language model features, just the trained neural network, which is fine.

. The Problem is that some characters are recognized correctly but are not displayed properly, e.g.: Hoͤring / Verſuche / uͤber / laͤnger

You mean that the font of your terminal doesn't print these characters? Try a more complete Unicode font like those in the nerd-fonts project

image

Is ist possible and necessary to add a language model features?

kba commented

Is ist possible and necessary to add a language model features?

It is possible, see e.g. tesseract-ocr/tesstrain#155 (comment) on how to do it with tesstrain.

It is not necessary and there is little to be gained with these dictionaries in place. They were more important in the pre-LSTM rule-based models but AFAIK nobody bothers with them when training tesseract 4+ models.

Ok. Thank you

hey,

tryed to create pdf as output file and get an error: Cannot open file "/.../ocrd_all/venv/share/tessdata//pdf.ttf"!

The command I used: tesseract -l Fraktur_50000000.334_450937 input.tif output pdf

Any ideas?

tryed to create pdf as output file and get an error: Cannot open file "/.../ocrd_all/venv/share/tessdata//pdf.ttf"!

The command I used: tesseract -l Fraktur_50000000.334_450937 input.tif output pdf

I guess our packaged Tesseract installation does not include the config files you normally get. Try copying /usr/share/tesseract-ocr/4.00/tessdata/{pdf.ttf,configs/} to your venv/share/tessdata/.

EDIT No, we actually do install all that correctly. But your error message contains a leading slash, which hints at a wrong VIRTUAL_ENV setting...

What do mean with a wrong VIRTUAL_ENV setting?

kba commented

What do mean with a wrong VIRTUAL_ENV setting?

It's suspicious that tesseract tries to look in /.../ocrd_all/venv/share/tessdata//pdf.ttf (note the leading / and the // between tessdata and pdf.ttf. What's the output of echo $VIRTUAL_ENV and ls -la $VIRTUAL_ENV/share/tessdata?

output of echo $VIRTUAL_ENV: /home/superman/ocrd_all/venv

output ls -la $VIRTUAL_ENV/share/tessdata:
total 1039432
drwxrwxr-x 5 superman superman 4096 Dec 2 15:32 .
drwxrwxr-x 4 superman superman 4096 Dec 2 15:19 ..
drwxrwxr-x 8 superman superman 4096 Dec 2 15:20 .git
-rw-rw-r-- 1 superman superman 102 Dec 2 15:20 .gitmodules
-rw-rw-r-- 1 superman superman 11358 Dec 2 15:20 LICENSE
-rw-rw-r-- 1 superman superman 1393 Dec 2 15:20 README.md
-rw-rw-r-- 1 superman superman 7851157 Dec 2 15:20 afr.traineddata
-rw-rw-r-- 1 superman superman 8423467 Dec 2 15:20 amh.traineddata
-rw-rw-r-- 1 superman superman 2494806 Dec 2 15:20 ara.traineddata
-rw-rw-r-- 1 superman superman 2045457 Dec 2 15:20 asm.traineddata
-rw-rw-r-- 1 superman superman 10139884 Dec 2 15:20 aze.traineddata
-rw-rw-r-- 1 superman superman 4726411 Dec 2 15:20 aze_cyrl.traineddata
-rw-rw-r-- 1 superman superman 11185811 Dec 2 15:20 bel.traineddata
-rw-rw-r-- 1 superman superman 1789439 Dec 2 15:20 ben.traineddata
-rw-rw-r-- 1 superman superman 1966470 Dec 2 15:20 bod.traineddata
-rw-rw-r-- 1 superman superman 7930443 Dec 2 15:20 bos.traineddata
-rw-rw-r-- 1 superman superman 6335020 Dec 2 15:20 bre.traineddata
-rw-rw-r-- 1 superman superman 8371797 Dec 2 15:20 bul.traineddata
-rw-rw-r-- 1 superman superman 6502025 Dec 2 15:20 cat.traineddata
-rw-rw-r-- 1 superman superman 2402839 Dec 2 15:20 ceb.traineddata
-rw-rw-r-- 1 superman superman 16238266 Dec 2 15:20 ces.traineddata
-rw-rw-r-- 1 superman superman 44366093 Dec 2 15:20 chi_sim.traineddata
-rw-rw-r-- 1 superman superman 2470991 Dec 2 15:20 chi_sim_vert.traineddata
-rw-rw-r-- 1 superman superman 59025073 Dec 2 15:20 chi_tra.traineddata
-rw-rw-r-- 1 superman superman 2368306 Dec 2 15:20 chi_tra_vert.traineddata
-rw-rw-r-- 1 superman superman 1458011 Dec 2 15:20 chr.traineddata
-rw-rw-r-- 1 superman superman 19 Dec 2 15:32 configs
-rw-rw-r-- 1 superman superman 2299112 Dec 2 15:20 cos.traineddata
-rw-rw-r-- 1 superman superman 5998144 Dec 2 15:20 cym.traineddata
-rw-rw-r-- 1 superman superman 10578171 Dec 2 15:20 dan.traineddata
-rw-rw-r-- 1 superman superman 1622858 Dec 2 15:20 dan_frak.traineddata
-rw-rw-r-- 1 superman superman 15437534 Dec 2 15:20 deu.traineddata
-rw-rw-r-- 1 superman superman 1978741 Dec 2 15:20 deu_frak.traineddata
-rw-rw-r-- 1 superman superman 1774548 Dec 2 15:20 div.traineddata
-rw-rw-r-- 1 superman superman 449626 Dec 2 15:20 dzo.traineddata
-rw-rw-r-- 1 superman superman 7543380 Dec 2 15:20 ell.traineddata
-rw-rw-r-- 1 superman superman 23466654 Dec 2 15:20 eng.traineddata
-rw-rw-r-- 1 superman superman 5207312 Dec 2 15:20 enm.traineddata
-rw-rw-r-- 1 superman superman 11330444 Dec 2 15:20 epo.traineddata
-rw-rw-r-- 1 superman superman 2251950 Dec 2 15:20 equ.traineddata
-rw-rw-r-- 1 superman superman 15301628 Dec 2 15:20 est.traineddata
-rw-rw-r-- 1 superman superman 10145441 Dec 2 15:20 eus.traineddata
-rw-rw-r-- 1 superman superman 3439785 Dec 2 15:20 fao.traineddata
-rw-rw-r-- 1 superman superman 561272 Dec 2 15:20 fas.traineddata
-rw-rw-r-- 1 superman superman 2509440 Dec 2 15:20 fil.traineddata
-rw-rw-r-- 1 superman superman 21140513 Dec 2 15:20 fin.traineddata
-rw-rw-r-- 1 superman superman 14213351 Dec 2 15:20 fra.traineddata
-rw-rw-r-- 1 superman superman 22874034 Dec 2 15:20 frk.traineddata
-rw-rw-r-- 1 superman superman 17856636 Dec 2 15:20 frm.traineddata
-rw-rw-r-- 1 superman superman 1906031 Dec 2 15:20 fry.traineddata
-rw-rw-r-- 1 superman superman 3068320 Dec 2 15:20 gla.traineddata
-rw-rw-r-- 1 superman superman 4664254 Dec 2 15:20 gle.traineddata
-rw-rw-r-- 1 superman superman 8074927 Dec 2 15:20 glg.traineddata
-rw-rw-r-- 1 superman superman 7428728 Dec 2 15:20 grc.traineddata
-rw-rw-r-- 1 superman superman 1963128 Dec 2 15:20 guj.traineddata
-rw-rw-r-- 1 superman superman 3326722 Dec 2 15:20 hat.traineddata
-rw-rw-r-- 1 superman superman 5413459 Dec 2 15:20 heb.traineddata
-rw-rw-r-- 1 superman superman 1651010 Dec 2 15:20 hin.traineddata
-rw-rw-r-- 1 superman superman 13797409 Dec 2 15:20 hrv.traineddata
-rw-rw-r-- 1 superman superman 18051690 Dec 2 15:20 hun.traineddata
-rw-rw-r-- 1 superman superman 3594112 Dec 2 15:20 hye.traineddata
-rw-rw-r-- 1 superman superman 3797385 Dec 2 15:20 iku.traineddata
-rw-rw-r-- 1 superman superman 8279230 Dec 2 15:20 ind.traineddata
-rw-rw-r-- 1 superman superman 9037703 Dec 2 15:20 isl.traineddata
-rw-rw-r-- 1 superman superman 15951701 Dec 2 15:20 ita.traineddata
-rw-rw-r-- 1 superman superman 17345259 Dec 2 15:20 ita_old.traineddata
-rw-rw-r-- 1 superman superman 7386965 Dec 2 15:20 jav.traineddata
-rw-rw-r-- 1 superman superman 35659159 Dec 2 15:20 jpn.traineddata
-rw-rw-r-- 1 superman superman 3039939 Dec 2 15:20 jpn_vert.traineddata
-rw-rw-r-- 1 superman superman 3608311 Dec 2 15:20 kan.traineddata
-rw-rw-r-- 1 superman superman 8744377 Dec 2 15:20 kat.traineddata
-rw-rw-r-- 1 superman superman 1082383 Dec 2 15:20 kat_old.traineddata
-rw-rw-r-- 1 superman superman 9263539 Dec 2 15:20 kaz.traineddata
-rw-rw-r-- 1 superman superman 1446906 Dec 2 15:20 khm.traineddata
-rw-rw-r-- 1 superman superman 15430595 Dec 2 15:20 kir.traineddata
-rw-rw-r-- 1 superman superman 3568645 Dec 2 15:20 kmr.traineddata
-rw-rw-r-- 1 superman superman 15317715 Dec 2 15:20 kor.traineddata
-rw-rw-r-- 1 superman superman 1128590 Dec 2 15:20 kor_vert.traineddata
-rw-rw-r-- 1 superman superman 7055204 Dec 2 15:20 lao.traineddata
-rw-rw-r-- 1 superman superman 9215366 Dec 2 15:20 lat.traineddata
-rw-rw-r-- 1 superman superman 10635271 Dec 2 15:20 lav.traineddata
-rw-rw-r-- 1 superman superman 12629538 Dec 2 15:20 lit.traineddata
-rw-rw-r-- 1 superman superman 2606439 Dec 2 15:20 ltz.traineddata
-rw-rw-r-- 1 superman superman 5953416 Dec 2 15:20 mal.traineddata
-rw-rw-r-- 1 superman superman 3193116 Dec 2 15:20 mar.traineddata
-rw-rw-r-- 1 superman superman 5323418 Dec 2 15:20 mkd.traineddata
-rw-rw-r-- 1 superman superman 7426902 Dec 2 15:20 mlt.traineddata
-rw-rw-r-- 1 superman superman 2137055 Dec 2 15:20 mon.traineddata
-rw-rw-r-- 1 superman superman 862986 Dec 2 15:20 mri.traineddata
-rw-rw-r-- 1 superman superman 8243366 Dec 2 15:20 msa.traineddata
-rw-rw-r-- 1 superman superman 4640591 Dec 2 15:20 mya.traineddata
-rw-rw-r-- 1 superman superman 2189424 Dec 2 15:20 nep.traineddata
-rw-rw-r-- 1 superman superman 23163950 Dec 2 15:20 nld.traineddata
-rw-rw-r-- 1 superman superman 12397893 Dec 2 15:20 nor.traineddata
-rw-rw-r-- 1 superman superman 6322100 Dec 2 15:20 oci.traineddata
-rw-rw-r-- 1 superman superman 1480096 Dec 2 15:20 ori.traineddata
-rw-rw-r-- 1 superman superman 10562874 Dec 2 15:20 osd.traineddata
-rw-rw-r-- 1 superman superman 1698789 Dec 2 15:20 pan.traineddata
lrwxrwxrwx 1 superman superman 19 Dec 2 15:20 pdf.ttf -> tessconfigs/pdf.ttf
-rw-rw-r-- 1 superman superman 19344135 Dec 2 15:20 pol.traineddata
-rw-rw-r-- 1 superman superman 15336931 Dec 2 15:20 por.traineddata
-rw-rw-r-- 1 superman superman 1772117 Dec 2 15:20 pus.traineddata
-rw-rw-r-- 1 superman superman 5026349 Dec 2 15:20 que.traineddata
-rw-rw-r-- 1 superman superman 11008634 Dec 2 15:20 ron.traineddata
-rw-rw-r-- 1 superman superman 19920885 Dec 2 15:20 rus.traineddata
-rw-rw-r-- 1 superman superman 12404680 Dec 2 15:20 san.traineddata
drwxrwxr-x 2 superman superman 4096 Dec 2 15:20 script
-rw-rw-r-- 1 superman superman 1727443 Dec 2 15:20 sin.traineddata
-rw-rw-r-- 1 superman superman 14100356 Dec 2 15:20 slk.traineddata
-rw-rw-r-- 1 superman superman 845398 Dec 2 15:20 slk_frak.traineddata
-rw-rw-r-- 1 superman superman 9942454 Dec 2 15:20 slv.traineddata
-rw-rw-r-- 1 superman superman 1694065 Dec 2 15:20 snd.traineddata
-rw-rw-r-- 1 superman superman 18256019 Dec 2 15:20 spa.traineddata
-rw-rw-r-- 1 superman superman 19628379 Dec 2 15:20 spa_old.traineddata
-rw-rw-r-- 1 superman superman 8575759 Dec 2 15:20 sqi.traineddata
-rw-rw-r-- 1 superman superman 7434267 Dec 2 15:20 srp.traineddata
-rw-rw-r-- 1 superman superman 9375025 Dec 2 15:20 srp_latn.traineddata
-rw-rw-r-- 1 superman superman 1369513 Dec 2 15:20 sun.traineddata
-rw-rw-r-- 1 superman superman 6029030 Dec 2 15:20 swa.traineddata
-rw-rw-r-- 1 superman superman 13627152 Dec 2 15:20 swe.traineddata
-rw-rw-r-- 1 superman superman 2207238 Dec 2 15:20 syr.traineddata
-rw-rw-r-- 1 superman superman 3353079 Dec 2 15:20 tam.traineddata
-rw-rw-r-- 1 superman superman 1072909 Dec 2 15:20 tat.traineddata
-rw-rw-r-- 1 superman superman 3315170 Dec 2 15:20 tel.traineddata
drwxrwxr-x 2 superman superman 4096 Dec 2 15:20 tessconfigs
-rw-rw-r-- 1 superman superman 3721580 Dec 2 15:20 tgk.traineddata
-rw-rw-r-- 1 superman superman 7321367 Dec 2 15:20 tgl.traineddata
-rw-rw-r-- 1 superman superman 1072519 Dec 2 15:20 tha.traineddata
-rw-rw-r-- 1 superman superman 2184930 Dec 2 15:20 tir.traineddata
-rw-rw-r-- 1 superman superman 947262 Dec 2 15:20 ton.traineddata
-rw-rw-r-- 1 superman superman 18750612 Dec 2 15:20 tur.traineddata
-rw-rw-r-- 1 superman superman 2794302 Dec 2 15:20 uig.traineddata
-rw-rw-r-- 1 superman superman 12408644 Dec 2 15:20 ukr.traineddata
-rw-rw-r-- 1 superman superman 1398748 Dec 2 15:20 urd.traineddata
-rw-rw-r-- 1 superman superman 10757098 Dec 2 15:20 uzb.traineddata
-rw-rw-r-- 1 superman superman 4905658 Dec 2 15:20 uzb_cyrl.traineddata
-rw-rw-r-- 1 superman superman 7763728 Dec 2 15:20 vie.traineddata
-rw-rw-r-- 1 superman superman 4882707 Dec 2 15:20 yid.traineddata
-rw-rw-r-- 1 superman superman 963413 Dec 2 15:20 yor.traineddata