no correction with ocrd-cis-postcorrect
Opened this issue · 10 comments
I'm running ocrd-cis-postcorrect
on the aligned OCR-output of Calamari and Tesserocr. So far, the output seems to be completely identical with the input even though there are quite some differences between the results of the two OCR engines. See e.g. the attached example.
postcorrect.zip
How can I achieve some correction results?
Thanks for reporting. I am having a look.
It appears that both files are line-segmented. The post-correction needs word-segmented input.
Anyway you could try to set the OCR to output word segments (as well as line segments).
thanks for your quick reply! I'll try it again with word segments and report back
I finally tried ocrd-cis-postcorrect
again, this time with two OCR results from Tesseract and Calamari boeth segmented on word level (and aligned beforehand). Unfortunately I now run into an error (see attachment), there are no output files produced at all.
From a quick glance I suspect problems with the profiling. Can you rerun the same command with --log-level DEBUG
? I'll take a closer look later today.
thx a lot for your quick reply! there's the log file
In order to run our post correction, both our profiler and an according language backend has to be installed on the system. The configuration variable profilerPath
(which should be named profilerCommand
more appropriately) must point to the profiler executable and the profilerConfig
variable must point to the according language configuration file. There is a manual for the profiler and the language backend in our repositories.
The other way is to use the profiler that is installed in this project's Dockerfile using docker
. You can execute the following steps to build and test the docker container:
$ cd path/to/ocrd_cis # Change into ocrd_cis directory.
$ sudo docker build -t ocrd_cis . # Build the ocrd_cis docker image (this will take some time).
$ sudo docker run ocrd_cis /apps/profiler --help # Check the profiler command in the image.
$ echo 'Theyle' | sudo docker -i run ocrd_cis /apps/profiler \
--config /etc/profiler/languages/german.ini \
--sourceFormat TXT --sourceFile /dev/stdin --simpleOutput
Then you can write a shellscript that executes sudo docker -i run ocrd_cis /apps/profiler $@
, set the profilerPath
to this script and the profilerConfig
to e.g. /etc/profiler/languages/german.ini
(a language configuration file within the docker container).
The third option is to run the post correction directly from the built docker image. I see that these points are not very clear in the documenation for the post correction. I will improve the documentation to make the configuration of the profiler more clear.
And I forgot to mention, that the error you are getting is due to a bad profiler configuration.
thanks for your help! I'm using a native installation of ocrd_all
and assumed that it included everything I need to run ocrd-cis-postcorrect
(except the model). But then I guess I still need to install profiler and language backend