ocrd-cis-ocropy-recognize: 'ascii' codec can't decode byte 0xa9
Closed this issue · 3 comments
models:
> find . -name *.pyrnn|xargs md5sum
bb90b17321987002afa6b94e650d16fa ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur.pyrnn
ef3238cd60cb1c35ede74573c8d14766 ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur-jze.pyrnn
file: https://digi.ub.uni-heidelberg.de/diglitData/jb/ocropy-test.jpg
command:
> ocrd-make -f crop-anyocr-binarize-page-olena-sauvola-denoise-ocropy-deskew-page-ocropy-segment-tesseract-ocropy-dewarp-ocr-ocropy-tesseract.`mk
make: Entering directory '/home/jb/workspace/ocrd/ocrd4dwork'
building OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP from OCR-D-SEG-LINE-tesseract-ocropy-DEWARP with pattern rule for ocrd-cis-ocropy-recognize
ocrd workspace remove-group -r OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP 2>/dev/null || true
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE-tesseract-ocropy-DEWARP -O OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP -p OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json 2>&1 | tee OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.log && touch -c OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP || { rm -fr OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP; exit 1; }
16:39:06.634 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-LINE-tesseract-ocropy-DEWARP'] output_file_grp=['OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP']
Traceback (most recent call last):
File "/home/jb/ocrd_all/venv/bin/ocrd-cis-ocropy-recognize", line 8, in <module>
sys.exit(ocrd_cis_ocropy_recognize())
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 49, in ocrd_cis_ocropy_recognize
return ocrd_cli_wrap_processor(OcropyRecognize, *args, **kwargs)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/base.py", line 57, in run_processor
processor.process()
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/recognize.py", line 134, in process
self.network = load_object(self.get_model(), verbose=1)
File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/ocrolib/common.py", line 459, in load_object
return unpickler.load()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 0: ordinal not in range(128)
Makefile:304: recipe for target 'OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP' failed
make: *** [OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP] Error 1
make: Leaving directory '/home/jb/workspace/ocrd/ocrd4dwork'
Thanks for reporting!
I believe this is an artifact of incomplete Python 2-3 porting. You can avoid it by leaving the file in gzip-compressed form (with .gz
extension).
The uncompressed case needs to use the same latin1
encoding IMO.
Tried it, but then ocr-cis-ocropy-recognize does not find the *.pyrnn.gz
That's odd. Relative paths should be searched:
- in
__file__
's directory, e.g.venv/lib/python3.6/site-packages/ocrd_cis/ocropy
- in
__file__
'smodels
subdirectory, e.g.venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models
- in any of the directories mentioned in
ocrolib.ocropus_find_file
:Result of searching $fname is the first existing in: * $base/$fname * $base/$fname.gz * $base/model/$fname * $base/model/$fname.gz * $base/data/$fname * $base/data/$fname.gz * $base/gui/$fname * $base/gui/$fname.gz # if gz $base can be four base paths: * `$OCROPUS_DATA` environment variable * current working directory * ../../../../share/ocropus from this file's install location * `/usr/local/share/ocropus` * `$PREFIX/share/ocropus` ($PREFIX being the Python installation prefix, usually `/usr`)
3 probably won't help you, because the CWD is the OCR-D workspace directory in the processor's context, and you probably never installed ocropus
itself.
So, you should stick with 1 or 2, in the .gz
form (until we patched the uncompressed condition).
Perhaps you forgot to also add the .gz
suffix in the makefile/parameter file?