OCR-D/ocrd_olena

Filenames with whitespace not supported

Closed this issue · 3 comments

ocrd-olena-binarize fails to process image names which contain blanks:

22:36:52.039 INFO ocrd-olena-binarize - processing PAGE-XML input file Add MS 23494_0027_page (Add MS 23494_0027)
22:36:52.065 INFO ocrd-olena-binarize - found AlternativeImage filename 'Add MS 23494_0027_B.tif' (B/W)
22:36:52.070 WARNING ocrd-olena-binarize - image URL 'Add MS 23494_0027_B.tif' not referenced
Error: wrong number of arguments! "23494_0027_B.tif" was not expected!
Error: wrong number of arguments! "WOLF-IMG/Add" was not expected!
Error: wrong number of arguments! "MS" was not expected!
Error: wrong number of arguments! "23494_0027_page-IMG-BIN_wolf.png" was not expected!

Should we try to support such filenames? Especially Windows users love them. I just tried to process real data from the British Library which were created with Aletheia.

Of course it's possible to rename such filenames and patch the PAGE files before using OCR-D. If we require this, there should be support for that kind of data preprocessing.

Should we try to support such filenames? Especially Windows users love them. I just tried to process real data from the British Library which were created with Aletheia.

Absolutely. We are trying to drag this feature along everywhere meticulously. I was under the impression that this already worked. You've proven me wrong.

At a glance, it looks like the problem is not in ocrd-olena-binarize, but in scribo-cli (which uses $@ without quotes).

This is trivial to fix with a sed call during make install. Anyone?

This is trivial to fix with a sed call during make install. Anyone?

Or let's wait for OCR-D/olena#4 and adopt here.

This is trivial to fix with a sed call during make install. Anyone?

Or let's wait for OCR-D/olena#4 and adopt here.

Fixed via #49