bashlib input_files: ensure download_file (as in all Pythonic processors)
bertsky opened this issue · 0 comments
All of our processors written in Python use a Workspace.download_file(input_file)
in their processing loop. This ensures the file is available locally, even if it was still a URL (saving it under a reproducable temporary path).
Unfortunately, our bashlib processors have no chance to get that behaviour: the ocrd workspace find --download
would inevitably persist the downloaded file, which is perhaps not entirely wrong, but different from the Python processors. Regardless, it's not what we do in ocrd_olena, ocrd_pagetopdf, ocrd_fileformat, ocrd_im6convert etc.
Hence, if the input fileGrp is entirely remote, we only get messages like this:
ERROR ocrd.ocrd-olena-binarize - input file ID=FILE_0024_DEFAULT (pageId=PHYS_0024 MIME=image/jpg) is not on disk
The result would be a successful run without actual output fileGrp:
Exception: Invalid state: expected output file group 'OCR-D-BIN' not in METS (despite processor success)
Now, the solution I propose is simple: have ocrd bashlib input-files
(which does have access to Workspace.download_file(input_file)
) do the job!