bashlib input_files: ensure download_file (as in all Pythonic processors)

Question

bashlib input_files: ensure download_file (as in all Pythonic processors)

bertsky opened this issue 8 months ago · 0 comments

All of our processors written in Python use a Workspace.download_file(input_file) in their processing loop. This ensures the file is available locally, even if it was still a URL (saving it under a reproducable temporary path).

Unfortunately, our bashlib processors have no chance to get that behaviour: the ocrd workspace find --download would inevitably persist the downloaded file, which is perhaps not entirely wrong, but different from the Python processors. Regardless, it's not what we do in ocrd_olena, ocrd_pagetopdf, ocrd_fileformat, ocrd_im6convert etc.

Hence, if the input fileGrp is entirely remote, we only get messages like this:

ERROR ocrd.ocrd-olena-binarize - input file ID=FILE_0024_DEFAULT (pageId=PHYS_0024 MIME=image/jpg) is not on disk

The result would be a successful run without actual output fileGrp:

Exception: Invalid state: expected output file group 'OCR-D-BIN' not in METS (despite processor success)

Now, the solution I propose is simple: have ocrd bashlib input-files (which does have access to Workspace.download_file(input_file)) do the job!