METS Server based page paralellism for `ocrd process`
kba opened this issue · 1 comments
kba commented
BTW, we could also provide this per-page parallelism recipe in core via Python. For the user, it could then look like
ocrd process --jobs 4 --timeout 2m --on-error=empty
Originally posted by @bertsky in OCR-D/ocrd-demo-mets-server#3 (comment)
bertsky commented
To elaborate:
- add an option
--jobs
toocrd process
which would split the workspace into per-page pipelines synchronised via METS server and managed by Python's builtinmultiprocessing
facilities.
→ could also offer additional options (splitting up into chunks instead of pages...) - add another option
--timeout
, applicable to the lowest substep (i.e. whole-workspace single-processor call normally, single-page single-processor call in parallel case)
→ now merely as a stopgap, later to be implemented inProcessor.process_page
andProcessor.process_workspace
when we have the new processor API - add another option
--on-error
offering various options (raise, ignore, skip, empty)
→ now merely as a stopgap, later to be implemented inProcessor.process_page
andProcessor.process_workspace
when we have the new processor API including error handling