OCR-D/core

METS Server based page paralellism for `ocrd process`

kba opened this issue · 1 comments

kba commented
          BTW, we could also provide this per-page parallelism recipe in core via Python. For the user, it could then look like

ocrd process --jobs 4 --timeout 2m --on-error=empty

Originally posted by @bertsky in OCR-D/ocrd-demo-mets-server#3 (comment)

To elaborate:

  • add an option --jobs to ocrd process which would split the workspace into per-page pipelines synchronised via METS server and managed by Python's builtin multiprocessing facilities.
    → could also offer additional options (splitting up into chunks instead of pages...)
  • add another option --timeout, applicable to the lowest substep (i.e. whole-workspace single-processor call normally, single-page single-processor call in parallel case)
    → now merely as a stopgap, later to be implemented in Processor.process_page and Processor.process_workspace when we have the new processor API
  • add another option --on-error offering various options (raise, ignore, skip, empty)
    → now merely as a stopgap, later to be implemented in Processor.process_page and Processor.process_workspace when we have the new processor API including error handling