bertsky/workflow-configuration

leverage more facilities offered by GNU parallel

Closed this issue · 1 comments

With (ocrd-)make, we want to be able to apply a certain workflow to either a set of workspaces together (possibly parallel), by running each workspace recursively, or on a single one if it is the CWD already. This necessitates determining from within Makefile where we are, by searching for mets.xml files recursively – before anything else is done. The choice of the default target crucially depends on it, and also the availability of certain special targets (like info, show, help, all, server, install). This not only takes extra time, it also makes Makefile hard to read (with lots of ifeq and enumerations of targets).

It would probably be more intuitive, but also more flexible, if the top level was not controlled by make itself, but a shell script only – say ocrd-make. It could use parallel to schedule and control multiple recursive tasks/workspaces (which also offers priority based on load-level, but also many more criteria). It could also encapsulate the jobs with Docker calls or externalize jobs with ssh calls (assuming the remote side has a suitable OCR-D installation and network mounts). We could rely on parallel's --hostgroup --sshlogin --onall --transfer and --joblog facilities. It even has capabilities like --resume --resume-failed, and can show --progress. It can also use an external SQL server to control jobs, which could be interfaced with from a UI component.

I have implemented the base functionality in 1746daa already:

ocrd-make -j -l 4 -f all-tess-frak2021.mk all
INFO: processing 23 workspaces with -R -I /ocrd_all/venv/share/workflow-configuration -f /ocrd_all/venv/share/workflow-configuration/all-tess-frak2021.mk in parallel

Computers / CPU cores / Max jobs to run
1:local / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:7/0/100%/0.0s ./buerger_gedichte_1778.ocrd/data
local:6/1/100%/105.0s ./huebner_handbuch_1696.ocrd/data
local:5/2/100%/72.0s ./silesius_seelenlust01_1657.ocrd/data
local:7/3/100%/48.7s ./loskiel
local:6/4/100%/41.0s ./estor_rechtsgelehrsamkeit02_1758.ocrd/data
local:5/5/100%/36.2s ./benner_herrnhuterey04_1748.ocrd/data
local:5/6/100%/33.0s ./praetorius_verrichtung_1668.ocrd/data
local:6/7/100%/32.4s ./weigel_gnothi02_1618.ocrd/data
local:5/8/100%/31.8s ./bernd_lebensbeschreibung_1738.ocrd/data
local:5/9/100%/37.8s ./rollenhagen_reysen_1603.ocrd/data
local:4/10/100%/34.9s ./loeber_heuschrecken_1693.ocrd/data
local:3/11/100%/32.3s ./weigel_gnothi02_1618.ocrd.alternativeimages/data
local:4/12/100%/30.8s ./wecker_kochbuch_1598.ocrd/data
local:5/13/100%/32.2s ./euler_rechenkunst01_1738.ocrd/data
local:7/14/100%/41.9s ./valentinus_occulta_1603.ocrd/data
local:6/15/100%/39.6s ./weigel_gnothi02_1618.ocrd.debug/data
local:5/16/100%/37.6s ./lohenstein_agrippina_1665.ocrd/data
local:4/17/100%/35.4s ./luz_blitz_1784.ocrd/data
local:3/18/100%/33.8s ./weigel_gnothi02_1618
local:4/19/100%/35.0s ./praetorius_syntagma02_1619_teil2.ocrd/data
local:3/20/100%/35.5s ./justi_abhandlung01_1758.ocrd/data
local:2/21/100%/34.4s ./bohse_helicon_1696.ocrd/data
local:1/22/100%/37.7s ./glauber_opera01_1658.ocrd/data
local:0/23/100%/36.0s 
all-tess-frak2021.3476.log
_all.all-tess-frak2021.log

This uses --progress for live updates, but also aggregates the stdout/stderr results from --files --tag by moving the temporary files into the proper logfile name of each workspace directory (as before). When running for all, it also concatenates all logfiles into an overall one for that workflow (as before). Additionally, a CSV file with opstats from --joblog gets created. It also aggregates the exit codes by adding them all up and exiting on that.

So the next steps could be:

  • resuming from failure
  • delegating to Docker calls
  • delegating to ssh calls and remote control
  • transfer of input and output data for ssh runs (in case we have no implicit network storage)
  • using an SQL server for job control for further interfacing (web monitoring?)