leverage more facilities offered by GNU parallel
Closed this issue · 1 comments
With (ocrd-)make, we want to be able to apply a certain workflow to either a set of workspaces together (possibly parallel), by running each workspace recursively, or on a single one if it is the CWD already. This necessitates determining from within Makefile
where we are, by searching for mets.xml files recursively – before anything else is done. The choice of the default target crucially depends on it, and also the availability of certain special targets (like info
, show
, help
, all
, server
, install
). This not only takes extra time, it also makes Makefile
hard to read (with lots of ifeq
and enumerations of targets).
It would probably be more intuitive, but also more flexible, if the top level was not controlled by make
itself, but a shell script only – say ocrd-make
. It could use parallel
to schedule and control multiple recursive tasks/workspaces (which also offers priority based on load-level, but also many more criteria). It could also encapsulate the jobs with Docker calls or externalize jobs with ssh calls (assuming the remote side has a suitable OCR-D installation and network mounts). We could rely on parallel
's --hostgroup --sshlogin --onall --transfer
and --joblog
facilities. It even has capabilities like --resume --resume-failed
, and can show --progress
. It can also use an external SQL server to control jobs, which could be interfaced with from a UI component.
I have implemented the base functionality in 1746daa already:
ocrd-make -j -l 4 -f all-tess-frak2021.mk all
INFO: processing 23 workspaces with -R -I /ocrd_all/venv/share/workflow-configuration -f /ocrd_all/venv/share/workflow-configuration/all-tess-frak2021.mk in parallel
Computers / CPU cores / Max jobs to run
1:local / 8 / 8
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:7/0/100%/0.0s ./buerger_gedichte_1778.ocrd/data
local:6/1/100%/105.0s ./huebner_handbuch_1696.ocrd/data
local:5/2/100%/72.0s ./silesius_seelenlust01_1657.ocrd/data
local:7/3/100%/48.7s ./loskiel
local:6/4/100%/41.0s ./estor_rechtsgelehrsamkeit02_1758.ocrd/data
local:5/5/100%/36.2s ./benner_herrnhuterey04_1748.ocrd/data
local:5/6/100%/33.0s ./praetorius_verrichtung_1668.ocrd/data
local:6/7/100%/32.4s ./weigel_gnothi02_1618.ocrd/data
local:5/8/100%/31.8s ./bernd_lebensbeschreibung_1738.ocrd/data
local:5/9/100%/37.8s ./rollenhagen_reysen_1603.ocrd/data
local:4/10/100%/34.9s ./loeber_heuschrecken_1693.ocrd/data
local:3/11/100%/32.3s ./weigel_gnothi02_1618.ocrd.alternativeimages/data
local:4/12/100%/30.8s ./wecker_kochbuch_1598.ocrd/data
local:5/13/100%/32.2s ./euler_rechenkunst01_1738.ocrd/data
local:7/14/100%/41.9s ./valentinus_occulta_1603.ocrd/data
local:6/15/100%/39.6s ./weigel_gnothi02_1618.ocrd.debug/data
local:5/16/100%/37.6s ./lohenstein_agrippina_1665.ocrd/data
local:4/17/100%/35.4s ./luz_blitz_1784.ocrd/data
local:3/18/100%/33.8s ./weigel_gnothi02_1618
local:4/19/100%/35.0s ./praetorius_syntagma02_1619_teil2.ocrd/data
local:3/20/100%/35.5s ./justi_abhandlung01_1758.ocrd/data
local:2/21/100%/34.4s ./bohse_helicon_1696.ocrd/data
local:1/22/100%/37.7s ./glauber_opera01_1658.ocrd/data
local:0/23/100%/36.0s
all-tess-frak2021.3476.log
_all.all-tess-frak2021.log
This uses --progress
for live updates, but also aggregates the stdout/stderr results from --files --tag
by moving the temporary files into the proper logfile name of each workspace directory (as before). When running for all
, it also concatenates all logfiles into an overall one for that workflow (as before). Additionally, a CSV file with opstats from --joblog
gets created. It also aggregates the exit codes by adding them all up and exiting on that.
So the next steps could be:
- resuming from failure
- delegating to Docker calls
- delegating to ssh calls and remote control
- transfer of input and output data for ssh runs (in case we have no implicit network storage)
- using an SQL server for job control for further interfacing (web monitoring?)