SIMEXP/psom

workers detected as crashed when they are not

Opened this issue · 5 comments

Apparently with laggy file systems some workers will be detected as crashed when they are not. Maybe send kill messages, with unique file names, such that there won't be several active workers at the same time.

After inspection of the logs, the workers indeed crashed. Here is an example of error:

Something went bad ... the pipeline has FAILED !
The last error message occured was :
I could not find /gs/scratch/pbellec/psom2/cambridge_preproc_gui_100/logs/worker/psom10/new_jobs.mat for spawning
File /home/pbellec/git/psom/psom_worker.m at line 144

Maybe some race condition. Another possibility may be that it indeed reflected several workers working in the same folder. More investigation is necessary.

I have fixed the possibility of race condition, and still getting the error. Maybe re-run with the nb_resub = 0 to be able to get the error message the first time workers are crashing.

So I can confirm that the heartbeat mechanism is broken due to delays in file updates. The deamon resets workers that are not dead in the first place. This ends up creating a huge mess in the news feed, with jobs getting completed multiple times. The manager understandably goes nuts, starting to count negative numbers of tasks running and eventually completely loosing it. Course of action to fix this is not clear. Sending kill signal to workers deemed to be dead would definitely be a useful safeguard, in case the deamon is confused about the state of workers. But the fact delays of several seconds in file updating can happen is questioning the feasibility of the heartbeat. I will try using a much longer time (5 minutes, instead of 5 seconds) before declaring jobs as dead, but I am afraid this parameter would end up being too system (and pipeline) dependent.

So with a dead_time of 5 minutes, there does not seem to have problems on guillimin. Maybe there should be a different default dead_time in 'background/batch' vs 'qsub/msub/bsub/condor' modes. The following two new features need to be implemented before this issue is closed:

  • Send kill signal to workers deemed to be dead.
  • expose dead_time as a parameter in PSOM_GB_VARS. PSOM should not crash if the default for this parameter is not found in PSOM_GB_VARS though, for backward compatibility.

same issue of conflicting workers happened when integrating PSOM with CBRAIN https://github.com/glatard/cbrain-plugins-psom/issues/7

Needs urgent fixing.