find_id in cluster.slurm improperly uses scontrol

Question

find_id in cluster.slurm improperly uses scontrol

Closed this issue 3 years ago · 1 comments

scontrol won't return jobs that have disappeared from the queue. This is an issue if the status update for a job completed isn't caught, which is only likely to be a serious issue when status check timers are too high. This needs addressing though, to pick up jobs that have left the queue and mark the job appropriately

Answer 1 · 2022-03-15T11:34:47.000Z

@CRosieWilliams identified this in WAVIhpc runs, so getting a fix rolled out ASAP

[15-03-22 10:22:14    :WARNING ] - Command returned err: None
[15-03-22 10:22:14    :ERROR   ] - Job status for run PIGTHW3km_sanity_checks_t100_runs-0 retrieval whilst slurm running, waiting and retrying
Traceback (most recent call last):
  File "/data/hpcdata/users/chll1/WAVI_Julia/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/batcher.py", line 264, in run_batch_item
    job = await cluster.find_id(job_id)
  File "/data/hpcdata/users/chll1/WAVI_Julia/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/cluster/slurm.py", line 46, in find_id
    if v.split("=")[0] == "JobName"][0],
IndexError: list index out of range

Or more specifically, one job can get stuck in that and hold the whole thing up. If I see that, I quit it and restart. Not sure if that's good practise, but it keeps it ticking over.```