find_id in cluster.slurm improperly uses scontrol
Closed this issue · 1 comments
JimCircadian commented
scontrol won't return jobs that have disappeared from the queue. This is an issue if the status update for a job completed isn't caught, which is only likely to be a serious issue when status check timers are too high. This needs addressing though, to pick up jobs that have left the queue and mark the job appropriately
JimCircadian commented
@CRosieWilliams identified this in WAVIhpc runs, so getting a fix rolled out ASAP
[15-03-22 10:22:14 :WARNING ] - Command returned err: None
[15-03-22 10:22:14 :ERROR ] - Job status for run PIGTHW3km_sanity_checks_t100_runs-0 retrieval whilst slurm running, waiting and retrying
Traceback (most recent call last):
File "/data/hpcdata/users/chll1/WAVI_Julia/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/batcher.py", line 264, in run_batch_item
job = await cluster.find_id(job_id)
File "/data/hpcdata/users/chll1/WAVI_Julia/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/cluster/slurm.py", line 46, in find_id
if v.split("=")[0] == "JobName"][0],
IndexError: list index out of range
Or more specifically, one job can get stuck in that and hold the whole thing up. If I see that, I quit it and restart. Not sure if that's good practise, but it keeps it ticking over.```