NREL/jade

Completed job not getting cleared out of active_hpc_job_ids

elainethale opened this issue · 3 comments

jade v0.4.9

In this screenshot, you can see that one job is done and two are submitted, but there are still three active_hpc_job_ids:

image

For reference, here is my HPC queue at the time:

image

This a bug that needs to be fixed. I won't have time to thoroughly think through it and fix it until next week, but here is what I'm thinking occurred.

  • The distributed Jade submitter checks status of all outstanding HPC job IDs when SLURM allocates a compute node and starts Jade (which then runs that node's batch of jobs), when that node finishes its batch of jobs, or when a user runs jade show-status or jade try-submit-jobs.
  • SLURM will report status of a completed job ID for some amount of time (I don't know how long). At some point that ID will get purged and the squeue command will fail.
  • When a Jade process completes on a node it doesn't internally complete that job ID because technically that job ID is still active.
  • You had one node complete its batch and then tried to check status after this SLURM timeout occurred.
  • Up until now most people using Jade have individual compute nodes complete their work regularly before the SLURM timeout occurs for any single job.

This could have some bad outcomes. Any existing jobs will complete, but if there are other jobs dependent on these completions, I'm guessing it's possible that they won't get started.

I can fix this making the Jade submitter internally complete its HPC job ID before exiting. That should be simple. I could also detect this particular error message as another safeguard.

Your current batch might be stuck (or only the reporting is stuck). You can get it unstuck by doing the following:

  • Make sure that the submitter lock exists: <output-dir>/cluster_config.lock.
  • Delete the stale HPC job ID in <output-dir>/job_statuses.json.
  • Delete <output-dir>/cluster_config.lock.
  • Run jade try-submit-jobs <output-dir>.

If you want to prevent this issue from occurring until I properly fix it, I think that this will work:

  • Once an hour run jade try-submit-jobs <output-dir>
  • There is a potential downside in that you might put fewer jobs on each node.

Let me know if any of this doesn't match what you've seen and I'll investigate more.

It seems to have cleared itself overnight somehow ...

I must have forgotten something about the code. Perhaps you can only see the error if you run the jade show-status command. The fixes I described still apply, and I'll implement them next week. Thanks for reporting it.