flux-framework/flux-core

housekeeping doesn't release resource for a large job

Closed this issue · 13 comments

On elcap, flux-core and Fluxion got out of sync with allocated resources.

Affected resources were associated with a single job which was INACTIVE, but flux housekeeping list listed no tasks for the job. flux dmesg showed the following for the job:

[  +4.361834] job-exec[0]: exec aborted: id=f2AJsP9HniMm
[  +9.365820] job-exec[0]: Sending SIGKILL to job f2AJsP9HniMm
[ +14.362288] job-exec[0]: Sending SIGKILL to job f2AJsP9HniMm
[ +19.365833] job-exec[0]: Sending SIGKILL to job f2AJsP9HniMm
[ +24.370602] job-exec[0]: Sending SIGKILL to job f2AJsP9HniMm
[ +29.366845] job-exec[0]: Sending SIGKILL to job f2AJsP9HniMm
[ +29.366902] job-exec[0]: Sending SIGKILL to job shell for job f2AJsP9HniMm
[ +39.373830] job-exec[0]: Sending SIGKILL to job shell for job f2AJsP9HniMm
[  +0.545771] job-exec[0]: Sending SIGKILL to job shell for job f2AJsP9HniMm
[  +7.122439] job-exec[0]: job-exception: id=f2AJsP9HniMm: node failure on elcapX (shell rank Y)
[  +8.683464] job-manager[0]: f2AJsP9HniMm: epilog: stderr: f2AJsP9HniMm: epilog: rank M offline. Skipping.
[ +26.515790] job-manager[0]: housekeeping: elcapZ (rank M) f2AJsP9HniMm: No route to host

In this version of Flux, I had thought to also see messages:

housekeeping: JOBID complete
housekeeping: all resources of JOBID have been released

The all resources of JOBID have been released message is logged at LOG_DEBUG so I don't believe it will be propagated to the systemd journal (or syslog) if that's where you were looking for it. Unfortunately flux dmesg has wrapped.

I've been looking at the code and I haven't spotted a code path that would allow a job to be removed from the housekeeping list without logging that message (in 0.64.0).

I should have been more explicit. The messages above were from flux dmesg | grep JOBID. The job did not appear in job-manager stats. Since we have theexec aborted message, it isn't possible that the all resources have been released had wrapped at the time...

Oh duh, I should have known that given the human readable output.

Perplexing.

I wonder if a bulk-exec bug could cause this? Sorry I didn't have a chance to look yet.

Also, for the record, we ended up sending a custom sched-fluxion-qmanager.free RPC to clear the job in fluxion. (Thanks to @trws for that).

Somehow the job would need to get out of the hk->allocations list without going through allocation_remove() (which logs that message). I don't see a way for that to happen, other than if the scheduler were reloaded and the job was partially released or another less likely error, but then we would see another log message

housekeeping: HOSTS (rank RANKS) from JOBID will be terminated because  REASON

I don't think there is any way for bulk exec to get started and not be in the list, and those are the only two code paths that take things out of the list with zlistx_delete() so yeah, very strange.

trws commented

Hard to say, but we know we've seen cases where release didn't get finished. I'm starting to think some redundancy would be worth it. Maybe one or both of:

  1. Send a bit with the final free noting that it's final so we clear out the remainder
  2. Add a service to qmanager that causes it to iterate over all of its jobs and inquire what status they have in job-manager, reporting discrepancies and if requested removing jobs that should have been retired. This is more work, but honestly I've wanted a way to do this manually when trying to fix issues enough lately that I'm thinking we might even want to make some version of this part of the sched protocol.

Good thoughts. The final flag would be a simple change to the protocol and perhaps fluxion could detect and log any extra resources that needed to be freed at that point.

One other thought: A lot of the housekeeping logging was removed on current master. I wonder if we should put some of that back, or add some other helpful debugging in case we see this or similar issues in the next release.

There are two ways I can think of implementing the final free bit/flag in qmanager and resource:

  1. Whenever the final bit/flag is set, run a full cancellation. This is straightforward but higher complexity than a partial release when the R payload is equal to the remaining allocated resources. That will be the most common scenario and we'll incur longer cancellation times under most circumstances.
  2. Run a partial release with the R payload and use the existing return bool (full_removal?) to check if all remaining resources have been released. If not, run a full cancellation as cleanup. This approach would be faster except when a cleanup is needed. It would also be easier to implement discrepancy reporting.

I'm planning to implement option 2 unless there's preference for the first option.

Actually finding anything else to free would be a bug in core and could just be logged IMHO.

trws commented

Given what we have for partial release and the final flag, is this still active? I'm under the impression we'll need a new issue if we get more information on this.

Yes, good point @trws. Let's close this and reopen with fresh data if we see anything like it again.