flux-framework/flux-sched

Partial cancel not releasing rabbit resources (?)

Closed this issue · 15 comments

Snipped results of flux dmesg on hetchy:

2024-08-27T01:46:53.149076Z sched-fluxion-resource.err[0]: run_remove: dfu_traverser_t::remove (id=152883667495027712): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149167Z sched-fluxion-resource.err[0]: ssd0.
2024-08-27T01:46:53.149175Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149181Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149186Z sched-fluxion-resource.err[0]: ssd1.
2024-08-27T01:46:53.149190Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149194Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149199Z sched-fluxion-resource.err[0]: ssd2.
2024-08-27T01:46:53.149204Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149208Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149218Z sched-fluxion-resource.err[0]: ssd3.
2024-08-27T01:46:53.149225Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149234Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149244Z sched-fluxion-resource.err[0]: ssd4.
2024-08-27T01:46:53.149251Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149257Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149267Z sched-fluxion-resource.err[0]: ssd5.
2024-08-27T01:46:53.149279Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149287Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149295Z sched-fluxion-resource.err[0]: ssd7.
2024-08-27T01:46:53.149303Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149309Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149316Z sched-fluxion-resource.err[0]: ssd6.
2024-08-27T01:46:53.149324Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149333Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149342Z sched-fluxion-resource.err[0]: ssd8.
2024-08-27T01:46:53.149349Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149355Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149363Z sched-fluxion-resource.err[0]: ssd9.
2024-08-27T01:46:53.149369Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149377Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149385Z sched-fluxion-resource.err[0]: ssd10.
2024-08-27T01:46:53.149391Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149397Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149405Z sched-fluxion-resource.err[0]: ssd11.
2024-08-27T01:46:53.149415Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149421Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149429Z sched-fluxion-resource.err[0]: ssd12.
2024-08-27T01:46:53.149436Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149443Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149451Z sched-fluxion-resource.err[0]: ssd13.
2024-08-27T01:46:53.149457Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149464Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149472Z sched-fluxion-resource.err[0]: ssd14.
2024-08-27T01:46:53.149480Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149487Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149495Z sched-fluxion-resource.err[0]: ssd15.
2024-08-27T01:46:53.149502Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149508Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
2024-08-27T01:46:53.149516Z sched-fluxion-resource.err[0]: ssd16.
2024-08-27T01:46:53.149522Z sched-fluxion-resource.err[0]: Success.
2024-08-27T01:46:53.149528Z sched-fluxion-resource.err[0]: mod_plan: traverser tried to remove schedule and span aft
2024-08-27T01:46:53.149544Z sched-fluxion-resource.err[0]: partial_cancel_request_cb: remove fails due to match error (id=152883667495027712): Success
2024-08-27T01:46:53.150544Z sched-fluxion-qmanager.err[0]: remove: .free RPC partial cancel failed for jobid 152883667495027712: Invalid argument
2024-08-27T01:46:53.150564Z sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=parrypeak id=152883667495027712): Invalid argument
2024-08-27T01:50:11.281162Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded
2024-08-27T01:50:52.232045Z sched-fluxion-qmanager.debug[0]: feasibility_request_cb: feasibility succeeded

Also I think I observed that rabbit resources are not released by the scheduler when jobs complete. For instance, I ran a one-node rabbit job, and then tried to submit another one only for it to become stuck in SCHED.

Any thoughts on what might be going on @milroy ?

I reloaded the resource and fluxion modules and scheduling went back to working as expected at first, but then as I ran jobs they eventually became stuck in SCHED.

[  +2.103872] job-manager[0]: scheduler: hello
[  +2.104034] job-manager[0]: scheduler: ready unlimited
[  +2.104099] sched-fluxion-qmanager[0]: handshaking with job-manager completed
[Aug27 12:59] sched-fluxion-resource[0]: find_request_cb: find succeeded
[ +15.464330] sched-fluxion-resource[0]: find_request_cb: find succeeded
[Aug27 13:00] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[Aug27 13:05] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[  +4.494890] job-manager[0]: housekeeping: fMjSa3tyHyZ started
[Aug27 13:07] job-manager[0]: housekeeping: fMjUstxpcZm started
[ +31.506371] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[Aug27 13:09] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +18.366731] job-manager[0]: housekeeping: fMjVuq84egF started
[Aug27 13:12] job-manager[0]: housekeeping: fMjSa3tyHyZ complete
[  +0.000729] sched-fluxion-resource[0]: run_remove: dfu_traverser_t::remove (id=153988281996936192): mod_plan: traverser tried to remove schedule and span after vtx_cancel during partial cancel:
[  +0.000771] sched-fluxion-resource[0]: ssd0.
[  +0.000777] sched-fluxion-resource[0]: Success.
[  +0.000787] sched-fluxion-resource[0]: partial_cancel_request_cb: remove fails due to match error (id=153988281996936192): Success
[  +0.000971] sched-fluxion-qmanager[0]: remove: .free RPC partial cancel failed for jobid 153988281996936192: Invalid argument
[  +0.000994] sched-fluxion-qmanager[0]: jobmanager_free_cb: remove (queue=parrypeak id=153988281996936192): Invalid argument

The issue seems to have been introduced between 0.36.1 and 0.37.0.

I suspect this line is reached with mod_type == job_modify_t::PARTIAL_CANCEL:


in which case an additional check and then goto done; is probably warranted. Do you have a reproducer for this issue?

I can reproduce in the flux-coral2 environment locally or on LC clusters, but there are a bunch of plugins loaded. The simplest thing I have is the following I think.

It reproduces on the follow R+JGF
R.json

(it may work more easily if you rename your docker container to have the hostname compute-01)

with jobspecs like:

version: 9999
resources:
  - type: slot
    count: 1
    label: default
    with:
    - type: ssd
      count: 1
      exclusive: true
    - type: node
      count: 1
      exclusive: true
      with:
      - type: slot
        label: task
        count: 1
        with:
        - type: core
          count: 1
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: task
    count:
      per_slot: 1

However I don't know how to get it to ignore the parse_jobspec: job ƒ2kTi2FyZ invalid jobspec; Unsupported resource type 'ssd' errors. In the flux-coral2 environment it isn't an issue because the jobspec is modified after submission.

However I don't know how to get it to ignore the parse_jobspec: job ƒ2kTi2FyZ invalid jobspec; Unsupported resource type 'ssd'

Is this coming from the job-list module? If so you can probably safely ignore it, or just unload job-list.

After our discussion with Tom, I'm almost certain this issue is related to the ssds not being mapped to a broker rank.

It reproduces on the follow R+JGF
R.json

Are the rank values in this JGF representative of a production cluster (i.e., each rank is -1)?

Here's my first crack at a reproducer (t/issues/t1284-cancel-ssds.sh):

#!/bin/bash
#
#  Ensure fluxion cancels ssds as expected
#

log() { printf "issue#1284: $@\n" >&2; }

TEST_SIZE=2

log "Unloading modules..."
flux module remove sched-simple
flux module remove resource

flux config load <<EOF
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
EOF

flux module load resource monitor-force-up
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux module list
flux module remove job-list 
flux queue start --all --quiet
flux resource list
flux resource status

log "Running test jobs"
flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml
flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml

flux job wait -av

flux submit --flags=waitable \
		--setattr=exec.test.run_duration=0.01 \
		${SHARNESS_TEST_SRCDIR}/ssd-jobspec.yaml

flux jobs -a

With R.json set as the R.json file you provided, @jameshcorbett. I'm getting an error when trying to initialize the resource graph from R.json:

Sep 12 01:21:56.737938 UTC sched-fluxion-resource.err[0]: grow_resource_db_jgf: unpack_parent_job_resources: Invalid argument
Sep 12 01:21:56.737943 UTC sched-fluxion-resource.err[0]: update_resource_db: grow_resource_db: Invalid argument
Sep 12 01:21:56.737945 UTC sched-fluxion-resource.err[0]: update_resource: update_resource_db: Invalid argument
Sep 12 01:21:56.737990 UTC sched-fluxion-resource.err[0]: populate_resource_db_acquire: update_resource: Invalid argument
Sep 12 01:21:56.737991 UTC sched-fluxion-resource.err[0]: populate_resource_db: loading resources using resource.acquire
Sep 12 01:21:56.737996 UTC sched-fluxion-resource.err[0]: init_resource_graph: can't populate graph resource database
Sep 12 01:21:56.737997 UTC sched-fluxion-resource.err[0]: mod_main: can't initialize resource graph database

Also, it seems like the first test should be to allocate the whole cluster and then see if another job can be scheduled after the resources are released. I think this is the jobspec we want:

version: 9999
resources:
  - type: slot
    count: 1
    label: default
    with:
    - type: rack
      count: 1
      with:
      - type: ssd
        count: 36
      - type: node
        count: 1
        with:
        - type: core
          count: 10
# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "hostname" ]
    slot: task
    count:
      per_slot: 1

Is this the right approach? Any ideas what may be wrong with R.json?

@milroy I have a branch in my fork that repros the issue https://github.com/jameshcorbett/flux-sched/tree/issue-1284

Interestingly while fooling around with it I noticed that the issue only comes up if the jobspec has a top-level "slot". If instead it has "ssd" and "node" at the top level, it didn't seem to have the same problem.

With this patch that @trws and I talked about

diff --git a/qmanager/policies/base/queue_policy_base.hpp b/qmanager/policies/base/queue_policy_base.hpp
index 6fa2e44d..e9fd1166 100644
--- a/qmanager/policies/base/queue_policy_base.hpp
+++ b/qmanager/policies/base/queue_policy_base.hpp
@@ -666,7 +666,7 @@ class queue_policy_base_t : public resource_model::queue_adapter_base_t {
                     // during cancel
                     auto job_sp = job_it->second;
                     m_jobs.erase (job_it);
-                    if (final && !full_removal) {
+                    if (true) {
                         // This error condition indicates a discrepancy between core and sched.
                         flux_log_error (flux_h,
                                         "%s: Final .free RPC failed to remove all resources for "

applied on to the branch linked above, I see errors like:

Oct 08 02:39:39.086386 UTC sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 278065577984: Success
Oct 08 02:39:39.086651 UTC sched-fluxion-resource.debug[0]: cancel_request_cb: nonexistent job (id=278065577984)
Oct 08 02:39:39.086839 UTC sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=278065577984): Protocol error

@jameshcorbett I cloned your fork and ran t5101-issue-1284.t. Several things are going on.

The logs indicate that the test is using JGF read format to read R.json, but then it calls the rv1exec reader for cancel. That's because partial cancel is hard-coded to only use rv1exec since that was the only reader needed in production when I implemented rank-based partial cancel. I should have remembered that.

However, enabling the jgf-based partial cancel unveils more problems. The default match format used by the test is not compatible with jgf-based partial cancel (it appears to be rv1_nosched). Unfortunately, specifying jgf partial cancel-compatible match format results in yet another error upon graph initialization:

flux-config: error converting TOML to JSON: Invalid argument

I'll continue investigating and will report back as I find out more.

Update: tests 5101 and 5102 succeed if I specify the ssd pruning filter as follows:

test_expect_success 'an ssd jobspec can be allocated' '
	flux module remove sched-simple &&
	flux module remove resource &&
	flux config load <<EOF &&
[resource]
noverify = true
norestrict = true
path="${SHARNESS_TEST_SRCDIR}/R.json"
[sched-fluxion-resource]
prune-filters = "cluster:ssd,rack:ssd"
EOF
<...>

It appears that allowing core pruning filters in the graph causes cancellations to fail.

OK, interesting. I will have to try it out. I just realized that we're switching the EAS clusters to the rv1 match format, from rv1_nosched. Is partial cancel going to break or are we going to have other issues?

OK, interesting. I will have to try it out.

To be clear, Fluxion should tolerate both core and ssd filters and shouldn't behave incorrectly if there are more filters installed than necessary. I don't consider switching to ssd to be a fix, just a further clue about how to fix the overall behavior.

I just realized that we're switching the EAS clusters to the rv1 match format, from rv1_nosched. Is partial cancel going to break or are we going to have other issues?

I don't know, but that is something I'll start investigating as well.

After more investigation, there appear to be several issues to sort through:

  1. There isn't a straightforward way to support rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format. It may be possible to coerce it to work, but that will involve relaxing the error criterion for removal of planner_multi spans. In particular, this line is causing some of the errors reported in some of the related issues:
    if ((rc = planner_multi_rem_span (subtree_plan, span_it->second)) != 0) {
    That error condition occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.
  2. In principle, switching to rv1 with JGF should allow the partial cancel to work with the ssd vertices. Currently, only the rv1exec reader is supported in partial cancel. Supporting jgf is a quick PR.
  3. Adding support for the JGF reader in partial cancel is unsuccessful so far because the .free RPC omits the scheduling key even with the rv1 match format. I can't figure out why yet.

A temporary hack that may work on running clusters is to specify prune-filters = "ALL:ssd" in sched-fluxion-resource.toml. That should bypass the planner_multi error in the line linked above.

I was a bit pessimistic in the previous comment. I figured out a way to support the external rank partial cancellation with rv1_nosched/rv1_exec format and linked PR #1292 to close this issue.