flux-framework/flux-sched

implement partial release of resources

Closed this issue · 5 comments

Problem: as discussed in flux-framework/flux-core#4312, the original plan for partial release of resources was to give the scheduler a free RPC for each R fragment of a job's resources that can be returned to the pool. In fluxion, the R is ignored in the free callback and the jobid is used instead to free all resources allocated to the job.

An additional problem is that flux-core cannot fragment the contents of the opaque scheduling key in R.

Assuming we figure out a way in flux-core to release resource in parts, how can this be made to work in fluxion?

Note that RFC 27 would need to be updated as it currently describes a single free RPC.

A fragment would contain all the resources allocated to the job on one or more execution targets (broker ranks). That is, it would not be further subdivided.

One thought on the JGF problem is that perhaps a combination of job ID and the list of execution target ids from Rv1 would be sufficient to identify the resources being freed.

With the assumption that a fragment contains a subset of the job's broker ranks but the entire R (i.e., the full R for each broker rank) for each fragment broker rank, adding this support should be straightforward.

Mainly what's needed is to identify the broker ranks in the R fragment and iterate through the vertices in the by_rank graph metadata map for each rank. Then remove the scheduling and planner data per vertex. Updating the vertices' ancestors' pruning filters will require some thought, though...

With the assumption that a fragment contains a subset of the job's broker ranks but the entire R (i.e., the full R for each broker rank) for each fragment broker rank, adding this support should be straightforward.

Mainly what's needed is to identify the broker ranks in the R fragment and iterate through the vertices in the by_rank graph metadata map for each rank.

My understanding is that on elcap systems, the scheduler will need to be initialized from JGF in order to understand rabbit layout. Also, it will need to emit JGF for jobs in order to facilitate scheduler restart. The partial release will come in the form of R but that's OK because of this simplifying assumption right?

The partial release will come in the form of R but that's OK because of this simplifying assumption right?

That's correct. The partial cancel/release just uses the Rlite fragment string contained in the free RPC payload.

adding this support should be straightforward.

Famous last words. Fortunately the PR is merged and the functionality is in Fluxion now.

trws commented

@milroy, it looks like this one can be closed, so I'm closing it. If there's something we need to keep open here feel free to re-open.