kubernetes-sigs/jobset

Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet

mimowo opened this issue · 4 comments

There are currently two related issues which prevent JobSet - Kueue integration:

  1. JobSet rejects mutation of PodTemplate on suspend

When Kueue evicts a workload (represented by JobSet) it stops the JobSet and tries to restore the PodTemplate to enable re-admitting the same JobSet to another ResourceFlavor (with potentially different nodeSelectors).
For example, the following e2e test for Job shows how Kueue can preempt a workload and re-admit with another nodeSelector: link.

However, the integration with Kueue does not work currently, because the Kueue request to suspend
the JobSet fails if it also wants to update the PodTemaplte.

Let's fix this, but rather than solely doing a one-off fix here, we need to iron out the specific requirements for JobSet + Kueue integration, as well as align our roadmaps so changes in Kueue don't break JobSet integration.

We just recently had an issue similar to this a couple months ago, with Kueue trying to mutate certain podTemplate fields on suspended JobSets, but these are immutable fields in JobSet, which led to a customer/user reporting the issue (#579).

One thing we could potentially do is make the entire podTemplate mutable in JobSet, to prevent any further issues like this.

cc @alculquicondor @mimowo @ahg-g @kannon92

I think this is a good point. I think at the technical layer we should keep extending the JobSet e2e test suite in Kueue which was started by recently.

EDIT: the test suite for reference: https://github.com/kubernetes-sigs/kueue/blob/main/test/e2e/singlecluster/jobset_test.go. I'm going to extend it as part of kubernetes-sigs/kueue#2691 (started the PR in kubernetes-sigs/kueue#2700).

The proposal for the e2e test scenario which covers this and #623: #623 (comment)