Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet
mimowo opened this issue · 4 comments
There are currently two related issues which prevent JobSet - Kueue integration:
- JobSet rejects mutation of PodTemplate on suspend
When Kueue evicts a workload (represented by JobSet) it stops the JobSet and tries to restore the PodTemplate to enable re-admitting the same JobSet to another ResourceFlavor (with potentially different nodeSelectors).
For example, the following e2e test for Job shows how Kueue can preempt a workload and re-admit with another nodeSelector: link.
However, the integration with Kueue does not work currently, because the Kueue request to suspend
the JobSet fails if it also wants to update the PodTemaplte.
/assign
/cc @danielvegamyhre @tenzen-y
Let's fix this, but rather than solely doing a one-off fix here, we need to iron out the specific requirements for JobSet + Kueue integration, as well as align our roadmaps so changes in Kueue don't break JobSet integration.
We just recently had an issue similar to this a couple months ago, with Kueue trying to mutate certain podTemplate fields on suspended JobSets, but these are immutable fields in JobSet, which led to a customer/user reporting the issue (#579).
One thing we could potentially do is make the entire podTemplate mutable in JobSet, to prevent any further issues like this.
I think this is a good point. I think at the technical layer we should keep extending the JobSet e2e test suite in Kueue which was started by recently.
EDIT: the test suite for reference: https://github.com/kubernetes-sigs/kueue/blob/main/test/e2e/singlecluster/jobset_test.go. I'm going to extend it as part of kubernetes-sigs/kueue#2691 (started the PR in kubernetes-sigs/kueue#2700).
The proposal for the e2e test scenario which covers this and #623: #623 (comment)