kubernetes-sigs/kueue

Support dynamically sized (elastic) jobs

Opened this issue · 13 comments

ahg-g commented

We should have a clear path towards support spark and other dynamically sized jobs. Another example of this is Ray.

One related aspect is to support dynamically updating the resource requirements of a workload, we can probably limit that to support changing the count of a PodSet in QueuedWorkload (in Spark, the number of workers could change during the runtime of the job, but not the resource requirements of a worker).

One idea is to model it in a way similar to "in-place update to pod resources" [1], but in our case it would be the count that is mutable. The driver pod in spark would be watching for the corresponding QueuedWorkload instance and adjusts the number of workers when the new count is admitted.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

/lifecycle frozen

I am interested in working on this -- this probably needs some sort of design doc, will work with @alculquicondor and see if I can put something together in the next few weeks

/assign

I am interested in working on this -- this probably needs some sort of design doc, will work with @alculquicondor and see if I can put something together in the next few weeks

/assign

Hi @andrewsykim! Is there any progress?

@tenzen-y I was planning to work on this in a couple weeks during the holiday season, but feel free to start working on this if you're interested.

@andrewsykim Thanks. I also don't have enough time now. So, when I can get enough time, I will ask for progress again.

FYI @vicentefb and I are working on a proposal in a google doc, we will share it here soon when it's ready

/reopen

@tenzen-y: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Hi I would love to work on this issue, especially the ray autoscaling support. Would resuming #1852 be a good starting point?

/assign