kubernetes-sigs/jobset

Support Autoscaling replicatedJob

tenzen-y opened this issue · 11 comments

What would you like to be added:
I would like to support scale subresource and the metrics corresponding to HPA resource like this:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: autoscaling-sample
spec:
  scalePolicy:
    replicatedJobName:  workers  # Job Name
    replicas: 2                  # scaling target
    autoScaling:                 # typed `[]autoscalingv2.MetricSpec`
      minReplicas: 1
      maxReplicas: 10
      metrics:
        [...] 
  replicatedJobs:
  - name: workers
[...]

Why is this needed:
In the machine learning field, we often support elastically scaling worker nodes such as PyTorch Elastic.
Here is Kubeflow TrainingOperator Example: https://github.com/kubeflow/training-operator/blob/e31d11faa9f6ce5111b60c01079d39295589e0ef/pkg/apis/kubeflow.org/v1/pytorch_types.go#L98-L135

/kind feature

Thanks @tenzen-y, support for scale subresource was part of the original JobSet design https://bit.ly/k8s-jobset but we haven't been able to prioritize it yet. We will likely need other developers to do the implementation here, since I am currently busy with other work.

Thanks @tenzen-y, support for scale subresource was part of the original JobSet design https://bit.ly/k8s-jobset but we haven't been able to prioritize it yet. We will likely need other developers to do the implementation here, since I am currently busy with other work.

Yeah, actually during JobSet design in https://bit.ly/k8s-jobset, I mentioned the scale subresource :)
For sure, if anyone doesn't have enough bandwidth, I may be able to take this issue, but I'm not confident that I definitely have sufficient bandwidth for this issue now.

@tenzen-y it would be helpful if you or others could take this on, a short KEP would be great so we can align on the API changes to JobSet

@tenzen-y it would be helpful if you or others could take this on, a short KEP would be great so we can align on the API changes to JobSet

Yeah, definitely should create a small KEP. Once I find the time, I will try to assign this issue to me.

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

I can refer to this issue and try to implement it. I am willing to give it a try. :)

/assign

After the poc test, I found some problems.
I want to use this method to implement the scale subresource.
like this:
// +kubebuilder:subresource:scale:specpath=.spec.replicatedJobs[*].replicas,statuspath=.status.replicatedJobsStatus[*].active,selectorpath=

but got.

root@VM-0-6-ubuntu:/home/ubuntu# kubectl get hpa
NAME             REFERENCE               TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
network-jobset   JobSet/network-jobset   cpu: <unknown>/80%   1         3         0          10h
root@VM-0-6-ubuntu:/home/ubuntu# kubectl describe hpa
Name:                                                  network-jobset
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2024 22:46:13 +0800
Reference:                                             JobSet/network-jobset
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 80%
Min replicas:                                          1
Max replicas:                                          3
JobSet pods:                                           0 current / 0 desired
Conditions:
  Type         Status  Reason          Message
  ----         ------  ------          -------
  AbleToScale  False   FailedGetScale  the HPA controller was unable to get the target's current scale: Internal error occurred: the spec replicas field ".spec.replicatedJobs[*].replicas" does not exist
Events:
  Type     Reason          Age                     From                       Message
  ----     ------          ----                    ----                       -------
  Warning  FailedGetScale  2m43s (x2461 over 10h)  horizontal-pod-autoscaler  Internal error occurred: the spec replicas field ".spec.replicatedJobs[*].replicas" does not exist
root@VM-0-6-ubuntu:/home/ubuntu#

I'd like to implement this without changing too many of the original APIs, but []ReplicatedJob seems to be the main reason to implement this scale subresource.

// JobSetSpec defines the desired state of JobSet
type JobSetSpec struct {
	// ReplicatedJobs is the group of jobs that will form the set.
	// +listType=map
	// +listMapKey=name
	ReplicatedJobs []ReplicatedJob `json:"replicatedJobs,omitempty"`
...
}