Support Autoscaling replicatedJob
tenzen-y opened this issue · 11 comments
What would you like to be added:
I would like to support scale subresource and the metrics corresponding to HPA resource like this:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: autoscaling-sample
spec:
scalePolicy:
replicatedJobName: workers # Job Name
replicas: 2 # scaling target
autoScaling: # typed `[]autoscalingv2.MetricSpec`
minReplicas: 1
maxReplicas: 10
metrics:
[...]
replicatedJobs:
- name: workers
[...]
Why is this needed:
In the machine learning field, we often support elastically scaling worker nodes such as PyTorch Elastic.
Here is Kubeflow TrainingOperator Example: https://github.com/kubeflow/training-operator/blob/e31d11faa9f6ce5111b60c01079d39295589e0ef/pkg/apis/kubeflow.org/v1/pytorch_types.go#L98-L135
cc: @andreyvelich @ahg-g
/kind feature
Thanks @tenzen-y, support for scale subresource was part of the original JobSet design https://bit.ly/k8s-jobset but we haven't been able to prioritize it yet. We will likely need other developers to do the implementation here, since I am currently busy with other work.
Thanks @tenzen-y, support for scale subresource was part of the original JobSet design https://bit.ly/k8s-jobset but we haven't been able to prioritize it yet. We will likely need other developers to do the implementation here, since I am currently busy with other work.
Yeah, actually during JobSet design in https://bit.ly/k8s-jobset, I mentioned the scale subresource :)
For sure, if anyone doesn't have enough bandwidth, I may be able to take this issue, but I'm not confident that I definitely have sufficient bandwidth for this issue now.
@tenzen-y it would be helpful if you or others could take this on, a short KEP would be great so we can align on the API changes to JobSet
@tenzen-y it would be helpful if you or others could take this on, a short KEP would be great so we can align on the API changes to JobSet
Yeah, definitely should create a small KEP. Once I find the time, I will try to assign this issue to me.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I can refer to this issue and try to implement it. I am willing to give it a try. :)
/assign
After the poc test, I found some problems.
I want to use this method to implement the scale subresource.
like this:
// +kubebuilder:subresource:scale:specpath=.spec.replicatedJobs[*].replicas,statuspath=.status.replicatedJobsStatus[*].active,selectorpath=
but got.
root@VM-0-6-ubuntu:/home/ubuntu# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
network-jobset JobSet/network-jobset cpu: <unknown>/80% 1 3 0 10h
root@VM-0-6-ubuntu:/home/ubuntu# kubectl describe hpa
Name: network-jobset
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Wed, 11 Sep 2024 22:46:13 +0800
Reference: JobSet/network-jobset
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): <unknown> / 80%
Min replicas: 1
Max replicas: 3
JobSet pods: 0 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale False FailedGetScale the HPA controller was unable to get the target's current scale: Internal error occurred: the spec replicas field ".spec.replicatedJobs[*].replicas" does not exist
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetScale 2m43s (x2461 over 10h) horizontal-pod-autoscaler Internal error occurred: the spec replicas field ".spec.replicatedJobs[*].replicas" does not exist
root@VM-0-6-ubuntu:/home/ubuntu#
I'd like to implement this without changing too many of the original APIs, but []ReplicatedJob seems to be the main reason to implement this scale subresource.
// JobSetSpec defines the desired state of JobSet
type JobSetSpec struct {
// ReplicatedJobs is the group of jobs that will form the set.
// +listType=map
// +listMapKey=name
ReplicatedJobs []ReplicatedJob `json:"replicatedJobs,omitempty"`
...
}