Support Running condition
tenzen-y opened this issue · 18 comments
What would you like to be added:
I would like to support the new JobSetConditionType
and Running
so that researchers can easily find out if the JobSet is still running.
jobset/api/jobset/v1alpha2/jobset_types.go
Lines 50 to 62 in 235745c
However, the current batch/v1 Job doesn't support the Running condition: https://github.com/kubernetes/kubernetes/blob/4f04dffe5b2cd652a20b362eaea30164e3e5ea54/pkg/apis/batch/types.go#L612C4-L627
So, we need to define when we can say the Job/JobSet is still running. Starting this discussion from KEP #2804 might be better.
Why is this needed:
Currently, we represent a similar context with Suspend=False
, Completed=False
, and Failed=False
, but that representation approach is not intuitive. So, I would like to express the Running state clearly.
cc: @andreyvelich @ahg-g
/kind feature
I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended).
@alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status?
I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended).
@alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status?
@danielvegamyhre I have the same concern as you. So, as I mentioned in the issue description, it might be better to start discussing the Running
condition in kubernetes/enhancements#3524.
Additionally, actually, Aldo mentioned the Running
condition in the batch/v1 Job here: kubernetes/enhancements#3524 (comment)
The question is what does "running" mean, when you have multiple pods. A field such as ready
(for the number of ready pods) in the job status, gives the controllers enough information without having to define what a "Running" condition would be.
Maybe we should aim for a similar value in jobset status? Maybe it can just add up all the active
fields from all of its jobs?
The question is what does "running" mean, when you have multiple pods.
Ideally, Running means that all PyTorch workers are scheduled on the Kubernetes control plane, pulled the image, and ready to process Model Training.
As of today, we marked PyTorchJob Running if master replica is active or one of the workers is active in case PyTorchJob doesn't have a master.
From my point of view, that is wrong, since model training can start only when all replicas are active.
Image pull might take significant amount of time especially for the large models, and Job should not be in Running condition during that time. cc @johnugeorge @tenzen-y
Maybe it can just add up all the active fields from all of its jobs?
That looks good to me.
Maybe it can just add up all the active fields from all of its jobs?
That looks good to me.
@andreyvelich @alculquicondor Since active
includes pending pods I think we would want is the ready
values from each child Job.
JobSet already tracks the number of "Ready" jobs for each ReplicatedJob in the ReplicatedJobStatus here. A child Job is counted as ready if ready pods + completed pods >= expected pod count for the job
, where the expected pod count is minimum of parallelism and completions.
So we could consider a JobSet "Active/Running" if all ReplicatedJobs have replicatedJobStatus.Ready == replicatedJob.Spec.Replicas
Oops, sorry, I meant to say ready
indeed :)
So we could consider a JobSet "Active/Running" if all ReplicatedJobs have replicatedJobStatus.Ready == replicatedJob.Spec.Replicas
@tenzen-y Do we need to set a condition for that given that the info is already in the status?
Ping @tenzen-y @andreyvelich
Is this still necessary?
Yeah, I know that @googs1025 did some work to improve JobSet conditions in this PR: #594.
Also, I understand that is hard to define what does Running mean exactly.
So maybe we could just follow @danielvegamyhre suggestion here: #571 (comment)
From my perspective we should add this condition to the Job, since user assumes the following conditions when they create training job:
Created -> Running -> Succeeded
Any thoughts @tenzen-y ?
@ahg-g @andreyvelich @kannon92 As we discussed here, I'm hestating to add this running condition since at first glance, this Running
phase / condition can be introduce easily, but decision of Running
is too hard.
So, I would like not to introduce the Running
phase/condition in the JobSet side. But, I'm wondering if the Kubeflow Training v2 can introduce the Running
condition type depending on the JobSet Terminal Phase (
jobset/api/jobset/v1alpha2/jobset_types.go
Lines 137 to 139 in 56c77da
In conclusion, I don't prefer to introduce the Running
phase/condition in the JobSet side.
Sounds good. I think phase can suffice.
Maybe it’s something you can add to the TrainJob KEP and we can close this issue.
Sounds good. I think phase can suffice.
Maybe it’s something you can add to the TrainJob KEP and we can close this issue.
That makes sense. Anyway, let's close this issue now.
Once we can find the use cases that it can get over the drawbacks, we can revisit here.
/close
@tenzen-y: Closing this issue.
In response to this:
Sounds good. I think phase can suffice.
Maybe it’s something you can add to the TrainJob KEP and we can close this issue.
That makes sense. Anyway, let's close this issue now.
Once we can find the use cases that it can get over the drawbacks, we can revisit here.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.