kubernetes-sigs/jobset

Support Running condition

tenzen-y opened this issue · 18 comments

What would you like to be added:
I would like to support the new JobSetConditionType and Running so that researchers can easily find out if the JobSet is still running.

// These are built-in conditions of a JobSet.
const (
// JobSetCompleted means the job has completed its execution.
JobSetCompleted JobSetConditionType = "Completed"
// JobSetFailed means the job has failed its execution.
JobSetFailed JobSetConditionType = "Failed"
// JobSetSuspended means the job is suspended.
JobSetSuspended JobSetConditionType = "Suspended"
// JobSetStartupPolicyInProgress means the StartupPolicy is in progress.
JobSetStartupPolicyInProgress JobSetConditionType = "StartupPolicyInProgress"
// JobSetStartupPolicyCompleted means the StartupPolicy has completed.
JobSetStartupPolicyCompleted JobSetConditionType = "StartupPolicyCompleted"
)

However, the current batch/v1 Job doesn't support the Running condition: https://github.com/kubernetes/kubernetes/blob/4f04dffe5b2cd652a20b362eaea30164e3e5ea54/pkg/apis/batch/types.go#L612C4-L627

So, we need to define when we can say the Job/JobSet is still running. Starting this discussion from KEP #2804 might be better.

Why is this needed:
Currently, we represent a similar context with Suspend=False, Completed=False, and Failed=False, but that representation approach is not intuitive. So, I would like to express the Running state clearly.

/kind feature

hi! @tenzen-y There are some discussions on this issue #545

I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended).

@alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status?

hi! @tenzen-y There are some discussions on this issue #545

Thank you for pointing this out. I would like to focus on the new condition type separate from counting in this issue.

I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended).

@alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status?

@danielvegamyhre I have the same concern as you. So, as I mentioned in the issue description, it might be better to start discussing the Running condition in kubernetes/enhancements#3524.

Additionally, actually, Aldo mentioned the Running condition in the batch/v1 Job here: kubernetes/enhancements#3524 (comment)

The question is what does "running" mean, when you have multiple pods. A field such as ready (for the number of ready pods) in the job status, gives the controllers enough information without having to define what a "Running" condition would be.

Maybe we should aim for a similar value in jobset status? Maybe it can just add up all the active fields from all of its jobs?

The question is what does "running" mean, when you have multiple pods.

Ideally, Running means that all PyTorch workers are scheduled on the Kubernetes control plane, pulled the image, and ready to process Model Training.
As of today, we marked PyTorchJob Running if master replica is active or one of the workers is active in case PyTorchJob doesn't have a master.
From my point of view, that is wrong, since model training can start only when all replicas are active.
Image pull might take significant amount of time especially for the large models, and Job should not be in Running condition during that time. cc @johnugeorge @tenzen-y

Maybe it can just add up all the active fields from all of its jobs?

That looks good to me.

Maybe it can just add up all the active fields from all of its jobs?

That looks good to me.

@andreyvelich @alculquicondor Since active includes pending pods I think we would want is the ready values from each child Job.

JobSet already tracks the number of "Ready" jobs for each ReplicatedJob in the ReplicatedJobStatus here. A child Job is counted as ready if ready pods + completed pods >= expected pod count for the job, where the expected pod count is minimum of parallelism and completions.

So we could consider a JobSet "Active/Running" if all ReplicatedJobs have replicatedJobStatus.Ready == replicatedJob.Spec.Replicas

Oops, sorry, I meant to say ready indeed :)

ahg-g commented

So we could consider a JobSet "Active/Running" if all ReplicatedJobs have replicatedJobStatus.Ready == replicatedJob.Spec.Replicas

@tenzen-y Do we need to set a condition for that given that the info is already in the status?

Ping @tenzen-y @andreyvelich

Is this still necessary?

Yeah, I know that @googs1025 did some work to improve JobSet conditions in this PR: #594.
Also, I understand that is hard to define what does Running mean exactly.

So maybe we could just follow @danielvegamyhre suggestion here: #571 (comment)

From my perspective we should add this condition to the Job, since user assumes the following conditions when they create training job:
Created -> Running -> Succeeded

Any thoughts @tenzen-y ?

@ahg-g @andreyvelich @kannon92 As we discussed here, I'm hestating to add this running condition since at first glance, this Running phase / condition can be introduce easily, but decision of Running is too hard.

So, I would like not to introduce the Running phase/condition in the JobSet side. But, I'm wondering if the Kubeflow Training v2 can introduce the Running condition type depending on the JobSet Terminal Phase (

// TerminalState the state of the JobSet when it finishes execution.
// It can be either Complete or Failed. Otherwise, it is empty by default.
TerminalState string `json:"terminalState,omitempty"`
).

In conclusion, I don't prefer to introduce the Running phase/condition in the JobSet side.

Sounds good. I think phase can suffice.

Maybe it’s something you can add to the TrainJob KEP and we can close this issue.

Sounds good. I think phase can suffice.

Maybe it’s something you can add to the TrainJob KEP and we can close this issue.

That makes sense. Anyway, let's close this issue now.
Once we can find the use cases that it can get over the drawbacks, we can revisit here.
/close

@tenzen-y: Closing this issue.

In response to this:

Sounds good. I think phase can suffice.

Maybe it’s something you can add to the TrainJob KEP and we can close this issue.

That makes sense. Anyway, let's close this issue now.
Once we can find the use cases that it can get over the drawbacks, we can revisit here.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.