kubernetes-sigs/jobset

Add global Job index label/annotation to provide a global index for each job across the entire JobSet

Closed this issue · 0 comments

What would you like to be added:
Add a label and annotation jobset.sigs.k8s.io/job-id which contains an integer value from 0 to N-1 where N=total number of jobs in the JobSet, to assign each Job a globally unique index within the JobSet.

Why is this needed:
Currently the jobset.sigs.k8s.io/job-index label contains the local job index within its parent replicatedJob (values range from 0 to N-1 where N=replicatedJob.replicas).

This means for a JobSet with multiple replicatedJobs, multiple jobs may have the same job index (for example, two replicated jobs of 1 replica each will result in 2 Jobs each with job-index of 0).

In TPU multislice training we have used a JobSet with a single replicated job and exclusive job placement per slice (node pool), to assign 1 job replica exclusive usage of each TPU Slice. The job-index is then a natural and convenient way of assigning a unique TPU slice ID at the TPU runtime layer, which is required by TPU driver/runtime libraries for multislice training.

However, some users want to run multislice training workloads using a JobSet multiple replicated jobs with different templates - however, this is currently not possible because the job-index annotations from multiple different replicatedJobs are not unique (as described above), and TPU runtime requires unique slice IDs.

Therefore, we can add a new annotation, "jobset.sigs.k8s.io/job-id" which sets a globally job index that is unique across the JobSet.