kubernetes-sigs/jobset

Support for defining a global coordinator pod in the JobSet spec

danielvegamyhre opened this issue · 1 comments

What would you like to be added:
Support for defining a global coordinator pod in the JobSet spec.

Why is this needed:
We need to be able to build automation on top of JobSet which knows the stable network endpoint of the pod assigned to be the global coordinator distributed ML training / HPC workloads.

Adding coordinator field and controller changes: #618

Adding validation: #627

Adding runnable example: #628