Support for defining a global coordinator pod in the JobSet spec
danielvegamyhre opened this issue · 1 comments
danielvegamyhre commented
What would you like to be added:
Support for defining a global coordinator pod in the JobSet spec.
Why is this needed:
We need to be able to build automation on top of JobSet which knows the stable network endpoint of the pod assigned to be the global coordinator distributed ML training / HPC workloads.
danielvegamyhre commented