This is a minimal implementation for running distributed torch training jobs in k8s cluster (7k lines of code). In this implementation, we introduce a CRD called torchjob, which is composed of multiple tasks (each task has a type, for example, master or worker), and each task is a wrapper of a pod.
Supported features:
- DAG scheduling and Gang scheduling is enabled with Volcano.
- Native job scheduling and coordination (a torchjob will be enqueued into a queue, each time the coordinator selects a queue and a torchjob in the queue to do the reconciliation).
- Elastic scaling of torchjob task replicas.
Fixed:
- Node selection for local storage violates the usage of host path. Remove it temporarily.
Something new compared with kubedl:
- A weighted-round-robin (WRR) algorithm implementation for queue selections in job coordination.
- The ability of setting
MinMember
is exported to users for Gang scheduling when DAG scheduling is enabled. Specifically,MinMember
is added toTorchJobSpec
, and the related structs and functions are revised correspondingly (TODO: Add link). - Global optimization:
- Use in-memory cache to improve performance (remove repeating label & annotation generations, etc.)
- Create service & pod label selectors outside loops to avoid repeating creation.
- Fix some hiding bugs:
- In pkg/common/failover.go,
restartPod()
should returnfalse
rather thanerr == nil
when the crr resource status iskruisev1alpha1.ContainerRecreateRequestFailed
. Otherwise, there is a risk that the to-be-inplace-restart pod will be marked as succeeded when it is not. - When Gang scheduling is enabled, a torchjob enters into
Running
status when theMinMember
pods are running, rather than all pods are in running state.
- In pkg/common/failover.go,
- For queue selection, implement smooth-weighted-round-robin algorithm.
- Refine the TorchElastic implementation in the coordinator style.
- Add tests and examples.