torch-on-k8s: A Go repository from hliangzhao

torch-on-k8s

This is a minimal implementation for running distributed torch training jobs in k8s cluster (7k lines of code). In this implementation, we introduce a CRD called torchjob, which is composed of multiple tasks (each task has a type, for example, master or worker), and each task is a wrapper of a pod.

Supported features:

DAG scheduling and Gang scheduling is enabled with Volcano.
Native job scheduling and coordination (a torchjob will be enqueued into a queue, each time the coordinator selects a queue and a torchjob in the queue to do the reconciliation).
Elastic scaling of torchjob task replicas.

Fixed:

Node selection for local storage violates the usage of host path. Remove it temporarily.

Something new compared with kubedl:

A weighted-round-robin (WRR) algorithm implementation for queue selections in job coordination.
The ability of setting MinMember is exported to users for Gang scheduling when DAG scheduling is enabled. Specifically, MinMember is added to TorchJobSpec, and the related structs and functions are revised correspondingly (TODO: Add link).
Global optimization:
- Use in-memory cache to improve performance (remove repeating label & annotation generations, etc.)
- Create service & pod label selectors outside loops to avoid repeating creation.
Fix some hiding bugs:
- In pkg/common/failover.go, restartPod() should return false rather than err == nil when the crr resource status is kruisev1alpha1.ContainerRecreateRequestFailed. Otherwise, there is a risk that the to-be-inplace-restart pod will be marked as succeeded when it is not.
- When Gang scheduling is enabled, a torchjob enters into Running status when the MinMember pods are running, rather than all pods are in running state.

TODO

For queue selection, implement smooth-weighted-round-robin algorithm.
Refine the TorchElastic implementation in the coordinator style.
Add tests and examples.

hliangzhao/torch-on-k8s

torch-on-k8s

TODO

References