kubeflow/mpi-operator
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
GoApache-2.0
Issues
- 6
Add support for the managedBy field
#646 opened by mimowo - 3
- 3
What scale can mpi-operator support?
#648 opened by yxzhao6 - 0
- 10
Question: Is the network traffic of AllReduce(like, ML gradients) encrypted between workers?
#645 opened by jsyqrt - 6
ttlSecondsAfterFinished for MPIJob, not only launcher
#644 opened by hy00nc - 1
"cleanPodPolicy: All" does not clean up launcher pod
#643 opened by hy00nc - 4
Connection reset
#642 opened by bbenshab - 2
- 1
NCCL tests example
#639 opened by samos123 - 11
Release 0.5.0
#563 opened by tenzen-y - 8
Running in a subset of namespaces
#620 opened by emsixteeen - 0
Wrong host info in discover_hosts.sh
#621 opened by kuizhiqing - 0
"make generate" command run failed
#614 opened by wang-mask - 0
The operator still creates the launcher when launcherCreationPolicy is "WaitForWorkersReady" and suspend is "true"
#615 opened by wang-mask - 8
MPI-Operator run example failed
#598 opened by q443048756 - 4
Replace the plain pod workers with Indexed Job
#613 opened by tenzen-y - 28
Work with DeepSpeed for large scale training
#611 opened by kuizhiqing - 8
the object has been modified; please apply your changes to the latest version and try again
#607 opened by gl-001 - 4
When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock
#608 opened by yzhao-2023 - 9
Cant get mpijob status when pod template is invalid
#604 opened by congpeiqing - 4
which is the latest mpi job definition between mpi-operator and training operator
#605 opened by sxwl-donggang - 11
OpenMPI 4.1.5
#588 opened by bdevcich - 2
Update stale examples
#596 opened by jarulsamy - 2
- 7
- 3
Port conficts will occur when multiple pods dispatched to the same node under hostnetwork.
#593 opened by Saturnoul - 3
[feature]upgrade volcano to v1.8.0
#586 opened by lowang-bh - 2
pod priority was assigned to 0 though the priorityclassname of the podgroup had been assigned
#592 opened by Robin7831 - 2
- 3
Create CI pipeline for the exmaple images
#541 opened by tenzen-y - 3
- 4
MPIJobs with Kubernetes Python SDK
#582 opened by AymenFJA - 2
Connection dropped after 24 hours
#581 opened by sheevy - 9
python setup.py doesn't appear to install?
#578 opened by vsoch - 3
Copy APIs from common repo into here
#564 opened by tenzen-y - 3
- 6
Multiple MPI jobs via multiple launchers?
#574 opened by AymenFJA - 4
e2e test failed sometime
#570 opened by lowang-bh - 2
strange backup in hack/python-sdk/gen-sdk.sh
#572 opened by lowang-bh - 3
- 3
how can i deploy distributed training on kubernete clusters with torch.distributed.launch
#560 opened by ThomaswellY - 9
questions about applying for nodes and gpus
#558 opened by ThomaswellY - 8
Pod scheduling conundrum
#553 opened by sheevy - 2
- 3
- 10
Add tolerations only to specific worker pods
#539 opened by anxietymonger - 4
- 11
Support exitCode restartPolicy
#537 opened by Syulin7 - 3
Set the knowledge about Launcher and Worker to CRD
#519 opened by tenzen-y