Issues
- 0
unable to build image for ppc64le
#365 opened by gajanankulkarni-18 - 2
- 6
- 2
Multi-gpu in a single pod
#362 opened by wallarug - 1
run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed
#363 opened by sxl1993 - 3
How to use DDP in pytorch operator?
#350 opened by SeibertronSS - 1
- 3
- 5
- 7
Is python sdk still being maintained?
#317 opened by ca-scribner - 20
- 0
Can not use volcano for Gang Scheduling
#358 opened by bug-developer021 - 9
- 4
- 6
- 3
- 2
- 5
Mnist dataset server is down
#325 opened by Jeffwan - 1
- 2
why worker need initContainer in pytorch-operator?
#349 opened by zqz-net - 3
[feat] Support PyTorch 1.9
#346 opened by gaocegege - 1
Upgrade to v1 CRDs
#347 opened by mcristina422 - 4
- 4
PytorchJob replicas has different node affinity behaviors compared with Deployment
#344 opened by Shuai-Xie - 0
fell confused about world_size
#340 opened by ldd91 - 4
`init-pytorch` init container image configurable
#339 opened by apatil4 - 0
PyTorch Lightning Example.
#334 opened by tchaton - 4
Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator
#319 opened by asahalyft - 6
- 1
- 1
Worker template should be configurable.
#335 opened by MartinForReal - 1
- 1
NCCL "Connection Refused" for Worker Pods
#332 opened by twolffpiggott - 1
worker get connection timed out error in user namespace with sidecar.istio.io/inject=false
#329 opened by tingweiwu - 2
is there a simpler way to install pytorch-operator
#328 opened by tingweiwu - 3
Please create v1.2-branch
#314 opened by SatwikBhandiwad - 1
pytorch-operator: Consolidate manifests
#322 opened by yanniszark - 0
- 1
- 3
pytorch-operator pod CheckCRDExist failed
#294 opened by myonlyzzy - 9
dist.init_process_group stuck
#313 opened by ravenj73 - 2
Does pytorch-opterator just simplified the use of nn.parallel.DistributedDataParallel on multi nodes of multi gpu?
#311 opened by lwj1980s - 5
can I use gpus on specific node to train
#310 opened by lwj1980s - 4
- 2
Make manifest test friendly
#302 opened by Jeffwan - 1
Do not trigger presubmit jobs for simple changes
#307 opened by Jeffwan - 2
Support Torch Elastic in pytorch operator
#296 opened by Jeffwan - 2
Activate Travis in PR check
#299 opened by andreyvelich - 4
[bug] Unit test is broken
#292 opened by gaocegege - 7
how to create a local non-distributed training
#287 opened by houz42