Issues
- 13
- 4
GCP preemptible instances
#237 opened by Nintorac - 7
Distributed mnist is unexpectedly slow
#271 opened by panchul - 7
Integration into kubeflow pipeline
#190 opened by miguelvr - 2
PyTorchJob CRD definition link is broken
#284 opened by sakaia - 4
Link to CRD definition is broken
#254 opened by sakaia - 5
Kubernetes 1.6 support
#259 opened by posix4e - 3
OCI Runtime error for init-pytorch on AKS
#275 opened by wangdian - 7
- 3
- 2
Do we need pod name and namespace in manifests?
#283 opened by gaocegege - 8
- 3
- 10
How to run single-machine job?
#278 opened by jiaqianjing - 4
Cut release for pytorch operator
#273 opened by Jeffwan - 6
[feature] Rethink distributed Pytorch backoff retry
#270 opened by czheng94 - 8
Example PytorchJob is not starting
#264 opened by natalytvinova - 15
Failed to deploy pytorch operator
#206 opened by xiaqunfeng - 1
- 1
[examples/smoke_dist] pytorch_job_sendrecv.yaml does not exist in the directory
#268 opened by sakaia - 2
- 2
kubeflow common dependency path to be updated
#263 opened by igorvalko - 2
- 5
cleanPodPolicy Set to Running should clean Running pod
#260 opened by xrmzju - 3
Implement "earlier" resource validation
#183 opened by johanfleury - 3
resolve test image conflict
#185 opened by kunmingg - 2
MPI distributed training job failed on master node with message "MPI process group does not support multi-GPU collectives" but succeed on worker node
#203 opened by YYStreet - 2
PyTorchJob 1.0
#214 opened by johnugeorge - 1
- 1
Add licenses for dependencies in PyTorch Operator Image
#249 opened by jlewi - 1
Pytorch Docker image pytorch/pytorch:1.2-cuda10.0-cudnn7-runtime does not have cuda so unable to use GPU
#245 opened by MATRIX4284 - 0
Unstructured converted to Pytorch Job Anonymous field error when json uses inline mode
#234 opened by leileiwan - 7
- 1
- 6
Test case "TestDeletePodsAndServices" error
#228 opened by leileiwan - 0
Failed to set kubeflow in CI test.
#229 opened by jinchihe - 2
allocating master and work on different GPU nodes
#224 opened by mengdong - 3
Add documentation on RBAC authorizations
#182 opened by johanfleury - 4
NCCL backend did not start distributed training
#202 opened by YYStreet - 6
Add controller-name label for Pods and services
#211 opened by johnugeorge - 4
Use multi-stage build for pytorch operator Dockerfile
#195 opened by hmtai - 2
Can I use deployment.yaml in manifests directly
#205 opened by wynn5a - 16
gang schedule bug
#186 opened by zlcnju - 3
Prometheus Operator for Pytorch
#174 opened by krishnadurai - 2
Pytorch operator 1.0 release
#160 opened by johnugeorge - 3
Question: MNIST example
#163 opened by zzvara - 0
Pytorch API v1 implementation
#161 opened by johnugeorge - 3
how could i delete this pod
#154 opened by younkun - 3
- 1