Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator
asahalyft opened this issue · 4 comments
He Team,
I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in
809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator
. However, that repo is not accessible from inside our network. So, instead I switched to gcr.io/kubeflow-images-public/pytorch-operator:latest
I cloned this pytorch-operator
repo and generated the pytorch operator using kustomize build manifests/ | kubectl apply -f
which generates the following yaml - I also customized the namespace.
apiVersion: v1
kind: Namespace
metadata:
labels:
kustomize.component: pytorch-operator
name: pytorch-operator
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
labels:
kustomize.component: pytorch-operator
name: pytorchjobs.kubeflow.org
spec:
additionalPrinterColumns:
- JSONPath: .status.conditions[-1:].type
name: State
type: string
- JSONPath: .metadata.creationTimestamp
name: Age
type: date
group: kubeflow.org
names:
kind: PyTorchJob
plural: pytorchjobs
singular: pytorchjob
scope: Namespaced
subresources:
status: {}
validation:
openAPIV3Schema:
properties:
spec:
properties:
pytorchReplicaSpecs:
properties:
Master:
properties:
replicas:
maximum: 1
minimum: 1
type: integer
Worker:
properties:
replicas:
minimum: 1
type: integer
versions:
- name: v1
served: true
storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: pytorch-operator
kustomize.component: pytorch-operator
name: pytorch-operator
namespace: pytorch-operator
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
labels:
app: pytorch-operator
kustomize.component: pytorch-operator
name: pytorch-operator
rules:
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs
- pytorchjobs/status
- pytorchjobs/finalizers
verbs:
- '*'
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- '*'
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- events
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
labels:
app: pytorch-operator
kustomize.component: pytorch-operator
name: pytorch-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pytorch-operator
subjects:
- kind: ServiceAccount
name: pytorch-operator
namespace: pytorch-operator
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8443"
prometheus.io/scrape: "true"
labels:
app: pytorch-operator
kustomize.component: pytorch-operator
name: pytorch-operator
namespace: pytorch-operator
spec:
ports:
- name: monitoring-port
port: 8443
targetPort: 8443
selector:
kustomize.component: pytorch-operator
name: pytorch-operator
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
kustomize.component: pytorch-operator
name: pytorch-operator
namespace: pytorch-operator
spec:
replicas: 1
selector:
matchLabels:
kustomize.component: pytorch-operator
name: pytorch-operator
template:
metadata:
labels:
kustomize.component: pytorch-operator
name: pytorch-operator
spec:
containers:
- command:
- /pytorch-operator.v1
- --alsologtostderr
- -v=1
- --monitoring-port=8443
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
image: gcr.io/kubeflow-images-public/pytorch-operator:latest
name: pytorch-operator
serviceAccountName: pytorch-operator
I applied the above yaml and verified that the operator is running successfully
$ kubectl get pods -n pytorch-operator
NAME READY STATUS RESTARTS AGE
pytorch-operator-6746dbbc89-sv2qw 1/1 Running 0 100m
$
I then apply the following yaml to create a Distributed PytorchJob.
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
lyft.com/ml-platform: ""
spec:
containers:
- name: pytorch
image: "OUR_AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/lyftlearnhorovod:8678853078c35bf1d003761a070389ca535a5d03"
command:
- python
args:
- "/mnt/user-home/distributed-training-exploration/pytorchjob_distributed_mnist.py"
- "--backend"
- "nccl"
- "--epochs"
- "2"
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
lyft.com/ml-platform: ""
spec:
containers:
- name: pytorch
image: "OUR_AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/lyftlearnhorovod:8678853078c35bf1d003761a070389ca535a5d03"
command:
- python
args:
- "/mnt/user-home/distributed-training-exploration/pytorchjob_distributed_mnist.py"
- "--backend"
- "nccl"
- "--epochs"
- "2"
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule
I see the worker pods failing with ImagePullBackOff Errors
Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in OUR_AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id 'OUR_AWS_ACCOUNT'
15m Normal BackOff pod/pytorch-dist-mnist-nccl-worker-0 Back-off pulling image "alpine:3.10"
18m Warning Failed pod/pytorch-dist-mnist-nccl-worker-0 Error: ImagePullBackOff
10s Normal Scheduled pod/pytorch-dist-mnist-nccl-worker-0 Successfully assigned asaha/pytorch-dist-mnist-nccl-worker-0 to ip-10-44-108-79.ec2.internal
9s Normal Pulling pod/pytorch-dist-mnist-nccl-worker-0 Pulling image "alpine:3.10"
8s Warning Failed pod/pytorch-dist-mnist-nccl-worker-0 Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in <OUR_AWS_ACCOUNT>.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id '<OUR_AWS_ACCOUNT>'
8s Warning Failed pod/pytorch-dist-mnist-nccl-worker-0 Error: ErrImagePull
7s Normal BackOff pod/pytorch-dist-mnist-nccl-worker-0 Back-off pulling image "alpine:3.10"
20m Normal SuccessfulCreatePod pytorchjob/pytorch-dist-mnist-nccl Created pod: pytorch-dist-mnist-nccl-master-0
Since, the Docker images are fully materialized why would it fail looking for alpine:3.10
?
HI @gaocegege Is there a plan to look/comment on this issue?
Yeah, we are trying to use Amazon's new public docker registry. Ref kubeflow/training-operator#1205
809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator
is used for internal testing. Once we move to public registry, we will make a change. It's been changed to use GCR in master now.
Hi @Jeffwan I am still getting the alpine image not found when we apply a PytorchJob yaml even with kubeflow 1.3.0 manifest.
Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in OUR_AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id 'OUR_AWS_ACCOUNT'
I applied this PytorchJob yaml. I also used kubeflow manifests 1.3.0 and kustomize to generate the pytorch-operator crds and operator yamls and applied them. The pytorch-operator logs shows that the operator is running fine.
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-images-public/pytorch-dist-mnist-test:latest
args: ["--backend", "nccl"]
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-images-public/pytorch-dist-mnist-test:latest
args: ["--backend", "nccl"]
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule