kubeflow/pytorch-operator

Can PytorchJob skip or cancel the init cantainer?

SeibertronSS opened this issue · 2 comments

Hello,
Dear developers. I encounter a question when using pytorchjob. Can PytorchJob skip or cancel the init cantainer?

You might see couple of restarts in worker pods till master pod is up. I don't see any other problem. I haven't tested it.

Hi, @johnugeorge, I also ran into the same problem as @SeibertronSS.

I want to accelerate the training of pytorchjob to achieve a comparable training speed performance like on bare metal.

What I do:

  • turn the hostNetwork on for each pod
  • assign each pod with 4 GPUs, and launch 4 processes inside. (GPU machines 48/49 have 4 GPUs, respectively.)

Ideally, the training will start like running the command below on 48/49:

# 48
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py
# 49
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py

However, the experiment results show that the pytorchjob master is Running quickly while the worker is stuck in Init:0/1.

$ kubectl get pods -o wide
NAME                 READY   STATUS     RESTARTS   AGE     IP              NODE
mnist-ddp-master-0   1/1     Running    0          2m48s   10.252.192.48   gpu-10-252-192-48
mnist-ddp-worker-0   0/1     Init:0/1   0          2m48s   10.252.192.49   gpu-10-252-192-49

$ kubectl describe pod mnist-ddp-worker-0
...
Status:       Pending
IP:           10.252.192.49
IPs:
  IP:           10.252.192.49
Controlled By:  PyTorchJob/mnist-ddp
Init Containers:
  init-pytorch:
    ...
    Command:
      sh
      -c
      until nslookup mnist-ddp-master-0; do echo waiting for master; sleep 2; done; 
    State:          Running		# always sleep, can't pass the init-pytorch container
      Started:      Mon, 13 Sep 2021 16:22:12 +0800
    Ready:          False
    Restart Count:  0

I just want to know if there is a way to use pytorchjob like on bare metal.

Thanks very much.


Here is the YAML file I start the pytorchjob.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "mnist-ddp"
  namespace: "default"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python",
                  "-m",
                  "torch.distributed.launch",          	# launch 4 processes in a pod
                  "--nproc_per_node=4",       
                  "--nnodes=2",
                  "--node_rank=0",                   	# node rank 0
                  "--master_addr=10.252.192.48",      	# master IP -> host network IP
                  "mnist_ddp.py",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 4						# assign 4 gpus for each pod
          hostIPC: true
          hostNetwork: true                          	# turn on host Network
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: In
                        values:
                          - gpu-10-252-192-48   		# assgin pod to 48
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python",
                  "-m",
                  "torch.distributed.launch",
                  "--nproc_per_node=4",
                  "--nnodes=2",
                  "--node_rank=1",               		# node rank 1
                  "--master_addr=10.252.192.48",		# master IP -> host network IP
                  "mnist_ddp.py",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 4						# assign 4 gpus for each pod
          hostIPC: true
          hostNetwork: true                          	# turn on host Network
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: In
                        values:
                          - gpu-10-252-192-49			# assgin pod to 49