worker get connection timed out error in user namespace with sidecar.istio.io/inject=false

Question

worker get connection timed out error in user namespace with sidecar.istio.io/inject=false

tingweiwu opened this issue 3 years ago · 1 comments

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
  namespace: system
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: pytorch:1.0-cuda10.0-cudnn7-runtime
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers: 
            - name: pytorch
              image: pytorch:1.0-cuda10.0-cudnn7-runtime
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1

when I deploy PyTorchJob above, the worker pod get the error

Traceback (most recent call last):
  File "mnist.py", line 153, in <module>
    main()
  File "mnist.py", line 119, in main
    dist.init_process_group(backend=args.backend)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 144, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
RuntimeError: Connection timed out

the env in worker pod spec is

  env:
    - name: MASTER_PORT
      value: "23456"
    - name: MASTER_ADDR
      value: pytorch-dist-mnist-gloo-master-0
    - name: WORLD_SIZE
      value: "2"
    - name: RANK
      value: "1"
    - name: PYTHONUNBUFFERED
      value: "0"

why MASTER_ADDR doesn't be pytorch-dist-mnist-gloo-master-0.<namespace>. ?

Additional info:

the namespace describe as following

kubectl describe namespace system
Name:         system
Labels:       <none>
Annotations:  <none>
Status:       Active

Resource Quotas
 Name:    system-resourcequota
 Scopes:  NotBestEffort, NotTerminating
  * Matches all pods that have at least one resource requirement set. These pods have a burstable or guaranteed quality of service.
  * Matches all pods that do not have an active deadline. These pods usually include long running pods whose container command is not expected to terminate.
 Resource                 Used   Hard
 --------                 ---    ---
 limits.cpu               7      400
 limits.memory            29Gi   3000Gi
 requests.cpu             6125m  400
 requests.memory          29Gi   3000Gi
 requests.nvidia.com/gpu  1      50

No resource limits.

pytorch-operator was installed by kustomize in kube-system namespace

kustomize build manifests/overlays/standalone | kubectl apply -f -

Answer 1 · 2021-04-06T07:35:36.000Z

solved by check the networkpolicy in user namespace