kubernetes-sigs/jobset

MASTER_ADDR setting for mnist example

Closed this issue · 2 comments

When launching the public pytorch examples for jobsets (doc link), I had to change the MASTER_ADDR value from pytorch-workers-0-0.pytorch-workers -> pytorch-workers-0-0.pytorch.train.svc.cluster.local for the pods to connect.

Is this something specific to my installation? Or do the examples need to be updated?

Exact error log from resnet.yaml

[W socket.cpp:558] [c10d] The IPv6 network addresses of (pytorch-workers-0-0.pytorch-workers, 3389) cannot be retrieved (gai error: -2 - Name or service not known).

Setup

I launched the jobset into specific namespace named train.
kubectl apply -f jobset.yaml -n train --server-side

We are using the v0.5.2 of jobset
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.5.2/manifests.yaml

I have verified that the the headless service resource exists on jobset creation. I have verified that that the jobset-controller-manager is in a healthy state.

Oh it looks like the example docs might be out of date.

The public docs) have MASTER_ADDR=pytorch-workers-0-0.pytorch-workers, but the github example correctly has MASTER_ADDR=pytorch-workers-0-0.pytorch.

Setting MASTER_ADDR=pytorch-workers-0-0.pytorch now works.