MASTER_ADDR setting for mnist example
Closed this issue · 2 comments
When launching the public pytorch examples for jobsets (doc link), I had to change the MASTER_ADDR
value from pytorch-workers-0-0.pytorch-workers
-> pytorch-workers-0-0.pytorch.train.svc.cluster.local
for the pods to connect.
Is this something specific to my installation? Or do the examples need to be updated?
Exact error log from resnet.yaml
[W socket.cpp:558] [c10d] The IPv6 network addresses of (pytorch-workers-0-0.pytorch-workers, 3389) cannot be retrieved (gai error: -2 - Name or service not known).
Setup
I launched the jobset
into specific namespace named train
.
kubectl apply -f jobset.yaml -n train --server-side
We are using the v0.5.2
of jobset
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.5.2/manifests.yaml
I have verified that the the headless service resource exists on jobset
creation. I have verified that that the jobset-controller-manager
is in a healthy state.
Oh it looks like the example docs might be out of date.
The public docs) have MASTER_ADDR=pytorch-workers-0-0.pytorch-workers
, but the github example correctly has MASTER_ADDR=pytorch-workers-0-0.pytorch
.
Setting MASTER_ADDR=pytorch-workers-0-0.pytorch
now works.
Fixed in #604
Thanks @song-william!