
failed to connect to 'ipv4:': socket error: connection refused

echoyes opened this issue · 5 comments

when I run the distributed and I just followed the comand like this "./ 8 3", I encountered some errors such as "failed to connect to 'ipv4:': socket error: connection refused" .
Besides, I am also wondering by using the command "./ 8 3" how to start remote server process without using ssh or some other protocols.

You should run the inside the k8s cluster. That means you need to use
kubectl exec -it some-pod bash
to go into the cluster and start the training process. some-pod could be any pod that runs inside the cluster. For example you use the ps-worker pod.

If you want to train models using remote server, TensorFlow uses gRPC by default (and I don't think you can change that without significant code change).



Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
