caicloud/tensorflow-demo

failed to connect to 'ipv4:10.160.113.47:2222': socket error: connection refused

echoyes opened this issue · 5 comments

when I run the distributed mnist_cnn.py and I just followed the comand like this "./start_tf.sh 8 3 mnist_cnn.py", I encountered some errors such as "failed to connect to 'ipv4:10.160.113.47:2222': socket error: connection refused" .
Besides, I am also wondering by using the command "./start_tf.sh 8 3 mnist_cnn.py" how to start remote server process without using ssh or some other protocols.
thanks.

You should run the start_tf.sh inside the k8s cluster. That means you need to use
kubectl exec -it some-pod bash
to go into the cluster and start the training process. some-pod could be any pod that runs inside the cluster. For example you use the ps-worker pod.

If you want to train models using remote server, TensorFlow uses gRPC by default (and I don't think you can change that without significant code change).

您好!
您已经开源了自己的v1.0版本的代码,我想问下,如果自己搭建起来了kubernetes集群后,将您的v1.0版本的代码整合到想有集群中,可行么?

谢谢🙏

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close