How can i training with multi computers?

Question

How can i training with multi computers?

lidongke opened this issue 5 years ago · 5 comments

Hi~
How can i training with multi computers?I have not seen where i can set the address to connect?Is there the "distributed" in the spec json can work for this?
@kengz

Answer 1 · 2019-08-12T04:37:07.000Z

Hi @lidongke this is currently not a feature; the lab meant to run within a single machine, although that can already be quite big.
Multi-machine is a use case that the lab has not encountered, so you'll likely need to write custom code to modify or import the lab. We do not plan to support this soon, but here's a reference to get you started https://pytorch.org/docs/stable/distributed.html

Answer 2 · 2019-08-12T06:25:09.000Z

I see that the ray can work for Multi-machine. Do u know if it easily to add them to lab?

Answer 3 · 2019-08-12T07:22:35.000Z

You can probably start with the ray documentation https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html , setup the cluster machines, and pass those cluster configs into the ray.init(...) calls in SLM Lab.
Note that this means the parallelized runtimes are distinct Trials and so will contain different instances of an algorithm. If you're trying to say run a massive Hogwild parallelization on 1 algorithm with many workers across multiple machines this is not the use case.

Answer 4 · 2019-08-12T07:35:52.000Z

That is correct,i'm trying to run 1 algorithm with many workers by multi-machine, it will increase the sampling efficiency.

Answer 5 · 2019-08-26T03:14:09.000Z

you can close this issue, thanks!