kengz/SLM-Lab

How can i training with multi computers?

lidongke opened this issue · 5 comments

Hi~
How can i training with multi computers?I have not seen where i can set the address to connect?Is there the "distributed" in the spec json can work for this?
@kengz

kengz commented

Hi @lidongke this is currently not a feature; the lab meant to run within a single machine, although that can already be quite big.
Multi-machine is a use case that the lab has not encountered, so you'll likely need to write custom code to modify or import the lab. We do not plan to support this soon, but here's a reference to get you started https://pytorch.org/docs/stable/distributed.html

I see that the ray can work for Multi-machine. Do u know if it easily to add them to lab?

kengz commented

You can probably start with the ray documentation https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html , setup the cluster machines, and pass those cluster configs into the ray.init(...) calls in SLM Lab.
Note that this means the parallelized runtimes are distinct Trials and so will contain different instances of an algorithm. If you're trying to say run a massive Hogwild parallelization on 1 algorithm with many workers across multiple machines this is not the use case.

That is correct,i'm trying to run 1 algorithm with many workers by multi-machine, it will increase the sampling efficiency.

you can close this issue, thanks!