distributed training questions
Eric-Zhang1990 opened this issue · 2 comments
Eric-Zhang1990 commented
@Tomcli @sboagibm Sorry about bothering you.
I am still confused about multi learners.
What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.
Which means I just run the same job on different servers, not distributed training, right?
Thank you.
Tomcli commented
Hi @Eric-Zhang1990, we have examples on how to train distributed PyTorch on FfDL. https://github.com/IBM/FfDL/blob/master/etc/examples/c10d-native-parallelism/model-files/train_dist_parallel.py#L187-L227
On FfDL, all the learners will have a shared working directory under /job/
path. We used that path to discover all the learner container's IP and connect them with 'gloo'/'nccl'/'mpi' protocols. Then during model training, we will average the gradient at the end of each batch for our example.
Eric-Zhang1990 commented
@Tomcli Thank you for your kind reply.