IBM/FfDL

distributed training questions

Eric-Zhang1990 opened this issue · 2 comments

@Tomcli @sboagibm Sorry about bothering you.
I am still confused about multi learners.
深度截图_选择区域_20190318171958
What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.
深度截图_选择区域_20190318170934

Which means I just run the same job on different servers, not distributed training, right?
Thank you.

Hi @Eric-Zhang1990, we have examples on how to train distributed PyTorch on FfDL. https://github.com/IBM/FfDL/blob/master/etc/examples/c10d-native-parallelism/model-files/train_dist_parallel.py#L187-L227

On FfDL, all the learners will have a shared working directory under /job/ path. We used that path to discover all the learner container's IP and connect them with 'gloo'/'nccl'/'mpi' protocols. Then during model training, we will average the gradient at the end of each batch for our example.

@Tomcli Thank you for your kind reply.