distributed training questions

Question

distributed training questions

Eric-Zhang1990 opened this issue 6 years ago · 2 comments

Eric-Zhang1990 commented 6 years ago

@Tomcli @sboagibm Sorry about bothering you.
I am still confused about multi learners.

What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.

Which means I just run the same job on different servers, not distributed training, right?
Thank you.

Answer 1 · 2019-03-18T16:30:23.000Z

Hi @Eric-Zhang1990, we have examples on how to train distributed PyTorch on FfDL. https://github.com/IBM/FfDL/blob/master/etc/examples/c10d-native-parallelism/model-files/train_dist_parallel.py#L187-L227

On FfDL, all the learners will have a shared working directory under /job/ path. We used that path to discover all the learner container's IP and connect them with 'gloo'/'nccl'/'mpi' protocols. Then during model training, we will average the gradient at the end of each batch for our example.

Answer 2 · 2019-03-20T00:43:24.000Z

@Tomcli Thank you for your kind reply.

@Tomcli @sboagibm Sorry about bothering you. I am still confused about multi learners. What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.

@Tomcli @sboagibm Sorry about bothering you.
I am still confused about multi learners.

What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.