One question about DDP

Question

One question about DDP

cswwp opened this issue 4 years ago · 2 comments

@richardkxu Nice repo. One question: Is there difference between Single node, multiple GPUs with torch.distributed.launch (①) and Single node, multiple GPUs with multi-processes(②)? or they are equalize, and just two different method?

①

②

Answer 1 · 2020-11-21T17:54:55.000Z

The main difference is which distributed training library you use. The 1st one uses NVIDIA Apex library. The 2nd one uses torch.nn.DistributedDataParallel. The 1st one gives better perf and works better with NVIDIA GPUs. It also becomes the default way in newer version of pytorch (> 1.6.0). Hope this is helpful!

Answer 2 · 2020-11-22T07:06:54.000Z

The main difference is which distributed training library you use. The 1st one uses NVIDIA Apex library. The 2nd one uses torch.nn.DistributedDataParallel. The 1st one gives better perf and works better with NVIDIA GPUs. It also becomes the default way in newer version of pytorch (> 1.6.0). Hope this is helpful!

Thank you, very helpful