facebookincubator/gloo

Gloo in Pytorch for GPU tensor collective communication

Opened this issue · 0 comments

For Gloo in Pytorch distributed, as shown in this document https://pytorch.org/docs/stable/distributed.html, will the following code get performance benefits of using CUDA-aware MPI? (e.g., GPU-to-GPU transferring via PCIe while bypassing CPU)

group = dist.new_group([0, 1], backend="gloo")
dist.all_reduce(gpu_tensor_a, op=dist.ReduceOp.SUM, group=group)