Why CUDA benchmark doesn't include allreduce_ring?

Question

Why CUDA benchmark doesn't include allreduce_ring?

Closed this issue 5 years ago · 1 comments

The ordinary version of benchmark supports allreduce_ring but CUDA version doesn't. However in the source code there is CUDA implementation of it. Is it because allreduce_ring doesn't perform well on CUDA?

Answer 1 · 2019-10-11T13:54:50.000Z

The regular ring algorithm is only useful for very tiny messages (since it puts more bytes on the wire compared to the chunked version), so this was never added for CUDA.