Why CUDA benchmark doesn't include allreduce_ring?
Closed this issue · 1 comments
leeQT commented
The ordinary version of benchmark supports allreduce_ring but CUDA version doesn't. However in the source code there is CUDA implementation of it. Is it because allreduce_ring doesn't perform well on CUDA?
pietern commented
The regular ring algorithm is only useful for very tiny messages (since it puts more bytes on the wire compared to the chunked version), so this was never added for CUDA.