Use allreduce_coalesced for factor allreduce

Question

Opened this issue 3 years ago · 1 comments

"grouped comm on a set of unflattened tensors can be more performant than flattening+a single flat nccl call."

Answer 1 · 2022-03-25T19:16:29.000Z

Can also use allgather_coalesced instead of gradient/inverse broadcast.