[Perf] Have you tuned the performance of single-node all-to-all communication?
sphish opened this issue · 2 comments
sphish commented
I tested on the DGX A100 and found that, when nbytes ranges from 1M to 32M, the performance of NCCL's all-to-all is far faster to that of MSCCLPP.
Binyang2014 commented
Our current all-to-all communication implementation is basic and serves primarily as a demonstration of using MSCCL++ in kernel code, rather than being optimized for performance. For high-performance requirements, it is advisable to utilize the smChannel
and to develop your own kernel using the MSCCL++ API. Contributions to enhance this are welcome.
sphish commented
For high-performance requirements, it is advisable to utilize the
smChannel
and to develop your own kernel using the MSCCL++ API. Contributions to enhance this are welcome.
Thanks. I'll give it a try