microsoft/mscclpp

[Perf] Have you tuned the performance of single-node all-to-all communication?

sphish opened this issue · 2 comments

sphish commented

I tested on the DGX A100 and found that, when nbytes ranges from 1M to 32M, the performance of NCCL's all-to-all is far faster to that of MSCCLPP.

Our current all-to-all communication implementation is basic and serves primarily as a demonstration of using MSCCL++ in kernel code, rather than being optimized for performance. For high-performance requirements, it is advisable to utilize the smChannel and to develop your own kernel using the MSCCL++ API. Contributions to enhance this are welcome.

sphish commented

For high-performance requirements, it is advisable to utilize the smChannel and to develop your own kernel using the MSCCL++ API. Contributions to enhance this are welcome.

Thanks. I'll give it a try