[Perf] Have you tuned the performance of single-node all-to-all communication？

Question

[Perf] Have you tuned the performance of single-node all-to-all communication？

sphish opened this issue a year ago · 2 comments

I tested on the DGX A100 and found that, when nbytes ranges from 1M to 32M, the performance of NCCL's all-to-all is far faster to that of MSCCLPP.

Answer 1 · 2023-11-28T09:38:57.000Z

Our current all-to-all communication implementation is basic and serves primarily as a demonstration of using MSCCL++ in kernel code, rather than being optimized for performance. For high-performance requirements, it is advisable to utilize the smChannel and to develop your own kernel using the MSCCL++ API. Contributions to enhance this are welcome.

Answer 2 · 2023-11-29T00:50:23.000Z

For high-performance requirements, it is advisable to utilize the smChannel and to develop your own kernel using the MSCCL++ API. Contributions to enhance this are welcome.

Thanks. I'll give it a try