microsoft/mscclpp

Is there exist some documentation to explain the difference between allreduce algorithm in mscclpp?

MARD1NO opened this issue · 4 comments

I'm not very clear the difference of allreduce1/2/3..., and which algorithm should I use in different device(pcie, nvlink, nvls)

The allreduce1/2/3 just for test purpose. And we only tested in nvlink device. Allreduce 1/2/3 is implemented with different algo/interfaces. Algo1: ring-based reduce, Algo2: packet interface, Algo3: reduce-scatter - allgather reduce.

It's just an example to show how to use mscclpp interfaces to write your own algo should not be used in production.

The allreduce1/2/3 just for test purpose. And we only tested in nvlink device. Allreduce 1/2/3 is implemented with different algo/interfaces. Algo1: ring-based reduce, Algo2: packet interface, Algo3: reduce-scatter - allgather reduce.

It's just an example to show how to use mscclpp interfaces to write your own algo should not be used in production.

Thanks, what about algo6? It seems is target to NVLS, and can it be used in production?(Because I see the benchmark result is better than nccl implementation)

Oh you mean the python benchmark. Yes algo6 is for nvls. And these algo should be tested. You can check the code here:

if MPI.COMM_WORLD.size // N_GPUS_PER_NODE == 1:
if memory.nbytes < 2**20:
mscclpp_algos = [MscclppAllReduce2(mscclpp_group, memory, memory_out)]
else:
mscclpp_algos = [
MscclppAllReduce1(mscclpp_group, memory),
MscclppAllReduce3(mscclpp_group, memory, proxy_service),
]
if is_nvls_supported() and (data_type == cp.float32 or data_type == cp.float16):
mscclpp_algos.append(MscclppAllReduce6(mscclpp_group, nelem, data_type))
else:
if memory.nbytes < 2**22:
mscclpp_algos = [MscclppAllReduce5(mscclpp_group, memory, memory_out, N_GPUS_PER_NODE, proxy_service)]
else:
mscclpp_algos = [MscclppAllReduce4(mscclpp_group, memory, N_GPUS_PER_NODE, proxy_service)]
for how to choose different algo for different message size.
Also we have the nccl interface support: https://github.com/microsoft/mscclpp/blob/main/apps/nccl/README.md

You can try to use these algo in your production. Please note that we've included these algorithms just to demonstrate our ability to outperform NCCL. And the algo is just for signal node and two nodes.
Also, we encourage developing your own algo via mscclpp interface

Oh you mean the python benchmark. Yes algo6 is for nvls. And these algo should be tested. You can check the code here:

if MPI.COMM_WORLD.size // N_GPUS_PER_NODE == 1:
if memory.nbytes < 2**20:
mscclpp_algos = [MscclppAllReduce2(mscclpp_group, memory, memory_out)]
else:
mscclpp_algos = [
MscclppAllReduce1(mscclpp_group, memory),
MscclppAllReduce3(mscclpp_group, memory, proxy_service),
]
if is_nvls_supported() and (data_type == cp.float32 or data_type == cp.float16):
mscclpp_algos.append(MscclppAllReduce6(mscclpp_group, nelem, data_type))
else:
if memory.nbytes < 2**22:
mscclpp_algos = [MscclppAllReduce5(mscclpp_group, memory, memory_out, N_GPUS_PER_NODE, proxy_service)]
else:
mscclpp_algos = [MscclppAllReduce4(mscclpp_group, memory, N_GPUS_PER_NODE, proxy_service)]

for how to choose different algo for different message size.
Also we have the nccl interface support: https://github.com/microsoft/mscclpp/blob/main/apps/nccl/README.md
You can try to use these algo in your production. Please note that we've included these algorithms just to demonstrate our ability to outperform NCCL. And the algo is just for signal node and two nodes. Also, we encourage developing your own algo via mscclpp interface

Thanks for your explanation :D