[Perf] Failed to reproduce the performance result for Single-node AllReduce mentioned in README.md

Question

[Perf] Failed to reproduce the performance result for Single-node AllReduce mentioned in README.md

FC-Li opened this issue 3 months ago · 5 comments

Hi,
I am trying to compare mscclpp and nccl on the allreduce preformance. I used the following script to get the performance metrics for mscclpp and nccl respectively.
For mscclpp:

log_dir="logs"
mkdir -p ${log_dir}

for((k=0;k<8;k++));
do
    maxBytes=4g
    if ((k==2)) 
    then
        maxBytes=2g
    fi
    mpirun --bind-to numa -np 8 ./build/test/mscclpp-test/allreduce_test_perf \
        -b 3m \
        -e ${maxBytes} \
        -G 10 \
        -n 30 \
        -w 10 \
        -f 2 \
        -k ${k} \
        -o "${log_dir}/ar.txt"
done

For nccl using nccl-tests from Nvidia:

mpirun -np 8 ./build/all_reduce_perf \
    -b 3M \
    -e 4G \
    -G 10 \
    -f 2 \
    -d int32 \
    -R 0

What mscclpp got is:

As showed in the above figure, mscclpp's maximum algBw is underneath 100GB/s while it's supposed to be around 140GB/s as showed in README.md.

What nccl got is:

The achieved maximum algBw matches README.md roughly.

My nccl version is libnccl.so.2.20.5 and all tests are carried out on a machine with 8xH800.

As we can see, nccl's performance is much better than mscclpp especially when size is big. I suspect mscclpp may not work in its best setting. It would be helpful to reproduce the performance declared in README.md if you could share the test script and the environment configuration which your tests were carried out against.

Answer 1 · 2024-10-12T05:59:58.000Z

Could you try with the python benchmark? https://microsoft.github.io/mscclpp/getting-started/quickstart.html#performance-benchmark. The mscclpp-test maybe not suitable for your case.

Also seems you enabled nvlink-sharp for nccl-test but run mscclpp-test without nvlink-sharp. Test with python benchmark will enable nvlink sharp for both scenarios.
Besides, our result is for A100 machine. For A100 the nvlink speed is 600GB/s, for H800 is 400GB/s. It makes sense that A100 machine is faster than H800

Answer 2 · 2024-10-12T06:13:47.000Z

@Binyang2014 Thank you for your swift reply. I'll give python benchmark a try and share the result with you later on.

Answer 3 · 2024-10-12T08:31:06.000Z

@Binyang2014 My kernel version is 4.18.0, older than 5.6, so I encountered some problems when compiling mscclpp python benchmark with nvls support. Updating OS would take some time. Instead I turned nvls off for nccl to see what would happen. The following is what I got. The result now starts to make sense and aligns with your hypothesis.

Answer 4 · 2024-10-16T17:53:04.000Z

Please make sure nvidia_peermem driver is running on your machine. https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md#prerequisites

Answer 5 · 2024-10-16T19:43:57.000Z

Oh, I missed that you are using H800 GPUs. Then your numbers already look making sense.