microsoft/mscclpp

[Bug]Meet IB problem in single node experiment?

TonyWu199 opened this issue · 1 comments

Hi developers,
It is a really nice work and I try to reproduct in one node with 8 gpus. However, I meet the IB problems below in both UT and collective communication case.

The UT

command:
mpirun -np 2 ./test/mp_unit_tests
image

The all_reduce_test in c++

command:
mpirun --bind-to numa -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5
image
In my view, the gpu communication intra a single node is unrelated to IB, right?
Could you help fix this problem or maybe some walkarounds?

Please refer this comment: #254 (comment)