aws/aws-ofi-nccl

NCCL Cannot Find Tuner Symbols. Need to Export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so

zhanwenchen opened this issue · 1 comments

Hello,

I followed the official AWS AWS-OFI Plugin installation guide, but I found that there is a potential issue with the tuner. When I run the nccl-tests command in the linked guide:

/opt/amazon/openmpi/bin/mpirun \
-x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi5/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--hostfile my-hosts -n 8 -N 8 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

I got:

ip-172-31-18-239:755152:755194 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
ip-172-31-18-239:755152:755194 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.

Only with export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so do I get

ip-172-31-18-239:754820:754863 [5] NCCL INFO TUNER/Plugin: Plugin name set by env to /opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so
ip-172-31-18-239:754820:754863 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
ip-172-31-18-239:754820:754863 [5] NCCL INFO TUNER/Plugin: Using tuner plugin nccl_ofi_tuner

Yes, the current public instructions do not load the tuner. Setting NCCL_TUNER_PLUGIN as you have done is the correct way to load the tuner.

Loading the tuner is not required to use the plugin, although the tuner improves performance in some configurations. We may update the public instructions in the future to include loading the tuner.