[Bug] libmscclpp_nccl fails linking using ROCm 6.0
corey-derochie-amd opened this issue · 2 comments
corey-derochie-amd commented
While commit 72b99a4 allows libmscclpp to compile using ROCm 6.0, there are still linker errors in libmscclpp_nccl:
ld.lld: error: duplicate symbol: __float2bfloat16(float)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float2bfloat16(float))
>>> defined at allreduce.hpp
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x0)
ld.lld: error: duplicate symbol: __bfloat1622float2(__hip_bfloat162)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__bfloat1622float2(__hip_bfloat162))
>>> defined at allreduce.hpp
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x40)
ld.lld: error: duplicate symbol: __double2bfloat16(double)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__double2bfloat16(double))
>>> defined at allreduce.hpp
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x60)
ld.lld: error: duplicate symbol: __float22bfloat162_rn(HIP_vector_type<float, 2u>)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float22bfloat162_rn(HIP_vector_type<float, 2u>))
>>> defined at allreduce.hpp
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0xA0)
ld.lld: error: duplicate symbol: __high2float(__hip_bfloat162)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__high2float(__hip_bfloat162))
>>> defined at allreduce.hpp
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x120)
ld.lld: error: duplicate symbol: __low2float(__hip_bfloat162)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__low2float(__hip_bfloat162))
>>> defined at allreduce.hpp
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x130)
ld.lld: error: duplicate symbol: __float2bfloat16(float)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float2bfloat16(float))
>>> defined at nccl.cu
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x0)
ld.lld: error: duplicate symbol: __bfloat1622float2(__hip_bfloat162)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__bfloat1622float2(__hip_bfloat162))
>>> defined at nccl.cu
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x40)
ld.lld: error: duplicate symbol: __double2bfloat16(double)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__double2bfloat16(double))
>>> defined at nccl.cu
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x60)
ld.lld: error: duplicate symbol: __float22bfloat162_rn(HIP_vector_type<float, 2u>)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float22bfloat162_rn(HIP_vector_type<float, 2u>))
>>> defined at nccl.cu
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0xA0)
ld.lld: error: duplicate symbol: __high2float(__hip_bfloat162)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__high2float(__hip_bfloat162))
>>> defined at nccl.cu
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x120)
ld.lld: error: duplicate symbol: __low2float(__hip_bfloat162)
>>> defined at executor.cc
>>> ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__low2float(__hip_bfloat162))
>>> defined at nccl.cu
>>> CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x130)
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
gmake[5]: *** [apps/nccl/CMakeFiles/mscclpp_nccl.dir/build.make:145: apps/nccl/libmscclpp_nccl.so.0.5.2] Error 1
gmake[4]: *** [CMakeFiles/Makefile2:379: apps/nccl/CMakeFiles/mscclpp_nccl.dir/all] Error 2
gmake[4]: *** Waiting for unfinished jobs....
[100%] Built target check-format-cpp
gmake[3]: *** [Makefile:139: all] Error 2
gmake[2]: *** [CMakeFiles/mscclpp_nccl-download.dir/build.make:86: mscclpp_nccl-download-prefix/src/mscclpp_nccl-download-stamp/mscclpp_nccl-download-build] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/mscclpp_nccl-download.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
This does not appear to be an issue with later versions of ROCm.
chhwang commented
Hi @corey-derochie-amd, the team has investigated this from before, and it is very tricky to tackle from the mscclpp's side. We rather use this ROCm patch for include/hip/amd_detail/amd_hip_bf16.h
to avoid this issue on ROCm 6.0.
97c97
< #define __HOST_DEVICE__ __device__
---
> #define __HOST_DEVICE__ __device__ static
100c100
< #define __HOST_DEVICE__ __host__ __device__
---
> #define __HOST_DEVICE__ __host__ __device__ static inline
This is already adopted in ROCm 6.1.
corey-derochie-amd commented
Thanks, @chhwang .