USE_DIST_KVSTORE triggers "undefined reference to 'void mxnet::op::ElemwiseBinaryOp::DnsCsrDnsOp'" linker error
jens-maus opened this issue · 3 comments
Description
After switching to Ubuntu 22.04 with latest gcc/g++ v11 and CUDA 11.7 with NVIDIA driver 515.65.01 for Tesla V100S GPU cards I tried to compile mxnet 1.9.1 for our new environment because we need to get the R-package installed/updated as well. However, while most of the mxnet build seem to suceed fine, the build unfortunately stops right at trying to link img2rec
with an error message mentioned in the next section. Trying to skip the img2rec
build ends up in similar linker errors for other tools for which I could not find any solution. Also looking at similar issue tickets like #18761 and #18357 did not end up in a fix we could apply for the issue.
Any help in trying to solve this issue would be highly appreciated.
Error Message
$ cmake --build . --parallel 1
Consolidate compiler generated dependencies of target objects
[ 8%] Built target objects
[ 8%] Built target libzmq-static
Consolidate compiler generated dependencies of target dnnl_cpu_x64
[ 27%] Built target dnnl_cpu_x64
Consolidate compiler generated dependencies of target dnnl_common
[ 32%] Built target dnnl_common
Consolidate compiler generated dependencies of target dnnl_cpu
[ 41%] Built target dnnl_cpu
[ 41%] Built target dnnl
Consolidate compiler generated dependencies of target intgemm
[ 41%] Built target intgemm
[ 42%] Built target libomp-needed-headers
Consolidate compiler generated dependencies of target omp
[ 45%] Built target omp
Consolidate compiler generated dependencies of target dmlc
[ 46%] Built target dmlc
[ 46%] Built target proto_python
Consolidate compiler generated dependencies of target pslite
[ 47%] Built target pslite
Consolidate compiler generated dependencies of target mxnet
[ 94%] Built target mxnet
Consolidate compiler generated dependencies of target customop_lib
[ 94%] Built target customop_lib
Consolidate compiler generated dependencies of target transposecsr_lib
[ 94%] Built target transposecsr_lib
Consolidate compiler generated dependencies of target transposerowsp_lib
[ 95%] Built target transposerowsp_lib
Consolidate compiler generated dependencies of target subgraph_lib
[ 95%] Built target subgraph_lib
Consolidate compiler generated dependencies of target pass_lib
[ 95%] Built target pass_lib
Consolidate compiler generated dependencies of target customop_gpu_lib
[ 95%] Built target customop_gpu_lib
Consolidate compiler generated dependencies of target im2rec
[ 95%] Linking CXX executable im2rec
/usr/bin/ld: libmxnet.so: undefined reference to `void mxnet::op::ElemwiseBinaryOp::DnsCsrDnsOp<mxnet::op::mshadow_op::plus>(mshadow::Stream<mshadow::gpu>*, nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::OpReqType, mxnet::NDArray const&, bool)'
/usr/bin/ld: libmxnet.so: undefined reference to `void mxnet::op::ElemwiseBinaryOp::DnsCsrDnsOp<mxnet::op::mshadow_op::minus>(mshadow::Stream<mshadow::gpu>*, nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::OpReqType, mxnet::NDArray const&, bool)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [CMakeFiles/im2rec.dir/build.make:130: im2rec] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:749: CMakeFiles/im2rec.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
To Reproduce
This is the config.cmake
file we are using to build mxnet 1.9.1 in our ubuntu 22.04 environment:
set(CMAKE_BUILD_TYPE "Distribution" CACHE STRING "Build type")
set(CFLAGS "-mno-avx" CACHE STRING "CFLAGS")
set(CXXFLAGS "-mno-avx" CACHE STRING "CXXFLAGS")
set(USE_CUDA ON CACHE BOOL "Build with CUDA support")
set(USE_CUDNN ON CACHE BOOL "Build with CUDA support")
set(USE_NCCL ON CACHE BOOL "Build with NCCL support")
set(USE_OPENCV ON CACHE BOOL "Build with OpenCV support")
set(USE_OPENMP ON CACHE BOOL "Build with Openmp support")
set(USE_MKL_IF_AVAILABLE OFF CACHE BOOL "Use Intel MKL if found")
set(USE_MKLDNN ON CACHE BOOL "Build with MKL-DNN support")
set(USE_LAPACK ON CACHE BOOL "Build with lapack support")
set(USE_TVM_OP OFF CACHE BOOL "Enable use of TVM operator build system.")
set(USE_SSE ON CACHE BOOL "Build with x86 SSE instruction support")
set(USE_F16C OFF CACHE BOOL "Build with x86 F16C instruction support")
set(USE_LIBJPEG_TURBO ON CACHE BOOL "Build with libjpeg-turbo")
set(USE_DIST_KVSTORE ON CACHE BOOL "Build with DIST_KVSTORE support")
set(MXNET_CUDA_ARCH "5.0;6.0;7.0;8.0;8.6" CACHE STRING "Cuda architectures")
set(CMAKE_CUDA_COMPILER "/usr/local/cuda-11.7/bin/nvcc" CACHE STRING "Cuda compiler")
set(OPENMP_FILECHECK_EXECUTABLE "/usr/lib/llvm-14/bin/FileCheck")
set(OPENMP_LLVM_LIT_EXECUTABLE "/usr/lib/llvm-14/build/utils/lit/lit.py")
set(USE_CPP_PACKAGE ON CACHE BOOL "Build C++ Package")
set(NCCL_ROOT "/usr/local/nccl" CACHE BOOL "NCCL install path. Supports autodetection.")
Steps to reproduce
(Paste the commands you ran that produced the error.)
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/mxnet-1.9.1 ..
cmake --build . --parallel 20
What have you tried to solve it?
Environment
n/a
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.
Please note, that I am seeing the same/similar issue when trying to compile the older mxnet 1.9.0 in the same environment:
$ cmake --build . --parallel 1
Consolidate compiler generated dependencies of target objects
[ 9%] Built target objects
[ 9%] Built target libzmq-static
Consolidate compiler generated dependencies of target dnnl_cpu_x64
[ 25%] Built target dnnl_cpu_x64
Consolidate compiler generated dependencies of target dnnl_common
[ 28%] Built target dnnl_common
Consolidate compiler generated dependencies of target dnnl_cpu
[ 35%] Built target dnnl_cpu
[ 36%] Built target dnnl
Consolidate compiler generated dependencies of target intgemm
[ 36%] Built target intgemm
[ 37%] Built target libomp-needed-headers
Consolidate compiler generated dependencies of target omp
[ 40%] Built target omp
[ 40%] Built target proto_python
Consolidate compiler generated dependencies of target pslite
[ 40%] Built target pslite
Consolidate compiler generated dependencies of target mxnet_static
[ 91%] Built target mxnet_static
Consolidate compiler generated dependencies of target dmlc
[ 92%] Built target dmlc
Consolidate compiler generated dependencies of target mxnet
[ 92%] Built target mxnet
Consolidate compiler generated dependencies of target customop_lib
[ 92%] Built target customop_lib
Consolidate compiler generated dependencies of target transposecsr_lib
[ 92%] Built target transposecsr_lib
Consolidate compiler generated dependencies of target transposerowsp_lib
[ 93%] Built target transposerowsp_lib
Consolidate compiler generated dependencies of target subgraph_lib
[ 93%] Built target subgraph_lib
Consolidate compiler generated dependencies of target pass_lib
[ 94%] Built target pass_lib
Consolidate compiler generated dependencies of target customop_gpu_lib
[ 94%] Built target customop_gpu_lib
Consolidate compiler generated dependencies of target im2rec
[ 94%] Linking CXX executable im2rec
/usr/bin/ld: libmxnet.a(elemwise_binary_broadcast_op_basic.cu.o): in function `void mxnet::op::BinaryBroadcastComputeDenseEx<mshadow::gpu, mxnet::op::mshadow_op::minus>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)':
tmpxft_001b66c6_00000000-6_elemwise_binary_broadcast_op_basic.compute_80.cudafe1.cpp:(.text._ZN5mxnet2op29BinaryBroadcastComputeDenseExIN7mshadow3gpuENS0_10mshadow_op5minusEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_7NDArrayESaISE_EERKSD_INS_9OpReqTypeESaISJ_EESI_[_ZN5mxnet2op29BinaryBroadcastComputeDenseExIN7mshadow3gpuENS0_10mshadow_op5minusEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_7NDArrayESaISE_EERKSD_INS_9OpReqTypeESaISJ_EESI_]+0x34c): undefined reference to `void mxnet::op::ElemwiseBinaryOp::DnsCsrDnsOp<mxnet::op::mshadow_op::minus>(mshadow::Stream<mshadow::gpu>*, nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::OpReqType, mxnet::NDArray const&, bool)'
/usr/bin/ld: libmxnet.a(elemwise_binary_broadcast_op_basic.cu.o): in function `void mxnet::op::BinaryBroadcastComputeDenseEx<mshadow::gpu, mxnet::op::mshadow_op::plus>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)':
tmpxft_001b66c6_00000000-6_elemwise_binary_broadcast_op_basic.compute_80.cudafe1.cpp:(.text._ZN5mxnet2op29BinaryBroadcastComputeDenseExIN7mshadow3gpuENS0_10mshadow_op4plusEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_7NDArrayESaISE_EERKSD_INS_9OpReqTypeESaISJ_EESI_[_ZN5mxnet2op29BinaryBroadcastComputeDenseExIN7mshadow3gpuENS0_10mshadow_op4plusEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_7NDArrayESaISE_EERKSD_INS_9OpReqTypeESaISJ_EESI_]+0x14de): undefined reference to `void mxnet::op::ElemwiseBinaryOp::DnsCsrDnsOp<mxnet::op::mshadow_op::plus>(mshadow::Stream<mshadow::gpu>*, nnvm::NodeAttrs const&, mxnet::OpContext const&, mxnet::NDArray const&, mxnet::NDArray const&, mxnet::OpReqType, mxnet::NDArray const&, bool)'
collect2: error: ld returned 1 exit status
gmake[2]: *** [CMakeFiles/im2rec.dir/build.make:125: im2rec] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:783: CMakeFiles/im2rec.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
Perhaps anyone has a clue where this might originate from!?!?
After some further investigation I actually found the config option that seems to trigger the mentioned undefined reference
linker error. It is the USE_DIST_KVSTORE
config option that - if enabled - seem to trigger these error messages within the environment. Thus, setting it to OFF
...
set(USE_DIST_KVSTORE OFF CACHE BOOL "Build with DIST_KVSTORE support")
seem to help and the linker error goes away. However, it is still unclear why this config option seem to trigger the mentioned linker error.