DeepRec-AI/DeepRec

[BUILD] build failed with GPU configuration

zzhong44 opened this issue · 3 comments

System information

  • OS CPU: AMD EPYC 7V12 64-Core Processor
  • Build image: alideeprec/deeprec-build:deeprec-dev-gpu-py38-cu116-ubuntu20.04, and use nvidia-docker
  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04): CentOS Linux release 7.9.2009 (Core)
  • DeepRec version or commit id: 29ecde4
  • Python version: 3.8.10
  • Bazel version (if compiling from source): 5.3.1 (build from source)
  • GCC/Compiler version (if compiling from source): 9.4
  • CUDA/cuDNN version: 11.6
  • GPU: Tesla T4
  • GPU Driver version: 470.161.03

.tf_configure.bazelrc:

build --python_path="/usr/bin/python"  # python 3.8.10
build:xla --define with_xla_support=true
build --config=xla
build:star --define with_star_support=true
build --config=star
build:pmem --define with_pmem_support=true
build:parquet_dataset --define with_parquet_dataset_support=true
build --config=parquet_dataset
build:api_compatible --define with_api_compatible=true
build --action_env TF_USE_CCACHE="0"
build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.5"
build --action_env LD_LIBRARY_PATH="/usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
build --action_env GCC_HOST_COMPILER_PATH="/usr/bin/gcc"
build --config=cuda
build:opt --copt=-march=native
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-march=native
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"

build --config=noaws
build --config=nogcp
build --config=noignite
build --config=nokafka
build --config=numa

Describe the problem
build with cmd bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package, show error:
image

I found std::__cxx11::basic_string, so I try to build with bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package, show error:
image

But if I annotate these lines:

#build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda"
#build --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.5"
#build --action_env LD_LIBRARY_PATH="/usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
#build --action_env GCC_HOST_COMPILER_PATH="/usr/bin/gcc"
#build --config=cuda

and build cpu version with bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package or bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package. It can compile.

It can compile by adding option --config=monolithic, like bazel build --config=monolithic --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package. But by this way, //tensorflow:libtensorflow_framework.so will not be generated, which will cause subsequent compile of sok fail.

what is the version of bazel that you are using?

what is the version of bazel that you are using?

5.3.1

Success with bazel version -> 0.26.1