gloo cannot find system nccl
jinserk opened this issue ยท 1 comments
I've posted to PyTorch repo, but it's obviously related to the gloo project, so report it here too.
๐ Bug
When I build pytorch from the latest repo, it produces some unusual error as:
CMake Warning (dev) at cmake/Dependencies.cmake:846 (add_dependencies):
Policy CMP0046 is not set: Error on non-existent dependency in
add_dependencies. Run "cmake --help-policy CMP0046" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
CMakeLists.txt:201 (include)
This warning is for project developers. Use -Wno-dev to suppress it.
In file included from /u3/setup/pytorch/pytorch/third_party/gloo/gloo/nccl/nccl.cu:10:0:
/u3/setup/pytorch/pytorch/third_party/gloo/gloo/nccl/nccl.h:12:18: fatal error: nccl.h: No such file or directory
#include <nccl.h>
^
compilation terminated.
[ 20%] Linking CXX executable ../../bin/c10_DeviceGuard_test
[ 20%] Building CXX object c10/test/CMakeFiles/c10_logging_test.dir/logging_test.cpp.o
CMake Error at gloo_cuda_generated_nccl.cu.o.Release.cmake:215 (message):
Error generating
/u3/setup/pytorch/pytorch/build/third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/./gloo_cuda_generated_nccl.cu.o
make[2]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/gloo_cuda_generated_nccl.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
To Reproduce
Just do python setup.py bdist_wheel
if you have wheel
package from pip.
Expected behavior
Environment
PyTorch version: 1.0.0a0+60e7d04
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
CMake version: version 2.8.12.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.72
cuDNN version: Probably one of the following:
/usr/local/cudnn_6.0-cuda_8.0/lib64/libcudnn.so.6.0.21
/usr/local/cudnn_6.0-cuda_8.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.3-cuda_9.0/lib64/libcudnn.so.7.0.3
/usr/local/cudnn_7.0.3-cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.4-cuda_8.0/lib64/libcudnn.so.7.0.4
/usr/local/cudnn_7.0.4-cuda_8.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.4-cuda_9.0/lib64/libcudnn.so.7.0.4
/usr/local/cudnn_7.0.4-cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.5+cuda_9.1/lib64/libcudnn.so.7.0.5
/usr/local/cudnn_7.0.5+cuda_9.1/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.3+cuda_9.1/lib64/libcudnn.so.7.1.3
/usr/local/cudnn_7.1.3+cuda_9.1/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.4+cuda_9.0/lib64/libcudnn.so.7.1.4
/usr/local/cudnn_7.1.4+cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.4+cuda_9.2/lib64/libcudnn.so.7.1.4
/usr/local/cudnn_7.1.4+cuda_9.2/lib64/libcudnn_static.a
/usr/local/cudnn_7.2.1+cuda_9.2/lib64/libcudnn.so.7.2.1
/usr/local/cudnn_7.2.1+cuda_9.2/lib64/libcudnn_static.a
/usr/local/cudnn_7.3.0+cuda_10.0/lib64/libcudnn.so.7.3.0
/usr/local/cudnn_7.3.0+cuda_10.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.3.1+cuda_10.0/lib64/libcudnn.so.7.3.1
/usr/local/cudnn_7.3.1+cuda_10.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn.so.7.3.1
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn.so.7.4.1
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn_static.a
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect
Additional context
This is a PyTorch build system issue. Gloo takes a hint from the PyTorch build system as to where NCCL is.
Dup of pytorch/pytorch#14537.