facebookincubator/gloo

gloo cannot find system nccl

jinserk opened this issue ยท 1 comments

I've posted to PyTorch repo, but it's obviously related to the gloo project, so report it here too.

๐Ÿ› Bug

When I build pytorch from the latest repo, it produces some unusual error as:

CMake Warning (dev) at cmake/Dependencies.cmake:846 (add_dependencies):
  Policy CMP0046 is not set: Error on non-existent dependency in
  add_dependencies.  Run "cmake --help-policy CMP0046" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:201 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.
In file included from /u3/setup/pytorch/pytorch/third_party/gloo/gloo/nccl/nccl.cu:10:0:
/u3/setup/pytorch/pytorch/third_party/gloo/gloo/nccl/nccl.h:12:18: fatal error: nccl.h: No such file or directory
 #include <nccl.h>
                  ^
compilation terminated.
[ 20%] Linking CXX executable ../../bin/c10_DeviceGuard_test
[ 20%] Building CXX object c10/test/CMakeFiles/c10_logging_test.dir/logging_test.cpp.o
CMake Error at gloo_cuda_generated_nccl.cu.o.Release.cmake:215 (message):
  Error generating
  /u3/setup/pytorch/pytorch/build/third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/./gloo_cuda_generated_nccl.cu.o


make[2]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/gloo_cuda_generated_nccl.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....

To Reproduce

Just do python setup.py bdist_wheel if you have wheel package from pip.

Expected behavior

Environment

PyTorch version: 1.0.0a0+60e7d04
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
CMake version: version 2.8.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.72
cuDNN version: Probably one of the following:
/usr/local/cudnn_6.0-cuda_8.0/lib64/libcudnn.so.6.0.21
/usr/local/cudnn_6.0-cuda_8.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.3-cuda_9.0/lib64/libcudnn.so.7.0.3
/usr/local/cudnn_7.0.3-cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.4-cuda_8.0/lib64/libcudnn.so.7.0.4
/usr/local/cudnn_7.0.4-cuda_8.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.4-cuda_9.0/lib64/libcudnn.so.7.0.4
/usr/local/cudnn_7.0.4-cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.5+cuda_9.1/lib64/libcudnn.so.7.0.5
/usr/local/cudnn_7.0.5+cuda_9.1/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.3+cuda_9.1/lib64/libcudnn.so.7.1.3
/usr/local/cudnn_7.1.3+cuda_9.1/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.4+cuda_9.0/lib64/libcudnn.so.7.1.4
/usr/local/cudnn_7.1.4+cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.4+cuda_9.2/lib64/libcudnn.so.7.1.4
/usr/local/cudnn_7.1.4+cuda_9.2/lib64/libcudnn_static.a
/usr/local/cudnn_7.2.1+cuda_9.2/lib64/libcudnn.so.7.2.1
/usr/local/cudnn_7.2.1+cuda_9.2/lib64/libcudnn_static.a
/usr/local/cudnn_7.3.0+cuda_10.0/lib64/libcudnn.so.7.3.0
/usr/local/cudnn_7.3.0+cuda_10.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.3.1+cuda_10.0/lib64/libcudnn.so.7.3.1
/usr/local/cudnn_7.3.1+cuda_10.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn.so.7.3.1
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn.so.7.4.1
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

Additional context

This is a PyTorch build system issue. Gloo takes a hint from the PyTorch build system as to where NCCL is.

Dup of pytorch/pytorch#14537.