Error with nvidia driver version in Dockerfile

Question

Error with nvidia driver version in Dockerfile

Opened this issue 9 months ago · 11 comments

Hi, I'm trying to test the model, when I run test.py, an error occurs.

Then I tried nvidia-smi, I get:

After getting into the driver's version, I found them incompatible:

As I'm using the Dockerfile, this issue only happens in this image and I believe it is a problem within the Dockerfile, as nvidia-smi is well-working on my host. Do you have any suggestions in this case?

Answer 1 · 2024-03-11T09:09:04.000Z

Hi @K25801 ,
Which GPU are you using? Is it compatible with cuda 11.6, we are using in the Dockerfile?

Also have you tried this advise from Dockefile?

# Feel free to skip nvidia-cuda-dev if minkowski installation is fine

Answer 2 · 2024-03-11T09:21:04.000Z

Hi, I think the problem is caused by
nvidia-cuda-dev
There will be no problem when skipping this line. But that causes the installation of MinkowskiEngine to fail.
Like this:
[11/21] /opt/conda/bin/nvcc -I/opt/conda/lib/python3.10/site-packages/torch/include -I/opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/include -I/workspace/MinkowskiEngine/src -I/workspace/MinkowskiEngine/src/3rdparty -I/opt/conda/include/python3.10 -c -c /workspace/MinkowskiEngine/src/interpolation_gpu.cu -o /workspace/MinkowskiEngine/build/temp.linux-x86_64-cpython-310/workspace/MinkowskiEngine/src/interpolation_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --expt-relaxed-constexpr --expt-extended-lambda -O3 -Xcompiler=-fno-gnu-unique -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14 FAILED: /workspace/MinkowskiEngine/build/temp.linux-x86_64-cpython-310/workspace/MinkowskiEngine/src/interpolation_gpu.o /opt/conda/bin/nvcc -I/opt/conda/lib/python3.10/site-packages/torch/include -I/opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/include -I/workspace/MinkowskiEngine/src -I/workspace/MinkowskiEngine/src/3rdparty -I/opt/conda/include/python3.10 -c -c /workspace/MinkowskiEngine/src/interpolation_gpu.cu -o /workspace/MinkowskiEngine/build/temp.linux-x86_64-cpython-310/workspace/MinkowskiEngine/src/interpolation_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --expt-relaxed-constexpr --expt-extended-lambda -O3 -Xcompiler=-fno-gnu-unique -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14 In file included from /workspace/MinkowskiEngine/src/allocators.cuh:36:0, from /workspace/MinkowskiEngine/src/kernel_region.hpp:40, from /workspace/MinkowskiEngine/src/coordinate_map.hpp:30, from /workspace/MinkowskiEngine/src/interpolation_gpu.cu:26: /opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory #include <cusolverDn.h> ^~~~~~~~~~~~~~ compilation terminated.

I'm using RTX A6000 and Quadro RTX 8000, here is my nvidia-smi infos

Yes, I'm using the Dockerfile.

Answer 3 · 2024-03-11T09:34:33.000Z

fatal error: cusolverDn.h: No such file or directory

Yes this is exactly the error, why in some configurations we need to install extra cuda headers from nvidia-cuda-dev . However this package pulls tons of dependencies including another version of nvidia drivers or smth. You can may be try install libcusolver* instead of the whole nvidia-cuda-dev. I do not know the good solution as unfortunatly minkowskiengine is not supported for a couple of years now :(

Answer 4 · 2024-03-11T09:45:24.000Z

Thanks for your update! I immediately tried
apt install libcusolver*

After successfully installing, I tried to install MinkowskiEng again, but got the same error.
[11/21] /opt/conda/bin/nvcc -I/opt/conda/lib/python3.10/site-packages/torch/include -I/opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/include -I/tmp/pip-req-build-_1eo4q2g/src -I/tmp/pip-req-build-_1eo4q2g/src/3rdparty -I/opt/conda/include/python3.10 -c -c /tmp/pip-req-build-_1eo4q2g/src/broadcast_gpu.cu -o /tmp/pip-req-build-_1eo4q2g/build/temp.linux-x86_64-cpython-310/tmp/pip-req-build-_1eo4q2g/src/broadcast_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --expt-relaxed-constexpr --expt-extended-lambda -O3 -Xcompiler=-fno-gnu-unique -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14 FAILED: /tmp/pip-req-build-_1eo4q2g/build/temp.linux-x86_64-cpython-310/tmp/pip-req-build-_1eo4q2g/src/broadcast_gpu.o /opt/conda/bin/nvcc -I/opt/conda/lib/python3.10/site-packages/torch/include -I/opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/include -I/tmp/pip-req-build-_1eo4q2g/src -I/tmp/pip-req-build-_1eo4q2g/src/3rdparty -I/opt/conda/include/python3.10 -c -c /tmp/pip-req-build-_1eo4q2g/src/broadcast_gpu.cu -o /tmp/pip-req-build-_1eo4q2g/build/temp.linux-x86_64-cpython-310/tmp/pip-req-build-_1eo4q2g/src/broadcast_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --expt-relaxed-constexpr --expt-extended-lambda -O3 -Xcompiler=-fno-gnu-unique -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14 In file included from /tmp/pip-req-build-_1eo4q2g/src/allocators.cuh:36:0, from /tmp/pip-req-build-_1eo4q2g/src/kernel_region.hpp:40, from /tmp/pip-req-build-_1eo4q2g/src/coordinate_map.hpp:30, from /tmp/pip-req-build-_1eo4q2g/src/broadcast_gpu.cu:28: /opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory #include <cusolverDn.h> ^~~~~~~~~~~~~~ compilation terminated.

Answer 5 · 2024-03-11T09:47:04.000Z

If I use nvidia-cuda-dev, the installation will be fine, but nvidia-smi doesn't works

Answer 6 · 2024-04-16T02:56:43.000Z

Can such a GPU run through the program?Is cuda 11.6 the minimum requirement?
Because I always encounter such an error in training [runtime error: cuda error: the provided PTX was compiled with an unsupported toolkit.
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.】

Answer 7 · 2024-04-16T08:05:33.000Z

Cuda 11.2 should be fine, and rtx 3090 also. If you change cuda version just make sure you also changed it in precompiled versions of pytorch, mmcv, torch-scatter and spconv.

Answer 8 · 2024-05-18T14:08:17.000Z

I use rtx3090 and create docker with the command docker build -t my-tag, the dockerfile put at the current path. all is fine.

Answer 9 · 2024-07-16T06:03:12.000Z

Before installing MinkowskiEngine, enter this command.
export CPATH=/usr/local/cuda/include:$CPATH

I referenced here : microsoft/DeepSpeed#2684 (comment)

Answer 10 · 2024-10-26T15:00:05.000Z

hello,I have the same problem, I build the image according to the dockerfile and then used sudo docker --gpus ‘device=1’ -it oneformer_container /opt/nvidia/nvidia_entrypoint.sh into the container without making any driver or cuda changes, then run nvidia-smi with the error: Failed to initialise NVML: Driver/Library version mismatch.

cat /proc/driver/nvidia/version to see that the driver kernel version is 470.182.03 and nvcc -V to see that the cuda version is 11.6, which is compatible.

By chance I try to run nvidia-smi after trying to uninstall the driver apt-get --purge remove ‘*nvidia*’ in docker, and it work unexpectedly???? But I don't know why???

After I restart the container and build the relevant dataset for the scannet, I run python tools/train.py configs/oneformer3d_1xb4_scannet.py and find that the cuda is still reporting an error:

RuntimeError: Unexpected RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW.
This is still a driver hardware compatibility related issue, not sure what the problem is. And I check print(torch.cuda.is_avaliable()) reported an error:

/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at .... /c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 False
I don't know how can I fix these problems？Any advice you can give me would be greatly appreciated, thanks.

Answer 11 · 2024-10-31T02:08:04.000Z

@K25801 have you solved it?