microsoft/DeepSpeed

[BUG] fatal error: cusolverDn.h: No such file or directory

IamHussain503 opened this issue · 14 comments

Describe the bug
When I have installed deepspeed and dependencies gcc and g++ from the given links :

https://lindevs.com/install-gcc-on-ubuntu
https://lindevs.com/install-g-on-ubuntu

I am trying to run in python environment:
import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()

which should result successful loading of cpu_adam, however, there is error
fatal error: cusolverDn.h: No such file or directory
and other error in the end is:
RuntimeError: Error building extension 'cpu_adam'

I have downloaded the packages
https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/
cuda-license-10-0_10.0.130-1_amd64.deb
cuda-cublas-dev-10-0_10.0.130-1_amd64.deb
cuda-cublas-10-0_10.0.130-1_amd64.deb

cuda-cusolver-10-0_10.0.130-1_amd64.deb
cuda-cusolver-dev-10-0_10.0.130-1_amd64.deb

cuda-curand-10-0_10.0.130-1_amd64.deb

and installed them all, however error does not go away.

import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:3:0,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:11,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:1:
/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
[2/3] /opt/conda/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
_ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/opt/conda/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:3:0,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:1:
/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/bitten/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load
return self.jit_load(verbose)
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load
op_module = load(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'

To Reproduce
Steps to reproduce the behavior:
OS version 18.04 Ubuntu
(bitten) root@C.5718699:$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
(bitten) root@C.5718699:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46 Driver Version: 495.46 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:04:00.0 Off | Off |
| 30% 28C P8 18W / 230W | 1MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:44:00.0 Off | Off |
| 30% 27C P8 19W / 230W | 1MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.
(bitten) root@C.5718699:~$ ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/bitten/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Please help , thanks

Hello @Shaukat-Hussain
You are using nvcc from /opt/conda/envs/bitten/bin/nvcc. Are you sure this is the correct nvcc you want to use? Also the command line does not include the system CUDA dir /usr/local/cuda/include/ where the cusolverDn.h locates.

Could you try export PATH=/usr/local/cuda/bin:$PATH to see if that fixes the problem? (Replace /usr/local/cuda/ with your cuda dir)

To follow up on this issue: the root cause is on the pytorch side. They accidentally shipped the nvcc with their conda package which breaks the toolchain. The issue has been reported to the pytorch team and it should be fixed in the next release.

For now, please use temporary workaround: export PATH=/usr/local/cuda/bin:$PATH

Ref: https://discuss.pytorch.org/t/not-able-to-include-cusolverdn-h/169122

Please feel free to reopen the issue if the above solution doesn't work.

sudo apt install nvidia-cuda-dev

sudo apt install nvidia-cuda-dev

This can lead to Failed to initialize NVML: Driver/library version mismatch Use with caution.

I solved this issue by swapping out docker base image.

Used pytorch/pytorch_1.13.1-cuda11.6-cudnn8-devel
Instead of pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel.

And the issue went away. Hope this helps.

Along with adding to $PATH, make sure CUDA_HOME is also set properly to the nvcc version, that resolved the issue for me

@HeyangQin Can you help me sir. I have checked nvcc dir is correct. cuda is already added to $PATH but still get this error

ERROR TraceBack

Installed CUDA version 11.2 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination
Using /home/jovyan/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
In file included from /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:8:
/opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~
compilation terminated.
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/valle/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jovyan/vall-e/train.py", line 128, in
main()
File "/home/jovyan/vall-e/train.py", line 119, in main
trainer.train(
File "/home/jovyan/vall-e/vall_e/utils/trainer.py", line 125, in train
engines = engines_loader()
File "/home/jovyan/vall-e/train.py", line 21, in load_engines
model=trainer.Engine(
File "/home/jovyan/vall-e/vall_e/utils/engines.py", line 22, in init
super().init(None, *args, **kwargs)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1283, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1360, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 73, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 485, in load
return self.jit_load(verbose)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 520, in jit_load
op_module = load(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

DS_REPORT:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/valle/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Hi @thanhlong1997. Could you manually check if cusolverDn.h exists in the include dir?

Hi @thanhlong1997. Could you manually check if cusolverDn.h exists in the include dir?你好你能手动检查包含目录中是否存在 cusolverDn.h 吗?

hello,export PATH=/usr/local/cuda/bin:$PATH,I want to ask how to find my cuda dir。This is my command which nvcc
~/miniconda3/envs/myseg/bin/nvcc This is an error message ---share/home/ncu10/miniconda3/envs/myseg/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~

For me this solved the issue: export CPATH=/usr/local/cuda/include:$CPATH
(solution provided by ChatGPT)

For me this solved the issue: export CPATH=/usr/local/cuda/include:$CPATH (solution provided by ChatGPT)

牛逼!I solved problem by this way!

Another solution if use still want to use conda to manage cuda: simply install libcusolver-dev from nvidia for your cuda version. For example, I am using cuda11.6.1, so I can run conda install nvidia/label/cuda-11.6.1::libcusolver-dev .

conda install nvidia/label/cuda-11.6.1::libcusolver-dev

that's not work for me😭

I found that one of the best method is:

git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed/
DS_BUILD_CPU_ADAM=1 python setup.py build_ext -j8 bdist_wheel
pip install dist/deepspeed-0.14.3+b6e24adb-cp312-cp312-linux_x86_64.whl