[BUG] Deepspeed training integration failing with WizardCoder
karths8 opened this issue · 5 comments
To Reproduce
I am using the WizardCoder training script to further fine-tune the model on some examples that I have using DeepSpeed integration. I have followed their instructions given here to fine-tune the model and I am getting the following error:
Traceback
datachat_env) root@C.6442427:~/Custom-LLM$ sh train.sh
[2023-06-23 00:36:25,039] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-23 00:36:25,077] [INFO] [runner.py:541:main] cmd = /root/anaconda3/envs/datachat_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py --model_name_or_path /root/Custom-LLM/WizardCoder-15B-V1.0 --data_path /root/Custom-LLM/data.json --output_dir /root/Custom-LLM/WC-Checkpoint --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 50 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 30 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed /root/Custom-LLM/Llama-X/src/configs/deepspeed_config.json --fp16 True
[2023-06-23 00:36:26,992] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-23 00:36:26,993] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-23 00:36:26,993] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-23 00:36:26,993] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-23 00:36:26,993] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-06-23 00:36:29,650] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-23 00:36:55,124] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 15.82B parameters
[2023-06-23 00:37:12,845] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,968] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,969] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,970] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:19,
from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
12 | #include <curand_kernel.h>
| ^~~~~~~~~~~~~~~~~
compilation terminated.
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
12 | #include <curand_kernel.h>
| ^~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/datachat_env/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
trainer.train()
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
Loading extension module cpu_adam...
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
return _jit_compile(
^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
trainer.train()
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trainer.train() File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^ ^optimizer = DeepSpeedCPUAdam(model_parameters,^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
self.ds_opt_adam = CPUAdamBuilder().load()
^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
return self.jit_load(verbose)
engine = DeepSpeedEngine(args=args,
^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
op_module = load(name=self.name,
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^ ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
return _import_module_from_library(name, build_directory, is_python_module)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^ ^self.ds_opt_adam = CPUAdamBuilder().load()^
^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^ ^op_module = load(name=self.name,^
^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
^^ File "<frozen importlib._bootstrap>", line 573, in module_from_spec
^^ File "<frozen importlib._bootstrap_external>", line 1233, in create_module
^^ File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
^^ImportError^: ^/root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory^
^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 573, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1233, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Loading extension module cpu_adam...
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
trainer.train()
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 573, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1233, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fcaec4a89a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf4e6409a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f9ce61b09a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f6c2bf109a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Expected behavior
Expect the model to train using the deepspeed config file given
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
System info (please complete the following information):
- OS: Ubuntu 20.4
- GPU count and types 1 machine with 4xA100 80GB
Hi @karths8, looks like a couple of cpp and nvcc errors you're hitting:
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
Also could you try a newer version of DeepSpeed as it has improved error logging for these types of errors? 0.9.5 should have those changes.
Also if you pre-build the DS ops, do you get the same errors? Running DS_BUILD_OPS=1 DS_BUILD_AIO=0 pip install deepspeed==0.9.5
?
DS_BUILD_OPS=1 DS_BUILD_AIO=0 pip install deepspeed==0.9.5
Hey, I tried this and got the following error:
DS_BUILD_OPS=1 DS_BUILD_AIO=0 pip install deepspeed==0.9.5
[. . .]
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/setuptools/command/build_py.py:201: _Warning: Package 'deepspeed.ops.csrc.utils' is absent from the `packages` configuration.
!!
********************************************************************************
############################
# Package would be ignored #
############################
Python recognizes 'deepspeed.ops.csrc.utils' as an importable package[^1],
but it is absent from setuptools' `packages` configuration.
This leads to an ambiguous overall configuration. If you want to distribute this
package, please make sure that 'deepspeed.ops.csrc.utils' is explicitly added
to the `packages` configuration field.
Alternatively, you can also rely on setuptools' discovery methods
(for example by using `find_namespace_packages(...)`/`find_namespace:`
instead of `find_packages(...)`/`find:`).
You can read more about "package discovery" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
If you don't want 'deepspeed.ops.csrc.utils' to be distributed and are
already explicitly excluding 'deepspeed.ops.csrc.utils' via
`find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
you can try to use `exclude_package_data`, or `include-package-data=False` in
combination with a more fine grained `package-data` configuration.
You can read more about "package data files" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/datafiles.html
[^1]: For Python, any directory (with suitable naming) can be imported,
even if it does not contain any `.py` files.
On the other hand, currently there is no concept of package data
directory, all directories are treated like packages.
********************************************************************************
!!
check.warn(importable)
creating build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
copying deepspeed/autotuning/config_templates/template_zero0.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
copying deepspeed/autotuning/config_templates/template_zero1.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
copying deepspeed/autotuning/config_templates/template_zero2.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
copying deepspeed/autotuning/config_templates/template_zero3.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adagrad
copying deepspeed/ops/csrc/adagrad/cpu_adagrad.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adagrad
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/cpu_adam.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/multi_tensor_adam.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/multi_tensor_apply.cuh -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_pin_tensor.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_pin_tensor.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_test
copying deepspeed/ops/csrc/aio/py_test/single_process_config.json -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_test
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/common
copying deepspeed/ops/csrc/common/custom_cuda_kernel.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/common
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/cpu
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/cpu/comm
copying deepspeed/ops/csrc/cpu/comm/ccl.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/cpu/comm
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/StopWatch.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/Timer.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/compat.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/context.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/conversion_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/cpu_adagrad.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/cpu_adam.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/cublas_wrappers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/custom_cuda_layers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/dequantization_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/dropout.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/ds_kernel_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/ds_transformer_cuda.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/feed_forward.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/gelu.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/gemm_test.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/general_kernels.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/memory_access_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/normalize_layer.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/quantization.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/quantization_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/quantizer.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/reduction_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/simd.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/softmax.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/strided_batch_gemm.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/type_shim.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/lamb
copying deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/lamb
copying deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/lamb
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
copying deepspeed/ops/csrc/quantization/dequantize.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
copying deepspeed/ops/csrc/quantization/fake_quantizer.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
copying deepspeed/ops/csrc/quantization/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
copying deepspeed/ops/csrc/quantization/quantize.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
copying deepspeed/ops/csrc/random_ltd/gather_scatter.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
copying deepspeed/ops/csrc/random_ltd/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
copying deepspeed/ops/csrc/random_ltd/slice_attn_masks.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
copying deepspeed/ops/csrc/random_ltd/token_sort.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/sparse_attention
copying deepspeed/ops/csrc/sparse_attention/utils.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/sparse_attention
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/csrc
copying deepspeed/ops/csrc/spatial/csrc/opt_bias_add.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/csrc
copying deepspeed/ops/csrc/spatial/csrc/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/csrc
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/includes
copying deepspeed/ops/csrc/spatial/includes/spatial_cuda_layers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/includes
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/cublas_wrappers.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/dropout_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/gelu_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/general_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/normalize_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/softmax_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/transform_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/pointwise_ops.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/rms_norm.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
copying deepspeed/ops/csrc/transformer/inference/csrc/transform.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
copying deepspeed/ops/csrc/transformer/inference/includes/inference_context.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
copying deepspeed/ops/csrc/transformer/inference/includes/inference_cublas_wrappers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
copying deepspeed/ops/csrc/transformer/inference/includes/inference_cuda_layers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/utils
copying deepspeed/ops/csrc/utils/flatten_unflatten.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/utils
copying deepspeed/ops/sparse_attention/trsrc/matmul.tr -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/sparse_attention/trsrc
copying deepspeed/ops/sparse_attention/trsrc/softmax_bwd.tr -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/sparse_attention/trsrc
copying deepspeed/ops/sparse_attention/trsrc/softmax_fwd.tr -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/sparse_attention/trsrc
running build_ext
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py:398: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'deepspeed.ops.adagrad.cpu_adagrad_op' extension
creating build/temp.linux-x86_64-cpython-311
creating build/temp.linux-x86_64-cpython-311/csrc
creating build/temp.linux-x86_64-cpython-311/csrc/adagrad
creating build/temp.linux-x86_64-cpython-311/csrc/common
gcc -pthread -B /root/anaconda3/envs/datachat_env/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/anaconda3/envs/datachat_env/include -fPIC -O2 -isystem /root/anaconda3/envs/datachat_env/include -fPIC -Icsrc/includes -I/usr/local/cuda/include -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/anaconda3/envs/datachat_env/include/python3.11 -c csrc/adagrad/cpu_adagrad.cpp -o build/temp.linux-x86_64-cpython-311/csrc/adagrad/cpu_adagrad.o -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=cpu_adagrad_op -D_GLIBCXX_USE_CXX11_ABI=0
In file included from csrc/includes/cpu_adagrad.h:13,
from csrc/adagrad/cpu_adagrad.cpp:6:
csrc/includes/simd.h:69: warning: ignoring #pragma unroll [-Wunknown-pragmas]
69 | #pragma unroll
|
csrc/includes/simd.h:76: warning: ignoring #pragma unroll [-Wunknown-pragmas]
76 | #pragma unroll
|
csrc/includes/simd.h:82: warning: ignoring #pragma unroll [-Wunknown-pragmas]
82 | #pragma unroll
|
csrc/includes/simd.h:90: warning: ignoring #pragma unroll [-Wunknown-pragmas]
90 | #pragma unroll
|
csrc/includes/simd.h:98: warning: ignoring #pragma unroll [-Wunknown-pragmas]
98 | #pragma unroll
|
csrc/includes/simd.h:106: warning: ignoring #pragma unroll [-Wunknown-pragmas]
106 | #pragma unroll
|
csrc/includes/simd.h:112: warning: ignoring #pragma unroll [-Wunknown-pragmas]
112 | #pragma unroll
|
csrc/includes/simd.h:118: warning: ignoring #pragma unroll [-Wunknown-pragmas]
118 | #pragma unroll
|
csrc/includes/simd.h:124: warning: ignoring #pragma unroll [-Wunknown-pragmas]
124 | #pragma unroll
|
csrc/includes/simd.h:130: warning: ignoring #pragma unroll [-Wunknown-pragmas]
130 | #pragma unroll
|
csrc/includes/simd.h:136: warning: ignoring #pragma unroll [-Wunknown-pragmas]
136 | #pragma unroll
|
In file included from csrc/includes/cpu_adagrad.h:19,
from csrc/adagrad/cpu_adagrad.cpp:6:
csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
12 | #include <curand_kernel.h>
| ^~~~~~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for deepspeed
Running setup.py clean for deepspeed
Failed to build deepspeed
ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
I had a lot of Warnings saying Package would be ignored
as shown near the top of the traceback and am getting a similar error (curand_kernel.h: No such file or directory
) when attempting to install deepspeed using what you suggested. Any help from your side would be greatly appreciated!
Interesting, something certainly looks to be wrong here. The missing curand_kernel.h seems to indicate that the cuda version perhaps isn't right here somehow since the -lcurand
flag is missing from being included in all likeliehood.
Can you pip install deepspeed
without building any ops, and then run ds_report
so we can see what that says?
Also might be good to try this in a venv just to ensure there are no other packages conflicting?
I did some prodding and found that the curand_kernel.h
file was not present in the /usr/local/cuda-11.8/targets/x86_64-linux/include
directory thus causing the error in the traceback - fatal error: curand_kernel.h: No such file or directory
. To solve this I manually pulled the .deb file associated with curand from here and installed them like so:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/libcurand-11-8_10.3.0.86-1_amd64.deb
sudo dpkg -i libcurand-11-8_10.3.0.86-1_amd64.deb
This seemed to fix the issue
Interesting, glad that works for you!