microsoft/DeepSpeed

[BUG] Deepspeed training integration failing with WizardCoder

karths8 opened this issue · 5 comments

To Reproduce
I am using the WizardCoder training script to further fine-tune the model on some examples that I have using DeepSpeed integration. I have followed their instructions given here to fine-tune the model and I am getting the following error:

Traceback

datachat_env) root@C.6442427:~/Custom-LLM$ sh train.sh
[2023-06-23 00:36:25,039] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-23 00:36:25,077] [INFO] [runner.py:541:main] cmd = /root/anaconda3/envs/datachat_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py --model_name_or_path /root/Custom-LLM/WizardCoder-15B-V1.0 --data_path /root/Custom-LLM/data.json --output_dir /root/Custom-LLM/WC-Checkpoint --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 50 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 30 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed /root/Custom-LLM/Llama-X/src/configs/deepspeed_config.json --fp16 True
[2023-06-23 00:36:26,992] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-23 00:36:26,993] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-23 00:36:26,993] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-23 00:36:26,993] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-23 00:36:26,993] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-06-23 00:36:29,650] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-23 00:36:55,124] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 15.82B parameters
[2023-06-23 00:37:12,845] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,968] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,969] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,970] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
FAILED: cpu_adam.o 
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:19,
                 from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
   12 | #include <curand_kernel.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
FAILED: custom_cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
   12 | #include <curand_kernel.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
Loading extension module cpu_adam...    
op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    _write_ninja_file_and_build_library(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                         train() 
               File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    trainer.train()  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize

  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^    ^optimizer = DeepSpeedCPUAdam(model_parameters,^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
    File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                self.ds_opt_adam = CPUAdamBuilder().load() 
                                                       ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    return self.jit_load(verbose)    
engine = DeepSpeedEngine(args=args,
                      ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
    op_module = load(name=self.name,
        File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^    ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
       return _import_module_from_library(name, build_directory, is_python_module) 
             ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^    ^self.ds_opt_adam = CPUAdamBuilder().load()^
^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ 
    File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^    ^op_module = load(name=self.name,^
^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
^^  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
^^  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
^^  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
^^ImportError^: ^/root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory^
^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Loading extension module cpu_adam...
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fcaec4a89a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf4e6409a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f9ce61b09a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f6c2bf109a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

Expected behavior
Expect the model to train using the deepspeed config file given

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

System info (please complete the following information):

  • OS: Ubuntu 20.4
  • GPU count and types 1 machine with 4xA100 80GB

Hi @karths8, looks like a couple of cpp and nvcc errors you're hitting:

c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
FAILED: cpu_adam.o

Also could you try a newer version of DeepSpeed as it has improved error logging for these types of errors? 0.9.5 should have those changes.

Also if you pre-build the DS ops, do you get the same errors? Running DS_BUILD_OPS=1 DS_BUILD_AIO=0 pip install deepspeed==0.9.5?

DS_BUILD_OPS=1 DS_BUILD_AIO=0 pip install deepspeed==0.9.5

Hey, I tried this and got the following error:

DS_BUILD_OPS=1 DS_BUILD_AIO=0 pip install deepspeed==0.9.5

[. . .]
 /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/setuptools/command/build_py.py:201: _Warning: Package 'deepspeed.ops.csrc.utils' is absent from the `packages` configuration.
      !!
      
              ********************************************************************************
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'deepspeed.ops.csrc.utils' as an importable package[^1],
              but it is absent from setuptools' `packages` configuration.
      
              This leads to an ambiguous overall configuration. If you want to distribute this
              package, please make sure that 'deepspeed.ops.csrc.utils' is explicitly added
              to the `packages` configuration field.
      
              Alternatively, you can also rely on setuptools' discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
      
              You can read more about "package discovery" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
      
              If you don't want 'deepspeed.ops.csrc.utils' to be distributed and are
              already explicitly excluding 'deepspeed.ops.csrc.utils' via
              `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
              you can try to use `exclude_package_data`, or `include-package-data=False` in
              combination with a more fine grained `package-data` configuration.
      
              You can read more about "package data files" on setuptools documentation page:
      
              - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
      
      
              [^1]: For Python, any directory (with suitable naming) can be imported,
                    even if it does not contain any `.py` files.
                    On the other hand, currently there is no concept of package data
                    directory, all directories are treated like packages.
              ********************************************************************************
      
      !!
        check.warn(importable)
      creating build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
      copying deepspeed/autotuning/config_templates/template_zero0.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
      copying deepspeed/autotuning/config_templates/template_zero1.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
      copying deepspeed/autotuning/config_templates/template_zero2.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
      copying deepspeed/autotuning/config_templates/template_zero3.json -> build/lib.linux-x86_64-cpython-311/deepspeed/autotuning/config_templates
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adagrad
      copying deepspeed/ops/csrc/adagrad/cpu_adagrad.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adagrad
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
      copying deepspeed/ops/csrc/adam/cpu_adam.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
      copying deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
      copying deepspeed/ops/csrc/adam/multi_tensor_adam.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
      copying deepspeed/ops/csrc/adam/multi_tensor_apply.cuh -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/adam
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/common
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_pin_tensor.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_pin_tensor.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      copying deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_lib
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_test
      copying deepspeed/ops/csrc/aio/py_test/single_process_config.json -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/aio/py_test
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/common
      copying deepspeed/ops/csrc/common/custom_cuda_kernel.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/common
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/cpu
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/cpu/comm
      copying deepspeed/ops/csrc/cpu/comm/ccl.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/cpu/comm
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/StopWatch.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/Timer.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/compat.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/context.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/conversion_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/cpu_adagrad.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/cpu_adam.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/cublas_wrappers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/custom_cuda_layers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/dequantization_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/dropout.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/ds_kernel_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/ds_transformer_cuda.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/feed_forward.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/gelu.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/gemm_test.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/general_kernels.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/memory_access_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/normalize_layer.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/quantization.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/quantization_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/quantizer.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/reduction_utils.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/simd.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/softmax.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/strided_batch_gemm.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      copying deepspeed/ops/csrc/includes/type_shim.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/includes
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/lamb
      copying deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/lamb
      copying deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/lamb
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
      copying deepspeed/ops/csrc/quantization/dequantize.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
      copying deepspeed/ops/csrc/quantization/fake_quantizer.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
      copying deepspeed/ops/csrc/quantization/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
      copying deepspeed/ops/csrc/quantization/quantize.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/quantization
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
      copying deepspeed/ops/csrc/random_ltd/gather_scatter.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
      copying deepspeed/ops/csrc/random_ltd/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
      copying deepspeed/ops/csrc/random_ltd/slice_attn_masks.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
      copying deepspeed/ops/csrc/random_ltd/token_sort.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/random_ltd
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/sparse_attention
      copying deepspeed/ops/csrc/sparse_attention/utils.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/sparse_attention
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/csrc
      copying deepspeed/ops/csrc/spatial/csrc/opt_bias_add.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/csrc
      copying deepspeed/ops/csrc/spatial/csrc/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/csrc
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/includes
      copying deepspeed/ops/csrc/spatial/includes/spatial_cuda_layers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/spatial/includes
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/cublas_wrappers.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/dropout_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/gelu_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/general_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/normalize_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/softmax_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      copying deepspeed/ops/csrc/transformer/transform_kernels.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/pointwise_ops.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/rms_norm.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      copying deepspeed/ops/csrc/transformer/inference/csrc/transform.cu -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/csrc
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
      copying deepspeed/ops/csrc/transformer/inference/includes/inference_context.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
      copying deepspeed/ops/csrc/transformer/inference/includes/inference_cublas_wrappers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
      copying deepspeed/ops/csrc/transformer/inference/includes/inference_cuda_layers.h -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/transformer/inference/includes
      creating build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/utils
      copying deepspeed/ops/csrc/utils/flatten_unflatten.cpp -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/csrc/utils
      copying deepspeed/ops/sparse_attention/trsrc/matmul.tr -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/sparse_attention/trsrc
      copying deepspeed/ops/sparse_attention/trsrc/softmax_bwd.tr -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/sparse_attention/trsrc
      copying deepspeed/ops/sparse_attention/trsrc/softmax_fwd.tr -> build/lib.linux-x86_64-cpython-311/deepspeed/ops/sparse_attention/trsrc
      running build_ext
      /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py:398: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
        warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
      building 'deepspeed.ops.adagrad.cpu_adagrad_op' extension
      creating build/temp.linux-x86_64-cpython-311
      creating build/temp.linux-x86_64-cpython-311/csrc
      creating build/temp.linux-x86_64-cpython-311/csrc/adagrad
      creating build/temp.linux-x86_64-cpython-311/csrc/common
      gcc -pthread -B /root/anaconda3/envs/datachat_env/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/anaconda3/envs/datachat_env/include -fPIC -O2 -isystem /root/anaconda3/envs/datachat_env/include -fPIC -Icsrc/includes -I/usr/local/cuda/include -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/anaconda3/envs/datachat_env/include/python3.11 -c csrc/adagrad/cpu_adagrad.cpp -o build/temp.linux-x86_64-cpython-311/csrc/adagrad/cpu_adagrad.o -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=cpu_adagrad_op -D_GLIBCXX_USE_CXX11_ABI=0
      In file included from csrc/includes/cpu_adagrad.h:13,
                       from csrc/adagrad/cpu_adagrad.cpp:6:
      csrc/includes/simd.h:69: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
         69 | #pragma unroll
            |
      csrc/includes/simd.h:76: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
         76 | #pragma unroll
            |
      csrc/includes/simd.h:82: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
         82 | #pragma unroll
            |
      csrc/includes/simd.h:90: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
         90 | #pragma unroll
            |
      csrc/includes/simd.h:98: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
         98 | #pragma unroll
            |
      csrc/includes/simd.h:106: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
        106 | #pragma unroll
            |
      csrc/includes/simd.h:112: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
        112 | #pragma unroll
            |
      csrc/includes/simd.h:118: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
        118 | #pragma unroll
            |
      csrc/includes/simd.h:124: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
        124 | #pragma unroll
            |
      csrc/includes/simd.h:130: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
        130 | #pragma unroll
            |
      csrc/includes/simd.h:136: warning: ignoring #pragma unroll  [-Wunknown-pragmas]
        136 | #pragma unroll
            |
      In file included from csrc/includes/cpu_adagrad.h:19,
                       from csrc/adagrad/cpu_adagrad.cpp:6:
      csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
         12 | #include <curand_kernel.h>
            |          ^~~~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for deepspeed
  Running setup.py clean for deepspeed
Failed to build deepspeed
ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects

I had a lot of Warnings saying Package would be ignored as shown near the top of the traceback and am getting a similar error (curand_kernel.h: No such file or directory) when attempting to install deepspeed using what you suggested. Any help from your side would be greatly appreciated!

Interesting, something certainly looks to be wrong here. The missing curand_kernel.h seems to indicate that the cuda version perhaps isn't right here somehow since the -lcurand flag is missing from being included in all likeliehood.

Can you pip install deepspeed without building any ops, and then run ds_report so we can see what that says?

Also might be good to try this in a venv just to ensure there are no other packages conflicting?

I did some prodding and found that the curand_kernel.h file was not present in the /usr/local/cuda-11.8/targets/x86_64-linux/include directory thus causing the error in the traceback - fatal error: curand_kernel.h: No such file or directory. To solve this I manually pulled the .deb file associated with curand from here and installed them like so:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/libcurand-11-8_10.3.0.86-1_amd64.deb
sudo dpkg -i libcurand-11-8_10.3.0.86-1_amd64.deb

This seemed to fix the issue

Interesting, glad that works for you!