microsoft/DeepSpeed

Error building extension 'fused_adam' with DeepSpeed==0.3.13

saichandrapandraju opened this issue ยท 17 comments

Hi,

I upgraded DeepSpeed to 0.3.13 and Torch to 1.8.0 and while using DeepSpeed with HF (HuggingFace), I'm getting below error -
RuntimeError: Error building extension 'fused_adam' and here is the stacktrace -

[2021-03-23 07:03:49,374] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-23 07:03:49,407] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jovyan/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /home/jovyan/.cache/torch_extensions/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1672             check=True,
-> 1673             env=env)
   1674     except subprocess.CalledProcessError as e:

/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)
    439     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-24-3435b262f1ae> in <module>
----> 1 trainer.train()

~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    901         delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
    902         if self.args.deepspeed:
--> 903             model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
    904             self.model = model.module
    905             self.model_wrapped = model  # will get further wrapped in DDP

~/.local/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
    416         model=model,
    417         model_parameters=model_parameters,
--> 418         config_params=config,
    419     )
    420 

~/.local/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    123                                  dist_init_required=dist_init_required,
    124                                  collate_fn=collate_fn,
--> 125                                  config_params=config_params)
    126     else:
    127         assert mpu is None, "mpu must be None with pipeline parallelism"

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
    181         self.lr_scheduler = None
    182         if model_parameters or optimizer:
--> 183             self._configure_optimizer(optimizer, model_parameters)
    184             self._configure_lr_scheduler(lr_scheduler)
    185             self._report_progress(0)

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    596                 logger.info('Using client Optimizer as basic optimizer')
    597         else:
--> 598             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    599             if self.global_rank == 0:
    600                 logger.info(

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    670                     optimizer = FusedAdam(model_parameters,
    671                                           **optimizer_parameters,
--> 672                                           adam_w_mode=effective_adam_w_mode)
    673 
    674         elif self.optimizer_name() == LAMB_OPTIMIZER:

~/.local/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py in __init__(self, params, lr, bias_correction, betas, eps, adam_w_mode, weight_decay, amsgrad, set_grad_none)
     70         self.set_grad_none = set_grad_none
     71 
---> 72         fused_adam_cuda = FusedAdamBuilder().load()
     73         # Skip buffer
     74         self._dummy_overflow_buf = torch.cuda.IntTensor([0])

~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    213             return importlib.import_module(self.absolute_name())
    214         else:
--> 215             return self.jit_load(verbose)
    216 
    217     def jit_load(self, verbose=True):

~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    250             extra_cuda_cflags=self.nvcc_args(),
    251             extra_ldflags=self.extra_ldflags(),
--> 252             verbose=verbose)
    253         build_duration = time.time() - start_build
    254         if verbose:

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1089         is_python_module,
   1090         is_standalone,
-> 1091         keep_intermediates=keep_intermediates)
   1092 
   1093 

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1300                         verbose=verbose,
   1301                         with_cuda=with_cuda,
-> 1302                         is_standalone=is_standalone)
   1303             finally:
   1304                 baton.release()

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
   1405         build_directory,
   1406         verbose,
-> 1407         error_prefix=f"Error building extension '{name}'")
   1408 
   1409 

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1681         if hasattr(error, 'output') and error.output:  # type: ignore
   1682             message += f": {error.output.decode()}"  # type: ignore
-> 1683         raise RuntimeError(message) from e
   1684 
   1685 

RuntimeError: Error building extension 'fused_adam'

Versions that I'm using are -

Collecting environment information...
PyTorch version: 1.8.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.8.0+cu101
[pip3] torchvision==0.8.1
[conda] Could not collect

transformers==4.4.2
DeepSpeed==0.3.13
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

But I was able to run DeepSpeed-0.3.10 with HuggingFace-4.3.2 and Torch-1.7.1+cu101 without any issue.

Plz suggest how to proceed further..

Here is the config file that I'm using for DeepSpeed -

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "overlap_comm": true,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "contiguous_gradients": true,
       "cpu_offload": false
   },

   "optimizer": {
     "type": "AdamW",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },

   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

When I use "cpu_offload": true, getting error as - RuntimeError: Error building extension 'cpu_adam'.
Below is the full stacktrace -

[2021-03-23 07:21:31,906] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-23 07:21:31,929] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jovyan/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /home/jovyan/.cache/torch_extensions/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1672             check=True,
-> 1673             env=env)
   1674     except subprocess.CalledProcessError as e:

/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)
    439     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-24-3435b262f1ae> in <module>
----> 1 trainer.train()

~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    901         delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
    902         if self.args.deepspeed:
--> 903             model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
    904             self.model = model.module
    905             self.model_wrapped = model  # will get further wrapped in DDP

~/.local/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
    416         model=model,
    417         model_parameters=model_parameters,
--> 418         config_params=config,
    419     )
    420 

~/.local/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    123                                  dist_init_required=dist_init_required,
    124                                  collate_fn=collate_fn,
--> 125                                  config_params=config_params)
    126     else:
    127         assert mpu is None, "mpu must be None with pipeline parallelism"

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
    181         self.lr_scheduler = None
    182         if model_parameters or optimizer:
--> 183             self._configure_optimizer(optimizer, model_parameters)
    184             self._configure_lr_scheduler(lr_scheduler)
    185             self._report_progress(0)

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    596                 logger.info('Using client Optimizer as basic optimizer')
    597         else:
--> 598             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    599             if self.global_rank == 0:
    600                 logger.info(

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    665                     optimizer = DeepSpeedCPUAdam(model_parameters,
    666                                                  **optimizer_parameters,
--> 667                                                  adamw_mode=effective_adam_w_mode)
    668                 else:
    669                     from deepspeed.ops.adam import FusedAdam

~/.local/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
     76         DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
     77         self.adam_w_mode = adamw_mode
---> 78         self.ds_opt_adam = CPUAdamBuilder().load()
     79 
     80         self.ds_opt_adam.create_adam(self.opt_id,

~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    213             return importlib.import_module(self.absolute_name())
    214         else:
--> 215             return self.jit_load(verbose)
    216 
    217     def jit_load(self, verbose=True):

~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    250             extra_cuda_cflags=self.nvcc_args(),
    251             extra_ldflags=self.extra_ldflags(),
--> 252             verbose=verbose)
    253         build_duration = time.time() - start_build
    254         if verbose:

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1089         is_python_module,
   1090         is_standalone,
-> 1091         keep_intermediates=keep_intermediates)
   1092 
   1093 

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1300                         verbose=verbose,
   1301                         with_cuda=with_cuda,
-> 1302                         is_standalone=is_standalone)
   1303             finally:
   1304                 baton.release()

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
   1405         build_directory,
   1406         verbose,
-> 1407         error_prefix=f"Error building extension '{name}'")
   1408 
   1409 

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1681         if hasattr(error, 'output') and error.output:  # type: ignore
   1682             message += f": {error.output.decode()}"  # type: ignore
-> 1683         raise RuntimeError(message) from e
   1684 
   1685 

RuntimeError: Error building extension 'cpu_adam'

I created a colab notebook that took quite a lot of trial and error to figure out the right versions of everything to make DeepSpeed compile.

As you can see in the notebook I'm using torch==1.7.1+cu110 - have you tried running my notebook? It worked a week ago.

I will try to keep it up-to-date as colab changes its setup so ping me if it stops working (or file an issue) and requires a tune up.

I documented the critical components to successfully building deepspeed here.

Hi @stas00 ,

I ran your notebook in colab and it gave this error -

[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.7/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 26.246264457702637 seconds
Killing subprocess 468
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'examples/seq2seq/run_seq2seq.py', '--local_rank=0', '--model_name_or_path', 'google/mt5-small', '--output_dir', 'output_dir', '--adam_eps', '1e-06', '--evaluation_strategy=steps', '--do_train', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '1', '--overwrite_output_dir', '--per_device_train_batch_size', '16', '--predict_with_generate', '--sortish_sampler', '--val_max_target_length', '128', '--warmup_steps', '500', '--max_train_samples', '2000', '--max_val_samples', '500', '--task', 'translation_en_to_ro', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_prefix', 'translate English to Romanian: ', '--deepspeed', 'ds_config.json', '--fp16']' died with <Signals.SIGKILL: 9>.

Here is the full notebook with outputs.

All the critical components that you mentioned for building deepspeed are not valid in my case as I'm using system wide cuda version while installing torch. Also I don't have multiple cuda versions in my system and using gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

This is odd, since I have just re-run my notebook on the free version of colab and it didn't have any problems.

So you have may have noticed you made a progress, since you managed to now build the deepspeed extensions using this notebook. But then something killed the process immediately after it built the extension. so now you have a correct combination of the packages.

Try to re-run that last cell again - since the extension is now built and cached (that is if you're in the same session, if not start a new and re-run this cell second time if it dies again the first time).

In theory everybody gets mostly the same environment, but perhaps it's not so. Could you monitor that you disk space and RAM are not at 100% - perhaps the watchdog kills the process when resources are exhausted?

I'm curious what happens if you run the training cell the 2nd time.

You're absolutely correct @stas00

RAM is reaching 100% and process is getting killed. I ran 2nd time by changing cpu_offload to false and training completed successfully. But I'm not sure why the same didn't happen for you in colab.

Also, I'm thinking how is this setup related to my issue. Are you suggesting me to upgrade my system-wide cuda to 11?

RAM is reaching 100% and process is getting killed. I ran 2nd time by changing cpu_offload to false and training completed successfully. But I'm not sure why the same didn't happen for you in colab.

Glad you figured it out!

We don't know if you get the same environment as I do. Actually it looks pretty random. I just tried 2 different notebooks and in the deepspeed one it gave me 25GB RAM and in another one only 12GB!

Run a cell with:

! free -h

I will add this to the notebook with a note, so others will know.

Also, I'm thinking how is this setup related to my issue. Are you suggesting me to upgrade my system-wide cuda to 11?

Not at all. You just need to have the same cuda as your pytorch was built with, so just install a pytorch build that matches your system-wide cuda and make sure you got PATH and LD_LIBRARY_PATH set correctly. It's all documented here:

https://huggingface.co/transformers/main_classes/trainer.html#installation-notes

You can probably close this issue now.

Hi @stas00
But I'm still facing the same issue after making following changes suggested in HF's installation-notes -

previously my paths were -

!which nvcc
/usr/local/cuda/bin/nvcc
!echo $LD_LIBRARY_PATH
!echo $PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
/home/jovyan/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

and now they are modified as -

!echo $LD_LIBRARY_PATH
!echo $PATH
/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
/usr/local/cuda-10.1/bin:/home/jovyan/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
!which nvcc
/usr/local/cuda-10.1/bin/nvcc

I have my cuda-10.1 in /usr/local/, so above paths are correct.
Here is the ds_report -

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
 [WARNING]  sparse_attn requires the 'cmake' command, but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/.local/lib/python3.6/site-packages/torch']
torch version .................... 1.8.0+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/jovyan/.local/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.13, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.1

Also tried this from similar issue but didn't work.

Oh, so it's not colab that you're trying to get it to work on. OK!

But I was able to run DeepSpeed-0.3.10 with HuggingFace-4.3.2 and Torch-1.7.1+cu101 without any issue.

Can you build it by downgrading to torch-1.7.1+cu101? Just to validate that deepspeed master is not at fault since you had it working with 0.3.10 - but as you see you changed the pytorch version as well.

Where did you find torch-1.8.0-cu101?

I can see here only 10.2 or 11.1 at https://pytorch.org/get-started/locally/

Alternatively if your nvidia driver supports it move to 11.1, you also get a better cudnn along with newer drivers/cuda, which you want to upgrade too then. I use cuda-11.1 at the moment and it works well.

If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier. See the details here:
https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops
This is the approach that I use most of the time on machines where building the extension at run-time proves to be problematic.

My build script is:

#!/bin/bash

rm -rf build
time TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log

you just need to adjust the arch list to match your hardware. and may be -j to match how many parallel makes you'd like to run. And it does the develop install. You can remove -e if you want to.

Hi @stas00 ,

Can you build it by downgrading to torch-1.7.1+cu101? Just to validate that deepspeed master is not at fault since you had it working with 0.3.10 - but as you see you changed the pytorch version as well.

Tried this and still facing same issue.

Where did you find torch-1.8.0-cu101?

I downloaded cu101/torch-1.8.0%2Bcu101-cp36-cp36m-linux_x86_64.whl from https://download.pytorch.org/whl/torch_stable.html

If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier.

If this downloads and installs modules from external sources, my VM won't have open internet access and it has to go through my company's firewall. If this (downloading from external sources) is the case, I may not pre-build from source.
Plz confirm whether it does collect necessary things from external sources.

Alternatively if your nvidia driver supports it move to 11.1, you also get a better cudnn along with newer drivers/cuda, which you want to upgrade too then. I use cuda-11.1 at the moment and it works well.

I want this to be last option as it's not in my control and have to contact other team to upgrade.

Meanwhile, I tried prebuilding in colab with diff. combinations and all those worked fine and you can find detailed o/p's here

If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier.

If this downloads and installs modules from external sources, my VM won't have open internet access and it has to go through my company's firewall. If this (downloading from external sources) is the case, I may not pre-build from source.
Plz confirm whether it does collect necessary things from external sources.

Meanwhile, I tried prebuilding in colab with diff. combinations and all those worked fine and you can find detailed o/p's here

OK, so since you prebuilt from source on colab (thank you for sharing the outcomes), you now know what's involved. It'll install dependencies just like when you don't pre-build from source. So if you are able to do pip install deepspeed on your setup you can also do the same here. i.e. preinstall all the dependencies when you have the network just like you'd do normally.

Here is yet another approach to consider. Build a binary wheel on whatever normal machine where you have a similar cuda setup:

git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel

adjust TORCH_CUDA_ARCH_LIST for the required archs on the target machine.

Now you have dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl (will be a different name depending on the build).

Now you can install it on your VM and you don't need to build anything at run time, you just do:

pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl

I presume you will already have the other dependencies installed since you already did that for pip install deepspeed.

I wonder if DeepSpeed should document this approach on their advanced install page.

Thanks a lot @stas00

Finally it worked. As colab is having python-3.7, I replicated what you've said in my AWS EC2 instance where I had many CUDAs including 10.1 and 11.1.

Reiterating the steps I followed so that it can help someone with similar issues -

  1. Created a conda environment with python-3.6.9 (because my target machine where I want to run DeepSpeed is having 3.6.9).

  2. Changed PATH and LD_LIBRARY_PATH to point to CUDA-10.1 ( again because of my target machine) as suggested in HF's installation notes here. Below are the commands -

export PATH=/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH
  1. Install PyTorch (should be same version as in target machine). I installed with below command -
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

verify torch versions with python -m torch.utils.collect_env

  1. Execute below commands to pre-build DeepSpeed -
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
time DS_BUILD_OPS=1 pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log

Check whether compatible op's were installed or not with ds_report

  1. Extract whl file for this DeepSpeed using below command-
rm -rf build
DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel
  1. Take whl from dist/ and install in target machine using pip install deepspeed-0.3.13+7fcc891-cp36-cp36m-linux_x86_64.whl

Awesome! Thank you for the report, @saichandrapandraju

Except you don't need step 4. Step 5 is all you need after you cloned the repo.

Step 4 is for when you want to install it locally. and is similar to Steps 5+6 but you don't get a wheel to take to another machine.

I had the same issue with fairscale on several setups no matter what I tried it won't build at runtime, but prebuilding into a wheel and installing that worked.

BTW, I do recommend you use an explicit TORCH_CUDA_ARCH_LIST for your gpus during the build, since from what I understand you may get a better performance that way. Especially if your build machine doesn't have the same gpus as your target machine.

Yes.

In my case both my build and target machines are same, so didn't use TORCH_CUDA_ARCH_LIST .

But yeah, it's always better to explicitly mention. For reference, I used torch.cuda.get_device_properties(device) to check my device architecture which gives o/p like _CudaDeviceProperties(name='Tesla V100-SXM2-32GB', major=7, minor=0, total_memory=32510MB, multi_processor_count=80) .

I'm not very sure but I thought my device architecture is 7.0 from above o/p. One can also check list of CUDA architectures that installed torch is compiled for using torch.cuda.get_arch_list() which gives o/p as -
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75'] for torch==1.7.1+cu101
['sm_37', 'sm_50', 'sm_60', 'sm_70'] for torch==1.8.1+cu101

Not sure whether this is the correct way to check. May be @stas00 can confirm.

That's the correct way: major=7, minor=0 => 7.0

Also you can find the full list of all archs at https://developer.nvidia.com/cuda-gpus

Incidentally I have just added all this information to the docs, hopefully should be merged in the next few days:


You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** 
(same as arch in this context) `here <https://developer.nvidia.com/cuda-gpus>`__.

You can check the archs pytorch was built with using:

.. code-block:: bash

   python -c "import torch; print(torch.cuda.get_arch_list())"

Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:

.. code-block:: bash

   CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
   print(torch.cuda.get_device_properties(torch.device('cuda')))"

If the output is:

.. code-block:: bash

   _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)

then you know that this card's arch is ``8.6``.

You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
it's best to specify the desired archs explicitly.

Also added a deepspeed PR with various docs including how to build the binary wheel: #909