Error building extension 'fused_adam' with DeepSpeed==0.3.13
saichandrapandraju opened this issue ยท 17 comments
Hi,
I upgraded DeepSpeed
to 0.3.13
and Torch
to 1.8.0
and while using DeepSpeed with HF (HuggingFace), I'm getting below error -
RuntimeError: Error building extension 'fused_adam' and here is the stacktrace -
[2021-03-23 07:03:49,374] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-23 07:03:49,407] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jovyan/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /home/jovyan/.cache/torch_extensions/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1672 check=True,
-> 1673 env=env)
1674 except subprocess.CalledProcessError as e:
/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-24-3435b262f1ae> in <module>
----> 1 trainer.train()
~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
901 delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
902 if self.args.deepspeed:
--> 903 model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
904 self.model = model.module
905 self.model_wrapped = model # will get further wrapped in DDP
~/.local/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
416 model=model,
417 model_parameters=model_parameters,
--> 418 config_params=config,
419 )
420
~/.local/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
123 dist_init_required=dist_init_required,
124 collate_fn=collate_fn,
--> 125 config_params=config_params)
126 else:
127 assert mpu is None, "mpu must be None with pipeline parallelism"
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
181 self.lr_scheduler = None
182 if model_parameters or optimizer:
--> 183 self._configure_optimizer(optimizer, model_parameters)
184 self._configure_lr_scheduler(lr_scheduler)
185 self._report_progress(0)
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
596 logger.info('Using client Optimizer as basic optimizer')
597 else:
--> 598 basic_optimizer = self._configure_basic_optimizer(model_parameters)
599 if self.global_rank == 0:
600 logger.info(
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
670 optimizer = FusedAdam(model_parameters,
671 **optimizer_parameters,
--> 672 adam_w_mode=effective_adam_w_mode)
673
674 elif self.optimizer_name() == LAMB_OPTIMIZER:
~/.local/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py in __init__(self, params, lr, bias_correction, betas, eps, adam_w_mode, weight_decay, amsgrad, set_grad_none)
70 self.set_grad_none = set_grad_none
71
---> 72 fused_adam_cuda = FusedAdamBuilder().load()
73 # Skip buffer
74 self._dummy_overflow_buf = torch.cuda.IntTensor([0])
~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
213 return importlib.import_module(self.absolute_name())
214 else:
--> 215 return self.jit_load(verbose)
216
217 def jit_load(self, verbose=True):
~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
250 extra_cuda_cflags=self.nvcc_args(),
251 extra_ldflags=self.extra_ldflags(),
--> 252 verbose=verbose)
253 build_duration = time.time() - start_build
254 if verbose:
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1089 is_python_module,
1090 is_standalone,
-> 1091 keep_intermediates=keep_intermediates)
1092
1093
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1300 verbose=verbose,
1301 with_cuda=with_cuda,
-> 1302 is_standalone=is_standalone)
1303 finally:
1304 baton.release()
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
1405 build_directory,
1406 verbose,
-> 1407 error_prefix=f"Error building extension '{name}'")
1408
1409
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1681 if hasattr(error, 'output') and error.output: # type: ignore
1682 message += f": {error.output.decode()}" # type: ignore
-> 1683 raise RuntimeError(message) from e
1684
1685
RuntimeError: Error building extension 'fused_adam'
Versions that I'm using are -
Collecting environment information...
PyTorch version: 1.8.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.8.0+cu101
[pip3] torchvision==0.8.1
[conda] Could not collect
transformers==4.4.2
DeepSpeed==0.3.13
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
But I was able to run DeepSpeed-0.3.10
with HuggingFace-4.3.2
and Torch-1.7.1+cu101
without any issue.
Plz suggest how to proceed further..
Here is the config file that I'm using for DeepSpeed -
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-5,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500
}
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
When I use "cpu_offload": true
, getting error as - RuntimeError: Error building extension 'cpu_adam'.
Below is the full stacktrace -
[2021-03-23 07:21:31,906] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-23 07:21:31,929] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jovyan/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /home/jovyan/.cache/torch_extensions/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1672 check=True,
-> 1673 env=env)
1674 except subprocess.CalledProcessError as e:
/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-24-3435b262f1ae> in <module>
----> 1 trainer.train()
~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
901 delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
902 if self.args.deepspeed:
--> 903 model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
904 self.model = model.module
905 self.model_wrapped = model # will get further wrapped in DDP
~/.local/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
416 model=model,
417 model_parameters=model_parameters,
--> 418 config_params=config,
419 )
420
~/.local/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
123 dist_init_required=dist_init_required,
124 collate_fn=collate_fn,
--> 125 config_params=config_params)
126 else:
127 assert mpu is None, "mpu must be None with pipeline parallelism"
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
181 self.lr_scheduler = None
182 if model_parameters or optimizer:
--> 183 self._configure_optimizer(optimizer, model_parameters)
184 self._configure_lr_scheduler(lr_scheduler)
185 self._report_progress(0)
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
596 logger.info('Using client Optimizer as basic optimizer')
597 else:
--> 598 basic_optimizer = self._configure_basic_optimizer(model_parameters)
599 if self.global_rank == 0:
600 logger.info(
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
665 optimizer = DeepSpeedCPUAdam(model_parameters,
666 **optimizer_parameters,
--> 667 adamw_mode=effective_adam_w_mode)
668 else:
669 from deepspeed.ops.adam import FusedAdam
~/.local/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
76 DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
77 self.adam_w_mode = adamw_mode
---> 78 self.ds_opt_adam = CPUAdamBuilder().load()
79
80 self.ds_opt_adam.create_adam(self.opt_id,
~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
213 return importlib.import_module(self.absolute_name())
214 else:
--> 215 return self.jit_load(verbose)
216
217 def jit_load(self, verbose=True):
~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
250 extra_cuda_cflags=self.nvcc_args(),
251 extra_ldflags=self.extra_ldflags(),
--> 252 verbose=verbose)
253 build_duration = time.time() - start_build
254 if verbose:
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1089 is_python_module,
1090 is_standalone,
-> 1091 keep_intermediates=keep_intermediates)
1092
1093
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1300 verbose=verbose,
1301 with_cuda=with_cuda,
-> 1302 is_standalone=is_standalone)
1303 finally:
1304 baton.release()
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
1405 build_directory,
1406 verbose,
-> 1407 error_prefix=f"Error building extension '{name}'")
1408
1409
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1681 if hasattr(error, 'output') and error.output: # type: ignore
1682 message += f": {error.output.decode()}" # type: ignore
-> 1683 raise RuntimeError(message) from e
1684
1685
RuntimeError: Error building extension 'cpu_adam'
I created a colab notebook that took quite a lot of trial and error to figure out the right versions of everything to make DeepSpeed compile.
As you can see in the notebook I'm using torch==1.7.1+cu110
- have you tried running my notebook? It worked a week ago.
I will try to keep it up-to-date as colab changes its setup so ping me if it stops working (or file an issue) and requires a tune up.
I documented the critical components to successfully building deepspeed here.
Hi @stas00 ,
I ran your notebook in colab and it gave this error -
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.7/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 26.246264457702637 seconds
Killing subprocess 468
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/deepspeed/launcher/launch.py", line 171, in <module>
main()
File "/usr/local/lib/python3.7/dist-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/usr/local/lib/python3.7/dist-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'examples/seq2seq/run_seq2seq.py', '--local_rank=0', '--model_name_or_path', 'google/mt5-small', '--output_dir', 'output_dir', '--adam_eps', '1e-06', '--evaluation_strategy=steps', '--do_train', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '1', '--overwrite_output_dir', '--per_device_train_batch_size', '16', '--predict_with_generate', '--sortish_sampler', '--val_max_target_length', '128', '--warmup_steps', '500', '--max_train_samples', '2000', '--max_val_samples', '500', '--task', 'translation_en_to_ro', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_prefix', 'translate English to Romanian: ', '--deepspeed', 'ds_config.json', '--fp16']' died with <Signals.SIGKILL: 9>.
Here is the full notebook with outputs.
All the critical components that you mentioned for building deepspeed are not valid in my case as I'm using system wide cuda version while installing torch. Also I don't have multiple cuda versions in my system and using gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
This is odd, since I have just re-run my notebook on the free version of colab and it didn't have any problems.
So you have may have noticed you made a progress, since you managed to now build the deepspeed extensions using this notebook. But then something killed the process immediately after it built the extension. so now you have a correct combination of the packages.
Try to re-run that last cell again - since the extension is now built and cached (that is if you're in the same session, if not start a new and re-run this cell second time if it dies again the first time).
In theory everybody gets mostly the same environment, but perhaps it's not so. Could you monitor that you disk space and RAM are not at 100% - perhaps the watchdog kills the process when resources are exhausted?
I'm curious what happens if you run the training cell the 2nd time.
You're absolutely correct @stas00
RAM is reaching 100% and process is getting killed. I ran 2nd time by changing cpu_offload
to false
and training completed successfully. But I'm not sure why the same didn't happen for you in colab.
Also, I'm thinking how is this setup related to my issue. Are you suggesting me to upgrade my system-wide cuda to 11?
RAM is reaching 100% and process is getting killed. I ran 2nd time by changing cpu_offload to false and training completed successfully. But I'm not sure why the same didn't happen for you in colab.
Glad you figured it out!
We don't know if you get the same environment as I do. Actually it looks pretty random. I just tried 2 different notebooks and in the deepspeed one it gave me 25GB RAM and in another one only 12GB!
Run a cell with:
! free -h
I will add this to the notebook with a note, so others will know.
Also, I'm thinking how is this setup related to my issue. Are you suggesting me to upgrade my system-wide cuda to 11?
Not at all. You just need to have the same cuda as your pytorch was built with, so just install a pytorch build that matches your system-wide cuda and make sure you got PATH
and LD_LIBRARY_PATH
set correctly. It's all documented here:
https://huggingface.co/transformers/main_classes/trainer.html#installation-notes
You can probably close this issue now.
Hi @stas00
But I'm still facing the same issue after making following changes suggested in HF's installation-notes -
previously my paths were -
!which nvcc
/usr/local/cuda/bin/nvcc
!echo $LD_LIBRARY_PATH
!echo $PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
/home/jovyan/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
and now they are modified as -
!echo $LD_LIBRARY_PATH
!echo $PATH
/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
/usr/local/cuda-10.1/bin:/home/jovyan/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
!which nvcc
/usr/local/cuda-10.1/bin/nvcc
I have my cuda-10.1 in /usr/local/, so above paths are correct.
Here is the ds_report -
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] sparse_attn requires the 'cmake' command, but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/.local/lib/python3.6/site-packages/torch']
torch version .................... 1.8.0+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/jovyan/.local/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.13, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.1
Also tried this from similar issue but didn't work.
Oh, so it's not colab that you're trying to get it to work on. OK!
But I was able to run DeepSpeed-0.3.10 with HuggingFace-4.3.2 and Torch-1.7.1+cu101 without any issue.
Can you build it by downgrading to torch-1.7.1+cu101? Just to validate that deepspeed master is not at fault since you had it working with 0.3.10 - but as you see you changed the pytorch version as well.
Where did you find torch-1.8.0-cu101?
I can see here only 10.2 or 11.1 at https://pytorch.org/get-started/locally/
Alternatively if your nvidia driver supports it move to 11.1, you also get a better cudnn along with newer drivers/cuda, which you want to upgrade too then. I use cuda-11.1 at the moment and it works well.
If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier. See the details here:
https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops
This is the approach that I use most of the time on machines where building the extension at run-time proves to be problematic.
My build script is:
#!/bin/bash
rm -rf build
time TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log
you just need to adjust the arch list to match your hardware. and may be -j to match how many parallel makes you'd like to run. And it does the develop install. You can remove -e
if you want to.
Hi @stas00 ,
Can you build it by downgrading to torch-1.7.1+cu101? Just to validate that deepspeed master is not at fault since you had it working with 0.3.10 - but as you see you changed the pytorch version as well.
Tried this and still facing same issue.
Where did you find torch-1.8.0-cu101?
I downloaded cu101/torch-1.8.0%2Bcu101-cp36-cp36m-linux_x86_64.whl
from https://download.pytorch.org/whl/torch_stable.html
If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier.
If this downloads and installs modules from external sources, my VM won't have open internet access and it has to go through my company's firewall. If this (downloading from external sources) is the case, I may not pre-build from source.
Plz confirm whether it does collect necessary things from external sources.
Alternatively if your nvidia driver supports it move to 11.1, you also get a better cudnn along with newer drivers/cuda, which you want to upgrade too then. I use cuda-11.1 at the moment and it works well.
I want this to be last option as it's not in my control and have to contact other team to upgrade.
Meanwhile, I tried prebuilding in colab with diff. combinations and all those worked fine and you can find detailed o/p's here
If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier.
If this downloads and installs modules from external sources, my VM won't have open internet access and it has to go through my company's firewall. If this (downloading from external sources) is the case, I may not pre-build from source.
Plz confirm whether it does collect necessary things from external sources.
Meanwhile, I tried prebuilding in colab with diff. combinations and all those worked fine and you can find detailed o/p's here
OK, so since you prebuilt from source on colab (thank you for sharing the outcomes), you now know what's involved. It'll install dependencies just like when you don't pre-build from source. So if you are able to do pip install deepspeed
on your setup you can also do the same here. i.e. preinstall all the dependencies when you have the network just like you'd do normally.
Here is yet another approach to consider. Build a binary wheel on whatever normal machine where you have a similar cuda setup:
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel
adjust TORCH_CUDA_ARCH_LIST
for the required archs on the target machine.
Now you have dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
(will be a different name depending on the build).
Now you can install it on your VM and you don't need to build anything at run time, you just do:
pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
I presume you will already have the other dependencies installed since you already did that for pip install deepspeed
.
I wonder if DeepSpeed should document this approach on their advanced install page.
Thanks a lot @stas00
Finally it worked. As colab is having python-3.7, I replicated what you've said in my AWS EC2 instance where I had many CUDAs including 10.1 and 11.1.
Reiterating the steps I followed so that it can help someone with similar issues -
-
Created a conda environment with python-3.6.9 (because my target machine where I want to run DeepSpeed is having 3.6.9).
-
Changed
PATH
andLD_LIBRARY_PATH
to point to CUDA-10.1 ( again because of my target machine) as suggested in HF's installation notes here. Below are the commands -
export PATH=/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH
- Install PyTorch (should be same version as in target machine). I installed with below command -
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
verify torch versions with python -m torch.utils.collect_env
- Execute below commands to pre-build DeepSpeed -
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
time DS_BUILD_OPS=1 pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log
Check whether compatible op's were installed or not with ds_report
- Extract whl file for this DeepSpeed using below command-
rm -rf build
DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel
- Take whl from
dist/
and install in target machine usingpip install deepspeed-0.3.13+7fcc891-cp36-cp36m-linux_x86_64.whl
Awesome! Thank you for the report, @saichandrapandraju
Except you don't need step 4. Step 5 is all you need after you cloned the repo.
Step 4 is for when you want to install it locally. and is similar to Steps 5+6 but you don't get a wheel to take to another machine.
I had the same issue with fairscale on several setups no matter what I tried it won't build at runtime, but prebuilding into a wheel and installing that worked.
BTW, I do recommend you use an explicit TORCH_CUDA_ARCH_LIST
for your gpus during the build, since from what I understand you may get a better performance that way. Especially if your build machine doesn't have the same gpus as your target machine.
Yes.
In my case both my build and target machines are same, so didn't use TORCH_CUDA_ARCH_LIST
.
But yeah, it's always better to explicitly mention. For reference, I used torch.cuda.get_device_properties(device)
to check my device architecture which gives o/p like _CudaDeviceProperties(name='Tesla V100-SXM2-32GB', major=7, minor=0, total_memory=32510MB, multi_processor_count=80)
.
I'm not very sure but I thought my device architecture is 7.0
from above o/p. One can also check list of CUDA architectures that installed torch is compiled for using torch.cuda.get_arch_list()
which gives o/p as -
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75']
for torch==1.7.1+cu101
['sm_37', 'sm_50', 'sm_60', 'sm_70']
for torch==1.8.1+cu101
Not sure whether this is the correct way to check. May be @stas00 can confirm.
That's the correct way: major=7, minor=0
=> 7.0
Also you can find the full list of all archs at https://developer.nvidia.com/cuda-gpus
Incidentally I have just added all this information to the docs, hopefully should be merged in the next few days:
You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities**
(same as arch in this context) `here <https://developer.nvidia.com/cuda-gpus>`__.
You can check the archs pytorch was built with using:
.. code-block:: bash
python -c "import torch; print(torch.cuda.get_arch_list())"
Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
.. code-block:: bash
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
print(torch.cuda.get_device_properties(torch.device('cuda')))"
If the output is:
.. code-block:: bash
_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
then you know that this card's arch is ``8.6``.
You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
it's best to specify the desired archs explicitly.