Error building extension 'cpu_adam' (Hugging Face integration)
Closed this issue · 3 comments
Hi guys,
I'm trying to use the cpu_offload
function of DeepSpeed
integration with HuggingFace
's Trainer
integration on a single GPU on AWS Sagemaker (ml.p2.xlarge
instance). However I've been struggling for quite some time to get it to work properly. Here are the current versions I'm using:
CUDA: 10.1 (V10.1.243)
transformers: 4.6.0
pytorch: 1.7.1
deepspeed: 0.3.16 / 0.4.0 (master)
OS: Amazon Linux AMI 2018.03 (x86_64)
gcc/g++/c++: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
I've taken a look at similar issues (#889, #694, #885) but haven't had any success so far. So far I have tried:
- Changing the versions of the pytorch, deepseed and transformers libraries
- Pre-building the ops of deepspeed (
DS_BUILD_OPS=1
andDS_BUILD_CPU_ADAM=1
) - Installing DeepSpeed and Trasnformers from source
Here is a simplified version of the code I'm running on Sagemaker:
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9993' # modify if RuntimeError: Address already in use
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"
# Training Args
MAX_LEN = 512
TRAIN_BATCH_SIZE = 8
VAL_BATCH_SIZE = 8
EPOCHS = 1
LEARNING_RATE = 1e-05
args = TrainingArguments(
output_dir = "../flujo_nlp/outputs/",
overwrite_output_dir = True,
per_device_train_batch_size = TRAIN_BATCH_SIZE,
per_device_eval_batch_size = VAL_BATCH_SIZE,
learning_rate = LEARNING_RATE,
weight_decay = 0.01,
max_grad_norm = 1.0,
num_train_epochs = EPOCHS, # Si esta el modo max_steps entonces ese se toma para entrenar el modelo
max_steps = 2000, # 2000
evaluation_strategy = "steps",
eval_steps = 200, # 200
lr_scheduler_type = 'linear',
warmup_ratio = 0.0,
warmup_steps = 0,
logging_dir = "../flujo_nlp/logs/",
logging_strategy = 'steps',
logging_steps = 200,
seed = 42,
fp16 = False,
dataloader_drop_last = False,
dataloader_num_workers = 0,
label_names = ["labels"],
load_best_model_at_end = True,
metric_for_best_model = "eval_loss",
greater_is_better = False,
ignore_data_skip = False,
deepspeed = "deepspeed_config_1gpu.json"
)
cbks = [
EarlyStoppingCallback(early_stopping_patience = 2, early_stopping_threshold = 0),
PrinterCallback()
]
# Trainer
trainer = MultilabelTrainer(
num_labels = n_labels,
loss_fct = loss,
model = TextClassifier(model_dict["model_path"], n_labels, loss, n_extra_layers = n_extra_layers),
args = args,
train_dataset = train_dataset,
eval_dataset = val_dataset,
compute_metrics = compute_metrics_fct,
callbacks = cbks
)
train_output = trainer.train()
My json config file:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true,
"cpu_offload": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
The output of running ds_report
:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.4.0+11e94e6, 11e94e6, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1
This is the stack trace I get if I try to run the code without pre-building
[2021-06-02 21:06:31,700] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.4.0+11e94e6, git-hash=11e94e6, git-branch=master
[2021-06-02 21:06:31,707] [WARNING] [config.py:80:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-06-02 21:06:31,858] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-06-02 21:06:31,968] [INFO] [engine.py:173:__init__] DeepSpeed Flops Profiler Enabled: False
Using /home/ec2-user/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1538 check=True,
-> 1539 env=env)
1540 else:
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-13-4a7b77bf5678> in <module>
46 )
47
---> 48 train_output = trainer.train()
49
50 # Evalua
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1112 if args.deepspeed:
1113 deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
-> 1114 self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
1115 )
1116 self.model = deepspeed_engine.module
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/integrations.py in deepspeed_init(trainer, num_training_steps, resume_from_checkpoint)
520 config_params=config,
521 optimizer=optimizer,
--> 522 lr_scheduler=lr_scheduler,
523 )
524
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params)
134 collate_fn=collate_fn,
135 config=config,
--> 136 config_params=config_params)
137 else:
138 assert mpu is None, "mpu must be None with pipeline parallelism"
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params, dont_change_device)
185 self.lr_scheduler = None
186 if model_parameters or optimizer:
--> 187 self._configure_optimizer(optimizer, model_parameters)
188 self._configure_lr_scheduler(lr_scheduler)
189 self._report_progress(0)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
687 logger.info('Using client Optimizer as basic optimizer')
688 else:
--> 689 basic_optimizer = self._configure_basic_optimizer(model_parameters)
690 if self.global_rank == 0:
691 logger.info(
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
758 optimizer = DeepSpeedCPUAdam(model_parameters,
759 **optimizer_parameters,
--> 760 adamw_mode=effective_adam_w_mode)
761 else:
762 from deepspeed.ops.adam import FusedAdam
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
76 DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
77 self.adam_w_mode = adamw_mode
---> 78 self.ds_opt_adam = CPUAdamBuilder().load()
79
80 self.ds_opt_adam.create_adam(self.opt_id,
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
214 return importlib.import_module(self.absolute_name())
215 else:
--> 216 return self.jit_load(verbose)
217
218 def jit_load(self, verbose=True):
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
251 extra_cuda_cflags=self.nvcc_args(),
252 extra_ldflags=self.extra_ldflags(),
--> 253 verbose=verbose)
254 build_duration = time.time() - start_build
255 if verbose:
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
995 with_cuda,
996 is_python_module,
--> 997 keep_intermediates=keep_intermediates)
998
999
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
1200 build_directory=build_directory,
1201 verbose=verbose,
-> 1202 with_cuda=with_cuda)
1203 finally:
1204 baton.release()
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
1298 build_directory,
1299 verbose,
-> 1300 error_prefix="Error building extension '{}'".format(name))
1301
1302
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1553 if hasattr(error, 'output') and error.output: # type: ignore
1554 message += ": {}".format(error.output.decode()) # type: ignore
-> 1555 raise RuntimeError(message) from e
1556
1557
RuntimeError: Error building extension 'cpu_adam'
And finally, the stack trace I get if I try to pre-build while installing with DS_BUILD_CPU_ADAM=1 pip install deepspeed
:
Collecting deepspeed
Downloading deepspeed-0.3.16.tar.gz (385 kB)
|████████████████████████████████| 385 kB 19.1 MB/s eta 0:00:01
Requirement already satisfied: torch>=1.2 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (1.7.1)
Requirement already satisfied: torchvision>=0.4.0 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (0.8.2)
Requirement already satisfied: tqdm in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (4.61.0)
Collecting tensorboardX==1.8
Downloading tensorboardX-1.8-py2.py3-none-any.whl (216 kB)
|████████████████████████████████| 216 kB 46.3 MB/s eta 0:00:01
Collecting ninja
Downloading ninja-1.10.0.post2-py3-none-manylinux1_x86_64.whl (107 kB)
|████████████████████████████████| 107 kB 57.8 MB/s eta 0:00:01
Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (1.19.2)
Requirement already satisfied: psutil in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (5.8.0)
Requirement already satisfied: protobuf>=3.2.0 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from tensorboardX==1.8->deepspeed) (3.15.8)
Requirement already satisfied: six in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from tensorboardX==1.8->deepspeed) (1.15.0)
Requirement already satisfied: typing_extensions in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from torch>=1.2->deepspeed) (3.7.4.3)
Requirement already satisfied: dataclasses in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from torch>=1.2->deepspeed) (0.8)
Requirement already satisfied: pillow>=4.1.1 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from torchvision>=0.4.0->deepspeed) (8.1.0)
Building wheels for collected packages: deepspeed
Building wheel for deepspeed (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-mg0l8e9x/deepspeed_429da1b49e0440b5894b5291a4e649c0/setup.py'"'"'; __file__='"'"'/tmp/pip-install-mg0l8e9x/deepspeed_429da1b49e0440b5894b5291a4e649c0/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-p5cawauv
cwd: /tmp/pip-install-mg0l8e9x/deepspeed_429da1b49e0440b5894b5291a4e649c0/
Complete output (254 lines):
DS_BUILD_OPS=0
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] async_io requires the libraries: ['libaio-dev'] but are missing.
Install Ops={'cpu_adam': 1, 'fused_adam': False, 'fused_lamb': False, 'sparse_attn': False, 'transformer': False, 'stochastic_transformer': False, 'utils': False, 'async_io': False}
fatal: not a git repository (or any of the parent directories): .git
version=0.3.16, git_hash=unknown, git_branch=unknown
install_requires=['torch>=1.2', 'torchvision>=0.4.0', 'tqdm', 'tensorboardX==1.8', 'ninja', 'numpy', 'psutil']
compatible_ops={'cpu_adam': True, 'fused_adam': True, 'fused_lamb': True, 'sparse_attn': False, 'transformer': True, 'stochastic_transformer': True, 'utils': True, 'async_io': False}
ext_modules=[<setuptools.extension.Extension('deepspeed.ops.adam.cpu_adam_op') at 0x7f10bd356668>]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/deepspeed
copying deepspeed/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed
copying deepspeed/constants.py -> build/lib.linux-x86_64-3.6/deepspeed
copying deepspeed/git_version_info_installed.py -> build/lib.linux-x86_64-3.6/deepspeed
copying deepspeed/git_version_info.py -> build/lib.linux-x86_64-3.6/deepspeed
copying deepspeed/env_report.py -> build/lib.linux-x86_64-3.6/deepspeed
creating build/lib.linux-x86_64-3.6/op_builder
copying op_builder/__init__.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/fused_lamb.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/transformer.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/utils.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/async_io.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/builder.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/fused_adam.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/cpu_adam.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/sparse_attn.py -> build/lib.linux-x86_64-3.6/op_builder
copying op_builder/stochastic_transformer.py -> build/lib.linux-x86_64-3.6/op_builder
creating build/lib.linux-x86_64-3.6/deepspeed/ops
copying deepspeed/ops/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops
copying deepspeed/ops/module_inject.py -> build/lib.linux-x86_64-3.6/deepspeed/ops
creating build/lib.linux-x86_64-3.6/deepspeed/module_inject
copying deepspeed/module_inject/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/module_inject
copying deepspeed/module_inject/inject.py -> build/lib.linux-x86_64-3.6/deepspeed/module_inject
copying deepspeed/module_inject/replace_module.py -> build/lib.linux-x86_64-3.6/deepspeed/module_inject
creating build/lib.linux-x86_64-3.6/deepspeed/utils
copying deepspeed/utils/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
copying deepspeed/utils/zero_to_fp32.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
copying deepspeed/utils/timer.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
copying deepspeed/utils/logging.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
copying deepspeed/utils/distributed.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
creating build/lib.linux-x86_64-3.6/deepspeed/elasticity
copying deepspeed/elasticity/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
copying deepspeed/elasticity/config.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
copying deepspeed/elasticity/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
copying deepspeed/elasticity/elasticity.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
creating build/lib.linux-x86_64-3.6/deepspeed/launcher
copying deepspeed/launcher/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
copying deepspeed/launcher/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
copying deepspeed/launcher/launch.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
copying deepspeed/launcher/runner.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
copying deepspeed/launcher/multinode_runner.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
creating build/lib.linux-x86_64-3.6/deepspeed/pipe
copying deepspeed/pipe/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/pipe
creating build/lib.linux-x86_64-3.6/deepspeed/profiling
copying deepspeed/profiling/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling
copying deepspeed/profiling/config.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling
copying deepspeed/profiling/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling
creating build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/config_utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/dataloader.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/engine.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/lr_schedules.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/progressive_layer_drop.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
copying deepspeed/runtime/csr_tensor.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
creating build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/sparse_self_attention.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/matmul.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/softmax.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/bert_sparse_self_attention.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/sparsity_config.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
copying deepspeed/ops/sparse_attention/sparse_attention_utils.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
creating build/lib.linux-x86_64-3.6/deepspeed/ops/aio
copying deepspeed/ops/aio/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/aio
creating build/lib.linux-x86_64-3.6/deepspeed/ops/transformer
copying deepspeed/ops/transformer/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/transformer
copying deepspeed/ops/transformer/transformer.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/transformer
creating build/lib.linux-x86_64-3.6/deepspeed/ops/lamb
copying deepspeed/ops/lamb/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/lamb
copying deepspeed/ops/lamb/fused_lamb.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/lamb
creating build/lib.linux-x86_64-3.6/deepspeed/ops/adam
copying deepspeed/ops/adam/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
copying deepspeed/ops/adam/fused_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
copying deepspeed/ops/adam/multi_tensor_apply.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
copying deepspeed/ops/adam/cpu_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
creating build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/fused_lamb.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/transformer.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/async_io.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/builder.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/fused_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/cpu_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/sparse_attn.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
copying deepspeed/ops/op_builder/stochastic_transformer.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
creating build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
copying deepspeed/ops/sparse_attention/trsrc/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
creating build/lib.linux-x86_64-3.6/deepspeed/profiling/flops_profiler
copying deepspeed/profiling/flops_profiler/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling/flops_profiler
copying deepspeed/profiling/flops_profiler/profiler.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling/flops_profiler
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
copying deepspeed/runtime/activation_checkpointing/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
copying deepspeed/runtime/activation_checkpointing/config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
copying deepspeed/runtime/activation_checkpointing/checkpointing.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/stage3.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/linear.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/offload_config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/stage2.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/contiguous_memory_allocator.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/tiling.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/test.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/offload_constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/stage1.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
copying deepspeed/runtime/zero/partition_parameters.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/aio_config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/async_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/optimizer_utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
copying deepspeed/runtime/swap_tensor/partitioned_param_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
copying deepspeed/runtime/pipe/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
copying deepspeed/runtime/pipe/engine.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
copying deepspeed/runtime/pipe/p2p.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
copying deepspeed/runtime/pipe/schedule.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
copying deepspeed/runtime/pipe/module.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
copying deepspeed/runtime/pipe/topology.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/compression
copying deepspeed/runtime/compression/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/compression
copying deepspeed/runtime/compression/cupy.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/compression
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
copying deepspeed/runtime/fp16/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
copying deepspeed/runtime/fp16/unfused_optimizer.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
copying deepspeed/runtime/fp16/fused_optimizer.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
copying deepspeed/runtime/fp16/loss_scaler.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
copying deepspeed/runtime/comm/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
copying deepspeed/runtime/comm/mpi.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
copying deepspeed/runtime/comm/nccl.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
creating build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
copying deepspeed/runtime/fp16/onebit/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
copying deepspeed/runtime/fp16/onebit/adam.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
copying deepspeed/runtime/fp16/onebit/lamb.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
running egg_info
writing deepspeed.egg-info/PKG-INFO
writing dependency_links to deepspeed.egg-info/dependency_links.txt
writing requirements to deepspeed.egg-info/requires.txt
writing top-level names to deepspeed.egg-info/top_level.txt
reading manifest file 'deepspeed.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.cc' under directory 'deepspeed'
warning: no files found matching '*.tr' under directory 'csrc'
warning: no files found matching '*.cc' under directory 'csrc'
writing manifest file 'deepspeed.egg-info/SOURCES.txt'
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/compat.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/cpu_adam.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/multi_tensor_adam.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
copying deepspeed/ops/csrc/adam/multi_tensor_apply.cuh -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
copying deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/StopWatch.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/Timer.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/context.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/cpu_adam.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/cublas_wrappers.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/custom_cuda_layers.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/dropout.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/ds_transformer_cuda.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/feed_forward.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/gelu.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/gemm_test.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/general_kernels.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/normalize_layer.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/softmax.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/strided_batch_gemm.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
copying deepspeed/ops/csrc/includes/type_shim.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/lamb
copying deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/lamb
copying deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/lamb
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/sparse_attention
copying deepspeed/ops/csrc/sparse_attention/utils.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/sparse_attention
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/cublas_wrappers.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/dropout_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/gelu_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/general_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/normalize_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/softmax_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
copying deepspeed/ops/csrc/transformer/transform_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/utils
copying deepspeed/ops/csrc/utils/flatten_unflatten.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/utils
copying deepspeed/ops/sparse_attention/trsrc/matmul.tr -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
copying deepspeed/ops/sparse_attention/trsrc/softmax_bwd.tr -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
copying deepspeed/ops/sparse_attention/trsrc/softmax_fwd.tr -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
running build_ext
building 'deepspeed.ops.adam.cpu_adam_op' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/csrc
creating build/temp.linux-x86_64-3.6/csrc/adam
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ec2-user/anaconda3/envs/pytorch_latest_p36/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ec2-user/anaconda3/envs/pytorch_latest_p36/include -fPIC -Icsrc/includes -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/TH -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/include/python3.6m -c csrc/adam/cpu_adam.cpp -o build/temp.linux-x86_64-3.6/csrc/adam/cpu_adam.o -O3 -std=c++14 -L/usr/local/cuda-10.1/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
/usr/local/cuda-10.1/bin/nvcc -Icsrc/includes -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/TH -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/include/python3.6m -c csrc/adam/custom_cuda_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/adam/custom_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc
In file included from /usr/local/cuda-10.1/include/cuda_runtime.h:83,
from <command-line>:
/usr/local/cuda-10.1/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
| ^~~~~
error: command '/usr/local/cuda-10.1/bin/nvcc' failed with exit status 1
Any input on how to solve this issue would be very much appreciated!
Update:
Following the steps taken in https://seo-explorer.io/blog/configuring-centos-7-to-finetune-eleutherai-gpt-neo-2-7b-with-torch-and-deepspeed/ I managed to get the cpu_offload
function to run by updating my gcc
from 4.8.5 to 7.2.1, however as observed in Xirider/finetune-gpt2xl#3, I also had to set "torch_adam":true
in the config file to be able to run the code.
However, as mentioned in that issue, I too experieced a significant (2x) decrease in training speed, does anyone have an idea of why this might be happening?
Hi again, another update:
After coming back to the issue after some time I realized that when running the nvcc command (below) the gcc compiler being used was not the one I installed system-wide, but rather the one that came with anaconda which was pointed to with the softlink /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc
.
/usr/local/cuda-10.1/bin/nvcc -Icsrc/includes -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/TH -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/include/python3.6m -c csrc/adam/custom_cuda_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/adam/custom_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc
I managed to get the cpu_offload
functionality running by first installing a new gcc version with conda
conda install -c creditx gcc-7 -y
And then re-pointing the softlink used by nvcc by running
sudo ln -sfn /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-pc-linux-gnu-gcc-7.1.0 /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc
NOTE: re-pointing the softlink to the system-wide gcc compiler did not work, it threw all kinds of strange errors
This solved my issue though, there are two things worth noting:
- This took quite some time to figure out, and perhaps could be solved or better documented to help others struggling with these same issues on Sagemaker (dealing with Linux AMI, gcc, etc.)
- The performance I got when using
cpu_offload
wasn't what I was expecting. It could be that the GPU I'm using is a bit small (V100 with 16Gb of RAM), but memory-wise performance wasn't great (batches no greater than 16 when using hugging face's DistilBERT), but this topic will probably be filled in a separate issue.
Hope this helps someone out there! :D
Hi @Gabriel-Macias - thank you for sharing your result and that you were able to get this to work.
I'm going to close this issue as it is stale and with changes to python/nvcc/cuda/etc, its less likely that someone would hit the same issue, but closing as it is resolved.