Error building extension 'cpu_adam' (Hugging Face integration)

Question

Error building extension 'cpu_adam' (Hugging Face integration)

Closed this issue a year ago · 3 comments

Hi guys,

I'm trying to use the cpu_offload function of DeepSpeed integration with HuggingFace's Trainer integration on a single GPU on AWS Sagemaker (ml.p2.xlarge instance). However I've been struggling for quite some time to get it to work properly. Here are the current versions I'm using:

CUDA: 10.1 (V10.1.243)
transformers: 4.6.0
pytorch: 1.7.1
deepspeed: 0.3.16 / 0.4.0 (master)
OS: Amazon Linux AMI 2018.03 (x86_64)
gcc/g++/c++: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)

I've taken a look at similar issues (#889, #694, #885) but haven't had any success so far. So far I have tried:

Changing the versions of the pytorch, deepseed and transformers libraries
Pre-building the ops of deepspeed (DS_BUILD_OPS=1 and DS_BUILD_CPU_ADAM=1)
Installing DeepSpeed and Trasnformers from source

Here is a simplified version of the code I'm running on Sagemaker:

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9993' # modify if RuntimeError: Address already in use
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

# Training Args
MAX_LEN = 512
TRAIN_BATCH_SIZE = 8
VAL_BATCH_SIZE = 8
EPOCHS = 1
LEARNING_RATE = 1e-05

args = TrainingArguments(
    output_dir = "../flujo_nlp/outputs/",
    overwrite_output_dir = True,
    per_device_train_batch_size = TRAIN_BATCH_SIZE,
    per_device_eval_batch_size = VAL_BATCH_SIZE,
    learning_rate = LEARNING_RATE,
    weight_decay = 0.01,
    max_grad_norm = 1.0,
    
    num_train_epochs = EPOCHS, # Si esta el modo max_steps entonces ese se toma para entrenar el modelo
    max_steps = 2000, # 2000
    evaluation_strategy = "steps",
    eval_steps = 200, # 200
    
    lr_scheduler_type = 'linear',
    warmup_ratio = 0.0,
    warmup_steps = 0,
    logging_dir = "../flujo_nlp/logs/",
    logging_strategy = 'steps',
    logging_steps = 200,
    seed = 42,
    fp16 = False,
    dataloader_drop_last = False,
    dataloader_num_workers = 0,
    label_names = ["labels"],
    load_best_model_at_end = True,
    metric_for_best_model = "eval_loss",
    greater_is_better = False,
    ignore_data_skip = False,
    deepspeed = "deepspeed_config_1gpu.json"
)

cbks = [
    EarlyStoppingCallback(early_stopping_patience = 2, early_stopping_threshold = 0),
    PrinterCallback()
]

# Trainer
trainer = MultilabelTrainer(
    num_labels = n_labels,
    loss_fct = loss,
    model = TextClassifier(model_dict["model_path"], n_labels, loss, n_extra_layers = n_extra_layers),
    args = args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    compute_metrics = compute_metrics_fct,
    callbacks = cbks
)

train_output = trainer.train()

My json config file:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

The output of running ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.4.0+11e94e6, 11e94e6, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1

This is the stack trace I get if I try to run the code without pre-building

[2021-06-02 21:06:31,700] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.4.0+11e94e6, git-hash=11e94e6, git-branch=master
[2021-06-02 21:06:31,707] [WARNING] [config.py:80:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-06-02 21:06:31,858] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-06-02 21:06:31,968] [INFO] [engine.py:173:__init__] DeepSpeed Flops Profiler Enabled: False
Using /home/ec2-user/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1538                 check=True,
-> 1539                 env=env)
   1540         else:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)
    439     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-13-4a7b77bf5678> in <module>
     46             )
     47 
---> 48             train_output = trainer.train()
     49 
     50             # Evalua

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1112         if args.deepspeed:
   1113             deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
-> 1114                 self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
   1115             )
   1116             self.model = deepspeed_engine.module

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/integrations.py in deepspeed_init(trainer, num_training_steps, resume_from_checkpoint)
    520         config_params=config,
    521         optimizer=optimizer,
--> 522         lr_scheduler=lr_scheduler,
    523     )
    524 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params)
    134                                  collate_fn=collate_fn,
    135                                  config=config,
--> 136                                  config_params=config_params)
    137     else:
    138         assert mpu is None, "mpu must be None with pipeline parallelism"

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params, dont_change_device)
    185         self.lr_scheduler = None
    186         if model_parameters or optimizer:
--> 187             self._configure_optimizer(optimizer, model_parameters)
    188             self._configure_lr_scheduler(lr_scheduler)
    189             self._report_progress(0)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    687                 logger.info('Using client Optimizer as basic optimizer')
    688         else:
--> 689             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    690             if self.global_rank == 0:
    691                 logger.info(

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    758                     optimizer = DeepSpeedCPUAdam(model_parameters,
    759                                                  **optimizer_parameters,
--> 760                                                  adamw_mode=effective_adam_w_mode)
    761                 else:
    762                     from deepspeed.ops.adam import FusedAdam

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
     76         DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
     77         self.adam_w_mode = adamw_mode
---> 78         self.ds_opt_adam = CPUAdamBuilder().load()
     79 
     80         self.ds_opt_adam.create_adam(self.opt_id,

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    214             return importlib.import_module(self.absolute_name())
    215         else:
--> 216             return self.jit_load(verbose)
    217 
    218     def jit_load(self, verbose=True):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    251             extra_cuda_cflags=self.nvcc_args(),
    252             extra_ldflags=self.extra_ldflags(),
--> 253             verbose=verbose)
    254         build_duration = time.time() - start_build
    255         if verbose:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
    995         with_cuda,
    996         is_python_module,
--> 997         keep_intermediates=keep_intermediates)
    998 
    999 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, keep_intermediates)
   1200                         build_directory=build_directory,
   1201                         verbose=verbose,
-> 1202                         with_cuda=with_cuda)
   1203             finally:
   1204                 baton.release()

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
   1298         build_directory,
   1299         verbose,
-> 1300         error_prefix="Error building extension '{}'".format(name))
   1301 
   1302 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1553         if hasattr(error, 'output') and error.output:  # type: ignore
   1554             message += ": {}".format(error.output.decode())  # type: ignore
-> 1555         raise RuntimeError(message) from e
   1556 
   1557 

RuntimeError: Error building extension 'cpu_adam'

And finally, the stack trace I get if I try to pre-build while installing with DS_BUILD_CPU_ADAM=1 pip install deepspeed:

Collecting deepspeed
  Downloading deepspeed-0.3.16.tar.gz (385 kB)
     |████████████████████████████████| 385 kB 19.1 MB/s eta 0:00:01
Requirement already satisfied: torch>=1.2 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (1.7.1)
Requirement already satisfied: torchvision>=0.4.0 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (0.8.2)
Requirement already satisfied: tqdm in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (4.61.0)
Collecting tensorboardX==1.8
  Downloading tensorboardX-1.8-py2.py3-none-any.whl (216 kB)
     |████████████████████████████████| 216 kB 46.3 MB/s eta 0:00:01
Collecting ninja
  Downloading ninja-1.10.0.post2-py3-none-manylinux1_x86_64.whl (107 kB)
     |████████████████████████████████| 107 kB 57.8 MB/s eta 0:00:01
Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (1.19.2)
Requirement already satisfied: psutil in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from deepspeed) (5.8.0)
Requirement already satisfied: protobuf>=3.2.0 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from tensorboardX==1.8->deepspeed) (3.15.8)
Requirement already satisfied: six in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from tensorboardX==1.8->deepspeed) (1.15.0)
Requirement already satisfied: typing_extensions in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from torch>=1.2->deepspeed) (3.7.4.3)
Requirement already satisfied: dataclasses in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from torch>=1.2->deepspeed) (0.8)
Requirement already satisfied: pillow>=4.1.1 in /home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages (from torchvision>=0.4.0->deepspeed) (8.1.0)
Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-mg0l8e9x/deepspeed_429da1b49e0440b5894b5291a4e649c0/setup.py'"'"'; __file__='"'"'/tmp/pip-install-mg0l8e9x/deepspeed_429da1b49e0440b5894b5291a4e649c0/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-p5cawauv
       cwd: /tmp/pip-install-mg0l8e9x/deepspeed_429da1b49e0440b5894b5291a4e649c0/
  Complete output (254 lines):
  DS_BUILD_OPS=0
  /bin/sh: line 0: type: llvm-config: not found
  /bin/sh: line 0: type: llvm-config-9: not found
   [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
   [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
  /bin/sh: line 0: type: llvm-config: not found
  /bin/sh: line 0: type: llvm-config-9: not found
   [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
   [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
  Install Ops={'cpu_adam': 1, 'fused_adam': False, 'fused_lamb': False, 'sparse_attn': False, 'transformer': False, 'stochastic_transformer': False, 'utils': False, 'async_io': False}
  fatal: not a git repository (or any of the parent directories): .git
  version=0.3.16, git_hash=unknown, git_branch=unknown
  install_requires=['torch>=1.2', 'torchvision>=0.4.0', 'tqdm', 'tensorboardX==1.8', 'ninja', 'numpy', 'psutil']
  compatible_ops={'cpu_adam': True, 'fused_adam': True, 'fused_lamb': True, 'sparse_attn': False, 'transformer': True, 'stochastic_transformer': True, 'utils': True, 'async_io': False}
  ext_modules=[<setuptools.extension.Extension('deepspeed.ops.adam.cpu_adam_op') at 0x7f10bd356668>]
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/deepspeed
  copying deepspeed/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed
  copying deepspeed/constants.py -> build/lib.linux-x86_64-3.6/deepspeed
  copying deepspeed/git_version_info_installed.py -> build/lib.linux-x86_64-3.6/deepspeed
  copying deepspeed/git_version_info.py -> build/lib.linux-x86_64-3.6/deepspeed
  copying deepspeed/env_report.py -> build/lib.linux-x86_64-3.6/deepspeed
  creating build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/__init__.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/fused_lamb.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/transformer.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/utils.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/async_io.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/builder.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/fused_adam.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/cpu_adam.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/sparse_attn.py -> build/lib.linux-x86_64-3.6/op_builder
  copying op_builder/stochastic_transformer.py -> build/lib.linux-x86_64-3.6/op_builder
  creating build/lib.linux-x86_64-3.6/deepspeed/ops
  copying deepspeed/ops/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops
  copying deepspeed/ops/module_inject.py -> build/lib.linux-x86_64-3.6/deepspeed/ops
  creating build/lib.linux-x86_64-3.6/deepspeed/module_inject
  copying deepspeed/module_inject/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/module_inject
  copying deepspeed/module_inject/inject.py -> build/lib.linux-x86_64-3.6/deepspeed/module_inject
  copying deepspeed/module_inject/replace_module.py -> build/lib.linux-x86_64-3.6/deepspeed/module_inject
  creating build/lib.linux-x86_64-3.6/deepspeed/utils
  copying deepspeed/utils/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
  copying deepspeed/utils/zero_to_fp32.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
  copying deepspeed/utils/timer.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
  copying deepspeed/utils/logging.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
  copying deepspeed/utils/distributed.py -> build/lib.linux-x86_64-3.6/deepspeed/utils
  creating build/lib.linux-x86_64-3.6/deepspeed/elasticity
  copying deepspeed/elasticity/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
  copying deepspeed/elasticity/config.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
  copying deepspeed/elasticity/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
  copying deepspeed/elasticity/elasticity.py -> build/lib.linux-x86_64-3.6/deepspeed/elasticity
  creating build/lib.linux-x86_64-3.6/deepspeed/launcher
  copying deepspeed/launcher/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
  copying deepspeed/launcher/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
  copying deepspeed/launcher/launch.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
  copying deepspeed/launcher/runner.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
  copying deepspeed/launcher/multinode_runner.py -> build/lib.linux-x86_64-3.6/deepspeed/launcher
  creating build/lib.linux-x86_64-3.6/deepspeed/pipe
  copying deepspeed/pipe/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/pipe
  creating build/lib.linux-x86_64-3.6/deepspeed/profiling
  copying deepspeed/profiling/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling
  copying deepspeed/profiling/config.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling
  copying deepspeed/profiling/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/config_utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/dataloader.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/engine.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/lr_schedules.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/progressive_layer_drop.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  copying deepspeed/runtime/csr_tensor.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/sparse_self_attention.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/matmul.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/softmax.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/bert_sparse_self_attention.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/sparsity_config.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  copying deepspeed/ops/sparse_attention/sparse_attention_utils.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/aio
  copying deepspeed/ops/aio/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/aio
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/transformer
  copying deepspeed/ops/transformer/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/transformer
  copying deepspeed/ops/transformer/transformer.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/transformer
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/lamb
  copying deepspeed/ops/lamb/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/lamb
  copying deepspeed/ops/lamb/fused_lamb.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/lamb
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/adam
  copying deepspeed/ops/adam/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
  copying deepspeed/ops/adam/fused_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
  copying deepspeed/ops/adam/multi_tensor_apply.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
  copying deepspeed/ops/adam/cpu_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/adam
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/fused_lamb.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/transformer.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/async_io.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/builder.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/fused_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/cpu_adam.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/sparse_attn.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  copying deepspeed/ops/op_builder/stochastic_transformer.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/op_builder
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
  copying deepspeed/ops/sparse_attention/trsrc/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
  creating build/lib.linux-x86_64-3.6/deepspeed/profiling/flops_profiler
  copying deepspeed/profiling/flops_profiler/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling/flops_profiler
  copying deepspeed/profiling/flops_profiler/profiler.py -> build/lib.linux-x86_64-3.6/deepspeed/profiling/flops_profiler
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
  copying deepspeed/runtime/activation_checkpointing/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
  copying deepspeed/runtime/activation_checkpointing/config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
  copying deepspeed/runtime/activation_checkpointing/checkpointing.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/activation_checkpointing
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/stage3.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/linear.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/offload_config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/stage2.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/contiguous_memory_allocator.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/tiling.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/test.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/offload_constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/stage1.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  copying deepspeed/runtime/zero/partition_parameters.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/zero
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/aio_config.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/constants.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/async_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/optimizer_utils.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  copying deepspeed/runtime/swap_tensor/partitioned_param_swapper.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/swap_tensor
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  copying deepspeed/runtime/pipe/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  copying deepspeed/runtime/pipe/engine.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  copying deepspeed/runtime/pipe/p2p.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  copying deepspeed/runtime/pipe/schedule.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  copying deepspeed/runtime/pipe/module.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  copying deepspeed/runtime/pipe/topology.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/pipe
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/compression
  copying deepspeed/runtime/compression/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/compression
  copying deepspeed/runtime/compression/cupy.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/compression
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
  copying deepspeed/runtime/fp16/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
  copying deepspeed/runtime/fp16/unfused_optimizer.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
  copying deepspeed/runtime/fp16/fused_optimizer.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
  copying deepspeed/runtime/fp16/loss_scaler.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
  copying deepspeed/runtime/comm/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
  copying deepspeed/runtime/comm/mpi.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
  copying deepspeed/runtime/comm/nccl.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/comm
  creating build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
  copying deepspeed/runtime/fp16/onebit/__init__.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
  copying deepspeed/runtime/fp16/onebit/adam.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
  copying deepspeed/runtime/fp16/onebit/lamb.py -> build/lib.linux-x86_64-3.6/deepspeed/runtime/fp16/onebit
  running egg_info
  writing deepspeed.egg-info/PKG-INFO
  writing dependency_links to deepspeed.egg-info/dependency_links.txt
  writing requirements to deepspeed.egg-info/requires.txt
  writing top-level names to deepspeed.egg-info/top_level.txt
  reading manifest file 'deepspeed.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no files found matching '*.cc' under directory 'deepspeed'
  warning: no files found matching '*.tr' under directory 'csrc'
  warning: no files found matching '*.cc' under directory 'csrc'
  writing manifest file 'deepspeed.egg-info/SOURCES.txt'
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  copying deepspeed/ops/csrc/adam/compat.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  copying deepspeed/ops/csrc/adam/cpu_adam.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  copying deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  copying deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  copying deepspeed/ops/csrc/adam/multi_tensor_adam.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  copying deepspeed/ops/csrc/adam/multi_tensor_apply.cuh -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/adam
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  copying deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  copying deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  copying deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/common
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  copying deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/aio/py_lib
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/StopWatch.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/Timer.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/context.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/cpu_adam.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/cublas_wrappers.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/custom_cuda_layers.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/dropout.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/ds_transformer_cuda.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/feed_forward.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/gelu.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/gemm_test.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/general_kernels.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/normalize_layer.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/softmax.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/strided_batch_gemm.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  copying deepspeed/ops/csrc/includes/type_shim.h -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/includes
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/lamb
  copying deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/lamb
  copying deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/lamb
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/sparse_attention
  copying deepspeed/ops/csrc/sparse_attention/utils.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/sparse_attention
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/cublas_wrappers.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/dropout_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/gelu_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/general_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/normalize_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/softmax_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  copying deepspeed/ops/csrc/transformer/transform_kernels.cu -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/transformer
  creating build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/utils
  copying deepspeed/ops/csrc/utils/flatten_unflatten.cpp -> build/lib.linux-x86_64-3.6/deepspeed/ops/csrc/utils
  copying deepspeed/ops/sparse_attention/trsrc/matmul.tr -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
  copying deepspeed/ops/sparse_attention/trsrc/softmax_bwd.tr -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
  copying deepspeed/ops/sparse_attention/trsrc/softmax_fwd.tr -> build/lib.linux-x86_64-3.6/deepspeed/ops/sparse_attention/trsrc
  running build_ext
  building 'deepspeed.ops.adam.cpu_adam_op' extension
  creating build/temp.linux-x86_64-3.6
  creating build/temp.linux-x86_64-3.6/csrc
  creating build/temp.linux-x86_64-3.6/csrc/adam
  /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc -DNDEBUG -fwrapv -O2 -Wall -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/ec2-user/anaconda3/envs/pytorch_latest_p36/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/ec2-user/anaconda3/envs/pytorch_latest_p36/include -fPIC -Icsrc/includes -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/TH -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/include/python3.6m -c csrc/adam/cpu_adam.cpp -o build/temp.linux-x86_64-3.6/csrc/adam/cpu_adam.o -O3 -std=c++14 -L/usr/local/cuda-10.1/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  /usr/local/cuda-10.1/bin/nvcc -Icsrc/includes -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/TH -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/include/python3.6m -c csrc/adam/custom_cuda_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/adam/custom_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc
  In file included from /usr/local/cuda-10.1/include/cuda_runtime.h:83,
                   from <command-line>:
  /usr/local/cuda-10.1/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
    138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
        |  ^~~~~
  error: command '/usr/local/cuda-10.1/bin/nvcc' failed with exit status 1

Any input on how to solve this issue would be very much appreciated!

Answer 1 · 2021-06-04T16:17:10.000Z

Update:

Following the steps taken in https://seo-explorer.io/blog/configuring-centos-7-to-finetune-eleutherai-gpt-neo-2-7b-with-torch-and-deepspeed/ I managed to get the cpu_offload function to run by updating my gcc from 4.8.5 to 7.2.1, however as observed in Xirider/finetune-gpt2xl#3, I also had to set "torch_adam":true in the config file to be able to run the code.

However, as mentioned in that issue, I too experieced a significant (2x) decrease in training speed, does anyone have an idea of why this might be happening?

Answer 2 · 2021-06-17T16:38:39.000Z

Hi again, another update:

After coming back to the issue after some time I realized that when running the nvcc command (below) the gcc compiler being used was not the one I installed system-wide, but rather the one that came with anaconda which was pointed to with the softlink /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc.

/usr/local/cuda-10.1/bin/nvcc -Icsrc/includes -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/TH -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/home/ec2-user/anaconda3/envs/pytorch_latest_p36/include/python3.6m -c csrc/adam/custom_cuda_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/adam/custom_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0 -ccbin /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc

I managed to get the cpu_offload functionality running by first installing a new gcc version with conda

conda install -c creditx gcc-7 -y

And then re-pointing the softlink used by nvcc by running

sudo ln -sfn /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-pc-linux-gnu-gcc-7.1.0  /home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/x86_64-conda-linux-gnu-cc

NOTE: re-pointing the softlink to the system-wide gcc compiler did not work, it threw all kinds of strange errors

This solved my issue though, there are two things worth noting:

This took quite some time to figure out, and perhaps could be solved or better documented to help others struggling with these same issues on Sagemaker (dealing with Linux AMI, gcc, etc.)
The performance I got when using cpu_offload wasn't what I was expecting. It could be that the GPU I'm using is a bit small (V100 with 16Gb of RAM), but memory-wise performance wasn't great (batches no greater than 16 when using hugging face's DistilBERT), but this topic will probably be filled in a separate issue.

Hope this helps someone out there! :D

Answer 3 · 2023-08-18T16:23:49.000Z

Hi @Gabriel-Macias - thank you for sharing your result and that you were able to get this to work.

I'm going to close this issue as it is stale and with changes to python/nvcc/cuda/etc, its less likely that someone would hit the same issue, but closing as it is resolved.