microsoft/DeepSpeed

[BUG] cpu_adam warning

SkyAndCloud opened this issue ยท 27 comments

I encounterd a warning as the training begins:

cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!

Here is my environments:

Ubuntu 18.04 LTS
CUDA 11.8
python=3.9
torch=2.0.0
deepspeed=0.9.2
python3.9-dev

ds_report:
image

I searched over all the issues and found no worked solutions.

I have this warning as well.
Where can I find the log for deepspeed? I am quite new to this

Hi @SkyAndCloud - the warning you are seeing comes from here. Specifically, this comes when the system installed cuda and torch cuda do not match each other. Are you running with DS_SKIP_CUDA_CHECK or no?

Hi, I have verified that the system installed cuda version matches the torch cuda, both cu118. I haven't set DS_SKIP_CUDA_CHECK.

Are there other cuda installations on the system? Since something seems to be triggering that warning. Do you see either of the printouts from this function in your code too?

Hi @SkyAndCloud - any updates on this?

Sorry for the late reply. There is no other cuda installations, no prints from your referenced function. I still don't know what causes this warning. However, my code runs successfully.

Interesting, I'm curious if this repros with the latest deepspeed as well, but since I'm not able to repro this and your code is running successfully I'll close it for now - if others hit this error, feel free to re-open.

I have same warning issue. In the end it leads to killing the finetuning(WizardCoder) subprocess and exiting the main process with return code -9

Here is my configuration:
Ubuntu 22.04 LTS
CUDA 12.1
python=3.10
torch=2.0.1+cu118
deepspeed=0.9.5

DeepSpeedInfo1

Hi @paveltonev - your issue is probably more similar to this one?

#3794

Given the fact that it is an error and not a warning, and that you are using WizardCoder. If that's not similar enough, please open a new issue.

@loadams Hi. Wish to re-open this issue since I'm encountering the exact same problem as @SkyAndCloud, except my code is not working.

Output of ds_config:

[2023-08-24 18:01:15,632] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
Traceback (most recent call last):
  File "/opt/conda/envs/llama_etuning/bin/ds_report", line 6, in <module>
    cli_main()
  File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/deepspeed/env_report.py", line 159, in cli_main
    main(hide_operator_status=args.hide_operator_status, hide_errors_and_warnings=args.hide_errors_and_warnings)
  File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/deepspeed/env_report.py", line 153, in main
    op_report(verbose=not hide_errors_and_warnings)
  File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/deepspeed/env_report.py", line 53, in op_report
    is_compatible = OKAY if builder.is_compatible(verbose) else no
  File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/deepspeed/ops/op_builder/spatial_inference.py", line 29, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/opt/conda/envs/llama_etuning/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 41, in installed_cuda_version
    assert cuda_home is not None, "CUDA_HOME does not exist, unable to compile CUDA op(s)"
AssertionError: CUDA_HOME does not exist, unable to compile CUDA op(s)

Versions of torch cuda and system installed cuda match. So that does not seem to be the issue.
Can you help me figure it out pls? Thanks!

@KeeratKG - the error is here:

AssertionError: CUDA_HOME does not exist, unable to compile CUDA op(s)

It seems you'd need to set your CUDA_HOME env var. However, this can sometimes be a symptom of CUDA not being installed correctly. You may want to just re-install CUDA to be sure.

@KeeratKG - the error is here:

AssertionError: CUDA_HOME does not exist, unable to compile CUDA op(s)

It seems you'd need to set your CUDA_HOME env var. However, this can sometimes be a symptom of CUDA not being installed correctly. You may want to just re-install CUDA to be sure.

So I re-installed cudatoolkit via conda, and the issue persists in the ds_report output. I then ran the following:

>>> import os 
>>> print(os.environ.get('CUDA_PATH'))
None

Clearly an issue with my CUDA installation still exists, but I'm not sure I know what it exactly is. Would you be able to suggest any diagnostic checks for this? It's okay if this lies beyond the purview of this issue. Thanks.

If you are using conda, we have an environment.yaml here that you can use and has worked for others.

But to debug, I would try the following:

nvcc --version
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

That should tell you more about your cuda/torch+cuda installs to then debug if they are installed or not seen by Python or what.

loadams This still happens reproduced when installing pytorch through pip with --index-url https://download.pytorch.org/whl/cu117, as is the default in pytorch's website. (CUDA_PATH is not set in this installation process)

susht3 commented

loadams This still happens reproduced when installing pytorch through pip with --index-url https://download.pytorch.org/whl/cu117, as is the default in pytorch's website. (CUDA_PATH is not set in this installation process)

are you fix the error now? i get the same error

@amitportnoy and @susht3 - I'm not sure I understand what your issue is, could you elaborate?

Here is the sample of how we install a specific torch/cuda version in our CI tests.

CUDA_PATH wont be set in the torch install process, that should be set by the cuda install process.

Hi @loadams getting this warning w fresh install of deepspeed via pip install transformers[deepspeed], here's the output from the things you posted above

torch: 2.0.1 <module 'torch' from '/scratch/gpfs/ashwinee/envs/llm/lib/python3.8/site-packages/torch/__init__.py'>
CUDA available: True

and this is the output from ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/scratch/gpfs/ashwinee/envs/llm/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/scratch/gpfs/ashwinee/envs/llm/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 503.50 GB

and this is the output after deepspeed.initialize

[2023-09-19 12:12:27,640] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.10.3, git-hash=unknown, git-branch=unknown                                  [1/1880]
[2023-09-19 12:12:27,642] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-19 12:12:27,642] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl                                                         
[2023-09-19 12:12:32,339] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False                                                                       
[2023-09-19 12:12:32,558] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs                                                                   
[2023-09-19 12:12:32,558] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs                                                                   
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! [WARNING]  cpu_adam cuda is missing or is incompatible with installed
torch, only cpu ops can be compiled!

Using /home/ashwinee/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ashwinee/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /home/ashwinee/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...                                                                              
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)                                                                
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.864145040512085 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-09-19 12:12:34,209] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-09-19 12:12:34,225] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-09-19 12:12:34,225] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeed
CPUAdam'>
[2023-09-19 12:12:34,225] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-09-19 12:12:34,225] [INFO] [stage_1_and_2.py:146:__init__] Reduce bucket size 200000000
[2023-09-19 12:12:34,225] [INFO] [stage_1_and_2.py:147:__init__] Allgather bucket size 200000000
[2023-09-19 12:12:34,225] [INFO] [stage_1_and_2.py:148:__init__] CPU Offload: True
[2023-09-19 12:12:34,225] [INFO] [stage_1_and_2.py:149:__init__] Round robin gradient partitioning: False
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.9055325984954834 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Rank: 0 partition count [2] and sizes[(1387535360, False)]
Rank: 1 partition count [2] and sizes[(1387535360, False)]
[2023-09-19 12:12:40,323] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states
[2023-09-19 12:12:40,324] [INFO] [utils.py:804:see_memory_usage] MA 5.54 GB         Max_MA 5.54 GB         CA 6.04 GB         Max_CA 6 GB
[2023-09-19 12:12:40,324] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 49.96 GB, percent = 5.0%
[2023-09-19 12:12:43,823] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 218608
[2023-09-19 12:12:44,781] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 218609

so it seems that the warning for cpu_adam cuda is missing is causing the initialization of optimizer states to fail?

EDIT: for completeness my ds_config.json is;

{
    "train_batch_size": 64,
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 5e-5
      }
    },
    
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "consecutive_hysteresis": false,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Hi @kiddyboots216 - the warning is thrown from here which is in the op_builder. Can you try, when you install DeepSpeed, running DS_BUILD_CPU_ADAM=1 pip install deepspeed so the ops will be pre-compiled and we can debug the error that way?

Hi @kiddyboots216 - the warning is thrown from here which is in the op_builder. Can you try, when you install DeepSpeed, running DS_BUILD_CPU_ADAM=1 pip install deepspeed so the ops will be pre-compiled and we can debug the error that way?

Hi @loadams, I installed Deepspeed w this command

(llm) [ashwinee@della-gpu ashwinee]$ DS_BUILD_CPU_ADAM=1 pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8"
WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option / --install-option. Consider using --config-settings for more flexibility.
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Collecting deepspeed
  Downloading deepspeed-0.10.3.tar.gz (867 kB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 867.3/867.3 kB 47.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: hjson in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (3.1.0)
Requirement already satisfied: ninja in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (1.11.1)
Requirement already satisfied: numpy in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (1.24.3)
Requirement already satisfied: packaging>=20.0 in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (23.1)
Requirement already satisfied: psutil in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (5.9.5)
Requirement already satisfied: py-cpuinfo in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (9.0.0)
Requirement already satisfied: pydantic<2.0.0 in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (1.10.12)
Requirement already satisfied: torch in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (2.0.1)
Requirement already satisfied: tqdm in ./envs/llm/lib/python3.8/site-packages (from deepspeed) (4.65.0)
Requirement already satisfied: typing-extensions>=4.2.0 in ./envs/llm/lib/python3.8/site-packages (from pydantic<2.0.0->deepspeed) (4.5.0)
Requirement already satisfied: filelock in ./envs/llm/lib/python3.8/site-packages (from torch->deepspeed) (3.9.0)
Requirement already satisfied: sympy in ./envs/llm/lib/python3.8/site-packages (from torch->deepspeed) (1.11.1)
Requirement already satisfied: networkx in ./envs/llm/lib/python3.8/site-packages (from torch->deepspeed) (2.8.4)
Requirement already satisfied: jinja2 in ./envs/llm/lib/python3.8/site-packages (from torch->deepspeed) (3.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in ./envs/llm/lib/python3.8/site-packages (from jinja2->torch->deepspeed) (2.1.1)
Requirement already satisfied: mpmath>=0.19 in ./envs/llm/lib/python3.8/site-packages (from sympy->torch->deepspeed) (1.2.1)
Installing collected packages: deepspeed
  DEPRECATION: deepspeed is being installed using the legacy 'setup.py install' method, because the '--no-binary' option was enabled for it and this currently disables local wheel building for projects that don't have a 'pyproject.toml' file. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/11451
  Running setup.py install for deepspeed ... done

This is the output

[2023-09-19 12:35:45,705] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.10.3, git-hash=unknown, git-branch=unknown
[2023-09-19 12:35:45,706] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-19 12:35:49,861] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-09-19 12:35:50,056] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-09-19 12:35:50,072] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-09-19 12:35:50,072] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-09-19 12:35:50,088] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-09-19 12:35:50,088] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-09-19 12:35:50,088] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-09-19 12:35:50,088] [INFO] [stage_1_and_2.py:146:__init__] Reduce bucket size 200000000
[2023-09-19 12:35:50,088] [INFO] [stage_1_and_2.py:147:__init__] Allgather bucket size 200000000
[2023-09-19 12:35:50,088] [INFO] [stage_1_and_2.py:148:__init__] CPU Offload: True
[2023-09-19 12:35:50,088] [INFO] [stage_1_and_2.py:149:__init__] Round robin gradient partitioning: False
Rank: 0 partition count [2] and sizes[(1387535360, False)] 
Rank: 1 partition count [2] and sizes[(1387535360, False)] 
[2023-09-19 12:35:56,733] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states
[2023-09-19 12:35:56,734] [INFO] [utils.py:804:see_memory_usage] MA 5.54 GB         Max_MA 5.54 GB         CA 6.04 GB         Max_CA 6 GB 
[2023-09-19 12:35:56,734] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 49.54 GB, percent = 4.9%
[2023-09-19 12:36:00,145] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 221269
[2023-09-19 12:36:01,063] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 221270

So I guess cpu_adam is correctly installed now but the subprocess dies anyways.
Output of NVIDIA-SMI;

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   32C    P0              56W / 500W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   31C    P0              58W / 500W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Script to repro;

import argparse 
from transformers import GPTNeoXForCausalLM
import deepspeed

def main(args):
    model = GPTNeoXForCausalLM.from_pretrained(args.model_name, revision=args.revision, cache_dir=args.cache_dir)
    model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
                                                        model=model)
    print("Done initializing")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
    parser.add_argument('--user_path', type=str, default='/home/')
    parser.add_argument('--model_size', type=str, default='2.8b')
    parser.add_argument('--revision', type=str, default='step143000', help='what iteration (from pretraining) of the model')
    # Include DeepSpeed configuration arguments
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    args.model_name = f"EleutherAI/pythia-{args.model_size}"
    args.cache_dir = f'{args.user_path}/huggingface-models/pythia-{args.model_size}'
    print(args)
    main(args)

command;

deepspeed deepspeed_test.py --model_size 2.8b --user_path /scratch/gpfs/$USER --deepspeed --deepspeed_config ds_config.json

Hi @kiddyboots216 - the warning is thrown from here which is in the op_builder. Can you try, when you install DeepSpeed, running DS_BUILD_CPU_ADAM=1 pip install deepspeed so the ops will be pre-compiled and we can debug the error that way?

I tried a few more debugging things but the minimal example I provided above still results in deepspeed killing the subprocess. Let me know if you are able to repro.

Here is an (even more minimal example); this works for '1.3b' but will kill the subprocess for '2.7b' (but I can actually train that 2.7b param model, so I would be surprised if DS somehow can't let me initialize the engine for a 2.7b param model on 2 GPUs).

import argparse 
from transformers import AutoModelForCausalLM
import deepspeed

def main(args):
    model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
    model_engine, optimizer, _, _ = deepspeed.initialize(args=args,
                                                        model=model)
    print("Done initializing")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
    # Include DeepSpeed configuration arguments
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    main(args)

Ok after messing around with some other things and reinstalling (there was some other errors with fusedADAM), I ended up just allocating more CPU memory. It seems that the CPUAdamOps were very memory hungry. Thanks for your help in the installation!

Interesting, thanks for the info @kiddyboots216 - could you share how much memory you needed?

Interesting, thanks for the info @kiddyboots216 - could you share how much memory you needed?

I ended up needing 300G of CPU RAM! Which seemed quite high to me. But 32G didn't work and I kept doubling. If I changed "pin_memory": true -> false under offload_optimizer then I got some other errors due to improperly installed fused adam.

I meet the following error:
You can remove this warning by passing 'token=<use_auth_token>' instead.
warnings.warn(
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Traceback (most recent call last):
File "/root/data/LMFlow/examples/finetune.py", line 62, in
main()
File "/root/data/LMFlow/examples/finetune.py", line 58, in main
tuned_model = finetuner.tune(model=model, dataset=dataset)
File "/root/data/LMFlow/src/lmflow/pipeline/finetuner.py", line 298, in tune
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train
return inner_training_loop(
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 310, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1198, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1254, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1598, in _write_ninja_file_and_build_library
get_compiler_abi_compatibility_and_version(compiler)
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 337, in get_compiler_abi_compatibility_and_version
if not check_compiler_ok_for_platform(compiler):
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 291, in check_compiler_ok_for_platform
which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/opt/conda/envs/lmflow/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/opt/conda/envs/lmflow/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f719dc2bb80>
Traceback (most recent call last):
File "/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

How can I solve it?
the env: torch2.0.1+cuda11.7
the system cuda is 12.0

@wuhongyan123 - I'm not sure that is the full error, but it looks like something is wrong with your setup, since these errors aren't related to python/DeepSpeed, I'd check that the gcc/g++ compiler is properly installed on your system:

which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/opt/conda/envs/lmflow/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/opt/conda/envs/lmflow/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f719dc2bb80>
Traceback (most recent call last):