dandelionsllm/pandallm

单卡A100训练7B模型OOM

HuihuiChyan opened this issue · 12 comments

我正在尝试按照 llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml 来对7B模型进行instruct-tuning,batch_size设置为1,sequence_length设置为512,其他设置都没有做更改,但是爆显存不足,请问有什么可能的解决方案吗?

首先尝试zero-1设置下的CPU offload:

zero_optimization:
    stage: 1
    contiguous_gradients: True
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 5e8
    allgather_bucket_size: 5e8
    offload_optimizer:
      device: cpu
      pin_memory: True

如果还是不行的话,把stage改成3,然后开启offload_param:

  zero_optimization:
    stage: 3
    contiguous_gradients: True
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 5e8
    allgather_bucket_size: 5e8
    offload_optimizer:
      device: cpu
      pin_memory: True
    offload_param:
      device: cpu
      pin_memory: True

需要注意此时速度可能会非常慢,取决你的机器的gpu和cpu之间通信的速度

两个offload都开启了,还是不行……请问是否和单卡有关系?

你的显存是多少G?40G?能否贴一下config和训练启动命令?

我的显卡是80G的A100,config如下:

hydra:
  run:
    dir: ./

# Wiki path pretrain v8.2
model_name_or_path: /mnt/bn/slp-llm/sft_huihui/pandallm/llama-panda-zh-7b
pretrain:

aws_output_bucket:
data_dir:

train_file: /opt/ml/input/data/train/coig.json
dev_file:
test_file:

# Model
model:
  _target_: models.llama.LlamaForConditionalGeneration.from_pretrained
  vocab_size: 32001
  pad_token_id: 32000
  use_peft: False
  gradient_checkpointing: True

# model_eval:
#   _target_: models.llama.LlamaForConditionalGenerationFlan.from_pretrained_peft_eval
#   base_model_name_or_path: ${model_name_or_path}


# Data loading
read_tensor:
  _target_: data.collators.zh_instruct.TextDatasetUnify


extended_vocab:

# Data collator
collator:
  _target_: data.collators.flan.FlanCollatorOverCollator
  collator:
  max_seq_length: 512
  tokenizer: ${model_name_or_path}
  decoder_only: True

# Dataloader
num_workers: 4
prefetch_factor: 2

do_preprocess: False

exp_name: llama.7b.zh_instruct.10M.coig.sft.v1.0.seq1024.w8.adamw.NA100.0428.ds
exp_notes:
output_dir: ./${exp_name}
resume:

do_train: True
evaluate_during_training: False

do_eval: False
eval_sub_path: checkpoint-*

# Training hyper-parameters
per_gpu_train_batch_size: 1
per_gpu_eval_batch_size: 1
learning_rate: 3e-5
gradient_accumulation_steps: 2
weight_decay: 0.00
adam_epsilon: 1e-6
adam_betas: "(0.9, 0.99)"
max_grad_norm: 5.0
num_train_epochs: 5
total_dataset_len: -1
max_steps: 0
warmup_proportion: 0.01
warmup_steps: 0

# Optimizer
optimizer:
use_nvlamb:
bit_training:


logging_steps: 1
save_best: False
save_steps: 250
eval_steps: 250
ddp_eval: True
no_cuda: False
seed: 42
local_rank: -1
fp16: True
fp16_opt_level: O1
fp16_bfloat16: True

# Prediction config
prediction_cfg:
  metric: "acc"
  measure: 1
  best_checkpoint:
  best_result:
eval_forward_fn:
  _target_: general_util.evaluator.DiscriminatorForwardFn
post_process:


# fairscale.FullyShardedDP
fairscale_config:
  _target_: general_util.fsdp_utils.default_initialize
  # _target_: general_util.fsdp_utils.recursive_initialize
  # _target_: general_util.fsdp_utils.default_initialize_v2
  # _target_: general_util.torch_fsdp_utils.torch_fsdp_transformer_init
  # _target_: general_util.torch_fsdp_utils.torch_fsdp_auto_wrap
  fp16: ${fp16}
  move_grads_to_cpu: False
  move_params_to_cpu: False
  flatten_parameters: False
  # fp16_bfloat16: ${fp16_bfloat16}
  # cpu_offload: True
  # disable_reshard_on_root: False


# Lightseq config
with_lightseq: False

# Deepspeed config
ds_cfg:
  train_micro_batch_size_per_gpu: ${per_gpu_train_batch_size}
  gradient_accumulation_steps: ${gradient_accumulation_steps}
  optimizer:
    type: AdamW
    params:
      lr: ${learning_rate}
      betas: [0.9, 0.99]
      eps: ${adam_epsilon}
      weight_decay: ${weight_decay}
  scheduler:
    type: WarmupDecayLR
    params:
      total_num_steps:
      warmup_max_lr: ${learning_rate}
      warmup_num_steps:
      warmup_type: linear
  gradient_clipping: ${max_grad_norm}
  # fp16:
  #   enabled: ${fp16}
  #   initial_scale_power: 12
  bf16:
    enabled: ${fp16}
  # autotuning:
  #   enabled: true
  #   arg_mappings:
  #     train_micro_batch_size_per_gpu: "per_gpu_train_batch_size"
  #     gradient_accumulation_steps: "gradient_accumulation_steps"
  #     zero_optimization: "ds_cfg.zero_optimization"
  zero_optimization:
    stage: 3
    contiguous_gradients: True
    overlap_comm: True
    reduce_scatter: True
    reduce_bucket_size: 5e8
    allgather_bucket_size: 5e8
  offload_optimizer:
    device: cpu
    pin_memory: True
  offload_param:
    device: cpu
    pin_memory: True
  # activation_checkpointing:
  #   partition_activations: True
  #   cpu_checkpointing: True
  #   contiguous_memory_optimization: False
  #   number_checkpoints: False
  #   synchronize_checkpoint_boundary: False
  #   profile: False
  steps_per_print: 1024


summary_helper:
#  _target_: general_util.tensorboard_helper.SummaryWriterHelper
  _target_: general_util.tensorboard_helper.WandbWriter
  batch_index_or_keys:
#    "train/pair_value_num": pair_value_num
#    "train/pair_label_num": pair_label_num
#    "train/dropped_op_cnt": dropped_op_cnt
#    "train/invalid_path": invalid_path
  outputs_index_or_keys:
#    "train/mlm_loss": mlm_loss
#    "train/cls_loss": cls_loss
#    "train/tagging_loss": tagging_loss
#    "train/path_gen_loss": path_gen_loss

# Temporary variables
n_gpu:
device:
train_batch_size:
eval_batch_size:
world_size:
world_rank:

训练开启命令如下:

export HYDRA_FULL_ERROR=1
export WANDB_MODE=dryrun
export WANDB_SILENT=true
deepspeed --include localhost:0 \
    trainer_base_ds_mul.py \
    -cp conf/llama/zh \
    -cn llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml

这个代码确实没看出来有什么问题 你是否检查一下你的服务器0号卡的显存是充足的 没有其他人在使用?按理说zero-3+cpu offload 40G显存就足够微调了

以及方便的话 可以把你的error message放出来

好的,训练的log如下:

[2023-05-29 19:48:32,312] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-29 19:48:32,326] [INFO] [runner.py:550:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None trainer_base_ds_mul.py -cp conf/llama/zh -cn llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml
[2023-05-29 19:48:33,790] [INFO] [launch.py:135:main] 0 LAB_PYTORCH_NCCL_SCM_VERSION=1.0.0.1
[2023-05-29 19:48:33,790] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-29 19:48:33,790] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-29 19:48:33,790] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-29 19:48:33,790] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-29 19:48:33,790] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
['trainer_base_ds_mul.py', 'local_rank=0', '-cp', 'conf/llama/zh', '-cn', 'llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml']
[2023-05-29 19:48:36,856] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-29 19:48:36,856][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-05-29 19:48:36,857][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
[2023-05-29 19:48:36,859][FK][WARNING] - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
[2023-05-29 19:48:36,859][FK][WARNING] - CPU cores: 128
[2023-05-29 19:51:28,268][FK.general_util.tokenization_utils][INFO] - LlamaTokenizerFast(name_or_path='/mnt/bn/slp-llm/sft_huihui/pandallm/llama-panda-zh-7b', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False)
[2023-05-29 19:51:28,269][FK.general_util.tokenization_utils][INFO] - PAD TOKEN ID = 32000
[2023-05-29 19:52:11,607][FK.models.llama][INFO] - gradient_checkpointing: True
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00,  7.25s/it]
[2023-05-29 19:52:34,217][FK.models.llama][INFO] - Config pad token id after loading pre-trained weights: 32000
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - Logs details:
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - None
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - None
[2023-05-29 19:52:40,317][FK][INFO] - []
0it [00:00, ?it/s]
[2023-05-29 19:52:40,466] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-05-29 19:52:53,269][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-05-29 19:52:53,272][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-05-29 19:52:53,338] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.1588904857635498 seconds
[2023-05-29 19:52:53,863] [INFO] [logging.py:93:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-05-29 19:52:53,875] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-29 19:52:53,875] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2023-05-29 19:52:53,875] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-05-29 19:52:53,979] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-05-29 19:52:53,980] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:52:53,980] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 79.61 GB, percent = 4.0%
[2023-05-29 19:52:53,981] [INFO] [stage3.py:113:__init__] Reduce bucket size 500000000
[2023-05-29 19:52:53,982] [INFO] [stage3.py:114:__init__] Prefetch bucket size 50,000,000
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.14214134216308594 seconds
[2023-05-29 19:52:54,227] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-29 19:52:54,228] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:52:54,228] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 79.88 GB, percent = 4.0%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-05-29 19:52:54,373] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-29 19:52:54,374] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.83 GB         CA 13.08 GB         Max_CA 13 GB 
[2023-05-29 19:52:54,374] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 80.02 GB, percent = 4.0%
[2023-05-29 19:52:54,471] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-05-29 19:52:54,472] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 13.08 GB         Max_CA 13 GB 
[2023-05-29 19:52:54,472] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 80.1 GB, percent = 4.0%
[2023-05-29 19:53:06,285] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 7
[2023-05-29 19:53:06,286] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:53:06,287] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.14 GB, percent = 4.0%
[2023-05-29 19:53:06,386] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-05-29 19:53:06,387] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB         Max_MA 12.58 GB         CA 12.59 GB         Max_CA 13 GB 
[2023-05-29 19:53:06,387] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.15 GB, percent = 4.0%
[2023-05-29 19:53:06,539] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-05-29 19:53:06,540] [INFO] [utils.py:830:see_memory_usage] MA 37.69 GB         Max_MA 38.94 GB         CA 41.47 GB         Max_CA 41 GB 
[2023-05-29 19:53:06,540] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.16 GB, percent = 4.0%
[2023-05-29 19:53:06,641] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-05-29 19:53:06,642] [INFO] [utils.py:830:see_memory_usage] MA 37.69 GB         Max_MA 37.69 GB         CA 41.47 GB         Max_CA 41 GB 
[2023-05-29 19:53:06,644] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 81.17 GB, percent = 4.0%
Error executing job with overrides: ['local_rank=0']
Traceback (most recent call last):
  File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 437, in <module>
    main()
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 367, in main
    global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)
  File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 168, in train
    model, optimizer, _, scheduler = deepspeed.initialize(model=model,
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 312, in __init__
    self._setup_for_real_optimizer()
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 371, in _setup_for_real_optimizer
    self.initialize_optimizer_states()
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 938, in initialize_optimizer_states
    self._optimizer_step(i)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 858, in _optimizer_step
    self.optimizer.step()
  File "/root/miniconda3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 137, in step
    state['exp_avg_sq'] = torch.zeros_like(p.data)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB (GPU 0; 79.35 GiB total capacity; 75.35 GiB already allocated; 2.95 GiB free; 75.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:437 in <module>                       │
│                                                                                                  │
│   434 │   │   │   hydra_formatted_args.append(arg)                                               │
│   435 │   sys.argv = hydra_formatted_args                                                        │
│   436 │   print(sys.argv)                                                                        │
│ ❱ 437 │   main()                                                                                 │
│   438                                                                                            │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/main.py:94 in decorated_main                 │
│                                                                                                  │
│    91 │   │   │   │   else:                                                                      │
│    92 │   │   │   │   │   # no return value from run_hydra() as it may sometime actually run t   │
│    93 │   │   │   │   │   # multiple times (--multirun)                                          │
│ ❱  94 │   │   │   │   │   _run_hydra(                                                            │
│    95 │   │   │   │   │   │   args=args,                                                         │
│    96 │   │   │   │   │   │   args_parser=args_parser,                                           │
│    97 │   │   │   │   │   │   task_function=task_function,                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:394 in _run_hydra         │
│                                                                                                  │
│   391 │   │                                                                                      │
│   392 │   │   if args.run or args.multirun:                                                      │
│   393 │   │   │   run_mode = hydra.get_mode(config_name=config_name, overrides=overrides)        │
│ ❱ 394 │   │   │   _run_app(                                                                      │
│   395 │   │   │   │   run=args.run,                                                              │
│   396 │   │   │   │   multirun=args.multirun,                                                    │
│   397 │   │   │   │   mode=run_mode,                                                             │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:457 in _run_app           │
│                                                                                                  │
│   454 │   │   │   overrides.extend(["hydra.mode=MULTIRUN"])                                      │
│   455 │                                                                                          │
│   456 │   if mode == RunMode.RUN:                                                                │
│ ❱ 457 │   │   run_and_report(                                                                    │
│   458 │   │   │   lambda: hydra.run(                                                             │
│   459 │   │   │   │   config_name=config_name,                                                   │
│   460 │   │   │   │   task_function=task_function,                                               │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:223 in run_and_report     │
│                                                                                                  │
│   220 │   │   return func()                                                                      │
│   221 │   except Exception as ex:                                                                │
│   222 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 223 │   │   │   raise ex                                                                       │
│   224 │   │   else:                                                                              │
│   225 │   │   │   try:                                                                           │
│   226 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:220 in run_and_report     │
│                                                                                                  │
│   217                                                                                            │
│   218 def run_and_report(func: Any) -> Any:                                                      │
│   219 │   try:                                                                                   │
│ ❱ 220 │   │   return func()                                                                      │
│   221 │   except Exception as ex:                                                                │
│   222 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   223 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:458 in <lambda>           │
│                                                                                                  │
│   455 │                                                                                          │
│   456 │   if mode == RunMode.RUN:                                                                │
│   457 │   │   run_and_report(                                                                    │
│ ❱ 458 │   │   │   lambda: hydra.run(                                                             │
│   459 │   │   │   │   config_name=config_name,                                                   │
│   460 │   │   │   │   task_function=task_function,                                               │
│   461 │   │   │   │   overrides=overrides,                                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/hydra.py:132 in run                │
│                                                                                                  │
│   129 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   130 │   │                                                                                      │
│   131 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 132 │   │   _ = ret.return_value                                                               │
│   133 │   │                                                                                      │
│   134 │   │   return ret                                                                         │
│   135                                                                                            │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py:260 in return_value            │
│                                                                                                  │
│   257 │   │   │   sys.stderr.write(                                                              │
│   258 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   259 │   │   │   )                                                                              │
│ ❱ 260 │   │   │   raise self._return_value                                                       │
│   261 │                                                                                          │
│   262 │   @return_value.setter                                                                   │
│   263 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py:186 in run_job                 │
│                                                                                                  │
│   183 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   184 │   │   │   callbacks.on_job_start(config=config, task_function=task_function)             │
│   185 │   │   │   try:                                                                           │
│ ❱ 186 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   187 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   188 │   │   │   except Exception as e:                                                         │
│   189 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:367 in main                           │
│                                                                                                  │
│   364 │   │   │   logger.info("Resuming training from the latest checkpoint: %s", checkpoint)    │
│   365 │   │   │   continue_from_global_step = int(checkpoint.split('-')[-1])                     │
│   366 │   │                                                                                      │
│ ❱ 367 │   │   global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)     │
│   368 │   │   logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)          │
│   369 │                                                                                          │
│   370 │   # Test                                                                                 │
│                                                                                                  │
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:168 in train                          │
│                                                                                                  │
│   165 │   #      'weight_decay': 0.0}                                                            │
│   166 │   # ]                                                                                    │
│   167 │   torch.compile(model, mode="max-autotune")                                              │
│ ❱ 168 │   model, optimizer, _, scheduler = deepspeed.initialize(model=model,                     │
│   169 │   │   │   │   │   │   │   │   │   │   │   │   │   │     model_parameters=model.paramet   │
│   170 │   │   │   │   │   │   │   │   │   │   │   │   │   │     config=ds_config)                │
│   171 │   logger.info(optimizer.optimizer)                                                       │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py:125 in initialize            │
│                                                                                                  │
│   122 │   assert model is not None, "deepspeed.initialize requires a model"                      │
│   123 │                                                                                          │
│   124 │   if not isinstance(model, PipelineModule):                                              │
│ ❱ 125 │   │   engine = DeepSpeedEngine(args=args,                                                │
│   126 │   │   │   │   │   │   │   │    model=model,                                              │
│   127 │   │   │   │   │   │   │   │    optimizer=optimizer,                                      │
│   128 │   │   │   │   │   │   │   │    model_parameters=model_parameters,                        │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:340 in __init__        │
│                                                                                                  │
│    337 │   │   │   model_parameters = list(model_parameters)                                     │
│    338 │   │                                                                                     │
│    339 │   │   if has_optimizer:                                                                 │
│ ❱  340 │   │   │   self._configure_optimizer(optimizer, model_parameters)                        │
│    341 │   │   │   self._configure_lr_scheduler(lr_scheduler)                                    │
│    342 │   │   │   self._report_progress(0)                                                      │
│    343 │   │   elif self.zero_optimization():                                                    │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1298 in                │
│ _configure_optimizer                                                                             │
│                                                                                                  │
│   1295 │   │   optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer)              │
│   1296 │   │                                                                                     │
│   1297 │   │   if optimizer_wrapper == ZERO_OPTIMIZATION:                                        │
│ ❱ 1298 │   │   │   self.optimizer = self._configure_zero_optimizer(basic_optimizer)              │
│   1299 │   │   elif optimizer_wrapper == AMP:                                                    │
│   1300 │   │   │   amp_params = self.amp_params()                                                │
│   1301 │   │   │   log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0])      │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1599 in                │
│ _configure_zero_optimizer                                                                        │
│                                                                                                  │
│   1596 │   │   │   │   log_dist(f'Creating {model_dtype} ZeRO stage {zero_stage} optimizer',     │
│   1597 │   │   │   │   │   │    ranks=[0])                                                       │
│   1598 │   │   │   │   from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3   │
│ ❱ 1599 │   │   │   │   optimizer = DeepSpeedZeroOptimizer_Stage3(                                │
│   1600 │   │   │   │   │   self.module,                                                          │
│   1601 │   │   │   │   │   optimizer,                                                            │
│   1602 │   │   │   │   │   timers=timers,                                                        │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:312 in __init__   │
│                                                                                                  │
│    309 │   │   │   f'Largest partitioned param numel = {largest_partitioned_param_numel}',       │
│    310 │   │   │   force=False)                                                                  │
│    311 │   │                                                                                     │
│ ❱  312 │   │   self._setup_for_real_optimizer()                                                  │
│    313 │   │   self.grad_position = {}                                                           │
│    314 │   │   self.set_grad_positions()                                                         │
│    315                                                                                           │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:371 in            │
│ _setup_for_real_optimizer                                                                        │
│                                                                                                  │
│    368 │   │                                                                                     │
│    369 │   │   see_memory_usage("Before initializing optimizer states", force=True)              │
│    370 │   │                                                                                     │
│ ❱  371 │   │   self.initialize_optimizer_states()                                                │
│    372 │   │   see_memory_usage("After initializing optimizer states", force=True)               │
│    373 │   │   dist.barrier()                                                                    │
│    374                                                                                           │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:938 in            │
│ initialize_optimizer_states                                                                      │
│                                                                                                  │
│    935 │   │   │   │   │   0,                                                                    │
│    936 │   │   │   │   │   num_elements)                                                         │
│    937 │   │   │                                                                                 │
│ ❱  938 │   │   │   self._optimizer_step(i)                                                       │
│    939 │   │   │                                                                                 │
│    940 │   │   │   if swappable_param_subgroup:                                                  │
│    941 │   │   │   │   self._partitioned_params_swap_out(i)                                      │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:858 in            │
│ _optimizer_step                                                                                  │
│                                                                                                  │
│    855 │   │   fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]                      │
│    856 │   │   self.optimizer.param_groups[param_group_id]['params'] = [fp32_param]              │
│    857 │   │                                                                                     │
│ ❱  858 │   │   self.optimizer.step()                                                             │
│    859 │   │   self.optimizer.param_groups[param_group_id]['params'] = []                        │
│    860 │                                                                                         │
│    861 │   def _swappable_optimizer_subgroup(self, sub_group_id):                                │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/torch/optim/optimizer.py:280 in wrapper            │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:137 in step       │
│                                                                                                  │
│   134 │   │   │   │   │   # Exponential moving average of gradient values                        │
│   135 │   │   │   │   │   state['exp_avg'] = torch.zeros_like(p.data)                            │
│   136 │   │   │   │   │   # Exponential moving average of squared gradient values                │
│ ❱ 137 │   │   │   │   │   state['exp_avg_sq'] = torch.zeros_like(p.data)                         │
│   138 │   │   │   │                                                                              │
│   139 │   │   │   │   if p.dtype == torch.float16:                                               │
│   140 │   │   │   │   │   g_16.append(p.grad.data)                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB (GPU 0; 79.35 GiB total capacity; 75.35 GiB already allocated; 2.95 
GiB free; 75.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid 
fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-29 19:53:11,094] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2250
[2023-05-29 19:53:11,094] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'trainer_base_ds_mul.py', '--local_rank=0', '-cp', 'conf/llama/zh', '-cn', 'llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml'] exits with return code = 1

看上去你的0号卡连模型权重都放不进去,还没有开始训练,这是不可能的,所以用nvidia-smi命令先确认一下你的 0 号显卡显存是充足的,可以吗?

好的,下面是我的0号卡信息:
image

那你现在再试一下?我看错误log已经是一个小时以前的了

谢谢您提供的解决方案,其实有个很好奇的问题想正好请教一下您,为什么开源的模型大多基于7B、13B的模型,而30和60B的模型就很少,从13B到30B之间是否有一个什么GAP存在?

30B模型很难放进80G显存 需要更多修改 比如model parallel或tensor parallel 但这些都不是开箱即用的 需要你针对特定的模型结构写特定的代码 这个难度比较大 只是爱好者的话也不会去下功夫学