单卡A100训练7B模型OOM
HuihuiChyan opened this issue · 12 comments
我正在尝试按照 llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml 来对7B模型进行instruct-tuning,batch_size设置为1,sequence_length设置为512,其他设置都没有做更改,但是爆显存不足,请问有什么可能的解决方案吗?
首先尝试zero-1设置下的CPU offload:
zero_optimization:
stage: 1
contiguous_gradients: True
overlap_comm: True
reduce_scatter: True
reduce_bucket_size: 5e8
allgather_bucket_size: 5e8
offload_optimizer:
device: cpu
pin_memory: True
如果还是不行的话,把stage
改成3,然后开启offload_param
:
zero_optimization:
stage: 3
contiguous_gradients: True
overlap_comm: True
reduce_scatter: True
reduce_bucket_size: 5e8
allgather_bucket_size: 5e8
offload_optimizer:
device: cpu
pin_memory: True
offload_param:
device: cpu
pin_memory: True
需要注意此时速度可能会非常慢,取决你的机器的gpu和cpu之间通信的速度
两个offload都开启了,还是不行……请问是否和单卡有关系?
你的显存是多少G?40G?能否贴一下config和训练启动命令?
我的显卡是80G的A100,config如下:
hydra:
run:
dir: ./
# Wiki path pretrain v8.2
model_name_or_path: /mnt/bn/slp-llm/sft_huihui/pandallm/llama-panda-zh-7b
pretrain:
aws_output_bucket:
data_dir:
train_file: /opt/ml/input/data/train/coig.json
dev_file:
test_file:
# Model
model:
_target_: models.llama.LlamaForConditionalGeneration.from_pretrained
vocab_size: 32001
pad_token_id: 32000
use_peft: False
gradient_checkpointing: True
# model_eval:
# _target_: models.llama.LlamaForConditionalGenerationFlan.from_pretrained_peft_eval
# base_model_name_or_path: ${model_name_or_path}
# Data loading
read_tensor:
_target_: data.collators.zh_instruct.TextDatasetUnify
extended_vocab:
# Data collator
collator:
_target_: data.collators.flan.FlanCollatorOverCollator
collator:
max_seq_length: 512
tokenizer: ${model_name_or_path}
decoder_only: True
# Dataloader
num_workers: 4
prefetch_factor: 2
do_preprocess: False
exp_name: llama.7b.zh_instruct.10M.coig.sft.v1.0.seq1024.w8.adamw.NA100.0428.ds
exp_notes:
output_dir: ./${exp_name}
resume:
do_train: True
evaluate_during_training: False
do_eval: False
eval_sub_path: checkpoint-*
# Training hyper-parameters
per_gpu_train_batch_size: 1
per_gpu_eval_batch_size: 1
learning_rate: 3e-5
gradient_accumulation_steps: 2
weight_decay: 0.00
adam_epsilon: 1e-6
adam_betas: "(0.9, 0.99)"
max_grad_norm: 5.0
num_train_epochs: 5
total_dataset_len: -1
max_steps: 0
warmup_proportion: 0.01
warmup_steps: 0
# Optimizer
optimizer:
use_nvlamb:
bit_training:
logging_steps: 1
save_best: False
save_steps: 250
eval_steps: 250
ddp_eval: True
no_cuda: False
seed: 42
local_rank: -1
fp16: True
fp16_opt_level: O1
fp16_bfloat16: True
# Prediction config
prediction_cfg:
metric: "acc"
measure: 1
best_checkpoint:
best_result:
eval_forward_fn:
_target_: general_util.evaluator.DiscriminatorForwardFn
post_process:
# fairscale.FullyShardedDP
fairscale_config:
_target_: general_util.fsdp_utils.default_initialize
# _target_: general_util.fsdp_utils.recursive_initialize
# _target_: general_util.fsdp_utils.default_initialize_v2
# _target_: general_util.torch_fsdp_utils.torch_fsdp_transformer_init
# _target_: general_util.torch_fsdp_utils.torch_fsdp_auto_wrap
fp16: ${fp16}
move_grads_to_cpu: False
move_params_to_cpu: False
flatten_parameters: False
# fp16_bfloat16: ${fp16_bfloat16}
# cpu_offload: True
# disable_reshard_on_root: False
# Lightseq config
with_lightseq: False
# Deepspeed config
ds_cfg:
train_micro_batch_size_per_gpu: ${per_gpu_train_batch_size}
gradient_accumulation_steps: ${gradient_accumulation_steps}
optimizer:
type: AdamW
params:
lr: ${learning_rate}
betas: [0.9, 0.99]
eps: ${adam_epsilon}
weight_decay: ${weight_decay}
scheduler:
type: WarmupDecayLR
params:
total_num_steps:
warmup_max_lr: ${learning_rate}
warmup_num_steps:
warmup_type: linear
gradient_clipping: ${max_grad_norm}
# fp16:
# enabled: ${fp16}
# initial_scale_power: 12
bf16:
enabled: ${fp16}
# autotuning:
# enabled: true
# arg_mappings:
# train_micro_batch_size_per_gpu: "per_gpu_train_batch_size"
# gradient_accumulation_steps: "gradient_accumulation_steps"
# zero_optimization: "ds_cfg.zero_optimization"
zero_optimization:
stage: 3
contiguous_gradients: True
overlap_comm: True
reduce_scatter: True
reduce_bucket_size: 5e8
allgather_bucket_size: 5e8
offload_optimizer:
device: cpu
pin_memory: True
offload_param:
device: cpu
pin_memory: True
# activation_checkpointing:
# partition_activations: True
# cpu_checkpointing: True
# contiguous_memory_optimization: False
# number_checkpoints: False
# synchronize_checkpoint_boundary: False
# profile: False
steps_per_print: 1024
summary_helper:
# _target_: general_util.tensorboard_helper.SummaryWriterHelper
_target_: general_util.tensorboard_helper.WandbWriter
batch_index_or_keys:
# "train/pair_value_num": pair_value_num
# "train/pair_label_num": pair_label_num
# "train/dropped_op_cnt": dropped_op_cnt
# "train/invalid_path": invalid_path
outputs_index_or_keys:
# "train/mlm_loss": mlm_loss
# "train/cls_loss": cls_loss
# "train/tagging_loss": tagging_loss
# "train/path_gen_loss": path_gen_loss
# Temporary variables
n_gpu:
device:
train_batch_size:
eval_batch_size:
world_size:
world_rank:
训练开启命令如下:
export HYDRA_FULL_ERROR=1
export WANDB_MODE=dryrun
export WANDB_SILENT=true
deepspeed --include localhost:0 \
trainer_base_ds_mul.py \
-cp conf/llama/zh \
-cn llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml
这个代码确实没看出来有什么问题 你是否检查一下你的服务器0号卡的显存是充足的 没有其他人在使用?按理说zero-3+cpu offload 40G显存就足够微调了
以及方便的话 可以把你的error message放出来
好的,训练的log如下:
[2023-05-29 19:48:32,312] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-29 19:48:32,326] [INFO] [runner.py:550:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None trainer_base_ds_mul.py -cp conf/llama/zh -cn llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml
[2023-05-29 19:48:33,790] [INFO] [launch.py:135:main] 0 LAB_PYTORCH_NCCL_SCM_VERSION=1.0.0.1
[2023-05-29 19:48:33,790] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-29 19:48:33,790] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-29 19:48:33,790] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-29 19:48:33,790] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-29 19:48:33,790] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
['trainer_base_ds_mul.py', 'local_rank=0', '-cp', 'conf/llama/zh', '-cn', 'llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml']
[2023-05-29 19:48:36,856] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-29 19:48:36,856][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-05-29 19:48:36,857][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
[2023-05-29 19:48:36,859][FK][WARNING] - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
[2023-05-29 19:48:36,859][FK][WARNING] - CPU cores: 128
[2023-05-29 19:51:28,268][FK.general_util.tokenization_utils][INFO] - LlamaTokenizerFast(name_or_path='/mnt/bn/slp-llm/sft_huihui/pandallm/llama-panda-zh-7b', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False)
[2023-05-29 19:51:28,269][FK.general_util.tokenization_utils][INFO] - PAD TOKEN ID = 32000
[2023-05-29 19:52:11,607][FK.models.llama][INFO] - gradient_checkpointing: True
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:21<00:00, 7.25s/it]
[2023-05-29 19:52:34,217][FK.models.llama][INFO] - Config pad token id after loading pre-trained weights: 32000
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - Logs details:
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - None
[2023-05-29 19:52:40,317][FK.TensorboardHelper][INFO] - None
[2023-05-29 19:52:40,317][FK][INFO] - []
0it [00:00, ?it/s]
[2023-05-29 19:52:40,466] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-05-29 19:52:53,269][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-05-29 19:52:53,272][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-05-29 19:52:53,338] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.1588904857635498 seconds
[2023-05-29 19:52:53,863] [INFO] [logging.py:93:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-05-29 19:52:53,875] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-29 19:52:53,875] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2023-05-29 19:52:53,875] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-05-29 19:52:53,979] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-05-29 19:52:53,980] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 12.59 GB Max_CA 13 GB
[2023-05-29 19:52:53,980] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 79.61 GB, percent = 4.0%
[2023-05-29 19:52:53,981] [INFO] [stage3.py:113:__init__] Reduce bucket size 500000000
[2023-05-29 19:52:53,982] [INFO] [stage3.py:114:__init__] Prefetch bucket size 50,000,000
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.14214134216308594 seconds
[2023-05-29 19:52:54,227] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-29 19:52:54,228] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 12.59 GB Max_CA 13 GB
[2023-05-29 19:52:54,228] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 79.88 GB, percent = 4.0%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-05-29 19:52:54,373] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-29 19:52:54,374] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.83 GB CA 13.08 GB Max_CA 13 GB
[2023-05-29 19:52:54,374] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 80.02 GB, percent = 4.0%
[2023-05-29 19:52:54,471] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-05-29 19:52:54,472] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 13.08 GB Max_CA 13 GB
[2023-05-29 19:52:54,472] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 80.1 GB, percent = 4.0%
[2023-05-29 19:53:06,285] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 7
[2023-05-29 19:53:06,286] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 12.59 GB Max_CA 13 GB
[2023-05-29 19:53:06,287] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 81.14 GB, percent = 4.0%
[2023-05-29 19:53:06,386] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-05-29 19:53:06,387] [INFO] [utils.py:830:see_memory_usage] MA 12.58 GB Max_MA 12.58 GB CA 12.59 GB Max_CA 13 GB
[2023-05-29 19:53:06,387] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 81.15 GB, percent = 4.0%
[2023-05-29 19:53:06,539] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-05-29 19:53:06,540] [INFO] [utils.py:830:see_memory_usage] MA 37.69 GB Max_MA 38.94 GB CA 41.47 GB Max_CA 41 GB
[2023-05-29 19:53:06,540] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 81.16 GB, percent = 4.0%
[2023-05-29 19:53:06,641] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-05-29 19:53:06,642] [INFO] [utils.py:830:see_memory_usage] MA 37.69 GB Max_MA 37.69 GB CA 41.47 GB Max_CA 41 GB
[2023-05-29 19:53:06,644] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 81.17 GB, percent = 4.0%
Error executing job with overrides: ['local_rank=0']
Traceback (most recent call last):
File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 437, in <module>
main()
File "/root/miniconda3/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/root/miniconda3/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 367, in main
global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step)
File "/mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py", line 168, in train
model, optimizer, _, scheduler = deepspeed.initialize(model=model,
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1599, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 312, in __init__
self._setup_for_real_optimizer()
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 371, in _setup_for_real_optimizer
self.initialize_optimizer_states()
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 938, in initialize_optimizer_states
self._optimizer_step(i)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 858, in _optimizer_step
self.optimizer.step()
File "/root/miniconda3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 137, in step
state['exp_avg_sq'] = torch.zeros_like(p.data)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB (GPU 0; 79.35 GiB total capacity; 75.35 GiB already allocated; 2.95 GiB free; 75.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:437 in <module> │
│ │
│ 434 │ │ │ hydra_formatted_args.append(arg) │
│ 435 │ sys.argv = hydra_formatted_args │
│ 436 │ print(sys.argv) │
│ ❱ 437 │ main() │
│ 438 │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/main.py:94 in decorated_main │
│ │
│ 91 │ │ │ │ else: │
│ 92 │ │ │ │ │ # no return value from run_hydra() as it may sometime actually run t │
│ 93 │ │ │ │ │ # multiple times (--multirun) │
│ ❱ 94 │ │ │ │ │ _run_hydra( │
│ 95 │ │ │ │ │ │ args=args, │
│ 96 │ │ │ │ │ │ args_parser=args_parser, │
│ 97 │ │ │ │ │ │ task_function=task_function, │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:394 in _run_hydra │
│ │
│ 391 │ │ │
│ 392 │ │ if args.run or args.multirun: │
│ 393 │ │ │ run_mode = hydra.get_mode(config_name=config_name, overrides=overrides) │
│ ❱ 394 │ │ │ _run_app( │
│ 395 │ │ │ │ run=args.run, │
│ 396 │ │ │ │ multirun=args.multirun, │
│ 397 │ │ │ │ mode=run_mode, │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:457 in _run_app │
│ │
│ 454 │ │ │ overrides.extend(["hydra.mode=MULTIRUN"]) │
│ 455 │ │
│ 456 │ if mode == RunMode.RUN: │
│ ❱ 457 │ │ run_and_report( │
│ 458 │ │ │ lambda: hydra.run( │
│ 459 │ │ │ │ config_name=config_name, │
│ 460 │ │ │ │ task_function=task_function, │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:223 in run_and_report │
│ │
│ 220 │ │ return func() │
│ 221 │ except Exception as ex: │
│ 222 │ │ if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger(): │
│ ❱ 223 │ │ │ raise ex │
│ 224 │ │ else: │
│ 225 │ │ │ try: │
│ 226 │ │ │ │ if isinstance(ex, CompactHydraException): │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:220 in run_and_report │
│ │
│ 217 │
│ 218 def run_and_report(func: Any) -> Any: │
│ 219 │ try: │
│ ❱ 220 │ │ return func() │
│ 221 │ except Exception as ex: │
│ 222 │ │ if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger(): │
│ 223 │ │ │ raise ex │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/utils.py:458 in <lambda> │
│ │
│ 455 │ │
│ 456 │ if mode == RunMode.RUN: │
│ 457 │ │ run_and_report( │
│ ❱ 458 │ │ │ lambda: hydra.run( │
│ 459 │ │ │ │ config_name=config_name, │
│ 460 │ │ │ │ task_function=task_function, │
│ 461 │ │ │ │ overrides=overrides, │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/_internal/hydra.py:132 in run │
│ │
│ 129 │ │ callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret) │
│ 130 │ │ │
│ 131 │ │ # access the result to trigger an exception in case the job failed. │
│ ❱ 132 │ │ _ = ret.return_value │
│ 133 │ │ │
│ 134 │ │ return ret │
│ 135 │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py:260 in return_value │
│ │
│ 257 │ │ │ sys.stderr.write( │
│ 258 │ │ │ │ f"Error executing job with overrides: {self.overrides}" + os.linesep │
│ 259 │ │ │ ) │
│ ❱ 260 │ │ │ raise self._return_value │
│ 261 │ │
│ 262 │ @return_value.setter │
│ 263 │ def return_value(self, value: Any) -> None: │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/hydra/core/utils.py:186 in run_job │
│ │
│ 183 │ │ with env_override(hydra_cfg.hydra.job.env_set): │
│ 184 │ │ │ callbacks.on_job_start(config=config, task_function=task_function) │
│ 185 │ │ │ try: │
│ ❱ 186 │ │ │ │ ret.return_value = task_function(task_cfg) │
│ 187 │ │ │ │ ret.status = JobStatus.COMPLETED │
│ 188 │ │ │ except Exception as e: │
│ 189 │ │ │ │ ret.return_value = e │
│ │
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:367 in main │
│ │
│ 364 │ │ │ logger.info("Resuming training from the latest checkpoint: %s", checkpoint) │
│ 365 │ │ │ continue_from_global_step = int(checkpoint.split('-')[-1]) │
│ 366 │ │ │
│ ❱ 367 │ │ global_step, tr_loss = train(cfg, model, tokenizer, continue_from_global_step) │
│ 368 │ │ logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) │
│ 369 │ │
│ 370 │ # Test │
│ │
│ /mnt/bn/slp-llm/sft_huihui/pandallm/trainer_base_ds_mul.py:168 in train │
│ │
│ 165 │ # 'weight_decay': 0.0} │
│ 166 │ # ] │
│ 167 │ torch.compile(model, mode="max-autotune") │
│ ❱ 168 │ model, optimizer, _, scheduler = deepspeed.initialize(model=model, │
│ 169 │ │ │ │ │ │ │ │ │ │ │ │ │ │ model_parameters=model.paramet │
│ 170 │ │ │ │ │ │ │ │ │ │ │ │ │ │ config=ds_config) │
│ 171 │ logger.info(optimizer.optimizer) │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/__init__.py:125 in initialize │
│ │
│ 122 │ assert model is not None, "deepspeed.initialize requires a model" │
│ 123 │ │
│ 124 │ if not isinstance(model, PipelineModule): │
│ ❱ 125 │ │ engine = DeepSpeedEngine(args=args, │
│ 126 │ │ │ │ │ │ │ │ model=model, │
│ 127 │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 128 │ │ │ │ │ │ │ │ model_parameters=model_parameters, │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:340 in __init__ │
│ │
│ 337 │ │ │ model_parameters = list(model_parameters) │
│ 338 │ │ │
│ 339 │ │ if has_optimizer: │
│ ❱ 340 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 341 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 342 │ │ │ self._report_progress(0) │
│ 343 │ │ elif self.zero_optimization(): │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1298 in │
│ _configure_optimizer │
│ │
│ 1295 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer) │
│ 1296 │ │ │
│ 1297 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1298 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_optimizer) │
│ 1299 │ │ elif optimizer_wrapper == AMP: │
│ 1300 │ │ │ amp_params = self.amp_params() │
│ 1301 │ │ │ log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0]) │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1599 in │
│ _configure_zero_optimizer │
│ │
│ 1596 │ │ │ │ log_dist(f'Creating {model_dtype} ZeRO stage {zero_stage} optimizer', │
│ 1597 │ │ │ │ │ │ ranks=[0]) │
│ 1598 │ │ │ │ from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3 │
│ ❱ 1599 │ │ │ │ optimizer = DeepSpeedZeroOptimizer_Stage3( │
│ 1600 │ │ │ │ │ self.module, │
│ 1601 │ │ │ │ │ optimizer, │
│ 1602 │ │ │ │ │ timers=timers, │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:312 in __init__ │
│ │
│ 309 │ │ │ f'Largest partitioned param numel = {largest_partitioned_param_numel}', │
│ 310 │ │ │ force=False) │
│ 311 │ │ │
│ ❱ 312 │ │ self._setup_for_real_optimizer() │
│ 313 │ │ self.grad_position = {} │
│ 314 │ │ self.set_grad_positions() │
│ 315 │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:371 in │
│ _setup_for_real_optimizer │
│ │
│ 368 │ │ │
│ 369 │ │ see_memory_usage("Before initializing optimizer states", force=True) │
│ 370 │ │ │
│ ❱ 371 │ │ self.initialize_optimizer_states() │
│ 372 │ │ see_memory_usage("After initializing optimizer states", force=True) │
│ 373 │ │ dist.barrier() │
│ 374 │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:938 in │
│ initialize_optimizer_states │
│ │
│ 935 │ │ │ │ │ 0, │
│ 936 │ │ │ │ │ num_elements) │
│ 937 │ │ │ │
│ ❱ 938 │ │ │ self._optimizer_step(i) │
│ 939 │ │ │ │
│ 940 │ │ │ if swappable_param_subgroup: │
│ 941 │ │ │ │ self._partitioned_params_swap_out(i) │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py:858 in │
│ _optimizer_step │
│ │
│ 855 │ │ fp32_param = self.fp32_partitioned_groups_flat[sub_group_id] │
│ 856 │ │ self.optimizer.param_groups[param_group_id]['params'] = [fp32_param] │
│ 857 │ │ │
│ ❱ 858 │ │ self.optimizer.step() │
│ 859 │ │ self.optimizer.param_groups[param_group_id]['params'] = [] │
│ 860 │ │
│ 861 │ def _swappable_optimizer_subgroup(self, sub_group_id): │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/torch/optim/optimizer.py:280 in wrapper │
│ │
│ 277 │ │ │ │ │ │ │ raise RuntimeError(f"{func} must return None or a tuple of ( │
│ 278 │ │ │ │ │ │ │ │ │ │ │ f"but got {result}.") │
│ 279 │ │ │ │ │
│ ❱ 280 │ │ │ │ out = func(*args, **kwargs) │
│ 281 │ │ │ │ self._optimizer_step_code() │
│ 282 │ │ │ │ │
│ 283 │ │ │ │ # call optimizer step post hooks │
│ │
│ /root/miniconda3/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:137 in step │
│ │
│ 134 │ │ │ │ │ # Exponential moving average of gradient values │
│ 135 │ │ │ │ │ state['exp_avg'] = torch.zeros_like(p.data) │
│ 136 │ │ │ │ │ # Exponential moving average of squared gradient values │
│ ❱ 137 │ │ │ │ │ state['exp_avg_sq'] = torch.zeros_like(p.data) │
│ 138 │ │ │ │ │
│ 139 │ │ │ │ if p.dtype == torch.float16: │
│ 140 │ │ │ │ │ g_16.append(p.grad.data) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB (GPU 0; 79.35 GiB total capacity; 75.35 GiB already allocated; 2.95
GiB free; 75.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid
fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-29 19:53:11,094] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2250
[2023-05-29 19:53:11,094] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/bin/python', '-u', 'trainer_base_ds_mul.py', '--local_rank=0', '-cp', 'conf/llama/zh', '-cn', 'llama_7b_zh_instruct_coig_sft_v1_0_ds.yaml'] exits with return code = 1
看上去你的0号卡连模型权重都放不进去,还没有开始训练,这是不可能的,所以用nvidia-smi
命令先确认一下你的 0 号显卡显存是充足的,可以吗?
那你现在再试一下?我看错误log已经是一个小时以前的了
谢谢您提供的解决方案,其实有个很好奇的问题想正好请教一下您,为什么开源的模型大多基于7B、13B的模型,而30和60B的模型就很少,从13B到30B之间是否有一个什么GAP存在?
30B模型很难放进80G显存 需要更多修改 比如model parallel或tensor parallel 但这些都不是开箱即用的 需要你针对特定的模型结构写特定的代码 这个难度比较大 只是爱好者的话也不会去下功夫学