微调报错

Question

微调报错

Closed this issue 6 months ago · 5 comments

我在合并 LoRA 权重和基础 LLM后，使用sh script/train/finetune_lora.sh 微调模型报错。错误日志如下。请问这是什么问题，如何修改。

Traceback (most recent call last):
File "/home/Bunny-main/bunny/train/train.py", line 393, in
train()
File "/home/Bunny-main/bunny/train/train.py", line 380, in train
non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
File "/home/Bunny-main/bunny/train/train.py", line 118, in get_peft_state_non_lora_maybe_zero_3
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
File "/home/Bunny-main/bunny/train/train.py", line 118, in
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
File "/home/Bunny-main/bunny/train/train.py", line 81, in maybe_zero_3
with zero.GatheredParameters([param]):
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2178, in exit
self.params[0].partition(param_list=self.params, has_been_updated=False)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1329, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1478, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _partition_param
free_param(param)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 281, in free_param
assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 451, 'status': 'AVAILABLE', 'numel': 2949120, 'ds_numel': 2949120, 'shape': (2560, 1152), 'ds_shape': (2560, 1152), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2494}, 'ds_tensor.shape': torch.Size([1474560])}
Traceback (most recent call last):
File "/home/Bunny-main/bunny/train/train.py", line 393, in
train()
File "/home/Bunny-main/bunny/train/train.py", line 380, in train
non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
File "/home/Bunny-main/bunny/train/train.py", line 118, in get_peft_state_non_lora_maybe_zero_3
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
File "/home/Bunny-main/bunny/train/train.py", line 118, in
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
File "/home/Bunny-main/bunny/train/train.py", line 81, in maybe_zero_3
with zero.GatheredParameters([param]):
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2178, in exit
self.params[0].partition(param_list=self.params, has_been_updated=False)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1329, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1478, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _partition_param
free_param(param)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/bunny/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 281, in free_param
assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 451, 'status': 'AVAILABLE', 'numel': 2949120, 'ds_numel': 2949120, 'shape': (2560, 1152), 'ds_shape': (2560, 1152), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2494}, 'ds_tensor.shape': torch.Size([1474560])}
[2024-07-04 16:56:43,965] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 799
[2024-07-04 16:56:43,966] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 800
[2024-07-04 16:56:43,966] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/envs/bunny/bin/python', '-u', 'bunny/train/train.py', '--local_rank=1', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './script/deepspeed/zero3.json', '--model_name_or_path', './outmodel', '--model_type', 'phi-2', '--version', 'bunny', '--data_path', './finetune/test.json', '--image_folder', './finetune/', '--vision_tower', '../siglip-so400m', '--mm_projector_type', 'mlp2x_gelu', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'False', '--bf16', 'True', '--output_dir', './checkpoints-phi-2/bunny-lora-phi-2', '--num_train_epochs', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'none'] exits with return code = 1

Answer 1 · 2024-07-04T09:22:12.000Z

我把参数从 --deepspeed ./script/deepspeed/zero3.json \ 修改为 --deepspeed ./script/deepspeed/zero2.json \后能够得到以下微调后的文件。

由于我不是非常专业的开发者。我在直接使用或者合并的时候都报这个错误：

Answer 2 · 2024-07-04T12:52:41.000Z

我把参数从 --deepspeed ./script/deepspeed/zero3.json \ 修改为 --deepspeed ./script/deepspeed/zero2.json \后能够得到以下微调后的文件。

由于我不是非常专业的开发者。我在直接使用或者合并的时候都报这个错误：

--model-base ./outmodel

Answer 3 · 2024-07-05T03:09:20.000Z

我把参数从 --deepspeed ./script/deepspeed/zero3.json \ 修改为 --deepspeed ./script/deepspeed/zero2.json \后能够得到以下微调后的文件。
由于我不是非常专业的开发者。我在直接使用或者合并的时候都报这个错误：

--model-base ./outmodel

感谢我修改 model-base的参数后能够正常合并也能够使用

运行加载模型。
我的合并文件结构如下：

但是我想通过Transform直接加载模型：

显示在我合并的目录中缺少了


这两个文件我在源码中cp过来了

然后我直接使用Transform加载模型调用：

提示没有process_images。这个应当如何修改呢，

Answer 4 · 2024-07-05T03:19:27.000Z

The snippet in Quickstart is used for Bunny-v1.0-3B (SigLIP + Phi-2) and so on. We manually combine some configuration code into a single file for users' convenience. Also, you can check modeling_bunny_phi.py and configuration_bunny_phi.py and their related parts in the source code of Bunny to see the difference.

For other models including models trained by yourself, we currently only support loading them with installing source code of Bunny. Or you can copy modeling_bunny_phi.py and configuration_bunny_phi.py into your model and edit config.json.

BTW, offset_bos should be 0 for Phi-2-based Bunny.

Answer 5 · 2024-07-05T03:25:45.000Z

The snippet in Quickstart is used for Bunny-v1.0-3B (SigLIP + Phi-2) and so on. We manually combine some configuration code into a single file for users' convenience. Also, you can check modeling_bunny_phi.py and configuration_bunny_phi.py and their related parts in the source code of Bunny to see the difference.

For other models including models trained by yourself, we currently only support loading them with installing source code of Bunny. Or you can copy modeling_bunny_phi.py and configuration_bunny_phi.py into your model and edit config.json.

BTW, offset_bos should be 0 for Phi-2-based Bunny.

好的明白谢谢了。