基于上一次保存checkpoint继续训练

Question

基于上一次保存checkpoint继续训练

chensongcan opened this issue a year ago · 7 comments

使用预训练脚本，通过去掉overwrite_output_dir参数，加载保存在output的checkpoint-100文件夹，出现RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
peft版本没问题
peft 0.3.0.dev0
torch 1.13.1+cu117

Answer 1 · 2023-05-30T12:13:44.000Z

因为trainer只保存了部分参数，而deepspeed在load模型时默认strict=True，要求严格加载所以权重,所以出现错误。
需要修改deepspeed的源代码。因为deepspeed版本不同，无法直接给出代码位置。
大概修改的方法为将deepspeed/runtime/engine.py中函数load_checkpoint定义中的参数load_module_strict默认值改为False

Answer 2 · 2023-05-30T12:23:14.000Z

好的，只需要做这一个改动就可以有效了吗，还有其他的参数需要改吗

Answer 3 · 2023-05-30T13:17:59.000Z

应该没有了，你先试试。

Answer 4 · 2023-05-30T13:47:18.000Z

非常感谢

Answer 5 · 2023-06-06T22:01:58.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Answer 6 · 2023-06-10T18:36:17.000Z

This is a serious issue for this repo. No way to resume training, probably have to refer to https://github.com/Facico/Chinese-Vicuna/tree/master ?!?

Problem is I doing a pretraining, not only fine-tuning and the script works for "most of the works". Maybe have to rewrite everything from scratch ://

Answer 7 · 2023-06-10T18:37:06.000Z

因为trainer只保存了部分参数，而deepspeed在load模型时默认strict=True，要求严格加载所以权重,所以出现错误。需要修改deepspeed的源代码。因为deepspeed版本不同，无法直接给出代码位置。大概修改的方法为将deepspeed/runtime/engine.py中函数load_checkpoint定义中的参数load_module_strict默认值改为False

Then what is the point to even saving the checkpoint if this codes do not work with Deepspeed ?!?