基于上一次保存checkpoint继续训练
chensongcan opened this issue · 7 comments
使用预训练脚本,通过去掉overwrite_output_dir参数,加载保存在output的checkpoint-100文件夹,出现RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
peft版本没问题
peft 0.3.0.dev0
torch 1.13.1+cu117
因为trainer只保存了部分参数,而deepspeed在load模型时默认strict=True,要求严格加载所以权重,所以出现错误。
需要修改deepspeed的源代码。因为deepspeed版本不同,无法直接给出代码位置。
大概修改的方法为将deepspeed/runtime/engine.py中函数load_checkpoint
定义中的参数load_module_strict
默认值改为False
好的,只需要做这一个改动就可以有效了吗,还有其他的参数需要改吗
应该没有了,你先试试。
非常感谢
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
This is a serious issue for this repo. No way to resume training, probably have to refer to https://github.com/Facico/Chinese-Vicuna/tree/master ?!?
Problem is I doing a pretraining, not only fine-tuning and the script works for "most of the works". Maybe have to rewrite everything from scratch ://
因为trainer只保存了部分参数,而deepspeed在load模型时默认strict=True,要求严格加载所以权重,所以出现错误。 需要修改deepspeed的源代码。因为deepspeed版本不同,无法直接给出代码位置。 大概修改的方法为将deepspeed/runtime/engine.py中函数
load_checkpoint
定义中的参数load_module_strict
默认值改为False
Then what is the point to even saving the checkpoint if this codes do not work with Deepspeed ?!?