a bug found in save_model of LOMOTrainer
DingQiang2018 opened this issue · 10 comments
我使用lomo(和zero3)在8张NVIDIA 3090 GPU上微调chatglm2-6b,并使用LOMOTrainer的save_model方法保存。重新加载模型checkpoint后,我发现模型测出来的验证集loss与训练结束时测出来的不一样。我参考deepspeed官方保存模型的代码,重写了save_model(重写的代码如下),发现这个bug解决了。这说明原来版本的save_model有bug,但我还没有找到具体出错原因。
I used LOMO (and zero3) to fine-tune chatglm2-6b on 8 NVIDIA 3090 GPUs and saved it using LOMOTrainer's save_model method. After reloading the model checkpoint, I found that the validation loss measured by the model differed from the validation loss measured at the end of training. I referred to the DeepSpeed official code, rewrote save_model (rewritten code below), and found this bug resolved. This indicates that the original version of save_model has a bug, but I have not yet figured out the specific cause of the error.
def save_model(self, index):
if self.training_args.local_rank in [-1, 0]:
checkpoint_dir = sorted(Path(self.training_args.output_dir).glob("checkpoint-*"))
if len(checkpoint_dir) >= self.training_args.save_total_limit:
shutil.rmtree(checkpoint_dir[0], ignore_errors=True)
torch.distributed.barrier()
if self.training_args.resume_step:
output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index+self.training_args.resume_step}")
else:
output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index}")
if not os.path.exists(output_dir):
os.makedirs(output_dir, exist_ok=True)
state_dict = OrderedDict() if torch.distributed.get_rank() == 0 else None
shared_params = {}
# Prepare for checkpoint save by ensuring all parameters are partitioned
self.model.optimizer.partition_all_parameters()
with deepspeed.zero.GatheredParameters(list(self.model.module.parameters()), modifier_rank=0):
if torch.distributed.get_rank() == 0:
for name, param in self.model.module.named_parameters():
if param is None:
continue
# can't rely on param.data_ptr() as it will be reused as weights gets
# gathered and reduced, but param.ds_id is unique across all zero weights
# (and shared params will have the same param.ds_id)
if param.ds_id in shared_params:
# shared weights
#print(f"`{key}` is shared with `{shared_params[param.ds_id]}`")
state_dict[name] = state_dict[shared_params[param.ds_id]]
else:
state_dict[name] = param.detach().cpu()
shared_params[param.ds_id] = name
#print(f"param {param.ds_id} {param.shape} {key} ")
# now buffers - not sure if need to take care of potentially shared weights here
for name, buf in self.model.module.named_buffers():
if (buf is not None and name not in self.model.module._non_persistent_buffers_set):
state_dict[name] = buf.detach().cpu()
if len(self.model.optimizer.persistent_parameters) > 0:
self.model.optimizer.persistent_parameters[0].all_gather(self.model.optimizer.persistent_parameters)
if torch.distributed.get_rank() == 0:
torch.save(state_dict, os.path.join(output_dir, 'pytorch_model.bin'))
torch.distributed.barrier()
Thanks for your kind feedback and save_model()
has been updated according to your advice. FYI: 06e50c0
很荣幸我的建议被采纳。我还想问问您对之前save_model
出错的具体原因有什么看法吗?我还没想通。
It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model
went wrong previously? I have not figured out this.
很荣幸我的建议被采纳。我还想问问您对之前
save_model
出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on whysave_model
went wrong previously? I have not figured out this.
作者您好,我之所以想知道这个问题的答案,是因为我看到 LOMO 的优化器实现和save_model
的代码假设了同样的 deepspeed 划分参数的方式,即每个参数经过展平后划分成若干块,第 i 块分配给第 i 个进程。我不确定 deepspeed 是否是这样划分参数的。因此,我在上面提供的save_model
代码没有使用这样的假设,仅使用 deepspeed 提供的 deepspeed.zero.GatheredParameters
自动进行参数的聚合。让我意外的是,这一改动修复了save_model
的 bug。因此我推测save_model
出错的原因可能在于上述划分参数的假设不对。这动摇了我对 LOMO 优化器的实现的正确性的看法。希望作者能消除我的疑虑。
Hi, I want to know the answer to this question because I find the implementation of LOMO and the code of save_model
assume the same layout of partitioned parameters in deepspeed, that is, each parameter is flattened and divided into chuncks, with the i-th chunck sent to the i-th process. I'm not sure if deepspeed splits the parameters that way. Therefore, the code of save_model
I provided above does not use this assumption, only using deepspeed.Zero.GatheredParameters
provided by deepspeed to gather parameters automatically. To my surprise, this change fixes the bug. Therefore, I speculate that the bug may lie in the wrong assumptions of parameters partitioning. This has shaken my opinion about the correctness of the implementation of the LOMO optimizer. I hope the author can address my doubts.
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!
@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!
Yeah I am having this issue, did you find any solution?
@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!
Yeah I am having this issue, did you find any solution?
还没有解决……
@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!
Yeah I am having this issue, did you find any solution?
还没有解决……
我也不能 在 merge llama with lora 之后得到相同的结果,很奇怪
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!
Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。
很荣幸我的建议被采纳。我还想问问您对之前
save_model
出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on whysave_model
went wrong previously? I have not figured out this.
Hi, 我想知道使用ChatGLM2的loss两种保存方法会差多少,不知道您是否还有记录?BTW,LLaMA会有同样的问题吗?
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!
Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。
Hi, 我注意到了,但是我目前还是没办法做到 merge 之后的 model 有相同的eval resutls。。。。