OpenLMLab/LOMO

a bug found in save_model of LOMOTrainer

DingQiang2018 opened this issue · 10 comments

我使用lomo(和zero3)在8张NVIDIA 3090 GPU上微调chatglm2-6b,并使用LOMOTrainer的save_model方法保存。重新加载模型checkpoint后,我发现模型测出来的验证集loss与训练结束时测出来的不一样。我参考deepspeed官方保存模型的代码,重写了save_model(重写的代码如下),发现这个bug解决了。这说明原来版本的save_model有bug,但我还没有找到具体出错原因。
I used LOMO (and zero3) to fine-tune chatglm2-6b on 8 NVIDIA 3090 GPUs and saved it using LOMOTrainer's save_model method. After reloading the model checkpoint, I found that the validation loss measured by the model differed from the validation loss measured at the end of training. I referred to the DeepSpeed official code, rewrote save_model (rewritten code below), and found this bug resolved. This indicates that the original version of save_model has a bug, but I have not yet figured out the specific cause of the error.

    def save_model(self, index):
        if self.training_args.local_rank in [-1, 0]:
            checkpoint_dir = sorted(Path(self.training_args.output_dir).glob("checkpoint-*"))
            if len(checkpoint_dir) >= self.training_args.save_total_limit:
                shutil.rmtree(checkpoint_dir[0], ignore_errors=True)
        torch.distributed.barrier()

        if self.training_args.resume_step:
            output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index+self.training_args.resume_step}")
        else:
            output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index}")
        if not os.path.exists(output_dir):
            os.makedirs(output_dir, exist_ok=True)

        state_dict = OrderedDict() if torch.distributed.get_rank() == 0 else None
        shared_params = {}

        # Prepare for checkpoint save by ensuring all parameters are partitioned
        self.model.optimizer.partition_all_parameters()

        with deepspeed.zero.GatheredParameters(list(self.model.module.parameters()), modifier_rank=0):
            if torch.distributed.get_rank() == 0:
                for name, param in self.model.module.named_parameters():
                    if param is None:
                        continue
                    # can't rely on param.data_ptr() as it will be reused as weights gets
                    # gathered and reduced, but param.ds_id is unique across all zero weights
                    # (and shared params will have the same param.ds_id)
                    if param.ds_id in shared_params:
                        # shared weights
                        #print(f"`{key}` is shared with `{shared_params[param.ds_id]}`")
                        state_dict[name] = state_dict[shared_params[param.ds_id]]
                    else:
                        state_dict[name] = param.detach().cpu()
                        shared_params[param.ds_id] = name
                    #print(f"param {param.ds_id} {param.shape} {key} ")

                # now buffers - not sure if need to take care of potentially shared weights here
                for name, buf in self.model.module.named_buffers():
                    if (buf is not None and name not in self.model.module._non_persistent_buffers_set):
                        state_dict[name] = buf.detach().cpu()

        if len(self.model.optimizer.persistent_parameters) > 0:
            self.model.optimizer.persistent_parameters[0].all_gather(self.model.optimizer.persistent_parameters)

        if torch.distributed.get_rank() == 0:
            torch.save(state_dict, os.path.join(output_dir, 'pytorch_model.bin'))

        torch.distributed.barrier()

Thanks for your kind feedback and save_model() has been updated according to your advice. FYI: 06e50c0

很荣幸我的建议被采纳。我还想问问您对之前save_model出错的具体原因有什么看法吗?我还没想通。
It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model went wrong previously? I have not figured out this.

很荣幸我的建议被采纳。我还想问问您对之前save_model出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model went wrong previously? I have not figured out this.

作者您好,我之所以想知道这个问题的答案,是因为我看到 LOMO 的优化器实现和save_model的代码假设了同样的 deepspeed 划分参数的方式,即每个参数经过展平后划分成若干块,第 i 块分配给第 i 个进程。我不确定 deepspeed 是否是这样划分参数的。因此,我在上面提供的save_model代码没有使用这样的假设,仅使用 deepspeed 提供的 deepspeed.zero.GatheredParameters 自动进行参数的聚合。让我意外的是,这一改动修复了save_model的 bug。因此我推测save_model出错的原因可能在于上述划分参数的假设不对。这动摇了我对 LOMO 优化器的实现的正确性的看法。希望作者能消除我的疑虑。

Hi, I want to know the answer to this question because I find the implementation of LOMO and the code of save_model assume the same layout of partitioned parameters in deepspeed, that is, each parameter is flattened and divided into chuncks, with the i-th chunck sent to the i-th process. I'm not sure if deepspeed splits the parameters that way. Therefore, the code of save_model I provided above does not use this assumption, only using deepspeed.Zero.GatheredParameters provided by deepspeed to gather parameters automatically. To my surprise, this change fixes the bug. Therefore, I speculate that the bug may lie in the wrong assumptions of parameters partitioning. This has shaken my opinion about the correctness of the implementation of the LOMO optimizer. I hope the author can address my doubts.

@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!

@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!

Yeah I am having this issue, did you find any solution?

@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!

Yeah I am having this issue, did you find any solution?
还没有解决……

@DingQiang2018Hello, I noticed that the author modified LOMOTrainer and LOMOLoRaTrainer according to your suggestion. LOMOTrainer runs without problems, but LOMOLoRaTrainer will report an error at self.model.optimizer.partition_all_parameters(). Have you encountered the same problem? Thanks!

Yeah I am having this issue, did you find any solution?
还没有解决……

我也不能 在 merge llama with lora 之后得到相同的结果,很奇怪

@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!

Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。

很荣幸我的建议被采纳。我还想问问您对之前save_model出错的具体原因有什么看法吗?我还没想通。 It's my pleasure to see my advice being accepted. Moreover, may I ask you if you have any comment on why save_model went wrong previously? I have not figured out this.

Hi, 我想知道使用ChatGLM2的loss两种保存方法会差多少,不知道您是否还有记录?BTW,LLaMA会有同样的问题吗?

@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢!

Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。

Hi, 我注意到了,但是我目前还是没办法做到 merge 之后的 model 有相同的eval resutls。。。。