OpenMOSS/CoLLiE

How to convert parallel state_dict to normal state_dict?

Opened this issue · 3 comments

Hi, there! I saved parallel state_dict (requires_grad True only) with 8 GPUs remotely, how to load these state_dicts and save them as one locally? Thanks in advance.

collie_dp0_pp0_tp0.pt  collie_zero_dp0_pp0_tp0.pt  collie_zero_dp2_pp0_tp0.pt  collie_zero_dp4_pp0_tp0.pt  collie_zero_dp6_pp0_tp0.pt
collie.json            collie_zero_dp1_pp0_tp0.pt  collie_zero_dp3_pp0_tp0.pt  collie_zero_dp5_pp0_tp0.pt  collie_zero_dp7_pp0_tp0.pt

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.

callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]

BTW, are you using the main branch or dev branch? Recommend using dev now.

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.

callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]

BTW, are you using the main branch or dev branch? Recommend using dev now.

Got it. I'm using the dev branch. So the aforementioned are all trainer state (not model weights) as defined in the Trainer. The issue caused by my filter method of if requires_grad, which is always False in state_dict.

self.checkpoint_file = "collie_dp{}_pp{}_tp{}.pt".format(env.dp_rank, env.pp_rank, env.tp_rank)  # Trainer state
state_dict = {n: p.detach().cpu() for n, p in model.state_dict().items() if p.requires_grad}  # always empty

The topk in the CheckpointCallback defaults to 0, which will not save the model... I think it's better to set it to be 1 or -1 or raise a warning by default in case of misconfiguration.