kamalkraj/e5-mistral-7b-instruct

Exception when "--checkpointing_steps" is set

Hypothesis-Z opened this issue · 2 comments

Source code in Accelerate lib shows that weights in hooks is empty if the training task is launched via Deepspeed.

https://github.com/huggingface/accelerate/blob/b8c85839531ded28efb77c32e0ad85af2062b27a/src/accelerate/accelerator.py#L2778-L2824

Threrfore, IndexError will be raised in save_model_hook.

def save_model_hook(models, weights, output_dir):
for i, model in enumerate(models):
model.save_pretrained(output_dir, state_dict=weights[i])
# make sure to pop weight so that corresponding model is not saved again
weights.pop()


Another error is that if "--checkpointing_steps" is set as "epoch", acceleator.save_state() times out but it works if an integer is set.

Hi, Have you solved this problem?

@liujiqiang999 Do not register the hooks.
# accelerator.register_save_state_pre_hook(save_model_hook)
# accelerator.register_load_state_pre_hook(load_model_hook)