Is there existing code to resume training from specific checkpoint?

Question

Closed this issue 7 months ago · 1 comments

Are there any official guidelines for resuming the training from a specific checkpoint?

Taking a look at the gpt-neox repository, I guess we need to set the "load" parameter in the config.

But I assume there is no 1:1 mapping between data chunks and checkpoints since there are 133 data splits and 143000 steps.

Are there any existing resources to ensure our setup faithfully reproduces your training?

Answer 1 · 2024-03-01T13:38:06.000Z

I solved this manually inspecting things. I will try to provide some reproducible instructions soon!