Is there existing code to resume training from specific checkpoint?
Closed this issue · 1 comments
javirandor commented
Are there any official guidelines for resuming the training from a specific checkpoint?
Taking a look at the gpt-neox repository, I guess we need to set the "load" parameter in the config.
But I assume there is no 1:1 mapping between data chunks and checkpoints since there are 133 data splits and 143000 steps.
Are there any existing resources to ensure our setup faithfully reproduces your training?
javirandor commented
I solved this manually inspecting things. I will try to provide some reproducible instructions soon!