EleutherAI/pythia

Is there existing code to resume training from specific checkpoint?

Closed this issue · 1 comments

Are there any official guidelines for resuming the training from a specific checkpoint?

Taking a look at the gpt-neox repository, I guess we need to set the "load" parameter in the config.

But I assume there is no 1:1 mapping between data chunks and checkpoints since there are 133 data splits and 143000 steps.

Are there any existing resources to ensure our setup faithfully reproduces your training?

I solved this manually inspecting things. I will try to provide some reproducible instructions soon!