lessw2020/Ranger21

resuming training with ranger21?

neuronflow opened this issue · 3 comments

As I learned ranger21 does internal lr scheduling etc.

How should training be resumed? Is there a state dict to be loaded etc.?

Hi @neuronflow,
Thanks for opening the issue!
Ranger21 does maintain a basic state dict but for sure we need to update it with some additional data to ensure a clean restart if training is stopped.
Let me use this issue to track it and I'll test and fix it ideally in the next few days as this has been on my todo list.

any updates on this one? :) I lost multiple GPU days of training because the trainings are non resumable :/

Seconding the need for this feature!