Recover parameters for resume training

Question

Recover parameters for resume training

Opened this issue 3 months ago · 1 comments

Hello! This is excellent work! I'm currently attempting to conduct an experiment based on this project. However, I have some questions about how to resume the checkpoint for continuous training.

I've already tried using the parser --resume, but it seems that the parameter recovery isn't working properly. Can I rely on the --resume parser, or do I need to recover from the bitstream using --bitstream?

Answer 1 · 2024-04-13T21:11:06.000Z

Hi, to resume training from a model checkpoint, simply use the following argument:

--resume OUTPUT_DIR/checkpoints/CHECKPOINT_NAME

OUTPUT_DIR is the full path of the original run, which will be printed at the beginning.
CHECKPOINT_NAME is one of the checkpoint names in the checkpoints folder.

For example, if the original run printed:

Output dir: /home/HiNeRV/ReadySetGo-HiNeRV-20240413-191040-15cda5b7

You can resume training with:

--resume /home/HiNeRV/ReadySetGo-HiNeRV-20240413-191040-15cda5b7/checkpoints/checkpoint_best