Add example on how to resume model training

Question

Add example on how to resume model training

Opened this issue 5 years ago · 1 comments

Modify existing code samples and cofigs to include the option of resuming training.
For example: given a checkpoint from epoch 100 run the training for 50 more epochs.

Restart where training left off.
Update: Dutch_F3 notebook
Re-use min_epoch param in the config
param for weighs-path to resume -- double check what param we reuse

Answer 1 · 2020-02-05T22:56:44.000Z

Need more discussion offline, but right now suggest you use:
TRAIN.BEING_EPOCH - epoch used to resume training from
TEST.MODEL_PATH - path to model you want to resume from (however, we need to discuss this - I don't know if there's a better mechanism to specify which model to resume from).

For example, the following uses the same setup we do https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/master/experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml

Thoughts on how to specify model name to resume from? Suggest we add TRAIN.MODEL_RESUME or something of that nature, which technically renders BEGIN_EPOCH useless