About checkpoints

Question

About checkpoints

Closed this issue 2 years ago · 2 comments

Hi, thanks for your great work. I have a question about checkpoints.

I saw config files, and I can find that you used mode=max in latest_checkpoint.yaml, but I can't find it in last_checkpoint.yaml.
so if you used the same metrics for them, I think we need to remove it from latest_checkpoint.yaml. (If use error or loss for metrics)

How do you think about this?

Additionally, I want to know which one is the best model. Is the last.ckpt the best model with metrics (valid error or loss)?

thanks.

Answer 1 · 2022-05-28T18:13:54.000Z

Hello @dwro0121,

To clarify:

last_checkpoint.yaml: After the end of each epoch: save the current checkpoint as latest-{epoch}.ckpt (and delete the previous one). At the end of the training, save the last one as last.ckpt.
latest_checkpoint: Every X epochs (200 by default), save the checkpoint (and keep all others).

Actually, about the config files, it does not matter so much: the default behaviour withmonitor: None is to save the last checkpoint.
I will remove monitor: stepand mode: max from latest_checkpoint which do the same thing, it will be more clear. Thanks for pointing this out to me.

To simplify things, the best model is the last.ckpt, the checkpoint after a full training. I am not using the validation metric/loss to choose the best model.

Answer 2 · 2022-05-30T11:54:07.000Z

Questions have been resolved. Thank you for the reply.