About checkpoints
Closed this issue · 2 comments
Hi, thanks for your great work. I have a question about checkpoints.
I saw config files, and I can find that you used mode=max
in latest_checkpoint.yaml
, but I can't find it in last_checkpoint.yaml
.
so if you used the same metrics for them, I think we need to remove it from latest_checkpoint.yaml
. (If use error or loss for metrics)
How do you think about this?
Additionally, I want to know which one is the best model. Is the last.ckpt
the best model with metrics (valid error or loss)?
thanks.
Hello @dwro0121,
To clarify:
last_checkpoint.yaml
: After the end of each epoch: save the current checkpoint aslatest-{epoch}.ckpt
(and delete the previous one). At the end of the training, save the last one aslast.ckpt
.latest_checkpoint
: EveryX
epochs (200 by default), save the checkpoint (and keep all others).
Actually, about the config files, it does not matter so much: the default behaviour withmonitor: None
is to save the last checkpoint.
I will remove monitor: step
and mode: max
from latest_checkpoint
which do the same thing, it will be more clear. Thanks for pointing this out to me.
To simplify things, the best model is the last.ckpt
, the checkpoint after a full training. I am not using the validation metric/loss to choose the best model.
Questions have been resolved. Thank you for the reply.