Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler
dbuscombe-usgs opened this issue · 4 comments
When things go afoul during model training, for example a powercut, memory leak, or other unexpected issue that interrupts lengthy training, there is currently no way to restore model training
HOT_START
could be used to restore model weights and resume training from the beginning epoch, however the LR scheduler will start again at the beginning, thus negating the point of the LR scheduler. In fact restarting the model with refined weights without modifying the LR scheduler could create unwanted model convergence issues
To avoid this situation, the code could be modified as follows:
- add new parameter
INITIAL_EPOCH
to the config file - If absent, it would default to zero
model.fit
would useINITIAL_EPOCH
as argument to theinitial_epoch
parameter- if
HOT_START
is specified butINITIAL_EPOCH
, the program should exit with a message for the user
keras' model.fit options listed here include the description for initial_epoch
https://keras.io/api/models/model_training_apis/
should be a straightforward fix
the one downside I see is that the full training history for the model, currently provided in the output file ..model_history.npz
, would not be available. It is only created after successful cessation of model training. I do not see a workaround, however ....
Implemented in 809466a
Leaving open to add to wiki docs
now added to wiki
closing