Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler

Question

Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler

dbuscombe-usgs opened this issue 2 years ago · 4 comments

When things go afoul during model training, for example a powercut, memory leak, or other unexpected issue that interrupts lengthy training, there is currently no way to restore model training

HOT_START could be used to restore model weights and resume training from the beginning epoch, however the LR scheduler will start again at the beginning, thus negating the point of the LR scheduler. In fact restarting the model with refined weights without modifying the LR scheduler could create unwanted model convergence issues

To avoid this situation, the code could be modified as follows:

add new parameter INITIAL_EPOCH to the config file
If absent, it would default to zero
model.fit would use INITIAL_EPOCH as argument to the initial_epoch parameter
if HOT_START is specified but INITIAL_EPOCH, the program should exit with a message for the user

Answer 1 · 2022-11-28T23:49:01.000Z

keras' model.fit options listed here include the description for initial_epoch https://keras.io/api/models/model_training_apis/

should be a straightforward fix

Answer 2 · 2022-11-28T23:51:25.000Z

the one downside I see is that the full training history for the model, currently provided in the output file ..model_history.npz, would not be available. It is only created after successful cessation of model training. I do not see a workaround, however ....

Answer 3 · 2022-11-29T00:01:04.000Z

Implemented in 809466a

Leaving open to add to wiki docs

Answer 4 · 2023-02-24T03:56:16.000Z

now added to wiki

closing