Doodleverse/segmentation_gym

Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler

dbuscombe-usgs opened this issue · 4 comments

When things go afoul during model training, for example a powercut, memory leak, or other unexpected issue that interrupts lengthy training, there is currently no way to restore model training

HOT_START could be used to restore model weights and resume training from the beginning epoch, however the LR scheduler will start again at the beginning, thus negating the point of the LR scheduler. In fact restarting the model with refined weights without modifying the LR scheduler could create unwanted model convergence issues

To avoid this situation, the code could be modified as follows:

  1. add new parameter INITIAL_EPOCH to the config file
  2. If absent, it would default to zero
  3. model.fit would use INITIAL_EPOCH as argument to the initial_epoch parameter
  4. if HOT_START is specified but INITIAL_EPOCH, the program should exit with a message for the user

keras' model.fit options listed here include the description for initial_epoch https://keras.io/api/models/model_training_apis/

should be a straightforward fix

the one downside I see is that the full training history for the model, currently provided in the output file ..model_history.npz, would not be available. It is only created after successful cessation of model training. I do not see a workaround, however ....

Implemented in 809466a

Leaving open to add to wiki docs

now added to wiki

closing