Resume Aim tracking when resuming training
adamtupper opened this issue · 0 comments
Bug
When resuming training and using the AimHook
, instead of the previously started run being resumed a new one is created. As a result, the same training run is incorrectly recorded as multiple separate runs in Aim if training is interrupted and then resumed.
Reproduce the Bug
Perform any training with the AimHook
, interrupt the run during training, and then resume. You'll see in Aim that the run is recorded as two runs, one for each segment.
Error Messages and Logs
N/A
Proposed Fix
I have implemented a simple fix that saves and loads the run hash when saving and loading checkpoints, respectively, if the user is using the AimHook
. The AimHook
checks to see if the model/algorithm as an attribute aim_run_hash
to see whether tracking should be started or resumed.