microsoft/Semi-supervised-learning

Resume Aim tracking when resuming training

adamtupper opened this issue · 0 comments

Bug

When resuming training and using the AimHook, instead of the previously started run being resumed a new one is created. As a result, the same training run is incorrectly recorded as multiple separate runs in Aim if training is interrupted and then resumed.

Reproduce the Bug

Perform any training with the AimHook, interrupt the run during training, and then resume. You'll see in Aim that the run is recorded as two runs, one for each segment.

Error Messages and Logs

N/A

Proposed Fix

I have implemented a simple fix that saves and loads the run hash when saving and loading checkpoints, respectively, if the user is using the AimHook. The AimHook checks to see if the model/algorithm as an attribute aim_run_hash to see whether tracking should be started or resumed.