r-three/t-few

To which epoch/training step does the finish.pt checkpoint belong to?

stefanhgm opened this issue · 3 comments

Hi everyone!

When I run the experiments after eval_epoch_interval's the model is validated and a checkpoint is written out as global_stepXXXXX.pt. At the end there is also a final checkpoint written out named finish.pt. I assumed this one either belongs to the best intermediate validation performance or the last epoch. However, from comparing it with the other checkpoints that were created it seems that finish.pt differs from all global_stepXXXXX.pt checkpoints, so I am wondering to which point in training does the finish.pt belong to?

Sorry if I miss something obvious here.

Best,
Stefan

dptam commented

Hi Stefan,

finish.pt corresponds to the final model after training for 1000 steps. Since the number of steps is usually not divisible by the number of number of steps in an epoch, the last epoch will usually not correspond to the model after training for 1000 steps but after training for the greatest multiple of (the number of steps in an epoch) less than 1000.

All our experiments are reported using finish.pt checkpoint.

Yes sorry it wasn't clear in our setup. Let me know if something is still not clear.

Hi Derek,

thanks for your quick answer! Okay that makes sense and I also found the corresponding parts in the code now. I want to use the number of steps as additional hyper parameter so I need to pick the best checkpoint to run on my test set. I got that working now!

Thanks again for making your code available!

Hi Derek,

thanks for your quick answer! Okay that makes sense and I also found the corresponding parts in the code now. I want to use the number of steps as additional hyper parameter so I need to pick the best checkpoint to run on my test set. I got that working now!

Thanks again for making your code available!

Hi Stefan,
How did you pick the best checkpoint on your test? By accuracy? By score_gt? Thanks! @stefanhgm

Best wishes,
Caffrey