Stoping and restarting training script starts from scratch

Question

Stoping and restarting training script starts from scratch

turowicz opened this issue 4 years ago · 3 comments

turowicz commented 4 years ago

Hey,

I'm having some serious issues each time I stop the training after a checkpoint has been created and run evaluation.

It seems like the restarted job picks up the checkpoint step number but starts learning from scratch.

Cheers

Answer 1 · 2020-09-15T03:29:04.000Z

@turowicz, this certainly is odd behaviour.. Can you please specify the commands you used for training and evaluation? A copy of your config file would also be helpfull

Answer 2 · 2020-09-15T06:42:21.000Z

More details here:

tensorflow/models#9229 (comment)

Answer 3 · 2021-06-15T13:13:07.000Z

@sglvladi that was EfficientNet causing the issue