Training time

Question

Training time

zzaibi opened this issue 8 years ago · 5 comments

zzaibi commented 8 years ago

It's been training on a single-GPU machine for 3 days, and there is no sign of finishing. How long should it take?

Answer 1 · 2017-12-29T19:40:50.000Z

It trains forever. It saves periodically, and you can kill it whenever you think it's "done" based on the loss level.

Answer 2 · 2017-12-29T20:48:20.000Z

Problem is, it only saves a very small number of check points. I can't go back, right?

Answer 3 · 2017-12-29T22:16:07.000Z

By default, it checkpoints every 2000 steps, which I think is pretty infrequent. You can change the steps_per_checkpoint in models/config.json to a lower number to make it save more often. Since it saves a checkpoint and prints loss information at the same time, I'd wait until you see new loss information before killing it to minimize the amount of lost training time.

Answer 4 · 2017-12-29T22:54:50.000Z

Understood. The problem is that it seems to delete old check points, and save only the most recent few. I guess I've lost the optimal point.

Answer 5 · 2017-12-29T23:01:52.000Z

Oh, ok. I l looked into it, and if you change this line locally to include a max_to_keep argument, it should stop deleting old checkpoints. I haven't tried it, though, and you'd have to restart training to test it out. You can read the docs for the saver here for more details.