Chung-I/Variational-Recurrent-Autoencoder-Tensorflow

Training time

zzaibi opened this issue · 5 comments

It's been training on a single-GPU machine for 3 days, and there is no sign of finishing. How long should it take?

It trains forever. It saves periodically, and you can kill it whenever you think it's "done" based on the loss level.

Problem is, it only saves a very small number of check points. I can't go back, right?

By default, it checkpoints every 2000 steps, which I think is pretty infrequent. You can change the steps_per_checkpoint in models/config.json to a lower number to make it save more often. Since it saves a checkpoint and prints loss information at the same time, I'd wait until you see new loss information before killing it to minimize the amount of lost training time.

Understood. The problem is that it seems to delete old check points, and save only the most recent few. I guess I've lost the optimal point.

Oh, ok. I l looked into it, and if you change this line locally to include a max_to_keep argument, it should stop deleting old checkpoints. I haven't tried it, though, and you'd have to restart training to test it out. You can read the docs for the saver here for more details.