Out of memory exceptions cause next model to fail

Question

Out of memory exceptions cause next model to fail

joeddav opened this issue 7 years ago · 4 comments

Future models can't even compile when the model before it runs out of memory, resulting in termination of the entire process. Something to do with GPU memory allocation.

In the mean time, try not to allow more parameters than your system is equipped to train.

Answer 1 · 2017-08-01T12:50:17.000Z

@joeddav

I'm running into these problems trying to replicate your mnist results (in the "Running max of MNIST accuracies across 20 generations" figure).

I get

"Resource exhausted: OOM when allocating tensor with shape[1024,256]"

after running for a while (this time on the 811th candidate solution).

I am running on a Titan X with 12gb VRAM. Do you think that it's possible tensorflow / keras is having a problem with freeing memory? If so, do you know of a way of explicitly forcing memory to be freed after the candidate solution evaluation?

Regards,

Alex

Answer 2 · 2017-08-01T16:20:04.000Z

Yes, I think that after the error that memory is not being immediately deallocated. The puzzling thing is that the error happens when the next model is being compiled, not trained, so it's got to be some kind of allocation issue. We could try doing a gc.collect() after catching the oom error, but I'm not sure if that will actually make any impact in this case.

Answer 3 · 2017-08-01T19:04:38.000Z

Calling gc.collect() after the first OOM exception seems to solve it.

Answer 4 · 2017-08-01T19:07:01.000Z

@joeddav

Thanks for the quick response - I'll pull down your changes and try it later in the week!

Regards,

Alex