Memory Leak

Question

Memory Leak

Opened this issue 8 months ago · 2 comments

Whenever I run either the ERM or upweighted training routines I encounter a memory leak during the training epochs. There is no memory leaked during the validation or test epochs.

Initial runs leak around 450MB per epoch
Upweighted runs leak around 2410MB per epoch
In both initial and JTT runs, which have the same batch size of 64, leak roughly 180KB per batch.

There are some training-only instructions in the run_epoch function in train.py that involve the loss calculator and the csv logger. I'm pretty confident that there's not thing in the csv logger code that would cause a memory leak. I'm less confident about the loss calculator, however I've yet to find anything that seems like it would leak memory.

Answer 1 · 2024-06-11T20:47:33.000Z

I got a memory leak when training on MultiNLI. I changed the 'num_workers' parameter in the dataloader to 0. Now it works! I don't know if that helps!

Answer 2 · 2024-07-17T21:51:46.000Z

This was when training on CelebA. I fixed it though. I believe they forgot to relinquish a tensor in the data logger.