shubham-goel/4D-Humans

Backbone gpu memory increasing after each batch

DavidBoja opened this issue · 8 comments

Hello,

first of all, thank you very much for sharing the implementation!

I was wondering if you experienced issues with GPU memory increasing after every batch using the ViT backbone?
I decreased the batch size to 2 so I start off with 2GB of gpu memory. However, after each batch iteration, the memory keeps increasing and finally explodes after 4 iterations.

I tried tracking where exactly does the memory increase, and I found two sources:

  1. After the 32 ViT blocks the memory keeps increasing
  2. After finishing the training_step the memory consumption spikes

I should expect that the memory increases in the first iteration since pytorch lazily loads stuff, but not after the following iterations.
Any tips are much appreciated :).

The PL output estimation of the memory allocation of the whole mode is:

637 M     Trainable params
0         Non-trainable params
637 M     Total params
2,549.510 Total estimated model params size (MB)

Which does make sense - and is around what gets allocated on the gpu (2,706 is the actual size) when training starts.

However, once the training actually starts (when training_step starts running) the size just keeps increasing in each iteration. I am still at a loss, what exactly is causing that. I checked if any there are any appends of non-detached tensors, I stopped the logging, and reduced everything else I could, but the issue still persists.

Do you have any advice?

Hi @DavidBoja, any luck finding the reason behind the spike? I am facing the same issue.

No, unfortunately I did not.

I switched to other architectures since the architecture from this paper needs a lot of compute power (gpu) and I believe it primarily achieves good results because of the huge amount of data it is trained on.

I wish you luck. Let me know if you manage to find a solution please :).

Yeah makes sense. I will get back to you if I find a soln. Thank you!

Hi @DavidBoja ,are you working on 3D Human Reconstruction problem? Also which architecture have you been using currently?

Hi @mlkorra
I'm more focused on 3D data, rather than 2D data, but I'm interested in guided transformers like these, and non-learning NNs like these.

Not sure of the exact setting you are working with, but one aspect that we observed that could create issues with GPU memory is using values for the number of workers that might be more than what it's available at the machine that we train on. In that case, we decreased that value and we avoided the GPU memory increase issue.

Hi @geopavlakos,

Thanks for the help. I am working with two 12gb Nvidia cards. I tried lowering the number of workers but unfortunately this did not help.

I can run the demo successfully, but I face issues with the training. I tried lowering the number of workers, the batch size, and even played around with lowering the SMPL_HEAD depth and heads to only 2, but the issue still persists.
I think the number of workers should not be related to the issue I'm facing however (problem with gpu memory that keeps increasing), because (as I understand it) the workers prepare the dataset examples that are going to be batched in an training iteration - but they are only transferred on the gpu once the training loop starts.

On the other hand, I have never used pytorch lightning before, so maybe issues are arising from there.

In the meantime I have switched to other work so I'm not actively experimenting with the network, so maybe @jerriebright can share more input regarding the issue he is facing if he is still working on it - or if he has found a solution :).