donnyyou/torchcv

Weird memory allocation, leading to OOM easily.

wondervictor opened this issue · 1 comments

Hi, I met a weird OOM problem (might be a bug) when using torchcv to train a semantic segmentation model. I've found that GPU memory would increase dramatically (even OOM) when starting to train a model. This problem also exists in the SFNet.
I inserted the lines below to

print("max mem: {:.3f} GB".format(torch.cuda.max_memory_allocated()/1024/1024/1024))
torch.cuda.reset_max_memory_allocated()

And I obtained the output of the allocated memory:

# first iteration
max mem: 9.849 GB
max mem: 9.849 GB
max mem: 9.849 GB
max mem: 9.849 GB
# second iteration
max mem: 5.010 GB
max mem: 5.010 GB
max mem: 5.010 GB
max mem: 5.010 GB
# third iteration
max mem: 5.016 GB
max mem: 5.016 GB
max mem: 5.016 GB
max mem: 5.016 GB
....

GPU memory is prone to explode at the start of the training. Are there any clues about this problem?

I've solved it. CuDNN benchmark causes large memory assumption at the beginning, even leading to OOM.