CUDA out of memory error.
joaocps opened this issue · 2 comments
joaocps commented
Congratulations on the excellent work! I was trying to test it with an rgb image but I just can't due to lack of memory, any suggestions?
Thank you very much!
Stacktrace:
Traceback (most recent call last):
File "denoise_rgb.py", line 90, in <module>
denoise_bw_func()
File "denoise_rgb.py", line 63, in denoise_bw_func
test_image_dn = process_image(nl_denoiser, test_image_n.to(device), opt.max_chunk)
File "C:\Users\jcps\Desktop\AIMAGE-TEST\LIDIA-denoiser-master\LIDIA-denoiser-master\code\utils.py", line 77, in process_image
image_dn = nl_denoiser(image_n, train=False, save_memory=True, max_chunk=max_chunk)
File "C:\Users\jcps\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\jcps\Desktop\AIMAGE-TEST\LIDIA-denoiser-master\LIDIA-denoiser-master\code\modules.py", line 459, in forward
image_dn = self.denoise_image(image_n, train, save_memory, max_chunk)
File "C:\Users\jcps\Desktop\AIMAGE-TEST\LIDIA-denoiser-master\LIDIA-denoiser-master\code\modules.py", line 344, in denoise_image
top_dist0, top_ind0 = self.find_nn(image_for_nn0, im_params0, self.patch_w)
File "C:\Users\jcps\Desktop\AIMAGE-TEST\LIDIA-denoiser-master\LIDIA-denoiser-master\code\modules.py", line 289, in find_nn
top_dist = torch.zeros(im_params['batches'], im_params['patches_h'],
RuntimeError: CUDA out of memory. Tried to allocate 70.87 GiB (GPU 0; 4.00 GiB total capacity; 868.09 MiB already allocated; 1.84 GiB free; 1.02 GiB reserved in total by PyTorch)
NVIDIA GEFORCE GTX 960M 4GB
joaocps commented
grishavak commented
Thank you for showing interest in my work, and sorry for the late reply! Unfortunately, 4GB is not enough GPU memory for running this code. I suggest you run on CPU or use NVIDIA 1080ti.
This code applies separable matrix multiplications (W1XW2, where X is input, and W1, W2 are trainable matrices). It is an unusual operation for neural networks. Thus, I guess it is implemented non-efficiently in PyTorch or Cuda libraries.