chrischoy/FCGF

Out of memory with batch_size 1 and 4GB VRAM

fjodborg opened this issue · 0 comments

Hello, i have this problem where i at some point start running out of memory when running python train.py --threed_match_dir ~/dataset/threedmatch/ --batch_size 1.
At first i tried with batch_size 2 but it was too much for my gpu so i changed it to 1. After going through some thousands epochs i started getting "out of memory" errors like:

INFO - 2021-02-22 12:51:28,348 - trainer - Train Epoch: 1 [1440/7317], Current Loss: 1.157e+00 Pos: 0.365 Neg: 0.792	Data time: 0.0536, Train time: 0.5614, Iter time: 0.6150
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main(config)
  File "train.py", line 63, in main
    trainer.train()
  File "/home/f/repos/FCGF/lib/trainer.py", line 132, in train
    self._train_epoch(epoch)
  File "/home/f/repos/FCGF/lib/trainer.py", line 492, in _train_epoch
    self.config.batch_size)
  File "/home/f/repos/FCGF/lib/trainer.py", line 427, in contrastive_hardest_negative_loss
    D01 = pdist(posF0, subF1, dist_type='L2')
  File "/home/f/repos/FCGF/lib/metrics.py", line 24, in pdist
    D2 = torch.sum((A.unsqueeze(1) - B.unsqueeze(0)).pow(2), 2)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 3.82 GiB total capacity; 744.27 MiB already allocated; 43.38 MiB free; 814.00 MiB reserved in total by PyTorch)

Currently my system takes up 500MiB VRAM from my GTX 1650 (4GB) and the rest is used by pytorch. I'm running pytorch 1.7 in a python 3.7 conda enviroment and i tried running tried compiling minkowskiEngine for cuda 11.2 and 10.2 but both gave the same error.