marcoamonteiro/pi-GAN

Cuda OOM error during training

athenas-lab opened this issue · 4 comments

Hi,

I tried to train the code on the CARLA dataset. But I am getting Cuda out of memory error. These are the things I have tried so far:

  1. I have tried running on a single as well multiple 2080 Ti GpuS (specified using CUDA_VISIBLE_DEVICES), each with 11GB memory, but it still generates OOM error.
  2. I tried on 3090 GPU, but the code generates errors on 3090 GPU (that are not related to Cuda OOM error)
  3. I have also tried to reduce the batch size for the Carla dataset in curriculum.py from 30 to 10 as shown below. But I still get the OOM error when I run on a single or multiple 2080Ti GPUs.

CARLA = {
0: {'batch_size': 10, 'num_steps': 48, 'img_size': 32, 'batch_split': 1, 'gen_lr': 4e-5, 'disc_lr': 4e-4},
int(10e3): {'batch_size': 14, 'num_steps': 48, 'img_size': 64, 'batch_split': 2, 'gen_lr': 2e-5, 'disc_lr': 2e-4},
int(55e3): {'batch_size': 10, 'num_steps': 48, 'img_size': 128, 'batch_split': 5, 'gen_lr': 10e-6, 'disc_lr': 10e-5},
int(200e3): {},

Is there anything else I can do to fix the OOM error?

thanks

Hi,

How many 2080 GPUs did you try running on concurrently? We trained our models with 48GB of GPU memory.

Could you try increasing the batch_split on the first CARLA step to 4? That'll divide the batch into multiple runs and reduce memory usage.

Hi,

  1. I tried running on 2 to 6 2080 Ti gpus (so from 22 GB to 66 GB). I tried batch_size ranging from 6 to 30. I also tried with batch_split 1 and 4. But in each case I got Cuda OOM. The issue appears to be in siren.py as noted below.

Progress to next stage: 0%| | 0/10000 [00:16<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 400, in
mp.spawn(train, args=(num_gpus, opt), nprocs=num_gpus, join=True)

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "pi_gan/train.py", line 263, in train
gen_imgs, gen_positions = generator_ddp(subset_z, **metadata)
File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "pi_gan/generators/generators.py", line 49, in forward
coarse_output = self.siren(transformed_points, z, ray_directions=transformed_ray_directions_expanded).reshape(batch_size, img_size * img_size, num_steps, 4)
File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "pi_gan/siren/siren.py", line 133, in forward
return self.forward_with_frequencies_phase_shifts(input, frequencies, phase_shifts, ray_directions, **kwargs)
File "pi_gan/siren/siren.py", line 143, in forward_with_frequencies_phase_shifts
x = layer(x, frequencies[..., start:end], phase_shifts[..., start:end])
File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "pi_gan/siren/siren.py", line 94, in forward
return torch.sin(freq * x + phase_shift)
RuntimeError: CUDA out of memory. Tried to allocate 720.00 MiB (GPU 0; 10.76 GiB total capacity; 7.26 GiB already allocated; 441.44 MiB free; 8.13 GiB reserved in total by PyTorch)

I have only made changes to the first line below. For the remaining 3 steps I am retaining the original values. SHould they be changed based on the parameters in the 1st line?
0: {'batch_size': 6, 'num_steps': 48, 'img_size': 32, 'batch_split': 1, 'gen_lr': 4e-5, 'disc_lr': 4e-4},
int(10e3): {'batch_size': 14, 'num_steps': 48, 'img_size': 64, 'batch_split': 2, 'gen_lr': 2e-5, 'disc_lr': 2e-4},
int(55e3): {'batch_size': 10, 'num_steps': 48, 'img_size': 128, 'batch_split': 5, 'gen_lr': 10e-6, 'disc_lr': 10e-5},
int(200e3): {},

Using 11798 MiB:

0: {'batch_size': 28 * 2, 'num_steps': 12, 'img_size': 64, 'batch_split': 8, 'gen_lr': 6e-5, 'disc_lr': 2e-4},
int(200e3): {},

Thanks for the reply @zwzhang121 . Sounds like that curriculum worked for you?

@athena913 we've noticed that when you split across multiple GPUs one of the GPU's will need a little more memory than if you were training on just one GPU. Given the above curriculum worked for zwhang121 I'd recommend increasing the batch_split and seeing if it works.