GPU memory size issues on Colab?
duskvirkus opened this issue · 1 comments
Hey @rosinality exciting repo!
I'm working on a fork with pytorch lightning for training on tpus but I've hit a roadblock where it's having trouble loading images. So I thought I should try running your training script to make sure it was something I changed about the dataloader.
Switched to a fresh install of your repo to run the test.
Using a colab pro instance with a Tesla P100-PCIE-16GB.
Downloaded a couple pip libraries to get things working (tensorfn, wandb, ninja, and jsonnet). Converted my dataset. Then changed the config file use a size of 256. And ran the following command:
!python train.py --n_gpu 1 --conf /content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/config/config-t-256.jsonnet training.batch=16 path="/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/datasets/painterly-faces-256"
The memory error I'm getting:
Output appended
...
0% 0/800000 [00:00<?, ?it/s]/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/stylegan2/op/conv2d_gradfix.py:89: UserWarning: conv2d_gradfix not supported on PyTorch 1.9.0+cu102. Falling back to torch.nn.functional.conv2d().
f"conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d()."
Traceback (most recent call last):
File "train.py", line 406, in <module>
main, conf.n_gpu, conf.n_machine, conf.machine_rank, conf.dist_url, args=(conf,)
File "/usr/local/lib/python3.7/dist-packages/tensorfn/distributed/launch.py", line 49, in launch
fn(*args)
File "train.py", line 399, in main
train(conf, loader, generator, discriminator, g_optim, d_optim, g_ema, device)
File "train.py", line 250, in train
fake_img = generator(noise)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/model.py", line 424, in forward
out = conv(out, latent)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/model.py", line 303, in forward
out = self.activation(out)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/model.py", line 258, in forward
out = fused_leaky_relu(out, negative_slope=self.negative_slope)
File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/stylegan2/op/fused_act.py", line 119, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/stylegan2/op/fused_act.py", line 66, in forward
out = fused.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
RuntimeError: CUDA out of memory. Tried to allocate 1.16 GiB (GPU 0; 15.90 GiB total capacity; 11.59 GiB already allocated; 231.75 MiB free; 14.75 GiB reserved in total by PyTorch)
0% 0/800000 [00:11<?, ?it/s]
How much memory does your config require? Do I need to decrease my batch size (or other settings) to be able to train on colab?
Nevermind I seems to be working with a batch size of 8. Still open to advice on changing the config for colab training though.