GPU memory size issues on Colab?

Question

GPU memory size issues on Colab?

duskvirkus opened this issue 4 years ago · 1 comments

Hey @rosinality exciting repo!

I'm working on a fork with pytorch lightning for training on tpus but I've hit a roadblock where it's having trouble loading images. So I thought I should try running your training script to make sure it was something I changed about the dataloader.

Switched to a fresh install of your repo to run the test.

Using a colab pro instance with a Tesla P100-PCIE-16GB.

Downloaded a couple pip libraries to get things working (tensorfn, wandb, ninja, and jsonnet). Converted my dataset. Then changed the config file use a size of 256. And ran the following command:

!python train.py --n_gpu 1 --conf /content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/config/config-t-256.jsonnet training.batch=16 path="/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/datasets/painterly-faces-256"

The memory error I'm getting:

Output appended
...
  0% 0/800000 [00:00<?, ?it/s]/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/stylegan2/op/conv2d_gradfix.py:89: UserWarning: conv2d_gradfix not supported on PyTorch 1.9.0+cu102. Falling back to torch.nn.functional.conv2d().
  f"conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d()."
Traceback (most recent call last):
  File "train.py", line 406, in <module>
    main, conf.n_gpu, conf.n_machine, conf.machine_rank, conf.dist_url, args=(conf,)
  File "/usr/local/lib/python3.7/dist-packages/tensorfn/distributed/launch.py", line 49, in launch
    fn(*args)
  File "train.py", line 399, in main
    train(conf, loader, generator, discriminator, g_optim, d_optim, g_ema, device)
  File "train.py", line 250, in train
    fake_img = generator(noise)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/model.py", line 424, in forward
    out = conv(out, latent)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/model.py", line 303, in forward
    out = self.activation(out)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/model.py", line 258, in forward
    out = fused_leaky_relu(out, negative_slope=self.negative_slope)
  File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/stylegan2/op/fused_act.py", line 119, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/content/drive/MyDrive/afg-lightning-devel/alias-free-gan-pytorch/stylegan2/op/fused_act.py", line 66, in forward
    out = fused.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
RuntimeError: CUDA out of memory. Tried to allocate 1.16 GiB (GPU 0; 15.90 GiB total capacity; 11.59 GiB already allocated; 231.75 MiB free; 14.75 GiB reserved in total by PyTorch)
  0% 0/800000 [00:11<?, ?it/s]

How much memory does your config require? Do I need to decrease my batch size (or other settings) to be able to train on colab?

Answer 1 · 2021-07-12T17:10:19.000Z

Nevermind I seems to be working with a batch size of 8. Still open to advice on changing the config for colab training though.