Error with grid_sampler_2d_backward()

Question

Error with grid_sampler_2d_backward()

zikpefu opened this issue 3 years ago · 10 comments

Describe the bug
I get the following traceback error everytime I attempt to train stylegan3

/home/zikpefu/ece8550/final/stylegan3/training/augment.py:231: UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is /home/zikpefu/.cache/torch/kernels. This warning will appear only once per process. (Triggered internally at  ../aten/src/ATen/native/cuda/jit_utils.cpp:860.)
  s = torch.exp2(torch.randn([batch_size], device=device) * self.scale_std)
Traceback (most recent call last):
  File "train.py", line 286, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "train.py", line 281, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train.py", line 96, in launch_training
    subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
  File "train.py", line 47, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/home/zikpefu/ece8550/final/stylegan3/training/training_loop.py", line 278, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
  File "/home/zikpefu/ece8550/final/stylegan3/training/loss.py", line 81, in accumulate_gradients
    loss_Gmain.mean().mul(gain).backward()
  File "/home/zikpefu/.local/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/zikpefu/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "/home/zikpefu/.local/lib/python3.7/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/home/zikpefu/ece8550/final/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 50, in backward
    grad_input, grad_grid = _GridSample2dBackward.apply(grad_output, input, grid)
  File "/home/zikpefu/ece8550/final/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 59, in forward
    grad_input, grad_grid = op(grad_output, input, grid, 0, 0, False)
RuntimeError: aten::grid_sampler_2d_backward() is missing value for argument 'output_mask'. Declaration: aten::grid_sampler_2d_backward(Tensor grad_output, Tensor input, Tensor grid, int interpolation_mode, int padding_mode,  bool align_corners, bool[2] output_mask) -> (Tensor, Tensor)

To Reproduce
Steps to reproduce the behavior:

In 'stylegan3' directory, run command python train.py --outdir=training-runs-anime --cfg=stylegan3-t --data=dataset/anime_dataset.zip --gpus=1 --batch=32 --gamma=0.5 --mirror=1
See error

Expected behavior
If this bug didn't exist, I would be able to train stylegan3-t with anime images.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Linux
PyTorch version 1.11.0
CUDA toolkit version: Cuda compilation tools, release 10.2, V10.2.89
NVIDIA driver version: 470.42.01
GPU V100 with NVLINK
Docker: did you use Docker? No Docker

Additional context
All Desktop info is correct, any help is appreciated

Answer 1 · 2022-04-01T02:30:50.000Z

I have a same problem

Answer 2 · 2022-04-03T23:17:04.000Z

i have a same problem too

Answer 3 · 2022-04-04T12:45:42.000Z

See the related issue in PyTorch repository: pytorch/pytorch#75018

My (non-expert) analysis of this: There was a backwards-incompatible change to grid_sampler_2d_backward in PyTorch 1.11.0 (it now takes in the grad mask to possibly avoid computing unnecessary gradient for the input) and the stylegan3 code is calling this function directly.

Answer 4 · 2022-04-08T10:31:23.000Z

specifying "pytorch=1.10.2" in the environment.yml should work. I have had issues with anaconda downloading the CPU version so I personally use "pytorch=1.10.2=py3.9_cuda11.3_cudnn8_0"

More information in PDillis/stylegan3-fun#7, includes a pull request with an environment.yml that works for me

Answer 5 · 2022-04-20T00:15:23.000Z

I had the same problem, and following command fixes it. Using pytorch=1.10 should work, but I'm not using conda.

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Answer 6 · 2022-04-22T14:47:56.000Z

I pushed a change that should fix stylegan3 for pytorch 1.11. I'd be curious to know if it fixes the problems mentioned in the above thread (it does fix it for me.)

Answer 7 · 2022-04-22T21:31:37.000Z

Thanks @jannehellsten for the change!

Answer 8 · 2022-05-01T11:23:51.000Z

@jannehellsten - Still not working for me using above fix.

D:\python\repos\stable\stylegan3\training\augment.py:231: UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is C:\Users\secre\AppData\Local\Temp/torch/kernels. This warning will appear only once per process. (Triggered internally at ..\aten\src\ATen\native\cuda\jit_utils.cpp:860.)
s = torch.exp2(torch.randn([batch_size], device=device) * self.scale_std)

Answer 9 · 2022-05-01T11:27:41.000Z

As an update, manually creating the directory worked for me. No warnings now.

Answer 10 · 2022-05-01T16:04:44.000Z

I’ve seen this too but I think that’s just a warning? It should work, just that startup takes longer. FWIW it’s a bug in PyTorch: it’s unable to create this directory due to a bug in their cache code.