Error with grid_sampler_2d_backward()
zikpefu opened this issue · 10 comments
Describe the bug
I get the following traceback error everytime I attempt to train stylegan3
/home/zikpefu/ece8550/final/stylegan3/training/augment.py:231: UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is /home/zikpefu/.cache/torch/kernels. This warning will appear only once per process. (Triggered internally at ../aten/src/ATen/native/cuda/jit_utils.cpp:860.)
s = torch.exp2(torch.randn([batch_size], device=device) * self.scale_std)
Traceback (most recent call last):
File "train.py", line 286, in <module>
main() # pylint: disable=no-value-for-parameter
File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/software/spackages/linux-centos8-x86_64/gcc-8.3.1/anaconda3-2019.10-v5cuhr6keyz5ryxcwvv2jkzfj2gwrj4a/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "train.py", line 281, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train.py", line 96, in launch_training
subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
File "train.py", line 47, in subprocess_fn
training_loop.training_loop(rank=rank, **c)
File "/home/zikpefu/ece8550/final/stylegan3/training/training_loop.py", line 278, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
File "/home/zikpefu/ece8550/final/stylegan3/training/loss.py", line 81, in accumulate_gradients
loss_Gmain.mean().mul(gain).backward()
File "/home/zikpefu/.local/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/zikpefu/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/home/zikpefu/.local/lib/python3.7/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/home/zikpefu/ece8550/final/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 50, in backward
grad_input, grad_grid = _GridSample2dBackward.apply(grad_output, input, grid)
File "/home/zikpefu/ece8550/final/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 59, in forward
grad_input, grad_grid = op(grad_output, input, grid, 0, 0, False)
RuntimeError: aten::grid_sampler_2d_backward() is missing value for argument 'output_mask'. Declaration: aten::grid_sampler_2d_backward(Tensor grad_output, Tensor input, Tensor grid, int interpolation_mode, int padding_mode, bool align_corners, bool[2] output_mask) -> (Tensor, Tensor)
To Reproduce
Steps to reproduce the behavior:
- In 'stylegan3' directory, run command
python train.py --outdir=training-runs-anime --cfg=stylegan3-t --data=dataset/anime_dataset.zip --gpus=1 --batch=32 --gamma=0.5 --mirror=1
- See error
Expected behavior
If this bug didn't exist, I would be able to train stylegan3-t with anime images.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: Linux
- PyTorch version 1.11.0
- CUDA toolkit version: Cuda compilation tools, release 10.2, V10.2.89
- NVIDIA driver version: 470.42.01
- GPU V100 with NVLINK
- Docker: did you use Docker? No Docker
Additional context
All Desktop info is correct, any help is appreciated
I have a same problem
i have a same problem too
See the related issue in PyTorch repository: pytorch/pytorch#75018
My (non-expert) analysis of this: There was a backwards-incompatible change to grid_sampler_2d_backward in PyTorch 1.11.0 (it now takes in the grad mask to possibly avoid computing unnecessary gradient for the input) and the stylegan3 code is calling this function directly.
specifying "pytorch=1.10.2" in the environment.yml should work. I have had issues with anaconda downloading the CPU version so I personally use "pytorch=1.10.2=py3.9_cuda11.3_cudnn8_0"
More information in PDillis/stylegan3-fun#7, includes a pull request with an environment.yml that works for me
I had the same problem, and following command fixes it. Using pytorch=1.10 should work, but I'm not using conda.
pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
I pushed a change that should fix stylegan3 for pytorch 1.11. I'd be curious to know if it fixes the problems mentioned in the above thread (it does fix it for me.)
Thanks @jannehellsten for the change!
@jannehellsten - Still not working for me using above fix.
D:\python\repos\stable\stylegan3\training\augment.py:231: UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is C:\Users\secre\AppData\Local\Temp/torch/kernels. This warning will appear only once per process. (Triggered internally at ..\aten\src\ATen\native\cuda\jit_utils.cpp:860.)
s = torch.exp2(torch.randn([batch_size], device=device) * self.scale_std)
As an update, manually creating the directory worked for me. No warnings now.
I’ve seen this too but I think that’s just a warning? It should work, just that startup takes longer. FWIW it’s a bug in PyTorch: it’s unable to create this directory due to a bug in their cache code.