SHI-Labs/Neighborhood-Attention-Transformer

pytorch 1.12.0 CUDA 11.6 Win10 VS2019 build error

Ken1256 opened this issue · 19 comments

C:\Program Files\Python\Python37\lib\site-packages\torch\include\pybind11\cast.h(1429): error: too few arguments for template template parameter "Tuple"
          detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(1507): here

C:\Program Files\Python\Python37\lib\site-packages\torch\include\pybind11\cast.h(1503): error: too few arguments for template template parameter "Tuple"
          detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(1507): here

2 errors detected in the compilation of "C:/pytorch/NAT/natten/src/nattenav_cuda_kernel.cu".
nattenav_cuda_kernel.cu
ninja: build stopped: subcommand failed.

Hello and thank you for your interest.
We recommend using PyTorch 1.11.
1.12 is a very recent release and will likely require us updating the kernel.
However, the error you shared does not appear to be from our code.
Have you tried 1.11?

Similar problem.
facebookresearch/pytorch3d#1127
Maybe need a specific Windows version.

I seriously doubt that, because as I mentioned the error points to pybind, not to our code. Unless that's not the full error.
But again, I'd recommend using 1.11, we still haven't even tested our kernel on 1.12.

Edit: It appears to be an incompatibility issue with nvcc. I've seen multiple instances of this in other PyTorch CUDA extensions, maybe they might help?

ashawkey/torch-ngp#51 (comment)

facebookresearch/pytorch3d#1024

bamsumit/slayerPytorch#86

Win10 VS2019 pytorch 1.11.0 CUDA 11.3 pass
Win10 VS2019 pytorch 1.12.0 CUDA 11.3 pass
Win10 VS2019 pytorch 1.12.0 CUDA 11.6 fail

pytorch/pytorch#69460

Are those your cuda toolkit versions or cuda driver versions?
Assuming it's the latter, so just using 1.12 with an earlier toolkit resolved the issue?

So what is your actual cuda version though?
Also, it's unclear, is the kernel it working with the 11.3 toolkit?

So is the issue resolved?

HAT v0.11 issue is resolved.
HAT v0.12 There are other build errors.

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(881): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(911): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(911): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(1167): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(1167): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(1167): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(1167): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(1214): error: expected an expression

Z:\py_test\NAT_v0_12\natten\src\nattenav_cuda_kernel.cu(1214): error: expected an expression

9 errors detected in the compilation of "Z:/py_test/NAT_v0_12/natten/src/nattenav_cuda_kernel.cu".
nattenav_cuda_kernel.cu
ninja: build stopped: subcommand failed.

Are you still on PyTorch v1.12 or 1.11?

On PyTorch v1.11.

Can you clear your compilation cache and try again? I just tried a fresh compile and it works out fine on multiple set ups on my end.
I'm not sure where the cache would be on Windows, on linux it's $HOME/.cache/torch_extensions.
Could you also confirm you're on the latest commit?

After clearing the cache still build errors.
Did you tested on Windows?
HAT v0.12 uses much less memory than HAT v0.11?

I'm sorry to hear that.
Unfortunately no, we don't have a Windows environment, but the error is really strange.
Based on the error you shared it's possible that the issue is: it's not loading a header file, which is new in v0.12.
But from what I'm seeing it's probably an incompatibility somewhere in your environment (CUDA vs CUDA toolkit vs PyTorch version), that's resulting in the compilation error -- but again can't really say for certain with the information I have.

And no -- our NA extension just generally uses less memory than SWSA (I can get into details if you want), the memory usage hasn't changed in the new version. But our models will run a lot faster now with the new version basically.

PyTorch 1.11 should work?

Yes. 1.11 is the recommended version.

Closing this due to inactivity. If you still have questions feel free to open it back up.