NVCC error. Compilation of custom CUDA ops on Windows 10 for Tensorflow 2.x

Question

NVCC error. Compilation of custom CUDA ops on Windows 10 for Tensorflow 2.x

uladzislau-varabei opened this issue 3 years ago · 5 comments

uladzislau-varabei commented 3 years ago

Hi everyone,

I'm trying to compile custom CUDA ops on Windows 10 for Tensorflow 2.x, however, I encountered a problem.
Below is the output of compilation of fused_bias_act op.

...project/dnnlib/ops/fused_bias_act.cu(204): error: expected an expression

...project/dnnlib/ops/fused_bias_act.cu(204): error: no instance of constructor "tensorflow::register_op::OpDefBuilderWrapper::OpDefBuilderWrapper" matches the argument list
            argument types are: (const char [13], __nv_bool)

...project/dnnlib/ops/fused_bias_act.cu(217): error: expected an expression

...project/dnnlib/ops/fused_bias_act.cu(217): error: expected an expression

...project/dnnlib/ops/fused_bias_act.cu(217): error: expected a type specifier

...project/dnnlib/ops/fused_bias_act.cu(217): error: expected an expression

...project/dnnlib/ops/fused_bias_act.cu(218): error: expected an expression

...project/dnnlib/ops/fused_bias_act.cu(218): error: expected an expression

...project/dnnlib/ops/fused_bias_act.cu(218): error: expected a type specifier

...project/dnnlib/ops/fused_bias_act.cu(218): error: expected an expression

10 errors detected in the compilation of "...project/dnnlib/ops/fused_bias_act.cu".
_pywrap_tensorflow_internal.lib
fused_bias_act.cu

This is how these lines look in script (taken from the repo):

(204) REGISTER_OP("FusedBiasAct")
   ...
    .Attr       ("clamp: float = -1.0");
(217) REGISTER_KERNEL_BUILDER(Name("FusedBiasAct").Device(DEVICE_GPU).TypeConstraint<float>("T"), FusedBiasActOp<float>);
(218) REGISTER_KERNEL_BUILDER(Name("FusedBiasAct").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"), FusedBiasActOp<Eigen::half>);

It's the same as sescribed by mavanmanen here.

I tried several conda environments and here are the results:

Successfully compiled with Tf 1.14 (pip) + cuda 10.0 (conda) + cudnn 7.6.5 (conda) + MSVC 14.16 (VS17)
Didn't compile with Tf 2.6 (pip) + cuda 10.2 (conda) + cudnn 7.6.5 (conda) + MSVC 14.16 (VS17) / 14.29 (VS19)
Didn't compile with Tf 2.5 (pip) + cuda 11.2 (conda) + cudnn 8.1.0 (conda) + MSVC 14.16 (VS17) / 14.29 (VS19)
Didn't compile with Tf 2.5 (pip) + cuda 11.2 (system) + cudnn 8.1.0 (system) + MSVC 14.16 (VS17) / 14.29 (VS19)

As you can see the problem seems to be related to Tf 2.x (see option 1). I thought maybe it had something to do with cuda/cudnn, but trying different versions didn't help (see options 2 and 3). I also thought that maybe something isn't installed when using conda channels, but again the result is the same (see options 3 and 4). I tried to use Tf v1 mode, but it also didn't provide result. Code for this:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

I noticed that most of Tf 2.x ports of StyleGAN2/StyleGAN2-ADA (both projects have these custom ops) use different flags for compilation on Linux. Example 1 and 2.
The changes are aligned with the offical Tf 2.x guide for custom ops. However, none of the ports I found changes anything for Windows (except for the path to MSVC). The guide also only provides details for Linux.

I have found an official repo with details for both: Linux and Windows, but it didn't help me a lot. It has a potentially very useful BAZEL build file, which provides flags/options for Windows, but, unfortunately, still I couldn't compile the op. I tried to explicitly add flags to the script from this repo (I mean StyleGAN2-ADA), but I only had errors saying that some of them are not recognized. Note: I still tried to compile them with MSVC, not with Bazel.

So, if anyone could help wih compiling these ops on Windows with Tensorflow 2.x, it would be great. Pieces of code (explicit compilation flags, etc.), ideas, explanations and just thoughts are welcome.

Thanks in advance

Answer 1 · 2021-08-23T23:45:52.000Z

try https://github.com/johndpope/stylegan2-ada/
or this branch
https://github.com/johndpope/stylegan2-ada/tree/digressions

tensorflow is dead to nvidia labs - gotta move to pytorch.
https://github.com/NVlabs/stylegan2-ada-pytorch

Answer 2 · 2021-08-24T00:59:21.000Z

@johndpope the links you suggested don't seem to have any changes for Windows compile flags, only Linux. Though I tried it, yet ops are still not compiled and the error is the same. Just in case you didn't notice, I mentioned that I had tried to disable v2 behaviour (one of the main changes in shared links), but no success.

I know that there is an official PyTorch port and all upcoming projects by NVlabs will use it, yet I still would like to compile the ops on Windows and Tf 2.x.

Answer 3 · 2021-08-24T01:22:03.000Z

I had some problems with fused ops on Linux / one of the problems was gcc version. When os updated / version bumped to 10.3 (broken) 10.2 was working fine and had similar error. Had to link nvcc to Gcc 9 vs downgrading system Gcc.

NVIDIA/nccl#494

Answer 4 · 2021-08-24T21:36:57.000Z

Interesting. On Windows I use Visual Studio and MSVC (suggested by NVlabs and works for Tf 1.14). Actually, one of my guesses was that some versions of MSVC are not compatiable with Tf 2.x (except the old ones), so I tried using MSVC 14.16 (VS 2017) and MSVC 14.29 (VS 2019), but none of them worked for Tf 2.x. I'm not sure if it's a compiler version, some missing components of VS (though again it works for Tf 1.14), Visual Studio version or compilation options (most likely this one in my opinion).

Answer 5 · 2022-05-19T16:40:25.000Z

I has the same problem to build 'upfirdn_2d.cu'.
nvcc --std=c++11 -DNDEBUG "C:\Program Files\Python39\lib\site-packages\tensorflow\python_pywrap_tensorflow_internal.lib" --gpu-architecture=sm_86 --use_fast_math --disable-warnings --include-path "C:\Program Files\Python39\lib\site-packages\tensorflow\include" --include-path "C:\Program Files\Python39\lib\site-packages\tensorflow\include\external\protobuf_archive\src" --include-path "C:\Program Files\Python39\lib\site-packages\tensorflow\include\external\com_google_absl" --include-path "C:\Program Files\Python39\lib\site-packages\tensorflow\include\external\eigen_archive" --compiler-bindir "C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.16.27023/bin/HostX64/x64" 2>&1 "D:\notebook_root\StyleGAN2-TensorFlow-2.x-master\dnnlib\ops\upfirdn_2d.cu" --shared -o "C:\Users\ADMINI~~1\AppData\Local\Temp\tmp5005yz24\upfirdn_2d_tmp.dll" --keep --keep-dir "C:\Users\ADMINI~~1\AppData\Local\Temp\tmp5005yz24"

D:/notebook_root/StyleGAN2-TensorFlow-2.x-master/dnnlib/ops/upfirdn_2d.cu(310): error: expected an expression

D:/notebook_root/StyleGAN2-TensorFlow-2.x-master/dnnlib/ops/upfirdn_2d.cu(310): error: no instance of constructor "tensorflow::register_op::OpDefBuilderWrapper::OpDefBuilderWrapper" matches the argument list
argument types are: (const char [10], __nv_bool)