NVlabs/stylegan2-ada-pytorch

Vast.ai instance - **No module named 'upfirdn2d_plugin'**

dokluch opened this issue · 18 comments

Stuck here big time with ImportError: No module named 'upfirdn2d_plugin'

I am using a vast.ai instance nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu18.04

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   30C    P0    35W / 250W |      0MiB / 16160MiB |      0%      Default |

Conda environment is set with
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch --yes
(doesn't matter if I try a newer one)

What I've tried

FIrst I made sure my VM has CUDA 11.2 installed. Then I've installed a newer torch with CUDA 11.1.1, which did not help and I've rolled back (made a new env).

Removed torch_extensions
Just as described here:
#11

Didn't help

gcc
I found this thread and
#35

And tried installing gcc7
conda install -c conda-forge/label/gcc7 gcc_linux-64 (didn't help)

and even gcc5
conda install -c psi4 gcc-5
The latter sent me in a weird loop and I've abandoned this path.

This does not help either
#2 (comment)

Google Colab works fine and has ubuntu 18.04 with gcc 7.5.0 installed which I am trying to mimic. Hope that is the correct logic.

UPD:
Another instance with gcc 7.5.0 throws the same error as well

gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.

UPD2
Installing gcc 5 as described here: https://askubuntu.com/questions/1087150/install-gcc-5-on-ubuntu-18-04
Did not help either

UPD3
Sorry for not including the traceback originally

Traceback (most recent call last):
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())

Please advice on any possible next steps. No idea where to move next.

Originally posted by @dokluch in #2 (comment)

Please post the full stacktrace for the "No module named 'upfirdn2d_plugin" exception, as requested in the issue template too:

2. See error (please copy&paste full log and stacktraces).

Please post the full stacktrace for the "No module named 'upfirdn2d_plugin" exception, as requested in the issue template too:

2. See error (please copy&paste full log and stacktraces).

Just updated the original post with the traceback for generate.py

Somehow the real reason why the cpp extension build fails is not shown. You confirm this is on the latest version from github? Can you post git commit id also?

See if you get any more information if you apply the suggestion from #39 (comment)

Somehow the real reason why the cpp extension build fails is not shown. You confirm this is on the latest version from github? Can you post git commit id also?

See if you get any more information if you apply the suggestion from #39 (comment)

I have followed the advice to modify those files and what I got is:

Traceback (most recent call last):
  File "generate.py", line 127, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "generate.py", line 119, in generate_images
    img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 490, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 221, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 109, in forward
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 84, in bias_act
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 47, in _init
    _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1273, in _write_ninja_file_and_build_library
    check_compiler_abi_compatibility(compiler)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 265, in check_compiler_abi_compatibility
    if not check_compiler_ok_for_platform(compiler):
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 225, in check_compiler_ok_for_platform
    which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
  File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

Ran it on the machine with gcc5.5 installed and got another error message

Traceback (most recent call last):
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "generate.py", line 127, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "generate.py", line 119, in generate_images
    img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 490, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 221, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 109, in forward
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 88, in bias_act
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 51, in _init
    _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'bias_act_plugin': [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
FAILED: bias_act.cuda.o 
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/3] c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
FAILED: bias_act.o 
c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.

PS. The irony is that my windows machine is happily working with this repository while ubuntu fails.

Are you sure you can't run Docker on this machine? It's usually an easy way to fix stuff like this.

Anyway, your run with GCC 5.5 gets a lot further, so at least there's some progress.

This error:

c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.

seems to suggest the compilation cannot find some cuda headers. In my containers it's here:

root@7367a65ac3a5:/workspace# ls /usr/local/cuda/include/cuda_runtime_api.h 
/usr/local/cuda/include/cuda_runtime_api.h

Do you have CUDA installed in the first place? There's another error here that indicates it can't even find the CUDA compiler:

/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

Are you sure you can't run Docker on this machine? It's usually an easy way to fix stuff like this.

This error:

c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.

seems to suggest the compilation cannot find some cuda headers. In my containers it's here:

root@7367a65ac3a5:/workspace# ls /usr/local/cuda/include/cuda_runtime_api.h 
/usr/local/cuda/include/cuda_runtime_api.h

Do you have CUDA installed in the first place? There's another error here that indicates it can't even find the CUDA compiler:

/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

vast ai support answered that I can't reinstall cuda, just get a new instance with a cuda of my choice. Which I did.
I am going to try use Docker for this, but first I need to get a crash course on it since I've never used it in the real world scenario.

UPD. I can't run docker since their instances are already inside Docker.

Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.

I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.

If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load for additional clues.

Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.

I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.

If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load for additional clues.

Thank you for your time. I am going to go to the square one and try to do this all over again and hope it works. Or rent an instance somewhere else.

Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.

I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.

If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load for additional clues.

By the way, just analyzed my Windows logs and found that unfirdn2d is indeed not building properly either. Though this is a one-time error and it doesn't spam like in previous cases:

C:\Code\ML\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Code\ML\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "C:\Code\ML\stylegan2-ada-pytorch\torch_utils\custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'upfirdn2d_plugin': [1/1] "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64/link.exe" upfirdn2d.o upfirdn2d.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\libs /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib/x64" cudart.lib /out:upfirdn2d_plugin.pyd

FAILED: upfirdn2d_plugin.pyd 

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64/link.exe" upfirdn2d.o upfirdn2d.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\libs /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib/x64" cudart.lib /out:upfirdn2d_plugin.pyd

LINK : fatal error LNK1104: cannot open file 'upfirdn2d_plugin.pyd'

ninja: build stopped: subcommand failed.



  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

UPD. Vast ai issue fixed by choosing a "devel" type Ubuntu installation instead of "runtime", since runtime does not have nvcc and gcc and it's impossible to properly install them.

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!

Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!

Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
image

You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.

I install miniconda and then run

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes

pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab

If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
image

You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.

I install miniconda and then run

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes

pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab

If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8

The solution is cool.

Banging my head on this issue too... Which miniconda did you install? The StyleGAN docs say we should use python3.7 64 bits, but that installer is missing on the miniconda installers page... https://docs.conda.io/en/latest/miniconda.html#linux-installers it's got 32 bits for python3.7.

Also that docker instance comes very bare bones, no man, no vim. But your conda and pip commands should be enough?

Thanks a lot for all the pointers! I might finally see this through tonight...

Later Python versions should work fine too. I regularly run StyleGAN2 pytorch with Python 3.8 and 3.9.

It is finally working, phewwww. Thank you so much!

So indeed, future confused users, just go straight for the docker image and enjoy your training!

@dokluch Hi, I encountered exactly the same problem as you.... My error showed that I could not find nvcc, and my file cuda_runtime_api.h could not be found either .But there is no problem with other compilation tasks with nvcc ,I don't know why it fails when compiling. I am running on my local host, this is my machine information:

ubuntu 16.04, pytorch 1.9.0 ,python3.7,CUDA 11.3, gcc 5.4.0,RTX Titan

I have tried all the methods in the issue but the problem is still not solved. I don’t know if something is wrong with my ubuntu system. I hope to get some of your comments and opinions. I haven’t tried to use Docker yet. I don’t know if I can only move to Docker for training in the next step.

Expect all the advice and suggestions.

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
image

You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.

I install miniconda and then run

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes

pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab

If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8

I can add that miniconda with Python 3.9 doesn't work (current latest version), while miniconda with Python 3.8 works like a charm.