Vast.ai instance - **No module named 'upfirdn2d_plugin'**
dokluch opened this issue · 18 comments
Stuck here big time with ImportError: No module named 'upfirdn2d_plugin'
I am using a vast.ai instance nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu18.04
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 30C P0 35W / 250W | 0MiB / 16160MiB | 0% Default |
Conda environment is set with
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch --yes
(doesn't matter if I try a newer one)
What I've tried
FIrst I made sure my VM has CUDA 11.2 installed. Then I've installed a newer torch with CUDA 11.1.1, which did not help and I've rolled back (made a new env).
Removed torch_extensions
Just as described here:
#11
Didn't help
gcc
I found this thread and
#35
And tried installing gcc7
conda install -c conda-forge/label/gcc7 gcc_linux-64
(didn't help)
and even gcc5
conda install -c psi4 gcc-5
The latter sent me in a weird loop and I've abandoned this path.
This does not help either
#2 (comment)
Google Colab works fine and has ubuntu 18.04 with gcc 7.5.0 installed which I am trying to mimic. Hope that is the correct logic.
UPD:
Another instance with gcc 7.5.0 throws the same error as well
gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
UPD2
Installing gcc 5 as described here: https://askubuntu.com/questions/1087150/install-gcc-5-on-ubuntu-18-04
Did not help either
UPD3
Sorry for not including the traceback originally
Traceback (most recent call last):
File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
Traceback (most recent call last):
File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Please advice on any possible next steps. No idea where to move next.
Originally posted by @dokluch in #2 (comment)
Please post the full stacktrace for the "No module named 'upfirdn2d_plugin" exception, as requested in the issue template too:
2. See error (please copy&paste full log and stacktraces).
Please post the full stacktrace for the "No module named 'upfirdn2d_plugin" exception, as requested in the issue template too:
2. See error (please copy&paste full log and stacktraces).
Just updated the original post with the traceback for generate.py
Somehow the real reason why the cpp extension build fails is not shown. You confirm this is on the latest version from github? Can you post git commit id also?
See if you get any more information if you apply the suggestion from #39 (comment)
Somehow the real reason why the cpp extension build fails is not shown. You confirm this is on the latest version from github? Can you post git commit id also?
See if you get any more information if you apply the suggestion from #39 (comment)
I have followed the advice to modify those files and what I got is:
Traceback (most recent call last):
File "generate.py", line 127, in <module>
generate_images() # pylint: disable=no-value-for-parameter
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "generate.py", line 119, in generate_images
img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "<string>", line 490, in forward
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "<string>", line 221, in forward
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "<string>", line 109, in forward
File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 84, in bias_act
if impl == 'cuda' and x.device.type == 'cuda' and _init():
File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 47, in _init
_plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1273, in _write_ninja_file_and_build_library
check_compiler_abi_compatibility(compiler)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 265, in check_compiler_abi_compatibility
if not check_compiler_ok_for_platform(compiler):
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 225, in check_compiler_ok_for_platform
which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
Ran it on the machine with gcc5.5 installed and got another error message
Traceback (most recent call last):
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
env=env)
File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "generate.py", line 127, in <module>
generate_images() # pylint: disable=no-value-for-parameter
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "generate.py", line 119, in generate_images
img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "<string>", line 490, in forward
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "<string>", line 221, in forward
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "<string>", line 109, in forward
File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 88, in bias_act
if impl == 'cuda' and x.device.type == 'cuda' and _init():
File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 51, in _init
_plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
error_prefix="Error building extension '{}'".format(name))
File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'bias_act_plugin': [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o
FAILED: bias_act.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/3] c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o
FAILED: bias_act.o
c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.
PS. The irony is that my windows machine is happily working with this repository while ubuntu fails.
Are you sure you can't run Docker on this machine? It's usually an easy way to fix stuff like this.
Anyway, your run with GCC 5.5 gets a lot further, so at least there's some progress.
This error:
c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.
seems to suggest the compilation cannot find some cuda headers. In my containers it's here:
root@7367a65ac3a5:/workspace# ls /usr/local/cuda/include/cuda_runtime_api.h
/usr/local/cuda/include/cuda_runtime_api.h
Do you have CUDA installed in the first place? There's another error here that indicates it can't even find the CUDA compiler:
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
Are you sure you can't run Docker on this machine? It's usually an easy way to fix stuff like this.
This error:
c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0: /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory compilation terminated.
seems to suggest the compilation cannot find some cuda headers. In my containers it's here:
root@7367a65ac3a5:/workspace# ls /usr/local/cuda/include/cuda_runtime_api.h /usr/local/cuda/include/cuda_runtime_api.h
Do you have CUDA installed in the first place? There's another error here that indicates it can't even find the CUDA compiler:
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
vast ai support answered that I can't reinstall cuda, just get a new instance with a cuda of my choice. Which I did.
I am going to try use Docker for this, but first I need to get a crash course on it since I've never used it in the real world scenario.
UPD. I can't run docker since their instances are already inside Docker.
Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.
I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.
If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load
for additional clues.
Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.
I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.
If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and
torch.utils.cpp_extension.load
for additional clues.
Thank you for your time. I am going to go to the square one and try to do this all over again and hope it works. Or rent an instance somewhere else.
Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.
I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.
If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and
torch.utils.cpp_extension.load
for additional clues.
By the way, just analyzed my Windows logs and found that unfirdn2d is indeed not building properly either. Though this is a one-time error and it doesn't spam like in previous cases:
C:\Code\ML\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
Traceback (most recent call last):
File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1539, in _run_ninja_build
env=env)
File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Code\ML\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py", line 32, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "C:\Code\ML\stylegan2-ada-pytorch\torch_utils\custom_ops.py", line 110, in get_plugin
torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
error_prefix="Error building extension '{}'".format(name))
File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'upfirdn2d_plugin': [1/1] "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64/link.exe" upfirdn2d.o upfirdn2d.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\libs /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib/x64" cudart.lib /out:upfirdn2d_plugin.pyd
FAILED: upfirdn2d_plugin.pyd
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64/link.exe" upfirdn2d.o upfirdn2d.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\libs /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib/x64" cudart.lib /out:upfirdn2d_plugin.pyd
LINK : fatal error LNK1104: cannot open file 'upfirdn2d_plugin.pyd'
ninja: build stopped: subcommand failed.
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
UPD. Vast ai issue fixed by choosing a "devel" type Ubuntu installation instead of "runtime", since runtime does not have nvcc and gcc and it's impossible to properly install them.
@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel
as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?
@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing
1.8.0-cuda11.1-cudnn8-devel
as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?
That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.
I install miniconda and then run
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes
pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab
If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8
@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing1.8.0-cuda11.1-cudnn8-devel
as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.
I install miniconda and then run
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab
If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8
The solution is cool.
Banging my head on this issue too... Which miniconda did you install? The StyleGAN docs say we should use python3.7 64 bits, but that installer is missing on the miniconda installers page... https://docs.conda.io/en/latest/miniconda.html#linux-installers it's got 32 bits for python3.7.
Also that docker instance comes very bare bones, no man, no vim. But your conda and pip commands should be enough?
Thanks a lot for all the pointers! I might finally see this through tonight...
Later Python versions should work fine too. I regularly run StyleGAN2 pytorch with Python 3.8 and 3.9.
It is finally working, phewwww. Thank you so much!
So indeed, future confused users, just go straight for the docker image and enjoy your training!
@dokluch Hi, I encountered exactly the same problem as you.... My error showed that I could not find nvcc, and my file cuda_runtime_api.h could not be found either .But there is no problem with other compilation tasks with nvcc ,I don't know why it fails when compiling. I am running on my local host, this is my machine information:
ubuntu 16.04, pytorch 1.9.0 ,python3.7,CUDA 11.3, gcc 5.4.0,RTX Titan
I have tried all the methods in the issue but the problem is still not solved. I don’t know if something is wrong with my ubuntu system. I hope to get some of your comments and opinions. I haven’t tried to use Docker yet. I don’t know if I can only move to Docker for training in the next step.
Expect all the advice and suggestions.
@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing1.8.0-cuda11.1-cudnn8-devel
as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.
I install miniconda and then run
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab
If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8
I can add that miniconda with Python 3.9 doesn't work (current latest version), while miniconda with Python 3.8 works like a charm.