ArcticHare105/SPIN

ninja: build stopped: subcommand failed.

Opened this issue · 6 comments

compile cuda source of 'pair_wise_distance' function...
NOTE: if you avoid this process, you make .cu file and compile it following https://pytorch.org/tutorials/advanced/cpp_extension.html
Traceback (most recent call last):
File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
env=env)
File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "demo_train.py", line 17, in
sr_model = model.Model(args, checkpoint)
File "/media/save_old/mnt/hs_data/SR_work/tgrs/LBNet/codes/model/init.py", line 23, in init
module = import_module('model.' + args.model.lower())
File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/importlib/init.py", line 127, in import_module
return bootstrap.gcd_import(name[level:], package, level)
File "", line 1006, in gcd_import
File "", line 983, in find_and_load
File "", line 967, in find_and_load_unlocked File "", line 677, in load_unlocked File "", line 728, in exec_module File "", line 219, in call_with_frames_removed File "/media/save_old/mnt/hs_data/SR_work/tgrs/LBNet/codes/model/spin.py", line 7, in from .pair_wise_distance import PairwiseDistFunction File "/media/save_old/mnt/hs_data/SR_work/tgrs/LBNet/codes/model/pair_wise_distance.py", line 9, in "pair_wise_distance", cpp_sources="", cuda_sources=source File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1285, in load_inline keep_intermediates=keep_intermediates) File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1347, in jit_compile is_standalone=is_standalone) File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1452, in write_ninja_file_and_build_library error_prefix=f"Error building extension '{name}'") File "/home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in run_ninja_build raise RuntimeError(message) from e
RuntimeError: Error building extension 'pair_wise_distance': [1/3] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=pair_wise_distance -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/hs$
anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/inc
lude/torch/csrc/api/include -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include/TH -isystem /home/hs/anaconda3
/envs/tgrs1/lib/python3.7/site-packages/torch/include/THC -isystem /home/hs/anaconda3/envs/tgrs1/include/python3.7m -D_GLIBCXX_USE_CXX11_AB
I=0 -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS --expt-rel
axed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/
hs/.cache/torch_extensions/py37_cu111/pair_wise_distance/cuda.cu -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=pair_wise_distance -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLI
B="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include -isy
stem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/hs/anaconda3/envs/tgrs1/
lib/python3.7/site-packages/torch/include/TH -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include/THC -isystem
/home/hs/anaconda3/envs/tgrs1/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D
CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=ar
ch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/hs/.cache/torch_extensions/py37_cu111/pair_wise_distance/cuda.cu -o
cuda.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_86'
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=pair_wise_distance -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DP
YBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch
/include -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/hs/anaconda
3/envs/tgrs1/lib/python3.7/site-packages/torch/include/TH -isystem /home/hs/anaconda3/envs/tgrs1/lib/python3.7/site-packages/torch/include/
THC -isystem /home/hs/anaconda3/envs/tgrs1/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/hs/.cache/torch_extensio
ns/py37_cu111/pair_wise_distance/main.cpp -o main.o
ninja: build stopped: subcommand failed.

I have the same problem that nvcc fatal : Unsupported gpu architecture 'compute_86'
have you solved it?

same problem

sym330 commented

same error

same error

I encountered the 'Unsupported GPU architecture 'compute_89'' error and managed to resolve it using a Docker container. Here's a brief walkthrough of my solution:

I utilized a Docker container built by DGL(https://catalog.ngc.nvidia.com/orgs/nvidia/containers/dgl). This image primarily includes CUDA 12.2, Torch 2.1.0, and the Deep Graph Library.
After torch 1.10, THC namespace is deprecated and migrated into ATen, so within the Docker container, I made the following changes to the headers of the pair_wise_distance_cuda_source.py source file:

#include <stdio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CUDA_NUM_THREADS 256

#include <torch/extension.h>
#include <torch/types.h>
#include <ATen/core/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/Atomic.cuh>
#include <ATen/cuda/DeviceUtils.cuh>
// #include <THC/THC.h>
// #include <THC/THCAtomics.cuh>
// #include <THC/THCDeviceUtils.cuh>

After these modifications, I successfully loaded pair_wise_distance_cuda.

I'm not entirely sure why environment in DGL image fix the problem , but I hope this can be of help to others facing a similar issue.

@237014845 @Wwwww-disign @sym330 @wizard1023 @JiangYun77

I encountered the 'Unsupported GPU architecture 'compute_89'' error and managed to resolve it using a Docker container. Here's a brief walkthrough of my solution:

I utilized a Docker container built by DGL(https://catalog.ngc.nvidia.com/orgs/nvidia/containers/dgl). This image primarily includes CUDA 12.2, Torch 2.1.0, and the Deep Graph Library. After torch 1.10, THC namespace is deprecated and migrated into ATen, so within the Docker container, I made the following changes to the headers of the pair_wise_distance_cuda_source.py source file:

#include <stdio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define CUDA_NUM_THREADS 256

#include <torch/extension.h>
#include <torch/types.h>
#include <ATen/core/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/Atomic.cuh>
#include <ATen/cuda/DeviceUtils.cuh>
// #include <THC/THC.h>
// #include <THC/THCAtomics.cuh>
// #include <THC/THCDeviceUtils.cuh>

After these modifications, I successfully loaded pair_wise_distance_cuda.

I'm not entirely sure why environment in DGL image fix the problem , but I hope this can be of help to others facing a similar issue.

@237014845 @Wwwww-disign @sym330 @wizard1023 @JiangYun77

That works, Thanks