Running process stopped at “compiling cuda operations”

Question

Running process stopped at “compiling cuda operations”

sudanl opened this issue 2 years ago · 5 comments

Hello! I successfully run the code. However, when the running process reaches this step, it stops and does not continue without any error. Do you have any advice or opinion about this problem?

2022-10-18 17:05:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 4 workers***********************
2022-10-18 17:05:27 | INFO | fairseq_cli.train | training on 4 devices (GPUs/TPUs)
2022-10-18 17:05:27 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None
2022-10-18 17:05:27 | INFO | fairseq.trainer | Preparing to load checkpoint ./model/checkpoint_last.pt
2022-10-18 17:05:27 | INFO | fairseq.trainer | No existing checkpoint found ./model/checkpoint_last.pt
2022-10-18 17:05:27 | INFO | fairseq.trainer | loading train data for epoch 1
2022-10-18 17:05:28 | INFO | fairseq.data.data_utils | loaded 4,500,966 examples from: ./bin_data/WMT16/train.en-de.en
2022-10-18 17:05:28 | INFO | fairseq.data.data_utils | loaded 4,500,966 examples from: ./bin_data/WMT16/train.en-de.de
2022-10-18 17:05:28 | INFO | fairseq.tasks.translation | ./bin_data/WMT16 train en-de 4500966 examples
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:35 | INFO | fairseq.data.iterators | grouped total_num_itrs = 1278
2022-10-18 17:05:35 | INFO | fairseq.trainer | begin training epoch 1
2022-10-18 17:05:35 | INFO | fairseq_cli.train | Start iterating over samples
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)

Answer 1 · 2022-10-18T14:51:57.000Z

What do you mean "it stops"? Does the program exit? Or not producing any outputs?

Answer 2 · 2022-10-19T03:56:12.000Z

What do you mean "it stops"? Does the program exit? Or not producing any outputs?

It didn't exit and didn't produce any output

Answer 3 · 2022-10-19T05:17:47.000Z

@sudanl You can try deleting the cuda extension folder (default path: $HOME/.cache/torch_extensions/py37_cu113 (according to your version)/dag_loss_fn/). It occationally causes deadlocks because of previous complication failures.
Then run your script again. Remember wait for a while (should no more than 10 minitues) until the compilation finished.

Answer 4 · 2022-10-19T06:35:18.000Z

@sudanl You can try deleting the cuda extension folder (default path: $HOME/.cache/torch_extensions/py37_cu113 (according to your version)/dag_loss_fn/). It occationally causes deadlocks because of previous complication failures. Then run your script again. Remember wait for a while (should no more than 10 minitues) until the compilation finished.

I followed your suggestion but got some error I have encountered. Maybe I really can't run this code without updating gcc? ˃̣̣̥᷄⌓˂̣̣̥᷅

Python 3.7.10
gcc 4.8.5
Torch 1.10.1+Cuda 11.3

2022-10-19 14:17:25 | INFO | fairseq.trainer | Saving checkpoint to /data/home/USERNAME/DA-Transformer/model/crash.pt
2022-10-19 14:17:27 | INFO | fairseq.trainer | Finished saving checkpoint to /data/home/USERNAME/DA-Transformer/model/crash.pt
Traceback (most recent call last):
  File "/data/home/USERNAME/anaconda3/envs/DAT/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/data/home/USERNAME/DA-Transformer/fairseq_cli/train.py", line 530, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/data/home/USERNAME/DA-Transformer/fairseq/distributed/utils.py", line 351, in call_main
    join=True,
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
    env=env)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/data/home/USERNAME/DA-Transformer/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/data/home/USERNAME/DA-Transformer/fairseq_cli/train.py", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/data/home/USERNAME/DA-Transformer/fairseq_cli/train.py", line 305, in train
    log_output = trainer.train_step(samples)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/data/home/USERNAME/DA-Transformer/fairseq/trainer.py", line 856, in train_step
    raise e
  File "/data/home/USERNAME/DA-Transformer/fairseq/trainer.py", line 830, in train_step
    **extra_kwargs,
  File "/data/home/USERNAME/DA-Transformer/fs_plugins/tasks/translation_lev_modified.py", line 195, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/home/USERNAME/DA-Transformer/fs_plugins/criterions/nat_dag_loss.py", line 266, in forward
    outputs = model(src_tokens, src_lengths, prev_output_tokens, tgt_tokens, glat, glat_function)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/home/USERNAME/DA-Transformer/fairseq/distributed/module_proxy_wrapper.py", line 56, in forward
    return self.module(*args, **kwargs)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/home/USERNAME/DA-Transformer/fs_plugins/models/glat_decomposed_with_link.py", line 250, in forward
    prev_output_tokens, tgt_tokens, glat_info = glat_function(self, word_ins_out, tgt_tokens, prev_output_tokens, glat, links=links)
  File "/data/home/USERNAME/DA-Transformer/fs_plugins/criterions/nat_dag_loss.py", line 213, in glat_function
    word_ins_out, match = dag_logsoftmax_gather_inplace(word_ins_out, tgt_tokens.unsqueeze(1).expand(-1, prelen, -1))
  File "/data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.py", line 206, in forward
    selected_result = get_dag_kernel().logsoftmax_gather(word_ins_out, select_idx, require_gradient)
  File "/data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.py", line 60, in get_dag_kernel
    extra_include_paths=extra_include_paths,
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1136, in load
    keep_intermediates=keep_intermediates)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1347, in _jit_compile
    is_standalone=is_standalone)
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1452, in _write_ninja_file_and_build_library
    error_prefix=f"Error building extension '{name}'")
  File "/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'dag_loss_fn': [1/5] c++ -MMD -MF dag_loss.o.d -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -DOF_SOFTMAX_USE_FAST_MATH -O3 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cpp -o dag_loss.o 
FAILED: dag_loss.o 
c++ -MMD -MF dag_loss.o.d -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -DOF_SOFTMAX_USE_FAST_MATH -O3 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cpp -o dag_loss.o 
c++: error: unrecognized command line option ‘-std=c++14’
[2/5] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -O3 -std=c++14 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu -o dag_loss.cuda.o 
FAILED: dag_loss.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -O3 -std=c++14 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu -o dag_loss.cuda.o 
nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /usr/include/c++/4.8.2/tuple:35:0,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:22:
/usr/include/c++/4.8.2/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support for the \
  ^
In file included from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:25:0:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/macros/Macros.h:215:22: error: missing binary operator before token "("
 #elif __has_attribute(always_inline) || defined(__GNUC__)
                      ^
In file included from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:26:0:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
 #error C++14 or later compatible compiler is required to use ATen.
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/string_view.h:4:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/StringUtil.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:26:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/C++17.h:16:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 5 or later."
 #error \
  ^
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/C++17.h:27:2: error: #error You need C++14 to compile PyTorch
 #error You need C++14 to compile PyTorch
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/typeid.h:25:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/ScalarTypeToTypeMeta.h:4,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/TensorOptions.h:10,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Operators.h:14,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:3,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:26:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/TypeIndex.h:76:2: error: #error "You're running a too old version of GCC. We need GCC 5 or later."
 #error "You're running a too old version of GCC. We need GCC 5 or later."
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/extension.h:4:0,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:29:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:4:2: error: #error C++14 or later compatible compiler is required to use PyTorch.
 #error C++14 or later compatible compiler is required to use PyTorch.
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/enum.h:7:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:9,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_loss.cu:29:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/variant.h:243:2: error: #error "MPark.Variant requires C++11 support."
 #error "MPark.Variant requires C++11 support."
  ^
[3/5] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -O3 -std=c++14 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu -o dag_best_alignment.cuda.o 
FAILED: dag_best_alignment.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -O3 -std=c++14 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu -o dag_best_alignment.cuda.o 
nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /usr/include/c++/4.8.2/tuple:35:0,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:22:
/usr/include/c++/4.8.2/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support for the \
  ^
In file included from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:25:0:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/macros/Macros.h:215:22: error: missing binary operator before token "("
 #elif __has_attribute(always_inline) || defined(__GNUC__)
                      ^
In file included from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:26:0:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
 #error C++14 or later compatible compiler is required to use ATen.
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/string_view.h:4:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/StringUtil.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:26:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/C++17.h:16:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 5 or later."
 #error \
  ^
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/C++17.h:27:2: error: #error You need C++14 to compile PyTorch
 #error You need C++14 to compile PyTorch
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/typeid.h:25:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/ScalarTypeToTypeMeta.h:4,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/TensorOptions.h:10,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Operators.h:14,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:3,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:26:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/TypeIndex.h:76:2: error: #error "You're running a too old version of GCC. We need GCC 5 or later."
 #error "You're running a too old version of GCC. We need GCC 5 or later."
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/extension.h:4:0,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:29:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:4:2: error: #error C++14 or later compatible compiler is required to use PyTorch.
 #error C++14 or later compatible compiler is required to use PyTorch.
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/enum.h:7:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:9,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/dag_best_alignment.cu:29:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/variant.h:243:2: error: #error "MPark.Variant requires C++11 support."
 #error "MPark.Variant requires C++11 support."
  ^
[4/5] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -O3 -std=c++14 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu -o logsoftmax_gather.cuda.o 
FAILED: logsoftmax_gather.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/TH -isystem /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /data/home/USERNAME/anaconda3/envs/DAT/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -O3 -std=c++14 -c /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu -o logsoftmax_gather.cuda.o 
nvcc warning : The -std=c++14 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /usr/include/c++/4.8.2/tuple:35:0,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:25:
/usr/include/c++/4.8.2/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support for the \
  ^
In file included from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:28:0:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/macros/Macros.h:215:22: error: missing binary operator before token "("
 #elif __has_attribute(always_inline) || defined(__GNUC__)
                      ^
In file included from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:29:0:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++14 or later compatible compiler is required to use ATen.
 #error C++14 or later compatible compiler is required to use ATen.
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/string_view.h:4:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/StringUtil.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:29:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/C++17.h:16:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 5 or later."
 #error \
  ^
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/C++17.h:27:2: error: #error You need C++14 to compile PyTorch
 #error You need C++14 to compile PyTorch
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/typeid.h:25:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/ScalarTypeToTypeMeta.h:4,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/core/TensorOptions.h:10,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Operators.h:14,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:3,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:29:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/TypeIndex.h:76:2: error: #error "You're running a too old version of GCC. We need GCC 5 or later."
 #error "You're running a too old version of GCC. We need GCC 5 or later."
  ^
In file included from /usr/local/cuda/include/thrust/detail/type_deduction.h:11:0,
                 from /usr/local/cuda/include/thrust/detail/functional/operators/operator_adaptors.h:21,
                 from /usr/local/cuda/include/thrust/detail/functional/operators/assignment_operator.h:22,
                 from /usr/local/cuda/include/thrust/detail/functional/actor.h:32,
                 from /usr/local/cuda/include/thrust/detail/functional/placeholder.h:20,
                 from /usr/local/cuda/include/thrust/functional.h:26,
                 from /usr/local/cuda/include/thrust/system/detail/error_category.inl:22,
                 from /usr/local/cuda/include/thrust/system/error_code.h:520,
                 from /usr/local/cuda/include/thrust/system_error.h:49,
                 from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:34,
                 from /usr/local/cuda/include/thrust/system/cuda/detail/core/alignment.h:21,
                 from /usr/local/cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h:30,
                 from /usr/local/cuda/include/cub/device/dispatch/dispatch_histogram.cuh:48,
                 from /usr/local/cuda/include/cub/device/device_histogram.cuh:41,
                 from /usr/local/cuda/include/cub/cub.cuh:52,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:31:
/usr/local/cuda/include/thrust/detail/cpp11_required.h:23:6: error: #error C++11 is required for this Thrust feature; please upgrade your compiler or pass the appropriate -std=c++XX flag to it.
 #    error C++11 is required for this Thrust feature; please upgrade your compiler or pass the appropriate -std=c++XX flag to it.
      ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/extension.h:4:0,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:34:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:4:2: error: #error C++14 or later compatible compiler is required to use PyTorch.
 #error C++14 or later compatible compiler is required to use PyTorch.
  ^
In file included from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/enum.h:7:0,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:9,
                 from /data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                 from /data/home/USERNAME/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:34:
/data/home/USERNAME/anaconda3/envs/DAT/lib/python3.7/site-packages/torch/include/c10/util/variant.h:243:2: error: #error "MPark.Variant requires C++11 support."
 #error "MPark.Variant requires C++11 support."
  ^
ninja: build stopped: subcommand failed.

Answer 5 · 2022-10-19T07:01:35.000Z

@sudanl Your g++ version is too old to compile the PyTorch source codes. If you do not want to upgrade the gcc, I think you still have chances to run our model successfully (Two options):

Disable all custom cuda operations, i.e., adding the following flags in your script. It will be slightly slower and take much more gpu memory (may be +40~70%)

--torch-dag-loss                  # Use torch implementation for dag loss instead cuda implementation. It may become slower and consume more memory.
--torch-dag-best-alignment        # Use torch implementation for best-alignment instead cuda implementation. It may become slower and consume more memory.
--torch-dag-logsoftmax-gather     # Use torch implementation for logsoftmax-gather instead cuda implementation. It may become slower and consume more memory.

You can compile the cuda codes on another computer (with the same version of cuda/pytorch/python/[maybe gpu] but with a higher version of g++), then copy the compiled files to your server. I see you seem to have the same environment as mine, so I upload the binary files here. You can extract these files to $HOME/.cache/torch_extensions/py37_cu113 and try again.

My enviroment:

Python 3.7.13
Torch 1.10.1+Cuda 11.3
GPU Nvidia V100-32G