Transcoder fails with error: CUBLAS_STATUS_NOT_INITIALIZED
raffian opened this issue · 2 comments
After updating NVIDIA driver to 510.06, CUDA is finally recognized.
(codeGen_env) usr1@mak:~/projects/CodeGen$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.6 (64-bit runtime)
**Is CUDA available: True**
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080
Nvidia driver version: 510.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.0
[pip3] torchaudio==0.7.0a0+ac17b64
[pip3] torchvision==0.8.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.0.3 h15472ef_9 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libblas 3.9.0 12_linux64_mkl conda-forge
[conda] libcblas 3.9.0 12_linux64_mkl conda-forge
[conda] liblapack 3.9.0 12_linux64_mkl conda-forge
[conda] mkl 2021.4.0 h06a4308_640
[conda] numpy 1.19.5 py36hfc0c790_2 conda-forge
[conda] pytorch 1.7.0 py3.6_cuda11.0.221_cudnn8.0.3_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.7.0 py36 pytorch
[conda] torchvision 0.8.1 py36_cu110 pytorch
(codeGen_env) usr1@mak:~/projects/CodeGen$
Successfully compiled cuda extensions but had to comment out this section in apex to suppress a runtime error as per the output suggestion. This may (or may not) be the cause of CUBLAS_STATUS_NOT_INITIALIZED
(see below).
apex - commented out
if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
...
Cuda Ext Compilation
(codeGen_env) usr1@mak:~/projects/CodeGen/apex$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/pip/_internal/commands/install.py:245: UserWarning: Disabling all use of wheels due to the use of --build-option / --global-option / --install-option.
cmdoptions.check_install_build_global(options)
Using pip 21.3.1 from /home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/pip (python 3.6)
Processing /home/usr1/projects/CodeGen/apex
Running command python setup.py egg_info
torch.__version__ = 1.7.0
running egg_info
creating /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info
writing /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/dependency_links.txt
writing top-level names to /tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/top_level.txt
writing manifest file '/tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file '/tmp/pip-pip-egg-info-_podlbbz/apex.egg-info/SOURCES.txt'
/home/usr1/projects/CodeGen/apex/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
Preparing metadata (setup.py) ... done
Skipping wheel build for apex, due to binaries being disabled for it.
Installing collected packages: apex
Running command /home/usr1/anaconda3/envs/codeGen_env/bin/python3.6 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/usr1/projects/CodeGen/apex/setup.py'"'"'; __file__='"'"'/home/usr1/projects/CodeGen/apex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-tezgj0d4/install-record.txt --single-version-externally-managed --compile --install-headers /home/usr1/anaconda3/envs/codeGen_env/include/python3.6m/apex
torch.__version__ = 1.7.0
/home/usr1/projects/CodeGen/apex/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
from /usr/bin
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/usr1/projects/CodeGen/apex/setup.py", line 159, in <module>
check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
File "/home/usr1/projects/CodeGen/apex/setup.py", line 103, in check_cuda_torch_binary_vs_bare_metal
"https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 11.0.
In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk).
Translation
cpp --> Java
Source lang file is C (not cpp), about 600 lines of code like this (nothing fancy):
int DBG_change(int level_offset)
{
int retval = 0;
if ((debug_level += level_offset) < MSG_DEBUG_LEVEL_NONE)
{
debug_level = MSG_DEBUG_LEVEL_NONE;
debug_status = 0;
}
if (! debug_file_ptr && debug_level >= MSG_DEBUG_LEVEL_NONE)
{
retval = DBG_setup((char *) NULL, (char *) NULL);
}
else if (debug_file_ptr && debug_level <= MSG_DEBUG_LEVEL_NONE)
DBG_close();
return (retval);
}
static void DBG_set_level(char *app_name)
{
char *envptr;
char debug_env[MAX_ENVVARLEN + 1];
if ((app_name != NULL) && *app_name)
{
strncpy(debug_env, app_name, MAX_ENVVARLEN);
strncat(debug_env, "_DEBUG_LEVEL", MAX_ENVVARLEN - strlen(debug_env));
}
else
{
strcpy(debug_env, "DEBUG_LEVEL");
}
envptr = (char *) getenv(debug_env);
if (envptr == NULL)
debug_level = MSG_DEBUG_LEVEL_NONE;
else if (! strcmp(envptr, "TRUE") || ! strcmp(envptr, "true"))
debug_level = MSG_DEBUG_LEVEL_ON_MIN;
else if (! strcmp(envptr, "MIN") || ! strcmp(envptr, "min"))
debug_level = MSG_DEBUG_LEVEL_ON_MIN;
else if (! strcmp(envptr, "NORM") || ! strcmp(envptr, "norm"))
debug_level = MSG_DEBUG_LEVEL_ON_NORM;
else if (! strcmp(envptr, "MAX") || ! strcmp(envptr, "max"))
debug_level = MSG_DEBUG_LEVEL_ON_MAX;
else if (! strcmp(envptr, "FALSE") || ! strcmp(envptr, "false"))
debug_level = MSG_DEBUG_LEVEL_NONE;
else if (isdigit(envptr[0]))
{
if ((debug_level = (int) atoi(envptr)) <= MSG_DEBUG_LEVEL_NONE)
debug_level = MSG_DEBUG_LEVEL_NONE;
}
else
{
debug_level = MSG_DEBUG_LEVEL_NONE;
}
}
python -m codegen_sources.model.translate
Error: CUBLAS_STATUS_NOT_INITIALIZED
(codeGen_env) usr1@mak:~/projects/CodeGen$ python -m codegen_sources.model.translate --src_lang cpp --tgt_lang java --model_path TransCoder_model_1.pth --beam_size 10 < csrc.c
adding to path /home/usr1/projects/CodeGen
INFO - 12/13/21 16:50:08 - 0:00:05 - ============ Model Reloading
INFO - 12/13/21 16:50:08 - 0:00:05 - Reloading encoder from TransCoder_model_1.pth ...
WARNING - 12/13/21 16:50:13 - 0:00:09 - Lang cpp_sa matched to pretrained cpp_sa lang embedding.
WARNING - 12/13/21 16:50:13 - 0:00:09 - Lang java_sa matched to pretrained java_sa lang embedding.
WARNING - 12/13/21 16:50:13 - 0:00:09 - Lang python_sa matched to pretrained python_sa lang embedding.
WARNING - 12/13/21 16:50:13 - 0:00:09 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times.
INFO - 12/13/21 16:50:13 - 0:00:10 - Reloading decoders from TransCoder_model_1.pth ...
WARNING - 12/13/21 16:50:14 - 0:00:11 - Lang cpp_sa matched to pretrained cpp_sa lang embedding.
WARNING - 12/13/21 16:50:14 - 0:00:11 - Lang java_sa matched to pretrained java_sa lang embedding.
WARNING - 12/13/21 16:50:14 - 0:00:11 - Lang python_sa matched to pretrained python_sa lang embedding.
WARNING - 12/13/21 16:50:14 - 0:00:11 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times.
INFO - 12/13/21 16:50:14 - 0:00:11 - Number of parameters (encoder): 143239641
INFO - 12/13/21 16:50:14 - 0:00:11 - Number of parameters (decoders): 168442329
INFO - 12/13/21 16:50:14 - 0:00:11 - Number of decoders: 1
...
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1603729128610/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [158,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/usr1/projects/CodeGen/codegen_sources/model/translate.py", line 276, in <module>
beam_size=params.beam_size,
File "/home/usr1/projects/CodeGen/codegen_sources/model/translate.py", line 192, in translate
enc1 = self.encoder("fwd", x=x1, lengths=len1, langs=langs1, causal=False)
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/usr1/projects/CodeGen/codegen_sources/model/src/model/transformer.py", line 433, in forward
return self.fwd(**kwargs)
File "/home/usr1/projects/CodeGen/codegen_sources/model/src/model/transformer.py", line 526, in fwd
attn = self.attentions[i](tensor, attn_mask, use_cache=use_cache)
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/usr1/projects/CodeGen/codegen_sources/model/src/model/transformer.py", line 243, in forward
q = shape(self.q_lin(input)) # (bs, n_heads, qlen, dim_per_head)
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/usr1/anaconda3/envs/codeGen_env/lib/python3.6/site-packages/torch/nn/functional.py", line 1692, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
It's worth noting I haven't trained any models; just using TransCoder_model_1.pth
as-is and doing a simple cpp->Java translation as a test. What is the significance of this error?
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
I managed to get beyond this issue. Turned out my input file had Windows line endings (can't believe I'm even admitting to that, but whatever, it happens) Also, my source files are C, not CPP, not sure if that was causing CUBLAS_STATUS_NOT_INITIALIZED
but I intend to create a new processor for C.
closed