ERROR: RuntimeError: cublas runtime error

Question

ERROR: RuntimeError: cublas runtime error

harpap opened this issue 3 years ago · 4 comments

My conda env:

python=3.6 pytorch=1.3.1

_libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
 _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-4.5-1_gnu
 _pytorch_select    pkgs/main/linux-64::_pytorch_select-0.2-gpu_0
 blas               pkgs/main/linux-64::blas-1.0-mkl
 ca-certificates    pkgs/main/linux-64::ca-certificates-2021.10.26-h06a4308_2
 certifi            pkgs/main/linux-64::certifi-2021.5.30-py36h06a4308_0
 cffi               pkgs/main/linux-64::cffi-1.14.6-py36h400218f_0
 cudatoolkit        pkgs/main/linux-64::cudatoolkit-10.0.130-0
 cudnn              pkgs/main/linux-64::cudnn-7.6.5-cuda10.0_0
 intel-openmp       pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561
 ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.35.1-h7274673_9
 libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2
 libgcc-ng          pkgs/main/linux-64::libgcc-ng-9.3.0-h5101ec6_17
 libgomp            pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17
 libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-9.3.0-hd4cf53a_17
 mkl                pkgs/main/linux-64::mkl-2020.2-256
 mkl-service        pkgs/main/linux-64::mkl-service-2.3.0-py36he8ac12f_0
 mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.0-py36h54f3939_0
 mkl_random         pkgs/main/linux-64::mkl_random-1.1.1-py36h0573a6f_0
 ncurses            pkgs/main/linux-64::ncurses-6.3-h7f8727e_2
 ninja              pkgs/main/linux-64::ninja-1.10.2-h5e70eb0_2
 numpy              pkgs/main/linux-64::numpy-1.19.2-py36h54aff64_0
 numpy-base         pkgs/main/linux-64::numpy-base-1.19.2-py36hfa32c7d_0
 openssl            pkgs/main/linux-64::openssl-1.1.1l-h7f8727e_0
 pip                pkgs/main/linux-64::pip-21.2.2-py36h06a4308_0
 pycparser          pkgs/main/noarch::pycparser-2.21-pyhd3eb1b0_0
 python             pkgs/main/linux-64::python-3.6.13-h12debd9_1
 pytorch            pkgs/main/linux-64::pytorch-1.3.1-cuda100py36h53c1284_0
 readline           pkgs/main/linux-64::readline-8.1-h27cfd23_0
 setuptools         pkgs/main/linux-64::setuptools-58.0.4-py36h06a4308_0
 six                pkgs/main/noarch::six-1.16.0-pyhd3eb1b0_0
 sqlite             pkgs/main/linux-64::sqlite-3.36.0-hc218d9a_0
 tk                 pkgs/main/linux-64::tk-8.6.11-h1ccaba5_0
 wheel              pkgs/main/noarch::wheel-0.37.0-pyhd3eb1b0_1
 xz                 pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
 zlib               pkgs/main/linux-64::zlib-1.2.11-h7f8727e_4

I later run pip install -r requirements.txt which throws an error and also installs the following:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. 
This behaviour is the source of the following dependency conflicts.
mkl-fft 1.3.0 requires numpy>=1.16, but you have numpy 1.15.1 which is incompatible.
Successfully installed Deprecated-1.2.6 Jinja2-3.0.3 MarkupSafe-2.0.1 Pillow-7.0.0 Werkzeug-2.0.2 aadict-0.2.3 
alabaster-0.7.12 allennlp-0.9.0 asset-0.6.13 attrs-21.2.0 babel-2.9.1 backcall-0.2.0 blis-0.2.4 boto3-1.10.45 
botocore-1.13.45 bpemb-0.3.0 certifi-2020.4.5.1 chardet-3.0.4 click-8.0.3 conllu-1.3.1 cycler-0.10.0 cymem-2.0.6 
dataclasses-0.8 decorator-5.1.0 docutils-0.15.2 editdistance-0.6.0 filelock-3.4.0 flaky-3.7.0 flask-2.0.2 flask-cors-3.0.10 
ftfy-6.0.3 gensim-3.8.1 gevent-21.12.0 globre-0.1.5 greenlet-1.1.2 h5py-2.8.0 idna-2.8 imagesize-1.3.0
 importlib-metadata-4.8.3 iniconfig-1.1.1 ipython-7.12.0 ipython-genutils-0.2.0 itsdangerous-2.0.1 jedi-0.18.1 
jmespath-0.10.0 joblib-1.1.0 jsonnet-0.18.0 jsonpickle-2.0.0 kiwisolver-1.3.1 matplotlib-3.1.3 mock-4.0.1 
murmurhash-1.0.6 nltk-3.6.3 numpy-1.15.1 numpydoc-1.1.0 overrides-2.8.0 packaging-21.3 parsimonious-0.8.1
 parso-0.8.3 pexpect-4.8.0 pickleshare-0.7.5 plac-0.9.6 pluggy-0.13.1 preshed-2.0.1 prompt-toolkit-3.0.24 
protobuf-3.19.1 ptyprocess-0.7.0 py-1.11.0 pygments-2.10.0 pyhocon-0.3.56 pyparsing-3.0.6 pytest-6.1.2
 python-dateutil-2.8.2 pytorch-pretrained-bert-0.6.2 pytorch-transformers-1.1.0 pytz-2021.3 pyyaml-5.2 
regex-2019.12.20 requests-2.22.0 responses-0.16.0 s3transfer-0.2.1 sacremoses-0.0.46 scikit-learn-0.24.2 
scipy-1.4.1 segtok-1.5.7 sentencepiece-0.1.96 sklearn-0.0 smart-open-5.2.1 snowballstemmer-2.2.0 
spacy-2.1.9 sphinx-4.3.2 sphinxcontrib-applehelp-1.0.2 sphinxcontrib-devhelp-1.0.2 sphinxcontrib-htmlhelp-2.0.0
 sphinxcontrib-jsmath-1.0.1 sphinxcontrib-qthelp-1.0.3 sphinxcontrib-serializinghtml-1.1.5 sqlparse-0.4.2 
srsly-1.0.5 tabulate-0.8.6 tensorboardX-2.4.1 thinc-7.0.8 threadpoolctl-3.0.0 tokenizers-0.8.0rc4 toml-0.10.2
 tqdm-4.41.0 traitlets-4.3.3 transformers-3.0.0 typing-extensions-4.0.1 unidecode-1.3.2 urllib3-1.25.11
 wasabi-0.9.0 wcwidth-0.2.5 word2number-1.1 wrapt-1.13.3 zipp-3.6.0 zope.event-4.5.0 zope.interface-5.4.0

Then when I run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test throws this error:

[2021-12-23 11:25:58,720 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-sentencepiece.bpe.model from cache at /home/chapapadopoulos/.cache/torch/transformers/431cf95b26928e8ff52fd32e349c1de81e77e39e0827a725feaa4357692901cf.309f0c29486cffc28e1e40a2ab0ac8f500c203fe080b95f820aa9cb58e5b84ed
[2021-12-23 11:25:59,854 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-config.json from cache at /home/chapapadopoulos/.cache/torch/transformers/4df1826a1128bbf8e81e2d920aace90d7e8a32ca214090f7210822aca0fd67d2.af9bc4ec719428ebc5f7bd9b67c97ee305cad5ba274c764cd193a31529ee3ba6
[2021-12-23 11:25:59,856 INFO] Model config XLMRobertaConfig {
  "_num_labels": 8,
  "architectures": [
    "XLMRobertaForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "B-LOC",
    "1": "B-MISC",
    "2": "B-ORG",
    "3": "I-LOC",
    "4": "I-MISC",
    "5": "I-ORG",
    "6": "I-PER",
    "7": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "B-LOC": 0,
    "B-MISC": 1,
    "B-ORG": 2,
    "I-LOC": 3,
    "I-MISC": 4,
    "I-ORG": 5,
    "I-PER": 6,
    "O": 7
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_hidden_states": true,
  "output_past": true,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 250002
}

[2021-12-23 11:26:00,498 INFO] loading weights file https://cdn.huggingface.co/xlm-roberta-large-finetuned-conll03-english-pytorch_model.bin from cache at /home/chapapadopoulos/.cache/torch/transformers/3a603320849fd5410edf034706443763632c09305bb0fd1f3ba26dcac5ed84b3.437090cbc8148a158bd2b30767652c9e66e4b09430bc0fa2b717028fb6047724
[2021-12-23 11:26:21,062 INFO] All model checkpoint weights were used when initializing XLMRobertaModel.

[2021-12-23 11:26:21,063 INFO] All the weights of XLMRobertaModel were initialized from the model checkpoint at xlm-roberta-large-finetuned-conll03-english.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use XLMRobertaModel for predictions without further training.
2021-12-23 11:26:22,672 Model Size: 1106399156
Corpus: 14987 train + 3466 dev + 3684 test sentences
2021-12-23 11:26:22,721 ----------------------------------------------------------------------------------------------------
2021-12-23 11:26:25,010 loading file resources/taggers/en-xlmr-tuned-first_elmo_bert-old-four_multi-bert-four_word-glove_word_origflair_mflair_char_30episode_150epoch_32batch_0.1lr_800hidden_en_monolingual_crf_fast_reinforce_freeze_norelearn_sentbatch_0.5discount_0.9momentum_5patience_nodev_newner5/best-model.pt
2021-12-23 11:26:30,452 Testing using best model ...
2021-12-23 11:26:30,455 Setting embedding mask to the best action: tensor([1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1.], device='cuda:0')
['/home/chapapadopoulos/.cache/torch/transformers/bert-base-cased', '/home/chapapadopoulos/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/chapapadopoulos/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/chapapadopoulos/.flair/embeddings/news-backward-0.4.1.pt', '/home/chapapadopoulos/.flair/embeddings/news-forward-0.4.1.pt', '/home/chapapadopoulos/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
2021-12-23 11:26:32,461 /home/yongjiang.jy/.cache/torch/transformers/bert-base-cased 108310272
Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/chapapadopoulos/github/NER/ACE-main/flair/trainers/reinforcement_trainer.py", line 1459, in final_test
    self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
  File "/home/chapapadopoulos/github/NER/ACE-main/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
    embedding.embed(sentences)
  File "/home/chapapadopoulos/github/NER/ACE-main/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/chapapadopoulos/github/NER/ACE-main/flair/embeddings.py", line 2722, in _add_embeddings_internal
    sequence_output, pooled_output, all_encoder_layers = self.model(all_input_ids, token_type_ids=None, attention_mask=all_input_masks)
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 239, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: cublas runtime error : the GPU program failed to execute at /tmp/pip-req-build-ocx5vxk7/aten/src/THC/THCBlas.cu:331

It runs on an nvidia 3090 and I have updated all drivers:
NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4

Answer 1 · 2021-12-23T10:14:07.000Z

It seems that it is the problem with CUDA or pytorch version. Can you successfully run this in python:

import torch
torch.zeros(1).cuda()

I aware that the pytorch cuda version (10.0) is not match your CUDA version (11.4) in your enverioment.

pytorch            pkgs/main/linux-64::pytorch-1.3.1-cuda100py36h53c1284_0

Maybe your CUDA version is too high, you may try to use a lower CUDA version or higher pytorch version (pytorch 1.7 is OK for running the code):

Answer 2 · 2021-12-23T12:13:27.000Z

Hi @wangxinyu0922 ! thanks for the help.
This command: torch.zeros(1).cuda() runs but very slowly.

I created new env with torch1.7 and python 3.9.7 and it installed:

 _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-4.5-1_gnu
  _pytorch_select    pkgs/main/linux-64::_pytorch_select-0.1-cpu_0
  blas               pkgs/main/linux-64::blas-1.0-mkl
  ca-certificates    pkgs/main/linux-64::ca-certificates-2021.10.26-h06a4308_2
  certifi            pkgs/main/linux-64::certifi-2021.10.8-py39h06a4308_0
  cffi               pkgs/main/linux-64::cffi-1.14.6-py39h400218f_0
  cudatoolkit        pkgs/main/linux-64::cudatoolkit-11.3.1-h2bc3f7f_2
  cudnn              pkgs/main/linux-64::cudnn-8.2.1-cuda11.3_0
  intel-openmp       pkgs/main/linux-64::intel-openmp-2019.4-243
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.35.1-h7274673_9
  libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-9.3.0-h5101ec6_17
  libgomp            pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17
  libmklml           pkgs/main/linux-64::libmklml-2019.0.5-0
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-9.3.0-hd4cf53a_17
  mkl                pkgs/main/linux-64::mkl-2020.2-256
  mkl-service        pkgs/main/linux-64::mkl-service-2.3.0-py39he8ac12f_0
  mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.0-py39h54f3939_0
  mkl_random         pkgs/main/linux-64::mkl_random-1.0.2-py39h63df603_0
  ncurses            pkgs/main/linux-64::ncurses-6.3-h7f8727e_2
  ninja              pkgs/main/linux-64::ninja-1.10.2-py39hd09550d_3
  numpy              pkgs/main/linux-64::numpy-1.19.2-py39h89c1606_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.19.2-py39h2ae0177_0
  openssl            pkgs/main/linux-64::openssl-1.1.1l-h7f8727e_0
  pip                pkgs/main/linux-64::pip-21.2.4-py39h06a4308_0
  pycparser          pkgs/main/noarch::pycparser-2.21-pyhd3eb1b0_0
  python             pkgs/main/linux-64::python-3.9.7-h12debd9_1
  pytorch            pkgs/main/linux-64::pytorch-1.7.1-cpu_py39h6a09485_0
  readline           pkgs/main/linux-64::readline-8.1-h27cfd23_0
  setuptools         pkgs/main/linux-64::setuptools-58.0.4-py39h06a4308_0
  six                pkgs/main/noarch::six-1.16.0-pyhd3eb1b0_0
  sqlite             pkgs/main/linux-64::sqlite-3.36.0-hc218d9a_0
  tk                 pkgs/main/linux-64::tk-8.6.11-h1ccaba5_0
  typing-extensions  pkgs/main/noarch::typing-extensions-3.10.0.2-hd3eb1b0_0
  typing_extensions  pkgs/main/noarch::typing_extensions-3.10.0.2-pyh06a4308_0
  tzdata             pkgs/main/noarch::tzdata-2021e-hda174b7_0
  wheel              pkgs/main/noarch::wheel-0.37.0-pyhd3eb1b0_1
  xz                 pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
  zlib               pkgs/main/linux-64::zlib-1.2.11-h7f8727e_4

But in this env it was impossible to install the requirements.txt (it throws lots of errors). If you could tell me the versions it would really help. I paste the requirements.txt that I tried:

allennlp==0.9.0
boto3==1.10.45
botocore==1.13.45
bpemb==0.3.0
certifi==2020.4.5.1
conllu==1.3.1
cycler==0.10.0
Deprecated==1.2.6
gensim==3.8.1
h5py==2.8.0
ipython==7.12.0
matplotlib==3.1.3
mock==4.0.1
numpy
overrides==2.8.0
Pillow==7.0.0
pyhocon==0.3.56
pytest==6.1.2
pytorch-transformers==1.1.0
pyyaml==5.2
regex==2019.12.20
requests==2.22.0
scipy==1.4.1
segtok==1.5.7
sklearn==0.0
spacy
tabulate==0.8.6
torch
tqdm==4.41.0
transformers==3.0.0

Answer 3 · 2021-12-23T14:23:51.000Z

You may see this issue

Answer 4 · 2022-01-31T09:55:33.000Z

Just an update on this. I was not able to fix this on Nvidia 3090 (I also checked this issue). But I run it on another PC with a GeForce GTX 1070 and it run there. Closing for now thanks.