ERROR: RuntimeError: cublas runtime error
harpap opened this issue · 4 comments
My conda env:
python=3.6 pytorch=1.3.1
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_openmp_mutex pkgs/main/linux-64::_openmp_mutex-4.5-1_gnu
_pytorch_select pkgs/main/linux-64::_pytorch_select-0.2-gpu_0
blas pkgs/main/linux-64::blas-1.0-mkl
ca-certificates pkgs/main/linux-64::ca-certificates-2021.10.26-h06a4308_2
certifi pkgs/main/linux-64::certifi-2021.5.30-py36h06a4308_0
cffi pkgs/main/linux-64::cffi-1.14.6-py36h400218f_0
cudatoolkit pkgs/main/linux-64::cudatoolkit-10.0.130-0
cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.0_0
intel-openmp pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.35.1-h7274673_9
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_2
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.3.0-h5101ec6_17
libgomp pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.3.0-hd4cf53a_17
mkl pkgs/main/linux-64::mkl-2020.2-256
mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py36he8ac12f_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.3.0-py36h54f3939_0
mkl_random pkgs/main/linux-64::mkl_random-1.1.1-py36h0573a6f_0
ncurses pkgs/main/linux-64::ncurses-6.3-h7f8727e_2
ninja pkgs/main/linux-64::ninja-1.10.2-h5e70eb0_2
numpy pkgs/main/linux-64::numpy-1.19.2-py36h54aff64_0
numpy-base pkgs/main/linux-64::numpy-base-1.19.2-py36hfa32c7d_0
openssl pkgs/main/linux-64::openssl-1.1.1l-h7f8727e_0
pip pkgs/main/linux-64::pip-21.2.2-py36h06a4308_0
pycparser pkgs/main/noarch::pycparser-2.21-pyhd3eb1b0_0
python pkgs/main/linux-64::python-3.6.13-h12debd9_1
pytorch pkgs/main/linux-64::pytorch-1.3.1-cuda100py36h53c1284_0
readline pkgs/main/linux-64::readline-8.1-h27cfd23_0
setuptools pkgs/main/linux-64::setuptools-58.0.4-py36h06a4308_0
six pkgs/main/noarch::six-1.16.0-pyhd3eb1b0_0
sqlite pkgs/main/linux-64::sqlite-3.36.0-hc218d9a_0
tk pkgs/main/linux-64::tk-8.6.11-h1ccaba5_0
wheel pkgs/main/noarch::wheel-0.37.0-pyhd3eb1b0_1
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.11-h7f8727e_4
I later run pip install -r requirements.txt
which throws an error and also installs the following:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.
This behaviour is the source of the following dependency conflicts.
mkl-fft 1.3.0 requires numpy>=1.16, but you have numpy 1.15.1 which is incompatible.
Successfully installed Deprecated-1.2.6 Jinja2-3.0.3 MarkupSafe-2.0.1 Pillow-7.0.0 Werkzeug-2.0.2 aadict-0.2.3
alabaster-0.7.12 allennlp-0.9.0 asset-0.6.13 attrs-21.2.0 babel-2.9.1 backcall-0.2.0 blis-0.2.4 boto3-1.10.45
botocore-1.13.45 bpemb-0.3.0 certifi-2020.4.5.1 chardet-3.0.4 click-8.0.3 conllu-1.3.1 cycler-0.10.0 cymem-2.0.6
dataclasses-0.8 decorator-5.1.0 docutils-0.15.2 editdistance-0.6.0 filelock-3.4.0 flaky-3.7.0 flask-2.0.2 flask-cors-3.0.10
ftfy-6.0.3 gensim-3.8.1 gevent-21.12.0 globre-0.1.5 greenlet-1.1.2 h5py-2.8.0 idna-2.8 imagesize-1.3.0
importlib-metadata-4.8.3 iniconfig-1.1.1 ipython-7.12.0 ipython-genutils-0.2.0 itsdangerous-2.0.1 jedi-0.18.1
jmespath-0.10.0 joblib-1.1.0 jsonnet-0.18.0 jsonpickle-2.0.0 kiwisolver-1.3.1 matplotlib-3.1.3 mock-4.0.1
murmurhash-1.0.6 nltk-3.6.3 numpy-1.15.1 numpydoc-1.1.0 overrides-2.8.0 packaging-21.3 parsimonious-0.8.1
parso-0.8.3 pexpect-4.8.0 pickleshare-0.7.5 plac-0.9.6 pluggy-0.13.1 preshed-2.0.1 prompt-toolkit-3.0.24
protobuf-3.19.1 ptyprocess-0.7.0 py-1.11.0 pygments-2.10.0 pyhocon-0.3.56 pyparsing-3.0.6 pytest-6.1.2
python-dateutil-2.8.2 pytorch-pretrained-bert-0.6.2 pytorch-transformers-1.1.0 pytz-2021.3 pyyaml-5.2
regex-2019.12.20 requests-2.22.0 responses-0.16.0 s3transfer-0.2.1 sacremoses-0.0.46 scikit-learn-0.24.2
scipy-1.4.1 segtok-1.5.7 sentencepiece-0.1.96 sklearn-0.0 smart-open-5.2.1 snowballstemmer-2.2.0
spacy-2.1.9 sphinx-4.3.2 sphinxcontrib-applehelp-1.0.2 sphinxcontrib-devhelp-1.0.2 sphinxcontrib-htmlhelp-2.0.0
sphinxcontrib-jsmath-1.0.1 sphinxcontrib-qthelp-1.0.3 sphinxcontrib-serializinghtml-1.1.5 sqlparse-0.4.2
srsly-1.0.5 tabulate-0.8.6 tensorboardX-2.4.1 thinc-7.0.8 threadpoolctl-3.0.0 tokenizers-0.8.0rc4 toml-0.10.2
tqdm-4.41.0 traitlets-4.3.3 transformers-3.0.0 typing-extensions-4.0.1 unidecode-1.3.2 urllib3-1.25.11
wasabi-0.9.0 wcwidth-0.2.5 word2number-1.1 wrapt-1.13.3 zipp-3.6.0 zope.event-4.5.0 zope.interface-5.4.0
Then when I run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test
throws this error:
[2021-12-23 11:25:58,720 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-sentencepiece.bpe.model from cache at /home/chapapadopoulos/.cache/torch/transformers/431cf95b26928e8ff52fd32e349c1de81e77e39e0827a725feaa4357692901cf.309f0c29486cffc28e1e40a2ab0ac8f500c203fe080b95f820aa9cb58e5b84ed
[2021-12-23 11:25:59,854 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/xlm-roberta-large-finetuned-conll03-english-config.json from cache at /home/chapapadopoulos/.cache/torch/transformers/4df1826a1128bbf8e81e2d920aace90d7e8a32ca214090f7210822aca0fd67d2.af9bc4ec719428ebc5f7bd9b67c97ee305cad5ba274c764cd193a31529ee3ba6
[2021-12-23 11:25:59,856 INFO] Model config XLMRobertaConfig {
"_num_labels": 8,
"architectures": [
"XLMRobertaForTokenClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"id2label": {
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "I-LOC",
"4": "I-MISC",
"5": "I-ORG",
"6": "I-PER",
"7": "O"
},
"initializer_range": 0.02,
"intermediate_size": 4096,
"label2id": {
"B-LOC": 0,
"B-MISC": 1,
"B-ORG": 2,
"I-LOC": 3,
"I-MISC": 4,
"I-ORG": 5,
"I-PER": 6,
"O": 7
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "xlm-roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"output_hidden_states": true,
"output_past": true,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 250002
}
[2021-12-23 11:26:00,498 INFO] loading weights file https://cdn.huggingface.co/xlm-roberta-large-finetuned-conll03-english-pytorch_model.bin from cache at /home/chapapadopoulos/.cache/torch/transformers/3a603320849fd5410edf034706443763632c09305bb0fd1f3ba26dcac5ed84b3.437090cbc8148a158bd2b30767652c9e66e4b09430bc0fa2b717028fb6047724
[2021-12-23 11:26:21,062 INFO] All model checkpoint weights were used when initializing XLMRobertaModel.
[2021-12-23 11:26:21,063 INFO] All the weights of XLMRobertaModel were initialized from the model checkpoint at xlm-roberta-large-finetuned-conll03-english.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use XLMRobertaModel for predictions without further training.
2021-12-23 11:26:22,672 Model Size: 1106399156
Corpus: 14987 train + 3466 dev + 3684 test sentences
2021-12-23 11:26:22,721 ----------------------------------------------------------------------------------------------------
2021-12-23 11:26:25,010 loading file resources/taggers/en-xlmr-tuned-first_elmo_bert-old-four_multi-bert-four_word-glove_word_origflair_mflair_char_30episode_150epoch_32batch_0.1lr_800hidden_en_monolingual_crf_fast_reinforce_freeze_norelearn_sentbatch_0.5discount_0.9momentum_5patience_nodev_newner5/best-model.pt
2021-12-23 11:26:30,452 Testing using best model ...
2021-12-23 11:26:30,455 Setting embedding mask to the best action: tensor([1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1.], device='cuda:0')
['/home/chapapadopoulos/.cache/torch/transformers/bert-base-cased', '/home/chapapadopoulos/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/chapapadopoulos/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/chapapadopoulos/.flair/embeddings/news-backward-0.4.1.pt', '/home/chapapadopoulos/.flair/embeddings/news-forward-0.4.1.pt', '/home/chapapadopoulos/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
2021-12-23 11:26:32,461 /home/yongjiang.jy/.cache/torch/transformers/bert-base-cased 108310272
Traceback (most recent call last):
File "train.py", line 163, in <module>
predict_posterior=args.predict_posterior,
File "/home/chapapadopoulos/github/NER/ACE-main/flair/trainers/reinforcement_trainer.py", line 1459, in final_test
self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
File "/home/chapapadopoulos/github/NER/ACE-main/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
embedding.embed(sentences)
File "/home/chapapadopoulos/github/NER/ACE-main/flair/embeddings.py", line 97, in embed
self._add_embeddings_internal(sentences)
File "/home/chapapadopoulos/github/NER/ACE-main/flair/embeddings.py", line 2722, in _add_embeddings_internal
sequence_output, pooled_output, all_encoder_layers = self.model(all_input_ids, token_type_ids=None, attention_mask=all_input_masks)
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
output_hidden_states=output_hidden_states,
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward
output_attentions,
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/chapapadopoulos/anaconda3/envs/ACEagain/lib/python3.6/site-packages/transformers/modeling_bert.py", line 239, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: cublas runtime error : the GPU program failed to execute at /tmp/pip-req-build-ocx5vxk7/aten/src/THC/THCBlas.cu:331
It runs on an nvidia 3090 and I have updated all drivers:
NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4
It seems that it is the problem with CUDA or pytorch version. Can you successfully run this in python:
import torch
torch.zeros(1).cuda()
I aware that the pytorch cuda version (10.0) is not match your CUDA version (11.4) in your enverioment.
pytorch pkgs/main/linux-64::pytorch-1.3.1-cuda100py36h53c1284_0
Maybe your CUDA version is too high, you may try to use a lower CUDA version or higher pytorch version (pytorch 1.7 is OK for running the code):
Hi @wangxinyu0922 ! thanks for the help.
This command: torch.zeros(1).cuda()
runs but very slowly.
I created new env with torch1.7 and python 3.9.7 and it installed:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_openmp_mutex pkgs/main/linux-64::_openmp_mutex-4.5-1_gnu
_pytorch_select pkgs/main/linux-64::_pytorch_select-0.1-cpu_0
blas pkgs/main/linux-64::blas-1.0-mkl
ca-certificates pkgs/main/linux-64::ca-certificates-2021.10.26-h06a4308_2
certifi pkgs/main/linux-64::certifi-2021.10.8-py39h06a4308_0
cffi pkgs/main/linux-64::cffi-1.14.6-py39h400218f_0
cudatoolkit pkgs/main/linux-64::cudatoolkit-11.3.1-h2bc3f7f_2
cudnn pkgs/main/linux-64::cudnn-8.2.1-cuda11.3_0
intel-openmp pkgs/main/linux-64::intel-openmp-2019.4-243
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.35.1-h7274673_9
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_2
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.3.0-h5101ec6_17
libgomp pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17
libmklml pkgs/main/linux-64::libmklml-2019.0.5-0
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.3.0-hd4cf53a_17
mkl pkgs/main/linux-64::mkl-2020.2-256
mkl-service pkgs/main/linux-64::mkl-service-2.3.0-py39he8ac12f_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.3.0-py39h54f3939_0
mkl_random pkgs/main/linux-64::mkl_random-1.0.2-py39h63df603_0
ncurses pkgs/main/linux-64::ncurses-6.3-h7f8727e_2
ninja pkgs/main/linux-64::ninja-1.10.2-py39hd09550d_3
numpy pkgs/main/linux-64::numpy-1.19.2-py39h89c1606_0
numpy-base pkgs/main/linux-64::numpy-base-1.19.2-py39h2ae0177_0
openssl pkgs/main/linux-64::openssl-1.1.1l-h7f8727e_0
pip pkgs/main/linux-64::pip-21.2.4-py39h06a4308_0
pycparser pkgs/main/noarch::pycparser-2.21-pyhd3eb1b0_0
python pkgs/main/linux-64::python-3.9.7-h12debd9_1
pytorch pkgs/main/linux-64::pytorch-1.7.1-cpu_py39h6a09485_0
readline pkgs/main/linux-64::readline-8.1-h27cfd23_0
setuptools pkgs/main/linux-64::setuptools-58.0.4-py39h06a4308_0
six pkgs/main/noarch::six-1.16.0-pyhd3eb1b0_0
sqlite pkgs/main/linux-64::sqlite-3.36.0-hc218d9a_0
tk pkgs/main/linux-64::tk-8.6.11-h1ccaba5_0
typing-extensions pkgs/main/noarch::typing-extensions-3.10.0.2-hd3eb1b0_0
typing_extensions pkgs/main/noarch::typing_extensions-3.10.0.2-pyh06a4308_0
tzdata pkgs/main/noarch::tzdata-2021e-hda174b7_0
wheel pkgs/main/noarch::wheel-0.37.0-pyhd3eb1b0_1
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.11-h7f8727e_4
But in this env it was impossible to install the requirements.txt (it throws lots of errors). If you could tell me the versions it would really help. I paste the requirements.txt that I tried:
allennlp==0.9.0
boto3==1.10.45
botocore==1.13.45
bpemb==0.3.0
certifi==2020.4.5.1
conllu==1.3.1
cycler==0.10.0
Deprecated==1.2.6
gensim==3.8.1
h5py==2.8.0
ipython==7.12.0
matplotlib==3.1.3
mock==4.0.1
numpy
overrides==2.8.0
Pillow==7.0.0
pyhocon==0.3.56
pytest==6.1.2
pytorch-transformers==1.1.0
pyyaml==5.2
regex==2019.12.20
requests==2.22.0
scipy==1.4.1
segtok==1.5.7
sklearn==0.0
spacy
tabulate==0.8.6
torch
tqdm==4.41.0
transformers==3.0.0
You may see this issue