RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)

Question

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)

Opened this issue 3 years ago · 11 comments

CUDA_VISIBLE_DEVICES=0 python main.py --seed 1 --model_name_or_path bert-base-uncased --model_save_path ./output/bert_newdata_webred_pretraining_matching_seed1 --force_del --use_apex_amp --apex_amp_opt_level O1 --task kbqa_name_expanded --metric f1 --data_path ./data/webqsp_name_expanded --head_entity_start_token "[start of head entity]" --head_entity_end_token "[end of head entity]" --tail_entity_start_token "[start of tail entity]" --tail_entity_end_token "[end of tail entity]" --continue_training_path ./output/webred_rel_matching_pretraining --batch_size 32

An error occurred while executing the above command

Answer 1 · 2021-12-07T15:17:09.000Z

It seems the length of the input sample exceeds the maximum length of the BERT model, how did you generate the data in ./data/webqsp_name_expanded (i.e., train/dev/test.json files)? I have recorded the size of these files as follows, you can check the size of your generated data files with them:

File	Size
train.json	836,054,211 bytes
dev.json	74,405,349 bytes
test.json	494,622,411 bytes

Or can you provide one or more examples of your generated files for more details?

Answer 2 · 2022-04-19T08:56:13.000Z

I also encountered a similar problem, how did you solve it?
My generated data files:
train.json | 834265663 bytes
dev.json | 76220178 bytes
test.json | 494618068 bytes

Answer 3 · 2022-05-17T03:10:11.000Z

Hi, yuanmeng, thanks for your work and contribution to this repo!
Unfortunately, I have encounted the above problem though I followed the readme file. And I cannot figure out which part of the preprocessing went wrong.
The size of my generated data files are very different:
train.json | 407193390 bytes
dev.json | 38454180 bytes
test.json | 238490319 bytes
By the way, the size of my freebase-rdf-latest.gz file is 29.98GB, and the generated mid2name.json file is 1.18GB.
Looking forward to hear from you, thanks!

Answer 4 · 2022-05-27T03:35:41.000Z

Hi @lirenhao1997 , sorry for the late reply.

I have tried to reproduce the issue recently, and the issue is possibly due to that the length of the model input (i.e. the number of wordpieces after tokenizing) exceeds the maximum possible length of BERT. Did you add the following special tokens to the vocab.txt in the pre-downloaded BERT files as instructed in README.md:

[unused0] -> [unknown entity]
[unused1] -> [path separator]
[unused2] -> [start of entity]
[unused3] -> [end of entity]
[unused4] -> [self]
[unused5] -> [start of head entity]
[unused6] -> [end of head entity]
[unused7] -> [start of tail entity]
[unused8] -> [end of tail entity]

Without this modification, the BERT tokenizer will split these special tokens into multiple wordpieces that may result in the above CUDA error.

Answer 5 · 2022-05-27T05:55:05.000Z

Thanks for reply! @yym6472
I had already modified the vocab file as your instruction in README.md before I came across the above CUDA error issue:

Answer 6 · 2022-05-27T06:20:37.000Z

Hi @lirenhao1997, could you share your training script (just copy and paste the one that encountered this error)?

Answer 7 · 2022-05-27T06:27:44.000Z

@yym6472 Just the content in ./scripts/bert.sh:

python3 main.py \
    --seed 4 \
    --model_name_or_path ./bert-base-uncased/ \
    --model_save_path ./output/bert_newdata_headtail_seed4 \
    --force_del \
    --use_apex_amp \
    --apex_amp_opt_level O1 \
    --task kbqa_name_expanded \
    --metric f1 \
    --data_path ./data/webqsp_name_expanded \
    --head_entity_start_token "[start of head entity]" \
    --head_entity_end_token "[end of head entity]" \
    --tail_entity_start_token "[start of tail entity]" \
    --tail_entity_end_token "[end of tail entity]" \
    "$@"

Answer 8 · 2022-05-27T07:04:17.000Z

It seems that this script runs well on my environment. Could you please then check the followings for more information:

Is the version of the installed transformers package 4.3.3? I suspect if you install another version, it may download the online vocab file.

Could you please run the following script and paste the output to check the format of processed data is consistent with ours:

import json
data = json.load(open("./data/webqsp_name_expanded/train.json"))
print(data["WebQTrn-0"]["candidiates"][:10])

Did you modified any arguments of the data process script? I find your data file is about half the size of ours (407193390 compared to 836054211 bytes for train.json)
Did you encounter the CUDA error at the begining of the training, or after training several steps?

Answer 9 · 2022-05-27T07:56:46.000Z

@yym6472

Yes, I installed the package transformers of version 4.3.3.
Here is the output running the above script:

[{'candidate_id': '<fb:m.0gxnnwq>', 'is_answer': 1, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'people.person.children', 'Jeremy Bieber', 'people.person.children', 'Jaxon Bieber'], ['Justin Bieber', 'people.person.sibling_s', '[unknown entity]', 'people.sibling_relationship.sibling', 'Jaxon Bieber'], ['Justin Bieber', 'people.person.parents', 'Jeremy Bieber', 'people.person.children', 'Jaxon Bieber'], ['Justin Bieber', 'people.sibling_relationship.sibling', '[unknown entity]', 'people.sibling_relationship.sibling', 'Jaxon Bieber']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:people.person.children>', '<fb:m.0gxnnv_>', '<fb:people.person.children>', '<fb:m.0gxnnwq>'], ['<fb:m.06w2sn5>', '<fb:people.person.sibling_s>', '<fb:m.0gxnnwp>', '<fb:people.sibling_relationship.sibling>', '<fb:m.0gxnnwq>'], ['<fb:m.06w2sn5>', '<fb:people.person.parents>', '<fb:m.0gxnnv_>', '<fb:people.person.children>', '<fb:m.0gxnnwq>'], ['<fb:m.06w2sn5>', '<fb:people.sibling_relationship.sibling>', '<fb:m.0gxnnwp>', '<fb:people.sibling_relationship.sibling>', '<fb:m.0gxnnwq>']]}, {'candidate_id': '<fb:m.0frtj8w>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.artist.album', 'My Worlds']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.artist.album>', '<fb:m.0frtj8w>']]}, {'candidate_id': '<fb:m.0zs8p_f>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.artist.track', 'Backpack']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.artist.track>', '<fb:m.0zs8p_f>']]}, {'candidate_id': '<fb:m.0rd_xpf>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.artist.track', 'Never Say Never (acoustic)']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.artist.track>', '<fb:m.0rd_xpf>']]}, {'candidate_id': '<fb:m.0sj6x9g>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.artist.track', 'Beauty and a Beat (acoustic version)']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.artist.track>', '<fb:m.0sj6x9g>']]}, {'candidate_id': '<fb:m.0zkmw7y>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.artist.album', 'PYD']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.artist.album>', '<fb:m.0zkmw7y>']]}, {'candidate_id': '<fb:m.0yrkc0l>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'award.award_honor.award_winner', '[unknown entity]'], ['Justin Bieber', 'music.artist.album', 'Boyfriend', 'award.award_honor.honored_for', '[unknown entity]'], ['Justin Bieber', 'award.award_winner.awards_won', '[unknown entity]'], ['Justin Bieber', 'music.artist.album', 'Boyfriend', 'award.award_winning_work.awards_won', '[unknown entity]']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:award.award_honor.award_winner>', '<fb:m.0yrkc0l>'], ['<fb:m.06w2sn5>', '<fb:music.artist.album>', '<fb:m.0j8sx6v>', '<fb:award.award_honor.honored_for>', '<fb:m.0yrkc0l>'], ['<fb:m.06w2sn5>', '<fb:award.award_winner.awards_won>', '<fb:m.0yrkc0l>'], ['<fb:m.06w2sn5>', '<fb:music.artist.album>', '<fb:m.0j8sx6v>', '<fb:award.award_winning_work.awards_won>', '<fb:m.0yrkc0l>']]}, {'candidate_id': '<fb:m.03gfvhv>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'broadcast.artist.content', 'Emphatic Radio.com!']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:broadcast.artist.content>', '<fb:m.03gfvhv>']]}, {'candidate_id': '<fb:m.0np1qjx>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.artist.track', 'Kiss and Tell']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.artist.track>', '<fb:m.0np1qjx>']]}, {'candidate_id': '<fb:m.0c3vvnk>', 'is_answer': 0, 'is_self': 0, 'converted_paths': [['Justin Bieber', 'music.composer.compositions', 'Never Say Never'], ['Justin Bieber', 'music.composition.composer', 'Never Say Never']], 'raw_paths': [['<fb:m.06w2sn5>', '<fb:music.composer.compositions>', '<fb:m.0c3vvnk>'], ['<fb:m.06w2sn5>', '<fb:music.composition.composer>', '<fb:m.0c3vvnk>']]}]

I did not modify any preprocessing script in the 'Preparing WebQSP Dataset' process (all scripts in graftnet_preprocessing).
But I did modify the script in the 'Preprocessing WebQSP for BERT-based KBQA' process as follows for there was no path ./ GraftNet/preprocessing/ in my environment.

For line 20 in data/freebase/generate_mid_to_name_mapping.py:

all_entities_file = "../GraftNet/preprocessing/freebase_2hops/all_entities"
->
all_entities_file = "../../freebase_2hops/all_entities"

For line 99 in data/webqsp_name_expanded/process.py:

subgraph_file = "../GraftNet/preprocessing/webqsp_subgraphs.json"
->
subgraph_file = "../../webqsp_subgraphs.json"

I wonder if that indicates something wrong with my previous preprocessing steps?

I encountered the CUDA error just at the begining of the training.

Answer 10 · 2022-06-12T09:41:50.000Z

Hi, yuanmeng, I have solved the problem and run the script successfully these days.
The reason I came across the above problem might be that I did not get the complete freebase data with graftnet_preprocessing/run_pipeline script. Specifically, the network connection was closed the first time I downloaded the file. Though I rerun the script and got the complete file, I forgot to rename the new file and covered the old one which resulted in the incomplete processing of the following files.
Anyway, thanks again for your help! @yym6472

Answer 11 · 2022-06-29T14:05:41.000Z

@lirenhao1997 Happy to hear that, and thanks for your update. I'm sorry I didn't work it out with you due to my recent personal issues.