训练几个epoch后报错:RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED
Opened this issue · 4 comments
在指定device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")前提下
使用自带的duie和dgre数据集都会在训练几个Epoch之后抛出如下错误:
【train】6/100 40420/713100 loss:2.2920708656311035
【train】6/100 40430/713100 loss:0.8514504432678223
【train】6/100 40440/713100 loss:1.3389232158660889
Traceback (most recent call last):
File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/main.py", line 229, in
main(data_name)
File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/main.py", line 220, in main
train.train()
File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/main.py", line 54, in train
output = self.model(input_ids, attention_mask, labels)
File "/home/qhm/anaconda3/envs/TY_taishan/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/model.py", line 34, in forward
logits = self.crf.decode(seq_out, mask=attention_mask.bool())
File "/home/qhm/anaconda3/envs/TY_taishan/lib/python3.9/site-packages/torchcrf/init.py", line 139, in decode
return self._viterbi_decode(emissions, mask)
File "/home/qhm/anaconda3/envs/TY_taishan/lib/python3.9/site-packages/torchcrf/init.py", line 305, in _viterbi_decode
score = torch.where(mask[i].unsqueeze(1), next_score, score)
RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/c10/cuda/impl/CUDAGuardImpl.h":30, please report a bug to PyTorch.
是设备问题。更换设备后没出现报错。原设备会掉显卡驱动。在另外的torch环境下,训练别的模型也会出现同样的报错。
man!What can i say!
前几天,我也一直出现这个问题,我出现这个问题的时候使用cpu是i9-13900k,现在我换了颗新的cpu i7-14700k之后不再出现这个问题了
前几天,我也一直出现这个问题,我出现这个问题的时候使用cpu是i9-13900k,现在我换了颗新的cpu i7-14700k之后不再出现这个问题了
难绷,这个程序确实是在139k上跑的