CUDA error: device-side assert triggered

Question

CUDA error: device-side assert triggered

Risho92 opened this issue 4 years ago · 8 comments

python3 run.py --model AttH --max_epochs 1 --batch_size 2

I was trying to execute AttH model with the above command from command prompt. I am getting an error "CUDA error: device-side assert triggered". Given below is the full Traceback. I am trying from Ubuntu 20 and Cuda 11. Can you please provide some guidance on this?

Traceback (most recent call last):
File "run.py", line 191, in
train(parser.parse_args())
File "run.py", line 142, in train
train_loss = optimizer.epoch(train_examples)
File "/home/<user_name>/Desktop/AttH/KGEmb/optimizers/kg_optimizer.py", line 175, in epoch
l = self.calculate_loss(input_batch)
File "/home/<user_name>/Desktop/AttH/KGEmb/optimizers/kg_optimizer.py", line 120, in calculate_loss
loss, factors = self.neg_sampling_loss(input_batch)
File "/home/<user_name>/Desktop/AttH/KGEmb/optimizers/kg_optimizer.py", line 80, in neg_sampling_loss
positive_score, factors = self.model(input_batch)
File "/home/<user_name>/Desktop/AttH/KGEmb/hyp_kg_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/<user_name>/Desktop/AttH/KGEmb/models/base.py", line 140, in forward
lhs_e, lhs_biases = self.get_queries(queries)
File "/home/<user_name>/Desktop/AttH/KGEmb/models/hyperbolic.py", line 94, in get_queries
rot_q = givens_rotations(rot_mat, head).view((-1, 1, self.rank))
File "/home/<user_name>/Desktop/AttH/KGEmb/utils/euclidean.py", line 41, in givens_rotations
givens = givens / torch.norm(givens, p=2, dim=-1, keepdim=True)
File "/home/<user_name>/Desktop/AttH/KGEmb/hyp_kg_env/lib/python3.7/site-packages/torch/functional.py", line 1123, in norm
return _VF.norm(input, p, _dim, keepdim=keepdim)
RuntimeError: CUDA error: device-side assert triggered

Answer 1 · 2020-12-01T18:22:43.000Z

I am facing the same issue. Is it resolved? If yes, please let me know how.

Answer 2 · 2020-12-02T15:22:26.000Z

@kingsaint could you share the command you are running?

Answer 3 · 2020-12-02T15:55:09.000Z

@ines-chami Looks like the multi_c option should be on? The following command worked.
python run.py --dataset YAGO3-10 --model AttH --max_epochs 500 --patience 10 --rank 200 --neg_sample_size -1 learning_rate 0.0005 --multi_c

But if I don't want multiple curvatures per relation then it does not work. Index out of range error occurs at line 91 of models/hyperbolic.py

Answer 4 · 2020-12-02T16:12:33.000Z

The command:
python run.py --dataset YAGO3-10 --model AttH --max_epochs 500 --patience 10 --rank 200 --neg_sample_size -1 --learning_rate 0.0005 --batch_size 100
worked fine for me (I had to reduce the batch size to avoid memory issues).

Could you share you training log or a screenshot so I could see where the error happens?

Answer 5 · 2020-12-02T18:10:44.000Z

I used your command and got this error

Traceback (most recent call last):
File "run.py", line 191, in
train(parser.parse_args())
File "run.py", line 142, in train
train_loss = optimizer.epoch(train_examples)
File "/common/home/rb897/KGEmb/optimizers/kg_optimizer.py", line 175, in epoch
l = self.calculate_loss(input_batch)
File "/common/home/rb897/KGEmb/optimizers/kg_optimizer.py", line 122, in calculate_loss
predictions, factors = self.model(input_batch, eval_mode=True)
File "/common/users/rb897/local/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/common/home/rb897/KGEmb/models/base.py", line 140, in forward
lhs_e, lhs_biases = self.get_queries(queries)
File "/common/home/rb897/KGEmb/models/hyperbolic.py", line 95, in get_queries
rot_q = givens_rotations(rot_mat, head).view((-1, 1, self.rank))
File "/common/home/rb897/KGEmb/utils/euclidean.py", line 43, in givens_rotations
x_rot = givens[:, :, 0:1] * x + givens[:, :, 1:] * torch.cat((-x[:, :, 1:], x[:, :, 0:1]), dim=-1)
RuntimeError: CUDA error: device-side assert triggered
/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [96,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [97,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [98,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [0,0,0], thread: [99,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Answer 6 · 2020-12-02T22:50:35.000Z

The first issue posted by @Risho92 seems to be caused by a divide by zero error which should be fix in this commit:
7390004

@kingsaint your issue seems to be triggered somewhere else but I cannot reproduce the bug. I am using Python 3.7.3 and the packages below:

numpy==1.18.3
torch==1.5.0

Answer 7 · 2021-05-16T19:26:51.000Z

Hey, I am also getting this error. Can anyone help me ?

Answer 8 · 2021-06-09T08:04:10.000Z

@Sahajtomar It works perfectly if you change this in the requirements.txt file:

   numpy==1.18.3
   torch==1.5.0