CUDA error: device-side assert triggered in IndexKernel.cu
zyfone opened this issue · 1 comments
zyfone commented
Issue Title:
CUDA error: device-side assert triggered in IndexKernel.cu
Description:
[05/03 01:24:57 d2.utils.events]: eta: 12:47:44 iter: 979 total_loss: 0.05901 cos_dist_loss_6: 0.03457 reg_loss_6: 0.02462 time: 0.4656 data_time: 0.0113 lr: N/A max_mem: 4285M
[05/03 01:25:07 d2.utils.events]: eta: 12:47:56 iter: 999 total_loss: 0.05903 cos_dist_loss_6: 0.03479 reg_loss_6: 0.02454 time: 0.4658 data_time: 0.0107 lr: N/A max_mem: 4285M
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [126,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [127,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
ERROR [05/03 01:05:25 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "train_voc.py", line 315, in run_step
loss_dict_s = self.model(data_s)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/domaingen/modeling/meta_arch.py", line 341, in forward
logits, proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/domaingen/modeling/rpn.py", line 63, in forward
gt_labels, gt_boxes = self.label_and_sample_anchors(anchors, gt_instances)
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 340, in label_and_sample_anchors
matched_idxs, gt_labels_i = retry_if_cuda_oom(self.anchor_matcher)(match_quality_matrix)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/utils/memory.py", line 70, in wrapped
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/modeling/matcher.py", line 89, in __call__
assert torch.all(match_quality_matrix >= 0)
RuntimeError: CUDA error: device-side assert triggered
[05/03 01:05:25 d2.engine.hooks]: Overall training speed: 999 iterations in 0:07:51 (0.4720 s / it)
[05/03 01:05:25 d2.engine.hooks]: Total training time: 0:07:51 (0:00:00 on hooks)
[05/03 01:05:25 d2.utils.events]: eta: 12:57:07 iter: 1001 total_loss: 0.05898 cos_dist_loss_6: 0.03474 reg_loss_6: 0.02445 loss_cls: 26.82 loss_box_reg: 0.02828 loss_rpn_cls: 0.4097 loss_rpn_loc: 0.2613 time: 0.4719 data_time: 0.0103 lr: N/A max_mem: 20716M
Traceback (most recent call last):
File "train_voc.py", line 609, in <module>
main(args)
File "train_voc.py", line 598, in main
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "train_voc.py", line 315, in run_step
loss_dict_s = self.model(data_s)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/domaingen/modeling/meta_arch.py", line 341, in forward
logits, proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/autodl-tmp/domaingen/modeling/rpn.py", line 63, in forward
gt_labels, gt_boxes = self.label_and_sample_anchors(anchors, gt_instances)
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 340, in label_and_sample_anchors
matched_idxs, gt_labels_i = retry_if_cuda_oom(self.anchor_matcher)(match_quality_matrix)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/utils/memory.py", line 70, in wrapped
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/detectron2/modeling/matcher.py", line 89, in __call__
assert torch.all(match_quality_matrix >= 0)
RuntimeError: CUDA error: device-side assert triggered
Environment:
- OS: Ubuntu
- CUDA Version: cu111
- PyTorch Version: 1.8.1
Steps to Reproduce:
- run
python -u train_voc.py --config-file configs/comic_watercolor.yaml
- iter: 999
zyfone commented
Please refer this issue facebookresearch/detectron2#3945.