LiWentomng/OrientedRepPoints

训练时在计算giou时报错

Closed this issue · 2 comments

在训练时一个epoch没有跑完就会出现计算giou的错误,但是检查了下数据集应该没有问题。在有的数据集会出现这种问题,在大部分数据集上都能正常训练,请问这可能是什么原因呢?应该怎么解决呢?谢谢~

Traceback (most recent call last):
  File "tools/train.py", line 154, in <module>
    main()
  File "tools/train.py", line 150, in main
    meta=meta)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/apis/train.py", line 112, in train_detector
    meta=meta)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/apis/train.py", line 245, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 373, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 275, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/models/detectors/base.py", line 147, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/models/detectors/orientedreppoints_detector.py", line 36, in forward_train
    *loss_inputs, gt_rbboxes_ignore=gt_rbboxes_ignore)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/models/anchor_heads/orientedreppoints_head.py", line 465, in loss
    rbox_weights_list_refine, pos_inds_list_refine)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/core/utils/misc.py", line 24, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/models/anchor_heads/orientedreppoints_head.py", line 567, in points_quality_assessment
    reduction_override='none')
  File "/home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/models/losses/iou_loss.py", line 134, in forward
    self.loss_weight)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/models/losses/iou_loss.py", line 78, in forward
    convex_gious, grad = convex_giou(pred, target)
  File "/home/f523/guazai/sdb/zhoujunfeng/OrientedRepPoints-main/mmdet/ops/iou/iou_wrapper.py", line 13, in convex_giou
    convex_giou_grad = convex_giou_cuda.convex_giou(pred, target)
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at mmdet/ops/iou/src/convex_giou_kernel.cu:856
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1573049304260/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f3b4d9a5687 in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x17044 (0x7f3b4dbe1044 in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1cccb (0x7f3b4dbe6ccb in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f3b4d992e9d in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x1c86c9 (0x7f3b4eb2d6c9 in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x4aff6b (0x7f3b4ee14f6b in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x4affa6 (0x7f3b4ee14fa6 in /home/f523/anaconda3/envs/redet/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: __libc_start_main + 0xf0 (0x7f3b53406840 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

请问这个解决了吗,我也碰到同样的问题,在一个普通的检测数据集上,会出现RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at mmdet/ops/iou/src/convex_giou_kernel.cu:856,运行一段时间后就会随机出现

请问这个解决了吗,我也碰到同样的问题,在一个普通的检测数据集上,会出现RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at mmdet/ops/iou/src/convex_giou_kernel.cu:856,运行一段时间后就会随机出现

你好,可以检查下数据集的标注是否存在问题。我的是因为数据集标注本身存在问题,有的标注框宽度或高度为0,这种情况会导致计算GIOU时报错。