ddshan/hand_object_detector

RuntimeError: CUDA error: device-side assert triggered (can't train the model if batch is not 1)

Opened this issue · 4 comments

it seems that the batch_size can only be 1,
when I set the batch_size = 4 or 8 during training, the error occurs:

Traceback (most recent call last):
File "trainval_net.py", line 321, in
rois_label, loss_list = fasterRCNN(im_data, im_info, gt_boxes, num_boxes, box_info)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/Hand-Object-Interaction-detection/lib/model/faster_rcnn/faster_rcnn.py", line 62, in forward
roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes, box_info)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 52, in forward
rois_per_image, self._num_classes, box_info)
File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 146, in _sample_rois_pytorch
fg_inds = torch.nonzero(max_overlaps[i] >= cfg.TRAIN.FG_THRESH).view(-1)
RuntimeError: CUDA error: device-side assert triggered

hey @ddshan, have u ever trained the network with batch_size = 4 or others?

I turned off the cuda, then the practical error is:

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
Traceback (most recent call last):
File "trainval_net.py", line 321, in
rois_label, loss_list = fasterRCNN(im_data, im_info, gt_boxes, num_boxes, box_info)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/Hand-Object-Interaction-detection/lib/model/faster_rcnn/faster_rcnn.py", line 62, in forward
roi_data = self.RCNN_proposal_target(rois, gt_boxes, num_boxes, box_info)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 52, in forward
rois_per_image, self._num_classes, box_info)
File "/content/Hand-Object-Interaction-detection/lib/model/rpn/proposal_target_layer_cascade.py", line 136, in _sample_rois_pytorch
list_box.append(box_info[i][(offset[i,:].view(-1),)])
IndexError: index 20 is out of bounds for dimension 0 with size 20

I am pretty sure the problem is in the proposal_target_layer_cascade.py, around 170 line

labels = gt_boxes[:, :, 4].contiguous().view(-1)[(offset.view(-1),)].view(batch_size, -1)
        list_box = []
        for i in range(batch_size):
            """error when batch > 1, IndexError: index 20 is out of bounds for dimension 0 with size 20"""
            list_box.append(box_info[i][(offset[i, :].view(-1),)])
        boxes_info = torch.stack(list_box)

Hi,

We only trained with batch size = 1 due to constraints of our modification on the codebase we followed. Sorry for the inconvenience. Will let you know if we have an improved version.