hustvl/CrossVIS

RuntimeError: CUDA error: device-side assert triggered

Alxx999 opened this issue · 10 comments

Hello, I made the following mistake during training, how should I solve it?

I added it in adet/data/builtin.py before training:

_PERDEFINED_SPLITS_YOUTUBEVIS_VIDEO = {
'youtubevis_train':
# ('youtubevis/train/', 'youtubevis/annotations/train.json'),
('/media/lin/file/VIS/datasets/youtube-vis2021/train/JPEGImages', '/media/lin/file/VIS/datasets/youtube-vis2021/train/instances.json'),
'youtubevis_valid':
# ('youtubevis/valid/', 'youtubevis/annotations/valid.json'),
('/media/lin/file/VIS/datasets/youtube-vis2021/valid/JPEGImages', '/media/lin/file/VIS/datasets/youtube-vis2021/valid/instances.json'),
'youtubevis_test':
('youtubevis/test/', 'youtubevis/annotations/test.json'),
}

metadata_youtubevis_video = {
'thing_classes': [
'airplane', 'bear', 'bird', 'boat', 'car',
'cat', 'cow', 'deer', 'dog', 'duck',
'earless_seal', 'elephant', 'fish', 'flying_disc', 'fox',
'frog', 'giant_panda', 'giraffe', 'horse', 'leopard',
'lizard', 'monkey', 'motorbike', 'mouse', 'parrot',
'person', 'rabbit', 'shark', 'skateboard', 'snake',
'snowboard', 'squirrel', 'surfboard', 'tennis_racket', 'tiger',
'train', 'truck', 'turtle', 'whale', 'zebra'
]
}

Then:
I'm detectron2 / data/datasets/builtin_meta. Registered in py VIS_CATEGORIES, and modified to them

def _get_coco_instances_meta():
thing_ids = [k["id"] for k in VIS_CATEGORIES if k["isthing"] == 1]
thing_colors = [k["color"] for k in VIS_CATEGORIES if k["isthing"] == 1]
assert len(thing_ids) == 40, len(thing_ids)
# Mapping from the incontiguous COCO category id to an id in [0, 79]
thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}
thing_classes = [k["name"] for k in VIS_CATEGORIES if k["isthing"] == 1]
ret = {
"thing_dataset_id_to_contiguous_id": thing_dataset_id_to_contiguous_id,
"thing_classes": thing_classes,
"thing_colors": thing_colors,
}
return ret

Then,change NUM_CLASSES in adet/config/defaults.py to 40

My final training order is:python tools/train_net.py --config configs/CrossVIS/R_50_1x.yaml MODEL.WEIGHTS CondInst_MS_R_50_1x.pth

Error is as follows
[05/16 20:27:26 adet.data.common]: Serializing 89750 elements to byte tensors and concatenating them all ...
[05/16 20:27:26 adet.data.common]: Serialized dataset takes 216.92 MiB
[05/16 20:27:26 adet.data.build]: Using training sampler TrainingSampler
[05/16 20:27:26 fvcore.common.checkpoint]: [Checkpointer] Loading from CondInst_MS_R_50_1x.pth ...
WARNING [05/16 20:27:26 fvcore.common.checkpoint]: Skip loading parameter 'proposal_generator.fcos_head.cls_logits.weight' to the model due to incompatible shapes: (80, 256, 3, 3) in the checkpoint but (40, 256, 3, 3) in the model! You might want to double check if this is expected.
WARNING [05/16 20:27:26 fvcore.common.checkpoint]: Skip loading parameter 'proposal_generator.fcos_head.cls_logits.bias' to the model due to incompatible shapes: (80,) in the checkpoint but (40,) in the model! You might want to double check if this is expected.
WARNING [05/16 20:27:26 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
cls.{bias, weight}
mask_head._iter
proposal_generator.fcos_head.cls_logits.{bias, weight}
proposal_generator.fcos_head.reid_pred.{bias, weight}
proposal_generator.fcos_head.reid_pred_bn.{bias, running_mean, running_var, weight}
proposal_generator.fcos_head.reid_tower.0.{bias, weight}
proposal_generator.fcos_head.reid_tower.1.{bias, weight}
proposal_generator.fcos_head.reid_tower.10.{bias, weight}
proposal_generator.fcos_head.reid_tower.3.{bias, weight}
proposal_generator.fcos_head.reid_tower.4.{bias, weight}
proposal_generator.fcos_head.reid_tower.6.{bias, weight}
proposal_generator.fcos_head.reid_tower.7.{bias, weight}
proposal_generator.fcos_head.reid_tower.9.{bias, weight}
[05/16 20:27:26 adet.trainer]: Starting training from iteration 0
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [32,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [33,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [34,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [35,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [36,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [37,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [38,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [39,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [40,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [41,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [42,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [43,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [44,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [45,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [46,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [47,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [48,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [49,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [50,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [51,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [52,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [53,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [54,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [5,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [7,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [10,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [16,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [17,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [18,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [19,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [20,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [21,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [22,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [23,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [24,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [25,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [26,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [27,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [28,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [29,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [30,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "tools/train_net.py", line 231, in
args=(args, ),
File "/home/lin/.conda/envs/crossvis/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "tools/train_net.py", line 219, in main
return trainer.train()
File "tools/train_net.py", line 97, in train
self.train_loop(self.start_iter, self.max_iter)
File "tools/train_net.py", line 87, in train_loop
self.run_step()
File "/home/lin/.conda/envs/crossvis/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/home/lin/.conda/envs/crossvis/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/home/lin/.conda/envs/crossvis/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/lin/.conda/envs/crossvis/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered

I also met this error, do you know how to fix it now?

Hi, all! Thanks for your attention in our work.
It seems both you two are trying to training CrossVIS with YouTube-VIS 2021 dataset. Please modify self.nID in here to identity numbers of YouTube-VIS 2021 (larger than default 3774), or the identify loss will raise errors due to target indices may out of its bounds.
Hope this is helpful to you!

Hi, all! Thanks for your attention in our work. It seems both you two are trying to training CrossVIS with YouTube-VIS 2021 dataset. Please modify self.nID in here to identity numbers of YouTube-VIS 2021 (larger than default 3774), or the identify loss will raise errors due to target indices may out of its bounds. Hope this is helpful to you!

Much appreciated!

大家好!感谢您对我们工作的关注。看起来你们两个都在测试使用 YouTube-VIS 2021 数据集 CrossVIS。请self.nID此处为 YouTube-VIS 2021 的标识号(大于默认 3774)进行修改,否则标识号遗嘱会因您的目标索引可能超出其范围而有错误。希望这对您有帮助!

非常打击!
Hi,What did you change to 3774 finally?

大家好!感谢您对我们工作的关注。看起来你们两个都在测试使用 YouTube-VIS 2021 数据集 CrossVIS。请self.nID此处为 YouTube-VIS 2021 的标识号(大于默认 3774)进行修改,否则标识号遗嘱会因您的目标索引可能超出其范围而有错误。希望这对您有帮助!

非常打击!
Hi,What did you change to 3774 finally?

You can try 6283, it works for me

大家好!感谢您对我们工作的关注。看起来你们两个都在测试使用 YouTube-VIS 2021 数据集 CrossVIS。请self.nID此处为 YouTube-VIS 2021 的标识号(大于默认 3774)进行修改,否则标识号遗嘱会因您的目标索引可能超出其范围而有错误。希望这对您有帮助!

非常打击!
Hi,What did you change to 3774 finally?

You can try 6283, it works for me

Thanks

Hi, all! Thanks for your attention in our work. It seems both you two are trying to training CrossVIS with YouTube-VIS 2021 dataset. Please modify self.nID in here to identity numbers of YouTube-VIS 2021 (larger than default 3774), or the identify loss will raise errors due to target indices may out of its bounds. Hope this is helpful to you!

If I use a Vis dataset I created myself, what should self.nid be set to?

Hi, all! Thanks for your attention in our work. It seems both you two are trying to training CrossVIS with YouTube-VIS 2021 dataset. Please modify self.nID in here to identity numbers of YouTube-VIS 2021 (larger than default 3774), or the identify loss will raise errors due to target indices may out of its bounds. Hope this is helpful to you!

If I use a Vis dataset I created myself, what should self.nid be set to?

It should be the number of all the instances in your own training dataset.

Hi, all! Thanks for your attention in our work. It seems both you two are trying to training CrossVIS with YouTube-VIS 2021 dataset. Please modify self.nID in here to identity numbers of YouTube-VIS 2021 (larger than default 3774), or the identify loss will raise errors due to target indices may out of its bounds. Hope this is helpful to you!

If I use a Vis dataset I created myself, what should self.nid be set to?

check out key ['annotations'] in train.json

Thank you so much