Error when training HTC-CBV2
liming-ai opened this issue · 3 comments
liming-ai commented
Hi @bryanyzhu @cailk
Thanks for your contribution, I tried to train the config and created an environment following README.
However, an error was raised:
Traceback (most recent call last):
File "tools/train.py", line 188, in <module>
main()
File "tools/train.py", line 184, in main
meta=meta)
File "/home/tiger/code/bigdetection/mmdet/apis/train.py", line 189, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/tiger/code/bigdetection/mmdet/utils/optimizer.py", line 26, in after_train_iter
scaled_loss.backward()
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [1, 256, 68, 92]], which is output 0 of ReluBackward0, is at version 4; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3739365) of binary: /home/tiger/miniconda3/envs/cbv2/bin/python
After add torch.autograd.set_detect_anomaly(True)
, it shows:
File "tools/train.py", line 188, in <module>
main()
File "tools/train.py", line 184, in main
meta=meta)
File "/home/tiger/code/bigdetection/mmdet/apis/train.py", line 189, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 53, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/tiger/code/bigdetection/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(**data)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/tiger/code/bigdetection/mmdet/models/detectors/base.py", line 171, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/tiger/code/bigdetection/mmdet/models/detectors/two_stage.py", line 266, in forward_train
**kwargs)
File "/home/tiger/code/bigdetection/mmdet/models/roi_heads/htc_roi_head.py", line 244, in forward_train
semantic_pred, semantic_feat = self.semantic_head(x)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
output = old_func(*new_args, **new_kwargs)
File "/home/tiger/code/bigdetection/mmdet/models/roi_heads/mask_heads/fused_semantic_head.py", line 86, in forward
x = self.lateral_convs[self.fusion_level](feats[self.fusion_level])
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/cnn/bricks/conv_module.py", line 202, in forward
x = self.activate(x)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 98, in forward
return F.relu(input, inplace=self.inplace)
File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/functional.py", line 1299, in relu
result = torch.relu(input)
(function _print_stack)
cailk commented
Hi, sorry for the late reply. Well, this config can be implemented in our environment without errors. Would you please show me which version of MMCV & MMDet you are using?
liming-ai commented
Hi, sorry for the late reply. Well, this config can be implemented in our environment without errors. Would you please show me which version of MMCV & MMDet you are using?
Hi, thanks for your reply. I have fixed this issue.