amazon-science/bigdetection

Error when training HTC-CBV2

liming-ai opened this issue · 3 comments

Hi @bryanyzhu @cailk

Thanks for your contribution, I tried to train the config and created an environment following README.

However, an error was raised:

Traceback (most recent call last):
  File "tools/train.py", line 188, in <module>
    main()
  File "tools/train.py", line 184, in main
    meta=meta)
  File "/home/tiger/code/bigdetection/mmdet/apis/train.py", line 189, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/tiger/code/bigdetection/mmdet/utils/optimizer.py", line 26, in after_train_iter
    scaled_loss.backward()
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [1, 256, 68, 92]], which is output 0 of ReluBackward0, is at version 4; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3739365) of binary: /home/tiger/miniconda3/envs/cbv2/bin/python

After add torch.autograd.set_detect_anomaly(True), it shows:

  File "tools/train.py", line 188, in <module>
    main()
  File "tools/train.py", line 184, in main
    meta=meta)
  File "/home/tiger/code/bigdetection/mmdet/apis/train.py", line 189, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 53, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/tiger/code/bigdetection/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/detectors/base.py", line 171, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/detectors/two_stage.py", line 266, in forward_train
    **kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/roi_heads/htc_roi_head.py", line 244, in forward_train
    semantic_pred, semantic_feat = self.semantic_head(x)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/home/tiger/code/bigdetection/mmdet/models/roi_heads/mask_heads/fused_semantic_head.py", line 86, in forward
    x = self.lateral_convs[self.fusion_level](feats[self.fusion_level])
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/mmcv/cnn/bricks/conv_module.py", line 202, in forward
    x = self.activate(x)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1112, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/tiger/miniconda3/envs/cbv2/lib/python3.7/site-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)
 (function _print_stack)

@cailk is investigating on it, will update here soon.

cailk commented

Hi, sorry for the late reply. Well, this config can be implemented in our environment without errors. Would you please show me which version of MMCV & MMDet you are using?

Hi, sorry for the late reply. Well, this config can be implemented in our environment without errors. Would you please show me which version of MMCV & MMDet you are using?

Hi, thanks for your reply. I have fixed this issue.