Megvii-BaseDetection/BEVDepth

Input type different & Memory usage during training

hutchinsonian opened this issue · 2 comments

Thanks for this work!

When I train the model with python [EXP_PATH] --amp_backend native -b 8 --gpus 8,
[EXP_PATH] is bevdepth/exps/nuscenes/mv/bev_depth_lss_r50_256x704_128x128_20e_cbgs_2key_da_ema.py
I got the following error:

Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
  File "BEVDepth/bevdepth/layers/backbones/base_lss_fpn.py", line 310, in forward
    x = self.reduce_conv(x)
  File "BEVDepth/bevdepth/layers/backbones/base_lss_fpn.py", line 402, in _forward_voxel_net
    self.depth_aggregation_net(img_feat_with_depth).view(
  File "BEVDepth/bevdepth/layers/backbones/base_lss_fpn.py", line 533, in _forward_single_sweep
    img_feat_with_depth = self._forward_voxel_net(img_feat_with_depth)
  File "BEVDepth/bevdepth/layers/backbones/base_lss_fpn.py", line 593, in forward
    key_frame_res = self._forward_single_sweep(
  File "BEVDepth/bevdepth/models/base_bev_depth.py", line 56, in forward
    x, depth_pred = self.backbone(x,
  File "BEVDepth/bevdepth/exps/nuscenes/base_exp.py", line 239, in forward
    return self.model(sweep_imgs, mats)
  File "BEVDepth/bevdepth/exps/nuscenes/base_exp.py", line 249, in training_step
    preds, depth_preds = self(sweep_imgs, mats)
  File "BEVDepth/bevdepth/exps/base_cli.py", line 78, in run_cli
    trainer.fit(model)
  File "BEVDepth/bevdepth/exps/nuscenes/mv/bev_depth_lss_r50_256x704_128x128_20e_cbgs_2key_da_ema.py", line 29, in <module>
    run_cli(BEVDepthLightningModel,
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

So I try to add the code:

@autocast(False)
def forward(self, x):
    x = x.to(torch.float32)  # Add
    x = self.reduce_conv(x)
    x = self.conv(x) + x
    x = self.out_conv(x)
    x = x.to(torch.float16)  # Add
    return x

The above error disappeared, but the memory usage is very high. When I set batch_size=1, 22G memory is still used.
image
is this correct?

Met similar issue, I worked around it by using https://discuss.pytorch.org/t/runtimeerror-input-type-torch-cuda-floattensor-and-weight-type-torch-halftensor-should-be-the-same/104312/5

i.e. adding autocast like

from  torch.cuda.amp import autocast
with autocast():
    outputs = model.forward(tensor)

Met similar issue, I worked around it by using https://discuss.pytorch.org/t/runtimeerror-input-type-torch-cuda-floattensor-and-weight-type-torch-halftensor-should-be-the-same/104312/5

i.e. adding autocast like

from  torch.cuda.amp import autocast
with autocast():
    outputs = model.forward(tensor)

may I ask what is your GPU memory usage? Is it still 22G when setting batch_size=1?