pytorch/vision

Change default value of eps in FrozenBatchNorm to match BatchNorm

juyunsang opened this issue · 18 comments

❓ Questions and Help

Hello
Loss is nan error occurs when I learn fast rcnn with resnext101 backbone
My code is as follows

backbone = resnet_fpn_backbone('resnext101_32x8d', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

error message

Epoch: [0]  [   0/7208]  eta: 1:27:42  lr: 0.000040  loss: 40613806080.0000 (40613806080.0000)  loss_box_reg: 7979147264.0000 (7979147264.0000)  loss_classifier: 11993160704.0000 (11993160704.0000)  loss_objectness: 9486380032.0000 (9486380032.0000)  loss_rpn_box_reg: 11155118080.0000 (11155118080.0000)  time: 0.7301  data: 0.4106  max mem: 1241
Loss is nan, stopping training

When i change the backbone to resnet50 and resnet152, no error occrus.

Please note that this issue tracker is not a help form and this issue will be closed.

We have a set of listed resources available on the website. Our primary means of support is our discussion forum:

Hi @juyunsang

as our template states:

Please note that this issue tracker is not a help form and this issue will be closed. [...] Our primary means of support is our discussion forum.


Without knowing your data its hard to tell what is going wrong. I'm assuming your data is not corrupt since the other models are working. Thus, this might be a hyper-parameter problem. The loss in the first step seems fairly large (~ 40e9). In a first step I would reduce the learning rate and see if this already solves the problem.

@juyunsang note that we do not provide pre-trained weights for detection models with the resnext101 backbone, which might explain the issue you are facing. You might be finetuning a detection model with ResNet50 pre-trained on COCO, while training it from scratch with ResNeXt101

@fmassa
Thank you for reply
I don't understand your mention
You might be finetuning a detection model with ResNet50 pre-trained on COCO, while training it from scratch with ResNeXt101
Are you saying that I should use the code below?

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

Can you show me the code for how I can use the resnext101 backbone?

@juyunsang My understanding was that you were using

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

for ResNet50, and

backbone = resnet_fpn_backbone('resnext101_32x8d', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

for ResNeXt101. Is it the case?

@fmassa
Yes!
It works well with the code below.

backbone = resnet_fpn_backbone('resnet50', pretrained=True)
model = FasterRCNN(backbone, num_classes)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

However, changing the parameter of resnet_fpn_backbone function to resnext101_32x8d results in a Loss is Nan error.

Thanks for confirming.

My first bet would be that it's an issue with the FPN, because we forgot to run the weight initialization, as discussed in #2326

Could you check if implementing that fix could solve the issue for you?

@fmassa
Thank you for reply
I change code self.children to self.modules
but error still occued..

# initialize parameters now to avoid modifying the initialization of top_blocks
for m in self.children():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_uniform_(m.weight, a=1)
nn.init.constant_(m.bias, 0)

Ok, I think I know what's going on. The resnext implementation might have weights which are zero in the batch norm, and we might need to set eps in

eps: float = 0.,
to be 1e-5 to match the default values in PyTorch.

This can be done by changing the norm_layer in

def resnet_fpn_backbone(backbone_name, pretrained, norm_layer=misc_nn_ops.FrozenBatchNorm2d, trainable_layers=3):
to be a lambda with the eps set out to be 1e-5.

Can you try this out and report back?

The same thing happened to @szagoruyko with WideResNets. Changing the default value of eps in FrozenBatchNorm should be taken into account, but it will change the results for pre-trained models, and this should be checked to see how much it affects performance

I was having a somewhat similar issue and this seems to fix it.
I was trying to load the weights from a pre-trained torchvision.models.resnet50 model into a torchvision.models.detection.backbone_utils.resnet_fpn_backbone for use in a FasterRCNN model. Renaming the state dict keys (dict([('.'.join(['body'] + k.split('.')[1:]),v) for k,v in checkpoint["state_dict"].items()])) and loading it with strict=False worked, but I was getting Loss is nan, stopping training messages during training.

Using functools.partial as below created a model that got through training.

from functools import partial
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
from torchvision.ops.misc import FrozenBatchNorm2d

FBN = partial(FrozenBatchNorm2d, eps=1E-5)
backbone = resnet_fpn_backbone('resnet50', pretrained=True, norm_layer=FBN, trainable_layers=3).cuda()

I think it might be time to think about changing the default in FrozenBatchNorm2d to more closely align with what BatchNorm in PyTorch does, we should just check that this change doesn't affect the performance of the currently trained models in any way.

frgfm commented

Hi @fmassa, I just opened #2852 to tackle this!

Hi @frgfm

We still need to make sure that the current pre-trained models still give correct results with the new value.

@datumbox is going to be working on ensuring that this is the case.

frgfm commented

No worries @fmassa ! Should I split the PR (one added eps to __repr__ to avoid silent differences, and the other changing the default eps value)?

Yes please, if you could only make the __repr__ changes in your PR it would be great.

@datumbox will be taking care of switching the default eps value in a follow-up PR

frgfm commented

@fmassa done!

frgfm commented

Just FYI @datumbox @fmassa, this is all the more beneficial when you're trying to use RCNN models in torch.float16.
I just tried on my end, and when the running_var of some BN gets converted to half, it drops to zero by underflow. With an eps=0, the model will yield many NaNs, while with this change, it yields a valid output :)

@frgfm would you say that we should make it a BC-breaking change and revert the back-compatibility fix in #2940 , if the benefits of having a non-zero eps outweights the downsides of BC-breakage?

frgfm commented

@fmassa I would consider the following points:

  • Prevents underflow & thus NaNs upon forward in FrozenBN in float16
  • Training a RCNN model with a pretrained backbone will start with closer reproduction of the backbone's state for its image classification training
  • Now a downturn: pretrained RCNN models will have slightly different (expectedly worse) perf in float32

Seeing the results of #2933, I'm less concerned than I used to about this last inconvenience. So I would argue the benefits do outweigh the BC downsides. But I may be missing other aspects I'm not aware off 🤷‍♂️