rwightman/efficientdet-pytorch

[BUG] validate.py is not able to validate model for cspresdext50pan cspdarkdet53m

sadransh opened this issue · 4 comments

Describe the bug
Some models seems to have problem with validate.py file. I trained models using cspdarkdet53m , cspresdext50pan. Everythin seems fine and model gets acceptable results. But when I try to use validate.py for a checkpoint. almost all metrics are 0 ( some AR are around 0.1) ( perediction results in json file all have confidence <0.07)
Note. with other effcientDet models everything works fine

To Reproduce
Steps to reproduce the behavior:

  1. train a model !./distributed_train.sh 3 ../data --model cspdarkdet53m --sync-bn --num-classes 1 --opt fusedadam --mean 0.3238 --std 0.1493 -b 9 --apex-amp --lr .00006 --warmup-epochs 3 --save-images -j 25 --epochs 400
  2. try to validate a checkpoint using !python ./validate.py ../data --split val --apex-amp --checkpoint ./output/train/20210407-195811-cspdarkdet53m/checkpoint-108.pth.tar --model cspdarkdet53m --num-classes 1 --dataset weld --mean 0.3238 --std 0.1493 -b 20

Expected behavior

Expected to see similar numbers as validation while training.

Probable Reason
It seems that thesemodels are using different FPN than ( bi-fpn) So determining corrent FPN for these should solve the issue.

@sadransh if you're loading a training checkpoint and use used EMA averaging, you want to use --use-ema or clean the checkpoints with clean script so it only has either ema or non-ema weights.

If that's not the issue not sure, the model configs should be the same so long as you didn't make custom changes at the script level (they share the same config file). I've stopped using sync-bn because I kept running into problems, esp with train/val differences. I also stopped using apex amp due to issues.

One thing to try is set the model into train model in the val script (model.train()). If there is a big improvement (but likely not max) you may have run into one of the unsolved sync-bn issues I've had where the saved running/mean avg for distributed training w/ sync-bn are bad.

First tried "model.train()" for validate script did not work.
Then I re trained without sync-bn and the apex-amp and problem resolved. Not sure which one is the source yet.
Thanks.

@sadransh most likely the sync-bn, as that matches experience I had with some nets, I had to train these two specific models, and other recent non-efficientdet variants without syncbn, for some reason the problem doesn't seem to happen in most efficiendet models. It's pretty confusing, but I have had issues with standard classification convnets and sync bn for some architectures.

I switched away from apex amp because it conflicts with torchscript and I've found torchscript useful to speed up training for these models so I just use native amp now and the native ddp.