Model weights become NaN in step 1 on VOC

Question

Model weights become NaN in step 1 on VOC

HieuPhan33 opened this issue 3 years ago · 10 comments

Hi,
Thanks for your contribution.
I have a problem when training PLOP on VOC dataset with setting 15-5.
After successfully training the model at step 0, I trained the model at the next step.
The model becomes NaN after few training iterations even at the first epoch.

Since the model M0 can be trained without any problem, I doubt that the distilling the knowledge of M0 to M1 might lead to a divergence problem for M1.

Following the papers, I used lr=0.01 for M0 and lr=0.001 for M1.
Here is the setting I used.
Step 0:
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root /media/hieu/DATA/semantic_segmentation/PascalVOC12 --batch_size 12 --dataset voc --name PLOP --task 15-5 --overlap --step 0 --lr 0.01 --epochs 30 --method FT --pod local --pod_factor 0.01 --pod_logits --pseudo entropy --threshold 0.001 --classif_adaptive_factor --init_balanced

Step 1:
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root /media/hieu/DATA/semantic_segmentation/PascalVOC12 --batch_size 12 --dataset voc --name PLOP --task 15-5 --overlap --step 1 --lr 0.001 --epochs 30 --method FT --pod local --pod_factor 0.01 --pod_logits --pseudo entropy --threshold 0.001 --classif_adaptive_factor --init_balanced.

Answer 1 · 2021-05-19T14:58:23.000Z

The distillation is disabled for M0, don't worry about it.

You can get all the correct hyperparameter is you use --method PLOP.
But if you want to do it by hand, don't forget that the Local POD is applied with a different factor for the logits through an option called --pod_options:

https://github.com/arthurdouillard/CVPR2021_PLOP/blob/main/argparser.py#L53

As you can see the factor is at 0.0005 instead of 0.01 as for the backbone. You probably have NaN because the loss is too big and have gradient explosion.

Answer 2 · 2021-05-20T06:34:47.000Z

Hi Douillard,
thanks for your quick response.

I still have the problem when using the parameter --method PLOP. I did inspect the program. It does use pod_factor=0.01 for early layers and use pod_factor=0.0005 for the last layer.

Here is the command:
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root /media/hieu/DATA/semantic_segmentation/PascalVOC12 --batch_size 12 --dataset voc --name PLOP --task 15-5 --step 1 --lr 0.001 --epochs 30 --method PLOP --pod local --pod_factor 0.01 --pod_logits --pseudo entropy --threshold 0.001 --classif_adaptive_factor --init_balanced

I did try to even lower the lr and pod_factor to 0.0001 and 0.0005 but it still results in gradient explosion. Here is the shortest version of the command:
python run.py --dataset voc --task 15-5 --step 1 --lr 0.0001 --method PLOP --pod local --pod_factor 0.0005

Answer 3 · 2021-05-20T08:40:44.000Z

It's curious because there are already other researchers who manage to runs those scripts without problems (actually they got even sligtly better results).

I'm rerunning multiple experiments right now and will come back to you.

I see that you don't use half precision (--opt_level O1). It may be the problem? I know that apex (the library behind half precision) is rescaling the loss.

Answer 4 · 2021-05-20T08:54:56.000Z

Yes, I see that no one is complaining this nan problem on github issues. This is a strange problem I admit.

I even tried setting pod_factor=0 to remove distillation loss, but then the model still gets nan exactly at iteration 21 of the first epoch.

I inspect the loss and model parameters at iteration 20. Epoch_loss is only 28 which does not indicate the gradient explosion. Many model weights are still small (less than 1.0).

At iteration 21, however, all model weights suddenly become NaN. To be honest, I never encountered such a strange problem with model training before.

I really appreciate your quick feedbacks and even getting hands-on my problems to find solutions together.

I will try using --opt_level 01 and see how it goes.
Many thanks.

Answer 5 · 2021-05-20T08:59:30.000Z

Maybe your data is corrupted at some point? You may want to add at line 216 in train.py to add a check about it, like assert torch.isfinite(images) and torch.isfinite(labels).

Nevertheless, I've just launched exps on my side and will keep you updated if I also (which I doubt) encounter NaN.

Answer 6 · 2021-05-20T09:16:00.000Z

Hi Douillard,
I tried using mixed precision training. Yes, the gradient loss is overflowing to 32768 at iteration 21.
Unlike FP32 training, it reduces the loss scale and still continues to train.

Let's see how it goes. Will keep you updated, thanks.

Answer 7 · 2021-05-20T09:22:46.000Z

Hum, I admit that I've always run my code with mixed precision, so that may be the reason. It's a bit ugly I guess, but I'm happy to know you don't have the NaN problem anymore.

I'm still available if you have any more issues. Good luck in your research!

Answer 8 · 2021-05-20T11:17:16.000Z

Thanks Douillard, wish you all the best also.

Cheers.

Answer 9 · 2021-05-20T13:59:31.000Z

I've re-runned 15-5 and 15-1 with this repository code and scripts and code respectively:

Final Mean IoU 69.65
Average Mean IoU 75.2
Mean IoU first 75.8
Mean IoU last 49.96

Final Mean IoU 56.17
Average Mean IoU 67.55
Mean IoU first 66.62
Mean IoU last 22.73

So I think the problem was coming from the half precision.

Answer 10 · 2021-12-17T10:33:34.000Z

HI @arthurdouillard.

One problem I found when running PLOP with mixed precision was the classif_adaptive_factor.
Sometimes, with mixed precision, the den becomes zero, producing a NaN value.
Summing to den a small eps (1e-6) solved the issue.

Hope it helps.