No learning during training

Question

No learning during training

zimmermax98 opened this issue a year ago · 2 comments

Hey,

I tried training PIPNet on the CUB dataset. I followed the README.md and used the default arguments. The only changes I made are reducing batch size and learning rate, since I only have 12 GB of VRAM available. Specifically, I set

args.batch_size = 8
args.batch_size_pretrain = 16
args.lr = 0.05 / 8
args.lr_block = 0.0005 / 8
args.lr_net = 0.0005 / 8

which is a reduction by a factor of 8 for all arguments.

The rest of the code is unchanged, and I pulled the recent version with the new CUB preprocessing. The code executes without any errors.

Looking at log_epoch_overview.csv, I can see the test_top1_acc dropping from 0.023 in epoch 1 to 0.005 in epoch 12 and stagnating afterward. mean_train_loss_during_epoch stays constant at around 11 +/- 1 for all epochs.

I also tested the code on different GPUs and tried different learning rates, but the problem remains. Since I only applied the changes mentioned above, I'm very confused why I can't reach any learning.

Do you have any idea, what the problem might be here? And can you recommend further ways to debug?

Also, would it be possible to provide a pretrained model (or maybe even the entire run directory for a pretrained model)?

Thanks a lot in advance!

Best
Max

Answer 1 · 2023-07-03T09:17:20.000Z

Hi Max,
The model should train as expected when following the recommended hyperparameters. I think that your model is not training effectively because of the very small batch size. For example, our tanh loss optimizes that each prototype is detected at least once in a minibatch. With very small minibatches, such training might be ineffective. Having a look at the visualized part-prototypes could give you some indication on how well the pretraining is doing. I would recommend to use a GPU that allows bigger minibatches (or use multiple GPUs, the code should support that although I didn't tested that thoroughly). Alternatively, you could try using --net convnext_tiny_13 or freeze more layers of the backbone (as also described in the FAQ section in the README).

Yes, I will release a pretrained CUB model soon.

Answer 2 · 2023-07-17T12:13:36.000Z

@zimmermax98 Checkpoint of PIP-Net trained on CUB is available, see the README.