Sense-X/UniFormer

a basic question about pretrained model

go-ahead-maker opened this issue · 4 comments

Hi authors, nice work!
I got a basic question about the checkpoint file or pretrained weight
For image classification, the saved checkpoint file is a dict like utils.save_on_master({ 'model': model_without_ddp.state_dict(), 'optimizer': optimizer.state_dict(), 'lr_scheduler': lr_scheduler.state_dict(), 'epoch': epoch, 'model_ema': get_state_dict(model_ema), 'scaler': loss_scaler.state_dict(), 'args': args, 'max_accuracy': max_accuracy, }, checkpoint_path).
when using model_ema, the checkpoint dict will contain both model and model_ema. I wonder which of these two will be used when loading the model into the downstream task. I check the load_checkpoint function used in UniFormer (follow Swin), and it seems choice model to load. So, will the model_ema not be used?

Yes. Actually, we follow the code in DeiT and do not test model_ema in our codebase. In the downstream tasks, we only use the model instead of model_ema.

In my later experiments, model_ema also does not work for the current model. It may be more suitable for the lightweight models (FLOPs < 1G), which is a common technique for training those models.

Thanks for your valuable reply!
I want to ask you one more question. If I launch the model_ema during training, will it affect the training of the original model? Or will EMA copy the original model parameters and update them independently? Because I read the code of EMA, It seems first to copy the original model, and update it after the backward of the original model. So, I suppose that model_ema dost not affect the original model's weight.
Thank you again for your patience for this basic issue!

Sorry for the late reply.
Yes, model_ema will not affect the model parameters. It works like ensembling models via weighted parameters.
Thus, it will not affect the original model's performance. But the ema model usually works better while training, that's why they are often used as teachers in contrastive learning.
In my experiments, though it works better in the middle epochs, it achieves similar result in the end.

So appreciate for your detailed explanations, It do help me a lot.
looking forward to your future works~