a basic question about pretrained model
go-ahead-maker opened this issue · 4 comments
Hi authors, nice work!
I got a basic question about the checkpoint file or pretrained weight
For image classification, the saved checkpoint file is a dict like utils.save_on_master({ 'model': model_without_ddp.state_dict(), 'optimizer': optimizer.state_dict(), 'lr_scheduler': lr_scheduler.state_dict(), 'epoch': epoch, 'model_ema': get_state_dict(model_ema), 'scaler': loss_scaler.state_dict(), 'args': args, 'max_accuracy': max_accuracy, }, checkpoint_path)
.
when using model_ema
, the checkpoint dict will contain both model
and model_ema
. I wonder which of these two will be used when loading the model into the downstream task. I check the load_checkpoint
function used in UniFormer (follow Swin), and it seems choice model
to load. So, will the model_ema
not be used?
Yes. Actually, we follow the code in DeiT and do not test model_ema
in our codebase. In the downstream tasks, we only use the model
instead of model_ema
.
In my later experiments, model_ema
also does not work for the current model. It may be more suitable for the lightweight models (FLOPs < 1G), which is a common technique for training those models.
Thanks for your valuable reply!
I want to ask you one more question. If I launch the model_ema
during training, will it affect the training of the original model? Or will EMA copy the original model parameters and update them independently? Because I read the code of EMA, It seems first to copy the original model, and update it after the backward of the original model. So, I suppose that model_ema
dost not affect the original model's weight.
Thank you again for your patience for this basic issue!
Sorry for the late reply.
Yes, model_ema
will not affect the model parameters. It works like ensembling models via weighted parameters.
Thus, it will not affect the original model's performance. But the ema
model usually works better while training, that's why they are often used as teachers in contrastive learning.
In my experiments, though it works better in the middle epochs, it achieves similar result in the end.
So appreciate for your detailed explanations, It do help me a lot.
looking forward to your future works~