Error while loading a pretrained model to resume training.
Opened this issue · 1 comments
Firstly, congrats on the paper and great results.
Currently, I face an issue when I resume the training with the pre-trained model (R50-MS1Mv2) provided by you, but when I run the script to train a model from scratch it works without any issues.
RuntimeError: Error(s) in loading state_dict for Trainer:
size mismatch for head.kernel: copying a param with shape torch.Size([512, 205990]) from checkpoint, the shape in current model is torch.Size([512, 85742]).
Any suggestion on how to fix it? Thank you,
python main.py \
--data_root /mnt/largedisk/Datasets/ \
--train_data_path faces_emore \
--val_data_path faces_emore \
--prefix ir50_ms1mv2_adaface \
--use_wandb \
--use_mxrecord \
--gpus 1 \
--use_16bit \
--arch ir_50 \
--batch_size 32 \
--num_workers 8 \
--epochs 26 \
--lr_milestones 12,20,24 \
--lr 0.1 \
--head adaface \
--m 0.4 \
--h 0.333 \
--low_res_augmentation_prob 0.2 \
--crop_augmentation_prob 0.2 \
--photometric_augmentation_prob 0.2 \
--resume_from_checkpoint /home/akhil/adaface/adaface_ir50_ms1mv2.ckpt
Hello @Akhil-Gurram ,
pay eventually attention to the last fc layer of the model. If the current model, before loading the weights from the checkpoint, has such a layer at its end, its size ([512, 85742]) should be different from the size of the fc-layer in the checkpoint ([512, 205990]). That's normal, since you are using another dataset with 85742 classes, I assume.
It could therefore probably help to skip the weights of this fc-layer when loading the parameters from the checkpoint.
P.S.: I want to do that too, but I'm still encountering the problem that I don't know exactly how I should structure my own data set for the evaluation so that it works with Adaface. Do you have any ideas? Same question for the training set, since I use my own training set. Do you use a training list?