Error while loading a pretrained model to resume training.

Question

Error while loading a pretrained model to resume training.

Opened this issue a year ago · 1 comments

Firstly, congrats on the paper and great results.

Currently, I face an issue when I resume the training with the pre-trained model (R50-MS1Mv2) provided by you, but when I run the script to train a model from scratch it works without any issues.

RuntimeError: Error(s) in loading state_dict for Trainer:
        size mismatch for head.kernel: copying a param with shape torch.Size([512, 205990]) from checkpoint, the shape in current model is torch.Size([512, 85742]).

Any suggestion on how to fix it? Thank you,

python main.py \
    --data_root /mnt/largedisk/Datasets/ \
    --train_data_path faces_emore \
    --val_data_path faces_emore \
    --prefix ir50_ms1mv2_adaface \
    --use_wandb \
    --use_mxrecord \
    --gpus 1 \
    --use_16bit \
    --arch ir_50 \
    --batch_size 32 \
    --num_workers 8 \
    --epochs 26 \
    --lr_milestones 12,20,24 \
    --lr 0.1 \
    --head adaface \
    --m 0.4 \
    --h 0.333 \
    --low_res_augmentation_prob 0.2 \
    --crop_augmentation_prob 0.2 \
    --photometric_augmentation_prob 0.2 \
    --resume_from_checkpoint /home/akhil/adaface/adaface_ir50_ms1mv2.ckpt

Answer 1 · 2023-12-30T16:22:48.000Z

Hello @Akhil-Gurram ,

pay eventually attention to the last fc layer of the model. If the current model, before loading the weights from the checkpoint, has such a layer at its end, its size ([512, 85742]) should be different from the size of the fc-layer in the checkpoint ([512, 205990]). That's normal, since you are using another dataset with 85742 classes, I assume.

It could therefore probably help to skip the weights of this fc-layer when loading the parameters from the checkpoint.

P.S.: I want to do that too, but I'm still encountering the problem that I don't know exactly how I should structure my own data set for the evaluation so that it works with Adaface. Do you have any ideas? Same question for the training set, since I use my own training set. Do you use a training list?