shansongliu/MuMu-LLaMA

Errors in loading state_dict for M2UGen when running inference.py

Opened this issue · 7 comments

Hi! I tried to run inference.py but encoutered below error as it seems indicating some missing keys and mismatches.
I believe I have set up the checkpoints files correctly.
For the LLaMA model, I made a request to Meta and downloaded the 7B with a signed download link.
Others than that, I got everything from huggingface, also the knn.

Not sure what I should fix at this point. I would appreciate it if you could give me some hints! I hope the problem is from my setup.

Traceback (most recent call last):
  File "/workspace/M2UGen/M2UGen/inference.py", line 95, in <module>
    load_result = model.load_state_dict(new_ckpt, strict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for M2UGen:
        Missing key(s) in state_dict: "vit_model.encoder.layer.12.attention.attention.query.weight", "vit_model.encoder.layer.12.attention.attention.query.bias", 
...
...
        size mismatch for vit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
        size mismatch for vit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 197, 768]) from checkpoint, the shape in current model is torch.Size([1, 197, 1024]).
        size mismatch for vit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 16, 16]).
        size mismatch for vit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
...

Below is the command I used to run inference

python M2UGen/inference.py --video_file "/workspace/video-test.mp4" --model ./ckpts/M2UGen-MusicGen/checkpoint.pth --llama_dir ./ckpts/LLaMA --music_decoder musicgen

and this is the structure of the ckpts folder

├── LLaMA
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   ├── tokenizer.model
│   └── tokenizer_checklist.chk
├── M2UGen-MusicGen
│   └── checkpoint.pth
└── knn.index

Have you solved it? I have the same problem
请问解决了吗,我也遇到了这个问题

Just tried running this model. Getting the same error

Traceback (most recent call last):
File "/content/M2UGen/M2UGen/gradio_app.py", line 75, in
load_result = model.load_state_dict(new_ckpt, strict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for M2UGen

what does the "knn.index" mean?

the same problem,when i use medium model can fix it

same error here.

Error(s) in loading state_dict for M2UGen

RuntimeError: Error(s) in loading state_dict for M2UGen:
	Missing key(s) in state_dict: "vit_model.encoder.layer.12.attention.attention.query.weight", "vit_model.encoder.layer.12.attention.attention.query.bias", "vit_model.encoder.layer.12.attention.attention.key.weight", "vit_model.encoder.layer.12.attention.attention.key.bias", "vit_model.encoder.layer.12.attention.attention.value.weight", "vit_model.encoder.layer.12.attention.attention.value.bias", "vit_model.encoder.layer.12.attention.output.dense.weight", "vit_model.encoder.layer.12.attention.output.dense.bias", "vit_model.encoder.layer.12.intermediate.dense.weight", "vit_model.encoder.layer.12.intermediate.dense.bias", "vit_model.encoder.layer.12.output.dense.weight", "vit_model.encoder.layer.12.output.dense.bias", "vit_model.encoder.layer.12.layernorm_before.weight", "vit_model.encoder.layer.12.layernorm_before.bias", "vit_model.encoder.layer.12.layernorm_after.weight", "vit_model.encoder.layer.12.layernorm_after.bias", "vit_model.encoder.layer.13.attention.attention.query.weight", "vit_model.encoder.layer.13.attention.attention.query.bias", "vit_model.encoder.layer.13.attention.attention.key.weight", "vit_model.encoder.layer.13.attention.attention.key.bias", "vit_model.encoder.layer.13.attention.attention.value.weight", "vit_model.encoder.layer.13.attention.attention.value.bias", "vit_model.encoder.layer.13.attention.output.dense.weight", "vit_model.encoder.layer.13.attention.output.dense.bias", "vit_model.encoder.layer.13.intermediate.dense.weight", "vit_model.encoder.layer.13.intermediate.dense.bias", "vit_model.encoder.layer.13.output.dense.weight", "vit_model.encoder.layer.13.output.dense.bias", "vit_model.encoder.layer.13.layernorm_before.weight", "vit_model.encoder.layer.13.layernorm_before.bias", "vit_model.encoder.layer.13.layernorm_after.weight", "vit_model.encoder.layer.13.layernorm_after.bias", "vit_model.encoder.layer.14.attention.attention.query.weight", "vit_model.encoder.layer.14.attention.attention.query.bias", "vit_model.encoder.layer.14.attention.attention.key.weight", "vit_model.encoder.layer.14.attention.attention.key.bias", "vit_model.encoder.layer.14.attention.attention.value.weight", "vit_model.encoder.layer.14.attention.attention.value.bias", "vit_model.encoder.layer.14.attention.output.dense.weight", "vit_model.encoder.layer.14.attention.output.dense.bias", "vit_model.encoder.layer.14.intermediate.dense.weight", "vit_model.encoder.layer.14.intermediate.dense.bias", "vit_model.encoder.layer.14.output.dense.weight", "vit_model.encoder.layer.14.output.dense.bias", "vit_model.encoder.layer.14.layernorm_before.weight", "vit_model.encoder.layer.14.layernorm_before.bias", "vit_model.encoder.layer.14.layernorm_after.weight", "vit_model.encoder.layer.14.layernorm_after.bias", "vit_model.encoder.layer.15.attention.attention.query.weight", "vit_model.encoder.layer.15.attention.attention.query.bias", "vit_model.encoder.layer.15.attention.attention.key.weight", "vit_model.encoder.layer.15.attention.attention.key.bias", "vit_model.encoder.layer.15.attention.attention.value.weight", "vit_model.encoder.layer.15.attention.attention.value.bias", "vit_model.encoder.layer.15.attention.output.dense.weight", "vit_model.encoder.layer.15.attention.output.dense.bias", "vit_model.encoder.layer.15.intermediate.dense.weight", "vit_model.encoder.layer.15.intermediate.dense.bias", "vit_model.encoder.layer.15.output.dense.weight", "vit_model.encoder.layer.15.output.dense.bias", "vit_model.encoder.layer.15.layernorm_before.weight", "vit_model.encoder.layer.15.layernorm_before.bias", "vit_model.encoder.layer.15.layernorm_after.weight", "vit_model.encoder.layer.15.layernorm_after.bias", "vit_model.encoder.layer.16.attention.attention.query.weight", "vit_model.encoder.layer.16.attention.attention.query.bias", "vit_model.encoder.layer.16.attention.attention.key.weight", "vit_model.encoder.layer.16.attention.attention.key.bias", "vit_model.encoder.layer.16.attention.attention.value.weight", "vit_model.encoder.layer.16.attention.attention.value.bias", "vit_model.encoder.layer.16.attention.output.dense.weight", "vit_model.encoder.layer.16.attention.output.dense.bias", "vit_model.encoder.layer.16.intermediate.dense.weight", "vit_model.encoder.layer.16.intermediate.dense.bias", "vit_model.encoder.layer.16.output.dense.weight", "vit_model.encoder.layer.16.output.dense.bias", "vit_model.encoder.layer.16.layernorm_before.weight", "vit_model.encoder.layer.16.layernorm_before.bias", "vit_model.encoder.layer.16.layernorm_after.weight", "vit_model.encoder.layer.16.layernorm_after.bias", "vit_model.encoder.layer.17.attention.attention.query.weight", "vit_model.encoder.layer.17.attention.attention.query.bias", "vit_model.encoder.layer.17.attention.attention.key.weight", "vit_model.encoder.layer.17.attention.attention.key.bias", "vit_model.encoder.layer.17.attention.attention.value.weight", "vit_model.encoder.layer.17.attention.attention.value.bias", "vit_model.encoder.layer.17.attention.output.dense.weight", "vit_model.encoder.layer.17.attention.output.dense.bias", "vit_model.encoder.layer.17.intermediate.dense.weight", "vit_model.encoder.layer.17.intermediate.dense.bias", "vit_model.encoder.layer.17.output.dense.weight", "vit_model.encoder.layer.17.output.dense.bias", "vit_model.encoder.layer.17.layernorm_before.weight", "vit_model.encoder.layer.17.layernorm_before.bias", "vit_model.encoder.layer.17.layernorm_after.weight", "vit_model.encoder.layer.17.layernorm_after.bias", "vit_model.encoder.layer.18.attention.attention.query.weight", "vit_model.encoder.layer.18.attention.attention.query.bias", "vit_model.encoder.layer.18.attention.attention.key.weight", "vit_model.encoder.layer.18.attention.attention.key.bias", "vit_model.encoder.layer.18.attention.attention.value.weight", "vit_model.encoder.layer.18.attention.attention.value.bias", "vit_model.encoder.layer.18.attention.output.dense.weight", "vit_model.encoder.layer.18.attention.output.dense.bias", "vit_model.encoder.layer.18.intermediate.dense.weight", "vit_model.encoder.layer.18.intermediate.dense.bias", "vit_model.encoder.layer.18.output.dense.weight", "vit_model.encoder.layer.18.output.dense.bias", "vit_model.encoder.layer.18.layernorm_before.weight", "vit_model.encoder.layer.18.layernorm_before.bias", "vit_model.encoder.layer.18.layernorm_after.weight", "vit_model.encoder.layer.18.layernorm_after.bias", "vit_model.encoder.layer.19.attention.attention.query.weight", "vit_model.encoder.layer.19.attention.attention.query.bias", "vit_model.encoder.layer.19.attention.attention.key.weight", "vit_model.encoder.layer.19.attention.attention.key.bias", "vit_model.encoder.layer.19.attention.attention.value.weight", "vit_model.encoder.layer.19.attention.attention.value.bias", "vit_model.encoder.layer.19.attention.output.dense.weight", "vit_model.encoder.layer.19.attention.output.dense.bias", "vit_model.encoder.layer.19.intermediate.dense.weight", "vit_model.encoder.layer.19.intermediate.dense.bias", "vit_model.encoder.layer.19.output.dense.weight", "vit_model.encoder.layer.19.output.dense.bias", "vit_model.encoder.layer.19.layernorm_before.weight", "vit_model.encoder.layer.19.layernorm_before.bias", "vit_model.encoder.layer.19.layernorm_after.weight", "vit_model.encoder.layer.19.layernorm_after.bias", "vit_model.encoder.layer.20.attention.attention.query.weight", "vit_model.encoder.layer.20.attention.attention.query.bias", "vit_model.encoder.layer.20.attention.attention.key.weight", "vit_model.encoder.layer.20.attention.attention.key.bias", "vit_model.encoder.layer.20.attention.attention.value.weight", "vit_model.encoder.layer.20.attention.attention.value.bias", "vit_model.encoder.layer.20.attention.output.dense.weight", "vit_model.encoder.layer.20.attention.output.dense.bias", "vit_model.encoder.layer.20.intermediate.dense.weight", "vit_model.encoder.layer.20.intermediate.dense.bias", "vit_model.encoder.layer.20.output.dense.weight", "vit_model.encoder.layer.20.output.dense.bias", "vit_model.encoder.layer.20.layernorm_before.weight", "vit_model.encoder.layer.20.layernorm_before.bias", "vit_model.encoder.layer.20.layernorm_after.weight", "vit_model.encoder.layer.20.layernorm_after.bias", "vit_model.encoder.layer.21.attention.attention.query.weight", "vit_model.encoder.layer.21.attention.attention.query.bias", "vit_model.encoder.layer.21.attention.attention.key.weight", "vit_model.encoder.layer.21.attention.attention.key.bias", "vit_model.encoder.layer.21.attention.attention.value.weight", "vit_model.encoder.layer.21.attention.attention.value.bias", "vit_model.encoder.layer.21.attention.output.dense.weight", "vit_model.encoder.layer.21.attention.output.dense.bias", "vit_model.encoder.layer.21.intermediate.dense.weight", "vit_model.encoder.layer.21.intermediate.dense.bias", "vit_model.encoder.layer.21.output.dense.weight", "vit_model.encoder.layer.21.output.dense.bias", "vit_model.encoder.layer.21.layernorm_before.weight", "vit_model.encoder.layer.21.layernorm_before.bias", "vit_model.encoder.layer.21.layernorm_after.weight", "vit_model.encoder.layer.21.layernorm_after.bias", "vit_model.encoder.layer.22.attention.attention.query.weight", "vit_model.encoder.layer.22.attention.attention.query.bias", "vit_model.encoder.layer.22.attention.attention.key.weight", "vit_model.encoder.layer.22.attention.attention.key.bias", "vit_model.encoder.layer.22.attention.attention.value.weight", "vit_model.encoder.layer.22.attention.attention.value.bias", "vit_model.encoder.layer.22.attention.output.dense.weight", "vit_model.encoder.layer.22.attention.output.dense.bias", "vit_model.encoder.layer.22.intermediate.dense.weight", "vit_model.encoder.layer.22.intermediate.dense.bias", "vit_model.encoder.layer.22.output.dense.weight", "vit_model.encoder.layer.22.output.dense.bias", "vit_model.encoder.layer.22.layernorm_before.weight", "vit_model.encoder.layer.22.layernorm_before.bias", "vit_model.encoder.layer.22.layernorm_after.weight", "vit_model.encoder.layer.22.layernorm_after.bias", "vit_model.encoder.layer.23.attention.attention.query.weight", "vit_model.encoder.layer.23.attention.attention.query.bias", "vit_model.encoder.layer.23.attention.attention.key.weight", "vit_model.encoder.layer.23.attention.attention.key.bias", "vit_model.encoder.layer.23.attention.attention.value.weight", "vit_model.encoder.layer.23.attention.attention.value.bias", "vit_model.encoder.layer.23.attention.output.dense.weight", "vit_model.encoder.layer.23.attention.output.dense.bias", "vit_model.encoder.layer.23.intermediate.dense.weight", "vit_model.encoder.layer.23.intermediate.dense.bias", "vit_model.encoder.layer.23.output.dense.weight", "vit_model.encoder.layer.23.output.dense.bias", "vit_model.encoder.layer.23.layernorm_before.weight", "vit_model.encoder.layer.23.layernorm_before.bias", "vit_model.encoder.layer.23.layernorm_after.weight", "vit_model.encoder.layer.23.layernorm_after.bias", "vivit_model.encoder.layer.12.attention.attention.query.weight", "vivit_model.encoder.layer.12.attention.attention.query.bias", "vivit_model.encoder.layer.12.attention.attention.key.weight", "vivit_model.encoder.layer.12.attention.attention.key.bias", "vivit_model.encoder.layer.12.attention.attention.value.weight", "vivit_model.encoder.layer.12.attention.attention.value.bias", "vivit_model.encoder.layer.12.attention.output.dense.weight", "vivit_model.encoder.layer.12.attention.output.dense.bias", "vivit_model.encoder.layer.12.intermediate.dense.weight", "vivit_model.encoder.layer.12.intermediate.dense.bias", "vivit_model.encoder.layer.12.output.dense.weight", "vivit_model.encoder.layer.12.output.dense.bias", "vivit_model.encoder.layer.12.layernorm_before.weight", "vivit_model.encoder.layer.12.layernorm_before.bias", "vivit_model.encoder.layer.12.layernorm_after.weight", "vivit_model.encoder.layer.12.layernorm_after.bias", "vivit_model.encoder.layer.13.attention.attention.query.weight", "vivit_model.encoder.layer.13.attention.attention.query.bias", "vivit_model.encoder.layer.13.attention.attention.key.weight", "vivit_model.encoder.layer.13.attention.attention.key.bias", "vivit_model.encoder.layer.13.attention.attention.value.weight", "vivit_model.encoder.layer.13.attention.attention.value.bias", "vivit_model.encoder.layer.13.attention.output.dense.weight", "vivit_model.encoder.layer.13.attention.output.dense.bias", "vivit_model.encoder.layer.13.intermediate.dense.weight", "vivit_model.encoder.layer.13.intermediate.dense.bias", "vivit_model.encoder.layer.13.output.dense.weight", "vivit_model.encoder.layer.13.output.dense.bias", "vivit_model.encoder.layer.13.layernorm_before.weight", "vivit_model.encoder.layer.13.layernorm_before.bias", "vivit_model.encoder.layer.13.layernorm_after.weight", "vivit_model.encoder.layer.13.layernorm_after.bias", "vivit_model.encoder.layer.14.attention.attention.query.weight", "vivit_model.encoder.layer.14.attention.attention.query.bias", "vivit_model.encoder.layer.14.attention.attention.key.weight", "vivit_model.encoder.layer.14.attention.attention.key.bias", "vivit_model.encoder.layer.14.attention.attention.value.weight", "vivit_model.encoder.layer.14.attention.attention.value.bias", "vivit_model.encoder.layer.14.attention.output.dense.weight", "vivit_model.encoder.layer.14.attention.output.dense.bias", "vivit_model.encoder.layer.14.intermediate.dense.weight", "vivit_model.encoder.layer.14.intermediate.dense.bias", "vivit_model.encoder.layer.14.output.dense.weight", "vivit_model.encoder.layer.14.output.dense.bias", "vivit_model.encoder.layer.14.layernorm_before.weight", "vivit_model.encoder.layer.14.layernorm_before.bias", "vivit_model.encoder.layer.14.layernorm_after.weight", "vivit_model.encoder.layer.14.layernorm_after.bias", "vivit_model.encoder.layer.15.attention.attention.query.weight", "vivit_model.encoder.layer.15.attention.attention.query.bias", "vivit_model.encoder.layer.15.attention.attention.key.weight", "vivit_model.encoder.layer.15.attention.attention.key.bias", "vivit_model.encoder.layer.15.attention.attention.value.weight", "vivit_model.encoder.layer.15.attention.attention.value.bias", "vivit_model.encoder.layer.15.attention.output.dense.weight", "vivit_model.encoder.layer.15.attention.output.dense.bias", "vivit_model.encoder.layer.15.intermediate.dense.weight", "vivit_model.encoder.layer.15.intermediate.dense.bias", "vivit_model.encoder.layer.15.output.dense.weight", "vivit_model.encoder.layer.15.output.dense.bias", "vivit_model.encoder.layer.15.layernorm_before.weight", "vivit_model.encoder.layer.15.layernorm_before.bias", "vivit_model.encoder.layer.15.layernorm_after.weight", "vivit_model.encoder.layer.15.layernorm_after.bias", "vivit_model.encoder.layer.16.attention.attention.query.weight", "vivit_model.encoder.layer.16.attention.attention.query.bias", "vivit_model.encoder.layer.16.attention.attention.key.weight", "vivit_model.encoder.layer.16.attention.attention.key.bias", "vivit_model.encoder.layer.16.attention.attention.value.weight", "vivit_model.encoder.layer.16.attention.attention.value.bias", "vivit_model.encoder.layer.16.attention.output.dense.weight", "vivit_model.encoder.layer.16.attention.output.dense.bias", "vivit_model.encoder.layer.16.intermediate.dense.weight", "vivit_model.encoder.layer.16.intermediate.dense.bias", "vivit_model.encoder.layer.16.output.dense.weight", "vivit_model.encoder.layer.16.output.dense.bias", "vivit_model.encoder.layer.16.layernorm_before.weight", "vivit_model.encoder.layer.16.layernorm_before.bias", "vivit_model.encoder.layer.16.layernorm_after.weight", "vivit_model.encoder.layer.16.layernorm_after.bias", "vivit_model.encoder.layer.17.attention.attention.query.weight", "vivit_model.encoder.layer.17.attention.attention.query.bias", "vivit_model.encoder.layer.17.attention.attention.key.weight", "vivit_model.encoder.layer.17.attention.attention.key.bias", "vivit_model.encoder.layer.17.attention.attention.value.weight", "vivit_model.encoder.layer.17.attention.attention.value.bias", "vivit_model.encoder.layer.17.attention.output.dense.weight", "vivit_model.encoder.layer.17.attention.output.dense.bias", "vivit_model.encoder.layer.17.intermediate.dense.weight", "vivit_model.encoder.layer.17.intermediate.dense.bias", "vivit_model.encoder.layer.17.output.dense.weight", "vivit_model.encoder.layer.17.output.dense.bias", "vivit_model.encoder.layer.17.layernorm_before.weight", "vivit_model.encoder.layer.17.layernorm_before.bias", "vivit_model.encoder.layer.17.layernorm_after.weight", "vivit_model.encoder.layer.17.layernorm_after.bias", "vivit_model.encoder.layer.18.attention.attention.query.weight", "vivit_model.encoder.layer.18.attention.attention.query.bias", "vivit_model.encoder.layer.18.attention.attention.key.weight", "vivit_model.encoder.layer.18.attention.attention.key.bias", "vivit_model.encoder.layer.18.attention.attention.value.weight", "vivit_model.encoder.layer.18.attention.attention.value.bias", "vivit_model.encoder.layer.18.attention.output.dense.weight", "vivit_model.encoder.layer.18.attention.output.dense.bias", "vivit_model.encoder.layer.18.intermediate.dense.weight", "vivit_model.encoder.layer.18.intermediate.dense.bias", "vivit_model.encoder.layer.18.output.dense.weight", "vivit_model.encoder.layer.18.output.dense.bias", "vivit_model.encoder.layer.18.layernorm_before.weight", "vivit_model.encoder.layer.18.layernorm_before.bias", "vivit_model.encoder.layer.18.layernorm_after.weight", "vivit_model.encoder.layer.18.layernorm_after.bias", "vivit_model.encoder.layer.19.attention.attention.query.weight", "vivit_model.encoder.layer.19.attention.attention.query.bias", "vivit_model.encoder.layer.19.attention.attention.key.weight", "vivit_model.encoder.layer.19.attention.attention.key.bias", "vivit_model.encoder.layer.19.attention.attention.value.weight", "vivit_model.encoder.layer.19.attention.attention.value.bias", "vivit_model.encoder.layer.19.attention.output.dense.weight", "vivit_model.encoder.layer.19.attention.output.dense.bias", "vivit_model.encoder.layer.19.intermediate.dense.weight", "vivit_model.encoder.layer.19.intermediate.dense.bias", "vivit_model.encoder.layer.19.output.dense.weight", "vivit_model.encoder.layer.19.output.dense.bias", "vivit_model.encoder.layer.19.layernorm_before.weight", "vivit_model.encoder.layer.19.layernorm_before.bias", "vivit_model.encoder.layer.19.layernorm_after.weight", "vivit_model.encoder.layer.19.layernorm_after.bias", "vivit_model.encoder.layer.20.attention.attention.query.weight", "vivit_model.encoder.layer.20.attention.attention.query.bias", "vivit_model.encoder.layer.20.attention.attention.key.weight", "vivit_model.encoder.layer.20.attention.attention.key.bias", "vivit_model.encoder.layer.20.attention.attention.value.weight", "vivit_model.encoder.layer.20.attention.attention.value.bias", "vivit_model.encoder.layer.20.attention.output.dense.weight", "vivit_model.encoder.layer.20.attention.output.dense.bias", "vivit_model.encoder.layer.20.intermediate.dense.weight", "vivit_model.encoder.layer.20.intermediate.dense.bias", "vivit_model.encoder.layer.20.output.dense.weight", "vivit_model.encoder.layer.20.output.dense.bias", "vivit_model.encoder.layer.20.layernorm_before.weight", "vivit_model.encoder.layer.20.layernorm_before.bias", "vivit_model.encoder.layer.20.layernorm_after.weight", "vivit_model.encoder.layer.20.layernorm_after.bias", "vivit_model.encoder.layer.21.attention.attention.query.weight", "vivit_model.encoder.layer.21.attention.attention.query.bias", "vivit_model.encoder.layer.21.attention.attention.key.weight", "vivit_model.encoder.layer.21.attention.attention.key.bias", "vivit_model.encoder.layer.21.attention.attention.value.weight", "vivit_model.encoder.layer.21.attention.attention.value.bias", "vivit_model.encoder.layer.21.attention.output.dense.weight", "vivit_model.encoder.layer.21.attention.output.dense.bias", "vivit_model.encoder.layer.21.intermediate.dense.weight", "vivit_model.encoder.layer.21.intermediate.dense.bias", "vivit_model.encoder.layer.21.output.dense.weight", "vivit_model.encoder.layer.21.output.dense.bias", "vivit_model.encoder.layer.21.layernorm_before.weight", "vivit_model.encoder.layer.21.layernorm_before.bias", "vivit_model.encoder.layer.21.layernorm_after.weight", "vivit_model.encoder.layer.21.layernorm_after.bias", "vivit_model.encoder.layer.22.attention.attention.query.weight", "vivit_model.encoder.layer.22.attention.attention.query.bias", "vivit_model.encoder.layer.22.attention.attention.key.weight", "vivit_model.encoder.layer.22.attention.attention.key.bias", "vivit_model.encoder.layer.22.attention.attention.value.weight", "vivit_model.encoder.layer.22.attention.attention.value.bias", "vivit_model.encoder.layer.22.attention.output.dense.weight", "vivit_model.encoder.layer.22.attention.output.dense.bias", "vivit_model.encoder.layer.22.intermediate.dense.weight", "vivit_model.encoder.layer.22.intermediate.dense.bias", "vivit_model.encoder.layer.22.output.dense.weight", "vivit_model.encoder.layer.22.output.dense.bias", "vivit_model.encoder.layer.22.layernorm_before.weight", "vivit_model.encoder.layer.22.layernorm_before.bias", "vivit_model.encoder.layer.22.layernorm_after.weight", "vivit_model.encoder.layer.22.layernorm_after.bias", "vivit_model.encoder.layer.23.attention.attention.query.weight", "vivit_model.encoder.layer.23.attention.attention.query.bias", "vivit_model.encoder.layer.23.attention.attention.key.weight", "vivit_model.encoder.layer.23.attention.attention.key.bias", "vivit_model.encoder.layer.23.attention.attention.value.weight", "vivit_model.encoder.layer.23.attention.attention.value.bias", "vivit_model.encoder.layer.23.attention.output.dense.weight", "vivit_model.encoder.layer.23.attention.output.dense.bias", "vivit_model.encoder.layer.23.intermediate.dense.weight", "vivit_model.encoder.layer.23.intermediate.dense.bias", "vivit_model.encoder.layer.23.output.dense.weight", "vivit_model.encoder.layer.23.output.dense.bias", "vivit_model.encoder.layer.23.layernorm_before.weight", "vivit_model.encoder.layer.23.layernorm_before.bias", "vivit_model.encoder.layer.23.layernorm_after.weight", "vivit_model.encoder.layer.23.layernorm_after.bias". 
	size mismatch for vit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
	size mismatch for vit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 197, 768]) from checkpoint, the shape in current model is torch.Size([1, 197, 1024]).
	size mismatch for vit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 16, 16]).
	size mismatch for vit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.pooler.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.pooler.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
	size mismatch for vivit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 3137, 768]) from checkpoint, the shape in current model is torch.Size([1, 3137, 1024]).
	size mismatch for vivit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 2, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 2, 16, 16]).
	size mismatch for vivit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.pooler.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.pooler.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).

the same problem

Just tried running this model. Getting the same error刚刚尝试运行此模型。收到相同的错误

Traceback (most recent call last):回溯 (最近调用最后): File "/content/M2UGen/M2UGen/gradio_app.py", line 75, in 文件 “/content/M2UGen/M2UGen/gradio_app.py”,第 75 行,在 load_result = model.load_state_dict(new_ckpt, strict=True)load_result = model.load_state_dict(new_ckpt, strict=True) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict文件 “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”,第 2152 行,load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(raise RuntimeError('state_dict加载 {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for M2UGenRuntimeError:为 M2UGen 加载state_dict时出错

I get the same problem,how did you resolve?