Errors in loading state_dict for M2UGen when running inference.py

Question

Errors in loading state_dict for M2UGen when running inference.py

Opened this issue 10 months ago · 7 comments

Hi! I tried to run inference.py but encoutered below error as it seems indicating some missing keys and mismatches.
I believe I have set up the checkpoints files correctly.
For the LLaMA model, I made a request to Meta and downloaded the 7B with a signed download link.
Others than that, I got everything from huggingface, also the knn.

Not sure what I should fix at this point. I would appreciate it if you could give me some hints! I hope the problem is from my setup.

Traceback (most recent call last):
  File "/workspace/M2UGen/M2UGen/inference.py", line 95, in <module>
    load_result = model.load_state_dict(new_ckpt, strict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for M2UGen:
        Missing key(s) in state_dict: "vit_model.encoder.layer.12.attention.attention.query.weight", "vit_model.encoder.layer.12.attention.attention.query.bias", 
...
...
        size mismatch for vit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
        size mismatch for vit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 197, 768]) from checkpoint, the shape in current model is torch.Size([1, 197, 1024]).
        size mismatch for vit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 16, 16]).
        size mismatch for vit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
...

Below is the command I used to run inference

python M2UGen/inference.py --video_file "/workspace/video-test.mp4" --model ./ckpts/M2UGen-MusicGen/checkpoint.pth --llama_dir ./ckpts/LLaMA --music_decoder musicgen

and this is the structure of the ckpts folder

├── LLaMA
│   ├── 7B
│   │   ├── checklist.chk
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   ├── tokenizer.model
│   └── tokenizer_checklist.chk
├── M2UGen-MusicGen
│   └── checkpoint.pth
└── knn.index

Answer 1 · 2024-06-26T08:15:19.000Z

Have you solved it? I have the same problem
请问解决了吗，我也遇到了这个问题

Answer 2 · 2024-06-26T16:27:03.000Z

Just tried running this model. Getting the same error

Traceback (most recent call last):
File "/content/M2UGen/M2UGen/gradio_app.py", line 75, in
load_result = model.load_state_dict(new_ckpt, strict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for M2UGen

Answer 3 · 2024-07-02T14:48:20.000Z

what does the "knn.index" mean?

Answer 4 · 2024-11-12T14:19:26.000Z

the same problem,when i use medium model can fix it

Answer 5 · 2024-11-26T02:40:51.000Z

same error here.

Error(s) in loading state_dict for M2UGen


RuntimeError: Error(s) in loading state_dict for M2UGen:
	Missing key(s) in state_dict: "vit_model.encoder.layer.12.attention.attention.query.weight", "vit_model.encoder.layer.12.attention.attention.query.bias", "vit_model.encoder.layer.12.attention.attention.key.weight", "vit_model.encoder.layer.12.attention.attention.key.bias", "vit_model.encoder.layer.12.attention.attention.value.weight", "vit_model.encoder.layer.12.attention.attention.value.bias", "vit_model.encoder.layer.12.attention.output.dense.weight", "vit_model.encoder.layer.12.attention.output.dense.bias", "vit_model.encoder.layer.12.intermediate.dense.weight", "vit_model.encoder.layer.12.intermediate.dense.bias", "vit_model.encoder.layer.12.output.dense.weight", "vit_model.encoder.layer.12.output.dense.bias", "vit_model.encoder.layer.12.layernorm_before.weight", "vit_model.encoder.layer.12.layernorm_before.bias", "vit_model.encoder.layer.12.layernorm_after.weight", "vit_model.encoder.layer.12.layernorm_after.bias", "vit_model.encoder.layer.13.attention.attention.query.weight", "vit_model.encoder.layer.13.attention.attention.query.bias", "vit_model.encoder.layer.13.attention.attention.key.weight", "vit_model.encoder.layer.13.attention.attention.key.bias", "vit_model.encoder.layer.13.attention.attention.value.weight", "vit_model.encoder.layer.13.attention.attention.value.bias", "vit_model.encoder.layer.13.attention.output.dense.weight", "vit_model.encoder.layer.13.attention.output.dense.bias", "vit_model.encoder.layer.13.intermediate.dense.weight", "vit_model.encoder.layer.13.intermediate.dense.bias", "vit_model.encoder.layer.13.output.dense.weight", "vit_model.encoder.layer.13.output.dense.bias", "vit_model.encoder.layer.13.layernorm_before.weight", "vit_model.encoder.layer.13.layernorm_before.bias", "vit_model.encoder.layer.13.layernorm_after.weight", "vit_model.encoder.layer.13.layernorm_after.bias", "vit_model.encoder.layer.14.attention.attention.query.weight", "vit_model.encoder.layer.14.attention.attention.query.bias", "vit_model.encoder.layer.14.attention.attention.key.weight", "vit_model.encoder.layer.14.attention.attention.key.bias", "vit_model.encoder.layer.14.attention.attention.value.weight", "vit_model.encoder.layer.14.attention.attention.value.bias", "vit_model.encoder.layer.14.attention.output.dense.weight", "vit_model.encoder.layer.14.attention.output.dense.bias", "vit_model.encoder.layer.14.intermediate.dense.weight", "vit_model.encoder.layer.14.intermediate.dense.bias", "vit_model.encoder.layer.14.output.dense.weight", "vit_model.encoder.layer.14.output.dense.bias", "vit_model.encoder.layer.14.layernorm_before.weight", "vit_model.encoder.layer.14.layernorm_before.bias", "vit_model.encoder.layer.14.layernorm_after.weight", "vit_model.encoder.layer.14.layernorm_after.bias", "vit_model.encoder.layer.15.attention.attention.query.weight", "vit_model.encoder.layer.15.attention.attention.query.bias", "vit_model.encoder.layer.15.attention.attention.key.weight", "vit_model.encoder.layer.15.attention.attention.key.bias", "vit_model.encoder.layer.15.attention.attention.value.weight", "vit_model.encoder.layer.15.attention.attention.value.bias", "vit_model.encoder.layer.15.attention.output.dense.weight", "vit_model.encoder.layer.15.attention.output.dense.bias", "vit_model.encoder.layer.15.intermediate.dense.weight", "vit_model.encoder.layer.15.intermediate.dense.bias", "vit_model.encoder.layer.15.output.dense.weight", "vit_model.encoder.layer.15.output.dense.bias", "vit_model.encoder.layer.15.layernorm_before.weight", "vit_model.encoder.layer.15.layernorm_before.bias", "vit_model.encoder.layer.15.layernorm_after.weight", "vit_model.encoder.layer.15.layernorm_after.bias", "vit_model.encoder.layer.16.attention.attention.query.weight", "vit_model.encoder.layer.16.attention.attention.query.bias", "vit_model.encoder.layer.16.attention.attention.key.weight", "vit_model.encoder.layer.16.attention.attention.key.bias", "vit_model.encoder.layer.16.attention.attention.value.weight", "vit_model.encoder.layer.16.attention.attention.value.bias", "vit_model.encoder.layer.16.attention.output.dense.weight", "vit_model.encoder.layer.16.attention.output.dense.bias", "vit_model.encoder.layer.16.intermediate.dense.weight", "vit_model.encoder.layer.16.intermediate.dense.bias", "vit_model.encoder.layer.16.output.dense.weight", "vit_model.encoder.layer.16.output.dense.bias", "vit_model.encoder.layer.16.layernorm_before.weight", "vit_model.encoder.layer.16.layernorm_before.bias", "vit_model.encoder.layer.16.layernorm_after.weight", "vit_model.encoder.layer.16.layernorm_after.bias", "vit_model.encoder.layer.17.attention.attention.query.weight", "vit_model.encoder.layer.17.attention.attention.query.bias", "vit_model.encoder.layer.17.attention.attention.key.weight", "vit_model.encoder.layer.17.attention.attention.key.bias", "vit_model.encoder.layer.17.attention.attention.value.weight", "vit_model.encoder.layer.17.attention.attention.value.bias", "vit_model.encoder.layer.17.attention.output.dense.weight", "vit_model.encoder.layer.17.attention.output.dense.bias", "vit_model.encoder.layer.17.intermediate.dense.weight", "vit_model.encoder.layer.17.intermediate.dense.bias", "vit_model.encoder.layer.17.output.dense.weight", "vit_model.encoder.layer.17.output.dense.bias", "vit_model.encoder.layer.17.layernorm_before.weight", "vit_model.encoder.layer.17.layernorm_before.bias", "vit_model.encoder.layer.17.layernorm_after.weight", "vit_model.encoder.layer.17.layernorm_after.bias", "vit_model.encoder.layer.18.attention.attention.query.weight", "vit_model.encoder.layer.18.attention.attention.query.bias", "vit_model.encoder.layer.18.attention.attention.key.weight", "vit_model.encoder.layer.18.attention.attention.key.bias", "vit_model.encoder.layer.18.attention.attention.value.weight", "vit_model.encoder.layer.18.attention.attention.value.bias", "vit_model.encoder.layer.18.attention.output.dense.weight", "vit_model.encoder.layer.18.attention.output.dense.bias", "vit_model.encoder.layer.18.intermediate.dense.weight", "vit_model.encoder.layer.18.intermediate.dense.bias", "vit_model.encoder.layer.18.output.dense.weight", "vit_model.encoder.layer.18.output.dense.bias", "vit_model.encoder.layer.18.layernorm_before.weight", "vit_model.encoder.layer.18.layernorm_before.bias", "vit_model.encoder.layer.18.layernorm_after.weight", "vit_model.encoder.layer.18.layernorm_after.bias", "vit_model.encoder.layer.19.attention.attention.query.weight", "vit_model.encoder.layer.19.attention.attention.query.bias", "vit_model.encoder.layer.19.attention.attention.key.weight", "vit_model.encoder.layer.19.attention.attention.key.bias", "vit_model.encoder.layer.19.attention.attention.value.weight", "vit_model.encoder.layer.19.attention.attention.value.bias", "vit_model.encoder.layer.19.attention.output.dense.weight", "vit_model.encoder.layer.19.attention.output.dense.bias", "vit_model.encoder.layer.19.intermediate.dense.weight", "vit_model.encoder.layer.19.intermediate.dense.bias", "vit_model.encoder.layer.19.output.dense.weight", "vit_model.encoder.layer.19.output.dense.bias", "vit_model.encoder.layer.19.layernorm_before.weight", "vit_model.encoder.layer.19.layernorm_before.bias", "vit_model.encoder.layer.19.layernorm_after.weight", "vit_model.encoder.layer.19.layernorm_after.bias", "vit_model.encoder.layer.20.attention.attention.query.weight", "vit_model.encoder.layer.20.attention.attention.query.bias", "vit_model.encoder.layer.20.attention.attention.key.weight", "vit_model.encoder.layer.20.attention.attention.key.bias", "vit_model.encoder.layer.20.attention.attention.value.weight", "vit_model.encoder.layer.20.attention.attention.value.bias", "vit_model.encoder.layer.20.attention.output.dense.weight", "vit_model.encoder.layer.20.attention.output.dense.bias", "vit_model.encoder.layer.20.intermediate.dense.weight", "vit_model.encoder.layer.20.intermediate.dense.bias", "vit_model.encoder.layer.20.output.dense.weight", "vit_model.encoder.layer.20.output.dense.bias", "vit_model.encoder.layer.20.layernorm_before.weight", "vit_model.encoder.layer.20.layernorm_before.bias", "vit_model.encoder.layer.20.layernorm_after.weight", "vit_model.encoder.layer.20.layernorm_after.bias", "vit_model.encoder.layer.21.attention.attention.query.weight", "vit_model.encoder.layer.21.attention.attention.query.bias", "vit_model.encoder.layer.21.attention.attention.key.weight", "vit_model.encoder.layer.21.attention.attention.key.bias", "vit_model.encoder.layer.21.attention.attention.value.weight", "vit_model.encoder.layer.21.attention.attention.value.bias", "vit_model.encoder.layer.21.attention.output.dense.weight", "vit_model.encoder.layer.21.attention.output.dense.bias", "vit_model.encoder.layer.21.intermediate.dense.weight", "vit_model.encoder.layer.21.intermediate.dense.bias", "vit_model.encoder.layer.21.output.dense.weight", "vit_model.encoder.layer.21.output.dense.bias", "vit_model.encoder.layer.21.layernorm_before.weight", "vit_model.encoder.layer.21.layernorm_before.bias", "vit_model.encoder.layer.21.layernorm_after.weight", "vit_model.encoder.layer.21.layernorm_after.bias", "vit_model.encoder.layer.22.attention.attention.query.weight", "vit_model.encoder.layer.22.attention.attention.query.bias", "vit_model.encoder.layer.22.attention.attention.key.weight", "vit_model.encoder.layer.22.attention.attention.key.bias", "vit_model.encoder.layer.22.attention.attention.value.weight", "vit_model.encoder.layer.22.attention.attention.value.bias", "vit_model.encoder.layer.22.attention.output.dense.weight", "vit_model.encoder.layer.22.attention.output.dense.bias", "vit_model.encoder.layer.22.intermediate.dense.weight", "vit_model.encoder.layer.22.intermediate.dense.bias", "vit_model.encoder.layer.22.output.dense.weight", "vit_model.encoder.layer.22.output.dense.bias", "vit_model.encoder.layer.22.layernorm_before.weight", "vit_model.encoder.layer.22.layernorm_before.bias", "vit_model.encoder.layer.22.layernorm_after.weight", "vit_model.encoder.layer.22.layernorm_after.bias", "vit_model.encoder.layer.23.attention.attention.query.weight", "vit_model.encoder.layer.23.attention.attention.query.bias", "vit_model.encoder.layer.23.attention.attention.key.weight", "vit_model.encoder.layer.23.attention.attention.key.bias", "vit_model.encoder.layer.23.attention.attention.value.weight", "vit_model.encoder.layer.23.attention.attention.value.bias", "vit_model.encoder.layer.23.attention.output.dense.weight", "vit_model.encoder.layer.23.attention.output.dense.bias", "vit_model.encoder.layer.23.intermediate.dense.weight", "vit_model.encoder.layer.23.intermediate.dense.bias", "vit_model.encoder.layer.23.output.dense.weight", "vit_model.encoder.layer.23.output.dense.bias", "vit_model.encoder.layer.23.layernorm_before.weight", "vit_model.encoder.layer.23.layernorm_before.bias", "vit_model.encoder.layer.23.layernorm_after.weight", "vit_model.encoder.layer.23.layernorm_after.bias", "vivit_model.encoder.layer.12.attention.attention.query.weight", "vivit_model.encoder.layer.12.attention.attention.query.bias", "vivit_model.encoder.layer.12.attention.attention.key.weight", "vivit_model.encoder.layer.12.attention.attention.key.bias", "vivit_model.encoder.layer.12.attention.attention.value.weight", "vivit_model.encoder.layer.12.attention.attention.value.bias", "vivit_model.encoder.layer.12.attention.output.dense.weight", "vivit_model.encoder.layer.12.attention.output.dense.bias", "vivit_model.encoder.layer.12.intermediate.dense.weight", "vivit_model.encoder.layer.12.intermediate.dense.bias", "vivit_model.encoder.layer.12.output.dense.weight", "vivit_model.encoder.layer.12.output.dense.bias", "vivit_model.encoder.layer.12.layernorm_before.weight", "vivit_model.encoder.layer.12.layernorm_before.bias", "vivit_model.encoder.layer.12.layernorm_after.weight", "vivit_model.encoder.layer.12.layernorm_after.bias", "vivit_model.encoder.layer.13.attention.attention.query.weight", "vivit_model.encoder.layer.13.attention.attention.query.bias", "vivit_model.encoder.layer.13.attention.attention.key.weight", "vivit_model.encoder.layer.13.attention.attention.key.bias", "vivit_model.encoder.layer.13.attention.attention.value.weight", "vivit_model.encoder.layer.13.attention.attention.value.bias", "vivit_model.encoder.layer.13.attention.output.dense.weight", "vivit_model.encoder.layer.13.attention.output.dense.bias", "vivit_model.encoder.layer.13.intermediate.dense.weight", "vivit_model.encoder.layer.13.intermediate.dense.bias", "vivit_model.encoder.layer.13.output.dense.weight", "vivit_model.encoder.layer.13.output.dense.bias", "vivit_model.encoder.layer.13.layernorm_before.weight", "vivit_model.encoder.layer.13.layernorm_before.bias", "vivit_model.encoder.layer.13.layernorm_after.weight", "vivit_model.encoder.layer.13.layernorm_after.bias", "vivit_model.encoder.layer.14.attention.attention.query.weight", "vivit_model.encoder.layer.14.attention.attention.query.bias", "vivit_model.encoder.layer.14.attention.attention.key.weight", "vivit_model.encoder.layer.14.attention.attention.key.bias", "vivit_model.encoder.layer.14.attention.attention.value.weight", "vivit_model.encoder.layer.14.attention.attention.value.bias", "vivit_model.encoder.layer.14.attention.output.dense.weight", "vivit_model.encoder.layer.14.attention.output.dense.bias", "vivit_model.encoder.layer.14.intermediate.dense.weight", "vivit_model.encoder.layer.14.intermediate.dense.bias", "vivit_model.encoder.layer.14.output.dense.weight", "vivit_model.encoder.layer.14.output.dense.bias", "vivit_model.encoder.layer.14.layernorm_before.weight", "vivit_model.encoder.layer.14.layernorm_before.bias", "vivit_model.encoder.layer.14.layernorm_after.weight", "vivit_model.encoder.layer.14.layernorm_after.bias", "vivit_model.encoder.layer.15.attention.attention.query.weight", "vivit_model.encoder.layer.15.attention.attention.query.bias", "vivit_model.encoder.layer.15.attention.attention.key.weight", "vivit_model.encoder.layer.15.attention.attention.key.bias", "vivit_model.encoder.layer.15.attention.attention.value.weight", "vivit_model.encoder.layer.15.attention.attention.value.bias", "vivit_model.encoder.layer.15.attention.output.dense.weight", "vivit_model.encoder.layer.15.attention.output.dense.bias", "vivit_model.encoder.layer.15.intermediate.dense.weight", "vivit_model.encoder.layer.15.intermediate.dense.bias", "vivit_model.encoder.layer.15.output.dense.weight", "vivit_model.encoder.layer.15.output.dense.bias", "vivit_model.encoder.layer.15.layernorm_before.weight", "vivit_model.encoder.layer.15.layernorm_before.bias", "vivit_model.encoder.layer.15.layernorm_after.weight", "vivit_model.encoder.layer.15.layernorm_after.bias", "vivit_model.encoder.layer.16.attention.attention.query.weight", "vivit_model.encoder.layer.16.attention.attention.query.bias", "vivit_model.encoder.layer.16.attention.attention.key.weight", "vivit_model.encoder.layer.16.attention.attention.key.bias", "vivit_model.encoder.layer.16.attention.attention.value.weight", "vivit_model.encoder.layer.16.attention.attention.value.bias", "vivit_model.encoder.layer.16.attention.output.dense.weight", "vivit_model.encoder.layer.16.attention.output.dense.bias", "vivit_model.encoder.layer.16.intermediate.dense.weight", "vivit_model.encoder.layer.16.intermediate.dense.bias", "vivit_model.encoder.layer.16.output.dense.weight", "vivit_model.encoder.layer.16.output.dense.bias", "vivit_model.encoder.layer.16.layernorm_before.weight", "vivit_model.encoder.layer.16.layernorm_before.bias", "vivit_model.encoder.layer.16.layernorm_after.weight", "vivit_model.encoder.layer.16.layernorm_after.bias", "vivit_model.encoder.layer.17.attention.attention.query.weight", "vivit_model.encoder.layer.17.attention.attention.query.bias", "vivit_model.encoder.layer.17.attention.attention.key.weight", "vivit_model.encoder.layer.17.attention.attention.key.bias", "vivit_model.encoder.layer.17.attention.attention.value.weight", "vivit_model.encoder.layer.17.attention.attention.value.bias", "vivit_model.encoder.layer.17.attention.output.dense.weight", "vivit_model.encoder.layer.17.attention.output.dense.bias", "vivit_model.encoder.layer.17.intermediate.dense.weight", "vivit_model.encoder.layer.17.intermediate.dense.bias", "vivit_model.encoder.layer.17.output.dense.weight", "vivit_model.encoder.layer.17.output.dense.bias", "vivit_model.encoder.layer.17.layernorm_before.weight", "vivit_model.encoder.layer.17.layernorm_before.bias", "vivit_model.encoder.layer.17.layernorm_after.weight", "vivit_model.encoder.layer.17.layernorm_after.bias", "vivit_model.encoder.layer.18.attention.attention.query.weight", "vivit_model.encoder.layer.18.attention.attention.query.bias", "vivit_model.encoder.layer.18.attention.attention.key.weight", "vivit_model.encoder.layer.18.attention.attention.key.bias", "vivit_model.encoder.layer.18.attention.attention.value.weight", "vivit_model.encoder.layer.18.attention.attention.value.bias", "vivit_model.encoder.layer.18.attention.output.dense.weight", "vivit_model.encoder.layer.18.attention.output.dense.bias", "vivit_model.encoder.layer.18.intermediate.dense.weight", "vivit_model.encoder.layer.18.intermediate.dense.bias", "vivit_model.encoder.layer.18.output.dense.weight", "vivit_model.encoder.layer.18.output.dense.bias", "vivit_model.encoder.layer.18.layernorm_before.weight", "vivit_model.encoder.layer.18.layernorm_before.bias", "vivit_model.encoder.layer.18.layernorm_after.weight", "vivit_model.encoder.layer.18.layernorm_after.bias", "vivit_model.encoder.layer.19.attention.attention.query.weight", "vivit_model.encoder.layer.19.attention.attention.query.bias", "vivit_model.encoder.layer.19.attention.attention.key.weight", "vivit_model.encoder.layer.19.attention.attention.key.bias", "vivit_model.encoder.layer.19.attention.attention.value.weight", "vivit_model.encoder.layer.19.attention.attention.value.bias", "vivit_model.encoder.layer.19.attention.output.dense.weight", "vivit_model.encoder.layer.19.attention.output.dense.bias", "vivit_model.encoder.layer.19.intermediate.dense.weight", "vivit_model.encoder.layer.19.intermediate.dense.bias", "vivit_model.encoder.layer.19.output.dense.weight", "vivit_model.encoder.layer.19.output.dense.bias", "vivit_model.encoder.layer.19.layernorm_before.weight", "vivit_model.encoder.layer.19.layernorm_before.bias", "vivit_model.encoder.layer.19.layernorm_after.weight", "vivit_model.encoder.layer.19.layernorm_after.bias", "vivit_model.encoder.layer.20.attention.attention.query.weight", "vivit_model.encoder.layer.20.attention.attention.query.bias", "vivit_model.encoder.layer.20.attention.attention.key.weight", "vivit_model.encoder.layer.20.attention.attention.key.bias", "vivit_model.encoder.layer.20.attention.attention.value.weight", "vivit_model.encoder.layer.20.attention.attention.value.bias", "vivit_model.encoder.layer.20.attention.output.dense.weight", "vivit_model.encoder.layer.20.attention.output.dense.bias", "vivit_model.encoder.layer.20.intermediate.dense.weight", "vivit_model.encoder.layer.20.intermediate.dense.bias", "vivit_model.encoder.layer.20.output.dense.weight", "vivit_model.encoder.layer.20.output.dense.bias", "vivit_model.encoder.layer.20.layernorm_before.weight", "vivit_model.encoder.layer.20.layernorm_before.bias", "vivit_model.encoder.layer.20.layernorm_after.weight", "vivit_model.encoder.layer.20.layernorm_after.bias", "vivit_model.encoder.layer.21.attention.attention.query.weight", "vivit_model.encoder.layer.21.attention.attention.query.bias", "vivit_model.encoder.layer.21.attention.attention.key.weight", "vivit_model.encoder.layer.21.attention.attention.key.bias", "vivit_model.encoder.layer.21.attention.attention.value.weight", "vivit_model.encoder.layer.21.attention.attention.value.bias", "vivit_model.encoder.layer.21.attention.output.dense.weight", "vivit_model.encoder.layer.21.attention.output.dense.bias", "vivit_model.encoder.layer.21.intermediate.dense.weight", "vivit_model.encoder.layer.21.intermediate.dense.bias", "vivit_model.encoder.layer.21.output.dense.weight", "vivit_model.encoder.layer.21.output.dense.bias", "vivit_model.encoder.layer.21.layernorm_before.weight", "vivit_model.encoder.layer.21.layernorm_before.bias", "vivit_model.encoder.layer.21.layernorm_after.weight", "vivit_model.encoder.layer.21.layernorm_after.bias", "vivit_model.encoder.layer.22.attention.attention.query.weight", "vivit_model.encoder.layer.22.attention.attention.query.bias", "vivit_model.encoder.layer.22.attention.attention.key.weight", "vivit_model.encoder.layer.22.attention.attention.key.bias", "vivit_model.encoder.layer.22.attention.attention.value.weight", "vivit_model.encoder.layer.22.attention.attention.value.bias", "vivit_model.encoder.layer.22.attention.output.dense.weight", "vivit_model.encoder.layer.22.attention.output.dense.bias", "vivit_model.encoder.layer.22.intermediate.dense.weight", "vivit_model.encoder.layer.22.intermediate.dense.bias", "vivit_model.encoder.layer.22.output.dense.weight", "vivit_model.encoder.layer.22.output.dense.bias", "vivit_model.encoder.layer.22.layernorm_before.weight", "vivit_model.encoder.layer.22.layernorm_before.bias", "vivit_model.encoder.layer.22.layernorm_after.weight", "vivit_model.encoder.layer.22.layernorm_after.bias", "vivit_model.encoder.layer.23.attention.attention.query.weight", "vivit_model.encoder.layer.23.attention.attention.query.bias", "vivit_model.encoder.layer.23.attention.attention.key.weight", "vivit_model.encoder.layer.23.attention.attention.key.bias", "vivit_model.encoder.layer.23.attention.attention.value.weight", "vivit_model.encoder.layer.23.attention.attention.value.bias", "vivit_model.encoder.layer.23.attention.output.dense.weight", "vivit_model.encoder.layer.23.attention.output.dense.bias", "vivit_model.encoder.layer.23.intermediate.dense.weight", "vivit_model.encoder.layer.23.intermediate.dense.bias", "vivit_model.encoder.layer.23.output.dense.weight", "vivit_model.encoder.layer.23.output.dense.bias", "vivit_model.encoder.layer.23.layernorm_before.weight", "vivit_model.encoder.layer.23.layernorm_before.bias", "vivit_model.encoder.layer.23.layernorm_after.weight", "vivit_model.encoder.layer.23.layernorm_after.bias". 
	size mismatch for vit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
	size mismatch for vit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 197, 768]) from checkpoint, the shape in current model is torch.Size([1, 197, 1024]).
	size mismatch for vit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 16, 16]).
	size mismatch for vit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.0.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.1.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.2.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.3.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.4.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.5.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.6.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.7.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.8.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.9.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.10.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vit_model.encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vit_model.encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vit_model.encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.encoder.layer.11.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vit_model.pooler.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vit_model.pooler.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.embeddings.cls_token: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([1, 1, 1024]).
	size mismatch for vivit_model.embeddings.position_embeddings: copying a param with shape torch.Size([1, 3137, 768]) from checkpoint, the shape in current model is torch.Size([1, 3137, 1024]).
	size mismatch for vivit_model.embeddings.patch_embeddings.projection.weight: copying a param with shape torch.Size([768, 3, 2, 16, 16]) from checkpoint, the shape in current model is torch.Size([1024, 3, 2, 16, 16]).
	size mismatch for vivit_model.embeddings.patch_embeddings.projection.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.0.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.1.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.2.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.3.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.4.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.5.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.6.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.7.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.8.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.9.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.10.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.attention.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1024]).
	size mismatch for vivit_model.encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([4096]).
	size mismatch for vivit_model.encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
	size mismatch for vivit_model.encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_before.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_before.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_after.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.encoder.layer.11.layernorm_after.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).
	size mismatch for vivit_model.pooler.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
	size mismatch for vivit_model.pooler.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]).

Answer 6 · 2024-12-20T12:50:41.000Z

the same problem

Answer 7 · 2025-01-13T12:36:17.000Z

Just tried running this model. Getting the same error刚刚尝试运行此模型。收到相同的错误

Traceback (most recent call last):回溯（最近调用最后）： File "/content/M2UGen/M2UGen/gradio_app.py", line 75, in 文件 “/content/M2UGen/M2UGen/gradio_app.py”，第 75 行，在 load_result = model.load_state_dict(new_ckpt, strict=True)load_result = model.load_state_dict（new_ckpt， strict=True） File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict文件 “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”，第 2152 行，load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(raise RuntimeError（'state_dict加载 {}：\n\t{}'.format（ RuntimeError: Error(s) in loading state_dict for M2UGenRuntimeError：为 M2UGen 加载state_dict时出错

I get the same problem，how did you resolve?