MCG-NJU/VideoMAE

VideoMAE ViT-H pre-train does not contain the decoder weights

sandstorm12 opened this issue · 2 comments

Problem

The VideoMAE ViT-H and VideoMAE ViT-S pre-trained kinetics weights seem to have a problem. When loading the weights of other pre-trained models like ViT-L or ViT-B, the state_dict contains the weights for the decoder layers. But this is not true for the ViT-H and ViT-S. As a result, it is not possible to load it into an encoder/decoder setup.

How to reproduce

To reproduce, just download the weights and load the state_dict. Comparing it to the other pre-trained weights you can see the decoder weights are missing.

URL = "https://drive.google.com/file/d/1AJQR1Rsi2N1pDn9tLyJ8DQrUREiBA1bO/view?usp=sharing"
output_name = "checkpoint.pth"
gdown.cached_download(URL, output_name)

state_dict = torch.load(output_name)
print(state_dict["module"])

The state_dict is very large, so I don't include the output here.

innat commented

The link of pretrain VideoMAE ViT-H is wrong sort of.
It has only the encoder part.