
Open-set video recognition

kazunaritakeichi opened this issue ยท 5 comments

Dear author, thank you for publishing your work!
I want to try open-set video recognition.
How to do? In other words, how do I get the features from a video?

Thanks for your interest @kazunaritakeichi. For obtaining features for the video you need to run a pre-trained CLIP frame by frame and then aggregate the video frame features. After that , you may use bilinear sampling to get the temporal points over the feature by adjusting the strides. Thus you get a fixed dimesnion video feature same as I3D , C3D AND TSN.

For starters you may look into this file for the data pre-processing from AFSD paper , CVPR 21 : Preprocessing Script

After this step you may follow this step for passing into model. Passing into Model

Let me know if i could answer you.

Thank you for your response.
I think I could have got features with the code like below referring openai/CLIP.

I have a question.
Which CLIP should I use ViT-B/16 (referring README) or ViT-B/32 (referring text model) for extracting features?
In the case of my video, I got better result with ViT-B/32.

if __name__ == '__main__'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    # model, preprocess = clip.load("ViT-B/16", device=device)

    cap = cv2.VideoCapture('/path/to/video')

    features = []
    while True:
        ret, img =
        if ret == False:
        image = preprocess(Image.fromarray(img[:,:,::-1])).unsqueeze(0).to(device)
        with torch.no_grad():
            feature = model.encode_image(image)


    features ='/path/to/features.npy','cpu').detach().numpy().copy())

Yes , this is a good option to extract the visual features. Regarding the choice of ViT I think , you can use any variant. Ofcourse I used ViTB16 because the other competitors used it for fair comparison. You can use any , but make sure both text and video comes from same style ViT

Thank you for your reply.

The default text model looks like ViTB32.


Lines 36 to 37 in a574ca6

self.txt_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").float()
self.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")

Considering that features you provide come from ViT16, should I modify like below when using that features?

self.txt_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16").float() 
self.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16") 

When I modify like that, I got the following error.

# python 
Traceback (most recent call last):
  File "", line 46, in <module>
  File "/root/local/python-3.7.3/lib/python3.7/site-packages/torch/nn/modules/", line 1407, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
        size mismatch for module.txt_model.vision_model.embeddings.position_ids: copying a param with shape torch.Size([1, 50]) from checkpoint, the shape in current model is torch.Size([1, 197]).
        size mismatch for module.txt_model.vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([768, 3, 32, 32]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).
        size mismatch for module.txt_model.vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([197, 768]).

I use the checkpoint of this link.
Screenshot 2022-10-13 at 11 38 04

Since the text model of this check point is based on ViT32, should I use features coming from ViT32 for this checkpoint?

Thanks for pointing it out. The error comes, because the pre-trained model was trained using ViT-B-32. Now the keys of ViT-B-32 mismatch with ViT-B-16 for the text part after your change.