sauradip/STALE

Open-set video recognition

kazunaritakeichi opened this issue ยท 5 comments

Dear author, thank you for publishing your work!
I want to try open-set video recognition.
How to do? In other words, how do I get the features from a video?

Thanks for your interest @kazunaritakeichi. For obtaining features for the video you need to run a pre-trained CLIP frame by frame and then aggregate the video frame features. After that , you may use bilinear sampling to get the temporal points over the feature by adjusting the strides. Thus you get a fixed dimesnion video feature same as I3D , C3D AND TSN.

For starters you may look into this file for the data pre-processing from AFSD paper , CVPR 21 : Preprocessing Script

After this step you may follow this step for passing into model. Passing into Model

Let me know if i could answer you.

@sauradip
Thank you for your response.
I think I could have got features with the code like below referring openai/CLIP.

I have a question.
Which CLIP should I use ViT-B/16 (referring README) or ViT-B/32 (referring text model) for extracting features?
In the case of my video, I got better result with ViT-B/32.

if __name__ == '__main__'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    # model, preprocess = clip.load("ViT-B/16", device=device)

    cap = cv2.VideoCapture('/path/to/video')

    features = []
    while True:
        ret, img = cap.read()
        if ret == False:
            break
        image = preprocess(Image.fromarray(img[:,:,::-1])).unsqueeze(0).to(device)
        with torch.no_grad():
            feature = model.encode_image(image)
        features.append(feature)

    cap.release()

    features = torch.cat(features)
    np.save('/path/to/features.npy', features.to('cpu').detach().numpy().copy())

Yes , this is a good option to extract the visual features. Regarding the choice of ViT I think , you can use any variant. Ofcourse I used ViTB16 because the other competitors used it for fair comparison. You can use any , but make sure both text and video comes from same style ViT

Thank you for your reply.

The default text model looks like ViTB32.

STALE/stale_model.py

Lines 36 to 37 in a574ca6

self.txt_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").float()
self.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")

Considering that features you provide come from ViT16, should I modify like below when using that features?

self.txt_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16").float() 
self.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16") 

When I modify like that, I got the following error.

# python stale_inference.py 
Traceback (most recent call last):
  File "stale_inference.py", line 46, in <module>
    model.load_state_dict(checkpoint['state_dict'])
  File "/root/local/python-3.7.3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
        size mismatch for module.txt_model.vision_model.embeddings.position_ids: copying a param with shape torch.Size([1, 50]) from checkpoint, the shape in current model is torch.Size([1, 197]).
        size mismatch for module.txt_model.vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([768, 3, 32, 32]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).
        size mismatch for module.txt_model.vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([50, 768]) from checkpoint, the shape in current model is torch.Size([197, 768]).

I use the checkpoint of this link.
Screenshot 2022-10-13 at 11 38 04

Since the text model of this check point is based on ViT32, should I use features coming from ViT32 for this checkpoint?

Thanks for pointing it out. The error comes, because the pre-trained model was trained using ViT-B-32. Now the keys of ViT-B-32 mismatch with ViT-B-16 for the text part after your change.