google-research/big_vision

[Question] How to inference / captioning short video?

Opened this issue · 6 comments

I read in the readme file, paligemma can captioning a short video, anyone can guide me to do that?

Does it extract every frames on the video? Or does the paligemma tokenizer directly support video or I need to convert my video to be a numpy array?

PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for several academic video data sets, for example MSR-VTT
https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/msrvtt_cap.py

However, there are no fine-tuned checkpoints available, and some extra work is required to set up data loading for fine-tuning. Please see the video configs for details.

Wow, great thank you for the guidance. 🙏

@mitscha Could you please share some example code on one could do short video captioning with a base pretrained model? I'm very interested in this.

Are the paligemma-mix models also finetuned for video captioning?

PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for several academic video data sets, for example MSR-VTT https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/msrvtt_cap.py

However, there are no fine-tuned checkpoints available, and some extra work is required to set up data loading for fine-tuning. Please see the video configs for details.

Sorry, what kind of extra work is needed?

@mitscha Just following up on your point. It seems that simply stacking frames (i.e. passing a tensor of frames size [16, width, height, 3] to the processor doesn't quite work. The error that is returned is:

ValueError: Number of images does not match number of special image tokens in the input text. Got 1024 image tokens in the text but 16384 tokens from image embeddings.

The paligemma model on HF demo doesn't say anything about passing image tokens, so I am thinking it is implicitly handled when a video is passed. But something about my input is not leading to that happening. What is the standard usage for video inputs to the paligemma-3b models using the HF api? It would be great if an example could be added to the README there.

Edit: I just realized something. It may be possible that video inference was never set up in the processor for the PyTorch port. Would be nice to get confirmation on whether this is true.