Best practice to maintain context across multiple frames

Question

Best practice to maintain context across multiple frames

Closed this issue 3 months ago · 2 comments

Creating this issue to pool together ideas for maintaining context across multiple frames.

Context

I was playing with metropolis-nim-workflows/nim_workflows/vlm_alerts/vlm_nim_workshop.ipynb - Part 3.2 Interactive Video Pipeline. I tested it by providing it with a video of a drawer toppling down when a baby was playing with it (https://www.youtube.com/watch?v=Ucuiu1giM-M)

I tested it with prompt such as

is the baby in danger? yes or no
is the drawer falling down? yes or no
describe what is happening

Problem

The model response for is the drawer falling down? yes or no is always no, which might be due to many reasons: bad video quality, model performance etc.

However, the above also made me realised that the current approach does not maintain context across frames (since at any one time, we only send to the model the current frame)

Ideation

One most straightforward approach will be to send 10 frames to the model (e.g. at 1 frame per second). I'll test this out and see if it works.

Any one has any other ideas on how to maintain context across frames?

Answer 1 · 2024-10-10T16:02:54.000Z

Hi @hengkuanwee,

Yes you are correct, sending a single image at a time to the model does limit its ability to detect events that take place over time. Currently, on build.nvidia.com most of the vision language models can only accept 1 image. You could experiment with https://build.nvidia.com/microsoft/phi-3_5-vision-instruct this is currently the only multi-image VLM we have that can take up to 8 frames in one request.

Alternatively, you can also apply for early access to VIA, which is a more complex pipeline capable of maintaining context across frames to summarize long videos and could answer the types of questions you are interested in. https://github.com/NVIDIA/metropolis-nim-workflows/tree/main/via_workflows

Answer 2 · 2024-10-11T09:22:51.000Z

Hi @ssmmoo1,

Thanks for your response, I'll try out phi-3_5-vision-instruct to see the performance and I'll definitely give VIA a try.

Much appreciated!