longvideobench/LongVideoBench

About Phi-3Vision-Instruct (128K) in Table 5

Gaozhongpai opened this issue · 1 comments

Hi, thanks for the work. I wonder if the Phi-3Vision-Instruct has a context window of 128K, why the context window exceeded at 64 frames in Table 5, as shown below?

image

I saw from the supplementary that each frame consumes more than 2000 tokens. Thanks

The default setting is num_crop = 16 for Phi-3-Vision-Instruct. For video frames, it is usually not high-resolution, I don't think we need to set num_crop = 16. If num_crop = 4, you can increase the frame number by 4 times.