About Phi-3Vision-Instruct (128K) in Table 5
Gaozhongpai opened this issue · 1 comments
Gaozhongpai commented
Gaozhongpai commented
I saw from the supplementary that each frame consumes more than 2000 tokens. Thanks
The default setting is num_crop = 16
for Phi-3-Vision-Instruct. For video frames, it is usually not high-resolution, I don't think we need to set num_crop = 16
. If num_crop = 4
, you can increase the frame number by 4 times.