checkpoint and more benchmark results
Closed this issue · 1 comments
Wonderful job!
I've also been thinking about video token compression recently. Your idea is so interesting.
I want to know when checkpoint will be released.
At the same time, I also want to know the results on other benchmarks, such as mvbench, videomme.
Hi,
Thank you for your interest in our work! We'll have the final model weights available soon.
We're in the process of cleaning the inference code. We'll be open sourcing the relevant checkpoints soon after this is done.
Regarding the more video understanding benchmarks you mentioned, since the main contribution of this paper lies in visual compression, we only tested it on simple and common video QA benchmarks and compared with other video anderstanding methods with vision compression. If you are interested, you can try the effect of VoCo-LLaMA on these new benchmarks after our weights are released.
Besides, regarding your question about using visual compression to solve the long video understanding task. Although our method has shown relatively good results, we only used a baseline method to expand VoCo-LLaMA to the video field (i.e., compress each video frame into same compressed tokens), just to prove the effectiveness of VoCo-LLaMA in visual compression. In the absence of training resources to retain the entire video token sequence, pruning and memory design for video frame tokens is still a mainstream and effective method, but this is not our main contribution, so we did not discuss it in the paper.
Best Regards,
Xubing