- Pytorch
- transformers
- Install required packages : pip install -r requirements.txt
We provide code that enables the reproduction of our experiments with LLaVA v1.6 7b/13b/34b and GPT-4V using the IG-VLM approach. For each VLM, we offer files that facilitate experimentation across various benchmarks:Open-ended Video Question Answering (VQA) with datasets such as MSVD-QA, MSRVTT-QA, ActivityNet-QA, and TGIF-QA, Text Generation Performance VQA for CI, DO, CU, TU, and CO, Multiple-choice VQA including NExT-QA, STAR, TVQA, IntentQA, and EgoSchema.
- To conduct these benchmark experiments, please prepare data download and a QA pair sheet.
- The QA pair sheet should follow the format outlined below and must be converted into a CSV file for use.
# for open-ended QA sheet, it should include video_name, question, answer, question_id and question_type(optional)
# for multiple-choice QA sheet, it should include video_name, question, options(a0, a1, a2, .. ), answer, question_id and question_type(optional).
# question_id should be unique.
# example of multeple-choice QA
| video_name | question_id | question | a0 | a1 | a2 | a3 | a4 | answer | question_type(optional) |
|------------|-------------|-------------------------------------------------------|---------------|-------------|----------|----------------|-----------------|------------|-------------------------|
| 5333075105 | unique1234 | what did the man do after he reached the cameraman? | play with toy |inspect wings| stop |move to the side|pick up something| stop | TN |
...
- For experimenting with LLaVA v1.6 combined with IG-VLM, the following command can be used. Please install the LLaVA code to the execution path. Please make sure to reinstall it every time for reproductions. The llm_size parameter allows the selection among the 7b, 13b, and 34b model configurations:
# Open-ended video question answering
python eval_llava_openended.py --path_qa_pair_csv ./data/open_ended_qa/ActivityNet_QA.csv --path_video /data/activitynet/videos/%s.mp4 --path_result ./result_activitynet/ --api_key {api_key} --llm_size 7b
# Text generation performance
python eval_llava_textgeneration_openended.py --path_qa_pair_csv ./data/text_generation_benchmark/Generic_QA.csv --path_video /data/activitynet/videos/%s.mp4 --path_result ./result_textgeneration/ --api_key {api_key} --llm_size 13b
# Multiple-choice VQA
python eval_llava_multiplechoice.py --path_qa_pair_csv ./data/multiple_choice_qa/TVQA.csv --path_video /data/TVQA/videos/%s.mp4 --path_result ./result_tvqa/ --llm_size 34b
- When conducting experiments with GPT-4V combined with IG-VLM, the process can be initiated using the following command. Please be aware that utilizing the GPT-4 vision API may incur significant costs.
# Open-ended video question answering
python eval_gpt4v_openended.py --path_qa_pair_csv ./data/open_ended_qa/MSVD_QA.csv --path_video /data/msvd/videos/%s.avi --path_result ./result_activitynet_gpt4/ --api_key {api_key}
# Text generation performance
python eval_gpt4v_textgeneration_openended.py --path_qa_pair_csv ./data/text_generation_benchmark/Generic_QA.csv --path_video /data/activitynet/videos/%s.mp4 --path_result ./result_textgeneration_gpt4/ --api_key {api_key}
# Multiple-choice VQA
python eval_gpt4v_multiplechoice.py --path_qa_pair_csv ./data/multiple_choice_qa/EgoSchema.csv --path_video /data/EgoSchema/videos/%s.mp4 --path_result ./result_egoschema_gpt4/ --api_key {api_key}