VisionQA-Llama2-OWLViT

Introduce

This is a multimodal model design for the Vision Question Answering (VQA) task. It integrates the Llama2 13B, OWL-ViT, and YOLOv8 models, utilizing hard prompt tuning.

features:

Llama2 13B handles language understanding and generation.
OWL-ViT identifies objects in the image relevant to the question.
YOLOv8 efficiently detects and annotates objects within the image

Combining these models leverages their strengths for precise and efficient VQA, ensuring accurate object recognition and context understanding from both language and visual inputs.

Requirement

pip install requirements.txt

Data

I evaluate the testing data from the GQA dataset.

Eval

python val_zero_shot.py

--imgs_path: The path of the GQA data image file
--dataroot: The path of the GQA data
--mode: ['testdev', 'val', 'train']

Run

python zero_shot.py

--img_path: The path of the question image
--yolo_weight: The pre-train yolov8 weight

Predict result

The resutl of GQA accuracy score is 0.52.

ycchen218/VisionQA-Llama2-OWLViT