Paper link: https://arxiv.org/abs/2406.08702
- Download dataset in https://huggingface.co/datasets/klee972/VLind-Bench
- Directory structure should be as follows.
├── data
│ ├── data.json
│ ├── counterfactual
│ ├── factual
└── evel
├── ctx_cfq
├── gpt4o_eval.py
├── instructblip_eval.py
├── score_pipeline.py
└── score.sh
- Run
gpt4o_eval.py
orinstructblip_eval.py
to generate model predictions. - Run
score.sh
to evaluate pipeline scores and accuracies.