Uncertainty-Aware Evaluation for Vision-Language Models

Introduction

Datasets

Evaluation

Accuracy Results

model_name	MB	OOD	SQA	SB	AI2D	Avg.
LLaVA-v1.6-Vicuna-13B	76.75	72.93	70.56	70.37	73.67	72.85
Monkey-Chat	76.98	70.6	74.66	66.1	67.95	71.26
LLaVA-v1.6-Vicuna-7B	75.56	73.7	65.86	69.06	69.75	70.78
InternLM-XComposer2-VL	71.77	70.04	77.95	64.44	66.13	70.07
Yi-VL-6B	75.24	73.91	66.72	66.25	58.84	68.19
CogAgent-VQA	74.78	68.57	67.12	68.01	58.2	67.34
MobileVLM_V2-7B	75.97	66.53	72.33	66.71	53.55	67.02
MoE-LLaVA-Phi2-2.7B	73.73	74.82	64.04	66.42	55.76	66.95
mPLUG-Owl2	73.05	73.28	65.71	61.49	54.38	65.58
Qwen-VL-Chat	71.4	54.22	63.23	59.79	65.09	62.74

Set Sizes Results

model_name	MMB	OOD	SQA	SB	AI2D	Avg.
LLaVA-v1.6-Vicuna-13B	2.34	2.18	2.45	2.49	2.33	2.36
Monkey-Chat	2.7	2.92	2.56	3.26	3.19	2.93
LLaVA-v1.6-Vicuna-7B	2.37	2.34	2.45	2.53	2.37	2.41
InternLM-XComposer2-VL	2.72	2.2	2.41	3.08	3.02	2.69
Yi-VL-6B	2.47	2.02	2.76	2.61	3	2.57
CogAgent-VQA	2.33	2.46	2.36	2.49	2.94	2.52
MobileVLM_V2-7B	2.53	2.61	2.62	2.8	3.4	2.79
MoE-LLaVA-Phi2-2.7B	2.54	1.89	2.7	2.69	2.92	2.55
mPLUG-Owl2	2.55	2.09	2.71	2.93	3	2.65
Qwen-VL-Chat	2.7	3.32	2.9	3.32	3.1	3.07

Uncertainty-aware Accuracy Results

model_name	MMB	OOD	SQA	SB	AI2D	Avg.
LLaVA-v1.6-Vicuna-13B	90.41	86.29	78.04	73.87	84.58	82.64
Monkey-Chat	83.41	63.22	81.7	52.49	56.08	67.38
LLaVA-v1.6-Vicuna-7B	87.87	82.81	69.77	70.69	77.2	77.67
InternLM-XComposer2-VL	69.98	80.49	94.37	52.6	56.4	70.77
Yi-VL-6B	84.56	95.05	64.01	64.56	49.31	71.5
CogAgent-VQA	85.56	71.14	72.4	69.96	49	69.61
MobileVLM_V2-7B	84.19	64.35	79.07	62.18	39.39	65.84
MoE-LLaVA-Phi2-2.7B	82.83	100.73	61.18	64.67	47.87	71.46
mPLUG-Owl2	78.4	89.24	62.92	52.91	45.09	65.71
Qwen-VL-Chat	69.58	40.28	54.71	44.7	54.3	52.71

Getting started

6 groups of models could be launch from one environment: LLaVa, CogVLM, Yi-VL, Qwen-VL, internlm-xcomposer, MoE-LLaVA. This environment could be created by the following code:

python3 -m venv venv
source venv/bin/activate
pip install git+https://github.com/haotian-liu/LLaVA.git 
pip install git+https://github.com/PKU-YuanGroup/MoE-LLaVA.git --no-deps
pip install deepspeed==0.9.5
pip install -r requirements.txt
pip install xformers==0.0.23 --no-deps

mPLUG-Owl model can be launched from the following environment:

python3 -m venv venv_mplug
source venv_mplug/bin/activate
git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl2
git checkout 74f6be9f0b8d42f4c0ff9142a405481e0f859e5c
pip install -e .
pip install git+https://github.com/haotian-liu/LLaVA.git --no-deps
cd ../../
pip install -r requirements.txt

Monkey models can be launched from the following environment:

python3 -m venv venv_monkey
source venv_monkey/bin/activate
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey
pip install -r requirements.txt
pip install git+https://github.com/haotian-liu/LLaVA.git --no-deps
cd ../
pip install -r requirements.txt

To check all models you can run scripts/test_model_logits.sh

To work with Yi-VL:

apt-get install git-lfs
cd ../
git clone https://huggingface.co/01-ai/Yi-VL-6B

Model logits

To get model logits in four benchmarks run command from scripts/run.sh.

To quantify uncertainty by logits

python -m uncertainty_quantification_via_cp --result_data_path 'output' --file_to_write 'full_result.json'

To get result tables by uncertainty

python -m make_tables --result_path 'full_result.json' --dir_to_write 'tables'

Citation

@article{kostumov2024uncertainty,
  title={Uncertainty-Aware Evaluation for Vision-Language Models},
  author={Kostumov, Vasily and Nutfullin, Bulat and Pilipenko, Oleg and Ilyushin, Eugene},
  journal={arXiv preprint arXiv:2402.14418},
  year={2024}
}

Acknowledgement

LLM-Uncertainty-Bench: conformal prediction applied to LLM. Thanks for the authors for providing the framework.

Contact

We welcome suggestions to help us improve benchmark. For any query, please contact us at v.kostumov@ensec.ai. If you find something interesting, please also feel free to share with us through email or open an issue. Thanks!

EnSec-AI/VLM-Uncertainty-Bench