BenchLMM: A Python repository from The Artificial Intelligence Frontier Exploration Group (AIFEG) - The Artificial Intelligence Frontier Exploration Group (AIFEG)

[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai¹, Zirui Song^2,3, Dayan Guan†¹, Zhenhao Chen⁴, Yaohang Li^2,3, Xing Luo⁵, Chenyu Yi¹, Alex Kot¹

⁴

⁵

*Equal contribution, †Corresponding Author

If you like our project, please give us a star ⭐ on GitHub for latest update.

Benchmark Examples

Note: For a simple presentation, the questions in Domestic Robot and Open Game have been simplified from multiple-choice format. Please see our Benchmark for more examples and detailed questions.

Directory Structure

baseline/:
- Contains LLaVA and InstructBLIP baseline code.
evaluate/:
- All the Python code used for evaluating the model's output. This evaluation is done by using ChatGPT to compare the model output answers with ground truth answers.
evaluate_results/:
- This directory contains the evaluation results of the baseline models.
jsonl/:
- This directory contains all JSONL files with the question, image relative location, and the ground truth answer.
- Sample JSONL format:
```
{
  "question_id": "bottle_test_broken_large_000_001", 
  "image": "bottle_test_broken_large_000.png", 
  "text": "Is there any defect in the object in this image? Answer the question using a single word or phrase.", 
  "answer": "Yes"
}
```
  The image is the relative image location of corresponding style image folder, the text is the question, answer is ground truth answer.
imgs/:
- This directory contains the images used on this page. However, they are not our benchmark images.
results/:
- This directory contains the inference results of the baseline models.
scripts/:
- Contains the scripts to run the baseline and evaluate the results.

Evaluate on our Benchmark

Install
Download our benchmark image from our Releases or Hugging face

git clone git@github.com:AIFEG/BenchLMM.git
cd BenchLMM
mkdir evaluate_results

Prepare your model output
Prepare your results in the following format, Key "prompt" is the input of the model, you better use the Jsonl file to store your results.

{
  "question_id": 110, 
  "prompt": "Is there any defect in the object in this image? Answer the question using a single word or phrase.", 
  "model_output": "Yes",
}

Rename your Jsonl file
Rename your Jsonl file to xxxx_StyleName.jsonl like the following project tree. You must keep the style of the suffix consistent with the example.

.
├── answers_Benchmark_AD.jsonl
├── xxxxxxxx_CT.jsonl
├── xxxxxxxx_MRI.jsonl
├── xxxxxxxx_Med-X-RAY.jsonl
├── xxxxxxxx_RS.jsonl
├── xxxxxxxx_Robots.jsonl
├── xxxxxxxx_defect_detection.jsonl
├── xxxxxxxx_game.jsonl
├── xxxxxxxx_infrard.jsonl
├── xxxxxxxx_style_cartoon.jsonl
├── xxxxxxxx_style_handmake.jsonl
├── xxxxxxxx_style_painting.jsonl
├── xxxxxxxx_style_sketch.jsonl
├── xxxxxxxx_style_tattoo.jsonl
├── xxxxxxxx_xray.jsonl

Evaluate your model output
Modify the file path and run the script scripts/evaluate.sh

bash scripts/evaluate.sh

Note: Score will be saved in the file results. Robots and game scores are included in the evaluate_results/Robots.jsonl and evaluate_results/game.jsonl respectively.

Baseline

Model	VRAM required
InstructBLIP-7B	30GB
InstructBLIP-13B	65GB
LLava-1.5-7B	<24GB
LLava-1.5-13B	30GB

LLaVA

Install

Clone this repository and navigate to LLaVA folder

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

Install Package

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

LLaVA Weights
Please check out our Model Zoo for all public LLaVA checkpoints, and the instructions of how to use the weights.

Run and evaluate LLaVA on our Benchmark

Add the file BenchLMM_LLaVA_model_vqa.py to the path LLaVA/llava/eval/
Modify the file path and run the script scripts/LLaVA.sh

bash scripts/LLaVA.sh

Evaluate results

bash scripts/evaluate.sh

Note: Score will be saved in the file results.

InstructBLIP

Install

git clone https://github.com/salesforce/LAVIS.git  
cd LAVIS  
pip install -e .

Prepare Vicuna Weights
InstructBLIP uses frozen Vicuna 7B and 13B models. Please first follow the instructions to prepare Vicuna v1.1 weights.
Then modify the llm_model in the Model Config to the folder that contains Vicuna weights.
Run InstructBLIP on our Benchmark

Modify the file path and run the script BenchLMM/scripts/InstructBLIP.sh

bash BenchLMM/scripts/InstructBLIP.sh

Evaluate results Modify the file path and run the script BenchLMM/scripts/evaluate.sh

bash BenchLMM/scripts/evaluate.sh

Note: Score will be saved in the file results.

Cite our work

@article{cai2023benchlmm,
  title={BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models},
  author={Cai, Rizhao and Song, Zirui and Guan, Dayan and Chen, Zhenhao and Luo, Xing and Yi, Chenyu and Kot, Alex},
  journal={arXiv preprint arXiv:2312.02896},
  year={2023}
}

Contact

If you have any question or issue with our project, please contact Dayan Guan: dayan.guan@outlook.com

Acknowledgement

This research is supported in part by the Rapid-Rich Object Search (ROSE) Lab of Nanyang Technological University and the NTU-PKU Joint Research Institute (a collaboration between NTU and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation). We are deeply grateful to Yaohang Li from the University of Technology Sydney for his invaluable assistance in conducting the experiments, and to Jingpu Yang, Helin Wang, Zihui Cui, Yushan Jiang, Fengxian Ji, and Yuxiao Hang from NLULab@NEUQ (Northeastern University at Qinhuangdao, China) for their meticulous efforts in annotating the dataset. We also would like to thank Prof. Miao Fang (PI of NLULab@NEUQ) for his supervision and insightful suggestion during discussion on this project.

AIFEG/BenchLMM