/VisualAgentBench

Towards Large Multimodal Models as Visual Foundation Agents

Primary LanguagePythonApache License 2.0Apache-2.0

VisualAgentBench (VAB)

🌐 Website | 📃 Paper | 🗂️ VAB Training (Under Construction)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

VisualAgentBench (VAB) is the first benchmark designed to systematically evaluate and develop large multi models (LMMs) as visual foundation agents, which comprises 5 distinct environments across 3 types of representative visual agent tasks (Embodied, GUI, and Visual Design)

VAB.mp4
  • VAB-OmniGibson (Embodied)
  • VAB-Minecraft (Embodied)
  • VAB-Mobile (GUI)
  • VAB-WebArena-Lite (GUI, based on WebArena and VisualWebArena)
  • VAB-CSS (Visual Design)

Compared to its predecessor AgentBench, VAB highlights visual inputs and the enabling of Foundation Agent capability development with training open LLMs/LMMs on trajectories.

Table of Contents

Dataset Summary

We offer two splits for each dataset: Testing and Training. Different from its predecessor AgentBench, VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs.

Leaderboard

Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere Prompting, and open LMMs are tested after Multitask Finetuning on VAB training set, as they usually fail to follow complicated agent task instructions.

Quick Start

TODO

Acknowledgement

This project is heavily built upon the following repositories (to be updated):

  • AgentBench: which serves as the backbone framework of this project for efficient and reliable parallel agent evaluation.
  • WebArena and VisualWebArena: which serve as the testing framework and data source for VAB-WebArena-Lite dataset.
  • OmniGibson: which serves as the environment for VAB-OmniGibson.
  • JARVIS-1: VAB-Minecraft's framework is adapted from JARVIS-1's pipeline.
  • STEVE-1: which serves as the action executor for VAB-Minecraft.

Citation

@article{liu2024visualagentbench,
  title={VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents},
  author={Liu, Xiao and Zhang, Tianjie and Gu, Yu and Iong, Iat Long and Xu, Yifan and Song, Xixuan and Zhang, Shudan and Lai, Hanyu and Liu, Xinyi and Zhao, Hanlin and others},
  journal={arXiv preprint arXiv:2408.06327},
  year={2024}
}