Code, Dataset, and Demo for the paper "GPT-4V(ision) is a Generalist Web Agent, if Grounded".
Check project website for an overview and demo videos.
Release process:
- Dataset
- Example data for the three element grounding methods
- Data used in the paper with screenshot images
- Code
- Offline Experiments
- Screenshot generation
- Code to overlay image annotation
- BLIP-2 fine-tuning
- Online Evaluation Tool
- Offline Experiments
- Models
- Fine-tuned BLIP-2 Model
The dataset is derived from Mind2Web by pairing each HTML text with the rendered webpage screenshots. The screenshot image data comes from the Raw Dump with Full Traces and Snapshots captured with PlayWright during data annotation.
These scripts can collect screenshot images from the Mind2Web raw dump and overlay image annotation for action grounding.
We develop a new online evaluation tool using Playwright to evaluate web agents on live websites. Our tool can convert the predicted action into a browser event and execute it on the website.
We acknowledge Xiang Deng for his initial contribution to this tool.
Questions or issues? File an issue or contact Boyuan Zheng
The code under this repo is licensed under an MIT License.
The code was released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potentially harmful use of the data or technology by any party.
If you find this work useful, please consider citing our paper:
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}