/SeeAct

Code, Dataset for SeeAct

Primary LanguagePythonMIT LicenseMIT

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Code, Dataset, and Demo for the paper "GPT-4V(ision) is a Generalist Web Agent, if Grounded".

Check project website for an overview and demo videos.

Release process:

  • Dataset
    • Example data for the three element grounding methods
    • Data used in the paper with screenshot images
  • Code
    • Offline Experiments
      • Screenshot generation
      • Code to overlay image annotation
      • BLIP-2 fine-tuning
    • Online Evaluation Tool
  • Models
    • Fine-tuned BLIP-2 Model

Dataset

The dataset is derived from Mind2Web by pairing each HTML text with the rendered webpage screenshots. The screenshot image data comes from the Raw Dump with Full Traces and Snapshots captured with PlayWright during data annotation.

Screenshot Generation

These scripts can collect screenshot images from the Mind2Web raw dump and overlay image annotation for action grounding.

Online Evaluation Tool

We develop a new online evaluation tool using Playwright to evaluate web agents on live websites. Our tool can convert the predicted action into a browser event and execute it on the website.

We acknowledge Xiang Deng for his initial contribution to this tool.

Contact

Questions or issues? File an issue or contact Boyuan Zheng

Licensing Information

The code under this repo is licensed under an MIT License.

Disclaimer

The code was released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potentially harmful use of the data or technology by any party.

Citation Information

If you find this work useful, please consider citing our paper:

@article{zheng2023seeact,
  title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
  author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
  journal={arXiv preprint arXiv:2401.01614},
  year={2024},
}