/webagents-step

Primary LanguageJupyter NotebookMIT LicenseMIT

SteP: Stacked LLM Policies for Web Actions

Paper link: https://arxiv.org/abs/2310.03720

Installation

To set up the project, clone the repository and create a virtual environment:

cd webagents-step
pyenv virtualenv webagents-step
pyenv activate webagents-step

Install the required packages:

pip install -r requirements.txt

WebArena Evaluation

WebArena Results

We break down the success rates across different websites and provide links to the trajectory logs below, containing the observations, model predictions, and evaluator outputs for each task.

The latest runs with gpt-4-turbo-2024-04-09 model and WebArena code (last commit May 29, 2024) are linked below

Website Number of tasks Success Rate Trajectory Logs
Gitlab 180 31.7% logs
Reddit 106 59.4% logs
Shopping 187 36.9% logs
Shopping admin (CMS) 182 24.2% logs
Map 109 30.3% logs
Multisite 48 12.5% logs
All 812 33.5% logs

Installing WebArena

Install WebArena from WebArena github repository. This code uses the last commit 4c741b4b20a3e183836e58f383f9be1785248160 on May 29, 2024.

Generate test data configs:

python scripts/generate_test_data.py

You will see *.json files generated in config_files/ folder. Copy these over to a tasks/webarena directory in the webagents-step/ root directory.

You will also need to setup authentication for all websites as per instructions in the WebArena README (See instructions for Obtain the auto-login cookies for all websites). This will generate a .auth folder. Copy this over to webagents-step/ root directory.

Running Evaluation

To run WebArena evaluation:

python scripts/evaluate/eval_webarena.py --config configs/webarena/eval_openai_agent.yml

Important:

  • Set up each website as a docker as listed in WebArena instructions
  • Reset the website state before running an evaluation. This matters since the initial state of the website affects the success of the task.
  • For Reddit tasks, there is a rate limit on making more than 3 posts in an hour. You need to add a sleep of 21 minutes before every new task. This can be done by adding time.sleep(1260) inside the for loop in eval_webarena.py

MiniWoB++ Evaluation

Installing MiniWob++

Install MiniWoB++ from this repository. Use commit 43bd1fe.

Running Evaluation

To run MiniWoB++ evaluation:

python scripts/evaluate/eval_miniwob.py --config configs/miniwob/eval_openai_agent.yml

Contact

This project is still in active development. For any questions or issues, please contact us at psodhi@asapp.com.