Paper link: https://arxiv.org/abs/2310.03720
To set up the project, clone the repository and create a virtual environment:
cd webagents-step
pyenv virtualenv webagents-step
pyenv activate webagents-step
Install the required packages:
pip install -r requirements.txt
We break down the success rates across different websites and provide links to the trajectory logs below, containing the observations, model predictions, and evaluator outputs for each task.
The latest runs with gpt-4-turbo-2024-04-09
model and WebArena code (last commit May 29, 2024) are linked below
Website | Number of tasks | Success Rate | Trajectory Logs |
---|---|---|---|
Gitlab | 180 | 31.7% | logs |
106 | 59.4% | logs | |
Shopping | 187 | 36.9% | logs |
Shopping admin (CMS) | 182 | 24.2% | logs |
Map | 109 | 30.3% | logs |
Multisite | 48 | 12.5% | logs |
All | 812 | 33.5% | logs |
Install WebArena from WebArena github repository. This code uses the last commit 4c741b4b20a3e183836e58f383f9be1785248160 on May 29, 2024.
Generate test data configs:
python scripts/generate_test_data.py
You will see *.json
files generated in config_files/ folder. Copy these over to a tasks/webarena
directory in the webagents-step/
root directory.
You will also need to setup authentication for all websites as per instructions in the WebArena README (See instructions for Obtain the auto-login cookies for all websites). This will generate a .auth
folder. Copy this over to webagents-step/
root directory.
To run WebArena evaluation:
python scripts/evaluate/eval_webarena.py --config configs/webarena/eval_openai_agent.yml
Important:
- Set up each website as a docker as listed in WebArena instructions
- Reset the website state before running an evaluation. This matters since the initial state of the website affects the success of the task.
- For Reddit tasks, there is a rate limit on making more than 3 posts in an hour. You need to add a sleep of 21 minutes before every new task. This can be done by adding
time.sleep(1260)
inside the for loop ineval_webarena.py
Install MiniWoB++ from this repository. Use commit 43bd1fe.
To run MiniWoB++ evaluation:
python scripts/evaluate/eval_miniwob.py --config configs/miniwob/eval_openai_agent.yml
This project is still in active development. For any questions or issues, please contact us at psodhi@asapp.com.