SteP: Stacked LLM Policies for Web Actions

Paper link: https://arxiv.org/abs/2310.03720

Installation

To set up the project, clone the repository and create a virtual environment:

cd webagents-step
pyenv virtualenv webagents-step
pyenv activate webagents-step

Install the required packages:

pip install -r requirements.txt

WebArena Evaluation

WebArena Results

We break down the success rates across different websites and provide links to the trajectory logs below, containing the observations, model predictions, and evaluator outputs for each task.

The latest runs with gpt-4-turbo-2024-04-09 model and WebArena code (last commit May 29, 2024) are linked below

Website	Number of tasks	Success Rate	Trajectory Logs
Gitlab	180	31.7%	logs
Reddit	106	59.4%	logs
Shopping	187	36.9%	logs
Shopping admin (CMS)	182	24.2%	logs
Map	109	30.3%	logs
Multisite	48	12.5%	logs
All	812	33.5%	logs

Installing WebArena

Install WebArena from WebArena github repository. This code uses the last commit 4c741b4b20a3e183836e58f383f9be1785248160 on May 29, 2024.

Generate test data configs:

python scripts/generate_test_data.py

You will see *.json files generated in config_files/ folder. Copy these over to a tasks/webarena directory in the webagents-step/ root directory.

You will also need to setup authentication for all websites as per instructions in the WebArena README (See instructions for Obtain the auto-login cookies for all websites). This will generate a .auth folder. Copy this over to webagents-step/ root directory.

Running Evaluation

To run WebArena evaluation:

python scripts/evaluate/eval_webarena.py --config configs/webarena/eval_openai_agent.yml

Important:

Set up each website as a docker as listed in WebArena instructions
Reset the website state before running an evaluation. This matters since the initial state of the website affects the success of the task.
For Reddit tasks, there is a rate limit on making more than 3 posts in an hour. You need to add a sleep of 21 minutes before every new task. This can be done by adding time.sleep(1260) inside the for loop in eval_webarena.py

MiniWoB++ Evaluation

Installing MiniWob++

Install MiniWoB++ from this repository. Use commit 43bd1fe.

Running Evaluation

To run MiniWoB++ evaluation:

python scripts/evaluate/eval_miniwob.py --config configs/miniwob/eval_openai_agent.yml

Contact

This project is still in active development. For any questions or issues, please contact us at psodhi@asapp.com.

asappresearch/webagents-step

SteP: Stacked LLM Policies for Web Actions

Installation

WebArena Evaluation

WebArena Results

Installing WebArena

Running Evaluation

MiniWoB++ Evaluation

Installing MiniWob++

Running Evaluation

Contact