This is a Dockerfile based solution of the SWE-Bench evaluation framework.
The solution is designed so that each "testbed" for testing a version of a repository is built in a separate Docker image. Each test is then run in its own Docker container. This approach ensures more stable test results because the environment is completely isolated and is reset for each test. Since the Docker container can be recreated each time, there's no need for reinstallation, speeding up the benchmark process.
Docker images for testbeds used in the SWE-Bench_Lite
dataset has been built and tested on gold predictions.
2 benchmark instances are currently failing.
See results in the evaluations/SWE-bench_Lite_golden folder.
Docker images for testbeds used in the SWE-Bench
dataset has been built and tested on the check-harness
predictions
published by SWE-bench.
10 benchmark instances are currently failing.
See results in the evaluations/SWE-bench_check_harness folder.
I have tested running Docker benchmarks on the SWE-Agents GPT-4 benchmark and Auto Code Rover's first benchmark run.
The SWE-Agent GPT-4 predictions yield exactly the same results of 18% (54) resolved issues as SWE-Agent's own results, which seems to show that the Docker image approach works with the same accuracy.
However, the Docker benchmark provides better results for AutoCodeRover. In AutoCodeRover's own benchmarks, they achieve 16.00% (48), 15.67% (47), and 16.67% (50) resolved issues. In swe-bench-docker, the same predictions result in 18.00% (54), 19% (57) and 19% (57) resolved issues. This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the AutoCodeRover paper. In each individual run, there are also benchmark instances that fail in swe-bench-docker's evaluation but not in AutoCodeRover's. Thus, it seems there are still false positives or negatives that are not detected when comparing with gold patches, likely due to incorrect dependency versions
But this suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging to conduct evaluations with completely accurate results.
There are currently three different Docker images for running benchmarks.
Testbeds are set up in a Conda environment similar to the original SWE-bench environment.
Since each benchmark is tested in its own container, using Conda may be overkill. Testbeds are set up with only the
correct Python version installed via Pyenv. This approach has been shown to result in fewer erroneous benchmark
instances in repositories where it has been tested, and the image becomes smaller. Currently, django
, psf/requests
and scikit-learn
use this type of Docker image. Hopefully, more repositories can be run this way.
In scikit-learn
, some benchmarks seem to fail because Cython code isn't compiled. To avoid building the project before each test, an image is built for each benchmark instance.
Run run_evaluation.py
to evaluate a predictions file. A log for each test is written to log_dir in the same format
as in the SWE-bench evaluation tools, and the same tooling can then be used to generate a report.
python run_evaluation.py
--predictions_path [Required] Path to the predictions file
--log_dir [Required] Path to directory to save evaluation log files
--swe_bench_tasks [Required] Path to SWE-bench task instances file or dataset
--namespace [Optional] Namespace of the Docker repository
--log_suffix [Optional] Suffix to append to log file names
--skip_existing [Optional] Skip evaluating task instances with logs that already exist
--timeout [Optional] Timeout for installation + test script execution
--num_processes [Optional] Number of processes to run in parallel (-1 for unlimited)
It might be worth pulling all Images before running the script to achieve more consistent timing in the evaluation.
scripts/pull_docker_images.sh [Dockerfiles directory] [Namespace]
docker build -t aorwall/swe-bench-base:bookworm-slim -f docker/Dockerfile .
docker build -t aorwall/swe-bench-pyenvs:bookworm-slim -f docker/Dockerfile-pyenvs .
Generates Dockerfiles for all test beds in a SWE-Bench benchmark dataset. These can then be used to build Docker images.
python generate_dockerfiles.py
--swe_bench_tasks [Required] Path to SWE-bench task instances file or dataset
--namespace [Required] Namespace of the Docker repository
--docker_dir [Required] Path to the directory where the Dockerfiles will be saved
This script builds Docker images from all Dockerfiles.
scripts/build_docker_images.sh [Dockerfiles directory] [Namespace]
This script builds Docker images from all Dockerfiles.
scripts/push_docker_images.sh [Dockerfiles directory] [Namespace]
Run a single instance and print logs to stdout.
python run_single_instance.py
--instance_id [Required] Instance ID of the task to run
--swe_bench_tasks [Optional] Path to SWE-bench task instances file or dataset (default is princeton-nlp/SWE-bench_Lite)
--namespace [Optional] Namespace of the Docker repository
--predictions_path [Optional] Path to the predictions file, if not set the golden patch will be used
scripts/build_docker_images.sh [Namespace] [Testbed directory]