Code and data for the following works:
SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench
To access SWE-bench Pro, copy and run the following code:
from datasets import load_dataset
swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')SWE-bench Pro uses Docker for reproducible evaluations. In addition, the evaluation script requires Modal to scale the evaluation set.
Follow the instructions in the Docker setup guide to install Docker on your machine. If you're setting up on Linux, we recommend seeing the post-installation steps as well.
Run the following commands to store modal credentials:
pip install modal
modalv setup # and follow the prompts to generate your token and secret
After running these steps, you should be able to see a token ID and secret in ~/.modal.toml:
EG:
token_id = <token id>
token_secret = <token secret>
active = true
We store prebuilt Docker images for each instance. They can be found in this directory:
https://hub.docker.com/repository/docker/jefzda/sweap-images/general
The format of the images is as follows.
jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}
For example:
jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03
First generate patch predictions using your harness of choice. Evaluate patch predictions on SWE-bench Pro with the following command:
python sweap_pro_eval_modal.py \
--raw_sample_path=external_hf_v2.csv \
--patch_path={OUTPUT}/gold_patches.json \
--output_dir={OUTPUT}/ \
--scripts_dir=run_scripts \
--num_workers=100 \
--dockerhub_username=your-usernameReplace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.