Reproducing swebench-lite results

Question

Reproducing swebench-lite results

Opened this issue 5 months ago · 5 comments

justinchiu-cohere commented 5 months ago

Thanks for releasing the repo, as well as the trajectories for swebench-lite! I am trying to reproduce the results with gpt-4o, but am seeing a fix rate of 59/300, as opposed to the 27.33% reported.

Other than the --plausible flag in rerank, are there any other possible causes for this?
Did you notice a large amount of variance between runs?
I changed the prompts slightly, adding a sentence before # Examples to clarify that we are giving output examples. Could this lead to large changes in resolution?

Answer 1 · 2024-08-19T15:53:27.000Z

Hi Justin

We had ran agentless multiple times ourselves and the while the results have some variance, it should not be as large as down to 59/300. I would expect a range between ~70s/300 to ~80s/300 (even without plausible flag). As a reference you can see that recently OpenAI had ran our configuration and got 24.3% and it seems they only generate 1 sample per bug.

Please check your configurations are correct, you can refer to the README_swebenchlite.md file to completely recover our experimental settings.

Thanks

Answer 2 · 2024-08-21T20:30:38.000Z

I tried a fresh clone, and ran the commands in README_swebenchlite.md again. However, after the repair step I'm seeing wc -l results/repair_run_1/output.jsonl == 284, as opposed to the expected 300 in the v0.1.0 release. Oddly, this is also true for results/repair_run_2/output.jsonl as well.

I'll debug a bit more, try evaluating the locations, and also try feeding in the locations from the v0.1.0 release to the repair step instead. Could oai gpt outages cause some of the prompts to fail midway, but be labeled as completed and thus not run again on a subsequent repair call?

Answer 3 · 2024-08-25T14:04:09.000Z

I noticed you added a lot of new models like deepseek and gpt-4o-mini, do you have a reference evaluation result on these models? Similar to Justin, I can't seem to fully reproduce the reported performance. But the high cost prevents me from running gpt-4o multiple times.

Answer 4 · 2024-10-07T00:08:01.000Z

I am also very curious about the gpt-4o-mini results. Any follow-up updates on that?

Answer 5 · 2024-12-12T16:14:38.000Z

I am also having trouble reproducing the swebench-lite results with my initial run. I plan to continue working out any possible issues or oversights from my end.

The swebench README has a lot of information and many commands. There are even commands to run (related to "additional repair commands") that are hidden as the section must be expanded. Due to at least this specific repair phase, it seems simply copy and pasting the commands from the README would not work in replicating the swebench-lite results (based on my understanding of the repair phase generated from the 4 sets of edit locations, and then the README only contains commands for running the selected regression tests on the first set of repairs/edits).

I am expecting that at least my own issues in replication arise from my unfamiliarity and limited understanding of each of the many steps, and thus the ability to run the correct commands. Somehow I run into errors in the scripts as well. I wonder if others have the same issue.

Providing a bash script with the exact commands to run that produced your exact results on the leaderboards would help users including myself out with reproduction. Is this possible? Also if anyone in the community can share your bash script here or in a PR that would be greatly appreciated.