A few issues while trying to rerun swe-bench-lite with aider

Question

A few issues while trying to rerun swe-bench-lite with aider

daniel-vainsencher opened this issue 3 months ago · 13 comments

daniel-vainsencher commented 3 months ago

In the README.md Installation instructions, there is a naked pip install requirements.txt, without mention of creating a venv or other kind of environment. People should take care to avoid this, but better to smooth it.
The suggestion to read SWE-Bench docker docs to ensure docker images are built or pulled is confusing: pulls happen automatically, don't they? would be helpful to understand how harness.py relates to that repo. If there is a reason why not, I haven't gotten to that stage yet.
The Running the benchmark harness says to edit and run harness.py. I've run into the following issues with that:
The import of get_lite_dataset is commented out, when that's what I was trying to use, looks like an autofix to appease a linter due to commenting out the call to it, probably better to import all of them and put a noqa comment.
just_devin_570 (which I didn't want to do, since running lite) is turned on and used in two ways:
- a condition filters out the dataset immediately
- also it is threaded through process_instances to filter out predictions. Isn't this redundant with the previous? the predictions need to be adapted to the dataset whether it is lite, full or 570?
I just turned it to False, hoping that works.
After that, I just ran harness.py, and got an error of "NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported." This seems to possibly be a bug in datasets or fsspec (which would be pure bad luck on my part), but possibly improved by providing a frozen requirements file.

Answer 1 · 2024-05-31T16:14:40.000Z

Reading more about the SWE-Bench-docker, I'd recommend adding to the Installation section (possibly with some disclaimers) something like:

SWE-bench-docker/scripts/pull_docker_images.sh SWE-bench-docker/docker/ aorwall

Answer 2 · 2024-05-31T16:22:45.000Z

Ok, pip install -U datasets package resolved that issue.

Recommend to add mkdir chat-logs to the running instructions (or fix the code to auto-make), because:

Traceback (most recent call last):
  File "/home/danielv/System/Software/aider-swe-bench/./harness.py", line 482, in <module>
    status = main()
  File "/home/danielv/System/Software/aider-swe-bench/./harness.py", line 385, in main
    process_instances(
  File "/home/danielv/System/Software/aider-swe-bench/./harness.py", line 450, in process_instances
    chat_history_dname.mkdir(exist_ok=True)
  File "/usr/lib/python3.10/pathlib.py", line 1175, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'chat-logs/testing---gpt-4o'

Answer 3 · 2024-05-31T16:28:14.000Z

With that, benchmark is running, will report back after repro is done.

Answer 4 · 2024-06-03T15:20:07.000Z

Run killed partway though (due to oai limit errors), trying to run ./report.py predictions/testing---gpt-4o.

Because DEVIN_570 was hard coded True in report.py, had 0 filtered predictions, so got error message:

  File "/home/danielv/System/Software/aider-swe-bench/./report.py", line 147, in preds_to_jsonl
    model_name_or_path = list(predictions.values())[0]["model_name_or_path"]
IndexError: list index out of range

Because didn't have directory logs, got: FileNotFoundError: [Errno 2] No such file or directory: 'logs/testing---gpt-4o'.
My relevant env is using python 3.10, so had to change swe-agent-docker to not use TaskGroup. Might be worth giving a warning that explains the issue, but not a big deal.
report is hardcoded to pass the FULL_DATASET_FNAME to run_evaluation, but that file is not in the aider_swe_bench repo. Using instead the LITE file, which is.
the directory predictions/full is assumed to pre-exist

After dealing with all of these, report.py completes, will mention that in separate comment.

Answer 5 · 2024-06-05T16:06:32.000Z

I fully agree with @daniel-vainsencher; running this repo is not that smooth.

BTW, many thanks to Daniel's running log, it's really helpful.

Answer 6 · 2024-06-05T19:36:00.000Z

@RenzeLou I appreciate it.

BTW, my report ended up looking very bad, because for many instances the logs are missing. Looking back, there is some issue writing the logs to /opt/logs, but I haven't debugged it yet. If anyone has seens this and has a fix, would be happy to hear about it.

Answer 7 · 2024-06-05T22:38:02.000Z

I have gone through the testing scripts of this repo, it basically uses the log file to decide whether the instances are resolved (i.e., the *.eval.log files are required, otherwise the final test score will become zero). Someone can also check the resolve score calculation here:

https://github.com/princeton-nlp/SWE-bench/blob/8b1265b7817cf3cba114c56e7a5b98bba3f9979d/swebench/metrics/report.py#L328

However, I didn't find any .eval.log files after running this repo (as also mentioned by @daniel-vainsencher), where there were only .md and .json files under the output dir.

@paul-gauthier Could you answer these questions? Where the .eval.log files are saved (by default)? Or, is there anything we missed?

I would very much appreciate it if you could help on this issue.

Answer 8 · 2024-06-05T22:53:04.000Z

The workflow for working with SWE Bench in general is 2 steps:

Run your agent on the problems to produce predictions, which are a series of json records that get bundled up into a jsonl file.
Evaluate the predictions jsonl file using the acceptance tests. This produces the .eval.log files that I think you are asking about.

This repo is for running and evaluating aider on SWE Bench. As described in the README, it consists of 2 scripts:

The harness.py script will run aider on all the problems and produce predictions. It does not do any acceptance testing. It does run any pre-existing tests that were part of the problem's repo, but never runs any acceptance tests. This script produces a bunch of predictions as individual json files in predictions/<DIRNAME>/<instance_id>.json.
The report.py script consumes all those predictions and turns them into predictions/<DIRNAME>/all_preds.jsonl. It then feeds that jsonl file through the SWE Bench evaluation and reporting scripts to produce logs/<DIRNAME>/<instance_id>...eval.log files as well as a summary report in predictions/<DIRNAME>/results.json.

Let me know if that was helpful?

Answer 9 · 2024-06-05T23:19:54.000Z

Thanks for your reply @paul-gauthier!

I am running the SWE-bench Lite, where I think I have correctly set the dataset (get_lite_dataset in harness.py).

After several instances had been predicted by Aider (I didn't run the full Lite bench), I ran the report.py, but there was no results.json or any .eval.log files generated.

Here is the info printed on the terminal:

...
LnB5IiwgInRlc3RzL3Rlc3RfZXh0X2F1dG9kb2NfY29uZmlncy5weSJdLCAidGVzdF9jbWQiOiAidG94IC1lcHkzOSAtdiAtLSB0ZXN0cy9yb290cy90ZXN0LWV4dC1hdXRvZG9jL3RhcmdldC9hbm5vdGF0aW9ucy5weSB0ZXN0cy90ZXN0X2V4dF9hdXRvZG9jX2NvbmZpZ3MucHkifQ== -e LOG_DIR=/opt/logs -e TIMEOUT=1800 -e LOG_SUFFIX= aorwall/swe-bench-sphinx-doc_sphinx-testbed:3.4
2024-06-05 16:11:50,583 - swebench_docker.run_docker - WARNING - Stdout - /bin/sh: docker: command not found

2024-06-05 16:11:50,583 - swebench_docker.run_docker - WARNING - Stderr - None
Processing predictions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 10871.71it/s]
==> log_dir: logs/lite---gpt-4o
sorted(resolved_instances): []
len(generated_minus_applied): 4
generated_minus_applied: django__django-13315* django__django-15061* django__django-16041* sphinx-doc__sphinx-8435*
len(with_logs_minus_applied): 0
with_logs_minus_applied: set()
len(no_apply): 0
no_apply: 
predictions/OLD/lite---gpt-4o.240605-161150 predictions/lite---gpt-4o
len(chosen): 0
No predictions

I don't know what's going on. Could you provide any hints on this issue? Thanks so much.

Answer 10 · 2024-06-06T01:17:21.000Z

@RenzeLou I would start from
/bin/sh: docker: command not found
Do you have docker installed?

Answer 11 · 2024-06-06T19:22:30.000Z

@RenzeLou another thing: keep in mind that this is a very new repo, and that the task it attempts to do closely coordinates 4 repos (not counting mere libraries):

Aider
this repo
The SWE-Bench-docker repo
The SWE-Bench repo

This is a complicated integration piece, so it starting out imperfect is totally understandable. This is not for the feint of heart, its still a bit the wild west. If you are not ready to do serious debugging, I'd recommend waiting and seeing if they stabilize.

Answer 12 · 2024-06-07T13:08:54.000Z

@paul-gauthier two things:

I'd like to have running LITE just work in harness and report, requiring the user to either change the code in a single place or pass arguments. Let me know if you'd like me to create a PR for that based on the changes I've already made.
The issue I am encountering with logs seems to be due to improperly set permissions with the original images in the aorwall namespace. Did you have to rebuild the docker images? any notes on what you did to fix them?

Answer 13 · 2024-06-12T22:41:54.000Z

For posterity: part of the issue I was encountering with permissions is probably because I was using rootless docker, but I haven't completely resolved them. Switching to rootless podman solved some problems, created others :/
I am not sure the current approach to isolation is optimal.