princeton-nlp/SWE-bench

It seems that current evaluation does not handle the apply failure case?

Hodge931 opened this issue · 4 comments

Describe the issue

As titled, Thanks!

Suggest an improvement to documentation

No response

It should be recorded in the run_instance.log file via the logic here, is that what you're referring to?

The report generated no longer explicitly prints the number of instances where the apply patch failed (it is included in the count of # of instances that were not resolved). However, the number of failed patch applies should be recoverable from parsing the logs (i.e. looking for whether the APPLY_PATCH_FAIL string shows up).

Thanks for the reply!

In my case, if errors are raised in parallel execution within docker, the program stucks without evaluating remaining instances.

Do you have a stack trace or some program output?

Closing this due to inactivity. @Hodge931 please feel free to re-open with more execution details or create a new issue.