record_test_for_each_commit flag does not store all commits, and has same result everytime

Question

record_test_for_each_commit flag does not store all commits, and has same result everytime

denizbt opened this issue a year ago · 1 comments

Description

I encountered an issue when using the record_test_for_each_commit: true flag in the agent configuration for the commit0 split containing only simpy.

Despite having 25+ commits in the git log (after running an agent for the simpy repository), the eval_results.json file contains only six entries, and each entry shows the same test results—83/140 test cases passed with identical runtimes.

The test results stored for each commit in eval_results.json are identical, even though substantial code changes and test runs occurred across different commits.
Running commit0 evaluate --branch commit0-test yields inconsistent results. Specifically, it reports 0/140 test cases passed (it took over 30 min to finish running so I assume it timed-out and terminated), whereas eval_results.json shows 83/140 tests passed for the same commit hash.

Steps to Reproduce:

Set up an agent configuration file with the record_test_for_each_commit: true flag. Here's the .agent.yaml file I used.

add_import_module_to_context: true
agent_name: aider
max_iteration: 3
max_lint_info_length: 10000
max_repo_info_length: 10000
max_spec_info_length: 10000
max_unit_tests_info_length: 10000
model_name: gpt-4o-mini
pre_commit_config_path: .pre-commit-config.yaml
record_test_for_each_commit: true
run_entire_dir_lint: false
run_tests: true
use_lint_info: false
use_repo_info: false
use_spec_info: false
use_topo_sort_dependencies: true
use_unit_tests_info: true
use_user_prompt: false
user_prompt: 'Here is your task:

  You need to complete the implementations for all functions (i.e., those with pass
  statements) and pass the unit tests.

  Do not change the names of existing functions or classes, as they may be referenced
  from other code like unit tests, etc.

  When you generate code, you must maintain the original formatting of the function
  stubs (such as whitespaces), otherwise we will not able to search/replace blocks
  for code modifications, and therefore you will receive a score of 0 for your generated
  code.'

Run the agent on simpy repository with agent run --branch commit0-test (commit0-test is new branch created in simpy repository through git checkout -b commit0-test).
Compare test results in eval_results.json to those obtained by running commit0 evaluate --branch commit0-test after agent is finished running.

Screenshots:

Here is the output I receive after running commit0 evaluate --branch commit0-test.
I assume that this command runs evaluate on the most recent commit which the eval_results.json file claims passes 83/140 test cases.

.commit0.yaml File

base_dir: /Users/denizbt/Documents/commit0-files/repos
dataset_name: wentingzhao/commit0_combined
dataset_split: test
repo_split: simpy

Answer 1 · 2024-11-06T03:26:50.000Z

Confirmed by @denizbt offline, there isn't an issue in commit0.