record_test_for_each_commit flag does not store all commits, and has same result everytime
denizbt opened this issue · 1 comments
denizbt commented
Description
I encountered an issue when using the record_test_for_each_commit: true flag in the agent configuration for the commit0 split containing only simpy.
Despite having 25+ commits in the git log (after running an agent for the simpy repository), the eval_results.json file contains only six entries, and each entry shows the same test results—83/140 test cases passed with identical runtimes.
- The test results stored for each commit in
eval_results.jsonare identical, even though substantial code changes and test runs occurred across different commits. - Running
commit0 evaluate --branch commit0-testyields inconsistent results. Specifically, it reports 0/140 test cases passed (it took over 30 min to finish running so I assume it timed-out and terminated), whereaseval_results.jsonshows 83/140 tests passed for the same commit hash.
Steps to Reproduce:
- Set up an agent configuration file with the record_test_for_each_commit: true flag. Here's the
.agent.yamlfile I used.
add_import_module_to_context: true
agent_name: aider
max_iteration: 3
max_lint_info_length: 10000
max_repo_info_length: 10000
max_spec_info_length: 10000
max_unit_tests_info_length: 10000
model_name: gpt-4o-mini
pre_commit_config_path: .pre-commit-config.yaml
record_test_for_each_commit: true
run_entire_dir_lint: false
run_tests: true
use_lint_info: false
use_repo_info: false
use_spec_info: false
use_topo_sort_dependencies: true
use_unit_tests_info: true
use_user_prompt: false
user_prompt: 'Here is your task:
You need to complete the implementations for all functions (i.e., those with pass
statements) and pass the unit tests.
Do not change the names of existing functions or classes, as they may be referenced
from other code like unit tests, etc.
When you generate code, you must maintain the original formatting of the function
stubs (such as whitespaces), otherwise we will not able to search/replace blocks
for code modifications, and therefore you will receive a score of 0 for your generated
code.'
- Run the agent on simpy repository with
agent run --branch commit0-test(commit0-test is new branch created in simpy repository throughgit checkout -b commit0-test). - Compare test results in eval_results.json to those obtained by running
commit0 evaluate --branch commit0-testafter agent is finished running.
Screenshots:
Here is the output I receive after running commit0 evaluate --branch commit0-test.
I assume that this command runs evaluate on the most recent commit which the eval_results.json file claims passes 83/140 test cases.
.commit0.yaml File
base_dir: /Users/denizbt/Documents/commit0-files/repos
dataset_name: wentingzhao/commit0_combined
dataset_split: test
repo_split: simpy
wenting-zhao commented
Confirmed by @denizbt offline, there isn't an issue in commit0.