noahshinn/reflexion

Interpreting results files

Closed this issue · 6 comments

Hi, super interesting work here! I was wondering how to interpret the results files in root -- initially I thought is_solved meant correct or not, and that would indeed give 87.8% for Reflexion with GPT-4; but then the equivalent file without Reflexion gets 81.7 when my impression is that should be 67. How should I be interpreting the columns, and if is_solved != correct, how do I check correctness? + I see the reflections for the ones that have is_solved False, but not the predicted (incorrect) solutions, how can I see those?

@sachit-menon i know this is unrelated to your question, but did you change any specific parameters when running the solution? i get is_solved= False on all the tasks

I haven't even tried running it yet, just looking at the already-included output files 😅

Looks like the code doesn't log the correct solutions

From reflexion.py:

    if is_solved:
        item["is_solved"] = True
    else:
        item["is_solved"] = False
        item["solution"] = ""
    item["reflections"] = reflections
    write_jsonl(log_path, [item], append=True)

Hi, super interesting work here! I was wondering how to interpret the results files in root -- initially I thought is_solved meant correct or not, and that would indeed give 87.8% for Reflexion with GPT-4; but then the equivalent file without Reflexion gets 81.7 when my impression is that should be 67. How should I be interpreting the columns, and if is_solved != correct, how do I check correctness? + I see the reflections for the ones that have is_solved False, but not the predicted (incorrect) solutions, how can I see those?

Hi, thanks for the note! I've pushed some changes with cleanup on some of the logic and with another rerun of the GPT-4 results for those that do not want to rerun the entire benchmark. Use ./validate_py_results.py as a util script for new results if you choose to rerun.

Looks like the code doesn't log the correct solutions

From reflexion.py:

    if is_solved:
        item["is_solved"] = True
    else:
        item["is_solved"] = False
        item["solution"] = ""
    item["reflections"] = reflections
    write_jsonl(log_path, [item], append=True)

Thanks for the note. I've cleaned up the code a bit to remove redundancy. Previously, I was only logging successful solutions.

Thanks Noah! In line 30 of validate_results.py I think there's a typo -- green_text_out = green_text(f"passes {num_tests}/{num_tests} test cases") is what displays how many pass, but that's just the same var over itself. Where is the actual number of test cases that pass computed instead of just the number present in item["test"] total?

Edit: actually, I guess it's not a typo, my bad -- if there are no exceptions, it means the test cases all passed, since they're using asserts. So I guess it doesn't measure how many passed or not, just all/nothing. (Still interested in the below question!)

Also, if is_solved is False <=> no test cases pass? I interpreted your previous comment as saying you didn't save the incorrect solutions before but they're in the updated files, but it looks like the field is still blank for any is_solved False cases.