noahshinn/reflexion

Potential unexpected behavior in executor (that uses the`global`)

allanj opened this issue · 5 comments

I was running the script https://github.com/noahshinn/reflexion/blob/main/programming_runs/run_reflexion_codellama_multi.sh
with CodeLLaMA model, simply change the codellama to codellama-7b

CUDA_VISIBLE_DEVICES=$1 python main.py \
  --run_name "reflexion_codellama_$1" \
  --root_dir "root" \
  --dataset_path ./benchmarks/humaneval-py.jsonl \
  --strategy "reflexion" \
  --language "py" \
  --model "codellama-7b" \
  --pass_at_k "1" \
  --max_iters "2" \
  --verbose | tee ./logs/reflexion_codellama_$1

The performance is around 30~40% though it will get killed in the middle of the experiment.

Observation

Some predictions present unexpected behavior while it is correct prediction is_solved = True. In the following prediction (which is also attached here)

  1. The first program generated actually failed the generated test cases (NOTE: these test cases actually generated by CodeLLaMA-7B), although this program passes the ground-truth test cases (eventually).
  2. After reflection, the model actually generate nothing (i.e., None).
  3. So, suppose we should take "" as our final output, right?
  4. But the "is_solved" is actually assigned to True.

image

Reason

Because we always use the function globals(), even though the program after reflexion has nothing, globals() already contain the program in the first attempt.

Thus, as we do not clear the information in globals() every time, potentially this could lead to some unexpected behavior.

I have been a bit confused here, hope the authors can clarify more on this.

Especially, globals() will definitely memorize previous implementations, which will be used if the future generation is None.

Or in future generation, it generate something (e.g., some functions) not exists but those functions are memorized in globals(). Potentially, that has some contamination in the data.

That means, the order to run the evaluation samples matters

So if I understand correctly, it could potentially cause incorrect answers to be correct?

Yes, right!.

But I assume for more capable models like GPT-3.5/4 or other large models, such an issue might rarely happen.

I just got in touch with the authors, and currently making a fix.

I think the current issue is that, we need to deal with the situation
if cur_func_impl is None, what we are going to do,

We always put assertion for the current implementation

assert isinstance(cur_func_impl, str)

because it is possible that "cur_func_impl" is pure natural language response without any code extracted, which make the variable cur_func_impl to be None.

My current attempt:

  1. For the first attempt, we just set to "" empty string, as it is also initial implementation
if cur_func_impl is None:
    # code cannot be extracted from the response.
    cur_func_impl = ""
  1. For the reflection attempt, we can roll back and reflect again, rather than put an empty string.
if reflected_func_impl is None:
    # code cannot be extracted from the response.
    # we just skip this iteration and reflection again.
    cur_iter += 1
    continue

I'm making a PR on this, and will update more results here

Some changes: main...allanj:reflexion:fix_global
I'm running some experiments to make sure the results

If I clear the globals() variable information, the performance seems to drop a lot (if using CodeLLaMA-7B), I'm checking the reason