Can't reproduce HumanEval score

Question

Can't reproduce HumanEval score

geekan opened this issue a year ago · 5 comments

follow programming_runs/run_reflexion.sh

get 0.77-0.83 scores for multi trials.

Answer 1 · 2024-01-10T18:18:39.000Z

I cannot reproduce the result either 😭 Is it possible for the author to release the generated tests by GPT4 they use in the experiments?

Answer 2 · 2024-01-11T08:11:04.000Z

Hi @geekan and @FloridSleeves,

As many LLM papers may be experiencing, we are subject to the performance of proprietary models as there is not a better open-source option to evaluate at a high level of performance. We showed some open-source models' results to prove this in the appendix of the recent version of the paper. If you want to use openai's models with Reflexion, I would advise you to use the -0314 suffix to the gpt-4 or gpt-3.5-turbo models to evaluate a model that was checkpointed at a closer time to our experiments. I hope that we will have more open-source options on which we can use reflexion in the future.

Answer 3 · 2024-03-02T14:11:22.000Z

I just ran programming_runs/run_reflexion.sh directly, and also got 80% only..

and also the amount of human eval only has 161 (which I guess should be 164)?

Answer 4 · 2024-04-28T19:45:00.000Z

I just ran programming_runs/run_reflexion.sh directly, and also got 80% only..

and also the amount of human eval only has 161 (which I guess should be 164)?

I am also curious that HumanEval-python should be 164 tasks.

Answer 5 · 2024-05-01T17:50:02.000Z

I just ran programming_runs/run_reflexion.sh directly, and also got 80% only..
and also the amount of human eval only has 161 (which I guess should be 164)?

I am also curious that HumanEval-python should be 164 tasks.

We used the MultiPL-E benchmark, which includes 161 tasks, we also use MultiPL-E for our Rust experiments.
The HumanEval dataset is not clean, so transformations are required for a sound evaluation. MultiPL-E does the following adjustments to the Python dataset:

Of the 164 original HumanEval benchmarks:
(1) we exclude 3 benchmarks that have Python helper
functions in their prompt; (2) we modify 2 benchmarks to use
unit tests instead of randomized testing; and (3) for certain
typed languages, we fail to compile up to 5 benchmarks with
untranslatable types. These changes do not lead to significantly
different results for Python

Going forward, I recommend people to use EvalPlus or avoid HumanEval altogether in favor of datasets which are guaranteed to not be included in the training dataset (e.g. LiveCodeBench)