imoneoi/openchat

Can not Reproduce benchmarks

Closed this issue · 4 comments

I use

python -m ochat.evaluation.run_eval --condition "GPT4 Correct" --model openchat/openchat-3.5-0106 --eval_sets coding fs_cothub/bbh fs_cothub/mmlu zs/agieval zs/bbh_mc_orca zs/truthfulqa_orca

and

python -m ochat.evaluation.run_eval --condition "Math Correct" --model openchat/openchat-3.5-0106 --eval_sets fs_cothub/gsm8k zs/math

to reproducing benchmarks of code,math and other reasoning benchmarks,but I can't reproducing the score listed in readme.

I use the newest commit 30da91b, transformer 4.36.1/4.36.2, ochat 3.5.1, vllm 0.2.1

@zhang7346 Can you post the results or error message? There may be some fluctuations as the implementation of vLLM may change.

By the way, you can check out the independent evalplus leaderboard, which reports slightly higher results than ours.

It's likely a vLLM issue, as the latest vLLM produces a lot of empty answers during evaluation. We're actively fixing it up.

Closing as vllm>=0.3.3 will fix this issue, and we'll update the package requirements in the next release. Re-open if needed.

Thank you for your reply!
Now I use vllm==0.3.3, I can reproduce almost all the benchmarks(bbh_mc bbh_cot aigeval gsm8k turthfuqa mmlu) excepet human-eval.

I run the follow commads:
'''
python -m ochat.evaluation.run_eval --condition "GPT4 Correct" --model openchat/openchat-3.5-0106 --eval_sets coding
python ochat/evaluation/view_results.py
python ochat/evaluation/convert_to_evalplus.py
'''
and then I run th evaluate outside of docker:
'''
evalplus.evaluate --dataset humaneval --samples /ochat/evaluation/evalplus_codegen/openchat3.5-0106_vllm033_transformers4382.jsonl
'''
I got
'''Base
{'pass@1': 0.25}
Base + Extra
{'pass@1': 0.23780487804878048}
'''