Performance Results on HumanEval
htcml opened this issue · 1 comments
I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?
@htcml thanks for reading the paper.
In our case, HumanEval dataset would not be the best evaluation benchmark. The reason is that HumanEval is treated as a docstring
to code task in which the function signature and its docstring (in code comment block) is given. It is ideal for zero-shot evaluation for larger LMs such as CodeGen and Codex.
In our paper, we focus more on natural language text description of a problem and generate a program from scratch.
One workaround is that we can reformulate the HumanEval as text-to-code tasks but the comparison might not be fair with current baselines.