salesforce/CodeRL

Performance Results on HumanEval

htcml opened this issue · 1 comments

htcml commented

I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?

@htcml thanks for reading the paper.

In our case, HumanEval dataset would not be the best evaluation benchmark. The reason is that HumanEval is treated as a docstring to code task in which the function signature and its docstring (in code comment block) is given. It is ideal for zero-shot evaluation for larger LMs such as CodeGen and Codex.

In our paper, we focus more on natural language text description of a problem and generate a program from scratch.

One workaround is that we can reformulate the HumanEval as text-to-code tasks but the comparison might not be fair with current baselines.