Combine with Parsel for improving the state of the art?
LifeIsStrange opened this issue · 1 comments
Hi @noahshinn024 this paper result is fascinating, but
If we look at the leaderboard on humaneval,
it appears the second best paper (Parsel) is using a combination of GPT-4 with the prior best paper on codegen, Code-T
which had brought a 20% absolute accuracy gain previously
https://paperswithcode.com/paper/codet-code-generation-with-generated-tests
Therefore, my question about synergies is two fold,
- Could you, like Parsel, combine reflexion with Code-T for even better accuracy
- Could you combine Parsel with Reflexion for potentially even better accuracy than 1)
edit I have skimmed the paper and you do mention CodeT and also says that Reflexion also rely on test generation. Therefore my point might be moot but 1) CodeT might generate tests in a better way than your inhouse solution? unlikely
and 2) the innovation brought by Parsel is not mentionned in the paper and could allow record accuracy
TL;DR I believe Reflexion could go beyond 91% with Parsel
Interesting! In general, I think there's a lot of room to grow with unit test generation. In our implementation, we enforce that all unit tests must pass, which can lead to bad behavior if one unit test is wrong.
Possible optimizations:
- CodeT-like ranking for the "best" implementation according to a self-generated unit test suite
- ranking for performance across many self-generated unit test suites
- hard-coded unit test accuracy threshold heuristic(s)
- "code review" for unit tests given to a committee of LLMs