Improving autonomous codegen
irgolic opened this issue · 0 comments
irgolic commented
How codegen works now
At the moment, code is generated autonomously by auto-v1
(with subsequent NewFile
and EditFile
actions). The EditFile
action edits file hunks, not whole files themselves. It is shown a code hunk it needs to edit like this:
9 |
10 | </div>
11 *
12 * # 🛠 Usage
13 *
14 | Warning: This Github Action is currently **in development**, and in **alpha release**.
15 | If you're interested in using this action, please reach out on [Discord](https://discord.gg/vz7p9TfHsh).
And most of the time, it's pretty good at returning only the *
highlighted lines, but there are some distinct improvements that can be made.
It's asked to respond in this format:
```
<string>
```
{
"outcome": string # A description of the outcome of the attempt to rewrite the file hunk according to the problem statement.
}
The outcome is used to gauge the effect of the action and is fed back into the autonomous agent.
What improvements we can make
Each of these could be made into a separate issue, but for now, I'm listing them as potential TODO items:
- Sometimes it returns the lines with leading line number markers (like
9 |
). These should be removed in a post-processing step, if each line begins with a line number and pipe/star character. Make sure tolstrip
the line. -
gpt-3.5-turbo
struggles with generating both the code hunk and JSON in one go. An alternative method is asking it to generate code and reflect on the outcome in two separate calls/questions should be implemented. For GPT-4 it still makes sense to do it in one go to conserve on tokens, so theauto-v1
codegen agent should expose a config parameter to choose whether to do it in one go or two (add a kwarg to__init__
, and it'll be passed viacodegen_agent_config
). - Reflexion outlines a framework for the model to decide when to "reflect" on its actions, which helps it choose better actions moving forward.
- RepoCoder generates a first-pass code sample, then looks at the similarity between the generated sample and the rest of the (chunkified) repository. It compares a "sparse" jaccard similarity approach (which is what Github Copilot uses, thanks for highlighting that @dmarx), and a "dense" text embedding approach. They found that the two approaches have equivalent performance, so either approach works.
- We can ask it to generate tests first, then add to the agent's list of actions an ability to run tests when it thinks it's finished
- Add "RemoveFileHunk" and "AppendFileHunk" actions