irgolic/AutoPR

Improving autonomous codegen

irgolic opened this issue · 0 comments

How codegen works now

At the moment, code is generated autonomously by auto-v1 (with subsequent NewFile and EditFile actions). The EditFile action edits file hunks, not whole files themselves. It is shown a code hunk it needs to edit like this:

 9 | 
10 | </div>
11 * 
12 * # 🛠 Usage
13 * 
14 | Warning: This Github Action is currently **in development**, and in **alpha release**.
15 | If you're interested in using this action, please reach out on [Discord](https://discord.gg/vz7p9TfHsh).

And most of the time, it's pretty good at returning only the * highlighted lines, but there are some distinct improvements that can be made.
It's asked to respond in this format:

```
<string>
```
{
  "outcome": string # A description of the outcome of the attempt to rewrite the file hunk according to the problem statement.
}

The outcome is used to gauge the effect of the action and is fed back into the autonomous agent.

What improvements we can make

Each of these could be made into a separate issue, but for now, I'm listing them as potential TODO items:

  • Sometimes it returns the lines with leading line number markers (like 9 | ). These should be removed in a post-processing step, if each line begins with a line number and pipe/star character. Make sure to lstrip the line.
  • gpt-3.5-turbo struggles with generating both the code hunk and JSON in one go. An alternative method is asking it to generate code and reflect on the outcome in two separate calls/questions should be implemented. For GPT-4 it still makes sense to do it in one go to conserve on tokens, so the auto-v1 codegen agent should expose a config parameter to choose whether to do it in one go or two (add a kwarg to __init__, and it'll be passed via codegen_agent_config).
  • Reflexion outlines a framework for the model to decide when to "reflect" on its actions, which helps it choose better actions moving forward.
  • RepoCoder generates a first-pass code sample, then looks at the similarity between the generated sample and the rest of the (chunkified) repository. It compares a "sparse" jaccard similarity approach (which is what Github Copilot uses, thanks for highlighting that @dmarx), and a "dense" text embedding approach. They found that the two approaches have equivalent performance, so either approach works.
  • We can ask it to generate tests first, then add to the agent's list of actions an ability to run tests when it thinks it's finished
  • Add "RemoveFileHunk" and "AppendFileHunk" actions