I would use Codegen-16b or StarCoder and spider dev set for fine-tuning. Perhaps the data annotation will be needed for (FT)
I used GPT-4 GUI mostly. For proper evaluation it would be some API without rate limiters or local inference server with open-sourced models:
- LLama-65b
- falcon-40b-instruct
- Bloom
- Dolly-2
- Etc
- Percentage of predictions which are valid SQL (VA)
- Execution accuracy (EX)
- Component Matching (CM)
- Dummy datasset in Clickhouse
- Prompts implementation
- GPT-4 call
- Caching placeholder
- Eval
- Different LLMs experiments
- Prompts automation for random DB
- DB for cached queries, prompts and SQL results
- QDecomp as a separate LLM call