symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
GoMIT
Issues
- 0
Assess failing tests
#235 opened - 1
Weight "executed code" more prominently
#233 opened - 0
- 0
- 0
Do not start the ollama server if not needed
#225 opened - 0
Follow up "Docker runtime"
#224 opened - 0
- 2
- 0
- 3
- 1
Extract model costs into log and CSVs
#210 opened - 0
- 0
Interactive result comparison
#208 opened - 0
- 0
Extract human-readable names for models
#206 opened - 1
- 2
- 2
Evaluation task: Transpile
#201 opened - 0
- 0
Isolation of evaluations
#198 opened - 0
Roadmap for v0.6.0
#195 opened - 0
Evaluation task: TDD
#194 opened - 4
- 0
- 0
Openrouter returns 524 when querying models
#186 opened - 0
Add timeout to `symflower test`
#185 opened - 0
- 0
- 1
Collect Go coverage if tests trigger panic
#175 opened - 2
Deal with dependencies requested by LLMs
#174 opened - 3
LLM result parsing bug
#173 opened - 2
Improve maintainability of assessments
#169 opened - 0
Evaluation task: Code repair
#168 opened - 2
- 0
Support multiple evaluation tasks
#165 opened - 3
- 0
- 0
- 3
- 0
Deal with failing tests
#158 opened - 0
- 1
- 0
- 0
- 0
Repository not reset for multiple tasks
#147 opened - 0
- 0
Java
#143 opened - 0
- 2
Test for pulling Ollama model is flaky
#135 opened - 0
Follow-up: Allow to retry a model when it errors
#131 opened