symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
GoMIT
Issues
- 1
Follow-up "Check if the testdata repository is valid before running the evaluation, so it is checked just once"
#266 opened by ruiAzevedo19 - 0
Roadmap for v0.7.0
#301 opened by bauersimon - 3
Ruby support
#300 opened by ahumenberger - 1
Data visualization based on evaluation CSV files
#296 opened by ruiAzevedo19 - 0
Roadmap for v0.6.0
#195 opened by zimmski - 0
- 0
- 0
Copy of evaluation data for kubernetes is not working
#312 opened by Munsio - 5
Docker runtime broken on main
#302 opened by bauersimon - 5
- 1
- 2
Check if all testdata repositories are well-formed just once, and not in every task run
#263 opened by ruiAzevedo19 - 0
- 0
Rethink retry logic for LLM Providers
#305 opened by Munsio - 2
Openrouter Provider preferences
#286 opened by Munsio - 1
Assess failing tests
#235 opened by zimmski - 0
- 0
Evaluation run for all "good open weight models" with all available quantizations and different GPUs
#209 opened by zimmski - 0
- 4
Weight "executed code" more prominently
#233 opened by zimmski - 0
Evaluation task: TDD
#194 opened by ahumenberger - 3
- 0
Interactive result comparison
#208 opened by bauersimon - 4
- 0
- 1
- 0
Openrouter returns 524 when querying models
#186 opened by bauersimon - 0
- 3
unable to create temporary repository path: exec: WaitDelay expired before I/O complete
#219 opened by bauersimon - 0
Isolation of evaluations
#198 opened by Munsio - 2
Evaluation task: Transpile
#201 opened by ruiAzevedo19 - 0
Docker runtime is using the wrong container image
#242 opened by zimmski - 2
- 2
Dump the assessments in the CSV files once they happen and not in the end of all executions
#237 opened by ruiAzevedo19 - 0
Pull ollama models
#283 opened by Munsio - 0
Malformed Maven version
#270 opened by Munsio - 0
- 0
Docker containers may use the same result-path
#273 opened by Munsio - 0
Do not start the ollama server if not needed
#225 opened by Munsio - 1
Extract model costs into log and CSVs
#210 opened by bauersimon - 0
Follow-up "Code repairing task to enable models to fix code with compilation errors"
#200 opened by ruiAzevedo19 - 0
- 0
Change all prompts to enforce code fences
#257 opened by bauersimon - 0
Follow up - Isolated evaluations
#224 opened by Munsio - 0
Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed
#232 opened by ruiAzevedo19 - 3
- 0
Extract human-readable names for models
#206 opened by bauersimon - 0
CSV report header is missing the task identifier
#187 opened by bauersimon - 0
Add timeout to `symflower test`
#185 opened by bauersimon - 0