symflower/eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
GoMIT
Issues
- 4
Fix svg Y axis ticks
#73 opened by bauersimon - 0
Implement "chain of thought" tasks
#31 opened by zimmski - 0
Infer if a model produced "too much" code
#44 opened by bauersimon - 1
- 0
- 0
Retry with feedback and retry without feedback.
#30 opened by rnbwdsh - 0
Upload binaries of the evaluation binary for all OSes and architectures for users that only want to benchmark
#20 opened by zimmski - 0
- 0
- 5
- 0
Follow up: Ollama Support
#100 opened by bauersimon - 0
Flaky CI because of corrupted Z3 installation
#107 opened by bauersimon - 0
- 3
Introduce an AST-differ that also gives metrics
#80 opened by zimmski - 1
Add linters where each error is a metric
#81 opened by zimmski - 1
Sandbox execution
#17 opened by zimmski - 0
Infer if a model actually returned source code
#43 opened by bauersimon - 7
Roadmap for v0.5.0
#79 opened by zimmski - 1
- 10
Roadmap for v0.4.0
#35 opened by zimmski - 2
- 4
Generic OpenAI API provider
#111 opened by bauersimon - 6
Preload/Unload Ollama models before prompting
#116 opened by Munsio - 0
Multiple runs without interleaving
#119 opened by bauersimon - 0
- 1
Give models a retry on error
#123 opened by bauersimon - 0
- 0
Multiple Runs
#108 opened by bauersimon - 0
Measure Model response time
#105 opened by bauersimon - 0
Fixed Ollama version
#117 opened by bauersimon - 3
Integrate Ollama
#91 opened by bauersimon - 5
Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#101 opened by mkovelamudi - 0
- 1
Empty responses should not be tested but should fail
#92 opened by zimmski - 0
- 0
Automatic Markdown export
#49 opened by bauersimon - 0
Java language implementation
#61 opened by zimmski - 0
Move more output into the logs
#52 opened by bauersimon - 0
Don't hardcode test file path
#58 opened by bauersimon - 0
- 0
- 7
Automatic response categorization
#32 opened by bauersimon - 1
Scoring and Ranking
#41 opened by bauersimon - 0
Refactor metrics to assessment
#34 opened by bauersimon