symflower/eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

GoMIT

Issues

Fix svg Y axis ticks
#73 opened 5 months ago by bauersimon
4
Implement "chain of thought" tasks
#31 opened 6 months ago by zimmski
0
Infer if a model produced "too much" code
#44 opened 5 months ago by bauersimon
0
Better error handling - panic instead of "failing up"
#28 opened 6 months ago by rnbwdsh
1
Better prompting with templates: Setting a mandatory document start
#29 opened 6 months ago by rnbwdsh
0
Retry with feedback and retry without feedback.
#30 opened 6 months ago by rnbwdsh
0
Upload binaries of the evaluation binary for all OSes and architectures for users that only want to benchmark
#20 opened 6 months ago by zimmski
0
Do an up-to-date leaderboard/dashboard for current models current evaluation
#26 opened 6 months ago by zimmski
0
Include linters in the development environment and CI
#18 opened 6 months ago by zimmski
0
`InstallToolsPath` is not used for test execution (`make test`)
#93 opened 5 months ago by Munsio
5
Follow up: Ollama Support
#100 opened 5 months ago by bauersimon
0
Flaky CI because of corrupted Z3 installation
#107 opened 5 months ago by bauersimon
0
Include metrics about the models for comparing models
#82 opened 5 months ago by zimmski
0
Introduce an AST-differ that also gives metrics
#80 opened 5 months ago by zimmski
3
Add linters where each error is a metric
#81 opened 5 months ago by zimmski
1
Sandbox execution
#17 opened 3 months ago by zimmski
1
Infer if a model actually returned source code
#43 opened 3 months ago by bauersimon
0
Roadmap for v0.5.0
#79 opened 2 months ago by zimmski
7
Exclude openrouter/auto since it is just a random model
#126 opened 2 months ago by bauersimon
1
Roadmap for v0.4.0
#35 opened 5 months ago by zimmski
10
Track how many characters were present in code part / complete response
#128 opened 4 months ago by bauersimon
2
Generic OpenAI API provider
#111 opened 4 months ago by bauersimon
4
Preload/Unload Ollama models before prompting
#116 opened 4 months ago by Munsio
6
Multiple runs without interleaving
#119 opened 4 months ago by bauersimon
0
Optimize repository handling in multiple runs per model
#113 opened 4 months ago by Munsio
0
Give models a retry on error
#123 opened 4 months ago by bauersimon
1
Do not cancel successive runs if previous runs had problems
#127 opened 4 months ago by bauersimon
0
Multiple Runs
#108 opened 5 months ago by bauersimon
0
Measure Model response time
#105 opened 5 months ago by bauersimon
0
Fixed Ollama version
#117 opened 5 months ago by bauersimon
0
Integrate Ollama
#91 opened 5 months ago by bauersimon
3
Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#101 opened 5 months ago by mkovelamudi
5
Non deterministic test output leads to flaky CI Jobs
#98 opened 5 months ago by Munsio
0
Empty responses should not be tested but should fail
#92 opened 5 months ago by zimmski
1
Add additional CSV files that sum up: overall, per-language
#83 opened 5 months ago by zimmski
0
Automatic Markdown export
#49 opened 5 months ago by bauersimon
0
Java language implementation
#61 opened 5 months ago by zimmski
0
Move more output into the logs
#52 opened 5 months ago by bauersimon
0
Don't hardcode test file path
#58 opened 5 months ago by bauersimon
0
Evaluating a single model prints all label stats
#51 opened 5 months ago by bauersimon
0
Automatic Symflower installation with fixed version
#47 opened 5 months ago by zimmski
0
Automatic response categorization
#32 opened 5 months ago by bauersimon
7
Scoring and Ranking
#41 opened 5 months ago by bauersimon
1
Refactor metrics to assessment
#34 opened 6 months ago by bauersimon
0