symflower/eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

GoMIT

Issues

Collect Go coverage if tests trigger panic
#175 opened 6 months ago
1
Deal with dependencies requested by LLMs
#174 opened 6 months ago
2
LLM result parsing bug
#173 opened 6 months ago
3
Improve maintainability of assessments
#169 opened 6 months ago
2
Evaluation task: Code repair
#168 opened 6 months ago
0
Fixed timeouts for `symflower unit-tests` and `symflower test`
#167 opened 5 months ago
2
Support multiple evaluation tasks
#165 opened 5 months ago
0
`ollama_llama_server` and other background processes we start must be killed on CTRL+C
#164 opened 6 months ago
3
Automatic selection of repositories is broken
#163 opened 6 months ago
0
New task to check for Go and Java compilation errors
#160 opened 6 months ago
0
https://github.com/symflower/eval-dev-quality/pull/155/files missing a test
#159 opened 6 months ago
3
Deal with failing tests
#158 opened 6 months ago
0
Logic for "Create temporary repositories for each language so the repository is copied only once per language." copies more than needed
#157 opened 6 months ago
0
Running Ollama tests with the wrong Ollama binary should fail hard
#156 opened 6 months ago
1
The prompt uses different paths depending on the OS
#152 opened 5 months ago
0
Evaluation folder with date cannot be created on windows
#151 opened 6 months ago
0
Repository not reset for multiple tasks
#147 opened 7 months ago
0
The git repository change requires the GPG password
#145 opened 7 months ago
0
Java
#143 opened 7 months ago
0
Follow-Up from using Git to reset the temporary directory
#141 opened 6 months ago
0
Test for pulling Ollama model is flaky
#135 opened 7 months ago
2
Follow-up: Allow to retry a model when it errors
#131 opened 7 months ago
0
Track how many characters were present in code part / complete response
#128 opened 7 months ago
2
Do not cancel successive runs if previous runs had problems
#127 opened 7 months ago
0
Exclude openrouter/auto since it is just a random model
#126 opened 5 months ago
1
Give models a retry on error
#123 opened 7 months ago
1
Multiple runs without interleaving
#119 opened 7 months ago
0
Fixed Ollama version
#117 opened 7 months ago
0
Preload/Unload Ollama models before prompting
#116 opened 7 months ago
6
Optimize repository handling in multiple runs per model
#113 opened 7 months ago
0
Generic OpenAI API provider
#111 opened 7 months ago
4
Multiple Runs
#108 opened 7 months ago
0
Flaky CI because of corrupted Z3 installation
#107 opened 7 months ago
0
Measure Model response time
#105 opened 7 months ago
0
Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#101 opened 7 months ago
5
Follow up: Ollama Support
#100 opened 7 months ago
0
Non deterministic test output leads to flaky CI Jobs
#98 opened 7 months ago
0
`InstallToolsPath` is not used for test execution (`make test`)
#93 opened 7 months ago
5
Empty responses should not be tested but should fail
#92 opened 7 months ago
1
Integrate Ollama
#91 opened 7 months ago
3
Add additional CSV files that sum up: overall, per-language
#83 opened 7 months ago
0
Include metrics about the models for comparing models
#82 opened 8 months ago
0
Add linters where each error is a metric
#81 opened 8 months ago
1
Introduce an AST-differ that also gives metrics
#80 opened 8 months ago
3
Roadmap for v0.5.0
#79 opened 5 months ago
7
Fix svg Y axis ticks
#73 opened 2 months ago
5
Java language implementation
#61 opened 8 months ago
0
Don't hardcode test file path
#58 opened 8 months ago
0
Move more output into the logs
#52 opened 8 months ago
0
Evaluating a single model prints all label stats
#51 opened 8 months ago
0