symflower/eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

GoMIT

Issues

Follow-up "Check if the testdata repository is valid before running the evaluation, so it is checked just once"
#266 opened a month ago by ruiAzevedo19
1
Roadmap for v0.7.0
#301 opened 5 months ago by bauersimon
0
Ruby support
#300 opened 5 months ago by ahumenberger
3
Data visualization based on evaluation CSV files
#296 opened 5 months ago by ruiAzevedo19
1
Roadmap for v0.6.0
#195 opened 6 months ago by zimmski
0
Code repair should only consider #(passing tests) and never coverage
#320 opened 4 months ago by bauersimon
0
Move Dependency installation of Docker into multistage builds
#319 opened 4 months ago by Munsio
0
Copy of evaluation data for kubernetes is not working
#312 opened 4 months ago by Munsio
0
Docker runtime broken on main
#302 opened 4 months ago by bauersimon
5
Reporting tool does not work with `/**/` directory globing
#304 opened 4 months ago by Munsio
5
Use a JSON configuration file to set up an evaluation run
#282 opened 4 months ago by ruiAzevedo19
1
Check if all testdata repositories are well-formed just once, and not in every task run
#263 opened 4 months ago by ruiAzevedo19
2
Follow-up: Use a JSON configuration file to set up an evaluation run
#307 opened 4 months ago by ruiAzevedo19
0
Rethink retry logic for LLM Providers
#305 opened 4 months ago by Munsio
0
Openrouter Provider preferences
#286 opened 5 months ago by Munsio
2
Assess failing tests
#235 opened 5 months ago by zimmski
1
"symflower unit-tests" timeout error differs between Linux and Windows
#280 opened 5 months ago by ruiAzevedo19
0
Evaluation run for all "good open weight models" with all available quantizations and different GPUs
#209 opened 6 months ago by zimmski
0
Multiple model parameter with same value result in multiple evaluations
#220 opened 6 months ago by Munsio
0
Weight "executed code" more prominently
#233 opened 5 months ago by zimmski
4
Evaluation task: TDD
#194 opened 6 months ago by ahumenberger
0
Keep individual coverage files and LLM query/responses
#204 opened 6 months ago by zimmski
3
Interactive result comparison
#208 opened 6 months ago by bauersimon
0
Coverage for Java is tracked for lines, while Go is tracked for ranges
#193 opened 6 months ago by bauersimon
4
Make the Knapsack.java case easier to solve for models
#230 opened 5 months ago by zimmski
0
Log model responses directly to file and reuse them for debugging
#181 opened 5 months ago by bauersimon
1
Openrouter returns 524 when querying models
#186 opened 6 months ago by bauersimon
0
Report the maximum theoretically reachable #files-executed
#215 opened 5 months ago by bauersimon
0
unable to create temporary repository path: exec: WaitDelay expired before I/O complete
#219 opened 5 months ago by bauersimon
3
Isolation of evaluations
#198 opened 5 months ago by Munsio
0
Evaluation task: Transpile
#201 opened 5 months ago by ruiAzevedo19
2
Docker runtime is using the wrong container image
#242 opened 5 months ago by zimmski
0
Tool/command to combine multiple evaluations into one
#205 opened 5 months ago by bauersimon
2
Dump the assessments in the CSV files once they happen and not in the end of all executions
#237 opened 5 months ago by ruiAzevedo19
2
Pull ollama models
#283 opened 5 months ago by Munsio
0
Malformed Maven version
#270 opened 5 months ago by Munsio
0
Flaky test when testing `symflower unit-tests` timeout
#276 opened 5 months ago by ruiAzevedo19
0
Docker containers may use the same result-path
#273 opened 5 months ago by Munsio
0
Do not start the ollama server if not needed
#225 opened 5 months ago by Munsio
0
Extract model costs into log and CSVs
#210 opened 5 months ago by bauersimon
1
Follow-up "Code repairing task to enable models to fix code with compilation errors"
#200 opened 5 months ago by ruiAzevedo19
0
Add the current commit revision to the binary, Docker image and reports
#207 opened 5 months ago by zimmski
0
Change all prompts to enforce code fences
#257 opened 5 months ago by bauersimon
0
Follow up - Isolated evaluations
#224 opened 5 months ago by Munsio
0
Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed
#232 opened 5 months ago by ruiAzevedo19
0
Apply symflower fix to a "write-test" result of a model
#213 opened 5 months ago by bauersimon
3
Extract human-readable names for models
#206 opened 6 months ago by bauersimon
0
CSV report header is missing the task identifier
#187 opened 6 months ago by bauersimon
0
Add timeout to `symflower test`
#185 opened 6 months ago by bauersimon
0
If results folder already exists, add suffix but don't overwrite or error
#176 opened 6 months ago by bauersimon
0