symflower/eval-dev-quality

Roadmap for v0.6.0

Opened this issue · 0 comments

TODO sort and sort out

  • Models
  • Metrics & Reporting
    • Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite.
    • write out results right away so we don't loose anything if the evaluation crashes
    • AST differ #80
    • Non-benchmark metrics #82
    • #81
    • Remove absolute paths completely e.g. in stack traces too.
    • Automatically interpret "Extra code" #44
    • Figure out the "perfect" coverage score so we can display percentage of coverage reached
    • Make coverage metric fair
      • "Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite." -> only Symflower coverage will make this fair
    • Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
    • Bar charts should have have their value on the bar. The axis values do not work that well
    • Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
    • Charts to showcase data
      • Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
      • Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
    • Reporting and documentation on writing deep-dives
      • What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
      • Are there big LLMs that totally fail?
      • Are there small LLMs that are surprisingly good?
      • What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
    • Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
    • Categorize by parameters/experts https://www.reddit.com/r/LocalLLaMA/comments/1cdivc8/comment/l1davhv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
    • Compare input/output/request/... costs https://twitter.com/oleghuman/status/1786296672785420744
    • Benchmark quantized models, beacuse they need less memory
    • distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
  • Documentation
    • Clean up and extend README
      • Better examples for contributions
      • Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
    • Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour berak in between to not run into cached responses.
    • Write Tutorial for using Ollama
    • YouTube video for using Ollama
  • Tooling & Installation
    • Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
    • Automatic tool installation with fixed version
      • Go
      • Java
    • Ensure that non-critical CLI input validation (such as unavailable models) does not panic
    • Ollama support
      • Install and test Ollama on MacOS
      • Install and test Ollama on Windows
    • Allow to forward CLI commands to be evaluated: #27 (comment)
    • Refactor Model and Provider to be in the same package #121 (comment)
  • Outreach
    • Automatically updated leader board for this repository: #26
      • Take a look at current leaderboards and evals to know what could be interesting Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
    • Blog post about the different suffixed of models e.g. "chat" and "instruct" and eval them somehow. Idead from https://www.reddit.com/r/LocalLLaMA/comments/1bz5oyx/comment/kyrfap4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
    • Blog post about HumanEval
    • Blog post about training a small LLM directly on HumanEval
    • Blog post about "non-determinism of LLMs" https://community.openai.com/t/a-question-on-determinism/8185 good starting point, and how we can make them at least more stable.
    • Blogpost idea: misleading comments, weird coding style... how much does it take to confuse the most powerful AI? @ahumenberger
      • Maybe not only comments. What about obfuscated code, e.g. function and variables names are just random strings?
  • Research
  • Evaluation
    • Java
      • Let the Java test case for No test files actually identify and error that there are no test files (needs to be implemented in symflower test)
    • LLM
    • Prepare language and evaluation logic for multiple files:
      • Use symflower symbols to receive files
    • Sandboxed execution #17 e.g. with Docker as its first implementation
    • timeout for test execution (we've seen tests that take > 15 minutes to execute in some benchmarks)
    • Do an evaluation with different temperatures
    • Failing tests should receive a score penalty
    • Evaluation tasks
      • Introduce the interface for doing "evaluation tasks" so we can easily add them #197
      • Add evaluation task for code repair #170
      • Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
      • Add evaluation task for transpilation Go->Java and Java->Go
      • Add evaluation task for code refactoring: two function with the same code -> extract into a helper function
      • Add evaluation task for implementing and fixing bugs using TDD
      • Scoring, Categorization, Bar Charts split by language.
      • Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
    • Code repair
      • Own task category
      • 0-shot, 1-shot, ...
        • With LLM repair
        • With tool repair
    • Do test file paths through
      • symflower symbols
      • Task for models
    • Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
    • Move towards generated cases so models cannot integrate fixed cases to always have 100% score
    • Think about adding more trainings data generation features: This will also help with dynamic cases
      • Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
  • Think about a commercial effort of the eval, that we can balance some of the costs that goes into maintaining this eval
  • Benchmark that showcases base-models vs their fine-tuned coding model e.g. in v0.5.0 we see that Codestral, codellama, ... are worse