symflower/eval-dev-quality

Tool/command to combine multiple evaluations into one

bauersimon opened this issue · 2 comments

If we run them on multiple machines, we want to easily combine them.

TODO

1st iteration

  • Remove all CSVs but evaluation.csv
    • Remove the models-summed.csv and <language>-summed.csv files
    • Remove all the evaluation CSV parsing logic
    • Remove all occurrences of model's cost and human-readable name
    • #256
  • Let the "report" command also generate the Markdown report
  • Reason: we want to move to a more flexible way of handling the evaluation data -> SQL

2nd iteration

  • Create a new command eval-dev-quality report
type Report struct {
     EvaluationPaths []string // holds  "evaluation.csv" file paths
     ResultPath string 
}
  • Create an evaluation.csv file in ResultPath that will hold the combined evaluation records
  • Loop through the EvaluationPaths and find all evaluation.csv
    • Append the CSV records to the overall evaluation.csv

Current workarounds:

  • cat docs/reports/<version>/*/models-summed.csv | sort | uniq for overall scores
  • cat docs/reports/<version>/*/evaluation.csv | sort | uniq for repository-specific scores
  • I think eval-dev-quality report is great
  • if we want to obtain i.e. the real names and costs of the models in the reporting tool we might need to share some parts of the command arguments (i.e. the provider tokens)... though I think openrouter allows to query information without a token so we should be good for now
  • I think @zimmski mentioned he wanted to use postgres now ... not sure how that would work though cause it seems you would need a running DB for that in the background?