defenseunicorns/leapfrogai

EPIC: LeapfrogAI Evaluations v1.1

Opened this issue · 0 comments

LeapfrogAI Evaluations v1.1

Description

Now that a baseline evaluations framework for LeapfrogAI exists, it needs to be further expanded to meet the needs of the product and mission-success teams.

Feedback has been provided with the common themes needing to be addressed:

  • Some evaluations (primarily NIAH) are always passing 100% and as such, are not helpful for tracking growth over time
  • Some NIAH and QA evals are not leveraging the full chunk data in RAG responses and as such are not evaluating RAG to the extent it should be
  • Evaluation results are not currently being stored anywhere
  • The current implementation of LFAI evals is very specific to the OpenAI way of handling RAG, and therefore the evaluations can't be run against custom RAG pipelines (a delivery concern).
  • MMLU results suspiciously sometimes return the same score for multiple topics, indicating a potential problem with the evaluation 🐛

Completion Criteria

  • Create a new Needle In a Haystack Dataset for more difficult evaluations
  • Create a new Question/Answer Dataset for more difficult evaluations
  • Utilize new annotations and chunk data for better evaluations on retrieval
  • Store evaluation results as Github artifacts for long-term tracking
  • Add an abstraction layer to the evaluation suite to allow custom RAG pipelines
  • Fix MMLU scores being constant 🐛