EPIC: LeapfrogAI Evaluations v1.1

Question

EPIC: LeapfrogAI Evaluations v1.1

Opened this issue 3 months ago · 0 comments

LeapfrogAI Evaluations v1.1

Description

Now that a baseline evaluations framework for LeapfrogAI exists, it needs to be further expanded to meet the needs of the product and mission-success teams.

Feedback has been provided with the common themes needing to be addressed:

Some evaluations (primarily NIAH) are always passing 100% and as such, are not helpful for tracking growth over time
Some NIAH and QA evals are not leveraging the full chunk data in RAG responses and as such are not evaluating RAG to the extent it should be
Evaluation results are not currently being stored anywhere
The current implementation of LFAI evals is very specific to the OpenAI way of handling RAG, and therefore the evaluations can't be run against custom RAG pipelines (a delivery concern).
MMLU results suspiciously sometimes return the same score for multiple topics, indicating a potential problem with the evaluation 🐛

Completion Criteria

Create a new Needle In a Haystack Dataset for more difficult evaluations
- #1068
Create a new Question/Answer Dataset for more difficult evaluations
- #1075
Utilize new annotations and chunk data for better evaluations on retrieval
- #1067
Store evaluation results as Github artifacts for long-term tracking
- #1076
Add an abstraction layer to the evaluation suite to allow custom RAG pipelines
- #1173
Fix MMLU scores being constant 🐛
- #1172