EPIC: LeapfrogAI Evaluations v1.1
Opened this issue · 0 comments
jalling97 commented
LeapfrogAI Evaluations v1.1
Description
Now that a baseline evaluations framework for LeapfrogAI exists, it needs to be further expanded to meet the needs of the product and mission-success teams.
Feedback has been provided with the common themes needing to be addressed:
- Some evaluations (primarily NIAH) are always passing 100% and as such, are not helpful for tracking growth over time
- Some NIAH and QA evals are not leveraging the full chunk data in RAG responses and as such are not evaluating RAG to the extent it should be
- Evaluation results are not currently being stored anywhere
- The current implementation of LFAI evals is very specific to the OpenAI way of handling RAG, and therefore the evaluations can't be run against custom RAG pipelines (a delivery concern).
- MMLU results suspiciously sometimes return the same score for multiple topics, indicating a potential problem with the evaluation 🐛
Completion Criteria
- Create a new Needle In a Haystack Dataset for more difficult evaluations
- Create a new Question/Answer Dataset for more difficult evaluations
- Utilize new annotations and chunk data for better evaluations on retrieval
- Store evaluation results as Github artifacts for long-term tracking
- Add an abstraction layer to the evaluation suite to allow custom RAG pipelines
- Fix MMLU scores being constant 🐛