Improve evaluation procedure for extensive results
Opened this issue · 1 comments
monatis commented
Problem
In the current implementation we use samplers to calculate evaluation metrics on a small subset of the dataset. This can give slightly different scores due to the random state in sampling. It's always possible to seed RNGs for reproduceable results, but there might be cases where we are extremely lucky or extremely unlucky based on the chosen seed. It's still fair to compare different checkpoints with seeded evaluators, but we cannot be sure whether we overestimate or underestimate the performance of all the checkpoints.
Possible solution
- Add an option to enable multiple passes over the data and report the mean and STD of all passes, or
- Accept an optional
QdrantClient
and if it isNone
use Qdrant as the backend to store embeddings and retrieve from.
parthkl021 commented
@generall is this issue solved ?
If not can I work on it