Implement Evaluations (Decide on datasets and benchmark initial methods)
griff4692 opened this issue · 0 comments
griff4692 commented
A few considerations
- GPT Fast is integrated with
lm-eval-harness
so see if we can write evals inlm-eval-harness
. Will this work for non QA? - We want different kinds of evals which differ based on length of prompts and lengths of required outputs
- Decide on evaluation metrics
- Let's provide granular speed / memory metrics: prefill / prompt encoding, cache operations, and attention. If we are making attention faster (smaller KV cache) yet introducing a lot of overhead on costly cache operations, it's good to understand this.