Implement Evaluations (Decide on datasets and benchmark initial methods)

Question

griff4692 opened this issue 4 months ago · 0 comments

A few considerations

GPT Fast is integrated with lm-eval-harness so see if we can write evals in lm-eval-harness. Will this work for non QA?
We want different kinds of evals which differ based on length of prompts and lengths of required outputs
Decide on evaluation metrics
Let's provide granular speed / memory metrics: prefill / prompt encoding, cache operations, and attention. If we are making attention faster (smaller KV cache) yet introducing a lot of overhead on costly cache operations, it's good to understand this.