AnswerDotAI/cold-compress

Implement Evaluations (Decide on datasets and benchmark initial methods)

griff4692 opened this issue · 0 comments

A few considerations

  • GPT Fast is integrated with lm-eval-harness so see if we can write evals in lm-eval-harness. Will this work for non QA?
  • We want different kinds of evals which differ based on length of prompts and lengths of required outputs
  • Decide on evaluation metrics
  • Let's provide granular speed / memory metrics: prefill / prompt encoding, cache operations, and attention. If we are making attention faster (smaller KV cache) yet introducing a lot of overhead on costly cache operations, it's good to understand this.