premAI-io/benchmarks

An Evaluation Dataset for quality benchmarking of different inference engine implementation.

Anindyadeep opened this issue · 1 comments

The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.

So here is the idea:

  • We need to curate a very good evaluation dataset with good prompts (the type or subjects of prompts need to be discussed)
  • Once that is done, we need to implement a simple evaluation pipeline or a script/function that can do a one-shot evaluation of this dataset
  • Then we can show all the results of those prompts per engine. Where inside each engine/implementation file, we will have a results.md file, under which we can show the results in the following sample format:

AutoGPTQ

Float 32 precision

Id Prompt Result Score
1 This is a sample prompt This is a sample result 5.5

Float 16 precision

Id Prompt Result Score
1 This is a sample prompt This is a sample result 5.5

Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.

So here are the subtasks

  • Get 5 good prompts
  • Make a simple evaluation pipeline supporting all the engines
  • State the results in a readme or better if it can be done through the function
  • (Optional) Make a huggingface space out of it

We are closing this, since we are going for this approach mentioned in issue #162

cc: @nsosio