An Evaluation Dataset for quality benchmarking of different inference engine implementation.

Question

An Evaluation Dataset for quality benchmarking of different inference engine implementation.

Anindyadeep opened this issue a year ago · 1 comments

The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.

So here is the idea:

We need to curate a very good evaluation dataset with good prompts (the type or subjects of prompts need to be discussed)
Once that is done, we need to implement a simple evaluation pipeline or a script/function that can do a one-shot evaluation of this dataset
Then we can show all the results of those prompts per engine. Where inside each engine/implementation file, we will have a results.md file, under which we can show the results in the following sample format:

AutoGPTQ

Float 32 precision

Id	Prompt	Result	Score
1	This is a sample prompt	This is a sample result	5.5

Float 16 precision

Id	Prompt	Result	Score
1	This is a sample prompt	This is a sample result	5.5

Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.

So here are the subtasks

Get 5 good prompts
Make a simple evaluation pipeline supporting all the engines
State the results in a readme or better if it can be done through the function
(Optional) Make a huggingface space out of it

Answer 1 · 2024-04-13T18:21:17.000Z

We are closing this, since we are going for this approach mentioned in issue #162

cc: @nsosio