An Evaluation Dataset for quality benchmarking of different inference engine implementation.
Anindyadeep opened this issue · 1 comments
Anindyadeep commented
The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.
So here is the idea:
- We need to curate a very good evaluation dataset with good prompts (the type or subjects of prompts need to be discussed)
- Once that is done, we need to implement a simple evaluation pipeline or a script/function that can do a one-shot evaluation of this dataset
- Then we can show all the results of those prompts per engine. Where inside each engine/implementation file, we will have a
results.md
file, under which we can show the results in the following sample format:
AutoGPTQ
Float 32 precision
Id | Prompt | Result | Score |
---|---|---|---|
1 | This is a sample prompt | This is a sample result | 5.5 |
Float 16 precision
Id | Prompt | Result | Score |
---|---|---|---|
1 | This is a sample prompt | This is a sample result | 5.5 |
Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.
So here are the subtasks
- Get 5 good prompts
- Make a simple evaluation pipeline supporting all the engines
- State the results in a readme or better if it can be done through the function
- (Optional) Make a huggingface space out of it
Anindyadeep commented