[Feature] LLM-based (QA Accuracy) eval algorithm

Question

[Feature] LLM-based (QA Accuracy) eval algorithm

athewsey opened this issue 10 months ago · 2 comments

The metrics-based approaches in the QAAccuracy eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).

It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator?

As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface-based class, but there are a lot of design questions to consider like:

Is it possible to control whether the same LLM, a different LLM, or a panel of multiple LLMs gets used for the evaluation step, versus the original answer generation?
Since there are lots of different ways to use LLMs for self-critique, maybe e.g. QAAccuracyByLLMCritic should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")

Answer 1 · 2024-01-22T08:59:15.000Z

Thanks @athewsey for the feedback.

We recently added recall over tokens as an evaluation metric (#157). recall should not penalize verbose generation as harshly as f1_score.

On LLM-based metrics: We are looking into adding some along the lines of what you suggested. Though a bit different from the metrics you proposed, one could also include bert_score from SummarizationAccuracy evaluation into the QAAccuracy evaluation.

Answer 2 · 2024-01-30T08:06:39.000Z

Thanks for the update! I'd be a bit concerned that neither alternative token-based metrics, nor similarity-based LM ones like bert_score, fully allow for a divergence between correctness and style/tone/other factors? It seems like even a supposedly-semantic similarity score would be biased by differences in tone, framing, or levels of detail in the answer.

IMO for many use-cases it would be useful to separate evaluation of a system for factual accuracy (trustworthiness / hallucination) versus other factors which are still important, but more about whether it actually enables productivity gains: E.g. is it too verbose, does it correctly cite references, etc.

LLM-based critique provides a natural way to formulate this multi-axis validation by guiding the critic LLM what specific aspects to assess in natural language. Of course it's fair that there'd be concerns about when self-critique metrics might be biased, but I haven't seen any research yet that quantifies those concerns & gives a strong steer to avoid that kind of method... If anybody's aware of any, would love to read it!