I was wondering about the evaluation method
gauss5930 opened this issue · 2 comments
Hi, I was very impressed after seeing retrieval performance measurements of LLM using Needle In A Haystack! Meanwhile, I was curious about what results would be shown in open-source models.
The experiment result of Mistral-7B-Instruct-v0.2 showed very poor performance. I felt something strange and analyzed the model's response, I found that in most cases, the needle was found well, but the evaluator gave a very low score due to other additional explanations. My questions are following:
- I think it is sufficient to evaluate the model's retrieval performance that the needle sentence was included in the generation. Is Needle In A Haystack's evaluation based on generating the exact needle sentence without any other explanation evaluation criteria for Needle In A Haystack?
- If so, is it possible to accurately evaluate the non instruction-tuned model that generates infinitely until hits the limit of context lengths on Needle In A Haystack?
The following is an example of a model response that received a score of 1.
[Score 1 Response]
The best thing to do in San Francisco, according to the author, is to eat a sandwich and sit in Dolores Park on a sunny day. However,
the author also warns that as the world becomes more addictive, living a normal life will become increasingly difficult and the two
senses of "normal" will be driven further apart. The author suggests that people will have to figure out for themselves what to avoid
and how, and even that may not be enough as existing things may become more addictive. The author also mentions that most
people he knows have problems with Internet addiction and that he avoids having an iPhone to limit his access to the Internet. The
author's latest trick for avoiding addiction is taking long hikes.
Hey! Thanks for the question and research. TLDR: I think it is up to the test designer to decide what they want to look for.
Just like in RAG there are multiple accuracy metrics I believe that the same goes for NIAH.
In your response, it got the answer, but it included a lot of fluff too. The fluff that was included seems to be slightly off topic.
I agree it doesn't feel like a 1, but that is a subjective opinion.
I think the route forward is allowing users more control over which evaluator is used (and the grading criteria) to allow them to make the test they want.
Thank you for answering my question! I asked this question out of curiosity because there was a slight difference between what I thought of NIAH's evaluation criteria and the actual NIAH evaluation criteria. Thanks to Kamradt's answer, I was able to resolve my curiosity.
As you mentioned, the model response contains unrelated content that is slightly different from the purpose of the question, so it seems difficult to say that it is a perfect retrieval. To conduct experiments more appropriately to NIAH's evaluation criteria, it would be a good idea to use methods such as prompt engineering!
Thank you for releasing a useful benchmark for measuring LLM performance like the NIAH benchmark!