These seven steps are not deterministic at each step, is it about alleviating hallucination rather than eliminating it?
yuedajiong opened this issue · 2 comments
human-level?
Hi,
I'm not sure if I fully understand your question. All steps in WikiChat are deterministic, for instance, all temperature values are set to 0 by default (i.e. greedy decoding).
You may see slightly different results with OpenAI due to the fact that its outputs are not fully deterministic even with greedy decoding (See the numerous posts about this topic on the OpenAI community forum, for example this).
Also, please note that asking the same question with different conversation histories might also result in different outputs, since previous conversation turns are also part of the input given to the LLM. For example, the following two conversations may have slightly different outputs:
User: Hi
WikiChat: Hi, how can I help you?
User: Tell me about Haruki Murakami.
User: Tell me about Haruki Murakami.
Regarding alleviating vs eliminating hallucination, WikiChat achieves 97.3% factual accuracy on average when tested with gpt-4
as the underlying LLM. This test includes conversations on challenging "recent" or "tail" topics.
If weaker LLMs are used as WikiChat's backbone, the hallucination slightly increases. For example, with text-davinci-003
, factual accuracy is 89.2%. Please refer to our paper for more numbers and details.
Thanks