[BUG] How do I evaluate my Llama-Index app for hallucination?

Question

[BUG] How do I evaluate my Llama-Index app for hallucination?

Closed this issue 7 months ago · 1 comments

Bug Description
What happened?

To Reproduce
Which steps should someone take to run into the same error? A small, reproducible code example is useful here.

Expected behavior
A clear and concise description of what you expected to happen.

Relevant Logs/Tracebacks
Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.

Environment:

OS: [e.g. MacOS, Windows]
Python Version
TruLens version
Versions of other relevant installed libraries

Additional context
Add any other context about the problem here.

Answer 1 · 2024-02-29T18:26:20.000Z

🤖

Hey there, @joshreini1! Good to see you back in the world of TruLens. Ready to tackle another adventure together? 😜

To evaluate your Llama-Index app for hallucination using TruLens, you can focus on the "hallucination triad" which includes groundedness, context relevance, and answer relevance. Here's how you can do it:

Groundedness: Use the Groundedness feedback function to evaluate how well the LLM's responses are grounded in the provided context. This is done by measuring the groundedness of the LLM's output with respect to the context chunks collected from the app. The groundedness measure is aggregated using the grounded.grounded_statements_aggregator.

from trulens_eval.feedback import Groundedness
grounded = Groundedness(groundedness_provider=OpenAI())
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons)
    .on(context.collect()) # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

Question/Answer Relevance: Evaluate the relevance between the overall question and the LLM's answer using the relevance feedback function from the OpenAI provider. This helps in assessing if the answer is relevant to the question asked.
```
f_qa_relevance = Feedback(openai.relevance).on_input_output()
```
Question/Statement Relevance: Assess the relevance between the question and each context chunk using the qs_relevance feedback function. This evaluates if the context chunks used in generating the response are relevant to the question. The relevance scores are aggregated using np.mean.
```
f_qs_relevance = (
    Feedback(openai.qs_relevance)
    .on_input()
    .on(context)
    .aggregate(np.mean)
)
```

After defining these feedback functions, instrument your app for logging with TruLens using TruLlama and specify the feedback functions you want to apply:

from trulens_eval import TruLlama
tru_query_engine_recorder = TruLlama(query_engine,
    app_id='LlamaIndex_App1',
    feedbacks=[f_groundedness, f_qa_relevance, f_qs_relevance])

You can then execute queries and retrieve feedback results to evaluate the hallucination aspects of your Llama-Index app:

with tru_query_engine_recorder as recording:
    query_engine.query("What did the author do growing up?")

# Retrieve feedback results
for feedback, feedback_result in recording.get().wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)

This process allows you to systematically evaluate the groundedness, context relevance, and answer relevance of your Llama-Index app's responses, providing insights into potential hallucinations in the generated content. You can find more details in the Llama Index Quickstart example.

Regarding the Groundedness feedback function, it is designed to measure the groundedness of a statement with respect to a source text. It evaluates how well the source material supports each sentence in the statement. The class provides different methods to perform this evaluation, utilizing either a Large Language Model (LLM) provider like OpenAI or a Natural Language Inference (NLI) model from Huggingface. You can find more details in the groundedness.py file.

If you're encountering any specific errors or issues while implementing this, could you please provide more details? This could include the exact error message, traceback, or any other relevant information that could help in diagnosing the problem.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.