Outdated documentation for Groundedness
Arseniy-II opened this issue ยท 2 comments
Bug Description
Here is the reference for the old use of Groundedness
Expected behavior
According to the recent updated it should be used differently
To address the outdated documentation for Groundedness, we need to update it to reflect the recent changes introduced in the pull request Updated Groundedness Usage. Below is the updated documentation content that should replace the old references:
Groundedness Benchmark
Overview
The Groundedness functionality evaluates the consistency of summarized responses with the source texts. This is done using various feedback function providers such as OpenAI's GPT-3.5-turbo, GPT-4, and Huggingface.
Updated Usage
The recent update introduces new methods and classes for Groundedness. Here is how you can use the updated Groundedness functionality:
Importing Required Modules and Functions
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set
Generating Test Cases
Tru().reset_database()
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval_test_100.json")
groundedness_golden_set = [next(test_cases_gen) for _ in range(100)]
Setting Environment Variables
import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."
Benchmarking Groundedness Feedback Function Providers
Huggingface Provider
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
groundedness_hug = Groundedness(groundedness_provider=huggingface_provider)
f_groundedness_hug = Feedback(groundedness_hug.groundedness_measure, name="Groundedness Huggingface").on_input().on_output().aggregate(groundedness_hug.grounded_statements_aggregator)
def wrapped_groundedness_hug(input, output):
return np.mean(list(f_groundedness_hug(input, output)[0].values()))
OpenAI GPT-3.5-turbo Provider
from trulens_eval.feedback.provider import OpenAI
groundedness_openai = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-3.5-turbo"))
f_groundedness_openai = Feedback(groundedness_openai.groundedness_measure, name="Groundedness OpenAI GPT-3.5").on_input().on_output().aggregate(groundedness_openai.grounded_statements_aggregator)
def wrapped_groundedness_openai(input, output):
return f_groundedness_openai(input, output)[0]['full_doc_score']
OpenAI GPT-4 Provider
groundedness_openai_gpt4 = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-4"))
f_groundedness_openai_gpt4 = Feedback(groundedness_openai_gpt4.groundedness_measure, name="Groundedness OpenAI GPT-4").on_input().on_output().aggregate(groundedness_openai_gpt4.grounded_statements_aggregator)
def wrapped_groundedness_openai_gpt4(input, output):
return f_groundedness_openai_gpt4(input, output)[0]['full_doc_score']
Creating Feedback Object for Mean Absolute Error
ground_truth = GroundTruthAgreement(groundedness_golden_set)
f_mae = Feedback(ground_truth.mae, name="Mean Absolute Error").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()
Creating TruBasicApp Instances
tru_wrapped_groundedness_hug = TruBasicApp(wrapped_groundedness_hug, app_id="groundedness huggingface", feedbacks=[f_mae])
tru_wrapped_groundedness_openai = TruBasicApp(wrapped_groundedness_openai, app_id="groundedness openai gpt-3.5", feedbacks=[f_mae])
tru_wrapped_groundedness_openai_gpt4 = TruBasicApp(wrapped_groundedness_openai_gpt4, app_id="groundedness openai gpt-4", feedbacks=[f_mae])
Running the Groundedness Evaluation
for i in range(len(groundedness_golden_set)):
source = groundedness_golden_set[i]["query"]
response = groundedness_golden_set[i]["response"]
with tru_wrapped_groundedness_hug as recording:
tru_wrapped_groundedness_hug.app(source, response)
with tru_wrapped_groundedness_openai as recording:
tru_wrapped_groundedness_openai.app(source, response)
with tru_wrapped_groundedness_openai_gpt4 as recording:
tru_wrapped_groundedness_openai_gpt4.app(source, response)
Getting the Leaderboard
Tru().get_leaderboard(app_ids=[]).sort_values(by="Mean Absolute Error")
Additional Information
The trulens_eval/feedback/prompts.py
file contains various templates and prompts used for evaluating groundedness and other feedback metrics. These include:
- GROUNDEDNESS_REASON_TEMPLATE: Provides a template for scoring and supporting evidence.
- LLM_GROUNDEDNESS_FULL_PROMPT: Used to determine the information overlap between the source and statement.
- SYSTEM_FIND_SUPPORTING and USER_FIND_SUPPORTING: Prompts for finding supporting evidence in the source text.
- COMPREHENSIVENESS_SYSTEM_PROMPT and COMPREHENSIVENESS_USER_PROMPT: Used for evaluating the comprehensiveness of key points in summaries.
These templates and prompts are part of the feedback mechanisms that help in assessing the groundedness and other quality metrics of the responses.
This updated documentation should replace the old references to ensure users are guided correctly on how to use the new Groundedness functionality.
Details
Sources
- trulens_eval/trulens_eval/tests/groundedness_benchmark.ipynb
- trulens_eval/trulens_eval/tests/groundedness_smoke_tests.ipynb
- trulens_eval/trulens_eval/feedback/prompts.py
- trulens_eval/trulens_eval/feedback/provider/base.py
- trulens_eval/trulens_eval/feedback/v2/feedback.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a ๐ if this is helpful and ๐ if it is irrelevant.
To continue the conversation, mention @dosu.
Fixed by #1161