/rag-llm-prompt-evaluator-guard

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Overview

Developed by Arize AI
Date of development August 6, 2024
Validator type RAG, LLM Judge
Blog https://docs.arize.com/arize/large-language-models/guardrails
License Apache 2
Input/Output RAG Retrieval or Output

Description

Given a RAG application, this Guard will use an LLM Judge to decide whether the LLM response is acceptable. Users can instantiate the Guard with one of the Arize off-the-shelf evaluators (Context Relevancy, Hallucination or QA Correctness), which match our off-the-shelf RAG evaluators in Phoenix.

Alternatively, users can customize the Guard with their own LLM Judge by writing a custom prompt that inherits from the ArizeRagEvalPromptBase(ABC) class.

Benchmark Results

For the off-the-shelf Guards, we have benchmarked results on public datasets.

Context Relevancy LLM Judge

We benchmarked the Context Relevancy Guard on "wiki_qa-train" benchmark dataset in benchmark_context_relevancy_prompt.py.

Model: gpt-4o-mini
Guard Results
              precision    recall  f1-score   support

    relevant       0.70      0.86      0.77        93
   unrelated       0.85      0.68      0.76       107

    accuracy                           0.77       200
   macro avg       0.78      0.77      0.76       200
weighted avg       0.78      0.77      0.76       200

Latency
count    200.000000
mean       2.812122
std        1.753805
min        1.067620
25%        1.708051
50%        2.248962
75%        3.321251
max       14.102804
Name: guard_latency_gpt-4o-mini, dtype: float64
median latency
2.2489616039965767

Hallucination LLM Judge

This Guard was benchmarked on the "halueval_qa_data" from the HaluEval benchmark:

Model: gpt-4o-mini
Guard Results
              precision    recall  f1-score   support

     factual       0.79      0.97      0.87       129
hallucinated       0.96      0.73      0.83       121

    accuracy                           0.85       250
   macro avg       0.87      0.85      0.85       250
weighted avg       0.87      0.85      0.85       250

Latency
count    250.000000
mean       1.865513
std        0.603700
min        1.139974
25%        1.531160
50%        1.758210
75%        2.026153
max        6.403010
Name: guard_latency_gpt-4o-mini, dtype: float64
median latency
1.7582097915001214

QA Correctness LLM Judge

This Guard was benchmarked on the 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0): https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf

Model: gpt-4o-mini
Guard Results
              precision    recall  f1-score   support

     correct       1.00      0.96      0.98       133
   incorrect       0.96      1.00      0.98       117

    accuracy                           0.98       250
   macro avg       0.98      0.98      0.98       250
weighted avg       0.98      0.98      0.98       250

Latency
count    250.000000
mean       2.610912
std        1.415877
min        1.148114
25%        1.678278
50%        2.263149
75%        2.916726
max       10.625763
Name: guard_latency_gpt-4o-mini, dtype: float64
median latency
2.263148645986803

Installation

guardrails hub install hub://arize-ai/llm_rag_evaluator

Usage Examples

Validating string output via Python

In this example, we apply the validator to a string output generated by an LLM.

# Import Guard and Validator
from guardrails.hub import LlmRagEvaluator, HallucinationPrompt
from guardrails import Guard

# Setup Guard
guard = Guard().use(
    LlmRagEvaluator(
        eval_llm_prompt_generator=HallucinationPrompt(prompt_name="hallucination_judge_llm"),
        llm_evaluator_fail_response="hallucinated",
        llm_evaluator_pass_response="factual",
        llm_callable="gpt-4o-mini",
        on_fail="exception",
        on="prompt"
    ),
)

metadata = {
    "user_message": "User message",
    "context": "Context retrieved from RAG application",
    "llm_response": "Proposed response from LLM before Guard is applied"
}

guard.validate(llm_output="Proposed response from LLM before Guard is applied", metadata=metadata)

API Reference

__init__(self, on_fail="noop")

    Initializes a new instance of the ValidatorTemplate class.

    Parameters

    • eval_llm_prompt_generator (Type[ArizeRagEvalPromptBase]): Child class that will use a fixed interface to generate a prompt for an LLM Judge given the retrieved context, user input message and proposed LLM response. Off-the-shelf child classes include QACorrectnessPrompt, HallucinationPrompt and ContextRelevancyPrompt.
    • llm_evaluator_fail_response (str): Expected string output from the Judge LLM when the validator fails, e.g. "hallucinated".
    • llm_evaluator_pass_response (str): Expected string output from the Judge LLM when the validator passes, e.g. "factual".
    • llm_callable (str): Callable LLM string used to instantiate the LLM Judge, such as gpt-4o-mini.
    • on_fail (str, Callable): The policy to enact when a validator fails. If str, must be one of reask, fix, filter, refrain, noop, exception or fix_reask. Otherwise, must be a function that is called when the validator fails.

validate(self, value, metadata) -> ValidationResult

    Validates the given `value` using the rules defined in this validator, relying on the `metadata` provided to customize the validation process. This method is automatically invoked by `guard.parse(...)`, ensuring the validation logic is applied to the input data.

    Note:

    1. This method should not be called directly by the user. Instead, invoke guard.parse(...) where this method will be called internally for each associated Validator.
    2. When invoking guard.parse(...), ensure to pass the appropriate metadata dictionary that includes keys and values required by this validator. If guard is associated with multiple validators, combine all necessary metadata into a single dictionary.

    Parameters

    • value (Any): The input value to validate.

    • metadata (dict): A dictionary containing metadata required for validation. Keys and values must match the expectations of this validator.

      Key Type Description Default
      user_message String User input message to RAG application. N/A
      context String Retrieved context from RAG application. N/A
      llm_response String Proposed response from the LLM used in the RAG application. N/A