The system is happy about the model saying it doesn't know something even though it has been stated in the prompt

Question

The system is happy about the model saying it doesn't know something even though it has been stated in the prompt

loulblemo opened this issue 4 months ago · 3 comments

Discussed in #1182

^{Originally posted by loulblemo June 6, 2024}

from trulens_eval import OpenAI as fOpenAI
provider = fOpenAI()

from trulens_eval import Feedback

f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons,
    name="Answer Relevance"
).on_input_output()

f_qa_relevance("My favorite color is blue, what is my favorite color?", "I don't know")

(1.0,
 {'reason': 'Criteria: The response directly addresses the prompt by stating "I don\'t know."\nSupporting Evidence: The response provided is a direct answer to the prompt, acknowledging that the responder does not know the favorite color of the user. This aligns perfectly with the criteria of the evaluation, as it is a relevant and intentional refusal to answer the question posed in the prompt. Therefore, the response deserves a high score of 10 for being the most relevant in this context.'})
```</div>

Answer 1 · 2024-06-06T20:54:28.000Z

To ensure that the response "I don't know" does not receive a high relevance score when the prompt clearly states the favorite color, you need to adjust the evaluation criteria in the PromptResponseRelevance class. Specifically, modify the guideline that currently states "Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most RELEVANT."

Here is the modified PromptResponseRelevance class:

class PromptResponseRelevance(Relevance, WithPrompt):
    system_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
        """You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.
        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. 

        A few additional scoring guidelines:

        - Long RESPONSES should score equally well as short RESPONSES.

        - Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should be counted as the least RELEVANT.

        - RESPONSE must be relevant to the entire PROMPT to get a score of 10.

        - RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.

        - RESPONSE that is RELEVANT to none of the PROMPT should get a score of 0.

        - RESPONSE that is RELEVANT to some of the PROMPT should get as score of 2, 3, or 4. Higher score indicates more RELEVANCE.

        - RESPONSE that is RELEVANT to most of the PROMPT should get a score between a 5, 6, 7 or 8. Higher score indicates more RELEVANCE.

        - RESPONSE that is RELEVANT to the entire PROMPT should get a score of 9 or 10.

        - RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 10.

        - RESPONSE that confidently FALSE should get a score of 0.

        - RESPONSE that is only seemingly RELEVANT should get a score of 0.

        - Never elaborate.
        """
    )
    user_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
        """PROMPT: {prompt}

        RESPONSE: {response}

        RELEVANCE: """
    )

This change ensures that responses like "I don't know" will receive a low relevance score when they do not address the prompt appropriately ^[1]^[2].

To continue talking to Dosu, mention @dosu.

Answer 2 · 2024-06-07T22:34:38.000Z

This is my interpretation - as it stands today, relevance_with_cot_reasons mainly assesses how relevant an answer is to user's query, so a model answering they don't know their favorite color can (or maybe even should) actually be considered perfectly relevant, hence a 1.0 score.

Let me know if this makes sense.

Answer 3 · 2024-06-09T02:39:06.000Z

@loulblemo please check out custom feedback functions if you'd like different handling for your case.

For example, you could create the custom feedback as follows to replace the handling of "I don't know responses" to be scored as least relevant.

from trulens_eval.feedback.provider.openai import OpenAI
from typing import Tuple, Dict
from trulens_eval.feedback import prompts

class Custom_OpenAI(OpenAI):
    def context_relevance_with_cot_reasons_extreme(self, question: str, context: str) -> Tuple[float, Dict]:
        """
        Tweaked version of answer relevance, extending OpenAI provider.
        A function that completes a template to check the relevance of the answer to the question.
        Updated to score "I don't know" and similar responses to have low relevance.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            prompt (str): A prompt.
            response (str): A response to the prompt.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # Update handling of I don't know
        system_prompt = prompts.ANSWER_RELEVANCE_SYSTEM.replace(
        "Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most RELEVANT.", ""Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the least RELEVANT."")
        
        user_prompt = str.format(prompts.ANSWER_RELEVANCE_USER, prompt=prompt, response=response)
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)