agential-ai/agential

[Feature Request]: CRITIC, Standardizing Critique Few-Shots

Closed this issue · 0 comments

Feature Description

image
Figure 1. Number of few-shot examples per benchmark.

Benchmark Number of critique few-shot examples Matches Figure 1 Issues
HotpotQA 5 different examples used
FEVER 5
TriviaQA 5 different examples used, different number of few-shot examples
AmbigNQ 5 different examples used, different number of few-shot examples, different order
GSM8K 5 different examples used, different number of few-shot examples
SVAMP 5 different examples used, different number of few-shot examples
TabMWP 5 different examples used, different number of few-shot examples
MBPP 5 different examples used, different number of few-shot examples
HumanEval 5 ✅ (HumanEval is 0-shot, the convention for 0-shot is to have 5 examples)

Table 1. Number of critique few-shot examples per benchmark.

For CRITIC, we have to craft critique prompts. These are different from the few-shot examples.

For every benchmark the 2 following criteria:

  • The number of critique few-shot examples == the number of few-shot examples for that benchmark (Figure 1)
  • For each benchmark, every example in the critique few-shot examples should use the same question as every example in the few-shot examples

The green checkmark in the "Matches Figure 1" column indicates both criteria are satisfied. The "Issues" column indicates why the 2 sets of benchmark few-shot examples don't match.

Note: Make sure to follow the prompt formatting.

Ways to Go About This

For any given benchmark:

  • num few-shots > num critique few-shots
    • write more critique examples
    • ensure all critique examples match the examples (ordering matters and same questions used)
    • you may have to replace existing critique examples and use the questions from the few-shots
  • num few-shots == num critique few-shots AND no checkmark above (Table 1)
    • replace existing critique examples with examples
    • ensure ordering is the same too
  • num few-shots < num critique few-shots
    • remove some critique examples
    • ensure all the examples use the same question and ordering matches