[Feature Request]: CRITIC, Standardizing Critique Few-Shots

Question

[Feature Request]: CRITIC, Standardizing Critique Few-Shots

Closed this issue 25 days ago · 0 comments

Feature Description

Figure 1. Number of few-shot examples per benchmark.

Benchmark	Number of critique few-shot examples	Matches Figure 1	Issues
HotpotQA	5	✅	different examples used
FEVER	5	✅
TriviaQA	5	✅	different examples used, different number of few-shot examples
AmbigNQ	5	✅	different examples used, different number of few-shot examples, different order
GSM8K	5	✅	different examples used, different number of few-shot examples
SVAMP	5	✅	different examples used, different number of few-shot examples
TabMWP	5	✅	different examples used, different number of few-shot examples
MBPP	5	✅	different examples used, different number of few-shot examples
HumanEval	5	✅ (HumanEval is 0-shot, the convention for 0-shot is to have 5 examples)

Table 1. Number of critique few-shot examples per benchmark.

For CRITIC, we have to craft critique prompts. These are different from the few-shot examples.

For every benchmark the 2 following criteria:

The number of critique few-shot examples == the number of few-shot examples for that benchmark (Figure 1)
For each benchmark, every example in the critique few-shot examples should use the same question as every example in the few-shot examples

The green checkmark in the "Matches Figure 1" column indicates both criteria are satisfied. The "Issues" column indicates why the 2 sets of benchmark few-shot examples don't match.

Note: Make sure to follow the prompt formatting.

Ways to Go About This

For any given benchmark:

num few-shots > num critique few-shots
- write more critique examples
- ensure all critique examples match the examples (ordering matters and same questions used)
- you may have to replace existing critique examples and use the questions from the few-shots
num few-shots == num critique few-shots AND no checkmark above (Table 1)
- replace existing critique examples with examples
- ensure ordering is the same too
num few-shots < num critique few-shots
- remove some critique examples
- ensure all the examples use the same question and ordering matches