Structured data extraction/known results for test cases

Question

Structured data extraction/known results for test cases

philpax opened this issue a year ago · 1 comments

Hi there!

First off, thanks for this - it's great and as-is it's given me some ideas for prompt design 🙏

I'm working on trying to extract dates from arbitrary text and to produce JSON, so that given an optimised system prompt, I can pass GPT-3.5 an arbitrary string and it'll produce an array that corresponds to this TypeScript schema:

type Result =
    {"Millennium": {"year": number, "metadata"?: string}} |
    {"Century": {"year": number, "metadata"?: string}} |
    {"Decade": {"year": number, "metadata"?: string}} |
    {"Year": {"year": number, "metadata"?: string}} |
    {"Month": {"year": number, "month": number, "metadata"?: string}} |
    {"Day": {"year": number, "month": number, "day": number, "metadata"?: string}} |
    {"Range": {"start": Result, "end": Result, "metadata"?: string}} |
    {"Ambiguous": Result[]} |
    {"Present": {"metadata"?: string}}

My existing prompt is a many-shot prompt where I specify how a given date string should be parsed into JSON. This prompt works pretty well, but the prompt itself is over a thousand tokens, making evaluation costly.

If you're curious, here's a subset of the examples:

2024: `[{"Year":{"year":2024}}]`
c. 2016: `[{"Year":{"year":2016}}]`
1930-1937 1942-1945: `[{"Range":{"start":{"Year":{"year":1930}},"end":{"Year":{"year":1937}}}},{"Range":{"start":{"Year":{"year":1942}},"end":{"Year":{"year":1945}}}}]`
7–12 June 1967: `[{"Range":{"start":{"Day":{"year":1967,"month":6,"day":7}},"end":{"Day":{"year":1967,"month":6,"day":12}}}}]`
16, 20-27 March 1924: `[{"Day":{"year":1924,"month":3,"day":16}},{"Range":{"start":{"Day":{"year":1924,"month":3,"day":20}},"end":{"Day":{"year":1924,"month":3,"day":27}}}}]`
12 June 1723 - 26 September 1726: `[{"Range":{"start":{"Day":{"year":1723,"month":6,"day":12}},"end":{"Day":{"year":1726,"month":9,"day":26}}}}]`
14th century: `[{"Century":{"year":1300}}]`

I was hoping to use GPE to find a more optimised prompt by using my existing many-shot examples as test cases, and presenting what they should evaluate to, and then letting GPE/its GPT instances find a prompt that satisfies those test cases and evaluates to the same value without actually specifying each example.

Unfortunately, at the time of writing, GPE only comes in two flavours - the test cases with GPT evaluation and classification with multiple-choice answers.

Using the former, I was able to find a slightly more optimal prompt prelude, but the many-shot cases are still required. The classification flavour appears to be relatively coupled to multiple-choice evaluation, so it wouldn't work for me.

For my use case, I'd like a flavour in between: test cases with known solutions, where each prompt is graded in its ability to match the solution. I was considering hacking up the classification flavour, but I wasn't sure how best to adapt the prompt to handle this.

Is this something that you think would be feasible? I figure that this might come up in other contexts, too - being able to pass a set of input/output pairs to GPE and have it optimise for the best prompt would be wonderful!

More concretely: I'd like to pass in

test_cases = [
{ 'prompt': "The Bank at Burbank", 'output': '[]' },
{ 'prompt': "Red Bull Studios, AWOLSTUDIO, Avatar Studios, Main and Market, Gymnasium, Fireside Sound Studio", 'output': '[]' },
{ 'prompt': "1980s", 'output': '[{"Decade":{"year":1980}}]' },
{ 'prompt': "3000 BC", 'output': '[{"Year":{"year":-3000}}]' },
# ...
]

and have GPE optimise a prompt that produces the given output for a prompt.

Answer 1 · 2023-07-20T16:38:32.000Z

I was hoping for something similar, in my case only part of the completion needs to be a specific value but the rest can vary. I was thinking of having an evaluation function, then the test cases would have two fields, input and expected output and one would have to define the evaluation function that takes the input and expected output and returns true / false as a success value, or even a score if that's what you want.