Support few-shot chain-of-thought in GPQA / MMLU

Question

Support few-shot chain-of-thought in GPQA / MMLU

Opened this issue 2 months ago · 0 comments

Both GPQA and MMLU have a similar way of doing few-shot chain-of-thought so we should reuse the common infrastructure. We should coordinate to not duplicate work.

Scenario

When constructing instances, set instance.extra_data["chain_of_thought"] to the chain of thought in the dataset instance.

Run spec function

Update the run spec function to take in an boolean parameter use_chain_of_thought
Update the function to use the new adapter and metric if and only if use_chain_of_thought is true.

Adapter

Create a subclass of MultipleChoiceJointAdapter that includes the chain of thought in the output.
Add a new enum value (similar to ADAPT_MULTIPLE_CHOICE_JOINT) for that subclass and add the enum value and the subclass to AdapterFactory.
Set the AdapterSpec to use this enum value when constructing the run spec for in the run spec functions.

Metric

Create a new subclass of Metric in a new file chain_of_thought_metrics.py
Override evaluate_generation() to parse the model generated output to look for something like (A) and output a Stat named chain_of_thought_exact_match