Support few-shot chain-of-thought in GPQA / MMLU
Opened this issue · 0 comments
yifanmai commented
Both GPQA and MMLU have a similar way of doing few-shot chain-of-thought so we should reuse the common infrastructure. We should coordinate to not duplicate work.
Scenario
- When constructing instances, set
instance.extra_data["chain_of_thought"]
to the chain of thought in the dataset instance.
Run spec function
- Update the run spec function to take in an boolean parameter
use_chain_of_thought
- Update the function to use the new adapter and metric if and only if
use_chain_of_thought
is true.
Adapter
- Create a subclass of
MultipleChoiceJointAdapter
that includes the chain of thought in the output. - Add a new enum value (similar to
ADAPT_MULTIPLE_CHOICE_JOINT
) for that subclass and add the enum value and the subclass toAdapterFactory
. - Set the
AdapterSpec
to use this enum value when constructing the run spec for in the run spec functions.
Metric
- Create a new subclass of
Metric
in a new filechain_of_thought_metrics.py
- Override
evaluate_generation()
to parse the model generated output to look for something like(A)
and output aStat
namedchain_of_thought_exact_match