stanford-crfm/helm

Support few-shot chain-of-thought in GPQA / MMLU

Opened this issue · 0 comments

Related: #3017 and #3018

Both GPQA and MMLU have a similar way of doing few-shot chain-of-thought so we should reuse the common infrastructure. We should coordinate to not duplicate work.

Scenario

  1. When constructing instances, set instance.extra_data["chain_of_thought"] to the chain of thought in the dataset instance.

Run spec function

  1. Update the run spec function to take in an boolean parameter use_chain_of_thought
  2. Update the function to use the new adapter and metric if and only if use_chain_of_thought is true.

Adapter

  1. Create a subclass of MultipleChoiceJointAdapter that includes the chain of thought in the output.
  2. Add a new enum value (similar to ADAPT_MULTIPLE_CHOICE_JOINT) for that subclass and add the enum value and the subclass to AdapterFactory.
  3. Set the AdapterSpec to use this enum value when constructing the run spec for in the run spec functions.

Metric

  1. Create a new subclass of Metric in a new file chain_of_thought_metrics.py
  2. Override evaluate_generation() to parse the model generated output to look for something like (A) and output a Stat named chain_of_thought_exact_match