google-deepmind/opro

What's the evaluation config setting for the final evaluation ?

Victordongy opened this issue · 1 comments

I noticed that several parameters were important to final result, such as extract_final_answer_by_prompting_again. Wondering whether you could post the final config setting for the final evaluation for reproducible result on GSM_8K reported from the original paper ?

Thanks for the question! The settings we used for each externally accessible model (text-bison and GPT models) can be found at around

if scorer_llm_name == "text-bison":
for prompt optimization, and
if scorer_llm_name == "text-bison":
for prompt/instruction evaluation.

The settings we used for internal models are:

  1. Pre-trained PaLM 2-L as scorer:

temperature=0.0, max_decode_steps=256, extract_final_answer_by_prompting_again=True, old_instruction_score_threshold=0.01

  1. Pre-trained PaLM 2-L as optimizer:

temperature=1.5, max_decode_steps=256

  1. PaLM 2-L-IT as optimizer:

temperature=1.0, max_decode_steps=1024

Basically, extract_final_answer_by_prompting_again should be set to True when the scorer model is a pre-trained model. When it's an instruction-tuned model, setting it to either True or False should work as long as the outputs follow similar patterns, like "So the answer is xxx".

Note that the exact numbers may not be reproducible because the PaLM 2-L model is Google-internal, and the externally callable text-bison model has been regularly updated. But you should at least be able to get similar numbers with text-bison scorer if you follow the settings linked above.