What's the evaluation config setting for the final evaluation ?

I noticed that several parameters were important to final result, such as extract_final_answer_by_prompting_again. Wondering whether you could post the final config setting for the final evaluation for reproducible result on GSM_8K reported from the original paper ?

Thanks for the question! The settings we used for each externally accessible model (text-bison and GPT models) can be found at around

opro/opro/optimization/optimize_instructions.py

Line 689 in 38af462

if scorer_llm_name == "text-bison":

for prompt optimization, and

opro/opro/evaluation/evaluate_instructions.py

Line 241 in 38af462

if scorer_llm_name == "text-bison":

for prompt/instruction evaluation.

The settings we used for internal models are:

Pre-trained PaLM 2-L as scorer:

temperature=0.0, max_decode_steps=256, extract_final_answer_by_prompting_again=True, old_instruction_score_threshold=0.01

Pre-trained PaLM 2-L as optimizer:

temperature=1.5, max_decode_steps=256

PaLM 2-L-IT as optimizer:

temperature=1.0, max_decode_steps=1024

Basically, extract_final_answer_by_prompting_again should be set to True when the scorer model is a pre-trained model. When it's an instruction-tuned model, setting it to either True or False should work as long as the outputs follow similar patterns, like "So the answer is xxx".

Note that the exact numbers may not be reproducible because the PaLM 2-L model is Google-internal, and the externally callable text-bison model has been regularly updated. But you should at least be able to get similar numbers with text-bison scorer if you follow the settings linked above.