What's the evaluation config setting for the final evaluation ?
Victordongy opened this issue · 1 comments
I noticed that several parameters were important to final result, such as extract_final_answer_by_prompting_again
. Wondering whether you could post the final config setting for the final evaluation for reproducible result on GSM_8K reported from the original paper ?
Thanks for the question! The settings we used for each externally accessible model (text-bison and GPT models) can be found at around
for prompt optimization, andopro/opro/evaluation/evaluate_instructions.py
Line 241 in 38af462
The settings we used for internal models are:
- Pre-trained PaLM 2-L as scorer:
temperature=0.0
, max_decode_steps=256
, extract_final_answer_by_prompting_again=True
, old_instruction_score_threshold=0.01
- Pre-trained PaLM 2-L as optimizer:
temperature=1.5
, max_decode_steps=256
- PaLM 2-L-IT as optimizer:
temperature=1.0
, max_decode_steps=1024
Basically, extract_final_answer_by_prompting_again
should be set to True
when the scorer model is a pre-trained model. When it's an instruction-tuned model, setting it to either True
or False
should work as long as the outputs follow similar patterns, like "So the answer is xxx".
Note that the exact numbers may not be reproducible because the PaLM 2-L model is Google-internal, and the externally callable text-bison model has been regularly updated. But you should at least be able to get similar numbers with text-bison scorer if you follow the settings linked above.