Code for the paper "Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters" by Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun.
.
├── grade-school-math/ # GSM8K dataset, from https://github.com/openai/grade-school-math
├── indices_800.json # Indices for the 800 GSM8K test examples used for evaluation
├── Bamboogle Prerelease - Sheet1.csv # Bamboogle dataset, from https://github.com/ofirpress/self-ask
├── Bamboogle Prerelease - Sheet1_inter.csv # Annotated intermediate bridging entities for Bamboogle
├── utils.py # Helper functions
├── prompts_*/ # Full prompts for all settings in our experiments
├── main_*.py # Scripts for getting model predictions via OpenAI API
├── eval_*.ipynb # Evaluation scripts, including cached evaluation results
└── result_*/ # Cached model prediction results
First put your OpenAI API key in a file named api_key.txt
.
Details could be found in the param descriptions in main_*.py
. For example, to run the invalid reasoning setting on GSM8K and Bamboogle:
python main_gsm8k.py --prompt_dir prompts_arithmetic/invalid_reasoning.txt --eng text-davinci-002 --num_test 800 --seed 1357 --temp 0.0 --test_ind indices_800.json
python main_bamboogle.py --prompt_dir prompts_bamboogle/invalid_reasoning.txt --eng text-davinci-002 --num_test -1 --seed 1357 --temp 0.0
eval_*.ipynb
contains the scripts and cached evaluation results.