math-evals We will benchmark several prompting techniques on mathematical datasets such as gsm-8k and MATH. The ideas are all human generated, and code is mostly AI generated (cursor + claude-3.5-sonnet) Lab notebook / observations TODO