math-evals

We will benchmark several prompting techniques on mathematical datasets such as gsm-8k and MATH.

The ideas are all human generated, and code is mostly AI generated (cursor + claude-3.5-sonnet)

shreyshahi/math-evals