/math-evals

We will benchmark several prompting techniques on the GSM8k dataset

Primary LanguagePythonMIT LicenseMIT

math-evals

We will benchmark several prompting techniques on mathematical datasets such as gsm-8k and MATH.

The ideas are all human generated, and code is mostly AI generated (cursor + claude-3.5-sonnet)

Lab notebook / observations

TODO