Curation of resources for LLM mathematical reasoning.
📢 If you have any suggestions, please don't hesitate to let us know. You can
- directly E-mail Yuxuan Tong,
- comment under the Twitter thread,
- or post an issue in the GitHub repository.
The following resources are listed in (roughly) chronological order of publication.
- Llemma & Proof-Pile-2: Open-sourced re-implementation of Minerva.
- Open-sourced corpus Proof-Pile-2 comprising 51.9B tokens (by DeepSeek tokenizer).
- Continually pre-trained based on CodeLLaMAs.
- OpenWebMath:
- 13.6B tokens (by DeepSeek tokenizer).
- Used by Rho-1 to achieve performance comparable with DeepSeekMath.
- MathPile:
- 8.9B tokens (by DeepSeek tokenizer).
- Mainly comprising arXiv papers.
- Shown not effective (on 7B models) by DeepSeekMath.
- DeepSeekMath: Open-sourced SotA (as of 2024-04-18).
- Continually pre-trained based on DeepSeek-LLMs and DeepSeekCoder-7B
- Rho-1: Selecting tokens based on loss/perplexity, achieving performance comparable with DeepSeekMath but only based on 15B OpenWebMath corpus.
- RFT: SFT on rejection-sampled model outputs is effective.
- MetaMath: Constructing problems of ground truth answer (but no necessarily feasible) by self-verification.
- Augmenting with GPT-3.5-Turbo.
- AugGSM8k : Common data augmentation on GSM8k helps little in generalization to MATH.
- MathScale: Scaling synthetic data to ~2M samples using GPT-3.5-Turbo with knowledge graph.
- KPMath: Scaling synthetic data to 1.576M samples using GPT-4-Turbo with knowledge graph.
- XWin-Math: Simple scaling synthetic data to 480k MATH + 960k GSM8k samples using GPT-4-Turbo with knowledge graph.
- MAmmoTH: SFT on CoT&PoT-mixing data is effective.
- ToRA & MARIO: The fisrt open-sourced model works to verify the effectiveness of SFT for tool-integrated reasoning.
- OpenMathInstruct-1: Scaling synthetic data to 1.8M using Mixtral-8x7B
- Math-Shepherd: Consturcting step-correctness labels based on an MCTS-like method.
Here we focus on several the most important benchmarks.
- MMLU(-Math): Measuring Massive Multitask Language Understanding, reference: https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test, MIT License
- MATH: Measuring Mathematical Problem Solving With the MATH Dataset, reference: https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math, MIT License
- MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners, reference: https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp, Creative Commons Attribution 4.0 International Public License (CC-BY)
- miniF2F: “a formal mathematics benchmark (translated across multiple formal systems) consisting of exercise statements from olympiads (AMC, AIME, IMO) as well as high-school and undergraduate maths classes”.
- OlympiadBench: “an Olympiad-level bilingual multimodal scientific benchmark”.
- GPT-4V attains an average score of 17.23% on OlympiadBench, with a mere 11.28% in physics.
- AIMO: “a new $10mn prize fund to spur the open development of AI models capable of performing as well as top human participants in the International Mathematical Olympiad (IMO)”.