Awesome LLM4Math

Curation of resources for LLM mathematical reasoning.

| 🐱 GitHub | 🐦 Twitter |

📢 If you have any suggestions, please don't hesitate to let us know. You can

directly E-mail Yuxuan Tong,
comment under the Twitter thread,
or post an issue in the GitHub repository.

The following resources are listed in (roughly) chronological order of publication.

Continual Pre-Training: Methods / Models / Corpora

Llemma & Proof-Pile-2: Open-sourced re-implementation of Minerva.
- Open-sourced corpus Proof-Pile-2 comprising 51.9B tokens (by DeepSeek tokenizer).
- Continually pre-trained based on CodeLLaMAs.
OpenWebMath:
- 13.6B tokens (by DeepSeek tokenizer).
- Used by Rho-1 to achieve performance comparable with DeepSeekMath.
MathPile:
- 8.9B tokens (by DeepSeek tokenizer).
- Mainly comprising arXiv papers.
- Shown not effective (on 7B models) by DeepSeekMath.
DeepSeekMath: Open-sourced SotA (as of 2024-04-18).
- Continually pre-trained based on DeepSeek-LLMs and DeepSeekCoder-7B
Rho-1: Selecting tokens based on loss/perplexity, achieving performance comparable with DeepSeekMath but only based on 15B OpenWebMath corpus.

SFT: Methods / Models / Datasets

Natural language (only)

RFT: SFT on rejection-sampled model outputs is effective.
MetaMath: Constructing problems of ground truth answer (but no necessarily feasible) by self-verification.
- Augmenting with GPT-3.5-Turbo.
AugGSM8k : Common data augmentation on GSM8k helps little in generalization to MATH.
MathScale: Scaling synthetic data to ~2M samples using GPT-3.5-Turbo with knowledge graph.
KPMath: Scaling synthetic data to 1.576M samples using GPT-4-Turbo with knowledge graph.
XWin-Math: Simple scaling synthetic data to 480k MATH + 960k GSM8k samples using GPT-4-Turbo with knowledge graph.

Code integration

MAmmoTH: SFT on CoT&PoT-mixing data is effective.
ToRA & MARIO: The fisrt open-sourced model works to verify the effectiveness of SFT for tool-integrated reasoning.
OpenMathInstruct-1: Scaling synthetic data to 1.8M using Mixtral-8x7B

RL: Methods / Models / Datasets

Math-Shepherd: Consturcting step-correctness labels based on an MCTS-like method.

Evaluation: Benchmarks

Here we focus on several the most important benchmarks.

OpenAI `simple-evals` - Math

MMLU(-Math): Measuring Massive Multitask Language Understanding, reference: https://arxiv.org/abs/2009.03300, https://github.com/hendrycks/test, MIT License

MATH: Measuring Mathematical Problem Solving With the MATH Dataset, reference: https://arxiv.org/abs/2103.03874, https://github.com/hendrycks/math, MIT License

MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners, reference: https://arxiv.org/abs/2210.03057, https://github.com/google-research/url-nlp, Creative Commons Attribution 4.0 International Public License (CC-BY)

Other benchmarks

miniF2F: “a formal mathematics benchmark (translated across multiple formal systems) consisting of exercise statements from olympiads (AMC, AIME, IMO) as well as high-school and undergraduate maths classes”.
OlympiadBench: “an Olympiad-level bilingual multimodal scientific benchmark”.
- GPT-4V attains an average score of 17.23% on OlympiadBench, with a mere 11.28% in physics.

Curations, collections and surveys

GitHub - lupantech/dl4math: Resources of deep learning for mathematical reasoning (DL4MATH).

Events

AIMO: “a new $10mn prize fund to spur the open development of AI models capable of performing as well as top human participants in the International Mathematical Olympiad (IMO)”.

fengsxy/Awesome-LLM4Math