/Awesome-LLM4Math

Curation of resources for LLM mathematical reasoning.

Apache License 2.0Apache-2.0

Awesome LLM4Math

Curation of resources for LLM mathematical reasoning.

Awesome License: Apache

🐱 GitHub | 🐦 Twitter |

📢 If you have any suggestions, please don't hesitate to let us know. You can

The following resources are listed in (roughly) chronological order of publication.

Continual Pre-Training: Methods / Models / Corpora

  • Llemma & Proof-Pile-2: Open-sourced re-implementation of Minerva.
    • Open-sourced corpus Proof-Pile-2 comprising 51.9B tokens (by DeepSeek tokenizer).
    • Continually pre-trained based on CodeLLaMAs.
  • OpenWebMath:
    • 13.6B tokens (by DeepSeek tokenizer).
    • Used by Rho-1 to achieve performance comparable with DeepSeekMath.
  • MathPile:
    • 8.9B tokens (by DeepSeek tokenizer).
    • Mainly comprising arXiv papers.
    • Shown not effective (on 7B models) by DeepSeekMath.
  • DeepSeekMath: Open-sourced SotA (as of 2024-04-18).
    • Continually pre-trained based on DeepSeek-LLMs and DeepSeekCoder-7B
  • Rho-1: Selecting tokens based on loss/perplexity, achieving performance comparable with DeepSeekMath but only based on 15B OpenWebMath corpus.

SFT: Methods / Models / Datasets

Natural language (only)

  • RFT: SFT on rejection-sampled model outputs is effective.
  • MetaMath: Constructing problems of ground truth answer (but no necessarily feasible) by self-verification.
    • Augmenting with GPT-3.5-Turbo.
  • AugGSM8k : Common data augmentation on GSM8k helps little in generalization to MATH.
  • MathScale: Scaling synthetic data to ~2M samples using GPT-3.5-Turbo with knowledge graph.
  • KPMath: Scaling synthetic data to 1.576M samples using GPT-4-Turbo with knowledge graph.
  • XWin-Math: Simple scaling synthetic data to 480k MATH + 960k GSM8k samples using GPT-4-Turbo with knowledge graph.

Code integration

  • MAmmoTH: SFT on CoT&PoT-mixing data is effective.
  • ToRA & MARIO: The fisrt open-sourced model works to verify the effectiveness of SFT for tool-integrated reasoning.
  • OpenMathInstruct-1: Scaling synthetic data to 1.8M using Mixtral-8x7B

RL: Methods / Models / Datasets

  • Math-Shepherd: Consturcting step-correctness labels based on an MCTS-like method.

Evaluation: Benchmarks

Here we focus on several the most important benchmarks.

Other benchmarks

  • miniF2F: “a formal mathematics benchmark (translated across multiple formal systems) consisting of exercise statements from olympiads (AMC, AIME, IMO) as well as high-school and undergraduate maths classes”.
  • OlympiadBench: “an Olympiad-level bilingual multimodal scientific benchmark”.
    • GPT-4V attains an average score of 17.23% on OlympiadBench, with a mere 11.28% in physics.

Curations, collections and surveys

Events

  • AIMO: “a new $10mn prize fund to spur the open development of AI models capable of performing as well as top human participants in the International Mathematical Olympiad (IMO)”.