Reverse-Process Synthetic Data Generation: Automatically Generating Training Language Models for Complex Problem Solving

Abstract:

This paper introduces a methodology for generating high-quality, diverse training data for Language Models (LMs) in complex problem-solving domains. Our approach, termed "Reverse-Process Synthetic Data Generation" (RPSDG), inverts traditionally difficult problems to create an abundance of training examples with known solutions, e.g., symbolically taking the deriative of a function, $f \mapsto f'$, versus solving antiderivatives of $f'$. By automating the generation of problems of graduating difficulty, we create datasets that enable process-supervised training of LLMs. We demonstrate the efficacy of this method for training mathematical reasoning. Our results show significant improvements in LLMs' problem-solving capabilities, particularly in areas requiring multi-step reasoning and creative insights. This methodology not only enhances model performance but also provides a framework for generating explainable AI solutions, as the step-by-step problem-solving process is inherent in the training data.

Introduction
- The challenge of training data for complex problem-solving
- Overview of Reverse-Process Synthetic Data Generation (RPSDG)
- Potential impact on AI capabilities and explainability
Methodology
- Core principles
- Automating generation of process supervision training data
- Curriculum learning and problem difficulty progression
Mathematics
- Algebra: Equation solving and manipulation
- Calculus: From differentiation to integration
Implementation and Results
- Data generation pipelines
- Transformer-based LMs
- Self-Supervised Learning
- Evals and benchmarks
Discussion
- Implications for AI problem-solving capabilities
- Enhancing explainability and transparency
- Limitations and challenges of the RPSDG approach using SSL
Future Work
- Expanding to new domains and problem types
- Reinforcement learning to reward multi-step reasoning even without a known (but verifiable) solution
Conclusion
- Summary of key findings
- Broader impact on AI research and applications

Introduction

In "The Bitter Lesson," Richard Sutton argues that learning algorithms that scale with compute and data will eventually outperform handcrafted algorithms.

The next frontier in AI research is finding ways to acquire high-quality data that can be used to train models to predict the latent structure and processes in the world. A significant portion of the world's data is latent, where the processes that generate the data are not observable (e.g., not written down). For example, in mathematics, the way in which a proof was discovered is often not demonstrated and instead only a polished proof is presented, hiding the creative process and the "dark matter" that led to the proof. Understanding and modeling the latent structure in our processes can lead to significant improvements in AI capabilities.

In this paper, we are interested in exploring algorithmic data generation, where we apply classical algorithms (GOFAI) to automatically generate high-quality step-by-step (process supervision) training data for LMs. In particular, we are interested in exploring problems which have the feature of being easy to solve in one direction, but hard to solve in the other direction, such as

Taking Derivatives vs. Integrating Functions

In mathematics, computing derivatives of functions is generally easier than finding their antiderivatives (integrals). This inherent asymmetry allows us to use the more straightforward differentiation process to generate a rich dataset for training language models (LLMs) by reversing the problem-solving direction: starting with derivatives and deriving the original functions.

Generating Integral Calculus Training Data By Solving Derivatives and Reversing the Process

Starting with Known Functions:
- Select functions ( f(x) ) that have closed-form solutions and well-defined derivatives.
- Examples include polynomials, trigonometric functions, exponential functions, and logarithmic functions.
- To ensure the training data covers a wide range of functions and their transformations, we create a variety of functions with different complexities and forms.
Formulating the Reverse Process:
- Take the derivative of ( f(x) ) to generate ( f'(x) ).
Reversing the Process:
- Generate the RPSDG by reversing the problem and solution steps, starting with ( f'(x) )$ to show how to arrive at a corresponding integral ( f(x) ).
- Ensure that each step in this process is well-documented, capturing the intermediate transformations.

This approach allows us to generate integration problems of graduating difficulty, leveraging the inherent asymmetry between differentiation and integration. By automating this process, we create a diverse dataset for training LLMs to predict solutions in integral calculus.

Generating a Theorem Proof vs. Verifying a Proof

One of the key challenges in mathematics and logic is generating proofs for given theorems. While verifying a proof is generally straightforward, generating the proof itself can be significantly more complex. Our methodology leverages this asymmetry by focusing on the reverse process: starting with randomly generated expressions and using rewrite rules to create both theorems and their proofs.

Generating Theorem Proofs

Random Walks in Expression Space:
- Begin with a randomly generated expression ( e_{\text{start}} ).
- Apply a series of rewrite rules ( r_1, r_2, \ldots, r_n ) to generate a sequence of intermediate expressions ( e_1, e_2, \ldots, e_n ), ultimately arriving at a final expression ( e_{\text{end}} ).
- Each step ( e_i \rightarrow e_{i+1} ) represents a proof step within a logical or mathematical framework.
Forming Theorems and Proofs:
- The pair ( (e_{\text{start}}, e_{\text{end}}) ) represents a theorem, where ( e_{\text{start}} ) is the hypothesis and ( e_{\text{end}} ) is the conclusion.
- The sequence of intermediate steps provides the proof for this theorem.
- This method effectively generates theorems and their corresponding proofs by random exploration of the expression space.
Reversibility and Bidirectional Processes:
- The reverse process can also be applied, starting with ( e_{\text{end}} ) and working backward to ( e_{\text{start}} ).
- Intermediate steps that involve complex operations (e.g., integration) can often be reversed into simpler operations (e.g., differentiation).
- This bidirectional approach ensures a rich dataset with varied difficulty levels for training models.
Automated Proof Generation:
- By applying rewrite rules to random starting points and ending at random points, we automatically generate both theorems and proofs.
- This method circumvents the traditional difficulty of finding a theorem to prove and then discovering its proof.
- The randomness ensures a diverse set of theorems and proofs, capturing a wide range of logical and mathematical concepts.

This methodology allows us to systematically generate a large volume of training data for LLMs. By focusing on problems that are easy in one direction but complex in the other, we create a diverse dataset that captures a wide range of logical and mathematical challenges. This not only enhances the problem-solving capabilities of LLMs but also provides a framework for generating explainable AI solutions, as the step-by-step problem-solving process is inherent in the training data.

queelius/RPSDG