/Symbolic-Regression

predicting equations from raw data with deep learning

Primary LanguagePython

Accelerating Symbolic Regression with Deep Learning

by Tyler Hughes, Siddharth Buddhiraju, and Rituraj

Introduction

The goal of symbolic regression is to generate a function that describes a given set of datapoints. This function can, generally, include undetermined constants that may be fit later. Typical approaches to symbolic regression involve global optimization techniques based on genetic algorithms [1,2,3], that search a subset of the entire space of possible equations to find the best fit. This approach is cumbersome and does not utilize any of the features or structure inherent in the original data.

In this work, we approach symbolic as a modified machine translation problem, where our goal is to translate the original dataset into an equation using a mixture of machine learning techniques. We first train a neural network to extract features from the dataset by performing a fit and stripping out the learned parameters from the network. We then train a recurrent neural network model (using LSTMs) to decode this feature vector into an equation. Further processing of this equation can be performed to fit constants and evaluate accuracy vs. simplicity.

This work presents a fresh approach to the problem of symbolic regression and may allow for potential increases in predictive power, reliability, and speed compared to previous solutions.

Paper

This work was performed as the final project for Stanford CS 221, Artificial Intelligence: Principles and Techniques. The final writeup is linked here.

Code

Generating the Dataset

The data is generated by running the generate_examples.py script. One must specify the maximum equation tree depth allowable by setting the tree_depth variable. For a number of training examples, this script generates a random equation up to this maximum depth, sets the constants randomly, and then generates a set of (x,y) pairs. Then, a neural network in NN.py fits to this dataset and returns a feature vector for this training example. All of this data is stored in ./data/ directory and is separated by the maximum allowed tree depth. 1500 examples of depths 1, 2, and 3 have been pre-generated and are in ./data/. Therefore, for purposes of training, this section can be skipped.

Decoding and Equation Prediction

The files LSTM_basic.py, LSTM_sequence.py, and LSTM_tree.py are each separate implementations of the LSTM network used to decode the feature vector into equations.

LSTM_basic.py allows for user-defined equations and feature vectors, and was used for debugging and investigating simple cases.

LSTM_sequence.py trains an linear LSTM sequence model on the dataset and returns equations containing special elements for parentheses. For example, sin(x) would be represented as [ sin , ( , x , ) ]

LSTM_tree.py trains an tree-based LSTM sequence model that is intended to mirror the basic representation of equations as trees. This script trains a full binary tree of LSTMs using the original dataset. On prediction time, it returns another full binary equation tree with special characters, which are then pruned to leave the predicted equation.

These three files show both training and prediction statistics to the user.

Curve Fitting

After generating the equations, we provide a file fitter.py for doing curve fitting to the original (x,y) data points. This script contains an example, which fits when running the script. The functions from this script can also be imported for post-processing.

Dependencies

References

[1] Josh Bongard and Hod Lipson. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 104(24):9943–9948, 2007.

[2] John Duffy and Jim Engle-Warnick. Using symbolic regression to infer strategies from experimental data. In Evolutionary computation in Economics and Finance, pages 61–82. Springer, 2002.

[3] Wouter Minnebo, Sean Stijven, and Katya Vladislavleva. Empowering knowledge computing with variable selection, 2011.