/PySR

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Primary LanguageJuliaApache License 2.0Apache-2.0

Documentation Status PyPI version Build Status

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Cite this software

Documentation

Symbolic regression is a very interpretable machine learning algorithm for low-dimensional problems: these tools search equation space to find algebraic relations that approximate a dataset.

One can also extend these approaches to higher-dimensional spaces by using a neural network as proxy, as explained in 2006.11287, where we apply it to N-body problems. Here, one essentially uses symbolic regression to convert a neural net to an analytic equation. Thus, these tools simultaneously present an explicit and powerful way to interpret deep models.

Backstory:

Previously, we have used eureqa, which is a very efficient and user-friendly tool. However, eureqa is GUI-only, doesn't allow for user-defined operators, has no distributed capabilities, and has become proprietary (and recently been merged into an online service). Thus, the goal of this package is to have an open-source symbolic regression tool as efficient as eureqa, while also exposing a configurable python interface.

Installation

PySR uses both Julia and Python, so you need to have both installed.

Install Julia - see downloads, and then instructions for mac and linux. (Don't use the conda-forge version; it doesn't seem to work properly.) Then, at the command line, install the Optim and SpecialFunctions packages via:

julia -e 'import Pkg; Pkg.add("Optim"); Pkg.add("SpecialFunctions")'

For python, you need to have Python 3, numpy, sympy, and pandas installed.

You can install this package from PyPI with:

pip install pysr

Quickstart

import numpy as np
from pysr import pysr, best, get_hof

# Dataset
X = 2*np.random.randn(100, 5)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2

# Learn equations
equations = pysr(X, y, niterations=5,
        binary_operators=["plus", "mult"],
        unary_operators=["cos", "exp", "sin"])

...# (you can use ctl-c to exit early)

print(best())

which gives:

x0**2 + 2.000016*cos(x3) - 1.9999845

One can also use best_tex to get the LaTeX form, or best_callable to get a function you can call. This uses a score which balances complexity and error; however, one can see the full list of equations with:

print(get_hof())

This is a pandas table, with additional columns:

  • MSE - the mean square error of the formula
  • score - a metric akin to Occam's razor; you should use this to help select the "true" equation.
  • sympy_format - sympy equation.
  • lambda_format - a lambda function for that equation, that you can pass values through.