/gplearn

Genetic Programming in Python, with a scikit-learn inspired API

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Changes in this fork of Trevor Stephens' gplearn

This fork extends the original code by three methods:

  • There is an option to provide initial guesses for programs in the form of equations with variable names X0, X1, ... for features (e.g. '1.5*X0 + 10*X1/X2') as a list of strings specified for the optional parameter previous_programs of the modified SymbolicRegressor.
  • Setting the new optional parameter optimize to True for SymbolicRegressor will trigger symbolic program simplification via sympy and optimization of numerical program parameters via scipy.
  • Setting the new optional parameter n_program_sum of SymbolicRegressor to integers larger than 1 will trigger interpretation of the first column of the observation input as a weight w0 and the following n_features columns as program feature input, the next column as weight w1, etc., such that a program P is evaluated as a sum from i=1 to n_program_sum over w_i * P(features_i).

Additional extensions:

  • Optional parameter penalties is a dictionary with function-specific weights as program penalties, e.g. {'add':2.0, 'var':1.0, 'coeff':1.5} including penalties for variables 'var' and numerical coefficients 'coeff'.
  • Optional parameter force_coeff inserts factors of one before numerical optimization, so that e.g. sums of features with different physical units with summands without numerical pre-factors can be avoided.
  • Use gplearn._programparser.program_to_math to convert list representation of program to mathematical expression with standard math operators *, /, +, -, etc. instead of mul(...), ... etc., e.g. mathstring=program_to_math(est_gp._program.program).
  • Implementation of modified AIC metric aic0. Use together with parsimony_coefficient=2.0 to properly penalize operators, variables, and numerical coefficients as degrees of freedom.


Original README below:

Version License Documentation Status Test Status Windows Test Status Test Coverage Code Health

Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn!

gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API.

While Genetic Programming (GP) can be used to perform a very wide variety of tasks, gplearn is purposefully constrained to solving symbolic regression problems. This is motivated by the scikit-learn ethos, of having powerful estimators that are straight-forward to implement.

Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets in order to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.

gplearn retains the familiar scikit-learn fit/predict API and works with the existing scikit-learn pipeline and grid search modules. The package attempts to squeeze a lot of functionality into a scikit-learn-style API. While there are a lot of parameters to tweak, reading the documentation should make the more relevant ones clear for your problem.

gplearn supports regression through the SymbolicRegressor, binary classification with the SymbolicClassifier, as well as transformation for automated feature engineering with the SymbolicTransformer, which is designed to support regression problems, but should also work for binary classification.

gplearn is built on scikit-learn and a fairly recent copy (0.22.1+) is required for installation. If you come across any issues in running or installing the package, please submit a bug report.