/GP-GOMEA

Genetic Programming version of GOMEA. Also includes standard tree-based GP, and Semantic Backpropagation-based GP

Primary LanguageC++Apache License 2.0Apache-2.0

Genetic Programming - GOMEA, Standard & Semantic

Implementations of the Gene-pool Optimal Mixing Evolutionary Algorithm for Genetic Programming (GP-GOMEA), standard tree-based GP (Standard GP), and Semantic Backpropagation-based GP (SBP-GP). The last two can also be used in a multi-objective setting (in which case NSGA-II is used) --- Multi-objective versions of GP-GOMEA and its population size-free scheme (called interleaved multistart scheme) have yet to be implemented. This code uses C++ under the hood for speed, and has a Python 3 interface to scikit-learn.

NEW: Interested in GP-GOMEA alone?

Find a more streamlined and better-performing version at this new repo

Algorithms

GP-GOMEA: GP-GOMEA estimates interdependencies between model components and uses this info to cross-over interdependent components en block, to preserve their concerted action. GP-GOMEA is especially proficient in finding models in the form of small symbolic expressions, that can often be interpreted.

SBP-GP: SBP-GP uses recursive function inversion to estimate effective ways of modifying evolving models. SBP-GP works well when high accuracy is demanded, at the cost of producing very complex models, which are hard or impossible to interpret.

GP-GOMEA and SBP-GP are some of the top-performing algorithms when it comes to symbolic regression:

image

Image from SRBench, a large benchmark for symbolic regression including over 250 datasets.

Related research work

If you use our code for academic purposes, please support our research by citing our research papers. The main work for which this code base was created is:

@article{virgolin2021improving,
  title={Improving model-based genetic programming for symbolic regression of small expressions},
  author={Virgolin, Marco and Alderliesten, Tanja and Witteveen, Cees and Bosman, Peter A. N.},
  journal={Evolutionary Computation},
  volume={29},
  number={2},
  pages={211--237},
  year={2021},
  publisher={MIT Press}
}

Other works featuring GP-GOMEA (in some form) or the other algorithms implemented here are:

  • M. Virgolin, T. Alderliesten, C. Witteveen, and P.A.N. Bosman. Scalable Genetic Programming by Gene-pool Optimal Mixing and Input-space Entropy-based Building Block Learning. In Proceedings of the Genetic and Evolutionary Computation Conference - GECCO-2017, pages 1041-1048, ACM, 2017. (The source code for this paper is available on Peter's website: https://homepages.cwi.nl/~bosman/source_code.php)
  • M. Virgolin, T. Alderliesten, A. Bel, C. Witteveen, and P.A.N. Bosman. Symbolic Regression and Feature Construction with GP-GOMEA applied to Radiotherapy Dose Reconstruction of Childhood Cancer Survivors. In Proceedings of the Genetic and Evolutionary Computation Conference - GECCO-2018, pages 1395-1402, ACM, 2018.
  • M. Virgolin, T. Alderliesten, P.A.N. Bosman. Linear Scaling with and within Semantic Backpropagation-based Genetic Programming for Symbolic Regression. In Proceedings of the Genetic and Evolutionary Computation Conference - GECCO-2019, pages 1084-1092, ACM, 2019.
  • M. Virgolin, T. Alderliesten, C. Witteveen, P.A.N. Bosman. Improving Model-based Genetic Programming for Symbolic Regression of Small Expressions. Evolutionary Computation, pages 211-237, MIT Press, 2021.
  • M. Virgolin, T. Alderliesten, P.A.N. Bosman. On Explaining Machine Learning Models by Evolving Crucial and Compact Features. Swarm and Evolutionary Computation, v. 53, p. 100640, 2020.
  • M. Virgolin, A. De Lorenzo, E. Medvet, F. Randone. "Learning a Formula of Interpretability to Learn Interpretable Formulas". Parallel Problem Solving from Nature - PPSN-XVI, pages 79-93, Springer, 2020.
  • W. La Cava, P. Orzechowski, B, Burlacu, F.O. de França, M. Virgolin, Y. Jin, M. Kommenda, J.H. Moore. Contemporary symbolic regression methods and their relative performance. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.

Installation

Native

This code can be compiled on Linux, we tested on Ubuntu and Fedora (kudos to @abouter for helping out). There are a few steps to follow:

  • Inspect and potentially edit the deps_ubuntu (or deps_fedora) file to align it to your system (no change should be needed).
  • Run chmod +x deps_ubuntu; sudo ./deps_ubuntu (or chmod +x deps_fedora; sudo ./deps_fedora) to install the needed dependencies on your system.

The project is built using CMake (kudos to @EviSijben). Run make (or make debug for a debug build). To test that everything works fine, run python3 test.py. You can use Ninja to speed up builds, by prefixing the make command with GEN=ninja (e.g. GEN=ninja make release).

Conda

To install using conda, run:

conda env create -f environment.yml
conda activate gpgomenv
make

Docker

This code can also be compiled and run inside a Docker container (kudos to @roy-tc for providing this!):

docker build -t gp-gomea .

This will:

  • Install dependencies (implementation derived from on deps_ubuntu).
  • Build the project.
  • Test that everything is fine using python3 test.py.

You can run the container in interactive mode using docker run -it gp-gomea and issue for instance python3 test.py or execute your own script.

Using the Python interface

See test.py. For example:

from pyGPGOMEA import GPGOMEARegressor as GPGR
from sklearn.datasets import load_boston
import numpy as np

X, y = load_boston(return_X_y=True)
model = GPGR( gomea=True, ims='5_1', generations=10, seed=42 )
model.fit(X, y)
print('Model found:', model.get_model())
print('RMSE:', np.sqrt( np.mean( np.square( model.predict(X) - y ) ) ))

Take a look at test.py for more details.

Do you want to run a C++ executable instead?

After running make, you will find a C++ executable called main in build/release/src/ that you can run using a parameter setting file, for example, main --file params_sgp.txt.

Datasets

For the C++ executable, datasets must be organized as follows. Each row is an example, and each column is a feature, with exception for the last column, which is the target variable. Values should be separated by spaces. Do not include any textual header. You can find examples at: https://goo.gl/9D2z3b