GP-NLS-julia

Julia implementation of the GP-NLS algorithm for symbolic regression described in the paper:

Kommenda, M., Burlacu, B., Kronberger, G. et al. Parameter identification for symbolic regression using nonlinear least squares. Genet Program Evolvable Mach 21, 471–501 (2020).

The documentation of this package is available here.

What is Symbolic Regression?

Symbolic regression is the task of finding a good mathematical expression to describe the relationship between a set of independent variables $\mathbf{X} = X_1, X_2, \ldots, X_n$ with a dependent variable $Y$ , normally represented as tabular data:

In other words, suppose that you have available $m$ observations with $n$ variables, and a response variable that you supose that have a relationship $f(\mathbf{X}) = Y$ , but the function $f$ is unknown: we can only see how the response changes when the input changes, but we don't know how the response is described by the variables of the problem. Symbolic regression tries to find a function $\widehat{f}$ that approximates the output of the unknown function just by learning mathematical structures from the data itself.

How GP-NLS works

The idea behind the use of evolutionary algorithms is to manipulate a population of mathematical expressions computationally (represented using expression trees). A fitness function, which measures how good each expression is (this function could be, for example, the Mean Squared Error) the individuals of the population have their fitness to represent how well each function describes the data. Through variation operations (which define the power of exploration of the algorithm, done in the full search space, or the power of exploration, which performs a local search) and selective pressure (which promote the maintenance of good solutions in the population), a simulation of the evolutionary process tends to converge to good solutions present in the population. However, it is worth noting that there is no guarantee that the optimal solution will be found, although generally, the algorithm will return a good solution if possible.

The genetic programming algorithm starts with a random population of solutions, which are represented by trees, and then using a fitness function, it repeats the process of selecting the parents, performing the cross between them, applying a mutation on the child solutions, and, finally set a new generation choice between parents and children. This process is repeated until a stop criteria is met.

GP-NLS creates symbolic trees but expands them by adding an intercept, slope, and a coefficient to every variable. The new free parameters are then adjusted using the non-linear optimization method called Levenberg-Marquardt.

The implementation is within folder ./GeneticProgrammingNLS. In ./docs you can find an automatically generated documentation for the package. In ./experiments there are some scripts to evaluate the different ways of creating/optimizing free parameters of the symbolic regression models, with results reported in ./experiments/results and comparative plots in ./experiments/plots

Installing:

inside the git root folder (which contains the implementation of GP-NLS for symbolic regression) start a julia terminal and enter in the pkg manager environment by pressing ]. To add GP_NLS to your local packages:

dev ./GP_NLS

Now you can use the "original" GP as well as the GP with non-linear least squares optimization by importing it:

using GP_NLS

The first time you import, Julia wil precompile all files, this can take a while.

Minimal usage:

There is a simple pluto notebook showing how to load a data set and execute the GP algorithm for symbolic regression inside ./examples.

Testing:

inside .GP-NLS folder (which contains the implementation of GP-NLS for symbolic regression) start a julia terminal and enter in the pkg manager environment by pressing ].

First you need to activate the local package:

activate .

Then you can run the tests:

test

Benchmarking:

There are tree scripts to benchmark different ways of creating/optimizing the constants in the expression trees. You can execute the experiments for a list of datasets with the scripts inside the ./experiments folder.

For example, to test the GP-NLS coefficient optimization for the airfoil and cars data set:

julia LsqOptExperiment.jl airfoil cars

The experiment script will perform 30 executions for the data set, and the results are saved inside ./experiments/results. To further analyse the results, you can run the post_hoc_analysis.jl script (which is a pluto notebook):

julia post_hoc_analysis.jl

and the plots comparing the methods will be saved in ./experiments/plots.

Building the docs:

You need to have Documenter. First, install it using the package manager:

import Pkg; Pkg.add("Documenter")

Then, inside the ./docsource folder, run in the terminal:

julia make.jl

gAldeia/GP-NLS-julia