Avoiding repeated work : Cache node models with node hashing or manually.

Question

Avoiding repeated work : Cache node models with node hashing or manually.

perib opened this issue a year ago · 0 comments

During the evolutionary algorithm, TPOT2 will fit the same exact data to the same estimator. Ideally, we should be able to catch this and use a cached version of the estimator that we previously trained.

Currently, we have support for the Joblib Memory module. The Joblib memory parameter can be used to cache inputs and outputs of the fit functions within the graphpipeline. This is memory/storage intensive. As it needs to store multiple transformations of the data.

Another approach could be to use graph theory to hash node locations within a graph. Then we can store just the fitted estimator without storing the inputs and outputs.

One more idea could be to store fitted pipelines manually within the graphindividual. Mutation/crossover operations could keep track of which nodes need to be refitted. Crossover would copy the fitted estimator to other graphindividuals. We would have to store a fitted module for each node and for each fold. This would work but would add a lot of complexity to the code.