Better warm starting with automatically converting pipeline or pipeline string to gama individual string

Question

Better warm starting with automatically converting pipeline or pipeline string to gama individual string

Opened this issue 3 years ago · 6 comments

It's a lot of work currently for a user to convert model string to gama individual string format. It will be great if we can have a function for that or GAMA can automatically take the pipeline string for warmstarting

Answer 1 · 2022-08-24T17:31:29.000Z

Here is the gist for my last experiment where I still had to eliminate some of the search space to make it work https://gist.github.com/prabhant/ebc0f4f9eb17fec4a80047f2aeb4b184

Answer 2 · 2023-07-13T09:33:02.000Z

I have tried working with the code posted by @prabhant, however, when I try to warm-start gama, I get an error. The code for reproducing the error is listed below:

from sklearn.decomposition import FastICA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from gama.configuration.classification import clf_config

p = Pipeline(steps=[('imputation',SimpleImputer(strategy='median')),('2',RobustScaler()),('1',FastICA(tol=0.75,whiten='unit-variance')),('0',ExtraTreesClassifier(max_features=0.8,min_samples_leaf=2,min_samples_split=5))])

try:
  if p['imputation']:
    p = p[1:]
except:
  pass

l = []
for i in range(len(p)):
  l.append(str(p[i].__class__()).replace('()',''))
#making string from pipeline
s = []
#For making list
for i in reversed(l):
  s.append(f"{i}(")
#for making data 
data_string ="data"
s.append(data_string)
#for making hyperparameters
for i in range(len(p)):
  keys = p[i].__dict__.keys() & clf_config[p[i].__class__].keys()
  for j in keys:
    # if j in clf_config[p[i].__class__].keys():
    if j == list(keys)[-1]:
      if type(p[i].__dict__[j])==str:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}'")
      else:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}")
    else:
      if type(p[i].__dict__[j])==str:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}', ")
      else:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}, ")
  s.append('), ')
s[-1] = ')'

#incorrect format:
#print(s)
['ExtraTreesClassifier(', 'FastICA(', 'RobustScaler(', 'data', '), ', 'FastICA.tol=0.75, ', "FastICA.whiten='unit-variance'", '), ', 'ExtraTreesClassifier.min_samples_leaf=2, ', "ExtraTreesClassifier.criterion='gini', ", 'ExtraTreesClassifier.min_samples_split=5, ', 'ExtraTreesClassifier.bootstrap=False, ', 'ExtraTreesClassifier.n_estimators=100, ', 'ExtraTreesClassifier.max_features=0.8', ')']

#but when I do this:

warm_starting_candidates = [''.join(s)]

#I think this is the correct format
#print(warm_starting_candidates)

["ExtraTreesClassifier(FastICA(RobustScaler(data), FastICA.tol=0.75, FastICA.whiten='unit-variance'), ExtraTreesClassifier.min_samples_leaf=2, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.min_samples_split=5, ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.n_estimators=100, ExtraTreesClassifier.max_features=0.8)"]

#However, in the context of warm-starting, I get the following error:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from gama import GamaClassifier

if __name__ == '__main__':
    X, y = load_breast_cancer(return_X_y=True)  
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

    automl = GamaClassifier(max_total_time=180, store="nothing")
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit(X_train, y_train, warm_start = warm_starting_candidates)

#Error Message
KeyError: "Could not find Terminal of type 'ExtraTreesClassifier.min_samples_leaf=2'."

Answer 3 · 2023-07-13T10:04:34.000Z

@Wman1001 This means you might not have this value in your search space, can you check it?

Answer 4 · 2023-07-13T10:15:35.000Z

Correct, I added them to the search space, which does fix the issue. I was wondering whether it is intended that the tree based models have an empty list for min_samples_leaf and min_samples_split by default?

Answer 5 · 2023-07-13T10:35:23.000Z

The search space depends on your needs, but if you do not define any values then GAMA only takes the default values for the parameter.

Answer 6 · 2023-07-13T12:22:06.000Z

Actually, a

Classifier: {
 "hyperparameter": []
 }

Definition means that the hyperparameter is defined on a search space level instead of just the classifier level, which allows certain hyperparameters to be "shared" across different classifiers. See also 1 and 2