openml-labs/gama

Better warm starting with automatically converting pipeline or pipeline string to gama individual string

Opened this issue ยท 6 comments

It's a lot of work currently for a user to convert model string to gama individual string format. It will be great if we can have a function for that or GAMA can automatically take the pipeline string for warmstarting

Here is the gist for my last experiment where I still had to eliminate some of the search space to make it work https://gist.github.com/prabhant/ebc0f4f9eb17fec4a80047f2aeb4b184

I have tried working with the code posted by @prabhant, however, when I try to warm-start gama, I get an error. The code for reproducing the error is listed below:

from sklearn.decomposition import FastICA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from gama.configuration.classification import clf_config

p = Pipeline(steps=[('imputation',SimpleImputer(strategy='median')),('2',RobustScaler()),('1',FastICA(tol=0.75,whiten='unit-variance')),('0',ExtraTreesClassifier(max_features=0.8,min_samples_leaf=2,min_samples_split=5))])

try:
  if p['imputation']:
    p = p[1:]
except:
  pass

l = []
for i in range(len(p)):
  l.append(str(p[i].__class__()).replace('()',''))
#making string from pipeline
s = []
#For making list
for i in reversed(l):
  s.append(f"{i}(")
#for making data 
data_string ="data"
s.append(data_string)
#for making hyperparameters
for i in range(len(p)):
  keys = p[i].__dict__.keys() & clf_config[p[i].__class__].keys()
  for j in keys:
    # if j in clf_config[p[i].__class__].keys():
    if j == list(keys)[-1]:
      if type(p[i].__dict__[j])==str:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}'")
      else:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}")
    else:
      if type(p[i].__dict__[j])==str:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}', ")
      else:
        s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}, ")
  s.append('), ')
s[-1] = ')'

#incorrect format:
#print(s)
['ExtraTreesClassifier(', 'FastICA(', 'RobustScaler(', 'data', '), ', 'FastICA.tol=0.75, ', "FastICA.whiten='unit-variance'", '), ', 'ExtraTreesClassifier.min_samples_leaf=2, ', "ExtraTreesClassifier.criterion='gini', ", 'ExtraTreesClassifier.min_samples_split=5, ', 'ExtraTreesClassifier.bootstrap=False, ', 'ExtraTreesClassifier.n_estimators=100, ', 'ExtraTreesClassifier.max_features=0.8', ')']

#but when I do this:

warm_starting_candidates = [''.join(s)]

#I think this is the correct format
#print(warm_starting_candidates)

["ExtraTreesClassifier(FastICA(RobustScaler(data), FastICA.tol=0.75, FastICA.whiten='unit-variance'), ExtraTreesClassifier.min_samples_leaf=2, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.min_samples_split=5, ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.n_estimators=100, ExtraTreesClassifier.max_features=0.8)"]

#However, in the context of warm-starting, I get the following error:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from gama import GamaClassifier

if __name__ == '__main__':
    X, y = load_breast_cancer(return_X_y=True)  
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

    automl = GamaClassifier(max_total_time=180, store="nothing")
    print("Starting `fit` which will take roughly 3 minutes.")
    automl.fit(X_train, y_train, warm_start = warm_starting_candidates)

#Error Message
KeyError: "Could not find Terminal of type 'ExtraTreesClassifier.min_samples_leaf=2'."

@Wman1001 This means you might not have this value in your search space, can you check it?

Correct, I added them to the search space, which does fix the issue. I was wondering whether it is intended that the tree based models have an empty list for min_samples_leaf and min_samples_split by default?

The search space depends on your needs, but if you do not define any values then GAMA only takes the default values for the parameter.

Actually, a

Classifier: {
 "hyperparameter": []
 }

Definition means that the hyperparameter is defined on a search space level instead of just the classifier level, which allows certain hyperparameters to be "shared" across different classifiers. See also 1 and 2