Better warm starting with automatically converting pipeline or pipeline string to gama individual string
Opened this issue ยท 6 comments
It's a lot of work currently for a user to convert model string to gama individual string format. It will be great if we can have a function for that or GAMA can automatically take the pipeline string for warmstarting
Here is the gist for my last experiment where I still had to eliminate some of the search space to make it work https://gist.github.com/prabhant/ebc0f4f9eb17fec4a80047f2aeb4b184
I have tried working with the code posted by @prabhant, however, when I try to warm-start gama, I get an error. The code for reproducing the error is listed below:
from sklearn.decomposition import FastICA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from gama.configuration.classification import clf_config
p = Pipeline(steps=[('imputation',SimpleImputer(strategy='median')),('2',RobustScaler()),('1',FastICA(tol=0.75,whiten='unit-variance')),('0',ExtraTreesClassifier(max_features=0.8,min_samples_leaf=2,min_samples_split=5))])
try:
if p['imputation']:
p = p[1:]
except:
pass
l = []
for i in range(len(p)):
l.append(str(p[i].__class__()).replace('()',''))
#making string from pipeline
s = []
#For making list
for i in reversed(l):
s.append(f"{i}(")
#for making data
data_string ="data"
s.append(data_string)
#for making hyperparameters
for i in range(len(p)):
keys = p[i].__dict__.keys() & clf_config[p[i].__class__].keys()
for j in keys:
# if j in clf_config[p[i].__class__].keys():
if j == list(keys)[-1]:
if type(p[i].__dict__[j])==str:
s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}'")
else:
s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}")
else:
if type(p[i].__dict__[j])==str:
s.append(f"{str(p[i].__class__()).replace('()','')}.{j}='{p[i].__dict__[j]}', ")
else:
s.append(f"{str(p[i].__class__()).replace('()','')}.{j}={p[i].__dict__[j]}, ")
s.append('), ')
s[-1] = ')'
#incorrect format:
#print(s)
['ExtraTreesClassifier(', 'FastICA(', 'RobustScaler(', 'data', '), ', 'FastICA.tol=0.75, ', "FastICA.whiten='unit-variance'", '), ', 'ExtraTreesClassifier.min_samples_leaf=2, ', "ExtraTreesClassifier.criterion='gini', ", 'ExtraTreesClassifier.min_samples_split=5, ', 'ExtraTreesClassifier.bootstrap=False, ', 'ExtraTreesClassifier.n_estimators=100, ', 'ExtraTreesClassifier.max_features=0.8', ')']
#but when I do this:
warm_starting_candidates = [''.join(s)]
#I think this is the correct format
#print(warm_starting_candidates)
["ExtraTreesClassifier(FastICA(RobustScaler(data), FastICA.tol=0.75, FastICA.whiten='unit-variance'), ExtraTreesClassifier.min_samples_leaf=2, ExtraTreesClassifier.criterion='gini', ExtraTreesClassifier.min_samples_split=5, ExtraTreesClassifier.bootstrap=False, ExtraTreesClassifier.n_estimators=100, ExtraTreesClassifier.max_features=0.8)"]
#However, in the context of warm-starting, I get the following error:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from gama import GamaClassifier
if __name__ == '__main__':
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
automl = GamaClassifier(max_total_time=180, store="nothing")
print("Starting `fit` which will take roughly 3 minutes.")
automl.fit(X_train, y_train, warm_start = warm_starting_candidates)
#Error Message
KeyError: "Could not find Terminal of type 'ExtraTreesClassifier.min_samples_leaf=2'."
@Wman1001 This means you might not have this value in your search space, can you check it?
Correct, I added them to the search space, which does fix the issue. I was wondering whether it is intended that the tree based models have an empty list for min_samples_leaf and min_samples_split by default?
The search space depends on your needs, but if you do not define any values then GAMA only takes the default values for the parameter.