maxpumperla/hyperas

Working with non-hardcoded data

flexthink opened this issue · 4 comments

As of now, I didn't find a way to pass parameters to the data() function, and it appears to ignore everything in the code except imports. This is because Hyperas creates a new Python file out of the data, the model and everything else before attempting to train, and this works well if you're training on MNIST or some other data set that came with the framework - or on random data. But what if the data-set is selected from a drop-down or retrieved from a URL? What if you want to run it out of a script that has a config file that specifies the path to the data? What if it needs to read a database? Is there a way to do this the way Hyperas is currently set up? If not, is there anything on the roadmap?

As you pointed out hyperas is currently a simple wrapper that uses data() and model() as templates from which it formats code that it then executes. Meaning that within data() you would define everything, just like a regular script.

In all of your examples, you basically want to be able to generate new templates that hyperas can call.

e.g. lets say you have a application that uses hyperas based on an input dataset:

data_template = "def data(): \n{pipeline} \nreturn x_train,y_train,x_test,y_test"

pipelines = {'mnist': ' import something \n# some reshape \n# some scaling', ...}

def get_data_func(dset):
    pipeline = pipelines[dset]
    return data_template.format(pipeline=pipeline)

def model(x_train, x_test, y_train, y_test):
    # define model
    return {'loss' :-acc , ....}

def do_optimize(input_dset):
    data_func_string = get_data_func(input_dset)
    best_run, best_model = optim.minimize(model=model,
                                          data= data_func_string, 
                                            ...)
    return best_run, best_model

if __name__ == '__main__':
    input_dset = input('What dataset do you want optimize a model for?')
    best_run, best_model = do_optimize(input_dset)

get_data_func('mnist') would return a string like:

''' def data():
    import something
    # some reshape
    # some scaling
   return x_train,y_train,x_test,y_test'''

This currently is not allowed, but shouldn't take too long to hack out. Basically just making sure that formatting is consistent with the internal of hyperas. The source that you'd want to touch is here around line 194 or so.

Something like:

if not isinstance(data,str):
    # line 194
else:
    data_string = data

The example above is also not how you should template strings in this situation. I recommend something like jinja if you are really going to go down that path and need flexibility. It may be better to just go with regular hyperopt in this situation.

Does this help?

Another way is to pickle the arguments for data() to a file, then in data() unpickle them. The file path would need to be hardcoded. You can also do this for the model() function. Instead of pickling the data you can also save the info as a plaintext file.

A simple example:

def data():
    import argparse                       
    import pickle

    args_file = 'data_args.pkl'    
    args = pickle.load(open(args_file, 'rb'))
    (X_train, y_train) = some_file_loader(args.train)
    (X_valid, y_valid) = some_file_loader(args.valid)
    return X_train, y_train, X_valid, y_valid


import argparse
import pickle

parser = argparse.ArgumentParser()
parser.add_argument('--train', help='Training data file', type=str, required=True)
parser.add_argument('--valid', help='Validation data file', type=str, required=True)
args = parser.parse_args()

args_file = 'data_args.pkl'
pickle.dump(args, open(args_file, 'wb'))
X_train, y_train, X_valid, y_valid = data()
best_run, best_model = optim.minimize(model=model,
                                      data=data, 
                                            ...)

not sure I can do any better answering this question than @pkairys or @dehdari. Will add this to the README for future reference.

For future reference, if someone else has this issue. There is a simple way to do it:

We just have to write a function that returns the args:

import argparse

def my_args():
   parser = argparse.ArgumentParser()
   parser.add_argument('--train', help='Training data file', type=str, required=True)
   parser.add_argument('--valid', help='Validation data file', type=str, required=True)
   args = parser.parse_args()
   return args

Then we can call it in minimize as follows:

best_run, best_model = optim.minimize(model=model,
                                      data=data,
                                      functions=[my_args],
                                            ...)

then call it in model:

def model(x_train, x_test, y_train, y_test):
    args = my_args()
    train_file = args.train
    valid_file = args.valid
    # define model
    return {'loss' :-acc , ....}
```