SimonBlanke/Hyperactive

Printing of Results from Runs - Preference to be able to provide additional parameters to objective function not through search space

mlittmanabbvie opened this issue · 20 comments

Is your feature request related to a problem? Please describe.
Yes, when using the "print results" parameter for verbosity within the Hyperactive initialization the parameter set printed includes all of the parameters used. In my case, one of the parameters within the search space is the dataframe that I am passing to the objective_function. I can't find any other way to pass the dataframe that the objective function is being performed on without including it in the search space.

Describe the solution you'd like

  1. Either the ability to edit which parameters will be printed from the print_results within the parameter set or
  2. The ability to pass extra parameters to the objective function without including them in the search space (Preferred)
    This could perhaps be done through the "initialize" parameter if it was opened up to more arguments than grid, vertices, and
    random, perhaps **kwargs so that any user parameters could be added. As it is written now, if you were to add an additional parameter to add_search, it might complicate things rather than just having the optimizer only look within search_space, you would have to make that change everywhere and for each optimizer which is a ton of work, but if you were to add it for initialize, it may mean less changes.

Describe alternatives you've considered
I have considered not printing the results because the dataframe printed makes my results not as clean in the console.

Additional context
If I do not include the dataframe in the search parameters then I can't run my objective_function. But if I do, I can't include memory = True because the fact that my dataframe is now something stored means that it would consume a ton of memory very quickly.

Thank you @mlittmanabbvie for the detailed explanation of the issue! :-)

The search space should only be used for parameters you want to change/optimize. Passing something like a dataframe to the objective function via the search space is not necessary if you define it at the top level (not inside a function/ no indentation, What is a top-level statement in Python?) of your script. A very simple example can be found in the README.md:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_boston
from hyperactive import Hyperactive

data = load_boston()
X, y = data.data, data.target

# define the model in a function
def model(opt):
    # pass the suggested parameter to the machine learning model
    gbr = GradientBoostingRegressor(
        n_estimators=opt["n_estimators"]
    )
    scores = cross_val_score(gbr, X, y, cv=3)

    # return a single numerical value, which gets maximized
    return scores.mean()


# search space determines the ranges of parameters you want the optimizer to search through
search_space = {"n_estimators": list(range(10, 200, 5))}

# start the optimization run
hyper = Hyperactive()
hyper.add_search(model, search_space, n_iter=50)
hyper.run()

In this example we need the data X and y from sklearn inside the objective function to fit the model. Since those variables are defined at top level we can access them from the objective function. Another benefit is that the result prints in the command line are much cleaner.

Therefore you should use the search space only for parameters you want to change.

Is this explanation helpful to you?

I appreciate the response! Unfortunately that doesn't help in my case because the Dataset that I am referring to changes throughout the process. Therefore, I am unable to set it in the beginning and must pass it through each time. There doesn't seem to be a way to pass other variables to the objective function other than through the search space which means that even if you were to pass constants (ones that also change during the course of the program), you would need to send them through the search space as single values.

Okay this is interesting! I will prepare an example with a dataset in the search space and try to look for:

  1. the prints in the command line
  2. the memory

I don't think the memory would blow up if memory=True because the memory_dictionary only stores the positions in the search space, not the actual values. The memory impact should be minimal.

The memory blows up because if you need to pass [df] as a parameter into search space, I believe when you send it to a list it is storing the entire thing into memory.

Could you provide a minimal example that I can run? I am currently working on figuring this problem out, but an example would be very helpful.

import numpy as np
import pandas as pd
from hyperactive import Hyperactive
from hyperactive import RepulsingHillClimbingOptimizer

ret_df = pd.DataFrame(([1, 2, 3], [4, 5, 6], [7, 8, 9]))


def func_minl(opts):
    return opts["slope"] ** 2 + opts["exp"] ** 2


h = Hyperactive(["progress_bar", "print_results", "print_times"])
search_space = {
    "exp": list(range(0, 5)),
    "slope": list(np.arange(0.001, 10, step=0.05)),
    "clust": [5],
    "df": [ret_df],
}

h.add_search(
    func_minl,
    search_space=search_space,
    n_iter=10,
    optimizer=RepulsingHillClimbingOptimizer(
        epsilon=0.05,
        distribution="normal",
        n_neighbours=3,
        rand_rest_p=0.03,
        repulsion_factor=3,
    ),
    n_jobs=1,
    max_score=None,
    initialize={
        "warm_start": [
            {
                "exp": 2,
                "slope": 5,
                "clust": 5,
                "df": ret_df,
            }
        ]
    },
    early_stopping={"tol_rel": 0.001, "n_iter_no_change": 3},
    random_state=0,
    memory=True,
    memory_warm_start=None,
)
h.run()

when "print results" is used in Hyperactive init, notice that the dataframe itself gets printed to the console for best_params

Yes, I see what you mean. The solution for this issue is somewhat related to #39. In the patched version you will be able to put the dataframe into a function and then put the function into the search space. This way I can use the function name as an ID (which is very important!) in the backend.

Problem with that approach is that if the dataframe is changing, the the function would need to be given the new dataframe each time in order to return the correct one.

I do not see how your example shows a changing dataframe. If your example shows your idea in a short but clear way, then I might be able to help you.

Normally you know the content of the search space before your optimization run, but it sounds like contents of the search space (the dataframe) is changing during the optimization run. Is this correct? This would be quite unusual and I am not sure Hyperactive is prepared for this application.

import numpy as np
import pandas as pd
from hyperactive import Hyperactive
from hyperactive import RepulsingHillClimbingOptimizer

ret_df = pd.DataFrame(([1, 2, 3], [4, 5, 6], [7, 8, 9]))


def func_minl(opts):
    opts["df"]["clust"] = opts["clust"]
    return opts["slope"] ** 2 + opts["exp"] ** 2


h = Hyperactive(["progress_bar", "print_results", "print_times"])
search_space = {
    "exp": list(range(0, 5)),
    "slope": list(np.arange(0.001, 10, step=0.05)),
    "clust": [5],
    "df": [ret_df],
}

h.add_search(
    func_minl,
    search_space=search_space,
    n_iter=10,
    optimizer=RepulsingHillClimbingOptimizer(
        epsilon=0.05,
        distribution="normal",
        n_neighbours=3,
        rand_rest_p=0.03,
        repulsion_factor=3,
    ),
    n_jobs=1,
    max_score=None,
    initialize={
        "warm_start": [
            {
                "exp": 2,
                "slope": 5,
                "clust": 5,
                "df": [ret_df],
            }
        ]
    },
    early_stopping={"tol_rel": 0.001, "n_iter_no_change": 3},
    random_state=0,
    memory=True,
    memory_warm_start=None,
)
h.run()

Edited the previous example to include what I mean about changing the df. Obviously in this context it doesn't make too much sense for why you would be doing this, but it is a simplified version

Changing the search space during the optimization run is a problem, that Hyperactive is not really equipped for. Could you explain what you want to achieve? I am unfamiliar with this type of optimization problem.

Also I still do not see why the dataframe must be in the search space in your example. You do not access its parameters in the objective function to calculate anything score-related.

In that case I was adding to the dataframe as it was running. For example, if I wanted to calculate additional metrics and add them to the dataframe on each run.

In regards to why it works for your example. It is because you don't have nested functions. In my case, it looks like this

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_wine
from hyperactive import Hyperactive


def run_process(search_space):
    ret_df = pd.DataFrame(([1, 2, 3], [4, 5, 6], [7, 8, 9]))
    hyper = Hyperactive()
    hyper.add_search(model, search_space, n_iter=40)
    hyper.run()

def model(opt):
    gbr = GradientBoostingClassifier(
        n_estimators=opt["n_estimators"],
        max_depth=opt["max_depth"],
        min_samples_split=opt["min_samples_split"],
        min_samples_leaf=opt["min_samples_leaf"],
        criterion=opt["criterion"],
    )
    
    print(ret_df)

    return ret_df.iloc[0]


search_space = {
    "n_estimators": list(range(10, 150, 5)),
    "max_depth": list(range(2, 12)),
    "min_samples_split": list(range(2, 25)),
    "min_samples_leaf": list(range(1, 25)),
    "criterion": ["friedman_mse", "squared_error", "absolute_error"],
    "subsample": list(np.arange(0.1, 3, 0.1))
}


run_process(search_space)

You will see that you get a name error because the Model doesn't know what the Dataframe is. Therefore I would have to pass the dataframe into the search space in order for the model to be able to use it.

One solution would be to convert the dataframe to a string and then send it through the search space. Then in the objective function, read the string as a dataframe and then use it. The reason I can't just pass the dataframe itself is that the system will quickly run out of memory when it tries to create a numpy array with a dataframe as the first index. This puts a cap on how large of a dataframe that you can use. It would be easier if there was a way to pass additional parameters to the objective function without sending them through the search space.

Hello @mlittmanabbvie,

I think I can see what you are trying to do. I will prepair a design-proposal of this new feature. Then we can discuss if it solves your problem.

Sry for the late answer. I had a lot of work to do (Hyperactive v4, GFO v1, ...).

Hello @mlittmanabbvie,

I found an easy way to pass objects to the objective function. I hope this is what you are searching for:

You can pass a dictionary with any object to pass_through in the add_search-method. After that you can access it in the objective function. This example should demonstrate the usage:

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_wine
from hyperactive import Hyperactive


def run_process(search_space):
    ret_df = pd.DataFrame(([1, 2, 3], [4, 5, 6], [7, 8, 9]))
    pass_through = {"ret_df": ret_df}

    hyper = Hyperactive()
    hyper.add_search(model, search_space, n_iter=15, pass_through=pass_through) # pass the dataframe
    hyper.run()


def model(opt):
    ret_df = opt.pass_through["ret_df"] # access the dataframe

    gbr = GradientBoostingClassifier(
        n_estimators=opt["n_estimators"],
        max_depth=opt["max_depth"],
        min_samples_split=opt["min_samples_split"],
        min_samples_leaf=opt["min_samples_leaf"],
        criterion=opt["criterion"],
    )

    print("\n ret_df \n", ret_df)

    ret_df.loc[0, 0] = np.random.randint(0, 100)

    return ret_df.loc[0, 0]


search_space = {
    "n_estimators": list(range(10, 150, 5)),
    "max_depth": list(range(2, 12)),
    "min_samples_split": list(range(2, 25)),
    "min_samples_leaf": list(range(1, 25)),
    "criterion": ["friedman_mse", "squared_error", "absolute_error"],
    "subsample": list(np.arange(0.1, 3, 0.1)),
}


run_process(search_space)

You can change the object inside the objective-function without using the search space.

Let me know if this would solve your problem.

Which version is this pass through feature added to? I just installed 4.0.2 and I am getting add_search method doesn't have pass_through as a parameter

Hello @mlittmanabbvie,

the code above was a suggestion how this feature could work. In the meantime I implemented the feature and I will release it in the next version (probably within the next few days).

Oh got it thank you! The idea is to be able to pass any number of items to the optimization function that don't necessarily need to be in the search space correct? If so, that would be amazing! Very excited for the release

Hello @mlittmanabbvie,

Correct! You can pass any data-type to the pass_through-parameter. This feature works exactly as shown in the example above. You can change the data in the pass_through-parameter during the optimization run (like updating a dataframe).

  • I tested the pass_through-dictionary for numbers, lists and functions in 21d3f8a.
  • The README was updated in cb9ec47.

I will close this issue. If you have some questions or additional feature requests we can open another issue.

Thank you very much for your explanations and support for this useful feature! :-)