keras-team/keras-tuner

Absurd RAM Consumption Growth during Search

JLPiper opened this issue · 4 comments

Whilst training a model using a BayesianOptimization tuner on a Ubuntu 22.04 system, as the programs runs, the program continues to increase its usage of RAM without releasing it in-between trials, until the system has completely run out of memory and either the computer or the program crashes. I am not running on a system with a GPU.

Here is the relevant code:

def build_model(hp):
    model = Sequential()
    model.add(LSTM(hp.Int('lstm_1_unit', min_value=4, max_value=512, sampling='log'), return_sequences=True, input_shape=(n_steps, n_features)))
    for i in range(hp.Int('additional_layers', min_value=1, max_value=5, step=1)):
        if i == hp.get('additional_layers') - 1:
            model.add(LSTM(hp.Int(f'lstm_{i+2}_unit', min_value=4, max_value=512, sampling='log'), return_sequences=False))
        else:
            model.add(LSTM(hp.Int(f'lstm_{i+2}_unit', min_value=4, max_value=512, sampling='log'), return_sequences=True))
    model.add(Dropout(hp.Float('Dropout_rate', min_value=0, max_value=0.5, step=0.1)))
    model.add(Dense(hp.Int('dense_units', min_value=4, max_value=256, sampling='log'), activation=hp.Choice('dense_activation',values=['linear', 'relu', 'sigmoid', 'tanh', 'softmax', 'softplus'], default='linear')))
    model.add(Dense(1, activation=hp.Choice('output_dense',values=['linear', 'relu', 'sigmoid', 'tanh', 'softmax', 'softplus'], default='linear')))
    optimizer = hp.Choice('optimizer', values=['adam', 'sgd', 'rmsprop', 'adagrad', 'adadelta', 'adamax'], default='adam')
    lr = hp.Float('learning_rate', min_value=1e-5, max_value=1e-1, sampling='log')
    model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['mse'])
    return model

tuner= BayesianOptimization(
        build_model,
        objective='mse',
        max_trials=10,
        )

tuner.search(
        X,
        y,
        epochs=75,
        batch_size=32,
        validation_data=(x_input,y_input),
)

I have tried adding clear_session() to the beginning of my build_model function to no avail. All other fixes I have tried so far have been of no use.
Any help in resolving this matter would be greatly appreciated.

I'd like to add that although this issue is almost a year old now it still seems to be an issue on the latest version (1.4.6) and any guidance or bugfixes would be much appreciated. It's very annoying to have to babysit the entire search process when it can take several hours to complete in my case. Even if there's only a bit of a "hacky" solution for now, I'll take any fix to just be able to let this run and go do something else in the meantime.

For all of you who have had this issue, and continue to have this issue. Until it has been addressed, there are two ways I have found to alleviate this.

A. Use a less intensive hyper-parameter search option that can feasibly complete its search before the memory consumption becomes too much. I found switching from Bayesian to Hyperband gave me a lot more leeway at the cost of the benefits of using Bayesian.

B. Use a separate program to launch, monitor, and kill the main tuner program. I found a handful of libraries that let you track the resource usage relatively easily. Simply have that program launch the tuner, wait until the resource usage passes a certain threshold, and have it kill the tuner when the usage passes the threshold.

Keras tuners naturally save its progress, so it will pick up right where it left off. A word of warning though, even then I have run into occasions where a single tuning step consumes too much memory by itself and gets stuck as the secondary program kills it before it can finish processing.

Hopefully, these fixes won't be necessary in the future.

I'd like to add that although this issue is almost a year old now it still seems to be an issue on the latest version (1.4.6) and any guidance or bugfixes would be much appreciated. It's very annoying to have to babysit the entire search process when it can take several hours to complete in my case. Even if there's only a bit of a "hacky" solution for now, I'll take any fix to just be able to let this run and go do something else in the meantime.

Just posted some suggestions.

I'd like to add that although this issue is almost a year old now it still seems to be an issue on the latest version (1.4.6) and any guidance or bugfixes would be much appreciated. It's very annoying to have to babysit the entire search process when it can take several hours to complete in my case. Even if there's only a bit of a "hacky" solution for now, I'll take any fix to just be able to let this run and go do something else in the meantime.

Just posted some suggestions.

Thank you very much for the advice! It's not ideal, but at least I'll be able to get something up and running