keras-team/keras-tuner

Kernel crashes when trying to build a Hyperband tuner

Faptimus420 opened this issue · 1 comments

Describe the bug
I am attempting to build a Vision Transformer as an experiment. My code is, with modifications, based on the code in this Keras guide. I am however also trying to use Keras Tuner to tune the model's hyperparameters. The function used to generate the hypermodel is below. The cell with the hypermodel function definition executes fine, but when I try to execute the cell containing the call to build the Hyperband tuner:

tuner = keras_tuner.Hyperband(hypermodel=generate_vit, objective='val_accuracy', max_epochs=100, directory=normpath('C:/'))
tuner.search_space_summary()

the kernel just crashes after a few seconds, no error messages or anything. Looking at resource usage, I also don't see any increase in the usage of RAM, CPU, nor GPU when executing the above cell, just a drop once the kernel crashes.

To Reproduce
This is the function that should generate a model:

def generate_vit(hp):
    patch_size = hp.Fixed('patch_size', 32)
    num_patches = hp.Fixed('num_patches', (IMAGE_DIMS[0] // patch_size) ** 2)
    projection_dim = hp.Choice('projection_dim', [512, 256, 128, 64, 32])
    normalization_epsilon = hp.Choice('norm_epsilon', [1e-3, 1e-6])

    inputs = Input(shape=IMAGE_DIMS)
    patches = Patches(patch_size)(inputs)
    encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)

    for i in range(hp.Int('num_layers', min_value=3, max_value=15, step=3)):
        x1 = LayerNormalization(epsilon=normalization_epsilon)(encoded_patches)

        if hp.Boolean(f'apply_regularization{i}'):
            attention_output = MultiHeadAttention(num_heads=hp.Int(f'num_heads{i}', min_value=3, max_value=6, parent_name=f'apply_regularization{i}', parent_values=[True]),
                                                  key_dim=projection_dim,
                                                  dropout=hp.Float(f'attention_dropout{i}', min_value=0.1, max_value=0.7, step=0.2, parent_name=f'apply_regularization{i}', parent_values=[True]),
                                                  kernel_regularizer=L1L2(hp.Float(f'l1_reg{i}', min_value=1e-6, max_value=1e-1, sampling='log', parent_name=f'apply_regularization{i}', parent_values=[True]),
                                                                          hp.Float(f'l2_reg{i}', min_value=1e-6, max_value=1e-1, sampling='log', parent_name=f'apply_regularization{i}', parent_values=[True])))(x1, x1)
        else:
            attention_output = MultiHeadAttention(num_heads=hp.Int(f'num_heads{i}', min_value=3, max_value=6, parent_name=f'apply_regularization{i}', parent_values=[False]),
                                                  key_dim=projection_dim,
                                                  dropout=hp.Float(f'attention_dropout{i}', min_value=0.1, max_value=0.7, step=0.2, parent_name=f'apply_regularization{i}', parent_values=[False]))(x1, x1)

        x2 = Add()([attention_output, encoded_patches])
        x3 = LayerNormalization(epsilon=normalization_epsilon)(x2)
        x3 = mlp(x3,
                 hidden_units=[projection_dim * 2, projection_dim],
                 dropout_rate=hp.Float(f'mlp_dropout{i}', min_value=0.1, max_value=0.7, step=0.2),
                 activation=hp.Choice(f'mlp_act{i}', ['gelu', 'relu', 'leaky_relu']))
        encoded_patches = Add()([x3, x2])

    representation = LayerNormalization(epsilon=normalization_epsilon)(encoded_patches)
    if hp.Boolean('global_avg_pooling'):
        representation = GlobalAveragePooling1D()(representation)
    else:
        representation = Flatten()(representation)
    representation = Dropout(hp.Float('representation_dropout', min_value=0.1, max_value=0.7, step=0.2))(representation)

    mlp_head_units = hp.Choice('mlp_head_units', [8192, 4096, 2048, 1024, 512, 256])
    features = mlp(representation,
                   hidden_units=[mlp_head_units, mlp_head_units / 2],
                   dropout_rate=hp.Float('mlp_head_dropout', min_value=0.1, max_value=0.7, step=0.2),
                   activation=hp.Choice('mlp_head_act', ['gelu', 'relu', 'leaky_relu']))
    logits = Dense(NUM_CLASSES)(features)
    model = Model(inputs=inputs, outputs=logits)


    model.compile(loss=SparseCategoricalCrossentropy(from_logits=True),
                  metrics=[SparseCategoricalAccuracy(name='accuracy')],
                  optimizer=Adam(learning_rate=hp.Float('lr', min_value=1e-6, max_value=1e-3, sampling='log'),
                                 epsilon=hp.Choice('opt_epsilon', [1.0, 0.1, 1e-7]),
                                 decay=hp.Float('wd', min_value=0.0001, max_value=0.1, sampling='log')))
    return model

Helper classes/functions:
This is the mlp function:

def mlp(x, hidden_units, dropout_rate: float, activation: str):
    for units in hidden_units:
        x = Dense(units, activation=activation)(x)
        x = Dropout(dropout_rate)(x)
    return x

This is the Patches class:

class Patches(Layer):
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    def call(self, images):
        batch_size = tf_shape(images)[0]
        patches = extract_patches(
            images=images,
            sizes=[1, self.patch_size, self.patch_size, 1],
            strides=[1, self.patch_size, self.patch_size, 1],
            rates=[1, 1, 1, 1],
            padding="VALID",
        )
        patch_dims = patches.shape[-1]
        patches = tf_reshape(patches, [batch_size, -1, patch_dims])
        return patches

    def get_config(self):
        config = super().get_config()
        config.update({
            'patch_size': self.patch_size
        })
        return config

This is the PatchEncoder class:

class PatchEncoder(Layer):
    def __init__(self, num_patches, projection_dim):
        super().__init__()
        self.num_patches = num_patches
        self.projection = Dense(units=projection_dim)
        self.position_embedding = Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    def call(self, patch):
        positions = tf_range(start=0, limit=self.num_patches, delta=1)
        encoded = self.projection(patch) + self.position_embedding(positions)
        return encoded

    def get_config(self):
        config = super().get_config()
        config.update({
            'num_patches': self.num_patches
        })
        return config

Expected behavior
A Hyperband tuner search space to be generated, with a model for each permutation of hyperparameters.

Additional context

  • Building the model as a class instead of a function, sub-classed from the keras_tuner.HyperModel class, I get the same result.
  • Trying to just build the model as-is (without involving Keras Tuner at all, just using fixed hyperparameters), the model builds fine.

Would you like to help us fix it?
Yes, if it is actually bug, but I'm not sure if I'm not just doing something wrong and my environment fails at telling me so.

After experimenting with trying to add the hyperparameters one by one, I discovered the issue was in this part of the code:

if hp.Boolean('global_avg_pooling'):
        representation = GlobalAveragePooling1D()(representation)
    else:
        representation = Flatten()(representation)

It seems that the issue has nothing to do with Keras Tuner, it's an issue specifically with trying to use the Flatten layer - apparently, the representation of the data is so large when reaching this part of the code that trying to apply the Flatten layer just causes the kernel to crash, without Keras/TensorFlow itself reporting an error. I changed the code to instead test whether using avg or max pooling would work better, dropping the Flatten layer entirely.