keras-team/keras-nlp

Preprocessor does not respect sequence_length

Closed this issue · 2 comments

52631 commented

Describe the bug
If I initialize a preprocessor from preset it does not respect the specified sequence length.

To Reproduce
In keras-nlp== 0.11.1, the preprocessor defaults to 512 regardless of specified length:

keras_nlp.models.BertPreprocessor.from_preset('bert_tiny_en_uncased', sequence_length=16)("The quick brown fox jumped.")

Expected behavior
In keras-nlp==0.8.2, the preprocess would respect specified length.

{'token_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy=
 array([ 101, 1996, 4248, 2829, 4419, 5598, 1012,  102,    0,    0,    0,
           0,    0,    0,    0,    0], dtype=int32)>,
 'segment_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>,
 'padding_mask': <tf.Tensor: shape=(16,), dtype=bool, numpy=
 array([ True,  True,  True,  True,  True,  True,  True,  True, False,
        False, False, False, False, False, False, False])>}

Additional context
In my case, this showed up as a large performance hit when migrating code to latest version. The performance penalty may be more subtle depending on the desired sequence length relative to the default value.

It seems the work around is to override the sequence length after initializing.

preprocessor = keras_nlp.models.BertPreprocessor.from_preset('bert_tiny_en_uncased', sequence_length=16)
preprocessor.sequence_length = 16

Thanks for reporting this issue! I'll look into this!

This issue is fixed in #1632.