Preprocessor does not respect sequence_length

Question

Preprocessor does not respect sequence_length

Closed this issue 23 days ago · 2 comments

Describe the bug
If I initialize a preprocessor from preset it does not respect the specified sequence length.

To Reproduce
In keras-nlp== 0.11.1, the preprocessor defaults to 512 regardless of specified length:

keras_nlp.models.BertPreprocessor.from_preset('bert_tiny_en_uncased', sequence_length=16)("The quick brown fox jumped.")

Expected behavior
In keras-nlp==0.8.2, the preprocess would respect specified length.

{'token_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy=
 array([ 101, 1996, 4248, 2829, 4419, 5598, 1012,  102,    0,    0,    0,
           0,    0,    0,    0,    0], dtype=int32)>,
 'segment_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>,
 'padding_mask': <tf.Tensor: shape=(16,), dtype=bool, numpy=
 array([ True,  True,  True,  True,  True,  True,  True,  True, False,
        False, False, False, False, False, False, False])>}

Additional context
In my case, this showed up as a large performance hit when migrating code to latest version. The performance penalty may be more subtle depending on the desired sequence length relative to the default value.

It seems the work around is to override the sequence length after initializing.

preprocessor = keras_nlp.models.BertPreprocessor.from_preset('bert_tiny_en_uncased', sequence_length=16)
preprocessor.sequence_length = 16

Answer 1 · 2024-05-16T23:25:41.000Z

Thanks for reporting this issue! I'll look into this!

Answer 2 · 2024-05-17T17:08:27.000Z

This issue is fixed in #1632.