keras-team/keras-nlp

How gemma_lm.preprocessor.sequence_length dealing with large input data

Closed this issue · 3 comments

Hi
Can someone help me please?
I want to know how Gemma_lm.preprocessor.sequence_length handles large input data that exceeds the length
Thanks

The output is truncated if the length of the tokens in the prompt is greater than the sequence length. So, only the leftmost sequence_length number of tokens will be present in the output. Here's an example:

from keras_nlp.models import MistralCausalLMPreprocessor

prompt = "This is how KerasNLP tokenizers work."

tokenizer = MistralCausalLMPreprocessor.from_preset("mistral_7b_en")

tokenizer(prompt, sequence_length=15)[0]['token_ids']
# <tf.Tensor: shape=(15,), dtype=int32, numpy=
# array([    1,   851,   349,   910,   524, 11234, 28759, 11661,  6029,
#        17916,   771,     0,     0,     0,     0], dtype=int32)>

tokenizer(prompt, sequence_length=5)[0]['token_ids']
# <tf.Tensor: shape=(5,), dtype=int32, numpy=array([  1, 851, 349, 910, 524], dtype=int32)>

Let me know if this answers your question!

Thanks😄

Closing it, if you have any other questions, feel free to reopen!