How gemma_lm.preprocessor.sequence_length dealing with large input data

Question

How gemma_lm.preprocessor.sequence_length dealing with large input data

Closed this issue 2 months ago · 3 comments

Hi
Can someone help me please?
I want to know how Gemma_lm.preprocessor.sequence_length handles large input data that exceeds the length
Thanks

Answer 1 · 2024-04-19T21:04:03.000Z

The output is truncated if the length of the tokens in the prompt is greater than the sequence length. So, only the leftmost sequence_length number of tokens will be present in the output. Here's an example:

from keras_nlp.models import MistralCausalLMPreprocessor

prompt = "This is how KerasNLP tokenizers work."

tokenizer = MistralCausalLMPreprocessor.from_preset("mistral_7b_en")

tokenizer(prompt, sequence_length=15)[0]['token_ids']
# <tf.Tensor: shape=(15,), dtype=int32, numpy=
# array([    1,   851,   349,   910,   524, 11234, 28759, 11661,  6029,
#        17916,   771,     0,     0,     0,     0], dtype=int32)>

tokenizer(prompt, sequence_length=5)[0]['token_ids']
# <tf.Tensor: shape=(5,), dtype=int32, numpy=array([  1, 851, 349, 910, 524], dtype=int32)>

Let me know if this answers your question!

Answer 2 · 2024-04-20T05:35:17.000Z

Thanks😄

Answer 3 · 2024-04-20T05:57:11.000Z

Closing it, if you have any other questions, feel free to reopen!