How gemma_lm.preprocessor.sequence_length dealing with large input data
Closed this issue · 3 comments
mostafamdy commented
Hi
Can someone help me please?
I want to know how Gemma_lm.preprocessor.sequence_length handles large input data that exceeds the length
Thanks
tirthasheshpatel commented
The output is truncated if the length of the tokens in the prompt is greater than the sequence length. So, only the leftmost sequence_length
number of tokens will be present in the output. Here's an example:
from keras_nlp.models import MistralCausalLMPreprocessor
prompt = "This is how KerasNLP tokenizers work."
tokenizer = MistralCausalLMPreprocessor.from_preset("mistral_7b_en")
tokenizer(prompt, sequence_length=15)[0]['token_ids']
# <tf.Tensor: shape=(15,), dtype=int32, numpy=
# array([ 1, 851, 349, 910, 524, 11234, 28759, 11661, 6029,
# 17916, 771, 0, 0, 0, 0], dtype=int32)>
tokenizer(prompt, sequence_length=5)[0]['token_ids']
# <tf.Tensor: shape=(5,), dtype=int32, numpy=array([ 1, 851, 349, 910, 524], dtype=int32)>
Let me know if this answers your question!
mostafamdy commented
Thanks😄
tirthasheshpatel commented
Closing it, if you have any other questions, feel free to reopen!