token length limit
mmoskal opened this issue · 0 comments
mmoskal commented
StackRecognizer
currently has 130 byte limit on token length, and there is also assert!(word.len() < 0xff);
in toktree.rs which has to do with toktree format.
We probably can just ignore tokens longer than 255 (so far I've seen them only in starcoder tokenizer - there is one slightly longer token of spaces.
CC @saikat107