microsoft/aici

token length limit

mmoskal opened this issue · 0 comments

StackRecognizer currently has 130 byte limit on token length, and there is also assert!(word.len() < 0xff); in toktree.rs which has to do with toktree format.

We probably can just ignore tokens longer than 255 (so far I've seen them only in starcoder tokenizer - there is one slightly longer token of spaces.

CC @saikat107