The tokenizer needs to be improved.
zu1k opened this issue · 2 comments
zu1k commented
Now we use cang-jie and jieba-rs as out tokenizer, which causing Latin language tokenize false.
We need to combine multiple tokenizer to improve the accuracy.
Known bugs:
- not ignore case now
- jieba-rs not use search mode
zu1k commented
We now use cang-jie
which based on jieba-rs
as the tokenier for tantivy and combine it with some other filters such as length limit
、ascii folding
and lowercase
.
In fact, the 940w books of the raw data contain a wide variety of languages, and the current tokenier is better optimized for only one language.
We need to build a more general-purpose tokenizer for tantivy, either in combination with existing tokenizers or using existing NLP tools based on AI.