The tokenizer needs to be improved.

Question

The tokenizer needs to be improved.

zu1k opened this issue 2 years ago · 2 comments

zu1k commented 2 years ago

Now we use cang-jie and jieba-rs as out tokenizer, which causing Latin language tokenize false.

We need to combine multiple tokenizer to improve the accuracy.

Known bugs:

not ignore case now
jieba-rs not use search mode

Answer 1 · 2022-12-02T11:07:18.000Z

We now use cang-jie which based on jieba-rs as the tokenier for tantivy and combine it with some other filters such as length limit、ascii folding and lowercase.

In fact, the 940w books of the raw data contain a wide variety of languages, and the current tokenier is better optimized for only one language.

We need to build a more general-purpose tokenizer for tantivy, either in combination with existing tokenizers or using existing NLP tools based on AI.

Answer 2 · 2022-12-06T01:40:51.000Z

#39 The conversion between Simplified and Traditional Chinese may be a requirement, but it may also cause some difficulties in searching, which requires careful consideration.

Possible solutions:
OpenCC and further search result scoring mechanism