book-searcher-org/book-searcher

The tokenizer needs to be improved.

zu1k opened this issue · 2 comments

zu1k commented

Now we use cang-jie and jieba-rs as out tokenizer, which causing Latin language tokenize false.

We need to combine multiple tokenizer to improve the accuracy.

Known bugs:

  • not ignore case now
  • jieba-rs not use search mode
zu1k commented

We now use cang-jie which based on jieba-rs as the tokenier for tantivy and combine it with some other filters such as length limitascii folding and lowercase.

In fact, the 940w books of the raw data contain a wide variety of languages, and the current tokenier is better optimized for only one language.

We need to build a more general-purpose tokenizer for tantivy, either in combination with existing tokenizers or using existing NLP tools based on AI.

zu1k commented

#39 The conversion between Simplified and Traditional Chinese may be a requirement, but it may also cause some difficulties in searching, which requires careful consideration.

Possible solutions:
OpenCC and further search result scoring mechanism