Does this module support CJK word splitting?

Question

Does this module support CJK word splitting?

Closed this issue 6 years ago · 5 comments

For example in Chinese, 一个单词 are two words. How to make sure I can get the correct result when searching 单词?

Answer 1 · 2019-02-02T12:11:33.000Z

That's a good point. Actually I have not enough linguistic knowledge to provide a solution to this issue, but I will take into account.

Answer 2 · 2019-02-03T11:00:57.000Z

@ts-thomas Thanks for your quick response.
You may take https://github.com/codepiano/lunr.js as a reference. It's basically a lunr.js fork but with Chinese word splitting support.

Answer 3 · 2019-02-03T14:02:56.000Z

Thanks for the reference, I will take a look into this. I also make a little research and maybe this workaround could help you.

Set a custom tokenizer which fits your needs, e.g.:

var index = FlexSearch.create({
    encode: false,
    tokenize: function(str){
        return str.split(/[\x00-\x7F]+/);
    }
});

index.add(0, "서울시가 잠이 든 시간에 아무 말, 미뤄, 미뤄");

var results = index.search("든");

var results = index.search("시간에");

You can also pass a custom encoder function to apply some linguistic transformations. I would be happy to get some feedback from you.

Answer 4 · 2019-02-04T09:41:12.000Z

@dzcpy related to your example the tokenizer should probably:

var index = FlexSearch.create({
    encode: false,
    tokenize: function(str){
        return str.replace(/[\x00-\x7F]/g, "").split("");
    }
});

index.add(0, "一个单词");

var results = index.search("单词");

Answer 5 · 2020-06-16T13:44:30.000Z

You can use some chinese tokenizer such as https://github.com/yanyiwu/nodejieba
`
var nodejieba = require("nodejieba");
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return nodejieba.cut(str);
}
});

index.add(1,"一个单词");
var result = index.search("单词");
console.log(result);
`