nextapps-de/flexsearch

Does this module support CJK word splitting?

Closed this issue · 5 comments

dzcpy commented

For example in Chinese, 一个单词 are two words. How to make sure I can get the correct result when searching 单词?

That's a good point. Actually I have not enough linguistic knowledge to provide a solution to this issue, but I will take into account.

dzcpy commented

@ts-thomas Thanks for your quick response.
You may take https://github.com/codepiano/lunr.js as a reference. It's basically a lunr.js fork but with Chinese word splitting support.

Thanks for the reference, I will take a look into this. I also make a little research and maybe this workaround could help you.

Set a custom tokenizer which fits your needs, e.g.:

var index = FlexSearch.create({
    encode: false,
    tokenize: function(str){
        return str.split(/[\x00-\x7F]+/);
    }
});
index.add(0, "서울시가 잠이 든 시간에 아무 말, 미뤄, 미뤄");
var results = index.search("든");
var results = index.search("시간에");

You can also pass a custom encoder function to apply some linguistic transformations. I would be happy to get some feedback from you.

@dzcpy related to your example the tokenizer should probably:

var index = FlexSearch.create({
    encode: false,
    tokenize: function(str){
        return str.replace(/[\x00-\x7F]/g, "").split("");
    }
});
index.add(0, "一个单词");
var results = index.search("单词");

You can use some chinese tokenizer such as https://github.com/yanyiwu/nodejieba
`
var nodejieba = require("nodejieba");
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return nodejieba.cut(str);
}
});

index.add(1,"一个单词");
var result = index.search("单词");
console.log(result);
`