Does this module support CJK word splitting?
Closed this issue · 5 comments
For example in Chinese, 一个单词
are two words. How to make sure I can get the correct result when searching 单词
?
That's a good point. Actually I have not enough linguistic knowledge to provide a solution to this issue, but I will take into account.
@ts-thomas Thanks for your quick response.
You may take https://github.com/codepiano/lunr.js as a reference. It's basically a lunr.js
fork but with Chinese word splitting support.
Thanks for the reference, I will take a look into this. I also make a little research and maybe this workaround could help you.
Set a custom tokenizer which fits your needs, e.g.:
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return str.split(/[\x00-\x7F]+/);
}
});
index.add(0, "서울시가 잠이 든 시간에 아무 말, 미뤄, 미뤄");
var results = index.search("든");
var results = index.search("시간에");
You can also pass a custom encoder function to apply some linguistic transformations. I would be happy to get some feedback from you.
@dzcpy related to your example the tokenizer should probably:
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return str.replace(/[\x00-\x7F]/g, "").split("");
}
});
index.add(0, "一个单词");
var results = index.search("单词");
You can use some chinese tokenizer such as https://github.com/yanyiwu/nodejieba
`
var nodejieba = require("nodejieba");
var index = FlexSearch.create({
encode: false,
tokenize: function(str){
return nodejieba.cut(str);
}
});
index.add(1,"一个单词");
var result = index.search("单词");
console.log(result);
`