Compound word with nakaguro in it
mhko opened this issue · 2 comments
Thanks for the library.
I was testing compound words with nakaguro character in them and noticed that a compound word 'コカ・コーラ' is tokenized to a single term <コカ・コーラ> in Search mode whereas another such word 'アイス・キューブ' tokenizes to its components <アイス>, <キューブ>. Is the former produces a single token because it's a trademark or could this be a bug? Ultimately, I'd like to find documents that contain <コカ・コーラ> using a search term <コーラ>.
Thanks in advance for your help!
Which dictionary are you using?
If it is the default one (IPADic), this is not a bug because there is a following dictionary entry:
コカ・コーラ,1288,1288,3891,名詞,固有名詞,一般,*,*,*,コカ・コーラ,コカコーラ,コカコーラ
Search mode works mainly on compound words that is not in the dictionary. In fact, アイスキューブ is not in IPADic.
Again, you customize the tokenizer by adding words to dictionary if you want to do some quick fix.
There's also a builder option for splitting unknown words on nakaguro that can be used as follows:
Tokenizer tokenizer = new Tokenizer.Builder()
.isSplitOnNakaguro(true)
.mode(TokenizerBase.Mode.SEARCH)
.build();
It only works on unknown words, but in combination with search mode, perhaps it makes more sense that we split on nakaguro in all cases.
In your case, as pointed out by Fujinuma-san, "コカ・コーラ" is a known word.