Implement a Japanese specialized Normalizer
ManyTheFish opened this issue · 8 comments
Today, there is no specialized normalizer for the Japanese Language.
drawback
Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ダメ
, is also spelled 駄目
, or だめ
Technical approach
Create a new Japanese normalizer that unifies hiragana and katakana equivalences.
Interesting libraries
- wana_kana seems promising to convert everything in Hiragana
Files expected to be modified
Misc
related to product#532
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement aSegmenter
or aNormalizer
.
Thanks a lot for your Contribution! 🤝
Update: Copied to #149 Limitations
... for instance,
ダメ
, is also spelled駄目
, orだめ
... wana_kana seems promising to convert everything in Hiragana
After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:
to_hiragana("ダメ駄目だめ")
will be"だめ駄目だめ"
to_romaji("ダメ駄目だめ")
will be"dame駄目dame"
Is it okay if the normalizer only convert Katakana to Hiragana?
Hey @choznerol, the conversion from kanji to Hiragana is not straightforward, that's why all the libraries I found don't support it.
However, converting katakana to hiragana is a great enhancement!
Hi @ManyTheFish @choznerol ,
I think that in Japanese, excessive normalization of katakana and hiragana can create a lot of noise in the opposite direction.
If you wish to treat these hiragana and katakana tokens identically, it is common practice to register the required synonyms in that business domain.
https://docs.meilisearch.com/learn/configuration/synonyms.html
In the normalization of Japanese characters, I think it is more important that I wrote in this comment.
#139 (comment)
What do you think?
thank you @mosuka for your feedback, so let's be cautious.
I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔
On my side, I'll investigate more about the pro and cons of doing this transliteration.
But, keep in mind that we are in an IR context and not in a translation context, sometimes, it is better to lose precision in favor of a higher recall.
About your other issue, I started a redesign to implement a pre-normalization, however, it's not an easy task mainly if you want a good highlighting on Meilisearch. 😅
@ManyTheFish
Thank you for your reply.
The above comment is just my personal opinion. And I agree with you.
It would be helpful if users could make a choice. 😄
Thanks!
Thank @mosuka @ManyTheFish for the discussion. I am afraid I don't have enough Japanese/tokenization knowledge to have input 😅.
I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔
Not sure exactly how this will be implemented, but please let me know if I should also address the "disable by default" part in #149. And of course, if after re-consideration we think #149 is actually too risky, please don't hesitate to close it or leave it pending.
@choznerol
Thank you for your feedback. 😄
It's alright. No worries.
That is my personal opinion and I am sure there are many who would welcome your PR.
Thanks! 😄
@mosuka @choznerol,
we will merge the PR, but, we have to add a feature flag allowing us to activate or deactivate this normalizer at compile time, instead of depending on the #[cfg(feature = "japanese")]
for this normalizer we should make it depends on a new flag #[cfg(feature = "japanese-transliteration")]
.
I requested this last change on your PR @choznerol, everything else is good and should be merged!
Thanks to both of you!