Implement a Japanese specialized Normalizer

Question

Implement a Japanese specialized Normalizer

ManyTheFish opened this issue 2 years ago · 8 comments

Today, there is no specialized normalizer for the Japanese Language.

drawback

Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ダメ, is also spelled 駄目, or だめ

Technical approach

Create a new Japanese normalizer that unifies hiragana and katakana equivalences.

Interesting libraries

wana_kana seems promising to convert everything in Hiragana

Files expected to be modified

create /src/normalizer/japanese.rs
/src/normalizer/mod.rs

Misc

related to product#532

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

Answer 1 · 2022-10-08T08:27:42.000Z

Update: Copied to #149 Limitations

... for instance, ダメ, is also spelled 駄目, or だめ
... wana_kana seems promising to convert everything in Hiragana

After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:

to_hiragana("ダメ駄目だめ") will be "だめ駄目だめ"

to_romaji("ダメ駄目だめ") will be "dame駄目dame"

Is it okay if the normalizer only convert Katakana to Hiragana?

Answer 2 · 2022-10-10T14:37:32.000Z

Hey @choznerol, the conversion from kanji to Hiragana is not straightforward, that's why all the libraries I found don't support it.
However, converting katakana to hiragana is a great enhancement!

Answer 3 · 2022-10-11T07:18:16.000Z

Hi @ManyTheFish @choznerol ,

I think that in Japanese, excessive normalization of katakana and hiragana can create a lot of noise in the opposite direction.
If you wish to treat these hiragana and katakana tokens identically, it is common practice to register the required synonyms in that business domain.

https://docs.meilisearch.com/learn/configuration/synonyms.html

In the normalization of Japanese characters, I think it is more important that I wrote in this comment.
#139 (comment)

What do you think?

Answer 4 · 2022-10-11T11:44:20.000Z

thank you @mosuka for your feedback, so let's be cautious.
I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔
On my side, I'll investigate more about the pro and cons of doing this transliteration.
But, keep in mind that we are in an IR context and not in a translation context, sometimes, it is better to lose precision in favor of a higher recall.

About your other issue, I started a redesign to implement a pre-normalization, however, it's not an easy task mainly if you want a good highlighting on Meilisearch. 😅

Answer 5 · 2022-10-11T12:28:12.000Z

@ManyTheFish
Thank you for your reply.
The above comment is just my personal opinion. And I agree with you.
It would be helpful if users could make a choice. 😄
Thanks!

Answer 6 · 2022-10-14T07:38:56.000Z

Thank @mosuka @ManyTheFish for the discussion. I am afraid I don't have enough Japanese/tokenization knowledge to have input 😅.

I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔

Not sure exactly how this will be implemented, but please let me know if I should also address the "disable by default" part in #149. And of course, if after re-consideration we think #149 is actually too risky, please don't hesitate to close it or leave it pending.

Answer 7 · 2022-10-14T08:54:07.000Z

@choznerol
Thank you for your feedback. 😄
It's alright. No worries.
That is my personal opinion and I am sure there are many who would welcome your PR.

Thanks! 😄

Answer 8 · 2022-10-17T13:55:21.000Z

@mosuka @choznerol,
we will merge the PR, but, we have to add a feature flag allowing us to activate or deactivate this normalizer at compile time, instead of depending on the #[cfg(feature = "japanese")] for this normalizer we should make it depends on a new flag #[cfg(feature = "japanese-transliteration")].

I requested this last change on your PR @choznerol, everything else is good and should be merged!

Thanks to both of you!