meilisearch/charabia

Implement Jyutping normalizer

Closed this issue · 1 comments

Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.

drawback

This normalization process doesn't seem to enhance the recall of Meilisearch.

enhancement

Following the official discussion about Chinese support in Meilisearch, it is more relevant to normalize Chinese characters by transliterating them into a Phonological version.
In order to have accurate phonology for Cantonese, we should normalize Chinese characters into Jyutping using the kCantonese dictionary of the unihan database.
We should find an efficent way to normalize characters, and so, the dictionary may be reformated.

Files expected to be modified

Misc

related to product#503
original source of the dictionnary: unihan.zip in https://unicode.org/Public/UNIDATA/

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

Having several concurrent normalizations is not possible for now, we will have to rework the API before allowing it.
Until the rework, the Jyutping and the Cangjie normalizer implementation is delayed in favor of the Pinyin normalizer.