Implement Jyutping normalizer
Closed this issue · 1 comments
Today Meilisearch normalizes Chinese characters by converting traditional characters into simplified ones.
drawback
This normalization process doesn't seem to enhance the recall of Meilisearch.
enhancement
Following the official discussion about Chinese support in Meilisearch, it is more relevant to normalize Chinese characters by transliterating them into a Phonological version.
In order to have accurate phonology for Cantonese, we should normalize Chinese characters into Jyutping using the kCantonese dictionary of the unihan database.
We should find an efficent way to normalize characters, and so, the dictionary may be reformated.
Files expected to be modified
Misc
related to product#503
original source of the dictionnary: unihan.zip in https://unicode.org/Public/UNIDATA/
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement aSegmenter
or aNormalizer
.
Thanks a lot for your Contribution! 🤝
Having several concurrent normalizations is not possible for now, we will have to rework the API before allowing it.
Until the rework, the Jyutping and the Cangjie normalizer implementation is delayed in favor of the Pinyin normalizer.