Unable to get results from using normalizer
ArupDas15 opened this issue · 2 comments
Hello,
I want to perform Unicode text normalization in the Bengali language. For example: Consider the word মনীন্দ্র and মণীন্দ্র both differ in their Unicode values in the following ways (Notice the difference in ন and ণ in the first and second sentence of the word মনীন্দ্র):
WORD 1: মনীন্দ্র
[('ম', 2478), ('ন', 2472), ('ী', 2496), ('ন', 2472), ('্', 2509), ('দ', 2470), ('্', 2509), ('র', 2480)]
WORD 2: মণীন্দ্র
[('ম', 2478), ('ণ', 2467), ('ী', 2496), ('ন', 2472), ('্', 2509), ('দ', 2470), ('্', 2509), ('র', 2480)]
When I tired to use this library https://github.com/csebuetnlp/normalizer for normalization it is not showing any difference in the Unicode values after normalizing the input text. Kindly help.
Hi, I am not sure what the issue is here. These two strings are already in their standard form. Are you expecting these two strings to be equal after normalization? Since "ন" and "ণ" are two distinct characters of the Bengali alphabet, we can't arbitrarily replace one with another. For example, consider the words "মন" and "মণ", their meanings are entirely different and hence we can't normalize both of them to an equal string.
This library is useful for converting same looking unicode strings to a standard format. For example, consider the following two words with their unicode representations:
বড় -> [('ব', 2476), ('ড', 2465), ('়', 2492)]
বড় -> [('ব', 2476), ('ড়', 2524)]
As you can see, they look the same despite having different representations. You can convert them to their standardized format using this library, i.e. normalize("বড়") == normalize("বড়")
. Hope this answers your query.
Hi, thank you for your reply. I have understood my mistake after reading your response. I was considering both to be the same character with different Unicode representations. Thank you very much.