andreihar/taibun

Better zh_TW and zh_CN conversion

Closed this issue · 9 comments

Thank you for making this! As a native Hokkien speaker I find it very professionally done.

However, when doing conversion between zh_TW and zh_* (to_traditional & to_simplified), the context (the word and the sentence a char is in) should be considered, simple char-to-char mapping can be problematic in some cases.

https://github.com/BYVoid/OpenCC This library seem to be better at handling the subtlety of conversion.

Thank you very much for your kind words!

Indeed, the current simplified to traditional converter doesn't handle cases where the single simplified char maps to multiple traditional chars. I've modified both the conversion dataset and the codebase. When tested on taibun dataset, the accuracy improved by 10% (2.17% higher than OpenCC's conversion), and currently it's 32% more efficient than OpenCC's conversion.

I'll think about how to further boost efficiency and I plan to release the new version by week's end at the latest.

I deeply appreciate your valuable feedback!

Thank you Andrei!

How do you measure efficiency, is it the execution time of the function?

Yes, I measure the time it takes to convert all items in words.json from Simplified to Traditional. The converter I've developed is specifically designed to handle the conversion of characters exclusively found in words.json rather than all Chinese characters, so this accounts for its faster execution.

@andreihar Thank you Andrei!

I made a simple Gradio app to make it easier for non-technical people to use taibun here https://huggingface.co/spaces/tddschn/taibun-converter , do you think you can include it in your README?

Sorry for the late reply! It seems GitHub doesn't notify about messages in closed issues.

The live demo of Taibun can be currently accessed via this link: https://taibun.vercel.app/. I plan to change domains for all my web projects very soon, hence I don't have a link to it in the README. I hope I'll get to it in the near future.

I currently live in the Metro Vancouver area, so I have quite a lot of Taiwanese friends. Besides that, the main grammar resource I use is Taiwanese Grammar: A Concise Reference by Philip T. Lin. It's written in English and explains many grammar points by comparing them with both English and Mandarin grammar, so it makes it very easy to understand the Taiwanese language.

When it comes to Written Taiwanese, pretty much nobody knows it since in schools Taiwanese is taught primarily as a spoken language. When I ask my friends to translate something into Taiwanese, they will usually use iTaigi and the Taiwanese Ministry of Education Dictionary to find Chinese characters for Taiwanese words.

Thank you Andrei!