/sylbreak

Syllable segmentation tool for Myanmar language (Burmese) by Ye.

Primary LanguageHTMLApache License 2.0Apache-2.0

sylbreak

Myanmar language (Burmese) README

Syllable segmenation is an important preprocess for many natural language processing (NLP) such as romanization, transliteration and graphame-to-phoneme (g2p) conversion.

"sylbreak" is a syllable segmentation tool for Myanmar language (Burmese) text encoded with Unicode (e.g. Myanmar3, Padauk). I used only one short line of regular expression (RE) as follow:

$line =~ s/((?<!$ssSymbol)[$myConsonant](?![$aThat$ssSymbol])|[$enChar$otherChar])/$sep$1/g;

Here, the point is (a consonant not after a subscript symbol AND not followed by a-That character or a subscript symbol)

Here, variables are declared as follows:

my $myConsonant = "က-အ";
my $enChar = "a-zA-Z0-9";
my $otherChar = "ဣဤဥဦဧဩဪဿ၌၍၏၀-၉၊။!-\/:-\@\[-`{-~\\s";
my $ssSymbol = "";
my $aThat = "";

Visualization of sylbreak RE

Fig. Visualization of sylbreak RE

If you use shell (sylbreak.sh), perl (sylbreak.pl) and python (sylbreak.py) scripts, no need to make installation.

Enjoy syllable breaking!

Ye@Lab

Demo/Explanation

In the paper titled "An Algorithm for Myanmar Syllable Segmentation based on the Official Standard Myanmar Unicode Text" presented at the ICCA-2023 conference, the authors make the following statement in Section VI, Performance Evaluation:

Furthermore, we compared the correctness of our algorithm with an existing algorithm, sylbreak3. As stated in Section II, the drawback of the sylbreak3 algorithm is that it cannot correctly segment syllables that contain consonants, ‘်’ and ‘့’. To evaluate this, we tested another set of 165 common syllables in 8 random Myanmar sentences shown Table IX. The results obtained should be seen in the Table X.

According to this experiment, it can be clearly seen that the sylbreak3 algorithm can correctly segment all Myanmar syllables including Parli and digits but it fails in detecting the boundary of syllables composed of ‘်’ and ‘့’.

The statement that "sylbreak fails in detecting the boundary of syllables that composed of ‘်’ and ‘့ ’" is wrong. When I read their paper carefully, I found that the test data is not correctly typed according to the Unicode typing of the Myanmar language. In details, they typed Auk-ka-myit ("့") and then A-that ("်") instead of A-that ("်") and then Auk-ka-myit ("့") order. I assumed they got wrong segmentation results because of this. Actually, sylbreak tool is working well if the user provided the Myanmar text that typed correct order based on the Unicode standard.

Here is the video file that I explained well by comparing the example words from their paper. Though I explained in Myanmar language, hope everyone can follow my explanation.

Video Link: https://vimeo.com/864665740?share=copy

Acknowledgement

Thanks to Swan Htet Aung who informed my typo mistake of $otherChar ... ဥဥ ---> ဥဦ
sylbreak RE example programs for Java and Java Script was written by Chan Mrate Ko Ko.

Reference

  1. Dr. Thein Tun, Acoustic Phonetics and The Phonology of the Myanmar Language
  2. Romanization: https://en.wikipedia.org/wiki/Romanization
  3. Myanmar Unicode: http://unicode.org/charts/PDF/U1000.pdf
  4. Syllable segmentation algorithm of Myanmar text: http://gii2.nagaokaut.ac.jp/gii/media/share/20080901-ZMM%20Presentation.pdf
  5. Zin Maung Maung and Yoshiki Makami,"A rule-based syllable segmentation of Myanmar Text", in Proceeding of the IJCNLP-08 workshop of NLP for Less Privileged Language, January, 2008, Hyderabad, India, pp. 51-58. Paper
  6. Tin Htay Hlaing, "Manually constructed context-free grammar for Myanmar syllable structure", in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12), Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 32-37. Paper
  7. Ye Kyaw Thu, Andrew Finch, Yoshinori Sagisaka and Eiichiro Sumita, "A Study of Myanmar Word Segmentation Schemes for Statistical Machine Translation", in Proceedings of the 11th International Conference on Computer Applications (ICCA 2013), February 26~27, 2013, Yangon, Myanmar, pp. 167-179. Paper
  8. Ye Kyaw Thu, Andrew Finch, Win Pa Pa, and Eiichiro Sumita, "A Large-scale Study of Statistical Machine Translation Methods for Myanmar Language", in Proceedings of SNLP2016, February 10-12, 2016, Phranakhon Si Ayutthaya, Thailand. Paper
  9. Regular Expression: https://en.wikipedia.org/wiki/Regular_expression
  10. DebuggexBeter: https://www.debuggex.com/
  11. Run UTN11 normalization on Myanmar text? harfbuzz/harfbuzz#494