Rabbit-Converter/Rabbit

isUni and isZg string check feature

Closed this issue · 11 comments

Should be add given data is unicode data or zawgyi data in rabbit converter.
Example

if ( Rabbit::isZawgyi(data)) {
    data = Rabbit::zg2uni(data);
}

Argh, you mean font detector ?
On May 10, 2015 8:50 PM, "Nyan Lynn Htut" notifications@github.com wrote:

Should be add given data is unicode data or zawgyi data in rabbit
converter.
Example

if ( Rabbit::isZawgyi(data) {
data = Rabbit::zg2uni(data);
}


Reply to this email directly or view it on GitHub
#10.

@nyanlynnhtut , this one cannot say 100% correct font detection because of some conflict code point.

Example : \u103A or \u103D

in zawgyi, it's ကျ but in unicode , it's က်

So, this one is zawgyi or unicode ?

In tagu , I use like

var regexUni = new RegExp("[ဃငဆဇဈဉညဋဌဍဎဏဒဓနဘရဝဟဠအ]်|ျ[က-အ]ါ|ျ[ါ-း]|\u103e|\u103f|\u1031[^\u1000-\u1021\u103b\u1040\u106a\u106b\u107e-\u1084\u108f\u1090]|\u1031$|\u1031[က-အ]\u1032|\u1025\u102f|\u103c\u103d[\u1000-\u1001]|ည်း|ျင်း|င်|န်း|ျာ|င့်");
var regexZG = new RegExp("\s\u1031| ေ[က-အ]်|[က-အ]း");

80% ok for long string. But problem in short string.

Another concern and fear that I've always had is that people would actually take that chance to detect unicode and convert everything to Zawgyi. ( hmm the fear of @ravichhabra as well )
That would go totally wrong and nobody cannot say for sure that nobody is going to do that with "free" software available out there. Because nowadays, people don't really give a fuck about the open licensing. (That's the another story)

All the font detection rules you will ever find on the Internet are also based on Ko @ravichhabra font buster script.

In terms of technical, things will never be perfect but it will be "okay"-ish for most of the stuffs. We've been done that in close source apps like PyawKyi and it works well.

@yelinaung , Font detection code that use in Tagu is base on Thant Thet MMFontTagger code and you can check from my blog , http://blog.saturngod.net/knowledgebase/tagu-firefox-addon

But I am not sure about this font detection license and need to confirm with Thant Thet or we need to write our own. If it's not WTFPL license , I don't want to use in Rabbit and prefer to write my own.

It is indeed a complicated matter. I propose to postpone until some time. Some thoughts @thantthet @ravichhabra ?

MyanmarFontTagger uses modified version of Ko @ravichhabra regex. So the best to ask him.

I detect Zawgyi or Unicode by using some patterns such as ကွ , strings start with ေ and most of the zawgyi texts contain /u107E to /u1084. It's not reliable but useful 😜

​ေ is U1031 in both charset. I am curious how you detect it ?