UniversalDependencies/tools

validate.py complains about combining macrons

amir-zeldes opened this issue · 7 comments

In UD_Coptic-Scriptorium we have combining macrons in the MISC field, indicating supra-linear strokes added to characters in the manuscript forms (a MISC annotation called orig, containing unnormalized word forms):

[Line 149 Sent shenoute_a22-MONB_YA_421_428_s0002]: Unicode not normalized: MISC.character[6] is COMBINING MACRON, should be COMBINING DOT BELOW

Why are combining macrons bad unicode? There are no combined glyphs for these characters. Why should they be DOT BELOW instead?

Update: this seems to be flagged for rows that have both a combining macron and a dot below. However, this is the desired form. Is there some reason we shouldn't have both a combining macron and dot below? It renders fine and is the normal way to do this in Unicode Coptic transcriptions AFAIK.

Combining macron itself is not bad Unicode. The issue is that it is used in a non-canonical sequence of characters, which can be replaced by a canonical one (and the resulting glyph will stay the same).

Sometimes there is more than one way how to encode the same thing in Unicode (see also this blog post). So in your case, ⲛ̣̄ is encoded as the following sequence of Unicode points:

ⲛ       11419   2C9B    L       COPTIC SMALL LETTER NI
 ̄         772   0304    M       COMBINING MACRON
 ̣         803   0323    M       COMBINING DOT BELOW

But the same glyph could be also encoded as another sequence, switching the order of the diacritical marks:

ⲛ       11419   2C9B    L       COPTIC SMALL LETTER NI
 ̣         803   0323    M       COMBINING DOT BELOW
 ̄         772   0304    M       COMBINING MACRON

In general it is problematic if one thing can occur as multiple different sequences of bytes. Therefore Unicode defines a normal form called NFC (in fact it defines several normal forms but this is the one we use now). Apparently the canonical ordering is dot first, then the macron. The issue is more dangerous in the FORM and LEMMA columns which are more likely to be used in statistical modeling. But since we now check the normalization, we check it everywhere including the MISC column.

You should be able to trivially fix the problem if you pipe the data through normalize_unicode.pl, a script that I added recently to the tools repository.

I see, thanks for the explanation @dan-zeman . That does make sense on the one hand, but on the other hand it causes a practical problem for our corpus:

Coptic corpora outside of the treebank, created in a range of projects and by various people, consistently use macron, then dot. The reason is that semantically, an n with a macron is a Coptic letter (it stands for a syllabic n), but the dot below is a paleographic convention indicating that the letter in the manuscript is damaged or reconstructed. So it makes more sense to say 'first, it's a syllabic n, then I note just like for any letter that it is damaged', rather than saying 'it's a damaged n, and by the way it has the syllabicity marker on it'.

From a more practical perspective, if we 'fix' the order in MISC, we will lose parity with the source corpora, which I don't always have control over (and even if I did, we would be out of sync with the rest of Coptic studies resources).

What would you suggest - can we turn off this check for MISC? These are all omitted in form and lemma anyway, since those are normalized forms.

Hmm, I am somewhat reluctant to introduce exceptions at such a low level as character encoding, although obviously it is doable. Isn't it a mistake to seek internal semantics in the order? I think the normal form together with UTF-8 just define a sequence of bytes that represents the character COPTIC SMALL LETTER NI WITH MACRON AND DOT BELOW. It seems similar to the FEATS column, where for practical reasons we also must sort all features in a canonical way, although one may claim that some features are related and it would be better to make them adjacent.

The actual impact of the issue disappears if the swapped order is used consistently in the treebank. But then again, normalization of a consistent approach will not lose any information and de-normalization (when needed to sync with non-UD corpora) should be trivial as well. I can make the validator tollerate this special case in Coptic, but if you can be persuaded that normalization does not hurt that much, I'd like to persuade you :-)

I see your point. I'm actually pretty ambivalent about this too, because I like the idea that Unicode is predictable and I intuitively understand the viewpoint that after combining 'it's just one glyph' or 'there is no order'. The only worry I have is that it makes our text not reconstitute to what external resources have as the base text, and I don't think I'll be able to influence other projects using Coptic to obey this order of glyphs consistently.

Let me ask @ctschroeder who knows much more about Coptic encoding in projects outside Scriptorium: the issue is that our data has combining supralinear stroke, THEN underdot, but the Unicode standard actually has a spec saying it should be the other way around. Is this something we should conform to, or is the risk of mismatching other corpora too great?

OK, after some internal discussion we've decided to follow the Unicode standard and change our non-treebank corpora as well, and add this as a necessary step for future data. Thanks for the clarification!

Great, thanks!