jgm/unicode-collation

More unexpected results for "de"

njbart opened this issue · 2 comments

With the latest unicode-collate, there’s now something odd about German non-phonebook collation:

For testing, I used this file, de.txt:

Aa
Ab
Äb
Ac
Ad
Ae
Aeb
Af
Az

These are the results:

$ unicode-collate < de.txt
Aa
Ab
Äb
Ac
Ad
Ae
Aeb
Af
Az
$ unicode-collate de < de.txt 
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
$ unicode-collate de-DE < de.txt 
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-AT < de.txt 
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-CH < de.txt 
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-u-co-phonebk < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
$ unicode-collate de-AT-u-co-phonebk < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-u-co-search < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
$ unicode-collate de-u-co-standard < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az

So, the results for de-u-co-phonebk (Ä=Ae) and de-AT-u-co-phonebk (Ä after Az) are as expected. (However, I would have expected de-DE-u-co-phonebk to give the same results as de-u-co-phonebk, but the actual output matches de-AT-u-co-phonebk.)

For de, de-u-co-standard, de-DE, de-AT and de-CH I would have expected the root collation order (accents ignored, Äb together with Ab). https://github.com/jgm/unicode-collation/blob/main/README.md seems to support this in stating, “For languages not listed here, the root collation is used.”

However, the actual de and de-u-co-standard output matches de-u-co-phonebk, and the others correspond to de-AT-u-co-phonebk.

I’m not sure what to make of de-u-co-search (and -u-co-search variants in general), since “search” doesn’t really correspond to any kind of variant discussed in, e.g., https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions.

A previous version of unicode-collation used to contain data/collation/de.xml (not sure what this has been replaced with in more recent versions), which claimed, without giving any more details: “# Per Apple language group, these rules [apparently: search] match phonebook collation below.” On the other hand, not being explicitly listed in README.md would imply root collation in this case, too.

jgm commented

Sure enough, the fallback logic is wrong; de is falling back to the phonebook collation.

jgm commented

A previous version of unicode-collation used to contain data/collation/de.xml (not sure what this has been replaced with in more recent versions)

I switched to using some tailoring data files from perl's Unicode::Collate::Locale.
They don't contain the "search" collations.