More unexpected results for "de"
njbart opened this issue · 2 comments
With the latest unicode-collate, there’s now something odd about German non-phonebook collation:
For testing, I used this file, de.txt
:
Aa
Ab
Äb
Ac
Ad
Ae
Aeb
Af
Az
These are the results:
$ unicode-collate < de.txt
Aa
Ab
Äb
Ac
Ad
Ae
Aeb
Af
Az
$ unicode-collate de < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
$ unicode-collate de-DE < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-AT < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-CH < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-u-co-phonebk < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
$ unicode-collate de-AT-u-co-phonebk < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Af
Az
Äb
$ unicode-collate de-u-co-search < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
$ unicode-collate de-u-co-standard < de.txt
Aa
Ab
Ac
Ad
Ae
Aeb
Äb
Af
Az
So, the results for de-u-co-phonebk
(Ä
=Ae
) and de-AT-u-co-phonebk
(Ä
after Az
) are as expected. (However, I would have expected de-DE-u-co-phonebk
to give the same results as de-u-co-phonebk
, but the actual output matches de-AT-u-co-phonebk
.)
For de
, de-u-co-standard
, de-DE
, de-AT
and de-CH
I would have expected the root collation order (accents ignored, Äb
together with Ab
). https://github.com/jgm/unicode-collation/blob/main/README.md seems to support this in stating, “For languages not listed here, the root collation is used.”
However, the actual de
and de-u-co-standard
output matches de-u-co-phonebk
, and the others correspond to de-AT-u-co-phonebk
.
I’m not sure what to make of de-u-co-search
(and -u-co-search
variants in general), since “search” doesn’t really correspond to any kind of variant discussed in, e.g., https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions.
A previous version of unicode-collation used to contain data/collation/de.xml (not sure what this has been replaced with in more recent versions), which claimed, without giving any more details: “# Per Apple language group, these rules [apparently: search] match phonebook collation below.” On the other hand, not being explicitly listed in README.md would imply root collation in this case, too.
Sure enough, the fallback logic is wrong; de is falling back to the phonebook collation.
A previous version of unicode-collation used to contain data/collation/de.xml (not sure what this has been replaced with in more recent versions)
I switched to using some tailoring data files from perl's Unicode::Collate::Locale.
They don't contain the "search" collations.