jgm/unicode-collation

Questions re: collation element tables

Closed this issue · 2 comments

I'm asking here because this is one of the more recently-worked-on UCA implementations (in any language) that I could find. As far as I've been able to figure out, the Unicode people publish three complete tables that assign weights to code points:

  • DUCET, i.e., allkeys.txt. My sense is that, in practice, this should be used mostly for the conformance tests. It exists as the standard starting point, and for the validation of UCA implementations. (But I could be wrong!)
  • The CLDR "root collation order" table, i.e., allkeys_CLDR.txt. This is based on DUCET, but with certain modifications that make it preferable for real-world use. (Note that there are conformance tests associated with this table, which also differ subtly from the main UCA tests.)
  • FractionalUCA.txt (also available in a short version that omits comments). I haven't used this table yet. The format is very different. I think it agrees with the CLDR root order out of the box, but it's designed to facilitate tailoring. e.g., there are gaps between primary weight ranges that allow for whole scripts to be moved. If a tailoring specifies that all Arabic script be ordered before all Latin, that can be accomplished.

Does this make sense? I'm curious what you found out. For my in-progress UCA implementation, I started with DUCET, until I could pass the conformance tests. Then I added the CLDR table and made the necessary adjustments to pass those tests, too. When it came time to look into supporting tailorings, I was really confused about how that works.

The approach taken by the Perl library (and adapted here) seems easier to understand, but I don't get how it could be right. It's all based on DUCET, isn't it? And for a given tailoring, they generate a list of code points for which different weights should be used. So you can check that list first, use the weights if any are found, and if not, go to the main table?

The problem is that, as I mentioned above, tailorings often stipulate the wholesale reordering of scripts and other groups of characters. And the table designed for that is FractionalUCA. Applying a tailoring also needs to be rule-based, rather than list-based—unless one is willing to duplicate the entire table for each tailoring. Did you manage to deal with this?

Thanks in advance for any wisdom that you might be able/willing to share. I guess I feel a bit like I've been given the run-around by the Unicode docs.

jgm commented

I found the documentation extremely confusing too. In the end (if I recall correctly) I adopted the approach of the Perl library, which I could understand, and which at least had the imprimatur of the Perl community, which has historically cared about text, while remaining confused about some of the Unicode documentation. I don't think this library has the flexibility to completely reorder scripts.

I'm not quite sure what you mean by "rule-based rather than list-based". Also not sure about "duplicate the entire table for each tailoring." If you look at data/tailorings, you'll see that each tailoring is just a mini-Collation that only defines the new values for characters deviating from the Ducet ordering. This is then combined with the Ducet collation via monoidal "append" -- but this combining only takes place if the tailored collation is actually needed, because Haskell is lazy.

Thanks for taking the time to reply!

I found the documentation extremely confusing too.

Yeah. I started to wonder if the docs are written by them, for them, and made public as a courtesy. It is what it is.

In the end (if I recall correctly) I adopted the approach of the Perl library, which I could understand, and which at least had the imprimatur of the Perl community, which has historically cared about text, while remaining confused about some of the Unicode documentation.

That makes sense. The Perl library seems to be updated regularly to stay current with Unicode. That alone is very rare. (I think the UCA implementation in Go has seen little attention for almost a decade.)

I don't think this library has the flexibility to completely reorder scripts.

It's good to have this confirmed. I'm not sure I'll even try to support full reordering in my implementation. The Perl approach, if I understand it correctly, allows for "local tailoring" (within a block of code points) but not "global tailoring." Maybe that's the best realistic strategy for libraries other than ICU itself.

I'm not quite sure what you mean by "rule-based rather than list-based". Also not sure about "duplicate the entire table for each tailoring." If you look at data/tailorings, you'll see that each tailoring is just a mini-Collation that only defines the new values for characters deviating from the Ducet ordering.

I saw the tailorings. Some of them, e.g. Arabic, are quite small. What I meant is that, if we wanted to change the order of a whole script relative to others, we would need a tailoring list that would practically replicate DUCET. Of course, the answer here is that the library isn't trying to do that.

This is then combined with the Ducet collation via monoidal "append"...

The merging I did misunderstand. That's a more elegant approach than what I imagined (check the tailoring list first, if applicable, then fall back to DUCET for all other code points).