jgm/unicode-collation

Remove commented-out lines in conformance tests

jgm opened this issue · 4 comments

jgm commented

Currently we comment out a few Tibetan characters -- once #5 is fixed, we can restore them.
We also comment out lines beginning with D8.. (surrogate code points), because (as I recall) something goes wrong here with unpacking and normalization. IT would be good either to fix this issue, or to add code to the unit test runner to ignore these lines, so we can use unmodified conformance test files.

jgm commented

Here's the issue with the surrogates:

GHCI> Data.Text.pack "\xD800\x0021"
"\65533!"  <-- 0xFFFD 0x0021

which normalizes to [0xFFFD, 0x0021].
As you can see, pack replaces \xD800 with the unicode replacement character 0xFFFD.
The conformance tests have (for SHIFTED):

D800 0021;      # ('\uD800') <surrogate-D800>   [FBC1 D800 | 0020 | 0002 | FFFF 0267 |] 

and for NON_IGNORABLE:

D800 0021;      # ('\uD800') <surrogate-D800>   [FBC1 D800 0267 | 0020 0020 | 0002 0002 |]  

while we get (shifted)

FFFD 0021; # (�!) [FFFD | 0020 | 0002 | FFFF 0267]

and non_ignorable:

FFFD 0021; # (�!) [FFFD 0267 | 0020 0020 | 0002 0002]
jgm commented

I guess this isn't a bug in pack, since pack expects full code points; the surrogate pair should be expanded before we call pack.

But I don't really understand this; D800 0021 isn't actually a valid surrogate pair, is it?

I didn't think so either. The page with the tests on it doesn't seem to specify an encoding. It just calls each one a "sequence of Unicode code points". The page does say,

Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines lines in the test cases, before testing for conformance.

It looks like the intent is to test the standard's recommendations for ill-formed code unit sequences in section 10.1.1. But this library operates on Text and not raw ByteStrings, so I don't think that issue will ever come up.

jgm commented

Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines lines in the test cases, before testing for conformance.

Oh good, then I won't feel bad about doing that.