Some code point sequences can't round-trip

Question

Some code point sequences can't round-trip

Closed this issue 2 years ago · 8 comments

As an example, 'Ç、.' gets turned into 'Ç、。'. I'm not sure if this is expected behavior or not when mixing planes. That particular sequence was found with a sort of fuzzer, hypothesis.

Answer 1 · 2020-11-06T23:19:18.000Z

I think this may actually be expected behavior based on the unishox algorithm. ("Full stops" are encoded to a generic code, then the specific representation is chosen based on what language(s) are used.)

Answer 2 · 2020-11-06T23:30:26.000Z

https://raw.githubusercontent.com/siara-cc/Unishox/master/Unishox_Article_1.pdf#

6.13 Encoding punctuations

Some languages, such as Japanese and Chinese use their own punctua-tion characters. For example full-stop is indicated using U+3002 which isrepresented visually as a small circle.

So when encountering a Japanese full-stop, the special code for full-stopis used, only in this case, the decoder is expected to decode it as U+3002instead of ’.’. In general, if the prior unicode character is greater thanU+3000, then the special full-stop is decoded.

There are other types of full-stops used in other languages. For exam-ple, Hindi uses a kind of pipe symbol to indicate full-stop. However, toavoid confusion, this is left to delta coding, since it does not make muchdifference in compression ratio

Answer 3 · 2020-11-07T18:06:18.000Z

Yes, presently whether it decodes as the Unicode full-stop or ASCII full-stop depends on the previous character decoded.

Answer 4 · 2021-05-17T01:57:36.000Z

Is this still true of Unishox2? I've been hitting it with hypothesis as well and haven't found the same issue.

Answer 5 · 2021-05-21T06:42:22.000Z

Hi, Thank you for the follow-up. I removed this feature of automatically deciding which full-stop in Unishox2 because it causes inconsistency and confusion.

Answer 6 · 2022-02-12T14:24:39.000Z

I had the same concern, for round-tripping (not just text), FYI:
#36 (comment)

I answered, in short @siara-cc seemingly confirmed not an issue, but then I noticed this open issue. I wouldn't say it put it into doubt for version 2. Can you confirm it should work there? All legal UTF-8 strings, Japanese or not, but also arbitrary byte-strings? I or someone could do some tests on random strings, while test can't proof absence of bugs, only presence. I'm just taking his word until I understand how illegal UTF-8 is handled. It didn't seem obvious to me it would, while he answered it should (somehow).

If version 1 is now outdated/unsupported, should the limitation for it just be documented and this issue closed?

Answer 7 · 2022-02-12T14:29:52.000Z

presently whether it decodes as the Unicode full-stop or ASCII full-stop depends on the previous character decoded.

That seems like a bug, not wanted, and neither really good to change the behavior... a catch-22. I guess all should just use the newer version 2. I mentioned this should be documented, this is rather obscure, maybe document it better and/or state version 1 not supported if not done already...

Answer 8 · 2022-02-12T18:42:43.000Z

@PallHaraldsson This is already documented in README.md at the beginning.