siara-cc/Unishox2

Some code point sequences can't round-trip

Closed this issue · 8 comments

As an example, 'Ç、.' gets turned into 'Ç、。'. I'm not sure if this is expected behavior or not when mixing planes. That particular sequence was found with a sort of fuzzer, hypothesis.

I think this may actually be expected behavior based on the unishox algorithm. ("Full stops" are encoded to a generic code, then the specific representation is chosen based on what language(s) are used.)

https://raw.githubusercontent.com/siara-cc/Unishox/master/Unishox_Article_1.pdf#

6.13 Encoding punctuations

Some languages, such as Japanese and Chinese use their own punctua-tion characters. For example full-stop is indicated using U+3002 which isrepresented visually as a small circle.

So when encountering a Japanese full-stop, the special code for full-stopis used, only in this case, the decoder is expected to decode it as U+3002instead of ’.’. In general, if the prior unicode character is greater thanU+3000, then the special full-stop is decoded.

There are other types of full-stops used in other languages. For exam-ple, Hindi uses a kind of pipe symbol to indicate full-stop. However, toavoid confusion, this is left to delta coding, since it does not make muchdifference in compression ratio

Yes, presently whether it decodes as the Unicode full-stop or ASCII full-stop depends on the previous character decoded.

Is this still true of Unishox2? I've been hitting it with hypothesis as well and haven't found the same issue.

Hi, Thank you for the follow-up. I removed this feature of automatically deciding which full-stop in Unishox2 because it causes inconsistency and confusion.

I had the same concern, for round-tripping (not just text), FYI:
#36 (comment)

I answered, in short @siara-cc seemingly confirmed not an issue, but then I noticed this open issue. I wouldn't say it put it into doubt for version 2. Can you confirm it should work there? All legal UTF-8 strings, Japanese or not, but also arbitrary byte-strings? I or someone could do some tests on random strings, while test can't proof absence of bugs, only presence. I'm just taking his word until I understand how illegal UTF-8 is handled. It didn't seem obvious to me it would, while he answered it should (somehow).

If version 1 is now outdated/unsupported, should the limitation for it just be documented and this issue closed?

presently whether it decodes as the Unicode full-stop or ASCII full-stop depends on the previous character decoded.

That seems like a bug, not wanted, and neither really good to change the behavior... a catch-22. I guess all should just use the newer version 2. I mentioned this should be documented, this is rather obscure, maybe document it better and/or state version 1 not supported if not done already...

@PallHaraldsson This is already documented in README.md at the beginning.